Commit Graph

4806 Commits

Author SHA1 Message Date
Alexander V. Chernikov
41ce0e34ea [fib algo] Update fib_gen counter under FIB_MOD_LOCK.
MFC after:	3 days
2021-04-28 20:23:03 +00:00
Alexander V. Chernikov
f9668e42b4 Add rib_walk_from() wrapper for selective rib tree traversal.
Provide wrapper for the rnh_walktree_from() rib callback.
As currently `struct rib_head` is considered internal to the
 routing subsystem, this wrapper is necessary to maintain isolation
 from the external code.

Differential Revision: https://reviews.freebsd.org/D29971
MFC after:	1 week
2021-04-28 08:09:45 +00:00
Alexander V. Chernikov
8a0d57baec [fib algo] Delay algo init at fib growth to to allow to reliably use rib KPI.
Currently, most of the rib(9) KPI does not use rnh pointers, using
 fibnum and family parameters to determine the rib pointer instead.
This works well except for the case when we initialize new rib pointers
 during fib growth.
In that case, there is no mapping between fib/family and the new rib,
 as an entirely new rib pointer array is populated.

Address this by delaying fib algo initialization till after switching
 to the new pointer array and updating the number of fibs.
Set datapath pointer to the dummy function, so the potential callers
 won't crash the kernel in the brief moment when the rib exists, but
 no fib algo is attached.

This change allows to avoid creating duplicates of existing rib functions,
 with altered signature.

Differential Revision: https://reviews.freebsd.org/D29969
MFC after:	1 week
2021-04-27 22:10:08 +00:00
Alexander V. Chernikov
439d087d0b [fib algo] always commit static routes synchronously.
Modular fib lookup framework features logic that allows
 route update batching for the algorithms that cannot easily
 apply the routing change without rebuilding. As a result,
 dataplane lookups may return old data until the the sync
 takes place. With the default sync timeout of 50ms, it is
 possible that new binary like ping(8) executed exactly after
 route(8) will still use the old fib data.

To address some aspects of the problem, framework executes
 all rtable changes without RTF_GATEWAY synchronously.

To fix the aforementioned problem, this diff extends sync
 execution for all RTF_STATIC routes (e.g. ones maintained by
 route(8).
This fixes a bunch of tests in the networking space.

Reported by:	ci, arichardson
MFC after:	2 weeks
2021-04-27 08:31:40 +00:00
Alexander V. Chernikov
25682e6a49 Fix rtsock sockaddr alignment.
b31fbebeb3 introduced alloc_sockaddr_aligned() which, in fact,
 failed to produce aligned addresses.

Reported by:	Oskar Holmlund <oskar.holmlund at yahoo.com>
MFC after:	immediately
2021-04-27 08:04:19 +00:00
Alexander V. Chernikov
bc5ef45aec Fix drace CTF for the rib_head.
33cb3cb2e3 introduced an `rib_head` structure field under the
FIB_ALGO define. This may be problematic for the CTF, as some
 of the files including `route_var.h` do not have `fib_algo`
 defined.

Make dtrace happy by making the field unconditional.

Suggested by:	markj
2021-04-27 07:47:53 +00:00
Kristof Provost
5f5bf88949 pfsync: Expose PFSYNCF_OK flag to userspace
Add 'syncok' field to ifconfig's pfsync interface output. This allows
userspace to figure out when pfsync has completed the initial bulk
import.

Reviewed by:	donner
MFC after:	2 weeks
Sponsored by:	Rubicon Communications, LLC ("Netgate")
Differential Revision:	https://reviews.freebsd.org/D29948
2021-04-26 14:31:17 +02:00
Kristof Provost
6fcc8e042a pf: Allow multiple labels to be set on a rule
Allow up to 5 labels to be set on each rule.
This offers more flexibility in using labels. For example, it replaces
the customer 'schedule' keyword used by pfSense to terminate states
according to a schedule.

Reviewed by:	glebius
MFC after:	2 weeks
Sponsored by:	Rubicon Communications, LLC ("Netgate")
Differential Revision:	https://reviews.freebsd.org/D29936
2021-04-26 14:14:21 +02:00
Patrick Kelsey
ca7005f189 iflib: Improve mapping of TX/RX queues to CPUs
iflib now supports mapping each (TX,RX) queue pair to the same CPU
(default), to separate CPUs, or to a pair of physical and logical CPUs
that share the same L2 cache.  The mapping mechanism supports unequal
numbers of TX and RX queues, with the excess queues always being
mapped to consecutive physical CPUs.  When the platform cannot
distinguish between physical and logical CPUs, all are treated as
physical CPUs.  See the comment on get_cpuid_for_queue() for the
entire matrix.

The following device-specific tunables influence the mapping process:
dev.<device>.<unit>.iflib.core_offset       (existing)
dev.<device>.<unit>.iflib.separate_txrx     (existing)
dev.<device>.<unit>.iflib.use_logical_cores (new)

The following new, read-only sysctls provide visibility of the mapping
results:
dev.<device>.<unit>.iflib.{t,r}xq<n>.cpu

When an iflib driver allocates TX softirqs without providing reference
RX IRQs, iflib now binds those TX softirqs to CPUs using the above
mapping mechanism (that is, treats them as if they were TX IRQs).
Previously, such bindings were left up to the grouptaskqueue code and
thus fell outside of the iflib CPU mapping strategy.

Reviewed by:	kbowling
Tested by:	olivier, pkelsey
MFC after:	3 weeks
Differential Revision:	https://reviews.freebsd.org/D24094
2021-04-26 01:06:34 -04:00
Alexander V. Chernikov
7d222ce3c1 Fix NOINET[6],!VIMAGE builds after FIB_ALGO addition to GENERIC
Reported by:	jbeich
PR:		255390
2021-04-21 05:53:42 +01:00
Alexander V. Chernikov
67372fb3e0 Fix NOINET[6] build after enabling FIB_ALGO in GENERIC.
Submitted by:	jbeich
PR:		255389
2021-04-21 02:49:18 +01:00
Alexander V. Chernikov
c23385612d [fib algo] Do not print algo attach/detach message on boot
MFC after:	1 day
2021-04-25 08:58:06 +00:00
Alexander V. Chernikov
a81e2e7890 Make gcc happy by initializing error in rib_handle_ifaddr_info(). 2021-04-25 08:44:59 +00:00
Stefan Eßer
6409e59427 Fix build with gcc
Correctly declare function without arguments as f(void) instead of f().
2021-04-25 10:15:17 +02:00
Alexander V. Chernikov
5d1403a79a [rtsock] Enforce netmask/RTF_HOST consistency.
Traditionally we had 2 sources of information whether the
 added/delete route request targets network or a host route:
netmask (RTA_NETMASK) and RTF_HOST flag.

The former one is tricky: netmask can be empty or can explicitly
 specify the host netmask. Parsing netmask sockaddr requires per-family
 parsing and that's what rtsock code traditionally avoided. As a result,
 consistency was not enforced and it was possible to specify network with
 the RTF_HOST flag and vice versa.

Continue normalization efforts from D29826 and D29826 and ensure that
 RTF_HOST flag always reflects host/network data from netmask field.

Differential Revision: https://reviews.freebsd.org/D29958
MFC after:	2 days
2021-04-24 22:41:27 +00:00
Mark Johnston
8e8f1cc9bb Re-enable network ioctls in capability mode
This reverts a portion of 274579831b ("capsicum: Limit socket
operations in capability mode") as at least rtsol and dhcpcd rely on
being able to configure network interfaces while in capability mode.

Reported by:	bapt, Greg V
Sponsored by:	The FreeBSD Foundation
2021-04-23 09:22:49 -04:00
Andrew Gallatin
3183d0b680 iflib: initialize LRO unconditionally
Changes to the LRO code have exposed a bug in iflib where devices
which are not capable of doing LRO are still calling
tcp_lro_flush_all(), even when they have not initialized the LRO
context. This used to be mostly harmless, but the LRO code now sets
the VNET based on the ifp in the lro context and will try to access it
through a NULL ifp resulting in a panic at boot.

To fix this, we unconditionally initializes LRO so that we have a
valid LRO context when calling tcp_lro_flush_all(). One alternative is
to check the device capabilities before calling tcp_lro_flush_all() or
adding a new state flag in the ctx. However, it seems unwise to add an
extra, mostly useless test for higher performance devices when we can
just initialize LRO for all devices.

Reviewed by: erj, hselasky, markj, olivier
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D29928
2021-04-23 05:55:20 -04:00
Alexander V. Chernikov
33cb3cb2e3 Fix rib generation count for fib algo.
Currently, PCB caching mechanism relies on the rib generation
 counter (rnh_gen) to invalidate cached nhops/LLE entries.

With certain fib algorithms, it is now possible that the
 datapath lookup state applies RIB changes with some delay.
In that scenario, PCB cache will invalidate on the RIB change,
 but the new lookup may result in the same nexthop being returned.
When fib algo finally gets in sync with the RIB changes, PCB cache
 will not receive any notification and will end up caching the stale data.

To fix this, introduce additional counter, rnh_gen_rib, which is used
 only when FIB_ALGO is enabled.
This counter is incremented by the control plane. Each time when fib algo
 synchronises with the RIB, it updates rnh_gen to the current rnh_gen_rib value.

Differential Revision: https://reviews.freebsd.org/D29812
Reviewed by:	donner
MFC after:	2 weeks
2021-04-20 22:02:41 +00:00
Alexander V. Chernikov
b31fbebeb3 Relax rtsock message restrictions.
Address multiple issues with strict rtsock message validation.

D28668 "normalisation" approach was based on the assumption that
 we always have at least "standard" sockaddr len.
It turned out to be false - certain older applications like quagga
 or routed abuse sin[6]_len field and set it to the offset to the
 first fully-zero bit in the mask. It is impossible to normalise
 such sockaddrs without reallocation.

With that in mind, change the approach to use a distinct memory
 buffer for the altered sockaddrs. This allows supporting the older
 software while maintaining the guarantee on the "standard" sockaddrs.

PR:	255273,255089
Differential Revision:	https://reviews.freebsd.org/D29826
MFC after:	3 days
2021-04-20 21:34:19 +00:00
Alexander V. Chernikov
758c9d54d4 Improve error reporting in rtsock.c
MFC after:	3 days
2021-04-19 20:36:41 +00:00
Kristof Provost
42ec75f83a pf: Optionally attempt to preserve rule counter values across ruleset updates
Usually rule counters are reset to zero on every update of the ruleset.
With keepcounters set pf will attempt to find matching rules between old
and new rulesets and preserve the rule counters.

MFC after:	4 weeks
Sponsored by:	Rubicon Communications, LLC ("Netgate")
Differential Revision:	https://reviews.freebsd.org/D29780
2021-04-19 14:31:47 +02:00
Kristof Provost
4f1f67e888 pf: PFRULE_REFS should not be user-visible
Split the PFRULE_REFS flag from the rule_flag field. PFRULE_REFS is a
kernel-internal flag and should not be exposed to or read from
userspace.

MFC after:	4 weeks
Sponsored by:	Rubicon Communications, LLC ("Netgate")
Differential Revision:	https://reviews.freebsd.org/D29778
2021-04-19 14:31:47 +02:00
Jonah Caplan
0e4025bffa bridgestp: validate timer values in config BPDU
IEEE Std 802.1D-2004 Section 17.14 defines permitted ranges for timers.
Incoming BPDU messages should be checked against the permitted ranges.
The rest of 17.14 appears to be enforced already.

PR:		254924
Reviewed by:	kp, donner
Differential Revision:	https://reviews.freebsd.org/D29782
2021-04-19 12:09:18 +02:00
Alexander V. Chernikov
0abb6ff590 fib algo: do not reallocate datapath index for datapath ptr update.
Fib algo uses a per-family array indexed by the fibnum to store
 lookup function pointers and per-fib data.

Each algorithm rebuild currently requires re-allocating this array
 to support atomic change of two pointers.

As in reality most of the changes actually involve changing only
 data pointer, add a shortcut performing in-flight pointer update.

MFC after:	2 weeks
2021-04-18 16:12:13 +01:00
Alexander V. Chernikov
e2f79d9e51 Fib algo: extend KPI by allowing algo to set datapath pointers.
Some algorithms may require updating datapath and control plane
 algo pointers after the (batched) updates.

Export fib_set_datapath_ptr() to allow setting the new datapath
 function or data pointer from the algo.
Add fib_set_algo_ptr() to allow updating algo control plane
 pointer from the algo.
Add fib_epoch_call() epoch(9) wrapper to simplify freeing old
 datapath state.

Reviewed by:		zec
Differential Revision: https://reviews.freebsd.org/D29799
MFC after:		1 week
2021-04-18 16:12:12 +01:00
Alexander V. Chernikov
6b8ef0d428 Add batched update support for the fib algo.
Initial fib algo implementation was build on a very simple set of
 principles w.r.t updates:

1) algorithm is ether able to apply the change synchronously (DIR24-8)
 or requires full rebuild (bsearch, lradix).
2) framework falls back to rebuild on every error (memory allocation,
 nhg limit, other internal algo errors, etc).

This changes brings the new "intermediate" concept - batched updates.
Algotirhm can indicate that the particular update has to be handled in
 batched fashion (FLM_BATCH).
The framework will write this update and other updates to the temporary
 buffer instead of pushing them to the algo callback.
Depending on the update rate, the framework will batch 50..1024 ms of updates
 and submit them to a different algo callback.

This functionality is handy for the slow-to-rebuild algorithms like DXR.

Differential Revision:	https://reviews.freebsd.org/D29588
Reviewed by:	zec
MFC after:	2 weeks
2021-04-14 23:54:11 +01:00
Tai-hwa Liang
d9b61e7153 if_firewire: fixing panic upon packet reception for VNET build
netisr_dispatch_src() needs valid VNET pointer or firewire_input() will panic
when receiving a packet.

Reviewed by:	glebius
MFC after:	2 weeks
2021-04-13 22:59:58 +00:00
Kurosawa Takahiro
2aa21096c7 pf: Implement the NAT source port selection of MAP-E Customer Edge
MAP-E (RFC 7597) requires special care for selecting source ports
in NAT operation on the Customer Edge because a part of bits of the port
numbers are used by the Border Relay to distinguish another side of the
IPv4-over-IPv6 tunnel.

PR:		254577
Reviewed by:	kp
Differential Revision:	https://reviews.freebsd.org/D29468
2021-04-13 10:53:18 +02:00
Alexander V. Chernikov
afbb64f1d8 Fix vlan creation for the older ifconfig(8) binaries.
Reported by:	allanjude
MFC after:	immediately
2021-04-11 18:13:09 +01:00
Alexander V. Chernikov
7f5f3fcc32 Fix direct route installation with net/bird.
Slighly relax the gateway validation rules imposed by the
 2fe5a79425, by requiring only first 8 bytes (everyhing
 before sdl_data to be present in the AF_LINK gateway.

Reported by:	olivier
2021-04-10 16:31:16 +01:00
Alexander V. Chernikov
63dceebe68 Appease -Wsign-compare in radix.c
Differential Revision:	https://reviews.freebsd.org/D29661
Submitted by:	zec
MFC after	2 weeks
2021-04-10 13:48:25 +00:00
Alexander V. Chernikov
caf2f62765 Allow to specify debugnet fib in sysctl/tunable.
Differential Revision:	https://reviews.freebsd.org/D29593
Reviewed by:		donner
MFC after:		2 weeks
2021-04-10 13:47:49 +00:00
Kristof Provost
d710367d11 pf: Implement nvlist variant of DIOCGETRULE
MFC after:	4 weeks
Sponsored by:	Rubicon Communications, LLC ("Netgate")
Differential Revision:	https://reviews.freebsd.org/D29559
2021-04-10 11:16:01 +02:00
Kristof Provost
5c62eded5a pf: Introduce nvlist variant of DIOCADDRULE
This will make future extensions of the API much easier.
The intent is to remove support for DIOCADDRULE in FreeBSD 14.

Reviewed by:	markj (previous version), glebius (previous version)
MFC after:	4 weeks
Sponsored by:	Rubicon Communications, LLC ("Netgate")
Differential Revision:	https://reviews.freebsd.org/D29557
2021-04-10 11:16:00 +02:00
Alexander V. Chernikov
ee2cf2b360 Implement better rebuild-delay fib algo policy.
The intent is to better handle time intervals with large amount of RIB
updates (e.g. BGP peer going up or down), while still keeping low sync
delay for the rest scenarios.

The implementation is the following: updates are bucketed into the
buckets of size 50ms. If the number of updates within a current bucket
 exceeds the threshold of 500 routes/sec (e.g. 10 updates per bucket
interval), the update is delayed for another 50ms. This can be repeated
 until the maximum update delay (1 sec) is reached.

All 3 variables are runtime tunables:

* net.route.algo.fib_max_sync_delay_ms: 1000
* net.route.algo.bucket_change_threshold_rate: 500
* net.route.algo.bucket_time_ms: 50

Differential Review:	https://reviews.freebsd.org/D29588
MFC after:		2 weeks
2021-04-09 21:33:03 +01:00
Alexander V. Chernikov
9e5243d7b6 Enforce check for using the return result for ifa?_try_ref().
Suggested by:	hps
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D29504
2021-04-05 03:35:19 +01:00
Kristof Provost
4967f672ef pf: Remove unused variable rt_listid from struct pf_krule
Reviewed by:	donner
MFC after:	4 weeks
Sponsored by:	Rubicon Communications, LLC ("Netgate")
Differential Revision:	https://reviews.freebsd.org/D29639
2021-04-08 13:24:35 +02:00
Mark Johnston
274579831b capsicum: Limit socket operations in capability mode
Capsicum did not prevent certain privileged networking operations,
specifically creation of raw sockets and network configuration ioctls.
However, these facilities can be used to circumvent some of the
restrictions that capability mode is supposed to enforce.

Add capability mode checks to disallow network configuration ioctls and
creation of sockets other than PF_LOCAL and SOCK_DGRAM/STREAM/SEQPACKET
internet sockets.

Reviewed by:	oshogbo
Discussed with:	emaste
Reported by:	manu
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D29423
2021-04-07 14:32:56 -04:00
Vincenzo Maffione
361e950180 iflib: add support for netmap offsets
Follow-up change to a6d768d845.
This change adds iflib support for netmap offsets, enabling
applications to use offsets on any driver backed by iflib.
2021-04-05 07:54:47 +00:00
Vincenzo Maffione
9bad2638cc netmap: restore commit a56e6334d1
The fix in a56e6334d1
was accidentally reverted by commit 45c67e8f6b.
2021-04-02 10:45:47 +00:00
Vincenzo Maffione
45c67e8f6b netmap: several typo fixes
No functional changes intended.
2021-04-02 07:01:20 +00:00
Konstantin Belousov
baacf70137 vxlan: correct interface MTU when using hw offloads
Otherwise it breaks when offloading like checksum or TSO are used,
because second (encapsulated) ip_output() processing passes fragments of
the encapsulated packet down to the hardware interface.

Diagnosed by:	hselasky
Reviewed by:	np
Sponsored by:	Nvidia Networking / Mellanox Technologies
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D29501
2021-03-31 14:38:26 +03:00
Konstantin Belousov
e243367b64 mbuf: add a way to mark flowid as calculated from the internal headers
In some settings offload might calculate hash from decapsulated packet.
Reserve a bit in packet header rsstype to indicate that.

Add m_adj_decap() that acts similarly to m_adj, but also either clear
flowid if it is not marked as inner, or transfer it to the decapsulated
header, clearing inner indicator. It depends on the internals of m_adj()
that reuses the argument packet header for the result.

Use m_adj_decap() for decapsulating vxlan(4) and gif(4) input packets.

Reviewed by:	ae, hselasky, np
Sponsored by:	Nvidia Networking / Mellanox Technologies
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D28773
2021-03-31 14:38:26 +03:00
Alexander V. Chernikov
0c2a0e0380 Fix typo in the 9fa8d1582b.
Reported by:	cy
2021-03-29 23:42:48 +00:00
Alexander V. Chernikov
9fa8d1582b Put bandaid for nhgrp_dump_sysctl() malloc KASSERT().
Recent rtsock changes widened epoch and covered nhgrp_dump_sysctl(),
  resulting in `netstat -4On` triggering with KASSERT.

MFC after:	1 day
2021-03-29 23:12:11 +00:00
Alexander V. Chernikov
0f30a36ded Rename variables inside nexhtop group consider_resize() code.
No functional changes.

MFC after: 3 days
2021-03-29 23:06:13 +00:00
Alexander V. Chernikov
9095dc7da4 Fix nexhtop group index array scaling.
The current code has the limit of 127 nexthop groups due to the
 wrongly-checked bitmask_copy() return value.

PR: 254303
Reported by:	Aleks <a.ivanov at veesp.com>
MFC after: 1 day
2021-03-29 23:00:17 +00:00
Vincenzo Maffione
660a47cb99 netmap: monitor: add a flag to distinguish packet direction
The netmap monitor intercepts any TX/RX packets on the monitored
port. However, before this change there was no way to tell
whether an intercepted packet was being transmitted or received
on the monitored port.
A TXMON flag in the netmap slot has been added for this purpose.
2021-03-29 16:32:54 +00:00
Vincenzo Maffione
a6d768d845 netmap: add kernel support for the "offsets" feature
This feature enables applications to ask netmap to transmit or
receive packets starting at a user-specified offset from the
beginning of the netmap buffer. This is meant to ease those
packet manipulation operations such as pushing or popping packet
headers, that may be useful to implement software switches,
routers and other packet processors.
To use the feature, drivers (e.g., iflib, vtnet, etc.) must have
explicit support. This change does not add support for any driver,
but introduces the necessary kernel changes. However, offsets support
is already included for VALE ports and pipes.
2021-03-29 16:29:01 +00:00
you@x
21d0c01226 netmap: iflib: add nm_config callback
This per-driver callback is invoked by netmap when it wants
to align the number of TX/RX netmap rings and/or the number of
TX/RX netmap slots to the actual state configured in the hardware.
The alignment happens when netmap mode is switched on (with no
active netmap file descriptors for that netmap port), or when
collecting netmap port information.

MFC after:	1 week
2021-03-29 09:31:18 +00:00
Alexander V. Chernikov
6f43c72b47 Zero struct weightened_nhop fields in nhgrp_get_addition_group().
`struct weightened_nhop` has spare 32bit between the fields due to
 the alignment (on amd64).
Not zeroing these spare bits results in duplicating nhop groups
 in the kernel due to the way how comparison works.

MFC after:	1 day
2021-03-20 08:26:03 +00:00
Alexander V. Chernikov
24cd2796cf Fix !VNET build broken by 66f138563b. 2021-03-25 00:31:08 +00:00
Alexander V. Chernikov
66f138563b Plug nexthop group refcount leak.
In case with batch route delete via rib_walk_del(), when
 some paths from the multipath route gets deleted, old
 multipath group were not freed.

PR:    254496
Reported by:   Zhenlei Huang <zlei.huang@gmail.com>
MFC after:     1 day
2021-03-24 23:52:18 +00:00
Alexander V. Chernikov
c00e2f573b Fix build for non-vnet non-multipath kernels broken by
a0308e48ec.
2021-03-23 23:35:23 +00:00
Alexander V. Chernikov
a0308e48ec Fix panic when destroying interface with ECMP routes.
Reported by:	Zhenlei Huang <zlei.huang at gmail.com>
PR:		254496
MFC after:	immediately
2021-03-23 22:03:20 +00:00
Adrian Chadd
25bfa44860 Add device and ifnet logging methods, similar to device_printf / if_printf
* device_printf() is effectively a printf
* if_printf() is effectively a LOG_INFO

This allows subsystems to log device/netif stuff using different log levels,
rather than having to invent their own way to prefix unit/netif  names.

Differential Revision: https://reviews.freebsd.org/D29320
Reviewed by: imp
2021-03-22 00:02:34 +00:00
Alexander V. Chernikov
2476178e6b Fix kassert panic when inserting multipath routes from multiple threads.
Reported by:	Marco Zec <zec at fer.hr>
MFC after:	immediately
2021-03-21 18:15:29 +00:00
Kyle Evans
f187d6dfbf base: remove if_wg(4) and associated utilities, manpage
After length decisions, we've decided that the if_wg(4) driver and
related work is not yet ready to live in the tree.  This driver has
larger security implications than many, and thus will be held to
more scrutiny than other drivers.

Please also see the related message sent to the freebsd-hackers@
and freebsd-arch@ lists by Kyle Evans <kevans@FreeBSD.org> on
2021/03/16, with the subject line "Removing WireGuard Support From Base"
for additional context.
2021-03-17 09:14:48 -05:00
Alexander V. Chernikov
e4ac3f7463 Fix fib algo rebuild delay calculation.
Submitted by:	Marco Zec <zec at fer.hr>
MFC after:	3 days
2021-03-15 21:09:07 +00:00
Kyle Evans
74ae3f3e33 if_wg: import latest fixup work from the wireguard-freebsd project
This is the culmination of about a week of work from three developers to
fix a number of functional and security issues.  This patch consists of
work done by the following folks:

- Jason A. Donenfeld <Jason@zx2c4.com>
- Matt Dunwoodie <ncon@noconroy.net>
- Kyle Evans <kevans@FreeBSD.org>

Notable changes include:
- Packets are now correctly staged for processing once the handshake has
  completed, resulting in less packet loss in the interim.
- Various race conditions have been resolved, particularly w.r.t. socket
  and packet lifetime (panics)
- Various tests have been added to assure correct functionality and
  tooling conformance
- Many security issues have been addressed
- if_wg now maintains jail-friendly semantics: sockets are created in
  the interface's home vnet so that it can act as the sole network
  connection for a jail
- if_wg no longer fails to remove peer allowed-ips of 0.0.0.0/0
- if_wg now exports via ioctl a format that is future proof and
  complete.  It is additionally supported by the upstream
  wireguard-tools (which we plan to merge in to base soon)
- if_wg now conforms to the WireGuard protocol and is more closely
  aligned with security auditing guidelines

Note that the driver has been rebased away from using iflib.  iflib
poses a number of challenges for a cloned device trying to operate in a
vnet that are non-trivial to solve and adds complexity to the
implementation for little gain.

The crypto implementation that was previously added to the tree was a
super complex integration of what previously appeared in an old out of
tree Linux module, which has been reduced to crypto.c containing simple
boring reference implementations.  This is part of a near-to-mid term
goal to work with FreeBSD kernel crypto folks and take advantage of or
improve accelerated crypto already offered elsewhere.

There's additional test suite effort underway out-of-tree taking
advantage of the aforementioned jail-friendly semantics to test a number
of real-world topologies, based on netns.sh.

Also note that this is still a work in progress; work going further will
be much smaller in nature.

MFC after:	1 month (maybe)
2021-03-14 23:52:04 -05:00
Gordon Bergling
5666643a95 Fix some common typos in comments
- occured -> occurred
- normaly -> normally
- controling -> controlling
- fileds -> fields
- insterted -> inserted
- outputing -> outputting

MFC after:	1 week
2021-03-13 18:26:15 +01:00
Kristof Provost
cecfaf9bed pf: Fully remove interrupt events on vnet cleanup
swi_remove() removes the software interrupt handler but does not remove
the associated interrupt event.
This is visible when creating and remove a vnet jail in `procstat -t
12`.

We can remove it manually with intr_event_destroy().

PR:		254171
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D29211
2021-03-12 12:12:43 +01:00
Wei Hu
a491581f3f Hyper-V: hn: Enable vSwitch RSC support in hn netvsc driver
Receive Segment Coalescing (RSC) in the vSwitch is a feature available in
Windows Server 2019 hosts and later. It reduces the per packet processing
overhead by coalescing multiple TCP segments when possible. This happens
mostly when TCP traffics are among different guests on same host.
This patch adds netvsc driver support for this feature.

The patch also updates NVS version to 6.1 as needed for RSC
enablement.

MFC after:	2 weeks
Sponsored by:	Microsoft
Differential Revision:	https://reviews.freebsd.org/D29075
2021-03-12 04:35:16 +00:00
Kristof Provost
5e9dae8e14 pf: Factor out pf_krule_free()
Reviewed by:	melifaro@
MFC after:	1 week
Sponsored by:	Rubicon Communications, LLC ("Netgate")
Differential Revision:	https://reviews.freebsd.org/D29194
2021-03-11 10:39:43 +01:00
Alexander V. Chernikov
b1d63265ac Flush remaining routes from the routing table during VNET shutdown.
Summary:
This fixes rtentry leak for the cloned interfaces created inside the
 VNET.

PR:	253998
Reported by:	rashey at superbox.pl
MFC after:	3 days

Loopback teardown order is `SI_SUB_INIT_IF`, which happens after `SI_SUB_PROTO_DOMAIN` (route table teardown).
Thus, any route table operations are too late to schedule.
As the intent of the vnet teardown procedures to minimise the amount of effort by doing global cleanups instead of per-interface ones, address this by adding a relatively light-weight routing table cleanup function, `rib_flush_routes()`.
It removes all remaining routes from the routing table and schedules the deletion, which will happen later, when `rtables_destroy()` waits for the current epoch to finish.

Test Plan:
```
set_skip:set_skip_group_lo  ->  passed  [0.053s]
tail -n 200 /var/log/messages | grep rtentry
```

Reviewers: #network, kp, bz

Reviewed By: kp

Subscribers: imp, ae

Differential Revision: https://reviews.freebsd.org/D29116
2021-03-10 21:10:14 +00:00
Kyle Evans
0dd691b412 iflib: allow clone detach if not yet init
If we hit an error during init, then we'll unwind our state and attempt
to detach the device -- don't block it.

This was discovered by creating a wg0 with missing parameters; said
failure ended up leaving this orphaned device in place and ended up
panicking the system upon enumeration of the dev.* sysctl space.

Reviewed by:	gallatin, markj
MFC after:	3 days
Differential Revision:	https://reviews.freebsd.org/D29145
2021-03-09 13:49:13 -06:00
Mark Johnston
ffe3def903 iflib: Make if_shared_ctx_t a pointer to const
This structure is shared among multiple instances of a driver, so we
should ensure that it doesn't somehow get treated as if there's a
separate instance per interface.  This is especially important for
software-only drivers like wg.

DEVICE_REGISTER() still returns a void * and so the per-driver sctx
structures are not yet defined with the const qualifier.

Reviewed by:	gallatin, erj
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D29102
2021-03-08 12:39:06 -05:00
Tai-hwa Liang
092f3f0812 net: fixing a memory leak in if_deregister_com_alloc()
Drain the callbacks upon if_deregister_com_alloc() such that the
if_com_free[type] won't be nullified before if_destroy().

Taking fwip(4) as an example, before this fix, kldunload if_fwip will
go through the following:

  1. fwip_detach()
  2. if_free() -> schedule if_destroy() through NET_EPOCH_CALL
  3. fwip_detach() returns
  4. firewire_modevent(MOD_UNLOAD) -> if_deregister_com_alloc()
  5. kernel complains about:
	Warning: memory type fw_com leaked memory on destroy (1 allocations, 64 bytes leaked).
  6. EPOCH runs if_destroy() -> if_free_internal()i

By this time, if_com_free[if_alloctype] is NULL since it's already
nullified by if_deregister_com_alloc(); hence, firewire_free() won't
have a chance to release the allocated fw_com.

Reviewed by:	hselasky, glebius
MFC after:	2 weeks
2021-03-06 14:43:16 +00:00
Kristof Provost
29698ed904 pf: Mark struct pf_pdesc as kernel only
This structure is only used by the kernel module internally. It's not
shared with user space, so hide it behind #ifdef _KERNEL.

Sponsored by:	Rubicon Communications, LLC ("Netgate")
2021-03-05 09:21:06 +01:00
Kristof Provost
448732b8e2 altq: Increase maximum number of CBQ and HFSC classes
In some configurations we need more classes than ALTQ supports by
default.  Increase the maximum number of classes we allow.
This will only cost us a comparatively trivial amount of memory, so
there's little reason not to do so.

If ever we find we want even more we may want to consider turning these
defines into a tunable, but for now do the easy thing.

Reviewed by:	donner@
MFC after:	2 weeks
Sponsored by:	Rubicon Communications, LLC ("Netgate")
Differential Revision:	https://reviews.freebsd.org/D29034
2021-03-04 20:58:22 +01:00
Kristof Provost
bb4a7d94b9 net: Introduce IPV6_DSCP(), IPV6_ECN() and IPV6_TRAFFIC_CLASS() macros
Introduce convenience macros to retrieve the DSCP, ECN or traffic class
bits from an IPv6 header.

Use them where appropriate.

Reviewed by:	ae (previous version), rscheff, tuexen, rgrimes
MFC after:	2 weeks
Sponsored by:	Rubicon Communications, LLC ("Netgate")
Differential Revision:	https://reviews.freebsd.org/D29056
2021-03-04 20:56:48 +01:00
Marcin Wojtas
09c3f04ff3 iflib: add support for admin completion queues
For interfaces with admin completion queues, introduce a new devmethod
IFDI_ADMIN_COMPLETION_HANDLE and a corresponding flag IFLIB_HAS_ADMINCQ.

This provides an option for handling any admin cq logic, which cannot be
run from an interrupt context.

Said method is called from within iflib's admin task, making it safe to
sleep.

Reviewed by: mmacy
Submitted by: Artur Rojek <ar@semihalf.com>
Obtained from: Semihalf
Sponsored by: Amazon, Inc.
Differential Revision: https://reviews.freebsd.org/D28708
2021-03-03 00:40:47 +01:00
Kristof Provost
f5537cd069 bridgestp: Ensure we send STP on VLAN interfaces
Reviewed by:	donner@
MFC after:	1 week
X-MFC-with:	711ed156b9
Sponsored by:	Orange Business Services
Differential Revision:	https://reviews.freebsd.org/D28916
2021-02-25 10:16:25 +01:00
Marcin Wojtas
ef567155d3 Fix powerpc build after 6dd69f0064
Commit 6dd69f0064 ("iflib: introduce isc_dma_width")
failed to build on powerpc due to implicit type conversion
error. Fix that.

Submitted by: Artur Rojek <ar@semihalf.com>
Obtained from: Semihalf
Sponsored by: Amazon, Inc.
2021-02-25 02:35:41 +01:00
Marcin Wojtas
6dd69f0064 iflib: introduce isc_dma_width
Some DMA controllers are unable to address the full host memory space
and are instead limited to a subset of address range (e.g. 48-bit).

Allow the driver to specify the maximum allowed DMA addressing width
(in bits) for the NIC hardware, by introducing a new field in
if_softc_ctx.

If said field is omitted (set to 0), the lowaddr of DMA window bounds
defaults to BUS_SPACE_MAXADDR.

Submitted by: Artur Rojek <ar@semihalf.com>
Obtained from: Semihalf
Sponsored by: Amazon, Inc.
Differential Revision: https://reviews.freebsd.org/D28706
2021-02-25 00:25:39 +01:00
Mark Johnston
b6999635b1 iflib: Avoid double counting in rxeof
iflib_rxeof() was counting everything twice.  This was introduced when
pfil hooks were added to the iflib receive path.  We want to count rx
packets/bytes before the pfil hooks are executed, so remove the counter
adjustments that are executed after.

PR:		253583
Reviewed by:	gallatin, erj
MFC after:	3 days
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D28900
2021-02-24 10:08:53 -05:00
Kristof Provost
38c0951386 bridge: Remove members when assigned to a new vnet
When the bridge is moved to a different vnet we must remove all of its
member interfaces (and span interfaces), because we don't know if those
will be moved along with it. We don't want to hold references to
interfaces not in our vnet.

Reviewed by:	donner@
MFC after:	1 week
Sponsored by:	Orange Business Services
Differential Revision:	https://reviews.freebsd.org/D28859
2021-02-23 13:54:07 +01:00
Kristof Provost
89fa9c34d7 bridge/stp: Ensure we enter NET_EPOCH whenever we can send traffic
Reviewed by:	donner@
MFC after:	1 week
Sponsored by:	Orange Business Services
Differential Revision:	https://reviews.freebsd.org/D28858
2021-02-23 13:54:07 +01:00
Kristof Provost
711ed156b9 bridge: Support STP on VLAN devices
VLAN devices have type IFT_L2VLAN, so the STP code mistakenly believed
they couldn't be used for STP. That's not the case, so add the
ITF_L2VLAN to the check.

Reviewed by:	donner@
MFC after:	1 week
Sponsored by:	Orange Business Services
Differential Revision:	https://reviews.freebsd.org/D28857
2021-02-23 13:54:06 +01:00
Alexander V. Chernikov
5964172837 Simplify ifa/ifp refcounting in the routing stack.
The routing stack control depends on quite a tree of functions to
 determine the proper attributes of a route such as a source address (ifa)
 or transmit ifp of a route.

When actually inserting a route, the stack needs to ensure that ifa and ifp
 points to the entities that are still valid.
Validity means slightly more than just pointer validity - stack need guarantee
 that the provided objects are not scheduled for deletion.

Currently, callers either ignore it (most ifp parts, historically) or try to
 use refcounting (ifa parts). Even in case of ifa refcounting it's not always
 implemented in fully-safe manner. For example, some codepaths inside
 rt_getifa_fib() are referencing ifa while not holding any locks, resulting in
 possibility of referencing scheduled-for-deletion ifa.

Instead of trying to fix all of the callers by enforcing proper refcounting,
 switch to a different model.
As the rib_action() already requires epoch, do not require any stability guarantees
 other than the epoch-provided one.
Use newly-added conditional versions of the refcounting functions
 (ifa_try_ref(), if_try_ref()) and fail if any of these fails.

Reviewed by:	donner
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D28837
2021-02-22 23:37:59 +00:00
Alexander V. Chernikov
7563019bc6 Add if_try_ref() to simplify refcount handling inside epoch.
When we have an ifp pointer and the code is running inside epoch,
 epoch guarantees the pointer will not be freed.
However, the following case can still happen:

* in thread 1 we drop to refcount=0 for ifp and schedule its deletion.
* in thread 2 we use this ifp and reference it
* destroy callout kicks in
* unhappy user reports a bug

This can happen with the current implementation of ifnet_byindex_ref(),
 as we're not holding any locks preventing ifnet deletion by a parallel thread.

To address it, add if_try_ref(), allowing to return failure when
 referencing ifp with refcount=0.
Additionally, enforce existing if_ref() is with KASSERT to provide a
 cleaner error in such scenarios.

Finally, fix ifnet_byindex_ref() by using if_try_ref() and returning NULL
 if the latter fails.

MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D28836
2021-02-22 23:37:59 +00:00
Alexander V. Chernikov
e5b394f2d0 Fix setting static entries for arp/ndp.
rtsock message validation changes committed in 2fe5a79425
 did not take llinfo messages into account.

Add a special validation case for RTA_GATEWAY llinfo messages.

MFC after:	2 days
2021-02-20 18:26:35 +00:00
Mark Johnston
0f9544d03e iflib: Fix detach of pseudo interfaces
In commit 38bfc6dee3 we added an IFDI_DETACH() call to
iflib_pseudo_deregister() since it looked like it was missing.  One is
present in the error-handling path of iflib_pseudo_register().  However,
the detach actually comes from the DEVICE_DETACH() method for the
above-mentioned device_t, so now we're calling IFDI_DETACH() twice when
destroying a pseudo interface.

Fix the problem by not calling IFDI_DETACH() from the device detach
routine.  This way we can ensure that iflib de-initialization always
happens in a consistent order.  It also ensures that you can't do silly
things like "devctl detach <pseudo ifnet>", which would previously
detach the driver without tearing down the corresponding ifnet.

PR:		253541
Reviewed by:	erj
MFC after:	1 week
Fixes:		38bfc6dee3
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D28774
2021-02-19 17:10:41 -05:00
Alexander V. Chernikov
f9e1cd6c99 Fix arp/ndp deletion broken by 2fe5a79425.
Changes in the 2fe5a79425 moved dst sockaddr masking from the
 routing control plane to the rtsock code.

It broke arp/ndp deletion.
It turns out, arp/ndp perform RTM_GET request first to get an
 interface index necessary for the deletion.
Then they simply stamp the reply with RTF_LLDATA and set the
 command to RTM_DELETE.
As a result, kernel receives request with non-empty RTA_NETMASK
 and clears RTA_DST host bits before passing the message to the
 lla code.

De facto, the only needed bits are RTA_DST, RTA_GATEWAY and the
 subset of rtm_flags.

With that in mind, fix the interace by clearing RTA_NETMASK
 for every messages with RTF_LLDATA.

While here, cleanup arp/ndp code a bit.

MFC after:	1 day
Reviewed by:	gnn
Differential Revision:	https://reviews.freebsd.org/D28804
2021-02-19 21:17:17 +00:00
John Baldwin
2ccf971ace iflib: Cast the result of iflib_netmap_txq_init() to void.
This fixes a warning from GCC for kernels without netmap since the
return value is never used.

Reviewed by:	vmaffione, erj
Differential Revision:	https://reviews.freebsd.org/D28598
2021-02-19 12:52:53 -08:00
Alexander V. Chernikov
a4513bace0 Fix NOINET6 build broken by 2fe5a79425.
Reported by:	mjg
2021-02-16 21:49:48 +00:00
Alexander V. Chernikov
2fe5a79425 Fix dst/netmask handling in routing socket code.
Traditionally routing socket code did almost zero checks on
 the input message except for the most basic size checks.

This resulted in the unclear KPI boundary for the routing system code
 (`rtrequest*` and now `rib_action()`) w.r.t message validness.

Multiple potential problems and nuances exists:
* Host bits in RTAX_DST sockaddr. Existing applications do send prefixes
 with hostbits uncleared. Even `route(8)` does this, as they hope the kernel
 would do the job of fixing it. Code inside `rib_action()` needs to handle
 it on its own (see `rt_maskedcopy()` ugly hack).
* There are multiple way of adding the host route: it can be DST without
 netmask or DST with /32(/128) netmask. Also, RTF_HOST has to be set correspondingly.
 Currently, these 2 options create 2 DIFFERENT routes in the kernel.
* no sockaddr length/content checking for the "secondary" fields exists: nothing
 stops rtsock application to send sockaddr_in with length of 25 (instead of 16).
 Kernel will accept it, install to RIB as is and propagate to all rtsock consumers,
 potentially triggering bugs in their code. Same goes for sin_port, sin_zero, etc.

The goal of this change is to make rtsock verify all sockaddr and prefix consistency.
Said differently, `rib_action()` or internals should NOT require to change any of the
 sockaddrs supplied by `rt_addrinfo` structure due to incorrectness.

To be more specific, this change implements the following:
* sockaddr cleanup/validation check is added immediately after getting sockaddrs from rtm.
* Per-family dst/netmask checks clears host bits in dst and zeros all dst/netmask "secondary" fields.
* The same netmask checking code converts /32(/128) netmasks to "host" route case
 (NULL netmask, RTF_HOST), removing the dualism.
* Instead of allowing ANY "known" sockaddr families (0<..<AF_MAX), allow only actually
 supported ones (inet, inet6, link).
* Automatically convert `sockaddr_sdl` (AF_LINK) gateways to
  `sockaddr_sdl_short`.

Reported by:	Guy Yur <guyyur at gmail.com>
Reviewed By:	donner
Differential Revision: https://reviews.freebsd.org/D28668
MFC after:	3 days
2021-02-16 20:30:04 +00:00
Alexander V. Chernikov
600eade2fb Add ifa_try_ref() to simplify ifa handling inside epoch.
More and more code migrates from lock-based protection to the NET_EPOCH
 umbrella. It requires some logic changes, including, notably, refcount
 handling.

When we have an `ifa` pointer and we're running inside epoch we're
 guaranteed that this pointer will not be freed.
However, the following case can still happen:
 * in thread 1 we drop to 0 refcount for ifa and schedule its deletion.
 * in thread 2 we use this ifa and reference it
 * destroy callout kicks in
 * unhappy user reports bug

To address it, new `ifa_try_ref()` function is added, allowing to return
 failure when we try to reference `ifa` with 0 refcount.
Additionally, existing `ifa_ref()` is enforced with `KASSERT` to provide
 cleaner error in such scenarious.

Reviewed By: rstone, donner
Differential Revision: https://reviews.freebsd.org/D28639
MFC after:	1 week
2021-02-16 20:14:50 +00:00
Allan Jude
922cf8ac43 Use iflib_if_init_locked() during media change instead of iflib_init_locked().
iflib_init_locked() assumes that iflib_stop() has been called, however,
it is not called for media changes.
iflib_if_init_locked() calls stop then init, so fixes the problem.

PR:	253473
MFC after:	3 days
Reviewed by:	markj
Sponsored by:	Juniper Networks, Inc., Klara, Inc.
Differential Revision:	https://reviews.freebsd.org/D28667
2021-02-16 19:02:00 +00:00
Alexander V. Chernikov
64d5c27777 Remove now-unused RTF_RNH_LOCKED route flag.
MFC after:	1 week
2021-02-15 20:49:59 +00:00
Alexander V. Chernikov
a375ec52a7 Fix ifa refcount leak during route addition.
Reported by:	rstone
Reviewed by:	rstone
MFC after:	1 day
2021-02-13 00:06:14 +00:00
Alexander V. Chernikov
8ca99aecf7 Fix various NOINET* builds broken by 145bf6c0af.
Reported by:	mjg, bdragon
2021-02-12 20:36:20 +00:00
Alexander V. Chernikov
8170a7d438 Fix interface route addition with net/bird.
The case of adding interface route by specifying interface
 address as the gateway was missed during code refactoring.
Re-add it back by copying non-AF_LINK gateway data when RTF_GATEWAY
 is not set.

Reviewed by:	donner
MFC after:	3 days
2021-02-12 19:45:35 +00:00
Alexander V. Chernikov
145bf6c0af Fix blackhole/reject routes.
Traditionally *BSD routing stack required to supply some
 interface data for blackhole/reject routes. This lead to
 varieties of hacks in routing daemons when inserting such routes.
With the recent routeing stack changes, gateway sockaddr without
 RTF_GATEWAY started to be treated differently, purely as link
 identifier.

This change broke net/bird, which installs blackhole routes with
 127.0.0.1 gateway without RTF_GATEWAY flags.

Fix this by automatically constructing necessary gateway data at
 rtsock level if RTF_REJECT/RTF_BLACKHOLE is set.

Reported by:	Marek Zarychta <zarychtam at plan-b.pwste.edu.pl>
Reviewed by:	donner
MFC after:	1 week
2021-02-11 23:08:55 +00:00
Kristof Provost
6d2a10d96f Widen ifnet_detach_sxlock coverage
Widen the ifnet_detach_sxlock to cover the entire vnet sysuninit code.
This ensures that we can't end up having the vnet_sysuninit free the UDP
pcb while the detach code is running and trying to purge the UDP pcb.

MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D28530
2021-02-11 16:12:29 +01:00
Alexander V. Chernikov
924d1c9a05 Revert "SO_RERROR indicates that receive buffer overflows should be handled as errors."
Wrong version of the change was pushed inadvertenly.

This reverts commit 4a01b854ca.
2021-02-08 22:32:32 +00:00
Alexander V. Chernikov
adc4ea97bd Turn off forgotten multipath debug messages
Reported by:	mike tancsa<mike at sentex.net>
MFC after:	3 days
2021-02-08 21:42:20 +00:00
Alexander V. Chernikov
4a01b854ca SO_RERROR indicates that receive buffer overflows should be handled as errors.
Historically receive buffer overflows have been ignored and programs
could not tell if they missed messages or messages had been truncated
because of overflows. Since programs historically do not expect to get
receive overflow errors, this behavior is not the default.

This is really really important for programs that use route(4) to keep in sync
with the system. If we loose a message then we need to reload the full system
state, otherwise the behaviour from that point is undefined and can lead
to chasing bogus bug reports.
2021-02-08 21:42:20 +00:00
Alexander V. Chernikov
eb0b1b33d5 Enable multipath routing by default.
ROUTE_MPATH was added to the GENERIC kernel in r368648.

According to the plan in D27428, it was enabled with `net.route.multipath` sysctl set to 0.
Given enough time has passed, this change enables route multipath by default.

The goal is to ship FreeBSD 13 with multipath turned on.

Reviewed By: donner, olivier
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D28423
2021-02-03 08:49:58 +00:00
Sai Rajesh Tallamraju
38bfc6dee3 iflib: Free resources in a consistent order during detach
Memory and PCI resources are freed with no particular order.  This could
cause use-after-frees when detaching following a failed attach.  For
instance, iflib_tx_structures_free() frees ctx->ifc_txqs[] but
iflib_tqg_detach() attempts to access this array. Similarly, adapter
queues gets freed by IFDI_QUEUES_FREE() but IFDI_DETACH() attempts to
access adapter queues to free PCI resources.

MFC after:	2 weeks
Sponsored by:	NetApp, Inc.
Differential Revision:	https://reviews.freebsd.org/D27634
2021-02-01 11:15:54 -05:00
Jonah Caplan
88be0e1120 bridge: fix STP roles and protos strings
Add the missing commas that got lost in e5539fb618.

PR:		252532
Reviewd by:	kp@, donner@, freqlabs@
MFC after:	3 days
Differential Revision:	https://reviews.freebsd.org/D28425
2021-02-01 15:27:06 +01:00
Alexander V. Chernikov
78c93a1721 Use process fib for inet/inet6 fib_algo sysctls.
This allows to set/query fib algo for non-default fibs.

MFC after:	3 days
2021-01-31 10:50:08 +00:00
Alexander V. Chernikov
151ec796a2 Fix the design problem with delayed algorithm sync.
Currently, if the immutable algorithm like bsearch or radix_lockless
 receives rtable update notification, it schedules algorithm rebuild.
This rebuild is executed by the callout after ~50 milliseconds.

It is possible that a script adding an interface address and than route
 with the gateway bound to that address will fail. It can happen due
 to the fact that fib is not updated by the time the route addition
 request arrives.

Fix this by allowing synchronous algorithm rebuilds based on certain
 conditions. By default, these conditions assume:
1) less than net.route.algo.fib_sync_limit=100 routes
2) routes without gateway.

* Move algo instance build entirely under rib WLOCK.
 Rib lock is only used for control plane (except radix algo, but there
  are no rebuilds).
* Add rib_walk_ext_locked() function to allow RIB iteration with
 rib lock already held.
* Fix rare potential callout use-after-free for fds by binding fd
 callout to the relevant rib rmlock. In that case, callout_stop()
 under rib WLOCK guarantees no callout will be executed afterwards.

MFC after:	3 days
2021-01-30 23:25:57 +00:00
Alexander V. Chernikov
dd9163003c Add rib_subscribe_locked() and rib_unsubsribe_locked() to support
subscriptions during RIB modifications.
Add new subscriptions to the beginning of the lists instead of
 the end. This fixes the situation when new subscription is created
 int the callback for the existing subscription, leading to the
 subscription notification handler pick it.

MFC after: 3 days
2021-01-30 23:25:57 +00:00
Alexander V. Chernikov
ab6d9aaed7 Move business logic from rebuild_fd_callout() into rebuild_fd().
This simplifies code a bit and allows for future non-callout
 callers to request rebuild.

MFC after:	3 days
2021-01-30 23:25:57 +00:00
Alexander V. Chernikov
f8b7ebea49 Improve fib_algo debug messages.
* Move per-prefix debug lines under LOG_DEBUG2
* Create fib instance counter to distingush log messages between
 instances
* Add more messages on rebuild reason.

MFC after:	3 days
2021-01-30 23:25:56 +00:00
Alexander V. Chernikov
cb984c62d7 Fix multipath support for rib_lookup_info().
The initial plan was to remove rib_lookup_info() before
 FreeBSD 13. As several customers are still remaining,
 fix rib_lookup_info() for the multipath use case.
2021-01-29 23:14:24 +00:00
Alexander V. Chernikov
53729367d3 Fix subinterface vlan creation.
D26436 introduced support for stacked vlans that changed the way vlans
 are configured.  In particular, this change broke setups that have
 same-number vlans as subinterfaces.

Vlan support was initially created assuming "vlanX" semantics. In this paradigm,
 automatic number assignment supported by cloning (ifconfig vlan create) was a
 natural fit.
When "ifaceX.Y" support was added, allowing to have the same vlan number on
 multiple devices, cloning code became more complex, as the is no
unified "vlan" namespace anymore. Such interfaces got the first spare
index from "vlan" cloner. This, in turn, led to the following problem:
 ifconfig ix0.333 create -> index 1
 ifconfig ix0.444 create -> index 2
 ifconfig vlan2 create -> allocation failure

This change fixes such allocations by using cloning indexes only for
 "vlanX" interfaces.

Reviewed by:            hselasky
MFC after:		3 days
Differential Revision:  https://reviews.freebsd.org/D27505
2021-01-29 21:43:20 +00:00
Gleb Smirnoff
3f43ada98c Catch up with 6edfd179c8: mechanically rename IFCAP_NOMAP to IFCAP_MEXTPG.
Originally IFCAP_NOMAP meant that the mbuf has external storage pointer
that points to unmapped address.  Then, this was extended to array of
such pointers.  Then, such mbufs were augmented with header/trailer.
Basically, extended mbufs are extended, and set of features is subject
to change.  The new name should be generic enough to avoid further
renaming.
2021-01-29 11:46:24 -08:00
Randall Stewart
1a714ff204 This pulls over all the changes that are in the netflix
tree that fix the ratelimit code. There were several bugs
in tcp_ratelimit itself and we needed further work to support
the multiple tag format coming for the joint TLS and Ratelimit dances.

    Sponsored by: Netflix Inc.
    Differential Revision:  https://reviews.freebsd.org/D28357
2021-01-28 11:53:05 -05:00
Kristof Provost
35dabb7b9c altq: Fix typo in features sysctl description
Reported by:	Jose Luis Duran
2021-01-27 16:42:14 +01:00
Kristof Provost
27b2aa4938 altq: Remove unused arguments from altq_attach()
Minor cleanup, no functional change.

Reviewed by:		donner@
Differential Revision:	https://reviews.freebsd.org/D28304
2021-01-25 19:58:22 +01:00
Kristof Provost
e111d79806 Add FEATURE sysctls for ALTQ disciplines
This will allow userspace to more easily figure out if ALTQ is built
into the kernel and what disciplines are supported.

Reviewed by:		donner@
Differential Revision:	https://reviews.freebsd.org/D28302
2021-01-25 19:58:22 +01:00
Vincenzo Maffione
f80efe5016 iflib: netmap: move per-packet operation out of fragments loop
MFC after:	1 week
2021-01-24 21:38:59 +00:00
Vincenzo Maffione
aceaccab65 iflib: netmap: add support for NS_MOREFRAG
The NS_MOREFRAG flag can be set in a netmap slot to represent a
multi-fragment packet. Only the last fragment of a packet does
not have the flag set. On TX rings, the flag may be set by the
userspace application. The kernel will look at the flag and use it
to properly set up the NIC TX descriptors.
On RX rings, the kernel may set the flag if the packet received
was split across multiple netmap buffers. The userspace application
should look at the flag to know when the packet is complete.

Submitted by:	rajesh1.kumar_amd.com
Reviewed by:	vmaffione
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D27799
2021-01-24 21:20:59 +00:00
Andrew Gallatin
0c864213ef iflib: Fix a NULL pointer deref
rxd_frag_to_sd() have pf_rv parameter as NULL with the current
code. This patch fixes the NULL pointer dereference in that
case thus avoiding a possible panic.

Submitted by: rajesh1.kumar at amd.com
Reviewed by: gallatin
Differential Revision: https://reviews.freebsd.org/D28115
2021-01-21 09:47:06 -05:00
Alexander V. Chernikov
9d6567bc30 Fix panic on vnet creation if fib algo has been set to fixed value.
Make fixed algo property per-VNET instead of global.
2021-01-17 20:32:25 +00:00
Alexander V. Chernikov
f9e0752e35 Create new in6_purgeifaddr() which purges bound ifa prefix if
it gets unused.

Currently if_purgeifaddrs() uses in6_purgeaddr() to remove IPv6
 ifaddrs. in6_purgeaddr() does not trrigger prefix removal if
 number of linked ifas goes to 0, as this is a low-level function.
 As a result, if_purgeifaddrs() purges all IPv4/IPv6 addresses but
 keeps corresponding IPv6 prefixes.

Fix this by creating higher-level wrapper which handles unused
 prefix usecase and use it in if_purgeifaddrs().

Differential revision:	https://reviews.freebsd.org/D28128
2021-01-17 20:32:25 +00:00
Alexander V. Chernikov
81728a538d Split rtinit() into multiple functions.
rtinit[1]() is a function used to add or remove interface address prefix routes,
  similar to ifa_maintain_loopback_route().
It was intended to be family-agnostic. There is a problem with this approach
 in reality.

1) IPv6 code does not use it for the ifa routes. There is a separate layer,
  nd6_prelist_(), providing interface for maintaining interface routes. Its part,
  responsible for the actual route table interaction, mimics rtenty() code.

2) rtinit tries to combine multiple actions in the same function: constructing
  proper route attributes and handling iterations over multiple fibs, for the
  non-zero net.add_addr_allfibs use case. It notably increases the code complexity.

3) dstaddr handling. flags parameter re-uses RTF_ flags. As there is no special flag
 for p2p connections, host routes and p2p routes are handled in the same way.
 Additionally, mapping IFA flags to RTF flags makes the interface pretty messy.
 It make rtinit() to clash with ifa_mainain_loopback_route() for IPV4 interface
 aliases.

4) rtinit() is the last customer passing non-masked prefixes to rib_action(),
 complicating rib_action() implementation.

5) rtinit() coupled ifa announce/withdrawal notifications, producing "false positive"
 ifa messages in certain corner cases.

To address all these points, the following has been done:

* rtinit() has been split into multiple functions:
- Route attribute construction were moved to the per-address-family functions,
 dealing with (2), (3) and (4).
- funnction providing net.add_addr_allfibs handling and route rtsock notificaions
 is the new routing table inteface.
- rtsock ifa notificaion has been moved out as well. resulting set of funcion are only
 responsible for the actual route notifications.

Side effects:
* /32 alias does not result in interface routes (/32 route and "host" route)
* RTF_PINNED is now set for IPv6 prefixes corresponding to the interface addresses

Differential revision:	https://reviews.freebsd.org/D28186
2021-01-16 22:42:41 +00:00
Alexander V. Chernikov
a6b7689718 Remove redundant rtinit() calls from tuntap.
Removed code iterates over if_addrhead and tries to remove
 routes for each ifa.
This is exactly the thing that if_purgeaddrs() do, and
 if_purgeaddr() is already called in the end.

Reviewed by:		glebius
MFC after:		2 weeks
Differential revision:	https://reviews.freebsd.org/D28106
2021-01-13 10:03:15 +00:00
Ryan Libby
c86fa3b8d7 pf: quiet -Wredundant-decls for pf_get_ruleset_number
In e86bddea9f sys/netpfil/pf/pf.h grew a
declaration of pf_get_ruleset_number.  Now delete the old declaration
from sys/net/pfvar.h.

Reviewed by:	kp
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D28081
2021-01-10 21:53:15 -08:00
Alexander V. Chernikov
685de460bc Use static initializers for fib algo to shift initialization
to ealier stage. This allows to register modules loaded at
 boot time.

Reported by:	olivier
2021-01-11 00:16:54 +00:00
Vincenzo Maffione
55f0ad5fde netmap: restore hwofs and support it in iflib
Restore the hwofs functionality temporarily disabled by
7ba6ecf216 to prevent issues with iflib.
This patch brings the necessary changes to iflib to
enable howfs to allow interface restarts without
disrupting netmap applications actively using its
rings.
After this change, it becomes possible for multiple
non-cooperating netmap applications to use non-overlapping
subsets of the available netmap rings without clashing
with each other.

PR:		252453
MFC after:	1 week
2021-01-10 22:51:15 +00:00
Vincenzo Maffione
8aa8484cbf iflib: fix build failure in case DEV_NETMAP is not defined
This addresses the build failure introduced by
3d65fd97e8.

MFC with: 3d65fd97e8
2021-01-10 14:43:58 +00:00
Vincenzo Maffione
4ba9ad0dc3 iflib: add assert to prevent out-of-bounds array access
The iflib_queues_alloc() allocates isc_nrxqs iflib_dma_info structs
for each rxqset, and links each struct to a different free list.
As a result, it must be isc_nrxqs >= isc_nfl (plus the completion
queue, if present).
Add an assertion to make this constraint explicit.

MFC after:	2 weeks
2021-01-10 13:59:20 +00:00
Vincenzo Maffione
3d65fd97e8 netmap: iflib: enable/disable krings on any interface reinit
Since 1d238b07d5, krings are disabled before
a reinit cycle triggered by iflib_netmap_register.
However, this operation is actually necessary also for
any interface reinit triggered by other causes (i.e.,
ifconfig commands).
We achieve this goal by moving the krings enable/disable
operation inside iflib_stop() and iflib_init_locked().

Once here, this change also removes some redundant operations
from iflib_netmap_register(), that are already performed by
iflib_stop().

PR:		252453
MFC after:	1 week
2021-01-10 12:04:08 +00:00
Vincenzo Maffione
3189ba6167 netmap: iflib: fix asserts in netmap_fl_refill()
When netmap_fl_refill() is called at initialization time (e.g.,
during netmap_iflib_register()), nic_i must be 0, since the
free list is reinitialized. At the end of the refill cycle, nic_i
must still be zero, because exactly N descriptors (N is the ring size)
are refilled.
This patch therefore fixes the assertions to check on nic_i rather
than on nm_i. The current netmap_reset() may in fact cause nm_i
to be != 0 while the device is resetting: this may happen when
multiple non-cooperating processes open different subsets of the
available netmap rings.

PR:	    252518
MFC after:  1 week
2021-01-09 21:35:07 +00:00
Vincenzo Maffione
1d238b07d5 netmap: iflib: stop krings during interface reset
When different processes open separate subsets of the
available rings of a same netmap interface, a device
reset may be performed while one of the processes
is actively using some rings (e.g., caused by another
process executing a nmport_open()).
With this patch, such situation will cause the
active process to get a POLLERR, so that it can
have a chance to detect the situation.
We also guarantee that no process is running a txsync
or rxsync (ioctl or poll) while an iflib device reset
is in progress.

PR:	    252453
MFC after:  1 week
2021-01-09 21:01:46 +00:00
Matt Macy
81be655266 iflib: ensure that tx interrupts enabled and cleanups
Doing a 'dd' over iscsi will reliably cause stalls. Tx
cleaning _should_ reliably happen as data is sent.
However, currently if the transmit queue fills it will
wait until the iflib timer (hz/2) runs.

This change causes the the tx taskq thread to be run
if there are completed descriptors.

While here:

- make timer interrupt delay a sysctl

- simplify txd_db_check handling

- comment on INTR types

Background on the change:

Initially doorbell updates were minimized by only writing to the register
on every fourth packet. If txq_drain would return without writing to the
doorbell it scheduled a callout on the next tick to do the doorbell write
to ensure that the write otherwise happened "soon". At that time a sysctl
was added for users to avoid the potential added latency by simply writing
to the doorbell register on every packet. This worked perfectly well for
e1000 and ixgbe ... and appeared to work well on ixl. However, as it
turned out there was a race to this approach that would lockup the ixl MAC.
It was possible for a lower producer index to be written after a higher one.
On e1000 and ixgbe this was harmless - on ixl it was fatal. My initial
response was to add a lock around doorbell writes - fixing the problem but
adding an unacceptable amount of lock contention.

The next iteration was to use transmit interrupts to drive delayed doorbell
writes. If there were no packets in the queue all doorbell writes would be
immediate as the queue started to fill up we could delay doorbell writes
further and further. At the start of drain if we've cleaned any packets we
know we've moved the state machine along and we write the doorbell (an
obvious missing optimization was to skip that doorbell write if db_pending
is zero). This change required that tx interrupts be scheduled periodically
as opposed to just when the hardware txq was full. However, that just leads
to our next problem.

Initially dedicated msix vectors were used for both tx and rx. However, it
was often possible to use up all available vectors before we set up all the
queues we wanted. By having rx and tx share a vector for a given queue we
could halve the number of vectors used by a given configuration. The problem
here is that with this change only e1000 passed the necessary value to have
the fast interrupt drive tx when appropriate.

Reported by: mav@
Tested by: mav@
Reviewed by:    gallatin@
MFC after:      1 month
Sponsored by:   iXsystems
Differential Revision:  https://reviews.freebsd.org/D27683
2021-01-07 14:07:35 -08:00
Alexander V. Chernikov
d68cf57b7f Refactor rt_addrmsg() and rt_routemsg().
Summary:
* Refactor rt_addrmsg(): make V_rt_add_addr_allfibs decision locally.
* Fix rt_routemsg() and multipath by accepting nexthop instead of interface pointer.
* Refactor rtsock_routemsg(): avoid accessing rtentry fields directly.
* Simplify in_addprefix() by moving prefix search to a separate  function.

Reviewers: #network

Subscribers: imp, ae, bz

Differential Revision: https://reviews.freebsd.org/D28011
2021-01-07 19:38:19 +00:00
Kristof Provost
5a3b9507d7 pf: Convert pfi_kkif to use counter_u64
Improve caching behaviour by using counter_u64 rather than variables
shared between cores.

The result of converting all counters to counter(9) (i.e. this full
patch series) is a significant improvement in throughput. As tested by
olivier@, on Intel Xeon E5-2697Av4 (16Cores, 32 threads) hardware with
Mellanox ConnectX-4 MCX416A-CCAT (100GBase-SR4) nics we see:

x FreeBSD 20201223: inet packets-per-second
+ FreeBSD 20201223 with pf patches: inet packets-per-second
+--------------------------------------------------------------------------+
|                                                                        + |
| xx                                                                     + |
|xxx                                                                    +++|
||A|                                                                       |
|                                                                       |A||
+--------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   5       9216962       9526356       9343902     9371057.6     116720.36
+   5      19427190      19698400      19502922      19546509     109084.92
Difference at 95.0% confidence
        1.01755e+07 +/- 164756
        108.584% +/- 2.9359%
        (Student's t, pooled s = 112967)

Reviewed by:	philip
MFC after:	2 weeks
Sponsored by:	Orange Business Services
Differential Revision:	https://reviews.freebsd.org/D27763
2021-01-05 23:35:37 +01:00
Kristof Provost
26c841e2a4 pf: Allocate and free pfi_kkif in separate functions
Factor out allocating and freeing pfi_kkif structures. This will be
useful when we change the counters to be counter_u64, so we don't have
to deal with that complexity in the multiple locations where we allocate
pfi_kkif structures.

No functional change.

MFC after:	2 weeks
Sponsored by:	Orange Business Services
Differential Revision:	https://reviews.freebsd.org/D27762
2021-01-05 23:35:37 +01:00
Kristof Provost
320c11165b pf: Split pfi_kif into a user and kernel space structure
No functional change.

MFC after:	2 weeks
Sponsored by:	Orange Business Services
Differential Revision:	https://reviews.freebsd.org/D27761
2021-01-05 23:35:37 +01:00
Kristof Provost
c3adacdad4 pf: Change pf_krule counters to use counter_u64
This improves the cache behaviour of pf and results in improved
throughput.

MFC after:	2 weeks
Sponsored by:	Orange Business Services
Differential Revision:	https://reviews.freebsd.org/D27760
2021-01-05 23:35:37 +01:00
Kristof Provost
c7bdafe2f1 pf: Remove unused fields from pf_krule
The u_* counters are used only to communicate with userspace, as
userspace cannot use counter_u64. As pf_krule is not passed to userspace
these fields are now obsolete.

MFC after:	2 weeks
Sponsored by:	Orange Business Services
Differential Revision:	https://reviews.freebsd.org/D27759
2021-01-05 23:35:36 +01:00
Kristof Provost
e86bddea9f pf: Split pf_rule into kernel and user space versions
No functional change intended.

MFC after:	2 weeks
Sponsored by:	Orange Business Services
Differential Revision:	https://reviews.freebsd.org/D27758
2021-01-05 23:35:36 +01:00
Kristof Provost
dc865dae89 pf: Migrate pf_rule and related structs to pf.h
As part of the split between user and kernel mode structures we're
moving all user space usable definitions into pf.h.

No functional change intended.

MFC after:      2 weeks
Sponsored by:   Orange Business Services
Differential Revision:	https://reviews.freebsd.org/D27757
2021-01-05 23:35:36 +01:00
Kristof Provost
fbbf270eef pf: Use counter_u64 in pf_src_node
Reviewd by:	philip
MFC after:      2 weeks
Sponsored by:   Orange Business Services
Differential Revision:	https://reviews.freebsd.org/D27756
2021-01-05 23:35:36 +01:00
Kristof Provost
17ad7334ca pf: Split pf_src_node into a kernel and userspace struct
Introduce a kernel version of struct pf_src_node (pf_ksrc_node).

This will allow us to improve the in-kernel data structure without
breaking userspace compatibility.

Reviewed by:	philip
MFC after:	2 weeks
Sponsored by:	Orange Business Services
Differential Revision:	https://reviews.freebsd.org/D27707
2021-01-05 23:35:36 +01:00
Alexander V. Chernikov
9c0ff6a8bb Remove now-unused RT_GATEWAY* definitions.
They were used to simplify nexthop transition, hence not needed
 anymore.
2021-01-04 21:45:46 +00:00
Hans Petter Selasky
747feea146 Streamline the infiniband code according to the ethernet code.
Fix LINT-NOIP kernel build.

Submitted by:	rlibby @
Differential Revision:	https://reviews.freebsd.org/D27861
MFC after:	1 week
Sponsored by:	Mellanox Technologies // NVIDIA Networking
2020-12-31 10:07:02 +01:00
Hans Petter Selasky
ec52ff6d14 Streamline the infiniband code according to the ethernet code.
Specifically implement the if_requestencap callback function for infiniband.
Most of the changes are simply a cut and paste of the equivalent ethernet part.

Reviewed by:	melifaro @
Differential Revision:	https://reviews.freebsd.org/D27631
MFC after:	1 week
Sponsored by:	Mellanox Technologies // NVIDIA Networking
2020-12-29 18:01:57 +01:00
Hans Petter Selasky
19ecb5e8da Fix for IPoIB over lagg(4).
Need to update both link layer address and broadcast address when active link changes for IP over infiniband.
This is because the broadcast address contains the so-called P-key, which is interface dependent.

Reviewed by:	kib @
Differential Revision:	https://reviews.freebsd.org/D27658
MFC after:	1 week
Sponsored by:	Mellanox Technologies // NVIDIA Networking
2020-12-29 17:35:06 +01:00
Ryan Libby
833dbf1e22 route: quiet -Wredundant-decls
Remove declaration duplicated in
f5baf8bb12

Reviewed by:	melifaro
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D27790
2020-12-27 16:32:27 -08:00
Alexander V. Chernikov
f733d9701b Fix default route handling in radix4_lockless algo.
Improve nexthop debugging.

Reported by:	Florian Smeets <flo at smeets.xyz>
2020-12-26 22:51:02 +00:00
Alexander V. Chernikov
4e19e0d92a Use light-weight versions of routing lookup functions in ng_netflow.
Use recently-added combination of `fib[46]_lookup_rt()` which
 returns rtentry & raw nexthop with `rt_get_inet[6]_plen()` which
 returns address/prefix length of prefix inside `rt`.

Add `nhop_select_func()` wrapper around inlined `nhop_select()` to
 allow callers external to the routing subsystem select the proper
 nexthop from the multipath group without including internal headers.

New calls does not require reference counting objects and reduce
 the amount of copied/processed rtentry data.

Differential Revision: https://reviews.freebsd.org/D27675
2020-12-26 11:27:38 +00:00
Alexander V. Chernikov
f5baf8bb12 Add modular fib lookup framework.
This change introduces framework that allows to dynamically
 attach or detach longest prefix match (lpm) lookup algorithms
 to speed up datapath route tables lookups.

Framework takes care of handling initial synchronisation,
 route subscription, nhop/nhop groups reference and indexing,
 dataplane attachments and fib instance algorithm setup/teardown.
Framework features automatic algorithm selection, allowing for
 picking the best matching algorithm on-the-fly based on the
 amount of routes in the routing table.

Currently framework code is guarded under FIB_ALGO config option.
An idea is to enable it by default in the next couple of weeks.

The following algorithms are provided by default:
IPv4:
* bsearch4 (lockless binary search in a special IP array), tailored for
  small-fib (<16 routes)
* radix4_lockless (lockless immutable radix, re-created on every rtable change),
  tailored for small-fib (<1000 routes)
* radix4 (base system radix backend)
* dpdk_lpm4 (DPDK DIR24-8-based lookups), lockless datastrucure, optimized
  for large-fib (D27412)
IPv6:
* radix6_lockless (lockless immutable radix, re-created on every rtable change),
  tailed for small-fib (<1000 routes)
* radix6 (base system radix backend)
* dpdk_lpm6 (DPDK DIR24-8-based lookups), lockless datastrucure, optimized
  for large-fib (D27412)

Performance changes:
Micro benchmarks (I7-7660U, single-core lookups, 2048k dst, code in D27604):
IPv4:
8 routes:
  radix4: ~20mpps
  radix4_lockless: ~24.8mpps
  bsearch4: ~69mpps
  dpdk_lpm4: ~67 mpps
700k routes:
  radix4_lockless: 3.3mpps
  dpdk_lpm4: 46mpps

IPv6:
8 routes:
  radix6_lockless: ~20mpps
  dpdk_lpm6: ~70mpps
100k routes:
  radix6_lockless: 13.9mpps
  dpdk_lpm6: 57mpps

Forwarding benchmarks:
+ 10-15% IPv4 forwarding performance (small-fib, bsearch4)
+ 25% IPv4 forwarding performance (full-view, dpdk_lpm4)
+ 20% IPv6 forwarding performance (full-view, dpdk_lpm6)

Control:
Framwork adds the following runtime sysctls:

List algos
* net.route.algo.inet.algo_list: bsearch4, radix4_lockless, radix4
* net.route.algo.inet6.algo_list: radix6_lockless, radix6, dpdk_lpm6
Debug level (7=LOG_DEBUG, per-route)
net.route.algo.debug_level: 5
Algo selection (currently only for fib 0):
net.route.algo.inet.algo: bsearch4
net.route.algo.inet6.algo: radix6_lockless

Support for manually changing algos in non-default fib will be added
soon. Some sysctl names will be changed in the near future.

Differential Revision: https://reviews.freebsd.org/D27401
2020-12-25 11:33:17 +00:00
Ryan Libby
2fb4a03d55 rtsock: quiet -Wunused-variable in LINT-NOIP kernels
Fixup after r368769 / d68fb8d978.

Reported by:	mjg
Reviewed by:	melifaro
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D27730
2020-12-24 12:34:18 -08:00
Mark Johnston
92be2847e8 rtsock: Avoid copying uninitialized padding bytes
When copying sockaddrs out to userspace, we pad them to a multiple of
the platform alignment (sizeof(long)).  However, some sockaddr sizes,
such as struct sockaddr_dl, are not an integer multiple of the
alignment, so we may end up copying out uninitialized bytes.

Fix this by always bouncing through a pre-zeroed sockaddr_storage.

Reported by:	KASAN
Reviewed by:	melifaro
MFC after:	3 days
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D27729
2020-12-23 11:16:40 -05:00
Kristof Provost
1c00efe98e pf: Use counter(9) for pf_state byte/packet tracking
This improves cache behaviour by not writing to the same variable from
multiple cores simultaneously.

pf_state is only used in the kernel, so can be safely modified.

Reviewed by:	Lutz Donnerhacke, philip
MFC after:	1 week
Sponsed by:	Orange Business Services
Differential Revision:	https://reviews.freebsd.org/D27661
2020-12-23 12:03:21 +01:00
Kristof Provost
c3f69af03a pf: Fix unaligned checksum updates
The algorithm we use to update checksums only works correctly if the
updated data is aligned on 16-bit boundaries (relative to the start of
the packet).

Import the OpenBSD fix for this issue.

PR:		240416
Obtained from:	OpenBSD
MFC after:	1 week
Reviewed by:	tuexen (previous version)
Differential Revision:	https://reviews.freebsd.org/D27696
2020-12-23 12:03:20 +01:00
Hans Petter Selasky
ddce63fcb6 Remove not needed variable initialization.
And switch from int to bool while at it.

Reviewed by:	melifaro@
Differential Revision:	https://reviews.freebsd.org/D27725
MFC after:	1 week
Sponsored by:	Mellanox Technologies // NVIDIA Networking
2020-12-23 12:04:46 +01:00
Konstantin Belousov
994e47023a vxlan: stop checking CSUM_ENCAP_VXLAN when converting inner CSUM flags into normal, for decapsulation.
The packet, if processed at this point, was already parsed to be UDP
directed to a vxlan port.

Connect-X 4+ does not provide easy method to infer which parser
processed the packet, so driver cannot set the flag without a lot of
efforts which are only to satisfy the formal requirements.

Reviewed by:	bryanv, np
Sponsored by:	Mellanox Technologies/NVidia Networking
Differential revision:	https://reviews.freebsd.org/D27449
MFC after:	1 week
2020-12-23 10:54:06 +02:00
Alexander V. Chernikov
d68fb8d978 Switch direct rt fields access in rtsock.c to newly-create field acessors.
rtsock code was build around the assumption that each rtentry record
 in the system radix tree is a ready-to-use sockaddr. This assumptions
 turned out to be not quite true:
* masks have their length tweaked, so we have rtsock_fix_netmask() hack
* IPv6 addresses have their scope embedded, so we have another explicit
 deembedding hack.

Change the code to decouple rtentry internals from rtsock code using
 newly-created rtentry accessors. This will allow to eventually eliminate
 both of the hacks and change rtentry dst/mask format.

Differential Revision:	https://reviews.freebsd.org/D27451
2020-12-18 22:00:57 +00:00
Brooks Davis
f3f2ee76ad style(9): Correct whitespace in struct definitions
struct ifconf and struct ifreq use the odd style "struct<tab>foo".
struct ifdrv seems to have tried to follow this but was committed with
spaces in place of most tabs resulting in "struct<space><space>ifdrv".

MFC after:	3 days
2020-12-11 01:00:07 +00:00
Gleb Smirnoff
5ee33a9076 Fixup r368446 with KERN_TLS. 2020-12-08 23:54:09 +00:00
Gleb Smirnoff
e1074ed6a0 The list of ports in configuration path shall be protected by locks,
epoch shall be used only for fast path.  Thus use LAGG_XLOCK() in
lagg_[un]register_vlan.  This fixes sleeping in epoch panic.

PR:		240609
2020-12-08 16:46:00 +00:00
Gleb Smirnoff
87bf9b9cbe Convert LAGG_RLOCK() to NET_EPOCH_ENTER(). No functional changes. 2020-12-08 16:36:46 +00:00
Mark Johnston
c065d4e5e9 iflib: Avoid leaking the freelist bitmaps upon driver detach
Submitted by:	Sai Rajesh Tallamraju <stallamr@netapp.com>
MFC after:	2 weeks
Sponsored by:	NetApp, Inc.
Differential Revision:	https://reviews.freebsd.org/D27342
2020-12-07 14:53:14 +00:00
Mark Johnston
102540192c iflib: Detach tasks upon device registration failure
In some error paths we would fail to detach from the iflib taskqueue
groups.  Also move the detach code into its own subroutine instead of
duplicating it.

Submitted by:	Sai Rajesh Tallamraju <stallamr@netapp.com>
MFC after:	2 weeks
Sponsored by:	NetApp, Inc.
Differential Revision:	https://reviews.freebsd.org/D27342
2020-12-07 14:52:57 +00:00
Alexander V. Chernikov
df9053920f Add IPv4/IPv6 rtentry prefix accessors.
Multiple consumers like ipfw, netflow or new route lookup algorithms
 need to get the prefix data out of struct rtentry.
Instead of providing direct access to the rtentry, create IPv4/IPv6
 accessors to abstract struct rtentry internals and avoid including
 internal routing headers for external consumers.

While here, move struct route_nhop_data to the public header, so external
 customers can actually use lookup functions returning rt&nhop data.

Differential Revision:	https://reviews.freebsd.org/D27416
2020-12-03 22:23:57 +00:00
Kristof Provost
7f883a9b5b net: Revert vnet/epair cleanup race mitigation
Revert the mitigation code for the vnet/epair cleanup race (done in r365457).
r368237 introduced a more reliable fix.

MFC after:	2 weeks
Sponsored by:	Modirum MDPay
2020-12-01 16:34:43 +00:00
Kristof Provost
e133271fc1 if: Fix panic when destroying vnet and epair simultaneously
When destroying a vnet and an epair (with one end in the vnet) we often
panicked. This was the result of the destruction of the epair, which destroys
both ends simultaneously, happening while vnet_if_return() was moving the
struct ifnet to its home vnet. This can result in a freed ifnet being re-added
to the home vnet V_ifnet list. That in turn panics the next time the ifnet is
used.

Prevent this race by ensuring that vnet_if_return() cannot run at the same time
as if_detach() or epair_clone_destroy().

PR:		238870, 234985, 244703, 250870
MFC after:	2 weeks
Sponsored by:	Modirum MDPay
Differential Revision:	https://reviews.freebsd.org/D27378
2020-12-01 16:23:59 +00:00
Alexander V. Chernikov
77df2c21cb Renumber NHR_* flags after NHR_IFAIF removal in r368127.
Suggested by:	rpokala
2020-11-30 21:42:55 +00:00
Alexander V. Chernikov
d1d941c5b9 Remove RADIX_MPATH config option.
ROUTE_MPATH is the new config option controlling new multipath routing
 implementation. Remove the last pieces of RADIX_MPATH-related code and
 the config option.

Reviewed by:	glebius
Differential Revision:	https://reviews.freebsd.org/D27244
2020-11-29 19:43:33 +00:00
Matt Macy
2338da0373 Import kernel WireGuard support
Data path largely shared with the OpenBSD implementation by
Matt Dunwoodie <ncon@nconroy.net>

Reviewed by:	grehan@freebsd.org
MFC after:	1 month
Sponsored by:	Rubicon LLC, (Netgate)
Differential Revision:	https://reviews.freebsd.org/D26137
2020-11-29 19:38:03 +00:00
Alexander V. Chernikov
3b1654cb14 Introduce rib_walk_ext_internal() to allow iteration with rnh pointer.
This solves the case when rib is not yet attached/detached to/from the
 system rib array.

Differential Revision:	https://reviews.freebsd.org/D27406
2020-11-29 13:54:49 +00:00
Alexander V. Chernikov
f47fa26065 Add nhop_ref_any() to unify referencing nhop or nexthop group.
It allows code within routing subsystem to transparently reference nexthops
 and nexthop groups, similar to nhop_free_any(), abstracting ROUTE_MPATH
 details.

Differential Revision:	https://reviews.freebsd.org/D27410
2020-11-29 13:52:06 +00:00
Alexander V. Chernikov
b712e3e343 Refactor fib4/fib6 functions.
No functional changes.

* Make lookup path of fib<4|6>_lookup_debugnet() separate functions
 (fib<46>_lookup_rt()). These will be used in the control plane code
 requiring unlocked radix operations and actual prefix pointer.
* Make lookup part of fib<4|6>_check_urpf() separate functions.
 This change simplifies the switch to alternative lookup implementations,
 which helps algorithmic lookups introduction.
* While here, use static initializers for IPv4/IPv6 keys

Differential Revision:	https://reviews.freebsd.org/D27405
2020-11-29 13:41:49 +00:00
Alexander V. Chernikov
98d5c4e5c8 Add tracking for rib/nhops/nhgrp objects and provide cumulative number accessors.
The resulting KPI can be used by routing table consumers to estimate the required
 scale for route table export.

* Add tracking for rib routes
* Add accessors for number of nexthops/nexthop objects
* Simplify rib_unsubscribe: store rnh we're attached to instead of requiring it up
 again on destruction. This helps in the cases when rnh is not linked yet/already unlinked.

Differential Revision:	https://reviews.freebsd.org/D27404
2020-11-29 13:27:24 +00:00
Alexander V. Chernikov
ef6ef7e5da Add nhgrp_get_idx() as a counterpart for nhop_get_idx().
It allows the routing-related code to reference nexthop groups by index
 instead of storing a pointer.
2020-11-28 15:46:40 +00:00
Alexander V. Chernikov
7a6dc73c98 Cleanup nexthops request flags:
* remove NHR_IFAIF as it was used by previous version of nexthop KPI
* update NHR_REF description
2020-11-28 15:11:59 +00:00
Konstantin Belousov
cd85379104 Make MAXPHYS tunable. Bump MAXPHYS to 1M.
Replace MAXPHYS by runtime variable maxphys. It is initialized from
MAXPHYS by default, but can be also adjusted with the tunable kern.maxphys.

Make b_pages[] array in struct buf flexible.  Size b_pages[] for buffer
cache buffers exactly to atop(maxbcachebuf) (currently it is sized to
atop(MAXPHYS)), and b_pages[] for pbufs is sized to atop(maxphys) + 1.
The +1 for pbufs allow several pbuf consumers, among them vmapbuf(),
to use unaligned buffers still sized to maxphys, esp. when such
buffers come from userspace (*).  Overall, we save significant amount
of otherwise wasted memory in b_pages[] for buffer cache buffers,
while bumping MAXPHYS to desired high value.

Eliminate all direct uses of the MAXPHYS constant in kernel and driver
sources, except a place which initialize maxphys.  Some random (and
arguably weird) uses of MAXPHYS, e.g. in linuxolator, are converted
straight.  Some drivers, which use MAXPHYS to size embeded structures,
get private MAXPHYS-like constant; their convertion is out of scope
for this work.

Changes to cam/, dev/ahci, dev/ata, dev/mpr, dev/mpt, dev/mvs,
dev/siis, where either submitted by, or based on changes by mav.

Suggested by: mav (*)
Reviewed by:	imp, mav, imp, mckusick, scottl (intermediate versions)
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D27225
2020-11-28 12:12:51 +00:00
Kristof Provost
bca0e1d2ac if: Fix non-VIMAGE build
if_link_ifnet() and if_unlink_ifnet() are needed even when VIMAGE is not
enabled.

MFC after:	2 weeks
Sponsored by:	Modirum MDPay
2020-11-25 17:15:24 +00:00
Kristof Provost
a779388f8b if: Protect V_ifnet in vnet_if_return()
When we terminate a vnet (i.e. jail) we move interfaces back to their home
vnet. We need to protect our access to the V_ifnet CK_LIST.

We could enter NET_EPOCH, but if_detach_internal() (called from if_vmove())
waits for net epoch callback completion. That's not possible from NET_EPOCH.
Instead, we take the IFNET_WLOCK, build a list of the interfaces that need to
move and, once we've released the lock, move them back to their home vnet.

We cannot hold the IFNET_WLOCK() during if_vmove(), because that results in a
LOR between ifnet_sx, in_multi_sx and iflib ctx lock.

Separate out moving the ifp into or out of V_ifnet, so we can hold the lock as
we do the list manipulation, but do not hold it as we if_vmove().

Reviewed by:	melifaro
MFC after:	2 weeks
Sponsored by:	Modirum MDPay
Differential Revision:	https://reviews.freebsd.org/D27279
2020-11-25 15:07:22 +00:00
Kristof Provost
a60100fdfc if: Remove ifnet_rwlock
It no longer serves any purpose, as evidenced by the fact that we never take it
without ifnet_sxlock.

Sponsored by:	Modirum MDPay
Differential Revision:	https://reviews.freebsd.org/D27278
2020-11-25 10:56:38 +00:00
Alexander V. Chernikov
7511a63825 Refactor rib iterator functions.
* Make rib_walk() order of arguments consistent with the rest of RIB api
* Add rib_walk_ext() allowing to exec callback before/after iteration.
* Rename rt_foreach_fib_walk_del -> rib_foreach_table_walk_del
* Rename rt_forach_fib_walk -> rib_foreach_table_walk
* Move rib_foreach_table_walk{_del} to route/route_helpers.c
* Slightly refactor rib_foreach_table_walk{_del} to make the implementation
 consistent and prepare for upcoming iterator optimizations.

Differential Revision:	https://reviews.freebsd.org/D27219
2020-11-22 20:21:10 +00:00
Mitchell Horne
70af7ce99a Make net/ifq.h C++ friendly
Don't use "new" as an identifier, and add explicit casts from void *.

As a general policy, FreeBSD doesn't make any C++ compatibility
guarantees for kernel headers like it does for userland, but it is a
small effort to do so in this case, to the benefit of a downstream
consumer (NetApp).

Reviewed by:	rscheff
Sponsored by:	NetApp, Inc.
Sponsored by:	Klara, Inc.
Differential Revision:	https://reviews.freebsd.org/D27286
2020-11-20 14:45:45 +00:00
Andrew Gallatin
8732245d29 LACP: When suppressing distributing, return ENOBUFS
When links come and go, lacp goes into a "suppress distributing" mode
where it drops traffic for 3 seconds. When in this mode, lagg/lacp
historiclally drops traffic with ENETDOWN. That return value causes TCP
to close any connection where it gets that value back from the lower
parts of the stack.  This means that any TCP connection with active
traffic during a 3-second windown when an LACP link comes or goes
would get closed.

TCP treats return values of ENOBUFS as transient errors, and re-schedules
transmission later. So rather than returning ENETDOWN, lets
return ENOBUFS instead.  This allows TCP connections to be preserved.

I've tested this by repeatedly bouncing links on a Netlfix CDN server
under a moderate (20Gb/s) load and overved ENOBUFS reported back to
the TCP stack (as reported by a RACK TCP sysctl).

Reviewed by:	jhb, jtl, rrs
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D27188
2020-11-18 14:55:49 +00:00
Mark Johnston
54bf96fb4f iflib: Free full mbuf chains when draining transmit queues
Submitted by:	Sai Rajesh Tallamraju <stallamr@netapp.com>
Reviewed by:	gallatin, hselasky
MFC after:	1 week
Sponsored by:	NetApp, Inc.
Differential Revision:	https://reviews.freebsd.org/D27179
2020-11-11 18:00:06 +00:00
Andrey V. Elsukov
2f4ffa9f72 Fix possible NULL pointer dereference.
lagg(4) replaces if_output method of its child interfaces and expects
that this method can be called only by child interfaces. But it is
possible that lagg_port_output() could be called by children of child
interfaces. In this case ifnet's if_lagg field is NULL. Add check that
lp is not NULL.

Obtained from:	Yandex LLC
MFC after:	1 week
Sponsored by:	Yandex LLC
2020-11-11 15:53:36 +00:00
Mitchell Horne
4a3fc6e22e Fix definition of rn_addmask()
Add the missing static keyword present in the declaration.

Reviewed by:	melifaro
Sponsored by:	NetApp, Inc.
Sponsored by:	Klara, Inc.
Differential Revision:	https://reviews.freebsd.org/D27024
2020-11-08 19:02:22 +00:00
Alexander V. Chernikov
2d39824195 Switch net.add_addr_allfibs default to 0.
The goal of the fib support is to provide multiple independent
 routing tables, isolated from each other.
net.add_addr_allfibs default tries to shift gears in the opposite
 direction, unconditionally inserting all addresses to all of the fibs.

There are use cases when this is necessary, however this is not a
 default expected behaviour, especially compared to other implementations.

Provide WARNING message for the setups with multiple fibs to notify
 potential users of the feature.

Differential Revision:	https://reviews.freebsd.org/D26076
2020-11-08 18:27:49 +00:00
Alexander V. Chernikov
76e6b37f6b Temporarily revert setting net.add_addr_allfibs to 0.
It accidentally sweeped in r367486.
Revert to allow for proper commit message & warning.
2020-11-08 18:11:12 +00:00
Alexander V. Chernikov
770495f4c0 Fix build broken by r367484: add route_ifaddrs.c.
Pointy hat to: melifaro
Reported by:	jenkins
2020-11-08 13:30:44 +00:00
Alexander V. Chernikov
bad6b23606 Move all ifaddr route creation business logic to net/route/route_ifaddr.c
Differential Revision:	https://reviews.freebsd.org/D26318
2020-11-08 11:12:00 +00:00
Konstantin Belousov
80ba361b2f if_media.c SIOCGMEDIAX handler: improve loop
Stop advancing counter past the current iteration number at the start
of iteration.  This removes the need of subtracting one when
calculating index for copyout, and arguably fixes off-by-one reporting
of copied out elements when copyout failed.

Reviewed by:	hselasky
Sponsored by:	Mellanox Technologies / NVidia Networking
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D27073
2020-11-03 14:33:04 +00:00
Konstantin Belousov
1fbbe9dbf5 net/if_media.c: improve IFMEDIA_DEBUG output.
Use consistent output format for hex.
Print both media and mask where relevant.

Reviewed by:	hselasky
Sponsored by:	Mellanox Technologies/NVidia Networking
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D27034
2020-11-01 16:38:30 +00:00
Konstantin Belousov
e399f19dba Cleanup of net/if_media.c: simplify cleanup loop in ifmedia_removeall().
Reviewed by:	hselasky
Sponsored by:	Mellanox Technologies/NVidia Networking
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D27034
2020-11-01 16:36:21 +00:00
Konstantin Belousov
899322fdfa Cleanup of net/if_media.c: some style.
Reviewed by:	hselasky
Sponsored by:	Mellanox Technologies/NVidia Networking
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D27034
2020-11-01 16:30:17 +00:00
Konstantin Belousov
2193fb16b5 Cleanup of net/if_media.c: switch to ANSI C function definitions.
Reviewed by:	hselasky
Sponsored by:	Mellanox Technologies/NVidia Networking
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D27034
2020-11-01 16:25:35 +00:00
Mitchell Horne
ced0f52457 net: add ETHER_IS_IPV6_MULTICAST
This can be used to detect if an ethernet address is specifically an
IPv6 multicast address, defined in accordance to RFC 2464.

ETHER_IS_MULTICAST is still preferred in the general case.

Reviewed by:	ae
Sponsored by:	NetApp, Inc.
Sponsored by:	Klara, Inc.
Differential Revision:	https://reviews.freebsd.org/D26611
2020-10-30 13:32:58 +00:00
John Baldwin
36e0a362ac Add m_snd_tag_alloc() as a wrapper around if_snd_tag_alloc().
This gives a more uniform API for send tag life cycle management.

Reviewed by:	gallatin, hselasky
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D27000
2020-10-29 23:28:39 +00:00
John Baldwin
521eac97f3 Support hardware rate limiting (pacing) with TLS offload.
- Add a new send tag type for a send tag that supports both rate
  limiting (packet pacing) and TLS offload (mostly similar to D22669
  but adds a separate structure when allocating the new tag type).

- When allocating a send tag for TLS offload, check to see if the
  connection already has a pacing rate.  If so, allocate a tag that
  supports both rate limiting and TLS offload rather than a plain TLS
  offload tag.

- When setting an initial rate on an existing ifnet KTLS connection,
  set the rate in the TCP control block inp and then reset the TLS
  send tag (via ktls_output_eagain) to reallocate a TLS + ratelimit
  send tag.  This allocates the TLS send tag asynchronously from a
  task queue, so the TLS rate limit tag alloc is always sleepable.

- When modifying a rate on a connection using KTLS, look for a TLS
  send tag.  If the send tag is only a plain TLS send tag, assume we
  failed to allocate a TLS ratelimit tag (either during the
  TCP_TXTLS_ENABLE socket option, or during the send tag reset
  triggered by ktls_output_eagain) and ignore the new rate.  If the
  send tag is a ratelimit TLS send tag, change the rate on the TLS tag
  and leave the inp tag alone.

- Lock the inp lock when setting sb_tls_info for a socket send buffer
  so that the routines in tcp_ratelimit can safely dereference the
  pointer without needing to grab the socket buffer lock.

- Add an IFCAP_TXTLS_RTLMT capability flag and associated
  administrative controls in ifconfig(8).  TLS rate limit tags are
  only allocated if this capability is enabled.  Note that TLS offload
  (whether unlimited or rate limited) always requires IFCAP_TXTLS[46].

Reviewed by:	gallatin, hselasky
Relnotes:	yes
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D26691
2020-10-29 00:23:16 +00:00
Vincenzo Maffione
be7a6b3d84 iflib: fix typo bug introduced by r367093
Code was supposed to call callout_reset_sbt_on() rather than
callout_reset_sbt(). This resulted into passing a "cpu" value
to a "flag" argument. A recipe for subtle errors.

PR:	248652
Reported by:	sg@efficientip.com
MFC with: r367093
2020-10-28 21:06:17 +00:00
Vincenzo Maffione
17cec474c0 iflib: add per-tx-queue netmap timer
The way netmap TX is handled in iflib when TX interrupts are not
used (IFC_NETMAP_TX_IRQ not set) has some issues:
  - The netmap_tx_irq() function gets called by iflib_timer(), which
    gets scheduled with tick granularity (hz). This is not frequent
    enough for 10Gbps NICs and beyond (e.g., ixgbe or ixl). The end
    result is that the transmitting netmap application is not woken
    up fast enough to saturate the link with small packets.
  - The iflib_timer() functions also calls isc_txd_credits_update()
    to ask for more TX completion updates. However, this violates
    the netmap requirement that only txsync can access the TX queue
    for datapath operations. Only netmap_tx_irq() may be called out
    of the txsync context.

This change introduces per-tx-queue netmap timers, using microsecond
granularity to ensure that netmap_tx_irq() can be called often enough
to allow for maximum packet rate. The timer routine simply calls
netmap_tx_irq() to wake up the netmap application. The latter will
wake up and call txsync to collect TX completion updates.

This change brings back line rate speed with small packets for ixgbe.
For the time being, timer expiration is hardcoded to 90 microseconds,
in order to avoid introducing a new sysctl.
We may eventually implement an adaptive expiration period or use another
deferred work mechanism in place of timers.

Also, fix the timers usage to make sure that each queue is serviced
by a different CPU.

PR:	248652
Reported by:	sg@efficientip.com
MFC after:	2 weeks
2020-10-27 21:53:33 +00:00
Hans Petter Selasky
1355e2dc4f More style fixes (partial revert of r366994).
Suggested by:		danfe@
Differential Revision:	https://reviews.freebsd.org/D26254
MFC after:		1 week
Sponsored by:		Mellanox Technologies // NVIDIA Networking
2020-10-24 13:07:50 +00:00
Hans Petter Selasky
1d3a22e765 Fix order of header files:
sys/systm.h should come right after sys/param.h

Suggested by:		kib@
Differential Revision:	https://reviews.freebsd.org/D26254
MFC after:		1 week
Sponsored by:		Mellanox Technologies // NVIDIA Networking
2020-10-24 10:52:09 +00:00
Hans Petter Selasky
01630a496b Run code through "clang-format -style=file" with some additional fixes.
No functional change.

Suggested by:		kib@ and emaste@
Differential Revision:	https://reviews.freebsd.org/D26254
MFC after:		1 week
Sponsored by:		Mellanox Technologies // NVIDIA Networking
2020-10-24 10:23:21 +00:00
Navdeep Parhar
610d345953 if_vxlan(4): csum_flags_to_inner_flags takes the tunnel protocol as a parameter.
No functional change.
2020-10-22 17:05:55 +00:00