Commit Graph

4669 Commits

Author SHA1 Message Date
Kristof Provost
fbbf270eef pf: Use counter_u64 in pf_src_node
Reviewd by:	philip
MFC after:      2 weeks
Sponsored by:   Orange Business Services
Differential Revision:	https://reviews.freebsd.org/D27756
2021-01-05 23:35:36 +01:00
Kristof Provost
17ad7334ca pf: Split pf_src_node into a kernel and userspace struct
Introduce a kernel version of struct pf_src_node (pf_ksrc_node).

This will allow us to improve the in-kernel data structure without
breaking userspace compatibility.

Reviewed by:	philip
MFC after:	2 weeks
Sponsored by:	Orange Business Services
Differential Revision:	https://reviews.freebsd.org/D27707
2021-01-05 23:35:36 +01:00
Alexander V. Chernikov
9c0ff6a8bb Remove now-unused RT_GATEWAY* definitions.
They were used to simplify nexthop transition, hence not needed
 anymore.
2021-01-04 21:45:46 +00:00
Hans Petter Selasky
747feea146 Streamline the infiniband code according to the ethernet code.
Fix LINT-NOIP kernel build.

Submitted by:	rlibby @
Differential Revision:	https://reviews.freebsd.org/D27861
MFC after:	1 week
Sponsored by:	Mellanox Technologies // NVIDIA Networking
2020-12-31 10:07:02 +01:00
Hans Petter Selasky
ec52ff6d14 Streamline the infiniband code according to the ethernet code.
Specifically implement the if_requestencap callback function for infiniband.
Most of the changes are simply a cut and paste of the equivalent ethernet part.

Reviewed by:	melifaro @
Differential Revision:	https://reviews.freebsd.org/D27631
MFC after:	1 week
Sponsored by:	Mellanox Technologies // NVIDIA Networking
2020-12-29 18:01:57 +01:00
Hans Petter Selasky
19ecb5e8da Fix for IPoIB over lagg(4).
Need to update both link layer address and broadcast address when active link changes for IP over infiniband.
This is because the broadcast address contains the so-called P-key, which is interface dependent.

Reviewed by:	kib @
Differential Revision:	https://reviews.freebsd.org/D27658
MFC after:	1 week
Sponsored by:	Mellanox Technologies // NVIDIA Networking
2020-12-29 17:35:06 +01:00
Ryan Libby
833dbf1e22 route: quiet -Wredundant-decls
Remove declaration duplicated in
f5baf8bb12

Reviewed by:	melifaro
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D27790
2020-12-27 16:32:27 -08:00
Alexander V. Chernikov
f733d9701b Fix default route handling in radix4_lockless algo.
Improve nexthop debugging.

Reported by:	Florian Smeets <flo at smeets.xyz>
2020-12-26 22:51:02 +00:00
Alexander V. Chernikov
4e19e0d92a Use light-weight versions of routing lookup functions in ng_netflow.
Use recently-added combination of `fib[46]_lookup_rt()` which
 returns rtentry & raw nexthop with `rt_get_inet[6]_plen()` which
 returns address/prefix length of prefix inside `rt`.

Add `nhop_select_func()` wrapper around inlined `nhop_select()` to
 allow callers external to the routing subsystem select the proper
 nexthop from the multipath group without including internal headers.

New calls does not require reference counting objects and reduce
 the amount of copied/processed rtentry data.

Differential Revision: https://reviews.freebsd.org/D27675
2020-12-26 11:27:38 +00:00
Alexander V. Chernikov
f5baf8bb12 Add modular fib lookup framework.
This change introduces framework that allows to dynamically
 attach or detach longest prefix match (lpm) lookup algorithms
 to speed up datapath route tables lookups.

Framework takes care of handling initial synchronisation,
 route subscription, nhop/nhop groups reference and indexing,
 dataplane attachments and fib instance algorithm setup/teardown.
Framework features automatic algorithm selection, allowing for
 picking the best matching algorithm on-the-fly based on the
 amount of routes in the routing table.

Currently framework code is guarded under FIB_ALGO config option.
An idea is to enable it by default in the next couple of weeks.

The following algorithms are provided by default:
IPv4:
* bsearch4 (lockless binary search in a special IP array), tailored for
  small-fib (<16 routes)
* radix4_lockless (lockless immutable radix, re-created on every rtable change),
  tailored for small-fib (<1000 routes)
* radix4 (base system radix backend)
* dpdk_lpm4 (DPDK DIR24-8-based lookups), lockless datastrucure, optimized
  for large-fib (D27412)
IPv6:
* radix6_lockless (lockless immutable radix, re-created on every rtable change),
  tailed for small-fib (<1000 routes)
* radix6 (base system radix backend)
* dpdk_lpm6 (DPDK DIR24-8-based lookups), lockless datastrucure, optimized
  for large-fib (D27412)

Performance changes:
Micro benchmarks (I7-7660U, single-core lookups, 2048k dst, code in D27604):
IPv4:
8 routes:
  radix4: ~20mpps
  radix4_lockless: ~24.8mpps
  bsearch4: ~69mpps
  dpdk_lpm4: ~67 mpps
700k routes:
  radix4_lockless: 3.3mpps
  dpdk_lpm4: 46mpps

IPv6:
8 routes:
  radix6_lockless: ~20mpps
  dpdk_lpm6: ~70mpps
100k routes:
  radix6_lockless: 13.9mpps
  dpdk_lpm6: 57mpps

Forwarding benchmarks:
+ 10-15% IPv4 forwarding performance (small-fib, bsearch4)
+ 25% IPv4 forwarding performance (full-view, dpdk_lpm4)
+ 20% IPv6 forwarding performance (full-view, dpdk_lpm6)

Control:
Framwork adds the following runtime sysctls:

List algos
* net.route.algo.inet.algo_list: bsearch4, radix4_lockless, radix4
* net.route.algo.inet6.algo_list: radix6_lockless, radix6, dpdk_lpm6
Debug level (7=LOG_DEBUG, per-route)
net.route.algo.debug_level: 5
Algo selection (currently only for fib 0):
net.route.algo.inet.algo: bsearch4
net.route.algo.inet6.algo: radix6_lockless

Support for manually changing algos in non-default fib will be added
soon. Some sysctl names will be changed in the near future.

Differential Revision: https://reviews.freebsd.org/D27401
2020-12-25 11:33:17 +00:00
Ryan Libby
2fb4a03d55 rtsock: quiet -Wunused-variable in LINT-NOIP kernels
Fixup after r368769 / d68fb8d978.

Reported by:	mjg
Reviewed by:	melifaro
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D27730
2020-12-24 12:34:18 -08:00
Mark Johnston
92be2847e8 rtsock: Avoid copying uninitialized padding bytes
When copying sockaddrs out to userspace, we pad them to a multiple of
the platform alignment (sizeof(long)).  However, some sockaddr sizes,
such as struct sockaddr_dl, are not an integer multiple of the
alignment, so we may end up copying out uninitialized bytes.

Fix this by always bouncing through a pre-zeroed sockaddr_storage.

Reported by:	KASAN
Reviewed by:	melifaro
MFC after:	3 days
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D27729
2020-12-23 11:16:40 -05:00
Kristof Provost
1c00efe98e pf: Use counter(9) for pf_state byte/packet tracking
This improves cache behaviour by not writing to the same variable from
multiple cores simultaneously.

pf_state is only used in the kernel, so can be safely modified.

Reviewed by:	Lutz Donnerhacke, philip
MFC after:	1 week
Sponsed by:	Orange Business Services
Differential Revision:	https://reviews.freebsd.org/D27661
2020-12-23 12:03:21 +01:00
Kristof Provost
c3f69af03a pf: Fix unaligned checksum updates
The algorithm we use to update checksums only works correctly if the
updated data is aligned on 16-bit boundaries (relative to the start of
the packet).

Import the OpenBSD fix for this issue.

PR:		240416
Obtained from:	OpenBSD
MFC after:	1 week
Reviewed by:	tuexen (previous version)
Differential Revision:	https://reviews.freebsd.org/D27696
2020-12-23 12:03:20 +01:00
Hans Petter Selasky
ddce63fcb6 Remove not needed variable initialization.
And switch from int to bool while at it.

Reviewed by:	melifaro@
Differential Revision:	https://reviews.freebsd.org/D27725
MFC after:	1 week
Sponsored by:	Mellanox Technologies // NVIDIA Networking
2020-12-23 12:04:46 +01:00
Konstantin Belousov
994e47023a vxlan: stop checking CSUM_ENCAP_VXLAN when converting inner CSUM flags into normal, for decapsulation.
The packet, if processed at this point, was already parsed to be UDP
directed to a vxlan port.

Connect-X 4+ does not provide easy method to infer which parser
processed the packet, so driver cannot set the flag without a lot of
efforts which are only to satisfy the formal requirements.

Reviewed by:	bryanv, np
Sponsored by:	Mellanox Technologies/NVidia Networking
Differential revision:	https://reviews.freebsd.org/D27449
MFC after:	1 week
2020-12-23 10:54:06 +02:00
Alexander V. Chernikov
d68fb8d978 Switch direct rt fields access in rtsock.c to newly-create field acessors.
rtsock code was build around the assumption that each rtentry record
 in the system radix tree is a ready-to-use sockaddr. This assumptions
 turned out to be not quite true:
* masks have their length tweaked, so we have rtsock_fix_netmask() hack
* IPv6 addresses have their scope embedded, so we have another explicit
 deembedding hack.

Change the code to decouple rtentry internals from rtsock code using
 newly-created rtentry accessors. This will allow to eventually eliminate
 both of the hacks and change rtentry dst/mask format.

Differential Revision:	https://reviews.freebsd.org/D27451
2020-12-18 22:00:57 +00:00
Brooks Davis
f3f2ee76ad style(9): Correct whitespace in struct definitions
struct ifconf and struct ifreq use the odd style "struct<tab>foo".
struct ifdrv seems to have tried to follow this but was committed with
spaces in place of most tabs resulting in "struct<space><space>ifdrv".

MFC after:	3 days
2020-12-11 01:00:07 +00:00
Gleb Smirnoff
5ee33a9076 Fixup r368446 with KERN_TLS. 2020-12-08 23:54:09 +00:00
Gleb Smirnoff
e1074ed6a0 The list of ports in configuration path shall be protected by locks,
epoch shall be used only for fast path.  Thus use LAGG_XLOCK() in
lagg_[un]register_vlan.  This fixes sleeping in epoch panic.

PR:		240609
2020-12-08 16:46:00 +00:00
Gleb Smirnoff
87bf9b9cbe Convert LAGG_RLOCK() to NET_EPOCH_ENTER(). No functional changes. 2020-12-08 16:36:46 +00:00
Mark Johnston
c065d4e5e9 iflib: Avoid leaking the freelist bitmaps upon driver detach
Submitted by:	Sai Rajesh Tallamraju <stallamr@netapp.com>
MFC after:	2 weeks
Sponsored by:	NetApp, Inc.
Differential Revision:	https://reviews.freebsd.org/D27342
2020-12-07 14:53:14 +00:00
Mark Johnston
102540192c iflib: Detach tasks upon device registration failure
In some error paths we would fail to detach from the iflib taskqueue
groups.  Also move the detach code into its own subroutine instead of
duplicating it.

Submitted by:	Sai Rajesh Tallamraju <stallamr@netapp.com>
MFC after:	2 weeks
Sponsored by:	NetApp, Inc.
Differential Revision:	https://reviews.freebsd.org/D27342
2020-12-07 14:52:57 +00:00
Alexander V. Chernikov
df9053920f Add IPv4/IPv6 rtentry prefix accessors.
Multiple consumers like ipfw, netflow or new route lookup algorithms
 need to get the prefix data out of struct rtentry.
Instead of providing direct access to the rtentry, create IPv4/IPv6
 accessors to abstract struct rtentry internals and avoid including
 internal routing headers for external consumers.

While here, move struct route_nhop_data to the public header, so external
 customers can actually use lookup functions returning rt&nhop data.

Differential Revision:	https://reviews.freebsd.org/D27416
2020-12-03 22:23:57 +00:00
Kristof Provost
7f883a9b5b net: Revert vnet/epair cleanup race mitigation
Revert the mitigation code for the vnet/epair cleanup race (done in r365457).
r368237 introduced a more reliable fix.

MFC after:	2 weeks
Sponsored by:	Modirum MDPay
2020-12-01 16:34:43 +00:00
Kristof Provost
e133271fc1 if: Fix panic when destroying vnet and epair simultaneously
When destroying a vnet and an epair (with one end in the vnet) we often
panicked. This was the result of the destruction of the epair, which destroys
both ends simultaneously, happening while vnet_if_return() was moving the
struct ifnet to its home vnet. This can result in a freed ifnet being re-added
to the home vnet V_ifnet list. That in turn panics the next time the ifnet is
used.

Prevent this race by ensuring that vnet_if_return() cannot run at the same time
as if_detach() or epair_clone_destroy().

PR:		238870, 234985, 244703, 250870
MFC after:	2 weeks
Sponsored by:	Modirum MDPay
Differential Revision:	https://reviews.freebsd.org/D27378
2020-12-01 16:23:59 +00:00
Alexander V. Chernikov
77df2c21cb Renumber NHR_* flags after NHR_IFAIF removal in r368127.
Suggested by:	rpokala
2020-11-30 21:42:55 +00:00
Alexander V. Chernikov
d1d941c5b9 Remove RADIX_MPATH config option.
ROUTE_MPATH is the new config option controlling new multipath routing
 implementation. Remove the last pieces of RADIX_MPATH-related code and
 the config option.

Reviewed by:	glebius
Differential Revision:	https://reviews.freebsd.org/D27244
2020-11-29 19:43:33 +00:00
Matt Macy
2338da0373 Import kernel WireGuard support
Data path largely shared with the OpenBSD implementation by
Matt Dunwoodie <ncon@nconroy.net>

Reviewed by:	grehan@freebsd.org
MFC after:	1 month
Sponsored by:	Rubicon LLC, (Netgate)
Differential Revision:	https://reviews.freebsd.org/D26137
2020-11-29 19:38:03 +00:00
Alexander V. Chernikov
3b1654cb14 Introduce rib_walk_ext_internal() to allow iteration with rnh pointer.
This solves the case when rib is not yet attached/detached to/from the
 system rib array.

Differential Revision:	https://reviews.freebsd.org/D27406
2020-11-29 13:54:49 +00:00
Alexander V. Chernikov
f47fa26065 Add nhop_ref_any() to unify referencing nhop or nexthop group.
It allows code within routing subsystem to transparently reference nexthops
 and nexthop groups, similar to nhop_free_any(), abstracting ROUTE_MPATH
 details.

Differential Revision:	https://reviews.freebsd.org/D27410
2020-11-29 13:52:06 +00:00
Alexander V. Chernikov
b712e3e343 Refactor fib4/fib6 functions.
No functional changes.

* Make lookup path of fib<4|6>_lookup_debugnet() separate functions
 (fib<46>_lookup_rt()). These will be used in the control plane code
 requiring unlocked radix operations and actual prefix pointer.
* Make lookup part of fib<4|6>_check_urpf() separate functions.
 This change simplifies the switch to alternative lookup implementations,
 which helps algorithmic lookups introduction.
* While here, use static initializers for IPv4/IPv6 keys

Differential Revision:	https://reviews.freebsd.org/D27405
2020-11-29 13:41:49 +00:00
Alexander V. Chernikov
98d5c4e5c8 Add tracking for rib/nhops/nhgrp objects and provide cumulative number accessors.
The resulting KPI can be used by routing table consumers to estimate the required
 scale for route table export.

* Add tracking for rib routes
* Add accessors for number of nexthops/nexthop objects
* Simplify rib_unsubscribe: store rnh we're attached to instead of requiring it up
 again on destruction. This helps in the cases when rnh is not linked yet/already unlinked.

Differential Revision:	https://reviews.freebsd.org/D27404
2020-11-29 13:27:24 +00:00
Alexander V. Chernikov
ef6ef7e5da Add nhgrp_get_idx() as a counterpart for nhop_get_idx().
It allows the routing-related code to reference nexthop groups by index
 instead of storing a pointer.
2020-11-28 15:46:40 +00:00
Alexander V. Chernikov
7a6dc73c98 Cleanup nexthops request flags:
* remove NHR_IFAIF as it was used by previous version of nexthop KPI
* update NHR_REF description
2020-11-28 15:11:59 +00:00
Konstantin Belousov
cd85379104 Make MAXPHYS tunable. Bump MAXPHYS to 1M.
Replace MAXPHYS by runtime variable maxphys. It is initialized from
MAXPHYS by default, but can be also adjusted with the tunable kern.maxphys.

Make b_pages[] array in struct buf flexible.  Size b_pages[] for buffer
cache buffers exactly to atop(maxbcachebuf) (currently it is sized to
atop(MAXPHYS)), and b_pages[] for pbufs is sized to atop(maxphys) + 1.
The +1 for pbufs allow several pbuf consumers, among them vmapbuf(),
to use unaligned buffers still sized to maxphys, esp. when such
buffers come from userspace (*).  Overall, we save significant amount
of otherwise wasted memory in b_pages[] for buffer cache buffers,
while bumping MAXPHYS to desired high value.

Eliminate all direct uses of the MAXPHYS constant in kernel and driver
sources, except a place which initialize maxphys.  Some random (and
arguably weird) uses of MAXPHYS, e.g. in linuxolator, are converted
straight.  Some drivers, which use MAXPHYS to size embeded structures,
get private MAXPHYS-like constant; their convertion is out of scope
for this work.

Changes to cam/, dev/ahci, dev/ata, dev/mpr, dev/mpt, dev/mvs,
dev/siis, where either submitted by, or based on changes by mav.

Suggested by: mav (*)
Reviewed by:	imp, mav, imp, mckusick, scottl (intermediate versions)
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D27225
2020-11-28 12:12:51 +00:00
Kristof Provost
bca0e1d2ac if: Fix non-VIMAGE build
if_link_ifnet() and if_unlink_ifnet() are needed even when VIMAGE is not
enabled.

MFC after:	2 weeks
Sponsored by:	Modirum MDPay
2020-11-25 17:15:24 +00:00
Kristof Provost
a779388f8b if: Protect V_ifnet in vnet_if_return()
When we terminate a vnet (i.e. jail) we move interfaces back to their home
vnet. We need to protect our access to the V_ifnet CK_LIST.

We could enter NET_EPOCH, but if_detach_internal() (called from if_vmove())
waits for net epoch callback completion. That's not possible from NET_EPOCH.
Instead, we take the IFNET_WLOCK, build a list of the interfaces that need to
move and, once we've released the lock, move them back to their home vnet.

We cannot hold the IFNET_WLOCK() during if_vmove(), because that results in a
LOR between ifnet_sx, in_multi_sx and iflib ctx lock.

Separate out moving the ifp into or out of V_ifnet, so we can hold the lock as
we do the list manipulation, but do not hold it as we if_vmove().

Reviewed by:	melifaro
MFC after:	2 weeks
Sponsored by:	Modirum MDPay
Differential Revision:	https://reviews.freebsd.org/D27279
2020-11-25 15:07:22 +00:00
Kristof Provost
a60100fdfc if: Remove ifnet_rwlock
It no longer serves any purpose, as evidenced by the fact that we never take it
without ifnet_sxlock.

Sponsored by:	Modirum MDPay
Differential Revision:	https://reviews.freebsd.org/D27278
2020-11-25 10:56:38 +00:00
Alexander V. Chernikov
7511a63825 Refactor rib iterator functions.
* Make rib_walk() order of arguments consistent with the rest of RIB api
* Add rib_walk_ext() allowing to exec callback before/after iteration.
* Rename rt_foreach_fib_walk_del -> rib_foreach_table_walk_del
* Rename rt_forach_fib_walk -> rib_foreach_table_walk
* Move rib_foreach_table_walk{_del} to route/route_helpers.c
* Slightly refactor rib_foreach_table_walk{_del} to make the implementation
 consistent and prepare for upcoming iterator optimizations.

Differential Revision:	https://reviews.freebsd.org/D27219
2020-11-22 20:21:10 +00:00
Mitchell Horne
70af7ce99a Make net/ifq.h C++ friendly
Don't use "new" as an identifier, and add explicit casts from void *.

As a general policy, FreeBSD doesn't make any C++ compatibility
guarantees for kernel headers like it does for userland, but it is a
small effort to do so in this case, to the benefit of a downstream
consumer (NetApp).

Reviewed by:	rscheff
Sponsored by:	NetApp, Inc.
Sponsored by:	Klara, Inc.
Differential Revision:	https://reviews.freebsd.org/D27286
2020-11-20 14:45:45 +00:00
Andrew Gallatin
8732245d29 LACP: When suppressing distributing, return ENOBUFS
When links come and go, lacp goes into a "suppress distributing" mode
where it drops traffic for 3 seconds. When in this mode, lagg/lacp
historiclally drops traffic with ENETDOWN. That return value causes TCP
to close any connection where it gets that value back from the lower
parts of the stack.  This means that any TCP connection with active
traffic during a 3-second windown when an LACP link comes or goes
would get closed.

TCP treats return values of ENOBUFS as transient errors, and re-schedules
transmission later. So rather than returning ENETDOWN, lets
return ENOBUFS instead.  This allows TCP connections to be preserved.

I've tested this by repeatedly bouncing links on a Netlfix CDN server
under a moderate (20Gb/s) load and overved ENOBUFS reported back to
the TCP stack (as reported by a RACK TCP sysctl).

Reviewed by:	jhb, jtl, rrs
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D27188
2020-11-18 14:55:49 +00:00
Mark Johnston
54bf96fb4f iflib: Free full mbuf chains when draining transmit queues
Submitted by:	Sai Rajesh Tallamraju <stallamr@netapp.com>
Reviewed by:	gallatin, hselasky
MFC after:	1 week
Sponsored by:	NetApp, Inc.
Differential Revision:	https://reviews.freebsd.org/D27179
2020-11-11 18:00:06 +00:00
Andrey V. Elsukov
2f4ffa9f72 Fix possible NULL pointer dereference.
lagg(4) replaces if_output method of its child interfaces and expects
that this method can be called only by child interfaces. But it is
possible that lagg_port_output() could be called by children of child
interfaces. In this case ifnet's if_lagg field is NULL. Add check that
lp is not NULL.

Obtained from:	Yandex LLC
MFC after:	1 week
Sponsored by:	Yandex LLC
2020-11-11 15:53:36 +00:00
Mitchell Horne
4a3fc6e22e Fix definition of rn_addmask()
Add the missing static keyword present in the declaration.

Reviewed by:	melifaro
Sponsored by:	NetApp, Inc.
Sponsored by:	Klara, Inc.
Differential Revision:	https://reviews.freebsd.org/D27024
2020-11-08 19:02:22 +00:00
Alexander V. Chernikov
2d39824195 Switch net.add_addr_allfibs default to 0.
The goal of the fib support is to provide multiple independent
 routing tables, isolated from each other.
net.add_addr_allfibs default tries to shift gears in the opposite
 direction, unconditionally inserting all addresses to all of the fibs.

There are use cases when this is necessary, however this is not a
 default expected behaviour, especially compared to other implementations.

Provide WARNING message for the setups with multiple fibs to notify
 potential users of the feature.

Differential Revision:	https://reviews.freebsd.org/D26076
2020-11-08 18:27:49 +00:00
Alexander V. Chernikov
76e6b37f6b Temporarily revert setting net.add_addr_allfibs to 0.
It accidentally sweeped in r367486.
Revert to allow for proper commit message & warning.
2020-11-08 18:11:12 +00:00
Alexander V. Chernikov
770495f4c0 Fix build broken by r367484: add route_ifaddrs.c.
Pointy hat to: melifaro
Reported by:	jenkins
2020-11-08 13:30:44 +00:00
Alexander V. Chernikov
bad6b23606 Move all ifaddr route creation business logic to net/route/route_ifaddr.c
Differential Revision:	https://reviews.freebsd.org/D26318
2020-11-08 11:12:00 +00:00
Konstantin Belousov
80ba361b2f if_media.c SIOCGMEDIAX handler: improve loop
Stop advancing counter past the current iteration number at the start
of iteration.  This removes the need of subtracting one when
calculating index for copyout, and arguably fixes off-by-one reporting
of copied out elements when copyout failed.

Reviewed by:	hselasky
Sponsored by:	Mellanox Technologies / NVidia Networking
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D27073
2020-11-03 14:33:04 +00:00
Konstantin Belousov
1fbbe9dbf5 net/if_media.c: improve IFMEDIA_DEBUG output.
Use consistent output format for hex.
Print both media and mask where relevant.

Reviewed by:	hselasky
Sponsored by:	Mellanox Technologies/NVidia Networking
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D27034
2020-11-01 16:38:30 +00:00
Konstantin Belousov
e399f19dba Cleanup of net/if_media.c: simplify cleanup loop in ifmedia_removeall().
Reviewed by:	hselasky
Sponsored by:	Mellanox Technologies/NVidia Networking
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D27034
2020-11-01 16:36:21 +00:00
Konstantin Belousov
899322fdfa Cleanup of net/if_media.c: some style.
Reviewed by:	hselasky
Sponsored by:	Mellanox Technologies/NVidia Networking
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D27034
2020-11-01 16:30:17 +00:00
Konstantin Belousov
2193fb16b5 Cleanup of net/if_media.c: switch to ANSI C function definitions.
Reviewed by:	hselasky
Sponsored by:	Mellanox Technologies/NVidia Networking
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D27034
2020-11-01 16:25:35 +00:00
Mitchell Horne
ced0f52457 net: add ETHER_IS_IPV6_MULTICAST
This can be used to detect if an ethernet address is specifically an
IPv6 multicast address, defined in accordance to RFC 2464.

ETHER_IS_MULTICAST is still preferred in the general case.

Reviewed by:	ae
Sponsored by:	NetApp, Inc.
Sponsored by:	Klara, Inc.
Differential Revision:	https://reviews.freebsd.org/D26611
2020-10-30 13:32:58 +00:00
John Baldwin
36e0a362ac Add m_snd_tag_alloc() as a wrapper around if_snd_tag_alloc().
This gives a more uniform API for send tag life cycle management.

Reviewed by:	gallatin, hselasky
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D27000
2020-10-29 23:28:39 +00:00
John Baldwin
521eac97f3 Support hardware rate limiting (pacing) with TLS offload.
- Add a new send tag type for a send tag that supports both rate
  limiting (packet pacing) and TLS offload (mostly similar to D22669
  but adds a separate structure when allocating the new tag type).

- When allocating a send tag for TLS offload, check to see if the
  connection already has a pacing rate.  If so, allocate a tag that
  supports both rate limiting and TLS offload rather than a plain TLS
  offload tag.

- When setting an initial rate on an existing ifnet KTLS connection,
  set the rate in the TCP control block inp and then reset the TLS
  send tag (via ktls_output_eagain) to reallocate a TLS + ratelimit
  send tag.  This allocates the TLS send tag asynchronously from a
  task queue, so the TLS rate limit tag alloc is always sleepable.

- When modifying a rate on a connection using KTLS, look for a TLS
  send tag.  If the send tag is only a plain TLS send tag, assume we
  failed to allocate a TLS ratelimit tag (either during the
  TCP_TXTLS_ENABLE socket option, or during the send tag reset
  triggered by ktls_output_eagain) and ignore the new rate.  If the
  send tag is a ratelimit TLS send tag, change the rate on the TLS tag
  and leave the inp tag alone.

- Lock the inp lock when setting sb_tls_info for a socket send buffer
  so that the routines in tcp_ratelimit can safely dereference the
  pointer without needing to grab the socket buffer lock.

- Add an IFCAP_TXTLS_RTLMT capability flag and associated
  administrative controls in ifconfig(8).  TLS rate limit tags are
  only allocated if this capability is enabled.  Note that TLS offload
  (whether unlimited or rate limited) always requires IFCAP_TXTLS[46].

Reviewed by:	gallatin, hselasky
Relnotes:	yes
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D26691
2020-10-29 00:23:16 +00:00
Vincenzo Maffione
be7a6b3d84 iflib: fix typo bug introduced by r367093
Code was supposed to call callout_reset_sbt_on() rather than
callout_reset_sbt(). This resulted into passing a "cpu" value
to a "flag" argument. A recipe for subtle errors.

PR:	248652
Reported by:	sg@efficientip.com
MFC with: r367093
2020-10-28 21:06:17 +00:00
Vincenzo Maffione
17cec474c0 iflib: add per-tx-queue netmap timer
The way netmap TX is handled in iflib when TX interrupts are not
used (IFC_NETMAP_TX_IRQ not set) has some issues:
  - The netmap_tx_irq() function gets called by iflib_timer(), which
    gets scheduled with tick granularity (hz). This is not frequent
    enough for 10Gbps NICs and beyond (e.g., ixgbe or ixl). The end
    result is that the transmitting netmap application is not woken
    up fast enough to saturate the link with small packets.
  - The iflib_timer() functions also calls isc_txd_credits_update()
    to ask for more TX completion updates. However, this violates
    the netmap requirement that only txsync can access the TX queue
    for datapath operations. Only netmap_tx_irq() may be called out
    of the txsync context.

This change introduces per-tx-queue netmap timers, using microsecond
granularity to ensure that netmap_tx_irq() can be called often enough
to allow for maximum packet rate. The timer routine simply calls
netmap_tx_irq() to wake up the netmap application. The latter will
wake up and call txsync to collect TX completion updates.

This change brings back line rate speed with small packets for ixgbe.
For the time being, timer expiration is hardcoded to 90 microseconds,
in order to avoid introducing a new sysctl.
We may eventually implement an adaptive expiration period or use another
deferred work mechanism in place of timers.

Also, fix the timers usage to make sure that each queue is serviced
by a different CPU.

PR:	248652
Reported by:	sg@efficientip.com
MFC after:	2 weeks
2020-10-27 21:53:33 +00:00
Hans Petter Selasky
1355e2dc4f More style fixes (partial revert of r366994).
Suggested by:		danfe@
Differential Revision:	https://reviews.freebsd.org/D26254
MFC after:		1 week
Sponsored by:		Mellanox Technologies // NVIDIA Networking
2020-10-24 13:07:50 +00:00
Hans Petter Selasky
1d3a22e765 Fix order of header files:
sys/systm.h should come right after sys/param.h

Suggested by:		kib@
Differential Revision:	https://reviews.freebsd.org/D26254
MFC after:		1 week
Sponsored by:		Mellanox Technologies // NVIDIA Networking
2020-10-24 10:52:09 +00:00
Hans Petter Selasky
01630a496b Run code through "clang-format -style=file" with some additional fixes.
No functional change.

Suggested by:		kib@ and emaste@
Differential Revision:	https://reviews.freebsd.org/D26254
MFC after:		1 week
Sponsored by:		Mellanox Technologies // NVIDIA Networking
2020-10-24 10:23:21 +00:00
Navdeep Parhar
610d345953 if_vxlan(4): csum_flags_to_inner_flags takes the tunnel protocol as a parameter.
No functional change.
2020-10-22 17:05:55 +00:00
Hans Petter Selasky
ce329aa256 Compile fix for MIPS, MIPS64, POWERPC and POWERPC64.
Add missing include files.

Differential Revision:	https://reviews.freebsd.org/D26254
Reviewed by:		melifaro@
MFC after:		1 week
Sponsored by:		Mellanox Technologies // NVIDIA Networking
2020-10-22 12:22:08 +00:00
Hans Petter Selasky
a92c4bb62a Add support for IP over infiniband, IPoIB, to lagg(4). Currently only
the failover protocol is supported due to limitations in the IPoIB
architecture. Refer to the lagg(4) manual page for how to configure
and use this new feature. A new network interface type,
IFT_INFINIBANDLAG, has been added, similar to the existing
IFT_IEEE8023ADLAG .

ifconfig(8) has been updated to accept a new laggtype argument when
creating lagg(4) network interfaces. This new argument is used to
distinguish between ethernet and infiniband type of lagg(4) network
interface. The laggtype argument is optional and defaults to
ethernet. The lagg(4) command line syntax is backwards compatible.

Differential Revision:	https://reviews.freebsd.org/D26254
Reviewed by:		melifaro@
MFC after:		1 week
Sponsored by:		Mellanox Technologies // NVIDIA Networking
2020-10-22 09:47:12 +00:00
Hans Petter Selasky
9d40cf60d6 Factor out generic IP over infiniband, IPoIB, definitions and code
into net/if_infiniband.c and net/infiniband.h . No functional change
intended.

Differential Revision:	https://reviews.freebsd.org/D26254
Reviewed by:		melifaro@
MFC after:		1 week
Sponsored by:		Mellanox Technologies // NVIDIA Networking
2020-10-22 09:09:53 +00:00
Alexander V. Chernikov
c7cffd65c5 Add support for stacked VLANs (IEEE 802.1ad, AKA Q-in-Q).
802.1ad interfaces are created with ifconfig using the "vlanproto" parameter.
Eg., the following creates a 802.1Q VLAN (id #42) over a 802.1ad S-VLAN
(id #5) over a physical Ethernet interface (em0).

ifconfig vlan5 create vlandev em0 vlan 5 vlanproto 802.1ad up
ifconfig vlan42 create vlandev vlan5 vlan 42 inet 10.5.42.1/24

VLAN_MTU, VLAN_HWCSUM and VLAN_TSO capabilities should be properly
supported. VLAN_HWTAGGING is only partially supported, as there is
currently no IFCAP_VLAN_* denoting the possibility to set the VLAN
EtherType to anything else than 0x8100 (802.1ad uses 0x88A8).

Submitted by:	Olivier Piras
Sponsored by:	RG Nets
Differential Revision:	https://reviews.freebsd.org/D26436
2020-10-21 21:28:20 +00:00
Alexander V. Chernikov
0c325f53f1 Implement flowid calculation for outbound connections to balance
connections over multiple paths.

Multipath routing relies on mbuf flowid data for both transit
 and outbound traffic. Current code fills mbuf flowid from inp_flowid
 for connection-oriented sockets. However, inp_flowid is currently
 not calculated for outbound connections.

This change creates simple hashing functions and starts calculating hashes
 for TCP,UDP/UDP-Lite and raw IP if multipath routes are present in the
 system.

Reviewed by:	glebius (previous version),ae
Differential Revision:	https://reviews.freebsd.org/D26523
2020-10-18 17:15:47 +00:00
Marcin Wojtas
1148702e43 Add SADB_SAFLAGS_ESN flag
This flag is going to be used by IKE daemon to signal if
Extended Sequence Number feature is going to be used.

Value for this flag was taken from OpenBSD source code
6b4cbaf181

Submitted by:           Patryk Duda <pdk@semihalf.com>
Reviewed by:            ae
Differential revision:  https://reviews.freebsd.org/D22366
Obtained from:          Semihalf
Sponsored by:           Stormshield
2020-10-16 11:22:29 +00:00
Richard Scheffenegger
868aabb470 Add IP(V6)_VLAN_PCP to set 802.1 priority per-flow.
This adds a new IP_PROTO / IPV6_PROTO setsockopt (getsockopt)
option IP(V6)_VLAN_PCP, which can be set to -1 (interface
default), or explicitly to any priority between 0 and 7.

Note that for untagged traffic, explicitly adding a
priority will insert a special 801.1Q vlan header with
vlan ID = 0 to carry the priority setting

Reviewed by:	gallatin, rrs
MFC after:	2 weeks
Sponsored by:	NetApp, Inc.
Differential Revision:	https://reviews.freebsd.org/D26409
2020-10-09 12:06:43 +00:00
Konstantin Belousov
cefdb89514 Fix typo.
Sponsored by:	Mellanox Technologies/NVIDIA Networking
MFC after:	3 days
2020-10-07 10:58:56 +00:00
Kristof Provost
4af1bd8157 bridge: call member interface ioctl() without NET_EPOCH
We're not allowed to hold NET_EPOCH while sleeping, so when we call ioctl()
handlers for member interfaces we cannot be in NET_EPOCH.  We still need some
protection of our CK_LISTs, so hold BRIDGE_LOCK instead.

That requires changing BRIDGE_LOCK into a sleepable lock, and separating the
BRIDGE_RT_LOCK, to protect bridge_rtnode lists. That lock is taken in the data
path (while in NET_EPOCH), so it cannot be a sleepable lock.

While here document the locking strategy.

MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D26418
2020-10-06 19:19:56 +00:00
John Baldwin
56fb710f1b Store the send tag type in the common send tag header.
Both cxgbe(4) and mlx5(4) wrapped the existing send tag header with
their own identical headers that stored the type that the
type-specific tag structures inherited from, so in practice it seems
drivers need this in the tag anyway.  This permits removing these
extra header indirections (struct cxgbe_snd_tag and struct
mlx5e_snd_tag).

In addition, this permits driver-independent code to query the type of
a tag, e.g. to know what type of tag is being queried via
if_snd_query.

Reviewed by:	gallatin, hselasky, np, kib
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D26689
2020-10-06 17:58:56 +00:00
Alexander V. Chernikov
1b95005e95 Fix route flags update during RTM_CHANGE.
Nexthop lookup was not consireding rt_flags when doing
 structure comparison, which lead to an original nexthop
 selection when changing flags. Fix the case by adding
 rt_flags field into comparison and rearranging nhop_priv
 fields to allow for efficient matching.
Fix `route change X/Y flags` case - recent changes
 disallowed specifying RTF_GATEWAY flag without actual gateway.
 It turns out, route(8) fills in RTF_GATEWAY by default, unless
 -interface flag is specified. Fix regression by clearing
 RTF_GATEWAY flag instead of failing.
Fix route flag reporting in RTM_CHANGE messages by explicitly
 updating rtm_flags after operation competion.
Add IPv4/IPv6 tests for flag-only route changes.
2020-10-04 13:24:58 +00:00
Alexander V. Chernikov
9c584fa4bc Remove ROUTE_MPATH-related warnings introduced in r366390.
Reported by:	mjg
2020-10-03 14:37:54 +00:00
Alexander V. Chernikov
fedeb08b6a Introduce scalable route multipath.
This change is based on the nexthop objects landed in D24232.

The change introduces the concept of nexthop groups.
Each group contains the collection of nexthops with their
 relative weights and a dataplane-optimized structure to enable
 efficient nexthop selection.

Simular to the nexthops, nexthop groups are immutable. Dataplane part
 gets compiled during group creation and is basically an array of
 nexthop pointers, compiled w.r.t their weights.

With this change, `rt_nhop` field of `struct rtentry` contains either
 nexthop or nexthop group. They are distinguished by the presense of
 NHF_MULTIPATH flag.
All dataplane lookup functions returns pointer to the nexthop object,
leaving nexhop groups details inside routing subsystem.

User-visible changes:

The change is intended to be backward-compatible: all non-mpath operations
 should work as before with ROUTE_MPATH and net.route.multipath=1.

All routes now comes with weight, default weight is 1, maximum is 2^24-1.

Current maximum multipath group width is statically set to 64.
 This will become sysctl-tunable in the followup changes.

Using functionality:
* Recompile kernel with ROUTE_MPATH
* set net.route.multipath to 1

route add -6 2001:db8::/32 2001:db8::2 -weight 10
route add -6 2001:db8::/32 2001:db8::3 -weight 20

netstat -6On

Nexthop groups data

Internet6:
GrpIdx  NhIdx     Weight   Slots                                 Gateway     Netif  Refcnt
1         ------- ------- ------- --------------------------------------- ---------       1
              13      10       1                             2001:db8::2     vlan2
              14      20       2                             2001:db8::3     vlan2

Next steps:
* Land outbound hashing for locally-originated routes ( D26523 ).
* Fix net/bird multipath (net/frr seems to work fine)
* Add ROUTE_MPATH to GENERIC
* Set net.route.multipath=1 by default

Tested by:	olivier
Reviewed by:	glebius
Relnotes:	yes
Differential Revision:	https://reviews.freebsd.org/D26449
2020-10-03 10:47:17 +00:00
Vincenzo Maffione
adf41f0788 netmap: fix constness warnings generated by "-Wcast-qual"
Submitted by:	milosz.kaniewski@gmail.com
MFC after:	3 days
2020-10-03 09:33:29 +00:00
Ed Maste
c1aedfcbd9 add SIOCGIFDATA ioctl
For interfaces that do not support SIOCGIFMEDIA (for which there are
quite a few) the only fallback is to query the interface for
if_data->ifi_link_state.  While it's possible to get at if_data for an
interface via getifaddrs(3) or sysctl, both are heavy weight mechanisms.

SIOCGIFDATA is a simple ioctl to retrieve this fast with very little
resource use in comparison.  This implementation mirrors that of other
similar ioctls in FreeBSD.

Submitted by:	Roy Marples <roy@marples.name>
Reviewed by:	markj
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D26538
2020-09-28 16:54:39 +00:00
Alexander V. Chernikov
2259a03020 Rework part of routing code to reduce difference to D26449.
* Split rt_setmetrics into get_info_weight() and rt_set_expire_info(),
 as these two can be applied at different entities and at different times.
* Start filling route weight in route change notifications
* Pass flowid to UDP/raw IP route lookups
* Rework nd6_subscription_cb() and sysctl_dumpentry() to prepare for the fact
 that rtentry can contain multiple nexthops.

Differential Revision:	https://reviews.freebsd.org/D26497
2020-09-21 20:02:26 +00:00
Alexander V. Chernikov
1440f62266 Remove unused nhop_ref_any() function.
Remove "opt_mpath.h" header where not needed.

No functional changes.
2020-09-20 21:32:52 +00:00
Alexander V. Chernikov
c4bcfe98e2 Fix gw updates / flag updates during route changes.
* Zero gw_sdl if switching to interface route - the assumption
 that underlying storage is zeroed is incorrect with route changes.
* Apply proper flag mask to rte.

Reported by:	vangyzen
2020-09-20 12:31:48 +00:00
Navdeep Parhar
b092fd6c97 if_vxlan(4): add support for hardware assisted checksumming, TSO, and RSS.
This lets a VXLAN pseudo-interface take advantage of hardware checksumming (tx
and rx), TSO, and RSS if the NIC is capable of performing these operations on
inner VXLAN traffic.

A VXLAN interface inherits the capabilities of its vxlandev interface if one is
specified or of the interface that hosts the vxlanlocal address. If other
interfaces will carry traffic for that VXLAN then they must have the same
hardware capabilities.

On transmit, if_vxlan verifies that the outbound interface has the required
capabilities and then translates the CSUM_ flags to their inner equivalents.
This tells the hardware ifnet that it needs to operate on the inner frame and
not the outer VXLAN headers.

An event is generated when a VXLAN ifnet starts. This allows hardware drivers to
configure their devices to expect VXLAN traffic on the specified incoming port.

On receive, the hardware does RSS and checksum verification on the inner frame.
if_vxlan now does a direct netisr dispatch to take full advantage of RSS. It is
not very clear why it didn't do this already.

Future work:
Rx: it should be possible to avoid the first trip up the protocol stack to get
the frame to if_vxlan just so it can decapsulate and requeue for a second trip
up the stack. The hardware NIC driver could directly call an if_vxlan receive
routine for VXLAN traffic instead.

Rx: LRO. depends on what happens with the previous item. There will have to to
be a mechanism to indicate that it's time for if_vxlan to flush its LRO state.

Reviewed by:	kib@
Relnotes:	Yes
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D25873
2020-09-18 02:37:57 +00:00
Navdeep Parhar
830edb4561 Add two new ifnet capabilities for hw checksumming and TSO for VXLAN traffic.
These are similar to the existing VLAN capabilities.

Reviewed by:	kib@
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D25873
2020-09-18 02:10:28 +00:00
Mitchell Horne
ceff9b9d25 if_media: definitions for 40GE LM4 ethernet media type
Reviewed by:	erj
Sponsored by:	NetApp, Inc.
Sponsored by:	Klara, Inc.
Differential Revision:	https://reviews.freebsd.org/D26276
2020-09-16 14:45:16 +00:00
Alexander V. Chernikov
2b32d93e55 Fix RADIX_MPATH build broken by r365521.
Reported by:	jenkins, Hartmann, O. <ohartmann at walstatt.org>
2020-09-10 07:05:31 +00:00
Alexander V. Chernikov
aa8f9f90ff Update nexthop handling for route addition/deletion in preparation for mpath.
Currently kernel requests deletion for the certain routes with specified gateway,
 but this gateway is not actually checked. With multipath routes, internal gateway
 checking becomes mandatory. Add the logic performing this check.

Generalise RTF_PINNED routes to the generic route priorities, simplifying the logic.

Add lookup_prefix() function to perform exact match search based on data in @info.

Differential Revision:	https://reviews.freebsd.org/D26356
2020-09-09 22:07:54 +00:00
Alexander V. Chernikov
cd6298d5c5 Retain marking net.fibs sysctl as a tunable.
Suggested by:	avg
2020-09-09 21:45:18 +00:00
Alexander V. Chernikov
4a8201c13a Fix panic with net.fibs tunable set in loader.conf.
Fix by removing forgotten CTLFLAG_RWTUN flag from the sysctl,
 loader variable will be read later in vnet_rtables_init().

Reported by:	mav
2020-09-08 21:39:34 +00:00
Kristof Provost
a969635b83 net: mitigate vnet / epair cleanup races
There's a race where dying vnets move their interfaces back to their original
vnet, and if_epair cleanup (where deleting one interface also deletes the other
end of the epair). This is commonly triggered by the pf tests, but also by
cleanup of vnet jails.

As we've not yet been able to fix the root cause of the issue work around the
panic by not dereferencing a NULL softc in epair_qflush() and by not
re-attaching DYING interfaces.

This isn't a full fix, but makes a very common panic far less likely.

PR:		244703, 238870
Reviewed by:	lutz_donnerhacke.de
MFC after:	4 days
Differential Revision:	https://reviews.freebsd.org/D26324
2020-09-08 14:54:10 +00:00
Alexander V. Chernikov
05aca418f4 Consistently use the same gateway when adding/deleting interface routes.
Use the same link-level gateway when adding or deleting interface routes.
This helps nexthop checking in the upcoming multipath changes.

Differential Revision:	https://reviews.freebsd.org/D26317
2020-09-07 10:13:54 +00:00
Ed Maste
4fa9815a3d rtsock.c: remove extraneous space
Submitted by:	Neel Chauhan <neel AT neelc DOT org>
Differential Revision:	https://reviews.freebsd.org/D26249
2020-09-05 16:13:36 +00:00
Alexander V. Chernikov
8f07963360 Fix regression for IPv6 loopback routes.
After nexthop introduction, loopback routes for the interface addresses
 were created without embedding actual interface index in the gateway.
 The latter is needed to pass the IPv6 scope during transmission via loopback..

Fix the regression by actually using passed gateway data with interface index.

Differential Revision:	https://reviews.freebsd.org/D26306
2020-09-03 22:24:52 +00:00
Mateusz Guzik
662c13053f net: clean up empty lines in .c and .h files 2020-09-01 21:19:14 +00:00
Vincenzo Maffione
35d8a463e8 iflib: leave only 1 receive descriptor unused
The pidx argument of isc_rxd_flush() indicates which is the last valid
receive descriptor to be used by the NIC. However, current code has
multiple issues:
  - Intel drivers write pidx to their RDT register, which means that
    NICs will only use the descriptors up to pidx-1 (modulo ring size N),
    and won't actually use the one pointed by pidx. This does not break
    reception, but it is anyway confusing and suboptimal (the NIC will
    actually see only N-2 descriptors as available, rather than N-1).
    Other drivers (if_vmx, if_bnxt, if_mgb) adhere to this semantic).
  - The semantic used by Intel (RDT is one descriptor past the last
     valid one) is used by most (if not all) NICs, and it is also used
     on the TX side (also in iflib). Since iflib is not currently
     using this semantic for RX, it must decrement fl->ifl_pidx
     (modulo N) before calling isc_rxd_flush(), and then the
     per-driver callback implementation must increment the index
     again (to match the real semantic). This is confusing and suboptimal.
  -  The iflib refill function is also called at initialization.
     However, in case the ring size is smaller than 128 (e.g. if_mgb),
     the refill function will actually prepare all the receive
     descriptors (N), without leaving one unused, as most of NICs assume
     (e.g. to avoid RDT to overrun RDH). I can speculate that the code
     looks like this right now because this issue showed up during
     testing (e.g. with if_mgb), and it was easy to workaround by
     decrementing pidx before isc_rxd_flush().

The goal of this change is to simplify the code (removing a bunch
of instructions from the RX fast path), and to make the semantic of
isc_rxd_flush() consistent across drivers. To achieve this, we:
  - change the semantics of the pidx argument to the usual one (that
    is the index one past the last valid one), so that both iflib and
    drivers avoid the decrement/increment dance.
  - fix the initialization code to prepare at most N-1 descriptors.

Reviewed by:	markj
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D26191
2020-09-01 20:41:47 +00:00
Alexander V. Chernikov
b8d2d479cd Revert uma zone alignemnt cache unadvertenly committed in r364950. 2020-08-29 12:04:13 +00:00
Alexander V. Chernikov
6498f66f7c Fix build with RADIX_MPATH.
Reported by:	Hartmann, O <ohartmann@walstatt.org>
2020-08-29 11:04:24 +00:00
Alexander V. Chernikov
7c89a3b63f Move fib_rte_to_nh_flags() from net/route_var.h to net/route/nhop_ctl.c.
No functional changes.
Initially this function was created to perform runtime flag conversions
 for the previous incarnation of fib lookup functions. As these functions
 got deprecated, move the function to the file with the only remaining
 caller. Lastly, rename it to convert_rt_to_nh_flags() to follow the
 naming notation.
2020-08-28 23:01:56 +00:00
Alexander V. Chernikov
a624ca3dff Move net/route/shared.h definitions to net/route/route_var.h.
No functional changes.

net/route/shared.h was created in the inital phases of nexthop conversion.
It was intended to serve the same purpose as route_var.h - share definitions
 of functions and structures between the routing subsystem components. At
 that time route_var.h was included by many files external to the routing
 subsystem, which largerly defeats its purpose.

As currently this is not the case anymore and amount of route_var.h includes
 is roughly the same as shared.h, retire the latter in favour of the former.
2020-08-28 22:50:20 +00:00
Alexander V. Chernikov
b122304f6a Further split nhop creation and rtable operations.
As nexthops are immutable, some operations such as route attribute changes
 require nexthop fetching, forking, modification and route switching.
These operations are not atomic, so they may need to be retried multiple
 times in presence of multiple speakers changing the same route.

This change introduces "synchronisation" primitive: route_update_conditional(),
 simplifying logic for route changes and upcoming multipath operations.

Differential Revision:	https://reviews.freebsd.org/D26216
2020-08-28 21:59:10 +00:00
Vincenzo Maffione
ae750d5cdf iflib: netmap: publish all the receive buffer
At initialization time, the netmap RX refill function used to
prepare the NIC RX ring with N-1 buffers rather than N (with
N equal to the number of descriptors in the NIC RX ring).
This is not how netmap is supposed to work, as it would keep
kring->nr_hwcur not in sync with the NIC "next index to refill"
(i.e., fl->ifl_pidx). Instead we prepare N buffers, although we
still publish (with isc_rxd_flush()) only the first N-1 buffers,
to avoid the NIC producer pointer to overrun the NIC consumer
pointer (for NICs where this is a real issue, e.g. Intel ones).

MFC after:	2 weeks
2020-08-25 15:19:45 +00:00
Alexander V. Chernikov
592d300e34 Remove RT_LOCK mutex from rte.
rtentry lock traditionally served 2 purposed: first was protecting refcounts,
 the second was assuring consistent field access/changes.
Since route nexthop introduction, the need for the former disappeared and
 the need for the latter reduced.
To be more precise, the following rte field are mutable:

rt_nhop (nexthop pointer, updated with RIB_WLOCK, passed in rib_cmd_info)
rte_flags (only RTF_HOST and RTF_UP, where RTF_UP gets changed at rte removal)
rt_weight (relative weight, updated with RIB_WLOCK, passed in rib_cmd_info)
rt_expire (time when rte deletion is scheduled, updated with RIB_WLOCK)
rt_chain (deletion chain pointer, updated with RIB_WLOCK)
All of them are updated under RIB_WLOCK, so the only remaining concern is the reading.

rt_nhop and rt_weight (addressed in this review) are read under rib lock and
 stored in the rib_cmd_info, so the caller has no problem with consitency.
rte_flags is currently read unlocked in rtsock reporting (however the scope
 is only RTF_UP flag, which is pretty static).
rt_expire is currently read unlocked in rtsock reporting.
rt_chain accesses are safe, as this is only used at route deletion.

rt_expire and rte_flags reads will be dealt in a separate reviews soon.

Differential Revision:	https://reviews.freebsd.org/D26162
2020-08-24 20:23:34 +00:00
Vincenzo Maffione
de5b46107c iflib: fix isc_rxd_flush call in netmap_fl_refill()
The semantic of the pidx argument of isc_rxd_flush() is the
last valid index of in the free list, rather than the next
index to be published. However, netmap was still using the
old convention. While there, also refactor the netmap_fl_refill()
to simplify a little bit and add an assertion.

MFC after:	2 weeks
2020-08-24 11:44:20 +00:00
Alexander V. Chernikov
eb1c7adb70 Finish r364492 by renaming rt_flags to rte_flags for multipath code. 2020-08-22 20:02:40 +00:00
Alexander V. Chernikov
93bfd365d2 Rename rt_flags to rte_flags && reduce number of rt_nhop accesses.
No functional changes.

Most of the routing flags are stored in the netxtop instead of rtentry.
Rename rt->rt_flags to rt->rte_flags to simplify reading/modifying code
 checking routing flags.

In the new multipath code, rt->rt_nhop may actually point to nexthop group
 instead of nhop. To ease transition, reduce the amount of rt->rt_nhop->...
 accesses.

Differential Revision:	https://reviews.freebsd.org/D26156
2020-08-22 19:30:56 +00:00
Mateusz Guzik
c93d310f87 Fix tinderbox build after r364465 2020-08-22 07:43:38 +00:00
Alexander V. Chernikov
f5247a232a Make net.fibs growable.
Allow to dynamically grow the amount of fibs in each vnet.

This change alters current behavior. Currently, if one defines
 ROUTETABLES > 1 in the kernel config, each vnet will be created
 with the number of fibs defined in the kernel config.
 After this commit vnets will be created with fibs=1.

Dynamic net.fibs is not compatible with net.add_addr_allfibs.
 The plan is to deprecate the latter and make
 net.add_addr_allfibs=0 default behaviour.

Reviewed by:	glebius
Relnotes:	yes
Differential Revision:	https://reviews.freebsd.org/D26062
2020-08-21 21:34:52 +00:00
Warner Losh
773e541e8d Use devctl.h instead of bus.h to reduce newbus pollution.
There's no need for these parts of the kernel to know about newbus,
so narrow what is included to devctl.h for device_notify_*.

Suggested by: kib@
2020-08-21 00:03:24 +00:00
Bjoern A. Zeeb
2ae32ca2c8 For consistency and to avoid any problems getting past the 31bit
boundry change the last two IF_Mbps(2500) and additionally one
IF_Mbps(5000) to ULL as well.

MFC after:	2 weeks
Sponsored by:	Rubicon Communications, LLC (d/b/a "Netgate")
2020-08-17 13:51:25 +00:00
Qing Li
a5154bb2e5 Correct the mask byte order when checking for reserved bits.
Reviewed by:	gnn
Approved by:	gnn
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D26071
2020-08-15 16:48:58 +00:00
Alexander V. Chernikov
bec053ffe0 Make net.inet6.ip6.deembed_scopeid behaviour default & remove sysctl.
Submitted by: Neel Chauhan <neel AT neelc DOT org>
Differential Revision:	https://reviews.freebsd.org/D25637
2020-08-15 11:37:44 +00:00
Alexander V. Chernikov
2f23f45b20 Simplify dom_<rtattach|rtdetach>.
Remove unused arguments from dom_rtattach/dom_rtdetach functions and make
  them return/accept 'struct rib_head' instead of 'void **'.
Declare inet/inet6 implementations in the relevant _var.h headers similar
  to domifattach / domifdetach.
Add rib_subscribe_internal() function to accept subscriptions to the rnh
  directly.

Differential Revision:	https://reviews.freebsd.org/D26053
2020-08-14 21:29:56 +00:00
Bryan Drewery
3869d41465 lagg: Avoid adding a port to a lagg device being destroyed.
The lagg_clone_destroy() handles detach and waiting for ifconfig callers
to drain already.

This narrows the race for 2 panics that the tests triggered. Both were a
consequence of adding a port to the lagg device after it had already detached
from all of its ports. The link state task would run after lagg_clone_destroy()
free'd the lagg softc.

    kernel:trap_fatal+0xa4
    kernel:trap_pfault+0x61
    kernel:trap+0x316
    kernel:witness_checkorder+0x6d
    kernel:_sx_xlock+0x72
    if_lagg.ko:lagg_port_state+0x3b
    kernel:if_down+0x144
    kernel:if_detach+0x659
    if_tap.ko:tap_destroy+0x46
    kernel:if_clone_destroyif+0x1b7
    kernel:if_clone_destroy+0x8d
    kernel:ifioctl+0x29c
    kernel:kern_ioctl+0x2bd
    kernel:sys_ioctl+0x16d
    kernel:amd64_syscall+0x337

    kernel:trap_fatal+0xa4
    kernel:trap_pfault+0x61
    kernel:trap+0x316
    kernel:witness_checkorder+0x6d
    kernel:_sx_xlock+0x72
    if_lagg.ko:lagg_port_state+0x3b
    kernel:do_link_state_change+0x9b
    kernel:taskqueue_run_locked+0x10b
    kernel:taskqueue_run+0x49
    kernel:ithread_loop+0x19c
    kernel:fork_exit+0x83

PR:		244168
Reviewed by:	markj
MFC after:	2 weeks
Sponsored by:	Dell EMC
Differential Revision:	https://reviews.freebsd.org/D25284
2020-08-13 22:06:27 +00:00
Alexander V. Chernikov
6cbadc4234 Move rtzone handling code to net/route_ctl.c
After moving the route control plane code from net/route.c,
 all rtzone users ended up being in net/route_ctl.c.
Move uma(9) rtzone setup/teardown code to net/route_ctl.c as well
 to have everything in a single place.

While here, remove custom initializers from the zone.
It was added originally to avoid setup/teardown of costy per-cpu couters.
With these counters removed, the only remaining job was avoiding rte mutex
 setup/teardown. Mutex setup is relatively cheap. Additionally, this mutex
 will soon be removed. With that in mind, there is no sense in keeping
 custom zone callbacks.

Differential Revision:	https://reviews.freebsd.org/D26051
2020-08-13 18:35:29 +00:00
Mitchell Horne
f7d79f6c6d Correctly set error in rt_mpath_unlink
It is possible for rn_delete() to return NULL. If this happens, then set
*perror to ESRCH, as is done in the rest of the function.

Sponsored by:	NetApp, Inc.
Sponsored by:	Klara, Inc.
Differential Revision:	https://reviews.freebsd.org/D25871
2020-08-12 16:43:20 +00:00
Vincenzo Maffione
6d84e76a25 iflib: netmap: improve rxsync to support IFLIB_HAS_RXCQ
For drivers with IFLIB_HAS_RXCQ set, there is a separate completion
queue. In this case, the netmap rxsync routine needs to update
rxq->ifr_cq_cidx in the same way it is updated by iflib_rxeof().
This improves the situation for vmx(4) and bnxt(4) drivers, which
use iflib and have the IFLIB_HAS_RXCQ bit set.

PR:	248494
MFC after:	3 weeks
2020-08-12 14:45:31 +00:00
Vincenzo Maffione
530960be8d iflib: refactor netmap_fl_refill and fix off-by-one issue
First, fix the initialization of the fl->ifl_rxd_idxs array,
which was affected by an off-by-one bug.
Once there, refactor the function to use better names for
local variables, optimize the variable assignments, and
merge the bus_dmamap_sync() inner loop with the outer one.

PR:	248494
MFC after:	3 weeks
2020-08-12 14:17:38 +00:00
Alexander V. Chernikov
8a0917c35b Do not enter epoch in add_route(), as it is already called in epoch.
Reviewed by:	glebius
2020-08-11 07:23:07 +00:00
Alexander V. Chernikov
136a1f8da8 Make <add|del|change>_route() static to finish the transition to the new kpi.
Discussed with:	glebius
2020-08-11 07:21:32 +00:00
Alexander V. Chernikov
9a00f6d067 Fix rib_subscribe() waitok flag by performing allocation outside epoch.
Make in6_inithead() use rib_subscribe with waitok to achieve reliable
 subscription allocation.

Reviewed by:	glebius
2020-08-11 07:05:30 +00:00
Vincenzo Maffione
c9d886cd7f iflib: netmap: drop redundant check
The validity of head is already checked by nm_rxsync_prologue().

MFC after:	2 weeks
2020-08-06 21:37:38 +00:00
Vincenzo Maffione
ee07345d20 iflib: netmap: don't increment ifl_cidx on the wrong free list
Netmap only uses free list 0 to keep it consistent with its
one-to-one mapping between each netmap ring and a device RX
(or TX) queue.
However, the current iflib_netmap_rxsync() routine was
mistakenly updating the ifl_cidx field of both free lists.

PR:		248494
MFC after:	2 weeks
2020-08-06 21:32:25 +00:00
Mark Johnston
96ad26eefb Remove free_domain() and uma_zfree_domain().
These functions were introduced before UMA started ensuring that freed
memory gets placed in domain-local caches.  They no longer serve any
purpose since UMA now provides their functionality by default.  Remove
them to simplyify the kernel memory allocator interfaces a bit.

Reviewed by:	cem, kib
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D25937
2020-08-04 13:58:36 +00:00
Matt Macy
0ae0e8d2bd iflib: fix LOR with bpf detach
Reported by:	grehan@
Approved by:	grehan@
MFC after:	1 week
Sponsored by:	Netgate
Differential Revision: https://reviews.freebsd.org/D25530
2020-07-27 01:17:59 +00:00
Alexander V. Chernikov
e1c05fd290 Transition from rtrequest1_fib() to rib_action().
Remove all variations of rtrequest <rtrequest1_fib, rtrequest_fib,
 in6_rtrequest, rtrequest_fib> and their uses and switch to
 to rib_action(). This is part of the new routing KPI.

Submitted by: Neel Chauhan <neel AT neelc DOT org>
Differential Revision: https://reviews.freebsd.org/D25546
2020-07-21 19:56:13 +00:00
Vincenzo Maffione
ac11d85740 iflib: initialize netmap with the correct number of descriptors
In case the network device has a RX or TX control queue, the correct
number of TX/RX descriptors is contained in the second entry of the
isc_ntxd (or isc_nrxd) array, rather than in the first entry.
This case is correctly handled by iflib_device_register() and
iflib_pseudo_register(), but not by iflib_netmap_attach().
If the first entry is larger than the second, this can result in a
panic. This change fixes the bug by introducing two helper functions
that also lead to some code simplification.

PR:	247647
MFC after:	3 weeks
Differential Revision:	https://reviews.freebsd.org/D25541
2020-07-20 21:08:56 +00:00
Alexander V. Chernikov
725871230d Temporarly revert r363319 to unbreak the build.
Reported by:	CI
Pointy hat to: melifaro
2020-07-19 10:53:15 +00:00
Alexander V. Chernikov
8cee15d9e4 Transition from rtrequest1_fib() to rib_action().
Remove all variations of rtrequest <rtrequest1_fib, rtrequest_fib,
 in6_rtrequest, rtrequest_fib> and their uses and switch to
to rib_action(). This is part of the new routing KPI.

Submitted by:	Neel Chauhan <neel AT neelc DOT org>
Differential Revision:	https://reviews.freebsd.org/D25546
2020-07-19 09:29:27 +00:00
Kristof Provost
93ed6ade6e bridge: Don't sleep during epoch
While it doesn't trigger INVARIANTS or WITNESS on head it does in stable/12.
There's also no reason for it, as we can easily report the out of memory error
to the caller (i.e. userspace). All of these can already fail.

PR:		248046
MFC after:	3 days
2020-07-18 12:43:11 +00:00
Kyle Evans
cef5fc74c2 tuntap: drop redundant if_mtu assignment in tuncreate
ether_ifattach will immediately clobber if_mtu with ETHERMTU anyways, just
let it happen.

MFC after:	1 week
2020-07-16 15:02:11 +00:00
Kyle Evans
fa2ab81e32 ether_ifattach: set mtu before calling if_attach()
if_attach() -> if_attach_internal() will call if_attachdomain1(ifp) any time
an ethernet interface is setup *after*
SI_SUB_PROTO_IFATTACHDOMAIN/SI_ORDER_FIRST.  This eventually leads to
nd6_ifattach() -> nd6_setmtu0() stashing off ifp->if_mtu in ndi->maxmtu
*before* ifp->if_mtu has been properly set in some scenarios, e.g., USB
ethernet adapter plugged in later on.

For interfaces that are created in early boot, we don't have this issue as
domains aren't constructed enough for them to attach and thus it gets
deferred to domainifattach at SI_SUB_PROTO_IFATTACHDOMAIN/SI_ORDER_SECOND
*after* the mtu has been set earlier in ether_ifattach().

PR:		248005
Submitted by:	Mathew <mjanelle blackberry com>
MFC after:	1 week
2020-07-16 13:37:32 +00:00
Alexander V. Chernikov
4c7ba83f9d Switch inet6 default route subscription to the new rib subscription api.
Old subscription model allowed only single customer.

Switch inet6 to the new subscription api and eliminate the old model.

Differential Revision:	https://reviews.freebsd.org/D25615
2020-07-12 11:24:23 +00:00
Alexander V. Chernikov
edc37a66e3 Add destructor for the rib subscription system to simplify users code.
Subscriptions are planned to be used by modules such as route lookup engines.
In that case that's the module task to properly unsibscribe before detach.
However, the in-kernel customer - inet6 wants to track default route changes.
To avoid having inet6 store per-fib subscriptions, handle automatic
 destruction internally.

Differential Revision:	https://reviews.freebsd.org/D25614
2020-07-12 11:18:09 +00:00
Mark Johnston
26dd427800 Split nhop_ref_object().
Now nhop_ref_object() unconditionally acquires a reference, and the new
nhop_try_ref_object() uses refcount_acquire_if_not_zero() to
conditionally acquire a reference.  Since the former is cheaper, use it
when we know that the initial counter value is non-zero.  No functional
change intended.

Reviewed by:	melifaro
Differential Revision:	https://reviews.freebsd.org/D25535
2020-07-06 21:20:57 +00:00
Mark Johnston
b256d25c50 iflib: Fix some nits in the rx refill code.
- Get rid of the ifl_vm_addrs array.  It is not used by any existing
  consumer, so we are just dirtying a couple of cache lines for no
  reason.
- Use uma_zalloc(fl->ifl_zone) instead of m_cljget().  Otherwise
  m_cljget() is doing unnecessary work to look up the correct zone, when
  iflib already knows what that zone is.
- ifl_gen is only used when INVARIANTS is on, so make that more clear.
- Fix some style nits and inconsistencies.

Reviewed by:	gallatin
Tested by:	pho
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D25490
2020-07-06 14:52:21 +00:00
Mark Johnston
a363e1d4d0 iflib: Fix handling of mbuf cluster allocation failures.
When refilling an rx freelist, make sure we only update the hardware
producer index if at least one cluster was allocated.  Otherwise the
NIC is programmed to write a previously used cluster, typically
resulting in a use-after-free when packet data is written by the
hardware.

Also make sure that we don't update the fragment index cursor if the
last allocation attempt didn't succeed.  For at least Intel drivers,
iflib assumes that the consumer index and fragment index cursor stay in
lockstep, but this assumption was violated in the face of cluster
allocation failures.

Reported and tested by:	pho
Reviewed by:	gallatin, hselasky
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D25489
2020-07-06 14:52:09 +00:00
Alexander V. Chernikov
6ad7446c6f Complete conversions from fib<4|6>_lookup_nh_<basic|ext> to fib<4|6>_lookup().
fib[46]_lookup_nh_ represents pre-epoch generation of fib api, providing less guarantees
 over pointer validness and requiring on-stack data copying.

With no callers remaining, remove fib[46]_lookup_nh_ functions.

Submitted by:	Neel Chauhan <neel AT neelc DOT org>
Differential Revision:	https://reviews.freebsd.org/D25445
2020-07-02 21:04:08 +00:00
Ryan Moeller
e5539fb618 libifconfig: Add function to get bridge status
The new function operates similarly to ifconfig_lagg_get_lagg_status and
likewise is accompanied by a function to free the bridge status data structure.

I have included in this patch the relocation of some strings describing STP
parameters and the PV2ID macro from ifconfig into net/if_bridgevar.h as they
are useful for consumers of libifconfig.

Reviewed by:	kp, melifaro, mmacy
Approved by:	mmacy (mentor)
MFC after:	1 week
Relnotes:	yes
Differential Revision:	https://reviews.freebsd.org/D25460
2020-07-01 02:32:41 +00:00
Vincenzo Maffione
9503233f87 iflib: fix compilation issue introduced in r362621
The ifp local variable is useful even without netmap
and altq, as it is used to check for IFF_DRV_RUNNING.

MFC after:	2 weeks
2020-06-25 20:43:21 +00:00
Vincenzo Maffione
d8b2d26b15 iflib: netmap: add support for partial ring openings
Reviewed by:	gallatin
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D25254
2020-06-25 19:44:24 +00:00
Vincenzo Maffione
88a688663a iflib: netmap: add per-tx-queue netmap support
Reviewed by:	gallatin
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D25253
2020-06-25 19:35:43 +00:00
Vincenzo Maffione
0ff2126795 iflib: netmap: fix rsync index overrun
In the current iflib_netmap_rxsync, there is nothing that prevents
kring->nr_hwtail to overrun kring->nr_hwcur during the descriptor
import phase. This may cause errors in netmap applications, such as:

em1 RX0: fail 'head < kring->nr_hwcur || head > kring->nr_hwtail'
    h 795 c 795 t 282 rh 795 rc 795 rt 282 hc 282 ht 282

Reviewed by:	gallatin
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D25252
2020-06-23 20:23:56 +00:00
Tycho Nightingale
c774294c57 To avoid a startup script race change net.bpf.optimize_writers from
CTLFLAG_RW to CTLFLAG_RWTUN to allow it to be modified by a loader
tunable.

Sponsored by:	Dell EMC Isilon
2020-06-23 13:57:53 +00:00
Matt Macy
9aeca21324 iflib: fix cloneattach fail and generalize pseudo device handling
- a cloneattach failure will not currently be handled correctly,
  jump to the right target

- pseudo devices are all treat as if they're ethernet devices -
  this often doesn't make sense

MFC after:	1 week
Sponsored by:	Netgate, Inc.
Differential Revision:	https://reviews.freebsd.org/D25083
2020-06-21 22:02:49 +00:00
Pawel Biernacki
9daf71541c net.link.generic.ifdata.<ifindex>.linkspecific: rework handler
This OID was added in r17352 but the write path of IFDATA_LINKSPECIFIC
seems unused as there are no in-base writers, and as far as I can tell
we had issues with this code before, see PR 219472.  Drop the write path
to make the handler read-only as described in comments and man-pages.
It can be marked as MPSAFE now.

Reviewed by:	bdragon, kib, melifaro, wollman
Approved by:	kib (mentor)
Sponsored by:	Mysterious Code Ltd.
Differential Revision:	https://reviews.freebsd.org/D25348
2020-06-21 18:40:17 +00:00
Vincenzo Maffione
0a182b4c63 iflib: netmap: enter/exit netmap mode after device stops
Avoid possible race conditions by calling nm_set_native_flags()
and nm_clear_native_flags() only after the device has been
stopped.

MFC after:	1 week
2020-06-14 21:07:12 +00:00
Ravi Pokala
2a73c8f5e1 Decode the "LACP Fast Timeout" LAGG option flag
r286700 added the "lacp_fast_timeout" option to `ifconfig', but we forgot to
include the new option in the string used to decode the option bits. Add
"LACP_FAST_TIMO" to LAGG_OPT_BITS.

Also, s/LAGG_OPT_LACP_TIMEOUT/LAGG_OPT_LACP_FAST_TIMO/g , to be clearer that
the flag indicates "Fast Timeout" mode.

Reported by:	Greg Foster <gfoster at panasas dot com>
Reviewed by:	jpaetzel
MFC after:	1 week
Sponsored by:	Panasas
Differential Revision:	https://reviews.freebsd.org/D25239
2020-06-11 22:46:08 +00:00
Alexander V. Chernikov
a287a973e3 Switch rtsock code to using newly-create rib_action() KPI call.
This simplifies the code and allows to further split rtentry and nexthop,
 removing one of the blockers for multipath code introduction, described in
 D24141.

Reviewed by:	ae
Differential Revision:	https://reviews.freebsd.org/D25192
2020-06-10 07:46:22 +00:00
Vincenzo Maffione
e136e9c88f iflib: netmap: honor netmap_irx_irq return values
In the receive interrupt routine, always call netmap_rx_irq().
The latter function will return != NM_IRQ_PASS if netmap is not
active on that specific receive queue, so that the driver can go
on with iflib_rxeof(). Note that netmap supports partial opening,
where only a subset of the RX or TX rings can be open in netmap mode.
Checking the IFCAP_NETMAP flag is not enough to make sure that the
queue is indeed in netmap mode.
Moreover, in case netmap_rx_irq() returns NM_IRQ_RESCHED, it means
that netmap expects the driver to call netmap_rx_irq() again as soon
as possible. Currently, this may happen when the device is attached
to a VALE switch.

Reviewed by:	gallatin
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D25167
2020-06-09 19:15:43 +00:00
John Baldwin
00a4311adc Refer to AES-CBC as "aes-cbc" rather than "rijndael-cbc" for IPsec.
At this point, AES is the more common name for Rijndael128.  setkey(8)
will still accept the old name, and old constants remain for
compatiblity.

Reviewed by:	cem, bcr (manpages)
MFC after:	2 weeks
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D24964
2020-06-04 22:58:37 +00:00
Andrey V. Elsukov
dd4490fdab Add if_reassing method to all tunneling interfaces.
After r339550 tunneling interfaces have started handle appearing and
disappearing of ingress IP address on the host system.
When such interfaces are moving into VNET jail, they lose ability to
properly handle ifaddr_event_ext event. And this leads to need to
reconfigure tunnel to make it working again.

Since moving an interface into VNET jail leads to removing of all IP
addresses, it looks consistent, that tunnel configuration should also
be cleared. This is what will do if_reassing method.

Reported by:	John W. O'Brien <john saltant com>
MFC after:	1 week
2020-06-03 13:02:31 +00:00
Alexander V. Chernikov
41e66f4eca Add rib subscription API.
Currently there is no easy way of subscribing for the routing table changes.
The only existing way is to set ifa_rtrequest callback in the each protocol
 ifaddr, which is not convenient or extandable.

This change provides generic notification subscription mechanism, that will
 replace current ifa_rtrequest one and allow other applications such as
 accelerated routing lookup modules subscribe for the changes.

In particular, this change provides 2 hooks: 1) synchronous one
 (RIB_NOTIFY_IMMEDIATE), called under RIB_WLOCK, which ensures exact
 ordering of the changes and 2) async one, (RIB_NOTIFY_DELAYED)
 that is called after the change w/o holding locks. The latter one does not
 provide any notification ordering guarantee.

Differential Revision:  https://reviews.freebsd.org/D25070
2020-06-01 21:52:24 +00:00
Alexander V. Chernikov
46cc6153d4 Finish r361706: add sys/net/route/route_ctl.h, missed in previous commit. 2020-06-01 21:51:20 +00:00
Alexander V. Chernikov
da187ddb3d * Add rib_<add|del|change>_route() functions to manipulate the routing table.
The main driver for the change is the need to improve notification mechanism.
Currently callers guess the operation data based on the rtentry structure
 returned in case of successful operation result. There are two problems with
 this appoach. First is that it doesn't provide enough information for the
 upcoming multipath changes, where rtentry refers to a new nexthop group,
 and there is no way of guessing which paths were added during the change.
 Second is that some rtentry fields can change during notification and
 protecting from it by requiring customers to unlock rtentry is not desired.

Additionally, as the consumers such as rtsock do know which operation they
 request in advance, making explicit add/change/del versions of the functions
 makes sense, especially given the functions don't share a lot of code.

With that in mind, introduce rib_cmd_info notification structure and
 rib_<add|del|change>_route() functions, with mandatory rib_cmd_info pointer.
 It will be used in upcoming generalized notifications.

* Move definitions of the new functions and some other functions/structures
 used for the routing table manipulation to a separate header file,
 net/route/route_ctl.h. net/route.h is a frequently used file included in
 ~140 places in kernel, and 90% of the users don't need these definitions.

Reviewed by:		ae
Differential Revision:	https://reviews.freebsd.org/D25067
2020-06-01 20:49:42 +00:00
Alexander V. Chernikov
e7403d0230 Revert r361704, it accidentally committed merged D25067 and D25070. 2020-06-01 20:40:40 +00:00
Alexander V. Chernikov
79674562b8 * Add rib_<add|del|change>_route() functions to manipulate the routing table.
The main driver for the change is the need to improve notification mechanism.
Currently callers guess the operation data based on the rtentry structure
 returned in case of successful operation result. There are two problems with
 this appoach. First is that it doesn't provide enough information for the
 upcoming multipath changes, where rtentry refers to a new nexthop group,
 and there is no way of guessing which paths were added during the change.
 Second is that some rtentry fields can change during notification and
 protecting from it by requiring customers to unlock rtentry is not desired.

Additionally, as the consumers such as rtsock do know which operation they
 request in advance, making explicit add/change/del versions of the functions
 makes sense, especially given the functions don't share a lot of code.

With that in mind, introduce rib_cmd_info notification structure and
 rib_<add|del|change>_route() functions, with mandatory rib_cmd_info pointer.
 It will be used in upcoming generalized notifications.

* Move definitions of the new functions and some other functions/structures
 used for the routing table manipulation to a separate header file,
 net/route/route_ctl.h. net/route.h is a frequently used file included in
 ~140 places in kernel, and 90% of the users don't need these definitions.

Reviewed by:	ae
Differential Revision: https://reviews.freebsd.org/D25067
2020-06-01 20:32:02 +00:00
Matt Macy
1f93e931d9 Fix panics when using iflib pseudo device support
Reviewed by:	gallatin@, hselasky@
MFC after:	1 week
Sponsored by:	Netgate, Inc.
Differential Revision:	https://reviews.freebsd.org/D23710
2020-05-31 18:42:00 +00:00
John Baldwin
28d2a72bbf Consistently include opt_ipsec.h for consumers of <netipsec/ipsec.h>.
This fixes ipsec.ko to include all of IPSEC_DEBUG.

Reviewed by:	imp
MFC after:	2 weeks
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D25046
2020-05-29 19:22:40 +00:00
Alexander V. Chernikov
cb86ca48bf Unlock rtentry before calling for epoch(9) destruction as the destruction
may happen immediately, leading to panic.

Reported by:	bdragon
2020-05-28 07:23:27 +00:00
Alexander V. Chernikov
4d2c2509f2 Move <add|del|change>_route() functions to route_ctl.c in preparation of
multipath control plane changed described in D24141.

Currently route.c contains core routing init/teardown functions, route table
 manipulation functions and various helper functions, resulting in >2KLOC
 file in total. This change moves most of the route table manipulation parts
 to a dedicated file, simplifying planned multipath changes and making
 route.c more manageable.

Differential Revision:	https://reviews.freebsd.org/D24870
2020-05-23 19:06:57 +00:00
Alexander V. Chernikov
a82f62ec2d Remove refcounting from rtentry.
After making rtentry reclamation backed by epoch(9) in r361409, there is
 no reason in keeping reference counting code.

Differential Revision:	https://reviews.freebsd.org/D24867
2020-05-23 12:15:47 +00:00
Alexander V. Chernikov
2bbab0af6d Use epoch(9) for rtentries to simplify control plane operations.
Currently the only reason of refcounting rtentries is the need to report
 the rtable operation details immediately after the execution.
Delaying rtentry reclamation allows to stop refcounting and simplify the code.
Additionally, this change allows to reimplement rib_lookup_info(), which
 is used by some of the customers to get the matching prefix along
 with nexthops, in more efficient way.

The change keeps per-vnet rtzone uma zone. It adds nh_vnet field to
 nhop_priv to be able to reliably set curvnet even during vnet teardown.
Rest of the reference counting code will be removed in the D24867 .

Differential Revision:	https://reviews.freebsd.org/D24866
2020-05-23 10:21:02 +00:00
Pawel Biernacki
9033ad5f15 sysctl: fix setting net.isr.dispatch during early boot
Fix another collateral damage of r357614: netisr is initialised way before
malloc() is available hence it can't use sysctl_handle_string() that
allocates temporary buffer.  Handle that internally in
sysctl_netisr_dispatch_policy().

PR:		246114
Reported by:	delphij
Reviewed by:	kib
Approved by:	kib (mentor)
Differential Revision:	https://reviews.freebsd.org/D24858
2020-05-16 17:05:44 +00:00
Mark Johnston
c1be839971 pf: Add a new zone for per-table entry counters.
Right now we optionally allocate 8 counters per table entry, so in
addition to memory consumed by counters, we require 8 pointers worth of
space in each entry even when counters are not allocated (the default).

Instead, define a UMA zone that returns contiguous per-CPU counter
arrays for use in table entries.  On amd64 this reduces sizeof(struct
pfr_kentry) from 216 to 160.  The smaller size also results in better
slab efficiency, so memory usage for large tables is reduced by about
28%.

Reviewed by:	kp
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D24843
2020-05-16 00:28:12 +00:00
Kyle Evans
c79cee7136 kernel: provide panicky version of __unreachable
__builtin_unreachable doesn't raise any compile-time warnings/errors on its
own, so problems with its usage can't be easily detected. While it would be
nice for this situation to change and compilers to at least add a warning
for trivial cases where local state means the instruction can't be reached,
this isn't the case at the moment and likely will not happen.

This commit adds an __assert_unreachable, whose intent is incredibly clear:
it asserts that this instruction is unreachable. On INVARIANTS builds, it's
a panic(), and on non-INVARIANTS it expands to  __unreachable().

Existing users of __unreachable() are converted to __assert_unreachable,
to improve debuggability if this assumption is violated.

Reviewed by:	mjg
Differential Revision:	https://reviews.freebsd.org/D23793
2020-05-13 18:07:37 +00:00
Warner Losh
83b4342743 Kill trailing newline while I'm here... 2020-05-12 23:46:52 +00:00
Alexander V. Chernikov
4a6ee281d9 Remove unused rnh_close callback from rtable & cleanup depends.
rnh_close callbackes was used by the in[6]_clsroute() handlers,
 doing cleanup in the route cloning code. Route cloning was eliminated
 somewhere around r186119. Last callback user was eliminated in r186215,
 11 years ago.

Differential Revision:	https://reviews.freebsd.org/D24793
2020-05-11 06:09:18 +00:00
Alexander V. Chernikov
d223372545 Remove rtalloc1(_fib) KPI.
Last user of rtalloc1() KPI has been eliminated in rS360631.
As kernel is now fully switched to use new routing KPI defined in
rS359823, remove old lookup functions.

Differential Revision:	https://reviews.freebsd.org/D24776
2020-05-10 09:34:48 +00:00
Alexander V. Chernikov
656442a718 Embed dst sockaddr into rtentry and remove rte packet counter
Currently each rtentry has dst&gateway allocated separately from another zone,
 bloating cache accesses.

Current 'struct rtentry' has 12 "mandatory" radix pointers in the beginning,
 leaving 4 usable pointers/32 bytes in the first 2 cache lines (amd64).
Fields needed for the datapath are destination sockaddr and rt_nhop.

So far it doesn't look like there is other routable addressing protocol other
 than IPv4/IPv6/MPLS, which uses keys longer than 20 bytes.
With that in mind, embed dst into struct rtentry, making the first 24 bytes
 of rtentry within 128 bytes. That is enough to make IPv6 address within first
 128 bytes.

It is still pretty easy to add code for supporting separately-allocated dst,
 however it doesn't make a lot of sense in having such code without a use case.

As rS359823 moved the gateway to the nexthop structure, the dst embedding change
 removes the need for any additional allocations done by rt_setgate().

Lastly, as a part of cleanup, remove counter(9) allocation code, as this field
 is not used in packet processing anymore.

Reviewed by:	ae
Differential Revision:	https://reviews.freebsd.org/D24669
2020-05-08 21:06:10 +00:00
Alexander V. Chernikov
682b902daf Add rib_lookup() sockaddr lookup wrapper and make ifa_ifwithroute use it.
Create rib_lookup() wrapper around per-af dataplane lookup functions.
This will help in the cases of having control plane af-agnostic code.

Switch ifa_ifwithroute() to use this function instead of rtalloc1().

Reviewed by:	ae
Differential Revision:	https://reviews.freebsd.org/D24731
2020-05-07 08:11:36 +00:00
Alexander V. Chernikov
d94be4ccaa Switch DDB show route to direct rnh_matchaddr() call instead of rtalloc1().
Eliminate the last rtalloc1() call to finish transition to the new routing
KPI defined in r359823.

Differential Revision:	https://reviews.freebsd.org/D24663
2020-05-04 15:07:57 +00:00
Alexander V. Chernikov
4f08f052ad Simplify address parsing in DDB show route command.
Use db_get_line() to overcome parser limitation.

Differential Revision:	https://reviews.freebsd.org/D24662
2020-05-04 15:00:19 +00:00
Alexander V. Chernikov
9e02229580 Remove now-unused rt_ifp,rt_ifa,rt_gateway,rt_mtu rte fields.
After converting routing subsystem customers to use nexthop objects
 defined in r359823, some fields in struct rtentry became unused.

This commit removes rt_ifp, rt_ifa, rt_gateway and rt_mtu from struct rtentry
 along with the code initializing and updating these fields.

Cleanup of the remaining fields will be addressed by D24669.

This commit also changes the implementation of the RTM_CHANGE handling.
Old implementation tried to perform the whole operation under radix WLOCK,
 resulting in slow performance and hacks like using RTF_RNH_LOCKED flag.
New implementation looks up the route nexthop under radix RLOCK, creates new
 nexthop and tries to update rte nhop pointer. Only last part is done under
 WLOCK.
In the hypothetical scenarious where multiple rtsock clients
 repeatedly issue RTM_CHANGE requests for the same route, route may get
 updated between read and update operation. This is addressed by retrying
 the operation multiple (3) times before returning failure back to the
 caller.

Differential Revision:	https://reviews.freebsd.org/D24666
2020-05-04 14:31:45 +00:00
Mark Johnston
814fa34dfb Increase the iflib txq callout mutex name length to 32 bytes.
With a length of 16, the name ("<if name>:TX(<qid>):callout") typically
gets truncated.

PR:		245712
Reported by:	ghuckriede@blackberry.com
MFC after:	1 week
2020-04-30 15:39:04 +00:00
Alexander V. Chernikov
8c61eb2107 Convert more rtentry field accesses into nhop fields accesses.
Continue routing subsystem conversion to nhop objects defined in r359823.
Use fields from nhop structure instead of "struct rtentry" fields.
This is one of the last changes prior to removing rt_ifp, rt_ifa,
 rt_gateway and rt_mtu from struct rtentry.

Differential Revision:	https://reviews.freebsd.org/D24609
2020-04-29 21:54:09 +00:00
Alexander V. Chernikov
74787ef47b Add nhop to the ifa_rtrequest() callback.
With the upcoming multipath changes described in D24141,
 rt->rt_nhop can potentially point to a nexthop group instead of
 an individual nhop.
To simplify caller handling of such cases, change ifa_rtrequest() callback
 to pass changed nhop directly.

Differential Revision:	https://reviews.freebsd.org/D24604
2020-04-29 19:28:56 +00:00
Alexander V. Chernikov
fc88ecd31a Move route-specific ddb commands to route/route_ddb.c
Currently functionality resides in rtsock.c, which is a controlling
 interface, partially external to the routing subsystem.
Additionally, DDB-supporting functionality is > 100SLOC, which deserves
 a separate file.

Given that, move this functionality to a newly-created net/route/ subdir.

Reviewed by:	cem
Differential Revision:	https://reviews.freebsd.org/D24561
2020-04-28 20:00:17 +00:00
Alexander V. Chernikov
e7d8af4f65 Move route_temporal.c and route_var.h to net/route.
Nexthop objects implementation, defined in r359823,
 introduced sys/net/route directory intended to hold all
 routing-related code. Move recently-introduced route_temporal.c and
 private route_var.h header there.

Differential Revision:	https://reviews.freebsd.org/D24597
2020-04-28 19:14:09 +00:00
Alexander V. Chernikov
fe6da72759 Move struct rtentry definition to nhop_var.h.
One of the goals of the new routing KPI defined in r359823
 is to entirely hide`struct rtentry` from the consumers.
It will allow to improve routing subsystem internals and deliver
 features much faster.

This is one of the last changes, effectively moving struct rtentry
 definition to a net/route_var.h header, internal to the routing subsystem.

Differential Revision:	https://reviews.freebsd.org/D24580
2020-04-28 18:42:30 +00:00
Alexander V. Chernikov
4043ee3cd7 Convert rtalloc_mpath_fib() users to the new KPI.
New fib[46]_lookup() functions support multipath transparently.
Given that, switch the last rtalloc_mpath_fib() calls to
 dib4_lookup() and eliminate the function itself.

Note: proper flowid generation (especially for the outbound traffic) is a
 bigger topic and will be handled in a separate review.
This change leaves flowid generation intact.

Differential Revision:	https://reviews.freebsd.org/D24595
2020-04-28 08:06:56 +00:00
Alexander V. Chernikov
1b0051bada Eliminate now-unused parts of old routing KPI.
r360292 switched most of the remaining routing customers to a new KPI,
 leaving a bunch of wrappers for old routing lookup functions unused.

Remove them from the tree as a part of routing cleanup.

Differential Revision:	https://reviews.freebsd.org/D24569
2020-04-28 07:25:34 +00:00
Eric Joyner
45818bf1a0 iflib: Stop interface before (un)registering VLAN
This patch is intended to solve a specific problem that iavf(4)
encounters, but what it does can be extended to solve other issues.

To summarize the iavf(4) issue, if the PF driver configures VLAN
anti-spoof, then the VF driver needs to make sure no untagged traffic is
sent if a VLAN is configured, and vice-versa. This can be an issue when
a VLAN is being registered or unregistered, e.g. when a packet may be on
the ring with a VLAN in it, but the VLANs are being unregistered. This
can cause that tagged packet to go out and cause an MDD event.

To fix this, include a new interface-dependent function that drivers can
implement named IFDI_NEEDS_RESTART(). Right now, this function is called
in iflib_vlan_unregister/register() to determine whether the interface
needs to be stopped and started when a VLAN is registered or
unregistered. The default return value of IFDI_NEEDS_RESTART() is true,
so this fixes the MDD problem that iavf(4) encounters, since the
interface rings are flushed during a stop/init.

A future change to iavf(4) will implement that function just in case the
default value changes, and to make it explicit that this interface reset
is required when a VLAN is added or removed.

Reviewed by:	gallatin@
MFC after:	1 week
Sponsored by:	Intel Corporation
Differential Revision:	https://reviews.freebsd.org/D22086
2020-04-27 22:02:44 +00:00
Alexander V. Chernikov
55f57ca9ac Convert debugnet to the new routing KPI.
Introduce new fib[46]_lookup_debugnet() functions serving as a
special interface for the crash-time operations. Underlying
implementation will try to return lookup result if
datastructures are not corrupted, avoding locking.

Convert debugnet to use fib4_lookup_debugnet() and switch it
to use nexthops instead of rtentries.

Reviewed by:	cem
Differential Revision:	https://reviews.freebsd.org/D24555
2020-04-26 18:42:38 +00:00
Kristof Provost
fffd27e5f3 bridge: epoch-ification
Run the bridge datapath under epoch, rather than under the
BRIDGE_LOCK().

We still take the BRIDGE_LOCK() whenever we insert or delete items in
the relevant lists, but we use epoch callbacks to free items so that
it's safe to iterate the lists without the BRIDGE_LOCK.

Tests on mercat5/6 shows this increases bridge throughput significantly,
from 3.7Mpps to 18.6Mpps.

Reviewed by:	emaste, philip, melifaro
MFC after:	2 months
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D24250
2020-04-26 16:22:35 +00:00
Alexander V. Chernikov
57e70e4471 Fix userland build broken by r360292. 2020-04-25 09:25:06 +00:00
Alexander V. Chernikov
983066f05b Convert route caching to nexthop caching.
This change is build on top of nexthop objects introduced in r359823.

Nexthops are separate datastructures, containing all necessary information
 to perform packet forwarding such as gateway interface and mtu. Nexthops
 are shared among the routes, providing more pre-computed cache-efficient
 data while requiring less memory. Splitting the LPM code and the attached
 data solves multiple long-standing problems in the routing layer,
 drastically reduces the coupling with outher parts of the stack and allows
 to transparently introduce faster lookup algorithms.

Route caching was (re)introduced to minimise (slow) routing lookups, allowing
 for notably better performance for large TCP senders. Caching works by
 acquiring rtentry reference, which is protected by per-rtentry mutex.
 If the routing table is changed (checked by comparing the rtable generation id)
 or link goes down, cache record gets withdrawn.

Nexthops have the same reference counting interface, backed by refcount(9).
This change merely replaces rtentry with the actual forwarding nextop as a
 cached object, which is mostly mechanical. Other moving parts like cache
 cleanup on rtable change remains the same.

Differential Revision:	https://reviews.freebsd.org/D24340
2020-04-25 09:06:11 +00:00
Alexander V. Chernikov
aaad3c4fca Convert rtentry field accesses into nhop field accesses.
One of the goals of the new routing KPI defined in r359823 is to entirely
 hide`struct rtentry` from the consumers. It will allow to improve routing
 subsystem internals and deliver more features much faster.

This commit is mostly mechanical change to eliminate direct struct rtentry
 field accesses.

The only notable difference is AF_LINK gateway encoding.

AF_LINK gw is used in routing stack for operations with interface routes
 and host loopback routes.
In the former case it indicates _some_ non-NULL gateway, as the interface
 is the same as in rt_ifp in kernel and rtm_ifindex in rtsock reporting.
In the latter case the interface index inside gateway was used by the IPv6
 datapath to verify address scope for link-local interfaces.

Kernel uses struct sockaddr_dl for this type of gateway. This structure
 allows for specifying rich interface data, such as mac address and interface
 name. However, this results in relatively large structure size - 52 bytes.
Routing stack fils in only 2 fields - sdl_index and sdl_type, which reside
 in the first 8 bytes of the structure.

In the new KPI, struct nhop_object tries to be cache-efficient, hence
 embodies gateway address inside the structure. In the AF_LINK case it
 stores stortened version of the structure - struct sockaddr_dl_short,
 which occupies 16 bytes. After D24340 changes, the data inside AF_LINK
 gateway will not be used in the kernel at all, leaving rtsock as the only
 potential concern.

The difference in rtsock reporting:

(old)
got message of size 240 on Thu Apr 16 03:12:13 2020
RTM_ADD: Add Route: len 240, pid: 0, seq 0, errno 0, flags:<UP,DONE,PINNED>
locks:  inits:
sockaddrs: <DST,GATEWAY,NETMASK>
 10.0.0.0 link#5 255.255.255.0

(new)
got message of size 200 on Sun Apr 19 09:46:32 2020
RTM_ADD: Add Route: len 200, pid: 0, seq 0, errno 0, flags:<UP,DONE,PINNED>
locks:  inits:
sockaddrs: <DST,GATEWAY,NETMASK>
 10.0.0.0 link#5 255.255.255.0

Note 40 bytes different (52-16 + alignment).
However, gateway is still a valid AF_LINK gateway with proper data filled in.

It is worth noting that these particular messages (interface routes) are mostly
 ignored by routing daemons:
* bird/quagga/frr uses RTM_NEWADDR and ignores prefix route addition messages.
* quagga/frr ignores routes without gateway

More detailed overview on how rtsock messages are used by the
 routing daemons to reconstruct the kernel view, can be found in D22974.

Differential Revision:	https://reviews.freebsd.org/D24519
2020-04-23 08:04:20 +00:00
Kristof Provost
fac24ad7f0 bridge: Simplify mac address generation
Unconditionally use ether_gen_addr() to generate bridge mac addresses.  This
function is now less likely to generate duplicate mac addresses across jails.
The old hand rolled hostid based code adds no value.

Reviewed by:	bz
Differential Revision:	https://reviews.freebsd.org/D24432
2020-04-18 08:00:58 +00:00
Kristof Provost
3f8bc99c4c ethersubr: Make the mac address generation more robust
If we create two (vnet) jails and create a bridge interface in each we end up
with the same mac address on both bridge interfaces.
These very often conflicts, resulting in same mac address in both jails.

Mitigate this problem by including the jail name in the mac address.

Reviewed by:	kevans, melifaro
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D24383
2020-04-18 07:50:30 +00:00
Alexander V. Chernikov
ae4b62595e Unbreak build by reverting if_bridge part of r360047.
Pointy hat to: melifaro
2020-04-17 18:22:37 +00:00
Alexander V. Chernikov
6745294280 Finish r191148: replace rtentry with route in if_bridge if_output() callback.
Generic if_output() callback signature was modified to use struct route
 instead of struct rtentry in r191148, back in 2009.

Quoting commit message:

 Change if_output to take a struct route as its fourth argument in order
 to allow passing a cached struct llentry * down to L2

Fix bridge_output() to match this signature and update the remaining
 comment in if_var.h.

Reviewed by:	kp
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D24394
2020-04-17 17:05:58 +00:00
Adrian Chadd
2538a4b1b4 Remove an duplicate definition of nhops_dump_sysctl()
One of the source files included both nhop.h and shared.h, leading to this
clashing.

Tested with: mips-gcc cross toolchain
2020-04-16 23:28:47 +00:00
Alexander V. Chernikov
5cdc484410 Fix userland build broken by r360014. 2020-04-16 17:53:23 +00:00
Alexander V. Chernikov
539642a29d Add nhop parameter to rti_filter callback.
One of the goals of the new routing KPI defined in r359823 is to
 entirely hide`struct rtentry` from the consumers. It will allow to
 improve routing subsystem internals and deliver more features much faster.
This change is one of the ongoing changes to eliminate direct
 struct rtentry field accesses.

Additionally, with the followup multipath changes, single rtentry can point
 to multiple nexthops.

With that in mind, convert rti_filter callback used when traversing the
 routing table to accept pair (rt, nhop) instead of nexthop.

Reviewed by:	ae
Differential Revision:	https://reviews.freebsd.org/D24440
2020-04-16 17:20:18 +00:00
Alexander V. Chernikov
dd4776f0cc Reorganise nd6 notification code to avoid direct rtentry field access.
One of the goals of the new routing KPI defined in r359823 is to entirely hide
 `struct rtentry` from the consumers. Doing so will allow to improve routing
 subsystem internals and deliver features more easily. This change is one of
  the ongoing changes to eliminate direct struct rtentry field accesses.

It introduces rtfree_func() wrapper around RTFREE() and reorganises nd6 notification
 code to avoid accessing most of the rtentry fields.

Reviewed by:	ae
Differential Revision:	https://reviews.freebsd.org/D24404
2020-04-14 22:48:33 +00:00
Alexander V. Chernikov
1968b7eb21 Postpone multipath seed init till SI_SUB_LAST, as it is needed only after
some useland program installs multiple paths to the same destination.

While here, make multipath init conditional.

Discussed with:	cem,ian
2020-04-14 07:38:34 +00:00
Andrew Gallatin
bd673b9942 lagg: stop double-counting output errors and counting drops as errors
Before this change, lagg double-counted errors from lagg members, and counted
every drop by a lagg member as an error.  Eg, if lagg sent a packet, and the
underlying hardware driver dropped it, a counter would be incremented by both
lagg and the underlying driver.

This change attempts to fix that by incrementing lagg's counters only for
errors that do not come from underlying drivers.

Reviewed by:	hselasky, jhb
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D24331
2020-04-13 23:06:56 +00:00
Alexander V. Chernikov
a666325282 Introduce nexthop objects and new routing KPI.
This is the foundational change for the routing subsytem rearchitecture.
 More details and goals are available in https://reviews.freebsd.org/D24141 .

This patch introduces concept of nexthop objects and new nexthop-based
 routing KPI.

Nexthops are objects, containing all necessary information for performing
 the packet output decision. Output interface, mtu, flags, gw address goes
 there. For most of the cases, these objects will serve the same role as
 the struct rtentry is currently serving.
Typically there will be low tens of such objects for the router even with
 multiple BGP full-views, as these objects will be shared between routing
 entries. This allows to store more information in the nexthop.

New KPI:

struct nhop_object *fib4_lookup(uint32_t fibnum, struct in_addr dst,
  uint32_t scopeid, uint32_t flags, uint32_t flowid);
struct nhop_object *fib6_lookup(uint32_t fibnum, const struct in6_addr *dst6,
  uint32_t scopeid, uint32_t flags, uint32_t flowid);

These 2 function are intended to replace all all flavours of
 <in_|in6_>rtalloc[1]<_ign><_fib>, mpath functions  and the previous
 fib[46]-generation functions.

Upon successful lookup, they return nexthop object which is guaranteed to
 exist within current NET_EPOCH. If longer lifetime is desired, one can
 specify NHR_REF as a flag and get a referenced version of the nexthop.
 Reference semantic closely resembles rtentry one, allowing sed-style conversion.

Additionally, another 2 functions are introduced to support uRPF functionality
 inside variety of our firewalls. Their primary goal is to hide the multipath
 implementation details inside the routing subsystem, greatly simplifying
 firewalls implementation:

int fib4_lookup_urpf(uint32_t fibnum, struct in_addr dst, uint32_t scopeid,
  uint32_t flags, const struct ifnet *src_if);
int fib6_lookup_urpf(uint32_t fibnum, const struct in6_addr *dst6, uint32_t scopeid,
  uint32_t flags, const struct ifnet *src_if);

All functions have a separate scopeid argument, paving way to eliminating IPv6 scope
 embedding and allowing to support IPv4 link-locals in the future.

Structure changes:
 * rtentry gets new 'rt_nhop' pointer, slightly growing the overall size.
 * rib_head gets new 'rnh_preadd' callback pointer, slightly growing overall sz.

Old KPI:
During the transition state old and new KPI will coexists. As there are another 4-5
 decent-sized conversion patches, it will probably take a couple of weeks.
To support both KPIs, fields not required by the new KPI (most of rtentry) has to be
 kept, resulting in the temporary size increase.
Once conversion is finished, rtentry will notably shrink.

More details:
* architectural overview: https://reviews.freebsd.org/D24141
* list of the next changes: https://reviews.freebsd.org/D24232

Reviewed by:	ae,glebius(initial version)
Differential Revision:	https://reviews.freebsd.org/D24232
2020-04-12 14:30:00 +00:00
Alexander V. Chernikov
62d95afacb Fix build by adding forgotten header to radix_mpath.c after r359797. 2020-04-11 09:38:45 +00:00
Alexander V. Chernikov
4684d3cbcb Remove per-AF radix_mpath initializtion functions.
Split their functionality by moving random seed allocation
 to SYSINIT and calling (new) generic multipath function from
 standard IPv4/IPv5 RIB init handlers.

Differential Revision:	https://reviews.freebsd.org/D24356
2020-04-11 07:37:08 +00:00
Alexander V. Chernikov
aef2d5fb0e Split rtrequest1_fib() into smaller manageable chunks.
No functional changes.

* Move route addition / route deletion code from rtrequest1_fib()
  to add_route() and del_route() respectively.
* Rename rtrequest1_fib_change() to change_route() for consistency.
* Shrink the scope of ugly info #defines.

MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D24349
2020-04-10 16:27:27 +00:00