freebsd-dev

Author	SHA1	Message	Date
Kristof Provost	f4096a7c8a	net: make ethernet.h self-contained Reviewed by: imp Sponsored by: Rubicon Communications, LLC ("Netgate") Differential Revision: https://reviews.freebsd.org/D33501	2021-12-17 12:38:35 +01:00
Kristof Provost	c658610b92	pf: make pfvar.h self-contained Ensure that the pfvar.h header can be included without including any other headers. Reviewed by: imp Sponsored by: Rubicon Communications, LLC ("Netgate") Differential Revision: https://reviews.freebsd.org/D33499	2021-12-17 12:38:34 +01:00
Kristof Provost	b29c145cc1	if_stf: make if_stf.h self-contained Ensure that the if_stf.h header can be included without including any other headers. Reviewed by: imp Sponsored by: Rubicon Communications, LLC ("Netgate") Differential Revision: https://reviews.freebsd.org/D33498	2021-12-17 12:38:34 +01:00
Warner Losh	c6df6f5322	Create wrapper for Giant taken for newbus Create a wrapper for newbus to take giant and for busses to take it too. bus_topo_lock() should be called before interacting with newbus routines and unlocked with bus_topo_unlock(). If you need the topology lock for some reason, bus_topo_mtx() will provide that. Sponsored by: Netflix Reviewed by: mav Differential Revision: https://reviews.freebsd.org/D31831	2021-12-09 17:04:45 -07:00
Mateusz Guzik	e735fa3212	net/if.c: plug set-but-not-unused vars Sponsored by: Rubicon Communications, LLC ("Netgate")	2021-12-09 20:39:40 +00:00
Gleb Smirnoff	7e0bba4d80	ifnet: make V_if_index static to if.c This requires moving net.link.generic sysctl declaration from if_mib.c to if.c. Ideally if_mib.c needs just to be merged to if.c, but they have different license texts. Differential revision: https://reviews.freebsd.org/D33263	2021-12-06 09:32:31 -08:00
Gleb Smirnoff	d74b7baeb0	ifnet_byindex() actually requires network epoch Sweep over potentially unsafe calls to ifnet_byindex() and wrap them in epoch. Most of the code touched remains unsafe, as the returned pointer is being used after epoch exit. Mark that with a comment. Validate the index argument inside the function, reducing argument validation requirement from the callers and making V_if_index private to if.c. Reviewed by: melifaro Differential revision: https://reviews.freebsd.org/D33263	2021-12-06 09:32:31 -08:00
Gleb Smirnoff	7b40b00fad	ifnet: merge ifindex_alloc(), ifnet_setbyindex(), if_grow() and call magic Now it is possible to just merge all this complexity into single linear function. Note that IFNET_WLOCK() is a sleepable lock, so we can M_WAITOK and epoch_wait_preempt(). Reviewed by: melifaro, bz, kp Differential revision: https://reviews.freebsd.org/D33262	2021-12-06 09:32:31 -08:00
Gleb Smirnoff	6ff4cac2ee	ifnet: initial if_grow() shall always succeed So let's just call malloc() directly. This also avoids hidden doubling of default V_if_indexlim. Reviewed by: melifaro, bz, kp Differential revision: https://reviews.freebsd.org/D33261	2021-12-06 09:32:31 -08:00
Gleb Smirnoff	450394af27	ifnet: use ck_pr(3) store & load setting ifnet pointer in ifindex The lockless access to the array is protected by the network epoch. Reviewed by: bz, kp Differential revision: https://reviews.freebsd.org/D33260	2021-12-06 09:32:30 -08:00
Gleb Smirnoff	8062e5759c	ifnet: allocate index at the end of if_alloc_domain() Now that if_alloc_domain() never fails and actually doesn't expose ifnet to outside we can eliminate IFNET_HOLD and two step index allocation. Reviewed by: kp Differential revision: https://reviews.freebsd.org/D33259	2021-12-06 09:32:30 -08:00
Gleb Smirnoff	ad2a0aec29	nhop: hash ifnet pointer instead of if_index Yet another problem created by VIMAGE/if_vmove/epair design that relocates ifnet between vnets and changes if_index. Since if_index changes, nhop hash values also changes, unlink_nhop() isn't able to find entry in hash and leaks the nhop. Since nhop references ifnet, the latter is also leaked. As result running network tests leaks memory on every single test that creates vnet jail. While here, rewrite whole hash_priv() to use static initializer, per Alexander's suggestion. Reviewed by: melifaro	2021-12-04 10:05:46 -08:00
Kristof Provost	6d4baa0d01	if_pflog: fix packet length There were two issues with the new pflog packet length. The first is that the length is expected to be a multiple of sizeof(long), but we'd assumed it had to be a multiple of sizeof(uint32_t). The second is that there's some broken software out there (such as Wireshark) that makes incorrect assumptions about the amount of padding. That is, Wireshark assumes there's always three bytes of padding, rather than however much is needed to get to a multiple of sizeof(long). Fix this by adding extra padding, and a fake field to maintain Wireshark's assumption. Reported by: Ozkan KIRIK <ozkan.kirik@gmail.com> Tested by: Ozkan KIRIK <ozkan.kirik@gmail.com> MFC after: 1 week Sponsored by: Rubicon Communications, LLC ("Netgate") Differential Revision: https://reviews.freebsd.org/D33236	2021-12-04 08:42:55 +01:00
Cy Schubert	db0ac6ded6	Revert "wpa: Import wpa_supplicant/hostapd commit 14ab4a816" This reverts commit `266f97b5e9`, reversing changes made to `a10253cffe`. A mismerge of a merge to catch up to main resulted in files being committed which should not have been.	2021-12-02 14:45:04 -08:00
Cy Schubert	266f97b5e9	wpa: Import wpa_supplicant/hostapd commit 14ab4a816 This is the November update to vendor/wpa committed upstream 2021-11-26. MFC after: 1 month	2021-12-02 13:35:14 -08:00
Gleb Smirnoff	9e93d2b335	ifnet: enable & fix if_debug build Fixes: `ce40632a31`	2021-12-02 10:59:43 -08:00
Gleb Smirnoff	93c67567e0	Remove "options PCBGROUP" With upcoming changes to the inpcb synchronisation it is going to be broken. Even its current status after the move of PCB synchronization to the network epoch is very questionable. This experimental feature was sponsored by Juniper but ended never to be used in Juniper and doesn't exist in their source tree [sjg@, stevek@, jtl@]. In the past (AFAIK, pre-epoch times) it was tried out at Netflix [gallatin@, rrs@] with no positive result and at Yandex [ae@, melifaro@]. I'm up to resurrecting it back if there is any interest from anybody. Reviewed by: rrs Differential revision: https://reviews.freebsd.org/D33020	2021-12-02 10:48:48 -08:00
Gleb Smirnoff	1cec1c5831	Allow to compile RSS without PCBGROUP. Reviewed by: rrs Differential revision: https://reviews.freebsd.org/D33019	2021-12-02 10:48:48 -08:00
Zhenlei Huang	73d41cc730	if_epair: Also mark the flag of pair b with IFF_KNOWSEPOCH Reviewed by: kp MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D33210	2021-12-01 15:54:23 +01:00
Kristof Provost	439da7f06d	if_stf: KASAN fix In in_stf_input() we grabbed a pointer to the IPv4 header and later did an m_pullup() before we look at the IPv6 header. However, m_pullup() could rearrange the mbuf chain and potentially invalidate the pointer to the IPv4 header. Avoid this issue by copying the IP header rather than getting a pointer to it. Reported by: markj, Jenkins (KASAN job) Reviewed by: markj MFC after: 1 week Sponsored by: Rubicon Communications, LLC ("Netgate") Differential Revision: https://reviews.freebsd.org/D33192	2021-11-30 17:35:15 +01:00
Mateusz Guzik	2cedfc3f7e	if_epair: ifdef vars only used with ALTQ Sponsored by: Rubicon Communications, LLC ("Netgate")	2021-11-24 21:28:54 +00:00
Gleb Smirnoff	3bc40f39fd	if_free: add a comment explaining why ifindex_free() is performed here	2021-11-22 19:59:27 -08:00
Gleb Smirnoff	fe499a8452	ifnet: merge if_destroy() and if_free_internal() into one New function has more meaningful name if_free_deferred() and has its header comment fixed to reflect reality. NFC	2021-11-22 19:53:12 -08:00
Gleb Smirnoff	4787572d05	ifnet: make if_alloc_domain() never fail The last consumer of if_com_alloc() is firewire. It never fails to allocate. Most likely the if_com_alloc() KPI will go away together with if_fwip(), less likely new consumers of if_com_alloc() will be added, but they would need to follow the no fail KPI.	2021-11-22 19:49:57 -08:00
Gleb Smirnoff	1e3ca25d92	ifnet: make if_alloc_domain() static	2021-11-22 19:49:57 -08:00
Gleb Smirnoff	ce40632a31	ifnet: append if_debug.c to if.c With this change if_index can become static. There is nothing that if_debug.c would want to isolate from if.c. Potentially if.c wants to share everything with if_debug.c. Move Bjoern's copyright to if.c. Reviewed by: bz	2021-11-22 19:49:57 -08:00
Gleb Smirnoff	8a6f38c8ac	ifnet: garbage collect drbr_*_drv(). They were left in `62d76917b8` but after years proved not to be useful.	2021-11-22 19:49:57 -08:00
Kristof Provost	b46512f704	if_stf: add dtrace probe points Sponsored by: Rubicon Communications, LLC ("Netgate") Differential Revision: https://reviews.freebsd.org/D33038	2021-11-20 19:29:01 +01:00
Kristof Provost	19dc644511	if_stf: add 6rd support Implement IPv6 Rapid Deployment (RFC5969) on top of the existing 6to4 (RFC3056) if_stf code. PR: 253328 Reviewed by: hrs Obtained from: pfSense Sponsored by: Rubicon Communications, LLC ("Netgate") Differential Revision: https://reviews.freebsd.org/D33037	2021-11-20 19:29:01 +01:00
Kristof Provost	3142d4f622	lagg: fix unused-but-set-variable MFC after: 1 week Sponsored by: Rubicon Communications, LLC ("Netgate")	2021-11-19 22:01:27 +01:00
Andriy Gapon	1bfdb812c7	iflib_stop: drain rx tasks to prevent any data races iflib_stop modifies iflib data structures that are used by _task_fn_rx, most prominently the free lists. So, iflib_stop has to ensure that the rx task threads are not active. This should help to fix a crash seen when iflib_if_ioctl (e.g., SIOCSIFCAP) is called while there is already traffic flowing. The crash has been seen on VMWare guests with vmxnet3 driver. My guess is that on physical hardware the couple of 1ms delays that iflib_stop has after disabling interrupts are enough for the queued work to be completed before any iflib state is touched. But on busy hypervisors the guests might not get enough CPU time to complete the work, thus there can be a race between the taskqueue threads and the work done to handle an ioctl, specifically in iflib_stop and iflib_init_locked. PR: 259458 Reviewed by: markj MFC after: 3 weeks Differential Revision: https://reviews.freebsd.org/D32926	2021-11-19 10:00:38 +02:00
Kristof Provost	8e492101ec	pf: add COMPAT_FREEBSD13 for DIOCKEEPCOUNTERS DIOCKEEPCOUNTERS used to overlap with DIOCGIFSPEEDV0, which has been fixed in 14, but remains in stable/12 and stable/13. Support the old, overlapping, call under COMPAT_FREEBSD13. Reviewed by: jhb Sponsored by: Rubicon Communications, LLC ("Netgate") Differential Revision: https://reviews.freebsd.org/D33001	2021-11-17 03:09:20 +01:00
Mateusz Guzik	79554f2b6c	net: whack "set but not used" warnings in net/rtsock.c ... except for one where the error is ignored. Sponsored by: Rubicon Communications, LLC ("Netgate")	2021-11-14 17:20:46 +00:00
Mateusz Guzik	c681cce925	net: whack "set but not used" warnings in net/pfil.c Sponsored by: Rubicon Communications, LLC ("Netgate")	2021-11-14 17:19:58 +00:00
Mateusz Guzik	5a4e46f6ec	net: whack "set but not used" warnings in net/if.c Sponsored by: Rubicon Communications, LLC ("Netgate")	2021-11-14 17:15:08 +00:00
Kristof Provost	047c4e365d	pf: renumber DIOCKEEPCOUNTERS We accidentally had two ioctls use the same base number (DIOCKEEPCOUNTERS and DIOCGIFSPEEDV{0,1}). We get away with that on most platforms because the size of the argument structures is different. This does break CHERI, and is generally a bad idea anyway. Renumber to avoid this collision. Reported by: jhb	2021-11-14 15:36:59 +01:00
Kristof Provost	8e45fed3ae	if_stf: enable use in vnet jails The cloner must be per-vnet so that cloned interfaces get destroyed when the vnet goes away. Otherwise we fail assertions in vnet_if_uninit(): panic: vnet_if_uninit:475 tailq &V_ifnet=0xfffffe01665fe070 not empty cpuid = 19 time = 1636107064 KDB: stack backtrace: db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe015d0cac60 vpanic() at vpanic+0x187/frame 0xfffffe015d0cacc0 panic() at panic+0x43/frame 0xfffffe015d0cad20 vnet_if_uninit() at vnet_if_uninit+0x7b/frame 0xfffffe015d0cad30 vnet_destroy() at vnet_destroy+0x170/frame 0xfffffe015d0cad60 prison_deref() at prison_deref+0x9b0/frame 0xfffffe015d0cadd0 sys_jail_remove() at sys_jail_remove+0x119/frame 0xfffffe015d0cae00 amd64_syscall() at amd64_syscall+0x12e/frame 0xfffffe015d0caf30 fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe015d0caf30 --- syscall (508, FreeBSD ELF64, sys_jail_remove), rip = 0x8011e920a, rsp = 0x7fffffffe788, rbp = 0x7fffffffe810 --- KDB: enter: panic MFC after: 3 weeks Sponsored by: Rubicon Communications, LLC ("Netgate") Differential Revision: https://reviews.freebsd.org/D32849	2021-11-09 09:39:53 +01:00
Kristof Provost	3576121c8b	if_stf: style(9) pass As stated in style(9): "Values in return statements should be enclosed in parentheses." MFC after: 3 weeks Sponsored by: Rubicon Communications, LLC ("Netgate") Differential Revision: https://reviews.freebsd.org/D32848	2021-11-09 09:39:53 +01:00
Kristof Provost	8ca6c11a7c	if_gif: fix vnet shutdown panic If an if_gif exists and has an address assigned inside a vnet when the vnet is shut down we failed to clean up the address, leading to a panic when we ip_destroy() and the V_in_ifaddrhashtbl is not empty. This happens because of the VNET_SYS(UN)INIT order, which means we destroy the if_gif interface before the addresses can be purged (and if_detach() does not remove addresses, it assumes this will be done by the stack teardown code). Set subsystem SI_SUB_PSEUDO just like if_bridge so the cleanup operations happen in the correct order. MFC after: 3 weeks Sponsored by: Rubicon Communications, LLC ("Netgate") Differential Revision: https://reviews.freebsd.org/D32835	2021-11-08 12:00:00 +01:00
Wojciech Macek	acdfc09639	lagg: update capabilites on SIOCSIFMTU Some NICs might have limited capabilities when Jumbo frames are used. For exampe some neta interfaces only support TX csum offload when the packet size is lower than a value specified in DT. Fix it by re-reading capabilities of children interfaces after MTU has been successfully changed. Found by: Jerome Tomczyk <jerome.tomczyk@stormshield.eu> Reviewed by: jhb Obtained from: Semihalf Sponsored by: Stormshield Differential revision: https://reviews.freebsd.org/D32724	2021-11-06 10:43:08 +01:00
Kristof Provost	76c5eecc34	pf: Introduce ridentifier Allow users to set a number on rules which will be exposed as part of the pflog header. The intent behind this is to allow users to correlate rules across updates (remember that pf rules continue to exist and match existing states, even if they're removed from the active ruleset) and pflog. Obtained from: pfSense MFC after: 3 weeks Sponsored by: Rubicon Communications, LLC ("Netgate") Differential Revision: https://reviews.freebsd.org/D32750	2021-11-05 09:39:56 +01:00
Bjoern A. Zeeb	1a8f198fa6	epair: remove "All rights reserved" Remove "All rights reserved" from The FreeBSD Foundation owned copyrights on epair code and documentation. Approved by: emaste (FreeBSD Foundation)	2021-11-02 16:50:26 +00:00
Bjoern A. Zeeb	3dd5760aa5	if_epair: rework Rework if_epair(4) to no longer use netisr and dpcpu. Instead use mbufq and swi_net. This simplifies the code and seems to make it work better and no longer hang. Work largely by bz@, with minor tweaks by kp@. Reviewed by: bz, kp MFC after: 3 weeks Differential Revision: https://reviews.freebsd.org/D31077	2021-11-02 09:23:46 +01:00
Mateusz Guzik	8f3d786cb3	pf: remove the flags argument from pf_unlink_state All consumers call it with PF_ENTER_LOCKED. Reviewed by: kp Sponsored by: Rubicon Communications, LLC ("Netgate")	2021-11-01 20:59:14 +01:00
Kristof Provost	62d2dcafb7	if_epair: delete mbuf tags Remove all (non-persistent) tags when we transmit a packet. Real network interfaces do not carry any tags either, and leaving tags attached can produce unexpected results. Reviewed by: bz, glebius MFC after: 3 weeks Sponsored by: Rubicon Communications, LLC ("Netgate") Differential Revision: https://reviews.freebsd.org/D32663	2021-10-28 10:41:16 +02:00
Mark Johnston	426682b05a	bpf: Fix the write filter for detached descriptors A BPF descriptor only has an associated interface descriptor once it is attached to an interface, e.g., with BIOCSETIF. Avoid dereferencing a NULL pointer in filt_bpfwrite() if the BPF descriptor is not attached. Reviewed by: ae Reported by: syzbot+ae45d5166afe15a5a21d@syzkaller.appspotmail.com Fixes: `ded77e0237` ("Allow the BPF to be select for write.") Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D32561	2021-10-26 10:00:39 -04:00
Gleb Smirnoff	c8ee75f231	Use network epoch to protect local IPv4 addresses hash. The modification to the hash are already naturally locked by in_control_sx. Convert the hash lists to CK lists. Remove the in_ifaddr_rmlock. Assert the network epoch where necessary. Most cases when the hash lookup is done the epoch is already entered. Cover a few cases, that need entering the epoch, which mostly is initial configuration of tunnel interfaces and multicast addresses. Reviewed by: melifaro Differential revision: https://reviews.freebsd.org/D32584	2021-10-22 14:40:53 -07:00
Gleb Smirnoff	6aae3517ed	Retire synchronous PPP kernel driver sppp(4). The last two drivers that required sppp are cp(4) and ce(4). These devices are still produced and can be purchased at Cronyx <http://cronyx.ru/hardware/wan.html>. Since Roman Kurakin <rik@FreeBSD.org> has quit them, they no longer support FreeBSD officially. Later they have dropped support for Linux drivers to. As of mid-2020 they don't even have a developer to maintain their Windows driver. However, their support verbally told me that they could provide aid to a FreeBSD developer with documentaion in case if there appears a new customer for their devices. These drivers have a feature to not use sppp(4) and create an interface, but instead expose the device as netgraph(4) node. Then, you can attach ng_ppp(4) with help of ports/net/mpd5 on top of the node and get your synchronous PPP. Alternatively you can attach ng_frame_relay(4) or ng_cisco(4) for HDLC. Actually, last time I used cp(4) back in 2004, using netgraph(4) instead of sppp(4) was already the right way to do. Thus, remove the sppp(4) related part of the drivers and enable by default the negraph(4) part. Further maintenance of these drivers in the tree shouldn't be a big deal. While doing that, remove some cruft and enable cp(4) compilation on amd64. The ce(4) for some unknown reason marks its internal DDK functions with __attribute__ fastcall, which most likely is safe to remove, but without hardware I'm not going to do that, so ce(4) remains i386-only. Reviewed by: emaste, imp, donner Differential Revision: https://reviews.freebsd.org/D32590 See also: https://reviews.freebsd.org/D23928	2021-10-22 11:41:36 -07:00
Gleb Smirnoff	2144431c11	Remove in_ifaddr_lock acquisiton to access in_ifaddrhead. An IPv4 address is embedded into an ifaddr which is freed via epoch. And the in_ifaddrhead is already a CK list. Use the network epoch to protect against use after free. Next step would be to CK-ify the in_addr hash and get rid of the... Reviewed by: melifaro Differential Revision: https://reviews.freebsd.org/D32434	2021-10-13 10:04:46 -07:00
Hartmut Brandt	ded77e0237	Allow the BPF to be select for write. This is needed for boost:asio which otherwise fails to handle BPFs. Reviewed by: ae Differential Revision: https://reviews.freebsd.org/D31967	2021-10-10 17:03:51 +02:00
Alexander V. Chernikov	7e64580b5f	routing: Use the same index space for both nexthop and nexthop groups. This simplifies userland object handling along with kernel-level nexthop handling in fib algo framework. MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D32342	2021-10-08 07:58:55 +00:00
Kristof Provost	76c2e71c4c	pf: remove unused field from pf_kanchor The 'match' field is only used in the userspace version of the struct (pf_anchor). MFC after: 2 weeks Sponsored by: Rubicon Communications, LLC ("Netgate")	2021-10-07 19:50:22 +02:00
Kristof Provost	5062afff9d	pfctl: userspace adaptive syncookies configration Hook up the userspace bits to configure syncookies in adaptive mode. MFC after: 1 week Sponsored by: Modirum MDPay Differential Revision: https://reviews.freebsd.org/D32136	2021-09-29 15:11:54 +02:00
Kristof Provost	bf8637181a	pf: implement adaptive mode Use atomic counters to ensure that we correctly track the number of half open states and syncookie responses in-flight. This determines if we activate or deactivate syncookies in adaptive mode. MFC after: 1 week Sponsored by: Modirum MDPay Differential Revision: https://reviews.freebsd.org/D32134	2021-09-29 15:11:54 +02:00
Kristof Provost	63b3c1c770	pf: support dummynet Allow pf to use dummynet pipes and queues. We re-use the currently unused IPFW_IS_DUMMYNET flag to allow dummynet to tell us that a packet is being re-injected after being delayed. This is needed to avoid endlessly looping the packet between pf and dummynet. MFC after: 2 weeks Sponsored by: Rubicon Communications, LLC ("Netgate") Differential Revision: https://reviews.freebsd.org/D31904	2021-09-24 11:41:25 +02:00
Arnaud Ysmal	0b92a7fe47	LACP: Do not wait response for marker messages not sent The error returned when a marker message can not be emitted on a port is not handled. This cause the lacp to block all emissions until the timeout of 3 seconds is reached. To fix this issue, I just clear the LACP_PORT_MARK flag when the packet could not be emitted. Differential revision: https://reviews.freebsd.org/D30467 Obtained from: Stormshield	2021-09-23 10:57:11 +02:00
John Baldwin	c782ea8bb5	Add a switch structure for send tags. Move the type and function pointers for operations on existing send tags (modify, query, next, free) out of 'struct ifnet' and into a new 'struct if_snd_tag_sw'. A pointer to this structure is added to the generic part of send tags and is initialized by m_snd_tag_init() (which now accepts a switch structure as a new argument in place of the type). Previously, device driver ifnet methods switched on the type to call type-specific functions. Now, those type-specific functions are saved in the switch structure and invoked directly. In addition, this more gracefully permits multiple implementations of the same tag within a driver. In particular, NIC TLS for future Chelsio adapters will use a different implementation than the existing NIC TLS support for T6 adapters. Reviewed by: gallatin, hselasky, kib (older version) Sponsored by: Chelsio Communications Differential Revision: https://reviews.freebsd.org/D31572	2021-09-14 11:43:41 -07:00
Mark Johnston	b1746faad6	debugnet: Include some required headers Don't depend on pollution from net/vnet.h. PR: 258496 MFC after: 1 week Sponsored by: The FreeBSD Foundation	2021-09-14 11:02:45 -04:00
Kristof Provost	b64f7ce98f	pf: qid and pqid can be uint16_t tag2name() returns a uint16_t, so we don't need to use uint32_t for the qid (or pqid). This reduces the size of struct pf_kstate slightly. That in turn buys us space to add extra fields for dummynet later. Happily these fields are not exposed to user space (there are user space versions of them, but they can just stay uint32_t), so there's no ABI breakage in modifying this. MFC after: 1 week Sponsored by: Rubicon Communications, LLC ("Netgate") Differential Revision: https://reviews.freebsd.org/D31873	2021-09-10 17:07:57 +02:00
Mark Johnston	b1e6a792d6	net: Enter a net epoch around protocol if_up/down notifications When traversing a list of interface addresses, we need to be in a net epoch section, and protocol ctlinput routines need a stable reference to the address. Reported by: syzbot+3219af764ead146a3a4e@syzkaller.appspotmail.com Reviewed by: kp, melifaro MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D31889	2021-09-10 09:07:40 -04:00
Alexander V. Chernikov	4b631fc832	routing: fix source address selection rules for IPv4 over IPv6. Current logic always selects an IFA of the same family from the outgoing interfaces. In IPv4 over IPv6 setup there can be just single non-127.0.0.1 ifa, attached to the loopback interface. Create a separate rt_getifa_family() to handle entire ifa selection for the IPv4 over IPv6. Differential Revision: https://reviews.freebsd.org/D31868 MFC after: 1 week	2021-09-07 21:41:05 +00:00
Kristof Provost	bb25e36e13	pf: remove unused function prototype MFC after: 1 week Sponsored by: Rubicon Communications, LLC ("Netgate")	2021-09-07 16:38:49 +02:00
Kristof Provost	312f5f8a4f	altq: mark callouts as mpsafe There's no reason to acquire the Giant lock while executing the ALTQ callouts. While here also remove a few backwards compatibility defines for long obsolete FreeBSD versions. Reviewed by: mav Suggested by: mav Differential Revision: https://reviews.freebsd.org/D31835	2021-09-04 17:26:10 +02:00
Kristof Provost	4cab80a8df	pf: Add counters for syncookies Count when we send a syncookie, receive a valid syncookie or detect a synflood. Reviewed by: kbowling MFC after: 1 week Sponsored by: Modirum MDPay Differential Revision: https://reviews.freebsd.org/D31713	2021-09-01 12:02:19 +02:00
Alexander V. Chernikov	0a3a377aee	routing: Disallow zero nexthop weights in nexthop groups. Adding such nexthops breaks calc_min_mpath_slots() assumptions, thus resulting in the incorrect nexthop group creation and eventually leading to panic. Reported by: avg MFC after: 1 week	2021-09-01 07:16:24 +00:00
Alexander V. Chernikov	639d7abec6	routing: simplify malloc flags in alloc_nhgrp(). MFC after: 1 week	2021-08-31 08:14:16 +00:00
Alexander V. Chernikov	f84c30106e	routing: Fix newly-added rt_get_inet[6]_parent() api. Correctly handle the case when no default route is present. Reported by: Konrad <konrad.kreciwilk at korbank.pl>	2021-08-30 21:10:37 +00:00
Alexander V. Chernikov	d98954e229	routing: Bring back the ability to specify transmit interface via its name. Some software references outgoing interfaces by specifying name instead of index. Use rti_ifp from rt_addrinfo if provided instead of always using address interface when constructing nexthop. PR: 255678 Reported by: martin.larsson2 at gmail.com MFC after: 1 week	2021-08-29 20:05:14 +00:00
Kristof Provost	2b10cf85f8	pf: Introduce nvlist variant of DIOCGETSTATUS Make it possible to extend the GETSTATUS call (e.g. when we want to add new counters, such as for syncookie support) by introducing an nvlist-based alternative. MFC after: 1 week Sponsored by: Modirum MDPay Differential Revision: https://reviews.freebsd.org/D31694	2021-08-29 14:59:04 +02:00
Luiz Otavio O Souza	eb680a63de	if_bridge: add ALTQ support Similar to the recent addition of ALTQ support to if_vlan. Reviewed by: donner Obtained from: pfsense MFC after: 1 week Sponsored by: Rubicon Communications, LLC ("Netgate") Differential Revision: https://reviews.freebsd.org/D31675	2021-08-26 11:23:44 +02:00
Luiz Otavio O Souza	2e5ff01d0a	if_vlan: add the ALTQ support to if_vlan. Inspired by the iflib implementation, allow ALTQ to be used with if_vlan interfaces. Reviewed by: donner Obtained from: pfsense MFC after: 1 week Sponsored by: Rubicon Communications, LLC ("Netgate") Differential Revision: https://reviews.freebsd.org/D31647	2021-08-25 08:56:45 +02:00
Kristof Provost	159258afb5	altq: Fix panics on rmc_restart() rmc_restart() is called from a timer, but can trigger traffic. This means the curvnet context will not be set. Use the vnet associated with the interface we're currently processing to set it. We also have to enter net_epoch here, for the same reason. Reviewed by: mjg MFC after: 1 week Sponsored by: Rubicon Communications, LLC ("Netgate") Differential Revision: https://reviews.freebsd.org/D31642	2021-08-23 21:35:41 +02:00
Zhenlei Huang	62e1a437f3	routing: Allow using IPv6 next-hops for IPv4 routes (RFC 5549). Implement kernel support for RFC 5549/8950. * Relax control plane restrictions and allow specifying IPv6 gateways for IPv4 routes. This behavior is controlled by the net.route.rib_route_ipv6_nexthop sysctl (on by default). * Always pass final destination in ro->ro_dst in ip_forward(). * Use ro->ro_dst to exract packet family inside if_output() routines. Consistently use RO_GET_FAMILY() macro to handle ro=NULL case. * Pass extracted family to nd6_resolve() to get the LLE with proper encap. It leverages recent lltable changes committed in `c541bd368f`. Presence of the functionality can be checked using ipv4_rfc5549_support feature(3). Example usage: route add -net 192.0.0.0/24 -inet6 fe80::5054:ff:fe14:e319%vtnet0 Differential Revision: https://reviews.freebsd.org/D30398 MFC after: 2 weeks	2021-08-22 22:56:08 +00:00
Vincenzo Maffione	98399ab06f	netmap: import changes from upstream - make sure rings are disabled during resets - introduce netmap_update_hostrings_mode(), with support for multiple host rings - always initialize ni_bufs_head in netmap_if ni_bufs_head was not properly initialized when no external buffers were requestedx and contained the ni_bufs_head from the last request. This was causing spurious buffer frees when alternating between apps that used external buffers and apps that did not use them. - check na validitity under lock on detach - netmap_mem: fix leak on error path - nm_dispatch: fix compilation on Raspberry Pi MFC after: 2 weeks	2021-08-22 09:31:05 +00:00
Alexander V. Chernikov	c541bd368f	lltable: Add support for "child" LLEs holding encap for IPv4oIPv6 entries. Currently we use pre-calculated headers inside LLE entries as prepend data for `if_output` functions. Using these headers allows saving some CPU cycles/memory accesses on the fast path. However, this approach makes adding L2 header for IPv4 traffic with IPv6 nexthops more complex, as it is not possible to store multiple pre-calculated headers inside lle. Additionally, the solution space is limited by the fact that PCB caching saves LLEs in addition to the nexthop. Thus, add support for creating special "child" LLEs for the purpose of holding custom family encaps and store mbufs pending resolution. To simplify handling of those LLEs, store them in a linked-list inside a "parent" (e.g. normal) LLE. Such LLEs are not visible when iterating LLE table. Their lifecycle is bound to the "parent" LLE - it is not possible to delete "child" when parent is alive. Furthermore, "child" LLEs are static (RTF_STATIC), avoding complex state machine used by the standard LLEs. nd6_lookup() and nd6_resolve() now accepts an additional argument, family, allowing to return such child LLEs. This change uses `LLE_SF()` macro which packs family and flags in a single int field. This is done to simplify merging back to stable/. Once this code lands, most of the cases will be converted to use a dedicated `family` parameter. Differential Revision: https://reviews.freebsd.org/D31379 MFC after: 2 weeks	2021-08-21 17:34:35 +00:00
Luiz Otavio O Souza	c138424148	lagg: don't update link layer addresses on destroy When the lagg is being destroyed it is not necessary update the lladdr of all the lagg members every time we update the primary interface. Reviewed by: scottl Obtained from: pfSense MFC after: 1 week Sponsored by: Rubicon Communications, LLC ("Netgate") Differential Revision: https://reviews.freebsd.org/D31586	2021-08-19 10:49:32 +02:00
Franco Fichtner	bb250fae9e	gre: simplify RSS ifdefs Use the early break to avoid else definitions. When RSS gains a runtime option previous constructs would duplicate and convolute the existing code. While here init flowid and skip magic numbers and late default assignment. Reviewed by: melifaro, kbowling Obtained from: OPNsense MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D31584	2021-08-18 10:05:29 -07:00
Kristof Provost	a051ca72e2	Introduce m_get3() Introduce m_get3() which is similar to m_get2(), but can allocate up to MJUM16BYTES bytes (m_get2() can only allocate up to MJUMPAGESIZE). This simplifies the bpf improvement in `f13da24715`. Suggested by: glebius Differential Revision: https://reviews.freebsd.org/D31455	2021-08-18 08:48:27 +02:00
Stephan de Wit	66fa12d8fb	iflib: emulate counters in netmap mode When iflib devices are in netmap mode the driver counters are no longer updated making it look from userspace tools that traffic has stopped. Reported by: Franco Fichtner <franco@opnsense.org> Reviewed by: vmaffione, iflib (erj, gallatin) Obtained from: OPNsense MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D31550	2021-08-18 00:17:43 -07:00
Alexander V. Chernikov	36e15b717e	routing: Fix crashes with dpdk_lpm[46] algo. When a prefix gets deleted from the RIB, dpdk_lpm algo needs to know the nexthop of the "parent" prefix to update its internal state. The glue code, which utilises RIB as a backing route store, uses fib[46]_lookup_rt() for the prefix destination after its deletion to fetch the desired nexthop. This approach does not work when deleting less-specific prefixes with most-specific ones are still present. For example, if 10.0.0.0/24, 10.0.0.0/23 and 10.0.0.0/22 exist in RIB, deleting 10.0.0.0/23 would result in 10.0.0.0/24 being returned as a search result instead of 10.0.0.0/22. This, in turn, results in the failed datastructure update: part of the deleted /23 prefix will still contain the reference to an old nexthop. This leads to the use-after-free behaviour, ending with the eventual crashes. Fix the logic flaw by properly fetching the prefix "parent" via newly-created rt_get_inet[6]_parent() helpers. Differential Revision: https://reviews.freebsd.org/D31546 PR: 256882,256833 MFC after: 1 week	2021-08-17 20:46:22 +00:00
Mark Johnston	24fe461284	ether: Add a KMSAN check for transmitted frames This helps ensure that outbound packet data is initialized per KMSAN. Sponsored by: The FreeBSD Foundation	2021-08-11 16:33:41 -04:00
Ed Maste	9feff969a0	Remove "All Rights Reserved" from FreeBSD Foundation sys/ copyrights These ones were unambiguous cases where the Foundation was the only listed copyright holder (in the associated license block). Sponsored by: The FreeBSD Foundation	2021-08-08 10:42:24 -04:00
Alexander V. Chernikov	9748eb7427	Simplify nhop operations in ip_output(). Consistently use `nh` instead of always dereferencing ro->ro_nh inside the if block. Always use nexthop mtu, as it provides guarantee that mtu is accurate. Pass `nh` pointer to rt_update_ro_flags() to allow upcoming uses of updating ro flags based on different nexthop. Differential Revision: https://reviews.freebsd.org/D31451 Reviewed by: kp MFC after: 2 weeks	2021-08-08 09:19:27 +00:00
Alexander V. Chernikov	0b79b007eb	[lltable] Restructure nd6 code. Factor out lltable locking logic from lltable_try_set_entry_addr() into a separate lltable_acquire_wlock(), so the latter can be used in other parts of the code w/o duplication. Create nd6_try_set_entry_addr() to avoid code duplication in nd6.c and nd6_nbr.c. Move lle creation logic from nd6_resolve_slow() into a separate nd6_get_llentry() to simplify the former. These changes serve as a pre-requisite for implementing RFC8950 (IPv4 prefixes with IPv6 nexthops). Differential Revision: https://reviews.freebsd.org/D31432 MFC after: 2 weeks	2021-08-07 09:59:11 +00:00
Alexander V. Chernikov	f3a3b06121	[lltable] Unify datapath feedback mechamism. Use newly-create llentry_request_feedback(), llentry_mark_used() and llentry_get_hittime() to request datapatch usage check and fetch the results in the same fashion both in IPv4 and IPv6. While here, simplify llentry_provide_feedback() wrapper by eliminating 1 condition check. MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D31390	2021-08-04 22:52:43 +00:00
Alexander V. Chernikov	5b42b494d5	Fix typo in rib_unsibscribe<_locked>(). Submitted by: Zhenlei Huang<zlei.huang at gmail.com> Differential Revision: https://reviews.freebsd.org/D31356	2021-08-01 13:29:52 +00:00
Alexander V. Chernikov	054948bd81	[multipath][nhops] Fix random crashes with high route churn rate. When certain multipath route begins flapping really fast, it may result in creating multiple identical nexthop groups. The code responsible for unlinking unused nexthop groups had an implicit assumption that there could be only one nexthop group for the same combination of nexthops with weights. This assumption resulted in always unlinking the first "identical" group, instead of the desired one. Such action, in turn, produced a used-but-unlinked nhg along with freed-and-linked nhg, ending up in random crashes. Similarly, it is possible that multiple identical nexthops gets created in the case of high route churn, resulting in the same problem when deleting one of such nexthops. Fix by matching the nexthop/nexhop group pointer when deleting the item. Reported by: avg MFC after: 1 week	2021-08-01 10:07:37 +00:00
Kristof Provost	b69019c14c	pf: remove DIOCGETSTATESNV While nvlists are very useful in maximising flexibility for future extensions their performance is simply unacceptably bad for the getstates feature, where we can easily want to export a million states or more. The DIOCGETSTATESNV call has been MFCd, but has not hit a release on any branch, so we can still remove it everywhere. Reviewed by: mjg MFC after: 1 week Sponsored by: Rubicon Communications, LLC ("Netgate") Differential Revision: https://reviews.freebsd.org/D31099	2021-07-30 11:45:28 +02:00
Bryan Drewery	7cbf1de38e	debugnet: Fix false-positive assertions for dp_state debugnet_handle_arp: An assertion is present to ensure the pcb is only modified when the state is DN_STATE_INIT. Because debugnet_arp_gw() is asynchronous it is possible for ARP replies to come in after the gateway address is known and the state already changed. debugnet_handle_ip: Similarly it is possible for packets to come in, from the expected server, during the gateway mac discovery phase. This can happen from testing disconnects / reconnects in quick succession. This later causes some acks to be sent back but hit an assertion because the state is wrong. Reviewed by: cem, debugnet_handle_arp: markj, vangyzen Sponsored by: Dell EMC Differential Revision: https://reviews.freebsd.org/D31327	2021-07-28 16:34:14 -07:00
Kristof Provost	01ad0c0079	net: disallow MTU changes on bridge member interfaces if_bridge member interfaces should always have the same MTU as the bridge itself, so disallow MTU changes on interfaces that are part of an if_bridge. Reviewed by: donner Sponsored by: Rubicon Communications, LLC ("Netgate") Differential Revision: https://reviews.freebsd.org/D31304	2021-07-28 22:03:30 +02:00
Kristof Provost	3330649382	if_bridge: allow MTU changes if_bridge used to only allow MTU changes if the new MTU matched that of all member interfaces. This doesn't really make much sense, in that we really shouldn't be allowed to change the MTU of bridge member in the first place. Instead we now change the MTU of all member interfaces. If one fails we revert all interfaces back to the original MTU. We do not address the issue where bridge member interface MTUs can be changed here. Reviewed by: donner Sponsored by: Rubicon Communications, LLC ("Netgate") Differential Revision: https://reviews.freebsd.org/D31288	2021-07-28 22:01:12 +02:00
Roy Marples	7045b1603b	socket: Implement SO_RERROR SO_RERROR indicates that receive buffer overflows should be handled as errors. Historically receive buffer overflows have been ignored and programs could not tell if they missed messages or messages had been truncated because of overflows. Since programs historically do not expect to get receive overflow errors, this behavior is not the default. This is really really important for programs that use route(4) to keep in sync with the system. If we loose a message then we need to reload the full system state, otherwise the behaviour from that point is undefined and can lead to chasing bogus bug reports. Reviewed by: philip (network), kbowling (transport), gbe (manpages) MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D26652	2021-07-28 09:35:09 -07:00
Kristof Provost	9ef8cd0b79	vlan: deduplicate bpf_setpcp() and pf_ieee8021q_setpcp() These two fuctions were identical, so move them into the common vlan_set_pcp() function, exposed in the if_vlan_var.h header. Reviewed by: donner MFC after: 1 week Sponsored by: Rubicon Communications, LLC ("Netgate") Differential Revision: https://reviews.freebsd.org/D31275	2021-07-26 23:13:31 +02:00
Luiz Otavio O Souza	1e7fe2fbb9	bpf: Add an ioctl to set the VLAN Priority on packets sent by bpf This allows the use of VLAN PCP in dhclient, which is required for certain ISPs (such as Orange.fr). Reviewed by: bcr (man page) MFC after: 1 week Sponsored by: Rubicon Communications, LLC ("Netgate") Differential Revision: https://reviews.freebsd.org/D31263	2021-07-26 23:13:31 +02:00
Mateusz Guzik	02cf67ccf6	pf: switch rule counters to pf_counter_u64 Reviewed by: kp Sponsored by: Rubicon Communications, LLC ("Netgate")	2021-07-25 10:22:17 +02:00
Mateusz Guzik	d40d4b3ed7	pf: switch kif counters to pf_counter_u64 Reviewed by: kp Sponsored by: Rubicon Communications, LLC ("Netgate")	2021-07-25 10:22:17 +02:00
Mateusz Guzik	fc4c42ce0b	pf: switch pf_status.fcounters to pf_counter_u64 Reviewed by: kp Sponsored by: Rubicon Communications, LLC ("Netgate")	2021-07-25 10:22:16 +02:00
Mateusz Guzik	defdcdd564	pf: add hybrid 32- an 64- bit counters Numerous counters got migrated from straight uint64_t to the counter(9) API. Unfortunately the implementation comes with a significiant performance hit on some platforms and cannot be easily fixed. Work around the problem by implementing a pf-specific variant. Reviewed by: kp Sponsored by: Rubicon Communications, LLC ("Netgate")	2021-07-25 10:22:16 +02:00
Mateusz Guzik	d9cc6ea270	pf: hide struct pf_kstatus behind ifdef _KERNEL Reviewed by: kp Sponsored by: Rubicon Communications, LLC ("Netgate")	2021-07-23 17:34:43 +00:00
Mark Johnston	0dcef81de9	Add required sysctl name length checks to various handlers Reported by: KMSAN MFC after: 1 week Sponsored by: The FreeBSD Foundation	2021-07-23 10:47:13 -04:00
Kyle Evans	51221b68fb	tuntap: clean up cc --analyze One complaint of a dead-store, smack it with a __diagused.	2021-07-21 19:14:43 -05:00
Kristof Provost	32271c4d38	pf: clean up syncookie callout on vnet shutdown Ensure that we cancel any outstanding callouts for syncookies when we terminate the vnet. MFC after: 1 week Sponsored by: Modirum MDPay	2021-07-20 21:13:25 +02:00
Mateusz Guzik	907257d696	pf: embed a pointer to the lock in struct pf_kstate This shaves calculation which in particular helps on arm. Note using the & hack instead would still be more work. Reviewed by: kp Sponsored by: Rubicon Communications, LLC ("Netgate")	2021-07-20 16:11:31 +00:00
Kristof Provost	231e83d342	pf: syncookie ioctl interface Kernel side implementation to allow switching between on and off modes, and allow this configuration to be retrieved. MFC after: 1 week Sponsored by: Modirum MDPay Differential Revision: https://reviews.freebsd.org/D31139	2021-07-20 10:36:13 +02:00
Kristof Provost	8e1864ed07	pf: syncookie support Import OpenBSD's syncookie support for pf. This feature help pf resist TCP SYN floods by only creating states once the remote host completes the TCP handshake rather than when the initial SYN packet is received. This is accomplished by using the initial sequence numbers to encode a cookie (hence the name) in the SYN+ACK response and verifying this on receipt of the client ACK. Reviewed by: kbowling Obtained from: OpenBSD MFC after: 1 week Sponsored by: Modirum MDPay Differential Revision: https://reviews.freebsd.org/D31138	2021-07-20 10:36:13 +02:00
Mateusz Guzik	9009d36afd	pf: shrink struct pf_kstate Makes room for a pointer. Reviewed by: kp Sponsored by: Rubicon Communications, LLC ("Netgate")	2021-07-19 14:54:49 +02:00
Mateusz Guzik	f9aa757d8d	pf: add a comment to pf_kstate concerning compat with pf_state_cmp Reviewed by: kp Sponsored by: Rubicon Communications, LLC ("Netgate")	2021-07-19 14:54:49 +02:00
Kristof Provost	ef950daa35	pf: match keyword support Support the 'match' keyword. Note that support is limited to adding queuing information, so without ALTQ support in the kernel setting match rules is pointless. For the avoidance of doubt: this is NOT full support for the match keyword as found in OpenBSD's pf. That could potentially be built on top of this, but this commit is NOT that. MFC after: 2 weeks Sponsored by: Rubicon Communications, LLC ("Netgate") Differential Revision: https://reviews.freebsd.org/D31115	2021-07-17 12:01:08 +02:00
Kristof Provost	c6bf20a2a4	pf: add DIOCGETSTATESV2 Add a new version of the DIOCGETSTATES call, which extends the struct to include the original interface information. MFC after: 1 week Sponsored by: Rubicon Communications, LLC ("Netgate") Differential Revision: https://reviews.freebsd.org/D31097	2021-07-09 10:29:53 +02:00
Mateusz Guzik	19d6e29b87	pf: add pf_find_state_all_exists Reviewed by: kp Sponsored by: Rubicon Communications, LLC ("Netgate")	2021-07-08 14:00:55 +00:00
Kristof Provost	211cddf9e3	pf: rename pf_state to pf_kstate Indicate that this is a kernel-only structure, and make it easier to distinguish from others used to communicate with userspace. Reviewed by: mjg MFC after: 1 week Sponsored by: Rubicon Communications, LLC ("Netgate") Differential Revision: https://reviews.freebsd.org/D31096	2021-07-08 10:31:43 +02:00
Mateusz Guzik	a56888534d	iflib: use m_gethdr_raw Reviewed by: gallatin Sponsored by: Rubicon Communications, LLC ("Netgate") Differential Revision: https://reviews.freebsd.org/D31081	2021-07-07 11:05:46 +00:00
Mateusz Guzik	f649cff587	pf: padalign global locks found in pf.c Reviewed by: kp Sponsored by: Rubicon Communications, LLC ("Netgate")	2021-07-05 09:56:54 +00:00
Mateusz Guzik	dc1ab04e4c	pf: allow table stats clearing and reading with ruleset rlock Instead serialize against these operations with a dedicated lock. Prior to the change, When pushing 17 mln pps of traffic, calling DIOCRGETTSTATS in a loop would restrict throughput to about 7 mln. With the change there is no slowdown. Reviewed by: kp (previous version) Sponsored by: Rubicon Communications, LLC ("Netgate")	2021-07-05 10:42:01 +02:00
Mateusz Guzik	f92c21a28c	pf: depessimize table handling Creating tables and zeroing their counters induces excessive IPIs (14 per table), which in turns kills single- and multi-threaded performance. Work around the problem by extending per-CPU counters with a general counter populated on "zeroing" requests -- it stores the currently found sum. Then requests to report the current value are the sum of per-CPU counters subtracted by the saved value. Sample timings when loading a config with 100k tables on a 104-way box: stock: pfctl -f tables100000.conf 0.39s user 69.37s system 99% cpu 1:09.76 total pfctl -f tables100000.conf 0.40s user 68.14s system 99% cpu 1:08.54 total patched: pfctl -f tables100000.conf 0.35s user 6.41s system 99% cpu 6.771 total pfctl -f tables100000.conf 0.48s user 6.47s system 99% cpu 6.949 total Reviewed by: kp (previous version) Sponsored by: Rubicon Communications, LLC ("Netgate")	2021-07-05 10:42:01 +02:00
Mateusz Guzik	bad5f0b6c2	iflib: switch bare zone_mbuf use to m_free_raw Reviewed by: kbowling Sponsored by: Rubicon Communications, LLC ("Netgate") Differential Revision: https://reviews.freebsd.org/D30961	2021-07-02 08:30:22 +00:00
Mateusz Guzik	55cc305dfc	pf: revert: Use counter(9) for pf_state byte/packet tracking stats are not shared and consequently per-CPU counters only waste memory. No slowdown was measured when passing over 20M pps. Reviewed by: kp Sponsored by: Rubicon Communications, LLC ("Netgate")	2021-06-29 07:24:53 +00:00
Mateusz Guzik	803dfe3da0	pf: deduplicate V_pf_state_z handling with pfsync Reviewed by: kp Sponsored by: Rubicon Communications, LLC ("Netgate")	2021-06-29 07:24:53 +00:00
Mateusz Guzik	e6dd0e2e8d	pf: assert that sizeof(struct pf_state) <= 312 To prevent accidentally going over a threshold which makes UMA fit only 12 objects per page instead of 13. Reviewed by: kp Sponsored by: Rubicon Communications, LLC ("Netgate")	2021-06-28 15:49:20 +00:00
Mateusz Guzik	d09388d013	pf: add pf_release_staten and use it in pf_unlink_state Saves one atomic op. Reviewed by: kp Sponsored by: Rubicon Communications, LLC ("Netgate")	2021-06-28 15:49:20 +00:00
Marcin Wojtas	58632fa7a3	iflib: Add a new quirk ENETC NIC found in LS1028A has a bug where clearing TX pidx/cidx causes the ring to hang after being re-enabled. Add a new flag, if set iflib will preserve the indices during restart. Submitted by: Kornel Duleba <mindal@semihalf.com> Reviewed by: gallatin, erj Obtained from: Semihalf Sponsored by: Alstom Group Differential Revision: https://reviews.freebsd.org/D30728	2021-06-24 13:00:56 +02:00
Florian Florensa	f13da24715	net/bpf: Fix writing of buffer bigger than PAGESIZE When allocating the mbuf we used m_get2 which fails if len is superior to MJUMPAGESIZE, if its the case, use m_getjcl instead. Reviewed by: kp@ PR: 205164 Pull Request: https://github.com/freebsd/freebsd-src/pull/131	2021-06-23 10:39:18 -06:00
Rozhuk Ivan	a75819461e	devctl: add ADDR_ADD and ADDR_DEL devctl event for IFNET Add devd event on network iface address add/remove. Can be used to automate actions on any address change. Reviewed by: imp@ (and minor style tweaks) Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D30840	2021-06-23 10:26:56 -06:00
Rozhuk Ivan	4fb3e0bb94	devctl: add RENAME devctl event for IFNET Add devd event on network iface rename. Reviewed by: imp@,asomers@ Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D30839	2021-06-23 10:20:58 -06:00
George V. Neville-Neil	c6b2d024d7	Retore the vnet before returning an error. Obtained from: Kanndula, Dheeraj <Dheeraj.Kandula@netapp.com> MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D30741	2021-06-21 10:46:20 -04:00
Kristof Provost	d38630f619	pf: store L4 headers in pf_pdesc Rather than pointers to the headers store full copies. This brings us slightly closer to what OpenBSD does, and also makes more sense than storing pointers to stack variable copies of the headers. Reviewed by: donner, scottl MFC after: 1 week Sponsored by: Rubicon Communications, LLC ("Netgate") Differential Revision: https://reviews.freebsd.org/D30719	2021-06-14 14:22:06 +02:00
Bjoern A. Zeeb	a3c2c06bc9	Make LINT NOINET and NOIP kernel builds warning free. Apply #ifdef INET or #if defined(INET6) \|\| defined(INET) to make universe NOINET and NOIP LINT kernels warning free as well again.	2021-06-06 14:03:06 +00:00
Kyle Evans	2d741f33bd	kern: ether_gen_addr: randomize on default hostuuid, too Currently, this will still hash the default (all zero) hostuuid and potentially arrive at a MAC address that has a high chance of collision if another interface of the same name appears in the same broadcast domain on another host without a hostuuid, e.g., some virtual machine setups. Instead of using the default hostuuid, just treat it as a failure and generate a random LA unicast MAC address. Reviewed by: bz, gbe, imp, kbowling, kp MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D29788	2021-06-01 22:59:21 -05:00
Kristof Provost	ec7b47fc81	pf: Move provider declaration to pf.h This simplifies life a bit, by not requiring us to repease the declaration for every file where we want static probe points. It also makes the gcc6 build happy.	2021-06-01 09:02:05 +02:00
Kristof Provost	d0fdf2b28f	pf: Track the original kif for floating states Track (and display) the interface that created a state, even if it's a floating state (and thus uses virtual interface 'all'). MFC after: 1 week Sponsored by: Rubicon Communications, LLC ("Netgate") Differential Revision: https://reviews.freebsd.org/D30245	2021-05-20 12:49:27 +02:00
Kristof Provost	0592a4c83d	pf: Add DIOCGETSTATESNV Add DIOCGETSTATESNV, an nvlist-based alternative to DIOCGETSTATES. MFC after: 1 week Sponsored by: Rubicon Communications, LLC ("Netgate") Differential Revision: https://reviews.freebsd.org/D30243	2021-05-20 12:49:27 +02:00
Kristof Provost	1732afaa0d	pf: Add DIOCGETSTATENV Add DIOCGETSTATENV, an nvlist-based alternative to DIOCGETSTATE. MFC after: 1 week Sponsored by: Rubicon Communications, LLC ("Netgate") Differential Revision: https://reviews.freebsd.org/D30242	2021-05-20 12:49:26 +02:00
Alexander V. Chernikov	76cfc6fa0d	Fix a use after free in update_rtm_from_rc(). update_rtm_from_rc() calls update_rtm_from_info() internally. The latter one may update provided prtm pointer with a new rtm. Reassign rtm from prtm afeter calling update_rtm_from_info() to avoid touching the freed rtm. PR: 255871 Submitted by: lylgood@foxmail.com MFC after: 3 days	2021-05-14 16:06:41 +00:00
Mark Johnston	ad22ba2b9f	if: Remove unnecessary validation in the SIOCSIFNAME handler A successful copyinstr() call guarantees that the returned string is nul-terminated. Furthermore, the removed check would harmlessly compare an uninitialized byte with '\0' if the new name is shorter than IFNAMESIZ - 1. Reported by: KMSAN MFC after: 1 week Sponsored by: The FreeBSD Foundation	2021-05-12 12:52:06 -04:00
John Baldwin	ed93deba11	Remove a write-only variable. While refactoring an earlier series of changes during review, the 'saved_data' variable stopped being used at the bottom of if_ioctl(). Suggested by: brooks Reviewed by: brooks, imp, kib Fixes: `d17e0940f7` Rework compat shims in ifioctl(). Sponsored by: DARPA Differential Revision: https://reviews.freebsd.org/D30197	2021-05-11 14:56:23 -07:00
Alexander V. Chernikov	aad59c79f5	Fix panic when trying to delete non-existent gateway in multipath route. IF non-existend gateway was specified, the code responsible for calculating an updated nexthop group, returned the same already-used nexthop group. After the route table update, the operation result contained the same old & new nexthop groups. Thus, the code responsible for decomposing the notification to the list of simple nexthop-level notifications, was not able to find any differences. As a result, it hasn't updated any of the "simple" notification fields, resulting in empty rtentry pointer. This empty pointer was the direct reason of a panic. Fix the problem by returning ESRCH when the new nexthop group is the same as the old one after applying gateway filter. Reported by: Michael <michael.adm at gmail.com> PR: 255665 MFC after: 3 days	2021-05-07 20:41:31 +00:00
Kristof Provost	93abcf17e6	pf: Support killing 'matching' states Optionally also kill states that match (i.e. are the NATed state or opposite direction state entry for) the state we're killing. See also https://redmine.pfsense.org/issues/8555 Submitted by: Steven Brown Reviewed by: bcr (man page) Obtained from: https://github.com/pfsense/FreeBSD-src/pull/11/ MFC after: 1 week Sponsored by: Rubicon Communications, LLC ("Netgate") Differential Revision: https://reviews.freebsd.org/D30092	2021-05-07 22:13:31 +02:00
Kristof Provost	abbcba9cf5	pf: Allow states to by killed per 'gateway' This allows us to kill states created from a rule with route-to/reply-to set. This is particularly useful in multi-wan setups, where one of the WAN links goes down. Submitted by: Steven Brown Obtained from: https://github.com/pfsense/FreeBSD-src/pull/11/ MFC after: 1 week Sponsored by: Rubicon Communications, LLC ("Netgate") Differential Revision: https://reviews.freebsd.org/D30058	2021-05-07 22:13:31 +02:00
Kristof Provost	e989530a09	pf: Introduce DIOCKILLSTATESNV Introduce an nvlist based alternative to DIOCKILLSTATES. MFC after: 1 week Sponsored by: Rubicon Communications, LLC ("Netgate") Differential Revision: https://reviews.freebsd.org/D30054	2021-05-07 22:13:30 +02:00
Kristof Provost	7606a45dcc	pf: Introduce DIOCCLRSTATESNV Introduce an nvlist variant of DIOCCLRSTATES. MFC after: 1 week Sponsored by: Rubicon Communications, LLC ("Netgate") Differential Revision: https://reviews.freebsd.org/D30052	2021-05-07 22:13:30 +02:00
John Baldwin	9c87db4b3c	Group all compat shim structures together to consolidate #ifdef's. Reviewed by: brooks, kib Obtained from: CheriBSD Sponsored by: DARPA Differential Revision: https://reviews.freebsd.org/D29894	2021-05-05 13:59:09 -07:00
John Baldwin	01e9cbc4c5	Use thunks for compat ioctls using struct ifgroupreq. Reviewed by: brooks, kib Obtained from: CheriBSD Sponsored by: DARPA Differential Revision: https://reviews.freebsd.org/D29893	2021-05-05 13:59:00 -07:00
John Baldwin	d61d98f4ed	Add freebsd32 compat shims for SIOC[GS]DRVSPEC. Reviewed by: brooks, kib Obtained from: CheriBSD Sponsored by: DARPA Differential Revision: https://reviews.freebsd.org/D29892	2021-05-05 13:58:50 -07:00
John Baldwin	d17e0940f7	Rework compat shims in ifioctl(). Centralize logic for handling compat ioctls into two blocks of code at the start and end of the ioctl routine. This avoids the conversion logic being spread out both in multiple blocks in ifioctl as well as various helper functions. Reviewed by: brooks, kib Obtained from: CheriBSD Sponsored by: DARPA Differential Revision: https://reviews.freebsd.org/D29891	2021-05-05 13:58:23 -07:00
Jose Luis Duran	0ea8a7f36d	ifconfig: Minor documentation fix Fix what appears to have been a small copy/paste typo in ifconfig(8)'s documentation (man page and header file). Not that it matters anymore. Reference: Table I-2 in IEEE Std 802.1Q-2014. PR: 255557 Submitted by: Jose Luis Duran <jlduran@gmail.com> MFC after: 1 week	2021-05-03 14:38:52 +03:00
Marcin Wojtas	cd945dc08a	iflib: Take iri_pad into account when processing small frames Drivers can specify padding of received frames with iri_pad field. This can be used to enforce ip alignment by hardware. Iflib ignored that padding when processing small frames, which rendered this feature inoperable. I found it while writing a driver for a NIC that can ip align received packets. Note that this doesn't change behavior of existing drivers as they all set iri_pad to 0. Submitted by: Kornel Duleba <mindal@semihalf.com> Reviewed by: gallatin Obtained from: Semihalf Sponsored by: Alstom Group Differential Revision: https://reviews.freebsd.org/D30009	2021-04-30 12:46:17 +02:00
Alexander V. Chernikov	41ce0e34ea	[fib algo] Update fib_gen counter under FIB_MOD_LOCK. MFC after: 3 days	2021-04-28 20:23:03 +00:00
Alexander V. Chernikov	f9668e42b4	Add rib_walk_from() wrapper for selective rib tree traversal. Provide wrapper for the rnh_walktree_from() rib callback. As currently `struct rib_head` is considered internal to the routing subsystem, this wrapper is necessary to maintain isolation from the external code. Differential Revision: https://reviews.freebsd.org/D29971 MFC after: 1 week	2021-04-28 08:09:45 +00:00
Alexander V. Chernikov	8a0d57baec	[fib algo] Delay algo init at fib growth to to allow to reliably use rib KPI. Currently, most of the rib(9) KPI does not use rnh pointers, using fibnum and family parameters to determine the rib pointer instead. This works well except for the case when we initialize new rib pointers during fib growth. In that case, there is no mapping between fib/family and the new rib, as an entirely new rib pointer array is populated. Address this by delaying fib algo initialization till after switching to the new pointer array and updating the number of fibs. Set datapath pointer to the dummy function, so the potential callers won't crash the kernel in the brief moment when the rib exists, but no fib algo is attached. This change allows to avoid creating duplicates of existing rib functions, with altered signature. Differential Revision: https://reviews.freebsd.org/D29969 MFC after: 1 week	2021-04-27 22:10:08 +00:00
Alexander V. Chernikov	439d087d0b	[fib algo] always commit static routes synchronously. Modular fib lookup framework features logic that allows route update batching for the algorithms that cannot easily apply the routing change without rebuilding. As a result, dataplane lookups may return old data until the the sync takes place. With the default sync timeout of 50ms, it is possible that new binary like ping(8) executed exactly after route(8) will still use the old fib data. To address some aspects of the problem, framework executes all rtable changes without RTF_GATEWAY synchronously. To fix the aforementioned problem, this diff extends sync execution for all RTF_STATIC routes (e.g. ones maintained by route(8). This fixes a bunch of tests in the networking space. Reported by: ci, arichardson MFC after: 2 weeks	2021-04-27 08:31:40 +00:00
Alexander V. Chernikov	25682e6a49	Fix rtsock sockaddr alignment. `b31fbebeb3` introduced alloc_sockaddr_aligned() which, in fact, failed to produce aligned addresses. Reported by: Oskar Holmlund <oskar.holmlund at yahoo.com> MFC after: immediately	2021-04-27 08:04:19 +00:00
Alexander V. Chernikov	bc5ef45aec	Fix drace CTF for the rib_head. `33cb3cb2e3` introduced an `rib_head` structure field under the FIB_ALGO define. This may be problematic for the CTF, as some of the files including `route_var.h` do not have `fib_algo` defined. Make dtrace happy by making the field unconditional. Suggested by: markj	2021-04-27 07:47:53 +00:00
Kristof Provost	5f5bf88949	pfsync: Expose PFSYNCF_OK flag to userspace Add 'syncok' field to ifconfig's pfsync interface output. This allows userspace to figure out when pfsync has completed the initial bulk import. Reviewed by: donner MFC after: 2 weeks Sponsored by: Rubicon Communications, LLC ("Netgate") Differential Revision: https://reviews.freebsd.org/D29948	2021-04-26 14:31:17 +02:00
Kristof Provost	6fcc8e042a	pf: Allow multiple labels to be set on a rule Allow up to 5 labels to be set on each rule. This offers more flexibility in using labels. For example, it replaces the customer 'schedule' keyword used by pfSense to terminate states according to a schedule. Reviewed by: glebius MFC after: 2 weeks Sponsored by: Rubicon Communications, LLC ("Netgate") Differential Revision: https://reviews.freebsd.org/D29936	2021-04-26 14:14:21 +02:00
Patrick Kelsey	ca7005f189	iflib: Improve mapping of TX/RX queues to CPUs iflib now supports mapping each (TX,RX) queue pair to the same CPU (default), to separate CPUs, or to a pair of physical and logical CPUs that share the same L2 cache. The mapping mechanism supports unequal numbers of TX and RX queues, with the excess queues always being mapped to consecutive physical CPUs. When the platform cannot distinguish between physical and logical CPUs, all are treated as physical CPUs. See the comment on get_cpuid_for_queue() for the entire matrix. The following device-specific tunables influence the mapping process: dev.<device>.<unit>.iflib.core_offset (existing) dev.<device>.<unit>.iflib.separate_txrx (existing) dev.<device>.<unit>.iflib.use_logical_cores (new) The following new, read-only sysctls provide visibility of the mapping results: dev.<device>.<unit>.iflib.{t,r}xq<n>.cpu When an iflib driver allocates TX softirqs without providing reference RX IRQs, iflib now binds those TX softirqs to CPUs using the above mapping mechanism (that is, treats them as if they were TX IRQs). Previously, such bindings were left up to the grouptaskqueue code and thus fell outside of the iflib CPU mapping strategy. Reviewed by: kbowling Tested by: olivier, pkelsey MFC after: 3 weeks Differential Revision: https://reviews.freebsd.org/D24094	2021-04-26 01:06:34 -04:00
Alexander V. Chernikov	7d222ce3c1	Fix NOINET[6],!VIMAGE builds after FIB_ALGO addition to GENERIC Reported by: jbeich PR: 255390	2021-04-21 05:53:42 +01:00
Alexander V. Chernikov	67372fb3e0	Fix NOINET[6] build after enabling FIB_ALGO in GENERIC. Submitted by: jbeich PR: 255389	2021-04-21 02:49:18 +01:00
Alexander V. Chernikov	c23385612d	[fib algo] Do not print algo attach/detach message on boot MFC after: 1 day	2021-04-25 08:58:06 +00:00
Alexander V. Chernikov	a81e2e7890	Make gcc happy by initializing error in rib_handle_ifaddr_info().	2021-04-25 08:44:59 +00:00
Stefan Eßer	6409e59427	Fix build with gcc Correctly declare function without arguments as f(void) instead of f().	2021-04-25 10:15:17 +02:00
Alexander V. Chernikov	5d1403a79a	[rtsock] Enforce netmask/RTF_HOST consistency. Traditionally we had 2 sources of information whether the added/delete route request targets network or a host route: netmask (RTA_NETMASK) and RTF_HOST flag. The former one is tricky: netmask can be empty or can explicitly specify the host netmask. Parsing netmask sockaddr requires per-family parsing and that's what rtsock code traditionally avoided. As a result, consistency was not enforced and it was possible to specify network with the RTF_HOST flag and vice versa. Continue normalization efforts from D29826 and D29826 and ensure that RTF_HOST flag always reflects host/network data from netmask field. Differential Revision: https://reviews.freebsd.org/D29958 MFC after: 2 days	2021-04-24 22:41:27 +00:00
Mark Johnston	8e8f1cc9bb	Re-enable network ioctls in capability mode This reverts a portion of `274579831b` ("capsicum: Limit socket operations in capability mode") as at least rtsol and dhcpcd rely on being able to configure network interfaces while in capability mode. Reported by: bapt, Greg V Sponsored by: The FreeBSD Foundation	2021-04-23 09:22:49 -04:00
Andrew Gallatin	3183d0b680	iflib: initialize LRO unconditionally Changes to the LRO code have exposed a bug in iflib where devices which are not capable of doing LRO are still calling tcp_lro_flush_all(), even when they have not initialized the LRO context. This used to be mostly harmless, but the LRO code now sets the VNET based on the ifp in the lro context and will try to access it through a NULL ifp resulting in a panic at boot. To fix this, we unconditionally initializes LRO so that we have a valid LRO context when calling tcp_lro_flush_all(). One alternative is to check the device capabilities before calling tcp_lro_flush_all() or adding a new state flag in the ctx. However, it seems unwise to add an extra, mostly useless test for higher performance devices when we can just initialize LRO for all devices. Reviewed by: erj, hselasky, markj, olivier Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D29928	2021-04-23 05:55:20 -04:00
Alexander V. Chernikov	33cb3cb2e3	Fix rib generation count for fib algo. Currently, PCB caching mechanism relies on the rib generation counter (rnh_gen) to invalidate cached nhops/LLE entries. With certain fib algorithms, it is now possible that the datapath lookup state applies RIB changes with some delay. In that scenario, PCB cache will invalidate on the RIB change, but the new lookup may result in the same nexthop being returned. When fib algo finally gets in sync with the RIB changes, PCB cache will not receive any notification and will end up caching the stale data. To fix this, introduce additional counter, rnh_gen_rib, which is used only when FIB_ALGO is enabled. This counter is incremented by the control plane. Each time when fib algo synchronises with the RIB, it updates rnh_gen to the current rnh_gen_rib value. Differential Revision: https://reviews.freebsd.org/D29812 Reviewed by: donner MFC after: 2 weeks	2021-04-20 22:02:41 +00:00
Alexander V. Chernikov	b31fbebeb3	Relax rtsock message restrictions. Address multiple issues with strict rtsock message validation. D28668 "normalisation" approach was based on the assumption that we always have at least "standard" sockaddr len. It turned out to be false - certain older applications like quagga or routed abuse sin[6]_len field and set it to the offset to the first fully-zero bit in the mask. It is impossible to normalise such sockaddrs without reallocation. With that in mind, change the approach to use a distinct memory buffer for the altered sockaddrs. This allows supporting the older software while maintaining the guarantee on the "standard" sockaddrs. PR: 255273,255089 Differential Revision: https://reviews.freebsd.org/D29826 MFC after: 3 days	2021-04-20 21:34:19 +00:00
Alexander V. Chernikov	758c9d54d4	Improve error reporting in rtsock.c MFC after: 3 days	2021-04-19 20:36:41 +00:00
Kristof Provost	42ec75f83a	pf: Optionally attempt to preserve rule counter values across ruleset updates Usually rule counters are reset to zero on every update of the ruleset. With keepcounters set pf will attempt to find matching rules between old and new rulesets and preserve the rule counters. MFC after: 4 weeks Sponsored by: Rubicon Communications, LLC ("Netgate") Differential Revision: https://reviews.freebsd.org/D29780	2021-04-19 14:31:47 +02:00
Kristof Provost	4f1f67e888	pf: PFRULE_REFS should not be user-visible Split the PFRULE_REFS flag from the rule_flag field. PFRULE_REFS is a kernel-internal flag and should not be exposed to or read from userspace. MFC after: 4 weeks Sponsored by: Rubicon Communications, LLC ("Netgate") Differential Revision: https://reviews.freebsd.org/D29778	2021-04-19 14:31:47 +02:00
Jonah Caplan	0e4025bffa	bridgestp: validate timer values in config BPDU IEEE Std 802.1D-2004 Section 17.14 defines permitted ranges for timers. Incoming BPDU messages should be checked against the permitted ranges. The rest of 17.14 appears to be enforced already. PR: 254924 Reviewed by: kp, donner Differential Revision: https://reviews.freebsd.org/D29782	2021-04-19 12:09:18 +02:00
Alexander V. Chernikov	0abb6ff590	fib algo: do not reallocate datapath index for datapath ptr update. Fib algo uses a per-family array indexed by the fibnum to store lookup function pointers and per-fib data. Each algorithm rebuild currently requires re-allocating this array to support atomic change of two pointers. As in reality most of the changes actually involve changing only data pointer, add a shortcut performing in-flight pointer update. MFC after: 2 weeks	2021-04-18 16:12:13 +01:00
Alexander V. Chernikov	e2f79d9e51	Fib algo: extend KPI by allowing algo to set datapath pointers. Some algorithms may require updating datapath and control plane algo pointers after the (batched) updates. Export fib_set_datapath_ptr() to allow setting the new datapath function or data pointer from the algo. Add fib_set_algo_ptr() to allow updating algo control plane pointer from the algo. Add fib_epoch_call() epoch(9) wrapper to simplify freeing old datapath state. Reviewed by: zec Differential Revision: https://reviews.freebsd.org/D29799 MFC after: 1 week	2021-04-18 16:12:12 +01:00
Alexander V. Chernikov	6b8ef0d428	Add batched update support for the fib algo. Initial fib algo implementation was build on a very simple set of principles w.r.t updates: 1) algorithm is ether able to apply the change synchronously (DIR24-8) or requires full rebuild (bsearch, lradix). 2) framework falls back to rebuild on every error (memory allocation, nhg limit, other internal algo errors, etc). This changes brings the new "intermediate" concept - batched updates. Algotirhm can indicate that the particular update has to be handled in batched fashion (FLM_BATCH). The framework will write this update and other updates to the temporary buffer instead of pushing them to the algo callback. Depending on the update rate, the framework will batch 50..1024 ms of updates and submit them to a different algo callback. This functionality is handy for the slow-to-rebuild algorithms like DXR. Differential Revision: https://reviews.freebsd.org/D29588 Reviewed by: zec MFC after: 2 weeks	2021-04-14 23:54:11 +01:00
Tai-hwa Liang	d9b61e7153	if_firewire: fixing panic upon packet reception for VNET build netisr_dispatch_src() needs valid VNET pointer or firewire_input() will panic when receiving a packet. Reviewed by: glebius MFC after: 2 weeks	2021-04-13 22:59:58 +00:00
Kurosawa Takahiro	2aa21096c7	pf: Implement the NAT source port selection of MAP-E Customer Edge MAP-E (RFC 7597) requires special care for selecting source ports in NAT operation on the Customer Edge because a part of bits of the port numbers are used by the Border Relay to distinguish another side of the IPv4-over-IPv6 tunnel. PR: 254577 Reviewed by: kp Differential Revision: https://reviews.freebsd.org/D29468	2021-04-13 10:53:18 +02:00
Alexander V. Chernikov	afbb64f1d8	Fix vlan creation for the older ifconfig(8) binaries. Reported by: allanjude MFC after: immediately	2021-04-11 18:13:09 +01:00
Alexander V. Chernikov	7f5f3fcc32	Fix direct route installation with net/bird. Slighly relax the gateway validation rules imposed by the `2fe5a79425`, by requiring only first 8 bytes (everyhing before sdl_data to be present in the AF_LINK gateway. Reported by: olivier	2021-04-10 16:31:16 +01:00
Alexander V. Chernikov	63dceebe68	Appease -Wsign-compare in radix.c Differential Revision: https://reviews.freebsd.org/D29661 Submitted by: zec MFC after 2 weeks	2021-04-10 13:48:25 +00:00
Alexander V. Chernikov	caf2f62765	Allow to specify debugnet fib in sysctl/tunable. Differential Revision: https://reviews.freebsd.org/D29593 Reviewed by: donner MFC after: 2 weeks	2021-04-10 13:47:49 +00:00
Kristof Provost	d710367d11	pf: Implement nvlist variant of DIOCGETRULE MFC after: 4 weeks Sponsored by: Rubicon Communications, LLC ("Netgate") Differential Revision: https://reviews.freebsd.org/D29559	2021-04-10 11:16:01 +02:00
Kristof Provost	5c62eded5a	pf: Introduce nvlist variant of DIOCADDRULE This will make future extensions of the API much easier. The intent is to remove support for DIOCADDRULE in FreeBSD 14. Reviewed by: markj (previous version), glebius (previous version) MFC after: 4 weeks Sponsored by: Rubicon Communications, LLC ("Netgate") Differential Revision: https://reviews.freebsd.org/D29557	2021-04-10 11:16:00 +02:00
Alexander V. Chernikov	ee2cf2b360	Implement better rebuild-delay fib algo policy. The intent is to better handle time intervals with large amount of RIB updates (e.g. BGP peer going up or down), while still keeping low sync delay for the rest scenarios. The implementation is the following: updates are bucketed into the buckets of size 50ms. If the number of updates within a current bucket exceeds the threshold of 500 routes/sec (e.g. 10 updates per bucket interval), the update is delayed for another 50ms. This can be repeated until the maximum update delay (1 sec) is reached. All 3 variables are runtime tunables: * net.route.algo.fib_max_sync_delay_ms: 1000 * net.route.algo.bucket_change_threshold_rate: 500 * net.route.algo.bucket_time_ms: 50 Differential Review: https://reviews.freebsd.org/D29588 MFC after: 2 weeks	2021-04-09 21:33:03 +01:00
Alexander V. Chernikov	9e5243d7b6	Enforce check for using the return result for ifa?_try_ref(). Suggested by: hps MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D29504	2021-04-05 03:35:19 +01:00
Kristof Provost	4967f672ef	pf: Remove unused variable rt_listid from struct pf_krule Reviewed by: donner MFC after: 4 weeks Sponsored by: Rubicon Communications, LLC ("Netgate") Differential Revision: https://reviews.freebsd.org/D29639	2021-04-08 13:24:35 +02:00
Mark Johnston	274579831b	capsicum: Limit socket operations in capability mode Capsicum did not prevent certain privileged networking operations, specifically creation of raw sockets and network configuration ioctls. However, these facilities can be used to circumvent some of the restrictions that capability mode is supposed to enforce. Add capability mode checks to disallow network configuration ioctls and creation of sockets other than PF_LOCAL and SOCK_DGRAM/STREAM/SEQPACKET internet sockets. Reviewed by: oshogbo Discussed with: emaste Reported by: manu Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D29423	2021-04-07 14:32:56 -04:00
Vincenzo Maffione	361e950180	iflib: add support for netmap offsets Follow-up change to `a6d768d845`. This change adds iflib support for netmap offsets, enabling applications to use offsets on any driver backed by iflib.	2021-04-05 07:54:47 +00:00
Vincenzo Maffione	9bad2638cc	netmap: restore commit `a56e6334d1` The fix in `a56e6334d1` was accidentally reverted by commit `45c67e8f6b`.	2021-04-02 10:45:47 +00:00
Vincenzo Maffione	45c67e8f6b	netmap: several typo fixes No functional changes intended.	2021-04-02 07:01:20 +00:00
Konstantin Belousov	baacf70137	vxlan: correct interface MTU when using hw offloads Otherwise it breaks when offloading like checksum or TSO are used, because second (encapsulated) ip_output() processing passes fragments of the encapsulated packet down to the hardware interface. Diagnosed by: hselasky Reviewed by: np Sponsored by: Nvidia Networking / Mellanox Technologies MFC after: 1 week Differential revision: https://reviews.freebsd.org/D29501	2021-03-31 14:38:26 +03:00
Konstantin Belousov	e243367b64	mbuf: add a way to mark flowid as calculated from the internal headers In some settings offload might calculate hash from decapsulated packet. Reserve a bit in packet header rsstype to indicate that. Add m_adj_decap() that acts similarly to m_adj, but also either clear flowid if it is not marked as inner, or transfer it to the decapsulated header, clearing inner indicator. It depends on the internals of m_adj() that reuses the argument packet header for the result. Use m_adj_decap() for decapsulating vxlan(4) and gif(4) input packets. Reviewed by: ae, hselasky, np Sponsored by: Nvidia Networking / Mellanox Technologies MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D28773	2021-03-31 14:38:26 +03:00
Alexander V. Chernikov	0c2a0e0380	Fix typo in the `9fa8d1582b`. Reported by: cy	2021-03-29 23:42:48 +00:00
Alexander V. Chernikov	9fa8d1582b	Put bandaid for nhgrp_dump_sysctl() malloc KASSERT(). Recent rtsock changes widened epoch and covered nhgrp_dump_sysctl(), resulting in `netstat -4On` triggering with KASSERT. MFC after: 1 day	2021-03-29 23:12:11 +00:00
Alexander V. Chernikov	0f30a36ded	Rename variables inside nexhtop group consider_resize() code. No functional changes. MFC after: 3 days	2021-03-29 23:06:13 +00:00
Alexander V. Chernikov	9095dc7da4	Fix nexhtop group index array scaling. The current code has the limit of 127 nexthop groups due to the wrongly-checked bitmask_copy() return value. PR: 254303 Reported by: Aleks <a.ivanov at veesp.com> MFC after: 1 day	2021-03-29 23:00:17 +00:00
Vincenzo Maffione	660a47cb99	netmap: monitor: add a flag to distinguish packet direction The netmap monitor intercepts any TX/RX packets on the monitored port. However, before this change there was no way to tell whether an intercepted packet was being transmitted or received on the monitored port. A TXMON flag in the netmap slot has been added for this purpose.	2021-03-29 16:32:54 +00:00
Vincenzo Maffione	a6d768d845	netmap: add kernel support for the "offsets" feature This feature enables applications to ask netmap to transmit or receive packets starting at a user-specified offset from the beginning of the netmap buffer. This is meant to ease those packet manipulation operations such as pushing or popping packet headers, that may be useful to implement software switches, routers and other packet processors. To use the feature, drivers (e.g., iflib, vtnet, etc.) must have explicit support. This change does not add support for any driver, but introduces the necessary kernel changes. However, offsets support is already included for VALE ports and pipes.	2021-03-29 16:29:01 +00:00
you@x	21d0c01226	netmap: iflib: add nm_config callback This per-driver callback is invoked by netmap when it wants to align the number of TX/RX netmap rings and/or the number of TX/RX netmap slots to the actual state configured in the hardware. The alignment happens when netmap mode is switched on (with no active netmap file descriptors for that netmap port), or when collecting netmap port information. MFC after: 1 week	2021-03-29 09:31:18 +00:00
Alexander V. Chernikov	6f43c72b47	Zero `struct weightened_nhop` fields in nhgrp_get_addition_group(). `struct weightened_nhop` has spare 32bit between the fields due to the alignment (on amd64). Not zeroing these spare bits results in duplicating nhop groups in the kernel due to the way how comparison works. MFC after: 1 day	2021-03-20 08:26:03 +00:00
Alexander V. Chernikov	24cd2796cf	Fix !VNET build broken by `66f138563b`.	2021-03-25 00:31:08 +00:00
Alexander V. Chernikov	66f138563b	Plug nexthop group refcount leak. In case with batch route delete via rib_walk_del(), when some paths from the multipath route gets deleted, old multipath group were not freed. PR: 254496 Reported by: Zhenlei Huang <zlei.huang@gmail.com> MFC after: 1 day	2021-03-24 23:52:18 +00:00
Alexander V. Chernikov	c00e2f573b	Fix build for non-vnet non-multipath kernels broken by `a0308e48ec`.	2021-03-23 23:35:23 +00:00
Alexander V. Chernikov	a0308e48ec	Fix panic when destroying interface with ECMP routes. Reported by: Zhenlei Huang <zlei.huang at gmail.com> PR: 254496 MFC after: immediately	2021-03-23 22:03:20 +00:00
Adrian Chadd	25bfa44860	Add device and ifnet logging methods, similar to device_printf / if_printf * device_printf() is effectively a printf * if_printf() is effectively a LOG_INFO This allows subsystems to log device/netif stuff using different log levels, rather than having to invent their own way to prefix unit/netif names. Differential Revision: https://reviews.freebsd.org/D29320 Reviewed by: imp	2021-03-22 00:02:34 +00:00
Alexander V. Chernikov	2476178e6b	Fix kassert panic when inserting multipath routes from multiple threads. Reported by: Marco Zec <zec at fer.hr> MFC after: immediately	2021-03-21 18:15:29 +00:00
Kyle Evans	f187d6dfbf	base: remove if_wg(4) and associated utilities, manpage After length decisions, we've decided that the if_wg(4) driver and related work is not yet ready to live in the tree. This driver has larger security implications than many, and thus will be held to more scrutiny than other drivers. Please also see the related message sent to the freebsd-hackers@ and freebsd-arch@ lists by Kyle Evans <kevans@FreeBSD.org> on 2021/03/16, with the subject line "Removing WireGuard Support From Base" for additional context.	2021-03-17 09:14:48 -05:00
Alexander V. Chernikov	e4ac3f7463	Fix fib algo rebuild delay calculation. Submitted by: Marco Zec <zec at fer.hr> MFC after: 3 days	2021-03-15 21:09:07 +00:00
Kyle Evans	74ae3f3e33	if_wg: import latest fixup work from the wireguard-freebsd project This is the culmination of about a week of work from three developers to fix a number of functional and security issues. This patch consists of work done by the following folks: - Jason A. Donenfeld <Jason@zx2c4.com> - Matt Dunwoodie <ncon@noconroy.net> - Kyle Evans <kevans@FreeBSD.org> Notable changes include: - Packets are now correctly staged for processing once the handshake has completed, resulting in less packet loss in the interim. - Various race conditions have been resolved, particularly w.r.t. socket and packet lifetime (panics) - Various tests have been added to assure correct functionality and tooling conformance - Many security issues have been addressed - if_wg now maintains jail-friendly semantics: sockets are created in the interface's home vnet so that it can act as the sole network connection for a jail - if_wg no longer fails to remove peer allowed-ips of 0.0.0.0/0 - if_wg now exports via ioctl a format that is future proof and complete. It is additionally supported by the upstream wireguard-tools (which we plan to merge in to base soon) - if_wg now conforms to the WireGuard protocol and is more closely aligned with security auditing guidelines Note that the driver has been rebased away from using iflib. iflib poses a number of challenges for a cloned device trying to operate in a vnet that are non-trivial to solve and adds complexity to the implementation for little gain. The crypto implementation that was previously added to the tree was a super complex integration of what previously appeared in an old out of tree Linux module, which has been reduced to crypto.c containing simple boring reference implementations. This is part of a near-to-mid term goal to work with FreeBSD kernel crypto folks and take advantage of or improve accelerated crypto already offered elsewhere. There's additional test suite effort underway out-of-tree taking advantage of the aforementioned jail-friendly semantics to test a number of real-world topologies, based on netns.sh. Also note that this is still a work in progress; work going further will be much smaller in nature. MFC after: 1 month (maybe)	2021-03-14 23:52:04 -05:00
Gordon Bergling	5666643a95	Fix some common typos in comments - occured -> occurred - normaly -> normally - controling -> controlling - fileds -> fields - insterted -> inserted - outputing -> outputting MFC after: 1 week	2021-03-13 18:26:15 +01:00
Kristof Provost	cecfaf9bed	pf: Fully remove interrupt events on vnet cleanup swi_remove() removes the software interrupt handler but does not remove the associated interrupt event. This is visible when creating and remove a vnet jail in `procstat -t 12`. We can remove it manually with intr_event_destroy(). PR: 254171 MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D29211	2021-03-12 12:12:43 +01:00
Wei Hu	a491581f3f	Hyper-V: hn: Enable vSwitch RSC support in hn netvsc driver Receive Segment Coalescing (RSC) in the vSwitch is a feature available in Windows Server 2019 hosts and later. It reduces the per packet processing overhead by coalescing multiple TCP segments when possible. This happens mostly when TCP traffics are among different guests on same host. This patch adds netvsc driver support for this feature. The patch also updates NVS version to 6.1 as needed for RSC enablement. MFC after: 2 weeks Sponsored by: Microsoft Differential Revision: https://reviews.freebsd.org/D29075	2021-03-12 04:35:16 +00:00
Kristof Provost	5e9dae8e14	pf: Factor out pf_krule_free() Reviewed by: melifaro@ MFC after: 1 week Sponsored by: Rubicon Communications, LLC ("Netgate") Differential Revision: https://reviews.freebsd.org/D29194	2021-03-11 10:39:43 +01:00
Alexander V. Chernikov	b1d63265ac	Flush remaining routes from the routing table during VNET shutdown. Summary: This fixes rtentry leak for the cloned interfaces created inside the VNET. PR: 253998 Reported by: rashey at superbox.pl MFC after: 3 days Loopback teardown order is `SI_SUB_INIT_IF`, which happens after `SI_SUB_PROTO_DOMAIN` (route table teardown). Thus, any route table operations are too late to schedule. As the intent of the vnet teardown procedures to minimise the amount of effort by doing global cleanups instead of per-interface ones, address this by adding a relatively light-weight routing table cleanup function, `rib_flush_routes()`. It removes all remaining routes from the routing table and schedules the deletion, which will happen later, when `rtables_destroy()` waits for the current epoch to finish. Test Plan: ``` set_skip:set_skip_group_lo -> passed [0.053s] tail -n 200 /var/log/messages \| grep rtentry ``` Reviewers: #network, kp, bz Reviewed By: kp Subscribers: imp, ae Differential Revision: https://reviews.freebsd.org/D29116	2021-03-10 21:10:14 +00:00
Kyle Evans	0dd691b412	iflib: allow clone detach if not yet init If we hit an error during init, then we'll unwind our state and attempt to detach the device -- don't block it. This was discovered by creating a wg0 with missing parameters; said failure ended up leaving this orphaned device in place and ended up panicking the system upon enumeration of the dev.* sysctl space. Reviewed by: gallatin, markj MFC after: 3 days Differential Revision: https://reviews.freebsd.org/D29145	2021-03-09 13:49:13 -06:00
Mark Johnston	ffe3def903	iflib: Make if_shared_ctx_t a pointer to const This structure is shared among multiple instances of a driver, so we should ensure that it doesn't somehow get treated as if there's a separate instance per interface. This is especially important for software-only drivers like wg. DEVICE_REGISTER() still returns a void * and so the per-driver sctx structures are not yet defined with the const qualifier. Reviewed by: gallatin, erj MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D29102	2021-03-08 12:39:06 -05:00
Tai-hwa Liang	092f3f0812	net: fixing a memory leak in if_deregister_com_alloc() Drain the callbacks upon if_deregister_com_alloc() such that the if_com_free[type] won't be nullified before if_destroy(). Taking fwip(4) as an example, before this fix, kldunload if_fwip will go through the following: 1. fwip_detach() 2. if_free() -> schedule if_destroy() through NET_EPOCH_CALL 3. fwip_detach() returns 4. firewire_modevent(MOD_UNLOAD) -> if_deregister_com_alloc() 5. kernel complains about: Warning: memory type fw_com leaked memory on destroy (1 allocations, 64 bytes leaked). 6. EPOCH runs if_destroy() -> if_free_internal()i By this time, if_com_free[if_alloctype] is NULL since it's already nullified by if_deregister_com_alloc(); hence, firewire_free() won't have a chance to release the allocated fw_com. Reviewed by: hselasky, glebius MFC after: 2 weeks	2021-03-06 14:43:16 +00:00
Kristof Provost	29698ed904	pf: Mark struct pf_pdesc as kernel only This structure is only used by the kernel module internally. It's not shared with user space, so hide it behind #ifdef _KERNEL. Sponsored by: Rubicon Communications, LLC ("Netgate")	2021-03-05 09:21:06 +01:00
Kristof Provost	448732b8e2	altq: Increase maximum number of CBQ and HFSC classes In some configurations we need more classes than ALTQ supports by default. Increase the maximum number of classes we allow. This will only cost us a comparatively trivial amount of memory, so there's little reason not to do so. If ever we find we want even more we may want to consider turning these defines into a tunable, but for now do the easy thing. Reviewed by: donner@ MFC after: 2 weeks Sponsored by: Rubicon Communications, LLC ("Netgate") Differential Revision: https://reviews.freebsd.org/D29034	2021-03-04 20:58:22 +01:00
Kristof Provost	bb4a7d94b9	net: Introduce IPV6_DSCP(), IPV6_ECN() and IPV6_TRAFFIC_CLASS() macros Introduce convenience macros to retrieve the DSCP, ECN or traffic class bits from an IPv6 header. Use them where appropriate. Reviewed by: ae (previous version), rscheff, tuexen, rgrimes MFC after: 2 weeks Sponsored by: Rubicon Communications, LLC ("Netgate") Differential Revision: https://reviews.freebsd.org/D29056	2021-03-04 20:56:48 +01:00
Marcin Wojtas	09c3f04ff3	iflib: add support for admin completion queues For interfaces with admin completion queues, introduce a new devmethod IFDI_ADMIN_COMPLETION_HANDLE and a corresponding flag IFLIB_HAS_ADMINCQ. This provides an option for handling any admin cq logic, which cannot be run from an interrupt context. Said method is called from within iflib's admin task, making it safe to sleep. Reviewed by: mmacy Submitted by: Artur Rojek <ar@semihalf.com> Obtained from: Semihalf Sponsored by: Amazon, Inc. Differential Revision: https://reviews.freebsd.org/D28708	2021-03-03 00:40:47 +01:00
Kristof Provost	f5537cd069	bridgestp: Ensure we send STP on VLAN interfaces Reviewed by: donner@ MFC after: 1 week X-MFC-with: `711ed156b9` Sponsored by: Orange Business Services Differential Revision: https://reviews.freebsd.org/D28916	2021-02-25 10:16:25 +01:00
Marcin Wojtas	ef567155d3	Fix powerpc build after `6dd69f0064` Commit `6dd69f0064` ("iflib: introduce isc_dma_width") failed to build on powerpc due to implicit type conversion error. Fix that. Submitted by: Artur Rojek <ar@semihalf.com> Obtained from: Semihalf Sponsored by: Amazon, Inc.	2021-02-25 02:35:41 +01:00
Marcin Wojtas	6dd69f0064	iflib: introduce isc_dma_width Some DMA controllers are unable to address the full host memory space and are instead limited to a subset of address range (e.g. 48-bit). Allow the driver to specify the maximum allowed DMA addressing width (in bits) for the NIC hardware, by introducing a new field in if_softc_ctx. If said field is omitted (set to 0), the lowaddr of DMA window bounds defaults to BUS_SPACE_MAXADDR. Submitted by: Artur Rojek <ar@semihalf.com> Obtained from: Semihalf Sponsored by: Amazon, Inc. Differential Revision: https://reviews.freebsd.org/D28706	2021-02-25 00:25:39 +01:00
Mark Johnston	b6999635b1	iflib: Avoid double counting in rxeof iflib_rxeof() was counting everything twice. This was introduced when pfil hooks were added to the iflib receive path. We want to count rx packets/bytes before the pfil hooks are executed, so remove the counter adjustments that are executed after. PR: 253583 Reviewed by: gallatin, erj MFC after: 3 days Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D28900	2021-02-24 10:08:53 -05:00
Kristof Provost	38c0951386	bridge: Remove members when assigned to a new vnet When the bridge is moved to a different vnet we must remove all of its member interfaces (and span interfaces), because we don't know if those will be moved along with it. We don't want to hold references to interfaces not in our vnet. Reviewed by: donner@ MFC after: 1 week Sponsored by: Orange Business Services Differential Revision: https://reviews.freebsd.org/D28859	2021-02-23 13:54:07 +01:00
Kristof Provost	89fa9c34d7	bridge/stp: Ensure we enter NET_EPOCH whenever we can send traffic Reviewed by: donner@ MFC after: 1 week Sponsored by: Orange Business Services Differential Revision: https://reviews.freebsd.org/D28858	2021-02-23 13:54:07 +01:00
Kristof Provost	711ed156b9	bridge: Support STP on VLAN devices VLAN devices have type IFT_L2VLAN, so the STP code mistakenly believed they couldn't be used for STP. That's not the case, so add the ITF_L2VLAN to the check. Reviewed by: donner@ MFC after: 1 week Sponsored by: Orange Business Services Differential Revision: https://reviews.freebsd.org/D28857	2021-02-23 13:54:06 +01:00
Alexander V. Chernikov	5964172837	Simplify ifa/ifp refcounting in the routing stack. The routing stack control depends on quite a tree of functions to determine the proper attributes of a route such as a source address (ifa) or transmit ifp of a route. When actually inserting a route, the stack needs to ensure that ifa and ifp points to the entities that are still valid. Validity means slightly more than just pointer validity - stack need guarantee that the provided objects are not scheduled for deletion. Currently, callers either ignore it (most ifp parts, historically) or try to use refcounting (ifa parts). Even in case of ifa refcounting it's not always implemented in fully-safe manner. For example, some codepaths inside rt_getifa_fib() are referencing ifa while not holding any locks, resulting in possibility of referencing scheduled-for-deletion ifa. Instead of trying to fix all of the callers by enforcing proper refcounting, switch to a different model. As the rib_action() already requires epoch, do not require any stability guarantees other than the epoch-provided one. Use newly-added conditional versions of the refcounting functions (ifa_try_ref(), if_try_ref()) and fail if any of these fails. Reviewed by: donner MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D28837	2021-02-22 23:37:59 +00:00
Alexander V. Chernikov	7563019bc6	Add if_try_ref() to simplify refcount handling inside epoch. When we have an ifp pointer and the code is running inside epoch, epoch guarantees the pointer will not be freed. However, the following case can still happen: * in thread 1 we drop to refcount=0 for ifp and schedule its deletion. * in thread 2 we use this ifp and reference it * destroy callout kicks in * unhappy user reports a bug This can happen with the current implementation of ifnet_byindex_ref(), as we're not holding any locks preventing ifnet deletion by a parallel thread. To address it, add if_try_ref(), allowing to return failure when referencing ifp with refcount=0. Additionally, enforce existing if_ref() is with KASSERT to provide a cleaner error in such scenarios. Finally, fix ifnet_byindex_ref() by using if_try_ref() and returning NULL if the latter fails. MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D28836	2021-02-22 23:37:59 +00:00
Alexander V. Chernikov	e5b394f2d0	Fix setting static entries for arp/ndp. rtsock message validation changes committed in `2fe5a79425` did not take llinfo messages into account. Add a special validation case for RTA_GATEWAY llinfo messages. MFC after: 2 days	2021-02-20 18:26:35 +00:00
Mark Johnston	0f9544d03e	iflib: Fix detach of pseudo interfaces In commit `38bfc6dee3` we added an IFDI_DETACH() call to iflib_pseudo_deregister() since it looked like it was missing. One is present in the error-handling path of iflib_pseudo_register(). However, the detach actually comes from the DEVICE_DETACH() method for the above-mentioned device_t, so now we're calling IFDI_DETACH() twice when destroying a pseudo interface. Fix the problem by not calling IFDI_DETACH() from the device detach routine. This way we can ensure that iflib de-initialization always happens in a consistent order. It also ensures that you can't do silly things like "devctl detach <pseudo ifnet>", which would previously detach the driver without tearing down the corresponding ifnet. PR: 253541 Reviewed by: erj MFC after: 1 week Fixes: `38bfc6dee3` Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D28774	2021-02-19 17:10:41 -05:00
Alexander V. Chernikov	f9e1cd6c99	Fix arp/ndp deletion broken by `2fe5a79425`. Changes in the `2fe5a79425` moved dst sockaddr masking from the routing control plane to the rtsock code. It broke arp/ndp deletion. It turns out, arp/ndp perform RTM_GET request first to get an interface index necessary for the deletion. Then they simply stamp the reply with RTF_LLDATA and set the command to RTM_DELETE. As a result, kernel receives request with non-empty RTA_NETMASK and clears RTA_DST host bits before passing the message to the lla code. De facto, the only needed bits are RTA_DST, RTA_GATEWAY and the subset of rtm_flags. With that in mind, fix the interace by clearing RTA_NETMASK for every messages with RTF_LLDATA. While here, cleanup arp/ndp code a bit. MFC after: 1 day Reviewed by: gnn Differential Revision: https://reviews.freebsd.org/D28804	2021-02-19 21:17:17 +00:00
John Baldwin	2ccf971ace	iflib: Cast the result of iflib_netmap_txq_init() to void. This fixes a warning from GCC for kernels without netmap since the return value is never used. Reviewed by: vmaffione, erj Differential Revision: https://reviews.freebsd.org/D28598	2021-02-19 12:52:53 -08:00
Alexander V. Chernikov	a4513bace0	Fix NOINET6 build broken by `2fe5a79425`. Reported by: mjg	2021-02-16 21:49:48 +00:00
Alexander V. Chernikov	2fe5a79425	Fix dst/netmask handling in routing socket code. Traditionally routing socket code did almost zero checks on the input message except for the most basic size checks. This resulted in the unclear KPI boundary for the routing system code (`rtrequest` and now `rib_action()`) w.r.t message validness. Multiple potential problems and nuances exists: Host bits in RTAX_DST sockaddr. Existing applications do send prefixes with hostbits uncleared. Even `route(8)` does this, as they hope the kernel would do the job of fixing it. Code inside `rib_action()` needs to handle it on its own (see `rt_maskedcopy()` ugly hack). * There are multiple way of adding the host route: it can be DST without netmask or DST with /32(/128) netmask. Also, RTF_HOST has to be set correspondingly. Currently, these 2 options create 2 DIFFERENT routes in the kernel. * no sockaddr length/content checking for the "secondary" fields exists: nothing stops rtsock application to send sockaddr_in with length of 25 (instead of 16). Kernel will accept it, install to RIB as is and propagate to all rtsock consumers, potentially triggering bugs in their code. Same goes for sin_port, sin_zero, etc. The goal of this change is to make rtsock verify all sockaddr and prefix consistency. Said differently, `rib_action()` or internals should NOT require to change any of the sockaddrs supplied by `rt_addrinfo` structure due to incorrectness. To be more specific, this change implements the following: * sockaddr cleanup/validation check is added immediately after getting sockaddrs from rtm. * Per-family dst/netmask checks clears host bits in dst and zeros all dst/netmask "secondary" fields. * The same netmask checking code converts /32(/128) netmasks to "host" route case (NULL netmask, RTF_HOST), removing the dualism. * Instead of allowing ANY "known" sockaddr families (0<..<AF_MAX), allow only actually supported ones (inet, inet6, link). * Automatically convert `sockaddr_sdl` (AF_LINK) gateways to `sockaddr_sdl_short`. Reported by: Guy Yur <guyyur at gmail.com> Reviewed By: donner Differential Revision: https://reviews.freebsd.org/D28668 MFC after: 3 days	2021-02-16 20:30:04 +00:00
Alexander V. Chernikov	600eade2fb	Add ifa_try_ref() to simplify ifa handling inside epoch. More and more code migrates from lock-based protection to the NET_EPOCH umbrella. It requires some logic changes, including, notably, refcount handling. When we have an `ifa` pointer and we're running inside epoch we're guaranteed that this pointer will not be freed. However, the following case can still happen: * in thread 1 we drop to 0 refcount for ifa and schedule its deletion. * in thread 2 we use this ifa and reference it * destroy callout kicks in * unhappy user reports bug To address it, new `ifa_try_ref()` function is added, allowing to return failure when we try to reference `ifa` with 0 refcount. Additionally, existing `ifa_ref()` is enforced with `KASSERT` to provide cleaner error in such scenarious. Reviewed By: rstone, donner Differential Revision: https://reviews.freebsd.org/D28639 MFC after: 1 week	2021-02-16 20:14:50 +00:00
Allan Jude	922cf8ac43	Use iflib_if_init_locked() during media change instead of iflib_init_locked(). iflib_init_locked() assumes that iflib_stop() has been called, however, it is not called for media changes. iflib_if_init_locked() calls stop then init, so fixes the problem. PR: 253473 MFC after: 3 days Reviewed by: markj Sponsored by: Juniper Networks, Inc., Klara, Inc. Differential Revision: https://reviews.freebsd.org/D28667	2021-02-16 19:02:00 +00:00
Alexander V. Chernikov	64d5c27777	Remove now-unused RTF_RNH_LOCKED route flag. MFC after: 1 week	2021-02-15 20:49:59 +00:00
Alexander V. Chernikov	a375ec52a7	Fix ifa refcount leak during route addition. Reported by: rstone Reviewed by: rstone MFC after: 1 day	2021-02-13 00:06:14 +00:00
Alexander V. Chernikov	8ca99aecf7	Fix various NOINET* builds broken by `145bf6c0af`. Reported by: mjg, bdragon	2021-02-12 20:36:20 +00:00
Alexander V. Chernikov	8170a7d438	Fix interface route addition with net/bird. The case of adding interface route by specifying interface address as the gateway was missed during code refactoring. Re-add it back by copying non-AF_LINK gateway data when RTF_GATEWAY is not set. Reviewed by: donner MFC after: 3 days	2021-02-12 19:45:35 +00:00
Alexander V. Chernikov	145bf6c0af	Fix blackhole/reject routes. Traditionally *BSD routing stack required to supply some interface data for blackhole/reject routes. This lead to varieties of hacks in routing daemons when inserting such routes. With the recent routeing stack changes, gateway sockaddr without RTF_GATEWAY started to be treated differently, purely as link identifier. This change broke net/bird, which installs blackhole routes with 127.0.0.1 gateway without RTF_GATEWAY flags. Fix this by automatically constructing necessary gateway data at rtsock level if RTF_REJECT/RTF_BLACKHOLE is set. Reported by: Marek Zarychta <zarychtam at plan-b.pwste.edu.pl> Reviewed by: donner MFC after: 1 week	2021-02-11 23:08:55 +00:00
Kristof Provost	6d2a10d96f	Widen ifnet_detach_sxlock coverage Widen the ifnet_detach_sxlock to cover the entire vnet sysuninit code. This ensures that we can't end up having the vnet_sysuninit free the UDP pcb while the detach code is running and trying to purge the UDP pcb. MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D28530	2021-02-11 16:12:29 +01:00
Alexander V. Chernikov	924d1c9a05	Revert "SO_RERROR indicates that receive buffer overflows should be handled as errors." Wrong version of the change was pushed inadvertenly. This reverts commit `4a01b854ca`.	2021-02-08 22:32:32 +00:00
Alexander V. Chernikov	adc4ea97bd	Turn off forgotten multipath debug messages Reported by: mike tancsa<mike at sentex.net> MFC after: 3 days	2021-02-08 21:42:20 +00:00
Alexander V. Chernikov	4a01b854ca	SO_RERROR indicates that receive buffer overflows should be handled as errors. Historically receive buffer overflows have been ignored and programs could not tell if they missed messages or messages had been truncated because of overflows. Since programs historically do not expect to get receive overflow errors, this behavior is not the default. This is really really important for programs that use route(4) to keep in sync with the system. If we loose a message then we need to reload the full system state, otherwise the behaviour from that point is undefined and can lead to chasing bogus bug reports.	2021-02-08 21:42:20 +00:00
Alexander V. Chernikov	eb0b1b33d5	Enable multipath routing by default. ROUTE_MPATH was added to the GENERIC kernel in r368648. According to the plan in D27428, it was enabled with `net.route.multipath` sysctl set to 0. Given enough time has passed, this change enables route multipath by default. The goal is to ship FreeBSD 13 with multipath turned on. Reviewed By: donner, olivier MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D28423	2021-02-03 08:49:58 +00:00
Sai Rajesh Tallamraju	38bfc6dee3	iflib: Free resources in a consistent order during detach Memory and PCI resources are freed with no particular order. This could cause use-after-frees when detaching following a failed attach. For instance, iflib_tx_structures_free() frees ctx->ifc_txqs[] but iflib_tqg_detach() attempts to access this array. Similarly, adapter queues gets freed by IFDI_QUEUES_FREE() but IFDI_DETACH() attempts to access adapter queues to free PCI resources. MFC after: 2 weeks Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D27634	2021-02-01 11:15:54 -05:00
Jonah Caplan	88be0e1120	bridge: fix STP roles and protos strings Add the missing commas that got lost in `e5539fb618`. PR: 252532 Reviewd by: kp@, donner@, freqlabs@ MFC after: 3 days Differential Revision: https://reviews.freebsd.org/D28425	2021-02-01 15:27:06 +01:00
Alexander V. Chernikov	78c93a1721	Use process fib for inet/inet6 fib_algo sysctls. This allows to set/query fib algo for non-default fibs. MFC after: 3 days	2021-01-31 10:50:08 +00:00
Alexander V. Chernikov	151ec796a2	Fix the design problem with delayed algorithm sync. Currently, if the immutable algorithm like bsearch or radix_lockless receives rtable update notification, it schedules algorithm rebuild. This rebuild is executed by the callout after ~50 milliseconds. It is possible that a script adding an interface address and than route with the gateway bound to that address will fail. It can happen due to the fact that fib is not updated by the time the route addition request arrives. Fix this by allowing synchronous algorithm rebuilds based on certain conditions. By default, these conditions assume: 1) less than net.route.algo.fib_sync_limit=100 routes 2) routes without gateway. * Move algo instance build entirely under rib WLOCK. Rib lock is only used for control plane (except radix algo, but there are no rebuilds). * Add rib_walk_ext_locked() function to allow RIB iteration with rib lock already held. * Fix rare potential callout use-after-free for fds by binding fd callout to the relevant rib rmlock. In that case, callout_stop() under rib WLOCK guarantees no callout will be executed afterwards. MFC after: 3 days	2021-01-30 23:25:57 +00:00
Alexander V. Chernikov	dd9163003c	Add rib_subscribe_locked() and rib_unsubsribe_locked() to support subscriptions during RIB modifications. Add new subscriptions to the beginning of the lists instead of the end. This fixes the situation when new subscription is created int the callback for the existing subscription, leading to the subscription notification handler pick it. MFC after: 3 days	2021-01-30 23:25:57 +00:00
Alexander V. Chernikov	ab6d9aaed7	Move business logic from rebuild_fd_callout() into rebuild_fd(). This simplifies code a bit and allows for future non-callout callers to request rebuild. MFC after: 3 days	2021-01-30 23:25:57 +00:00
Alexander V. Chernikov	f8b7ebea49	Improve fib_algo debug messages. * Move per-prefix debug lines under LOG_DEBUG2 * Create fib instance counter to distingush log messages between instances * Add more messages on rebuild reason. MFC after: 3 days	2021-01-30 23:25:56 +00:00
Alexander V. Chernikov	cb984c62d7	Fix multipath support for rib_lookup_info(). The initial plan was to remove rib_lookup_info() before FreeBSD 13. As several customers are still remaining, fix rib_lookup_info() for the multipath use case.	2021-01-29 23:14:24 +00:00
Alexander V. Chernikov	53729367d3	Fix subinterface vlan creation. D26436 introduced support for stacked vlans that changed the way vlans are configured. In particular, this change broke setups that have same-number vlans as subinterfaces. Vlan support was initially created assuming "vlanX" semantics. In this paradigm, automatic number assignment supported by cloning (ifconfig vlan create) was a natural fit. When "ifaceX.Y" support was added, allowing to have the same vlan number on multiple devices, cloning code became more complex, as the is no unified "vlan" namespace anymore. Such interfaces got the first spare index from "vlan" cloner. This, in turn, led to the following problem: ifconfig ix0.333 create -> index 1 ifconfig ix0.444 create -> index 2 ifconfig vlan2 create -> allocation failure This change fixes such allocations by using cloning indexes only for "vlanX" interfaces. Reviewed by: hselasky MFC after: 3 days Differential Revision: https://reviews.freebsd.org/D27505	2021-01-29 21:43:20 +00:00
Gleb Smirnoff	3f43ada98c	Catch up with `6edfd179c8`: mechanically rename IFCAP_NOMAP to IFCAP_MEXTPG. Originally IFCAP_NOMAP meant that the mbuf has external storage pointer that points to unmapped address. Then, this was extended to array of such pointers. Then, such mbufs were augmented with header/trailer. Basically, extended mbufs are extended, and set of features is subject to change. The new name should be generic enough to avoid further renaming.	2021-01-29 11:46:24 -08:00
Randall Stewart	1a714ff204	This pulls over all the changes that are in the netflix tree that fix the ratelimit code. There were several bugs in tcp_ratelimit itself and we needed further work to support the multiple tag format coming for the joint TLS and Ratelimit dances. Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D28357	2021-01-28 11:53:05 -05:00
Kristof Provost	35dabb7b9c	altq: Fix typo in features sysctl description Reported by: Jose Luis Duran	2021-01-27 16:42:14 +01:00
Kristof Provost	27b2aa4938	altq: Remove unused arguments from altq_attach() Minor cleanup, no functional change. Reviewed by: donner@ Differential Revision: https://reviews.freebsd.org/D28304	2021-01-25 19:58:22 +01:00
Kristof Provost	e111d79806	Add FEATURE sysctls for ALTQ disciplines This will allow userspace to more easily figure out if ALTQ is built into the kernel and what disciplines are supported. Reviewed by: donner@ Differential Revision: https://reviews.freebsd.org/D28302	2021-01-25 19:58:22 +01:00
Vincenzo Maffione	f80efe5016	iflib: netmap: move per-packet operation out of fragments loop MFC after: 1 week	2021-01-24 21:38:59 +00:00
Vincenzo Maffione	aceaccab65	iflib: netmap: add support for NS_MOREFRAG The NS_MOREFRAG flag can be set in a netmap slot to represent a multi-fragment packet. Only the last fragment of a packet does not have the flag set. On TX rings, the flag may be set by the userspace application. The kernel will look at the flag and use it to properly set up the NIC TX descriptors. On RX rings, the kernel may set the flag if the packet received was split across multiple netmap buffers. The userspace application should look at the flag to know when the packet is complete. Submitted by: rajesh1.kumar_amd.com Reviewed by: vmaffione MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D27799	2021-01-24 21:20:59 +00:00
Andrew Gallatin	0c864213ef	iflib: Fix a NULL pointer deref rxd_frag_to_sd() have pf_rv parameter as NULL with the current code. This patch fixes the NULL pointer dereference in that case thus avoiding a possible panic. Submitted by: rajesh1.kumar at amd.com Reviewed by: gallatin Differential Revision: https://reviews.freebsd.org/D28115	2021-01-21 09:47:06 -05:00
Alexander V. Chernikov	9d6567bc30	Fix panic on vnet creation if fib algo has been set to fixed value. Make fixed algo property per-VNET instead of global.	2021-01-17 20:32:25 +00:00
Alexander V. Chernikov	f9e0752e35	Create new in6_purgeifaddr() which purges bound ifa prefix if it gets unused. Currently if_purgeifaddrs() uses in6_purgeaddr() to remove IPv6 ifaddrs. in6_purgeaddr() does not trrigger prefix removal if number of linked ifas goes to 0, as this is a low-level function. As a result, if_purgeifaddrs() purges all IPv4/IPv6 addresses but keeps corresponding IPv6 prefixes. Fix this by creating higher-level wrapper which handles unused prefix usecase and use it in if_purgeifaddrs(). Differential revision: https://reviews.freebsd.org/D28128	2021-01-17 20:32:25 +00:00
Alexander V. Chernikov	81728a538d	Split rtinit() into multiple functions. rtinit[1]() is a function used to add or remove interface address prefix routes, similar to ifa_maintain_loopback_route(). It was intended to be family-agnostic. There is a problem with this approach in reality. 1) IPv6 code does not use it for the ifa routes. There is a separate layer, nd6_prelist_(), providing interface for maintaining interface routes. Its part, responsible for the actual route table interaction, mimics rtenty() code. 2) rtinit tries to combine multiple actions in the same function: constructing proper route attributes and handling iterations over multiple fibs, for the non-zero net.add_addr_allfibs use case. It notably increases the code complexity. 3) dstaddr handling. flags parameter re-uses RTF_ flags. As there is no special flag for p2p connections, host routes and p2p routes are handled in the same way. Additionally, mapping IFA flags to RTF flags makes the interface pretty messy. It make rtinit() to clash with ifa_mainain_loopback_route() for IPV4 interface aliases. 4) rtinit() is the last customer passing non-masked prefixes to rib_action(), complicating rib_action() implementation. 5) rtinit() coupled ifa announce/withdrawal notifications, producing "false positive" ifa messages in certain corner cases. To address all these points, the following has been done: * rtinit() has been split into multiple functions: - Route attribute construction were moved to the per-address-family functions, dealing with (2), (3) and (4). - funnction providing net.add_addr_allfibs handling and route rtsock notificaions is the new routing table inteface. - rtsock ifa notificaion has been moved out as well. resulting set of funcion are only responsible for the actual route notifications. Side effects: * /32 alias does not result in interface routes (/32 route and "host" route) * RTF_PINNED is now set for IPv6 prefixes corresponding to the interface addresses Differential revision: https://reviews.freebsd.org/D28186	2021-01-16 22:42:41 +00:00
Alexander V. Chernikov	a6b7689718	Remove redundant rtinit() calls from tuntap. Removed code iterates over if_addrhead and tries to remove routes for each ifa. This is exactly the thing that if_purgeaddrs() do, and if_purgeaddr() is already called in the end. Reviewed by: glebius MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D28106	2021-01-13 10:03:15 +00:00
Ryan Libby	c86fa3b8d7	pf: quiet -Wredundant-decls for pf_get_ruleset_number In `e86bddea9f` sys/netpfil/pf/pf.h grew a declaration of pf_get_ruleset_number. Now delete the old declaration from sys/net/pfvar.h. Reviewed by: kp Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D28081	2021-01-10 21:53:15 -08:00
Alexander V. Chernikov	685de460bc	Use static initializers for fib algo to shift initialization to ealier stage. This allows to register modules loaded at boot time. Reported by: olivier	2021-01-11 00:16:54 +00:00
Vincenzo Maffione	55f0ad5fde	netmap: restore hwofs and support it in iflib Restore the hwofs functionality temporarily disabled by `7ba6ecf216` to prevent issues with iflib. This patch brings the necessary changes to iflib to enable howfs to allow interface restarts without disrupting netmap applications actively using its rings. After this change, it becomes possible for multiple non-cooperating netmap applications to use non-overlapping subsets of the available netmap rings without clashing with each other. PR: 252453 MFC after: 1 week	2021-01-10 22:51:15 +00:00
Vincenzo Maffione	8aa8484cbf	iflib: fix build failure in case DEV_NETMAP is not defined This addresses the build failure introduced by `3d65fd97e8`. MFC with: `3d65fd97e8`	2021-01-10 14:43:58 +00:00
Vincenzo Maffione	4ba9ad0dc3	iflib: add assert to prevent out-of-bounds array access The iflib_queues_alloc() allocates isc_nrxqs iflib_dma_info structs for each rxqset, and links each struct to a different free list. As a result, it must be isc_nrxqs >= isc_nfl (plus the completion queue, if present). Add an assertion to make this constraint explicit. MFC after: 2 weeks	2021-01-10 13:59:20 +00:00
Vincenzo Maffione	3d65fd97e8	netmap: iflib: enable/disable krings on any interface reinit Since `1d238b07d5`, krings are disabled before a reinit cycle triggered by iflib_netmap_register. However, this operation is actually necessary also for any interface reinit triggered by other causes (i.e., ifconfig commands). We achieve this goal by moving the krings enable/disable operation inside iflib_stop() and iflib_init_locked(). Once here, this change also removes some redundant operations from iflib_netmap_register(), that are already performed by iflib_stop(). PR: 252453 MFC after: 1 week	2021-01-10 12:04:08 +00:00
Vincenzo Maffione	3189ba6167	netmap: iflib: fix asserts in netmap_fl_refill() When netmap_fl_refill() is called at initialization time (e.g., during netmap_iflib_register()), nic_i must be 0, since the free list is reinitialized. At the end of the refill cycle, nic_i must still be zero, because exactly N descriptors (N is the ring size) are refilled. This patch therefore fixes the assertions to check on nic_i rather than on nm_i. The current netmap_reset() may in fact cause nm_i to be != 0 while the device is resetting: this may happen when multiple non-cooperating processes open different subsets of the available netmap rings. PR: 252518 MFC after: 1 week	2021-01-09 21:35:07 +00:00
Vincenzo Maffione	1d238b07d5	netmap: iflib: stop krings during interface reset When different processes open separate subsets of the available rings of a same netmap interface, a device reset may be performed while one of the processes is actively using some rings (e.g., caused by another process executing a nmport_open()). With this patch, such situation will cause the active process to get a POLLERR, so that it can have a chance to detect the situation. We also guarantee that no process is running a txsync or rxsync (ioctl or poll) while an iflib device reset is in progress. PR: 252453 MFC after: 1 week	2021-01-09 21:01:46 +00:00
Matt Macy	81be655266	iflib: ensure that tx interrupts enabled and cleanups Doing a 'dd' over iscsi will reliably cause stalls. Tx cleaning _should_ reliably happen as data is sent. However, currently if the transmit queue fills it will wait until the iflib timer (hz/2) runs. This change causes the the tx taskq thread to be run if there are completed descriptors. While here: - make timer interrupt delay a sysctl - simplify txd_db_check handling - comment on INTR types Background on the change: Initially doorbell updates were minimized by only writing to the register on every fourth packet. If txq_drain would return without writing to the doorbell it scheduled a callout on the next tick to do the doorbell write to ensure that the write otherwise happened "soon". At that time a sysctl was added for users to avoid the potential added latency by simply writing to the doorbell register on every packet. This worked perfectly well for e1000 and ixgbe ... and appeared to work well on ixl. However, as it turned out there was a race to this approach that would lockup the ixl MAC. It was possible for a lower producer index to be written after a higher one. On e1000 and ixgbe this was harmless - on ixl it was fatal. My initial response was to add a lock around doorbell writes - fixing the problem but adding an unacceptable amount of lock contention. The next iteration was to use transmit interrupts to drive delayed doorbell writes. If there were no packets in the queue all doorbell writes would be immediate as the queue started to fill up we could delay doorbell writes further and further. At the start of drain if we've cleaned any packets we know we've moved the state machine along and we write the doorbell (an obvious missing optimization was to skip that doorbell write if db_pending is zero). This change required that tx interrupts be scheduled periodically as opposed to just when the hardware txq was full. However, that just leads to our next problem. Initially dedicated msix vectors were used for both tx and rx. However, it was often possible to use up all available vectors before we set up all the queues we wanted. By having rx and tx share a vector for a given queue we could halve the number of vectors used by a given configuration. The problem here is that with this change only e1000 passed the necessary value to have the fast interrupt drive tx when appropriate. Reported by: mav@ Tested by: mav@ Reviewed by: gallatin@ MFC after: 1 month Sponsored by: iXsystems Differential Revision: https://reviews.freebsd.org/D27683	2021-01-07 14:07:35 -08:00
Alexander V. Chernikov	d68cf57b7f	Refactor rt_addrmsg() and rt_routemsg(). Summary: * Refactor rt_addrmsg(): make V_rt_add_addr_allfibs decision locally. * Fix rt_routemsg() and multipath by accepting nexthop instead of interface pointer. * Refactor rtsock_routemsg(): avoid accessing rtentry fields directly. * Simplify in_addprefix() by moving prefix search to a separate function. Reviewers: #network Subscribers: imp, ae, bz Differential Revision: https://reviews.freebsd.org/D28011	2021-01-07 19:38:19 +00:00
Kristof Provost	5a3b9507d7	pf: Convert pfi_kkif to use counter_u64 Improve caching behaviour by using counter_u64 rather than variables shared between cores. The result of converting all counters to counter(9) (i.e. this full patch series) is a significant improvement in throughput. As tested by olivier@, on Intel Xeon E5-2697Av4 (16Cores, 32 threads) hardware with Mellanox ConnectX-4 MCX416A-CCAT (100GBase-SR4) nics we see: x FreeBSD 20201223: inet packets-per-second + FreeBSD 20201223 with pf patches: inet packets-per-second +--------------------------------------------------------------------------+ \| + \| \| xx + \| \|xxx +++\| \|\|A\| \| \| \|A\|\| +--------------------------------------------------------------------------+ N Min Max Median Avg Stddev x 5 9216962 9526356 9343902 9371057.6 116720.36 + 5 19427190 19698400 19502922 19546509 109084.92 Difference at 95.0% confidence 1.01755e+07 +/- 164756 108.584% +/- 2.9359% (Student's t, pooled s = 112967) Reviewed by: philip MFC after: 2 weeks Sponsored by: Orange Business Services Differential Revision: https://reviews.freebsd.org/D27763	2021-01-05 23:35:37 +01:00
Kristof Provost	26c841e2a4	pf: Allocate and free pfi_kkif in separate functions Factor out allocating and freeing pfi_kkif structures. This will be useful when we change the counters to be counter_u64, so we don't have to deal with that complexity in the multiple locations where we allocate pfi_kkif structures. No functional change. MFC after: 2 weeks Sponsored by: Orange Business Services Differential Revision: https://reviews.freebsd.org/D27762	2021-01-05 23:35:37 +01:00
Kristof Provost	320c11165b	pf: Split pfi_kif into a user and kernel space structure No functional change. MFC after: 2 weeks Sponsored by: Orange Business Services Differential Revision: https://reviews.freebsd.org/D27761	2021-01-05 23:35:37 +01:00
Kristof Provost	c3adacdad4	pf: Change pf_krule counters to use counter_u64 This improves the cache behaviour of pf and results in improved throughput. MFC after: 2 weeks Sponsored by: Orange Business Services Differential Revision: https://reviews.freebsd.org/D27760	2021-01-05 23:35:37 +01:00
Kristof Provost	c7bdafe2f1	pf: Remove unused fields from pf_krule The u_* counters are used only to communicate with userspace, as userspace cannot use counter_u64. As pf_krule is not passed to userspace these fields are now obsolete. MFC after: 2 weeks Sponsored by: Orange Business Services Differential Revision: https://reviews.freebsd.org/D27759	2021-01-05 23:35:36 +01:00
Kristof Provost	e86bddea9f	pf: Split pf_rule into kernel and user space versions No functional change intended. MFC after: 2 weeks Sponsored by: Orange Business Services Differential Revision: https://reviews.freebsd.org/D27758	2021-01-05 23:35:36 +01:00
Kristof Provost	dc865dae89	pf: Migrate pf_rule and related structs to pf.h As part of the split between user and kernel mode structures we're moving all user space usable definitions into pf.h. No functional change intended. MFC after: 2 weeks Sponsored by: Orange Business Services Differential Revision: https://reviews.freebsd.org/D27757	2021-01-05 23:35:36 +01:00
Kristof Provost	fbbf270eef	pf: Use counter_u64 in pf_src_node Reviewd by: philip MFC after: 2 weeks Sponsored by: Orange Business Services Differential Revision: https://reviews.freebsd.org/D27756	2021-01-05 23:35:36 +01:00
Kristof Provost	17ad7334ca	pf: Split pf_src_node into a kernel and userspace struct Introduce a kernel version of struct pf_src_node (pf_ksrc_node). This will allow us to improve the in-kernel data structure without breaking userspace compatibility. Reviewed by: philip MFC after: 2 weeks Sponsored by: Orange Business Services Differential Revision: https://reviews.freebsd.org/D27707	2021-01-05 23:35:36 +01:00
Alexander V. Chernikov	9c0ff6a8bb	Remove now-unused RT_GATEWAY* definitions. They were used to simplify nexthop transition, hence not needed anymore.	2021-01-04 21:45:46 +00:00
Hans Petter Selasky	747feea146	Streamline the infiniband code according to the ethernet code. Fix LINT-NOIP kernel build. Submitted by: rlibby @ Differential Revision: https://reviews.freebsd.org/D27861 MFC after: 1 week Sponsored by: Mellanox Technologies // NVIDIA Networking	2020-12-31 10:07:02 +01:00
Hans Petter Selasky	ec52ff6d14	Streamline the infiniband code according to the ethernet code. Specifically implement the if_requestencap callback function for infiniband. Most of the changes are simply a cut and paste of the equivalent ethernet part. Reviewed by: melifaro @ Differential Revision: https://reviews.freebsd.org/D27631 MFC after: 1 week Sponsored by: Mellanox Technologies // NVIDIA Networking	2020-12-29 18:01:57 +01:00
Hans Petter Selasky	19ecb5e8da	Fix for IPoIB over lagg(4). Need to update both link layer address and broadcast address when active link changes for IP over infiniband. This is because the broadcast address contains the so-called P-key, which is interface dependent. Reviewed by: kib @ Differential Revision: https://reviews.freebsd.org/D27658 MFC after: 1 week Sponsored by: Mellanox Technologies // NVIDIA Networking	2020-12-29 17:35:06 +01:00
Ryan Libby	833dbf1e22	route: quiet -Wredundant-decls Remove declaration duplicated in `f5baf8bb12` Reviewed by: melifaro Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D27790	2020-12-27 16:32:27 -08:00
Alexander V. Chernikov	f733d9701b	Fix default route handling in radix4_lockless algo. Improve nexthop debugging. Reported by: Florian Smeets <flo at smeets.xyz>	2020-12-26 22:51:02 +00:00
Alexander V. Chernikov	4e19e0d92a	Use light-weight versions of routing lookup functions in ng_netflow. Use recently-added combination of `fib[46]_lookup_rt()` which returns rtentry & raw nexthop with `rt_get_inet[6]_plen()` which returns address/prefix length of prefix inside `rt`. Add `nhop_select_func()` wrapper around inlined `nhop_select()` to allow callers external to the routing subsystem select the proper nexthop from the multipath group without including internal headers. New calls does not require reference counting objects and reduce the amount of copied/processed rtentry data. Differential Revision: https://reviews.freebsd.org/D27675	2020-12-26 11:27:38 +00:00
Alexander V. Chernikov	f5baf8bb12	Add modular fib lookup framework. This change introduces framework that allows to dynamically attach or detach longest prefix match (lpm) lookup algorithms to speed up datapath route tables lookups. Framework takes care of handling initial synchronisation, route subscription, nhop/nhop groups reference and indexing, dataplane attachments and fib instance algorithm setup/teardown. Framework features automatic algorithm selection, allowing for picking the best matching algorithm on-the-fly based on the amount of routes in the routing table. Currently framework code is guarded under FIB_ALGO config option. An idea is to enable it by default in the next couple of weeks. The following algorithms are provided by default: IPv4: * bsearch4 (lockless binary search in a special IP array), tailored for small-fib (<16 routes) * radix4_lockless (lockless immutable radix, re-created on every rtable change), tailored for small-fib (<1000 routes) * radix4 (base system radix backend) * dpdk_lpm4 (DPDK DIR24-8-based lookups), lockless datastrucure, optimized for large-fib (D27412) IPv6: * radix6_lockless (lockless immutable radix, re-created on every rtable change), tailed for small-fib (<1000 routes) * radix6 (base system radix backend) * dpdk_lpm6 (DPDK DIR24-8-based lookups), lockless datastrucure, optimized for large-fib (D27412) Performance changes: Micro benchmarks (I7-7660U, single-core lookups, 2048k dst, code in D27604): IPv4: 8 routes: radix4: ~20mpps radix4_lockless: ~24.8mpps bsearch4: ~69mpps dpdk_lpm4: ~67 mpps 700k routes: radix4_lockless: 3.3mpps dpdk_lpm4: 46mpps IPv6: 8 routes: radix6_lockless: ~20mpps dpdk_lpm6: ~70mpps 100k routes: radix6_lockless: 13.9mpps dpdk_lpm6: 57mpps Forwarding benchmarks: + 10-15% IPv4 forwarding performance (small-fib, bsearch4) + 25% IPv4 forwarding performance (full-view, dpdk_lpm4) + 20% IPv6 forwarding performance (full-view, dpdk_lpm6) Control: Framwork adds the following runtime sysctls: List algos * net.route.algo.inet.algo_list: bsearch4, radix4_lockless, radix4 * net.route.algo.inet6.algo_list: radix6_lockless, radix6, dpdk_lpm6 Debug level (7=LOG_DEBUG, per-route) net.route.algo.debug_level: 5 Algo selection (currently only for fib 0): net.route.algo.inet.algo: bsearch4 net.route.algo.inet6.algo: radix6_lockless Support for manually changing algos in non-default fib will be added soon. Some sysctl names will be changed in the near future. Differential Revision: https://reviews.freebsd.org/D27401	2020-12-25 11:33:17 +00:00
Ryan Libby	2fb4a03d55	rtsock: quiet -Wunused-variable in LINT-NOIP kernels Fixup after r368769 / `d68fb8d978`. Reported by: mjg Reviewed by: melifaro Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D27730	2020-12-24 12:34:18 -08:00
Mark Johnston	92be2847e8	rtsock: Avoid copying uninitialized padding bytes When copying sockaddrs out to userspace, we pad them to a multiple of the platform alignment (sizeof(long)). However, some sockaddr sizes, such as struct sockaddr_dl, are not an integer multiple of the alignment, so we may end up copying out uninitialized bytes. Fix this by always bouncing through a pre-zeroed sockaddr_storage. Reported by: KASAN Reviewed by: melifaro MFC after: 3 days Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D27729	2020-12-23 11:16:40 -05:00
Kristof Provost	1c00efe98e	pf: Use counter(9) for pf_state byte/packet tracking This improves cache behaviour by not writing to the same variable from multiple cores simultaneously. pf_state is only used in the kernel, so can be safely modified. Reviewed by: Lutz Donnerhacke, philip MFC after: 1 week Sponsed by: Orange Business Services Differential Revision: https://reviews.freebsd.org/D27661	2020-12-23 12:03:21 +01:00
Kristof Provost	c3f69af03a	pf: Fix unaligned checksum updates The algorithm we use to update checksums only works correctly if the updated data is aligned on 16-bit boundaries (relative to the start of the packet). Import the OpenBSD fix for this issue. PR: 240416 Obtained from: OpenBSD MFC after: 1 week Reviewed by: tuexen (previous version) Differential Revision: https://reviews.freebsd.org/D27696	2020-12-23 12:03:20 +01:00
Hans Petter Selasky	ddce63fcb6	Remove not needed variable initialization. And switch from int to bool while at it. Reviewed by: melifaro@ Differential Revision: https://reviews.freebsd.org/D27725 MFC after: 1 week Sponsored by: Mellanox Technologies // NVIDIA Networking	2020-12-23 12:04:46 +01:00
Konstantin Belousov	994e47023a	vxlan: stop checking CSUM_ENCAP_VXLAN when converting inner CSUM flags into normal, for decapsulation. The packet, if processed at this point, was already parsed to be UDP directed to a vxlan port. Connect-X 4+ does not provide easy method to infer which parser processed the packet, so driver cannot set the flag without a lot of efforts which are only to satisfy the formal requirements. Reviewed by: bryanv, np Sponsored by: Mellanox Technologies/NVidia Networking Differential revision: https://reviews.freebsd.org/D27449 MFC after: 1 week	2020-12-23 10:54:06 +02:00
Alexander V. Chernikov	d68fb8d978	Switch direct rt fields access in rtsock.c to newly-create field acessors. rtsock code was build around the assumption that each rtentry record in the system radix tree is a ready-to-use sockaddr. This assumptions turned out to be not quite true: * masks have their length tweaked, so we have rtsock_fix_netmask() hack * IPv6 addresses have their scope embedded, so we have another explicit deembedding hack. Change the code to decouple rtentry internals from rtsock code using newly-created rtentry accessors. This will allow to eventually eliminate both of the hacks and change rtentry dst/mask format. Differential Revision: https://reviews.freebsd.org/D27451	2020-12-18 22:00:57 +00:00
Brooks Davis	f3f2ee76ad	style(9): Correct whitespace in struct definitions struct ifconf and struct ifreq use the odd style "struct<tab>foo". struct ifdrv seems to have tried to follow this but was committed with spaces in place of most tabs resulting in "struct<space><space>ifdrv". MFC after: 3 days	2020-12-11 01:00:07 +00:00
Gleb Smirnoff	5ee33a9076	Fixup r368446 with KERN_TLS.	2020-12-08 23:54:09 +00:00
Gleb Smirnoff	e1074ed6a0	The list of ports in configuration path shall be protected by locks, epoch shall be used only for fast path. Thus use LAGG_XLOCK() in lagg_[un]register_vlan. This fixes sleeping in epoch panic. PR: 240609	2020-12-08 16:46:00 +00:00
Gleb Smirnoff	87bf9b9cbe	Convert LAGG_RLOCK() to NET_EPOCH_ENTER(). No functional changes.	2020-12-08 16:36:46 +00:00
Mark Johnston	c065d4e5e9	iflib: Avoid leaking the freelist bitmaps upon driver detach Submitted by: Sai Rajesh Tallamraju <stallamr@netapp.com> MFC after: 2 weeks Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D27342	2020-12-07 14:53:14 +00:00
Mark Johnston	102540192c	iflib: Detach tasks upon device registration failure In some error paths we would fail to detach from the iflib taskqueue groups. Also move the detach code into its own subroutine instead of duplicating it. Submitted by: Sai Rajesh Tallamraju <stallamr@netapp.com> MFC after: 2 weeks Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D27342	2020-12-07 14:52:57 +00:00
Alexander V. Chernikov	df9053920f	Add IPv4/IPv6 rtentry prefix accessors. Multiple consumers like ipfw, netflow or new route lookup algorithms need to get the prefix data out of struct rtentry. Instead of providing direct access to the rtentry, create IPv4/IPv6 accessors to abstract struct rtentry internals and avoid including internal routing headers for external consumers. While here, move struct route_nhop_data to the public header, so external customers can actually use lookup functions returning rt&nhop data. Differential Revision: https://reviews.freebsd.org/D27416	2020-12-03 22:23:57 +00:00
Kristof Provost	7f883a9b5b	net: Revert vnet/epair cleanup race mitigation Revert the mitigation code for the vnet/epair cleanup race (done in r365457). r368237 introduced a more reliable fix. MFC after: 2 weeks Sponsored by: Modirum MDPay	2020-12-01 16:34:43 +00:00
Kristof Provost	e133271fc1	if: Fix panic when destroying vnet and epair simultaneously When destroying a vnet and an epair (with one end in the vnet) we often panicked. This was the result of the destruction of the epair, which destroys both ends simultaneously, happening while vnet_if_return() was moving the struct ifnet to its home vnet. This can result in a freed ifnet being re-added to the home vnet V_ifnet list. That in turn panics the next time the ifnet is used. Prevent this race by ensuring that vnet_if_return() cannot run at the same time as if_detach() or epair_clone_destroy(). PR: 238870, 234985, 244703, 250870 MFC after: 2 weeks Sponsored by: Modirum MDPay Differential Revision: https://reviews.freebsd.org/D27378	2020-12-01 16:23:59 +00:00
Alexander V. Chernikov	77df2c21cb	Renumber NHR_* flags after NHR_IFAIF removal in r368127. Suggested by: rpokala	2020-11-30 21:42:55 +00:00
Alexander V. Chernikov	d1d941c5b9	Remove RADIX_MPATH config option. ROUTE_MPATH is the new config option controlling new multipath routing implementation. Remove the last pieces of RADIX_MPATH-related code and the config option. Reviewed by: glebius Differential Revision: https://reviews.freebsd.org/D27244	2020-11-29 19:43:33 +00:00
Matt Macy	2338da0373	Import kernel WireGuard support Data path largely shared with the OpenBSD implementation by Matt Dunwoodie <ncon@nconroy.net> Reviewed by: grehan@freebsd.org MFC after: 1 month Sponsored by: Rubicon LLC, (Netgate) Differential Revision: https://reviews.freebsd.org/D26137	2020-11-29 19:38:03 +00:00
Alexander V. Chernikov	3b1654cb14	Introduce rib_walk_ext_internal() to allow iteration with rnh pointer. This solves the case when rib is not yet attached/detached to/from the system rib array. Differential Revision: https://reviews.freebsd.org/D27406	2020-11-29 13:54:49 +00:00
Alexander V. Chernikov	f47fa26065	Add nhop_ref_any() to unify referencing nhop or nexthop group. It allows code within routing subsystem to transparently reference nexthops and nexthop groups, similar to nhop_free_any(), abstracting ROUTE_MPATH details. Differential Revision: https://reviews.freebsd.org/D27410	2020-11-29 13:52:06 +00:00
Alexander V. Chernikov	b712e3e343	Refactor fib4/fib6 functions. No functional changes. * Make lookup path of fib<4\|6>_lookup_debugnet() separate functions (fib<46>_lookup_rt()). These will be used in the control plane code requiring unlocked radix operations and actual prefix pointer. * Make lookup part of fib<4\|6>_check_urpf() separate functions. This change simplifies the switch to alternative lookup implementations, which helps algorithmic lookups introduction. * While here, use static initializers for IPv4/IPv6 keys Differential Revision: https://reviews.freebsd.org/D27405	2020-11-29 13:41:49 +00:00
Alexander V. Chernikov	98d5c4e5c8	Add tracking for rib/nhops/nhgrp objects and provide cumulative number accessors. The resulting KPI can be used by routing table consumers to estimate the required scale for route table export. * Add tracking for rib routes * Add accessors for number of nexthops/nexthop objects * Simplify rib_unsubscribe: store rnh we're attached to instead of requiring it up again on destruction. This helps in the cases when rnh is not linked yet/already unlinked. Differential Revision: https://reviews.freebsd.org/D27404	2020-11-29 13:27:24 +00:00
Alexander V. Chernikov	ef6ef7e5da	Add nhgrp_get_idx() as a counterpart for nhop_get_idx(). It allows the routing-related code to reference nexthop groups by index instead of storing a pointer.	2020-11-28 15:46:40 +00:00
Alexander V. Chernikov	7a6dc73c98	Cleanup nexthops request flags: * remove NHR_IFAIF as it was used by previous version of nexthop KPI * update NHR_REF description	2020-11-28 15:11:59 +00:00
Konstantin Belousov	cd85379104	Make MAXPHYS tunable. Bump MAXPHYS to 1M. Replace MAXPHYS by runtime variable maxphys. It is initialized from MAXPHYS by default, but can be also adjusted with the tunable kern.maxphys. Make b_pages[] array in struct buf flexible. Size b_pages[] for buffer cache buffers exactly to atop(maxbcachebuf) (currently it is sized to atop(MAXPHYS)), and b_pages[] for pbufs is sized to atop(maxphys) + 1. The +1 for pbufs allow several pbuf consumers, among them vmapbuf(), to use unaligned buffers still sized to maxphys, esp. when such buffers come from userspace (). Overall, we save significant amount of otherwise wasted memory in b_pages[] for buffer cache buffers, while bumping MAXPHYS to desired high value. Eliminate all direct uses of the MAXPHYS constant in kernel and driver sources, except a place which initialize maxphys. Some random (and arguably weird) uses of MAXPHYS, e.g. in linuxolator, are converted straight. Some drivers, which use MAXPHYS to size embeded structures, get private MAXPHYS-like constant; their convertion is out of scope for this work. Changes to cam/, dev/ahci, dev/ata, dev/mpr, dev/mpt, dev/mvs, dev/siis, where either submitted by, or based on changes by mav. Suggested by: mav () Reviewed by: imp, mav, imp, mckusick, scottl (intermediate versions) Tested by: pho Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D27225	2020-11-28 12:12:51 +00:00
Kristof Provost	bca0e1d2ac	if: Fix non-VIMAGE build if_link_ifnet() and if_unlink_ifnet() are needed even when VIMAGE is not enabled. MFC after: 2 weeks Sponsored by: Modirum MDPay	2020-11-25 17:15:24 +00:00
Kristof Provost	a779388f8b	if: Protect V_ifnet in vnet_if_return() When we terminate a vnet (i.e. jail) we move interfaces back to their home vnet. We need to protect our access to the V_ifnet CK_LIST. We could enter NET_EPOCH, but if_detach_internal() (called from if_vmove()) waits for net epoch callback completion. That's not possible from NET_EPOCH. Instead, we take the IFNET_WLOCK, build a list of the interfaces that need to move and, once we've released the lock, move them back to their home vnet. We cannot hold the IFNET_WLOCK() during if_vmove(), because that results in a LOR between ifnet_sx, in_multi_sx and iflib ctx lock. Separate out moving the ifp into or out of V_ifnet, so we can hold the lock as we do the list manipulation, but do not hold it as we if_vmove(). Reviewed by: melifaro MFC after: 2 weeks Sponsored by: Modirum MDPay Differential Revision: https://reviews.freebsd.org/D27279	2020-11-25 15:07:22 +00:00
Kristof Provost	a60100fdfc	if: Remove ifnet_rwlock It no longer serves any purpose, as evidenced by the fact that we never take it without ifnet_sxlock. Sponsored by: Modirum MDPay Differential Revision: https://reviews.freebsd.org/D27278	2020-11-25 10:56:38 +00:00
Alexander V. Chernikov	7511a63825	Refactor rib iterator functions. * Make rib_walk() order of arguments consistent with the rest of RIB api * Add rib_walk_ext() allowing to exec callback before/after iteration. * Rename rt_foreach_fib_walk_del -> rib_foreach_table_walk_del * Rename rt_forach_fib_walk -> rib_foreach_table_walk * Move rib_foreach_table_walk{_del} to route/route_helpers.c * Slightly refactor rib_foreach_table_walk{_del} to make the implementation consistent and prepare for upcoming iterator optimizations. Differential Revision: https://reviews.freebsd.org/D27219	2020-11-22 20:21:10 +00:00
Mitchell Horne	70af7ce99a	Make net/ifq.h C++ friendly Don't use "new" as an identifier, and add explicit casts from void *. As a general policy, FreeBSD doesn't make any C++ compatibility guarantees for kernel headers like it does for userland, but it is a small effort to do so in this case, to the benefit of a downstream consumer (NetApp). Reviewed by: rscheff Sponsored by: NetApp, Inc. Sponsored by: Klara, Inc. Differential Revision: https://reviews.freebsd.org/D27286	2020-11-20 14:45:45 +00:00
Andrew Gallatin	8732245d29	LACP: When suppressing distributing, return ENOBUFS When links come and go, lacp goes into a "suppress distributing" mode where it drops traffic for 3 seconds. When in this mode, lagg/lacp historiclally drops traffic with ENETDOWN. That return value causes TCP to close any connection where it gets that value back from the lower parts of the stack. This means that any TCP connection with active traffic during a 3-second windown when an LACP link comes or goes would get closed. TCP treats return values of ENOBUFS as transient errors, and re-schedules transmission later. So rather than returning ENETDOWN, lets return ENOBUFS instead. This allows TCP connections to be preserved. I've tested this by repeatedly bouncing links on a Netlfix CDN server under a moderate (20Gb/s) load and overved ENOBUFS reported back to the TCP stack (as reported by a RACK TCP sysctl). Reviewed by: jhb, jtl, rrs Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D27188	2020-11-18 14:55:49 +00:00
Mark Johnston	54bf96fb4f	iflib: Free full mbuf chains when draining transmit queues Submitted by: Sai Rajesh Tallamraju <stallamr@netapp.com> Reviewed by: gallatin, hselasky MFC after: 1 week Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D27179	2020-11-11 18:00:06 +00:00
Andrey V. Elsukov	2f4ffa9f72	Fix possible NULL pointer dereference. lagg(4) replaces if_output method of its child interfaces and expects that this method can be called only by child interfaces. But it is possible that lagg_port_output() could be called by children of child interfaces. In this case ifnet's if_lagg field is NULL. Add check that lp is not NULL. Obtained from: Yandex LLC MFC after: 1 week Sponsored by: Yandex LLC	2020-11-11 15:53:36 +00:00
Mitchell Horne	4a3fc6e22e	Fix definition of rn_addmask() Add the missing static keyword present in the declaration. Reviewed by: melifaro Sponsored by: NetApp, Inc. Sponsored by: Klara, Inc. Differential Revision: https://reviews.freebsd.org/D27024	2020-11-08 19:02:22 +00:00
Alexander V. Chernikov	2d39824195	Switch net.add_addr_allfibs default to 0. The goal of the fib support is to provide multiple independent routing tables, isolated from each other. net.add_addr_allfibs default tries to shift gears in the opposite direction, unconditionally inserting all addresses to all of the fibs. There are use cases when this is necessary, however this is not a default expected behaviour, especially compared to other implementations. Provide WARNING message for the setups with multiple fibs to notify potential users of the feature. Differential Revision: https://reviews.freebsd.org/D26076	2020-11-08 18:27:49 +00:00
Alexander V. Chernikov	76e6b37f6b	Temporarily revert setting net.add_addr_allfibs to 0. It accidentally sweeped in r367486. Revert to allow for proper commit message & warning.	2020-11-08 18:11:12 +00:00
Alexander V. Chernikov	770495f4c0	Fix build broken by r367484: add route_ifaddrs.c. Pointy hat to: melifaro Reported by: jenkins	2020-11-08 13:30:44 +00:00
Alexander V. Chernikov	bad6b23606	Move all ifaddr route creation business logic to net/route/route_ifaddr.c Differential Revision: https://reviews.freebsd.org/D26318	2020-11-08 11:12:00 +00:00
Konstantin Belousov	80ba361b2f	if_media.c SIOCGMEDIAX handler: improve loop Stop advancing counter past the current iteration number at the start of iteration. This removes the need of subtracting one when calculating index for copyout, and arguably fixes off-by-one reporting of copied out elements when copyout failed. Reviewed by: hselasky Sponsored by: Mellanox Technologies / NVidia Networking MFC after: 1 week Differential revision: https://reviews.freebsd.org/D27073	2020-11-03 14:33:04 +00:00
Konstantin Belousov	1fbbe9dbf5	net/if_media.c: improve IFMEDIA_DEBUG output. Use consistent output format for hex. Print both media and mask where relevant. Reviewed by: hselasky Sponsored by: Mellanox Technologies/NVidia Networking MFC after: 1 week Differential revision: https://reviews.freebsd.org/D27034	2020-11-01 16:38:30 +00:00
Konstantin Belousov	e399f19dba	Cleanup of net/if_media.c: simplify cleanup loop in ifmedia_removeall(). Reviewed by: hselasky Sponsored by: Mellanox Technologies/NVidia Networking MFC after: 1 week Differential revision: https://reviews.freebsd.org/D27034	2020-11-01 16:36:21 +00:00
Konstantin Belousov	899322fdfa	Cleanup of net/if_media.c: some style. Reviewed by: hselasky Sponsored by: Mellanox Technologies/NVidia Networking MFC after: 1 week Differential revision: https://reviews.freebsd.org/D27034	2020-11-01 16:30:17 +00:00
Konstantin Belousov	2193fb16b5	Cleanup of net/if_media.c: switch to ANSI C function definitions. Reviewed by: hselasky Sponsored by: Mellanox Technologies/NVidia Networking MFC after: 1 week Differential revision: https://reviews.freebsd.org/D27034	2020-11-01 16:25:35 +00:00
Mitchell Horne	ced0f52457	net: add ETHER_IS_IPV6_MULTICAST This can be used to detect if an ethernet address is specifically an IPv6 multicast address, defined in accordance to RFC 2464. ETHER_IS_MULTICAST is still preferred in the general case. Reviewed by: ae Sponsored by: NetApp, Inc. Sponsored by: Klara, Inc. Differential Revision: https://reviews.freebsd.org/D26611	2020-10-30 13:32:58 +00:00
John Baldwin	36e0a362ac	Add m_snd_tag_alloc() as a wrapper around if_snd_tag_alloc(). This gives a more uniform API for send tag life cycle management. Reviewed by: gallatin, hselasky Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D27000	2020-10-29 23:28:39 +00:00
John Baldwin	521eac97f3	Support hardware rate limiting (pacing) with TLS offload. - Add a new send tag type for a send tag that supports both rate limiting (packet pacing) and TLS offload (mostly similar to D22669 but adds a separate structure when allocating the new tag type). - When allocating a send tag for TLS offload, check to see if the connection already has a pacing rate. If so, allocate a tag that supports both rate limiting and TLS offload rather than a plain TLS offload tag. - When setting an initial rate on an existing ifnet KTLS connection, set the rate in the TCP control block inp and then reset the TLS send tag (via ktls_output_eagain) to reallocate a TLS + ratelimit send tag. This allocates the TLS send tag asynchronously from a task queue, so the TLS rate limit tag alloc is always sleepable. - When modifying a rate on a connection using KTLS, look for a TLS send tag. If the send tag is only a plain TLS send tag, assume we failed to allocate a TLS ratelimit tag (either during the TCP_TXTLS_ENABLE socket option, or during the send tag reset triggered by ktls_output_eagain) and ignore the new rate. If the send tag is a ratelimit TLS send tag, change the rate on the TLS tag and leave the inp tag alone. - Lock the inp lock when setting sb_tls_info for a socket send buffer so that the routines in tcp_ratelimit can safely dereference the pointer without needing to grab the socket buffer lock. - Add an IFCAP_TXTLS_RTLMT capability flag and associated administrative controls in ifconfig(8). TLS rate limit tags are only allocated if this capability is enabled. Note that TLS offload (whether unlimited or rate limited) always requires IFCAP_TXTLS[46]. Reviewed by: gallatin, hselasky Relnotes: yes Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D26691	2020-10-29 00:23:16 +00:00
Vincenzo Maffione	be7a6b3d84	iflib: fix typo bug introduced by r367093 Code was supposed to call callout_reset_sbt_on() rather than callout_reset_sbt(). This resulted into passing a "cpu" value to a "flag" argument. A recipe for subtle errors. PR: 248652 Reported by: sg@efficientip.com MFC with: r367093	2020-10-28 21:06:17 +00:00
Vincenzo Maffione	17cec474c0	iflib: add per-tx-queue netmap timer The way netmap TX is handled in iflib when TX interrupts are not used (IFC_NETMAP_TX_IRQ not set) has some issues: - The netmap_tx_irq() function gets called by iflib_timer(), which gets scheduled with tick granularity (hz). This is not frequent enough for 10Gbps NICs and beyond (e.g., ixgbe or ixl). The end result is that the transmitting netmap application is not woken up fast enough to saturate the link with small packets. - The iflib_timer() functions also calls isc_txd_credits_update() to ask for more TX completion updates. However, this violates the netmap requirement that only txsync can access the TX queue for datapath operations. Only netmap_tx_irq() may be called out of the txsync context. This change introduces per-tx-queue netmap timers, using microsecond granularity to ensure that netmap_tx_irq() can be called often enough to allow for maximum packet rate. The timer routine simply calls netmap_tx_irq() to wake up the netmap application. The latter will wake up and call txsync to collect TX completion updates. This change brings back line rate speed with small packets for ixgbe. For the time being, timer expiration is hardcoded to 90 microseconds, in order to avoid introducing a new sysctl. We may eventually implement an adaptive expiration period or use another deferred work mechanism in place of timers. Also, fix the timers usage to make sure that each queue is serviced by a different CPU. PR: 248652 Reported by: sg@efficientip.com MFC after: 2 weeks	2020-10-27 21:53:33 +00:00
Hans Petter Selasky	1355e2dc4f	More style fixes (partial revert of r366994). Suggested by: danfe@ Differential Revision: https://reviews.freebsd.org/D26254 MFC after: 1 week Sponsored by: Mellanox Technologies // NVIDIA Networking	2020-10-24 13:07:50 +00:00
Hans Petter Selasky	1d3a22e765	Fix order of header files: sys/systm.h should come right after sys/param.h Suggested by: kib@ Differential Revision: https://reviews.freebsd.org/D26254 MFC after: 1 week Sponsored by: Mellanox Technologies // NVIDIA Networking	2020-10-24 10:52:09 +00:00
Hans Petter Selasky	01630a496b	Run code through "clang-format -style=file" with some additional fixes. No functional change. Suggested by: kib@ and emaste@ Differential Revision: https://reviews.freebsd.org/D26254 MFC after: 1 week Sponsored by: Mellanox Technologies // NVIDIA Networking	2020-10-24 10:23:21 +00:00
Navdeep Parhar	610d345953	if_vxlan(4): csum_flags_to_inner_flags takes the tunnel protocol as a parameter. No functional change.	2020-10-22 17:05:55 +00:00
Hans Petter Selasky	ce329aa256	Compile fix for MIPS, MIPS64, POWERPC and POWERPC64. Add missing include files. Differential Revision: https://reviews.freebsd.org/D26254 Reviewed by: melifaro@ MFC after: 1 week Sponsored by: Mellanox Technologies // NVIDIA Networking	2020-10-22 12:22:08 +00:00
Hans Petter Selasky	a92c4bb62a	Add support for IP over infiniband, IPoIB, to lagg(4). Currently only the failover protocol is supported due to limitations in the IPoIB architecture. Refer to the lagg(4) manual page for how to configure and use this new feature. A new network interface type, IFT_INFINIBANDLAG, has been added, similar to the existing IFT_IEEE8023ADLAG . ifconfig(8) has been updated to accept a new laggtype argument when creating lagg(4) network interfaces. This new argument is used to distinguish between ethernet and infiniband type of lagg(4) network interface. The laggtype argument is optional and defaults to ethernet. The lagg(4) command line syntax is backwards compatible. Differential Revision: https://reviews.freebsd.org/D26254 Reviewed by: melifaro@ MFC after: 1 week Sponsored by: Mellanox Technologies // NVIDIA Networking	2020-10-22 09:47:12 +00:00
Hans Petter Selasky	9d40cf60d6	Factor out generic IP over infiniband, IPoIB, definitions and code into net/if_infiniband.c and net/infiniband.h . No functional change intended. Differential Revision: https://reviews.freebsd.org/D26254 Reviewed by: melifaro@ MFC after: 1 week Sponsored by: Mellanox Technologies // NVIDIA Networking	2020-10-22 09:09:53 +00:00
Alexander V. Chernikov	c7cffd65c5	Add support for stacked VLANs (IEEE 802.1ad, AKA Q-in-Q). 802.1ad interfaces are created with ifconfig using the "vlanproto" parameter. Eg., the following creates a 802.1Q VLAN (id #42) over a 802.1ad S-VLAN (id #5) over a physical Ethernet interface (em0). ifconfig vlan5 create vlandev em0 vlan 5 vlanproto 802.1ad up ifconfig vlan42 create vlandev vlan5 vlan 42 inet 10.5.42.1/24 VLAN_MTU, VLAN_HWCSUM and VLAN_TSO capabilities should be properly supported. VLAN_HWTAGGING is only partially supported, as there is currently no IFCAP_VLAN_* denoting the possibility to set the VLAN EtherType to anything else than 0x8100 (802.1ad uses 0x88A8). Submitted by: Olivier Piras Sponsored by: RG Nets Differential Revision: https://reviews.freebsd.org/D26436	2020-10-21 21:28:20 +00:00
Alexander V. Chernikov	0c325f53f1	Implement flowid calculation for outbound connections to balance connections over multiple paths. Multipath routing relies on mbuf flowid data for both transit and outbound traffic. Current code fills mbuf flowid from inp_flowid for connection-oriented sockets. However, inp_flowid is currently not calculated for outbound connections. This change creates simple hashing functions and starts calculating hashes for TCP,UDP/UDP-Lite and raw IP if multipath routes are present in the system. Reviewed by: glebius (previous version),ae Differential Revision: https://reviews.freebsd.org/D26523	2020-10-18 17:15:47 +00:00
Marcin Wojtas	1148702e43	Add SADB_SAFLAGS_ESN flag This flag is going to be used by IKE daemon to signal if Extended Sequence Number feature is going to be used. Value for this flag was taken from OpenBSD source code `6b4cbaf181` Submitted by: Patryk Duda <pdk@semihalf.com> Reviewed by: ae Differential revision: https://reviews.freebsd.org/D22366 Obtained from: Semihalf Sponsored by: Stormshield	2020-10-16 11:22:29 +00:00
Richard Scheffenegger	868aabb470	Add IP(V6)_VLAN_PCP to set 802.1 priority per-flow. This adds a new IP_PROTO / IPV6_PROTO setsockopt (getsockopt) option IP(V6)_VLAN_PCP, which can be set to -1 (interface default), or explicitly to any priority between 0 and 7. Note that for untagged traffic, explicitly adding a priority will insert a special 801.1Q vlan header with vlan ID = 0 to carry the priority setting Reviewed by: gallatin, rrs MFC after: 2 weeks Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D26409	2020-10-09 12:06:43 +00:00
Konstantin Belousov	cefdb89514	Fix typo. Sponsored by: Mellanox Technologies/NVIDIA Networking MFC after: 3 days	2020-10-07 10:58:56 +00:00
Kristof Provost	4af1bd8157	bridge: call member interface ioctl() without NET_EPOCH We're not allowed to hold NET_EPOCH while sleeping, so when we call ioctl() handlers for member interfaces we cannot be in NET_EPOCH. We still need some protection of our CK_LISTs, so hold BRIDGE_LOCK instead. That requires changing BRIDGE_LOCK into a sleepable lock, and separating the BRIDGE_RT_LOCK, to protect bridge_rtnode lists. That lock is taken in the data path (while in NET_EPOCH), so it cannot be a sleepable lock. While here document the locking strategy. MFC after: 1 month Differential Revision: https://reviews.freebsd.org/D26418	2020-10-06 19:19:56 +00:00
John Baldwin	56fb710f1b	Store the send tag type in the common send tag header. Both cxgbe(4) and mlx5(4) wrapped the existing send tag header with their own identical headers that stored the type that the type-specific tag structures inherited from, so in practice it seems drivers need this in the tag anyway. This permits removing these extra header indirections (struct cxgbe_snd_tag and struct mlx5e_snd_tag). In addition, this permits driver-independent code to query the type of a tag, e.g. to know what type of tag is being queried via if_snd_query. Reviewed by: gallatin, hselasky, np, kib Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D26689	2020-10-06 17:58:56 +00:00
Alexander V. Chernikov	1b95005e95	Fix route flags update during RTM_CHANGE. Nexthop lookup was not consireding rt_flags when doing structure comparison, which lead to an original nexthop selection when changing flags. Fix the case by adding rt_flags field into comparison and rearranging nhop_priv fields to allow for efficient matching. Fix `route change X/Y flags` case - recent changes disallowed specifying RTF_GATEWAY flag without actual gateway. It turns out, route(8) fills in RTF_GATEWAY by default, unless -interface flag is specified. Fix regression by clearing RTF_GATEWAY flag instead of failing. Fix route flag reporting in RTM_CHANGE messages by explicitly updating rtm_flags after operation competion. Add IPv4/IPv6 tests for flag-only route changes.	2020-10-04 13:24:58 +00:00
Alexander V. Chernikov	9c584fa4bc	Remove ROUTE_MPATH-related warnings introduced in r366390. Reported by: mjg	2020-10-03 14:37:54 +00:00
Alexander V. Chernikov	fedeb08b6a	Introduce scalable route multipath. This change is based on the nexthop objects landed in D24232. The change introduces the concept of nexthop groups. Each group contains the collection of nexthops with their relative weights and a dataplane-optimized structure to enable efficient nexthop selection. Simular to the nexthops, nexthop groups are immutable. Dataplane part gets compiled during group creation and is basically an array of nexthop pointers, compiled w.r.t their weights. With this change, `rt_nhop` field of `struct rtentry` contains either nexthop or nexthop group. They are distinguished by the presense of NHF_MULTIPATH flag. All dataplane lookup functions returns pointer to the nexthop object, leaving nexhop groups details inside routing subsystem. User-visible changes: The change is intended to be backward-compatible: all non-mpath operations should work as before with ROUTE_MPATH and net.route.multipath=1. All routes now comes with weight, default weight is 1, maximum is 2^24-1. Current maximum multipath group width is statically set to 64. This will become sysctl-tunable in the followup changes. Using functionality: * Recompile kernel with ROUTE_MPATH * set net.route.multipath to 1 route add -6 2001:db8::/32 2001:db8::2 -weight 10 route add -6 2001:db8::/32 2001:db8::3 -weight 20 netstat -6On Nexthop groups data Internet6: GrpIdx NhIdx Weight Slots Gateway Netif Refcnt 1 ------- ------- ------- --------------------------------------- --------- 1 13 10 1 2001:db8::2 vlan2 14 20 2 2001:db8::3 vlan2 Next steps: * Land outbound hashing for locally-originated routes ( D26523 ). * Fix net/bird multipath (net/frr seems to work fine) * Add ROUTE_MPATH to GENERIC * Set net.route.multipath=1 by default Tested by: olivier Reviewed by: glebius Relnotes: yes Differential Revision: https://reviews.freebsd.org/D26449	2020-10-03 10:47:17 +00:00
Vincenzo Maffione	adf41f0788	netmap: fix constness warnings generated by "-Wcast-qual" Submitted by: milosz.kaniewski@gmail.com MFC after: 3 days	2020-10-03 09:33:29 +00:00
Ed Maste	c1aedfcbd9	add SIOCGIFDATA ioctl For interfaces that do not support SIOCGIFMEDIA (for which there are quite a few) the only fallback is to query the interface for if_data->ifi_link_state. While it's possible to get at if_data for an interface via getifaddrs(3) or sysctl, both are heavy weight mechanisms. SIOCGIFDATA is a simple ioctl to retrieve this fast with very little resource use in comparison. This implementation mirrors that of other similar ioctls in FreeBSD. Submitted by: Roy Marples <roy@marples.name> Reviewed by: markj MFC after: 1 month Differential Revision: https://reviews.freebsd.org/D26538	2020-09-28 16:54:39 +00:00
Alexander V. Chernikov	2259a03020	Rework part of routing code to reduce difference to D26449. * Split rt_setmetrics into get_info_weight() and rt_set_expire_info(), as these two can be applied at different entities and at different times. * Start filling route weight in route change notifications * Pass flowid to UDP/raw IP route lookups * Rework nd6_subscription_cb() and sysctl_dumpentry() to prepare for the fact that rtentry can contain multiple nexthops. Differential Revision: https://reviews.freebsd.org/D26497	2020-09-21 20:02:26 +00:00
Alexander V. Chernikov	1440f62266	Remove unused nhop_ref_any() function. Remove "opt_mpath.h" header where not needed. No functional changes.	2020-09-20 21:32:52 +00:00
Alexander V. Chernikov	c4bcfe98e2	Fix gw updates / flag updates during route changes. * Zero gw_sdl if switching to interface route - the assumption that underlying storage is zeroed is incorrect with route changes. * Apply proper flag mask to rte. Reported by: vangyzen	2020-09-20 12:31:48 +00:00
Navdeep Parhar	b092fd6c97	if_vxlan(4): add support for hardware assisted checksumming, TSO, and RSS. This lets a VXLAN pseudo-interface take advantage of hardware checksumming (tx and rx), TSO, and RSS if the NIC is capable of performing these operations on inner VXLAN traffic. A VXLAN interface inherits the capabilities of its vxlandev interface if one is specified or of the interface that hosts the vxlanlocal address. If other interfaces will carry traffic for that VXLAN then they must have the same hardware capabilities. On transmit, if_vxlan verifies that the outbound interface has the required capabilities and then translates the CSUM_ flags to their inner equivalents. This tells the hardware ifnet that it needs to operate on the inner frame and not the outer VXLAN headers. An event is generated when a VXLAN ifnet starts. This allows hardware drivers to configure their devices to expect VXLAN traffic on the specified incoming port. On receive, the hardware does RSS and checksum verification on the inner frame. if_vxlan now does a direct netisr dispatch to take full advantage of RSS. It is not very clear why it didn't do this already. Future work: Rx: it should be possible to avoid the first trip up the protocol stack to get the frame to if_vxlan just so it can decapsulate and requeue for a second trip up the stack. The hardware NIC driver could directly call an if_vxlan receive routine for VXLAN traffic instead. Rx: LRO. depends on what happens with the previous item. There will have to to be a mechanism to indicate that it's time for if_vxlan to flush its LRO state. Reviewed by: kib@ Relnotes: Yes Sponsored by: Chelsio Communications Differential Revision: https://reviews.freebsd.org/D25873	2020-09-18 02:37:57 +00:00
Navdeep Parhar	830edb4561	Add two new ifnet capabilities for hw checksumming and TSO for VXLAN traffic. These are similar to the existing VLAN capabilities. Reviewed by: kib@ Sponsored by: Chelsio Communications Differential Revision: https://reviews.freebsd.org/D25873	2020-09-18 02:10:28 +00:00
Mitchell Horne	ceff9b9d25	if_media: definitions for 40GE LM4 ethernet media type Reviewed by: erj Sponsored by: NetApp, Inc. Sponsored by: Klara, Inc. Differential Revision: https://reviews.freebsd.org/D26276	2020-09-16 14:45:16 +00:00
Alexander V. Chernikov	2b32d93e55	Fix RADIX_MPATH build broken by r365521. Reported by: jenkins, Hartmann, O. <ohartmann at walstatt.org>	2020-09-10 07:05:31 +00:00
Alexander V. Chernikov	aa8f9f90ff	Update nexthop handling for route addition/deletion in preparation for mpath. Currently kernel requests deletion for the certain routes with specified gateway, but this gateway is not actually checked. With multipath routes, internal gateway checking becomes mandatory. Add the logic performing this check. Generalise RTF_PINNED routes to the generic route priorities, simplifying the logic. Add lookup_prefix() function to perform exact match search based on data in @info. Differential Revision: https://reviews.freebsd.org/D26356	2020-09-09 22:07:54 +00:00
Alexander V. Chernikov	cd6298d5c5	Retain marking net.fibs sysctl as a tunable. Suggested by: avg	2020-09-09 21:45:18 +00:00
Alexander V. Chernikov	4a8201c13a	Fix panic with net.fibs tunable set in loader.conf. Fix by removing forgotten CTLFLAG_RWTUN flag from the sysctl, loader variable will be read later in vnet_rtables_init(). Reported by: mav	2020-09-08 21:39:34 +00:00
Kristof Provost	a969635b83	net: mitigate vnet / epair cleanup races There's a race where dying vnets move their interfaces back to their original vnet, and if_epair cleanup (where deleting one interface also deletes the other end of the epair). This is commonly triggered by the pf tests, but also by cleanup of vnet jails. As we've not yet been able to fix the root cause of the issue work around the panic by not dereferencing a NULL softc in epair_qflush() and by not re-attaching DYING interfaces. This isn't a full fix, but makes a very common panic far less likely. PR: 244703, 238870 Reviewed by: lutz_donnerhacke.de MFC after: 4 days Differential Revision: https://reviews.freebsd.org/D26324	2020-09-08 14:54:10 +00:00
Alexander V. Chernikov	05aca418f4	Consistently use the same gateway when adding/deleting interface routes. Use the same link-level gateway when adding or deleting interface routes. This helps nexthop checking in the upcoming multipath changes. Differential Revision: https://reviews.freebsd.org/D26317	2020-09-07 10:13:54 +00:00
Ed Maste	4fa9815a3d	rtsock.c: remove extraneous space Submitted by: Neel Chauhan <neel AT neelc DOT org> Differential Revision: https://reviews.freebsd.org/D26249	2020-09-05 16:13:36 +00:00
Alexander V. Chernikov	8f07963360	Fix regression for IPv6 loopback routes. After nexthop introduction, loopback routes for the interface addresses were created without embedding actual interface index in the gateway. The latter is needed to pass the IPv6 scope during transmission via loopback.. Fix the regression by actually using passed gateway data with interface index. Differential Revision: https://reviews.freebsd.org/D26306	2020-09-03 22:24:52 +00:00
Mateusz Guzik	662c13053f	net: clean up empty lines in .c and .h files	2020-09-01 21:19:14 +00:00
Vincenzo Maffione	35d8a463e8	iflib: leave only 1 receive descriptor unused The pidx argument of isc_rxd_flush() indicates which is the last valid receive descriptor to be used by the NIC. However, current code has multiple issues: - Intel drivers write pidx to their RDT register, which means that NICs will only use the descriptors up to pidx-1 (modulo ring size N), and won't actually use the one pointed by pidx. This does not break reception, but it is anyway confusing and suboptimal (the NIC will actually see only N-2 descriptors as available, rather than N-1). Other drivers (if_vmx, if_bnxt, if_mgb) adhere to this semantic). - The semantic used by Intel (RDT is one descriptor past the last valid one) is used by most (if not all) NICs, and it is also used on the TX side (also in iflib). Since iflib is not currently using this semantic for RX, it must decrement fl->ifl_pidx (modulo N) before calling isc_rxd_flush(), and then the per-driver callback implementation must increment the index again (to match the real semantic). This is confusing and suboptimal. - The iflib refill function is also called at initialization. However, in case the ring size is smaller than 128 (e.g. if_mgb), the refill function will actually prepare all the receive descriptors (N), without leaving one unused, as most of NICs assume (e.g. to avoid RDT to overrun RDH). I can speculate that the code looks like this right now because this issue showed up during testing (e.g. with if_mgb), and it was easy to workaround by decrementing pidx before isc_rxd_flush(). The goal of this change is to simplify the code (removing a bunch of instructions from the RX fast path), and to make the semantic of isc_rxd_flush() consistent across drivers. To achieve this, we: - change the semantics of the pidx argument to the usual one (that is the index one past the last valid one), so that both iflib and drivers avoid the decrement/increment dance. - fix the initialization code to prepare at most N-1 descriptors. Reviewed by: markj MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D26191	2020-09-01 20:41:47 +00:00
Alexander V. Chernikov	b8d2d479cd	Revert uma zone alignemnt cache unadvertenly committed in r364950.	2020-08-29 12:04:13 +00:00
Alexander V. Chernikov	6498f66f7c	Fix build with RADIX_MPATH. Reported by: Hartmann, O <ohartmann@walstatt.org>	2020-08-29 11:04:24 +00:00
Alexander V. Chernikov	7c89a3b63f	Move fib_rte_to_nh_flags() from net/route_var.h to net/route/nhop_ctl.c. No functional changes. Initially this function was created to perform runtime flag conversions for the previous incarnation of fib lookup functions. As these functions got deprecated, move the function to the file with the only remaining caller. Lastly, rename it to convert_rt_to_nh_flags() to follow the naming notation.	2020-08-28 23:01:56 +00:00
Alexander V. Chernikov	a624ca3dff	Move net/route/shared.h definitions to net/route/route_var.h. No functional changes. net/route/shared.h was created in the inital phases of nexthop conversion. It was intended to serve the same purpose as route_var.h - share definitions of functions and structures between the routing subsystem components. At that time route_var.h was included by many files external to the routing subsystem, which largerly defeats its purpose. As currently this is not the case anymore and amount of route_var.h includes is roughly the same as shared.h, retire the latter in favour of the former.	2020-08-28 22:50:20 +00:00
Alexander V. Chernikov	b122304f6a	Further split nhop creation and rtable operations. As nexthops are immutable, some operations such as route attribute changes require nexthop fetching, forking, modification and route switching. These operations are not atomic, so they may need to be retried multiple times in presence of multiple speakers changing the same route. This change introduces "synchronisation" primitive: route_update_conditional(), simplifying logic for route changes and upcoming multipath operations. Differential Revision: https://reviews.freebsd.org/D26216	2020-08-28 21:59:10 +00:00
Vincenzo Maffione	ae750d5cdf	iflib: netmap: publish all the receive buffer At initialization time, the netmap RX refill function used to prepare the NIC RX ring with N-1 buffers rather than N (with N equal to the number of descriptors in the NIC RX ring). This is not how netmap is supposed to work, as it would keep kring->nr_hwcur not in sync with the NIC "next index to refill" (i.e., fl->ifl_pidx). Instead we prepare N buffers, although we still publish (with isc_rxd_flush()) only the first N-1 buffers, to avoid the NIC producer pointer to overrun the NIC consumer pointer (for NICs where this is a real issue, e.g. Intel ones). MFC after: 2 weeks	2020-08-25 15:19:45 +00:00
Alexander V. Chernikov	592d300e34	Remove RT_LOCK mutex from rte. rtentry lock traditionally served 2 purposed: first was protecting refcounts, the second was assuring consistent field access/changes. Since route nexthop introduction, the need for the former disappeared and the need for the latter reduced. To be more precise, the following rte field are mutable: rt_nhop (nexthop pointer, updated with RIB_WLOCK, passed in rib_cmd_info) rte_flags (only RTF_HOST and RTF_UP, where RTF_UP gets changed at rte removal) rt_weight (relative weight, updated with RIB_WLOCK, passed in rib_cmd_info) rt_expire (time when rte deletion is scheduled, updated with RIB_WLOCK) rt_chain (deletion chain pointer, updated with RIB_WLOCK) All of them are updated under RIB_WLOCK, so the only remaining concern is the reading. rt_nhop and rt_weight (addressed in this review) are read under rib lock and stored in the rib_cmd_info, so the caller has no problem with consitency. rte_flags is currently read unlocked in rtsock reporting (however the scope is only RTF_UP flag, which is pretty static). rt_expire is currently read unlocked in rtsock reporting. rt_chain accesses are safe, as this is only used at route deletion. rt_expire and rte_flags reads will be dealt in a separate reviews soon. Differential Revision: https://reviews.freebsd.org/D26162	2020-08-24 20:23:34 +00:00
Vincenzo Maffione	de5b46107c	iflib: fix isc_rxd_flush call in netmap_fl_refill() The semantic of the pidx argument of isc_rxd_flush() is the last valid index of in the free list, rather than the next index to be published. However, netmap was still using the old convention. While there, also refactor the netmap_fl_refill() to simplify a little bit and add an assertion. MFC after: 2 weeks	2020-08-24 11:44:20 +00:00
Alexander V. Chernikov	eb1c7adb70	Finish r364492 by renaming rt_flags to rte_flags for multipath code.	2020-08-22 20:02:40 +00:00
Alexander V. Chernikov	93bfd365d2	Rename rt_flags to rte_flags && reduce number of rt_nhop accesses. No functional changes. Most of the routing flags are stored in the netxtop instead of rtentry. Rename rt->rt_flags to rt->rte_flags to simplify reading/modifying code checking routing flags. In the new multipath code, rt->rt_nhop may actually point to nexthop group instead of nhop. To ease transition, reduce the amount of rt->rt_nhop->... accesses. Differential Revision: https://reviews.freebsd.org/D26156	2020-08-22 19:30:56 +00:00
Mateusz Guzik	c93d310f87	Fix tinderbox build after r364465	2020-08-22 07:43:38 +00:00
Alexander V. Chernikov	f5247a232a	Make net.fibs growable. Allow to dynamically grow the amount of fibs in each vnet. This change alters current behavior. Currently, if one defines ROUTETABLES > 1 in the kernel config, each vnet will be created with the number of fibs defined in the kernel config. After this commit vnets will be created with fibs=1. Dynamic net.fibs is not compatible with net.add_addr_allfibs. The plan is to deprecate the latter and make net.add_addr_allfibs=0 default behaviour. Reviewed by: glebius Relnotes: yes Differential Revision: https://reviews.freebsd.org/D26062	2020-08-21 21:34:52 +00:00
Warner Losh	773e541e8d	Use devctl.h instead of bus.h to reduce newbus pollution. There's no need for these parts of the kernel to know about newbus, so narrow what is included to devctl.h for device_notify_*. Suggested by: kib@	2020-08-21 00:03:24 +00:00
Bjoern A. Zeeb	2ae32ca2c8	For consistency and to avoid any problems getting past the 31bit boundry change the last two IF_Mbps(2500) and additionally one IF_Mbps(5000) to ULL as well. MFC after: 2 weeks Sponsored by: Rubicon Communications, LLC (d/b/a "Netgate")	2020-08-17 13:51:25 +00:00
Qing Li	a5154bb2e5	Correct the mask byte order when checking for reserved bits. Reviewed by: gnn Approved by: gnn MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D26071	2020-08-15 16:48:58 +00:00
Alexander V. Chernikov	bec053ffe0	Make net.inet6.ip6.deembed_scopeid behaviour default & remove sysctl. Submitted by: Neel Chauhan <neel AT neelc DOT org> Differential Revision: https://reviews.freebsd.org/D25637	2020-08-15 11:37:44 +00:00
Alexander V. Chernikov	2f23f45b20	Simplify dom_<rtattach\|rtdetach>. Remove unused arguments from dom_rtattach/dom_rtdetach functions and make them return/accept 'struct rib_head' instead of 'void **'. Declare inet/inet6 implementations in the relevant _var.h headers similar to domifattach / domifdetach. Add rib_subscribe_internal() function to accept subscriptions to the rnh directly. Differential Revision: https://reviews.freebsd.org/D26053	2020-08-14 21:29:56 +00:00
Bryan Drewery	3869d41465	lagg: Avoid adding a port to a lagg device being destroyed. The lagg_clone_destroy() handles detach and waiting for ifconfig callers to drain already. This narrows the race for 2 panics that the tests triggered. Both were a consequence of adding a port to the lagg device after it had already detached from all of its ports. The link state task would run after lagg_clone_destroy() free'd the lagg softc. kernel:trap_fatal+0xa4 kernel:trap_pfault+0x61 kernel:trap+0x316 kernel:witness_checkorder+0x6d kernel:_sx_xlock+0x72 if_lagg.ko:lagg_port_state+0x3b kernel:if_down+0x144 kernel:if_detach+0x659 if_tap.ko:tap_destroy+0x46 kernel:if_clone_destroyif+0x1b7 kernel:if_clone_destroy+0x8d kernel:ifioctl+0x29c kernel:kern_ioctl+0x2bd kernel:sys_ioctl+0x16d kernel:amd64_syscall+0x337 kernel:trap_fatal+0xa4 kernel:trap_pfault+0x61 kernel:trap+0x316 kernel:witness_checkorder+0x6d kernel:_sx_xlock+0x72 if_lagg.ko:lagg_port_state+0x3b kernel:do_link_state_change+0x9b kernel:taskqueue_run_locked+0x10b kernel:taskqueue_run+0x49 kernel:ithread_loop+0x19c kernel:fork_exit+0x83 PR: 244168 Reviewed by: markj MFC after: 2 weeks Sponsored by: Dell EMC Differential Revision: https://reviews.freebsd.org/D25284	2020-08-13 22:06:27 +00:00
Alexander V. Chernikov	6cbadc4234	Move rtzone handling code to net/route_ctl.c After moving the route control plane code from net/route.c, all rtzone users ended up being in net/route_ctl.c. Move uma(9) rtzone setup/teardown code to net/route_ctl.c as well to have everything in a single place. While here, remove custom initializers from the zone. It was added originally to avoid setup/teardown of costy per-cpu couters. With these counters removed, the only remaining job was avoiding rte mutex setup/teardown. Mutex setup is relatively cheap. Additionally, this mutex will soon be removed. With that in mind, there is no sense in keeping custom zone callbacks. Differential Revision: https://reviews.freebsd.org/D26051	2020-08-13 18:35:29 +00:00
Mitchell Horne	f7d79f6c6d	Correctly set error in rt_mpath_unlink It is possible for rn_delete() to return NULL. If this happens, then set *perror to ESRCH, as is done in the rest of the function. Sponsored by: NetApp, Inc. Sponsored by: Klara, Inc. Differential Revision: https://reviews.freebsd.org/D25871	2020-08-12 16:43:20 +00:00
Vincenzo Maffione	6d84e76a25	iflib: netmap: improve rxsync to support IFLIB_HAS_RXCQ For drivers with IFLIB_HAS_RXCQ set, there is a separate completion queue. In this case, the netmap rxsync routine needs to update rxq->ifr_cq_cidx in the same way it is updated by iflib_rxeof(). This improves the situation for vmx(4) and bnxt(4) drivers, which use iflib and have the IFLIB_HAS_RXCQ bit set. PR: 248494 MFC after: 3 weeks	2020-08-12 14:45:31 +00:00
Vincenzo Maffione	530960be8d	iflib: refactor netmap_fl_refill and fix off-by-one issue First, fix the initialization of the fl->ifl_rxd_idxs array, which was affected by an off-by-one bug. Once there, refactor the function to use better names for local variables, optimize the variable assignments, and merge the bus_dmamap_sync() inner loop with the outer one. PR: 248494 MFC after: 3 weeks	2020-08-12 14:17:38 +00:00
Alexander V. Chernikov	8a0917c35b	Do not enter epoch in add_route(), as it is already called in epoch. Reviewed by: glebius	2020-08-11 07:23:07 +00:00

... 6 7 8 9 10 ...

5152 Commits