freebsd-dev

Author	SHA1	Message	Date
Eric Joyner	6a3f243b04	iflib: remove kobject class reference increment Commit message from Jake: In iflib_register, the context is initialized as a kobject using the device driver's "driver" kobject class. As part of this, the function mistakenly increments the ref counter. The ref counter is incremented twice, once in the code directly, and once again by kobj_class_compile. However, there is no associated decrement in the detach path. Because of this, the ref counter will never go back down to zero, and thus the kobject method table will never be released. Remove this unnecessary reference count increment. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Submitted by: Jacob Keller <jacob.e.keller@intel.com> Reviewed by: jhb@, erj@ MFC after: 3 days Sponsored by: Intel Corporation Differential Revision: https://reviews.freebsd.org/D21125	2019-08-01 17:28:36 +00:00
Randall Stewart	20abea6663	This adds the third step in getting BBR into the tree. BBR and an updated rack depend on having access to the new ratelimit api in this commit. Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D20953	2019-08-01 14:17:31 +00:00
Ed Maste	1082be6554	ppp: correct echo-req magic number on big endian archs The magic number is a 32-bit quantity; use uint32_t to match hton's return type and avoid sending zeros (upper 32 bits) on big-endian architectures. PR: 184141 MFC after: 1 week Sponsored by: The FreeBSD Foundation	2019-08-01 13:42:58 +00:00
Kyle Evans	0dbac71f19	if_tuntap(4): Add TUNGIFNAME This effectively just moves TAPGIFNAME into common ioctl territory. MFC after: 3 days	2019-07-25 22:23:34 +00:00
Eric Joyner	7f3f6aad3e	iflib: fix dangling device softc pointer Commit text by Jake: If a driver's IFDI_ATTACH_PRE function fails, the iflib_device_register function will free the ctx pointer. However, it does not reset the device softc pointer to NULL. This will result in memory corruption as a future access to the now invalid pointer will corrupt memory that is later allocated on top of the same memory location. The iflib_device_deregister function correctly resets the softc pointer by using device_set_softc(). This clears up the invalid dangling pointer and prevents memory corruption that could lead to a panic or undefined behavior if the device's driver failed to attach. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Submitted by: Jacob Keller <jacob.e.keller@intel.com> Reviewed by: erj@, gallatin@ MFC after: 1 week Sponsored by: Intel Corporation Differential Revision: https://reviews.freebsd.org/D21003	2019-07-24 21:43:41 +00:00
Kirill Ponomarev	b7592822d5	Allow set MTU more than 1500 bytes. Submitted by: Alexandr Fedorov <aleksandr.fedorov_itglobal_dot_com> Approved by: jhb, rgrimes Sponsored by: ITGlobal.com Differential Revision: https://reviews.freebsd.org/D19422	2019-07-24 16:10:20 +00:00
Chuck Tuffli	94c15665a5	Fix a typo in r349969 OUI_FRREBSD_NVME_HIGH should have been OUI_FREEBSD_NVME_HIGH Caught by: Gary Jennejohn	2019-07-14 03:49:48 +00:00
Chuck Tuffli	409a80e5a4	bhyve: Create EUI64 for NVMe namespaces Accept an IEEE Extended Unique Identifier (EUI-64) from the command line for each NVMe namespace. If one isn't provided, it will create one based on the CRC16 of: - the FreeBSD IEEE OUI - PCI bus, device/slot, function values - Namespace ID Reviewed by: imp, araujo, jhb, rgrimes Approved by: imp (mentor), jhb (maintainer) MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D19905	2019-07-13 12:48:28 +00:00
Mark Johnston	eeacb3b02f	Merge the vm_page hold and wire mechanisms. The hold_count and wire_count fields of struct vm_page are separate reference counters with similar semantics. The remaining essential differences are that holds are not counted as a reference with respect to LRU, and holds have an implicit free-on-last unhold semantic whereas vm_page_unwire() callers must explicitly determine whether to free the page once the last reference to the page is released. This change removes the KPIs which directly manipulate hold_count. Functions such as vm_fault_quick_hold_pages() now return wired pages instead. Since r328977 the overhead of maintaining LRU for wired pages is lower, and in many cases vm_fault_quick_hold_pages() callers would swap holds for wirings on the returned pages anyway, so with this change we remove a number of page lock acquisitions. No functional change is intended. __FreeBSD_version is bumped. Reviewed by: alc, kib Discussed with: jeff Discussed with: jhb, np (cxgbe) Tested by: pho (previous version) Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D19247	2019-07-08 19:46:20 +00:00
John Baldwin	66d0c056be	Support IFCAP_NOMAP in vlan(4). Enable IFCAP_NOMAP for a vlan interface if it is supported by the underlying trunk device. Reviewed by: gallatin, hselasky, rrs Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D20616	2019-06-29 00:51:38 +00:00
John Baldwin	82334850ea	Add an external mbuf buffer type that holds multiple unmapped pages. Unmapped mbufs allow sendfile to carry multiple pages of data in a single mbuf, without mapping those pages. It is a requirement for Netflix's in-kernel TLS, and provides a 5-10% CPU savings on heavy web serving workloads when used by sendfile, due to effectively compressing socket buffers by an order of magnitude, and hence reducing cache misses. For this new external mbuf buffer type (EXT_PGS), the ext_buf pointer now points to a struct mbuf_ext_pgs structure instead of a data buffer. This structure contains an array of physical addresses (this reduces cache misses compared to an earlier version that stored an array of vm_page_t pointers). It also stores additional fields needed for in-kernel TLS such as the TLS header and trailer data that are currently unused. To more easily detect these mbufs, the M_NOMAP flag is set in m_flags in addition to M_EXT. Various functions like m_copydata() have been updated to safely access packet contents (using uiomove_fromphys()), to make things like BPF safe. NIC drivers advertise support for unmapped mbufs on transmit via a new IFCAP_NOMAP capability. This capability can be toggled via the new 'nomap' and '-nomap' ifconfig(8) commands. For NIC drivers that only transmit packet contents via DMA and use bus_dma, adding the capability to if_capabilities and if_capenable should be all that is required. If a NIC does not support unmapped mbufs, they are converted to a chain of mapped mbufs (using sf_bufs to provide the mapping) in ip_output or ip6_output. If an unmapped mbuf requires software checksums, it is also converted to a chain of mapped mbufs before computing the checksum. Submitted by: gallatin (earlier version) Reviewed by: gallatin, hselasky, rrs Discussed with: ae, kp (firewalls) Relnotes: yes Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D20616	2019-06-29 00:48:33 +00:00
Hans Petter Selasky	0dbdf04125	Need to wait for epoch callbacks to complete before detaching a network interface. This particularly manifests itself when an INP has multicast options attached during a network interface detach. Then the IPv4 and IPv6 leave group call which results from freeing the multicast address, may access a freed ifnet structure. These are the steps to reproduce: service mdnsd onestart # installed from ports ifconfig epair create ifconfig epair0a 0/24 up ifconfig epair0a destroy Tested by: pho @ MFC after: 1 week Sponsored by: Mellanox Technologies	2019-06-28 10:49:04 +00:00
Marius Strobl	c2c5d1e787	o In iflib_txq_drain(): - Remove desc_used, which is only ever written to. - Remove a dead store to reclaimed. - Don't recycle avail. - Sort variables according to style(9). These changes will make a subsequent commit easier to read. o In iflib_tx_credits_update(), don't bother checking whether the ift_txd_credits_update method pointer is NULL; _iflib_pre_assert() asserts upfront that this method has been assigned and functions like iflib_{fast_intr_rxtx,netmap_timer_adjust,txq_can_drain}() and _task_fn_tx() were already unconditionally relying on the method being callable.	2019-06-26 15:28:21 +00:00
Leandro Lupori	e2edff4167	[PowerPC64] Don't mark module data as static Fixes panic when loading ipfw.ko and if_epair.ko built with modern compiler. Similar to arm64 and riscv, when using a modern compiler (!gcc4.2), code generated tries to access data in the wrong location, causing kernel panic (data storage interrupt trap) when loading if_epair and ipfw. Issue was reproduced with kernel/module compiled using gcc8 and clang8. It affects both ELFv1 and ELFv2 ABI environments. PR: 232387 Submitted by: alfredo.junior_eldorado.org.br Reported by: Mark Millard Reviewed by: jhibbits Differential Revision: https://reviews.freebsd.org/D20461	2019-06-25 17:15:44 +00:00
Hans Petter Selasky	59854ecf55	Convert all IPv4 and IPv6 multicast memberships into using a STAILQ instead of a linear array. The multicast memberships for the inpcb structure are protected by a non-sleepable lock, INP_WLOCK(), which needs to be dropped when calling the underlying possibly sleeping if_ioctl() method. When using a linear array to keep track of multicast memberships, the computed memory location of the multicast filter may suddenly change, due to concurrent insertion or removal of elements in the linear array. This in turn leads to various invalid memory access issues and kernel panics. To avoid this problem, put all multicast memberships on a STAILQ based list. Then the memory location of the IPv4 and IPv6 multicast filters become fixed during their lifetime and use after free and memory leak issues are easier to track, for example by: vmstat -m \| grep multi All list manipulation has been factored into inline functions including some macros, to easily allow for a future hash-list implementation, if needed. This patch has been tested by pho@ . Differential Revision: https://reviews.freebsd.org/D20080 Reviewed by: markj @ MFC after: 1 week Sponsored by: Mellanox Technologies	2019-06-25 11:54:41 +00:00
Marko Zec	188adcb7e4	V_ip6_forwarding and V_ipforwarding have been defined in ip6_var.h / ip_var.h since at least 2008, so make use of those definitions here. MFC after: 3 days	2019-06-19 08:49:24 +00:00
Marko Zec	6aee0bfa85	Evaluating htons() at compile time is more efficient than doing ntohs() at runtime. This change removes a dependency on a barrel shifter pass before branch resolution, while reducing the instruction stream size by 9 bytes on amd64. MFC after: 3 days	2019-06-19 08:39:19 +00:00
Marius Strobl	d49e83eac3	- Replace unused and only ever written to members of public iflib(9) structs with placeholders (in the latter case, IFLIB_MAX_TX_BYTES etc. are also only ever used for these write-only members if at all, so both these macros and members can just go). Using these spares may render it possible to merge certain iflib(9) fixes to stable/12. Otherwise, changes extending struct if_irq or struct if_shared_ctx in any way would break KBI as instances of these are allocated by the driver front-ends (by contrast, struct if_pkt_info as well as struct if_softc_ctx instances are provided by iflib(9) and, thus, may grow at least at the end without breaking KBI). - Make the pvi_name in struct pci_vendor_info const char * as device identifiers in hardware lookup tables aren't to be expected to ever change at runtime. - Similarly, make the pci_vendor_info_t of struct if_shared_ctx which is used to point to the struct pci_vendor_info arrays provided by the driver front-ends const. - Remove the ETH_ADDR_LEN macro from iflib.h; this was duplicating ETHER_ADDR_LEN of <net/ethernet.h> with iflib(9) actually only consuming the latter macro. - Make the name argument of iflib_io_tqg_attach(9) const, matching the taskqgroup_attach_cpu(9) this function wraps as well as e. g. iflib_config_gtask_init(9). - Remove the orphaned iflib_qset_lock_get() prototype. - Remove some extraneous empty lines.	2019-06-15 11:07:41 +00:00
Mark Johnston	ce7fb386d8	Restore the comment removed in r348745. LAGG_RLOCK() enters an epoch section, so the comment wasn't stale. Reported by: jhb MFC with: r348745	2019-06-06 17:20:35 +00:00
Mark Johnston	9995dfd364	Conditionalize an in_epoch() call on INVARIANTS. Its result is only used to determine whether to perform further INVARIANTS-only checks. Remove a stale comment while here. Submitted by: Sebastian Huber <sebastian.huber@embedded-brains.de> MFC after: 1 week	2019-06-06 16:22:29 +00:00
Eric Joyner	668d6dbb4c	iflib: provide probe wrapper for vendor drivers From Jake: Vendor drivers that exist out-of-tree generally should return BUS_PROBE_VENDOR from their device probe functions. This helps ensure that a vendor replacement driver will supersede the in-kernel driver for a given device. Currently, if a vendor wants to implement a driver based on iflib, it will always report BUS_PROBE_DEFAULT. Add a wrapper function, iflib_device_probe_vendor() which can be used in place of iflib_device_probe(). This function will just return BUS_PROBE_VENDOR whenever iflib_device_probe() would return BUS_PROBE_DEFAULT. While vendor drivers can already implement such a wrapper themselves, providing it in the iflib.h header makes it easier for the vendor driver to do the right thing. Submitted by: Jacob Keller <jacob.e.keller@intel.com> Reviewed by: erj@, gallatin@, marius@ MFC after: 1 week Sponsored by: Intel Corporation Differential Revision: https://reviews.freebsd.org/D20221	2019-05-29 22:24:10 +00:00
Kyle Evans	d8b985430c	if_bridge(4): Complete bpf auditing of local traffic over the bridge There were two remaining "gaps" in auditing local bridge traffic with bpf(4): Locally originated outbound traffic from a member interface is invisible to the bridge's bpf(4) interface. Inbound traffic locally destined to a member interface is invisible to the member's bpf(4) interface -- this traffic has no chance after bridge_input to otherwise pass it over, and it wasn't originally received on this interface. I call these "gaps" because they don't affect conventional bridge setups. Alas, being able to establish an audit trail of all locally destined traffic for setups that can function like this is useful in some scenarios. Reviewed by: kp MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D19757	2019-05-29 01:08:30 +00:00
Andrey V. Elsukov	de25327313	Rework r348303 to reduce the time of holding global BPF lock. It appeared that using NET_EPOCH_WAIT() while holding global BPF lock can lead to another panic: spin lock 0xfffff800183c9840 (turnstile lock) held by 0xfffff80018e2c5a0 (tid 100325) too long panic: spin lock held too long ... #0 sched_switch (td=0xfffff80018e2c5a0, newtd=0xfffff8000389e000, flags=<optimized out>) at /usr/src/sys/kern/sched_ule.c:2133 #1 0xffffffff80bf9912 in mi_switch (flags=256, newtd=0x0) at /usr/src/sys/kern/kern_synch.c:439 #2 0xffffffff80c21db7 in sched_bind (td=<optimized out>, cpu=<optimized out>) at /usr/src/sys/kern/sched_ule.c:2704 #3 0xffffffff80c34c33 in epoch_block_handler_preempt (global=<optimized out>, cr=0xfffffe00005a1a00, arg=<optimized out>) at /usr/src/sys/kern/subr_epoch.c:394 #4 0xffffffff803c741b in epoch_block (global=<optimized out>, cr=<optimized out>, cb=<optimized out>, ct=<optimized out>) at /usr/src/sys/contrib/ck/src/ck_epoch.c:416 #5 ck_epoch_synchronize_wait (global=0xfffff8000380cd80, cb=<optimized out>, ct=<optimized out>) at /usr/src/sys/contrib/ck/src/ck_epoch.c:465 #6 0xffffffff80c3475e in epoch_wait_preempt (epoch=0xfffff8000380cd80) at /usr/src/sys/kern/subr_epoch.c:513 #7 0xffffffff80ce970b in bpf_detachd_locked (d=0xfffff801d309cc00, detached_ifp=<optimized out>) at /usr/src/sys/net/bpf.c:856 #8 0xffffffff80ced166 in bpf_detachd (d=<optimized out>) at /usr/src/sys/net/bpf.c:836 #9 bpf_dtor (data=0xfffff801d309cc00) at /usr/src/sys/net/bpf.c:914 To fix this add the check to the catchpacket() that BPF descriptor was not detached just before we acquired BPFD_LOCK(). Reported by: slavash Tested by: slavash MFC after: 1 week	2019-05-28 11:45:00 +00:00
Andrey V. Elsukov	44a514745c	Fix possible NULL pointer dereference. bpf_mtap() can invoke catchpacket() for already detached descriptor. And this can lead to NULL pointer dereference, since bd_bif pointer was reset to NULL in bpf_detachd_locked(). To avoid this, use NET_EPOCH_WAIT() when descriptor is removed from interface's descriptors list. After the wait it is safe to modify descriptor's content. Submitted by: kib Reported by: slavash MFC after: 1 week	2019-05-27 12:41:41 +00:00
John Baldwin	fb3bc59600	Restructure mbuf send tags to provide stronger guarantees. - Perform ifp mismatch checks (to determine if a send tag is allocated for a different ifp than the one the packet is being output on), in ip_output() and ip6_output(). This avoids sending packets with send tags to ifnet drivers that don't support send tags. Since we are now checking for ifp mismatches before invoking if_output, we can now try to allocate a new tag before invoking if_output sending the original packet on the new tag if allocation succeeds. To avoid code duplication for the fragment and unfragmented cases, add ip_output_send() and ip6_output_send() as wrappers around if_output and nd6_output_ifp, respectively. All of the logic for setting send tags and dealing with send tag-related errors is done in these wrapper functions. For pseudo interfaces that wrap other network interfaces (vlan and lagg), wrapper send tags are now allocated so that ip*_output see the wrapper ifp as the ifp in the send tag. The if_transmit routines rewrite the send tags after performing an ifp mismatch check. If an ifp mismatch is detected, the transmit routines fail with EAGAIN. - To provide clearer life cycle management of send tags, especially in the presence of vlan and lagg wrapper tags, add a reference count to send tags managed via m_snd_tag_ref() and m_snd_tag_rele(). Provide a helper function (m_snd_tag_init()) for use by drivers supporting send tags. m_snd_tag_init() takes care of the if_ref on the ifp meaning that code alloating send tags via if_snd_tag_alloc no longer has to manage that manually. Similarly, m_snd_tag_rele drops the refcount on the ifp after invoking if_snd_tag_free when the last reference to a send tag is dropped. This also closes use after free races if there are pending packets in driver tx rings after the socket is closed (e.g. from tcpdrop). In order for m_free to work reliably, add a new CSUM_SND_TAG flag in csum_flags to indicate 'snd_tag' is set (rather than 'rcvif'). Drivers now also check this flag instead of checking snd_tag against NULL. This avoids false positive matches when a forwarded packet has a non-NULL rcvif that was treated as a send tag. - cxgbe was relying on snd_tag_free being called when the inp was detached so that it could kick the firmware to flush any pending work on the flow. This is because the driver doesn't require ACK messages from the firmware for every request, but instead does a kind of manual interrupt coalescing by only setting a flag to request a completion on a subset of requests. If all of the in-flight requests don't have the flag when the tag is detached from the inp, the flow might never return the credits. The current snd_tag_free command issues a flush command to force the credits to return. However, the credit return is what also frees the mbufs, and since those mbufs now hold references on the tag, this meant that snd_tag_free would never be called. To fix, explicitly drop the mbuf's reference on the snd tag when the mbuf is queued in the firmware work queue. This means that once the inp's reference on the tag goes away and all in-flight mbufs have been queued to the firmware, tag's refcount will drop to zero and snd_tag_free will kick in and send the flush request. Note that we need to avoid doing this in the middle of ethofld_tx(), so the driver grabs a temporary reference on the tag around that loop to defer the free to the end of the function in case it sends the last mbuf to the queue after the inp has dropped its reference on the tag. - mlx5 preallocates send tags and was using the ifp pointer even when the send tag wasn't in use. Explicitly use the ifp from other data structures instead. - Sprinkle some assertions in various places to assert that received packets don't have a send tag, and that other places that overwrite rcvif (e.g. 802.11 transmit) don't clobber a send tag pointer. Reviewed by: gallatin, hselasky, rgrimes, ae Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D20117	2019-05-24 22:30:40 +00:00
Alexander V. Chernikov	563ab4e400	Fix gateway setup for the interface routes. Currently rinit1() and its IPv6 counterpart nd6_prefix_onlink_rtrequest() uses dummy null_sdl gateway address during route insertion and change it afterwards. This behaviour brings complications to the routing stack and the users of its upcoming notification system. This change fixes both rinit1() and nd6_prefix_onlink_rtrequest() by filling in proper gateway in the beginning. It does not change any of the userland notifications as in both cases, they happen after the insertion and fixup process (rt_newaddrmsg_fib() and nd6_rtmsg()). MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D20328	2019-05-22 21:20:15 +00:00
Conrad Meyer	e2e050c8ef	Extract eventfilter declarations to sys/_eventfilter.h This allows replacing "sys/eventfilter.h" includes with "sys/_eventfilter.h" in other header files (e.g., sys/{bus,conf,cpu}.h) and reduces header pollution substantially. EVENTHANDLER_DECLARE and EVENTHANDLER_LIST_DECLAREs were moved out of .c files into appropriate headers (e.g., sys/proc.h, powernv/opal.h). As a side effect of reduced header pollution, many .c files and headers no longer contain needed definitions. The remainder of the patch addresses adding appropriate includes to fix those files. LOCK_DEBUG and LOCK_FILE_LINE_ARG are moved to sys/_lock.h, as required by sys/mutex.h since r326106 (but silently protected by header pollution prior to this change). No functional change (intended). Of course, any out of tree modules that relied on header pollution for sys/eventhandler.h, sys/lock.h, or sys/mutex.h inclusion need to be fixed. __FreeBSD_version has been bumped.	2019-05-20 00:38:23 +00:00
Alexander V. Chernikov	2ad7ed6e4a	Fix rt_ifa selection during loopback route insertion process. Currently such routes are added with a link-level IFA, which is plain wrong. Only after the insertion they get fixed by the special link_rtrequest() ifa handler. This behaviour complicates routing code and makes ifa selection more complex. Streamline this process by explicitly moving link_rtrequest() logic to the pre-insertion rt_getifa_fib() ifa selector. Avoid calling all this logic in the loopback route case by explicitly specifying proper rt_ifa inside the ifa_maintain_loopback_route().§ MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D20076	2019-05-19 21:49:56 +00:00
Kyle Evans	db226f0d8e	tuntap: Defer clearing if_softc until after if_detach r346670 added an sx to close a race between the ifioctl handler and interface destruction. Unfortunately, it clears if_softc immediately after the interface is closed, but before if_detach has been invoked. Any time before detachment, an interface that's part of a bridge may still receive traffic that's pushed through tunstart/tunstart_l2 and promptly lead to a panic because if_softc is now NULL. Fix it by deferring the clearing of if_softc until after the interface has detached and thus been removed from the bridge. if_softc still gets cleared in case another thread has already entered the ioctl handler before it's replaced with ifdead_ioctl. Reported by: markj MFC after: 3 days	2019-05-14 20:32:29 +00:00
Andrey V. Elsukov	82d7bf6b1b	Avoid possible recursion on BPF_LOCK() in bpfwrite(). Release BPF_LOCK() before invoking if_output() and if_input(). Also enter epoch section before releasing lock, this should prevent access to ifnet that may be freed on interface detach. Reported by: markj	2019-05-13 20:17:55 +00:00
Andrey V. Elsukov	af1f58df99	Do not leak memory used for binary filter.	2019-05-13 14:07:02 +00:00
Andrey V. Elsukov	699281b545	Rework locking in BPF code to remove rwlock from fast path. On high packets rate the contention on rwlock in bpf_tap() functions can lead to packets dropping. To avoid this, migrate this code to use epoch(9) KPI and ConcurrencyKit's lists. * all lists changed to use CK_LIST; * reference counting added to bpf_if and bpf_d; * now bpf_if references ifnet and releases this reference on destroy; * each bpf_d descriptor references bpf_if when it is attached; * new struct bpf_program_buffer introduced to keep BPF filter programs; * bpf_program_buffer, bpf_d and bpf_if structures are freed by epoch_call(); * bpf_freelist and ifnet_departure event are no longer needed, thus both are removed; Reviewed by: melifaro Sponsored by: Yandex LLC Differential Revision: https://reviews.freebsd.org/D20224	2019-05-13 13:45:28 +00:00
Kyle Evans	81b3b91e6b	tuntap: Improve style No functional change. tun_flags of the tuntap_driver was renamed to ident_flags to reflect the fact that it's a subset of the tun_flags that identifies a tuntap device. This maps more easily (visually) to the TUN_DRIVER_IDENT_MASK that masks off the bits of tun_flags that are applicable to tuntap driver ident. This is a purely cosmetic change.	2019-05-11 04:18:06 +00:00
Eric Joyner	afb7737237	iflib: use default ntxd and nrxd when user value is not power of 2 From Jake: A user may set a sysctl to override the default number of Tx or Rx descriptors. However, certain calculations in the iflib core expect the number of descriptors to be a power of 2. Update _iflib_assert to verify that all of the shared context parameters for the number of descriptors are powers of 2. Modify iflib_reset_qvalues to check that the provided isc_nrxd value is a power of 2. If it's not, print a warning message and then use the default value. An alternative might be to try rounding the number down instead. However, this creates problems in case the rounded down value is below the minimum value that the driver would support. Submitted by: Jacob Keller <jacob.e.keller@intel.com> Reviewed by: marius@ MFC after: 1 week Sponsored by: Intel Corporation Differential Revision: https://reviews.freebsd.org/D19880	2019-05-10 00:41:42 +00:00
Kyle Evans	16760d8e28	tuntap: Don't down tap interfaces if LINK0 is set	2019-05-09 18:54:29 +00:00
Kyle Evans	a6fa049545	tuntap: Properly detach tap ifp	2019-05-09 14:06:24 +00:00
Marius Strobl	14e0010729	- Merge r338254 from cxgbe(4): Use fcmpset instead of cmpset when appropriate. - Revert r277226 of cxgbe(4), obsolete since r334320.	2019-05-09 11:34:46 +00:00
Gleb Smirnoff	6ca363eb7b	Existense of PCB route caching doesn't allow us to use new fast route lookup KPI in ip_output() like it is already used in ip_forward(). However, when there is no PCB provided we can use fast KPI, gaining performance advantage. Typical case when ip_output() is called without a PCB pointer is a sendto(2) on a not connected UDP socket. In practice DNS servers do this. Reviewed by: melifaro Differential Revision: https://reviews.freebsd.org/D19804	2019-05-08 23:39:24 +00:00
Marius Strobl	007b804fc7	Allow to build without INET and INET6 again after r347221. Submitted by: cam	2019-05-08 09:03:43 +00:00
Kyle Evans	251a32b5b2	tun/tap: merge and rename to `tuntap` tun(4) and tap(4) share the same general management interface and have a lot in common. Bugs exist in tap(4) that have been fixed in tun(4), and vice-versa. Let's reduce the maintenance requirements by merging them together and using flags to differentiate between the three interface types (tun, tap, vmnet). This fixes a couple of tap(4)/vmnet(4) issues right out of the gate: - tap devices may no longer be destroyed while they're open [0] - VIMAGE issues already addressed in tun by kp [0] emaste had removed an easy-panic-button in r240938 due to devdrn blocking. A naive glance over this leads me to believe that this isn't quite complete -- destroy_devl will only block while executing d_* functions, but doesn't block the device from being destroyed while a process has it open. The latter is the intent of the condvar in tun, so this is "fixed" (for certain definitions of the word -- it wasn't really broken in tap, it just wasn't quite ideal). ifconfig(8) also grew the ability to map an interface name to a kld, so that `ifconfig {tun,tap}0` can continue to autoload the correct module, and `ifconfig vmnet0 create` will now autoload the correct module. This is a low overhead addition. (MFC commentary) This may get MFC'd if many bugs in tun(4)/tap(4) are discovered after this, and how critical they are. Changes after this are likely easily MFC'd without taking this merge, but the merge will be easier. I have no plans to do this MFC as of now. Reviewed by: bcr (manpages), tuexen (testing, syzkaller/packetdrill) Input also from: melifaro Relnotes: yes Differential Revision: https://reviews.freebsd.org/D20044	2019-05-08 02:32:11 +00:00
Marius Strobl	3d10e9ed62	o Use iflib_fast_intr_rxtx() also for "legacy" interrupts, i. e. INTx and MSI. Unlike as with iflib_fast_intr_ctx(), the former will also enqueue _task_fn_tx() in addition to _task_fn_rx() if appropriate, bringing TCP TX throughput of EM-class devices on par with the MSI-X case and, thus, close to wirespeed/pre-iflib(4) times again. [1] Note that independently of the interrupt type, the UDP performance with these MACs still is abysmal and nowhere near to where it was before the conversion of em(4) to iflib(4). o In iflib_init_locked(), announce which free list failed to set up. o In _task_fn_tx() when running netmap(4), issue ifdi_intr_enable instead of the ifdi_tx_queue_intr_enable method in case of a "legacy" interrupt as the latter is valid with MSI-X only. o Instead of adding the missing - and apparently convoluted enough that a DBG_COUNTER_INC was put into a wrong spot in _task_fn_rx() - checks for ifdi_{r,t}x_queue_intr_enable being available in the MSI-X case also to iflib_fast_intr_rxtx(), factor these out to iflib_device_register() and make the checks fail gracefully rather than panic. This avoids invoking the checks at runtime over and over again in iflib_fast_intr_rxtx() and _task_fn_{r,t}x() - even if it's just in case of INVARIANTS - and makes these functions more readable. o In iflib_rx_structures_setup(), only initialize LRO resources if device and driver have LRO capability in order to not waste memory. Also, free the LRO resources again if setting them up fails for one of the queues. However, don't bother invoking iflib_rx_sds_free() in that case because iflib_rx_structures_setup() doesn't call iflib_rxsd_alloc() either (and iflib_{device,pseudo}_register() will issue iflib_rx_sds_free() in case of failure via iflib_rx_structures_free(), but there definitely is some asymmetry left to be fixed, though). o Similarly, free LRO resources again in iflib_rx_structures_free(). o In iflib_irq_set_affinity(), handle get_core_offset() errors gracefully instead of panicing (but only in case of INVARIANTS). This is a follow- up to r344132, as such driver bugs shouldn't be fatal. o Likewise, handle unknown iflib_intr_type_t in iflib_irq_alloc_generic() gracefully, too. o Bring yet more sanity to iflib_msix_init(): - If the device doesn't provide enough MSI-X vectors or not all vectors can be allocate so the expected number of queues in addition to admin interrupts can't be supported, try MSI next (and then INTx) as proper MSI-X vector distribution can't be assured in such cases. In essence, this change brings r254008 forward to iflib(4). Also, this is the fix alluded to in the commit message of r343934. - If the MSI-X allocation has failed, don't prematurely announce MSI is going to be used as the latter in fact may not be available either. - When falling back to MSI, only release the MSI-X table resource again if it was allocated in iflib_msix_init(), i. e. isn't supplied by the driver, in the first place. o In mp_ndesc_handler(), handle unknown type arguments gracefully, too. PR: 235031 (likely) [1] Reviewed by: shurd Differential Revision: https://reviews.freebsd.org/D20175	2019-05-07 08:28:35 +00:00
Marius Strobl	1722eeac95	- Remove the unused ifc_link_irq and ifc_mtx_name members of struct iflib_ctx. - Remove the only ever written to ift_db_mtx_name member of struct iflib_txq. - Remove the unused or only ever written to ifr_size, ifr_cq_pidx, ifr_cq_gen and ifr_lro_enabled members of struct iflib_rxq. - Consistently spell DMA, RX and TX uppercase in comments, messages etc. instead of mixing with some lowercase variants. - Consistently use if_t instead of a mix of if_t and struct ifnet pointers. - Bring the function comments of _iflib_fl_refill(), iflib_rx_sds_free() and iflib_fl_setup() in line with reality. - Judging problem reports, people are wondering what on earth messages like: "TX(0) desc avail = 1024, pidx = 0" are trying to indicate. Thus, extend this string to be more like that of non-iflib(4) Ethernet MAC drivers, notifying about a watchdog timeout due to which the interface will be reset. - Take advantage of the M_HAS_VLANTAG macro. - Use false/true rather than FALSE/TRUE for variables of type bool. - Use FALLTHROUGH as advocated by style(9).	2019-05-06 20:56:41 +00:00
Matt Macy	e2621d9657	Allow iflib drivers to pass a pointer to their own ifmedia structure. Tested by: emaste@ Differential Revision: https://reviews.freebsd.org/D19946	2019-05-03 20:05:31 +00:00
Andrew Gallatin	35961dce98	Select lacp egress ports based on NUMA domain This change creates an array of port maps indexed by numa domain for lacp port selection. If we have lacp interfaces in more than one domain, then we select the egress port by indexing into the numa port maps and picking a port on the appropriate numa domain. This is behavior is controlled by the new ifconfig use_numa flag and net.link.lagg.use_numa sysctl/tunable (both modeled after the existing use_flowid), which default to enabled. Reviewed by: bz, hselasky, markj (and scottl, earlier version) Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D20060	2019-05-03 14:43:21 +00:00
Ed Maste	ce3da455e9	iflib: remove assertion that isc_capabilities is nonzero It's atypical, but not invalid, for a driver to pass no capabilities. Submitted by: Gerald Aryeetey <aryeeteygerald_rogers.com> Reviewed by: shurd MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D20142	2019-05-02 19:13:31 +00:00
Stephen Hurd	f154ece02e	iflib: Better control over queue core assignment By default, cores are now assigned to queues in a sequential manner rather than all NICs starting at the first core. On a four-core system with two NICs each using two queue pairs, the nic:queue -> core mapping has changed from this: 0:0 -> 0, 0:1 -> 1 1:0 -> 0, 1:1 -> 1 To this: 0:0 -> 0, 0:1 -> 1 1:0 -> 2, 1:1 -> 3 Additionally, a device can now be configured to use separate cores for TX and RX queues. Two new tunables have been added, dev.X.Y.iflib.separate_txrx and dev.X.Y.iflib.core_offset. If core_offset is set, the NIC is not part of the auto-assigned sequence. Reviewed by: marius MFC after: 2 weeks Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D20029	2019-04-25 21:24:56 +00:00
Kyle Evans	e3a883c386	tap(4): Correct driver name... Reported by: rgrimes Pointy hat to: kevans MFC after: 3 days X-MFC-With: r346688	2019-04-25 18:26:34 +00:00
Kyle Evans	9ea63b2caa	tap(4): Add a MODULE_VERSION Otherwise tap(4) can be loaded by loader despite being compiled into the kernel, causing a panic as things try to double-initialize. PR: 220867 MFC after: 3 days	2019-04-25 18:22:22 +00:00
Kyle Evans	c83651445b	tun(4): Don't allow open of open or dying devices Previously, a pid check was used to prevent open of the tun(4); this works, but may not make the most sense as we don't prevent the owner process from opening the tun device multiple times. The potential race described near tun_pid should not be an issue: if a tun(4) is to be handed off, its fd has to have been sent via control message or some other mechanism that duplicates the fd to the receiving process so that it may set the pid. Otherwise, the pid gets cleared when the original process closes it and you have no effective handoff mechanism. Close up another potential issue with handing a tun(4) off by not clobbering state if the closer isn't the controller anymore. If we want some state to be cleared, we should do that a little more surgically. Additionally, nothing prevents a dying tun(4) from being "reopened" in the middle of tun_destroy as soon as the mutex is unlocked, quickly leading to a bad time. Return EBUSY if we're marked for destruction, as well, and the consumer will need to deal with it. The associated character device will be destroyed in short order. MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D20033	2019-04-25 13:46:12 +00:00
Kyle Evans	d91262603b	tun/tap: close race between destroy/ioctl handler It seems that there should be a better way to handle this, but this seems to be the more common approach and it should likely get replaced in all of the places it happens... Basically, thread 1 is in the process of destroying the tun/tap while thread 2 is executing one of the ioctls that requires the tun/tap mutex and the mutex is destroyed before the ioctl handler can acquire it. This is only one of the races described/found in PR 233955. PR: 233955 Reviewed by: ae MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D20027	2019-04-25 12:44:08 +00:00
Andrew Gallatin	6d49b41ee8	iflib: Add pfil hooks As with mlx5en, the idea is to drop unwanted traffic as early in receive as possible, before mbufs are allocated and anything is passed up the stack. This can save considerable CPU time when a machine is under a flooding style DOS attack. The major change here is to remove the unneeded abstraction where callers of rxd_frag_to_sd() get back a pointer to the mbuf ring, and are responsible for NULL'ing that mbuf themselves. Now this happens directly in rxd_frag_to_sd(), and it returns an mbuf. This allows us to use the decision (and potentially mbuf) returned by the pfil hooks. The driver can now recycle mbufs to avoid re-allocation when packets are dropped. Reviewed by: marius (shurd and erj also provided feedback) Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D19645	2019-04-24 13:32:04 +00:00
Andrey V. Elsukov	aee793eec9	Add GRE-in-UDP encapsulation support as defined in RFC8086. This GRE-in-UDP encapsulation allows the UDP source port field to be used as an entropy field for load-balancing of GRE traffic in transit networks. Also most of multiqueue network cards are able distribute incoming UDP datagrams to different NIC queues, while very little are able do this for GRE packets. When an administrator enables UDP encapsulation with command `ifconfig gre0 udpencap`, the driver creates kernel socket, that binds to tunnel source address and after udp_set_kernel_tunneling() starts receiving of all UDP packets destined to 4754 port. Each kernel socket maintains list of tunnels with different destination addresses. Thus when several tunnels use the same source address, they all handled by single socket. The IP[V6]_BINDANY socket option is used to be able bind socket to source address even if it is not yet available in the system. This may happen on system boot, when gre(4) interface is created before source address become available. The encapsulation and sending of packets is done directly from gre(4) into ip[6]_output() without using sockets. Reviewed by: eugen MFC after: 1 month Relnotes: yes Differential Revision: https://reviews.freebsd.org/D19921	2019-04-24 09:05:45 +00:00
Kyle Evans	e8de0c3bda	tun(4): Defer clearing TUN_OPEN until much later tun destruction will not continue until TUN_OPEN is cleared. There are brief moments in tunclose where the mutex is dropped and we've already cleared TUN_OPEN, so tun_destroy would be able to proceed while we're in the middle of cleaning up the tun still. tun_destroy should be blocked until these parts (address/route purges, mostly) are complete. PR: 233955 MFC after: 2 weeks	2019-04-23 17:28:28 +00:00
Andrew Gallatin	7687707dd4	Track device's NUMA domain in ifnet & alloc ifnet from NUMA local memory This commit adds new if_alloc_domain() and if_alloc_dev() methods to allocate ifnets. When called with a domain on a NUMA machine, ifalloc_domain() will record the NUMA domain in the ifnet, and it will allocate the ifnet struct from memory which is local to that NUMA node. Similarly, if_alloc_dev() is a wrapper for if_alloc_domain which uses a driver supplied device_t to call ifalloc_domain() with the appropriate domain. Note that the new if_numa_domain field fits in an alignment pad in struct ifnet, and so does not alter the size of the structure. Reviewed by: glebius, kib, markj Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D19930	2019-04-22 19:24:21 +00:00
Kyle Evans	1fd8c72c0a	iflib: Use new ether_gen_addr, restricting addresses to that subset Differential Revision: https://reviews.freebsd.org/D19587	2019-04-17 17:19:54 +00:00
Kyle Evans	3c3aa8c170	net: adjust randomized address bits Give devices that need a MAC a 16-bit allocation out of the FreeBSD Foundation OUI range. Change the name ether_fakeaddr to ether_gen_addr now that we're dealing real MAC addresses with a real OUI rather than random locally-administered addresses. Reviewed by: bz, rgrimes Differential Revision: https://reviews.freebsd.org/D19587	2019-04-17 17:18:43 +00:00
Michael Tuexen	e6481fd4c4	When sending a routing message, don't allow the user to set the RTF_RNH_LOCKED flag in rtm_flags, since this flag is used only internally. Reported by: syzbot+65c676f5248a13753ea0@syzkaller.appspotmail.com Reviewed by: ae@ MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D19898	2019-04-14 10:18:14 +00:00
Conrad Meyer	a8a16c7128	Replace read_random(9) with more appropriate arc4rand(9) KPIs Reviewed by: ae, delphij Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D19760	2019-04-04 01:02:50 +00:00
Mark Johnston	ca1163bd5f	Do not perform DAD on stf(4) interfaces. stf(4) interfaces are not multicast-capable so they can't perform DAD. They also did not set IFF_DRV_RUNNING when an address was assigned, so the logic in nd6_timer() would periodically flag such an address as tentative, resulting in interface flapping. Fix the problem by setting IFF_DRV_RUNNING when an address is assigned, and do some related cleanup: - In in6if_do_dad(), remove a redundant check for !UP \|\| !RUNNING. There is only one caller in the tree, and it only looks at whether the return value is non-zero. - Have in6if_do_dad() return false if the interface is not multicast-capable. - Set ND6_IFF_NO_DAD when an address is assigned to an stf(4) interface and the interface goes UP as a result. Note that this is not sufficient to fix the problem because the new address is marked as tentative and DAD is started before in6_ifattach() is called. However, setting no_dad is formally correct. - Change nd6_timer() to not flag addresses as tentative if no_dad is set. This is based on a patch from Viktor Dukhovni. Reported by: Viktor Dukhovni <ietf-dane@dukhovni.org> Reviewed by: ae MFC after: 3 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D19751	2019-03-30 18:00:44 +00:00
John Baldwin	841613dcdc	Use a dedicated malloc type for lagg(4)'s structures. Reviewed by: gallatin MFC after: 1 month Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D19719	2019-03-28 21:00:54 +00:00
Eric Joyner	225eae1bb7	iflib: return ENETDOWN when the network device is down From Jake: iflib_if_transmit returns ENOBUFS when the device is down, or when the link isn't active. This was changed in r308792 from return (0), so that the function correctly reports an error that it was unable to transmit. However, using ENOBUFS can cause some network applications to produce the following or similar errors: "ping: sendto: No buffer space available" This is a bit confusing as the real cause of the issue is that the network device is down. Replace the ENOBUFS return with ENETDOWN to indicate more clearly that the reason for the failure to send is due to the network device is offline. This will cause the error message to be reported as "ping: sendto: Network is down" Submitted by: Jacob Keller <jacob.e.keller@intel.com> Reviewed by: shurd@, sbruno@, bz@ MFC after: 1 week Sponsored by: Intel Corporation Differential Revision: https://reviews.freebsd.org/D19652	2019-03-28 20:46:45 +00:00
Eric Joyner	aac9c817af	iflib: hold the CTX lock in iflib_pseudo_register From Jake: The iflib_device_register function takes the CTX lock before calling IFDI_ATTACH_PRE, and releases it upon finishing the registration. Mirror this process in iflib_pseudo_register, so that we always hold the CTX lock during the attach process when registering a pseudo interface or a regular interface. This was caught by code inspection while attempting to analyze where the CTX lock was held. Submitted by: Jacob Keller <jacob.e.keller@intel.com> Reviewed by: shurd@, erj@ MFC after: 1 week Sponsored by: Intel Corporation Differential Revision: https://reviews.freebsd.org/D19604	2019-03-28 20:43:47 +00:00
John Baldwin	2f59b04af1	Remove nested epochs from lagg(4). lagg_bcast_start appeared to have a bug in that was using the last lagg port structure after exiting the epoch that was keeping that structure alive. However, upon further inspection, the epoch was already entered by the caller (lagg_transmit), so the epoch enter/exit in lagg_bcast_start was actually unnecessary. This commit generally removes uses of the net epoch via LAGG_RLOCK to protect the list of ports when the list of ports was already protected by an existing LAGG_RLOCK in a caller, or the LAGG_XLOCK. It also adds a missing epoch enter/exit in lagg_snd_tag_alloc while accessing the lagg port structures. An ifp is still accessed via an unsafe reference after the epoch is exited, but that is true in the current code and will be fixed in a future change. Reviewed by: gallatin MFC after: 1 month Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D19718	2019-03-28 20:25:36 +00:00
Kyle Evans	93c9d31918	if_bridge(4): ensure all traffic passing over the bridge is accounted for Consider a bridge0 with em0 and em1 members. Traffic rx'd by em0 and transmitted by bridge0 through em1 gets accounted for in IPACKETS/IBYTES and bridge0 bpf -- assuming it's not unicast traffic destined for em1. Unicast traffic destined for em1 traffic is not accounted for by any mechanism, and isn't pushed through bridge0's bpf machinery as any other packets that pass over the bridge do. Fix this and simplify GRAB_OUR_PACKETS by bailing out early if it was rx'd by the interface that it was addressed for. Everything else there is relevant for any traffic that came in from one member that's being directed at another member of the bridge. Reviewed by: kp MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D19614	2019-03-28 03:31:51 +00:00
Eric Joyner	10a1e981d4	iflib: mark isc_driver_version as constant From Jake: The iflib core never modifies the isc_driver_version string. Allow drivers to safely assign pointers to constant buffers by marking this parameter const. Submitted by: Jacob Keller <jacob.e.keller@intel.com> Reviewed by: erj@, gallatin@, jhb@ MFC after: 1 week Sponsored by: Intel Corporation Differential Revision: https://reviews.freebsd.org/D19577	2019-03-19 23:44:26 +00:00
Eric Joyner	1b9d93948a	iflib: expose the Rx mbuf buffer size to drivers From Jake: iflib_fl_setup calculates a suitable buffer size for the Rx mbufs based on the isc_max_frame_size value that drivers setup. This calculation is repeated by drivers when programming their hardware with the size of each Rx buffer. This can lead to a mismatch where the iflib mbuf size is different from the expected size of the buffer as programmed by the hardware. This can lead to unexpected results. If iflib ever wants to support mbuf sizes larger than one page, every driver must be updated to account for the new possible buffer sizes. Fix this by calculating the mbuf size prior to calling IFDI_INIT, and adding the iflib_get_rx_mbuf_sz function which will expose this value to drivers, so that they do not repeat the same calculation. Submitted by: Jacob Keller <jacob.e.keller@intel.com> Reviewed by: shurd@, erj@ MFC after: 1 week Sponsored by: Intel Corporation Differential Revision: https://reviews.freebsd.org/D19489	2019-03-19 17:59:56 +00:00
Eric Joyner	3e8d1bae5f	iflib: prevent possible infinite loop in iflib_encap From Jake: iflib_encap calls bus_dmamap_load_mbuf_sg. Upon it returning EFBIG, an m_collapse and an m_defrag are attempted to shrink the mbuf cluster to fit within the DMA segment limitations. However, if we call m_defrag, and then bus_dmamap_load_mbuf_sg returns EFBIG on the now defragmented mbuf, we will continuously re-call bus_dmamap_load_mbuf_sg over and over. This happens because m_head isn't NULL, and remap is >1, so we don't try to m_collapse or m_defrag again. The only way we exit the loop is if m_head is NULL. However, m_head can't be modified by the call to bus_dmamap_load_mbuf_sg, because we don't pass it as a double pointer. I believe this will be an incredibly rare occurrence, because it is unlikely that bus_dmamap_load_mbuf_sg will actually fail on the second defragment with an EFBIG error. However, it still seems like a possibility that we should account for. Fix the exit check to ensure that if remap is >1, we will also exit, even if m_head is not NULL. Submitted by: Jacob Keller <jacob.e.keller@intel.com> Reviewed by: shurd@, gallatin@ MFC after: 1 week Sponsored by: Intel Corporation Differential Revision: https://reviews.freebsd.org/D19468	2019-03-19 17:49:03 +00:00
Andrey V. Elsukov	c5be49da01	Convert allocation of bpf_if in bpfattach2 from M_NOWAIT to M_WAITOK and remove possible panic condition. It is already allowed to sleep in bpfattach[2], since BPF_LOCK was converted to SX lock in r332388. Also move KASSERT() to the top of function and make full initialization before bpf_if will be linked to BPF's list of interfaces. MFC after: 2 weeks	2019-03-19 10:29:32 +00:00
Vincenzo Maffione	d12354a56c	netmap: add support for multiple host rings Some applications forward from/to host rings most or all the traffic received or sent on a physical interface. In this cases it is desirable to have more than a pair of RX/TX host rings, and use multiple threads to speed up forwarding. This change adds support for multiple host rings. On registering a netmap port, the user can specify the number of desired receive and transmit host rings in the nr_host_tx_rings and nr_host_rx_rings fields of the nmreq_register structure. MFC after: 2 weeks	2019-03-18 12:22:23 +00:00
Kyle Evans	4920f9a348	if_bridge(4): Drop pointless rtflush At this point, all routes should've already been dropped by removing all members from the bridge. This condition is in-fact KASSERT'd in the line immediately above where this nop flush was added.	2019-03-15 17:19:36 +00:00
Kyle Evans	6e6b93fe1d	Revert r345192: Too many trees in play for bridge(4) bits An accidental appendage was committed that has not undergone review yet.	2019-03-15 17:18:19 +00:00
Kyle Evans	4b4b284d95	if_bridge(4): Drop pointless rtflush At this point, all routes should've already been dropped by removing all members from the bridge. This condition is in-fact KASSERT'd in the line immediately above where this nop flush was added.	2019-03-15 17:13:05 +00:00
Kristof Provost	43d3127ca7	bridge: Fix STP-related panic After r345180 we need to have the appropriate vnet context set to delete an rtnode in bridge_rtnode_destroy(). That's usually the case, but not when it's called by the STP code (through bstp_notify_rtage()). We have to set the vnet context in bridge_rtable_expire() just as we do in the other STP callback bridge_state_change(). Reviewed by: kevans	2019-03-15 15:52:36 +00:00
Kyle Evans	a87407ff85	if_bridge(4): Fix module teardown bridge_rtnode_zone still has outstanding allocations at the time of destruction in the current model because all of the interface teardown happens in a VNET_SYSUNINIT, -after- the MOD_UNLOAD has already been processed. The SYSUNINIT triggers destruction of the interfaces, which then attempts to free the memory from the zone that's already been destroyed, and we hit a panic. Solve this by virtualizing the uma_zone we allocate the rtnodes from to fix the ordering. bridge_rtable_fini should also take care to flush any remaining routes that weren't taken care of when dynamic routes were flushed in bridge_stop. Reviewed by: kp Differential Revision: https://reviews.freebsd.org/D19578	2019-03-15 13:19:52 +00:00
Kristof Provost	d6747eafa9	bridge: Fix panic if the STP root is removed If the spanning tree root interface is removed from the bridge we panic on the next 'ifconfig'. While the STP code is notified whenever a bridge member interface is removed from the bridge it does not clear the bs_root_port. This means bs_root_port can still point at an bridge_iflist which has been free()d. The next access to it will panic. Explicitly check if the interface we're removing in bstp_destroy() is the root, and if so re-assign the roles, which clears bs_root_port. Reviewed by: philip MFC after: 2 weeks	2019-03-15 11:21:20 +00:00
Kristof Provost	5904868691	pf :Use counter(9) in pf tables. The counters of pf tables are updated outside the rule lock. That means state updates might overwrite each other. Furthermore allocation and freeing of counters happens outside the lock as well. Use counter(9) for the counters, and always allocate the counter table element, so that the race condition cannot happen any more. PR: 230619 Submitted by: Kajetan Staszkiewicz <vegeta@tuxpowered.net> Reviewed by: glebius MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D19558	2019-03-15 11:08:44 +00:00
Kyle Evans	521b05ea52	ether_fakeaddr: Use 'b' 's' 'd' for the prefix This has the advantage of being obvious to sniff out the designated prefix by eye and it has all the right bits set. Comment stolen from ffec. I've removed bryanv@'s pending question of using the FreeBSD OUI range -- no one has followed up on this with a definitive action, and there's no particular reason to shoot for it and the administrative overhead that comes with deciding exactly how to use it.	2019-03-14 19:48:43 +00:00
Kyle Evans	6b7e0c1cca	ether: centralize fake hwaddr generation We currently have two places with identical fake hwaddr generation -- if_vxlan and if_bridge. Lift it into if_ethersubr for reuse in other interfaces that may also need a fake addr. Reviewed by: bryanv, kp, philip Differential Revision: https://reviews.freebsd.org/D19573	2019-03-14 17:18:00 +00:00
Gleb Smirnoff	c93410229c	Most Ethernet drivers that potentially can run a pfil(9) hook with PFIL_MEMPTR flag are intentionally providing a memory address that isn't aligned to pointer alignment. This is done to align an IPv4 or IPv6 header that is expected to follow Ethernet header. When we return PFIL_REALLOCED we store a pointer to allocated mbuf at this address. With this change the KPI changes to store the pointer at aligned address, which usually yields in +2 bytes. Provide two inlines: pfil_packet_align() to get aligned pfil_packet_t for a misaligned one pfil_mem2mbuf() to read out mbuf pointer from misaligned pfil_packet_t Provide function pfil_realloc(), not used yet, that would convert a memory pfil_packet_t to an mbuf one. Reported by: hps Reviewed by: hps, gallatin	2019-03-10 17:20:09 +00:00
Gleb Smirnoff	b9fdb4b3a3	Properly handle a case when a first filter returns PFIL_REALLOCED, then second one returns PFIL_PASS.	2019-03-10 17:08:05 +00:00
Bjoern A. Zeeb	b25d74e06c	Improve ARP logging. r344504 added an extra ARP_LOG() call in case of an if_output() failure. It turns out IPv4 can be noisy. In order to not spam the console by default: (a) add a counter for these events so people can keep better track of how often it happens, and (b) add a sysctl to select the default ARP_LOG log level and set it to INFO avoiding the one (the new) DEBUG level by default. Claim a spare (1st one after 10 years since the stats were added) in order to not break netstat from FreeBSD 12->13 updates in the future. Reviewed by: karels Differential Revision: https://reviews.freebsd.org/D19490	2019-03-09 01:12:59 +00:00
Bjoern A. Zeeb	21231a7aa6	Update for IETF draft-ietf-6man-ipv6only-flag. All changes are hidden behind the EXPERIMENTAL option and are not compiled in by default. Add ND6_IFF_IPV6_ONLY_MANUAL to be able to set the interface into no-IPv4-mode manually without router advertisement options. This will allow developers to test software for the appropriate behaviour even on dual-stack networks or IPv6-Only networks without the option being set in RA messages. Update ifconfig to allow setting and displaying the flag. Update the checks for the filters to check for either the automatic or the manual flag to be set. Add REVARP to the list of filtered IPv4-related protocols and add an input filter similar to the output filter. Add a check, when receiving the IPv6-Only RA flag to see if the receiving interface has any IPv4 configured. If it does, ignore the IPv6-Only flag. Add a per-VNET global sysctl, which is on by default, to not process the automatic RA IPv6-Only flag. This way an administrator (if this is compiled in) has control over the behaviour in case the node still relies on IPv4.	2019-03-06 23:31:42 +00:00
Eric Joyner	bc408c7d61	Remove references to CONTIGMALLOC_WORKS in iflib and em From Jake: "The iflib_fl_setup() function tries to pick various buffer sizes based on the max_frame_size value defined by the parent driver. However, this code was wrapped under CONTIGMALLOC_WORKS, which was never actually defined anywhere. This same code pattern was used in if_em.c, likely trying to match what iflib uses. Since CONTIGMALLOC_WORKS is not defined, remove this dead code from iflib_fl_setup and if_em.c Given that various iflib drivers appear to be using a similar calculation, it might be worth making this buffer size a value that the driver can peek at in the future." Submitted by: Jacob Keller <jacob.e.keller@intel.com> Reviewed by: shurd@ MFC after: 1 week Sponsored by: Intel Corporation Differential Revision: https://reviews.freebsd.org/D19199	2019-03-05 19:12:51 +00:00
Kristof Provost	5ea5849a7b	tun: VIMAGE fix for if_tun cloner The if_tun cloner is not virtualised, but if_clone_attach() does use a virtualised list of cloners. The result is that we can't find the if_tun cloner when we try to remove a renamed tun interface. Virtualise the cloner, and move the final cleanup into a sysuninit so that we're sure this happens after all of the vnet_sysuninits Note that we need unit numbers to be system-unique (rather than unique per vnet, as is done by if_clone_simple()). The unit number is used to create the corresponding /dev/tunX device node, and this node must match with the interface. Switch to if_clone_advanced() so that we have control over the unit numbers. Reproduction scenario: jail -c -n foo persist vnet jexec test ifconfig tun create jexec test ifconfig tun0 name wg0 jexec test ifconfig wg0 destroy PR: 235704 Reviewed by: bz, hrs, hselasky MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D19248	2019-03-05 13:21:07 +00:00
Alexander Motin	c3c93809f6	bridge: Fix spurious warnings about capabilities Mask off the bits we don't care about when checking that capabilities of the member interfaces have been disabled as intended. Submitted by: Ryan Moeller <ryan@ixsystems.com> Reviewed by: kristof, mav MFC after: 1 week Sponsored by: iXsystems, Inc. Differential Revision: https://reviews.freebsd.org/D18924	2019-03-04 22:01:09 +00:00
Stephen Hurd	ca62461bc6	iflib: Improve return values of interrupt handlers. iflib was returning FILTER_HANDLED, in cases where FILTER_STRAY was more correct. This potentially caused issues with shared legacy interrupts. Driver filters returning FILTER_STRAY are now properly handled. Submitted by: Augustin Cavalier <waddlesplash@gmail.com> Reviewed by: marius, gallatin Obtained from: Haiku (a84bb9, 4947d1) MFC after: 1 week Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D19201	2019-02-15 18:51:43 +00:00
Randall Stewart	fa91f84502	This commit adds the missing release mechanism for the ratelimiting code. The two modules (lagg and vlan) did have allocation routines, and even though they are indirect (and vector down to the underlying interfaces) they both need to have a free routine (that also vectors down to the actual interface). Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D19032	2019-02-13 14:57:59 +00:00
Marius Strobl	a6611c938b	Fix the build with ALTQ after r344060.	2019-02-12 22:33:17 +00:00
Marius Strobl	f855ec814d	Make taskqgroup_attach{,_cpu}(9) work across architectures So far, intr_{g,s}etaffinity(9) take a single int for identifying a device interrupt. This approach doesn't work on all architectures supported, as a single int isn't sufficient to globally specify a device interrupt. In particular, with multiple interrupt controllers in one system as found on e. g. arm and arm64 machines, an interrupt number as returned by rman_get_start(9) may be only unique relative to the bus and, thus, interrupt controller, a certain device hangs off from. In turn, this makes taskqgroup_attach{,_cpu}(9) and - internal to the gtaskqueue implementation - taskqgroup_attach_deferred{,_cpu}() not work across architectures. Yet in turn, iflib(4) as gtaskqueue consumer so far doesn't fit architectures where interrupt numbers aren't globally unique. However, at least for intr_setaffinity(..., CPU_WHICH_IRQ, ...) as employed by the gtaskqueue implementation to bind an interrupt to a particular CPU, using bus_bind_intr(9) instead is equivalent from a functional point of view, with bus_bind_intr(9) taking the device and interrupt resource arguments required for uniquely specifying a device interrupt. Thus, change the gtaskqueue implementation to employ bus_bind_intr(9) instead and intr_{g,s}etaffinity(9) to take the device and interrupt resource arguments required respectively. This change also moves struct grouptask from <sys/_task.h> to <sys/gtaskqueue.h> and wraps struct gtask along with the gtask_fn_t typedef into #ifdef _KERNEL as userland likes to include <sys/_task.h> or indirectly drags it in - for better or worse also with _KERNEL defined -, which with device_t and struct resource dependencies otherwise is no longer as easily possible now. The userland inclusion problem probably can be improved a bit by introducing a _WANT_TASK (as well as a _WANT_MOUNT) akin to the existing _WANT_PRISON etc., which is orthogonal to this change, though, and likely needs an exp-run. While at it: - Change the gt_cpu member in the grouptask structure to be of type int as used elswhere for specifying CPUs (an int16_t may be too narrow sooner or later), - move the gtaskqueue_enqueue_fn typedef from <sys/gtaskqueue.h> to the gtaskqueue implementation as it's only used and needed there, - change the GTASK_INIT macro to use "gtask" rather than "task" as argument given that it actually operates on a struct gtask rather than a struct task, and - let subr_gtaskqueue.c consistently use __func__ to print functions names. Reported by: mmel Reviewed by: mmel Differential Revision: https://reviews.freebsd.org/D19139	2019-02-12 21:23:59 +00:00
Marius Strobl	95dcf343b7	Further correct and optimize the bus_dma(9) usage of iflib(4): o Correct the obvious bugs in the netmap(4) parts: - No longer check for the existence of DMA maps as bus_dma(9) is used unconditionally in iflib(4) since r341095. - Supply the correct DMA tag and map pairs to bus_dma(9) functions (see also the commit message of r343753). - In iflib_netmap_timer_adjust(), add synchronization of the TX descriptors before calling the ift_txd_credits_update method as the latter evaluates the TX descriptors possibly updated by the MAC. - In _task_fn_tx(), wrap the netmap(4)-specific bits in #ifdef DEV_NETMAP just as done in _task_fn_admin() and _task_fn_rx() respectively. o In iflib_fast_intr_rxtx(), synchronize the TX rather than the RX descriptors before calling the ift_txd_credits_update method (see also above). o There's no need to synchronize an RX buffer that is going to be recycled in iflib_rxd_pkt_get(), yet; it's sufficient to do that as late as passing RX buffers to the MAC via the ift_rxd_refill method. Hence, combine that synchronization with the synchronization of new buffers into a common spot in _iflib_fl_refill(). o There's no need to synchronize the RX descriptors of a free list in preparation of the MAC updating their statuses with every invocation of rxd_frag_to_sd(); it's enough to do this once before handing control over to the MAC, i. e. before calling ift_rxd_flush method in _iflib_fl_refill(), which already performs the necessary synchronization. o Given that the ift_rxd_available method evaluates the RX descriptors which possibly have been altered by the MAC, synchronize as appropriate beforehand. Most notably this is now done in iflib_rxd_avail(), which in turn means that we don't need to issue the same synchronization yet again before calling the ift_rxd_pkt_get method in iflib_rxeof(). o In iflib_txd_db_check(), synchronize the TX descriptors before handing them over to the MAC for transmission via the ift_txd_flush method. o In iflib_encap(), move the TX buffer synchronization after the invocation of the ift_txd_encap() method. If the MAC driver fails to encapsulate the packet and we retry with a defragmented mbuf chain or finally fail, the cycles for TX buffer synchronization have been wasted. Synchronizing afterwards matches what non-iflib(4) drivers typically do and is sufficient as the MAC will not actually start with the transmission before - in this case - the ift_txd_flush method is called. Moreover, for the latter reason the synchronization of the TX descriptors in iflib_encap() can go as it's enough to synchronize them before passing control over to the MAC by issuing the ift_txd_flush() method (see above). o In iflib_txq_can_drain(), only synchronize TX descriptors if the ift_txd_credits_update method accessing these is actually called. Differential Revision: https://reviews.freebsd.org/D19081	2019-02-12 21:08:44 +00:00
Patrick Kelsey	8f2ac65690	Reduce the time it takes the kernel to install a new PF config containing a large number of queues In general, the time savings come from separating the active and inactive queues lists into separate interface and non-interface queue lists, and changing the rule and queue tag management from list-based to hash-bashed. In HFSC, a linear scan of the class table during each queue destroy was also eliminated. There are now two new tunables to control the hash size used for each tag set (default for each is 128): net.pf.queue_tag_hashsize net.pf.rule_tag_hashsize Reviewed by: kp MFC after: 1 week Sponsored by: RG Nets Differential Revision: https://reviews.freebsd.org/D19131	2019-02-11 05:17:31 +00:00
Marius Strobl	bfce461ee9	o As illustrated by e. g. figure 7-14 of the Intel 82599 10 GbE controller datasheet revision 3.3, in the context of Ethernet MACs the control data describing the packet buffers typically are named "descriptors". Each of these descriptors references one buffer, multiple of which a packet can be composed of. By contrast, in comments, messages and the names of structure members, iflib(4) refers to DMA resources employed for RX and TX buffers (rather than control data) as "desc(riptors)". This odd naming convention of iflib(4) made reviewing r343085 and identifying wrong and missing bus_dmamap_sync(9) calls in particular way harder than it already is. This convention may also explain why the netmap(4) part of iflib(4) pairs the DMA tags for control data with DMA maps of buffers and vice versa in calls to bus_dma(9) functions. Therefore, change iflib(4) to refer to buf(fers) when buffers and not the usual understanding of descriptors is meant. This change does not include corrections to the DMA resources used in the netmap(4) parts. However, it revises error messages to state which kind of allocation/creation failed. Specifically, the "Unable to allocate tx_buffer (map) memory" copy & pasted inappropriately on several occasions was replaced with proper messages. o Enhance some other error messages to indicate which half - RX or TX - they apply to instead of using identical text in both cases and generally canonicalize them. o Correct the descriptions of iflib_{r,t}xsd_alloc() to reflect reality; current code doesn't use {r,t}x_buffer structures. o In iflib_queues_alloc(): - Remove redundant BUS_DMA_NOWAIT of iflib_dma_alloc() calls, - change the M_WAITOK from malloc(9) calls into M_NOWAIT. The return values are already checked, deferred DMA allocations not being an option at this point, BUS_DMA_NOWAIT has to be used anyway and prior malloc(9) calls in this function also specify M_NOWAIT. Reviewed by: shurd Differential Revision: https://reviews.freebsd.org/D19067	2019-02-04 20:46:57 +00:00
Gleb Smirnoff	3ca1c423aa	Teach pfil_ioctl() about VIMAGE. Submitted by: gallatin	2019-02-03 08:28:02 +00:00
Vincenzo Maffione	5faab77822	netmap: upgrade sync-kloop support Add SYNC_KLOOP_MODE option, and add support for direct mode, where application executes the TXSYNC and RXSYNC in the context of the ioeventfd wake up callback. MFC after: 5 days	2019-02-02 22:39:29 +00:00
Gleb Smirnoff	b252313f0b	New pfil(9) KPI together with newborn pfil API and control utility. The KPI have been reviewed and cleansed of features that were planned back 20 years ago and never implemented. The pfil(9) internals have been made opaque to protocols with only returned types and function declarations exposed. The KPI is made more strict, but at the same time more extensible, as kernel uses same command structures that userland ioctl uses. In nutshell [KA]PI is about declaring filtering points, declaring filters and linking and unlinking them together. New [KA]PI makes it possible to reconfigure pfil(9) configuration: change order of hooks, rehook filter from one filtering point to a different one, disconnect a hook on output leaving it on input only, prepend/append a filter to existing list of filters. Now it possible for a single packet filter to provide multiple rulesets that may be linked to different points. Think of per-interface ACLs in Cisco or Juniper. None of existing packet filters yet support that, however limited usage is already possible, e.g. default ruleset can be moved to single interface, as soon as interface would pride their filtering points. Another future feature is possiblity to create pfil heads, that provide not an mbuf pointer but just a memory pointer with length. That would allow filtering at very early stages of a packet lifecycle, e.g. when packet has just been received by a NIC and no mbuf was yet allocated. Differential Revision: https://reviews.freebsd.org/D18951	2019-01-31 23:01:03 +00:00
John Baldwin	829c56fc08	Don't set IFCAP_TXRTLMT during lagg_clone_create(). lagg_capabilities() will set the capability once interfaces supporting the feature are added to the lagg. Setting it on a lagg without any interfaces is pointless as the if_snd_tag_alloc call will always fail in that case. Reviewed by: hselasky, gallatin MFC after: 2 weeks Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D19040	2019-01-31 21:35:37 +00:00
Gleb Smirnoff	f712b16127	Revert r316461: Remove "IPFW static rules" rmlock, and use pfil's global lock. The pfil(9) system is about to be converted to epoch(9) synchronization, so we need [temporarily] go back with ipfw internal locking. Discussed with: ae	2019-01-31 21:04:50 +00:00
Marius Strobl	b97de13ae0	- Stop iflib(4) from leaking MSI messages on detachment by calling bus_teardown_intr(9) before pci_release_msi(9). - Ensure that iflib(4) and associated drivers pass correct RIDs to bus_release_resource(9) by obtaining the RIDs via rman_get_rid(9) on the corresponding resources instead of using the RIDs initially passed to bus_alloc_resource_any(9) as the latter function may change those RIDs. Solely em(4) for the ioport resource (but not others) and bnxt(4) were using the correct RIDs by caching the ones returned by bus_alloc_resource_any(9). - Change the logic of iflib_msix_init() around to only map the MSI-X BAR if MSI-X is actually supported, i. e. pci_msix_count(9) returns > 0. Otherwise the "Unable to map MSIX table " message triggers for devices that simply don't support MSI-X and the user may think that something is wrong while in fact everything works as expected. - Put some (mostly redundant) debug messages emitted by iflib(4) and em(4) during attachment under bootverbose. The non-verbose output of em(4) seen during attachment now is close to the one prior to the conversion to iflib(4). - Replace various variants of spelling "MSI-X" (several in messages) with "MSI-X" as used in the PCI specifications. - Remove some trailing whitespace from messages emitted by iflib(4) and change them to consistently start with uppercase. - Remove some obsolete comments about releasing interrupts from drivers and correct a few others. Reviewed by: erj, Jacob Keller, shurd Differential Revision: https://reviews.freebsd.org/D18980	2019-01-30 13:21:26 +00:00
Marius Strobl	3db348b54a	- In _iflib_fl_refill(), don't mark an RX buffer as available in the corresponding bitmap before adding an mbuf has actually succeeded. Previously, m_gethdr(M_NOWAIT, ...) failing caused a "hole" in the RX ring but not in its bitmap. One implication of such a hole was that in a subsequent call to _iflib_fl_refill() with the RX buffer accounting still indicating another reclaimable buffer, bit_ffc(3) nevertheless returned -1 in frag_idx which in turn caused havoc when used as an index. Thus, additionally assert that frag_idx is 0 or greater. Another possible consequence of a hole in the RX ring was a NULL- dereference when trying to use the unallocated mbuf, for example in iflib_rxd_pkt_get(). While at it, make the variable declarations in _iflib_fl_refill() conform to style(9) and remove redundant checks already performed by bit_ffc{,_at}(3). - In iflib_queues_alloc(), don't pass redundant M_ZERO to bit_alloc(3). Reported and tested by: pho	2019-01-26 21:35:51 +00:00
Andrew Gallatin	77102fd6a2	Fix an iflib driver unload panic introduced in r343085 The new loop to sync and unload descriptors was indexed by "i", rather than "j". The panic was caused by "i" being advanced rather than "j", and eventually becoming out of bounds. Reviewed by: kib MFC after: 3 days Sponsored by: Netflix	2019-01-25 15:02:18 +00:00

1 2 3 4 5 ...

4214 Commits