freebsd-skq

Author	SHA1	Message	Date
Konstantin Belousov	247cf5664e	Add SIOCGIFDOWNREASON. The ioctl(2) is intended to provide more details about the cause of the down for the link. Eventually we might define a comprehensive list of codes for the situations. But interface also allows the driver to provide free-form null-terminated ASCII string to provide arbitrary non-formalized information. Sample implementation exists for mlx5(4), where the string is fetched from firmware controlling the port. Reviewed by: hselasky, rrs Sponsored by: Mellanox Technologies MFC after: 1 week Differential revision: https://reviews.freebsd.org/D21527	2019-09-17 18:49:13 +00:00
Kyle Evans	40b1c921bd	SIOCSIFNAME: Do nothing if we're not actually changing Instead of throwing EEXIST, just succeed if the name isn't actually changing. We don't need to trigger departure or any of that because there's no change from consumers' perspective. PR: 240539 Reviewed by: brooks MFC after: 5 days Differential Revision: https://reviews.freebsd.org/D21618	2019-09-12 15:36:48 +00:00
Hans Petter Selasky	180fecd5b6	Callout drain does not have to be followed by a callout stop call. Fix bogus code. MFC after: 1 week Sponsored by: Mellanox Technologies	2019-09-10 14:33:07 +00:00
Li-Wen Hsu	4835262b68	Fix build for the platforms where db_expr_t is not long Sponsored by: The FreeBSD Foundation	2019-09-10 08:51:11 +00:00
Conrad Meyer	0b12ab8111	Appease Clang false-positive Werrors in r352112 Reported by: bcran	2019-09-10 01:56:47 +00:00
Conrad Meyer	8b6acd2b51	ddb(4): Add 'show route <dest>' and 'show routetable [<af>]' These commands show the route resolved for a specified destination, or print out the entire routing table for a given address family (or all families, if none is explicitly provided). Discussed with: emaste Differential Revision: https://reviews.freebsd.org/D21510	2019-09-09 22:54:27 +00:00
Mark Johnston	fee2a2fa39	Change synchonization rules for vm_page reference counting. There are several mechanisms by which a vm_page reference is held, preventing the page from being freed back to the page allocator. In particular, holding the page's object lock is sufficient to prevent the page from being freed; holding the busy lock or a wiring is sufficent as well. These references are protected by the page lock, which must therefore be acquired for many per-page operations. This results in false sharing since the page locks are external to the vm_page structures themselves and each lock protects multiple structures. Transition to using an atomically updated per-page reference counter. The object's reference is counted using a flag bit in the counter. A second flag bit is used to atomically block new references via pmap_extract_and_hold() while removing managed mappings of a page. Thus, the reference count of a page is guaranteed not to increase if the page is unbusied, unmapped, and the object's write lock is held. As a consequence of this, the page lock no longer protects a page's identity; operations which move pages between objects are now synchronized solely by the objects' locks. The vm_page_wire() and vm_page_unwire() KPIs are changed. The former requires that either the object lock or the busy lock is held. The latter no longer has a return value and may free the page if it releases the last reference to that page. vm_page_unwire_noq() behaves the same as before; the caller is responsible for checking its return value and freeing or enqueuing the page as appropriate. vm_page_wire_mapped() is introduced for use in pmap_extract_and_hold(). It fails if the page is concurrently being unmapped, typically triggering a fallback to the fault handler. vm_page_wire() no longer requires the page lock and vm_page_unwire() now internally acquires the page lock when releasing the last wiring of a page (since the page lock still protects a page's queue state). In particular, synchronization details are no longer leaked into the caller. The change excises the page lock from several frequently executed code paths. In particular, vm_object_terminate() no longer bounces between page locks as it releases an object's pages, and direct I/O and sendfile(SF_NOCACHE) completions no longer require the page lock. In these latter cases we now get linear scalability in the common scenario where different threads are operating on different files. __FreeBSD_version is bumped. The DRM ports have been updated to accomodate the KPI changes. Reviewed by: jeff (earlier version) Tested by: gallatin (earlier version), pho Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D20486	2019-09-09 21:32:42 +00:00
Vincenzo Maffione	253b2ec199	netmap: import changes from upstream (SHA 137f537eae513) - Rework option processing. - Use larger integers for memory size values in the memory management code. MFC after: 2 weeks	2019-09-01 14:47:41 +00:00
Matt Joras	16cf6bdbb6	Wrap a vlan's parent's if_output in a separate function. When a vlan interface is created, its if_output is set directly to the parent interface's if_output. This is fine in the normal case but has an unfortunate consequence if you end up with a certain combination of vlan and lagg interfaces. Consider you have a lagg interface with a single laggport member. When an interface is added to a lagg its if_output is set to lagg_port_output, which blackholes traffic from the normal networking stack but not certain frames from BPF (pseudo_AF_HDRCMPLT). If you now create a vlan with the laggport member (not the lagg interface) as its parent, its if_output is set to lagg_port_output as well. While this is confusing conceptually and likely represents a misconfigured system, it is not itself a problem. The problem arises when you then remove the lagg interface. Doing this resets the if_output of the laggport member back to its original state, but the vlan's if_output is left pointing to lagg_port_output. This gives rise to the possibility that the system will panic when e.g. bpf is used to send any frames on the vlan interface. Fix this by creating a new function, vlan_output, which simply wraps the parent's current if_output. That way when the parent's if_output is restored there is no stale usage of lagg_port_output. Reviewed by: rstone Differential Revision: D21209	2019-08-30 20:19:43 +00:00
John Baldwin	b2e60773c6	Add kernel-side support for in-kernel TLS. KTLS adds support for in-kernel framing and encryption of Transport Layer Security (1.0-1.2) data on TCP sockets. KTLS only supports offload of TLS for transmitted data. Key negotation must still be performed in userland. Once completed, transmit session keys for a connection are provided to the kernel via a new TCP_TXTLS_ENABLE socket option. All subsequent data transmitted on the socket is placed into TLS frames and encrypted using the supplied keys. Any data written to a KTLS-enabled socket via write(2), aio_write(2), or sendfile(2) is assumed to be application data and is encoded in TLS frames with an application data type. Individual records can be sent with a custom type (e.g. handshake messages) via sendmsg(2) with a new control message (TLS_SET_RECORD_TYPE) specifying the record type. At present, rekeying is not supported though the in-kernel framework should support rekeying. KTLS makes use of the recently added unmapped mbufs to store TLS frames in the socket buffer. Each TLS frame is described by a single ext_pgs mbuf. The ext_pgs structure contains the header of the TLS record (and trailer for encrypted records) as well as references to the associated TLS session. KTLS supports two primary methods of encrypting TLS frames: software TLS and ifnet TLS. Software TLS marks mbufs holding socket data as not ready via M_NOTREADY similar to sendfile(2) when TLS framing information is added to an unmapped mbuf in ktls_frame(). ktls_enqueue() is then called to schedule TLS frames for encryption. In the case of sendfile_iodone() calls ktls_enqueue() instead of pru_ready() leaving the mbufs marked M_NOTREADY until encryption is completed. For other writes (vn_sendfile when pages are available, write(2), etc.), the PRUS_NOTREADY is set when invoking pru_send() along with invoking ktls_enqueue(). A pool of worker threads (the "KTLS" kernel process) encrypts TLS frames queued via ktls_enqueue(). Each TLS frame is temporarily mapped using the direct map and passed to a software encryption backend to perform the actual encryption. (Note: The use of PHYS_TO_DMAP could be replaced with sf_bufs if someone wished to make this work on architectures without a direct map.) KTLS supports pluggable software encryption backends. Internally, Netflix uses proprietary pure-software backends. This commit includes a simple backend in a new ktls_ocf.ko module that uses the kernel's OpenCrypto framework to provide AES-GCM encryption of TLS frames. As a result, software TLS is now a bit of a misnomer as it can make use of hardware crypto accelerators. Once software encryption has finished, the TLS frame mbufs are marked ready via pru_ready(). At this point, the encrypted data appears as regular payload to the TCP stack stored in unmapped mbufs. ifnet TLS permits a NIC to offload the TLS encryption and TCP segmentation. In this mode, a new send tag type (IF_SND_TAG_TYPE_TLS) is allocated on the interface a socket is routed over and associated with a TLS session. TLS records for a TLS session using ifnet TLS are not marked M_NOTREADY but are passed down the stack unencrypted. The ip_output_send() and ip6_output_send() helper functions that apply send tags to outbound IP packets verify that the send tag of the TLS record matches the outbound interface. If so, the packet is tagged with the TLS send tag and sent to the interface. The NIC device driver must recognize packets with the TLS send tag and schedule them for TLS encryption and TCP segmentation. If the the outbound interface does not match the interface in the TLS send tag, the packet is dropped. In addition, a task is scheduled to refresh the TLS send tag for the TLS session. If a new TLS send tag cannot be allocated, the connection is dropped. If a new TLS send tag is allocated, however, subsequent packets will be tagged with the correct TLS send tag. (This latter case has been tested by configuring both ports of a Chelsio T6 in a lagg and failing over from one port to another. As the connections migrated to the new port, new TLS send tags were allocated for the new port and connections resumed without being dropped.) ifnet TLS can be enabled and disabled on supported network interfaces via new '[-]txtls[46]' options to ifconfig(8). ifnet TLS is supported across both vlan devices and lagg interfaces using failover, lacp with flowid enabled, or lacp with flowid enabled. Applications may request the current KTLS mode of a connection via a new TCP_TXTLS_MODE socket option. They can also use this socket option to toggle between software and ifnet TLS modes. In addition, a testing tool is available in tools/tools/switch_tls. This is modeled on tcpdrop and uses similar syntax. However, instead of dropping connections, -s is used to force KTLS connections to switch to software TLS and -i is used to switch to ifnet TLS. Various sysctls and counters are available under the kern.ipc.tls sysctl node. The kern.ipc.tls.enable node must be set to true to enable KTLS (it is off by default). The use of unmapped mbufs must also be enabled via kern.ipc.mb_use_ext_pgs to enable KTLS. KTLS is enabled via the KERN_TLS kernel option. This patch is the culmination of years of work by several folks including Scott Long and Randall Stewart for the original design and implementation; Drew Gallatin for several optimizations including the use of ext_pgs mbufs, the M_NOTREADY mechanism for TLS records awaiting software encryption, and pluggable software crypto backends; and John Baldwin for modifications to support hardware TLS offload. Reviewed by: gallatin, hselasky, rrs Obtained from: Netflix Sponsored by: Netflix, Chelsio Communications Differential Revision: https://reviews.freebsd.org/D21277	2019-08-27 00:01:56 +00:00
Kyle Evans	5c4eed8601	tuntap: belatedly add MODULE_VERSION for if_tun and if_tap When tun/tap were merged, appropriate MODULE_VERSION should have been added for things like modfind(2) to continue to do the right thing with the old names. Reported by: jhb	2019-08-19 19:01:59 +00:00
Vincenzo Maffione	b5b83671ea	if_tuntap: minor improvements Rewrite a loop to avoid duplicating the exit condition. Simplify mask processing in tunpoll(). Fix minor typos. Reviewed by: kevans, markj MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D21302	2019-08-19 17:23:22 +00:00
Eric Joyner	f4aa9b67eb	net: Update SFF-8024 definitions and strings with values from rev 4.6 This will let ifconfig -v's SFF eeprom read functionality recognize more module types. Signed-off-by: Eric Joyner <erj@freebsd.org> Reviewed by: gallatin@ MFC after: 1 week Sponsored by: Intel Corporation Differential Revision: https://reviews.freebsd.org/D21041	2019-08-17 00:10:56 +00:00
Eric Joyner	566144142e	iflib: add iflib_deregister to help cleanup on exit Commit message by Jake: The iflib_register function exists to allocate and setup some common structures used by both iflib_device_register and iflib_pseudo_register. There is no associated cleanup function used to undo the steps taken in this function. Both iflib_device_deregister and iflib_pseudo_deregister have some of the necessary steps scattered in their flow. However, most of the necessary cleanup is not done during the error path of iflib_device_register and iflib_pseudo_register. Some examples of missed cleanup include: the ifp pointer is not free'd during error cleanup the STATE and CTX locks are not destroyed during error cleanup the vlan event handlers are not removed during error cleanup media added to the ifmedia structure is not removed the kobject reference is never deleted Additionally, when initializing the kobject class reference counter is increased even though kobj_init already increases it. This results in the class never being free'd again because the reference count would never hit zero even after all driver instances are unloaded. To aid in proper cleanup, implement an iflib_deregister function that goes through the reverse steps taken by iflib_register. Call this function during the error cleanup for iflib_device_register and iflib_pseudo_register. Additionally call the function in the iflib_device_deregister and iflib_pseudo_deregister functions near the end of their flow. This helps reduce code duplication and ensures that proper steps are taken to cleanup allocations and references in both the regular and error cleanup flows. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Submitted by: Jacob Keller <jacob.e.keller@intel.com> Reviewed by: shurd@, erj@ MFC after: 3 days Sponsored by: Intel Corporation Differential Revision: https://reviews.freebsd.org/D21005	2019-08-16 23:33:44 +00:00
George V. Neville-Neil	d8dc4e350f	Properly validte arguments for route deletion Reported by: Liang Zhuo brightiup.zhuo@gmail.com MFC after: 1 week	2019-08-03 14:42:07 +00:00
Eric Joyner	197c679824	iflib: Prevent kernel panic caused by loading driver with a specific interrupt configuration If a device has only 1 MSI-X interrupt available and does not support either MSI or legacy interrupts, iflib_device_register() will fail, leak memory and MSI resources, and the driver will not load. Worse, if another iflib-using driver tries to unload afterwards, a kernel panic will occur because the previous failed iflib driver loead did not properly call "taskqgroup_detach()" during it's cleanup. This patch is band-aid for this situation -- don't try allocating MSI or legacy interrupts if a single MSI-X interrupt was allocated, but fail to load instead. As well, during the cleanup, properly call taskqgroup_detach() on the admin task to prevent panics when other iflib drivers unload. This whole interrupt allocation process actually needs re-doing to properly support devices with only a single MSI-X interrupt, devices that only support MSI-X, non-PCI devices, and multiple non-MSIX interrupts, as well. Signed-off-by: Eric Joyner <erj@freebsd.org> Reviewed by: marius@ MFC after: 1 week Sponsored by: Intel Corporation Differential Revision: https://reviews.freebsd.org/D20747	2019-08-01 17:37:25 +00:00
Eric Joyner	6a3f243b04	iflib: remove kobject class reference increment Commit message from Jake: In iflib_register, the context is initialized as a kobject using the device driver's "driver" kobject class. As part of this, the function mistakenly increments the ref counter. The ref counter is incremented twice, once in the code directly, and once again by kobj_class_compile. However, there is no associated decrement in the detach path. Because of this, the ref counter will never go back down to zero, and thus the kobject method table will never be released. Remove this unnecessary reference count increment. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Submitted by: Jacob Keller <jacob.e.keller@intel.com> Reviewed by: jhb@, erj@ MFC after: 3 days Sponsored by: Intel Corporation Differential Revision: https://reviews.freebsd.org/D21125	2019-08-01 17:28:36 +00:00
Randall Stewart	20abea6663	This adds the third step in getting BBR into the tree. BBR and an updated rack depend on having access to the new ratelimit api in this commit. Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D20953	2019-08-01 14:17:31 +00:00
Ed Maste	1082be6554	ppp: correct echo-req magic number on big endian archs The magic number is a 32-bit quantity; use uint32_t to match hton's return type and avoid sending zeros (upper 32 bits) on big-endian architectures. PR: 184141 MFC after: 1 week Sponsored by: The FreeBSD Foundation	2019-08-01 13:42:58 +00:00
Kyle Evans	0dbac71f19	if_tuntap(4): Add TUNGIFNAME This effectively just moves TAPGIFNAME into common ioctl territory. MFC after: 3 days	2019-07-25 22:23:34 +00:00
Eric Joyner	7f3f6aad3e	iflib: fix dangling device softc pointer Commit text by Jake: If a driver's IFDI_ATTACH_PRE function fails, the iflib_device_register function will free the ctx pointer. However, it does not reset the device softc pointer to NULL. This will result in memory corruption as a future access to the now invalid pointer will corrupt memory that is later allocated on top of the same memory location. The iflib_device_deregister function correctly resets the softc pointer by using device_set_softc(). This clears up the invalid dangling pointer and prevents memory corruption that could lead to a panic or undefined behavior if the device's driver failed to attach. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Submitted by: Jacob Keller <jacob.e.keller@intel.com> Reviewed by: erj@, gallatin@ MFC after: 1 week Sponsored by: Intel Corporation Differential Revision: https://reviews.freebsd.org/D21003	2019-07-24 21:43:41 +00:00
Kirill Ponomarev	b7592822d5	Allow set MTU more than 1500 bytes. Submitted by: Alexandr Fedorov <aleksandr.fedorov_itglobal_dot_com> Approved by: jhb, rgrimes Sponsored by: ITGlobal.com Differential Revision: https://reviews.freebsd.org/D19422	2019-07-24 16:10:20 +00:00
Chuck Tuffli	94c15665a5	Fix a typo in r349969 OUI_FRREBSD_NVME_HIGH should have been OUI_FREEBSD_NVME_HIGH Caught by: Gary Jennejohn	2019-07-14 03:49:48 +00:00
Chuck Tuffli	409a80e5a4	bhyve: Create EUI64 for NVMe namespaces Accept an IEEE Extended Unique Identifier (EUI-64) from the command line for each NVMe namespace. If one isn't provided, it will create one based on the CRC16 of: - the FreeBSD IEEE OUI - PCI bus, device/slot, function values - Namespace ID Reviewed by: imp, araujo, jhb, rgrimes Approved by: imp (mentor), jhb (maintainer) MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D19905	2019-07-13 12:48:28 +00:00
Mark Johnston	eeacb3b02f	Merge the vm_page hold and wire mechanisms. The hold_count and wire_count fields of struct vm_page are separate reference counters with similar semantics. The remaining essential differences are that holds are not counted as a reference with respect to LRU, and holds have an implicit free-on-last unhold semantic whereas vm_page_unwire() callers must explicitly determine whether to free the page once the last reference to the page is released. This change removes the KPIs which directly manipulate hold_count. Functions such as vm_fault_quick_hold_pages() now return wired pages instead. Since r328977 the overhead of maintaining LRU for wired pages is lower, and in many cases vm_fault_quick_hold_pages() callers would swap holds for wirings on the returned pages anyway, so with this change we remove a number of page lock acquisitions. No functional change is intended. __FreeBSD_version is bumped. Reviewed by: alc, kib Discussed with: jeff Discussed with: jhb, np (cxgbe) Tested by: pho (previous version) Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D19247	2019-07-08 19:46:20 +00:00
John Baldwin	66d0c056be	Support IFCAP_NOMAP in vlan(4). Enable IFCAP_NOMAP for a vlan interface if it is supported by the underlying trunk device. Reviewed by: gallatin, hselasky, rrs Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D20616	2019-06-29 00:51:38 +00:00
John Baldwin	82334850ea	Add an external mbuf buffer type that holds multiple unmapped pages. Unmapped mbufs allow sendfile to carry multiple pages of data in a single mbuf, without mapping those pages. It is a requirement for Netflix's in-kernel TLS, and provides a 5-10% CPU savings on heavy web serving workloads when used by sendfile, due to effectively compressing socket buffers by an order of magnitude, and hence reducing cache misses. For this new external mbuf buffer type (EXT_PGS), the ext_buf pointer now points to a struct mbuf_ext_pgs structure instead of a data buffer. This structure contains an array of physical addresses (this reduces cache misses compared to an earlier version that stored an array of vm_page_t pointers). It also stores additional fields needed for in-kernel TLS such as the TLS header and trailer data that are currently unused. To more easily detect these mbufs, the M_NOMAP flag is set in m_flags in addition to M_EXT. Various functions like m_copydata() have been updated to safely access packet contents (using uiomove_fromphys()), to make things like BPF safe. NIC drivers advertise support for unmapped mbufs on transmit via a new IFCAP_NOMAP capability. This capability can be toggled via the new 'nomap' and '-nomap' ifconfig(8) commands. For NIC drivers that only transmit packet contents via DMA and use bus_dma, adding the capability to if_capabilities and if_capenable should be all that is required. If a NIC does not support unmapped mbufs, they are converted to a chain of mapped mbufs (using sf_bufs to provide the mapping) in ip_output or ip6_output. If an unmapped mbuf requires software checksums, it is also converted to a chain of mapped mbufs before computing the checksum. Submitted by: gallatin (earlier version) Reviewed by: gallatin, hselasky, rrs Discussed with: ae, kp (firewalls) Relnotes: yes Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D20616	2019-06-29 00:48:33 +00:00
Hans Petter Selasky	0dbdf04125	Need to wait for epoch callbacks to complete before detaching a network interface. This particularly manifests itself when an INP has multicast options attached during a network interface detach. Then the IPv4 and IPv6 leave group call which results from freeing the multicast address, may access a freed ifnet structure. These are the steps to reproduce: service mdnsd onestart # installed from ports ifconfig epair create ifconfig epair0a 0/24 up ifconfig epair0a destroy Tested by: pho @ MFC after: 1 week Sponsored by: Mellanox Technologies	2019-06-28 10:49:04 +00:00
Marius Strobl	c2c5d1e787	o In iflib_txq_drain(): - Remove desc_used, which is only ever written to. - Remove a dead store to reclaimed. - Don't recycle avail. - Sort variables according to style(9). These changes will make a subsequent commit easier to read. o In iflib_tx_credits_update(), don't bother checking whether the ift_txd_credits_update method pointer is NULL; _iflib_pre_assert() asserts upfront that this method has been assigned and functions like iflib_{fast_intr_rxtx,netmap_timer_adjust,txq_can_drain}() and _task_fn_tx() were already unconditionally relying on the method being callable.	2019-06-26 15:28:21 +00:00
Leandro Lupori	e2edff4167	[PowerPC64] Don't mark module data as static Fixes panic when loading ipfw.ko and if_epair.ko built with modern compiler. Similar to arm64 and riscv, when using a modern compiler (!gcc4.2), code generated tries to access data in the wrong location, causing kernel panic (data storage interrupt trap) when loading if_epair and ipfw. Issue was reproduced with kernel/module compiled using gcc8 and clang8. It affects both ELFv1 and ELFv2 ABI environments. PR: 232387 Submitted by: alfredo.junior_eldorado.org.br Reported by: Mark Millard Reviewed by: jhibbits Differential Revision: https://reviews.freebsd.org/D20461	2019-06-25 17:15:44 +00:00
Hans Petter Selasky	59854ecf55	Convert all IPv4 and IPv6 multicast memberships into using a STAILQ instead of a linear array. The multicast memberships for the inpcb structure are protected by a non-sleepable lock, INP_WLOCK(), which needs to be dropped when calling the underlying possibly sleeping if_ioctl() method. When using a linear array to keep track of multicast memberships, the computed memory location of the multicast filter may suddenly change, due to concurrent insertion or removal of elements in the linear array. This in turn leads to various invalid memory access issues and kernel panics. To avoid this problem, put all multicast memberships on a STAILQ based list. Then the memory location of the IPv4 and IPv6 multicast filters become fixed during their lifetime and use after free and memory leak issues are easier to track, for example by: vmstat -m \| grep multi All list manipulation has been factored into inline functions including some macros, to easily allow for a future hash-list implementation, if needed. This patch has been tested by pho@ . Differential Revision: https://reviews.freebsd.org/D20080 Reviewed by: markj @ MFC after: 1 week Sponsored by: Mellanox Technologies	2019-06-25 11:54:41 +00:00
Marko Zec	188adcb7e4	V_ip6_forwarding and V_ipforwarding have been defined in ip6_var.h / ip_var.h since at least 2008, so make use of those definitions here. MFC after: 3 days	2019-06-19 08:49:24 +00:00
Marko Zec	6aee0bfa85	Evaluating htons() at compile time is more efficient than doing ntohs() at runtime. This change removes a dependency on a barrel shifter pass before branch resolution, while reducing the instruction stream size by 9 bytes on amd64. MFC after: 3 days	2019-06-19 08:39:19 +00:00
Marius Strobl	d49e83eac3	- Replace unused and only ever written to members of public iflib(9) structs with placeholders (in the latter case, IFLIB_MAX_TX_BYTES etc. are also only ever used for these write-only members if at all, so both these macros and members can just go). Using these spares may render it possible to merge certain iflib(9) fixes to stable/12. Otherwise, changes extending struct if_irq or struct if_shared_ctx in any way would break KBI as instances of these are allocated by the driver front-ends (by contrast, struct if_pkt_info as well as struct if_softc_ctx instances are provided by iflib(9) and, thus, may grow at least at the end without breaking KBI). - Make the pvi_name in struct pci_vendor_info const char * as device identifiers in hardware lookup tables aren't to be expected to ever change at runtime. - Similarly, make the pci_vendor_info_t of struct if_shared_ctx which is used to point to the struct pci_vendor_info arrays provided by the driver front-ends const. - Remove the ETH_ADDR_LEN macro from iflib.h; this was duplicating ETHER_ADDR_LEN of <net/ethernet.h> with iflib(9) actually only consuming the latter macro. - Make the name argument of iflib_io_tqg_attach(9) const, matching the taskqgroup_attach_cpu(9) this function wraps as well as e. g. iflib_config_gtask_init(9). - Remove the orphaned iflib_qset_lock_get() prototype. - Remove some extraneous empty lines.	2019-06-15 11:07:41 +00:00
Mark Johnston	ce7fb386d8	Restore the comment removed in r348745. LAGG_RLOCK() enters an epoch section, so the comment wasn't stale. Reported by: jhb MFC with: r348745	2019-06-06 17:20:35 +00:00
Mark Johnston	9995dfd364	Conditionalize an in_epoch() call on INVARIANTS. Its result is only used to determine whether to perform further INVARIANTS-only checks. Remove a stale comment while here. Submitted by: Sebastian Huber <sebastian.huber@embedded-brains.de> MFC after: 1 week	2019-06-06 16:22:29 +00:00
Eric Joyner	668d6dbb4c	iflib: provide probe wrapper for vendor drivers From Jake: Vendor drivers that exist out-of-tree generally should return BUS_PROBE_VENDOR from their device probe functions. This helps ensure that a vendor replacement driver will supersede the in-kernel driver for a given device. Currently, if a vendor wants to implement a driver based on iflib, it will always report BUS_PROBE_DEFAULT. Add a wrapper function, iflib_device_probe_vendor() which can be used in place of iflib_device_probe(). This function will just return BUS_PROBE_VENDOR whenever iflib_device_probe() would return BUS_PROBE_DEFAULT. While vendor drivers can already implement such a wrapper themselves, providing it in the iflib.h header makes it easier for the vendor driver to do the right thing. Submitted by: Jacob Keller <jacob.e.keller@intel.com> Reviewed by: erj@, gallatin@, marius@ MFC after: 1 week Sponsored by: Intel Corporation Differential Revision: https://reviews.freebsd.org/D20221	2019-05-29 22:24:10 +00:00
Kyle Evans	d8b985430c	if_bridge(4): Complete bpf auditing of local traffic over the bridge There were two remaining "gaps" in auditing local bridge traffic with bpf(4): Locally originated outbound traffic from a member interface is invisible to the bridge's bpf(4) interface. Inbound traffic locally destined to a member interface is invisible to the member's bpf(4) interface -- this traffic has no chance after bridge_input to otherwise pass it over, and it wasn't originally received on this interface. I call these "gaps" because they don't affect conventional bridge setups. Alas, being able to establish an audit trail of all locally destined traffic for setups that can function like this is useful in some scenarios. Reviewed by: kp MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D19757	2019-05-29 01:08:30 +00:00
Andrey V. Elsukov	de25327313	Rework r348303 to reduce the time of holding global BPF lock. It appeared that using NET_EPOCH_WAIT() while holding global BPF lock can lead to another panic: spin lock 0xfffff800183c9840 (turnstile lock) held by 0xfffff80018e2c5a0 (tid 100325) too long panic: spin lock held too long ... #0 sched_switch (td=0xfffff80018e2c5a0, newtd=0xfffff8000389e000, flags=<optimized out>) at /usr/src/sys/kern/sched_ule.c:2133 #1 0xffffffff80bf9912 in mi_switch (flags=256, newtd=0x0) at /usr/src/sys/kern/kern_synch.c:439 #2 0xffffffff80c21db7 in sched_bind (td=<optimized out>, cpu=<optimized out>) at /usr/src/sys/kern/sched_ule.c:2704 #3 0xffffffff80c34c33 in epoch_block_handler_preempt (global=<optimized out>, cr=0xfffffe00005a1a00, arg=<optimized out>) at /usr/src/sys/kern/subr_epoch.c:394 #4 0xffffffff803c741b in epoch_block (global=<optimized out>, cr=<optimized out>, cb=<optimized out>, ct=<optimized out>) at /usr/src/sys/contrib/ck/src/ck_epoch.c:416 #5 ck_epoch_synchronize_wait (global=0xfffff8000380cd80, cb=<optimized out>, ct=<optimized out>) at /usr/src/sys/contrib/ck/src/ck_epoch.c:465 #6 0xffffffff80c3475e in epoch_wait_preempt (epoch=0xfffff8000380cd80) at /usr/src/sys/kern/subr_epoch.c:513 #7 0xffffffff80ce970b in bpf_detachd_locked (d=0xfffff801d309cc00, detached_ifp=<optimized out>) at /usr/src/sys/net/bpf.c:856 #8 0xffffffff80ced166 in bpf_detachd (d=<optimized out>) at /usr/src/sys/net/bpf.c:836 #9 bpf_dtor (data=0xfffff801d309cc00) at /usr/src/sys/net/bpf.c:914 To fix this add the check to the catchpacket() that BPF descriptor was not detached just before we acquired BPFD_LOCK(). Reported by: slavash Tested by: slavash MFC after: 1 week	2019-05-28 11:45:00 +00:00
Andrey V. Elsukov	44a514745c	Fix possible NULL pointer dereference. bpf_mtap() can invoke catchpacket() for already detached descriptor. And this can lead to NULL pointer dereference, since bd_bif pointer was reset to NULL in bpf_detachd_locked(). To avoid this, use NET_EPOCH_WAIT() when descriptor is removed from interface's descriptors list. After the wait it is safe to modify descriptor's content. Submitted by: kib Reported by: slavash MFC after: 1 week	2019-05-27 12:41:41 +00:00
John Baldwin	fb3bc59600	Restructure mbuf send tags to provide stronger guarantees. - Perform ifp mismatch checks (to determine if a send tag is allocated for a different ifp than the one the packet is being output on), in ip_output() and ip6_output(). This avoids sending packets with send tags to ifnet drivers that don't support send tags. Since we are now checking for ifp mismatches before invoking if_output, we can now try to allocate a new tag before invoking if_output sending the original packet on the new tag if allocation succeeds. To avoid code duplication for the fragment and unfragmented cases, add ip_output_send() and ip6_output_send() as wrappers around if_output and nd6_output_ifp, respectively. All of the logic for setting send tags and dealing with send tag-related errors is done in these wrapper functions. For pseudo interfaces that wrap other network interfaces (vlan and lagg), wrapper send tags are now allocated so that ip*_output see the wrapper ifp as the ifp in the send tag. The if_transmit routines rewrite the send tags after performing an ifp mismatch check. If an ifp mismatch is detected, the transmit routines fail with EAGAIN. - To provide clearer life cycle management of send tags, especially in the presence of vlan and lagg wrapper tags, add a reference count to send tags managed via m_snd_tag_ref() and m_snd_tag_rele(). Provide a helper function (m_snd_tag_init()) for use by drivers supporting send tags. m_snd_tag_init() takes care of the if_ref on the ifp meaning that code alloating send tags via if_snd_tag_alloc no longer has to manage that manually. Similarly, m_snd_tag_rele drops the refcount on the ifp after invoking if_snd_tag_free when the last reference to a send tag is dropped. This also closes use after free races if there are pending packets in driver tx rings after the socket is closed (e.g. from tcpdrop). In order for m_free to work reliably, add a new CSUM_SND_TAG flag in csum_flags to indicate 'snd_tag' is set (rather than 'rcvif'). Drivers now also check this flag instead of checking snd_tag against NULL. This avoids false positive matches when a forwarded packet has a non-NULL rcvif that was treated as a send tag. - cxgbe was relying on snd_tag_free being called when the inp was detached so that it could kick the firmware to flush any pending work on the flow. This is because the driver doesn't require ACK messages from the firmware for every request, but instead does a kind of manual interrupt coalescing by only setting a flag to request a completion on a subset of requests. If all of the in-flight requests don't have the flag when the tag is detached from the inp, the flow might never return the credits. The current snd_tag_free command issues a flush command to force the credits to return. However, the credit return is what also frees the mbufs, and since those mbufs now hold references on the tag, this meant that snd_tag_free would never be called. To fix, explicitly drop the mbuf's reference on the snd tag when the mbuf is queued in the firmware work queue. This means that once the inp's reference on the tag goes away and all in-flight mbufs have been queued to the firmware, tag's refcount will drop to zero and snd_tag_free will kick in and send the flush request. Note that we need to avoid doing this in the middle of ethofld_tx(), so the driver grabs a temporary reference on the tag around that loop to defer the free to the end of the function in case it sends the last mbuf to the queue after the inp has dropped its reference on the tag. - mlx5 preallocates send tags and was using the ifp pointer even when the send tag wasn't in use. Explicitly use the ifp from other data structures instead. - Sprinkle some assertions in various places to assert that received packets don't have a send tag, and that other places that overwrite rcvif (e.g. 802.11 transmit) don't clobber a send tag pointer. Reviewed by: gallatin, hselasky, rgrimes, ae Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D20117	2019-05-24 22:30:40 +00:00
Alexander V. Chernikov	563ab4e400	Fix gateway setup for the interface routes. Currently rinit1() and its IPv6 counterpart nd6_prefix_onlink_rtrequest() uses dummy null_sdl gateway address during route insertion and change it afterwards. This behaviour brings complications to the routing stack and the users of its upcoming notification system. This change fixes both rinit1() and nd6_prefix_onlink_rtrequest() by filling in proper gateway in the beginning. It does not change any of the userland notifications as in both cases, they happen after the insertion and fixup process (rt_newaddrmsg_fib() and nd6_rtmsg()). MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D20328	2019-05-22 21:20:15 +00:00
Conrad Meyer	e2e050c8ef	Extract eventfilter declarations to sys/_eventfilter.h This allows replacing "sys/eventfilter.h" includes with "sys/_eventfilter.h" in other header files (e.g., sys/{bus,conf,cpu}.h) and reduces header pollution substantially. EVENTHANDLER_DECLARE and EVENTHANDLER_LIST_DECLAREs were moved out of .c files into appropriate headers (e.g., sys/proc.h, powernv/opal.h). As a side effect of reduced header pollution, many .c files and headers no longer contain needed definitions. The remainder of the patch addresses adding appropriate includes to fix those files. LOCK_DEBUG and LOCK_FILE_LINE_ARG are moved to sys/_lock.h, as required by sys/mutex.h since r326106 (but silently protected by header pollution prior to this change). No functional change (intended). Of course, any out of tree modules that relied on header pollution for sys/eventhandler.h, sys/lock.h, or sys/mutex.h inclusion need to be fixed. __FreeBSD_version has been bumped.	2019-05-20 00:38:23 +00:00
Alexander V. Chernikov	2ad7ed6e4a	Fix rt_ifa selection during loopback route insertion process. Currently such routes are added with a link-level IFA, which is plain wrong. Only after the insertion they get fixed by the special link_rtrequest() ifa handler. This behaviour complicates routing code and makes ifa selection more complex. Streamline this process by explicitly moving link_rtrequest() logic to the pre-insertion rt_getifa_fib() ifa selector. Avoid calling all this logic in the loopback route case by explicitly specifying proper rt_ifa inside the ifa_maintain_loopback_route().§ MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D20076	2019-05-19 21:49:56 +00:00
Kyle Evans	db226f0d8e	tuntap: Defer clearing if_softc until after if_detach r346670 added an sx to close a race between the ifioctl handler and interface destruction. Unfortunately, it clears if_softc immediately after the interface is closed, but before if_detach has been invoked. Any time before detachment, an interface that's part of a bridge may still receive traffic that's pushed through tunstart/tunstart_l2 and promptly lead to a panic because if_softc is now NULL. Fix it by deferring the clearing of if_softc until after the interface has detached and thus been removed from the bridge. if_softc still gets cleared in case another thread has already entered the ioctl handler before it's replaced with ifdead_ioctl. Reported by: markj MFC after: 3 days	2019-05-14 20:32:29 +00:00
Andrey V. Elsukov	82d7bf6b1b	Avoid possible recursion on BPF_LOCK() in bpfwrite(). Release BPF_LOCK() before invoking if_output() and if_input(). Also enter epoch section before releasing lock, this should prevent access to ifnet that may be freed on interface detach. Reported by: markj	2019-05-13 20:17:55 +00:00
Andrey V. Elsukov	af1f58df99	Do not leak memory used for binary filter.	2019-05-13 14:07:02 +00:00
Andrey V. Elsukov	699281b545	Rework locking in BPF code to remove rwlock from fast path. On high packets rate the contention on rwlock in bpf_tap() functions can lead to packets dropping. To avoid this, migrate this code to use epoch(9) KPI and ConcurrencyKit's lists. * all lists changed to use CK_LIST; * reference counting added to bpf_if and bpf_d; * now bpf_if references ifnet and releases this reference on destroy; * each bpf_d descriptor references bpf_if when it is attached; * new struct bpf_program_buffer introduced to keep BPF filter programs; * bpf_program_buffer, bpf_d and bpf_if structures are freed by epoch_call(); * bpf_freelist and ifnet_departure event are no longer needed, thus both are removed; Reviewed by: melifaro Sponsored by: Yandex LLC Differential Revision: https://reviews.freebsd.org/D20224	2019-05-13 13:45:28 +00:00
Kyle Evans	81b3b91e6b	tuntap: Improve style No functional change. tun_flags of the tuntap_driver was renamed to ident_flags to reflect the fact that it's a subset of the tun_flags that identifies a tuntap device. This maps more easily (visually) to the TUN_DRIVER_IDENT_MASK that masks off the bits of tun_flags that are applicable to tuntap driver ident. This is a purely cosmetic change.	2019-05-11 04:18:06 +00:00
Eric Joyner	afb7737237	iflib: use default ntxd and nrxd when user value is not power of 2 From Jake: A user may set a sysctl to override the default number of Tx or Rx descriptors. However, certain calculations in the iflib core expect the number of descriptors to be a power of 2. Update _iflib_assert to verify that all of the shared context parameters for the number of descriptors are powers of 2. Modify iflib_reset_qvalues to check that the provided isc_nrxd value is a power of 2. If it's not, print a warning message and then use the default value. An alternative might be to try rounding the number down instead. However, this creates problems in case the rounded down value is below the minimum value that the driver would support. Submitted by: Jacob Keller <jacob.e.keller@intel.com> Reviewed by: marius@ MFC after: 1 week Sponsored by: Intel Corporation Differential Revision: https://reviews.freebsd.org/D19880	2019-05-10 00:41:42 +00:00

1 2 3 4 5 ...

4180 Commits