freebsd-nq

Author	SHA1	Message	Date
Luiz Otavio O Souza	c138424148	lagg: don't update link layer addresses on destroy When the lagg is being destroyed it is not necessary update the lladdr of all the lagg members every time we update the primary interface. Reviewed by: scottl Obtained from: pfSense MFC after: 1 week Sponsored by: Rubicon Communications, LLC ("Netgate") Differential Revision: https://reviews.freebsd.org/D31586	2021-08-19 10:49:32 +02:00
Randall Stewart	1a714ff204	This pulls over all the changes that are in the netflix tree that fix the ratelimit code. There were several bugs in tcp_ratelimit itself and we needed further work to support the multiple tag format coming for the joint TLS and Ratelimit dances. Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D28357	2021-01-28 11:53:05 -05:00
Hans Petter Selasky	19ecb5e8da	Fix for IPoIB over lagg(4). Need to update both link layer address and broadcast address when active link changes for IP over infiniband. This is because the broadcast address contains the so-called P-key, which is interface dependent. Reviewed by: kib @ Differential Revision: https://reviews.freebsd.org/D27658 MFC after: 1 week Sponsored by: Mellanox Technologies // NVIDIA Networking	2020-12-29 17:35:06 +01:00
Gleb Smirnoff	5ee33a9076	Fixup r368446 with KERN_TLS.	2020-12-08 23:54:09 +00:00
Gleb Smirnoff	e1074ed6a0	The list of ports in configuration path shall be protected by locks, epoch shall be used only for fast path. Thus use LAGG_XLOCK() in lagg_[un]register_vlan. This fixes sleeping in epoch panic. PR: 240609	2020-12-08 16:46:00 +00:00
Gleb Smirnoff	87bf9b9cbe	Convert LAGG_RLOCK() to NET_EPOCH_ENTER(). No functional changes.	2020-12-08 16:36:46 +00:00
Andrew Gallatin	8732245d29	LACP: When suppressing distributing, return ENOBUFS When links come and go, lacp goes into a "suppress distributing" mode where it drops traffic for 3 seconds. When in this mode, lagg/lacp historiclally drops traffic with ENETDOWN. That return value causes TCP to close any connection where it gets that value back from the lower parts of the stack. This means that any TCP connection with active traffic during a 3-second windown when an LACP link comes or goes would get closed. TCP treats return values of ENOBUFS as transient errors, and re-schedules transmission later. So rather than returning ENETDOWN, lets return ENOBUFS instead. This allows TCP connections to be preserved. I've tested this by repeatedly bouncing links on a Netlfix CDN server under a moderate (20Gb/s) load and overved ENOBUFS reported back to the TCP stack (as reported by a RACK TCP sysctl). Reviewed by: jhb, jtl, rrs Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D27188	2020-11-18 14:55:49 +00:00
Andrey V. Elsukov	2f4ffa9f72	Fix possible NULL pointer dereference. lagg(4) replaces if_output method of its child interfaces and expects that this method can be called only by child interfaces. But it is possible that lagg_port_output() could be called by children of child interfaces. In this case ifnet's if_lagg field is NULL. Add check that lp is not NULL. Obtained from: Yandex LLC MFC after: 1 week Sponsored by: Yandex LLC	2020-11-11 15:53:36 +00:00
John Baldwin	36e0a362ac	Add m_snd_tag_alloc() as a wrapper around if_snd_tag_alloc(). This gives a more uniform API for send tag life cycle management. Reviewed by: gallatin, hselasky Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D27000	2020-10-29 23:28:39 +00:00
Hans Petter Selasky	a92c4bb62a	Add support for IP over infiniband, IPoIB, to lagg(4). Currently only the failover protocol is supported due to limitations in the IPoIB architecture. Refer to the lagg(4) manual page for how to configure and use this new feature. A new network interface type, IFT_INFINIBANDLAG, has been added, similar to the existing IFT_IEEE8023ADLAG . ifconfig(8) has been updated to accept a new laggtype argument when creating lagg(4) network interfaces. This new argument is used to distinguish between ethernet and infiniband type of lagg(4) network interface. The laggtype argument is optional and defaults to ethernet. The lagg(4) command line syntax is backwards compatible. Differential Revision: https://reviews.freebsd.org/D26254 Reviewed by: melifaro@ MFC after: 1 week Sponsored by: Mellanox Technologies // NVIDIA Networking	2020-10-22 09:47:12 +00:00
John Baldwin	56fb710f1b	Store the send tag type in the common send tag header. Both cxgbe(4) and mlx5(4) wrapped the existing send tag header with their own identical headers that stored the type that the type-specific tag structures inherited from, so in practice it seems drivers need this in the tag anyway. This permits removing these extra header indirections (struct cxgbe_snd_tag and struct mlx5e_snd_tag). In addition, this permits driver-independent code to query the type of a tag, e.g. to know what type of tag is being queried via if_snd_query. Reviewed by: gallatin, hselasky, np, kib Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D26689	2020-10-06 17:58:56 +00:00
Mateusz Guzik	662c13053f	net: clean up empty lines in .c and .h files	2020-09-01 21:19:14 +00:00
Bryan Drewery	3869d41465	lagg: Avoid adding a port to a lagg device being destroyed. The lagg_clone_destroy() handles detach and waiting for ifconfig callers to drain already. This narrows the race for 2 panics that the tests triggered. Both were a consequence of adding a port to the lagg device after it had already detached from all of its ports. The link state task would run after lagg_clone_destroy() free'd the lagg softc. kernel:trap_fatal+0xa4 kernel:trap_pfault+0x61 kernel:trap+0x316 kernel:witness_checkorder+0x6d kernel:_sx_xlock+0x72 if_lagg.ko:lagg_port_state+0x3b kernel:if_down+0x144 kernel:if_detach+0x659 if_tap.ko:tap_destroy+0x46 kernel:if_clone_destroyif+0x1b7 kernel:if_clone_destroy+0x8d kernel:ifioctl+0x29c kernel:kern_ioctl+0x2bd kernel:sys_ioctl+0x16d kernel:amd64_syscall+0x337 kernel:trap_fatal+0xa4 kernel:trap_pfault+0x61 kernel:trap+0x316 kernel:witness_checkorder+0x6d kernel:_sx_xlock+0x72 if_lagg.ko:lagg_port_state+0x3b kernel:do_link_state_change+0x9b kernel:taskqueue_run_locked+0x10b kernel:taskqueue_run+0x49 kernel:ithread_loop+0x19c kernel:fork_exit+0x83 PR: 244168 Reviewed by: markj MFC after: 2 weeks Sponsored by: Dell EMC Differential Revision: https://reviews.freebsd.org/D25284	2020-08-13 22:06:27 +00:00
Ravi Pokala	2a73c8f5e1	Decode the "LACP Fast Timeout" LAGG option flag r286700 added the "lacp_fast_timeout" option to `ifconfig', but we forgot to include the new option in the string used to decode the option bits. Add "LACP_FAST_TIMO" to LAGG_OPT_BITS. Also, s/LAGG_OPT_LACP_TIMEOUT/LAGG_OPT_LACP_FAST_TIMO/g , to be clearer that the flag indicates "Fast Timeout" mode. Reported by: Greg Foster <gfoster at panasas dot com> Reviewed by: jpaetzel MFC after: 1 week Sponsored by: Panasas Differential Revision: https://reviews.freebsd.org/D25239	2020-06-11 22:46:08 +00:00
Andrew Gallatin	bd673b9942	lagg: stop double-counting output errors and counting drops as errors Before this change, lagg double-counted errors from lagg members, and counted every drop by a lagg member as an error. Eg, if lagg sent a packet, and the underlying hardware driver dropped it, a counter would be incremented by both lagg and the underlying driver. This change attempts to fix that by incrementing lagg's counters only for errors that do not come from underlying drivers. Reviewed by: hselasky, jhb Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D24331	2020-04-13 23:06:56 +00:00
Andrew Gallatin	98085bae8c	make lacp's use_numa hashing aware of send tags When I did the use_numa support, I missed the fact that there is a separate hash function for send tag nic selection. So when use_numa is enabled, ktls offload does not work properly, as it does not reliably allocate a send tag on the proper egress nic since different egress nics are selected for send-tag allocation and packet transmit. To fix this, this change: - refectors lacp_select_tx_port_by_hash() and lacp_select_tx_port() to make lacp_select_tx_port_by_hash() always called by lacp_select_tx_port() - pre-shifts flowids to convert them to hashes when calling lacp_select_tx_port_by_hash() - adds a numa_domain field to if_snd_tag_alloc_params - plumbs the numa domain into places where we allocate send tags In testing with NIC TLS setup on a NUMA machine, I see thousands of output errors before the change when enabling kern.ipc.tls.ifnet.permitted=1. After the change, I see no errors, and I see the NIC sysctl counters showing active TLS offload sessions. Reviewed by: rrs, hselasky, jhb Sponsored by: Netflix	2020-03-09 13:44:51 +00:00
Pawel Biernacki	7029da5c36	Mark more nodes as CTLFLAG_MPSAFE or CTLFLAG_NEEDGIANT (17 of many) r357614 added CTLFLAG_NEEDGIANT to make it easier to find nodes that are still not MPSAFE (or already are but aren’t properly marked). Use it in preparation for a general review of all nodes. This is non-functional change that adds annotations to SYSCTL_NODE and SYSCTL_PROC nodes using one of the soon-to-be-required flags. Mark all obvious cases as MPSAFE. All entries that haven't been marked as MPSAFE before are by default marked as NEEDGIANT Approved by: kib (mentor, blanket) Commented by: kib, gallatin, melifaro Differential Revision: https://reviews.freebsd.org/D23718	2020-02-26 14:26:36 +00:00
Alexander Motin	84becee1ac	Update route MTUs for bridge, lagg and vlan interfaces. Those interfaces may implicitly change their MTU on addition of parent interface in addition to normal SIOCSIFMTU ioctl path, where the route MTUs are updated normally. MFC after: 2 weeks Sponsored by: iXsystems, Inc.	2020-01-22 20:36:45 +00:00
Gleb Smirnoff	2a4bd982d0	Introduce NET_EPOCH_CALL() macro and use it everywhere where we free data based on the network epoch. The macro reverses the argument order of epoch_call(9) - first function, then its argument. NFC	2020-01-15 06:05:20 +00:00
Gleb Smirnoff	97168be809	Mechanically substitute assertion of in_epoch(net_epoch_preempt) to NET_EPOCH_ASSERT(). NFC	2020-01-15 05:45:27 +00:00
Mark Johnston	c23df8eafa	lagg: Further cleanup of the rr_limit option. Add an option flag so that arbitrary updates to a lagg's configuration do not clear sc_stride. Preseve compatibility for old ifconfig binaries. Update ifconfig to use the new flag and improve the casting used when parsing the option parameter. Modify the RR transmit function to avoid locklessly reading sc_stride twice. Ensure that sc_stride is always 1 or greater. Reviewed by: hselasky MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D23092	2020-01-09 14:58:41 +00:00
Mark Johnston	c104c2990d	lagg: Clean up handling of the rr_limit option. - Don't allow an unprivileged user to set the stride. [1] - Only set the stride under the softc lock. - Rename the internal fields to accurately reflect their use. Keep ro_bkt to avoid changing the user API. - Simplify the implementation. The port index is just sc_seq / stride. - Document rr_limit in ifconfig.8. Reported by: Ilja Van Sprundel <ivansprundel@ioactive.com> [1] MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D22857	2019-12-22 21:56:47 +00:00
John Baldwin	b2e60773c6	Add kernel-side support for in-kernel TLS. KTLS adds support for in-kernel framing and encryption of Transport Layer Security (1.0-1.2) data on TCP sockets. KTLS only supports offload of TLS for transmitted data. Key negotation must still be performed in userland. Once completed, transmit session keys for a connection are provided to the kernel via a new TCP_TXTLS_ENABLE socket option. All subsequent data transmitted on the socket is placed into TLS frames and encrypted using the supplied keys. Any data written to a KTLS-enabled socket via write(2), aio_write(2), or sendfile(2) is assumed to be application data and is encoded in TLS frames with an application data type. Individual records can be sent with a custom type (e.g. handshake messages) via sendmsg(2) with a new control message (TLS_SET_RECORD_TYPE) specifying the record type. At present, rekeying is not supported though the in-kernel framework should support rekeying. KTLS makes use of the recently added unmapped mbufs to store TLS frames in the socket buffer. Each TLS frame is described by a single ext_pgs mbuf. The ext_pgs structure contains the header of the TLS record (and trailer for encrypted records) as well as references to the associated TLS session. KTLS supports two primary methods of encrypting TLS frames: software TLS and ifnet TLS. Software TLS marks mbufs holding socket data as not ready via M_NOTREADY similar to sendfile(2) when TLS framing information is added to an unmapped mbuf in ktls_frame(). ktls_enqueue() is then called to schedule TLS frames for encryption. In the case of sendfile_iodone() calls ktls_enqueue() instead of pru_ready() leaving the mbufs marked M_NOTREADY until encryption is completed. For other writes (vn_sendfile when pages are available, write(2), etc.), the PRUS_NOTREADY is set when invoking pru_send() along with invoking ktls_enqueue(). A pool of worker threads (the "KTLS" kernel process) encrypts TLS frames queued via ktls_enqueue(). Each TLS frame is temporarily mapped using the direct map and passed to a software encryption backend to perform the actual encryption. (Note: The use of PHYS_TO_DMAP could be replaced with sf_bufs if someone wished to make this work on architectures without a direct map.) KTLS supports pluggable software encryption backends. Internally, Netflix uses proprietary pure-software backends. This commit includes a simple backend in a new ktls_ocf.ko module that uses the kernel's OpenCrypto framework to provide AES-GCM encryption of TLS frames. As a result, software TLS is now a bit of a misnomer as it can make use of hardware crypto accelerators. Once software encryption has finished, the TLS frame mbufs are marked ready via pru_ready(). At this point, the encrypted data appears as regular payload to the TCP stack stored in unmapped mbufs. ifnet TLS permits a NIC to offload the TLS encryption and TCP segmentation. In this mode, a new send tag type (IF_SND_TAG_TYPE_TLS) is allocated on the interface a socket is routed over and associated with a TLS session. TLS records for a TLS session using ifnet TLS are not marked M_NOTREADY but are passed down the stack unencrypted. The ip_output_send() and ip6_output_send() helper functions that apply send tags to outbound IP packets verify that the send tag of the TLS record matches the outbound interface. If so, the packet is tagged with the TLS send tag and sent to the interface. The NIC device driver must recognize packets with the TLS send tag and schedule them for TLS encryption and TCP segmentation. If the the outbound interface does not match the interface in the TLS send tag, the packet is dropped. In addition, a task is scheduled to refresh the TLS send tag for the TLS session. If a new TLS send tag cannot be allocated, the connection is dropped. If a new TLS send tag is allocated, however, subsequent packets will be tagged with the correct TLS send tag. (This latter case has been tested by configuring both ports of a Chelsio T6 in a lagg and failing over from one port to another. As the connections migrated to the new port, new TLS send tags were allocated for the new port and connections resumed without being dropped.) ifnet TLS can be enabled and disabled on supported network interfaces via new '[-]txtls[46]' options to ifconfig(8). ifnet TLS is supported across both vlan devices and lagg interfaces using failover, lacp with flowid enabled, or lacp with flowid enabled. Applications may request the current KTLS mode of a connection via a new TCP_TXTLS_MODE socket option. They can also use this socket option to toggle between software and ifnet TLS modes. In addition, a testing tool is available in tools/tools/switch_tls. This is modeled on tcpdrop and uses similar syntax. However, instead of dropping connections, -s is used to force KTLS connections to switch to software TLS and -i is used to switch to ifnet TLS. Various sysctls and counters are available under the kern.ipc.tls sysctl node. The kern.ipc.tls.enable node must be set to true to enable KTLS (it is off by default). The use of unmapped mbufs must also be enabled via kern.ipc.mb_use_ext_pgs to enable KTLS. KTLS is enabled via the KERN_TLS kernel option. This patch is the culmination of years of work by several folks including Scott Long and Randall Stewart for the original design and implementation; Drew Gallatin for several optimizations including the use of ext_pgs mbufs, the M_NOTREADY mechanism for TLS records awaiting software encryption, and pluggable software crypto backends; and John Baldwin for modifications to support hardware TLS offload. Reviewed by: gallatin, hselasky, rrs Obtained from: Netflix Sponsored by: Netflix, Chelsio Communications Differential Revision: https://reviews.freebsd.org/D21277	2019-08-27 00:01:56 +00:00
Randall Stewart	20abea6663	This adds the third step in getting BBR into the tree. BBR and an updated rack depend on having access to the new ratelimit api in this commit. Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D20953	2019-08-01 14:17:31 +00:00
Mark Johnston	ce7fb386d8	Restore the comment removed in r348745. LAGG_RLOCK() enters an epoch section, so the comment wasn't stale. Reported by: jhb MFC with: r348745	2019-06-06 17:20:35 +00:00
Mark Johnston	9995dfd364	Conditionalize an in_epoch() call on INVARIANTS. Its result is only used to determine whether to perform further INVARIANTS-only checks. Remove a stale comment while here. Submitted by: Sebastian Huber <sebastian.huber@embedded-brains.de> MFC after: 1 week	2019-06-06 16:22:29 +00:00
John Baldwin	fb3bc59600	Restructure mbuf send tags to provide stronger guarantees. - Perform ifp mismatch checks (to determine if a send tag is allocated for a different ifp than the one the packet is being output on), in ip_output() and ip6_output(). This avoids sending packets with send tags to ifnet drivers that don't support send tags. Since we are now checking for ifp mismatches before invoking if_output, we can now try to allocate a new tag before invoking if_output sending the original packet on the new tag if allocation succeeds. To avoid code duplication for the fragment and unfragmented cases, add ip_output_send() and ip6_output_send() as wrappers around if_output and nd6_output_ifp, respectively. All of the logic for setting send tags and dealing with send tag-related errors is done in these wrapper functions. For pseudo interfaces that wrap other network interfaces (vlan and lagg), wrapper send tags are now allocated so that ip*_output see the wrapper ifp as the ifp in the send tag. The if_transmit routines rewrite the send tags after performing an ifp mismatch check. If an ifp mismatch is detected, the transmit routines fail with EAGAIN. - To provide clearer life cycle management of send tags, especially in the presence of vlan and lagg wrapper tags, add a reference count to send tags managed via m_snd_tag_ref() and m_snd_tag_rele(). Provide a helper function (m_snd_tag_init()) for use by drivers supporting send tags. m_snd_tag_init() takes care of the if_ref on the ifp meaning that code alloating send tags via if_snd_tag_alloc no longer has to manage that manually. Similarly, m_snd_tag_rele drops the refcount on the ifp after invoking if_snd_tag_free when the last reference to a send tag is dropped. This also closes use after free races if there are pending packets in driver tx rings after the socket is closed (e.g. from tcpdrop). In order for m_free to work reliably, add a new CSUM_SND_TAG flag in csum_flags to indicate 'snd_tag' is set (rather than 'rcvif'). Drivers now also check this flag instead of checking snd_tag against NULL. This avoids false positive matches when a forwarded packet has a non-NULL rcvif that was treated as a send tag. - cxgbe was relying on snd_tag_free being called when the inp was detached so that it could kick the firmware to flush any pending work on the flow. This is because the driver doesn't require ACK messages from the firmware for every request, but instead does a kind of manual interrupt coalescing by only setting a flag to request a completion on a subset of requests. If all of the in-flight requests don't have the flag when the tag is detached from the inp, the flow might never return the credits. The current snd_tag_free command issues a flush command to force the credits to return. However, the credit return is what also frees the mbufs, and since those mbufs now hold references on the tag, this meant that snd_tag_free would never be called. To fix, explicitly drop the mbuf's reference on the snd tag when the mbuf is queued in the firmware work queue. This means that once the inp's reference on the tag goes away and all in-flight mbufs have been queued to the firmware, tag's refcount will drop to zero and snd_tag_free will kick in and send the flush request. Note that we need to avoid doing this in the middle of ethofld_tx(), so the driver grabs a temporary reference on the tag around that loop to defer the free to the end of the function in case it sends the last mbuf to the queue after the inp has dropped its reference on the tag. - mlx5 preallocates send tags and was using the ifp pointer even when the send tag wasn't in use. Explicitly use the ifp from other data structures instead. - Sprinkle some assertions in various places to assert that received packets don't have a send tag, and that other places that overwrite rcvif (e.g. 802.11 transmit) don't clobber a send tag pointer. Reviewed by: gallatin, hselasky, rgrimes, ae Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D20117	2019-05-24 22:30:40 +00:00
Andrew Gallatin	35961dce98	Select lacp egress ports based on NUMA domain This change creates an array of port maps indexed by numa domain for lacp port selection. If we have lacp interfaces in more than one domain, then we select the egress port by indexing into the numa port maps and picking a port on the appropriate numa domain. This is behavior is controlled by the new ifconfig use_numa flag and net.link.lagg.use_numa sysctl/tunable (both modeled after the existing use_flowid), which default to enabled. Reviewed by: bz, hselasky, markj (and scottl, earlier version) Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D20060	2019-05-03 14:43:21 +00:00
John Baldwin	841613dcdc	Use a dedicated malloc type for lagg(4)'s structures. Reviewed by: gallatin MFC after: 1 month Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D19719	2019-03-28 21:00:54 +00:00
John Baldwin	2f59b04af1	Remove nested epochs from lagg(4). lagg_bcast_start appeared to have a bug in that was using the last lagg port structure after exiting the epoch that was keeping that structure alive. However, upon further inspection, the epoch was already entered by the caller (lagg_transmit), so the epoch enter/exit in lagg_bcast_start was actually unnecessary. This commit generally removes uses of the net epoch via LAGG_RLOCK to protect the list of ports when the list of ports was already protected by an existing LAGG_RLOCK in a caller, or the LAGG_XLOCK. It also adds a missing epoch enter/exit in lagg_snd_tag_alloc while accessing the lagg port structures. An ifp is still accessed via an unsafe reference after the epoch is exited, but that is true in the current code and will be fixed in a future change. Reviewed by: gallatin MFC after: 1 month Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D19718	2019-03-28 20:25:36 +00:00
Randall Stewart	fa91f84502	This commit adds the missing release mechanism for the ratelimiting code. The two modules (lagg and vlan) did have allocation routines, and even though they are indirect (and vector down to the underlying interfaces) they both need to have a free routine (that also vectors down to the actual interface). Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D19032	2019-02-13 14:57:59 +00:00
John Baldwin	829c56fc08	Don't set IFCAP_TXRTLMT during lagg_clone_create(). lagg_capabilities() will set the capability once interfaces supporting the feature are added to the lagg. Setting it on a lagg without any interfaces is pointless as the if_snd_tag_alloc call will always fail in that case. Reviewed by: hselasky, gallatin MFC after: 2 weeks Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D19040	2019-01-31 21:35:37 +00:00
Marcelo Araujo	fbd8c33022	Allow changing lagg(4) MTU. Previously, changing the MTU would require destroying the lagg and creating a new one. Now it is allowed to change the MTU of the lagg interface and the MTU of the ports will be set to match. If any port cannot set the new MTU, all ports are reverted to the original MTU of the lagg. Additionally, when adding ports, the MTU of a port will be automatically set to the MTU of the lagg. As always, the MTU of the lagg is initially determined by the MTU of the first port added. If adding an interface as a port for some reason fails, that interface is reverted to its original MTU. Submitted by: Ryan Moeller <ryan@freqlabs.com> Reviewed by: mav Relnotes: Yes Sponsored by: iXsystems Inc. Differential Revision: https://reviews.freebsd.org/D17576	2018-10-30 09:53:57 +00:00
Jonathan T. Looney	13c6ba6d94	There are three places where we return from a function which entered an epoch section without exiting that epoch section. This is bad for two reasons: the epoch section won't exit, and we will leave the epoch tracker from the stack on the epoch list. Fix the epoch leak by making sure we exit epoch sections before returning. Reviewed by: ae, gallatin, mmacy Approved by: re (gjb, kib) Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D17450	2018-10-09 13:26:06 +00:00
Andrew Gallatin	5ccac9f972	lagg: allow lacp to manage the link state Lacp needs to manage the link state itself. Unlike other lagg protocols, the ability of lacp to pass traffic depends not only on the lagg members having link, but also on the lacp protocol converging to a distributing state with the link partner. If we prematurely mark the link as up, then we will send a gratuitous arp (via arp_handle_ifllchange()) before the lacp interface is capable of passing traffic. When this happens, the gratuitous arp is lost, and our link partner may cache a stale mac address (eg, when the base mac address for the lagg bundle changes, due to a BIOS change re-ordering NIC unit numbers) Reviewed by: jtl, hselasky Sponsored by: Netflix	2018-08-13 14:13:25 +00:00
Andrew Turner	5f901c92a8	Use the new VNET_DEFINE_STATIC macro when we are defining static VNET variables. Reviewed by: bz Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D16147	2018-07-24 16:35:52 +00:00
Matt Macy	6573d7580b	epoch(9): allow preemptible epochs to compose - Add tracker argument to preemptible epochs - Inline epoch read path in kernel and tied modules - Change in_epoch to take an epoch as argument - Simplify tfb_tcp_do_segment to not take a ti_locked argument, there's no longer any benefit to dropping the pcbinfo lock and trying to do so just adds an error prone branchfest to these functions - Remove cases of same function recursion on the epoch as recursing is no longer free. - Remove the the TAILQ_ENTRY and epoch_section from struct thread as the tracker field is now stack or heap allocated as appropriate. Tested by: pho and Limelight Networks Reviewed by: kbowling at llnw dot com Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D16066	2018-07-04 02:47:16 +00:00
Matt Macy	0f8d79d977	CK: update consumers to use CK macros across the board r334189 changed the fields to have names distinct from those in queue.h in order to expose the oversights as compile time errors	2018-05-24 23:21:23 +00:00
Mark Johnston	db5a36bddf	Simplify lagg_input(). No functional change intended. MFC after: 2 weeks	2018-05-22 15:35:38 +00:00
Matt Macy	46d0f824be	net: fix set but not used	2018-05-19 05:27:49 +00:00
Matt Macy	d7c5a620e2	ifnet: Replace if_addr_lock rwlock with epoch + mutex Run on LLNW canaries and tested by pho@ gallatin: Using a 14-core, 28-HTT single socket E5-2697 v3 with a 40GbE MLX5 based ConnectX 4-LX NIC, I see an almost 12% improvement in received packet rate, and a larger improvement in bytes delivered all the way to userspace. When the host receiving 64 streams of netperf -H $DUT -t UDP_STREAM -- -m 1, I see, using nstat -I mce0 1 before the patch: InMpps OMpps InGbs OGbs err TCP Est %CPU syscalls csw irq GBfree 4.98 0.00 4.42 0.00 4235592 33 83.80 4720653 2149771 1235 247.32 4.73 0.00 4.20 0.00 4025260 33 82.99 4724900 2139833 1204 247.32 4.72 0.00 4.20 0.00 4035252 33 82.14 4719162 2132023 1264 247.32 4.71 0.00 4.21 0.00 4073206 33 83.68 4744973 2123317 1347 247.32 4.72 0.00 4.21 0.00 4061118 33 80.82 4713615 2188091 1490 247.32 4.72 0.00 4.21 0.00 4051675 33 85.29 4727399 2109011 1205 247.32 4.73 0.00 4.21 0.00 4039056 33 84.65 4724735 2102603 1053 247.32 After the patch InMpps OMpps InGbs OGbs err TCP Est %CPU syscalls csw irq GBfree 5.43 0.00 4.20 0.00 3313143 33 84.96 5434214 1900162 2656 245.51 5.43 0.00 4.20 0.00 3308527 33 85.24 5439695 1809382 2521 245.51 5.42 0.00 4.19 0.00 3316778 33 87.54 5416028 1805835 2256 245.51 5.42 0.00 4.19 0.00 3317673 33 90.44 5426044 1763056 2332 245.51 5.42 0.00 4.19 0.00 3314839 33 88.11 5435732 1792218 2499 245.52 5.44 0.00 4.19 0.00 3293228 33 91.84 5426301 1668597 2121 245.52 Similarly, netperf reports 230Mb/s before the patch, and 270Mb/s after the patch Reviewed by: gallatin Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D15366	2018-05-18 20:13:34 +00:00
Matt Macy	70398c2f86	epoch(9): Make epochs non-preemptible by default There are risks associated with waiting on a preemptible epoch section. Change the name to make them not be the default and document the issue under CAVEATS. Reported by: markj	2018-05-18 17:29:43 +00:00
Stephen Hurd	99031b8f7d	Replace rmlock with epoch in lagg Use the new epoch based reclamation API. Now the hot paths will not block at all, and the sx lock is used for the softc data. This fixes LORs reported where the rwlock was obtained when the sxlock was held. Submitted by: mmacy Reported by: Harry Schmalzbauer <freebsd@omnilan.de> Reviewed by: sbruno Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D15355	2018-05-14 20:06:49 +00:00
Steven Hartland	fd3bb7aa46	Disabled the use of flowid for lagg by default Disabled the use of RSS hash from the network card aka flowid for lagg(4) interfaces by default as it's currently incompatible with the lacp and loadbalance protocols. The incompatibility is due to the fact that the flowid isn't know for the first packet of a new outbound stream which can result in the hash calculation method changing and hence a stream being incorrectly split across multiple interfaces during normal operation. This can be re-enabled by setting the following in loader.conf: net.link.lagg.default_use_flowid="1" Discussed with: kmacy Sponsored by: Multiplay	2018-01-04 20:05:47 +00:00
Sean Bruno	51352d9d81	Don't hold the RM lock during lagg_proto_addport() to avoid an LOR. Submitted by: Kevin Bowling <kevin.bowling@kev009.com> Reviewed by: mav MFC after: 1 week Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D11711	2017-07-25 14:41:50 +00:00
Alexander Motin	41cf0d54a2	Call VLAN_CAPABILITIES() when LAGG capabilities change. This makes VLAN on top of LAGG to expose proper capabilities if they are changed after creation. MFC after: 1 week	2017-05-26 22:22:48 +00:00
Alexander Motin	8403ab7919	Improve applying unified capabilities to the lagg ports. Some NICs have some capabilities dependent, so that disabling one require disabling some other (TXCSUM/RXCSUM on em). This code tries to reach the consensus more insistently. PR: 219453 MFC after: 1 week	2017-05-26 20:15:33 +00:00
Alexander Motin	e3d90506c4	Remove some code, dead from the day one.	2017-05-25 23:19:09 +00:00
Alexander Motin	bbfc32a6b5	Relax r317696 locking to not drain taskqueue under the lock. MFC after: 11 days	2017-05-05 16:51:53 +00:00
Alexander Motin	e83177fba8	Fix r317696 build without debug. MFC after: 2 weeks	2017-05-03 02:54:11 +00:00

1 2 3 4

178 Commits