freebsd-dev

Author	SHA1	Message	Date
Mark Johnston	ca1163bd5f	Do not perform DAD on stf(4) interfaces. stf(4) interfaces are not multicast-capable so they can't perform DAD. They also did not set IFF_DRV_RUNNING when an address was assigned, so the logic in nd6_timer() would periodically flag such an address as tentative, resulting in interface flapping. Fix the problem by setting IFF_DRV_RUNNING when an address is assigned, and do some related cleanup: - In in6if_do_dad(), remove a redundant check for !UP \|\| !RUNNING. There is only one caller in the tree, and it only looks at whether the return value is non-zero. - Have in6if_do_dad() return false if the interface is not multicast-capable. - Set ND6_IFF_NO_DAD when an address is assigned to an stf(4) interface and the interface goes UP as a result. Note that this is not sufficient to fix the problem because the new address is marked as tentative and DAD is started before in6_ifattach() is called. However, setting no_dad is formally correct. - Change nd6_timer() to not flag addresses as tentative if no_dad is set. This is based on a patch from Viktor Dukhovni. Reported by: Viktor Dukhovni <ietf-dane@dukhovni.org> Reviewed by: ae MFC after: 3 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D19751	2019-03-30 18:00:44 +00:00
John Baldwin	841613dcdc	Use a dedicated malloc type for lagg(4)'s structures. Reviewed by: gallatin MFC after: 1 month Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D19719	2019-03-28 21:00:54 +00:00
Eric Joyner	225eae1bb7	iflib: return ENETDOWN when the network device is down From Jake: iflib_if_transmit returns ENOBUFS when the device is down, or when the link isn't active. This was changed in r308792 from return (0), so that the function correctly reports an error that it was unable to transmit. However, using ENOBUFS can cause some network applications to produce the following or similar errors: "ping: sendto: No buffer space available" This is a bit confusing as the real cause of the issue is that the network device is down. Replace the ENOBUFS return with ENETDOWN to indicate more clearly that the reason for the failure to send is due to the network device is offline. This will cause the error message to be reported as "ping: sendto: Network is down" Submitted by: Jacob Keller <jacob.e.keller@intel.com> Reviewed by: shurd@, sbruno@, bz@ MFC after: 1 week Sponsored by: Intel Corporation Differential Revision: https://reviews.freebsd.org/D19652	2019-03-28 20:46:45 +00:00
Eric Joyner	aac9c817af	iflib: hold the CTX lock in iflib_pseudo_register From Jake: The iflib_device_register function takes the CTX lock before calling IFDI_ATTACH_PRE, and releases it upon finishing the registration. Mirror this process in iflib_pseudo_register, so that we always hold the CTX lock during the attach process when registering a pseudo interface or a regular interface. This was caught by code inspection while attempting to analyze where the CTX lock was held. Submitted by: Jacob Keller <jacob.e.keller@intel.com> Reviewed by: shurd@, erj@ MFC after: 1 week Sponsored by: Intel Corporation Differential Revision: https://reviews.freebsd.org/D19604	2019-03-28 20:43:47 +00:00
John Baldwin	2f59b04af1	Remove nested epochs from lagg(4). lagg_bcast_start appeared to have a bug in that was using the last lagg port structure after exiting the epoch that was keeping that structure alive. However, upon further inspection, the epoch was already entered by the caller (lagg_transmit), so the epoch enter/exit in lagg_bcast_start was actually unnecessary. This commit generally removes uses of the net epoch via LAGG_RLOCK to protect the list of ports when the list of ports was already protected by an existing LAGG_RLOCK in a caller, or the LAGG_XLOCK. It also adds a missing epoch enter/exit in lagg_snd_tag_alloc while accessing the lagg port structures. An ifp is still accessed via an unsafe reference after the epoch is exited, but that is true in the current code and will be fixed in a future change. Reviewed by: gallatin MFC after: 1 month Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D19718	2019-03-28 20:25:36 +00:00
Kyle Evans	93c9d31918	if_bridge(4): ensure all traffic passing over the bridge is accounted for Consider a bridge0 with em0 and em1 members. Traffic rx'd by em0 and transmitted by bridge0 through em1 gets accounted for in IPACKETS/IBYTES and bridge0 bpf -- assuming it's not unicast traffic destined for em1. Unicast traffic destined for em1 traffic is not accounted for by any mechanism, and isn't pushed through bridge0's bpf machinery as any other packets that pass over the bridge do. Fix this and simplify GRAB_OUR_PACKETS by bailing out early if it was rx'd by the interface that it was addressed for. Everything else there is relevant for any traffic that came in from one member that's being directed at another member of the bridge. Reviewed by: kp MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D19614	2019-03-28 03:31:51 +00:00
Eric Joyner	10a1e981d4	iflib: mark isc_driver_version as constant From Jake: The iflib core never modifies the isc_driver_version string. Allow drivers to safely assign pointers to constant buffers by marking this parameter const. Submitted by: Jacob Keller <jacob.e.keller@intel.com> Reviewed by: erj@, gallatin@, jhb@ MFC after: 1 week Sponsored by: Intel Corporation Differential Revision: https://reviews.freebsd.org/D19577	2019-03-19 23:44:26 +00:00
Eric Joyner	1b9d93948a	iflib: expose the Rx mbuf buffer size to drivers From Jake: iflib_fl_setup calculates a suitable buffer size for the Rx mbufs based on the isc_max_frame_size value that drivers setup. This calculation is repeated by drivers when programming their hardware with the size of each Rx buffer. This can lead to a mismatch where the iflib mbuf size is different from the expected size of the buffer as programmed by the hardware. This can lead to unexpected results. If iflib ever wants to support mbuf sizes larger than one page, every driver must be updated to account for the new possible buffer sizes. Fix this by calculating the mbuf size prior to calling IFDI_INIT, and adding the iflib_get_rx_mbuf_sz function which will expose this value to drivers, so that they do not repeat the same calculation. Submitted by: Jacob Keller <jacob.e.keller@intel.com> Reviewed by: shurd@, erj@ MFC after: 1 week Sponsored by: Intel Corporation Differential Revision: https://reviews.freebsd.org/D19489	2019-03-19 17:59:56 +00:00
Eric Joyner	3e8d1bae5f	iflib: prevent possible infinite loop in iflib_encap From Jake: iflib_encap calls bus_dmamap_load_mbuf_sg. Upon it returning EFBIG, an m_collapse and an m_defrag are attempted to shrink the mbuf cluster to fit within the DMA segment limitations. However, if we call m_defrag, and then bus_dmamap_load_mbuf_sg returns EFBIG on the now defragmented mbuf, we will continuously re-call bus_dmamap_load_mbuf_sg over and over. This happens because m_head isn't NULL, and remap is >1, so we don't try to m_collapse or m_defrag again. The only way we exit the loop is if m_head is NULL. However, m_head can't be modified by the call to bus_dmamap_load_mbuf_sg, because we don't pass it as a double pointer. I believe this will be an incredibly rare occurrence, because it is unlikely that bus_dmamap_load_mbuf_sg will actually fail on the second defragment with an EFBIG error. However, it still seems like a possibility that we should account for. Fix the exit check to ensure that if remap is >1, we will also exit, even if m_head is not NULL. Submitted by: Jacob Keller <jacob.e.keller@intel.com> Reviewed by: shurd@, gallatin@ MFC after: 1 week Sponsored by: Intel Corporation Differential Revision: https://reviews.freebsd.org/D19468	2019-03-19 17:49:03 +00:00
Andrey V. Elsukov	c5be49da01	Convert allocation of bpf_if in bpfattach2 from M_NOWAIT to M_WAITOK and remove possible panic condition. It is already allowed to sleep in bpfattach[2], since BPF_LOCK was converted to SX lock in r332388. Also move KASSERT() to the top of function and make full initialization before bpf_if will be linked to BPF's list of interfaces. MFC after: 2 weeks	2019-03-19 10:29:32 +00:00
Vincenzo Maffione	d12354a56c	netmap: add support for multiple host rings Some applications forward from/to host rings most or all the traffic received or sent on a physical interface. In this cases it is desirable to have more than a pair of RX/TX host rings, and use multiple threads to speed up forwarding. This change adds support for multiple host rings. On registering a netmap port, the user can specify the number of desired receive and transmit host rings in the nr_host_tx_rings and nr_host_rx_rings fields of the nmreq_register structure. MFC after: 2 weeks	2019-03-18 12:22:23 +00:00
Kyle Evans	4920f9a348	if_bridge(4): Drop pointless rtflush At this point, all routes should've already been dropped by removing all members from the bridge. This condition is in-fact KASSERT'd in the line immediately above where this nop flush was added.	2019-03-15 17:19:36 +00:00
Kyle Evans	6e6b93fe1d	Revert r345192: Too many trees in play for bridge(4) bits An accidental appendage was committed that has not undergone review yet.	2019-03-15 17:18:19 +00:00
Kyle Evans	4b4b284d95	if_bridge(4): Drop pointless rtflush At this point, all routes should've already been dropped by removing all members from the bridge. This condition is in-fact KASSERT'd in the line immediately above where this nop flush was added.	2019-03-15 17:13:05 +00:00
Kristof Provost	43d3127ca7	bridge: Fix STP-related panic After r345180 we need to have the appropriate vnet context set to delete an rtnode in bridge_rtnode_destroy(). That's usually the case, but not when it's called by the STP code (through bstp_notify_rtage()). We have to set the vnet context in bridge_rtable_expire() just as we do in the other STP callback bridge_state_change(). Reviewed by: kevans	2019-03-15 15:52:36 +00:00
Kyle Evans	a87407ff85	if_bridge(4): Fix module teardown bridge_rtnode_zone still has outstanding allocations at the time of destruction in the current model because all of the interface teardown happens in a VNET_SYSUNINIT, -after- the MOD_UNLOAD has already been processed. The SYSUNINIT triggers destruction of the interfaces, which then attempts to free the memory from the zone that's already been destroyed, and we hit a panic. Solve this by virtualizing the uma_zone we allocate the rtnodes from to fix the ordering. bridge_rtable_fini should also take care to flush any remaining routes that weren't taken care of when dynamic routes were flushed in bridge_stop. Reviewed by: kp Differential Revision: https://reviews.freebsd.org/D19578	2019-03-15 13:19:52 +00:00
Kristof Provost	d6747eafa9	bridge: Fix panic if the STP root is removed If the spanning tree root interface is removed from the bridge we panic on the next 'ifconfig'. While the STP code is notified whenever a bridge member interface is removed from the bridge it does not clear the bs_root_port. This means bs_root_port can still point at an bridge_iflist which has been free()d. The next access to it will panic. Explicitly check if the interface we're removing in bstp_destroy() is the root, and if so re-assign the roles, which clears bs_root_port. Reviewed by: philip MFC after: 2 weeks	2019-03-15 11:21:20 +00:00
Kristof Provost	5904868691	pf :Use counter(9) in pf tables. The counters of pf tables are updated outside the rule lock. That means state updates might overwrite each other. Furthermore allocation and freeing of counters happens outside the lock as well. Use counter(9) for the counters, and always allocate the counter table element, so that the race condition cannot happen any more. PR: 230619 Submitted by: Kajetan Staszkiewicz <vegeta@tuxpowered.net> Reviewed by: glebius MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D19558	2019-03-15 11:08:44 +00:00
Kyle Evans	521b05ea52	ether_fakeaddr: Use 'b' 's' 'd' for the prefix This has the advantage of being obvious to sniff out the designated prefix by eye and it has all the right bits set. Comment stolen from ffec. I've removed bryanv@'s pending question of using the FreeBSD OUI range -- no one has followed up on this with a definitive action, and there's no particular reason to shoot for it and the administrative overhead that comes with deciding exactly how to use it.	2019-03-14 19:48:43 +00:00
Kyle Evans	6b7e0c1cca	ether: centralize fake hwaddr generation We currently have two places with identical fake hwaddr generation -- if_vxlan and if_bridge. Lift it into if_ethersubr for reuse in other interfaces that may also need a fake addr. Reviewed by: bryanv, kp, philip Differential Revision: https://reviews.freebsd.org/D19573	2019-03-14 17:18:00 +00:00
Gleb Smirnoff	c93410229c	Most Ethernet drivers that potentially can run a pfil(9) hook with PFIL_MEMPTR flag are intentionally providing a memory address that isn't aligned to pointer alignment. This is done to align an IPv4 or IPv6 header that is expected to follow Ethernet header. When we return PFIL_REALLOCED we store a pointer to allocated mbuf at this address. With this change the KPI changes to store the pointer at aligned address, which usually yields in +2 bytes. Provide two inlines: pfil_packet_align() to get aligned pfil_packet_t for a misaligned one pfil_mem2mbuf() to read out mbuf pointer from misaligned pfil_packet_t Provide function pfil_realloc(), not used yet, that would convert a memory pfil_packet_t to an mbuf one. Reported by: hps Reviewed by: hps, gallatin	2019-03-10 17:20:09 +00:00
Gleb Smirnoff	b9fdb4b3a3	Properly handle a case when a first filter returns PFIL_REALLOCED, then second one returns PFIL_PASS.	2019-03-10 17:08:05 +00:00
Bjoern A. Zeeb	b25d74e06c	Improve ARP logging. r344504 added an extra ARP_LOG() call in case of an if_output() failure. It turns out IPv4 can be noisy. In order to not spam the console by default: (a) add a counter for these events so people can keep better track of how often it happens, and (b) add a sysctl to select the default ARP_LOG log level and set it to INFO avoiding the one (the new) DEBUG level by default. Claim a spare (1st one after 10 years since the stats were added) in order to not break netstat from FreeBSD 12->13 updates in the future. Reviewed by: karels Differential Revision: https://reviews.freebsd.org/D19490	2019-03-09 01:12:59 +00:00
Bjoern A. Zeeb	21231a7aa6	Update for IETF draft-ietf-6man-ipv6only-flag. All changes are hidden behind the EXPERIMENTAL option and are not compiled in by default. Add ND6_IFF_IPV6_ONLY_MANUAL to be able to set the interface into no-IPv4-mode manually without router advertisement options. This will allow developers to test software for the appropriate behaviour even on dual-stack networks or IPv6-Only networks without the option being set in RA messages. Update ifconfig to allow setting and displaying the flag. Update the checks for the filters to check for either the automatic or the manual flag to be set. Add REVARP to the list of filtered IPv4-related protocols and add an input filter similar to the output filter. Add a check, when receiving the IPv6-Only RA flag to see if the receiving interface has any IPv4 configured. If it does, ignore the IPv6-Only flag. Add a per-VNET global sysctl, which is on by default, to not process the automatic RA IPv6-Only flag. This way an administrator (if this is compiled in) has control over the behaviour in case the node still relies on IPv4.	2019-03-06 23:31:42 +00:00
Eric Joyner	bc408c7d61	Remove references to CONTIGMALLOC_WORKS in iflib and em From Jake: "The iflib_fl_setup() function tries to pick various buffer sizes based on the max_frame_size value defined by the parent driver. However, this code was wrapped under CONTIGMALLOC_WORKS, which was never actually defined anywhere. This same code pattern was used in if_em.c, likely trying to match what iflib uses. Since CONTIGMALLOC_WORKS is not defined, remove this dead code from iflib_fl_setup and if_em.c Given that various iflib drivers appear to be using a similar calculation, it might be worth making this buffer size a value that the driver can peek at in the future." Submitted by: Jacob Keller <jacob.e.keller@intel.com> Reviewed by: shurd@ MFC after: 1 week Sponsored by: Intel Corporation Differential Revision: https://reviews.freebsd.org/D19199	2019-03-05 19:12:51 +00:00
Kristof Provost	5ea5849a7b	tun: VIMAGE fix for if_tun cloner The if_tun cloner is not virtualised, but if_clone_attach() does use a virtualised list of cloners. The result is that we can't find the if_tun cloner when we try to remove a renamed tun interface. Virtualise the cloner, and move the final cleanup into a sysuninit so that we're sure this happens after all of the vnet_sysuninits Note that we need unit numbers to be system-unique (rather than unique per vnet, as is done by if_clone_simple()). The unit number is used to create the corresponding /dev/tunX device node, and this node must match with the interface. Switch to if_clone_advanced() so that we have control over the unit numbers. Reproduction scenario: jail -c -n foo persist vnet jexec test ifconfig tun create jexec test ifconfig tun0 name wg0 jexec test ifconfig wg0 destroy PR: 235704 Reviewed by: bz, hrs, hselasky MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D19248	2019-03-05 13:21:07 +00:00
Alexander Motin	c3c93809f6	bridge: Fix spurious warnings about capabilities Mask off the bits we don't care about when checking that capabilities of the member interfaces have been disabled as intended. Submitted by: Ryan Moeller <ryan@ixsystems.com> Reviewed by: kristof, mav MFC after: 1 week Sponsored by: iXsystems, Inc. Differential Revision: https://reviews.freebsd.org/D18924	2019-03-04 22:01:09 +00:00
Stephen Hurd	ca62461bc6	iflib: Improve return values of interrupt handlers. iflib was returning FILTER_HANDLED, in cases where FILTER_STRAY was more correct. This potentially caused issues with shared legacy interrupts. Driver filters returning FILTER_STRAY are now properly handled. Submitted by: Augustin Cavalier <waddlesplash@gmail.com> Reviewed by: marius, gallatin Obtained from: Haiku (a84bb9, 4947d1) MFC after: 1 week Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D19201	2019-02-15 18:51:43 +00:00
Randall Stewart	fa91f84502	This commit adds the missing release mechanism for the ratelimiting code. The two modules (lagg and vlan) did have allocation routines, and even though they are indirect (and vector down to the underlying interfaces) they both need to have a free routine (that also vectors down to the actual interface). Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D19032	2019-02-13 14:57:59 +00:00
Marius Strobl	a6611c938b	Fix the build with ALTQ after r344060.	2019-02-12 22:33:17 +00:00
Marius Strobl	f855ec814d	Make taskqgroup_attach{,_cpu}(9) work across architectures So far, intr_{g,s}etaffinity(9) take a single int for identifying a device interrupt. This approach doesn't work on all architectures supported, as a single int isn't sufficient to globally specify a device interrupt. In particular, with multiple interrupt controllers in one system as found on e. g. arm and arm64 machines, an interrupt number as returned by rman_get_start(9) may be only unique relative to the bus and, thus, interrupt controller, a certain device hangs off from. In turn, this makes taskqgroup_attach{,_cpu}(9) and - internal to the gtaskqueue implementation - taskqgroup_attach_deferred{,_cpu}() not work across architectures. Yet in turn, iflib(4) as gtaskqueue consumer so far doesn't fit architectures where interrupt numbers aren't globally unique. However, at least for intr_setaffinity(..., CPU_WHICH_IRQ, ...) as employed by the gtaskqueue implementation to bind an interrupt to a particular CPU, using bus_bind_intr(9) instead is equivalent from a functional point of view, with bus_bind_intr(9) taking the device and interrupt resource arguments required for uniquely specifying a device interrupt. Thus, change the gtaskqueue implementation to employ bus_bind_intr(9) instead and intr_{g,s}etaffinity(9) to take the device and interrupt resource arguments required respectively. This change also moves struct grouptask from <sys/_task.h> to <sys/gtaskqueue.h> and wraps struct gtask along with the gtask_fn_t typedef into #ifdef _KERNEL as userland likes to include <sys/_task.h> or indirectly drags it in - for better or worse also with _KERNEL defined -, which with device_t and struct resource dependencies otherwise is no longer as easily possible now. The userland inclusion problem probably can be improved a bit by introducing a _WANT_TASK (as well as a _WANT_MOUNT) akin to the existing _WANT_PRISON etc., which is orthogonal to this change, though, and likely needs an exp-run. While at it: - Change the gt_cpu member in the grouptask structure to be of type int as used elswhere for specifying CPUs (an int16_t may be too narrow sooner or later), - move the gtaskqueue_enqueue_fn typedef from <sys/gtaskqueue.h> to the gtaskqueue implementation as it's only used and needed there, - change the GTASK_INIT macro to use "gtask" rather than "task" as argument given that it actually operates on a struct gtask rather than a struct task, and - let subr_gtaskqueue.c consistently use __func__ to print functions names. Reported by: mmel Reviewed by: mmel Differential Revision: https://reviews.freebsd.org/D19139	2019-02-12 21:23:59 +00:00
Marius Strobl	95dcf343b7	Further correct and optimize the bus_dma(9) usage of iflib(4): o Correct the obvious bugs in the netmap(4) parts: - No longer check for the existence of DMA maps as bus_dma(9) is used unconditionally in iflib(4) since r341095. - Supply the correct DMA tag and map pairs to bus_dma(9) functions (see also the commit message of r343753). - In iflib_netmap_timer_adjust(), add synchronization of the TX descriptors before calling the ift_txd_credits_update method as the latter evaluates the TX descriptors possibly updated by the MAC. - In _task_fn_tx(), wrap the netmap(4)-specific bits in #ifdef DEV_NETMAP just as done in _task_fn_admin() and _task_fn_rx() respectively. o In iflib_fast_intr_rxtx(), synchronize the TX rather than the RX descriptors before calling the ift_txd_credits_update method (see also above). o There's no need to synchronize an RX buffer that is going to be recycled in iflib_rxd_pkt_get(), yet; it's sufficient to do that as late as passing RX buffers to the MAC via the ift_rxd_refill method. Hence, combine that synchronization with the synchronization of new buffers into a common spot in _iflib_fl_refill(). o There's no need to synchronize the RX descriptors of a free list in preparation of the MAC updating their statuses with every invocation of rxd_frag_to_sd(); it's enough to do this once before handing control over to the MAC, i. e. before calling ift_rxd_flush method in _iflib_fl_refill(), which already performs the necessary synchronization. o Given that the ift_rxd_available method evaluates the RX descriptors which possibly have been altered by the MAC, synchronize as appropriate beforehand. Most notably this is now done in iflib_rxd_avail(), which in turn means that we don't need to issue the same synchronization yet again before calling the ift_rxd_pkt_get method in iflib_rxeof(). o In iflib_txd_db_check(), synchronize the TX descriptors before handing them over to the MAC for transmission via the ift_txd_flush method. o In iflib_encap(), move the TX buffer synchronization after the invocation of the ift_txd_encap() method. If the MAC driver fails to encapsulate the packet and we retry with a defragmented mbuf chain or finally fail, the cycles for TX buffer synchronization have been wasted. Synchronizing afterwards matches what non-iflib(4) drivers typically do and is sufficient as the MAC will not actually start with the transmission before - in this case - the ift_txd_flush method is called. Moreover, for the latter reason the synchronization of the TX descriptors in iflib_encap() can go as it's enough to synchronize them before passing control over to the MAC by issuing the ift_txd_flush() method (see above). o In iflib_txq_can_drain(), only synchronize TX descriptors if the ift_txd_credits_update method accessing these is actually called. Differential Revision: https://reviews.freebsd.org/D19081	2019-02-12 21:08:44 +00:00
Patrick Kelsey	8f2ac65690	Reduce the time it takes the kernel to install a new PF config containing a large number of queues In general, the time savings come from separating the active and inactive queues lists into separate interface and non-interface queue lists, and changing the rule and queue tag management from list-based to hash-bashed. In HFSC, a linear scan of the class table during each queue destroy was also eliminated. There are now two new tunables to control the hash size used for each tag set (default for each is 128): net.pf.queue_tag_hashsize net.pf.rule_tag_hashsize Reviewed by: kp MFC after: 1 week Sponsored by: RG Nets Differential Revision: https://reviews.freebsd.org/D19131	2019-02-11 05:17:31 +00:00
Marius Strobl	bfce461ee9	o As illustrated by e. g. figure 7-14 of the Intel 82599 10 GbE controller datasheet revision 3.3, in the context of Ethernet MACs the control data describing the packet buffers typically are named "descriptors". Each of these descriptors references one buffer, multiple of which a packet can be composed of. By contrast, in comments, messages and the names of structure members, iflib(4) refers to DMA resources employed for RX and TX buffers (rather than control data) as "desc(riptors)". This odd naming convention of iflib(4) made reviewing r343085 and identifying wrong and missing bus_dmamap_sync(9) calls in particular way harder than it already is. This convention may also explain why the netmap(4) part of iflib(4) pairs the DMA tags for control data with DMA maps of buffers and vice versa in calls to bus_dma(9) functions. Therefore, change iflib(4) to refer to buf(fers) when buffers and not the usual understanding of descriptors is meant. This change does not include corrections to the DMA resources used in the netmap(4) parts. However, it revises error messages to state which kind of allocation/creation failed. Specifically, the "Unable to allocate tx_buffer (map) memory" copy & pasted inappropriately on several occasions was replaced with proper messages. o Enhance some other error messages to indicate which half - RX or TX - they apply to instead of using identical text in both cases and generally canonicalize them. o Correct the descriptions of iflib_{r,t}xsd_alloc() to reflect reality; current code doesn't use {r,t}x_buffer structures. o In iflib_queues_alloc(): - Remove redundant BUS_DMA_NOWAIT of iflib_dma_alloc() calls, - change the M_WAITOK from malloc(9) calls into M_NOWAIT. The return values are already checked, deferred DMA allocations not being an option at this point, BUS_DMA_NOWAIT has to be used anyway and prior malloc(9) calls in this function also specify M_NOWAIT. Reviewed by: shurd Differential Revision: https://reviews.freebsd.org/D19067	2019-02-04 20:46:57 +00:00
Gleb Smirnoff	3ca1c423aa	Teach pfil_ioctl() about VIMAGE. Submitted by: gallatin	2019-02-03 08:28:02 +00:00
Vincenzo Maffione	5faab77822	netmap: upgrade sync-kloop support Add SYNC_KLOOP_MODE option, and add support for direct mode, where application executes the TXSYNC and RXSYNC in the context of the ioeventfd wake up callback. MFC after: 5 days	2019-02-02 22:39:29 +00:00
Gleb Smirnoff	b252313f0b	New pfil(9) KPI together with newborn pfil API and control utility. The KPI have been reviewed and cleansed of features that were planned back 20 years ago and never implemented. The pfil(9) internals have been made opaque to protocols with only returned types and function declarations exposed. The KPI is made more strict, but at the same time more extensible, as kernel uses same command structures that userland ioctl uses. In nutshell [KA]PI is about declaring filtering points, declaring filters and linking and unlinking them together. New [KA]PI makes it possible to reconfigure pfil(9) configuration: change order of hooks, rehook filter from one filtering point to a different one, disconnect a hook on output leaving it on input only, prepend/append a filter to existing list of filters. Now it possible for a single packet filter to provide multiple rulesets that may be linked to different points. Think of per-interface ACLs in Cisco or Juniper. None of existing packet filters yet support that, however limited usage is already possible, e.g. default ruleset can be moved to single interface, as soon as interface would pride their filtering points. Another future feature is possiblity to create pfil heads, that provide not an mbuf pointer but just a memory pointer with length. That would allow filtering at very early stages of a packet lifecycle, e.g. when packet has just been received by a NIC and no mbuf was yet allocated. Differential Revision: https://reviews.freebsd.org/D18951	2019-01-31 23:01:03 +00:00
John Baldwin	829c56fc08	Don't set IFCAP_TXRTLMT during lagg_clone_create(). lagg_capabilities() will set the capability once interfaces supporting the feature are added to the lagg. Setting it on a lagg without any interfaces is pointless as the if_snd_tag_alloc call will always fail in that case. Reviewed by: hselasky, gallatin MFC after: 2 weeks Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D19040	2019-01-31 21:35:37 +00:00
Gleb Smirnoff	f712b16127	Revert r316461: Remove "IPFW static rules" rmlock, and use pfil's global lock. The pfil(9) system is about to be converted to epoch(9) synchronization, so we need [temporarily] go back with ipfw internal locking. Discussed with: ae	2019-01-31 21:04:50 +00:00
Marius Strobl	b97de13ae0	- Stop iflib(4) from leaking MSI messages on detachment by calling bus_teardown_intr(9) before pci_release_msi(9). - Ensure that iflib(4) and associated drivers pass correct RIDs to bus_release_resource(9) by obtaining the RIDs via rman_get_rid(9) on the corresponding resources instead of using the RIDs initially passed to bus_alloc_resource_any(9) as the latter function may change those RIDs. Solely em(4) for the ioport resource (but not others) and bnxt(4) were using the correct RIDs by caching the ones returned by bus_alloc_resource_any(9). - Change the logic of iflib_msix_init() around to only map the MSI-X BAR if MSI-X is actually supported, i. e. pci_msix_count(9) returns > 0. Otherwise the "Unable to map MSIX table " message triggers for devices that simply don't support MSI-X and the user may think that something is wrong while in fact everything works as expected. - Put some (mostly redundant) debug messages emitted by iflib(4) and em(4) during attachment under bootverbose. The non-verbose output of em(4) seen during attachment now is close to the one prior to the conversion to iflib(4). - Replace various variants of spelling "MSI-X" (several in messages) with "MSI-X" as used in the PCI specifications. - Remove some trailing whitespace from messages emitted by iflib(4) and change them to consistently start with uppercase. - Remove some obsolete comments about releasing interrupts from drivers and correct a few others. Reviewed by: erj, Jacob Keller, shurd Differential Revision: https://reviews.freebsd.org/D18980	2019-01-30 13:21:26 +00:00
Marius Strobl	3db348b54a	- In _iflib_fl_refill(), don't mark an RX buffer as available in the corresponding bitmap before adding an mbuf has actually succeeded. Previously, m_gethdr(M_NOWAIT, ...) failing caused a "hole" in the RX ring but not in its bitmap. One implication of such a hole was that in a subsequent call to _iflib_fl_refill() with the RX buffer accounting still indicating another reclaimable buffer, bit_ffc(3) nevertheless returned -1 in frag_idx which in turn caused havoc when used as an index. Thus, additionally assert that frag_idx is 0 or greater. Another possible consequence of a hole in the RX ring was a NULL- dereference when trying to use the unallocated mbuf, for example in iflib_rxd_pkt_get(). While at it, make the variable declarations in _iflib_fl_refill() conform to style(9) and remove redundant checks already performed by bit_ffc{,_at}(3). - In iflib_queues_alloc(), don't pass redundant M_ZERO to bit_alloc(3). Reported and tested by: pho	2019-01-26 21:35:51 +00:00
Andrew Gallatin	77102fd6a2	Fix an iflib driver unload panic introduced in r343085 The new loop to sync and unload descriptors was indexed by "i", rather than "j". The panic was caused by "i" being advanced rather than "j", and eventually becoming out of bounds. Reviewed by: kib MFC after: 3 days Sponsored by: Netflix	2019-01-25 15:02:18 +00:00
Vincenzo Maffione	f79ba6d75b	netmap: improvements to the netmap kloop (CSB mode) Changelist: - Add the proper memory barriers in the kloop ring processing functions. - Fix memory barriers usage in the user helpers (nm_sync_kloop_appl_write, nm_sync_kloop_appl_read). - Fix nm_kr_txempty() helper to look at rhead rather than rcur. This is important since the kloop can read a value of rcur which is ahead of the value of rhead (see explanation in nm_sync_kloop_appl_write) - Remove obsolete ptnetmap_guest_write_kring_csb() and ptnet_guest_read_kring_csb(), and update if_ptnet(4) to use those. - Prepare in advance the arguments for netmap_sync_kloop_[tr]x_ring(), to make the kloop faster. - Provide kernel and user implementation for nm_ldld_barrier() and nm_ldst_barrier() MFC after: 2 weeks	2019-01-23 14:51:36 +00:00
Brooks Davis	bc6f170e30	Rework CASE_IOC_IFGROUPREQ() to require a case before the macro. This is more compatible with formatting tools and looks more normal. Reported by: jhb (on a different review) Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D18442	2019-01-22 17:39:26 +00:00
Patrick Kelsey	8f82136aec	onvert vmx(4) to being an iflib driver. Also, expose IFLIB_MAX_RX_SEGS to iflib drivers and add iflib_dma_alloc_align() to the iflib API. Performance is generally better with the tunable/sysctl dev.vmx.<index>.iflib.tx_abdicate=1. Reviewed by: shurd MFC after: 1 week Relnotes: yes Sponsored by: RG Nets Differential Revision: https://reviews.freebsd.org/D18761	2019-01-22 01:11:17 +00:00
Patrick Kelsey	7f3eb9dab3	Fix various resource leaks that can occur in the error paths of iflib_device_register() and iflib_pseudo_register(). Reviewed by: shurd MFC after: 1 week Sponsored by: RG Nets Differential Revision: https://reviews.freebsd.org/D18760	2019-01-22 00:56:44 +00:00
Konstantin Belousov	8a04b53dce	Improve iflib busdma(9) KPI use. - Specify BUS_DMA_NOWAIT for bus_dmamap_load() on rx refill, since callbacks are not supposed to be used. - Match tso/non-tso tags to corresponding tx map operations. Create separate tso maps for tx descriptors. In particular, do not use non-tso tag to load, unload, or destroy a map created with tso tag. - Add missed bus_dmamap_sync() calls. Submitted by: marius. Reported and tested by: pho Reviewed by: marius Sponsored by: The FreeBSD Foundation MFC after: 1 week	2019-01-16 05:44:14 +00:00
Gleb Smirnoff	cc31f9821c	Remove recursive NET_EPOCH_ENTER() from sysctl_ifmalist(), missed in r342872.	2019-01-11 00:45:22 +00:00
Gleb Smirnoff	7b7f772fa0	Bring the comment up to date.	2019-01-10 00:37:14 +00:00
Mark Johnston	72755d285f	Stop setting if_linkmib in vlan(4) ifnets. There are several reasons: - The structure being exported via IFDATA_LINKSPECIFIC doesn't appear to be a standard MIB. - The structure being exported is private to the kernel and always has been. - No other drivers in common use set the if_linkmib field. - Because IFDATA_LINKSPECIFIC can be used to overwrite the linkmib structure, a privileged user could use it to corrupt internal vlan(4) state. [1] PR: 219472 Reported by: CTurt <ecturt@gmail.com> [1] Reviewed by: kp (previous version) MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D18779	2019-01-09 16:47:16 +00:00
Gleb Smirnoff	a68cc38879	Mechanical cleanup of epoch(9) usage in network stack. - Remove macros that covertly create epoch_tracker on thread stack. Such macros a quite unsafe, e.g. will produce a buggy code if same macro is used in embedded scopes. Explicitly declare epoch_tracker always. - Unmask interface list IFNET_RLOCK_NOSLEEP(), interface address list IF_ADDR_RLOCK() and interface AF specific data IF_AFDATA_RLOCK() read locking macros to what they actually are - the net_epoch. Keeping them as is is very misleading. They all are named FOO_RLOCK(), while they no longer have lock semantics. Now they allow recursion and what's more important they now no longer guarantee protection against their companion WLOCK macros. Note: INP_HASH_RLOCK() has same problems, but not touched by this commit. This is non functional mechanical change. The only functionally changed functions are ni6_addrs() and ni6_store_addrs(), where we no longer enter epoch recursively. Discussed with: jtl, gallatin	2019-01-09 01:11:19 +00:00
Gleb Smirnoff	21ee61a6cf	Remove part of comment that doesn't match reality.	2019-01-09 00:38:16 +00:00
Stephen Hurd	cd28ea929a	Use iflib_if_init_locked() during resume instead of iflib_init_locked(). iflib_init_locked() assumes that iflib_stop() has been called, however, it is not called for suspend. iflib_if_init_locked() calls stop then init, so fixes the problem. This was causing errors after a resume from suspend. PR: 224059 Reported by: zeising MFC after: 1 week Sponsored by: Limelight Networks	2019-01-07 23:46:54 +00:00
Matt Macy	5881181d8c	mp_ring: avoid items offset difference between iflib and mp_ring on architectures without 64-bit atomics Reported by: Augustin Cavalier <waddlesplash@gmail.com>	2019-01-03 23:06:05 +00:00
Konstantin Belousov	85f3b801e9	Fix typo, use boolean operator instead of bit-wise. Reviewed by: marius, shurd MFC after: 3 days Sponsored by: The FreeBSD Foundation	2019-01-03 01:01:03 +00:00
Mateusz Guzik	cc426dd319	Remove unused argument to priv_check_cred. Patch mostly generated with cocinnelle: @@ expression E1,E2; @@ - priv_check_cred(E1,E2,0) + priv_check_cred(E1,E2) Sponsored by: The FreeBSD Foundation	2018-12-11 19:32:16 +00:00
Stephen Hurd	7124b5ba04	Fix !tx_abdicate error from r336560 r336560 was supposed to restore pre-r323954 behaviour when tx_abdicate is not set (the default case). However, it appears that rather than the drainage check being made conditional on tx_abdicate being set, it was duplicated so it occured twice if tx_abdicate was set and once if it was not. Now when !tx_abdicate, drainage is only checked if the doorbell isn't pending. Reported by: lev MFC after: 1 week Sponsored by: Limelight Networks	2018-12-11 17:46:01 +00:00
Vincenzo Maffione	1d9ec3ee50	netmap.h: include stdatomic.h The stdatomic.h header exports atomic_thread_fence(), that can be used to implement the nm_stst_barrier() macro needed by netmap. MFC after: 3 days	2018-12-05 15:38:52 +00:00
Vincenzo Maffione	b6e66be22b	netmap: align codebase to the current upstream (760279cfb2730a585) Changelist: - Replace netmap passthrough host support with a more general mechanism to call TXSYNC/RXSYNC from an in-kernel event-loop. No kernel threads are used to use this feature: the application is required to spawn a thread (or a process) and issue a SYNC_KLOOP_START (NIOCCTRL) command in the thread body. The kernel loop is executed by the ioctl implementation, which returns to userspace only when a different thread calls SYNC_KLOOP_STOP or the netmap file descriptor is closed. - Update the if_ptnet driver to cope with the new data structures, and prune all the obsolete ptnetmap code. - Add support for "null" netmap ports, useful to allocate netmap_if, netmap_ring and netmap buffers to be used by specialized applications (e.g. hypervisors). TXSYNC/RXSYNC on these ports have no effect. - Various fixes and code refactoring. Sponsored by: Sunny Valley Networks Differential Revision: https://reviews.freebsd.org/D18015	2018-12-05 11:57:16 +00:00
Eric van Gyzen	9dae3b521e	altq: manual cleanup after r341507 Remove a file that became practically empty. Fix indentation. Like r341507, I do not plan to MFC, but anyone else can.	2018-12-04 23:53:42 +00:00
Eric van Gyzen	325fab802e	altq: remove ALTQ3_COMPAT code This code has apparently never compiled on FreeBSD since its introduction in 2004 (r130365). It has certainly not compiled since 2006, when r164033 added #elsif [sic] preprocessor directives. The code was left in the tree to reduce the diff from upstream (KAME). Since that upstream is no longer relevant, remove the long-dead code. This commit is the direct result of: unifdef -m -UALTQ3_COMPAT sys/net/altq/* A later commit will do some manual cleanup. I do not plan to MFC this. If that would help you, go for it.	2018-12-04 23:46:43 +00:00
Andrey V. Elsukov	851073551d	Adapt the fix in r341008 to correctly work with EBR. IFNET_RLOCK_NOSLEEP() is epoch_enter_preempt() in FreeBSD 12+. Holding it in sysctl_rtsock() doesn't protect us from ifnet unlinking, because unlinking occurs with IFNET_WLOCK(), that is rw_wlock+sx_xlock, and it doesn check that concurrent code is running in epoch section. But while we are in epoch section, we should be able to do access to ifnet's fields, even it was unlinked. Thus do not change if_addr and if_hw_addr fields in ifnet_detach_internal() to NULL, since rtsock code can do access to these fields and this is allowed while it is running in epoch section. This should fix the race, when ifnet_detach_internal() unlinks ifnet after we checked it for IFF_DYING in sysctl_dumpentry. Move free(ifp->if_hw_addr) into ifnet_free_internal(). Also remove the NULL check for ifp->if_description, since free(9) can correctly handle NULL pointer. MFC after: 1 week	2018-11-30 10:36:14 +00:00
Andrew Gallatin	fbec776de0	Use busdma unconditionally in iflib - Remove the complex mechanism to choose between using busdma and raw pmap_kextract at runtime. The reduced complexity makes the code easier to read and maintain. - Fix a bug in the small packet receive path where clusters were repeatedly mapped but never unmapped. We now store the cluster's bus address and avoid re-mapping the cluster each time a small packet is received. This patch fixes bugs I've seen where ixl(4) will not even respond to ping without seeing DMAR faults. I see a small improvement (14%) on packet forwarding tests using a Haswell based Xeon E5-2697 v3. Olivier sees a small regression (-3% to -6%) with lower end hardware. Reviewed by: mmacy Not objected to by: sbruno MFC after: 8 weeks Sponsored by: Netflix, Inc Differential Revision: https://reviews.freebsd.org/D17901	2018-11-27 20:01:05 +00:00
Andrey V. Elsukov	a716ad4a35	Fix possible panic during ifnet detach in rtsock. The panic can happen, when some application does dump of routing table using sysctl interface. To prevent this, set IFF_DYING flag in if_detach_internal() function, when ifnet under lock is removed from the chain. In sysctl_rtsock() take IFNET_RLOCK_NOSLEEP() to prevent ifnet detach during routes enumeration. In case, if some interface was detached in the time before we take the lock, add the check, that ifnet is not DYING. This prevents access to memory that could be freed after ifnet is unlinked. PR: 227720, 230498, 233306 Reviewed by: bz, eugen MFC after: 1 week Sponsored by: Yandex LLC Differential Revision: https://reviews.freebsd.org/D18338	2018-11-27 09:04:06 +00:00
Mark Johnston	d25f8522be	Plug routing sysctl leaks. Various structures exported by sysctl_rtsock() contain padding fields which were not being zeroed. Reported by: Thomas Barabosch, Fraunhofer FKIE Reviewed by: ae MFC after: 3 days Security: kernel memory disclosure Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D18333	2018-11-26 13:42:18 +00:00
Oleg Bulyzhin	cac302483e	Unbreak kernel build with VLAN_ARRAY defined. MFC after: 1 week	2018-11-21 13:34:21 +00:00
Andrey V. Elsukov	ad43bf348b	Allow configuration of several ipsec interfaces with the same tunnel endpoints. This can be used to configure several IPsec tunnels between two hosts with different security associations. Obtained from: Yandex LLC MFC after: 2 weeks Sponsored by: Yandex LLC	2018-11-16 14:21:57 +00:00
Mike Karels	456e896d6d	Fix flags collision causing inability to enable CBQ in ALTQ The CBQ BORROW flag conflicts with the RMCF_CODEL flag; the two sets of definitions actually define the same things. The symptom is that a kernel with CBQ support and not CODEL fails to load a QoS policy with the obscure error "pfctl: DIOCADDALTQ: Cannot allocate memory." If ALTQ_DEBUG is enabled, the error becomes a little clearer: "rmc_newclass: CODEL not configured for CBQ!" is printed by the kernel. There really shouldn't be two sets of macros that have to be defined consistently, but the include structure isn't right for exporting CBQ flags to altq_rmclass.h. Re-align the definitions, and add CTASSERTs in the kernel to ensure that the definitions are consistent. PR: 215716 Reviewed by: pkelsey MFC after: 2 weeks Sponsored by: Forcepoint LLC Differential Revision: https://reviews.freebsd.org/D17758	2018-11-16 03:42:29 +00:00
Stephen Hurd	0efb1a464f	Clear RX completion queue state veriables in iflib_stop() iflib_stop() was not resetting the rxq completion queue state variables. This meant that for any driver that has receive completion queues, after a reinit, iflib would start asking what's available on the rx side starting at whatever the completion queue index was prior to the stop, instead of at 0. Submitted by: pkelsey Reported by: pkelsey MFC after: 3 days Sponsored by: Limelight Networks	2018-11-14 20:36:18 +00:00
Stephen Hurd	8d4ceb9cc4	Prevent POLA violation with TSO/CSUM offload Ensure that any time CSUM_IP_TSO or CSUM_IP6_TSO is set that the corresponding CSUM_IP6?_TCP / CSUM_IP flags are also set. Rather than requireing drivers to bake-in an understanding that TSO implies checksum offloads, make it explicit. This change requires us to move the IFLIB_NEED_ZERO_CSUM implementation to ensure it's zeroed for TSO. Reported by: Jacob Keller <jacob.e.keller@intel.com> MFC after: 1 week Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D17801	2018-11-14 15:23:39 +00:00
Stephen Hurd	4d261ce278	Fix leaks caused by ifc_nhwtxqs never being initialized r333502 removed initialization of ifc_nhwtxqs, and it's not clear there's a need to copy it into the struct iflib_ctx at all. Use ctx->ifc_sctx->isc_ntxqs instead. Further, iflib_stop() did not clear the last ring in the case where isc_nfl != isc_nrxqs (such as when IFLIB_HAS_RXCQ is set). Use ctx->ifc_sctx->isc_nrxqs here instead of isc_nfl. Reported by: pkelsey Reviewed by: pkelsey MFC after: 3 days Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D17979	2018-11-14 15:16:45 +00:00
Gleb Smirnoff	b79aa45e0e	For compatibility KPI functions like if_addr_rlock() that used to have mutexes but now are converted to epoch(9) use thread-private epoch_tracker. Embedding tracker into ifnet(9) or ifnet derived structures creates a non reentrable function, that will fail miserably if called simultaneously from two different contexts. A thread private tracker will provide a single tracker that would allow to call these functions safely. It doesn't allow nested call, but this is not expected from compatibility KPIs. Reviewed by: markj	2018-11-13 22:58:38 +00:00
Stephen Hurd	a42546df88	Fix rxcsum issue introduced in r338838 r338838 attempted to fix issues with rxcsum and rxcsum6. However, the rxcsum bits were set as though if_setcapenablebit() was being called, not if_togglecapenable() which is in use. As a result, it was not possible to disable rxcsum when rxcsum6 was supported. PR: 233004 Reported by: lev Reviewed by: lev MFC after: 3 days Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D17881	2018-11-07 19:31:48 +00:00
Kristof Provost	fbbf436d56	pfsync: Handle syncdev going away If the syncdev is removed we no longer need to clean up the multicast entry we've got set up for that device. Pass the ifnet detach event through pf to pfsync, and remove our multicast handle, and mark us as no longer having a syncdev. Note that this callback is always installed, even if the pfsync interface is disabled (and thus it's not a per-vnet callback pointer). MFC after: 2 weeks Sponsored by: Orange Business Services Differential Revision: https://reviews.freebsd.org/D17502	2018-11-02 16:57:23 +00:00
Kristof Provost	25c6ab1b78	Notify that the ifnet will go away, even on vnet shutdown pf subscribes to ifnet_departure_event events, so it can clean up the ifg_pf_kif and if_pf_kif pointers in the ifnet. During vnet shutdown interfaces could go away without sending the event, so pf ends up cleaning these up as part of its shutdown sequence, which happens after the ifnet has already been freed. Send the ifnet_departure_event during vnet shutdown, allowing pf to clean up correctly. MFC after: 2 weeks Sponsored by: Orange Business Services Differential Revision: https://reviews.freebsd.org/D17500	2018-11-02 16:50:17 +00:00
Kristof Provost	5f6cf24e2d	pfsync: Make pfsync callbacks per-vnet The callbacks are installed and removed depending on the state of the pfsync device, which is per-vnet. The callbacks must also be per-vnet. MFC after: 2 weeks Sponsored by: Orange Business Services Differential Revision: https://reviews.freebsd.org/D17499	2018-11-02 16:47:07 +00:00
Kristof Provost	790194cd47	pf: Limit the fragment entry queue length to 64 per bucket. So we have a global limit of 1024 fragments, but it is fine grained to the region of the packet. Smaller packets may have less fragments. This costs another 16 bytes of memory per reassembly and devides the worst case for searching by 8. Obtained from: OpenBSD Differential Revision: https://reviews.freebsd.org/D17734	2018-11-02 15:32:04 +00:00
Kristof Provost	fd2ea405e6	pf: Split the fragment reassembly queue into smaller parts Remember 16 entry points based on the fragment offset. Instead of a worst case of 8196 list traversals we now check a maximum of 512 list entries or 16 array elements. Obtained from: OpenBSD Differential Revision: https://reviews.freebsd.org/D17733	2018-11-02 15:26:51 +00:00
Bjoern A. Zeeb	c955c6cd08	With more excessive use of modules, more kernel parts working with VIMAGE, and feature richness and global state increasing the 8k of vnet module space are no longer sufficient for people and loading multiple modules, e.g., pf(4) and ipl(4) or ipsec(4) will fail on the second module. Increase the module space to 8 * PAGE_SIZE which should be enough to hold multiple firewalls, ipsec, multicast (as in the old days was a problem), epair, carp, and any kind of other vnet enabled modules. Sadly this is a global byte array part of the vnet_set, so we cannot dynamically change its size; otherwise a TUNABLE would have been a better solution. PR: 228854 Reported by: Ernie Luzar, Marek Zarychta Discussed with: rgrimes on current MFC after: 3 days	2018-10-30 20:45:15 +00:00
Bjoern A. Zeeb	201100c58b	Initial implementation of draft-ietf-6man-ipv6only-flag. This change defines the RA "6" (IPv6-Only) flag which routers may advertise, kernel logic to check if all routers on a link have the flag set and accordingly update a per-interface flag. If all routers agree that it is an IPv6-only link, ether_output_frame(), based on the interface flag, will filter out all ETHERTYPE_IP/ARP frames, drop them, and return EAFNOSUPPORT to upper layers. The change also updates ndp to show the "6" flag, ifconfig to display the IPV6_ONLY nd6 flag if set, and rtadvd to allow announcing the flag. Further changes to tcpdump (contrib code) are availble and will be upstreamed. Tested the code (slightly earlier version) with 2 FreeBSD IPv6 routers, a FreeBSD laptop on ethernet as well as wifi, and with Win10 and OSX clients (which did not fall over with the "6" flag set but not understood). We may also want to (a) implement and RX filter, and (b) over time enahnce user space to, say, stop dhclient from running when the interface flag is set. Also we might want to start IPv6 before IPv4 in the future. All the code is hidden under the EXPERIMENTAL option and not compiled by default as the draft is a work-in-progress and we cannot rely on the fact that IANA will assign the bits as requested by the draft and hence they may change. Dear 6man, you have running code. Discussed with: Bob Hinden, Brian E Carpenter	2018-10-30 20:08:48 +00:00
Marcelo Araujo	fbd8c33022	Allow changing lagg(4) MTU. Previously, changing the MTU would require destroying the lagg and creating a new one. Now it is allowed to change the MTU of the lagg interface and the MTU of the ports will be set to match. If any port cannot set the new MTU, all ports are reverted to the original MTU of the lagg. Additionally, when adding ports, the MTU of a port will be automatically set to the MTU of the lagg. As always, the MTU of the lagg is initially determined by the MTU of the first port added. If adding an interface as a port for some reason fails, that interface is reverted to its original MTU. Submitted by: Ryan Moeller <ryan@freqlabs.com> Reviewed by: mav Relnotes: Yes Sponsored by: iXsystems Inc. Differential Revision: https://reviews.freebsd.org/D17576	2018-10-30 09:53:57 +00:00
Eugene Grosbein	232485a17e	Prevent stf(4) from panicing due to unprotected access to INADDR_HASH. PR: 220078 MFC after: 1 month Differential Revision: https://reviews.freebsd.org/D12457 Tested-by: Cassiano Peixoto and others	2018-10-27 04:45:28 +00:00
Eric Joyner	46fa0c2552	Revert r339634. That commit is causing kernel panics in em(4), so this will be reverted until those are fixed. Reported by: ae@, pho@, et al Sponsored by: Intel Corporation	2018-10-23 17:06:36 +00:00
Andrey V. Elsukov	8796e291f8	Add the check that current VNET is ready and access to srchash is allowed. This change is similar to r339646. The callback that checks for appearing and disappearing of tunnel ingress address can be called during VNET teardown. To prevent access to already freed memory, add check to the callback and epoch_wait() call to be sure that callback has finished its work. MFC after: 20 days	2018-10-23 13:11:45 +00:00
Andrey V. Elsukov	221022e190	Add the check that current VNET is ready and access to srchash is allowed. ipsec_srcaddr() callback can be called during VNET teardown, since ingress address checking subsystem isn't VNET specific. And thus callback can make access to already freed memory. To prevent this, use V_ipsec_idhtbl pointer as indicator of VNET readiness. And make epoch_wait() after resetting it to NULL in vnet_ipsec_uninit() to be sure that ipsec_srcaddr() is finished its work. Reported by: kp MFC after: 20 days	2018-10-23 13:03:03 +00:00
Andrey V. Elsukov	e6b383b2f5	Remove softc from idhash when interface is destroyed. MFC after: 20 days	2018-10-23 12:50:28 +00:00
Vincenzo Maffione	2a7db7a63d	netmap: align codebase to the current upstream (sha 8374e1a7e6941) Changelist: - Move large parts of VALE code to a new file and header netmap_bdg.[ch]. This is useful to reuse the code within upcoming projects. - Improvements and bug fixes to pipes and monitors. - Introduce nm_os_onattach(), nm_os_onenter() and nm_os_onexit() to handle differences between FreeBSD and Linux. - Introduce some new helper functions to handle more host rings and fake rings (netmap_all_rings(), netmap_real_rings(), ...) - Added new sysctl to enable/disable hw checksum in emulated netmap mode. - nm_inject: add support for NS_MOREFRAG Approved by: gnn (mentor) Differential Revision: https://reviews.freebsd.org/D17364	2018-10-23 08:55:16 +00:00
Eric Joyner	940f62d616	iflib: drain enqueued tasks before detaching from taskqgroup The taskqgroup_detach function does not check if task is already enqueued when detaching it. This may lead to kernel panic if enqueued task starts after context state lock is destroyed. Ensure that the already enqueued admin tasks are executed before detaching them. The issue was discovered during validation of D16429. Unloading of if_ixlv followed by immediate removal of VFs with iovctl -D may lead to panic on NODEBUG kernel. As well, check if iflib is in detach before enqueueing new admin or iov tasks, to prevent new tasks from executing while the taskqgroup tasks are being drained. Submitted by: Krzysztof Galazka <krzysztof.galazka@intel.com> Reviewed by: shurd@, erj@ Sponsored by: Intel Corporation Differential Revision: https://reviews.freebsd.org/D17404	2018-10-23 04:37:29 +00:00
Hans Petter Selasky	88d57e9eac	Resolve deadlock between epoch(9) and various network interface SX-locks, during if_purgeaddrs(), by not allowing to hold the epoch read lock over typical network IOCTL code paths. This is a regression issue after r334305. Reviewed by: ae (network) Differential revision: https://reviews.freebsd.org/D17647 MFC after: 1 week Sponsored by: Mellanox Technologies	2018-10-22 13:25:26 +00:00
Andrey V. Elsukov	cc958ed28c	Follow the fix in r339532 (by glebius): Fix exiting an epoch(9) we never entered. May happen only with MAC. MFC after: 1 month	2018-10-21 18:30:27 +00:00
Andrey V. Elsukov	2c87fdf059	Rework if_ipsec(4) to use epoch(9) instead of rmlock. * use CK_LIST and FNV hash to keep chains of softc; * read access to softc is protected by epoch(); * write access is protected by ipsec_ioctl_sx. Changing of softc fields is allowed only when softc is unlinked from CK_LIST chains. * linking/unlinking of softc is allowed only when ipsec_ioctl_sx is exclusive locked. * the plain LIST of all softc is replaced by hash table that uses ingress address of tunnels as a key. * added support for appearing/disappearing of ingress address handling. Now it is allowed configure non-local ingress IP address, and thus the problem with if_ipsec(4) configuration that happens on boot, when ingress address is not yet configured, is solved. MFC after: 1 month Sponsored by: Yandex LLC Differential Revision: https://reviews.freebsd.org/D17190	2018-10-21 18:24:20 +00:00
Andrey V. Elsukov	df49ca9fd3	Add handling for appearing/disappearing of ingress addresses to if_me(4). * register handler for ingress address appearing/disappearing; * add new srcaddr hash table for fast softc lookup by srcaddr; * when srcaddr disappears, clear IFF_DRV_RUNNING flag from interface, and set it otherwise; MFC after: 1 month Sponsored by: Yandex LLC	2018-10-21 18:18:37 +00:00
Andrey V. Elsukov	19873f4780	Add handling for appearing/disappearing of ingress addresses to if_gre(4). * register handler for ingress address appearing/disappearing; * add new srcaddr hash table for fast softc lookup by srcaddr; * when srcaddr disappears, clear IFF_DRV_RUNNING flag from interface, and set it otherwise; MFC after: 1 month Sponsored by: Yandex LLC Differential Revision: https://reviews.freebsd.org/D17214	2018-10-21 18:13:45 +00:00
Andrey V. Elsukov	009d82ee0f	Add handling for appearing/disappearing of ingress addresses to if_gif(4). * register handler for ingress address appearing/disappearing; * add new srcaddr hash table for fast softc lookup by srcaddr; * when srcaddr disappears, clear IFF_DRV_RUNNING flag from interface, and set it otherwise; * remove the note about ingress address from BUGS section. MFC after: 1 month Sponsored by: Yandex LLC Differential Revision: https://reviews.freebsd.org/D17134	2018-10-21 18:06:15 +00:00
Andrey V. Elsukov	8251c68d5c	Add KPI that can be used by tunneling interfaces to handle IP addresses appearing and disappearing on the host system. Such handling is need, because tunneling interfaces must use addresses, that are configured on the host as ingress addresses for tunnels. Otherwise the system can send spoofed packets with source address, that belongs to foreign host. The KPI uses ifaddr_event_ext event to implement addresses tracking. Tunneling interfaces register event handlers and then they are notified by the kernel, when an address disappears or appears. ifaddr_event_compat() handler from if.c replaced by srcaddr_change_event() in the ip_encap.c MFC after: 1 month Sponsored by: Yandex LLC Differential Revision: https://reviews.freebsd.org/D17134	2018-10-21 17:55:26 +00:00
Kristof Provost	5191a3aea6	vlan: Fix panic with lagg and vlan vlan_lladdr_fn() is called from taskqueue, which means there's no vnet context set. We can end up trying to send ARP messages (through the iflladdr_event event), which requires a vnet context. PR: 227654 MFC after: 3 days	2018-10-21 16:51:35 +00:00
Andrey V. Elsukov	64d63b1e03	Add ifaddr_event_ext event. It is similar to ifaddr_event, but the handler receives the type of event IFADDR_EVENT_ADD/IFADDR_EVENT_DEL, and the pointer to ifaddr. Also ifaddr_event now is implemented using ifaddr_event_ext handler. MFC after: 3 weeks Sponsored by: Yandex LLC Differential Revision: https://reviews.freebsd.org/D17100	2018-10-21 15:02:06 +00:00
Gleb Smirnoff	0a27163f8d	Fix exiting an epoch(9) we never entered. May happen only with MAC.	2018-10-21 12:39:00 +00:00
Hans Petter Selasky	c32a9d66fb	Fix deadlock when destroying VLANs. Synchronizing the epoch before freeing the multicast addresses while holding the VLAN_XLOCK() might lead to a deadlock. Use deferred freeing of the VLAN multicast addresses to resolve deadlock. Backtrace: Thread1: epoch_block_handler_preempt() ck_epoch_synchronize_wait() epoch_wait_preempt() vlan_setmulti() vlan_ioctl() in6m_release_task() gtaskqueue_run_locked() gtaskqueue_thread_loop() fork_exit() fork_trampoline() Thread2: sleepq_switch() sleepq_wait() _sx_xlock_hard() _sx_xlock() in6_leavegroup() in6_purgeaddr() if_purgeaddrs() if_detach_internal() if_detach() vlan_clone_destroy() if_clone_destroyif() if_clone_destroy() ifioctl() kern_ioctl() sys_ioctl() amd64_syscall() fast_syscall_common() syscall() Differential revision: https://reviews.freebsd.org/D17496 Reviewed by: slavash, mmacy Approved by: re (kib) Sponsored by: Mellanox Technologies	2018-10-15 10:29:29 +00:00
Eric Joyner	77c1fcec91	ixl/iavf(4): Change ixlv to iavf and update it to use iflib(9) Finishes the conversion of the 40Gb Intel Ethernet drivers to iflib(9) for FreeBSD 12.0, and fixes numerous bugs in both ixl(4) and iavf(4). This commit also re-adds the VF driver to GENERIC since it now compiles and functions. The VF driver name was changed from ixlv(4) to iavf(4) because the VF driver is now intended to be used with future products, not just with Fortville/Fort Park VFs. A man page update that documents these drivers is forthcoming in a separate commit. Reviewed by: sbruno@, kbowling@ Tested by: jeffrey.e.pieper@intel.com Approved by: re (gjb@) Relnotes: yes Sponsored by: Intel Corporation Differential Revision: https://reviews.freebsd.org/D16429	2018-10-12 22:40:54 +00:00
Jonathan T. Looney	13c6ba6d94	There are three places where we return from a function which entered an epoch section without exiting that epoch section. This is bad for two reasons: the epoch section won't exit, and we will leave the epoch tracker from the stack on the epoch list. Fix the epoch leak by making sure we exit epoch sections before returning. Reviewed by: ae, gallatin, mmacy Approved by: re (gjb, kib) Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D17450	2018-10-09 13:26:06 +00:00
Michael Tuexen	580e30a33e	Use strlcpy() instead of strncpy(). Approved by: re (kib@) CID: 1395980, 1395981 X-MFC with: r339012 MFC after: 1 week	2018-10-03 07:35:16 +00:00
Michael Tuexen	1687b1ab24	For changing the MTU on tun/tap devices, it should not matter whether it is done via using ifconfig, which uses a SIOCSIFMTU ioctl() command, or doing it using a TUNSIFINFO/TAPSIFINFO ioctl() command. Without this patch, for IPv6 the new MTU is not used when creating routes. Especially, when initiating TCP connections after increasing the MTU, the old MTU is still used to compute the MSS. Thanks to ae@ and bz@ for helping to improve the patch. Reviewed by: ae@, bz@ Approved by: re (kib@) MFC after: 1 week Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D17180	2018-09-29 13:01:23 +00:00
Warner Losh	329e817fcc	Reapply, with minor tweaks, r338025, from the original commit: Remove unused and easy to misuse PNP macro parameter Inspired by r338025, just remove the element size parameter to the MODULE_PNP_INFO macro entirely. The 'table' parameter is now required to have correct pointer (or array) type. Since all invocations of the macro already had this property and the emitted PNP data continues to include the element size, there is no functional change. Mostly done with the coccinelle 'spatch' tool: $ cat modpnpsize0.cocci @normaltables@ identifier b,c; expression a,d,e; declarer MODULE_PNP_INFO; @@ MODULE_PNP_INFO(a,b,c,d, -sizeof(d[0]), e); @singletons@ identifier b,c,d; expression a; declarer MODULE_PNP_INFO; @@ MODULE_PNP_INFO(a,b,c,&d, -sizeof(d), 1); $ rg -l MODULE_PNP_INFO -- sys \| \ xargs spatch --in-place --sp-file modpnpsize0.cocci (Note that coccinelle invokes diff(1) via a PATH search and expects diff to tolerate the -B flag, which BSD diff does not. So I had to link gdiff into PATH as diff to use spatch.) Tinderbox'd (-DMAKE_JUST_KERNELS). Approved by: re (glen)	2018-09-26 17:12:14 +00:00
Matt Macy	b08d611de8	fix vlan locking to permit sx acquisition in ioctl calls - update vlan(9) to handle changes earlier this year in multicast locking Tested by: np@, darkfiberu at gmail.com PR: 230510 Reviewed by: mjoras@, shurd@, sbruno@ Approved by: re (gjb@) Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D16808	2018-09-21 01:37:08 +00:00
Stephen Hurd	0c919c2370	Fix capabilities handling for iflib drivers Various capabilities were not being handled correctly in the SIOCSIFCAP handler. Specifically: IFCAP_RXCSUM and IFCAP_RXCSUM_IPV6 could be set even if not supported It was impossible to disable IFCAP_RXCSUM and/or IFCAP_RXCSUM_IPV6 via ifconfig since it does ioctl() per command-line flag rather than combine them into a single call. IFCAP_VLAN_HWCSUM could not be modified via the ioctl() Setting any combination of the three IFCAP_WOL flags would set only IFCAP_WOL_MCAST \| IFCAP_WOL_MAGIC. For example, setting only IFCAP_WOL_UCAST would result in both IFCAP_WOL_MCAST and IFCAP_WOL_MAGIC being enabled, but IFCAP_WOL_UCAST would not be enabled. Because if_vlancap() was called before if_togglecapenable(), vlan flags were sometimes not applied correctly. Interfaces were being unnecessarily stopped and restarted for WoL PR: 231151 Submitted by: Kaho Toshikazu <kaho@elam.kais.kyoto-u.ac.jp> Reported by: Shirkdog <mshirk@daemon-security.com> Reviewed by: galladin Approved by: re (gjb) Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D17158	2018-09-20 19:35:35 +00:00
Andrey V. Elsukov	c6851ad04c	Restore outbound packets capturing for if_gre(4). It was missed in r335048. Also clear M_MCAST and M_BCAST flags for encapsulated datagram, since it will have new IP header. Approved by: re (kib)	2018-09-17 10:10:14 +00:00
Ruslan Bukin	86c5937532	Don't mark module data as static on RISC-V. Similar to arm64, riscv compiler uses PC-relative loads/stores, and with static data compiler does not emit relocations. In result, kernel module linker has nothing to fix and data accessed from the wrong location. Approved by: re (gjb) Sponsored by: DARPA, AFRL	2018-09-12 08:05:33 +00:00
Stephen Hurd	64e6fc1379	Clean up iflib sysctls Remove sysctls: txq_drain_encapfail - now a duplicate of encap_txd_encap_fail intr_link - was never incremented intr_msix - was never incremented rx_zero_len - was never incremented The following were not incremented in all code-paths that apply: m_pullups, mbuf_defrag, rxd_flush, tx_encap, rx_intr_enables, tx_frees, encap_txd_encap_fail. Fixes: Replace the broken collapse_pkthdr() implementation with an MPASS(). fl_refills and fl_refills_large were not incremented when using netmap. Reviewed by: gallatin Approved by: re (marius) Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D16733	2018-09-06 18:51:52 +00:00
Bjoern A. Zeeb	d4672c61d3	Rather than duplicating the functionality of a macro after r322866 use the already existing one. No functional changes. Reviewed by: karels, ae Approved by: re (rgrimes) Differential Revision: https://reviews.freebsd.org/D17004	2018-09-03 22:10:49 +00:00
Stephen Hurd	bc0e855bd9	Fix compile error due to missing parenthesis in r338372 Approved by: re (gjb)	2018-08-29 16:21:34 +00:00
Stephen Hurd	a520f8b6fe	Fix potential data corruption in iflib The MP ring may have txq pointers enqueued. Previously, these were passed to m_free() when IFC_QFLUSH was set. This patch checks for the value and doesn't call m_free(). Reviewed by: gallatin Approved by: re (gjb) Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D16882	2018-08-29 15:55:25 +00:00
Mark Murray	19fa89e938	Remove the Yarrow PRNG algorithm option in accordance with due notice given in random(4). This includes updating of the relevant man pages, and no-longer-used harvesting parameters. Ensure that the pseudo-unit-test still does something useful, now also with the "other" algorithm instead of Yarrow. PR: 230870 Reviewed by: cem Approved by: so(delphij,gtetlow) Approved by: re(marius) Differential Revision: https://reviews.freebsd.org/D16898	2018-08-26 12:51:46 +00:00
Navdeep Parhar	47c39432b1	Unbreak VLANs after r337943. ether_set_pcp should not be called from ether_output_frame for VLAN interfaces -- the vid + pcp will be inserted during vlan_transmit in that case. r337943 sets the VLAN's ifnet's if_pcp to a proper PCP value and this led to double encapsulation (once with vid 0 and second time with vid+pcp). PR: 230794 Reviewed by: kib@ Approved by: re@ (gjb@) Sponsored by: Chelsio Communications Differential Revision: https://reviews.freebsd.org/D16887	2018-08-24 21:48:13 +00:00
Patrick Kelsey	249cc75fd1	Extended pf(4) ioctl interface and pfctl(8) to allow bandwidths of 2^32 bps or greater to be used. Prior to this, bandwidth parameters would simply wrap at the 2^32 boundary. The computations in the HFSC scheduler and token bucket regulator have been modified to operate correctly up to at least 100 Gbps. No other algorithms have been examined or modified for correct operation above 2^32 bps (some may have existing computation resolution or overflow issues at rates below that threshold). pfctl(8) will now limit non-HFSC bandwidth parameters to 2^32 - 1 before passing them to the kernel. The extensions to the pf(4) ioctl interface have been made in a backwards-compatible way by versioning affected data structures, supporting all versions in the kernel, and implementing macros that will cause existing code that consumes that interface to use version 0 without source modifications. If version 0 consumers of the interface are used against a new kernel that has had bandwidth parameters of 2^32 or greater configured by updated tools, such bandwidth parameters will be reported as 2^32 - 1 bps by those old consumers. All in-tree consumers of the pf(4) interface have been updated. To update out-of-tree consumers to the latest version of the interface, define PFIOC_USE_LATEST ahead of any includes and use the code of pfctl(8) as a guide for the ioctls of interest. PR: 211730 Reviewed by: jmallett, kp, loos MFC after: 2 weeks Relnotes: yes Sponsored by: RG Nets Differential Revision: https://reviews.freebsd.org/D16782	2018-08-22 19:38:48 +00:00
Eric Joyner	fe2bf351fe	if_media: Add new 2.5G/5G/25G/40G/50G/100G/200G/400G media types Upcoming Ethernet hardware will support new media types that aren't in the kernel yet, so they are added here. These mostly include new 25G/50G/100G media types; and this commit introduces new 200G/400G speeds and media. Reviewed by: hselasky@, jhb@ MFC after: 1 week Sponsored by: Intel Corporation Differential Revision: https://reviews.freebsd.org/D16731	2018-08-22 18:19:56 +00:00
Matt Macy	77ad07b6a3	fix copy/paste error when clearing ifma flag CID: 1395119 Reported by: vangyzen	2018-08-21 22:59:22 +00:00
Conrad Meyer	b8e771e97a	Back out r338035 until Warner is finished churning GSoC PNP patches I was not aware Warner was making or planning to make forward progress in this area and have since been informed of that. It's easy to apply/reapply when churn dies down.	2018-08-19 00:46:22 +00:00
Conrad Meyer	faa319436f	Remove unused and easy to misuse PNP macro parameter Inspired by r338025, just remove the element size parameter to the MODULE_PNP_INFO macro entirely. The 'table' parameter is now required to have correct pointer (or array) type. Since all invocations of the macro already had this property and the emitted PNP data continues to include the element size, there is no functional change. Mostly done with the coccinelle 'spatch' tool: $ cat modpnpsize0.cocci @normaltables@ identifier b,c; expression a,d,e; declarer MODULE_PNP_INFO; @@ MODULE_PNP_INFO(a,b,c,d, -sizeof(d[0]), e); @singletons@ identifier b,c,d; expression a; declarer MODULE_PNP_INFO; @@ MODULE_PNP_INFO(a,b,c,&d, -sizeof(d), 1); $ rg -l MODULE_PNP_INFO -- sys \| \ xargs spatch --in-place --sp-file modpnpsize0.cocci (Note that coccinelle invokes diff(1) via a PATH search and expects diff to tolerate the -B flag, which BSD diff does not. So I had to link gdiff into PATH as diff to use spatch.) Tinderbox'd (-DMAKE_JUST_KERNELS).	2018-08-19 00:22:21 +00:00
Navdeep Parhar	32a52e9e39	if_vlan(4): A VLAN always has a PCP and its ifnet's if_pcp should be set to the PCP value in use instead of IFNET_PCP_NONE. MFC after: 1 week Sponsored by: Chelsio Communications	2018-08-17 01:03:23 +00:00
Navdeep Parhar	32d2623ae2	Add the ability to look up the 3b PCP of a VLAN interface. Use it in toe_l2_resolve to fill up the complete vtag and not just the vid. Reviewed by: kib@ MFC after: 1 week Sponsored by: Chelsio Communications Differential Revision: https://reviews.freebsd.org/D16752	2018-08-16 23:46:38 +00:00
Matt Macy	f9be038601	Fix in6_multi double free This is actually several different bugs: - The code is not designed to handle inpcb deletion after interface deletion - add reference for inpcb membership - The multicast address has to be removed from interface lists when the refcount goes to zero OR when the interface goes away - decouple list disconnect from refcount (v6 only for now) - ifmultiaddr can exist past being on interface lists - add flag for tracking whether or not it's enqueued - deferring freeing moptions makes the incpb cleanup code simpler but opens the door wider still to races - call inp_gcmoptions synchronously after dropping the the inpcb lock Fundamentally multicast needs a rewrite - but keep applying band-aids for now. Tested by: kp Reported by: novel, kp, lwhsu	2018-08-15 20:23:08 +00:00
Andrew Gallatin	5ccac9f972	lagg: allow lacp to manage the link state Lacp needs to manage the link state itself. Unlike other lagg protocols, the ability of lacp to pass traffic depends not only on the lagg members having link, but also on the lacp protocol converging to a distributing state with the link partner. If we prematurely mark the link as up, then we will send a gratuitous arp (via arp_handle_ifllchange()) before the lacp interface is capable of passing traffic. When this happens, the gratuitous arp is lost, and our link partner may cache a stale mac address (eg, when the base mac address for the lagg bundle changes, due to a BIOS change re-ordering NIC unit numbers) Reviewed by: jtl, hselasky Sponsored by: Netflix	2018-08-13 14:13:25 +00:00
Kristof Provost	91e0f2d200	pf: Increase default hash table size Now that we (by default) limit the number of states to 100.000 it makse sense to also adjust the default size of the hash table. Based on the benchmarking results in https://github.com/ocochard/netbenches/blob/master/Atom_C2758_8Cores-Chelsio_T540-CR/pf-states_hashsize/results/fbsd12-head.r332390/README.md 128K entries offers a good compromise between performance and memory use. Users may still overrule this setting with the net.pf.states_hashsize and net.pf.source_nodes_hashsize loader(8) tunables.	2018-08-05 13:54:37 +00:00
Patrick Kelsey	8f410865b8	Mark the send queue ready so ALTQ is available.	2018-08-04 01:45:17 +00:00
Andrew Turner	b6ea4c5a2a	As with DPCPU_DEFINE_STATIC make VNET_DEFINE_STATIC non-static on arm64 in modules. It also fails in the same way, we are unable to relocate static variables as the compiler uses PC-relative loads with nothing for the kernel linker to relocate. Sponsored by: DARPA, AFRL	2018-07-30 15:05:07 +00:00
Andrew Turner	cd2106eaea	Ensure the DPCPU and VNET module spaces are aligned to hold a pointer. Previously they may have been aligned to a char, leading to misaligned DPCPU and VNET variables. Sponsored by: DARPA, AFRL	2018-07-30 14:25:17 +00:00
Andrew Turner	bc61d94997	As with DPCPU_DEFINE make it a compile error to use static with VNET_DEFINE. There is the VNET_DEFINE_STATIC macro for that.	2018-07-30 12:44:44 +00:00
Patrick Kelsey	b8ca475604	ALTQ support for iflib. Reviewed by: jmallett, mmacy Differential Revision: https://reviews.freebsd.org/D16433	2018-07-25 22:46:36 +00:00
Marius Strobl	c9a49a4fd8	Since r336611, n is only used for INET in iflib_parse_header(). Reported by: rpokala	2018-07-24 23:40:27 +00:00
Andrew Turner	5f901c92a8	Use the new VNET_DEFINE_STATIC macro when we are defining static VNET variables. Reviewed by: bz Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D16147	2018-07-24 16:35:52 +00:00
Andrew Turner	fceba23f93	As with DPCPU create VNET_DEFINE_STATIC for when a variable needs to be declaired static. This will allow us to change the definition on arm64 as it has the same issues described in r336349. Reviewed by: bz Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D16147	2018-07-24 16:31:16 +00:00
Eugene Grosbein	804771f553	epair(4): make sure we do not duplicate MAC addresses in case of reused if_index. PR: 229957 Tested by: O. Hartmann <ohartmann@walstatt.org> Approved by: avg (mentor)	2018-07-23 07:11:58 +00:00
Marius Strobl	7474544bac	Use the maximum of isc_tx_{nsegments,tso_segments_max} for MAX_TX_DESC. Since r336313, TSO support for LEM-class devices is removed again as it was before the conversion of {l,}em(4) to iflib(4) in r311849 and as a result, isc_tx_tso_segments_max is 0 for LEM-class devices now. Thus, inappropriate watermarks were used for this class. This is really only a band-aid, though, because so far iflib(9) doesn't fully take into account that DMA engines can support different maxima of segments for transfers of TSO and non-TSO packets. For example, the DESC_RECLAIMABLE macro is based on isc_tx_nsegments while MAX_TX_DESC used isc_tx_tso_segments_max only. For most in-tree consumers that doesn't make a difference as the maxima are the same for both kinds of transfers (that is, apart from the fact that TSO may require up to 2 sentinel descriptors but also not with every MAC supported). However, isc_tx_nsegments is 8 but isc_tx_tso_segments_max is 85 by default with ixl(4).	2018-07-22 17:51:11 +00:00
Marius Strobl	8b8d90931d	- Given that the controlling expression of the receive loop in iflib_rxeof() tests for avail > 0, avail can never be 0 within that loop. Thus, move decrementing avail and budget_left into the loop and before the code which checks for additional descriptors having become available in case all the previous ones have been processed but there still is budget left so the latter code works as expected. [1] - In iflib_{busdma_load_mbuf_sg,parse_header}(), remove dead stores to m and n respectively. [2, 3] - In collapse_pkthdr(), ensure that m_next isn't NULL before dereferencing it. [4] - Remove a duplicate assignment of segs in iflib_encap(). Reported by: Coverity CID: 1356027 [1], 1356047 [2], 1368205 [3], 1356028 [4]	2018-07-22 17:45:44 +00:00
Stephen Hurd	fe51d4cdfe	Add knob to control tx ring abdication. r323954 changed the mp ring behaviour when 64-bit atomics were available to abdicate the TX ring rather than having one become a consumer thereby running to completion on TX. The consumer of the mp ring was then triggered in the tx task rather than blocking the TX call. While this significantly lowered the number of RX drops in small-packet forwarding, it also negatively impacts TX performance. With this change, the default behaviour is reverted, causing one TX ring to become a consumer during the enqueue call. A new sysctl, dev.X.Y.iflib.tx_abdicate is added to control this behaviour. Reviewed by: gallatin Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D16302	2018-07-20 17:45:26 +00:00
Stephen Hurd	dd7fbcf193	Improve netmap TX handling when TX IRQs are not used/supported Use the timer to poll for TX completions when there are outstanding TX slots. Track when the last driver timer was called to prevent overcalling it. Also clean up some kring vs NIC ring usage. Reviewed by: marius, Johannes Lundberg <johalun0@gmail.com> Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D16300	2018-07-20 17:24:45 +00:00
Andrey V. Elsukov	acf673edf0	Move invoking of callout_stop(&lle->lle_timer) into llentry_free(). This deduplicates the code a bit, and also implicitly adds missing callout_stop() to in[6]_lltable_delete_entry() functions. PR: 209682, 225927 Submitted by: hselasky (previous version) MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D4605	2018-07-17 11:33:23 +00:00
Marius Strobl	7f87c0406d	Assorted TSO fixes for em(4)/iflib(9) and dead code removal: - Ever since the workaround for the silicon bug of TSO4 causing MAC hangs was committed in r295133, CSUM_TSO always got disabled unconditionally by em(4) on the first invocation of em_init_locked(). However, even with that problem fixed, it turned out that for at least e. g. 82579 not all necessary TSO workarounds are in place, still causing MAC hangs even at Gigabit speed. Thus, for stable/11, TSO usage was deliberately disabled in r323292 (r323293 for stable/10) for the EM-class by default, allowing users to turn it on if it happens to work with their particular EM MAC in a Gigabit-only environment. In head, the TSO workaround for speeds other than Gigabit was lost with the conversion to iflib(9) in r311849 (possibly along with another one or two TSO workarounds). Yet at the same time, for EM-class MACs TSO4 got enabled by default again, causing device hangs. Therefore, change the default for this hardware class back to have TSO4 off, allowing users to turn it on manually if it happens to work in their environment as we do in stable/{10,11}. An alternative would be to add a whitelist of EM-class devices where TSO4 actually is reliable with the workarounds in place, but given that the advantage of TSO at Gigabit speed is rather limited - especially with the overhead of these workarounds -, that's really not worth it. [1] This change includes the addition of an isc_capabilities to struct if_softc_ctx so iflib(9) can also handle interface capabilities that shouldn't be enabled by default which is used to handle the default-off capabilities of e1000 as suggested by shurd@ and moving their handling from em_setup_interface() to em_if_attach_pre() accordingly. - Although 82543 support TSO4 in theory, the former lem(4) didn't have support for TSO4, presumably because TSO4 is even more broken in the LEM-class of MACs than the later EM ones. Still, TSO4 for LEM-class devices was enabled as part of the conversion to iflib(9) in r311849, causing device hangs. So revert back to the pre-r311849 behavior of not supporting TSO4 for LEM-class at all, which includes not creating a TSO DMA tag in iflib(9) for devices not having IFCAP_TSO4 set. [2] - In fact, the FreeBSD TCP stack can handle a TSO size of IP_MAXPACKET (65535) rather than FREEBSD_TSO_SIZE_MAX (65518). However, the TSO DMA must have a maxsize of the maximum TSO size plus the size of a VLAN header for software VLAN tagging. The iflib(9) converted em(4), thus, first correctly sets scctx->isc_tx_tso_size_max to EM_TSO_SIZE in em_if_attach_pre(), but later on overrides it with IP_MAXPACKET in em_setup_interface() (apparently, left-over from pre-iflib(9) times). So remove the later and correct iflib(9) to correctly cap the maximum TSO size reported to the stack at IP_MAXPACKET. While at it, let iflib(9) use if_sethwtsomax(). This change includes the addition of isc_tso_max{seg,}size DMA engine constraints for the TSO DMA tag to struct if_shared_ctx and letting iflib_txsd_alloc() automatically adjust the maxsize of that tag in case IFCAP_VLAN_MTU is supported as requested by shurd@. - Move the if_setifheaderlen(9) call for adjusting the maximum Ethernet header length from {ixgbe,ixl,ixlv,ixv,em}_setup_interface() to iflib(9) so adjustment is automatically done in case IFCAP_VLAN_MTU is supported. As a consequence, this adjustment now is also done in case of bnxt(4) which missed it previously. - Move the reduction of the maximum TSO segment count reported to the stack by the number of m_pullup(9) calls (which in the worst case, can add another mbuf and, thus, the requirement for another DMA segment each) in the transmit path for performance reasons from em_setup_interface() to iflib_txsd_alloc() as these pull-ups are now done in iflib_parse_header() rather than in the no longer existing em_xmit(). Moreover, this optimization applies to all drivers using iflib(9) and not just em(4); all in-tree iflib(9) consumers still have enough room to handle full size TSO packets. Also, reduce the adjustment to the maximum number of m_pullup(9)'s now performed in iflib_parse_header(). - Prior to the conversion of em(4)/igb(4)/lem(4) and ixl(4) to iflib(9) in r311849 and r335338 respectively, these drivers didn't enable IFCAP_VLAN_HWFILTER by default due to VLAN events not being passed through by lagg(4). With iflib(9), IFCAP_VLAN_HWFILTER was turned on by default but also lagg(4) was fixed in that regard in r203548. So just remove the now redundant and defunct IFCAP_VLAN_HWFILTER handling in {em,ixl,ixlv}_setup_interface(). - Nuke other redundant IFCAP_ setting in {em,ixl,ixlv}_setup_interface() which is (more completely) already done in {em,ixl,ixlv}_if_attach_pre() now. - Remove some redundant/dead setting of scctx->isc_tx_csum_flags in em_if_attach_pre(). - Remove some IFCAP_* duplicated either directly or indirectly (e. g. via IFCAP_HWCSUM) in {EM,IGB,IXL}_CAPS. - Don't bother to fiddle with IFCAP_HWSTATS in ixgbe(4)/ixgbev(4) as iflib(9) adds that capability unconditionally. - Remove some unused macros from em(4). - Bump __FreeBSD_version as some of the above changes require the modules of drivers using iflib(9) to be recompiled. Okayed by: sbruno@ at 201806 DevSummit Transport Working Group [1] Reviewed by: sbruno (earlier version), erj PR: 219428 (part of; comment #10) [1], 220997 (part of; comment #3) [2] Differential Revision: https://reviews.freebsd.org/D15720	2018-07-15 19:04:23 +00:00
Kristof Provost	b895839daa	pf: Fix typo in r336221 Reported by: olivier@	2018-07-12 18:07:28 +00:00
Kristof Provost	f4cdb8aedd	pf: Increate default state table size The typical system now has a lot more memory than when pf was new, and is also expected to handle more connections. Increase the default size of the state table. Note that users can overrule this using 'set limit states' in pf.conf. From OpenBSD: The year is 2018. Mercury, Bowie, Cash, Motorola and DEC all left us. Just pf still has a default state table limit of 10000. Had! Now it's a tiny little bit more, 100k. lead guitar: me ok chorus: phessler theo claudio benno background school girl laughing: bob Obtained from: OpenBSD	2018-07-12 16:35:35 +00:00
Andrey V. Elsukov	98a8fdf6da	Deduplicate the code. Add generic function if_tunnel_check_nesting() that does check for allowed nesting level for tunneling interfaces and also does loop detection. Use it in gif(4), gre(4) and me(4) interfaces. Differential Revision: https://reviews.freebsd.org/D16162	2018-07-09 11:03:28 +00:00
Sean Bruno	2da1967762	struct ifmediareq *ifmrp is only used in the COMPAT_FREEBSD32 parts of ifioctl(). Move it inside the proper #ifdef. This was throwing a valid "Assigned but unused" warning with gcc. Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D16063	2018-07-07 13:35:06 +00:00
Will Andrews	cc535c95ca	Revert r335833. Several third-parties use at least some of these ioctls. While it would be better for regression testing if they were used in base (or at least in the test suite), it's currently not worth the trouble to push through removal. Submitted by: antoine, markj	2018-07-04 03:36:46 +00:00
Matt Macy	6573d7580b	epoch(9): allow preemptible epochs to compose - Add tracker argument to preemptible epochs - Inline epoch read path in kernel and tied modules - Change in_epoch to take an epoch as argument - Simplify tfb_tcp_do_segment to not take a ti_locked argument, there's no longer any benefit to dropping the pcbinfo lock and trying to do so just adds an error prone branchfest to these functions - Remove cases of same function recursion on the epoch as recursing is no longer free. - Remove the the TAILQ_ENTRY and epoch_section from struct thread as the tracker field is now stack or heap allocated as appropriate. Tested by: pho and Limelight Networks Reviewed by: kbowling at llnw dot com Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D16066	2018-07-04 02:47:16 +00:00
Will Andrews	c1887e9f09	pf: remove unused ioctls. Several ioctls are unused in pf, in the sense that no base utility references them. Additionally, a cursory review of pf-based ports indicates they're not used elsewhere either. Some of them have been unused since the original import. As far as I can tell, they're also unused in OpenBSD. Finally, removing this code removes the need for future pf work to take them into account. Reviewed by: kp Differential Revision: https://reviews.freebsd.org/D16076	2018-07-01 01:16:03 +00:00
Andrey V. Elsukov	6e081509db	Add NULL pointer check. encap_lookup_t method can be invoked by IP encap subsytem even if none of gif/gre/me interfaces are exist. Hash tables are allocated on demand, when first interface is created. So, make NULL pointer check before doing access to hash table. PR: 229378	2018-06-28 11:39:27 +00:00
Andrey V. Elsukov	ca3cd72b17	Move BPFIF_* macro definitions into .c file, where struct bpf_if is declared. They are only used in this file and there is no need to export them via bpfdesc.h.	2018-06-19 10:34:45 +00:00
Eric Joyner	dfae03b5b5	iflib: Style fixes MFC after: 1 week	2018-06-18 17:27:43 +00:00
Marius Strobl	e4defe55a8	Assorted fixes to MSI-X/MSI/INTx setup in iflib(9): - In iflib_msix_init(), VMMs with broken MSI-X activation are trying to be worked around by manually enabling PCIM_MSIXCTRL_MSIX_ENABLE before calling pci_alloc_msix(9). Apart from constituting a layering violation, this has the problem of leaving PCIM_MSIXCTRL_MSIX_ENABLE enabled when falling back to MSI or INTx when e. g. MSI-X is black- listed and initially also when disabled via hw.pci.enable_msix. The later in turn was incorrectly worked around in r325166. Since r310806, pci(4) itself has code to deal with broken MSI-X handling of VMMs, so all of these workarounds in iflib(9) can go, fixing non-working interrupts when falling back to MSI/INTx. In any case, possibly further adjustments to broken MSI-X activation of VMMs like enabling r310806 by default in VM environments need to be placed into pci(4), not iflib(9). [1] - Also remove the pci_enable_busmaster(9) call from iflib_msix_init(), which is already more properly invoked from iflib_device_attach(). - When falling back to MSI/INTx, release the MSI-X BAR resource again. - When falling back to INTx, ensure scctx->isc_vectors is set to 1 and not to something higher from a device with more than one MSI message supported. - Make the nearby ring_state(s) stuff (static) const. Discussed with: jhb at BSDCan 2018 [1] Reviewed by: imp, jhb Differential Revision: https://reviews.freebsd.org/D15729	2018-06-17 20:33:02 +00:00

1 2 3 4 5 ...

4206 Commits