freebsd-dev

Author	SHA1	Message	Date
Mark Johnston	c1be839971	pf: Add a new zone for per-table entry counters. Right now we optionally allocate 8 counters per table entry, so in addition to memory consumed by counters, we require 8 pointers worth of space in each entry even when counters are not allocated (the default). Instead, define a UMA zone that returns contiguous per-CPU counter arrays for use in table entries. On amd64 this reduces sizeof(struct pfr_kentry) from 216 to 160. The smaller size also results in better slab efficiency, so memory usage for large tables is reduced by about 28%. Reviewed by: kp MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D24843	2020-05-16 00:28:12 +00:00
Kyle Evans	c79cee7136	kernel: provide panicky version of __unreachable __builtin_unreachable doesn't raise any compile-time warnings/errors on its own, so problems with its usage can't be easily detected. While it would be nice for this situation to change and compilers to at least add a warning for trivial cases where local state means the instruction can't be reached, this isn't the case at the moment and likely will not happen. This commit adds an __assert_unreachable, whose intent is incredibly clear: it asserts that this instruction is unreachable. On INVARIANTS builds, it's a panic(), and on non-INVARIANTS it expands to __unreachable(). Existing users of __unreachable() are converted to __assert_unreachable, to improve debuggability if this assumption is violated. Reviewed by: mjg Differential Revision: https://reviews.freebsd.org/D23793	2020-05-13 18:07:37 +00:00
Warner Losh	83b4342743	Kill trailing newline while I'm here...	2020-05-12 23:46:52 +00:00
Alexander V. Chernikov	4a6ee281d9	Remove unused rnh_close callback from rtable & cleanup depends. rnh_close callbackes was used by the in[6]_clsroute() handlers, doing cleanup in the route cloning code. Route cloning was eliminated somewhere around r186119. Last callback user was eliminated in r186215, 11 years ago. Differential Revision: https://reviews.freebsd.org/D24793	2020-05-11 06:09:18 +00:00
Alexander V. Chernikov	d223372545	Remove rtalloc1(_fib) KPI. Last user of rtalloc1() KPI has been eliminated in rS360631. As kernel is now fully switched to use new routing KPI defined in rS359823, remove old lookup functions. Differential Revision: https://reviews.freebsd.org/D24776	2020-05-10 09:34:48 +00:00
Alexander V. Chernikov	656442a718	Embed dst sockaddr into rtentry and remove rte packet counter Currently each rtentry has dst&gateway allocated separately from another zone, bloating cache accesses. Current 'struct rtentry' has 12 "mandatory" radix pointers in the beginning, leaving 4 usable pointers/32 bytes in the first 2 cache lines (amd64). Fields needed for the datapath are destination sockaddr and rt_nhop. So far it doesn't look like there is other routable addressing protocol other than IPv4/IPv6/MPLS, which uses keys longer than 20 bytes. With that in mind, embed dst into struct rtentry, making the first 24 bytes of rtentry within 128 bytes. That is enough to make IPv6 address within first 128 bytes. It is still pretty easy to add code for supporting separately-allocated dst, however it doesn't make a lot of sense in having such code without a use case. As rS359823 moved the gateway to the nexthop structure, the dst embedding change removes the need for any additional allocations done by rt_setgate(). Lastly, as a part of cleanup, remove counter(9) allocation code, as this field is not used in packet processing anymore. Reviewed by: ae Differential Revision: https://reviews.freebsd.org/D24669	2020-05-08 21:06:10 +00:00
Alexander V. Chernikov	682b902daf	Add rib_lookup() sockaddr lookup wrapper and make ifa_ifwithroute use it. Create rib_lookup() wrapper around per-af dataplane lookup functions. This will help in the cases of having control plane af-agnostic code. Switch ifa_ifwithroute() to use this function instead of rtalloc1(). Reviewed by: ae Differential Revision: https://reviews.freebsd.org/D24731	2020-05-07 08:11:36 +00:00
Alexander V. Chernikov	d94be4ccaa	Switch DDB show route to direct rnh_matchaddr() call instead of rtalloc1(). Eliminate the last rtalloc1() call to finish transition to the new routing KPI defined in r359823. Differential Revision: https://reviews.freebsd.org/D24663	2020-05-04 15:07:57 +00:00
Alexander V. Chernikov	4f08f052ad	Simplify address parsing in DDB show route command. Use db_get_line() to overcome parser limitation. Differential Revision: https://reviews.freebsd.org/D24662	2020-05-04 15:00:19 +00:00
Alexander V. Chernikov	9e02229580	Remove now-unused rt_ifp,rt_ifa,rt_gateway,rt_mtu rte fields. After converting routing subsystem customers to use nexthop objects defined in r359823, some fields in struct rtentry became unused. This commit removes rt_ifp, rt_ifa, rt_gateway and rt_mtu from struct rtentry along with the code initializing and updating these fields. Cleanup of the remaining fields will be addressed by D24669. This commit also changes the implementation of the RTM_CHANGE handling. Old implementation tried to perform the whole operation under radix WLOCK, resulting in slow performance and hacks like using RTF_RNH_LOCKED flag. New implementation looks up the route nexthop under radix RLOCK, creates new nexthop and tries to update rte nhop pointer. Only last part is done under WLOCK. In the hypothetical scenarious where multiple rtsock clients repeatedly issue RTM_CHANGE requests for the same route, route may get updated between read and update operation. This is addressed by retrying the operation multiple (3) times before returning failure back to the caller. Differential Revision: https://reviews.freebsd.org/D24666	2020-05-04 14:31:45 +00:00
Mark Johnston	814fa34dfb	Increase the iflib txq callout mutex name length to 32 bytes. With a length of 16, the name ("<if name>:TX(<qid>):callout") typically gets truncated. PR: 245712 Reported by: ghuckriede@blackberry.com MFC after: 1 week	2020-04-30 15:39:04 +00:00
Alexander V. Chernikov	8c61eb2107	Convert more rtentry field accesses into nhop fields accesses. Continue routing subsystem conversion to nhop objects defined in r359823. Use fields from nhop structure instead of "struct rtentry" fields. This is one of the last changes prior to removing rt_ifp, rt_ifa, rt_gateway and rt_mtu from struct rtentry. Differential Revision: https://reviews.freebsd.org/D24609	2020-04-29 21:54:09 +00:00
Alexander V. Chernikov	74787ef47b	Add nhop to the ifa_rtrequest() callback. With the upcoming multipath changes described in D24141, rt->rt_nhop can potentially point to a nexthop group instead of an individual nhop. To simplify caller handling of such cases, change ifa_rtrequest() callback to pass changed nhop directly. Differential Revision: https://reviews.freebsd.org/D24604	2020-04-29 19:28:56 +00:00
Alexander V. Chernikov	fc88ecd31a	Move route-specific ddb commands to route/route_ddb.c Currently functionality resides in rtsock.c, which is a controlling interface, partially external to the routing subsystem. Additionally, DDB-supporting functionality is > 100SLOC, which deserves a separate file. Given that, move this functionality to a newly-created net/route/ subdir. Reviewed by: cem Differential Revision: https://reviews.freebsd.org/D24561	2020-04-28 20:00:17 +00:00
Alexander V. Chernikov	e7d8af4f65	Move route_temporal.c and route_var.h to net/route. Nexthop objects implementation, defined in r359823, introduced sys/net/route directory intended to hold all routing-related code. Move recently-introduced route_temporal.c and private route_var.h header there. Differential Revision: https://reviews.freebsd.org/D24597	2020-04-28 19:14:09 +00:00
Alexander V. Chernikov	fe6da72759	Move struct rtentry definition to nhop_var.h. One of the goals of the new routing KPI defined in r359823 is to entirely hide`struct rtentry` from the consumers. It will allow to improve routing subsystem internals and deliver features much faster. This is one of the last changes, effectively moving struct rtentry definition to a net/route_var.h header, internal to the routing subsystem. Differential Revision: https://reviews.freebsd.org/D24580	2020-04-28 18:42:30 +00:00
Alexander V. Chernikov	4043ee3cd7	Convert rtalloc_mpath_fib() users to the new KPI. New fib[46]_lookup() functions support multipath transparently. Given that, switch the last rtalloc_mpath_fib() calls to dib4_lookup() and eliminate the function itself. Note: proper flowid generation (especially for the outbound traffic) is a bigger topic and will be handled in a separate review. This change leaves flowid generation intact. Differential Revision: https://reviews.freebsd.org/D24595	2020-04-28 08:06:56 +00:00
Alexander V. Chernikov	1b0051bada	Eliminate now-unused parts of old routing KPI. r360292 switched most of the remaining routing customers to a new KPI, leaving a bunch of wrappers for old routing lookup functions unused. Remove them from the tree as a part of routing cleanup. Differential Revision: https://reviews.freebsd.org/D24569	2020-04-28 07:25:34 +00:00
Eric Joyner	45818bf1a0	iflib: Stop interface before (un)registering VLAN This patch is intended to solve a specific problem that iavf(4) encounters, but what it does can be extended to solve other issues. To summarize the iavf(4) issue, if the PF driver configures VLAN anti-spoof, then the VF driver needs to make sure no untagged traffic is sent if a VLAN is configured, and vice-versa. This can be an issue when a VLAN is being registered or unregistered, e.g. when a packet may be on the ring with a VLAN in it, but the VLANs are being unregistered. This can cause that tagged packet to go out and cause an MDD event. To fix this, include a new interface-dependent function that drivers can implement named IFDI_NEEDS_RESTART(). Right now, this function is called in iflib_vlan_unregister/register() to determine whether the interface needs to be stopped and started when a VLAN is registered or unregistered. The default return value of IFDI_NEEDS_RESTART() is true, so this fixes the MDD problem that iavf(4) encounters, since the interface rings are flushed during a stop/init. A future change to iavf(4) will implement that function just in case the default value changes, and to make it explicit that this interface reset is required when a VLAN is added or removed. Reviewed by: gallatin@ MFC after: 1 week Sponsored by: Intel Corporation Differential Revision: https://reviews.freebsd.org/D22086	2020-04-27 22:02:44 +00:00
Alexander V. Chernikov	55f57ca9ac	Convert debugnet to the new routing KPI. Introduce new fib[46]_lookup_debugnet() functions serving as a special interface for the crash-time operations. Underlying implementation will try to return lookup result if datastructures are not corrupted, avoding locking. Convert debugnet to use fib4_lookup_debugnet() and switch it to use nexthops instead of rtentries. Reviewed by: cem Differential Revision: https://reviews.freebsd.org/D24555	2020-04-26 18:42:38 +00:00
Kristof Provost	fffd27e5f3	bridge: epoch-ification Run the bridge datapath under epoch, rather than under the BRIDGE_LOCK(). We still take the BRIDGE_LOCK() whenever we insert or delete items in the relevant lists, but we use epoch callbacks to free items so that it's safe to iterate the lists without the BRIDGE_LOCK. Tests on mercat5/6 shows this increases bridge throughput significantly, from 3.7Mpps to 18.6Mpps. Reviewed by: emaste, philip, melifaro MFC after: 2 months Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D24250	2020-04-26 16:22:35 +00:00
Alexander V. Chernikov	57e70e4471	Fix userland build broken by r360292.	2020-04-25 09:25:06 +00:00
Alexander V. Chernikov	983066f05b	Convert route caching to nexthop caching. This change is build on top of nexthop objects introduced in r359823. Nexthops are separate datastructures, containing all necessary information to perform packet forwarding such as gateway interface and mtu. Nexthops are shared among the routes, providing more pre-computed cache-efficient data while requiring less memory. Splitting the LPM code and the attached data solves multiple long-standing problems in the routing layer, drastically reduces the coupling with outher parts of the stack and allows to transparently introduce faster lookup algorithms. Route caching was (re)introduced to minimise (slow) routing lookups, allowing for notably better performance for large TCP senders. Caching works by acquiring rtentry reference, which is protected by per-rtentry mutex. If the routing table is changed (checked by comparing the rtable generation id) or link goes down, cache record gets withdrawn. Nexthops have the same reference counting interface, backed by refcount(9). This change merely replaces rtentry with the actual forwarding nextop as a cached object, which is mostly mechanical. Other moving parts like cache cleanup on rtable change remains the same. Differential Revision: https://reviews.freebsd.org/D24340	2020-04-25 09:06:11 +00:00
Alexander V. Chernikov	aaad3c4fca	Convert rtentry field accesses into nhop field accesses. One of the goals of the new routing KPI defined in r359823 is to entirely hide`struct rtentry` from the consumers. It will allow to improve routing subsystem internals and deliver more features much faster. This commit is mostly mechanical change to eliminate direct struct rtentry field accesses. The only notable difference is AF_LINK gateway encoding. AF_LINK gw is used in routing stack for operations with interface routes and host loopback routes. In the former case it indicates _some_ non-NULL gateway, as the interface is the same as in rt_ifp in kernel and rtm_ifindex in rtsock reporting. In the latter case the interface index inside gateway was used by the IPv6 datapath to verify address scope for link-local interfaces. Kernel uses struct sockaddr_dl for this type of gateway. This structure allows for specifying rich interface data, such as mac address and interface name. However, this results in relatively large structure size - 52 bytes. Routing stack fils in only 2 fields - sdl_index and sdl_type, which reside in the first 8 bytes of the structure. In the new KPI, struct nhop_object tries to be cache-efficient, hence embodies gateway address inside the structure. In the AF_LINK case it stores stortened version of the structure - struct sockaddr_dl_short, which occupies 16 bytes. After D24340 changes, the data inside AF_LINK gateway will not be used in the kernel at all, leaving rtsock as the only potential concern. The difference in rtsock reporting: (old) got message of size 240 on Thu Apr 16 03:12:13 2020 RTM_ADD: Add Route: len 240, pid: 0, seq 0, errno 0, flags:<UP,DONE,PINNED> locks: inits: sockaddrs: <DST,GATEWAY,NETMASK> 10.0.0.0 link#5 255.255.255.0 (new) got message of size 200 on Sun Apr 19 09:46:32 2020 RTM_ADD: Add Route: len 200, pid: 0, seq 0, errno 0, flags:<UP,DONE,PINNED> locks: inits: sockaddrs: <DST,GATEWAY,NETMASK> 10.0.0.0 link#5 255.255.255.0 Note 40 bytes different (52-16 + alignment). However, gateway is still a valid AF_LINK gateway with proper data filled in. It is worth noting that these particular messages (interface routes) are mostly ignored by routing daemons: * bird/quagga/frr uses RTM_NEWADDR and ignores prefix route addition messages. * quagga/frr ignores routes without gateway More detailed overview on how rtsock messages are used by the routing daemons to reconstruct the kernel view, can be found in D22974. Differential Revision: https://reviews.freebsd.org/D24519	2020-04-23 08:04:20 +00:00
Kristof Provost	fac24ad7f0	bridge: Simplify mac address generation Unconditionally use ether_gen_addr() to generate bridge mac addresses. This function is now less likely to generate duplicate mac addresses across jails. The old hand rolled hostid based code adds no value. Reviewed by: bz Differential Revision: https://reviews.freebsd.org/D24432	2020-04-18 08:00:58 +00:00
Kristof Provost	3f8bc99c4c	ethersubr: Make the mac address generation more robust If we create two (vnet) jails and create a bridge interface in each we end up with the same mac address on both bridge interfaces. These very often conflicts, resulting in same mac address in both jails. Mitigate this problem by including the jail name in the mac address. Reviewed by: kevans, melifaro MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D24383	2020-04-18 07:50:30 +00:00
Alexander V. Chernikov	ae4b62595e	Unbreak build by reverting if_bridge part of r360047. Pointy hat to: melifaro	2020-04-17 18:22:37 +00:00
Alexander V. Chernikov	6745294280	Finish r191148: replace rtentry with route in if_bridge if_output() callback. Generic if_output() callback signature was modified to use struct route instead of struct rtentry in r191148, back in 2009. Quoting commit message: Change if_output to take a struct route as its fourth argument in order to allow passing a cached struct llentry * down to L2 Fix bridge_output() to match this signature and update the remaining comment in if_var.h. Reviewed by: kp MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D24394	2020-04-17 17:05:58 +00:00
Adrian Chadd	2538a4b1b4	Remove an duplicate definition of nhops_dump_sysctl() One of the source files included both nhop.h and shared.h, leading to this clashing. Tested with: mips-gcc cross toolchain	2020-04-16 23:28:47 +00:00
Alexander V. Chernikov	5cdc484410	Fix userland build broken by r360014.	2020-04-16 17:53:23 +00:00
Alexander V. Chernikov	539642a29d	Add nhop parameter to rti_filter callback. One of the goals of the new routing KPI defined in r359823 is to entirely hide`struct rtentry` from the consumers. It will allow to improve routing subsystem internals and deliver more features much faster. This change is one of the ongoing changes to eliminate direct struct rtentry field accesses. Additionally, with the followup multipath changes, single rtentry can point to multiple nexthops. With that in mind, convert rti_filter callback used when traversing the routing table to accept pair (rt, nhop) instead of nexthop. Reviewed by: ae Differential Revision: https://reviews.freebsd.org/D24440	2020-04-16 17:20:18 +00:00
Alexander V. Chernikov	dd4776f0cc	Reorganise nd6 notification code to avoid direct rtentry field access. One of the goals of the new routing KPI defined in r359823 is to entirely hide `struct rtentry` from the consumers. Doing so will allow to improve routing subsystem internals and deliver features more easily. This change is one of the ongoing changes to eliminate direct struct rtentry field accesses. It introduces rtfree_func() wrapper around RTFREE() and reorganises nd6 notification code to avoid accessing most of the rtentry fields. Reviewed by: ae Differential Revision: https://reviews.freebsd.org/D24404	2020-04-14 22:48:33 +00:00
Alexander V. Chernikov	1968b7eb21	Postpone multipath seed init till SI_SUB_LAST, as it is needed only after some useland program installs multiple paths to the same destination. While here, make multipath init conditional. Discussed with: cem,ian	2020-04-14 07:38:34 +00:00
Andrew Gallatin	bd673b9942	lagg: stop double-counting output errors and counting drops as errors Before this change, lagg double-counted errors from lagg members, and counted every drop by a lagg member as an error. Eg, if lagg sent a packet, and the underlying hardware driver dropped it, a counter would be incremented by both lagg and the underlying driver. This change attempts to fix that by incrementing lagg's counters only for errors that do not come from underlying drivers. Reviewed by: hselasky, jhb Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D24331	2020-04-13 23:06:56 +00:00
Alexander V. Chernikov	a666325282	Introduce nexthop objects and new routing KPI. This is the foundational change for the routing subsytem rearchitecture. More details and goals are available in https://reviews.freebsd.org/D24141 . This patch introduces concept of nexthop objects and new nexthop-based routing KPI. Nexthops are objects, containing all necessary information for performing the packet output decision. Output interface, mtu, flags, gw address goes there. For most of the cases, these objects will serve the same role as the struct rtentry is currently serving. Typically there will be low tens of such objects for the router even with multiple BGP full-views, as these objects will be shared between routing entries. This allows to store more information in the nexthop. New KPI: struct nhop_object fib4_lookup(uint32_t fibnum, struct in_addr dst, uint32_t scopeid, uint32_t flags, uint32_t flowid); struct nhop_object fib6_lookup(uint32_t fibnum, const struct in6_addr dst6, uint32_t scopeid, uint32_t flags, uint32_t flowid); These 2 function are intended to replace all all flavours of <in_\|in6_>rtalloc[1]<_ign><_fib>, mpath functions and the previous fib[46]-generation functions. Upon successful lookup, they return nexthop object which is guaranteed to exist within current NET_EPOCH. If longer lifetime is desired, one can specify NHR_REF as a flag and get a referenced version of the nexthop. Reference semantic closely resembles rtentry one, allowing sed-style conversion. Additionally, another 2 functions are introduced to support uRPF functionality inside variety of our firewalls. Their primary goal is to hide the multipath implementation details inside the routing subsystem, greatly simplifying firewalls implementation: int fib4_lookup_urpf(uint32_t fibnum, struct in_addr dst, uint32_t scopeid, uint32_t flags, const struct ifnet src_if); int fib6_lookup_urpf(uint32_t fibnum, const struct in6_addr dst6, uint32_t scopeid, uint32_t flags, const struct ifnet src_if); All functions have a separate scopeid argument, paving way to eliminating IPv6 scope embedding and allowing to support IPv4 link-locals in the future. Structure changes: * rtentry gets new 'rt_nhop' pointer, slightly growing the overall size. * rib_head gets new 'rnh_preadd' callback pointer, slightly growing overall sz. Old KPI: During the transition state old and new KPI will coexists. As there are another 4-5 decent-sized conversion patches, it will probably take a couple of weeks. To support both KPIs, fields not required by the new KPI (most of rtentry) has to be kept, resulting in the temporary size increase. Once conversion is finished, rtentry will notably shrink. More details: * architectural overview: https://reviews.freebsd.org/D24141 * list of the next changes: https://reviews.freebsd.org/D24232 Reviewed by: ae,glebius(initial version) Differential Revision: https://reviews.freebsd.org/D24232	2020-04-12 14:30:00 +00:00
Alexander V. Chernikov	62d95afacb	Fix build by adding forgotten header to radix_mpath.c after r359797.	2020-04-11 09:38:45 +00:00
Alexander V. Chernikov	4684d3cbcb	Remove per-AF radix_mpath initializtion functions. Split their functionality by moving random seed allocation to SYSINIT and calling (new) generic multipath function from standard IPv4/IPv5 RIB init handlers. Differential Revision: https://reviews.freebsd.org/D24356	2020-04-11 07:37:08 +00:00
Alexander V. Chernikov	aef2d5fb0e	Split rtrequest1_fib() into smaller manageable chunks. No functional changes. * Move route addition / route deletion code from rtrequest1_fib() to add_route() and del_route() respectively. * Rename rtrequest1_fib_change() to change_route() for consistency. * Shrink the scope of ugly info #defines. MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D24349	2020-04-10 16:27:27 +00:00
Kristof Provost	dd00a42a6b	bridge: Change lists to CK_LIST as a peparation for epochification Prepare the ground for a rework of the bridge locking approach. We will use an epoch-based approach in the datapath and making it safe to iterate over the interface, span and rtnode lists without holding the BRIDGE_LOCK. Replace the relevant lists by their ConcurrencyKit equivalents. No functional change in this commit. Reviewed by: emaste, ae, philip (previous version) Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D24249	2020-04-05 17:15:20 +00:00
Ed Maste	aeb665b538	remove extraneous double ;s in sys/	2020-03-30 16:04:25 +00:00
Mark Johnston	59d50fe5ef	Simplify taskqgroup inititialization. taskqgroup initialization was broken into two steps: 1. allocate the taskqgroup structure, at SI_SUB_TASKQ; 2. initialize taskqueues, start taskqueue threads, enqueue "binder" tasks to bind threads to specific CPUs, at SI_SUB_SMP. Step 2 tries to handle the case where tasks have already been attached to a queue, by migrating them to their intended queue. In particular, tasks can't be enqueued before step 2 has completed. This breaks NFS mountroot on systems using an iflib-based driver when EARLY_AP_STARTUP is not defined, since mountroot happens before SI_SUB_SMP in this case. Simplify initialization: do all initialization except for CPU binding at SI_SUB_TASKQ. This means that until CPU binding is completed, group tasks may be executed on a CPU other than that to which they were bound, but this should not be a problem for existing users of the taskqgroup KPIs. Reported by: sbruno Tested by: bdragon, sbruno MFC after: 1 month Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D24188	2020-03-30 14:22:52 +00:00
Conrad Meyer	704101dd3d	Fix PNP matching for iflib NIC drivers The previous descriptor string specified that all fields were significant for match. However, the only significant fields for in-tree drivers are vendor:devid, and the fictitious zero values constructed by PVID() did not match real subvendor, subdevice, revision, and/or class values, resulting in no automatic probe. If a future iflib driver needs to match on other criteria, the descriptor string can be updated accordingly. (E.g., "V32" and ~0 for unspecified values in PVID().) Reported by: mav Sponsored by: Dell EMC Isilon	2020-03-24 19:20:10 +00:00
Ed Maste	ed6611cc8c	iflib: simplify MPASS assertion Submitted by: andrew	2020-03-24 17:54:34 +00:00
Ed Maste	68af0153a7	iflib: split compound assertion ThunderX cluster systems are panicking on boot with a failed assertion MPASS(gtask != NULL && gtask->gt_taskqueue != NULL). Split the assertion so that it's clear which part is failing.	2020-03-24 17:25:56 +00:00
Patrick Kelsey	876996910a	Remove extraneous code from iflib ifsd_cidx is never used, and the line removed from rxd_frag_to_sd() is just dead code. Reviewed by: erj, gallatin MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D23951	2020-03-14 20:13:42 +00:00
Patrick Kelsey	3caff1885f	Remove refill budget from iflib Reviewed by: gallatin MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D23948	2020-03-14 19:58:50 +00:00
Patrick Kelsey	b38136097a	Allow iflib drivers to specify the buffer size used for each receive queue Reviewed by: erj, gallatin MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D23947	2020-03-14 19:56:46 +00:00
Patrick Kelsey	e503049067	Remove freelist contiguous-indexes assertion from rxd_frag_to_sd() The vmx driver is an example of an iflib driver that might report packets using non-contiguous descriptors (with unused descriptors either between received packets or between the fragments of a received packet), so this assertion needs to be removed. For such drivers, the freelist producer and consumer indexes don't relate directly to driver ring slots (the driver deals directly with freelist buffer indexes supplied by iflib during refill, and reports them with each fragment during packet reception), but do continue to be used by iflib for accounting, such as determining the number of ring slots that are refillable. PR: 243126, 243392, 240628 Reported by: avg, alexandr.oleynikov@gmail.com, Harald Schmalzbauer Reviewed by: gallatin MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D23946	2020-03-14 19:55:05 +00:00
Patrick Kelsey	4f2beb721b	Fix iflib zero-length fragment handling The dmamap for zero-length fragments should not be unloaded, as doing so breaks the the cluster-reuse logic in _iflib_fl_refill(). All zero-length fragments are now handled by the assemble_segments() path so that the cluster-reuse logic there does not have to be replicated in the small-single-fragment-packet path of iflib_rxd_pkt_get(). Packets consisting entirely of zero-length fragments (which result in a NULL mbuf pointer) are now properly tolerated. This allows drivers (such as the vmx driver) to pass such packets to iflib when a descriptor error occurs during packet reception, the advantage being that the refill of descriptors associated with the error packet are handled via the existing iflib machinery without having to duplicate parts of that machinery in the driver to handle that error case. Reviewed by: avg, erj, gallatin MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D23945	2020-03-14 19:51:55 +00:00
Patrick Kelsey	9e9b738ac5	Fix iflib freelist state corruption This fixes a bug in iflib freelist management that breaks the required correspondence between freelist indexes and driver ring slots. PR: 243126, 243392, 240628 Reported by: avg, alexandr.oleynikov@gmail.com, Harald Schmalzbauer Reviewed by: avg, gallatin MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D23943	2020-03-14 19:43:44 +00:00
Andrew Gallatin	98085bae8c	make lacp's use_numa hashing aware of send tags When I did the use_numa support, I missed the fact that there is a separate hash function for send tag nic selection. So when use_numa is enabled, ktls offload does not work properly, as it does not reliably allocate a send tag on the proper egress nic since different egress nics are selected for send-tag allocation and packet transmit. To fix this, this change: - refectors lacp_select_tx_port_by_hash() and lacp_select_tx_port() to make lacp_select_tx_port_by_hash() always called by lacp_select_tx_port() - pre-shifts flowids to convert them to hashes when calling lacp_select_tx_port_by_hash() - adds a numa_domain field to if_snd_tag_alloc_params - plumbs the numa domain into places where we allocate send tags In testing with NIC TLS setup on a NUMA machine, I see thousands of output errors before the change when enabling kern.ipc.tls.ifnet.permitted=1. After the change, I see no errors, and I see the NIC sysctl counters showing active TLS offload sessions. Reviewed by: rrs, hselasky, jhb Sponsored by: Netflix	2020-03-09 13:44:51 +00:00
Bjoern A. Zeeb	3818c25a1d	Implement optional table entry limits for if_llatbl. Implement counting of table entries linked on a per-table base with an optional (if set > 0) limit of the maximum number of table entries. For that the public lltable_link_entry() and lltable_unlink_entry() functions as well as the internal function pointers change from void to having an int return type. Given no consumer currently sets the new llt_maxentries this can be committed on its own. The moment we make use of the table limits, the callers of the link function must check the return value as it can change and entries might not be added. Adjustments for IPv6 (and possibly IPv4) will follow. Sponsored by: Netflix (originally) Reviewed by: melifaro MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D22713	2020-03-04 17:17:02 +00:00
Brooks Davis	8ad798ae9a	Expose ifr_buffer_get_(buffer\|length) outside if.c. This is a preparatory commit for D23933. Reviewed by: jhb	2020-03-03 18:05:11 +00:00
Alexander V. Chernikov	ea2773323c	Fix dynamic redrects by adding forgotten RTF_HOST flag. Improve tests to verify the generated route flags. Reported by: jtl MFC after: 2 weeks	2020-03-03 15:33:43 +00:00
Kyle Evans	072afcdffc	if_edsc: generate an arbitrary MAC address When generating an cloned interface instance in edsc_clone_create(), generate a MAC address from the FF OUI with ether_gen_addr(). This allows us to have unique local-link addresses. Previously, the MAC address was zero. Submitted by: Neel Chauhan <neel AT neelc DOT org> Differential Revision: https://reviews.freebsd.org/D23842	2020-03-02 02:45:57 +00:00
Pawel Biernacki	7029da5c36	Mark more nodes as CTLFLAG_MPSAFE or CTLFLAG_NEEDGIANT (17 of many) r357614 added CTLFLAG_NEEDGIANT to make it easier to find nodes that are still not MPSAFE (or already are but aren’t properly marked). Use it in preparation for a general review of all nodes. This is non-functional change that adds annotations to SYSCTL_NODE and SYSCTL_PROC nodes using one of the soon-to-be-required flags. Mark all obvious cases as MPSAFE. All entries that haven't been marked as MPSAFE before are by default marked as NEEDGIANT Approved by: kib (mentor, blanket) Commented by: kib, gallatin, melifaro Differential Revision: https://reviews.freebsd.org/D23718	2020-02-26 14:26:36 +00:00
Randall Stewart	d7313dc6f5	This commit expands tcp_ratelimit to be able to handle cards like the mlx-c5 and c6 that require a "setup" routine before the tcp_ratelimit code can declare and use a rate. I add the setup routine to if_var as well as fix tcp_ratelimit to call it. I also revisit the rates so that in the case of a mlx card of type c5/6 we will use about 100 rates concentrated in the range where the most gain can be had (1-200Mbps). Note that I have tested these on a c5 and they work and perform well. In fact in an unloaded system they pace right to the correct rate (great job mlx!). There will be a further commit here from Hans that will add the respective changes to the mlx driver to support this work (which I was testing with). Sponsored by: Netflix Inc. Differential Revision: ttps://reviews.freebsd.org/D23647	2020-02-26 13:48:33 +00:00
Kristof Provost	33b1fe11a2	bridge: Move locking defines into if_bridge.c The locking defines for if_bridge used to live in if_bridgevar.h, but they're only ever used by the bridge implementation itself (in if_bridge.c). Moving them into the .c file. Reported by: philip, emaste Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D23808	2020-02-26 08:47:18 +00:00
Gleb Smirnoff	e87c494015	Although most of the NIC drivers are epoch ready, due to peer pressure switch over to opt-in instead of opt-out for epoch. Instead of IFF_NEEDSEPOCH, provide IFF_KNOWSEPOCH. If driver marks itself with IFF_KNOWSEPOCH, then ether_input() would not enter epoch when processing its packets. Now this will create recursive entrance in epoch in >90% network drivers, but will guarantee safeness of the transition. Mark several tested drivers as IFF_KNOWSEPOCH. Reviewed by: hselasky, jeff, bz, gallatin Differential Revision: https://reviews.freebsd.org/D23674	2020-02-24 21:07:30 +00:00
Bjoern A. Zeeb	10108cb673	Partially revert VNET change and expand VNET structure. Revert parts of r353274 replacing vnet_state with a shutdown flag. Not having the state flag for the current SI_SUB_* makes it harder to debug kernel or module panics related to VNET bringup or teardown. Not having the state also does not allow us to check for other dependency levels between components, e.g. for moving interfaces. Expand the VNET structure with the new boolean flag indicating that we are doing a shutdown of a given vnet and update the vnet magic cookie for the change. Update libkvm to compile with a bool in the kernel struct. Bump __FreeBSD_version for (external) module builds to more easily detect the change. Reviewed by: hselasky MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D23097	2020-02-17 11:08:50 +00:00
Hans Petter Selasky	bacb11c9ed	Fix kernel panic while trying to read multicast stream. When VIMAGE is enabled make sure the "m_pkthdr.rcvif" pointer is set for all mbufs being input by the IGMP/MLD6 code. Else there will be a NULL-pointer dereference in the netisr code when trying to set the VNET based on the incoming mbuf. Add an assert to catch this when queueing mbufs on a netisr to make debugging of similar cases easier. Found by: Vladislav V. Prodan PR: 244002 Reviewed by: bz@ MFC after: 1 week Sponsored by: Mellanox Technologies	2020-02-17 09:46:32 +00:00
Hans Petter Selasky	f98977b521	Use NET_TASK_INIT() and NET_GROUPTASK_INIT() for drivers that process incoming packets in taskqueue context. This patch extends r357772. Tested by: yp@mm.st Sponsored by: Mellanox Technologies	2020-02-12 09:19:47 +00:00
Hans Petter Selasky	fb1a29b45e	Make sure the so-called end of receive interrupts don't starve in iflib. When the receive ring cannot be filled with mbufs, due to lack of memory, no more interrupts may be generated to fill the receive ring later on. Make sure to have a watchdog, to try refilling the receive ring from time to time, hopefully when more mbufs are available. Differential Revision: https://reviews.freebsd.org/D23315 MFC after: 1 week Reviewed by: gallatin@ Sponsored by: Mellanox Technologies	2020-02-12 08:30:07 +00:00
Gleb Smirnoff	6c3e93cb5a	Use NET_TASK_INIT() and NET_GROUPTASK_INIT() for drivers that process incoming packets in taskqueue context. Reviewed by: hselasky Differential Revision: https://reviews.freebsd.org/D23518	2020-02-11 18:57:07 +00:00
Konstantin Belousov	5d1277ca9a	if_media.h: Add 50G KR4 ethernet media type. Submitted by: Adam Peace <adam.e.peace@gmail.com> Reviewed by: hselasky Differential revision: https://reviews.freebsd.org/D23620	2020-02-11 18:03:45 +00:00
Konstantin Belousov	48ad3b215c	if_media.c: staticize and constify ifmedia description structures used under IFMEDIA_DEBUG. The reason for this change is to make it clear the scope of the in-kernel usage of IFM_TYPE_DESCRIPTIONS and IFM_SUBTYPE_ETHERNET_DESCRIPTIONS macros. Also it is somewhat better C. Reviewed by: hselasky Sponsored by: Mellanox Technologies Differential revision: https://reviews.freebsd.org/D23620	2020-02-11 17:45:01 +00:00
Konstantin Belousov	a249895df8	if_media.c: use __FBSDID(). Reviewed by: hselasky Sponsored by: Mellanox Technologies Differential revision: https://reviews.freebsd.org/D23620	2020-02-11 17:41:45 +00:00
Pedro F. Giffuni	a1b769b32d	typo: stray spaces. No functional change	2020-02-07 15:16:04 +00:00
Jeff Roberson	cd0be8b2ed	Temporarily force IFF_NEEDSEPOCH until drivers have been resolved. Recent network epoch changes have left some drivers unexpectedly broken and there is not yet a consensus on the correct fix. This is patch is a minor performance impact until we can agree on the correct path forward. Reviewed by: core, network, imp, glebius, hselasky Differential Revision: https://reviews.freebsd.org/D23515	2020-02-06 20:47:50 +00:00
Pedro F. Giffuni	abfc5e8591	ethernet: Add a couple more Ethertypes. Powerlink and Sercos III are used in automation. Both have been standardized and In the case of Ethernet Powerlink there is a BSD-licensed stack.	2020-02-05 19:11:07 +00:00
Pedro F. Giffuni	2a1481fbbf	typo: Registration. Pointed by: Dikshie Fauzie	2020-02-03 02:02:13 +00:00
Pedro F. Giffuni	ad2b6d4e9b	ethernet: Minor cleanup. Consistently use uppercase for ethertype hex numbers.	2020-02-03 01:08:15 +00:00
Pedro F. Giffuni	b33c19776b	style(9): Fix spaces after #define. No functional change.	2020-02-02 19:02:07 +00:00
Pedro F. Giffuni	682397c263	ethernet: add some more Ethertypes. Sort ETHERTYPE_FCOE, from r357414.	2020-02-02 18:33:20 +00:00
Pedro F. Giffuni	badbcf06e0	ethernet: add some more Ethertypes. Add some types based on other BSDs and also add EtherCat and PROFINET, which are IEC standards. There is a public list (CSV format) at: https://standards.ieee.org/products-services/regauth/ MFC after: 2 weeks	2020-02-02 18:27:37 +00:00
Kristof Provost	eb03a44325	vlan: Fix panic when vnet jail with a vlan interface is destroyed During vnet cleanup vnet_if_uninit() checks that no more interfaces remain in the vnet. Any interface borrowed from another vnet is returned by vnet_if_return(). Other interfaces (i.e. cloned interfaces) should have been destroyed by their cloner at this point. The if_vlan VNET_SYSUNINIT had priority SI_ORDER_FIRST, which means it had equal priority as vnet_if_uninit(). In other words: it was possible for it to be called after vnet_if_uninit(), which would lead to assertion failures. Set the priority to SI_ORDER_ANY, like other cloners to ensure that vlan interfaces are destroyed before we enter vnet_if_uninit(). The sys/net/if_vlan test provoked this.	2020-01-31 22:54:44 +00:00
Hans Petter Selasky	977b947223	Revert r357293. The netisr uses rm_ locks not rms_ locks as noted by jeff@ . Sponsored by: Mellanox Technologies	2020-01-31 10:51:13 +00:00
Hans Petter Selasky	780c568fec	Widen EPOCH(9) usage in netisr. Software interrupt handlers are allowed to sleep. In swi_net() there is a read lock behind NETISR_RLOCK() which in turn ends up calling msleep() which means the whole of swi_net() cannot be protected by an EPOCH(9) section. By default the NETISR_LOCKING feature is disabled. This issue was introduced by r357004. This is a preparation step for replacing the functionality provided by r357004. Found by: kib@ Sponsored by: Mellanox Technologies	2020-01-30 12:04:02 +00:00
Alexander V. Chernikov	4be465ab46	Plug parent iface refcount leak on <ifname>.X vlan creation. PR: kern/242270 Submitted by: Andrew Boyer <aboyer at pensando.io> MFC after: 2 weeks	2020-01-29 18:41:35 +00:00
Kristof Provost	b02fd8b790	epair: Do not abuse params to register the second interface if_epair used the 'params' argument to pass a pointer to the b interface through if_clone_create(). This pointer can be controlled by userspace, which means it could be abused to trigger a panic. While this requires PRIV_NET_IFCREATE privileges those are assigned to vnet jails, which means that vnet jails could panic the system. Reported by: Ilja Van Sprundel <ivansprundel@ioactive.com> MFC after: 3 days	2020-01-28 22:44:24 +00:00
Gleb Smirnoff	e87ad0ab37	Since now drivers that support pfil run their interrupts in the network epoch, stop entering it in pfil_run_hooks(). Assert the epoch there.	2020-01-23 01:49:22 +00:00
Gleb Smirnoff	5b64c645d7	Stop entering the network epoch in ether_input(), unless driver is marked with IFF_NEEDSEPOCH.	2020-01-23 01:47:43 +00:00
Gleb Smirnoff	0921628ddc	Introduce flag IFF_NEEDSEPOCH that marks Ethernet interfaces that supposedly may call into ether_input() without network epoch. They all need to be reviewed before 13.0-RELEASE. Some may need be fixed. The flag is not planned to be used in the kernel for a long time.	2020-01-23 01:41:09 +00:00
Gleb Smirnoff	af614b8e04	tap(4) calls ether_input() in context of write(2). Enter network epoch here. The tun(4) side doesn't need this, as netisr code will take care.	2020-01-23 01:38:51 +00:00
Gleb Smirnoff	0b8df657a4	Enter network epoch in iflib rxeof task. In upcoming changes ether_input() is going to be changed not to enter the network epoch. It is going to be responsibility of network interrupt. In case of iflib - its taskqueue.	2020-01-23 01:27:58 +00:00
Gleb Smirnoff	6ed3e18711	Mark swi_net() as INTR_TYPE_NET and stop entering epoch there.	2020-01-23 01:25:32 +00:00
Alexander Motin	84becee1ac	Update route MTUs for bridge, lagg and vlan interfaces. Those interfaces may implicitly change their MTU on addition of parent interface in addition to normal SIOCSIFMTU ioctl path, where the route MTUs are updated normally. MFC after: 2 weeks Sponsored by: iXsystems, Inc.	2020-01-22 20:36:45 +00:00
Alexander V. Chernikov	34a5582c47	Bring back redirect route expiration. Redirect (and temporal) route expiration was broken a while ago. This change brings route expiration back, with unified IPv4/IPv6 handling code. It introduces net.inet.icmp.redirtimeout sysctl, allowing to set an expiration time for redirected routes. It defaults to 10 minutes, analogues with net.inet6.icmp6.redirtimeout. Implementation uses separate file, route_temporal.c, as route.c is already bloated with tons of different functions. Internally, expiration is implemented as an per-rnh callout scheduled when route with non-zero rt_expire time is added or rt_expire is changed. It does not add any overhead when no temporal routes are present. Callout traverses entire routing tree under wlock, scheduling expired routes for deletion and calculating the next time it needs to be run. The rationale for such implemention is the following: typically workloads requiring large amount of routes have redirects turned off already, while the systems with small amount of routes will not inhibit large overhead during tree traversal. This changes also fixes netstat -rn display of route expiration time, which has been broken since the conversion from kread() to sysctl. Reviewed by: bz MFC after: 3 weeks Differential Revision: https://reviews.freebsd.org/D23075	2020-01-22 13:53:18 +00:00
Alexander V. Chernikov	16c2f24169	Document requirements for the 'struct route' variations. MFC after: 2 weeks	2020-01-21 12:00:34 +00:00
Eugene Grosbein	2888eb4091	ifa_maintain_loopback_route: adjust debugging output Correction after r333476: - write this as LOG_DEBUG again instead of LOG_INFO; - get back function name into the message; - error may be ESRCH if an address is removed in process (by carp f.e.), not only ENOENT; - expression complexity grows, so try making it more readable. MFC after: 1 week	2020-01-18 04:48:05 +00:00
Gleb Smirnoff	66c6c556b6	Change argument order of epoch_call() to more natural, first function, then its argument. Reviewed by: imp, cem, jhb	2020-01-17 06:10:24 +00:00
Gleb Smirnoff	9758a507e9	gif_transmit() must always be called in the network epoch.	2020-01-15 06:18:32 +00:00
Gleb Smirnoff	2a4bd982d0	Introduce NET_EPOCH_CALL() macro and use it everywhere where we free data based on the network epoch. The macro reverses the argument order of epoch_call(9) - first function, then its argument. NFC	2020-01-15 06:05:20 +00:00
Gleb Smirnoff	97168be809	Mechanically substitute assertion of in_epoch(net_epoch_preempt) to NET_EPOCH_ASSERT(). NFC	2020-01-15 05:45:27 +00:00
Gleb Smirnoff	3264dcadc9	- Move global network epoch definition to epoch.h, as more different subsystems tend to need to know about it, and including if_var.h is huge header pollution for them. Polluting possible non-network users with single symbol seems much lesser evil. - Remove non-preemptible network epoch. Not used yet, and unlikely to get used in close future.	2020-01-15 03:34:21 +00:00
Vincenzo Maffione	2ec213aba4	netmap: disable passthrough with no hypervisor support The netmap passthrough subsystem requires proper support in the hypervisor. In particular, two PCI device ids (from the Red Hat PCI vendor id 0x1b36) need to be assigned to the two netmap virtual devices. We then disable these devices until the ids have not been assigned, in order to avoid conflicts with other virtual devices emulated by upstream QEMU. PR: 241774 MFC after: 3 days	2020-01-13 21:47:23 +00:00
Alexander V. Chernikov	ead85fe415	Add fibnum, family and vnet pointer to each rib head. Having metadata such as fibnum or vnet in the struct rib_head is handy as it eases building functionality in the routing space. This change is required to properly bring back route redirect support. Reviewed by: bz MFC after: 3 weeks Differential Revision: https://reviews.freebsd.org/D23047	2020-01-09 17:21:00 +00:00
Mark Johnston	c23df8eafa	lagg: Further cleanup of the rr_limit option. Add an option flag so that arbitrary updates to a lagg's configuration do not clear sc_stride. Preseve compatibility for old ifconfig binaries. Update ifconfig to use the new flag and improve the casting used when parsing the option parameter. Modify the RR transmit function to avoid locklessly reading sc_stride twice. Ensure that sc_stride is always 1 or greater. Reviewed by: hselasky MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D23092	2020-01-09 14:58:41 +00:00
Kyle Evans	c7bab2a7ca	if_vmove: return proper error status if_vmove can fail if it lost a race and the vnet's already been moved. The callers (and their callers) can generally cope with this, but right now success is assumed. Plumb out the ENOENT from if_detach_internal if it happens so that the error's properly reported to userland. Reviewed by: bz, kp MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D22780	2020-01-09 03:52:50 +00:00
Alexander V. Chernikov	e02d3fe70c	Fix rtsock route message generation for interface addresses. Reviewed by: olivier MFC after: 1 month Differential Revision: https://reviews.freebsd.org/D22974	2020-01-07 21:16:30 +00:00
Eric Joyner	f6afed726b	iflib: Prevent watchdog from resetting idle queues While changing link state in iflib_link_state_change(), queues are marked as IFLIB_QUEUE_IDLE to disable watchdog. Currently, iflib_timer() watchdog does not check for previous queue status before marking it as IFLIB_QUEUE_HUNG. This patch adds check of queue status before marking it as hung. Signed-off-by: Piotr Pietruszewski <piotr.pietruszewski@intel.com> PR: 239240 Submitted by: Piotr Pietruszewski <piotr.pietruszewski@intel.com> Reported by: ultima@ Reviewed by: gallatin@, erj@ MFC after: 3 days Sponsored by: Intel Corporation Differential Revision: https://reviews.freebsd.org/D21712	2020-01-02 23:35:06 +00:00
Alexander V. Chernikov	5fcb2832e3	Plug loopback idaddr refcount leak. Reviewed by: markj MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D22980	2020-01-02 09:08:45 +00:00
Gleb Smirnoff	8d5c56dab1	In r343631 error code for a packet blocked by a firewall was changed from EACCES to EPERM. This change was not intentional, so fix that. Return EACCESS if a firewall forbids sending. Noticed by: ae	2020-01-01 17:31:43 +00:00
Alexander V. Chernikov	d930203192	Fix NOINET6 build broken by r356236. MFC after: 2 weeks	2019-12-31 17:57:12 +00:00
Alexander V. Chernikov	c83dda362e	Split gigantic rtsock route_output() into smaller functions. Amount of changes to the original code has been intentionally minimised to ease diffing. The changes are mostly mechanical, with the following exceptions: * lltable handler is now called directly based of RTF_LLINFO flag presense. * "report" logic for updating rtm in RTM_GET/RTM_DELETE has been simplified, fixing several potential use-after-free cases in rt_addrinfo. * llable asserts has been replaced with error-returning, preventing kernel crashes when lltable gw af family is invalid (root required). MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D22864	2019-12-31 17:26:53 +00:00
Mark Johnston	6b5d8e30f1	Plug some ifaddr refcount leaks. - Only take an ifaddr ref in in rt_exportinfo() if the caller explicitly requests it. Take care to release it in this case. - Don't unconditionally take a ref in rtrequest1_fib(). rt_getifa_fib() will acquire a reference, in which case we would previously acquire two references. - Stop taking a reference in rtinit1() before calling rtrequest1_fib(). rtrequest1_fib() will acquire a reference for the RTM_ADD case. PR: 242746 Reviewed by: melifaro (previous version) Tested by: ghuckriede@blackberry.com MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D22912	2019-12-27 01:12:54 +00:00
Mark Johnston	c104c2990d	lagg: Clean up handling of the rr_limit option. - Don't allow an unprivileged user to set the stride. [1] - Only set the stride under the softc lock. - Rename the internal fields to accurately reflect their use. Keep ro_bkt to avoid changing the user API. - Simplify the implementation. The port index is just sc_seq / stride. - Document rr_limit in ifconfig.8. Reported by: Ilja Van Sprundel <ivansprundel@ioactive.com> [1] MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D22857	2019-12-22 21:56:47 +00:00
Cy Schubert	57e22627f9	MFV r353141 (by phillip): Update libpcap from 1.9.0 to 1.9.1. MFC after: 2 weeks	2019-12-21 21:01:03 +00:00
Mark Johnston	3f197b134c	Deduplicate code between if_delgroup() and if_delgroups(). Fix some style in if_addgroup(). No functional change intended. Reviewed by: hselasky MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D22892	2019-12-20 20:15:34 +00:00
Mark Johnston	718ef55ec7	Fix a memory leak in if_delgroups() introduced in r334118. PR: 242712 Submitted by: ghuckriede@blackberry.com MFC after: 3 days	2019-12-20 17:21:57 +00:00
Gleb Smirnoff	185c3d2b93	Convert routing statistics to VNET_PCPUSTAT. Submitted by: ocochard Reviewed by: melifaro, glebius Differential Revision: https://reviews.freebsd.org/D22834	2019-12-17 02:02:26 +00:00
John Baldwin	65d2f9c12b	Use a void * argument to callout handlers instead of timeout_t casts. Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D22684	2019-12-05 18:47:29 +00:00
Bjoern A. Zeeb	3232273f42	Allow kernel to compile without BPF. r297816 added some bpf magic for VIMAGE unconditionally which no longer allows kernels to compile without bpf (but with other networking). Add the missing ifdef checks and allow a kernel to compile without bpf again. PR: 242136 Reported by: dave mischler.com MFC after: 2 weeks	2019-11-24 23:21:47 +00:00
Conrad Meyer	7993a104a1	Add explicit SI_SUB_EPOCH Add explicit SI_SUB_EPOCH, after SI_SUB_TASKQ and before SI_SUB_SMP (EARLY_AP_STARTUP). Rename existing "SI_SUB_TASKQ + 1" to SI_SUB_EPOCH. epoch(9) consumers cannot epoch_alloc() before SI_SUB_EPOCH:SI_ORDER_SECOND, but likely should allocate before SI_SUB_SMP. Prior to this change, consumers (well, epoch itself, and net/if.c) just open-coded the SI_SUB_TASKQ + 1 order to match epoch.c, but this was fragile. Reviewed by: mmacy Differential Revision: https://reviews.freebsd.org/D22503	2019-11-22 23:23:40 +00:00
Vincenzo Maffione	a56e6334d1	netmap: check if we already ran mmap before we attempt it Submitted by: neel@neelc.org Reviewed by: vmaffione MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D22390	2019-11-19 21:29:49 +00:00
Bjoern A. Zeeb	d9a61c960c	if_llatbl: change htable_unlink_entry() to early exist if no work to do Adjust the logic in htable_unlink_entry() to the one in htable_link_entry() saving a block indent and making it more clear in which case we do not do any work. No functional change. MFC after: 3 weeks Sponsored by: Netflix	2019-11-15 23:12:19 +00:00
Bjoern A. Zeeb	9744f55686	if_llatbl: cleanup Remove function prototypes which are not needed (no use before function definition for these file static functions). MFC after: 3 weeks Sponsored by: Netflix	2019-11-15 11:00:03 +00:00
Gleb Smirnoff	9352fab6ab	In if_siocaddmulti() enter VNET. Reported & tested by: garga	2019-11-13 16:28:53 +00:00
Bjoern A. Zeeb	8b5f9bb755	lltabl: remove dead code Remove the long (8? years ago) #if 0 marked function lltable_drain() and while here also remove the unused function llentry_alloc() which has call paths tools keep finding and are never used. Sponsored by: Netflix	2019-11-13 11:21:02 +00:00
Gleb Smirnoff	8a9a28c455	sysctl_rtsock() has all necessary locking and doesn't need Giant to run. While here add description.	2019-11-07 19:06:18 +00:00
Andrey V. Elsukov	a961401ee0	Enqueue lladdr_task to update link level address of vlan, when its parent interface has changed. During vlan reconfiguration without destroying interface, it is possible, that parent interface will be changed. This usually means, that link layer address of vlan will be different. Therefore we need to update all associated with vlan's addresses permanent llentries - NDP for IPv6 addresses, and ARP for IPv4 addresses. This is done via lladdr_task execution. To avoid extra work, before execution do the check, that L2 address is different. No objection from: #network Obtained from: Yandex LLC MFC after: 1 week Sponsored by: Yandex LLC Differential Revision: https://reviews.freebsd.org/D22243	2019-11-07 15:00:37 +00:00
Eric Joyner	74954211d6	net: add ETHER_IS_ZERO macro similar to ETHER_IS_BROADCAST Some places in network code may need to verify that an ethernet address is not the 'zero' address. Provide a standard macro ETHER_IS_ZERO for this purpose, similar to the ETHER_IS_BROADCAST macro already available. This patch also removes previous ETHER_IS_ZERO definitions in several USB ethernet drivers, in favor of this centrally-located macro. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Submitted by: Jacob Keller <jacob.e.keller@intel.com> Reviewed by: erj@ Sponsored by: Intel Corporation Differential Revision: https://reviews.freebsd.org/D21240	2019-11-05 00:12:21 +00:00
Eric Joyner	db8e8f1ede	iflib: properly release memory allocated for DMA DMA memory allocations using the bus_dma.h interface are not properly released in all cases for both Tx and Rx. This causes ~448 bytes of M_DEVBUF allocations to be leaked. First, the DMA maps for Rx are not properly destroyed. A slight attempt is made in iflib_fl_bufs_free to destroy the maps if we're detaching. However, this function may not be reliably called during detach. Indeed, there is a comment "asking" if this should be moved out. Fix this by moving the bus_dmamap_destroy call into iflib_rx_sds_free, where we already sync and unload the DMA. Second, the DMA tag associated with the ifr_ifdi descriptor DMA is not released properly anywhere. Add a call to iflib_dma_free in iflib_rx_structures_free. Third, use of NULL as a canary value on the map pointer returned by bus_dmamap_create is not valid. On some platforms, notably x86, this value may be NULL. In this case, we fail to properly release the related resources. Remove the NULL checks on map values in both iflib_fl_bufs_free and iflib_txsd_destroy. With all of these fixes applied, the leaks to M_DEVBUF are squelched, and iflib drivers now seem to properly cleanup when detaching. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Submitted by: Jacob Keller <jacob.e.keller@intel.com> Reviewed by: erj@, gallatin@ MFC after: 1 week Sponsored by: Intel Corporation Differential Revision: https://reviews.freebsd.org/D22203	2019-11-04 23:06:57 +00:00
Vincenzo Maffione	9dd7582c12	netmap: fix build issue in netmap_user.h The issue was a comparison of integers of different signs on 32 bit architectures. Reported by: jenkins MFC after: 1 week	2019-10-31 22:16:20 +00:00
Vincenzo Maffione	c7c7805531	add valectl to the system commands The valectl(4) program is used to manage vale(4) switches. Add it to the system commands so that it can be used right away. This program was previously called vale-ctl, and stored in tools/tools/netmap Reviewed by: hrs, bcr, lwhsu, kevans MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D22146	2019-10-31 21:01:34 +00:00
Eric Joyner	244e7cffa5	iflib: cleanup memory leaks on driver detach From Jake: The iflib stack failed to release all of the memory allocated under M_IFLIB during device detach. Specifically, the ifmp_ring, the ift_ifdi Tx DMA info, and the ifr_ifdi Rx DMA info were not being released. Release this memory so that iflib won't leak memory when a device detaches. Since we're freeing the ift_ifdi pointer during iflib_txq_destroy we need to call this only after iflib_dma_free in iflib_tx_structures_free. Additionally, also ensure that we destroy the callout mutex associated with each Tx queue when we free it. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Submitted by: Jacob Keller <jacob.e.keller@intel.com> Reviewed by: erj@, gallatin@ MFC after: 1 week Sponsored by: Intel Corporation Differential Revision: https://reviews.freebsd.org/D22157	2019-10-30 20:45:12 +00:00
Gleb Smirnoff	0839aa5c04	There is a long standing problem with multicast programming for NICs and IPv6. With IPv6 we may call if_addmulti() in context of processing of an incoming packet. Usually this is interrupt context. While most of the NIC drivers are able to reprogram multicast filters without sleeping, some of them can't. An example is e1000 family of drivers. With iflib conversion the problem was somewhat hidden. Iflib processes packets in private taskqueue, so going to sleep doesn't trigger an assertion. However, the sleep would block operation of the driver and following incoming packets would fill the ring and eventually would start being dropped. Enabling epoch for the full time of a packet processing again started to trigger assertions for e1000. Fix this problem once and for all using a general taskqueue to call if_ioctl() method in all cases when if_addmulti() is called in a non sleeping context. Note that nobody cares about returned value. Reviewed by: hselasky, kib Differential Revision: https://reviews.freebsd.org/D22154	2019-10-29 17:36:06 +00:00
Eric Joyner	1558015e3e	iflib: call ether_ifdetach and netmap_detach before stop From Jake: Calling ether_ifdetach after iflib_stop leads to a potential race where a stale ifp pointer can remain in the route entry list for IPv6 traffic. This will potentially cause a page fault or other system instability if the ifp pointer is accessed. Move both iflib_netmap_detach and ether_ifdetach to be called prior to iflib_stop. This avoids the race above, and helps ensure that other ifp references are removed before stopping the interface. Submitted by: Jacob Keller <jacob.e.keller@intel.com> Reviewed by: erj@, gallatin@, jhb@ MFC after: 3 days Sponsored by: Intel Corporation Differential Revision: https://reviews.freebsd.org/D22071	2019-10-23 23:20:49 +00:00
Conrad Meyer	65366903c3	Prevent a panic when a driver provides bogus debugnet parameters This is just a bandaid; we should fix the driver(s) too. Introduced in r353685. PR: 241403 X-MFC-With: r353685 Reported by: np and others	2019-10-23 16:48:22 +00:00
Kyle Evans	f7810883d4	tuntap(4): Fix NOINET build after r353741 Shuffle headers around to more appropriate #ifdef OPTION blocks (INET vs. INET6) -- double checked LINT-{NOINET,NOINET6,NOIP}, all seem good. Reported by: cem	2019-10-23 02:15:15 +00:00
Kyle Evans	200abb43c0	tuntap(4): properly declare if_tun and if_tap modules Simply adding MODULE_VERSION does not do the trick, because the modules haven't been declared. This should actually fix modfind/kldstat, which r351229 aimed and failed to do. This should make vm-bhyve do the right thing again when using the ports version, rather than the latest version not in ports. MFC after: 3 days	2019-10-22 00:18:16 +00:00
Gleb Smirnoff	19e09f447f	Remove obsoleted KPIs that were used to access interface address lists.	2019-10-21 18:17:03 +00:00
Kyle Evans	3d5013337a	tuntap(4): restrict scope of net.link.tap.user_open slightly net.link.tap.user_open has historically allowed non-root users to do devfs cloning and open /dev/tap* nodes based on permissions. Loosen this up to make it only allow users to do devfs cloning -- we no longer check it in tunopen. This allows tap devices to be created that can actually be opened by a user, rather than swiftly restricting them to root because the magic sysctl has not been set. The sysctl has not yet been completely deprecated, because more thought is needed for how to handle the devfs cloning case. There is not an easy suitable replacement for the sysctl there, and more care needs to be placed in determining whether that's OK or not. PR: 200185	2019-10-21 14:38:11 +00:00
Kyle Evans	6025077704	tuntap(4): use cdevpriv w/ dtor for last close instead of d_close cdevpriv dtors will be called when the reference count on the associated struct file drops to 0, while d_close can be unreliable for cleaning up state at "last close" for a number of reasons. As far as tunclose/tundtor is concerned the difference is minimal, so make the switch.	2019-10-20 22:55:47 +00:00
Kyle Evans	6869d530c7	tuntap(4): Use make_dev_s to avoid si_drv1 race This allows us to avoid some dance in tunopen for dealing with the possibility of dev->si_drv1 being NULL as it's set prior to the devfs node being created in all cases. There's still the possibility that the tun device hasn't been fully initialized, since that's done after the devfs node was created. Alleviate this by returning ENXIO if we're not to that point of tuncreate yet. This work is what sparked r353128, full initialization of cloned devices w/ specified make_dev_args.	2019-10-20 22:39:40 +00:00
Kyle Evans	486c0b2269	tuntap(4): break out after setting TUN_DSTADDR This is now the only flag we set in this loop, terminate early.	2019-10-20 21:06:25 +00:00
Kyle Evans	6041d76e0c	tuntap(4): Drop TUN_IASET This flag appears to have been effectively unused since introduction to if_tun(4) -- drop it now.	2019-10-20 21:03:48 +00:00
Michael Tuexen	4ad8cb6813	Fix compile issues when building a kernel without the VIMAGE option. Thanks to cem@ for discussing the issue which resulted in this patch. Reviewed by: cem@ Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D22089	2019-10-19 20:48:53 +00:00
Conrad Meyer	0634308df2	Fix debugnet(4) link/build fallout on some configurations Introduced in r353685 (sys/conf/files), r353694 (debugnet.c db_printf). Submitted by: kevans Reported by: cy X-MFC-With: r353685, r353694	2019-10-18 22:03:36 +00:00
Vincenzo Maffione	f8bc74e2f4	tap: add support for virtio-net offloads This patch is part of an effort to make bhyve networking (in particular TCP) faster. The key strategy to enhance TCP throughput is to let the whole packet datapath work with TSO/LRO packets (up to 64KB each), so that the per-packet overhead is amortized over a large number of bytes. This capability is supported in the guest by means of the vtnet(4) driver, which is able to handle TSO/LRO packets leveraging the virtio-net header (see struct virtio_net_hdr and struct virtio_net_hdr_mrg_rxbuf). A bhyve VM exchanges packets with the host through a network backend, which can be vale(4) or if_tap(4). While vale(4) supports TSO/LRO packets, if_tap(4) does not. This patch extends if_tap(4) with the ability to understand the virtio-net header, so that a tapX interface can process TSO/LRO packets. A couple of ioctl commands have been added to configure and probe the virtio-net header. Once the virtio-net header is set, the tapX interface acquires all the IFCAP capabilities necessary for TSO/LRO. Reviewed by: kevans Differential Revision: https://reviews.freebsd.org/D21263	2019-10-18 21:53:27 +00:00
Gleb Smirnoff	fda454099b	Make rt_getifa_fib() static.	2019-10-18 15:20:24 +00:00
Conrad Meyer	dda17b3672	Implement NetGDB(4) NetGDB(4) is a component of a system using a panic-time network stack to remotely debug crashed FreeBSD kernels over the network, instead of traditional serial interfaces. There are three pieces in the complete NetGDB system. First, a dedicated proxy server must be running to accept connections from both NetGDB and gdb(1), and pass bidirectional traffic between the two protocols. Second, the NetGDB client is activated much like ordinary 'gdb' and similarly to 'netdump' in ddb(4) after a panic. Like other debugnet(4) clients (netdump(4)), the network interface on the route to the proxy server must be online and support debugnet(4). Finally, the remote (k)gdb(1) uses 'target remote <proxy>:<port>' (like any other TCP remote) to connect to the proxy server. The NetGDB v1 protocol speaks the literal GDB remote serial protocol, and uses a 1:1 relationship between GDB packets and sequences of debugnet packets (fragmented by MTU). There is no encryption utilized to keep debugging sessions private, so this is only appropriate for local segments or trusted networks. Submitted by: John Reimer <john.reimer AT emc.com> (earlier version) Discussed some with: emaste, markj Relnotes: sure Differential Revision: https://reviews.freebsd.org/D21568	2019-10-17 21:33:01 +00:00
Conrad Meyer	e9c6962599	debugnet(4): Add optional full-duplex mode It remains unattached to any client protocol. Netdump is unaffected (remaining half-duplex). The intended consumer is NetGDB. Submitted by: John Reimer <john.reimer AT emc.com> (earlier version) Discussed with: markj Differential Revision: https://reviews.freebsd.org/D21541	2019-10-17 20:25:15 +00:00
Gleb Smirnoff	b2807792f1	Revert two parts of r353292 that enter epoch when processing vlan capabilities. It could be that entering epoch isn't necessary here, but better take a conservative approach. Submitted by: kp	2019-10-17 20:18:07 +00:00
Conrad Meyer	fde2cf65ce	debugnet(4): Infer non-server connection parameters Loosen requirements for connecting to debugnet-type servers. Only require a destination address; the rest can theoretically be inferred from the routing table. Relax corresponding constraints in netdump(4) and move ifp validation to debugnet connection time. Submitted by: John Reimer <john.reimer AT emc.com> (earlier version) Reviewed by: markj Differential Revision: https://reviews.freebsd.org/D21482	2019-10-17 20:10:32 +00:00
Conrad Meyer	8270d35eca	Add ddb(4) 'netdump' command to netdump a core without preconfiguration Add a 'X -s <server> -c <client> [-g <gateway>] -i <interface>' subroutine to the generic debugnet code. The imagined use is both netdump, shown here, and NetGDB (vaporware). It uses the ddb(4) lexer, with some new extensions, to parse out IPv4 addresses. 'Netdump' uses the generic debugnet routine to load a configuration and start a dump, without any netdump configuration prior to panic. Loosely derived from work by: John Reimer <john.reimer AT emc.com> Reviewed by: markj Differential Revision: https://reviews.freebsd.org/D21460	2019-10-17 19:49:20 +00:00
Conrad Meyer	6d567ec2da	debugnet: Respond to broadcast ARP requests The in-tree netdump code has always ignored non-directed ARP requests, and that seems to work most of the time for netdump. In my work and testing on NetGDB, it seems like sometimes the remote FreeBSD conversant (the non-panic system) will send broadcast-destination ARP requests to the debugnet kernel; without this change, those are dropped and the remote will see EHOSTDOWN "Host is down" errors from the userspace interface of the network stack. Discussed with: markj	2019-10-17 17:48:32 +00:00
Conrad Meyer	d39756c142	debugnet(4): Check hardware-validated UDP checksums Similar to INET checksums, lazily validate UDP checksums when the driver has already performed the check for us. Like debugnet(4) INET checksums, validation in software is left as future work. Reviewed by: markj Differential Revision: https://reviews.freebsd.org/D21745	2019-10-17 17:19:16 +00:00
Conrad Meyer	7790c8c199	Split out a more generic debugnet(4) from netdump(4) Debugnet is a simplistic and specialized panic- or debug-time reliable datagram transport. It can drive a single connection at a time and is currently unidirectional (debug/panic machine transmit to remote server only). It is mostly a verbatim code lift from netdump(4). Netdump(4) remains the only consumer (until the rest of this patch series lands). The INET-specific logic has been extracted somewhat more thoroughly than previously in netdump(4), into debugnet_inet.c. UDP-layer logic and up, as much as possible as is protocol-independent, remains in debugnet.c. The separation is not perfect and future improvement is welcome. Supporting INET6 is a long-term goal. Much of the diff is "gratuitous" renaming from 'netdump_' or 'nd_' to 'debugnet_' or 'dn_' -- sorry. I thought keeping the netdump name on the generic module would be more confusing than the refactoring. The only functional change here is the mbuf allocation / tracking. Instead of initiating solely on netdump-configured interface(s) at dumpon(8) configuration time, we watch for any debugnet-enabled NIC for link activation and query it for mbuf parameters at that time. If they exceed the existing high-water mark allocation, we re-allocate and track the new high-water mark. Otherwise, we leave the pre-panic mbuf allocation alone. In a future patch in this series, this will allow initiating netdump from panic ddb(4) without pre-panic configuration. No other functional change intended. Reviewed by: markj (earlier version) Some discussion with: emaste, jhb Objection from: marius Differential Revision: https://reviews.freebsd.org/D21421	2019-10-17 16:23:03 +00:00
Philip Paeps	579b70db89	ether: add older ethertype definitions for QinQ Older network equipment used the ethertypes 0x9100, 0x9200, and 0x9300 for outer VLANs, before standardisation introduced 0x88a8. Submitted by: Lutz Donnerhacke <lutz_donnerhacke.de> Differential Revision: https://reviews.freebsd.org/D21846	2019-10-17 00:34:53 +00:00
Gleb Smirnoff	b46d70fd88	do_link_state_change() is executed in taskqueue context and in general is allowed to sleep. Don't enter the epoch for the whole duration. If some event handlers need the epoch, they should handle that theirselves. Discussed with: hselasky	2019-10-16 16:32:58 +00:00
Hans Petter Selasky	270b83b9d1	The two functions ifnet_byindex() and ifnet_byindex_locked() are exactly the same after the network stack was epochified. Merge the two into one function and cleanup all uses of ifnet_byindex_locked(). While at it: - Add branch prediction macros. - Make sure the ifnet pointer is only deferred once, also when code optimisation is disabled. Sponsored by: Mellanox Technologies	2019-10-15 12:08:09 +00:00
Hans Petter Selasky	93cfeb0ed9	Exclude the network link eventhandler from epochification after r353292. This fixes the following assert when "options RATELIMIT" is used: panic() malloc() sysctl_add_oid() tcp_rl_ifnet_link() do_link_state_change() taskqueue_run_locked() Sponsored by: Mellanox Technologies	2019-10-15 11:20:16 +00:00
Gleb Smirnoff	416a1d1e70	if_delmulti() is never called without ifp argument, assert this instead of doing a useless search through interfaces.	2019-10-14 21:18:37 +00:00
Michael Tuexen	69104ebe0f	Add missing include which breaks builds without VIMAGE. The bug was introduced by me in r353480. Reported by: Michael Butler MFC after: 3 days	2019-10-13 19:58:37 +00:00
Michael Tuexen	d6e23cf0cf	Use an event handler to notify the SCTP about IP address changes instead of calling an SCTP specific function from the IP code. This is a requirement of supporting SCTP as a kernel loadable module. This patch was developed by markj@, I tweaked a bit the SCTP related code. Submitted by: markj@ MFC after: 3 days	2019-10-13 18:17:08 +00:00
Gleb Smirnoff	6dcec895d9	vlan_config() isn't always called in epoch context. Reported by: kp	2019-10-13 15:15:09 +00:00
Gleb Smirnoff	45c1d51c39	Don't use if_maddr_rlock() in sppp(4), use epoch(9) directly instead.	2019-10-10 23:54:37 +00:00
Gleb Smirnoff	73c96bbeac	Don't use if_maddr_rlock() in tuntap(4), use epoch(9) directly instead.	2019-10-10 23:51:14 +00:00
Gleb Smirnoff	4b24e5b1ef	Interface output method must be executed in network epoch, so if_addr_rlock() isn't needed here.	2019-10-10 23:50:32 +00:00
Gleb Smirnoff	fb3fc771f6	Add two extra functions that basically give count of addresses on interface. Such function could been implemented on top of the if_foreach_llm?addr(), but several drivers need counting, so avoid copy-n-paste inside the drivers.	2019-10-10 23:44:56 +00:00
Gleb Smirnoff	826857c833	Provide new KPI for network drivers to access lists of interface addresses. The KPI doesn't reveal neither how addresses are stored, how the access to them is synchronized, neither reveal struct ifaddr and struct ifmaddr. Reviewed by: gallatin, erj, hselasky, philip, stevek Differential Revision: https://reviews.freebsd.org/D21943	2019-10-10 23:42:55 +00:00
Gleb Smirnoff	caeeeaa7c5	ifnet_byindex_ref() requires network epoch.	2019-10-09 16:21:50 +00:00
Gleb Smirnoff	1e80e4f26c	Remove epoch assertion from if_setlladdr(). Originally this function was protected by IF_ADDR_LOCK(), which was a mutex, so that two simultaneous if_setlladdr() can't execute. Later it was switched to IF_ADDR_RLOCK(), likely by a mistake. Later it was switched to NET_EPOCH_ENTER(). Then I incorrectly added NET_EPOCH_ASSERT() here. In reality ifp->if_addr never goes away and never changes its length. So, doing bcopy() in it is always "safe", meaning it won't dereference a wrong pointer or write into someone's else memory. Of course doing two bcopy() in parallel would result in a mess of two addresses, but net epoch doesn't protect against that, neither IF_ADDR_RLOCK() did. So for now, just remove the assertion and leave for later a proper fix. Reported by: markj	2019-10-08 17:55:45 +00:00
Gleb Smirnoff	e9dc46cc30	In DIAGNOSTIC block of if_delmulti_ifma_flags() enter the network epoch. This quickly plugs the regression from r353292. The locking of multicast definitely needs a broader review today... Reported by: pho, dhw	2019-10-08 16:45:56 +00:00
Hans Petter Selasky	a362cf527e	Fix regression issue after r353274: Make sure the vnet_shutdown field is not set until after all VNET_SYSUNINIT()'s in the SI_SUB_VNET_DONE subsystem have been executed. Especially the vnet_if_return() functions requires that if_move() is still operational. Reported by: lwhsu@ MFC after: 1 week Sponsored by: Mellanox Technologies	2019-10-08 11:06:24 +00:00
Gleb Smirnoff	b8a6e03fac	Widen NET_EPOCH coverage. When epoch(9) was introduced to network stack, it was basically dropped in place of existing locking, which was mutexes and rwlocks. For the sake of performance mutex covered areas were as small as possible, so became epoch covered areas. However, epoch doesn't introduce any contention, it just delays memory reclaim. So, there is no point to minimise epoch covered areas in sense of performance. Meanwhile entering/exiting epoch also has non-zero CPU usage, so doing this less often is a win. Not the least is also code maintainability. In the new paradigm we can assume that at any stage of processing a packet, we are inside network epoch. This makes coding both input and output path way easier. On output path we already enter epoch quite early - in the ip_output(), in the ip6_output(). This patch does the same for the input path. All ISR processing, network related callouts, other ways of packet injection to the network stack shall be performed in net_epoch. Any leaf function that walks network configuration now asserts epoch. Tricky part is configuration code paths - ioctls, sysctls. They also call into leaf functions, so some need to be changed. This patch would introduce more epoch recursions (see EPOCH_TRACE) than we had before. They will be cleaned up separately, as several of them aren't trivial. Note, that unlike a lock recursion the epoch recursion is safe and just wastes a bit of resources. Reviewed by: gallatin, hselasky, cy, adrian, kristof Differential Revision: https://reviews.freebsd.org/D19111	2019-10-07 22:40:05 +00:00
Hans Petter Selasky	4715738b12	Compile time assert a valid subsystem for all VNET init and uninit functions. Using VNET init and uninit functions outside the given range has undefined behaviour. MFC after: 1 week Sponsored by: Mellanox Technologies	2019-10-07 14:24:59 +00:00
Hans Petter Selasky	204e2f30d9	Factor out VNET shutdown check into an own vnet structure field. Remove the now obsolete vnet_state field. This greatly simplifies the detection of VNET shutdown and avoids code duplication. Discussed with: bz@ MFC after: 1 week Sponsored by: Mellanox Technologies	2019-10-07 14:15:41 +00:00
Kyle Evans	291287667c	tuntap(4): loosen up tunclose restrictions Realistically, this cannot work. We don't allow the tun to be opened twice, so it must be done via fd passing, fork, dup, some mechanism like these. Applications demonstrably do not enforce strict ordering when they're handing off tun devices, so the parent closing before the child will easily leave the tun/tap device in a bad state where it can't be destroyed and a confused user because they did nothing wrong. Concede that we can't leave the tun/tap device in this kind of state because of software not playing the TUNSIFPID game, but it is still good to find and fix this kind of thing to keep ifconfig(8) up-to-date and help ensure good discipline in tun handling. MFC after: 3 days	2019-10-04 13:43:07 +00:00
Kyle Evans	59997c3c46	if_tuntap: create /dev aliases when a tuntap device gets renamed Currently, if you do: $ ifconfig tun0 create $ ifconfig tun0 name wg0 $ ls -l /dev \| egrep 'wg\|tun' You will see tun0, but no wg0. In fact, it's slightly more annoying to make the association between the new name and the old name in order to open the device (if it hadn't been opened during the rename). Register an eventhandler for ifnet_arrival_events and catch interface renames. We can determine if the ifnet is a tun easily enough from the if_dname, which matches the cevsw.d_name from the associated tuntap_driver. Some locking dance is required because renames don't require the device to be opened, so it could go away in the middle of handling the ioctl, but as soon as we've verified this isn't the case we can attempt to busy the tun and either bail out if the tun device is dying, or we can proceed with the rename. We only create these aliases on a best-effort basis. Renaming a tun device to "usbctl", which doesn't exist as an ifnet but does as a /dev, is clearly not that disastrous, but we can't and won't create a /dev for that.	2019-10-03 17:54:00 +00:00
Kyle Evans	c4cad1549e	if_tuntap: add a busy/unbusy mechanism, replace destroy OPEN check A future commit will create device aliases when a tuntap device is renamed so that it's still easily found in /dev after the rename. Said mechanism will want to keep the tun alive long enough to either realize that it's about to go away or complete the alias creation, even if the alias is about to get destroyed. While we're introducing it, using it to prevent open devices from going away makes plenty of sense and keeps the logic on waking up tun_destroy clean, so we don't have multiple places trying to cv_broadcast unless it's still in use elsewhere.	2019-10-03 17:46:27 +00:00
Mark Johnston	4166913371	Add IFLIB_SINGLE_IRQ_RX_ONLY. As of r347221 the iflib legacy interrupt mode setup assumes that drivers perform both receive and transmit processing from the interrupt handler. This assumption is invalid in the vmxnet3 driver, so introduce the IFLIB_SINGLE_IRQ_RX_ONLY flag to make iflib avoid tx processing in the interrupt handler. PR: 239118 Reported and tested by: Juraj Lutter <otis@sk.freebsd.org> Obtained from: marius Reviewed by: gallatin MFC after: 3 days Differential Revision: https://reviews.freebsd.org/D21831	2019-09-30 15:59:07 +00:00
Andrew Gallatin	6554362c66	kTLS support for TLS 1.3 TLS 1.3 requires a few changes because 1.3 pretends to be 1.2 with a record type of application data. The "real" record type is then included at the end of the user-supplied plaintext data. This required adding a field to the mbuf_ext_pgs struct to save the record type, and passing the real record type to the sw_encrypt() ktls backend functions. Reviewed by: jhb, hselasky Sponsored by: Netflix Differential Revision: D21801	2019-09-27 19:17:40 +00:00
Gleb Smirnoff	bf7700e44f	style(9): remove extraneous empty lines	2019-09-25 20:46:09 +00:00
Gleb Smirnoff	dd902d015a	Add debugging facility EPOCH_TRACE that checks that epochs entered are properly nested and warns about recursive entrances. Unlike with locks, there is nothing fundamentally wrong with such use, the intent of tracer is to help to review complex epoch-protected code paths, and we mean the network stack here. Reviewed by: hselasky Sponsored by: Netflix Pull Request: https://reviews.freebsd.org/D21610	2019-09-25 18:26:31 +00:00
Eric Joyner	53b5b9b049	iflib: Remove redundant VLAN events deregistration From Piotr: r351152 introduced iflib_deregister() function calling EVENTHANDLER_DEREGISTER() to unregister VLAN events. This patch removes duplicate of EVENTHANDLER_DEREGISTER() calls placed in iflib_device_deregister() as this function is now calling iflib_deregister(). This is to avoid deregistering same event twice. This patch also adds check in iflib_vlan_register() to prevent registering VLAN while being in detach. Patch co-authored by Krzysztof Galazka <krzysztof.galazka@intel.com>, erj <erj@FreeBSD.org> and Jacob Keller <jacob.e.keller@intel.com>. Signed-off-by: Piotr Pietruszewski <piotr.pietruszewski@intel.com> Submitted by: Piotr Pietruszewski <piotr.pietruszewski@intel.com> Reviewed by: gallatin@, erj@ MFC after: 3 days Sponsored by: Intel Corporation Differential Revision: https://reviews.freebsd.org/D21711	2019-09-24 17:03:31 +00:00
Konstantin Belousov	247cf5664e	Add SIOCGIFDOWNREASON. The ioctl(2) is intended to provide more details about the cause of the down for the link. Eventually we might define a comprehensive list of codes for the situations. But interface also allows the driver to provide free-form null-terminated ASCII string to provide arbitrary non-formalized information. Sample implementation exists for mlx5(4), where the string is fetched from firmware controlling the port. Reviewed by: hselasky, rrs Sponsored by: Mellanox Technologies MFC after: 1 week Differential revision: https://reviews.freebsd.org/D21527	2019-09-17 18:49:13 +00:00
Kyle Evans	40b1c921bd	SIOCSIFNAME: Do nothing if we're not actually changing Instead of throwing EEXIST, just succeed if the name isn't actually changing. We don't need to trigger departure or any of that because there's no change from consumers' perspective. PR: 240539 Reviewed by: brooks MFC after: 5 days Differential Revision: https://reviews.freebsd.org/D21618	2019-09-12 15:36:48 +00:00
Hans Petter Selasky	180fecd5b6	Callout drain does not have to be followed by a callout stop call. Fix bogus code. MFC after: 1 week Sponsored by: Mellanox Technologies	2019-09-10 14:33:07 +00:00
Li-Wen Hsu	4835262b68	Fix build for the platforms where db_expr_t is not long Sponsored by: The FreeBSD Foundation	2019-09-10 08:51:11 +00:00
Conrad Meyer	0b12ab8111	Appease Clang false-positive Werrors in r352112 Reported by: bcran	2019-09-10 01:56:47 +00:00
Conrad Meyer	8b6acd2b51	ddb(4): Add 'show route <dest>' and 'show routetable [<af>]' These commands show the route resolved for a specified destination, or print out the entire routing table for a given address family (or all families, if none is explicitly provided). Discussed with: emaste Differential Revision: https://reviews.freebsd.org/D21510	2019-09-09 22:54:27 +00:00
Mark Johnston	fee2a2fa39	Change synchonization rules for vm_page reference counting. There are several mechanisms by which a vm_page reference is held, preventing the page from being freed back to the page allocator. In particular, holding the page's object lock is sufficient to prevent the page from being freed; holding the busy lock or a wiring is sufficent as well. These references are protected by the page lock, which must therefore be acquired for many per-page operations. This results in false sharing since the page locks are external to the vm_page structures themselves and each lock protects multiple structures. Transition to using an atomically updated per-page reference counter. The object's reference is counted using a flag bit in the counter. A second flag bit is used to atomically block new references via pmap_extract_and_hold() while removing managed mappings of a page. Thus, the reference count of a page is guaranteed not to increase if the page is unbusied, unmapped, and the object's write lock is held. As a consequence of this, the page lock no longer protects a page's identity; operations which move pages between objects are now synchronized solely by the objects' locks. The vm_page_wire() and vm_page_unwire() KPIs are changed. The former requires that either the object lock or the busy lock is held. The latter no longer has a return value and may free the page if it releases the last reference to that page. vm_page_unwire_noq() behaves the same as before; the caller is responsible for checking its return value and freeing or enqueuing the page as appropriate. vm_page_wire_mapped() is introduced for use in pmap_extract_and_hold(). It fails if the page is concurrently being unmapped, typically triggering a fallback to the fault handler. vm_page_wire() no longer requires the page lock and vm_page_unwire() now internally acquires the page lock when releasing the last wiring of a page (since the page lock still protects a page's queue state). In particular, synchronization details are no longer leaked into the caller. The change excises the page lock from several frequently executed code paths. In particular, vm_object_terminate() no longer bounces between page locks as it releases an object's pages, and direct I/O and sendfile(SF_NOCACHE) completions no longer require the page lock. In these latter cases we now get linear scalability in the common scenario where different threads are operating on different files. __FreeBSD_version is bumped. The DRM ports have been updated to accomodate the KPI changes. Reviewed by: jeff (earlier version) Tested by: gallatin (earlier version), pho Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D20486	2019-09-09 21:32:42 +00:00
Vincenzo Maffione	253b2ec199	netmap: import changes from upstream (SHA 137f537eae513) - Rework option processing. - Use larger integers for memory size values in the memory management code. MFC after: 2 weeks	2019-09-01 14:47:41 +00:00
Matt Joras	16cf6bdbb6	Wrap a vlan's parent's if_output in a separate function. When a vlan interface is created, its if_output is set directly to the parent interface's if_output. This is fine in the normal case but has an unfortunate consequence if you end up with a certain combination of vlan and lagg interfaces. Consider you have a lagg interface with a single laggport member. When an interface is added to a lagg its if_output is set to lagg_port_output, which blackholes traffic from the normal networking stack but not certain frames from BPF (pseudo_AF_HDRCMPLT). If you now create a vlan with the laggport member (not the lagg interface) as its parent, its if_output is set to lagg_port_output as well. While this is confusing conceptually and likely represents a misconfigured system, it is not itself a problem. The problem arises when you then remove the lagg interface. Doing this resets the if_output of the laggport member back to its original state, but the vlan's if_output is left pointing to lagg_port_output. This gives rise to the possibility that the system will panic when e.g. bpf is used to send any frames on the vlan interface. Fix this by creating a new function, vlan_output, which simply wraps the parent's current if_output. That way when the parent's if_output is restored there is no stale usage of lagg_port_output. Reviewed by: rstone Differential Revision: D21209	2019-08-30 20:19:43 +00:00
John Baldwin	b2e60773c6	Add kernel-side support for in-kernel TLS. KTLS adds support for in-kernel framing and encryption of Transport Layer Security (1.0-1.2) data on TCP sockets. KTLS only supports offload of TLS for transmitted data. Key negotation must still be performed in userland. Once completed, transmit session keys for a connection are provided to the kernel via a new TCP_TXTLS_ENABLE socket option. All subsequent data transmitted on the socket is placed into TLS frames and encrypted using the supplied keys. Any data written to a KTLS-enabled socket via write(2), aio_write(2), or sendfile(2) is assumed to be application data and is encoded in TLS frames with an application data type. Individual records can be sent with a custom type (e.g. handshake messages) via sendmsg(2) with a new control message (TLS_SET_RECORD_TYPE) specifying the record type. At present, rekeying is not supported though the in-kernel framework should support rekeying. KTLS makes use of the recently added unmapped mbufs to store TLS frames in the socket buffer. Each TLS frame is described by a single ext_pgs mbuf. The ext_pgs structure contains the header of the TLS record (and trailer for encrypted records) as well as references to the associated TLS session. KTLS supports two primary methods of encrypting TLS frames: software TLS and ifnet TLS. Software TLS marks mbufs holding socket data as not ready via M_NOTREADY similar to sendfile(2) when TLS framing information is added to an unmapped mbuf in ktls_frame(). ktls_enqueue() is then called to schedule TLS frames for encryption. In the case of sendfile_iodone() calls ktls_enqueue() instead of pru_ready() leaving the mbufs marked M_NOTREADY until encryption is completed. For other writes (vn_sendfile when pages are available, write(2), etc.), the PRUS_NOTREADY is set when invoking pru_send() along with invoking ktls_enqueue(). A pool of worker threads (the "KTLS" kernel process) encrypts TLS frames queued via ktls_enqueue(). Each TLS frame is temporarily mapped using the direct map and passed to a software encryption backend to perform the actual encryption. (Note: The use of PHYS_TO_DMAP could be replaced with sf_bufs if someone wished to make this work on architectures without a direct map.) KTLS supports pluggable software encryption backends. Internally, Netflix uses proprietary pure-software backends. This commit includes a simple backend in a new ktls_ocf.ko module that uses the kernel's OpenCrypto framework to provide AES-GCM encryption of TLS frames. As a result, software TLS is now a bit of a misnomer as it can make use of hardware crypto accelerators. Once software encryption has finished, the TLS frame mbufs are marked ready via pru_ready(). At this point, the encrypted data appears as regular payload to the TCP stack stored in unmapped mbufs. ifnet TLS permits a NIC to offload the TLS encryption and TCP segmentation. In this mode, a new send tag type (IF_SND_TAG_TYPE_TLS) is allocated on the interface a socket is routed over and associated with a TLS session. TLS records for a TLS session using ifnet TLS are not marked M_NOTREADY but are passed down the stack unencrypted. The ip_output_send() and ip6_output_send() helper functions that apply send tags to outbound IP packets verify that the send tag of the TLS record matches the outbound interface. If so, the packet is tagged with the TLS send tag and sent to the interface. The NIC device driver must recognize packets with the TLS send tag and schedule them for TLS encryption and TCP segmentation. If the the outbound interface does not match the interface in the TLS send tag, the packet is dropped. In addition, a task is scheduled to refresh the TLS send tag for the TLS session. If a new TLS send tag cannot be allocated, the connection is dropped. If a new TLS send tag is allocated, however, subsequent packets will be tagged with the correct TLS send tag. (This latter case has been tested by configuring both ports of a Chelsio T6 in a lagg and failing over from one port to another. As the connections migrated to the new port, new TLS send tags were allocated for the new port and connections resumed without being dropped.) ifnet TLS can be enabled and disabled on supported network interfaces via new '[-]txtls[46]' options to ifconfig(8). ifnet TLS is supported across both vlan devices and lagg interfaces using failover, lacp with flowid enabled, or lacp with flowid enabled. Applications may request the current KTLS mode of a connection via a new TCP_TXTLS_MODE socket option. They can also use this socket option to toggle between software and ifnet TLS modes. In addition, a testing tool is available in tools/tools/switch_tls. This is modeled on tcpdrop and uses similar syntax. However, instead of dropping connections, -s is used to force KTLS connections to switch to software TLS and -i is used to switch to ifnet TLS. Various sysctls and counters are available under the kern.ipc.tls sysctl node. The kern.ipc.tls.enable node must be set to true to enable KTLS (it is off by default). The use of unmapped mbufs must also be enabled via kern.ipc.mb_use_ext_pgs to enable KTLS. KTLS is enabled via the KERN_TLS kernel option. This patch is the culmination of years of work by several folks including Scott Long and Randall Stewart for the original design and implementation; Drew Gallatin for several optimizations including the use of ext_pgs mbufs, the M_NOTREADY mechanism for TLS records awaiting software encryption, and pluggable software crypto backends; and John Baldwin for modifications to support hardware TLS offload. Reviewed by: gallatin, hselasky, rrs Obtained from: Netflix Sponsored by: Netflix, Chelsio Communications Differential Revision: https://reviews.freebsd.org/D21277	2019-08-27 00:01:56 +00:00
Kyle Evans	5c4eed8601	tuntap: belatedly add MODULE_VERSION for if_tun and if_tap When tun/tap were merged, appropriate MODULE_VERSION should have been added for things like modfind(2) to continue to do the right thing with the old names. Reported by: jhb	2019-08-19 19:01:59 +00:00
Vincenzo Maffione	b5b83671ea	if_tuntap: minor improvements Rewrite a loop to avoid duplicating the exit condition. Simplify mask processing in tunpoll(). Fix minor typos. Reviewed by: kevans, markj MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D21302	2019-08-19 17:23:22 +00:00
Eric Joyner	f4aa9b67eb	net: Update SFF-8024 definitions and strings with values from rev 4.6 This will let ifconfig -v's SFF eeprom read functionality recognize more module types. Signed-off-by: Eric Joyner <erj@freebsd.org> Reviewed by: gallatin@ MFC after: 1 week Sponsored by: Intel Corporation Differential Revision: https://reviews.freebsd.org/D21041	2019-08-17 00:10:56 +00:00
Eric Joyner	566144142e	iflib: add iflib_deregister to help cleanup on exit Commit message by Jake: The iflib_register function exists to allocate and setup some common structures used by both iflib_device_register and iflib_pseudo_register. There is no associated cleanup function used to undo the steps taken in this function. Both iflib_device_deregister and iflib_pseudo_deregister have some of the necessary steps scattered in their flow. However, most of the necessary cleanup is not done during the error path of iflib_device_register and iflib_pseudo_register. Some examples of missed cleanup include: the ifp pointer is not free'd during error cleanup the STATE and CTX locks are not destroyed during error cleanup the vlan event handlers are not removed during error cleanup media added to the ifmedia structure is not removed the kobject reference is never deleted Additionally, when initializing the kobject class reference counter is increased even though kobj_init already increases it. This results in the class never being free'd again because the reference count would never hit zero even after all driver instances are unloaded. To aid in proper cleanup, implement an iflib_deregister function that goes through the reverse steps taken by iflib_register. Call this function during the error cleanup for iflib_device_register and iflib_pseudo_register. Additionally call the function in the iflib_device_deregister and iflib_pseudo_deregister functions near the end of their flow. This helps reduce code duplication and ensures that proper steps are taken to cleanup allocations and references in both the regular and error cleanup flows. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Submitted by: Jacob Keller <jacob.e.keller@intel.com> Reviewed by: shurd@, erj@ MFC after: 3 days Sponsored by: Intel Corporation Differential Revision: https://reviews.freebsd.org/D21005	2019-08-16 23:33:44 +00:00
George V. Neville-Neil	d8dc4e350f	Properly validte arguments for route deletion Reported by: Liang Zhuo brightiup.zhuo@gmail.com MFC after: 1 week	2019-08-03 14:42:07 +00:00
Eric Joyner	197c679824	iflib: Prevent kernel panic caused by loading driver with a specific interrupt configuration If a device has only 1 MSI-X interrupt available and does not support either MSI or legacy interrupts, iflib_device_register() will fail, leak memory and MSI resources, and the driver will not load. Worse, if another iflib-using driver tries to unload afterwards, a kernel panic will occur because the previous failed iflib driver loead did not properly call "taskqgroup_detach()" during it's cleanup. This patch is band-aid for this situation -- don't try allocating MSI or legacy interrupts if a single MSI-X interrupt was allocated, but fail to load instead. As well, during the cleanup, properly call taskqgroup_detach() on the admin task to prevent panics when other iflib drivers unload. This whole interrupt allocation process actually needs re-doing to properly support devices with only a single MSI-X interrupt, devices that only support MSI-X, non-PCI devices, and multiple non-MSIX interrupts, as well. Signed-off-by: Eric Joyner <erj@freebsd.org> Reviewed by: marius@ MFC after: 1 week Sponsored by: Intel Corporation Differential Revision: https://reviews.freebsd.org/D20747	2019-08-01 17:37:25 +00:00
Eric Joyner	6a3f243b04	iflib: remove kobject class reference increment Commit message from Jake: In iflib_register, the context is initialized as a kobject using the device driver's "driver" kobject class. As part of this, the function mistakenly increments the ref counter. The ref counter is incremented twice, once in the code directly, and once again by kobj_class_compile. However, there is no associated decrement in the detach path. Because of this, the ref counter will never go back down to zero, and thus the kobject method table will never be released. Remove this unnecessary reference count increment. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Submitted by: Jacob Keller <jacob.e.keller@intel.com> Reviewed by: jhb@, erj@ MFC after: 3 days Sponsored by: Intel Corporation Differential Revision: https://reviews.freebsd.org/D21125	2019-08-01 17:28:36 +00:00
Randall Stewart	20abea6663	This adds the third step in getting BBR into the tree. BBR and an updated rack depend on having access to the new ratelimit api in this commit. Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D20953	2019-08-01 14:17:31 +00:00
Ed Maste	1082be6554	ppp: correct echo-req magic number on big endian archs The magic number is a 32-bit quantity; use uint32_t to match hton's return type and avoid sending zeros (upper 32 bits) on big-endian architectures. PR: 184141 MFC after: 1 week Sponsored by: The FreeBSD Foundation	2019-08-01 13:42:58 +00:00
Kyle Evans	0dbac71f19	if_tuntap(4): Add TUNGIFNAME This effectively just moves TAPGIFNAME into common ioctl territory. MFC after: 3 days	2019-07-25 22:23:34 +00:00
Eric Joyner	7f3f6aad3e	iflib: fix dangling device softc pointer Commit text by Jake: If a driver's IFDI_ATTACH_PRE function fails, the iflib_device_register function will free the ctx pointer. However, it does not reset the device softc pointer to NULL. This will result in memory corruption as a future access to the now invalid pointer will corrupt memory that is later allocated on top of the same memory location. The iflib_device_deregister function correctly resets the softc pointer by using device_set_softc(). This clears up the invalid dangling pointer and prevents memory corruption that could lead to a panic or undefined behavior if the device's driver failed to attach. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Submitted by: Jacob Keller <jacob.e.keller@intel.com> Reviewed by: erj@, gallatin@ MFC after: 1 week Sponsored by: Intel Corporation Differential Revision: https://reviews.freebsd.org/D21003	2019-07-24 21:43:41 +00:00
Kirill Ponomarev	b7592822d5	Allow set MTU more than 1500 bytes. Submitted by: Alexandr Fedorov <aleksandr.fedorov_itglobal_dot_com> Approved by: jhb, rgrimes Sponsored by: ITGlobal.com Differential Revision: https://reviews.freebsd.org/D19422	2019-07-24 16:10:20 +00:00
Chuck Tuffli	94c15665a5	Fix a typo in r349969 OUI_FRREBSD_NVME_HIGH should have been OUI_FREEBSD_NVME_HIGH Caught by: Gary Jennejohn	2019-07-14 03:49:48 +00:00
Chuck Tuffli	409a80e5a4	bhyve: Create EUI64 for NVMe namespaces Accept an IEEE Extended Unique Identifier (EUI-64) from the command line for each NVMe namespace. If one isn't provided, it will create one based on the CRC16 of: - the FreeBSD IEEE OUI - PCI bus, device/slot, function values - Namespace ID Reviewed by: imp, araujo, jhb, rgrimes Approved by: imp (mentor), jhb (maintainer) MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D19905	2019-07-13 12:48:28 +00:00
Mark Johnston	eeacb3b02f	Merge the vm_page hold and wire mechanisms. The hold_count and wire_count fields of struct vm_page are separate reference counters with similar semantics. The remaining essential differences are that holds are not counted as a reference with respect to LRU, and holds have an implicit free-on-last unhold semantic whereas vm_page_unwire() callers must explicitly determine whether to free the page once the last reference to the page is released. This change removes the KPIs which directly manipulate hold_count. Functions such as vm_fault_quick_hold_pages() now return wired pages instead. Since r328977 the overhead of maintaining LRU for wired pages is lower, and in many cases vm_fault_quick_hold_pages() callers would swap holds for wirings on the returned pages anyway, so with this change we remove a number of page lock acquisitions. No functional change is intended. __FreeBSD_version is bumped. Reviewed by: alc, kib Discussed with: jeff Discussed with: jhb, np (cxgbe) Tested by: pho (previous version) Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D19247	2019-07-08 19:46:20 +00:00
John Baldwin	66d0c056be	Support IFCAP_NOMAP in vlan(4). Enable IFCAP_NOMAP for a vlan interface if it is supported by the underlying trunk device. Reviewed by: gallatin, hselasky, rrs Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D20616	2019-06-29 00:51:38 +00:00
John Baldwin	82334850ea	Add an external mbuf buffer type that holds multiple unmapped pages. Unmapped mbufs allow sendfile to carry multiple pages of data in a single mbuf, without mapping those pages. It is a requirement for Netflix's in-kernel TLS, and provides a 5-10% CPU savings on heavy web serving workloads when used by sendfile, due to effectively compressing socket buffers by an order of magnitude, and hence reducing cache misses. For this new external mbuf buffer type (EXT_PGS), the ext_buf pointer now points to a struct mbuf_ext_pgs structure instead of a data buffer. This structure contains an array of physical addresses (this reduces cache misses compared to an earlier version that stored an array of vm_page_t pointers). It also stores additional fields needed for in-kernel TLS such as the TLS header and trailer data that are currently unused. To more easily detect these mbufs, the M_NOMAP flag is set in m_flags in addition to M_EXT. Various functions like m_copydata() have been updated to safely access packet contents (using uiomove_fromphys()), to make things like BPF safe. NIC drivers advertise support for unmapped mbufs on transmit via a new IFCAP_NOMAP capability. This capability can be toggled via the new 'nomap' and '-nomap' ifconfig(8) commands. For NIC drivers that only transmit packet contents via DMA and use bus_dma, adding the capability to if_capabilities and if_capenable should be all that is required. If a NIC does not support unmapped mbufs, they are converted to a chain of mapped mbufs (using sf_bufs to provide the mapping) in ip_output or ip6_output. If an unmapped mbuf requires software checksums, it is also converted to a chain of mapped mbufs before computing the checksum. Submitted by: gallatin (earlier version) Reviewed by: gallatin, hselasky, rrs Discussed with: ae, kp (firewalls) Relnotes: yes Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D20616	2019-06-29 00:48:33 +00:00
Hans Petter Selasky	0dbdf04125	Need to wait for epoch callbacks to complete before detaching a network interface. This particularly manifests itself when an INP has multicast options attached during a network interface detach. Then the IPv4 and IPv6 leave group call which results from freeing the multicast address, may access a freed ifnet structure. These are the steps to reproduce: service mdnsd onestart # installed from ports ifconfig epair create ifconfig epair0a 0/24 up ifconfig epair0a destroy Tested by: pho @ MFC after: 1 week Sponsored by: Mellanox Technologies	2019-06-28 10:49:04 +00:00
Marius Strobl	c2c5d1e787	o In iflib_txq_drain(): - Remove desc_used, which is only ever written to. - Remove a dead store to reclaimed. - Don't recycle avail. - Sort variables according to style(9). These changes will make a subsequent commit easier to read. o In iflib_tx_credits_update(), don't bother checking whether the ift_txd_credits_update method pointer is NULL; _iflib_pre_assert() asserts upfront that this method has been assigned and functions like iflib_{fast_intr_rxtx,netmap_timer_adjust,txq_can_drain}() and _task_fn_tx() were already unconditionally relying on the method being callable.	2019-06-26 15:28:21 +00:00
Leandro Lupori	e2edff4167	[PowerPC64] Don't mark module data as static Fixes panic when loading ipfw.ko and if_epair.ko built with modern compiler. Similar to arm64 and riscv, when using a modern compiler (!gcc4.2), code generated tries to access data in the wrong location, causing kernel panic (data storage interrupt trap) when loading if_epair and ipfw. Issue was reproduced with kernel/module compiled using gcc8 and clang8. It affects both ELFv1 and ELFv2 ABI environments. PR: 232387 Submitted by: alfredo.junior_eldorado.org.br Reported by: Mark Millard Reviewed by: jhibbits Differential Revision: https://reviews.freebsd.org/D20461	2019-06-25 17:15:44 +00:00
Hans Petter Selasky	59854ecf55	Convert all IPv4 and IPv6 multicast memberships into using a STAILQ instead of a linear array. The multicast memberships for the inpcb structure are protected by a non-sleepable lock, INP_WLOCK(), which needs to be dropped when calling the underlying possibly sleeping if_ioctl() method. When using a linear array to keep track of multicast memberships, the computed memory location of the multicast filter may suddenly change, due to concurrent insertion or removal of elements in the linear array. This in turn leads to various invalid memory access issues and kernel panics. To avoid this problem, put all multicast memberships on a STAILQ based list. Then the memory location of the IPv4 and IPv6 multicast filters become fixed during their lifetime and use after free and memory leak issues are easier to track, for example by: vmstat -m \| grep multi All list manipulation has been factored into inline functions including some macros, to easily allow for a future hash-list implementation, if needed. This patch has been tested by pho@ . Differential Revision: https://reviews.freebsd.org/D20080 Reviewed by: markj @ MFC after: 1 week Sponsored by: Mellanox Technologies	2019-06-25 11:54:41 +00:00
Marko Zec	188adcb7e4	V_ip6_forwarding and V_ipforwarding have been defined in ip6_var.h / ip_var.h since at least 2008, so make use of those definitions here. MFC after: 3 days	2019-06-19 08:49:24 +00:00
Marko Zec	6aee0bfa85	Evaluating htons() at compile time is more efficient than doing ntohs() at runtime. This change removes a dependency on a barrel shifter pass before branch resolution, while reducing the instruction stream size by 9 bytes on amd64. MFC after: 3 days	2019-06-19 08:39:19 +00:00
Marius Strobl	d49e83eac3	- Replace unused and only ever written to members of public iflib(9) structs with placeholders (in the latter case, IFLIB_MAX_TX_BYTES etc. are also only ever used for these write-only members if at all, so both these macros and members can just go). Using these spares may render it possible to merge certain iflib(9) fixes to stable/12. Otherwise, changes extending struct if_irq or struct if_shared_ctx in any way would break KBI as instances of these are allocated by the driver front-ends (by contrast, struct if_pkt_info as well as struct if_softc_ctx instances are provided by iflib(9) and, thus, may grow at least at the end without breaking KBI). - Make the pvi_name in struct pci_vendor_info const char * as device identifiers in hardware lookup tables aren't to be expected to ever change at runtime. - Similarly, make the pci_vendor_info_t of struct if_shared_ctx which is used to point to the struct pci_vendor_info arrays provided by the driver front-ends const. - Remove the ETH_ADDR_LEN macro from iflib.h; this was duplicating ETHER_ADDR_LEN of <net/ethernet.h> with iflib(9) actually only consuming the latter macro. - Make the name argument of iflib_io_tqg_attach(9) const, matching the taskqgroup_attach_cpu(9) this function wraps as well as e. g. iflib_config_gtask_init(9). - Remove the orphaned iflib_qset_lock_get() prototype. - Remove some extraneous empty lines.	2019-06-15 11:07:41 +00:00
Mark Johnston	ce7fb386d8	Restore the comment removed in r348745. LAGG_RLOCK() enters an epoch section, so the comment wasn't stale. Reported by: jhb MFC with: r348745	2019-06-06 17:20:35 +00:00
Mark Johnston	9995dfd364	Conditionalize an in_epoch() call on INVARIANTS. Its result is only used to determine whether to perform further INVARIANTS-only checks. Remove a stale comment while here. Submitted by: Sebastian Huber <sebastian.huber@embedded-brains.de> MFC after: 1 week	2019-06-06 16:22:29 +00:00
Eric Joyner	668d6dbb4c	iflib: provide probe wrapper for vendor drivers From Jake: Vendor drivers that exist out-of-tree generally should return BUS_PROBE_VENDOR from their device probe functions. This helps ensure that a vendor replacement driver will supersede the in-kernel driver for a given device. Currently, if a vendor wants to implement a driver based on iflib, it will always report BUS_PROBE_DEFAULT. Add a wrapper function, iflib_device_probe_vendor() which can be used in place of iflib_device_probe(). This function will just return BUS_PROBE_VENDOR whenever iflib_device_probe() would return BUS_PROBE_DEFAULT. While vendor drivers can already implement such a wrapper themselves, providing it in the iflib.h header makes it easier for the vendor driver to do the right thing. Submitted by: Jacob Keller <jacob.e.keller@intel.com> Reviewed by: erj@, gallatin@, marius@ MFC after: 1 week Sponsored by: Intel Corporation Differential Revision: https://reviews.freebsd.org/D20221	2019-05-29 22:24:10 +00:00
Kyle Evans	d8b985430c	if_bridge(4): Complete bpf auditing of local traffic over the bridge There were two remaining "gaps" in auditing local bridge traffic with bpf(4): Locally originated outbound traffic from a member interface is invisible to the bridge's bpf(4) interface. Inbound traffic locally destined to a member interface is invisible to the member's bpf(4) interface -- this traffic has no chance after bridge_input to otherwise pass it over, and it wasn't originally received on this interface. I call these "gaps" because they don't affect conventional bridge setups. Alas, being able to establish an audit trail of all locally destined traffic for setups that can function like this is useful in some scenarios. Reviewed by: kp MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D19757	2019-05-29 01:08:30 +00:00
Andrey V. Elsukov	de25327313	Rework r348303 to reduce the time of holding global BPF lock. It appeared that using NET_EPOCH_WAIT() while holding global BPF lock can lead to another panic: spin lock 0xfffff800183c9840 (turnstile lock) held by 0xfffff80018e2c5a0 (tid 100325) too long panic: spin lock held too long ... #0 sched_switch (td=0xfffff80018e2c5a0, newtd=0xfffff8000389e000, flags=<optimized out>) at /usr/src/sys/kern/sched_ule.c:2133 #1 0xffffffff80bf9912 in mi_switch (flags=256, newtd=0x0) at /usr/src/sys/kern/kern_synch.c:439 #2 0xffffffff80c21db7 in sched_bind (td=<optimized out>, cpu=<optimized out>) at /usr/src/sys/kern/sched_ule.c:2704 #3 0xffffffff80c34c33 in epoch_block_handler_preempt (global=<optimized out>, cr=0xfffffe00005a1a00, arg=<optimized out>) at /usr/src/sys/kern/subr_epoch.c:394 #4 0xffffffff803c741b in epoch_block (global=<optimized out>, cr=<optimized out>, cb=<optimized out>, ct=<optimized out>) at /usr/src/sys/contrib/ck/src/ck_epoch.c:416 #5 ck_epoch_synchronize_wait (global=0xfffff8000380cd80, cb=<optimized out>, ct=<optimized out>) at /usr/src/sys/contrib/ck/src/ck_epoch.c:465 #6 0xffffffff80c3475e in epoch_wait_preempt (epoch=0xfffff8000380cd80) at /usr/src/sys/kern/subr_epoch.c:513 #7 0xffffffff80ce970b in bpf_detachd_locked (d=0xfffff801d309cc00, detached_ifp=<optimized out>) at /usr/src/sys/net/bpf.c:856 #8 0xffffffff80ced166 in bpf_detachd (d=<optimized out>) at /usr/src/sys/net/bpf.c:836 #9 bpf_dtor (data=0xfffff801d309cc00) at /usr/src/sys/net/bpf.c:914 To fix this add the check to the catchpacket() that BPF descriptor was not detached just before we acquired BPFD_LOCK(). Reported by: slavash Tested by: slavash MFC after: 1 week	2019-05-28 11:45:00 +00:00
Andrey V. Elsukov	44a514745c	Fix possible NULL pointer dereference. bpf_mtap() can invoke catchpacket() for already detached descriptor. And this can lead to NULL pointer dereference, since bd_bif pointer was reset to NULL in bpf_detachd_locked(). To avoid this, use NET_EPOCH_WAIT() when descriptor is removed from interface's descriptors list. After the wait it is safe to modify descriptor's content. Submitted by: kib Reported by: slavash MFC after: 1 week	2019-05-27 12:41:41 +00:00
John Baldwin	fb3bc59600	Restructure mbuf send tags to provide stronger guarantees. - Perform ifp mismatch checks (to determine if a send tag is allocated for a different ifp than the one the packet is being output on), in ip_output() and ip6_output(). This avoids sending packets with send tags to ifnet drivers that don't support send tags. Since we are now checking for ifp mismatches before invoking if_output, we can now try to allocate a new tag before invoking if_output sending the original packet on the new tag if allocation succeeds. To avoid code duplication for the fragment and unfragmented cases, add ip_output_send() and ip6_output_send() as wrappers around if_output and nd6_output_ifp, respectively. All of the logic for setting send tags and dealing with send tag-related errors is done in these wrapper functions. For pseudo interfaces that wrap other network interfaces (vlan and lagg), wrapper send tags are now allocated so that ip*_output see the wrapper ifp as the ifp in the send tag. The if_transmit routines rewrite the send tags after performing an ifp mismatch check. If an ifp mismatch is detected, the transmit routines fail with EAGAIN. - To provide clearer life cycle management of send tags, especially in the presence of vlan and lagg wrapper tags, add a reference count to send tags managed via m_snd_tag_ref() and m_snd_tag_rele(). Provide a helper function (m_snd_tag_init()) for use by drivers supporting send tags. m_snd_tag_init() takes care of the if_ref on the ifp meaning that code alloating send tags via if_snd_tag_alloc no longer has to manage that manually. Similarly, m_snd_tag_rele drops the refcount on the ifp after invoking if_snd_tag_free when the last reference to a send tag is dropped. This also closes use after free races if there are pending packets in driver tx rings after the socket is closed (e.g. from tcpdrop). In order for m_free to work reliably, add a new CSUM_SND_TAG flag in csum_flags to indicate 'snd_tag' is set (rather than 'rcvif'). Drivers now also check this flag instead of checking snd_tag against NULL. This avoids false positive matches when a forwarded packet has a non-NULL rcvif that was treated as a send tag. - cxgbe was relying on snd_tag_free being called when the inp was detached so that it could kick the firmware to flush any pending work on the flow. This is because the driver doesn't require ACK messages from the firmware for every request, but instead does a kind of manual interrupt coalescing by only setting a flag to request a completion on a subset of requests. If all of the in-flight requests don't have the flag when the tag is detached from the inp, the flow might never return the credits. The current snd_tag_free command issues a flush command to force the credits to return. However, the credit return is what also frees the mbufs, and since those mbufs now hold references on the tag, this meant that snd_tag_free would never be called. To fix, explicitly drop the mbuf's reference on the snd tag when the mbuf is queued in the firmware work queue. This means that once the inp's reference on the tag goes away and all in-flight mbufs have been queued to the firmware, tag's refcount will drop to zero and snd_tag_free will kick in and send the flush request. Note that we need to avoid doing this in the middle of ethofld_tx(), so the driver grabs a temporary reference on the tag around that loop to defer the free to the end of the function in case it sends the last mbuf to the queue after the inp has dropped its reference on the tag. - mlx5 preallocates send tags and was using the ifp pointer even when the send tag wasn't in use. Explicitly use the ifp from other data structures instead. - Sprinkle some assertions in various places to assert that received packets don't have a send tag, and that other places that overwrite rcvif (e.g. 802.11 transmit) don't clobber a send tag pointer. Reviewed by: gallatin, hselasky, rgrimes, ae Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D20117	2019-05-24 22:30:40 +00:00
Alexander V. Chernikov	563ab4e400	Fix gateway setup for the interface routes. Currently rinit1() and its IPv6 counterpart nd6_prefix_onlink_rtrequest() uses dummy null_sdl gateway address during route insertion and change it afterwards. This behaviour brings complications to the routing stack and the users of its upcoming notification system. This change fixes both rinit1() and nd6_prefix_onlink_rtrequest() by filling in proper gateway in the beginning. It does not change any of the userland notifications as in both cases, they happen after the insertion and fixup process (rt_newaddrmsg_fib() and nd6_rtmsg()). MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D20328	2019-05-22 21:20:15 +00:00
Conrad Meyer	e2e050c8ef	Extract eventfilter declarations to sys/_eventfilter.h This allows replacing "sys/eventfilter.h" includes with "sys/_eventfilter.h" in other header files (e.g., sys/{bus,conf,cpu}.h) and reduces header pollution substantially. EVENTHANDLER_DECLARE and EVENTHANDLER_LIST_DECLAREs were moved out of .c files into appropriate headers (e.g., sys/proc.h, powernv/opal.h). As a side effect of reduced header pollution, many .c files and headers no longer contain needed definitions. The remainder of the patch addresses adding appropriate includes to fix those files. LOCK_DEBUG and LOCK_FILE_LINE_ARG are moved to sys/_lock.h, as required by sys/mutex.h since r326106 (but silently protected by header pollution prior to this change). No functional change (intended). Of course, any out of tree modules that relied on header pollution for sys/eventhandler.h, sys/lock.h, or sys/mutex.h inclusion need to be fixed. __FreeBSD_version has been bumped.	2019-05-20 00:38:23 +00:00
Alexander V. Chernikov	2ad7ed6e4a	Fix rt_ifa selection during loopback route insertion process. Currently such routes are added with a link-level IFA, which is plain wrong. Only after the insertion they get fixed by the special link_rtrequest() ifa handler. This behaviour complicates routing code and makes ifa selection more complex. Streamline this process by explicitly moving link_rtrequest() logic to the pre-insertion rt_getifa_fib() ifa selector. Avoid calling all this logic in the loopback route case by explicitly specifying proper rt_ifa inside the ifa_maintain_loopback_route().§ MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D20076	2019-05-19 21:49:56 +00:00
Kyle Evans	db226f0d8e	tuntap: Defer clearing if_softc until after if_detach r346670 added an sx to close a race between the ifioctl handler and interface destruction. Unfortunately, it clears if_softc immediately after the interface is closed, but before if_detach has been invoked. Any time before detachment, an interface that's part of a bridge may still receive traffic that's pushed through tunstart/tunstart_l2 and promptly lead to a panic because if_softc is now NULL. Fix it by deferring the clearing of if_softc until after the interface has detached and thus been removed from the bridge. if_softc still gets cleared in case another thread has already entered the ioctl handler before it's replaced with ifdead_ioctl. Reported by: markj MFC after: 3 days	2019-05-14 20:32:29 +00:00
Andrey V. Elsukov	82d7bf6b1b	Avoid possible recursion on BPF_LOCK() in bpfwrite(). Release BPF_LOCK() before invoking if_output() and if_input(). Also enter epoch section before releasing lock, this should prevent access to ifnet that may be freed on interface detach. Reported by: markj	2019-05-13 20:17:55 +00:00
Andrey V. Elsukov	af1f58df99	Do not leak memory used for binary filter.	2019-05-13 14:07:02 +00:00
Andrey V. Elsukov	699281b545	Rework locking in BPF code to remove rwlock from fast path. On high packets rate the contention on rwlock in bpf_tap() functions can lead to packets dropping. To avoid this, migrate this code to use epoch(9) KPI and ConcurrencyKit's lists. * all lists changed to use CK_LIST; * reference counting added to bpf_if and bpf_d; * now bpf_if references ifnet and releases this reference on destroy; * each bpf_d descriptor references bpf_if when it is attached; * new struct bpf_program_buffer introduced to keep BPF filter programs; * bpf_program_buffer, bpf_d and bpf_if structures are freed by epoch_call(); * bpf_freelist and ifnet_departure event are no longer needed, thus both are removed; Reviewed by: melifaro Sponsored by: Yandex LLC Differential Revision: https://reviews.freebsd.org/D20224	2019-05-13 13:45:28 +00:00
Kyle Evans	81b3b91e6b	tuntap: Improve style No functional change. tun_flags of the tuntap_driver was renamed to ident_flags to reflect the fact that it's a subset of the tun_flags that identifies a tuntap device. This maps more easily (visually) to the TUN_DRIVER_IDENT_MASK that masks off the bits of tun_flags that are applicable to tuntap driver ident. This is a purely cosmetic change.	2019-05-11 04:18:06 +00:00
Eric Joyner	afb7737237	iflib: use default ntxd and nrxd when user value is not power of 2 From Jake: A user may set a sysctl to override the default number of Tx or Rx descriptors. However, certain calculations in the iflib core expect the number of descriptors to be a power of 2. Update _iflib_assert to verify that all of the shared context parameters for the number of descriptors are powers of 2. Modify iflib_reset_qvalues to check that the provided isc_nrxd value is a power of 2. If it's not, print a warning message and then use the default value. An alternative might be to try rounding the number down instead. However, this creates problems in case the rounded down value is below the minimum value that the driver would support. Submitted by: Jacob Keller <jacob.e.keller@intel.com> Reviewed by: marius@ MFC after: 1 week Sponsored by: Intel Corporation Differential Revision: https://reviews.freebsd.org/D19880	2019-05-10 00:41:42 +00:00
Kyle Evans	16760d8e28	tuntap: Don't down tap interfaces if LINK0 is set	2019-05-09 18:54:29 +00:00
Kyle Evans	a6fa049545	tuntap: Properly detach tap ifp	2019-05-09 14:06:24 +00:00
Marius Strobl	14e0010729	- Merge r338254 from cxgbe(4): Use fcmpset instead of cmpset when appropriate. - Revert r277226 of cxgbe(4), obsolete since r334320.	2019-05-09 11:34:46 +00:00
Gleb Smirnoff	6ca363eb7b	Existense of PCB route caching doesn't allow us to use new fast route lookup KPI in ip_output() like it is already used in ip_forward(). However, when there is no PCB provided we can use fast KPI, gaining performance advantage. Typical case when ip_output() is called without a PCB pointer is a sendto(2) on a not connected UDP socket. In practice DNS servers do this. Reviewed by: melifaro Differential Revision: https://reviews.freebsd.org/D19804	2019-05-08 23:39:24 +00:00
Marius Strobl	007b804fc7	Allow to build without INET and INET6 again after r347221. Submitted by: cam	2019-05-08 09:03:43 +00:00
Kyle Evans	251a32b5b2	tun/tap: merge and rename to `tuntap` tun(4) and tap(4) share the same general management interface and have a lot in common. Bugs exist in tap(4) that have been fixed in tun(4), and vice-versa. Let's reduce the maintenance requirements by merging them together and using flags to differentiate between the three interface types (tun, tap, vmnet). This fixes a couple of tap(4)/vmnet(4) issues right out of the gate: - tap devices may no longer be destroyed while they're open [0] - VIMAGE issues already addressed in tun by kp [0] emaste had removed an easy-panic-button in r240938 due to devdrn blocking. A naive glance over this leads me to believe that this isn't quite complete -- destroy_devl will only block while executing d_* functions, but doesn't block the device from being destroyed while a process has it open. The latter is the intent of the condvar in tun, so this is "fixed" (for certain definitions of the word -- it wasn't really broken in tap, it just wasn't quite ideal). ifconfig(8) also grew the ability to map an interface name to a kld, so that `ifconfig {tun,tap}0` can continue to autoload the correct module, and `ifconfig vmnet0 create` will now autoload the correct module. This is a low overhead addition. (MFC commentary) This may get MFC'd if many bugs in tun(4)/tap(4) are discovered after this, and how critical they are. Changes after this are likely easily MFC'd without taking this merge, but the merge will be easier. I have no plans to do this MFC as of now. Reviewed by: bcr (manpages), tuexen (testing, syzkaller/packetdrill) Input also from: melifaro Relnotes: yes Differential Revision: https://reviews.freebsd.org/D20044	2019-05-08 02:32:11 +00:00
Marius Strobl	3d10e9ed62	o Use iflib_fast_intr_rxtx() also for "legacy" interrupts, i. e. INTx and MSI. Unlike as with iflib_fast_intr_ctx(), the former will also enqueue _task_fn_tx() in addition to _task_fn_rx() if appropriate, bringing TCP TX throughput of EM-class devices on par with the MSI-X case and, thus, close to wirespeed/pre-iflib(4) times again. [1] Note that independently of the interrupt type, the UDP performance with these MACs still is abysmal and nowhere near to where it was before the conversion of em(4) to iflib(4). o In iflib_init_locked(), announce which free list failed to set up. o In _task_fn_tx() when running netmap(4), issue ifdi_intr_enable instead of the ifdi_tx_queue_intr_enable method in case of a "legacy" interrupt as the latter is valid with MSI-X only. o Instead of adding the missing - and apparently convoluted enough that a DBG_COUNTER_INC was put into a wrong spot in _task_fn_rx() - checks for ifdi_{r,t}x_queue_intr_enable being available in the MSI-X case also to iflib_fast_intr_rxtx(), factor these out to iflib_device_register() and make the checks fail gracefully rather than panic. This avoids invoking the checks at runtime over and over again in iflib_fast_intr_rxtx() and _task_fn_{r,t}x() - even if it's just in case of INVARIANTS - and makes these functions more readable. o In iflib_rx_structures_setup(), only initialize LRO resources if device and driver have LRO capability in order to not waste memory. Also, free the LRO resources again if setting them up fails for one of the queues. However, don't bother invoking iflib_rx_sds_free() in that case because iflib_rx_structures_setup() doesn't call iflib_rxsd_alloc() either (and iflib_{device,pseudo}_register() will issue iflib_rx_sds_free() in case of failure via iflib_rx_structures_free(), but there definitely is some asymmetry left to be fixed, though). o Similarly, free LRO resources again in iflib_rx_structures_free(). o In iflib_irq_set_affinity(), handle get_core_offset() errors gracefully instead of panicing (but only in case of INVARIANTS). This is a follow- up to r344132, as such driver bugs shouldn't be fatal. o Likewise, handle unknown iflib_intr_type_t in iflib_irq_alloc_generic() gracefully, too. o Bring yet more sanity to iflib_msix_init(): - If the device doesn't provide enough MSI-X vectors or not all vectors can be allocate so the expected number of queues in addition to admin interrupts can't be supported, try MSI next (and then INTx) as proper MSI-X vector distribution can't be assured in such cases. In essence, this change brings r254008 forward to iflib(4). Also, this is the fix alluded to in the commit message of r343934. - If the MSI-X allocation has failed, don't prematurely announce MSI is going to be used as the latter in fact may not be available either. - When falling back to MSI, only release the MSI-X table resource again if it was allocated in iflib_msix_init(), i. e. isn't supplied by the driver, in the first place. o In mp_ndesc_handler(), handle unknown type arguments gracefully, too. PR: 235031 (likely) [1] Reviewed by: shurd Differential Revision: https://reviews.freebsd.org/D20175	2019-05-07 08:28:35 +00:00
Marius Strobl	1722eeac95	- Remove the unused ifc_link_irq and ifc_mtx_name members of struct iflib_ctx. - Remove the only ever written to ift_db_mtx_name member of struct iflib_txq. - Remove the unused or only ever written to ifr_size, ifr_cq_pidx, ifr_cq_gen and ifr_lro_enabled members of struct iflib_rxq. - Consistently spell DMA, RX and TX uppercase in comments, messages etc. instead of mixing with some lowercase variants. - Consistently use if_t instead of a mix of if_t and struct ifnet pointers. - Bring the function comments of _iflib_fl_refill(), iflib_rx_sds_free() and iflib_fl_setup() in line with reality. - Judging problem reports, people are wondering what on earth messages like: "TX(0) desc avail = 1024, pidx = 0" are trying to indicate. Thus, extend this string to be more like that of non-iflib(4) Ethernet MAC drivers, notifying about a watchdog timeout due to which the interface will be reset. - Take advantage of the M_HAS_VLANTAG macro. - Use false/true rather than FALSE/TRUE for variables of type bool. - Use FALLTHROUGH as advocated by style(9).	2019-05-06 20:56:41 +00:00
Matt Macy	e2621d9657	Allow iflib drivers to pass a pointer to their own ifmedia structure. Tested by: emaste@ Differential Revision: https://reviews.freebsd.org/D19946	2019-05-03 20:05:31 +00:00
Andrew Gallatin	35961dce98	Select lacp egress ports based on NUMA domain This change creates an array of port maps indexed by numa domain for lacp port selection. If we have lacp interfaces in more than one domain, then we select the egress port by indexing into the numa port maps and picking a port on the appropriate numa domain. This is behavior is controlled by the new ifconfig use_numa flag and net.link.lagg.use_numa sysctl/tunable (both modeled after the existing use_flowid), which default to enabled. Reviewed by: bz, hselasky, markj (and scottl, earlier version) Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D20060	2019-05-03 14:43:21 +00:00
Ed Maste	ce3da455e9	iflib: remove assertion that isc_capabilities is nonzero It's atypical, but not invalid, for a driver to pass no capabilities. Submitted by: Gerald Aryeetey <aryeeteygerald_rogers.com> Reviewed by: shurd MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D20142	2019-05-02 19:13:31 +00:00
Stephen Hurd	f154ece02e	iflib: Better control over queue core assignment By default, cores are now assigned to queues in a sequential manner rather than all NICs starting at the first core. On a four-core system with two NICs each using two queue pairs, the nic:queue -> core mapping has changed from this: 0:0 -> 0, 0:1 -> 1 1:0 -> 0, 1:1 -> 1 To this: 0:0 -> 0, 0:1 -> 1 1:0 -> 2, 1:1 -> 3 Additionally, a device can now be configured to use separate cores for TX and RX queues. Two new tunables have been added, dev.X.Y.iflib.separate_txrx and dev.X.Y.iflib.core_offset. If core_offset is set, the NIC is not part of the auto-assigned sequence. Reviewed by: marius MFC after: 2 weeks Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D20029	2019-04-25 21:24:56 +00:00
Kyle Evans	e3a883c386	tap(4): Correct driver name... Reported by: rgrimes Pointy hat to: kevans MFC after: 3 days X-MFC-With: r346688	2019-04-25 18:26:34 +00:00
Kyle Evans	9ea63b2caa	tap(4): Add a MODULE_VERSION Otherwise tap(4) can be loaded by loader despite being compiled into the kernel, causing a panic as things try to double-initialize. PR: 220867 MFC after: 3 days	2019-04-25 18:22:22 +00:00
Kyle Evans	c83651445b	tun(4): Don't allow open of open or dying devices Previously, a pid check was used to prevent open of the tun(4); this works, but may not make the most sense as we don't prevent the owner process from opening the tun device multiple times. The potential race described near tun_pid should not be an issue: if a tun(4) is to be handed off, its fd has to have been sent via control message or some other mechanism that duplicates the fd to the receiving process so that it may set the pid. Otherwise, the pid gets cleared when the original process closes it and you have no effective handoff mechanism. Close up another potential issue with handing a tun(4) off by not clobbering state if the closer isn't the controller anymore. If we want some state to be cleared, we should do that a little more surgically. Additionally, nothing prevents a dying tun(4) from being "reopened" in the middle of tun_destroy as soon as the mutex is unlocked, quickly leading to a bad time. Return EBUSY if we're marked for destruction, as well, and the consumer will need to deal with it. The associated character device will be destroyed in short order. MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D20033	2019-04-25 13:46:12 +00:00
Kyle Evans	d91262603b	tun/tap: close race between destroy/ioctl handler It seems that there should be a better way to handle this, but this seems to be the more common approach and it should likely get replaced in all of the places it happens... Basically, thread 1 is in the process of destroying the tun/tap while thread 2 is executing one of the ioctls that requires the tun/tap mutex and the mutex is destroyed before the ioctl handler can acquire it. This is only one of the races described/found in PR 233955. PR: 233955 Reviewed by: ae MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D20027	2019-04-25 12:44:08 +00:00
Andrew Gallatin	6d49b41ee8	iflib: Add pfil hooks As with mlx5en, the idea is to drop unwanted traffic as early in receive as possible, before mbufs are allocated and anything is passed up the stack. This can save considerable CPU time when a machine is under a flooding style DOS attack. The major change here is to remove the unneeded abstraction where callers of rxd_frag_to_sd() get back a pointer to the mbuf ring, and are responsible for NULL'ing that mbuf themselves. Now this happens directly in rxd_frag_to_sd(), and it returns an mbuf. This allows us to use the decision (and potentially mbuf) returned by the pfil hooks. The driver can now recycle mbufs to avoid re-allocation when packets are dropped. Reviewed by: marius (shurd and erj also provided feedback) Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D19645	2019-04-24 13:32:04 +00:00
Andrey V. Elsukov	aee793eec9	Add GRE-in-UDP encapsulation support as defined in RFC8086. This GRE-in-UDP encapsulation allows the UDP source port field to be used as an entropy field for load-balancing of GRE traffic in transit networks. Also most of multiqueue network cards are able distribute incoming UDP datagrams to different NIC queues, while very little are able do this for GRE packets. When an administrator enables UDP encapsulation with command `ifconfig gre0 udpencap`, the driver creates kernel socket, that binds to tunnel source address and after udp_set_kernel_tunneling() starts receiving of all UDP packets destined to 4754 port. Each kernel socket maintains list of tunnels with different destination addresses. Thus when several tunnels use the same source address, they all handled by single socket. The IP[V6]_BINDANY socket option is used to be able bind socket to source address even if it is not yet available in the system. This may happen on system boot, when gre(4) interface is created before source address become available. The encapsulation and sending of packets is done directly from gre(4) into ip[6]_output() without using sockets. Reviewed by: eugen MFC after: 1 month Relnotes: yes Differential Revision: https://reviews.freebsd.org/D19921	2019-04-24 09:05:45 +00:00
Kyle Evans	e8de0c3bda	tun(4): Defer clearing TUN_OPEN until much later tun destruction will not continue until TUN_OPEN is cleared. There are brief moments in tunclose where the mutex is dropped and we've already cleared TUN_OPEN, so tun_destroy would be able to proceed while we're in the middle of cleaning up the tun still. tun_destroy should be blocked until these parts (address/route purges, mostly) are complete. PR: 233955 MFC after: 2 weeks	2019-04-23 17:28:28 +00:00
Andrew Gallatin	7687707dd4	Track device's NUMA domain in ifnet & alloc ifnet from NUMA local memory This commit adds new if_alloc_domain() and if_alloc_dev() methods to allocate ifnets. When called with a domain on a NUMA machine, ifalloc_domain() will record the NUMA domain in the ifnet, and it will allocate the ifnet struct from memory which is local to that NUMA node. Similarly, if_alloc_dev() is a wrapper for if_alloc_domain which uses a driver supplied device_t to call ifalloc_domain() with the appropriate domain. Note that the new if_numa_domain field fits in an alignment pad in struct ifnet, and so does not alter the size of the structure. Reviewed by: glebius, kib, markj Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D19930	2019-04-22 19:24:21 +00:00
Kyle Evans	1fd8c72c0a	iflib: Use new ether_gen_addr, restricting addresses to that subset Differential Revision: https://reviews.freebsd.org/D19587	2019-04-17 17:19:54 +00:00
Kyle Evans	3c3aa8c170	net: adjust randomized address bits Give devices that need a MAC a 16-bit allocation out of the FreeBSD Foundation OUI range. Change the name ether_fakeaddr to ether_gen_addr now that we're dealing real MAC addresses with a real OUI rather than random locally-administered addresses. Reviewed by: bz, rgrimes Differential Revision: https://reviews.freebsd.org/D19587	2019-04-17 17:18:43 +00:00
Michael Tuexen	e6481fd4c4	When sending a routing message, don't allow the user to set the RTF_RNH_LOCKED flag in rtm_flags, since this flag is used only internally. Reported by: syzbot+65c676f5248a13753ea0@syzkaller.appspotmail.com Reviewed by: ae@ MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D19898	2019-04-14 10:18:14 +00:00

... 3 4 5 6 7 ...

4557 Commits