freebsd-nq

Author	SHA1	Message	Date
Matt Macy	4f6c66cc9c	UDP: further performance improvements on tx Cumulative throughput while running 64 netperf -H $DUT -t UDP_STREAM -- -m 1 on a 2x8x2 SKL went from 1.1Mpps to 2.5Mpps Single stream throughput increases from 910kpps to 1.18Mpps Baseline: https://people.freebsd.org/~mmacy/2018.05.11/udpsender2.svg - Protect read access to global ifnet list with epoch https://people.freebsd.org/~mmacy/2018.05.11/udpsender3.svg - Protect short lived ifaddr references with epoch https://people.freebsd.org/~mmacy/2018.05.11/udpsender4.svg - Convert if_afdata read lock path to epoch https://people.freebsd.org/~mmacy/2018.05.11/udpsender5.svg A fix for the inpcbhash contention is pending sufficient time on a canary at LLNW. Reviewed by: gallatin Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D15409	2018-05-23 21:02:14 +00:00
Matt Macy	d7c5a620e2	ifnet: Replace if_addr_lock rwlock with epoch + mutex Run on LLNW canaries and tested by pho@ gallatin: Using a 14-core, 28-HTT single socket E5-2697 v3 with a 40GbE MLX5 based ConnectX 4-LX NIC, I see an almost 12% improvement in received packet rate, and a larger improvement in bytes delivered all the way to userspace. When the host receiving 64 streams of netperf -H $DUT -t UDP_STREAM -- -m 1, I see, using nstat -I mce0 1 before the patch: InMpps OMpps InGbs OGbs err TCP Est %CPU syscalls csw irq GBfree 4.98 0.00 4.42 0.00 4235592 33 83.80 4720653 2149771 1235 247.32 4.73 0.00 4.20 0.00 4025260 33 82.99 4724900 2139833 1204 247.32 4.72 0.00 4.20 0.00 4035252 33 82.14 4719162 2132023 1264 247.32 4.71 0.00 4.21 0.00 4073206 33 83.68 4744973 2123317 1347 247.32 4.72 0.00 4.21 0.00 4061118 33 80.82 4713615 2188091 1490 247.32 4.72 0.00 4.21 0.00 4051675 33 85.29 4727399 2109011 1205 247.32 4.73 0.00 4.21 0.00 4039056 33 84.65 4724735 2102603 1053 247.32 After the patch InMpps OMpps InGbs OGbs err TCP Est %CPU syscalls csw irq GBfree 5.43 0.00 4.20 0.00 3313143 33 84.96 5434214 1900162 2656 245.51 5.43 0.00 4.20 0.00 3308527 33 85.24 5439695 1809382 2521 245.51 5.42 0.00 4.19 0.00 3316778 33 87.54 5416028 1805835 2256 245.51 5.42 0.00 4.19 0.00 3317673 33 90.44 5426044 1763056 2332 245.51 5.42 0.00 4.19 0.00 3314839 33 88.11 5435732 1792218 2499 245.52 5.44 0.00 4.19 0.00 3293228 33 91.84 5426301 1668597 2121 245.52 Similarly, netperf reports 230Mb/s before the patch, and 270Mb/s after the patch Reviewed by: gallatin Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D15366	2018-05-18 20:13:34 +00:00
Hans Petter Selasky	5cd5781c75	Remove redundant integer cast in ibcore. The "ref_count" field already has integer type. MFC after: 1 week Sponsored by: Mellanox Technologies	2018-03-19 13:51:33 +00:00
Hans Petter Selasky	d4eeed42ba	Make sure VNET is set when calling sa6_recoverscope() in ibcore. Else panic will occur when VIMAGE is enabled. MFC after: 1 week Sponsored by: Mellanox Technologies	2018-03-07 13:32:52 +00:00
Hans Petter Selasky	aa5962f908	Define values instead of using hardcoding. MFC after: 1 week Sponsored by: Mellanox Technologies	2018-03-07 13:30:38 +00:00
Hans Petter Selasky	c131a22379	Recover IPv6 scope ID for multicast link-local addresses as well as unicast link-local addresses. MFC after: 1 week Sponsored by: Mellanox Technologies	2018-03-07 13:28:12 +00:00
Hans Petter Selasky	cc79d31d6f	Embed the IPv6 scope ID before calling rtalloc1() in ibcore. Else rtalloc1() will resolve to the loopback interface. MFC after: 1 week Sponsored by: Mellanox Technologies	2018-03-07 13:25:40 +00:00
Hans Petter Selasky	d0a9dbc779	Make sure the IPv6 scope ID gets properly masked in ibcore. When exchanging CM messages the IPv6 scope ID should be ignored for link local addresses when doing comparisons. Make sure the scope ID is always set to zero for link local addresses. MFC after: 1 week Sponsored by: Mellanox Technologies	2018-03-07 12:58:51 +00:00
Hans Petter Selasky	03ae76a693	Fix for use-after-free when using delayed work structures in ibcore. It is not enough to cancel delayed work structures before freeing. Always cancel delayed work synchronously before freeing! MFC after: 1 week Sponsored by: Mellanox Technologies	2018-03-07 12:56:04 +00:00
Hans Petter Selasky	1456d97c01	Optimize ibcore RoCE address handle creation from user-space. Creating a UD address handle from user-space or from the kernel-space, when the link layer is ethernet, requires resolving the remote L3 address into a L2 address. Doing this from the kernel is easy because the required ARP(IPv4) and ND6(IPv6) address resolving APIs are readily available. In userspace such an interface does not exist and kernel help is required. It should be noted that in an IP-based GID environment, the GID itself does not contain all the information needed to resolve the destination IP address. For example information like VLAN ID and SCOPE ID, is not part of the GID and must be fetched from the GID attributes. Therefore a source GID should always be referred to as a GID index. Instead of going through various racy steps to obtain information about the GID attributes from user-space, this is now all done by the kernel. This patch optimises the L3 to L2 address resolving using the existing create address handle uverbs interface, retrieving back the L2 address as an additional user-space information structure. This commit combines the following Linux upstream commits: IB/core: Let create_ah return extended response to user IB/core: Change ib_resolve_eth_dmac to use it in create AH IB/mlx5: Make create/destroy_ah available to userspace IB/mlx5: Use kernel driver to help userspace create ah IB/mlx5: Report that device has udata response in create_ah MFC after: 1 week Sponsored by: Mellanox Technologies	2018-03-05 14:34:52 +00:00
Hans Petter Selasky	bf8641fedb	Get correct network device when accepting incoming RDMA connections in ibcore. This patch ensures the GID index is always used as a basis of resolving incoming RDMA connections, as compared to the GID value itself. Background: On a per infiniband port basis, the GID identifier is not a unique identifier! This assumption falls apart when VLAN ID, IPv6 scope ID and RoCE type, as supported by RoCE v2, is taken into account. This additional information is stored in the so-called GID attributes and is needed to correctly identify the destination network interface for an incoming connection. Different VLANs are allowed to define the same IPv4 addresses and especially for the default IPv6 link-local addresses or when using so-called containers or jails, this is true. The VNET information for the destination network interface is needed in order to perform the L2 address lookup in the right Virtual Network Stack context. Consequently old functions previously used by RoCE v1, like rdma_addr_find_smac_by_sgid() are impossible to support, because there can be multiple identical GIDs associated with the same infiniband port, and the answer to such a request becomes undefined. This function has been removed. MFC after: 1 week Sponsored by: Mellanox Technologies	2018-03-05 14:24:30 +00:00
Hans Petter Selasky	676833dadb	Pass valid if_index to rdma_addr_find_l2_eth_by_grh() in ibcore when possible. MFC after: 1 week Sponsored by: Mellanox Technologies	2018-03-05 14:22:36 +00:00
Hans Petter Selasky	197461919d	Add support for loopback in ibcore. Implement the missing pieces in addr_resolve() to support loopback addresses. IB core will test for the IFF_LOOPBACK flag in the network interface and treat these devices in a special way. MFC after: 1 week Sponsored by: Mellanox Technologies	2018-03-05 13:57:37 +00:00
Hans Petter Selasky	57b6d9f6ef	Make sure to register the VLAN GIDs using the VLAN network interface and not the parent one in ibcore. Else looking up the VLAN GIDs will fail for VLAN IPs. MFC after: 1 week Sponsored by: Mellanox Technologies	2018-03-05 12:39:34 +00:00
Hans Petter Selasky	891538abb5	Need to check for IPv6 linklocal address inside rdma_resolve_addr() in ibcore. MFC after: 1 week Sponsored by: Mellanox Technologies	2018-03-05 12:04:34 +00:00
Hans Petter Selasky	6d36a2c769	Map type of service, TOS, to IB or VLAN service level 1:1 in ibcore. MFC after: 1 week Sponsored by: Mellanox Technologies	2018-03-05 11:59:54 +00:00
Hans Petter Selasky	5b94bd8a69	Select RoCEv2 by default in ibcore. MFC after: 1 week Sponsored by: Mellanox Technologies	2018-03-05 11:58:37 +00:00
Hans Petter Selasky	703ea406d5	Make deletion of RoCE GID entries synchronous in ibcore. When a network device is departing, the RoCE GID entries should be cleared before the default L2 link layer address is freed. Else a NULL pointer access may happen. MFC after: 1 week Sponsored by: Mellanox Technologies	2018-03-05 11:57:26 +00:00
Hans Petter Selasky	42fa341d9c	Add support for IPv6 link local GIDs equal to the default GID for VLANs in ibcore. IPv6 link local addresses are usually derived from the netdev MAC address. This is applicable to VLAN devices and its lower netdevice as well. In such cases the IPv6 link local address is a duplicate of the default GID. Now that link local IPv6 addresses based GIDs are supported, allow adding such GID entries in the GID table. MFC after: 1 week Sponsored by: Mellanox Technologies	2018-03-05 11:55:29 +00:00
Hans Petter Selasky	be5c019762	Do not add RoCEv2 default GID in ibcore when IPv6 is disabled to honor the networking stack's IPv6 disabled setting. Else the offload HCA can start using IPv6 packets for QPs. MFC after: 1 week Sponsored by: Mellanox Technologies	2018-03-05 11:52:39 +00:00
Hans Petter Selasky	09938b2185	Add missing FreeBSD tags and SVN properties to ibcore. MFC after: 1 week Sponsored by: Mellanox Technologies	2018-03-05 11:49:45 +00:00
Pedro F. Giffuni	fe267a5590	sys: general adoption of SPDX licensing ID tags. Mainly focus on files that use BSD 2-Clause license, however the tool I was using misidentified many licenses so this was mostly a manual - error prone - task. The Software Package Data Exchange (SPDX) group provides a specification to make it easier for automated tools to detect and summarize well known opensource licenses. We are gradually adopting the specification, noting that the tags are considered only advisory and do not, in any way, superceed or replace the license texts. No functional change intended.	2017-11-27 15:23:17 +00:00
Hans Petter Selasky	e5806bf466	Compile fix for LINT-NOIP kernel target. Sponsored by: Mellanox Technologies	2017-11-24 12:05:49 +00:00
Hans Petter Selasky	95ef56abc2	Build fix for kernel LINT target. Sponsored by: Mellanox Technologies	2017-11-24 09:12:13 +00:00
Hans Petter Selasky	4051f0c8ed	Temporary fix to avoid hitting kernel assert. Don't dirty VM pages at this point. Sponsored by: Mellanox Technologies	2017-09-16 16:32:36 +00:00
Hans Petter Selasky	c4b28ce0b9	Remove no longer needed linux_poll_wakeup() calls. This is now handled by "wake_up()" in the LinuxKPI. Accessing the file pointer directly might cause use after free issues. Sponsored by: Mellanox Technologies	2017-09-16 16:31:30 +00:00
Hans Petter Selasky	526f596179	Embedding the scope ID is no longer needed for IPv6. Sponsored by: Mellanox Technologies	2017-09-16 16:28:48 +00:00
Hans Petter Selasky	e02ecc60b8	Set length field of socket address. Sponsored by: Mellanox Technologies	2017-09-16 16:28:19 +00:00
Hans Petter Selasky	6af166cc7b	Fix for refcount leak. Sponsored by: Mellanox Technologies	2017-09-16 16:27:24 +00:00
Hans Petter Selasky	d18b4113ed	Make sure the socket address length field gets set. Sponsored by: Mellanox Technologies	2017-09-16 16:26:46 +00:00
Hans Petter Selasky	bf00eaa12c	Improve ibcore address resolving: - Add more sanity checks. - Preserve source port number when resolving address. - Remove no longer needed scope ID hacks for IPv6. Sponsored by: Mellanox Technologies	2017-09-16 16:24:38 +00:00
Hans Petter Selasky	bca9d05fdb	Merge ^/head r319973 through 321382.	2017-07-23 15:22:06 +00:00
Hans Petter Selasky	9f715dc162	ibcore: Delete old files and add new ones missed in the initial commit for this projects branch. Sponsored by: Mellanox Technologies	2017-07-03 08:32:52 +00:00
Hans Petter Selasky	0faccd643c	ibcore: Fix for accessing invalid network device. Sponsored by: Mellanox Technologies	2017-07-03 08:29:49 +00:00
Hans Petter Selasky	ddc6ee3792	ibcore: Make sure all GID entries are deleted upon a network device departure. Sponsored by: Mellanox Technologies	2017-07-03 08:28:35 +00:00
Mark Johnston	4eb18346d1	Avoid including list.h in LinuxKPI headers. list.h includes a number of FreeBSD headers as a workaround for the LIST_HEAD name collision. To reduce pollution, avoid including list.h in commonly used headers when it is not explicitly needed. Reviewed by: hselasky MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D11249	2017-06-18 16:43:57 +00:00
Hans Petter Selasky	478d300572	Initial RoCE/infiniband kernel update to Linux v4.9. This patch currently supports: - ibcore as a kernel module only - krping as a kernel module only - ipoib as a kernel module only Sponsored by: Mellanox Technologies	2017-06-15 12:47:48 +00:00
Mark Johnston	e804572d2b	Fix indentation. MFC after: 1 week	2017-06-14 16:55:23 +00:00
Gleb Smirnoff	779f106aa1	Listening sockets improvements. o Separate fields of struct socket that belong to listening from fields that belong to normal dataflow, and unionize them. This shrinks the structure a bit. - Take out selinfo's from the socket buffers into the socket. The first reason is to support braindamaged scenario when a socket is added to kevent(2) and then listen(2) is cast on it. The second reason is that there is future plan to make socket buffers pluggable, so that for a dataflow socket a socket buffer can be changed, and in this case we also want to keep same selinfos through the lifetime of a socket. - Remove struct struct so_accf. Since now listening stuff no longer affects struct socket size, just move its fields into listening part of the union. - Provide sol_upcall field and enforce that so_upcall_set() may be called only on a dataflow socket, which has buffers, and for listening sockets provide solisten_upcall_set(). o Remove ACCEPT_LOCK() global. - Add a mutex to socket, to be used instead of socket buffer lock to lock fields of struct socket that don't belong to a socket buffer. - Allow to acquire two socket locks, but the first one must belong to a listening socket. - Make soref()/sorele() to use atomic(9). This allows in some situations to do soref() without owning socket lock. There is place for improvement here, it is possible to make sorele() also to lock optionally. - Most protocols aren't touched by this change, except UNIX local sockets. See below for more information. o Reduce copy-and-paste in kernel modules that accept connections from listening sockets: provide function solisten_dequeue(), and use it in the following modules: ctl(4), iscsi(4), ng_btsocket(4), ng_ksocket(4), infiniband, rpc. o UNIX local sockets. - Removal of ACCEPT_LOCK() global uncovered several races in the UNIX local sockets. Most races exist around spawning a new socket, when we are connecting to a local listening socket. To cover them, we need to hold locks on both PCBs when spawning a third one. This means holding them across sonewconn(). This creates a LOR between pcb locks and unp_list_lock. - To fix the new LOR, abandon the global unp_list_lock in favor of global unp_link_lock. Indeed, separating these two locks didn't provide us any extra parralelism in the UNIX sockets. - Now call into uipc_attach() may happen with unp_link_lock hold if, we are accepting, or without unp_link_lock in case if we are just creating a socket. - Another problem in UNIX sockets is that uipc_close() basicly did nothing for a listening socket. The vnode remained opened for connections. This is fixed by removing vnode in uipc_close(). Maybe the right way would be to do it for all sockets (not only listening), simply move the vnode teardown from uipc_detach() to uipc_close()? Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D9770	2017-06-08 21:30:34 +00:00
Gleb Smirnoff	9ed01c32e0	All these files need sys/vmmeter.h, but now they got it implicitly included via sys/pcpu.h.	2017-04-17 17:07:00 +00:00
Hans Petter Selasky	82d0140707	Add full VNET support to the inet_get_local_port_range() function in the LinuxKPI. MFC after: 1 week Sponsored by: Mellanox Technologies	2017-03-22 15:46:31 +00:00
Hans Petter Selasky	404027276b	Add basic support for VIMAGE to the LinuxKPI and ibcore. Support is implemented by mapping Linux's "struct net" into FreeBSD's "struct vnet". Currently only vnet0 is supported by ibcore. MFC after: 1 week Sponsored by: Mellanox Technologies	2017-03-16 09:59:35 +00:00
Navdeep Parhar	017296dbb6	cxgbe/iw_cxgbe: fix various double-close panics with iWARP sockets. Sockets representing the TCP endpoints for iWARP connections are allocated by the ibcore module. Before this revision they were closed either by the ibcore module or the iw_cxgbe hardware driver depending on the state transitions during connection teardown. This is error prone and there were cases where both iw_cxgbe and ibcore closed the socket leading to double-free panics. The fix is to let ibcore close the sockets it creates and never do it in the driver. - Use sodisconnect instead of soclose (preceded by solinger = 0) in the driver to tear down an RDMA connection abruptly. This does what's intended without releasing the socket's fd reference. - Close the socket in ibcore when the iWARP iw_cm_id is destroyed. This works for all kinds of sockets: clients that initiate connections, listeners, and sockets accepted off of listeners. Reviewed by: Steve Wise @ Open Grid Computing, hselasky@ MFC after: 3 days Sponsored by: Chelsio Communications Differential Revision: https://reviews.freebsd.org/D9796	2017-02-28 19:27:41 +00:00
Navdeep Parhar	3d4f452402	Avoid NULL dereference in a couple of sysctl handlers in ibcore. iw_cxgbe sets ib_device->dma_device to NULL (since r311880). Reviewed by: hselasky@ Sponsored by: Chelsio Communications	2017-02-23 07:48:58 +00:00
Navdeep Parhar	a5234e8ccb	Do not free an uninitialized pointer on soaccept failure in the iWARP connection manager. Sponsored by: Chelsio Communications	2016-08-26 08:25:28 +00:00
Hans Petter Selasky	7dc445f8d3	Add support for setting blocking and non-blocking mode on /dev/rdma_cm by returning success on FIONBIO and FIOASYNC IOCTLs. The actual flags handling is done by the kern_ioctl() function. Reported by: Alex Bowden <alex.bowden@outlook.com> Sponsored by: Mellanox Technologies MFC after: 1 week	2016-08-18 08:49:02 +00:00
Mark Johnston	4e071758a7	MFV be9130cc9: "IB/cma: Check for GID on listening devices first" This is an optimization that improves IB connection setup times. Discussed with: hselasky Obtained from: Linux MFC after: 2 weeks Sponsored by: EMC / Isilon Storage Division	2016-08-01 20:29:09 +00:00
Mark Johnston	82f1d3ea2f	MFV 29f27e847: "IB/cma: Use cached gids" This addresses a regression from an earlier upstream change which caused cma_acquire_dev() to bypass the port GID cache and instead query the HCA for each entry in its GID table. These queries can become extremely slow on multiport devices, which has a negative impact on connection setup times. Discussed with: hselasky Obtained from: Linux MFC after: 2 weeks Sponsored by: EMC / Isilon Storage Division	2016-08-01 20:27:11 +00:00
Navdeep Parhar	9e2d05841e	Fix bug in iwcm that caused a panic in iw_cm_wq when krping is run repeatedly in a tight loop. Approved by: re (gjb@) Obtained from: hselasky@ (part of larger changes in D5791)	2016-06-14 20:58:05 +00:00
George V. Neville-Neil	fd9e88e0f9	Fix up the Infiniband code to handle the new arpresolve.	2016-06-02 20:53:43 +00:00

1 2

91 Commits