Commit Graph

498 Commits

Author SHA1 Message Date
Conrad Meyer
bab54619e9 SDP: Add a dbg() on QP events
Sponsored by:	Dell EMC Isilon
2019-08-17 03:07:41 +00:00
Conrad Meyer
1ac512e80b SDP: Also log a nice status string in RX WC error dbg()
Sponsored by:	Dell EMC Isilon
2019-08-17 03:06:46 +00:00
Conrad Meyer
92d90f6142 SDP: Include nice string names for raw event numbers in a dbg()
Sponsored by:	Dell EMC Isilon
2019-08-17 03:05:09 +00:00
Conrad Meyer
6669d5459b SDP: SYSCTL_DECL SDP-wide sysctl node in header
This allows use of the shared _net_inet_sdp in more than one compilation
unit.  (Nothing in-tree uses this today, but some of Isilon's out-of-tree
SDP enhancements add sysctls below the node.)

Sponsored by:	Dell EMC Isilon
2019-08-17 03:03:26 +00:00
Slava Shwartsman
bb43866c38 Fix prio vs. nonprio tagged traffic in RDMACM
In current RDMACM implementation RDMACM server will not find a GID
index when the request was prio-tagged and the sever is non
prio-tagged and vise-versa.
According to 802.1Q-2014, VLAN tagged packets with VLAN id 0 should
be considered as untagged. Treat RDMACM request the same.

Reviewed by:    hselasky, kib
MFC after:      3 Days
Sponsored by:   Mellanox Technologies
2019-06-04 06:21:31 +00:00
Conrad Meyer
e12be3218a Include eventhandler.h in more compilation units
This was enumerated with exhaustive search for sys/eventhandler.h includes,
cross-referenced against EVENTHANDLER_* usage with the comm(1) utility.  Manual
checking was performed to avoid redundant includes in some drivers where a
common os_bsd.h (for example) included sys/eventhandler.h indirectly, but it is
possible some of these are redundant with driver-specific headers in ways I
didn't notice.

(These CUs did not show up as missing eventhandler.h in tinderbox.)

X-MFC-With:	r347984
2019-05-21 01:18:43 +00:00
Hans Petter Selasky
013f1e1435 Add new rates to ibcore.
Add the new rates that were added to the Infiniband specification as part of
HDR and 2x support.

Submitted by:	slavash@
MFC after:	3 days
Sponsored by:	Mellanox Technologies
2019-05-08 10:55:47 +00:00
Hans Petter Selasky
afbe61616f Handle IB_EVENT_DEVICE_FATAL event in ipoib.
Perform flush if IB_EVENT_DEVICE_FATAL was received.

Submitted by:	slavash@
MFC after:	3 days
Sponsored by:	Mellanox Technologies
2019-05-08 10:51:49 +00:00
Hans Petter Selasky
cb678cb911 Fix endless loop in ipoib_poll().
ib_req_notify_cq may return negative value which will indicate a
failure. In the case of uncorrectable error, we will end up in an
endless loop. Fix that, by going to another loop with poll_more
only if there is anything left to poll.

Submitted by:	slavash@
MFC after:	3 days
Sponsored by:	Mellanox Technologies
2019-05-08 10:42:05 +00:00
Hans Petter Selasky
38f38e9fda Make sure to error out when arming the CQ fails in ibcore.
MFC after:	3 days
Sponsored by:	Mellanox Technologies
2019-05-08 10:32:45 +00:00
Gleb Smirnoff
a68cc38879 Mechanical cleanup of epoch(9) usage in network stack.
- Remove macros that covertly create epoch_tracker on thread stack. Such
  macros a quite unsafe, e.g. will produce a buggy code if same macro is
  used in embedded scopes. Explicitly declare epoch_tracker always.

- Unmask interface list IFNET_RLOCK_NOSLEEP(), interface address list
  IF_ADDR_RLOCK() and interface AF specific data IF_AFDATA_RLOCK() read
  locking macros to what they actually are - the net_epoch.
  Keeping them as is is very misleading. They all are named FOO_RLOCK(),
  while they no longer have lock semantics. Now they allow recursion and
  what's more important they now no longer guarantee protection against
  their companion WLOCK macros.
  Note: INP_HASH_RLOCK() has same problems, but not touched by this commit.

This is non functional mechanical change. The only functionally changed
functions are ni6_addrs() and ni6_store_addrs(), where we no longer enter
epoch recursively.

Discussed with:	jtl, gallatin
2019-01-09 01:11:19 +00:00
Mark Johnston
2f2ddd68a5 Support MSG_DONTWAIT in send*(2).
As it does for recv*(2), MSG_DONTWAIT indicates that the call should
not block, returning EAGAIN instead.  Linux and OpenBSD both implement
this, so the change makes porting easier, especially since we do not
return EINVAL or so when unrecognized flags are specified.

Submitted by:	Greg V <greg@unrelenting.technology>
Reviewed by:	tuexen
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D18728
2019-01-04 17:31:50 +00:00
Slava Shwartsman
186058782a ipoib: Notify on modify QP failure only when relevant
Modify QP can fail and it can be acceptable, like when moving from RST to
ERR state, all the rest are not acceptable and a message to the log
should be printed.

The current code prints on all failures and many messages like:
"Failed to modify QP to ERROR state" appear, even when supported by the
state machine of the QP object.

Linux commit:
5dc78ad1904db597bdb4427f3ead437aae86f54c

Submitted by:   hselasky@
Approved by:    hselasky (mentor)
MFC after:      1 week
Sponsored by:   Mellanox Technologies
2018-12-05 13:27:17 +00:00
Slava Shwartsman
a061f0eb65 ipoib: increase the non-cm queue length
When a packet needs fragmentation, it might generate more than 3 fragments.
With the queue length 3, all fragments are generated faster than the
queue is drained, which effectively drops fourth and later fragments on
the floor.

Submitted by:   kib@
Approved by:    hselasky (mentor)
MFC after:      1 week
Sponsored by:   Mellanox Technologies
2018-12-05 13:26:47 +00:00
Slava Shwartsman
d705eff259 ipoib: Don't do a light flush when MTU is unchanged.
When changing the MTU of ibX network interfaces, check that the MTU was really
changed before requesting an update of the multicast rules. Else we might go
into an infinite loop joining and leaving ibX multicast groups towards the
opensm master interface.

Submitted by:   hselasky@
Approved by:    hselasky (mentor)
MFC after:      1 week
Sponsored by:   Mellanox Technologies
2018-12-05 13:26:17 +00:00
Slava Shwartsman
099ad46e81 ipoib: correct setting MTU from inside ipoib(4).
It is not enough to set ifnet->if_mtu to change the interface MTU.
System saves the MTU for route in the radix tree, and route cache keeps
the interface MTU as well. Since addition of the multicast group causes
recalculation of MTU, even bringing the interface up changes MTU from
4042 to 1500, which makes the system configuration inconsistent. Worse,
ip_output() prefers route MTU over interface MTU, so large packets are
not fragmented and dropped on floor.

Fix it for ipoib(4) using the same approach (or hack) as was applied
for it_tun/if_tap in r339012.  Thanks to bz@ for giving the hint.

Submitted by:   kib@
Approved by:    hselasky (mentor)
MFC after:      1 week
Sponsored by:   Mellanox Technologies
2018-12-05 13:25:47 +00:00
Slava Shwartsman
e13619b68b ibcore: Fix clearing of bound device interface.
Binding to a loopback device is not allowed. Make sure the destination
device address is global by clearing the bound device interface.
Only do this conditionally, else link local addresses won't work.

Submitted by:   hselasky@
Approved by:    hselasky (mentor)
MFC after:      1 week
Sponsored by:   Mellanox Technologies
2018-12-05 13:25:13 +00:00
Slava Shwartsman
a9c20af23d ibcore: ip6_dev_find() needs to know the scope ID.
Else the wrong network device can be returned for link-local addresses.

Submitted by:   hselasky@
Approved by:    hselasky (mentor)
MFC after:      1 week
Sponsored by:   Mellanox Technologies
2018-12-05 13:24:43 +00:00
Slava Shwartsman
5ace00dffe ibcore: Fix sleeping in atomic when RoCE is used
A couple of places in the CM do

    spin_lock_irq(&cm_id_priv->lock);
    ...
    if (cm_alloc_response_msg(work->port, work->mad_recv_wc, &msg))

However when the underlying transport is RoCE, this leads to a sleeping function
being called with the lock held - the callchain is

    cm_alloc_response_msg() ->
      ib_create_ah_from_wc() ->
        ib_init_ah_from_wc() ->
          rdma_addr_find_l2_eth_by_grh() ->
            rdma_resolve_ip()

and rdma_resolve_ip() starts out by doing

    req = kzalloc(sizeof *req, GFP_KERNEL);

not to mention rdma_addr_find_l2_eth_by_grh() doing

    wait_for_completion(&ctx.comp);

to wait for the task that rdma_resolve_ip() queues up.

Fix this by moving the AH creation out of the lock.

Linux commit:
c76161181193985087cd716fdf69b5cb6cf9ee85

Submitted by:   hselasky@
Approved by:    hselasky (mentor)
MFC after:      1 week
Sponsored by:   Mellanox Technologies
2018-12-05 13:24:12 +00:00
Slava Shwartsman
0231669450 ibcore: Add missing unref of netdevice.
Submitted by:   hselasky@
Approved by:    hselasky (mentor)
MFC after:      1 week
Sponsored by:   Mellanox Technologies
2018-12-05 13:23:44 +00:00
Slava Shwartsman
ae8534ca28 ibcore: Fix loopback with rdma-cm.
Trying to validate loopback fails because rtalloc1() resolves system
local addresses to the loopback network interface, lo0. Fix this by
explicitly checking for loopback during validation of the source
and destination network address. If the source address belongs to
a local network interface and is equal to the destination address,
there is no need to run the destination address through rtalloc1().

Submitted by:   hselasky@
Approved by:    hselasky (mentor)
MFC after:      1 week
Sponsored by:   Mellanox Technologies
2018-12-05 13:23:14 +00:00
Slava Shwartsman
f3cf3b7e84 ibcore: Make sure all VNETs are scanned for VLAN interfaces.
The master network interface and the VLANs may reside in different VNETs.
Make sure that all VNETs are searched when scanning for GID entries.

Submitted by:   netapp
Approved by:    hselasky (mentor)
MFC after:      1 week
Sponsored by:   Mellanox Technologies
2018-12-05 13:22:43 +00:00
Slava Shwartsman
af60974508 ibcore: Always check return value from ib_init_ah_from_wc().
This prevents code from accepting RoCEv1 connections when
only ROCEv2 is enabled and vice versa.

Linux commit:
0c4386ec77cfcd0ccbdbe8c2e67dd3a49b2a4c7f

Submitted by:   hselasky@
Approved by:    hselasky (mentor)
MFC after:      1 week
Sponsored by:   Mellanox Technologies
2018-12-05 13:22:07 +00:00
Slava Shwartsman
9fc9810098 ibcore: Add missing check for failure.
Submitted by:   hselasky@
Approved by:    hselasky (mentor)
MFC after:      1 week
Sponsored by:   Mellanox Technologies
2018-12-05 13:21:20 +00:00
Slava Shwartsman
33d7f9b8fb ibcore: Fix an array index check
The array ib_mad_mgmt_class_table.method_table has MAX_MGMT_CLASS
(80) elements. Hence compare the array index with that value instead
of with IB_MGMT_MAX_METHODS (128). This patch avoids that Coverity
reports the following:

Overrunning array class->method_table of 80 8-byte elements at element index 127
(byte offset 1016) using index convert_mgmt_class(mad_hdr->mgmt_class)
(which evaluates to 127).

Linux commit:
2fe2f378dd45847d2643638c07a7658822087836

Submitted by:   hselasky@
Approved by:    hselasky (mentor)
MFC after:      1 week
Sponsored by:   Mellanox Technologies
2018-12-05 13:20:51 +00:00
Slava Shwartsman
fe521b4443 ibcore: Check ib_find_pkey() return value.
Linux commit:
d3a2418ee36a59bc02e9d454723f3175dcf4bfd9

Submitted by:   hselasky@
Approved by:    hselasky (mentor)
MFC after:      1 week
Sponsored by:   Mellanox Technologies
2018-12-05 13:20:22 +00:00
Slava Shwartsman
4aa5230dc5 ibcore: Add support for IB_SPEED_HDR in sysfs rate printout.
Submitted by:   hselasky@
Approved by:    hselasky (mentor)
MFC after:      1 week
Sponsored by:   Mellanox Technologies
2018-12-05 13:19:52 +00:00
Slava Shwartsman
475c8de7bf ibcore: Don't access invalid port.
The port number in the listen_id_priv has been observed to be zero which
means no port has been selected. The current code lacks a check for invalid
port number.

Submitted by:   hselasky@
Approved by:    hselasky (mentor)
MFC after:      1 week
Sponsored by:   Mellanox Technologies
2018-12-05 13:19:21 +00:00
Slava Shwartsman
4b9b52a1bd ibcore: Discard unused error codes.
Submitted by:   hselasky@
Approved by:    hselasky (mentor)
MFC after:      1 week
Sponsored by:   Mellanox Technologies
2018-12-05 13:18:50 +00:00
Slava Shwartsman
f628150b6c ibcore: Make sure GID index variable gets initialized.
Submitted by:   hselasky@
Approved by:    hselasky (mentor)
MFC after:      1 week
Sponsored by:   Mellanox Technologies
2018-12-05 13:18:20 +00:00
Mark Johnston
79db6fe7aa Plug some networking sysctl leaks.
Various network protocol sysctl handlers were not zero-filling their
output buffers and thus would export uninitialized stack memory to
userland.  Fix a number of such handlers.

Reported by:	Thomas Barabosch, Fraunhofer FKIE
Reviewed by:	tuexen
MFC after:	3 days
Security:	kernel memory disclosure
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D18301
2018-11-22 20:49:41 +00:00
Hans Petter Selasky
7877f59352 Introduce and use sgid_index in CM requests in ibcore.
For RoCE, when CM requests are received for RC and UD connections,
netdevice of the incoming request is unavailable. Because of that CM
requests are always forwarded to init_net namespace.

Now that we have the GID index available, introduce SGID index in
incoming CM requests and refer to the netdevice of it.

While at it fix some incorrect uses of init_net and make sure
the rdma_create_id() function stores the VNET it is passed.

Based on linux commit:
cee104334c98dd04e9dd4d9a4fa4784f7f6aada9

MFC after:		3 days
Approved by:		re (gjb)
Sponsored by:		Mellanox Technologies
2018-09-09 07:20:15 +00:00
Hans Petter Selasky
86156495f4 Implement get network interface by params function in ipoib.
Also fix the validate_ipv4_net_dev() and validate_ipv6_net_dev() functions
which had source and destination addresses swapped, and didn't set the
scope ID for IPv6 link-local addresses.

This allows applications like krping to work using IPoIB devices.

MFC after:		3 days
Approved by:		re (gjb)
Sponsored by:		Mellanox Technologies
2018-09-07 18:05:09 +00:00
Slava Shwartsman
b4df6efb4e ibcore: Fix endless loop in searching for matching VLAN device
In r337943 ifnet's if_pcp was set to the PCP value in use
instead of IFNET_PCP_NONE.
Current ibcore code assumes that if_pcp is IFNET_PCP_NONE with
VLAN interfaces so it can identify prio-tagged traffic.
Fix that by explicitly verifying that that the if_type is IFT_ETHER
and not IFT_L2VLAN.

MFC after:      3 days
Approved by:    re (Marius), hselasky (mentor), kib (mentor)
Sponsored by:   Mellanox Technologies
2018-09-06 12:26:57 +00:00
Hans Petter Selasky
fb11674d30 Only NULL check the VNET pointer when VIMAGE is enabled in ibcore.
Else a NULL VNET pointer should be ignored. This fixes address resolving
when VIMAGE is disabled.

MFC after:		3 days
Sponsored by:		Mellanox Technologies
2018-07-31 11:23:44 +00:00
Hans Petter Selasky
cda1e10ca0 Use __FBSDID() for RCS tags in ibcore.
MFC after:		1 week
Sponsored by:		Mellanox Technologies
2018-07-17 09:47:14 +00:00
Hans Petter Selasky
8ac2952539 Remove blank line.
MFC after:		1 week
Sponsored by:		Mellanox Technologies
2018-07-17 09:44:16 +00:00
Hans Petter Selasky
1bebf70fe7 Add support for IPv6 multicast in ibcore.
This change allows us to join IPv6 multicast networks.

MFC after:		1 week
Sponsored by:		Mellanox Technologies
2018-07-17 09:37:16 +00:00
Hans Petter Selasky
c68cb3a60a Add support for RoCEv2 multicast in ibcore.
When creating address handle from multicast GID, set MAC according to
the appropriate formula instead of searching for it in the GID table:
- For IPv4 multicast GID use ip_eth_mc_map().
- For IPv6 multicast GID use ipv6_eth_mc_map().

Linux commit:
9636a56fa864464896bf7d1272c701f2b9a57737

MFC after:		1 week
Sponsored by:		Mellanox Technologies
2018-07-17 09:36:04 +00:00
Hans Petter Selasky
e3e6820f69 Honor return status of ib_init_ah_from_mcmember() in ibcore.
The return status of ib_init_ah_from_mcmember() is ignored by
cma_ib_mc_handler().  Honor it and return error event if ah attribute
initialization failed.

Linux commit:
6d337179f28cc50ddd7e224f677b4cda70b275fc

MFC after:		1 week
Sponsored by:		Mellanox Technologies
2018-07-17 09:34:29 +00:00
Hans Petter Selasky
0f2b4041fe Honor port_num while resolving GID for IB link layer in ibcore.
ah_attr contains the port number to which cm_id is bound. However, while
searching for GID table for matching GID entry, the port number is
ignored.

This could cause the wrong GID to be used when the ah_attr is converted to
an AH.

Linux commit:
563c4ba3bd2b8b0b21c65669ec2226b1cfa1138b

MFC after:		1 week
Sponsored by:		Mellanox Technologies
2018-07-17 09:33:20 +00:00
Hans Petter Selasky
56acbf0932 Set IPv4 TOS and IPv6 traffic class field for RoCEv2 traffic in ibcore.
The current implementation assumes a static mapping between
the TOS bits and the priority code point, PCP bits.

MFC after:		1 week
Sponsored by:		Mellanox Technologies
2018-07-17 09:32:09 +00:00
Hans Petter Selasky
02825401aa Fix for loopback detection in address resolve logic in ibcore.
When a loopback address is detected use the network interface which
has the loopback flag set to trigger loopback logic in address resolve.

MFC after:		1 week
Sponsored by:		Mellanox Technologies
2018-07-17 09:30:32 +00:00
Hans Petter Selasky
f736cb92cf Check port number supplied by user verbs cmds in ibcore.
The ib_uverbs_create_ah() ind ib_uverbs_modify_qp() calls receive
the port number from user input as part of its attributes and assumes
it is valid. Down on the stack, that parameter is used to access kernel
data structures.  If the value is invalid, the kernel accesses memory
it should not.  To prevent this, verify the port number before using it.

Linux commit:
5ecce4c9b17bed4dc9cb58bfb10447307569b77b
a62ab66b13a0f9bcb17b7b761f6670941ed5cd62
5a7a88f1b488e4ee49eb3d5b82612d4d9ffdf2c3

MFC after:		1 week
Sponsored by:		Mellanox Technologies
2018-07-17 09:29:14 +00:00
Hans Petter Selasky
56326b4755 Depend on IPv6 stack to resolve link local address for RoCEv2 in ibcore.
RoCEv1 does not use the IPv6 stack to resolve the link local DGID since it
uses GID address. It forms the DMAC directly from the DGID.

Linux commit:
56d0a7d9a0f045ee27a001762deac28c7d28e2e4

MFC after:		1 week
Sponsored by:		Mellanox Technologies
2018-07-17 09:27:31 +00:00
Hans Petter Selasky
df9bb49b66 Fix kernel crash during fail to initialize device in ibcore.
This patch fixes the kernel crash that occurs during ib_dealloc_device()
called due to provider driver fails with an error after
ib_alloc_device() and before it can register using ib_register_device().

This crashed seen in tha lab as below which can occur with any IB device
which fails to perform its device initialization before invoking
ib_register_device().

This patch avoids touching cache and port immutable structures if device
is not yet initialized.
It also releases related memory when cache and port immutable data
structure initialization fails during register_device() state.

Linux commit:
4be3a4fa51f432ef045546d16f25c68a1ab525b9

MFC after:		1 week
Sponsored by:		Mellanox Technologies
2018-07-17 09:26:09 +00:00
Hans Petter Selasky
855ad7cf39 Check AF family prior resolving address and introduce safer rdma_addr_size() variants in ibcore.
Garbage supplied by user will cause to UCMA module provide zero
memory size for memcpy(), because it wasn't checked, it will
produce unpredictable results in rdma_resolve_addr().

There are several places in the ucma ABI where userspace can pass in a
sockaddr but set the address family to AF_IB.  When that happens,
rdma_addr_size() will return a size bigger than sizeof struct sockaddr_in6,
and the ucma kernel code might end up copying past the end of a buffer
not sized for a struct sockaddr_ib.

Fix this by introducing new variants
    int rdma_addr_size_in6(struct sockaddr_in6 *addr);
    int rdma_addr_size_kss(struct __kernel_sockaddr_storage *addr);

that are type-safe for the types used in the ucma ABI and return 0 if the
size computed is bigger than the size of the type passed in.  We can use
these new variants to check what size userspace has passed in before
copying any addresses.

Linux commit:
2975d5de6428ff6d9317e9948f0968f7d42e5d74
09abfe7b5b2f442a85f4c4d59ecf582ad76088d7
84652aefb347297aa08e91e283adf7b18f77c2d5

MFC after:		1 week
Sponsored by:		Mellanox Technologies
2018-07-17 09:24:39 +00:00
Hans Petter Selasky
41decb450f Check for a cm_id->device in all user calls that need it in ibcore.
This was done by auditing all callers of ucma_get_ctx and switching the
ones that unconditionally touch ->device to ucma_get_ctx_dev. This covers
a little less than  half of the call sites.

The 11 remaining call sites to ucma_get_ctx() were manually audited.

Linux commit:
4b658d1bbc16605330694bb3ef2570c465ef383d
8b77586bd8fe600d97f922c79f7222c46f37c118

MFC after:		1 week
Sponsored by:		Mellanox Technologies
2018-07-17 09:22:26 +00:00
Hans Petter Selasky
89a0812fa6 Restore initialisation of ctx->uid in ucma_create_id() in ibcore.
This fixes a regression issue after r336373.

MFC after:		1 week
Sponsored by:		Mellanox Technologies
2018-07-17 09:21:05 +00:00
Hans Petter Selasky
083cdbac35 Fix kernel panic while using XRC_TGT QP type in ibcore.
Attempt to modify XRC_TGT QP type from the user space (ibv_xsrq_pingpong
invocation) will trigger the following kernel panic. It is caused by the
fact that such QPs missed uobject initialization.

Linux commit:
f45765872e7aae7b81feb3044aaf9886b21885ef

MFC after:		1 week
Sponsored by:		Mellanox Technologies
2018-07-17 09:18:16 +00:00
Hans Petter Selasky
765bd83cca Fix NULL pointer dereference during device removal in ibcore.
As part of ib_uverbs_remove_one which might be triggered upon
reset flow, we trigger IB_EVENT_DEVICE_FATAL event to userspace
application.
If device was removed after uverbs fd was opened but before
ib_uverbs_get_context was called, the event file will be accessed
before it was allocated, result in NULL pointer dereference:

Linux commit:
870201f95fcbd19538aef630393fe9d583eff82e

MFC after:		1 week
Sponsored by:		Mellanox Technologies
2018-07-17 09:16:54 +00:00
Hans Petter Selasky
ed222171a9 Fix access to non-initialized CM_ID object in ibcore.
The attempt to join multicast group without ensuring that CMA device
exists will lead to the following crash reported by syzkaller.

Linux commit:
7688f2c3bbf55e52388e37ac5d63ca471a7712e1

MFC after:		1 week
Sponsored by:		Mellanox Technologies
2018-07-17 09:15:50 +00:00
Hans Petter Selasky
91f0e6e1d8 Avoid that ib_drain_qp() triggers an out-of-bounds stack access in ibcore.
Linux commit:
a1ae7d0345edd593d6725d3218434d903a0af95d

MFC after:		1 week
Sponsored by:		Mellanox Technologies
2018-07-17 09:14:20 +00:00
Hans Petter Selasky
7477a89ae2 Ensure that CM_ID exists prior to access it in ibcore.
Prior to access UCMA commands, the context should be initialized
and connected to CM_ID with ucma_create_id(). In case user skips
this step, he can provide non-valid ctx without CM_ID and cause
to multiple NULL dereferences.

Also there are situations where the create_id can be raced with
other user access, ensure that the context is only shared to
other threads once it is fully initialized to avoid the races.

Linux commit:
e8980d67d6017c8eee8f9c35f782c4bd68e004c9

MFC after:		1 week
Sponsored by:		Mellanox Technologies
2018-07-17 09:13:11 +00:00
Hans Petter Selasky
f4546fa376 Add support for prio-tagged traffic for RDMA in ibcore.
When receiving a PCP change all GID entries are reloaded.
This ensures the relevant GID entries use prio tagging,
by setting VLAN present and VLAN ID to zero.

The priority for prio tagged traffic is set using the regular
rdma_set_service_type() function.

Fake the real network device to have a VLAN ID of zero
when prio tagging is enabled. This is logic is hidden inside
the rdma_vlan_dev_vlan_id() function which must always be used
to retrieve the VLAN ID throughout all of ibcore and the
infiniband network drivers.

The VLAN presence information then propagates through all
of ibcore and so incoming connections will have the VLAN
bit set. The incoming VLAN ID is then checked against the
return value of rdma_vlan_dev_vlan_id().

MFC after:		1 week
Sponsored by:		Mellanox Technologies
2018-07-17 09:11:53 +00:00
Hans Petter Selasky
682bd3fe12 Set default GID type as RoCE when resolving RoCE route in ibcore.
cma_iboe_set_mgid() is updated to reflect the RoCEv2 GID check.

Linux commit:
5c181bda77f409d89ad513528eccac5f3a416474

MFC after:		1 week
Sponsored by:		Mellanox Technologies
2018-07-17 09:09:17 +00:00
Hans Petter Selasky
b70db3278f Set RoCEv2 MGID according to spec in ibcore.
RoCEv2 Annex states that for RoCEv2 over IPv4, the corresponding
IPv4 address is encoded into the GID according to the following rule:
GID= :ffff:<IPv4 address>

Remove the 0xff0e prefix for RoCEv2 packets with IPv4 and leave it
zeroed and change rdma_is_multicast_addr() to consider the new logic.

Linux commit:
be1d325a335840a86c133a56c6a911c368bac0fd
1c3aea2bc8f0b2e5b57375ead40457ff75a3a2ec

MFC after:		1 week
Sponsored by:		Mellanox Technologies
2018-07-17 09:07:36 +00:00
Hans Petter Selasky
71698382e3 For multicast functions in ibcore, verify that LIDs are multicast LIDs.
The Infiniband spec defines "A multicast address is defined by a
MGID and a MLID" (section 10.5).

Add check to verify that the MLID value is in the correct address
range.

RoCE Annex (A16.9.10/11) declares that during attach (detach) QP to a
multicast group, if the QP is associated with a RoCE port, the
multicast group MLID is unused and is ignored.

During attach or detach multicast, when the QP is associated with a
port, it is enough to check the port's link layer and validate the
LID only if it is Infiniband. Otherwise, avoid validating the
multicast LID.

Linux commit:
8561eae60ff9417a50fa1fb2b83ae950dc5c1e21
5236333592244557a19694a51337df6ac018f0a7

MFC after:		1 week
Sponsored by:		Mellanox Technologies
2018-07-17 09:04:36 +00:00
Hans Petter Selasky
fed17c5852 Fix for RDMA loopback over VLAN in ibcore.
Implement a more generic solution for detecting loopback.
The problem was that the default netdevice was resolved
for loopback also when VLAN was used. Use real network
device instead of loopback device for bound device
interface.

How to test:
ucmatose -b 127.0.0.1 -p 20090
ucmatose -s 5.6.5.1 -p 20090

Note that RDMA treats the IPv4 and IPv6 loopback
addresses like any address.

MFC after:		1 week
Sponsored by:		Mellanox Technologies
2018-07-17 09:02:29 +00:00
Hans Petter Selasky
f9899e4567 Add native FreeBSD support for multicast in ibcore.
This change adds support for registering multicast addresses,
both IPv4 and IPv6.

MFC after:		1 week
Sponsored by:		Mellanox Technologies
2018-07-17 08:59:34 +00:00
Hans Petter Selasky
7a658d6545 If the MGID/MLID pair is not on the list return an error in ibcore.
A list of MGID/MLID pairs is built when doing a multicast attach.  When
the multicast detach is called, the list is searched, and regardless of
the search outcome, the driver detach is called.

If an MGID/MLID pair is not on the list, driver detach should not be
called, and an error should be returned.  Calling the driver without
removing an MGID/MLID pair from the list can leave the core and driver
out of sync.

Linux commit:
20c7840a77ddcb2ed2fbd66e8197db2868495751

MFC after:		1 week
Sponsored by:		Mellanox Technologies
2018-07-17 08:54:40 +00:00
Hans Petter Selasky
ef24f05897 Add lock to multicast handlers in ibcore.
When two handlers used the same object in the old schema, we blocked
the process in the kernel. The new schema just returns -EBUSY. This
could lead to different behaviour in applications between the old
schema and the new schema. In most cases, using such handlers
concurrently could lead to crashing the process. For example, if
thread A destroys a QP and thread B modifies it, we could have the
destruction happens before the modification. In this case, we are
accessing freed memory which could lead to crashing the process.
This is true for most cases. However, attaching and detaching
a multicast address from QP concurrently is safe. Therefore, we
preserve the original behaviour by adding a lock there.

Linux commit:
f48b726920d96dcd1860df06143bdea7d6d7dcc3

MFC after:		1 week
Sponsored by:		Mellanox Technologies
2018-07-17 08:52:29 +00:00
Hans Petter Selasky
8b767bd7d7 Only update source address when resolving is successful in ibcore.
When resolving an IP address in ibcore, only update the source address
upon normal completion. The ibcore address resolve function does not
care about the scope ID value of the IPv6 link-local addresses and expects
this information has already been extracted into the bound_dev_if field.
Because the same IPv6 link-local address can exist on multiple interfaces
the ibcore address resolver gets confused and returns ENETUNREACH.

Instead of updating both source address and bound_dev_if just keep the
address set to any address until resolving completes. For the sake of code
symmetry a similar change has been applied to the IPv4 address resolve path.

MFC after:		1 week
Sponsored by:		Mellanox Technologies
2018-07-17 08:48:30 +00:00
Hans Petter Selasky
e0cba2d201 Process address resolve requests at least one time per second in ibcore.
When setting a large address resolve timeout it was observed that the
address resolving would succeed at the timeout and not when the address
was available. Make sure the address resolving requests are processed no
slower than one time every second.

While at it use "int" for jiffies instead of "unsigned long" to match
FreeBSD ticks.

MFC after:		1 week
Sponsored by:		Mellanox Technologies
2018-07-17 08:34:49 +00:00
Hans Petter Selasky
3ea947d615 Revert r335094 and properly fix OFED build after r335053.
MFC after:	1 week
Sponsored by:	Mellanox Technologies
2018-06-14 07:55:10 +00:00
Matt Macy
67b3b4d245 fix OFED build after r335053 2018-06-13 23:30:54 +00:00
Matt Macy
4f6c66cc9c UDP: further performance improvements on tx
Cumulative throughput while running 64
  netperf -H $DUT -t UDP_STREAM -- -m 1
on a 2x8x2 SKL went from 1.1Mpps to 2.5Mpps

Single stream throughput increases from 910kpps to 1.18Mpps

Baseline:
https://people.freebsd.org/~mmacy/2018.05.11/udpsender2.svg

- Protect read access to global ifnet list with epoch
https://people.freebsd.org/~mmacy/2018.05.11/udpsender3.svg

- Protect short lived ifaddr references with epoch
https://people.freebsd.org/~mmacy/2018.05.11/udpsender4.svg

- Convert if_afdata read lock path to epoch
https://people.freebsd.org/~mmacy/2018.05.11/udpsender5.svg

A fix for the inpcbhash contention is pending sufficient time
on a canary at LLNW.

Reviewed by:	gallatin
Sponsored by:	Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D15409
2018-05-23 21:02:14 +00:00
Matt Macy
d7c5a620e2 ifnet: Replace if_addr_lock rwlock with epoch + mutex
Run on LLNW canaries and tested by pho@

gallatin:
Using a 14-core, 28-HTT single socket E5-2697 v3 with a 40GbE MLX5
based ConnectX 4-LX NIC, I see an almost 12% improvement in received
packet rate, and a larger improvement in bytes delivered all the way
to userspace.

When the host receiving 64 streams of netperf -H $DUT -t UDP_STREAM -- -m 1,
I see, using nstat -I mce0 1 before the patch:

InMpps OMpps  InGbs  OGbs err TCP Est %CPU syscalls csw     irq GBfree
4.98   0.00   4.42   0.00 4235592     33   83.80 4720653 2149771   1235 247.32
4.73   0.00   4.20   0.00 4025260     33   82.99 4724900 2139833   1204 247.32
4.72   0.00   4.20   0.00 4035252     33   82.14 4719162 2132023   1264 247.32
4.71   0.00   4.21   0.00 4073206     33   83.68 4744973 2123317   1347 247.32
4.72   0.00   4.21   0.00 4061118     33   80.82 4713615 2188091   1490 247.32
4.72   0.00   4.21   0.00 4051675     33   85.29 4727399 2109011   1205 247.32
4.73   0.00   4.21   0.00 4039056     33   84.65 4724735 2102603   1053 247.32

After the patch

InMpps OMpps  InGbs  OGbs err TCP Est %CPU syscalls csw     irq GBfree
5.43   0.00   4.20   0.00 3313143     33   84.96 5434214 1900162   2656 245.51
5.43   0.00   4.20   0.00 3308527     33   85.24 5439695 1809382   2521 245.51
5.42   0.00   4.19   0.00 3316778     33   87.54 5416028 1805835   2256 245.51
5.42   0.00   4.19   0.00 3317673     33   90.44 5426044 1763056   2332 245.51
5.42   0.00   4.19   0.00 3314839     33   88.11 5435732 1792218   2499 245.52
5.44   0.00   4.19   0.00 3293228     33   91.84 5426301 1668597   2121 245.52

Similarly, netperf reports 230Mb/s before the patch, and 270Mb/s after the patch

Reviewed by:	gallatin
Sponsored by:	Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D15366
2018-05-18 20:13:34 +00:00
Brooks Davis
38d958a647 Improve copy-and-pasted versions of SIOCGIFADDR.
The original implementation used a reference to ifr_data and a cast to
do the equivalent of accessing ifr_addr. This was copied multiple
times since 1996.

Approved by:	kib
MFC after:	1 week
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D14873
2018-03-27 20:51:49 +00:00
Hans Petter Selasky
5cd5781c75 Remove redundant integer cast in ibcore. The "ref_count" field already
has integer type.

MFC after:	1 week
Sponsored by:	Mellanox Technologies
2018-03-19 13:51:33 +00:00
Hans Petter Selasky
d4eeed42ba Make sure VNET is set when calling sa6_recoverscope() in ibcore.
Else panic will occur when VIMAGE is enabled.

MFC after:	1 week
Sponsored by:	Mellanox Technologies
2018-03-07 13:32:52 +00:00
Hans Petter Selasky
aa5962f908 Define values instead of using hardcoding.
MFC after:	1 week
Sponsored by:	Mellanox Technologies
2018-03-07 13:30:38 +00:00
Hans Petter Selasky
c131a22379 Recover IPv6 scope ID for multicast link-local addresses as well as
unicast link-local addresses.

MFC after:	1 week
Sponsored by:	Mellanox Technologies
2018-03-07 13:28:12 +00:00
Hans Petter Selasky
cc79d31d6f Embed the IPv6 scope ID before calling rtalloc1() in ibcore.
Else rtalloc1() will resolve to the loopback interface.

MFC after:	1 week
Sponsored by:	Mellanox Technologies
2018-03-07 13:25:40 +00:00
Hans Petter Selasky
d0a82c2414 Add IB_SPEED_HDR definition in ibcore.
MFC after:	1 week
Sponsored by:	Mellanox Technologies
2018-03-07 13:01:00 +00:00
Hans Petter Selasky
d0a9dbc779 Make sure the IPv6 scope ID gets properly masked in ibcore.
When exchanging CM messages the IPv6 scope ID should be ignored
for link local addresses when doing comparisons. Make sure the
scope ID is always set to zero for link local addresses.

MFC after:	1 week
Sponsored by:	Mellanox Technologies
2018-03-07 12:58:51 +00:00
Hans Petter Selasky
03ae76a693 Fix for use-after-free when using delayed work structures in ibcore.
It is not enough to cancel delayed work structures before freeing.
Always cancel delayed work synchronously before freeing!

MFC after:	1 week
Sponsored by:	Mellanox Technologies
2018-03-07 12:56:04 +00:00
Hans Petter Selasky
1456d97c01 Optimize ibcore RoCE address handle creation from user-space.
Creating a UD address handle from user-space or from the kernel-space,
when the link layer is ethernet, requires resolving the remote L3
address into a L2 address. Doing this from the kernel is easy because
the required ARP(IPv4) and ND6(IPv6) address resolving APIs are readily
available. In userspace such an interface does not exist and kernel
help is required.

It should be noted that in an IP-based GID environment, the GID itself
does not contain all the information needed to resolve the destination
IP address. For example information like VLAN ID and SCOPE ID, is not
part of the GID and must be fetched from the GID attributes. Therefore
a source GID should always be referred to as a GID index. Instead of
going through various racy steps to obtain information about the
GID attributes from user-space, this is now all done by the kernel.

This patch optimises the L3 to L2 address resolving using the existing
create address handle uverbs interface, retrieving back the L2 address
as an additional user-space information structure.

This commit combines the following Linux upstream commits:

IB/core: Let create_ah return extended response to user
IB/core: Change ib_resolve_eth_dmac to use it in create AH
IB/mlx5: Make create/destroy_ah available to userspace
IB/mlx5: Use kernel driver to help userspace create ah
IB/mlx5: Report that device has udata response in create_ah

MFC after:	1 week
Sponsored by:	Mellanox Technologies
2018-03-05 14:34:52 +00:00
Hans Petter Selasky
bf8641fedb Get correct network device when accepting incoming RDMA connections in ibcore.
This patch ensures the GID index is always used as a basis of resolving
incoming RDMA connections, as compared to the GID value itself.

Background:
On a per infiniband port basis, the GID identifier is not a unique identifier!
This assumption falls apart when VLAN ID, IPv6 scope ID and RoCE type,
as supported by RoCE v2, is taken into account. This additional
information is stored in the so-called GID attributes and is needed to
correctly identify the destination network interface for an incoming
connection.

Different VLANs are allowed to define the same IPv4 addresses and especially
for the default IPv6 link-local addresses or when using so-called containers
or jails, this is true.

The VNET information for the destination network interface is needed in
order to perform the L2 address lookup in the right Virtual Network Stack
context.

Consequently old functions previously used by RoCE v1, like
rdma_addr_find_smac_by_sgid() are impossible to support, because
there can be multiple identical GIDs associated with the same
infiniband port, and the answer to such a request becomes undefined.
This function has been removed.

MFC after:	1 week
Sponsored by:	Mellanox Technologies
2018-03-05 14:24:30 +00:00
Hans Petter Selasky
676833dadb Pass valid if_index to rdma_addr_find_l2_eth_by_grh() in ibcore when possible.
MFC after:	1 week
Sponsored by:	Mellanox Technologies
2018-03-05 14:22:36 +00:00
Hans Petter Selasky
197461919d Add support for loopback in ibcore.
Implement the missing pieces in addr_resolve() to support loopback
addresses. IB core will test for the IFF_LOOPBACK flag in the network
interface and treat these devices in a special way.

MFC after:	1 week
Sponsored by:	Mellanox Technologies
2018-03-05 13:57:37 +00:00
Hans Petter Selasky
57b6d9f6ef Make sure to register the VLAN GIDs using the VLAN network interface
and not the parent one in ibcore. Else looking up the VLAN GIDs will
fail for VLAN IPs.

MFC after:	1 week
Sponsored by:	Mellanox Technologies
2018-03-05 12:39:34 +00:00
Hans Petter Selasky
891538abb5 Need to check for IPv6 linklocal address inside rdma_resolve_addr() in ibcore.
MFC after:	1 week
Sponsored by:	Mellanox Technologies
2018-03-05 12:04:34 +00:00
Hans Petter Selasky
6d36a2c769 Map type of service, TOS, to IB or VLAN service level 1:1 in ibcore.
MFC after:	1 week
Sponsored by:	Mellanox Technologies
2018-03-05 11:59:54 +00:00
Hans Petter Selasky
5b94bd8a69 Select RoCEv2 by default in ibcore.
MFC after:	1 week
Sponsored by:	Mellanox Technologies
2018-03-05 11:58:37 +00:00
Hans Petter Selasky
703ea406d5 Make deletion of RoCE GID entries synchronous in ibcore.
When a network device is departing, the RoCE GID entries should be
cleared before the default L2 link layer address is freed. Else a NULL
pointer access may happen.

MFC after:	1 week
Sponsored by:	Mellanox Technologies
2018-03-05 11:57:26 +00:00
Hans Petter Selasky
42fa341d9c Add support for IPv6 link local GIDs equal to the default GID for
VLANs in ibcore.

IPv6 link local addresses are usually derived from the netdev MAC
address. This is applicable to VLAN devices and its lower netdevice as
well. In such cases the IPv6 link local address is a duplicate of the
default GID.

Now that link local IPv6 addresses based GIDs are supported, allow
adding such GID entries in the GID table.

MFC after:	1 week
Sponsored by:	Mellanox Technologies
2018-03-05 11:55:29 +00:00
Hans Petter Selasky
be5c019762 Do not add RoCEv2 default GID in ibcore when IPv6 is disabled to honor the
networking stack's IPv6 disabled setting. Else the offload HCA can start using
IPv6 packets for QPs.

MFC after:	1 week
Sponsored by:	Mellanox Technologies
2018-03-05 11:52:39 +00:00
Hans Petter Selasky
09938b2185 Add missing FreeBSD tags and SVN properties to ibcore.
MFC after:	1 week
Sponsored by:	Mellanox Technologies
2018-03-05 11:49:45 +00:00
Hans Petter Selasky
33ec1ccbae Import the mthca kernel side infiniband driver from Linux 4.9 and fix
compilation under FreeBSD. The mthca driver was temporarily removed as
part of the Linux 4.9 RoCE/infinband upgrade.

Top commit in Linux source tree:
69973b830859bc6529a7a0468ba0d80ee5117826

Sponsored by:	Mellanox Technologies
2018-02-13 17:04:34 +00:00
Pedro F. Giffuni
fe267a5590 sys: general adoption of SPDX licensing ID tags.
Mainly focus on files that use BSD 2-Clause license, however the tool I
was using misidentified many licenses so this was mostly a manual - error
prone - task.

The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.

No functional change intended.
2017-11-27 15:23:17 +00:00
Hans Petter Selasky
e5806bf466 Compile fix for LINT-NOIP kernel target.
Sponsored by:	Mellanox Technologies
2017-11-24 12:05:49 +00:00
Hans Petter Selasky
fd9e423c88 Make sure all tasks are cancelled synchronously in ipoib to avoid
use after free.

Sponsored by:	Mellanox Technologies
2017-11-24 09:55:20 +00:00
Hans Petter Selasky
68732efcd9 Build fix for ipoib when CONFIG_INFINIBAND_IPOIB_CM is defined.
Sponsored by:	Mellanox Technologies
2017-11-24 09:52:56 +00:00
Hans Petter Selasky
95ef56abc2 Build fix for kernel LINT target.
Sponsored by:	Mellanox Technologies
2017-11-24 09:12:13 +00:00
Hans Petter Selasky
82725ba9bf Merge ^/head r325999 through r326131. 2017-11-23 14:28:14 +00:00
Pedro F. Giffuni
51369649b0 sys: further adoption of SPDX licensing ID tags.
Mainly focus on files that use BSD 3-Clause license.

The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.

Special thanks to Wind River for providing access to "The Duke of
Highlander" tool: an older (2014) run over FreeBSD tree was useful as a
starting point.
2017-11-20 19:43:44 +00:00
Hans Petter Selasky
dd00abf2d7 Make sure the ib_wr_opcode enum is signed by adding a negative dummy element.
Different compilers may optimise the enum type in different ways. This ensures
coherency when range checking the value of enums in ibcore.

Sponsored by:	Mellanox Technologies
MFC after:	1 week
2017-11-14 14:51:37 +00:00
Hans Petter Selasky
3b8c4b3619 Make sure a valid VNET is set before trying to access the V_ip6_v6only
variable. Access the variable directly instead of going through the sysctl()
interface in the kernel.

Sponsored by:	Mellanox Technologies
MFC after:	1 week
2017-11-14 14:43:35 +00:00
Hans Petter Selasky
8dee9a7a44 Remove no longer supported mthca driver.
Sponsored by:	Mellanox Technologies
2017-11-13 10:59:38 +00:00
Hans Petter Selasky
f819030092 Merge ^/head r325505 through r325662. 2017-11-10 14:46:50 +00:00
Hans Petter Selasky
1529133ab3 Mark ipoib device as initialized on device open.
Set the IPOIB_FLAG_INITIALIZED on dev_open and clear it on dev_stop to
avoid a race between ipoib load and the underlying device driver.

The device module must dispatch the IB_EVENT_PORT_ACTIVE event before ipoib
module is loaded. Otherwise, the flush will fail since no one set the
IPOIB_FLAG_INITIALIZED.

Submitted by:	Slava Shwartsman <slavash@mellanox.com>
Sponsored by:	Mellanox Technologies
MFC after:	1 week
2017-11-10 08:58:42 +00:00
Hans Petter Selasky
902a165f67 Make sure sin_zero is zero in ibcore. Else socket address maching using
bcmp() might fail.

Sponsored by:	Mellanox Technologies
MFC after:	1 week
2017-11-09 19:30:10 +00:00
Hans Petter Selasky
cee98bee7d Make sure the IPv6 scope ID gets zeroed when exchanging CMA messages in ibcore.
Else the IPv6 address matching might fail. This change adds support for both
embedded and non-embedded IPv6 scope IDs when passing a IPv6 link-local socket
address to RDMA. Prior to this change only global IPv6 addresses would work
with RDMA.

Sponsored by:	Mellanox Technologies
MFC after:	1 week
2017-11-09 19:27:29 +00:00
Hans Petter Selasky
860bbba0bb Multiple fixes for using IPv6 link-local addresses with RDMA in ibcore.
1) Fail to resolve RDMA address if rtalloc1() returns the loopback
device, lo0, as the gateway interface. Currently RDMA loopback is
not supported.

2) Use ip_dev_find() and ip6_dev_find() to lookup network interfaces
with matching IPv4 and IPv6 addresses, respectivly.

3) In addr_resolve() make sure the "ifa" pointer is always set, also when
the "ifp" is NULL. Else a NULL pointer access might happen trying to
read from the "ifa" pointer later on.

4) In rdma_addr_find_dmac_by_grh() make sure the "bound_dev_if" field
gets set properly instead of passing the scope ID through the IPv6
socket address structure. This is more in line with upstream OFED
in Linux.

5) In rdma_addr_find_smac_by_sgid() there is no need to pass the
scope ID for IPv6. Either it is stored in the "bound_dev_if" field
or ip6_dev_find() will find the correct network device regardless
of the scope ID.

Sponsored by:	Mellanox Technologies
MFC after:	1 week
2017-11-09 19:22:43 +00:00
Hans Petter Selasky
c2c014f24c Merge ^/head r323559 through r325504. 2017-11-07 08:39:14 +00:00
Hans Petter Selasky
d05554bb99 The remote DMA TCP portspace selector, RDMA_PS_TCP, is used for both
iWarp and RoCE in ibcore. The selection of RDMA_PS_TCP can not be used
to indicate iWarp protocol use. Backport the proper IB device
capabilities from Linux upstream to distinguish between iWarp and
RoCE. Only allocate the additional socket required for iWarp for RDMA
IDs when at least one iWarp device present. This resolves
interopability issues between iWarp and RoCE in ibcore

Reviewed by:		np @
Differential Revision:	https://reviews.freebsd.org/D12563
Sponsored by:		Mellanox Technologies
MFC after:		3 days
2017-10-20 08:20:15 +00:00
Hans Petter Selasky
aacb037742 Make sure the IPv6 scope ID gets zeroed inside the GID. Else searching for a
valid GID entry based on IPv6 addresses can fail.

Sponsored by:	Mellanox Technologies
MFC after:	1 week
2017-10-10 12:36:41 +00:00
Hans Petter Selasky
4051f0c8ed Temporary fix to avoid hitting kernel assert.
Don't dirty VM pages at this point.

Sponsored by:	Mellanox Technologies
2017-09-16 16:32:36 +00:00
Hans Petter Selasky
c4b28ce0b9 Remove no longer needed linux_poll_wakeup() calls. This is now handled by
"wake_up()" in the LinuxKPI. Accessing the file pointer directly might cause
use after free issues.

Sponsored by:	Mellanox Technologies
2017-09-16 16:31:30 +00:00
Hans Petter Selasky
526f596179 Embedding the scope ID is no longer needed for IPv6.
Sponsored by:	Mellanox Technologies
2017-09-16 16:28:48 +00:00
Hans Petter Selasky
e02ecc60b8 Set length field of socket address.
Sponsored by:	Mellanox Technologies
2017-09-16 16:28:19 +00:00
Hans Petter Selasky
6af166cc7b Fix for refcount leak.
Sponsored by:	Mellanox Technologies
2017-09-16 16:27:24 +00:00
Hans Petter Selasky
d18b4113ed Make sure the socket address length field gets set.
Sponsored by:	Mellanox Technologies
2017-09-16 16:26:46 +00:00
Hans Petter Selasky
bf00eaa12c Improve ibcore address resolving:
- Add more sanity checks.
- Preserve source port number when resolving address.
- Remove no longer needed scope ID hacks for IPv6.

Sponsored by:	Mellanox Technologies
2017-09-16 16:24:38 +00:00
Hans Petter Selasky
c69c74b892 Adapt the existing SDP ULP code to the new ibcore APIs.
Requested by:	Sobczak, Bartosz <bartosz.sobczak@intel.com>
Sponsored by:	Mellanox Technologies
2017-09-16 16:16:00 +00:00
Hans Petter Selasky
65c5a7a879 Remove unsafe access to the LinuxKPI file structure from ibcore.
selwakeup() is now done by the wake_up() family of functions.

MFC after:		1 week
Sponsored by:		Mellanox Technologies
2017-09-09 06:34:20 +00:00
Mark Johnston
97310754e5 Fix indentation.
MFC after:	1 week
Sponsored by:	Dell EMC Isilon
2017-09-07 19:15:31 +00:00
Hans Petter Selasky
b40951b8cd Change reject message type when destroying cm_id in ibore.
This patch fixes an interopability issue between FreeBSD and non-FreeBSD
systems when the connection establishment is aborted. Refer to the
initial commit in Linux, drivers/infiniband/core/cm.c,
for a more detailed description.

Obtained from:	Linux
MFC after:	3 days
Sponsored by:	Mellanox Technologies
2017-08-03 09:31:10 +00:00
Hans Petter Selasky
44d8a0fc60 Ticks are 32-bit in FreeBSD.
MFC after:	3 days
Sponsored by:	Mellanox Technologies
2017-08-03 09:18:25 +00:00
Hans Petter Selasky
bca9d05fdb Merge ^/head r319973 through 321382. 2017-07-23 15:22:06 +00:00
Hans Petter Selasky
9f715dc162 ibcore: Delete old files and add new ones missed in the initial commit
for this projects branch.

Sponsored by:	Mellanox Technologies
2017-07-03 08:32:52 +00:00
Hans Petter Selasky
c19d65049e ibcore: Fix for compiling header files from user-space.
Sponsored by:	Mellanox Technologies
2017-07-03 08:30:43 +00:00
Hans Petter Selasky
0faccd643c ibcore: Fix for accessing invalid network device.
Sponsored by:	Mellanox Technologies
2017-07-03 08:29:49 +00:00
Hans Petter Selasky
ddc6ee3792 ibcore: Make sure all GID entries are deleted upon a network device
departure.

Sponsored by:	Mellanox Technologies
2017-07-03 08:28:35 +00:00
Mark Johnston
4eb18346d1 Avoid including list.h in LinuxKPI headers.
list.h includes a number of FreeBSD headers as a workaround for the
LIST_HEAD name collision. To reduce pollution, avoid including list.h
in commonly used headers when it is not explicitly needed.

Reviewed by:	hselasky
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D11249
2017-06-18 16:43:57 +00:00
Hans Petter Selasky
478d300572 Initial RoCE/infiniband kernel update to Linux v4.9.
This patch currently supports:
- ibcore as a kernel module only
- krping as a kernel module only
- ipoib as a kernel module only

Sponsored by:	Mellanox Technologies
2017-06-15 12:47:48 +00:00
Mark Johnston
e804572d2b Fix indentation.
MFC after:	1 week
2017-06-14 16:55:23 +00:00
Gleb Smirnoff
779f106aa1 Listening sockets improvements.
o Separate fields of struct socket that belong to listening from
  fields that belong to normal dataflow, and unionize them.  This
  shrinks the structure a bit.
  - Take out selinfo's from the socket buffers into the socket. The
    first reason is to support braindamaged scenario when a socket is
    added to kevent(2) and then listen(2) is cast on it. The second
    reason is that there is future plan to make socket buffers pluggable,
    so that for a dataflow socket a socket buffer can be changed, and
    in this case we also want to keep same selinfos through the lifetime
    of a socket.
  - Remove struct struct so_accf. Since now listening stuff no longer
    affects struct socket size, just move its fields into listening part
    of the union.
  - Provide sol_upcall field and enforce that so_upcall_set() may be called
    only on a dataflow socket, which has buffers, and for listening sockets
    provide solisten_upcall_set().

o Remove ACCEPT_LOCK() global.
  - Add a mutex to socket, to be used instead of socket buffer lock to lock
    fields of struct socket that don't belong to a socket buffer.
  - Allow to acquire two socket locks, but the first one must belong to a
    listening socket.
  - Make soref()/sorele() to use atomic(9).  This allows in some situations
    to do soref() without owning socket lock.  There is place for improvement
    here, it is possible to make sorele() also to lock optionally.
  - Most protocols aren't touched by this change, except UNIX local sockets.
    See below for more information.

o Reduce copy-and-paste in kernel modules that accept connections from
  listening sockets: provide function solisten_dequeue(), and use it in
  the following modules: ctl(4), iscsi(4), ng_btsocket(4), ng_ksocket(4),
  infiniband, rpc.

o UNIX local sockets.
  - Removal of ACCEPT_LOCK() global uncovered several races in the UNIX
    local sockets.  Most races exist around spawning a new socket, when we
    are connecting to a local listening socket.  To cover them, we need to
    hold locks on both PCBs when spawning a third one.  This means holding
    them across sonewconn().  This creates a LOR between pcb locks and
    unp_list_lock.
  - To fix the new LOR, abandon the global unp_list_lock in favor of global
    unp_link_lock.  Indeed, separating these two locks didn't provide us any
    extra parralelism in the UNIX sockets.
  - Now call into uipc_attach() may happen with unp_link_lock hold if, we
    are accepting, or without unp_link_lock in case if we are just creating
    a socket.
  - Another problem in UNIX sockets is that uipc_close() basicly did nothing
    for a listening socket.  The vnode remained opened for connections.  This
    is fixed by removing vnode in uipc_close().  Maybe the right way would be
    to do it for all sockets (not only listening), simply move the vnode
    teardown from uipc_detach() to uipc_close()?

Sponsored by:		Netflix
Differential Revision:	https://reviews.freebsd.org/D9770
2017-06-08 21:30:34 +00:00
Gleb Smirnoff
9ed01c32e0 All these files need sys/vmmeter.h, but now they got it implicitly
included via sys/pcpu.h.
2017-04-17 17:07:00 +00:00
Hans Petter Selasky
82d0140707 Add full VNET support to the inet_get_local_port_range() function in
the LinuxKPI.

MFC after:		1 week
Sponsored by:		Mellanox Technologies
2017-03-22 15:46:31 +00:00
Gleb Smirnoff
817e1ad9d0 Make sdp compilable after r315662. 2017-03-21 09:07:05 +00:00
Hans Petter Selasky
404027276b Add basic support for VIMAGE to the LinuxKPI and ibcore.
Support is implemented by mapping Linux's "struct net" into FreeBSD's
"struct vnet". Currently only vnet0 is supported by ibcore.

MFC after:		1 week
Sponsored by:		Mellanox Technologies
2017-03-16 09:59:35 +00:00
Warner Losh
fbbd9655e5 Renumber copyright clause 4
Renumber cluase 4 to 3, per what everybody else did when BSD granted
them permission to remove clause 3. My insistance on keeping the same
numbering for legal reasons is too pedantic, so give up on that point.

Submitted by:	Jan Schaumann <jschauma@stevens.edu>
Pull Request:	https://github.com/freebsd/freebsd/pull/96
2017-02-28 23:42:47 +00:00
Navdeep Parhar
017296dbb6 cxgbe/iw_cxgbe: fix various double-close panics with iWARP sockets.
Sockets representing the TCP endpoints for iWARP connections are
allocated by the ibcore module.  Before this revision they were closed
either by the ibcore module or the iw_cxgbe hardware driver depending on
the state transitions during connection teardown.  This is error prone
and there were cases where both iw_cxgbe and ibcore closed the socket
leading to double-free panics.  The fix is to let ibcore close the
sockets it creates and never do it in the driver.

- Use sodisconnect instead of soclose (preceded by solinger = 0) in the
  driver to tear down an RDMA connection abruptly.  This does what's
  intended without releasing the socket's fd reference.

- Close the socket in ibcore when the iWARP iw_cm_id is destroyed.  This
  works for all kinds of sockets: clients that initiate connections,
  listeners, and sockets accepted off of listeners.

Reviewed by:	Steve Wise @ Open Grid Computing, hselasky@
MFC after:	3 days
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D9796
2017-02-28 19:27:41 +00:00
Navdeep Parhar
3d4f452402 Avoid NULL dereference in a couple of sysctl handlers in ibcore.
iw_cxgbe sets ib_device->dma_device to NULL (since r311880).

Reviewed by:	hselasky@
Sponsored by:	Chelsio Communications
2017-02-23 07:48:58 +00:00
Hans Petter Selasky
97549c34ec Move the ConnectX-3 and ConnectX-2 driver from sys/ofed into sys/dev/mlx4
like other PCI network drivers. The sys/ofed directory is now mainly
reserved for generic infiniband code, with exception of the mthca driver.

- Add new manual page, mlx4en(4), describing how to configure and load
mlx4en.

- All relevant driver C-files are now prefixed mlx4, mlx4_en and
mlx4_ib respectivly to avoid object filename collisions when compiling
the kernel. This also fixes an issue with proper dependency file
generation for the C-files in question.

- Device mlxen is now device mlx4en and depends on device mlx4, see
mlx4en(4). Only the network device name remains unchanged.

- The mlx4 and mlx4en modules are now built by default on i386 and
amd64 targets. Only building the mlx4ib module depends on
WITH_OFED=YES .

Sponsored by:	Mellanox Technologies
2016-09-30 08:23:06 +00:00
Hans Petter Selasky
499b089076 Set hardware stats flag to avoid double counting the number of incoming bytes.
Found by:	Ben RUBSON <ben.rubson@gmail.com>
Sponsored by:	Mellanox Technologies
MFC after:	1 week
2016-09-29 16:36:32 +00:00
Navdeep Parhar
a5234e8ccb Do not free an uninitialized pointer on soaccept failure in the iWARP
connection manager.

Sponsored by:	Chelsio Communications
2016-08-26 08:25:28 +00:00
Hans Petter Selasky
7dc445f8d3 Add support for setting blocking and non-blocking mode on /dev/rdma_cm
by returning success on FIONBIO and FIOASYNC IOCTLs. The actual flags
handling is done by the kern_ioctl() function.

Reported by:	Alex Bowden <alex.bowden@outlook.com>
Sponsored by:	Mellanox Technologies
MFC after:	1 week
2016-08-18 08:49:02 +00:00
Mark Johnston
f1c1f188f7 mthca: Add a wrapper for the firmware's DIAG_RPRT command.
MFC after:	1 week
Sponsored by:	EMC / Isilon Storage Division
2016-08-05 21:34:09 +00:00
Mark Johnston
49fe7b5029 ipoib: Bound the number of egress mbufs buffered during pathrec lookups.
In pathological situations where the master subnet manager becomes
unresponsive for an extended period, we may otherwise end up queuing all
of the system's mbufs while waiting for a response to a path record lookup.

This addresses the same issue as commit 1e85b806f9 in Linux.

Reviewed by:	cem, ngie
MFC after:	2 weeks
Sponsored by:	EMC / Isilon Storage Division
2016-08-01 22:22:11 +00:00
Mark Johnston
4e071758a7 MFV be9130cc9: "IB/cma: Check for GID on listening devices first"
This is an optimization that improves IB connection setup times.

Discussed with:	hselasky
Obtained from:	Linux
MFC after:	2 weeks
Sponsored by:	EMC / Isilon Storage Division
2016-08-01 20:29:09 +00:00
Mark Johnston
82f1d3ea2f MFV 29f27e847: "IB/cma: Use cached gids"
This addresses a regression from an earlier upstream change which caused
cma_acquire_dev() to bypass the port GID cache and instead query the HCA
for each entry in its GID table. These queries can become extremely slow on
multiport devices, which has a negative impact on connection setup times.

Discussed with:	hselasky
Obtained from:	Linux
MFC after:	2 weeks
Sponsored by:	EMC / Isilon Storage Division
2016-08-01 20:27:11 +00:00
Mark Johnston
adaf3c4906 sdp: Destroy the RDMA ID after destroying the connection's queue pair.
This is the ordering documented by rdma_destroy_qp(). Also add a useful
KASSERT to sdp_pcbfree().

Sponsored by:	EMC / Isilon Storage Division
2016-07-29 21:03:02 +00:00
Mark Johnston
d3461164e0 sdp: Use malloc(9) instead of the Linux compat layer.
SDP transmit and receive rings are always created in a sleepable context,
so we can use M_WAITOK and remove error checks.

Sponsored by:	EMC / Isilon Storage Division
2016-07-29 21:01:04 +00:00
Mark Johnston
f66ec15275 sdp: Use the correct socket buffer in sdp_post_recvs_needed().
Sponsored by:	EMC / Isilon Storage Division
2016-07-29 20:54:43 +00:00
Mark Johnston
7e968dab6d sdp: Always free received control packets after they're handled.
Sponsored by:	EMC / Isilon Storage Division
2016-07-29 20:51:52 +00:00
Mark Johnston
3b3e6d882f Fix the KASSERT format string arguments after r303507. 2016-07-29 20:48:42 +00:00
Mark Johnston
a3c0b052b7 sdp: Use the PCB as the rx completion handler argument.
The generic socket may be detached from the PCB before the completion
queue is drained and destroyed, so this change closes a race condition
in connection teardown.

Sponsored by:	EMC / Isilon Storage Division
2016-07-29 20:39:32 +00:00