This change include several changes as listed below all related to UAR.
UAR is a special PCI memory area where the so-called doorbell register and
blue flame register live. Blue flame is a feature for sending small packets
more efficiently via a PCI memory page, instead of using PCI DMA.
- All structures and functions named xxx_uuars were renamed into xxx_bfreg.
- Remove partially implemented Blueflame support from mlx5en(4) and mlx5ib.
- Implement blue flame register allocator.
- Use blue flame register allocator in mlx5ib.
- A common UAR page is now allocated by the core to support doorbell register
writes for all of mlx5en and mlx5ib, instead of allocating one UAR per
sendqueue.
- Add support for DEVX query UAR.
- Add support for 4K UAR for libmlx5.
Linux commits:
7c043e908a74ae0a935037cdd984d0cb89b2b970
2f5ff26478adaff5ed9b7ad4079d6a710b5f27e7
0b80c14f009758cefeed0edff4f9141957964211
30aa60b3bd12bd79b5324b7b595bd3446ab24b52
5fe9dec0d045437e48f112b8fa705197bd7bc3c0
0118717583cda6f4f36092853ad0345e8150b286
a6d51b68611e98f05042ada662aed5dbe3279c1e
MFC after: 1 week
Sponsored by: Mellanox Technologies // NVIDIA Networking
APIs that have deferred callbacks should have some kind of cleanup
function that callers can use to fence the callbacks. Otherwise things
like module unloading can lead to dangling function pointers, or worse.
The IB MR code is the only place that calls this function and had a
really poor attempt at creating this fence. Provide a good version in
the core code as future patches will add more places that need this
fence.
Linux commit:
e355477ed9e4f401e3931043df97325d38552d54
MFC after: 1 week
Sponsored by: Mellanox Technologies // NVIDIA Networking
Report EQE data upon CQ completion to let upper layers use this data.
Linux commit:
4e0e2ea1886afe8c001971ff767f6670312a9b04
MFC after: 1 week
Sponsored by: Mellanox Technologies // NVIDIA Networking
Enhance mlx5_core_create_cq() to get the command out buffer from the
callers to let them use the output.
Linux commit:
38164b771947be9baf06e78ffdfb650f8f3e908e
MFC after: 1 week
Sponsored by: Mellanox Technologies // NVIDIA Networking
Make sure order of cleanup is exactly the opposite of initialization.
Linux commit:
f4044dac63e952ac1137b6df02b233d37696e2f5
MFC after: 1 week
Sponsored by: Mellanox Technologies // NVIDIA Networking
Currently the linking order of the infiniband, IB, modules decide in which
order the clients are attached and detached. For example one IB client may
use resources from another IB client. This can lead to a potential deadlock
at shutdown. For example if the ipoib is unregistered after the ib_multicast
client is detached, then if ipoib is using multicast addresses a deadlock may
happen, because ib_multicast will wait for all its resources to be freed before
returning from the remove method.
Fix this by using module_xxx_order() instead of module_xxx().
Differential Revision: https://reviews.freebsd.org/D23973
MFC after: 1 week
Sponsored by: Mellanox Technologies
r357614 added CTLFLAG_NEEDGIANT to make it easier to find nodes that are
still not MPSAFE (or already are but aren’t properly marked).
Use it in preparation for a general review of all nodes.
This is non-functional change that adds annotations to SYSCTL_NODE and
SYSCTL_PROC nodes using one of the soon-to-be-required flags.
Mark all obvious cases as MPSAFE. All entries that haven't been marked
as MPSAFE before are by default marked as NEEDGIANT
Approved by: kib (mentor, blanket)
Commented by: kib, gallatin, melifaro
Differential Revision: https://reviews.freebsd.org/D23718
The index should be computed as distance from arg[0] and not
the beginning of struct mlx5_ib_congestion .
While at it fix a use of zero length array to avoid depending
on undefined compiler behaviour.
MFC after: 1 week
Sponsored by: Mellanox Technologies
Add the 512 bytes limit of RDMA READ and the size of remote address to the max
SGE calculation.
Submitted by: slavash@
Linux commit: 288c01b746aa
MFC after: 3 days
Sponsored by: Mellanox Technologies
CM layer calls ib_modify_port() regardless of the link layer.
For the Ethernet ports, qkey violation and Port capabilities
are meaningless. Therefore, always return success for ib_modify_port
calls on the Ethernet ports.
Linux Commit:
ec2558796d25e6024071b6bcb8e11392538d57bf
Submitted by: slavash@
MFC after: 3 days
Sponsored by: Mellanox Technologies
Make sure the active width and speed is set in case the
translate_eth_proto_oper() function doesn't recognize the
current port operation mask.
Linux commit:
7672ed33c4c15dbe9d56880683baaba4227cf940
Submitted by: hselasky@
Approved by: hselasky (mentor)
MFC after: 1 week
Sponsored by: Mellanox Technologies
If the mlx5_ib_read_cong_stats() function was running when mlx5ib was unloaded,
because this function unconditionally restarts the timer, the timer can still
be pending after the delayed work has been cancelled. To fix this simply loop
on the delayed work cancel procedure as long as it returns non-zero.
Submitted by: hselasky@
Approved by: hselasky (mentor)
MFC after: 1 week
Sponsored by: Mellanox Technologies
Although "create_srq_user" does overwrite "in.pas" on some paths, it
also contains at least one feasible path which does not overwrite it.
Approved by: hselasky (mentor)
MFC after: 1 week
Sponsored by: Mellanox Technologies
"fw_rev_min(dev->mdev)" with type "unsigned short" (16 bits, unsigned) is
promoted in "fw_rev_min(dev->mdev) << 16" to type "int" (32 bits, signed), then
sign-extended to type "unsigned long" (64 bits, unsigned). If
"fw_rev_min(dev->mdev) << 16" is greater than 0x7FFFFFFF, the upper bits of the
result will all be 1.
Approved by: hselasky (mentor)
MFC after: 1 week
Sponsored by: Mellanox Technologies
Driver description should be set by core and not by the Ethernet driver.
Approved by: hselasky (mentor)
MFC after: 1 week
Sponsored by: Mellanox Technologies
All other mlx5_events report the port number as 1 based, which is how FW
reports it in the port event EQE. Reporting 0 for this event causes
mlx5_ib to not raise a fatal event notification to registered clients
due to a seemingly invalid port.
All switch cases in mlx5_ib_event that go through the port check are
supposed to set the port now, so just do it once at variable
declaration.
Linux commit:
aba462134634b502d720e15b23154f21cfa277e5
Approved by: hselasky (mentor)
MFC after: 1 week
Sponsored by: Mellanox Technologies
The user can provide very large cqe_size which will cause to integer
overflow.
Linux commit:
28e9091e3119933c38933cb8fc48d5618eb784c8
Approved by: hselasky (mentor)
MFC after: 1 week
Sponsored by: Mellanox Technologies
When resetting mlx5core instances it can happen that the order of attach and
detach for mlx5ib instances is changed. Take the unit number for mlx5_%d from
the parent PCI device, similarly to what is done in mlx5en(4), so that there
is a direct relationship between mce<N> and mlx5_<N>.
MFC after: 1 week
Sponsored by: Mellanox Technologies
The DSCP feature is controlled using a set of sysctl(8) fields under
the qos sysctl directory entry for mlx5en(4).
For Routable RoCE QPs, the DSCP should be set in the QP's address path.
The DSCP's value is derived from the traffic class.
Linux commit:
ed88451e1f2d400fd6a743d0a481631cf9f97550
MFC after: 1 week
Sponsored by: Mellanox Technologies
When receiving a PCP change all GID entries are reloaded.
This ensures the relevant GID entries use prio tagging,
by setting VLAN present and VLAN ID to zero.
The priority for prio tagged traffic is set using the regular
rdma_set_service_type() function.
Fake the real network device to have a VLAN ID of zero
when prio tagging is enabled. This is logic is hidden inside
the rdma_vlan_dev_vlan_id() function which must always be used
to retrieve the VLAN ID throughout all of ibcore and the
infiniband network drivers.
The VLAN presence information then propagates through all
of ibcore and so incoming connections will have the VLAN
bit set. The incoming VLAN ID is then checked against the
return value of rdma_vlan_dev_vlan_id().
MFC after: 1 week
Sponsored by: Mellanox Technologies
There is a difference when parsing a completion entry between Ethernet
and IB ports. When link layer is Ethernet the bits describe the type of
L3 header in the packet. In the case when link layer is Ethernet and VLAN
header is present the value of SL is equal to the 3 UP bits in the VLAN
header. If VLAN header is not present then the SL is undefined and consumer
of the completion should check if IB_WC_WITH_VLAN is set.
While that, this patch also fills the vlan_id field in the completion if
present.
linux commit 12f8fedef2ec94c783f929126b20440a01512c14
MFC after: 1 week
Sponsored by: Mellanox Technologies
ECN configuration and statistics is available through a set of sysctl(8)
nodes under sys.class.infiniband.mlx5_X.cong . The ECN configuration
nodes can also be used as loader tunables.
MFC after: 1 week
Sponsored by: Mellanox Technologies
- Factor out port speed definitions into new port.h header file,
similarly as done in Linux upstream.
- Correct two existing port speed definitions in mlx5en according to
Linux upstream.
MFC after: 1 week
Sponsored by: Mellanox Technologies
Creating a UD address handle from user-space or from the kernel-space,
when the link layer is ethernet, requires resolving the remote L3
address into a L2 address. Doing this from the kernel is easy because
the required ARP(IPv4) and ND6(IPv6) address resolving APIs are readily
available. In userspace such an interface does not exist and kernel
help is required.
It should be noted that in an IP-based GID environment, the GID itself
does not contain all the information needed to resolve the destination
IP address. For example information like VLAN ID and SCOPE ID, is not
part of the GID and must be fetched from the GID attributes. Therefore
a source GID should always be referred to as a GID index. Instead of
going through various racy steps to obtain information about the
GID attributes from user-space, this is now all done by the kernel.
This patch optimises the L3 to L2 address resolving using the existing
create address handle uverbs interface, retrieving back the L2 address
as an additional user-space information structure.
This commit combines the following Linux upstream commits:
IB/core: Let create_ah return extended response to user
IB/core: Change ib_resolve_eth_dmac to use it in create AH
IB/mlx5: Make create/destroy_ah available to userspace
IB/mlx5: Use kernel driver to help userspace create ah
IB/mlx5: Report that device has udata response in create_ah
MFC after: 1 week
Sponsored by: Mellanox Technologies
Different compilers may optimise the enum type in different ways. This ensures
coherency when range checking the value of enums in ibcore.
Sponsored by: Mellanox Technologies
MFC after: 1 week
iWarp and RoCE in ibcore. The selection of RDMA_PS_TCP can not be used
to indicate iWarp protocol use. Backport the proper IB device
capabilities from Linux upstream to distinguish between iWarp and
RoCE. Only allocate the additional socket required for iWarp for RDMA
IDs when at least one iWarp device present. This resolves
interopability issues between iWarp and RoCE in ibcore
Reviewed by: np @
Differential Revision: https://reviews.freebsd.org/D12563
Sponsored by: Mellanox Technologies
MFC after: 3 days