Commit Graph

266892 Commits

Author SHA1 Message Date
Martin Matuska
5eb61f6c65 zfs: merge openzfs/zfs@07a4c76e9 (master) into main
Notable upstream pull request merges:
  #12299 file reference counts can get corrupted
  #12320 FreeBSD: Use unmapped I/O for scattered/gang ABD buffers

Obtained from:	OpenZFS
OpenZFS commit:	07a4c76e90
2021-07-12 23:24:45 +02:00
Warner Losh
dc5a0d6d6d ddb(4): improve wording
Incorporate feedback overlooked in revew by wblock@

Sponsored by:		Netflix
Differential Revision:	https://reviews.freebsd.org/D4860
2021-07-12 15:13:13 -06:00
Daniel Gerzo
71f6aea415 loader: update autoboot description and move to loader.conf.5
Document "NO" special value for the autoboot_delay and move the
description to loader.conf.5.

imp reworked some of the wording from danger's patch.

Reviewed by:		imp
PR:			85128
Differential Revision:	https://reviews.freebsd.org/D11887
2021-07-12 15:13:03 -06:00
Brian Behlendorf
07a4c76e90
Update bug report template
- Remove the "SPL Version" line, the repositories have been merged
  since the 0.8 release and we no longer need to ask about this.

- Simply ask for the kernel version / patch level and add a hint
  about how to get this information on Linux and FreeBSD.

- Remove "Status: Triage Needed" from the template, in practice
  we really haven't been using this label so let's step setting it.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes: #12340
2021-07-12 14:05:50 -06:00
Jessica Clarke
8b487b8292 Fix bsd.subdir.mk-related issues after 0a0f748641
Since bsd.prog.mk includes bsd.obj.mk, and thus bsd.subdir.mk, we must
ensure all our bsd.subdir.mk-affecting variables are set before
including bsd.prog.mk. Since sbin's various Makefile.arch files add to
SUBDIR this results in those not taking effect, and presumably we also
end up not having buildworld as parallel as it should be due to the fact
that SUBDIR_PARALLEL was not being set before including bsd.prog.mk.

MFC with:	0a0f748641
Reviewed by:	olivier
Differential Revision:	https://reviews.freebsd.org/D31125
2021-07-12 20:54:01 +01:00
Bryan Drewery
6da773cb1c bsd.compiler.mk: Use CCACHE_PKG_PREFIX as ports now supports. 2021-07-12 12:52:52 -07:00
Andriy Gapon
66c183f43f mmc_cam_sim_default_action: do not touch the ccb after dispatching it
If MMC_SIM_CAM_REQUEST() is successful the ccb could be running or being
completed as the method returns.  Modifying the ccb status could override
whatever status was already set by a MMC driver.

I am not sure what was the purpose of setting the status to CAM_REQ_INVALID
in the success path.  I assume that it was to catch a possibility that the
ccb could be completed without its status explicitly set.  So, I am keeping
the code, it's just moved to before the MMC_SIM_CAM_REQUEST call.

Without this change I was getting random and phantom EIO errors on Rock64
running off an SD card (dwmmc driver) plus occasional panics like:
  Memory modified after free 0xffffa00003985800(2040) val=6 @ 0xffffa00003985854
  panic: Most recently used by CAM CCB

MFC after:	1 week
2021-07-12 21:29:26 +03:00
Hans Petter Selasky
693ddf4dc4 Fix LINT kernel build issues after c3987b8ea7 .
Fixes the IPOIB_CM and SDP kernel options.

Reported by:	lwhsu @
MFC after:	1 week
Sponsored by:	NVIDIA Networking
2021-07-12 18:00:30 +02:00
Hans Petter Selasky
cd2c05d323 ipoib: Fix for accessing uninitialized pointers and freed memory during attach and detach.
Call infiniband_ifdetach() early to stop ifioctl(9) calls from user-space
during device removal. Also make sure that ifioctl(9) calls are blocked from
executing until the device is fully initialized. Ideally we would delay the
infiniband_ifattach() call, but because part of the initialization is to update
the link level address, that is not possible without more significant changes.

MFC after:	1 week
Reviewed by:	kib
Sponsored by:	Mellanox Technologies // NVIDIA Networking
2021-07-12 15:01:19 +02:00
Hans Petter Selasky
7c3eff94bd mlx5: Numa domain improvements.
Properly allocate all mlx5en(4) structures from correct numa domain.

While at it cleanup unused numa domain integers deriving from the
Linux version of mlx5en(4).

MFC after:	1 week
Reviewed by:	kib
Sponsored by:	Mellanox Technologies // NVIDIA Networking
2021-07-12 14:52:45 +02:00
Hans Petter Selasky
cbf6911e10 mlx5: Fix for uninitialized "uid" field.
Make sure the "uid" field gets properly set when destroying DCT and QP
objects by making a copy of the field when creating such objects.

MFC after:	1 week
Reviewed by:	kib
Sponsored by:	Mellanox Technologies // NVIDIA Networking
2021-07-12 14:38:51 +02:00
Hans Petter Selasky
c8301cbb0f mlx4: Map core_clock page to user space only when allowed
Currently when we map the hca_core_clock page to the user space,
there are vulnerable registers, one of which is semaphore, on
this page as well. If user read the wrong offset, it can modify the
above semaphore and hang the device.

Hence, mapping the hca_core_clock page to the user space only when
user required it specifically.

After this patch, mlx4 core_clock won't be mapped to user space by
default. Oppose to current state, where mlx4 core_clock is always mapped
to user space.

MFC after:	1 week
Reviewed by:	kib
Sponsored by:	Mellanox Technologies // NVIDIA Networking
2021-07-12 14:22:35 +02:00
Hans Petter Selasky
c8d16d1e08 mlx5en: Allow binding channels to CPUs when RSS is not enabled.
MFC after:	1 week
Submitted by:	Netflix
Reviewed by:	kib
Sponsored by:	Mellanox Technologies // NVIDIA Networking
2021-07-12 14:22:34 +02:00
Hans Petter Selasky
f60da09dbb ibcore: Add some functions and definitions for selecting and querying retryable ucontext cleanup.
Linux commit:
1c77483e4c50339b0306572167ccbff6b55d051b

MFC after:	1 week
Reviewed by:	kib
Sponsored by:	Mellanox Technologies // NVIDIA Networking
2021-07-12 14:22:34 +02:00
Hans Petter Selasky
9dfa21486e mlx5en: Allocate per-channel doorbells.
To avoid congestion on the same PCI memory register space when
traffic consists mostly of small packets.

MFC after:	1 week
Reviewed by:	kib
Sponsored by:	Mellanox Technologies // NVIDIA Networking
2021-07-12 14:22:34 +02:00
Hans Petter Selasky
3a934ba7a3 mlx5en: Wait for all TLS connections to terminate when unloading driver.
The driver expects all TLS tags to be returned to the driver before
it can free the UMA zone where the TLS tags reside.

MFC after:	1 week
Reviewed by:	kib
Sponsored by:	Mellanox Technologies // NVIDIA Networking
2021-07-12 14:22:34 +02:00
Hans Petter Selasky
30416d4e82 mlx4ib and mlx5ib: Set slid to zero in Ethernet completion struct
IB spec says that a lid should be ignored when link layer is Ethernet,
for example when building or parsing a CM request message (CA17-34).
However, since ib_lid_be16() and ib_lid_cpu16()  validates the slid,
not only when link layer is IB, we set the slid to zero to prevent
false warnings in the kernel log.

Linux commit:
65389322b28f81cc137b60a41044c2d958a7b950

MFC after:	1 week
Reviewed by:	kib
Sponsored by:	Mellanox Technologies // NVIDIA Networking
2021-07-12 14:22:34 +02:00
Hans Petter Selasky
de2437f199 mlx5en: Configure relaxed PCI read and write ordering for ethernet.
This may improve performance in some configurations.

MFC after:	1 week
Reviewed by:	kib
Sponsored by:	Mellanox Technologies // NVIDIA Networking
2021-07-12 14:22:34 +02:00
Hans Petter Selasky
4692d9808e mlx5en: Check for pci_channel_offline() when draining sendqueue.
This speeds up detach in hypervisor environments.

MFC after:	1 week
Reviewed by:	kib
Sponsored by:	Mellanox Technologies // NVIDIA Networking
2021-07-12 14:22:33 +02:00
Hans Petter Selasky
8abf5ac0e6 mlx5ib: Implement support for enabling and disabling RoCE ECN.
RoCE is short for Remote direct memory access over Converged Ethernet.
ECN is short for Explicit Congestion Notification.

MFC after:	1 week
Reviewed by:	kib
Sponsored by:	Mellanox Technologies // NVIDIA Networking
2021-07-12 14:22:33 +02:00
Hans Petter Selasky
42f719d611 mlx5ib: Extend parameter macros so that more arguments may be added.
MFC after:	1 week
Reviewed by:	kib
Sponsored by:	Mellanox Technologies // NVIDIA Networking
2021-07-12 14:22:33 +02:00
Hans Petter Selasky
e787b5acb1 mlx5core: Don't query the PCI config space for offline during a firmware command.
Querying the PCI config space for offline for every firmware command blocks
the PCI bus and affects performance. Especially for packet pacing and TLS
when objects are frequently created and destroyed.

MFC after:	1 week
Reviewed by:	kib
Sponsored by:	Mellanox Technologies // NVIDIA Networking
2021-07-12 14:22:33 +02:00
Hans Petter Selasky
c3987b8ea7 ibcore: Declare ib_post_send() and ib_post_recv() arguments const
Since neither ib_post_send() nor ib_post_recv() modify the data structure
their second argument points at, declare that argument const. This change
makes it necessary to declare the 'bad_wr' argument const too and also to
modify all ULPs that call ib_post_send(), ib_post_recv() or
ib_post_srq_recv(). This patch does not change any functionality but makes
it possible for the compiler to verify whether the
ib_post_(send|recv|srq_recv) really do not modify the posted work request.

Linux commit:
f696bf6d64b195b83ca1bdb7cd33c999c9dcf514
7bb1fafc2f163ad03a2007295bb2f57cfdbfb630
d34ac5cd3a73aacd11009c4fc3ba15d7ea62c411

MFC after:	1 week
Reviewed by:	kib
Sponsored by:	Mellanox Technologies // NVIDIA Networking
2021-07-12 14:22:33 +02:00
Hans Petter Selasky
4fb0a74e08 mlx5: Set default timestamp format for mlx5en(4) and mlx5ib.
MFC after:	1 week
Reviewed by:	kib
Sponsored by:	Mellanox Technologies // NVIDIA Networking
2021-07-12 14:22:33 +02:00
Hans Petter Selasky
915fc66cb5 mlx5: Add new timestamp mode bits.
These fields declare which timestamp mode is supported
by the device per RQ/SQ/QP.

In addition add the ts_format field to the select the mode
for RQ/SQ/QP.

Linux commit:
a6a217dddcd544f6b75f0e2a60b6e84c1d494b7e

MFC after:	1 week
Reviewed by:	kib
Sponsored by:	Mellanox Technologies // NVIDIA Networking
2021-07-12 14:22:33 +02:00
Hans Petter Selasky
79b817084c ibcore: Implement ib_uverbs_get_ucontext_file().
Expose ib_ucontext from a given ib_uverbs_file. Drivers that use the ioctl(9)
API may have the ib_uverbs_file and need a way to get the related ib_ucontext
from it, this is enabled by this patch.

Downstream patches from this series will use it.

Linux commit:
7dc08dcfc8c86cb4457e383734ff6844ddaff876

MFC after:	1 week
Reviewed by:	kib
Sponsored by:	Mellanox Technologies // NVIDIA Networking
2021-07-12 14:22:33 +02:00
Hans Petter Selasky
05f4691919 ibcore: Clean up INIT_UDATA() and INIT_UDATA_BUF_OR_NULL() macro usage.
We get a harmless warning about the fact that we use the result of a
multiplication as a condition in INIT_UDATA_BUF_OR_NULL():

uverbs_main.c: In function 'ib_uverbs_write':
error: '*' in boolean context, suggest '&&' instead [-Werror=int-in-bool-context]

This avoids the problem by using an inline function in place of
the macro.

After changing INIT_UDATA_BUF_OR_NULL() to an inline function,
do the same change to INIT_UDATA() for consistency.

Using an inline function gives us better type safety here among other
issues with macros. I'm using u64_to_user_ptr() to convert the user
pointer to simplify the logic rather than adding lots of new type casts.

Linux commit:
12f727721eee61b3d19dedb95cb893b2baa9fe41
40a203396cc1c239f2e71c47c66ed03097123d2c

MFC after:	1 week
Reviewed by:	kib
Sponsored by:	Mellanox Technologies // NVIDIA Networking
2021-07-12 14:22:32 +02:00
Hans Petter Selasky
d92a9e5604 ibcore: Simplify ib_modify_qp_is_ok().
All callers to ib_modify_qp_is_ok() provides enum ib_qp_state makes the
checks of out-of-scope redundant. Let's remove them together with updating
function signature to return boolean result.

While at it remove unused "ll" parameter from ib_modify_qp_is_ok().

Linux commit:
19b1f54099b6ee334acbfbcfbdffd1d1f057216d
d31131bba5a1630304c55ea775c48cc84912ab59

MFC after:	1 week
Reviewed by:	kib
Sponsored by:	Mellanox Technologies // NVIDIA Networking
2021-07-12 14:22:32 +02:00
Hans Petter Selasky
0c13880ccc ibcore: Support rate limit for packet pacing
Add new member rate_limit to ib_qp_attr which holds the packet pacing
rate in kbps, 0 means unlimited.

IB_QP_RATE_LIMIT is added to ib_attr_mask and could be used by RAW
QPs when changing QP state from RTR to RTS, RTS to RTS.

Linux commit:
528e5a1bd3f0e9b760cb3a1062fce7513712a15d

MFC after:	1 week
Reviewed by:	kib
Sponsored by:	Mellanox Technologies // NVIDIA Networking
2021-07-12 14:22:32 +02:00
Hans Petter Selasky
3d2fb36a9c ibcore: Add new IB rates.
Add the new rates that were added to Infiniband spec as part of
HDR and 2x support.

Linux commit:
a5a5d1993696419e7d5357fc3128e53d219d382e

MFC after:	1 week
Reviewed by:	kib
Sponsored by:	Mellanox Technologies // NVIDIA Networking
2021-07-12 14:22:32 +02:00
Hans Petter Selasky
721b795b72 ibcore: Don't allocate method table, if already present.
This commit aligns the code in question with upstream Linux.

Linux commit:
2468b82d69e3a53d024f28d79ba0fdb8bf43dfbf

MFC after:	1 week
Reviewed by:	kib
Sponsored by:	Mellanox Technologies // NVIDIA Networking
2021-07-12 14:22:32 +02:00
Hans Petter Selasky
c6ccb08686 ibcore: Fix a use-after-free in ucma_resolve_ip().
There is a race condition between ucma_close() and ucma_resolve_ip():

CPU0                            CPU1
ucma_resolve_ip():              ucma_close():

ctx = ucma_get_ctx(file, cmd.id);

        list_for_each_entry_safe(ctx, tmp, &file->ctx_list, list) {
                mutex_lock(&mut);
                idr_remove(&ctx_idr, ctx->id);
                mutex_unlock(&mut);
                ...
                mutex_lock(&mut);
                if (!ctx->closing) {
                        mutex_unlock(&mut);
                        rdma_destroy_id(ctx->cm_id);
                ...
                ucma_free_ctx(ctx);
        }

ret = rdma_resolve_addr();
ucma_put_ctx(ctx);

Before idr_remove(), ucma_get_ctx() could still find the ctx
and after rdma_destroy_id(), rdma_resolve_addr() may still
access id_priv pointer. Also, ucma_put_ctx() may use ctx after
ucma_free_ctx() too.

ucma_close() should call ucma_put_ctx() too which tests the
refcnt and waits for the last one releasing it. The similar
pattern is already used by ucma_destroy_id().

Linux commit:
5fe23f262e0548ca7f19fb79f89059a60d087d22

MFC after:	1 week
Reviewed by:	kib
Sponsored by:	Mellanox Technologies // NVIDIA Networking
2021-07-12 14:22:32 +02:00
Hans Petter Selasky
20fea7ac64 ibcore: Define option to set ack timeout.
Define new option in 'rdma_set_option' to override calculated QP timeout
when requested to provide QP attributes to modify a QP.

At the same time, pack tos_set to be bitfield.

Linux commit:
2c1619edef61a03cb516efaa81750784c3071d10

MFC after:	1 week
Reviewed by:	kib
Sponsored by:	Mellanox Technologies // NVIDIA Networking
2021-07-12 14:22:32 +02:00
Hans Petter Selasky
df1df0c742 ibcore: Do not overreact to SM LID change event.
When IPoIB receives an SM LID change event, it reacts by flushing its
path record cache and rejoining multicast groups. This is the same
behavior it performs when it receives a reregistration event. This
behavior is unnecessary as an SM may have database backup or
synchronization mechanisms which permit the SM location or LID to change
without loss of multicast membership and without impact to path records.

Both opensm and the OPA FM issue reregistration events if a new SM is
started (or restarted with a new config) or an SM event occurs which
results in loss of multicast membership records by the SM (such as
opensm failover) or the SM encounters new nodes with Active ports (such
as after joining 2 fabrics by connecting switches via ISLs). Hence this
event can be depended on as the trigger for IPoIB cache and multicast
flushing.

It appears that some drivers, such as qib, and hfi1 issue the
IB_EVENT_SM_CHANGE but other drivers such as mlx4 and mlx5 do not.
Empirical testing on Mellanox EDR using ibv_asyncwatch has confirmed
that Mellanox EDR HCAs do not generate SM change events and that opensm
does generate reregistration.

An SM LID change event is generated by the mentioned drivers to reflect
that sm_lid and/or sm_sl in the local port info has changed. The intent
of this event is to permit applications and ULPs which have a local copy
of this information (or an address handle using it) to update their
information.

The intent is that the reregistration event (caused by the SM via a bit
in Set(PortInfo)) be used to inform nodes that they need to rejoin
multicast groups, resubscribe for notices and potentially update path
records.

When an SM migrates or fails over, a SM LID change event can occur. In
response IPoIB discards path records and multicast membership and loses
connectivity until these records are restored via SA requests. In very
large fabrics, it may take minutes for the SM to be ready and for the SA
responses to be supplied.  This can result in undesirable and
unnecessary IPoIB connectivity impacts. It also can result in an
unnecessary storm of SA queries from all nodes in a cluster potentially
followed by yet another storm if the SM issues the reregistration
request.

The fact the Mellanox HCAs do not even generate this event, is further
evidence that on modern IB fabrics there will be no ill side effects
from the proposed changes below to reduce the reaction by 3 kernel
components to this event. So these changes should be benign for Mellanox
IB fabrics and will benefit OPA fabrics while also making ib_core and
ULP behavor "correct" as intended by the IBTA spec and kernel RDMA event
APIs.

Address these issues by removing IB_EVENT_SM_CHANGE handling from ipoib.
IPoIB does not locally store sm_lid nor sm_sl, so it does not need to do
anything on SM LID change. IPoIB makes use of other ib_core components
to issue SA requests for it and those components correctly track SM LID
and SM LID changes.

Also in ib_core multicast handling,  remove the test for
IB_EVENT_SM_CHANGE. This code is moving all multicast groups to the
error state, which will trigger rejoins. This code is used by IPoIB as
well as the connection manager and other clients of multicast groups.
This kernel module centralizes group membership status and joins since a
node can only join a given group once but multiple ULPs or applications
may want to join the same group. It makes use of the sa_query.c
component in ib_core, which correctly trackes SM LID and SL. This
component does not track SM LID nor SL itself and hence need not react
to their changes.

Similarly in the ib_core cache code remove the handling for the
IB_EVENT_SM_CHANGE.  In this function. The ib_cache_update function
which is ultimately called is updating local copies of the pkey table,
gid table and lmc. It does not update nor retain sm_lid nor sm_sl. As
such it does not need to be called on an SM LID change. It technically
also does not need to be called on a reregistration. The LID_CHANGE,
PKEY_CHANGE, GID_CHANGE and port state change events (PORT_ERR,
PORT_ACTICE) should be sufficient triggers.

It is worth noting that the alternative of simply having the hfi1 and
qib drivers not generate the SM LID change event was explored. While
this would duplicate what Mellanox drivers do now, it is not the correct
behavior and removes the ability for an SM to migrate without requiring
reregistration. Since both opensm and OPA SM have mechanisms to backup
or synchronize registration information, it is desirable to let them
perform SM migrations (with LID or SL changes) without requiring
reregistration when they deem it appropriate.

Linux commit:
ba7d8117f3cca8eb70d579fde3f9ec8cd6a28f39

MFC after:	1 week
Reviewed by:	kib
Sponsored by:	Mellanox Technologies // NVIDIA Networking
2021-07-12 14:22:32 +02:00
Hans Petter Selasky
26646ba5bc ibcore: Remove debug prints after allocation failure.
The prints after [k|v][m|z|c]alloc() functions are not needed,
because in case of failure, allocator will print their internal
error prints anyway.

Linux commit:
2716243212241855cd9070883779f6e58967dec5

MFC after:	1 week
Reviewed by:	kib
Sponsored by:	Mellanox Technologies // NVIDIA Networking
2021-07-12 14:22:31 +02:00
Hans Petter Selasky
468a6b5055 ibcore: Fix use-after-free in IB mad completion handling.
We encountered a use-after-free bug when unloading the driver:

BUG: KASAN: use-after-free in ib_mad_post_receive_mads+0xddc/0xed0 [ib_core]
Read of size 4 at addr ffff8882ca5aa868 by task kworker/u13:2/23862

Workqueue: ib-comp-unb-wq ib_cq_poll_work [ib_core]
Call Trace:
dump_stack+0x9a/0xeb
print_address_description+0xe3/0x2e0
ib_mad_post_receive_mads+0xddc/0xed0 [ib_core]
__kasan_report+0x15c/0x1df
ib_mad_post_receive_mads+0xddc/0xed0 [ib_core]
kasan_report+0xe/0x20
ib_mad_post_receive_mads+0xddc/0xed0 [ib_core]
find_mad_agent+0xa00/0xa00 [ib_core]
qlist_free_all+0x51/0xb0
mlx4_ib_sqp_comp_worker+0x1970/0x1970 [mlx4_ib]
quarantine_reduce+0x1fa/0x270
kasan_unpoison_shadow+0x30/0x40
ib_mad_recv_done+0xdf6/0x3000 [ib_core]
_raw_spin_unlock_irqrestore+0x46/0x70
ib_mad_send_done+0x1810/0x1810 [ib_core]
mlx4_ib_destroy_cq+0x2a0/0x2a0 [mlx4_ib]
_raw_spin_unlock_irqrestore+0x46/0x70
debug_object_deactivate+0x2b9/0x4a0
__ib_process_cq+0xe2/0x1d0 [ib_core]
ib_cq_poll_work+0x45/0xf0 [ib_core]
process_one_work+0x90c/0x1860
pwq_dec_nr_in_flight+0x320/0x320
worker_thread+0x87/0xbb0
__kthread_parkme+0xb6/0x180
process_one_work+0x1860/0x1860
kthread+0x320/0x3e0
kthread_park+0x120/0x120
ret_from_fork+0x24/0x30
...
Freed by task 31682:
save_stack+0x19/0x80
__kasan_slab_free+0x11d/0x160
kfree+0xf5/0x2f0
ib_mad_port_close+0x200/0x380 [ib_core]
ib_mad_remove_device+0xf0/0x230 [ib_core]
remove_client_context+0xa6/0xe0 [ib_core]
disable_device+0x14e/0x260 [ib_core]
__ib_unregister_device+0x79/0x150 [ib_core]
ib_unregister_device+0x21/0x30 [ib_core]
mlx4_ib_remove+0x162/0x690 [mlx4_ib]
mlx4_remove_device+0x204/0x2c0 [mlx4_core]
mlx4_unregister_interface+0x49/0x1d0 [mlx4_core]
mlx4_ib_cleanup+0xc/0x1d [mlx4_ib]
__x64_sys_delete_module+0x2d2/0x400
do_syscall_64+0x95/0x470
entry_SYSCALL_64_after_hwframe+0x49/0xbe

The problem was that the MAD PD was deallocated before the MAD CQ.
There was completion work pending for the CQ when the PD got deallocated.
When the mad completion handling reached procedure
ib_mad_post_receive_mads(), we got a use-after-free bug in the following
line of code in that procedure:
sg_list.lkey = qp_info->port_priv->pd->local_dma_lkey;
(the pd pointer in the above line is no longer valid, because the
pd has been deallocated).

We fix this by allocating the PD before the CQ in procedure
ib_mad_port_open(), and deallocating the PD after freeing the CQ
in procedure ib_mad_port_close().

Since the CQ completion work queue is flushed during ib_free_cq(),
no completions will be pending for that CQ when the PD is later
deallocated.

Note that freeing the CQ before deallocating the PD is the practice
in the ULPs.

Linux commit:
770b7d96cfff6a8bf6c9f261ba6f135dc9edf484

MFC after:	1 week
Reviewed by:	kib
Sponsored by:	Mellanox Technologies // NVIDIA Networking
2021-07-12 14:22:31 +02:00
Hans Petter Selasky
507389a35a ibcore: Fail early if unsupported QP is provided.
When requested QP type is not supported for a {device, port}, return the
error right away before validating all parameters during mad agent
registration time.

Linux commit:
798bba01b44b0ddf8cd6e542635b37cc9a9b739c

MFC after:	1 week
Reviewed by:	kib
Sponsored by:	Mellanox Technologies // NVIDIA Networking
2021-07-12 14:22:31 +02:00
Hans Petter Selasky
e2ae502d28 ibcore: Use inline function to validate port
Linux commit:
24dc831b77eca9361cf835be59fa69ea0e471afc

MFC after:	1 week
Reviewed by:	kib
Sponsored by:	Mellanox Technologies // NVIDIA Networking
2021-07-12 14:22:31 +02:00
Hans Petter Selasky
31525faed8 ibcore: Validate port number in query_pkey verb.
Before calling the driver's function let's make sure port is valid.

Linux commit:
9af3f5cf9d64a056eca53bc643f6288ad28bbbb5

MFC after:	1 week
Reviewed by:	kib
Sponsored by:	Mellanox Technologies // NVIDIA Networking
2021-07-12 14:22:31 +02:00
Hans Petter Selasky
912e98cede ibcore: Protect against concurrent access to hardware stats.
Currently access to hardware stats buffer isn't protected, this can
result in multiple writes and reads at the same time to the same
memory location. This can lead to providing an incorrect value to
the user. Add a mutex to protect against it.

Linux commit:
e945130b52bea65d15f9bdf54949d4cb7a88db7f

MFC after:	1 week
Reviewed by:	kib
Sponsored by:	Mellanox Technologies // NVIDIA Networking
2021-07-12 14:22:31 +02:00
Hans Petter Selasky
d8cbfa101c mlx5core: Make sure error code is propagated on error.
If mlx5_init_once() fails, mlx5_load_one() should fail too, else the
device instance remains attached causing problems at reboot.

MFC after:	1 week
Reviewed by:	kib
Sponsored by:	Mellanox Technologies // NVIDIA Networking
2021-07-12 14:22:31 +02:00
Hans Petter Selasky
ac4174e064 ibcore: Do not expose unsupported counters.
If the provider driver (such as rdma_rxe) doesn't support PMA counters,
avoid exposing its directory similar to optional hw_counters directory.
If core fails to read the PMA counter, return an error so that user can
retry later if needed.

Linux commit:
0f6ef65d1c6ec8deb5d0f11f86631ec4cfe8f22e

MFC after:	1 week
Reviewed by:	kib
Sponsored by:	Mellanox Technologies // NVIDIA Networking
2021-07-12 14:22:31 +02:00
Hans Petter Selasky
4238b4a7a2 ibcore: Introduce ib_port_phys_state enum.
In order to improve readability, add ib_port_phys_state enum to replace
the use of magic numbers.

Linux commit:
72a7720fca37fec0daf295923f17ac5d88a613e1

MFC after:	1 week
Reviewed by:	kib
Sponsored by:	Mellanox Technologies // NVIDIA Networking
2021-07-12 14:22:31 +02:00
Hans Petter Selasky
d7d833e20b ibcore: Fix unable to change lifespan entry for hw_counters.
This patch fixes the case where 'lifespan' entry of the hw_counters
is not writable. Currently write callback is not exposed for for
the hw_counters sysfs operation. Due to this, modifying lifespan
value results into permission denied error in below example.

echo 10 > /sys/class/infiniband/mlx5_0/ports/1/hw_counters/lifespan
-bash: /sys/class/infiniband/mlx5_0/ports/1/hw_counters/lifespan:
Permission denied

This patch adds the hook to modify any attribute which implements
store() operation.

Linux commit:
79c4d80b43b8e43684894574a508a871f0c196bf

MFC after:	1 week
Reviewed by:	kib
Sponsored by:	Mellanox Technologies // NVIDIA Networking
2021-07-12 14:22:30 +02:00
Hans Petter Selasky
12a913d207 ibcore: Issue DREQ when receiving REQ/REP for stale QP.
From "InfiBand Architecture Specifications Volume 1":

  A QP is said to have a stale connection when only one side has
  connection information. A stale connection may result if the remote CM
  had dropped the connection and sent a DREQ but the DREQ was never
  received by the local CM. Alternatively the remote CM may have lost
  all record of past connections because its node crashed and rebooted,
  while the local CM did not become aware of the remote node's reboot
  and therefore did not clean up stale connections.

And:

  A local CM may receive a REQ/REP for a stale connection. It shall
  abort the connection issuing REJ to the REQ/REP. It shall then issue
  DREQ with "DREQ:remote QPN" set to the remote QPN from the REQ/REP.

This patch solves a problem with reuse of QPN. Current codebase, that
is IPoIB, relies on a REAP-mechanism to do cleanup of the structures
in CM. A problem with this is the timeconstants governing this
mechanism; they are up to 768 seconds and the interface may look
inresponsive in that period.  Issuing a DREQ (and receiving a DREP)
does the necessary cleanup and the interface comes up.

Linux commit:
9315bc9a133011fdb084f2626b86db3ebb64661f

MFC after:	1 week
Reviewed by:	kib
Sponsored by:	Mellanox Technologies // NVIDIA Networking
2021-07-12 14:22:30 +02:00
Hans Petter Selasky
2f05418748 ibcore: Fix memory leak in cm_req_handler error flows.
In the cm_req_handler() error flows, sometimes cm_id_priv->timewait_info
isn't free'd.

Linux commit:
8b00914654ef56ff5473f4fe1f1168254dbb8a17

MFC after:	1 week
Reviewed by:	kib
Sponsored by:	Mellanox Technologies // NVIDIA Networking
2021-07-12 14:22:30 +02:00
Hans Petter Selasky
f48e85dfe2 ibcore: Move debug counters to be under relevant IB device
The sysfs layout is created by CM incorrectly presented RDMA devices with
InfiniBand link layer. Layout of such devices represents device tree of
connections. By moving CM statistics to be under relevant port of IB
device, we will fix the following issues:

* Symlink name - It used device name instead of specific identifier.
* Target location - It was supposed to point to PCI-ID/infiniband_cm/
  instead of PCI-ID/infiniband/
* Target name - It created extra device file under already existing
  device folder, e.g. mlx5_0/mlx5_0
* Crash during boot with RDMA persistent naming patches.

sysfs: cannot create duplicate filename '/class/infiniband_cm/mlx5_0'
CPU: 29 PID: 433 Comm: modprobe Not tainted 5.0.0-rc5+ #178
Call Trace:
dump_stack+0xcc/0x180
sysfs_warn_dup.cold.3+0x17/0x2d
sysfs_do_create_link_sd.isra.2+0xd0/0xf0
device_add+0x7cb/0x1450
device_create_groups_vargs+0x1ae/0x220
device_create+0x93/0xc0
cm_add_one+0x38f/0xf60 [ib_cm]
add_client_context+0x167/0x210 [ib_core]
enable_device_and_get+0x230/0x3f0 [ib_core]
ib_register_device+0x823/0xbf0 [ib_core]
__mlx5_ib_add+0x45/0x150 [mlx5_ib]
mlx5_ib_add+0x1b3/0x5e0 [mlx5_ib]
mlx5_add_device+0x130/0x3a0 [mlx5_core]
mlx5_register_interface+0x1a9/0x270 [mlx5_core]
do_one_initcall+0x14f/0x5de
do_init_module+0x247/0x7c0
load_module+0x4c2f/0x60d0
entry_SYSCALL_64_after_hwframe+0x49/0xbe

After this change:
[leonro@server ~]$ ls -al /sys/class/infiniband/ibp0s12f0/ports/1/
drwxr-xr-x  2 root root    0 Mar 11 11:17 cm_rx_duplicates
drwxr-xr-x  2 root root    0 Mar 11 11:17 cm_rx_msgs
drwxr-xr-x  2 root root    0 Mar 11 11:17 cm_tx_msgs
drwxr-xr-x  2 root root    0 Mar 11 11:17 cm_tx_retries

Linux commit:
c87e65cfb97c7f325132a68288ed76ba7bdcd2c6

MFC after:	1 week
Reviewed by:	kib
Sponsored by:	Mellanox Technologies // NVIDIA Networking
2021-07-12 14:22:30 +02:00
Hans Petter Selasky
8d04583de5 ibcore: Fix memory leak in cm_add/remove_one.
In the process of moving the debug counters sysfs entries, the commit
mentioned below eliminated the cm_infiniband sysfs directory.

This sysfs directory was tied to the cm_port object allocated in procedure
cm_add_one().

Before the commit below, this cm_port object was freed via a call to
kobject_put(port->kobj) in procedure cm_remove_port_fs().

Since port no longer uses its kobj, kobject_put(port->kobj) was eliminated.
This, however, meant that kfree was never called for the cm_port buffers.

Fix this by adding explicit kfree(port) calls to functions cm_add_one()
and cm_remove_one().

Note that the kfree call in the first chunk below, in the cm_add_one error
flow, fixes an old, undetected memory leak.

Linux commit:
94635c36f3854934a46d9e812e028d4721bbb0e6

MFC after:	1 week
Reviewed by:	kib
Sponsored by:	Mellanox Technologies // NVIDIA Networking
2021-07-12 14:22:30 +02:00
Hans Petter Selasky
0f2e5b432d ibcore: Block processing of alternate path handling in RoCE RX CM messages.
Due to the below reasons, it is better to not support alternate path receive
messages for RoCE in near term.

1. Alternate path for RoCE is not supported at rdmacm layer.
2. It is not supported in uverbs/core layer for RoCE.
3. Alternate path for IPv6 for link local address cannot resolve route
determinstically without a valid incoming interface ID whose usecase
make sense only with dual port mode.
4. init_av_from_path while processing LAP messages for IB and RoCE can
lead to adding duplicate entry of AV into the port list, leads to list
corruption.
5. rdma-core userspace a well known userspace implementation has removed
support of libucm which use ucm.ko module, which is the only module that
can trigger alternate path related messages.
6. ucm kernel module is requested to be removed from the IB core in
the following patch, https://patchwork.kernel.org/patch/10268503/ .

Linux commit:
97c45c2c28cd291e06778d9d36a0f60ee74726bc

MFC after:	1 week
Reviewed by:	kib
Sponsored by:	Mellanox Technologies // NVIDIA Networking
2021-07-12 14:22:30 +02:00
Hans Petter Selasky
bf5075e41e ibcore: Store and restore ah_attr during LAP msg processing.
During CM LAP processing, ah_attr is reinitialized on receiving
a LAP request. First likely during CM request processing.

ah_attr might get zeroed out if LAP processing fails.
Therefore, try to create a new ah_attr for the LAP message.
If the initialization fails, continue with older ah_attr.
If the initialization passes, consider the new ah_attr by
overwriting the older one.

Linux commit:
0e225dcb7681c0a8e52fb9dc68bd8ab973de4ca2

MFC after:	1 week
Reviewed by:	kib
Sponsored by:	Mellanox Technologies // NVIDIA Networking
2021-07-12 14:22:30 +02:00