Before calling the driver's function let's make sure port is valid.
Linux commit:
9af3f5cf9d64a056eca53bc643f6288ad28bbbb5
MFC after: 1 week
Reviewed by: kib
Sponsored by: Mellanox Technologies // NVIDIA Networking
Currently access to hardware stats buffer isn't protected, this can
result in multiple writes and reads at the same time to the same
memory location. This can lead to providing an incorrect value to
the user. Add a mutex to protect against it.
Linux commit:
e945130b52bea65d15f9bdf54949d4cb7a88db7f
MFC after: 1 week
Reviewed by: kib
Sponsored by: Mellanox Technologies // NVIDIA Networking
If the provider driver (such as rdma_rxe) doesn't support PMA counters,
avoid exposing its directory similar to optional hw_counters directory.
If core fails to read the PMA counter, return an error so that user can
retry later if needed.
Linux commit:
0f6ef65d1c6ec8deb5d0f11f86631ec4cfe8f22e
MFC after: 1 week
Reviewed by: kib
Sponsored by: Mellanox Technologies // NVIDIA Networking
In order to improve readability, add ib_port_phys_state enum to replace
the use of magic numbers.
Linux commit:
72a7720fca37fec0daf295923f17ac5d88a613e1
MFC after: 1 week
Reviewed by: kib
Sponsored by: Mellanox Technologies // NVIDIA Networking
This patch fixes the case where 'lifespan' entry of the hw_counters
is not writable. Currently write callback is not exposed for for
the hw_counters sysfs operation. Due to this, modifying lifespan
value results into permission denied error in below example.
echo 10 > /sys/class/infiniband/mlx5_0/ports/1/hw_counters/lifespan
-bash: /sys/class/infiniband/mlx5_0/ports/1/hw_counters/lifespan:
Permission denied
This patch adds the hook to modify any attribute which implements
store() operation.
Linux commit:
79c4d80b43b8e43684894574a508a871f0c196bf
MFC after: 1 week
Reviewed by: kib
Sponsored by: Mellanox Technologies // NVIDIA Networking
From "InfiBand Architecture Specifications Volume 1":
A QP is said to have a stale connection when only one side has
connection information. A stale connection may result if the remote CM
had dropped the connection and sent a DREQ but the DREQ was never
received by the local CM. Alternatively the remote CM may have lost
all record of past connections because its node crashed and rebooted,
while the local CM did not become aware of the remote node's reboot
and therefore did not clean up stale connections.
And:
A local CM may receive a REQ/REP for a stale connection. It shall
abort the connection issuing REJ to the REQ/REP. It shall then issue
DREQ with "DREQ:remote QPN" set to the remote QPN from the REQ/REP.
This patch solves a problem with reuse of QPN. Current codebase, that
is IPoIB, relies on a REAP-mechanism to do cleanup of the structures
in CM. A problem with this is the timeconstants governing this
mechanism; they are up to 768 seconds and the interface may look
inresponsive in that period. Issuing a DREQ (and receiving a DREP)
does the necessary cleanup and the interface comes up.
Linux commit:
9315bc9a133011fdb084f2626b86db3ebb64661f
MFC after: 1 week
Reviewed by: kib
Sponsored by: Mellanox Technologies // NVIDIA Networking
The sysfs layout is created by CM incorrectly presented RDMA devices with
InfiniBand link layer. Layout of such devices represents device tree of
connections. By moving CM statistics to be under relevant port of IB
device, we will fix the following issues:
* Symlink name - It used device name instead of specific identifier.
* Target location - It was supposed to point to PCI-ID/infiniband_cm/
instead of PCI-ID/infiniband/
* Target name - It created extra device file under already existing
device folder, e.g. mlx5_0/mlx5_0
* Crash during boot with RDMA persistent naming patches.
sysfs: cannot create duplicate filename '/class/infiniband_cm/mlx5_0'
CPU: 29 PID: 433 Comm: modprobe Not tainted 5.0.0-rc5+ #178
Call Trace:
dump_stack+0xcc/0x180
sysfs_warn_dup.cold.3+0x17/0x2d
sysfs_do_create_link_sd.isra.2+0xd0/0xf0
device_add+0x7cb/0x1450
device_create_groups_vargs+0x1ae/0x220
device_create+0x93/0xc0
cm_add_one+0x38f/0xf60 [ib_cm]
add_client_context+0x167/0x210 [ib_core]
enable_device_and_get+0x230/0x3f0 [ib_core]
ib_register_device+0x823/0xbf0 [ib_core]
__mlx5_ib_add+0x45/0x150 [mlx5_ib]
mlx5_ib_add+0x1b3/0x5e0 [mlx5_ib]
mlx5_add_device+0x130/0x3a0 [mlx5_core]
mlx5_register_interface+0x1a9/0x270 [mlx5_core]
do_one_initcall+0x14f/0x5de
do_init_module+0x247/0x7c0
load_module+0x4c2f/0x60d0
entry_SYSCALL_64_after_hwframe+0x49/0xbe
After this change:
[leonro@server ~]$ ls -al /sys/class/infiniband/ibp0s12f0/ports/1/
drwxr-xr-x 2 root root 0 Mar 11 11:17 cm_rx_duplicates
drwxr-xr-x 2 root root 0 Mar 11 11:17 cm_rx_msgs
drwxr-xr-x 2 root root 0 Mar 11 11:17 cm_tx_msgs
drwxr-xr-x 2 root root 0 Mar 11 11:17 cm_tx_retries
Linux commit:
c87e65cfb97c7f325132a68288ed76ba7bdcd2c6
MFC after: 1 week
Reviewed by: kib
Sponsored by: Mellanox Technologies // NVIDIA Networking
In the process of moving the debug counters sysfs entries, the commit
mentioned below eliminated the cm_infiniband sysfs directory.
This sysfs directory was tied to the cm_port object allocated in procedure
cm_add_one().
Before the commit below, this cm_port object was freed via a call to
kobject_put(port->kobj) in procedure cm_remove_port_fs().
Since port no longer uses its kobj, kobject_put(port->kobj) was eliminated.
This, however, meant that kfree was never called for the cm_port buffers.
Fix this by adding explicit kfree(port) calls to functions cm_add_one()
and cm_remove_one().
Note that the kfree call in the first chunk below, in the cm_add_one error
flow, fixes an old, undetected memory leak.
Linux commit:
94635c36f3854934a46d9e812e028d4721bbb0e6
MFC after: 1 week
Reviewed by: kib
Sponsored by: Mellanox Technologies // NVIDIA Networking
Due to the below reasons, it is better to not support alternate path receive
messages for RoCE in near term.
1. Alternate path for RoCE is not supported at rdmacm layer.
2. It is not supported in uverbs/core layer for RoCE.
3. Alternate path for IPv6 for link local address cannot resolve route
determinstically without a valid incoming interface ID whose usecase
make sense only with dual port mode.
4. init_av_from_path while processing LAP messages for IB and RoCE can
lead to adding duplicate entry of AV into the port list, leads to list
corruption.
5. rdma-core userspace a well known userspace implementation has removed
support of libucm which use ucm.ko module, which is the only module that
can trigger alternate path related messages.
6. ucm kernel module is requested to be removed from the IB core in
the following patch, https://patchwork.kernel.org/patch/10268503/ .
Linux commit:
97c45c2c28cd291e06778d9d36a0f60ee74726bc
MFC after: 1 week
Reviewed by: kib
Sponsored by: Mellanox Technologies // NVIDIA Networking
During CM LAP processing, ah_attr is reinitialized on receiving
a LAP request. First likely during CM request processing.
ah_attr might get zeroed out if LAP processing fails.
Therefore, try to create a new ah_attr for the LAP message.
If the initialization fails, continue with older ah_attr.
If the initialization passes, consider the new ah_attr by
overwriting the older one.
Linux commit:
0e225dcb7681c0a8e52fb9dc68bd8ab973de4ca2
MFC after: 1 week
Reviewed by: kib
Sponsored by: Mellanox Technologies // NVIDIA Networking
rdma_reject_msg() returns a pointer to a string message associated with
the transport reject reason codes.
Linux commit:
77a5db13153906a7e00740b10b2730e53385c5a8
MFC after: 1 week
Reviewed by: kib
Sponsored by: Mellanox Technologies // NVIDIA Networking
In cm_form_tid(), a two bit message sequence number is OR'ed into bit
31-30 of the lower TID value.
After Linux commit f06d26537559 ("IB/cm: Randomize starting comm ID"), the
local_id is XOR'ed with a 32-bit random value. Hence, bit 31-30 in the
lower TID now has an arbitrarily value and it makes no sense to OR in
the message sequence number.
Adding to that, the evolution in use of IDR routines in cm_alloc_id()
has always had the possibility of returning a value with bit 30 set.
In addition, said bits are never checked.
Hence, remove the encoding and the corresponding enum.
Linux commit:
87a37ce9e400e40daee537ff95343e3c94743c6d
MFC after: 1 week
Reviewed by: kib
Sponsored by: Mellanox Technologies // NVIDIA Networking
Remove not needed error handling when destroying a CQ. The function in
question will later on be updated to return "void".
MFC after: 1 week
Reviewed by: kib
Sponsored by: Mellanox Technologies // NVIDIA Networking
Given all the code does operate on struct ifnet, the last step in this
longer series of changes now is to rename struct net_device to
struct ifnet (that is what it was defined to in the LinuxKPi code).
While mlx4 and OFED are "shared" code the decision was made years ago
to not write it based on the netdevice KPI but the native ifnet KPI
for most of it. This commit simply spells this out and with that
frees "struct netdevice" to be re-done on LinuxKPI to become a more
native/mixed implementation over time as needed by, e.g., wireless
drivers.
Sponsored by: The FreeBSD Foundation
MFC after: 10 days
Reviewed by: hselasky
Differential Revision: https://reviews.freebsd.org/D30515
The LinuxKPI net_device actually is an ifnet; in order to further
clean that up so we can extend "net_device" migrate the few macros
left into ofed and make sure the header is included in all files
which need access to the macros.
Sponsored by: The FreeBSD Foundation
MFC after: 12 days
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D30477
This removes all unused bits from linux/netdevice.h and migrates two
inline functions into the mlx4 and ofed code respectively.
This gets the mlx4/ofed (struct ifnet) specific bits down to 7 lines
in netdevice.h.
Sponsored by: The FreeBSD Foundation
MFC after: 13 days
Reviewed by: hselasky, kib
Differential Revision: https://reviews.freebsd.org/D30461
Several protocol methods take a sockaddr as input. In some cases the
sockaddr lengths were not being validated, or were validated after some
out-of-bounds accesses could occur. Add requisite checking to various
protocol entry points, and convert some existing checks to assertions
where appropriate.
Reported by: syzkaller+KASAN
Reviewed by: tuexen, melifaro
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D29519
The two functions in linux/inetdevice.h are highly FreeBSD/ifnet
specific. This is a result of struct net_device being mapped to
struct ifnet.
The only known consumer of these functions are two files in the
ofed/infiniband code.
As a first step of cleaning up copy linux/inetdevice.h to
rdma/ib_addr_freebsd.h. (It stayed a separate file to preserve
copyright and license of the original file; otherwise it could be
merged into ib_addr.h where more EPOCH/vnet/.. are already used).
Slightly rename the function to not conflict with LinuxKPI
in the future.
Remove the three last, now unneeded includes of inetdevice.h and
zap linux/inetdevice.h to an empty header file with only the forward
include to netdevice.h remaining.
Sponsored-by: The FreeBSD Foundation
MFC-after: 2 weeks
Reviewed-by: hselasky, kib
X-D-R: D29366 (extracted as further cleanup)
Differential Revision: https://reviews.freebsd.org/D29434
The computed IPOIB_CM_RX_SG is too small. It doesn't account for fallback
to mbuf clusters when jumbo frames are not available and it also doesn't
account for the packet header and trailer mbuf.
This causes a memory overwrite situation when IPOIB_CM is configured.
While at it add a kernel assert to ensure the mapping array is not overwritten.
PR: 254474
MFC after: 1 week
Sponsored by: Mellanox Technologies // NVIDIA Networking
We are not aware of any out-of-tree consumers anymore
which would need KPI support for before Linux version 5.
Update the two in-tree consumers to use the new KPI.
This allows us to remove the extra version check and
will also give access to {lower,upper}_32_bits() unconditionally.
Sponsored-by: The FreeBSD Foundation
Reviewed-by: hselasky, rlibby, rstone
MFC-after: 2 weeks
X-MFC: to 13 only
Differential Revision: https://reviews.freebsd.org/D29391
In the notifier event callback function rather than casting directly
to the expected type use the proper accessor function as the mlx drivers
already do.
This is preparational work to allow us to improve the struct net_device
is struct ifnet compat code shortcut in the future.
Obtained-from: bz_iwlwifi
Sponsored-by: The FreeBSD Foundation
MFC-after: 2 weeks
Reviewed-by: hselasky
Differential Revision: https://reviews.freebsd.org/D29364
The int in the argument to the ternary triggered -Wint-in-bool-context
from gcc. Upstream linux has a larger and more entangled patch,
12f727721eee61b3d19dedb95cb893b2baa9fe41, which doesn't apply cleanly.
When we eventually sync that, we can just drop this change.
Reviewed by: hselasky, imp, kib
MFC after: 3 days
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D28762
TCP_LINGERTIME can be traced back to BSD 4.4 Lite and perhaps beyond, in
exactly the same form that it appears here modulo slightly different
context. It used to be the case that there was a single pr_usrreq
method with requests dispatched to it; these exact two lines appeared in
tcp_usrreq's PRU_ATTACH handling.
The only purpose of this that I can find is to cause surprising behavior
on accepted connections. Newly-created sockets will never hit these
paths as one cannot set SO_LINGER prior to socket(2). If SO_LINGER is
set on a listening socket and inherited, one would expect the timeout to
be inherited rather than changed arbitrarily like this -- noting that
SO_LINGER is nonsense on a listening socket beyond inheritance, since
they cannot be 'connected' by definition.
Neither Illumos nor Linux reset the timer like this based on testing and
inspection of Illumos, and testing of Linux.
Reviewed by: rscheff, tuexen
Differential Revision: https://reviews.freebsd.org/D28265
When OFED was upgraded to Linux v4.9, a bunch of Linux-specific
netlink changes were dropped. Unfortunately, there was a mismerge
in this process and as a result ib_sa_cancel_query() would fail to
cancel an outstanding MAD.
This was causing rdma_destroy_id() to hang indefinitely waiting
for the MAD to complete and release the final reference.
Sponsored by: Dell Inc.
Differential Revision: https://reviews.freebsd.org/D28421
Reviewed by: hselasky, kib
MFC after: 2 months
This change include several changes as listed below all related to UAR.
UAR is a special PCI memory area where the so-called doorbell register and
blue flame register live. Blue flame is a feature for sending small packets
more efficiently via a PCI memory page, instead of using PCI DMA.
- All structures and functions named xxx_uuars were renamed into xxx_bfreg.
- Remove partially implemented Blueflame support from mlx5en(4) and mlx5ib.
- Implement blue flame register allocator.
- Use blue flame register allocator in mlx5ib.
- A common UAR page is now allocated by the core to support doorbell register
writes for all of mlx5en and mlx5ib, instead of allocating one UAR per
sendqueue.
- Add support for DEVX query UAR.
- Add support for 4K UAR for libmlx5.
Linux commits:
7c043e908a74ae0a935037cdd984d0cb89b2b970
2f5ff26478adaff5ed9b7ad4079d6a710b5f27e7
0b80c14f009758cefeed0edff4f9141957964211
30aa60b3bd12bd79b5324b7b595bd3446ab24b52
5fe9dec0d045437e48f112b8fa705197bd7bc3c0
0118717583cda6f4f36092853ad0345e8150b286
a6d51b68611e98f05042ada662aed5dbe3279c1e
MFC after: 1 week
Sponsored by: Mellanox Technologies // NVIDIA Networking
Use the native vnode lookup functions, instead of going via the LinuxKPI,
because the file referenced is typically created outside the LinuxKPI, and
the LinuxKPI's fdget() can only resolve file descriptor numbers which
were created by itself.
The vnode pointer is used as an identifier to identify XRCD handles which
are sharing resources.
This patch fixes the so-called XRCD support in ibcore for FreeBSD.
Refer to ibv_open_xrcd(3) for more information how the file descriptor
argument is used.
Reviewed by: kib@
MFC after: 1 week
Sponsored by: Mellanox Technologies // NVIDIA Networking
Call M_SETFIB() to make sure the IPoIB packet is directed to the correct
interface-specific FIB.
This was sufficient to allow general-purpose routing using the default FIB,
and a separate FIB for routing between IPoIB on ib0 and IPoEthernet on mce0.
Reviewed by: hselasky
Obtained from: Anmol Kumar <anmolk at panasas dot com>
MFC after: 1 week
Sponsored by: Panasas
Differential Revision: https://reviews.freebsd.org/D25239
Coverity claims the call to rdma_gid2ip in cma_igmp_send overwrites addr.
Use a consistent definition of sockaddr to prevent detections and code
changes in the future.
Submitted by: bret_ketchum@dell.com
Reported by: Coverity
Reviewed by: hselasky, kib
MFC after: 2 weeks
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D26229
Currently the linking order of the infiniband, IB, modules decide in which
order the clients are attached and detached. For example one IB client may
use resources from another IB client. This can lead to a potential deadlock
at shutdown. For example if the ipoib is unregistered after the ib_multicast
client is detached, then if ipoib is using multicast addresses a deadlock may
happen, because ib_multicast will wait for all its resources to be freed before
returning from the remove method.
Fix this by using module_xxx_order() instead of module_xxx().
Differential Revision: https://reviews.freebsd.org/D23973
MFC after: 1 week
Sponsored by: Mellanox Technologies
r357614 added CTLFLAG_NEEDGIANT to make it easier to find nodes that are
still not MPSAFE (or already are but aren’t properly marked).
Use it in preparation for a general review of all nodes.
This is non-functional change that adds annotations to SYSCTL_NODE and
SYSCTL_PROC nodes using one of the soon-to-be-required flags.
Mark all obvious cases as MPSAFE. All entries that haven't been marked
as MPSAFE before are by default marked as NEEDGIANT
Approved by: kib (mentor, blanket)
Commented by: kib, gallatin, melifaro
Differential Revision: https://reviews.freebsd.org/D23718
This reduces the number of references to VLAN_TRUNKDEV() in ibcore.
Currently only VLAN is supported as a child interface in FreeBSD.
Remove superfluous RCU locking.
Sponsored by: Mellanox Technologies
When epoch(9) was introduced to network stack, it was basically
dropped in place of existing locking, which was mutexes and
rwlocks. For the sake of performance mutex covered areas were
as small as possible, so became epoch covered areas.
However, epoch doesn't introduce any contention, it just delays
memory reclaim. So, there is no point to minimise epoch covered
areas in sense of performance. Meanwhile entering/exiting epoch
also has non-zero CPU usage, so doing this less often is a win.
Not the least is also code maintainability. In the new paradigm
we can assume that at any stage of processing a packet, we are
inside network epoch. This makes coding both input and output
path way easier.
On output path we already enter epoch quite early - in the
ip_output(), in the ip6_output().
This patch does the same for the input path. All ISR processing,
network related callouts, other ways of packet injection to the
network stack shall be performed in net_epoch. Any leaf function
that walks network configuration now asserts epoch.
Tricky part is configuration code paths - ioctls, sysctls. They
also call into leaf functions, so some need to be changed.
This patch would introduce more epoch recursions (see EPOCH_TRACE)
than we had before. They will be cleaned up separately, as several
of them aren't trivial. Note, that unlike a lock recursion the
epoch recursion is safe and just wastes a bit of resources.
Reviewed by: gallatin, hselasky, cy, adrian, kristof
Differential Revision: https://reviews.freebsd.org/D19111
When the software send queue gets filled up, callbacks to
if_transmit will stop. Make sure the transmit callback
routine checks the send queue and outputs any remaining
mbufs. Else the remaining mbufs may simply sit in the
output queue blocking the transmit path.
MFC after: 3 days
Sponsored by: Mellanox Technologies
The mistake came about like this: the first attempt to commit was blocked by
a pre-commit hook due to missing SVN tags. svn revert doesn't delete new
files, I guess. While reapplying the fixed diff, the non-empty target file
was just concatenated with the new contents? Ugh. :-(
This regression was introduced in the r326169 Linux v4.9 Infiniband upgrade.
Restore the functionality.
Reviewed by: hselasky
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D21298
This allows use of the shared _net_inet_sdp in more than one compilation
unit. (Nothing in-tree uses this today, but some of Isilon's out-of-tree
SDP enhancements add sysctls below the node.)
Sponsored by: Dell EMC Isilon
In current RDMACM implementation RDMACM server will not find a GID
index when the request was prio-tagged and the sever is non
prio-tagged and vise-versa.
According to 802.1Q-2014, VLAN tagged packets with VLAN id 0 should
be considered as untagged. Treat RDMACM request the same.
Reviewed by: hselasky, kib
MFC after: 3 Days
Sponsored by: Mellanox Technologies
This was enumerated with exhaustive search for sys/eventhandler.h includes,
cross-referenced against EVENTHANDLER_* usage with the comm(1) utility. Manual
checking was performed to avoid redundant includes in some drivers where a
common os_bsd.h (for example) included sys/eventhandler.h indirectly, but it is
possible some of these are redundant with driver-specific headers in ways I
didn't notice.
(These CUs did not show up as missing eventhandler.h in tinderbox.)
X-MFC-With: r347984
Add the new rates that were added to the Infiniband specification as part of
HDR and 2x support.
Submitted by: slavash@
MFC after: 3 days
Sponsored by: Mellanox Technologies
ib_req_notify_cq may return negative value which will indicate a
failure. In the case of uncorrectable error, we will end up in an
endless loop. Fix that, by going to another loop with poll_more
only if there is anything left to poll.
Submitted by: slavash@
MFC after: 3 days
Sponsored by: Mellanox Technologies
- Remove macros that covertly create epoch_tracker on thread stack. Such
macros a quite unsafe, e.g. will produce a buggy code if same macro is
used in embedded scopes. Explicitly declare epoch_tracker always.
- Unmask interface list IFNET_RLOCK_NOSLEEP(), interface address list
IF_ADDR_RLOCK() and interface AF specific data IF_AFDATA_RLOCK() read
locking macros to what they actually are - the net_epoch.
Keeping them as is is very misleading. They all are named FOO_RLOCK(),
while they no longer have lock semantics. Now they allow recursion and
what's more important they now no longer guarantee protection against
their companion WLOCK macros.
Note: INP_HASH_RLOCK() has same problems, but not touched by this commit.
This is non functional mechanical change. The only functionally changed
functions are ni6_addrs() and ni6_store_addrs(), where we no longer enter
epoch recursively.
Discussed with: jtl, gallatin
As it does for recv*(2), MSG_DONTWAIT indicates that the call should
not block, returning EAGAIN instead. Linux and OpenBSD both implement
this, so the change makes porting easier, especially since we do not
return EINVAL or so when unrecognized flags are specified.
Submitted by: Greg V <greg@unrelenting.technology>
Reviewed by: tuexen
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D18728
Modify QP can fail and it can be acceptable, like when moving from RST to
ERR state, all the rest are not acceptable and a message to the log
should be printed.
The current code prints on all failures and many messages like:
"Failed to modify QP to ERROR state" appear, even when supported by the
state machine of the QP object.
Linux commit:
5dc78ad1904db597bdb4427f3ead437aae86f54c
Submitted by: hselasky@
Approved by: hselasky (mentor)
MFC after: 1 week
Sponsored by: Mellanox Technologies
When a packet needs fragmentation, it might generate more than 3 fragments.
With the queue length 3, all fragments are generated faster than the
queue is drained, which effectively drops fourth and later fragments on
the floor.
Submitted by: kib@
Approved by: hselasky (mentor)
MFC after: 1 week
Sponsored by: Mellanox Technologies
When changing the MTU of ibX network interfaces, check that the MTU was really
changed before requesting an update of the multicast rules. Else we might go
into an infinite loop joining and leaving ibX multicast groups towards the
opensm master interface.
Submitted by: hselasky@
Approved by: hselasky (mentor)
MFC after: 1 week
Sponsored by: Mellanox Technologies
It is not enough to set ifnet->if_mtu to change the interface MTU.
System saves the MTU for route in the radix tree, and route cache keeps
the interface MTU as well. Since addition of the multicast group causes
recalculation of MTU, even bringing the interface up changes MTU from
4042 to 1500, which makes the system configuration inconsistent. Worse,
ip_output() prefers route MTU over interface MTU, so large packets are
not fragmented and dropped on floor.
Fix it for ipoib(4) using the same approach (or hack) as was applied
for it_tun/if_tap in r339012. Thanks to bz@ for giving the hint.
Submitted by: kib@
Approved by: hselasky (mentor)
MFC after: 1 week
Sponsored by: Mellanox Technologies
Binding to a loopback device is not allowed. Make sure the destination
device address is global by clearing the bound device interface.
Only do this conditionally, else link local addresses won't work.
Submitted by: hselasky@
Approved by: hselasky (mentor)
MFC after: 1 week
Sponsored by: Mellanox Technologies
A couple of places in the CM do
spin_lock_irq(&cm_id_priv->lock);
...
if (cm_alloc_response_msg(work->port, work->mad_recv_wc, &msg))
However when the underlying transport is RoCE, this leads to a sleeping function
being called with the lock held - the callchain is
cm_alloc_response_msg() ->
ib_create_ah_from_wc() ->
ib_init_ah_from_wc() ->
rdma_addr_find_l2_eth_by_grh() ->
rdma_resolve_ip()
and rdma_resolve_ip() starts out by doing
req = kzalloc(sizeof *req, GFP_KERNEL);
not to mention rdma_addr_find_l2_eth_by_grh() doing
wait_for_completion(&ctx.comp);
to wait for the task that rdma_resolve_ip() queues up.
Fix this by moving the AH creation out of the lock.
Linux commit:
c76161181193985087cd716fdf69b5cb6cf9ee85
Submitted by: hselasky@
Approved by: hselasky (mentor)
MFC after: 1 week
Sponsored by: Mellanox Technologies
Trying to validate loopback fails because rtalloc1() resolves system
local addresses to the loopback network interface, lo0. Fix this by
explicitly checking for loopback during validation of the source
and destination network address. If the source address belongs to
a local network interface and is equal to the destination address,
there is no need to run the destination address through rtalloc1().
Submitted by: hselasky@
Approved by: hselasky (mentor)
MFC after: 1 week
Sponsored by: Mellanox Technologies
The master network interface and the VLANs may reside in different VNETs.
Make sure that all VNETs are searched when scanning for GID entries.
Submitted by: netapp
Approved by: hselasky (mentor)
MFC after: 1 week
Sponsored by: Mellanox Technologies
This prevents code from accepting RoCEv1 connections when
only ROCEv2 is enabled and vice versa.
Linux commit:
0c4386ec77cfcd0ccbdbe8c2e67dd3a49b2a4c7f
Submitted by: hselasky@
Approved by: hselasky (mentor)
MFC after: 1 week
Sponsored by: Mellanox Technologies
The array ib_mad_mgmt_class_table.method_table has MAX_MGMT_CLASS
(80) elements. Hence compare the array index with that value instead
of with IB_MGMT_MAX_METHODS (128). This patch avoids that Coverity
reports the following:
Overrunning array class->method_table of 80 8-byte elements at element index 127
(byte offset 1016) using index convert_mgmt_class(mad_hdr->mgmt_class)
(which evaluates to 127).
Linux commit:
2fe2f378dd45847d2643638c07a7658822087836
Submitted by: hselasky@
Approved by: hselasky (mentor)
MFC after: 1 week
Sponsored by: Mellanox Technologies
The port number in the listen_id_priv has been observed to be zero which
means no port has been selected. The current code lacks a check for invalid
port number.
Submitted by: hselasky@
Approved by: hselasky (mentor)
MFC after: 1 week
Sponsored by: Mellanox Technologies
Various network protocol sysctl handlers were not zero-filling their
output buffers and thus would export uninitialized stack memory to
userland. Fix a number of such handlers.
Reported by: Thomas Barabosch, Fraunhofer FKIE
Reviewed by: tuexen
MFC after: 3 days
Security: kernel memory disclosure
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D18301
For RoCE, when CM requests are received for RC and UD connections,
netdevice of the incoming request is unavailable. Because of that CM
requests are always forwarded to init_net namespace.
Now that we have the GID index available, introduce SGID index in
incoming CM requests and refer to the netdevice of it.
While at it fix some incorrect uses of init_net and make sure
the rdma_create_id() function stores the VNET it is passed.
Based on linux commit:
cee104334c98dd04e9dd4d9a4fa4784f7f6aada9
MFC after: 3 days
Approved by: re (gjb)
Sponsored by: Mellanox Technologies
Also fix the validate_ipv4_net_dev() and validate_ipv6_net_dev() functions
which had source and destination addresses swapped, and didn't set the
scope ID for IPv6 link-local addresses.
This allows applications like krping to work using IPoIB devices.
MFC after: 3 days
Approved by: re (gjb)
Sponsored by: Mellanox Technologies
In r337943 ifnet's if_pcp was set to the PCP value in use
instead of IFNET_PCP_NONE.
Current ibcore code assumes that if_pcp is IFNET_PCP_NONE with
VLAN interfaces so it can identify prio-tagged traffic.
Fix that by explicitly verifying that that the if_type is IFT_ETHER
and not IFT_L2VLAN.
MFC after: 3 days
Approved by: re (Marius), hselasky (mentor), kib (mentor)
Sponsored by: Mellanox Technologies
Else a NULL VNET pointer should be ignored. This fixes address resolving
when VIMAGE is disabled.
MFC after: 3 days
Sponsored by: Mellanox Technologies
When creating address handle from multicast GID, set MAC according to
the appropriate formula instead of searching for it in the GID table:
- For IPv4 multicast GID use ip_eth_mc_map().
- For IPv6 multicast GID use ipv6_eth_mc_map().
Linux commit:
9636a56fa864464896bf7d1272c701f2b9a57737
MFC after: 1 week
Sponsored by: Mellanox Technologies
The return status of ib_init_ah_from_mcmember() is ignored by
cma_ib_mc_handler(). Honor it and return error event if ah attribute
initialization failed.
Linux commit:
6d337179f28cc50ddd7e224f677b4cda70b275fc
MFC after: 1 week
Sponsored by: Mellanox Technologies
ah_attr contains the port number to which cm_id is bound. However, while
searching for GID table for matching GID entry, the port number is
ignored.
This could cause the wrong GID to be used when the ah_attr is converted to
an AH.
Linux commit:
563c4ba3bd2b8b0b21c65669ec2226b1cfa1138b
MFC after: 1 week
Sponsored by: Mellanox Technologies
The current implementation assumes a static mapping between
the TOS bits and the priority code point, PCP bits.
MFC after: 1 week
Sponsored by: Mellanox Technologies
When a loopback address is detected use the network interface which
has the loopback flag set to trigger loopback logic in address resolve.
MFC after: 1 week
Sponsored by: Mellanox Technologies
The ib_uverbs_create_ah() ind ib_uverbs_modify_qp() calls receive
the port number from user input as part of its attributes and assumes
it is valid. Down on the stack, that parameter is used to access kernel
data structures. If the value is invalid, the kernel accesses memory
it should not. To prevent this, verify the port number before using it.
Linux commit:
5ecce4c9b17bed4dc9cb58bfb10447307569b77b
a62ab66b13a0f9bcb17b7b761f6670941ed5cd62
5a7a88f1b488e4ee49eb3d5b82612d4d9ffdf2c3
MFC after: 1 week
Sponsored by: Mellanox Technologies
RoCEv1 does not use the IPv6 stack to resolve the link local DGID since it
uses GID address. It forms the DMAC directly from the DGID.
Linux commit:
56d0a7d9a0f045ee27a001762deac28c7d28e2e4
MFC after: 1 week
Sponsored by: Mellanox Technologies
This patch fixes the kernel crash that occurs during ib_dealloc_device()
called due to provider driver fails with an error after
ib_alloc_device() and before it can register using ib_register_device().
This crashed seen in tha lab as below which can occur with any IB device
which fails to perform its device initialization before invoking
ib_register_device().
This patch avoids touching cache and port immutable structures if device
is not yet initialized.
It also releases related memory when cache and port immutable data
structure initialization fails during register_device() state.
Linux commit:
4be3a4fa51f432ef045546d16f25c68a1ab525b9
MFC after: 1 week
Sponsored by: Mellanox Technologies
Garbage supplied by user will cause to UCMA module provide zero
memory size for memcpy(), because it wasn't checked, it will
produce unpredictable results in rdma_resolve_addr().
There are several places in the ucma ABI where userspace can pass in a
sockaddr but set the address family to AF_IB. When that happens,
rdma_addr_size() will return a size bigger than sizeof struct sockaddr_in6,
and the ucma kernel code might end up copying past the end of a buffer
not sized for a struct sockaddr_ib.
Fix this by introducing new variants
int rdma_addr_size_in6(struct sockaddr_in6 *addr);
int rdma_addr_size_kss(struct __kernel_sockaddr_storage *addr);
that are type-safe for the types used in the ucma ABI and return 0 if the
size computed is bigger than the size of the type passed in. We can use
these new variants to check what size userspace has passed in before
copying any addresses.
Linux commit:
2975d5de6428ff6d9317e9948f0968f7d42e5d74
09abfe7b5b2f442a85f4c4d59ecf582ad76088d7
84652aefb347297aa08e91e283adf7b18f77c2d5
MFC after: 1 week
Sponsored by: Mellanox Technologies
This was done by auditing all callers of ucma_get_ctx and switching the
ones that unconditionally touch ->device to ucma_get_ctx_dev. This covers
a little less than half of the call sites.
The 11 remaining call sites to ucma_get_ctx() were manually audited.
Linux commit:
4b658d1bbc16605330694bb3ef2570c465ef383d
8b77586bd8fe600d97f922c79f7222c46f37c118
MFC after: 1 week
Sponsored by: Mellanox Technologies
Attempt to modify XRC_TGT QP type from the user space (ibv_xsrq_pingpong
invocation) will trigger the following kernel panic. It is caused by the
fact that such QPs missed uobject initialization.
Linux commit:
f45765872e7aae7b81feb3044aaf9886b21885ef
MFC after: 1 week
Sponsored by: Mellanox Technologies