While at it remove unused interface state bits. This also fixes and issue
during shutdown:
There is an issue where the firmware fails during mlx5_load_one,
the health_care timer detects the issue and schedules a health_care call.
Then the mlx5_load_one detects the issue, cleans up and quits. Then
the health_care starts and calls mlx5_unload_one to clean up the resources
that no longer exist and causes kernel panic.
The root cause is that the bit MLX5_INTERFACE_STATE_DOWN is not set
after mlx5_load_one fails. The solution is removing the bit
MLX5_INTERFACE_STATE_DOWN and quit mlx5_unload_one if the
bit MLX5_INTERFACE_STATE_UP is not set. The bit MLX5_INTERFACE_STATE_DOWN
is redundant and we can use MLX5_INTERFACE_STATE_UP instead.
Linux commit:
10a8d00707082955b177164d4b4e758ffcbd4017
b3cb5388499c5e219324bfe7da2e46cbad82bfcf
MFC after: 3 days
Sponsored by: Mellanox Technologies
Add support for DIM based on Linux,
with some minor adaptions specific to FreeBSD.
Linux commit
f97c3dc3c0e8d23a5c4357d182afeef4c67f5c33
MFC after: 3 days
Sponsored by: Mellanox Technologies
Drivers can now pass up numa domain information via the
mbuf numa domain field. This information is then used
by TCP syncache_socket() to associate that information
with the inpcb. The domain information is then fed back
into transmitted mbufs in ip{6}_output(). This mechanism
is nearly identical to what is done to track RSS hash values
in the inp_flowid.
Follow on changes will use this information for lacp egress
port selection, binding TCP pacers to the appropriate NUMA
domain, etc.
Reviewed by: markj, kib, slavash, bz, scottl, jtl, tuexen
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20028
This commit adds new if_alloc_domain() and if_alloc_dev() methods to
allocate ifnets. When called with a domain on a NUMA machine,
ifalloc_domain() will record the NUMA domain in the ifnet, and it will
allocate the ifnet struct from memory which is local to that NUMA
node. Similarly, if_alloc_dev() is a wrapper for if_alloc_domain
which uses a driver supplied device_t to call ifalloc_domain() with
the appropriate domain.
Note that the new if_numa_domain field fits in an alignment pad in
struct ifnet, and so does not alter the size of the structure.
Reviewed by: glebius, kib, markj
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D19930
This allows efficient filtering at packet ingress on mlx5en.
Note that the packets are filtered (and potentially dropped) *before*
the driver has committed to (re)allocating an mbuf for the
packet. Dropped packets are treated essentially the same as an
error. Nothing is allocated, and the existing buffer is recycled. This
allows us to drop malicious packets at close to line rate with very
little CPU use.
Reviewed by: hselasky, slavash, kib
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D19063
The backpressure indication is implemented using an unlimited rate type of
mbuf send tag. When the upper layers typically the socket layer has obtained such
a tag, it can then query the destination driver queue for the current
amount of space available in the send queue.
A single mbuf send tag may be referenced multiple times and a refcount has been added
to the mlx5e_priv structure to track its usage. Because the send tag resides
in the mlx5e_channel structure, there is no need to wait for refcounts to reach
zero until the mlx4en(4) driver is detached. The channels structure is persistant
during the lifetime of the mlx5en(4) driver it belongs to and can so be accessed
without any need of synchronization.
The mlx5e_snd_tag structure was extended to contain a type field, because there are now
two different tag types which end up in the driver which need to be distinguished.
Submitted by: hselasky@
Approved by: hselasky (mentor)
MFC after: 1 week
Sponsored by: Mellanox Technologies
In order to enable HW LRO, both the "hw_lro" sysctl in the mlx5en(4) config
space must be set, and the ifconfig(8) LRO capability must be set. Any other
settings will disable HW LRO.
Submitted by: hselasky@
Approved by: hselasky (mentor)
MFC after: 1 week
Sponsored by: Mellanox Technologies
Add counter for all transmitted and received bytes. Currently only all
transmitted and received packets were counted. Fix description of RX LRO
counters while at it.
Submitted by: hselasky@
Approved by: hselasky (mentor)
MFC after: 1 week
Sponsored by: Mellanox Technologies
By allocating the worst case size channel structure array
at attach time we can eliminate various NULL checks in the
fast path. And also reduce the chance for use-after-free
issues in the transmit fast path.
This change is also a requirement for implementing
backpressure support.
Submitted by: hselasky@
Approved by: hselasky (mentor)
MFC after: 1 week
Sponsored by: Mellanox Technologies
Writing to the debug stats variable must be locked,
else serialization will be lost which might cause
various kernel panics due to creating and destroying
sysctls out of order.
Make sure the sysctl context is initialized after freeing
the sysctl nodes, else they can be freed twice.
Submitted by: hselasky@
Approved by: hselasky (mentor)
MFC after: 1 week
Sponsored by: Mellanox Technologies
Inspect the ethernet compliance code to figure out actual cable type by reading
the PDDR module info register.
Submitted by: hselasky@
Approved by: hselasky (mentor)
MFC after: 1 week
Sponsored by: Mellanox Technologies
This can happen when connections are short lived and leads to
a firmware error printout in dmesg, syndrome 0x51cfb0, because
the SQ is in the wrong state.
Submitted by: hselasky@
Approved by: hselasky (mentor)
MFC after: 1 week
Sponsored by: Mellanox Technologies
1) Don't exceed the drivers own hardcoded TX inline limit.
The blueflame register size can be much greater than the hardcoded limit
for inlining. Make sure we don't exceed the drivers own limit, because this
also means that the maximum number of TX fragments becomes invalid and
then memory size assumptions in the TX path no longer hold up.
2) Make sure the mlx5_query_min_inline() function returns an error code.
3) Header inlining is required when using TSO.
4) Catch failure to compute inline header size for TSO.
5) Add support for UDP when computing inline header size.
6) Fix for inlining issues with regards to DSCP.
Make sure we inline 4 bytes beyond the ethernet and/or
VLAN header to workaround a hardware bug extracting
the DSCP field from the IPv4/v6 header.
Submitted by: hselasky@
Approved by: hselasky (mentor)
MFC after: 1 week
Sponsored by: Mellanox Technologies
The hardware queues are deep enough currently and using the DRBR and associated
callbacks only leads to more task switching in the TX path. The is also a race
setting the queue_state which can lead to hung TX rings.
Submitted by: hselasky@
Approved by: hselasky (mentor)
MFC after: 1 week
Sponsored by: Mellanox Technologies
Add support for setting the bandwidth limit as a ratio rather than in bits per
second. The ratio must be an integer number between 1 and 100 inclusivly.
Implement the needed firmware commands and SYSCTLs through mlx5en(4).
Submitted by: hselasky@
Approved by: hselasky (mentor)
MFC after: 1 week
Sponsored by: Mellanox Technologies
Make sure the active width and speed is set in case the
translate_eth_proto_oper() function doesn't recognize the
current port operation mask.
Linux commit:
7672ed33c4c15dbe9d56880683baaba4227cf940
Submitted by: hselasky@
Approved by: hselasky (mentor)
MFC after: 1 week
Sponsored by: Mellanox Technologies
If the mlx5_ib_read_cong_stats() function was running when mlx5ib was unloaded,
because this function unconditionally restarts the timer, the timer can still
be pending after the delayed work has been cancelled. To fix this simply loop
on the delayed work cancel procedure as long as it returns non-zero.
Submitted by: hselasky@
Approved by: hselasky (mentor)
MFC after: 1 week
Sponsored by: Mellanox Technologies
Although "create_srq_user" does overwrite "in.pas" on some paths, it
also contains at least one feasible path which does not overwrite it.
Approved by: hselasky (mentor)
MFC after: 1 week
Sponsored by: Mellanox Technologies
"fw_rev_min(dev->mdev)" with type "unsigned short" (16 bits, unsigned) is
promoted in "fw_rev_min(dev->mdev) << 16" to type "int" (32 bits, signed), then
sign-extended to type "unsigned long" (64 bits, unsigned). If
"fw_rev_min(dev->mdev) << 16" is greater than 0x7FFFFFFF, the upper bits of the
result will all be 1.
Approved by: hselasky (mentor)
MFC after: 1 week
Sponsored by: Mellanox Technologies
Driver description should be set by core and not by the Ethernet driver.
Approved by: hselasky (mentor)
MFC after: 1 week
Sponsored by: Mellanox Technologies
A command is either polling or event driven and the mode cannot change
during execution of a command. Make sure the event handler only handle
commands which are not polled. This is done by checking the command mode
in the command handler before completing commands.
Submitted by: hselasky@
Approved by: hselasky (mentor)
MFC after: 1 week
Sponsored by: Mellanox Technologies
This counter will represent transmitted packets which has more than
1518 octets.
The NIC has multiple hardware counters for counting transmitted
packets larger than 1518 octets. Each counter counts the packets
in specific range.
We accumulate those counters to have a single counter.
Approved by: hselasky (mentor)
MFC after: 1 week
Sponsored by: Mellanox Technologies
When the mlx5 health mechanism detects a problem while the driver
is in the middle of init_one or remove_one, the driver needs to prevent
the health mechanism from scheduling future work; if future work
is scheduled, there is a problem with use-after-free: the system WQ
tries to run the work item (which has been freed) at the scheduled
future time.
Prevent this by disabling work item scheduling in the health mechanism
when the driver is in the middle of init_one() or remove_one().
Approved by: hselasky (mentor)
MFC after: 1 week
Sponsored by: Mellanox Technologies
All other mlx5_events report the port number as 1 based, which is how FW
reports it in the port event EQE. Reporting 0 for this event causes
mlx5_ib to not raise a fatal event notification to registered clients
due to a seemingly invalid port.
All switch cases in mlx5_ib_event that go through the port check are
supposed to set the port now, so just do it once at variable
declaration.
Linux commit:
aba462134634b502d720e15b23154f21cfa277e5
Approved by: hselasky (mentor)
MFC after: 1 week
Sponsored by: Mellanox Technologies
The user can provide very large cqe_size which will cause to integer
overflow.
Linux commit:
28e9091e3119933c38933cb8fc48d5618eb784c8
Approved by: hselasky (mentor)
MFC after: 1 week
Sponsored by: Mellanox Technologies
If power exceed the slot limit, or slot limit is unknown the ConnectX-6
firmware will shutdown its port.
Inform the user via debug message.
MFC after: 3 days
Approved by: hselasky (mentor), kib (mentor)
Sponsored by: Mellanox Technologies
The receive side scaling stride parameter is a value which define the interval
between active receive side queues. The traffic for the inactive queues is
redirected to the nearest active queue by use of modulus. The default value
of this parameter is one, which means all receive side queues are used.
The point of this feature is to redirect more traffic to fewer receive side
queues in order to take more advantage of sorted large receive offload,
sorted LRO. The sorted LRO works better when more packets are accumulated
per service interval.
MFC after: 3 days
Approved by: re (marius)
Sponsored by: Mellanox Technologies
Inspecting the PRM no more than 0x3F data segments, DS, of size 16 bytes is
allowed.
Worst case scenario summary of DS usage:
Header is fixed: 2 DS
Maximum inlining: 98 => (98 - 2) / 16 = 6 DS
Remainder: 0x3F - 2 - 6 = 55 DS (mbuf frags)
Previously a value of 56 DS was used and this would work in the
normal case because not all inline data area was used up.
MFC after: 3 days
Approved by: re (marius)
Sponsored by: Mellanox Technologies
Query the minimal inline mode supported by the card.
When creating a send queue, cache the queried mode and optimize the transmit
if no inlining is required. In this case, we can avoid touching the headers
cache line and avoid dirtying several more lines by copying headers into
the send WQEs. Also, if no inline headers are used, hardware assists in
the VLAN tag framing.
Submitted by: kib@, slavash@
MFC after: 1 week
Sponsored by: Mellanox Technologies
For setups having a large amount of PCI devices, it makes sense to limit the
number of MSIX vectors per PCI device, in order to avoid running out of IRQ
vectors.
MFC after: 1 week
Sponsored by: Mellanox Technologies
The scatter list is formed by the chunks of MCLBYTES each, and larger
than default packets are returned to the stack as the mbuf chain.
Submitted by: kib@
MFC after: 1 week
Sponsored by: Mellanox Technologies
To access the data, set sysctl dev.mce.N.conf.debug_stats to 1.
This enables the sysctl node dev.mce.N.hw_ctx_debug. Its content is
the mapping of each channel' number to used receive queue and associated
completion queue, set of the transmit queues numbers and corresponding
completion queues.
Trimmed example output:
channel 30 rq 188 cq 1085
channel 30 tc 0 sq 187 cq 1084
channel 31 rq 191 cq 1087
channel 31 tc 0 sq 190 cq 1086
MFC after: 1 week
Sponsored by: Mellanox Technologies
Device detach and setting error state may deadlock over the interface mutex
like this:
a) Detach code in mlx5en waits until error state is set while the interface
mutex is locked.
b) The set error handler needs to lock the interface mutex before it can
set the error state.
The solution is to use atomics to set the error state.
MFC after: 1 week
Sponsored by: Mellanox Technologies
When resetting mlx5core instances it can happen that the order of attach and
detach for mlx5ib instances is changed. Take the unit number for mlx5_%d from
the parent PCI device, similarly to what is done in mlx5en(4), so that there
is a direct relationship between mce<N> and mlx5_<N>.
MFC after: 1 week
Sponsored by: Mellanox Technologies
The DSCP feature is controlled using a set of sysctl(8) fields under
the qos sysctl directory entry for mlx5en(4).
For Routable RoCE QPs, the DSCP should be set in the QP's address path.
The DSCP's value is derived from the traffic class.
Linux commit:
ed88451e1f2d400fd6a743d0a481631cf9f97550
MFC after: 1 week
Sponsored by: Mellanox Technologies
When receiving a PCP change all GID entries are reloaded.
This ensures the relevant GID entries use prio tagging,
by setting VLAN present and VLAN ID to zero.
The priority for prio tagged traffic is set using the regular
rdma_set_service_type() function.
Fake the real network device to have a VLAN ID of zero
when prio tagging is enabled. This is logic is hidden inside
the rdma_vlan_dev_vlan_id() function which must always be used
to retrieve the VLAN ID throughout all of ibcore and the
infiniband network drivers.
The VLAN presence information then propagates through all
of ibcore and so incoming connections will have the VLAN
bit set. The incoming VLAN ID is then checked against the
return value of rdma_vlan_dev_vlan_id().
MFC after: 1 week
Sponsored by: Mellanox Technologies
The hardware rate limiting feature is enabled by the RATELIMIT kernel
option. Please refer to ifconfig(8) and the txrtlmt option and the
SO_MAX_PACING_RATE set socket option for more information. This
feature is compatible with hardware transmit send offload, TSO.
A set of sysctl(8) knobs under dev.mce.<N>.rate_limit are provided to
setup the ratelimit table and also to fine tune various rate limit
related parameters.
Sponsored by: Mellanox Technologies
According to the 802.1Q-2014 9.6 VLAN Tag Control Information, VID value 0
means that there is no VLAN tag assigned to the packet, and only PCP and
DEI values from the tag are meaningful. Current flow table programming
filter out such packets.
When programming VLAN filter for flow table, unconditionally add rule which
accept packets with VLAN id 0. The packets are already handled correctly
by the network stack.
Reviewed by: hselasky, slavash
Sponsored by: Mellanox Technologies
MFC after: 1 week
Make sure the command completion handler is not called when the device is
in internal error state. This can easily trigger use after free situations.
MFC after: 3 days
Sponsored by: Mellanox Technologies
During health care IRQ resources will be reallocated.
Newbus requires that Giant is locked before accessing
these resources.
MFC after: 3 days
Sponsored by: Mellanox Technologies
Firmware dump collecting should be triggered in case firmware syndrome
with request for reset bit is set.
MFC after: 3 days
Submitted by: slavash@
Sponsored by: Mellanox Technologies
- Move the semaphore locking and unlocking to the same function.
- Flags are no longer needed if the reset and crdump will be done in the
same function.
MFC after: 3 days
Submitted by: slavash@
Sponsored by: Mellanox Technologies
- Move firmware dump prep and cleanup to init_one() and remove_one() so that
the init and cleanup will happen only upon driver reload.
- Add some prints to indicate firmware dump.
MFC after: 3 days
Submitted by: slavash@
Sponsored by: Mellanox Technologies
The old code checked for MLX5_CR_SPACE_DOMAIN which is irrelevant here.
However, if dev->vsec_addr would be 0, an access to wrong offset would
happen.
MFC after: 3 days
Submitted by: slavash@
Sponsored by: Mellanox Technologies
This fixes 32-bit compat (no ioctl command defintions are required
as struct ifreq is the same size). This is believed to be sufficent to
fully support ifconfig on 32-bit systems.
Reviewed by: kib
Obtained from: CheriBSD
MFC after: 1 week
Relnotes: yes
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D14900
error state.
If the device is in internal error state the hardware will not
generate completions. Just move on to destroy the resources.
Submitted by: slavash@
MFC after: 1 week
Sponsored by: Mellanox Technologies
Change page cleanup flow when in internal error to properly decrement
the page counts when reclaiming pages. That prevents timing out
waiting for extra pages that were actually cleaned up previously.
Submitted by: slavash@
MFC after: 1 week
Sponsored by: Mellanox Technologies
When a PCI error is detected the PCI state could be corrupt, don't
save it in that flow. Save the state after initialization. After
restoring the PCI state during slot reset save it again, restoring
the state destroys the previously saved state info.
Submitted by: slavash@
MFC after: 1 week
Sponsored by: Mellanox Technologies
Since the FW can be shared between PCI functions it is common that
more than one health poll will detected a failure, this can lead to
multiple resets.
The solution is to use a FW locking mechanism using semaphore space to
provide a way to synchronize between functions. The FW semaphore is
acquired via config cycle access. First the VSEC gateway must be
acquired, then the semaphore can be locked by writing a value to it
and confirmed it's locked by reading the same value back. The process
in the same to free the semaphore, except the value written should be
zero.
Submitted by: slavash@
MFC after: 1 week
Sponsored by: Mellanox Technologies
If a FW assert is considered fatal, indicated by a new bit in the
health buffer, reset the FW. After the reset, follow the normal
recovery flow.
Submitted by: slavash@
MFC after: 1 week
Sponsored by: Mellanox Technologies
Some mlx5 adapter firmware allows the driver to reset the firmware in
the event of an error. When a software reset is issued on any physical
function all PFs enter reset state. This is a recoverable condition.
The existing recovery flow was designed to allow the recovery of a
VF after a PF driver reload. This patch expands the scope of that
flow to recover PFs or VFs after a SW reset has been issued.
When a software reset is issued the following occurs:
1. The NIC interface mode is set to SW_RESET (7) while the reset is in
progress.
2. Once the reset completes the NIC interface mode is set to NIC
disabled (1).
After the reset has been issued (added in a subsequent patch) the
health poll for other functions will detect that the NIC interface
state has been set to disabled. This will cause it to enter the
existing recovery flow. If the PCI is still working (meaning it
doesn't return 0xff on all reads) it means recovery can proceed
immediately instead of waiting 60 seconds.
The error detetion has also been refactored to avoid incorrect or
misleading log messages.
Submitted by: slavash@
MFC after: 1 week
Sponsored by: Mellanox Technologies
When mlx5_enter_error_state() operation is forced by shutdown, the
messages surrounding setting the error state are not informational
and confuse users.
Submitted by: kib@
MFC after: 1 week
Sponsored by: Mellanox Technologies
This patch accumulates the following Linux commits:
- 8812c24d28f4972c4f2b9998bf30b1f2a1b62adf
net/mlx5: Add fast unload support in shutdown flow
- 59211bd3b6329c3e5f4a90ac3d7f87ffa7867073
net/mlx5: Split the load/unload flow into hardware and software flows
- 4525abeaae54560254a1bb8970b3d4c225d32ef4
net/mlx5: Expose command polling interface
Submitted by: Matthew Finlay <matt@mellanox.com>
MFC after: 1 week
Sponsored by: Mellanox Technologies
This patch accumulates the following Linux commits:
- 04c0c1ab38e95105d950db5b84e727637e149ce7
net/mlx5: PCI error recovery health care simulation
- 0179720d6be2096b8d0a4d143254ff9e77747daa
net/mlx5: Introduce trigger_health_work function
- 3fece5d676939f42f434c63dfe1bd42d7d94e6f0
net/mlx5: Continue health polling until it is explicitly stopped
Submitted by: Matthew Finlay <matt@mellanox.com>
MFC after: 1 week
Sponsored by: Mellanox Technologies
The mlx5e_destroy_ifp() function may be called from the system workqueue and
in this case trying to flush all works will cause a dead lock.
Instead of using the system workqueue, create a designated workqueue
for each mlx5en(4) device instance.
Submitted by: slavash@
MFC after: 1 week
Sponsored by: Mellanox Technologies
There is a difference when parsing a completion entry between Ethernet
and IB ports. When link layer is Ethernet the bits describe the type of
L3 header in the packet. In the case when link layer is Ethernet and VLAN
header is present the value of SL is equal to the 3 UP bits in the VLAN
header. If VLAN header is not present then the SL is undefined and consumer
of the completion should check if IB_WC_WITH_VLAN is set.
While that, this patch also fills the vlan_id field in the completion if
present.
linux commit 12f8fedef2ec94c783f929126b20440a01512c14
MFC after: 1 week
Sponsored by: Mellanox Technologies
mlx5core.
Do not consider the inability to create a firmware dump fatal, but
inform about the situation and allow the driver to attach. The device
might not implement the needed VSC, or we might not know the layout of
the registers map. In either case, only firmware dump functionality is
limited, the network operations should be fine.
Submitted by: kib@
MFC after: 1 week
Sponsored by: Mellanox Technologies
When the mlx5en(4) driver was converted to using BUSDMA(9) the call to
m_defrag() was moved after the part of the TX routine that strips the
header from the mbuf chain. Before it called m_defrag it first trimmed
off the now-empty mbufs from the start of the chain. This has the side
effect of also removing the head of the chain that has M_PKTHDR set.
m_defrag() will not defrag a chain that does not have M_PKTHDR set,
thus it was effectively never defragging the mbuf chains.
As it turns out, trimming the mbufs in this fashion is unnecessary since
the call to bus_dmamap_load_mbuf_sg doesn't map empty mbufs anyway, so
remove it.
Differential Revision: https://reviews.freebsd.org/D12050
Submitted by: mjoras@
MFC after: 1 week
Sponsored by: Mellanox Technologies
Set and report vport MTU rather than physical MTU,
The driver will set both vport and physical port mtu
and will rely on the query of vport mtu.
SRIOV VFs have to report their MTU to their vport manager (PF),
and this will allow them to work with any MTU they need
without failing the request.
Also for some cases where the PF is not a port owner, PF can
work with MTU less than the physical port mtu if set physical
port mtu didn't take effect.
Based on Linux upstream commit:
cd255efff9baadd654d6160e52d17ae7c568c9d3
Submitted by: Meny Yossefi <menyy@mellanox.com>
MFC after: 1 week
Sponsored by: Mellanox Technologies
Currently the ifnet interface is named mceX, where X is a monotonically
incremented value. If the device is reset due to a fatal error, then the
interface name will change. Using the device unit number will keep the
naming consistent across the reset logic.
Submitted by: Matthew Finlay <matt@mellanox.com>
MFC after: 1 week
Sponsored by: Mellanox Technologies
ConnectX-4/5 devices in mlx5core.
The dump is obtained by reading a predefined register map from the
non-destructive crspace, accessible by the vendor-specific PCIe
capability (VSC). The dump is stored in preallocated kernel memory and
managed by the mlx5tool(8), which communicates with the driver using a
character device node.
The utility allows to store the dump in format
<address> <value>
into a file, to reset the dump content, and to manually initiate the
dump.
A call to mlx5_fwdump() should be added at the places where a dump
must be fetched automatically. The most likely place is right before a
firmware reset request.
Submitted by: kib@
MFC after: 1 week
Sponsored by: Mellanox Technologies
Add the ability to access the vendor specific space gateway in order
to support reading and writing data into the different configuration
domains.
Submitted by: Matthew Finlay <matt@mellanox.com>
MFC after: 1 week
Sponsored by: Mellanox Technologies
Add support for PFC and implement reading the per priority statistics
using the sysctl(8) interface. PFC is used together with VLAN priority
and can be enabled and disabled on a per priority basis.
Global pause frames and PFC are incompatible features and surrounding
logic has been added to warn the user about misconfiguration.
Update relevant mlx5core APIs for PFC configuration.
MFC after: 1 week
Sponsored by: Mellanox Technologies
ECN configuration and statistics is available through a set of sysctl(8)
nodes under sys.class.infiniband.mlx5_X.cong . The ECN configuration
nodes can also be used as loader tunables.
MFC after: 1 week
Sponsored by: Mellanox Technologies
This patch accumulates the following Linux commits:
mlx5_health.c
- 78ccb25861d76a8fc5c678d762180e6918834200
mlx5_core: Fix wrong name in struct
- 171bb2c560f45c0427ca3776a4c8f4e26e559400
mlx5_core: Update health syndromes
- 0144a95e2ad53a40c62148f44fb0c1f9d2a0d1e9
mlx5_core: Use accessor functions to read from device memory
- ac6ea6e81a80172612e0c9ef93720f371b198918
mlx5_core: Use private health thread for each device
- fd76ee4da55abb21babfc69310d321b9cb9a32e0
mlx5_core: Fix internal error detection conditions
- 2241007b3d783cbdbaa78c30bdb1994278b6f9b9
mlx5: Clear health sick bit when starting health poll
- 712bfef60912d91033cb25739f7444d5b8d8c59f
mlx5: Fix version printout in case of health issue
- 89d44f0a6c732db23b219be708e2fe1e03ee4842
mlx5_core: Add pci error handlers to mlx5_core driver
mlx5_cmd.c
- be87544de8df2b1eb34bcb5e32691287d96f9ec4
mlx5_core: Fix async commands return code
- a31208b1e11df334d443ec8cace7636150bb8ce2
mlx5_core: New init and exit flow for mlx5_core
- 020446e01eebc9dbe7eda038e570ab9c7ab13586
mlx5_core: Prepare cmd interface to system errors handling
- 89d44f0a6c732db23b219be708e2fe1e03ee4842
mlx5_core: Add pci error handlers to mlx5_core driver
- 0d834442cc247c7b3f3bd6019512ae03e96dd99a
mlx5: Fix teardown errors that happen in pci error handler
mlx5_main.c
- 5fc7197d3a256d9c5de3134870304b24892a4908
mlx5: Add pci shutdown callback
Submitted by: Matthew Finlay <matt@mellanox.com>
MFC after: 1 week
Sponsored by: Mellanox Technologies
Add support for mapping priority to traffic class via sysctl
Submitted by: Slava Shwartsman <slavash@mellanox.com>
MFC after: 1 week
Sponsored by: Mellanox Technologies
- Factor out port speed definitions into new port.h header file,
similarly as done in Linux upstream.
- Correct two existing port speed definitions in mlx5en according to
Linux upstream.
MFC after: 1 week
Sponsored by: Mellanox Technologies
Adding an interface might be done outside the device_attach() routine
and will then cause a panic, due to the VNET not being set.
MFC after: 1 week
Sponsored by: Mellanox Technologies
The current implementation does not handle timeout in case of command
with callback request, and this can lead to deadlock if the command
doesn't get firmware response. Add delayed callback timeout work
before posting the command to firmware. In case of real firmware
command completion we will cancel the delayed work. In case of
firmware command timeout the callback timeout handler will be called
and it will simulate firmware completion with timeout error.
linux commit 65ee67084589c1783a74b4a4a5db38d7264ec8b5
MFC after: 1 week
Sponsored by: Mellanox Technologies
Call command completion handler in case of timeout when working in
interrupts mode. Avoid flushing the commands workqueue after acquiring
the semaphores to prevent a potential deadlock.
linux commit commit 9cba4ebcf374c3772f6eb61f2d065294b2451b49
MFC after: 1 week
Sponsored by: Mellanox Technologies
Creating a UD address handle from user-space or from the kernel-space,
when the link layer is ethernet, requires resolving the remote L3
address into a L2 address. Doing this from the kernel is easy because
the required ARP(IPv4) and ND6(IPv6) address resolving APIs are readily
available. In userspace such an interface does not exist and kernel
help is required.
It should be noted that in an IP-based GID environment, the GID itself
does not contain all the information needed to resolve the destination
IP address. For example information like VLAN ID and SCOPE ID, is not
part of the GID and must be fetched from the GID attributes. Therefore
a source GID should always be referred to as a GID index. Instead of
going through various racy steps to obtain information about the
GID attributes from user-space, this is now all done by the kernel.
This patch optimises the L3 to L2 address resolving using the existing
create address handle uverbs interface, retrieving back the L2 address
as an additional user-space information structure.
This commit combines the following Linux upstream commits:
IB/core: Let create_ah return extended response to user
IB/core: Change ib_resolve_eth_dmac to use it in create AH
IB/mlx5: Make create/destroy_ah available to userspace
IB/mlx5: Use kernel driver to help userspace create ah
IB/mlx5: Report that device has udata response in create_ah
MFC after: 1 week
Sponsored by: Mellanox Technologies
checks to recognize own network devices when using mlx5ib. This patch fixes
an issues where mlx5ib fails to recognize mceX network devices for use with
RoCE.
MFC after: 1 week
Sponsored by: Mellanox Technologies
The IA32 memory model guarantees that all writes are seen in the program
order. Also, any access to the uncacheable memory flushes the store
buffers. As the consequence, SFENCE instruction is (almost) never needed,
in particular, it is not needed to ensure the correct order of updates as
seen by a PCIe device.
Use atomic_thread_fence_rel() instead of wb() to only emit compiler barriers
on x86 there. Other architectures get the right barrier instruction as
well.
Reviewed by: hselasky
Sponsored by: Mellanox Technologies
MFC after: 1 week
Driver support is only provided for ConnectX4/5.
System-time timestamp is calculated based on the free-running counter
timestamp provided by hardware. Driver periodically samples the
counter to calibrate it against the system clock and uses linear
interpolation to convert. Stability of the crystal which drives the
clock is +-50 ppm at the operational temperature, which makes the
algorithm good enough.
The calculation is somewhat delicate because all values are 64bit and
overflow the naive formula for linear interpolation. The calculation
drops the least significant bits in advance, see the PREC shift in
mlx5_mbuf_tstmp().
Hardware stamps can be turned off by 'ifconfig mceN -hwrxtsmp'. Buggy
firmware might result in small but visible errors in the reported
timestamps, detectable e.g. by nonsensical (negative) RTT values for
LAN pings.
Reviewed by: gallatin, hselasky
Sponsored by: Mellanox Technologies
Differential revision: https://reviews.freebsd.org/D12638
Different compilers may optimise the enum type in different ways. This ensures
coherency when range checking the value of enums in ibcore.
Sponsored by: Mellanox Technologies
MFC after: 1 week
The hardware is capable of 2 requestor endianness modes for standard 8
byte atomics: BE (0x0) and host endianness (0x1). Read the supported
modes from hca atomic capabilities and configure HW to host endianness
mode if supported.
Sponsored by: Mellanox Technologies
MFC after: 1 week
the coming ibcore and mlx5ib updates in order to support traffic redirection
to so-called raw ethernet QPs.
Remove unused E-switch related routines and files while at it.
Sponsored by: Mellanox Technologies
MFC after: 1 week
iWarp and RoCE in ibcore. The selection of RDMA_PS_TCP can not be used
to indicate iWarp protocol use. Backport the proper IB device
capabilities from Linux upstream to distinguish between iWarp and
RoCE. Only allocate the additional socket required for iWarp for RDMA
IDs when at least one iWarp device present. This resolves
interopability issues between iWarp and RoCE in ibcore
Reviewed by: np @
Differential Revision: https://reviews.freebsd.org/D12563
Sponsored by: Mellanox Technologies
MFC after: 3 days
Remote DMA over Converged Ethernet, RoCE, for the ConnectX-4 series of
PCI express network cards.
There is currently no user-space support and this driver only supports
kernel side non-routable RoCE V1. The krping kernel module can be used
to test this driver. Full user-space support including RoCE V2 will be
added as part of the ongoing upgrade to ibcore from Linux 4.9. Otherwise
this driver is feature equivalent to mlx4ib(4). The mlx5ib(4) kernel
module will only be built when WITH_OFED=YES is specified.
MFC after: 2 weeks
Sponsored by: Mellanox Technologies
input errors in the mlx5en(4) driver. This improves the sysadmin view of
physical port errors.
Submitted by: gallatin@
MFC after: 1 week
Sponsored by: Mellanox Technologies
Code inspection reveals the busdma unload and free functions
do not write to the belonging dma tag and does not need to be
serialized. This allows mlx5_fwp_free() to be called from
software interrupt context.
MFC after: 3 days
Sponsored by: Mellanox Technologies
Add io_mapping_init_wc() and add a third (unused) parameter to
io_mapping_map_wc().
Reviewed by: hselasky
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D11286
The MLX5 driver has four different types of DMA allocations which are
now allocated using busdma:
1) The 4K firmware DMA-able blocks. One busdma object per 4K allocation.
2) Data for firmware commands use the 4K firmware blocks split into four 1K blocks.
3) The 4K firmware blocks are also used for doorbell pages.
4) The RQ-, SQ- and CQ- DMA rings. One busdma object per allocation.
After this patch the mlx5en driver can be used with DMAR enabled in
the FreeBSD kernel.
MFC after: 1 week
Sponsored by: Mellanox Technologies
- When device disappears from PCI indicate error device state and:
1) Trigger command completion for all pending commands
2) Prevent new commands from executing and return:
- success for modify and remove/cleanup commands
- failure for create/query commands
3) When reclaiming pages for a device in error state don't ask FW to
return all given pages, just release the allocated memory
MFC after: 1 week
Sponsored by: Mellanox Technologies
PCI device(s), changes:
- alloc_entry() now clears bit for page slot entry aswell
- update of cmd->ent_arr[] is now under cmd->alloc_lock
- complete command if alloc_entry() fails
MFC after: 1 week
Sponsored by: Mellanox Technologies
By default reading the diagnostic counters is disabled. The firmware
decides which counters are supported and only those supported show up
in the dev.mce.X.diagnostics sysctl tree.
To enable reading of diagnostic counters set one or more of the
following sysctls to one:
dev.mce.X.conf.diag_general_enable=1
dev.mce.X.conf.diag_pci_enable=1
MFC after: 1 week
Sponsored by: Mellanox Technologies
consistent return values from the mlx5e_sq_has_room_for()
function. The two counters are incremented by different threads under
different locks.
MFC after: 1 week
Sponsored by: Mellanox Technologies
- Add new sysctl node to control the transmit packet bufring.
- Add optimised version of the transmit routine which output packets
directly to the DMA ring instead of using bufring in case the transmit
lock is congested. This can reduce the number of taskswitches which in
turn influence the overall system CPU usage, depending on the
workload.
- Add " TX" suffix to debug name for transmit mutexes to silence some
witness warnings about aquiring duplicate locks having same name.
MFC after: 1 week
Sponsored by: Mellanox Technologies
Suggested by: gallatin @
Add own state variable to track if a sendqueue is stopped or not.
This will prevent traffic from entering the sendqueue while it is
being destroyed.
Update drain function to wait for traffic to be transmitted before
returning when the link state is active.
Add extra checks in transmit path for stopped SQ's.
While at it:
- Use likely() for a mbuf pointer check.
- Remove redundant IFF_DRV_RUNNING check.
MFC after: 1 week
Sponsored by: Mellanox Technologies
The firmware/hardware does not generate additional completion
events unless we post new buffers. Use a timer to try to post
more buffers in case we are temporarily out of mbufs. Else
the receive schedule completely stops.
Sponsored by: Mellanox Technologies
MFC after: 1 week
When mlx5e_sq_xmit() returns an error code and the mbuf pointer is set,
we should not free the mbuf, because the caller will keep the mbuf in
the drbr. Make sure the mbuf pointer is correctly set upon function
exit.
Sponsored by: Mellanox Technologies
MFC after: 1 week
The hardware MTU size can't be set to a value less than 1500 bytes due
to side-band management support. Allow setting the software MTU size
below 1500 bytes, thus creating a mismatch between hardware and
software MTU sizes.
Sponsored by: Mellanox Technologies
MFC after: 1 week
Try to reuse code to setup sendqueues when possible by making some static
functions global. Further split the mlx5e_close_sq_wait() function to
separate out reusable parts.
Sponsored by: Mellanox Technologies
MFC after: 1 week
This change also reduces the size of the mlx5e_sq structure so that the last
queue_state element will fit into the previous cacheline and then the mlx5e_sq
structure becomes one cacheline less for amd64.
Sponsored by: Mellanox Technologies
MFC after: 1 week
Make some functions and structures global to allow for code reuse
when creating rate limiting sendqueues.
Sponsored by: Mellanox Technologies
MFC after: 1 week
Move setting of CQ moderation mode together with the other
CQ moderation parameters. Pass completion event vector as
a separate argument to mlx5e_open_cq(), because its value is
different for each call. Pass mlx5e_priv pointer instead of
mlx5e_channel pointer so that code can be used by rate
limiting sendqueues.
Sponsored by: Mellanox Technologies
MFC after: 1 week
This change allows for reusing the transmit path for so called
rate limited senqueues. While at it optimise some pointer lookups
in the fast path.
Sponsored by: Mellanox Technologies
MFC after: 1 week
- Add new firmware commands and update existing ones.
- Add more firmware related structures and update existing ones.
- Some minor fixes, like adding missing \n to some prints.
Sponsored by: Mellanox Technologies
MFC after: 1 week