As unp_internalize() processes the input control messages, it builds
an output mbuf chain containing the internalized representations of
those messages. In one special case, that of an empty SCM_RIGHTS
message, the message is simply discarded. However, the loop which
appends mbufs to the output chain assumed that each iteration would
produce an mbuf, resulting in a null pointer dereference if an empty
SCM_RIGHTS message was followed by a non-empty message.
Fix this by advancing the output mbuf chain tail pointer only if an
internalized control message was produced.
Reported by: syzbot+1b5cced0f7fad26ae382@syzkaller.appspotmail.com
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
This adds a TOE hook to allocate a KTLS session. It also recognizes
TLS mbufs in the socket buffer and sends those to the NIC using a TLS
work request to encrypt the record before segmenting it.
TOE TLS support must be enabled via the dev.t6nex.<N>.tls sysctl in
addition to enabling KTLS.
Reviewed by: np, gallatin
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D21891
This adds the glue to allocate TLS sessions and invokes it from
the TLS enable socket option handler. This also adds some counters
for active TOE sessions.
The TOE KTLS mode is returned by getsockopt(TLSTX_TLS_MODE) when
TOE KTLS is in use on a socket, but cannot be set via setsockopt().
To simplify various checks, a TLS session now includes an explicit
'mode' member set to the value returned by TLSTX_TLS_MODE. Various
places that used to check 'sw_encrypt' against NULL to determine
software vs ifnet (NIC) TLS now check 'mode' instead.
Reviewed by: np, gallatin
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D21891
ABI already guarantees the direction is forward. Note this does not take care
of i386-specific cld's.
Reviewed by: kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D21906
The PCI block in the adapter requires this field to be set to a valid
queue ID. It is not clear why it did not fail on all machines, but
the effect was that crypto operations reading input data via DMA
failed with an internal PCI read error on machines with 128G or more
of RAM.
Reported by: gallatin
Reviewed by: np
MFC after: 3 days
Sponsored by: Chelsio Communications
As noted by the commit message, callouts are now persistant
and should not be in the auto-zero section of the RQ's and SQ's.
This fixes an assert when using the TX completion event
factor feature with mlx5en(4).
Found by: gallatin@
MFC after: 3 days
Sponsored by: Mellanox Technologies
protected by IF_ADDR_LOCK(), which was a mutex, so that two simultaneous
if_setlladdr() can't execute. Later it was switched to IF_ADDR_RLOCK(),
likely by a mistake. Later it was switched to NET_EPOCH_ENTER(). Then I
incorrectly added NET_EPOCH_ASSERT() here.
In reality ifp->if_addr never goes away and never changes its length. So,
doing bcopy() in it is always "safe", meaning it won't dereference a wrong
pointer or write into someone's else memory. Of course doing two bcopy() in
parallel would result in a mess of two addresses, but net epoch doesn't
protect against that, neither IF_ADDR_RLOCK() did.
So for now, just remove the assertion and leave for later a proper fix.
Reported by: markj
During a CoW fault, we must check for both 4KB and 2MB mappings before
clearing PGA_WRITEABLE on the old mapping's page. Previously we were
only checking for 4KB mappings. This was missed in r344106.
MFC after: 3 days
Sponsored by: The FreeBSD Foundation
This matches the state prior to r353149 and fixes crashes with DRM
modules.
Reported and tested by: cy, garga, Krasznai Andras
Fixes: r353149 ("amd64 pmap: implement per-superpage locks")
Sponsored by: The FreeBSD Foundation
pmap_remove_l3() may remove the last mapping of a page, in which case
it must clear PGA_WRITEABLE.
Reported by: Jenkins, via lwhsu
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
As long as we support ZFS on 32-bit platforms we should do this for all
64-bit variables that are modified in a lockless fashion using atomic
operations. Otherwise, there is a risk of a reading a torn value.
Here is a rationale for why I am doing this in dmu_object_alloc_impl:
- it's very recent code
- the code deals with object IDs and a number of objects in a file
system can overflow 32 bits
- incorrect allocation of an object ID may result in hard to debug
problems
- fixing all plain reads of 64-bit atomic variables is not a trivial
undertaking to do in one shot, so I chose to do it incrementally
MFC after: 3 weeks
X-MFC after: r353301, r353176
Make sure the vnet_shutdown field is not set until after all
VNET_SYSUNINIT()'s in the SI_SUB_VNET_DONE subsystem have been
executed. Especially the vnet_if_return() functions requires that
if_move() is still operational.
Reported by: lwhsu@
MFC after: 1 week
Sponsored by: Mellanox Technologies
At the moment i386 does not provide 64-bit atomic operations in
userland. Exposing some atomic_*_64 defines can cause unnecessary
confusion.
Discussed with: kib
MFC after: 2 weeks
In case the implementation ever changes from using a chain of next pointers,
then changing the macro definition will be necessary, but changing all the
files that iterate over vm_map entries will not.
Drop a counter in vm_object.c that would have an effect only if the
vm_map entry count was wrong.
Discussed with: alc
Reviewed by: markj
Tested by: pho (earlier version)
Differential Revision: https://reviews.freebsd.org/D21882
|
This adds two implementations for each atomic_fcmpset_ and atomic_cmpset_
short and char functions, selectable at compile time for the target
architecture. By default, it uses a generic shift-and-mask to perform atomic
updates to sub-components of 32-bit words from <sys/_atomic_subword.h>.
However, if ISA_206_ATOMICS is defined it uses the ll/sc instructions for
halfword and bytes, introduced in PowerISA 2.06. These instructions are
supported by all IBM processors from POWER7 on, as well as the Freescale/NXP
e6500 core. Although the e5500 and e500mc both implement PowerISA 2.06 they
do not implement these instructions.
As part of this, clean up the atomic_(f)cmpset_acq and _rel wrappers, by
using macros to reduce code duplication.
ISA_206_ATOMICS requires clang or newer binutils (2.20 or later).
Differential Revision: https://reviews.freebsd.org/D21682
Acquire the inp lock before checking whether the socket is already bound,
and around updates to the inp_vflag field.
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D21867
About 11 minutes of poudriere -s -j 104 and probing on return value of
trylocks reveals that over 10% of attempts fail, which in turn means
there are more atomics performed than necessary.
Trylocking was there to try preventing migration, but it's not very likely
to happen if the lock is uncontested.
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D21925
When epoch(9) was introduced to network stack, it was basically
dropped in place of existing locking, which was mutexes and
rwlocks. For the sake of performance mutex covered areas were
as small as possible, so became epoch covered areas.
However, epoch doesn't introduce any contention, it just delays
memory reclaim. So, there is no point to minimise epoch covered
areas in sense of performance. Meanwhile entering/exiting epoch
also has non-zero CPU usage, so doing this less often is a win.
Not the least is also code maintainability. In the new paradigm
we can assume that at any stage of processing a packet, we are
inside network epoch. This makes coding both input and output
path way easier.
On output path we already enter epoch quite early - in the
ip_output(), in the ip6_output().
This patch does the same for the input path. All ISR processing,
network related callouts, other ways of packet injection to the
network stack shall be performed in net_epoch. Any leaf function
that walks network configuration now asserts epoch.
Tricky part is configuration code paths - ioctls, sysctls. They
also call into leaf functions, so some need to be changed.
This patch would introduce more epoch recursions (see EPOCH_TRACE)
than we had before. They will be cleaned up separately, as several
of them aren't trivial. Note, that unlike a lock recursion the
epoch recursion is safe and just wastes a bit of resources.
Reviewed by: gallatin, hselasky, cy, adrian, kristof
Differential Revision: https://reviews.freebsd.org/D19111
The new sysctl was not added to the siftr.4 man page at the time.
This updates the man page, and removes one left over trailing whitespace.
Submitted by: Richard Scheffenegger
Reviewed by: bcr@
MFC after: 3 days
Differential Revision: https://reviews.freebsd.org/D21619
This provides a framework to define a template describing
a set of "variables of interest" and the intended way for
the framework to maintain them (for example the maximum, sum,
t-digest, or a combination thereof). Afterwards the user
code feeds in the raw data, and the framework maintains
these variables inside a user-provided, opaque stats blobs.
The framework also provides a way to selectively extract the
stats from the blobs. The stats(3) framework can be used in
both userspace and the kernel.
See the stats(3) manual page for details.
This will be used by the upcoming TCP statistics gathering code,
https://reviews.freebsd.org/D20655.
The stats(3) framework is disabled by default for now, except
in the NOTES kernel (for QA); it is expected to be enabled
in amd64 GENERIC after a cool down period.
Reviewed by: sef (earlier version)
Obtained from: Netflix
Relnotes: yes
Sponsored by: Klara Inc, Netflix
Differential Revision: https://reviews.freebsd.org/D20477
Remove the now obsolete vnet_state field. This greatly simplifies the
detection of VNET shutdown and avoids code duplication.
Discussed with: bz@
MFC after: 1 week
Sponsored by: Mellanox Technologies
The compatibility code for the atomic operations in ZFS code is a bit
messy. In some cases the native definitions are directly made
available, in some cases there are emulated operations in
opensolaris_atomic.c and in yet other cases there are atomic operations
implemented in assembly that were obtained from OpenSolaris / illumos.
This commit adds atomic_swap_64 for use with i386 userland.
The code is copied from illumos.
I am not sure why FreeBSD does not provide that operation natively.
Maybe because we try (or pretend) to support processors that did not
have the necessary instructions.
While here I also added atomic_load_64 for the same reasons.
This is original code based on iilumos atomic_swap_64 and FreeBSD
atomic_load_acq_64_i586.
Pointyhat to: avg
MFC after: 1 week
8423 8199 7432 Implement large_dnode pool feature
7432 Large dnode pool feature
8199 multi-threaded dmu_object_alloc()
8423 Implement large_dnode pool feature
10406 large_dnode changes broke zfs recv of legacy stream
llumos/illumos-gate@54811da5ac54811da5achttps://www.illumos.org/issues/8423https://www.illumos.org/issues/8199https://www.illumos.org/issues/7432illumos/illumos-gate@811964cd9f811964cd9fhttps://www.illumos.org/issues/10406
ZoL issues:
Improved dnode allocation #6564
Clean up large dnode code #6262
Fix dnode_hold() freeing dnode behavior #8172
Fix dnode allocation race #6414, #6439
Partial: Raw sends must be able to decrease nlevels #6821, #6864
Remove unnecessary txg syncs from receive_object() Closes#7197
This updates FreeBSD large_dnode code (that was imported from ZoL) to a
version that was committed to illumos. It has some cleanups,
improvements and fixes comparing to what we have in FreeBSD now.
I think that the most significant update is 8199 multi-threaded
dmu_object_alloc().
This commit reverts r351077 that was a revert of r351074 and r351076 and
restores those changes. Required atomic operations should be available
now on all platforms where we build ZFS.
Obtained from: illumos
MFC after: 3 weeks
DTS Import of Linux 5.3 added a patch that rework the L3 mmc instance
in the AM335x SoC but removed the status = 'disabled' on the node.
This cause the kernel to probe the device even if the board doesn't
have this mmc used and since we don't correctly activate the clock
for this module we panic with an external data abort.
Beaglebone(s) don't have this device anyway so simply disabling it.
Patch for the DTS was sent upstream.
https://patchwork.kernel.org/patch/11176921/
PR: 241089
Reported by: phk
Previously, the code used a plain store on platforms that lacked
atomic_swap_64 and possibly some other platforms as the condition worked
only if atomic_swap_64 was a macro.
MFC after: 1 week
X-MFC after: r353166, r353167
Some 32-bit platforms do not provide 64-bit atomic operations that ZFS
requires, either in userland or at all. We emulate those operations for
those platforms using a mutex. That is not entirely correct and it's
very efficient. Besides, the loads are plain loads, so torn values are
possible.
Nevertheless, the emulation seems to work for some definition of work.
This change adds atomic_swap_64, which is already used in ZFS code, and
atomic_load_64 that can be used to prevent torn reads.
MFC after: 1 week
According to ian, the only armv6 cpu we support is the 1176, so this
change is effectively a no-op.
The change is just to make the code more self-consistent.
The issue was noticed by a standalone module build for armv6.
Reviewed by: ian
MFC after: 3 weeks
no longer worked. The problem was that the defines used the
same space as the VLAN id. This commit does three things.
1) Move the LRO used fields to the PH_per fields. This is
safe since the entire PH_per is used for IP reassembly
which LRO code will not hit.
2) Remove old unused pace fields that are not used in mbuf.h
3) The VLAN processing is not in the mbuf queueing code. Consequently
if a VLAN submits to Rack or BBR we need to bypass the mbuf queueing
for now until rack_bbr_common is updated to handle the VLAN properly.
Reported by: Brad Davis
Root vnodes looekd up all the time, e.g. when crossing a mount point.
Currently used routines always perform a costly lookup which can be
trivially avoided.
Reviewed by: jeff (previous version), kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D21646
The current 256-lock sized array is a problem in the following ways:
- it's way too small
- there are 2 locks per cacheline
- it is not NUMA-aware
Solve these issues by introducing per-superpage locks backed by pages
allocated from respective domains.
This significantly reduces contention e.g. during poudriere -j 104.
See the review for results.
Reviewed by: kib
Discussed with: jeff
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D21833
As pointed out by mjg, without the parentheses the calculations done against
these macros are incorrect, resulting in only 1/3 of locks being used.
Reported by: mjg
that this path is taken by setting the tail pointer correctly.
There is still bug related to handling unordered unfragmented messages
which were delayed in deferred handling.
This issue was found by OSS-Fuzz testing the usrsctp stack and reported in
https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=17794
MFC after: 3 days
SI_CHEAPCLONE was introduced in r66067 for use with cloned bpfs. It was
later also used in tty, tun, tap at points. The rough timeline for being
removed in each of these is as follows:
- r181690: bpf switched to use cdevpriv API by ed@
- r181905: ed@ rewrote the TTY later to be mpsafe
- r204464: kib@ removes it from tun/tap, declaring it unused
I've not yet been able to dig up any other consumers in the intervening 9
years. It is no longer set on any devices in the tree and leaves an
interesting situation in make_dev_sv where we're ok with the device already
being set SI_NAMED.
Attempting to initialize si_drv{1,2} with mda_si_drv{1,2} does not work if
you are operating on cloned devices.
clone_create must be called prior to the make_dev* family to create/return
the device on the clonelist as needed. This device is later returned early
in newdev(), prior to si_drv{0,1,2} initialization.
This patch simply breaks out of the loop if we've found a device and
finishes init.
Reviewed by: kib
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D21904
vn_write already checks for vnode type to see if bwillwrite should be called.
This effectively reverts r244643.
Reviewed by: kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D21905
We only support host mode so those functions are just added so
we won't panic when generic-{e,o}hci will set the phy to host mode.
MFC after: 1 month
X-MFC-With: r353062
CTLFLAG_STATS identifies a sysctl OID as statistical or informational,
as opposed to a configurable/tunable OID that changes behavior.
This can be used, for example, to verfiy that the kyua tests do not
modify configurable OIDs when allow_sysctl_side_effects is true.
Add CTLFLAG_STATS to all COUNTER_U64* OIDs.
I will add the flag to more OIDs in a few subsequent commits, to
facilitate MFC. The flag should be added to many more OIDs. I plan to
add it those that my test found and some nearby that looked obvious.
MFC after: 2 weeks
Sponsored by: Dell EMC Isilon
nvdimm_e820 is a newbus pseudo driver that looks for "legacy" e820 PRAM
spans and creates ordinary-looking SPA devfs nodes for them
(/dev/nvdimm_spaN).
As these legacy regions lack real NFIT SPA regions and namespace
definitions, they must be administratively sliced up externally using
device.hints. This is similar in purpose to the Linux memmap= mechanism.
It is assumed that systems with working NFIT tables will not have any use
for this driver, and that that will be the prevailing style going forward,
so if there are no explicit hints provided, this driver does not
automatically create any devices.
Reviewed by: kib (previous version)
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D21885
The registers in ilumos and FreeBSD have a different number.
In the illumos, last 32-bits register defined is SS an in FreeBSD is GS.
While translating register we should comper it to the highest one.
PR: 240358
Reported by: lwhsu@
MFC after: 2 weeks
Realistically, this cannot work. We don't allow the tun to be opened twice,
so it must be done via fd passing, fork, dup, some mechanism like these.
Applications demonstrably do not enforce strict ordering when they're
handing off tun devices, so the parent closing before the child will easily
leave the tun/tap device in a bad state where it can't be destroyed and a
confused user because they did nothing wrong.
Concede that we can't leave the tun/tap device in this kind of state because
of software not playing the TUNSIFPID game, but it is still good to find and
fix this kind of thing to keep ifconfig(8) up-to-date and help ensure good
discipline in tun handling.
MFC after: 3 days
During readdir() we guarantee that the tn_dir.tn_parent does not go
away, but it might be replaced by a parallel rename. Read tn_parent
only once, then use the cached value.
Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Create an attachment file for the existing ACPI attachment, and create a
new FDT attachment for the generic-ehci driver.
Submitted by: andrew (Original version)
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D19389
Currently, if you do:
$ ifconfig tun0 create
$ ifconfig tun0 name wg0
$ ls -l /dev | egrep 'wg|tun'
You will see tun0, but no wg0. In fact, it's slightly more annoying to make
the association between the new name and the old name in order to open the
device (if it hadn't been opened during the rename).
Register an eventhandler for ifnet_arrival_events and catch interface
renames. We can determine if the ifnet is a tun easily enough from the
if_dname, which matches the cevsw.d_name from the associated tuntap_driver.
Some locking dance is required because renames don't require the device to
be opened, so it could go away in the middle of handling the ioctl, but as
soon as we've verified this isn't the case we can attempt to busy the tun
and either bail out if the tun device is dying, or we can proceed with the
rename.
We only create these aliases on a best-effort basis. Renaming a tun device
to "usbctl", which doesn't exist as an ifnet but does as a /dev, is clearly
not that disastrous, but we can't and won't create a /dev for that.
A future commit will create device aliases when a tuntap device is renamed
so that it's still easily found in /dev after the rename. Said mechanism
will want to keep the tun alive long enough to either realize that it's
about to go away or complete the alias creation, even if the alias is about
to get destroyed.
While we're introducing it, using it to prevent open devices from going away
makes plenty of sense and keeps the logic on waking up tun_destroy clean, so
we don't have multiple places trying to cv_broadcast unless it's still in
use elsewhere.
Four drivers (hpt27xx, hptmv, hptnr, hptrr, hpt27xx) include precompiled
binary objects; have users load them as modules if they are needed.
Additional work (i.e., integrating devmatch) required before MFC.
Reviewed by: markj
Relnotes: Yes
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D21865
The feature is implemented as an extension of the existing
ZFS_IOC_RENAME ioctl. Both the userland and the DSL interfaces support
renaming only a single bookmark at a time. As of now, there is no ZCP
interface to the new functionality. I am going to add it once the DSL
interface passes a test of time.
This change picks up support for zfs_ioc_namecheck_t::ENTITY_NAME that
was added to ZoL as part of Redacted Send/Receive feature by Paul
Dagnelie <pcd@delphix.com>. This is needed to allow a bookmark name in
zc_name.
Discussed with: mahrens
Reviewed by: bcr (man page)
Sponsored by: CyberSecure
Differential Revision: https://reviews.freebsd.org/D21795
can handle. Instead using an array on node private data, use per-hook
private data.
- Use NG_NODE_FOREACH_HOOK() to traverse through hooks instead of array.
PR: 240787
Submitted by: Lutz Donnerhacke <lutz donnerhacke.de>
Differential Revision: https://reviews.freebsd.org/D21803
It was defined with the wrong MACHINE_ARCH previously. This permits
using an MFS image without defining MD_ROOT_SIZE which has various
benefits (one being that the build is able to treat the MFS image as
a dependency and properly re-link the kernel with the new image when
building with NO_CLEAN).
MFC after: 2 weeks
Sponsored by: DARPA
MAXPATHLEN / PATH_MAX includes space for the terminating NUL, and namei
verifies the presence of the NUL. Thus there is no need to increase the
buffer size here.
The sysctl passes the string excluding the NUL, so req->newlen equal to
PATH_MAX is too long.
Reviewed by: kib
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D21876
Regression introduced in r343629 when malloc result was renamed from spa to
spa_mapping and the 'spa' name was instead used to iterate a table, but the
free() target was not updated.
Reviewed by: kib, scottph
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D21871
Most of this diff is refactoring to reduce duplication between the different
acq_ and rel_ variants.
Differential Revision: https://reviews.freebsd.org/D21822
Most of this diff is refactoring to reduce duplication between the different
acq_ and rel_ variants.
Differential Revision: https://reviews.freebsd.org/D21822
Provide *cmpset_{8,16} as wrappers around atomic_fcmpset_32. Initial users
will be mips and sparc64, and perhaps parts of powerpc.
This are not for general consumption; machine/atomic.h should include this
header as needed to provide atomic_{,f}cmpset_{8,16} and machine/atomic.h
should provide acq_ and rel_ variants.
Reviewed by: jhibbits, kib
Differential Revision: https://reviews.freebsd.org/D21822
The mountpoint may not have defined an iosize parameter, so an attempt
to configure readahead on a device file can lead to a divide-by-zero
crash.
The sequential heuristic is not applied to I/O to or from device files,
and posix_fadvise(2) returns an error when v_type != VREG, so perform
the same check here.
Reported by: syzbot+e4b682208761aa5bc53a@syzkaller.appspotmail.com
Reviewed by: kib
MFC after: 3 days
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D21864
For ll/sc architectures, atomic(9) allows failure modes where *old == val
due to write failure and callers should compensate for this. Do not retry on
failure, just leave 0 in ret and fail the operation if we couldn't sc it.
This lets the caller determine if it should retry or not.
Reviewed by: kib
Looks ok: imp
Differential Revision: https://reviews.freebsd.org/D21836
The PRM suggests random 0 - 10ms to prevent multiple waiters on the same
interval in order to avoid starvation.
Submitted by: slavash@
MFC after: 3 days
Sponsored by: Mellanox Technologies
Before attempting to initialize the command interface we must wait till
the fw_initializing bit is clear.
If we fail to meet this condition the hardware will drop our
configuration, specifically the descriptors page address. This scenario
can happen when the firmware is still executing an FLR flow and did not
finish yet so the driver needs to wait for that to finish.
Linux commits:
6c780a0267b8
b8a92577f4be.
MFC after: 3 days
Sponsored by: Mellanox Technologies
in mlx5en(4) after r348254.
The unlimited send tags are shared amount multiple connections and are
not allocated per send tag allocation request. Only increment the refcount.
MFC after: 3 days
Sponsored by: Mellanox Technologies
It may happen during link down that the running state may be observed
non-zero in the transmit routine, right before the running state is
cleared. This may end up using a destroyed mutex.
Make all channel mutexes and callouts persistant.
Preserve receive and send queue statistics during link toggle.
MFC after: 3 days
Sponsored by: Mellanox Technologies
in mlx5core. The EEPROM information is not only a property of the
mlx5en(4) driver.
Submitted by: slavash@
MFC after: 3 days
Sponsored by: Mellanox Technologies
The following sysctls are added:
dev.mce.N.conf.qos.cable_length
dev.mce.N.conf.qos.buffers_size
dev.mce.N.conf.qos.buffers_prio
Submitted by: kib@
MFC after: 3 days
Sponsored by: Mellanox Technologies
All prints in mlx5core should use on of the macros:
mlx5_core_err/dbg/warn
Submitted by: slavash@
MFC after: 3 days
Sponsored by: Mellanox Technologies
In case of health counter fails to increment it indicates a bad device health.
In case when the syndrome indicated by firmware is 0x0, this indicates that
firmware is unable to respond to initialization segment reads.
Add proper print in this case.
Submitted by: slavash@
MFC after: 3 days
Sponsored by: Mellanox Technologies
MPFS is a logical switch in the Mellanox device which forward packets
based on a hardware driven L2 address table, to one or more physical-
or virtual- functions. The physical- or virtual- function is required
to tell the MPFS by using the MPFS firmware commands, which unicast
MAC addresses it is requesting from the physical port's traffic.
Broadcast and multicast traffic however, is copied to all listening
physical- and virtual- functions and does not need a rule in the MPFS
switching table.
Linux commit: eeb66cdb682678bfd1f02a4547e3649b38ffea7e
MFC after: 3 days
Sponsored by: Mellanox Technologies
Add the 512 bytes limit of RDMA READ and the size of remote address to the max
SGE calculation.
Submitted by: slavash@
Linux commit: 288c01b746aa
MFC after: 3 days
Sponsored by: Mellanox Technologies
When the software send queue gets filled up, callbacks to
if_transmit will stop. Make sure the transmit callback
routine checks the send queue and outputs any remaining
mbufs. Else the remaining mbufs may simply sit in the
output queue blocking the transmit path.
MFC after: 3 days
Sponsored by: Mellanox Technologies
kern_shm_open2(), since conception, completely fails to pass the mode along
to kern_shm_open(). This breaks most uses of it.
Add tests alongside this that actually check the mode of the returned
files.
PR: 240934 [pulseaudio breakage]
Reported by: ler, Andrew Gierth [postgres breakage]
Diagnosed by: Andrew Gierth (great catch)
Tested by: ler, tmunro
Pointy hat to: kevans
Before my ZIL space optimization few years ago 128KB writes were logged
as two 64KB+ records in two 128KB log blocks. After that change it became
~124KB+/4KB+ in two 128KB log blocks to free space in the second block
for another record. Unfortunately in case of 128KB only writes, when space
in the second block remained unused, that change increased write latency by
imbalancing checksum computation time between parallel threads.
This change introduces new 68KB log block size, used for both writes below
67KB and 128KB-sharp writes. Writes of 68-127KB are still using one 128KB
block to not increase processing overhead. Writes above 131KB are still
using full 128KB blocks, since possible saving there is small. Mixed loads
will likely also fall back to previous 128KB, since code uses maximum of
the last 10 requested block sizes.
On a simple 128KB write test with queue depth of 1 this change demonstrates
~15-20% performance improvement.
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.
This adds 8 and 16 bit versions of the cmpset and fcmpset functions. Macros
are used to generate all the flavors from the same set of instructions; the
macro expansion handles the couple minor differences between each size
variation (generating ldrexb/ldrexh/ldrex for 8/16/32, etc).
In addition to handling new sizes, the instruction sequences used for cmpset
and fcmpset are rewritten to be a bit shorter/faster, and the new sequence
will not return false when *dst==*old but the store-exclusive fails because
of concurrent writers. Instead, it just loops like ldrex/strex sequences
normally do until it gets a non-conflicted store. The manpage allows LL/SC
architectures to bogusly return false, but there's no reason to actually do
so, at least on arm.
Reviewed by: cognet