9421 zdb should detect and print out the number of "leaked" objects
9422 zfs diff and zdb should explicitly mark objects that are on the deleted queue
illumos/illumos-gate@20b5dafb42
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Approved by: Matt Ahrens <mahrens@delphix.com>
Author: Paul Dagnelie <pcd@delphix.com>
9102 zfs should be able to initialize storage devices
The first access to a disk block can incur a performance penalty on some
platforms (e.g. AWS's EBS, VMware VMDKs). Therefore it is recommended that
volumes be "thick provisioned", where supported by the platform (VMware).
Thick provisioning is time consuming and often is ignored. If the thick
provision step is omitted, customers will see suboptimal performance until
we have written to all parts of the LUN. ZFS should be able to initialize
any unused storage to remove any first-write penalty that exists.
illumos/illumos-gate@094e47e980
Reviewed by: John Wren Kennedy <john.kennedy@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: George Wilson <george.wilson@delphix.com>
Nominal Mhz is either expressed via the clock-frequency property
or can be get via the clock property that holds the cpu clock.
Add support for the later.
Reviewed by: nwhitehorn
Differential Revision: https://reviews.freebsd.org/D16346
The nvmem interface helps provider of nvmem data to expose themselves to consumer.
NVMEM is generally present on some embedded board in a form of eeprom or fuses.
The nvmem api are helpers for consumer to read/write the cell data from a provider.
Differential Revision: https://reviews.freebsd.org/D16419
OBJ_ONEMAPPING flag is set. In other words, allow recycling of existing
but unused subranges of a vm object when the OBJ_ONEMAPPING flag is set.
Such situations are increasingly common with jemalloc >= 5.0. This
change has the expected effect of reducing the number of vm map entry and
object allocations and increasing the number of superpage promotions.
Reviewed by: kib, markj
Tested by: pho
MFC after: 6 weeks
Differential Revision: https://reviews.freebsd.org/D16501
The buffer descriptor list entries should be in little endian format. Byte swap
them on BE. This is the last piece of the puzzle for snd_hda(4) to work on
PowerPC.
The interface uses struct timespec, which needs a translation.
Reported and reviewed by: asomers
PR: 230175
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D16525
This calls into the Arm Trusted Firmware to enable and disable the
workaround for the Speculative Store Bypass Disable (SSBD) issue, also
known as Spectre Variant 4.
As this may have a large performance overhead, and how exploitable SSBD is
is unknown we follow the Linux lead of allowing the administrator to select
between always on, always off, or only enabled in the kernel, with the
latter being the default.
PR: 228955
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D15819
Else a NULL VNET pointer should be ignored. This fixes address resolving
when VIMAGE is disabled.
MFC after: 3 days
Sponsored by: Mellanox Technologies
r336940 introduced an "unused variable" warning on platforms which
support INET, but not INET6, like MALTA and MALTA64 as reported
by Mark Millard. Improve the #ifdefs to address this issue.
Sponsored by: Netflix, Inc.
illumos/illummos-gate@df477c0afa
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Paul Dagnelie <pcd@delphix.com>
This project's goal is to make read-heavy channel programs and zfs(1m)
administrative commands faster by caching all the metadata that they will
need in the dbuf layer. This will prevent the data from being evicted, so
that any future call to i.e. zfs get all won't have to go to disk (very
much).
illumos/illumos-gate@adb52d9262
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Thomas Caputi <tcaputi@datto.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Matthew Ahrens <mahrens@delphix.com>
We should use zfs_dbgmsg instead of spa_dbgmsg. Or at least,
metaslab_condense() should call zfs_dbgmsg because it's important and rare
enough to always log. It's possible that the message in zio_dva_allocate()
would be too high-frequency for zfs_dbgmsg.
illumos/illumos-gate@21f7c81cc1
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Matthew Ahrens <mahrens@delphix.com>
Currently vdev_label_sync and vdev_uberblock_sync take a zio_t and assume
that its io_private is a pointer to the good_writes count. They should
instead accept this argument explicitly.
illumos/illumos-gate@a3b5583021
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Matthew Ahrens <mahrens@delphix.com>
Mirrors are supposed to provide redundancy in the face of whole-disk failure
and silent damage (e.g. some data on disk is not right, but ZFS hasn't
detected the whole device as being broken). However, the current device
removal implementation bypasses some of the mirror's redundancy.
illumos/illumos-gate@3a4b1be953
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed by: Sara Hartse <sara.hartse@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Tim Chase <tim@chase2k.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Matthew Ahrens <mahrens@delphix.com>
On high-end systems running async sequential write workloads, especially
NUMA systems with flash or NVMe storage, one significant performance
bottleneck is selecting a metaslab to do allocations from. This process
can be parallelized, providing significant performance increases for
these workloads.
illumos/illumos-gate@f78cdc34af
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Alexander Motin <mav@FreeBSD.org>
Approved by: Gordon Ross <gwr@nexenta.com>
Author: Paul Dagnelie <pcd@delphix.com>
The current space map encoding has the following disadvantages:
[1] Assuming 512 sector size each entry can represent at most 16MB for a segment.
This makes the encoding very inefficient for large regions of space.
[2] As vdev-wide space maps have started to be used by new features (i.e.
device removal, zpool checkpoint) we've started imposing limits in the
vdevs that can be used with them based on the maximum addressable offset
(currently 64PB for a top-level vdev).
The new remains backwards compatible with the old one. The introduced
two-word entry format, besides extending the limits imposed by the single-entry
layout, also includes a vdev field and some extra padding after its prefix.
The extra padding after the prefix should is reserved for future usage (e.g.
new prefixes for future encodings or new fields for flags). The new vdev field
not only makes the space maps more self-descriptive, but also opens the doors
for pool-wide space maps.
One final important note is that the number of bits used for vdevs is reduced
to 24 bits for blkptrs. That was decided as we don't know of any setups that
use more than 16M vdevs for the time being and
we wanted to fit the vdev field in the space map. In addition that gives us
some extra bits in dva_t.
illumos/illumos-gate@17f11284b4
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <gwilson@zfsmail.com>
Approved by: Gordon Ross <gwr@nexenta.com>
Author: Serapheim Dimitropoulos <serapheim@delphix.com>
illumos/illumos-gate@b6bf6e1540
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Prashanth Sreenivasa <pks@delphix.com>
Approved by: Matt Ahrens <mahrens@delphix.com>
Author: Pavel Zakharov <pavel.zakharov@delphix.com>
TCP/IPv4 allows an implicit connection setup using sendto(), which
is used for TTCP and TCP fast open. This patch adds support for
TCP/IPv6.
While there, improve some tests for detecting multicast addresses,
which are mapped.
Reviewed by: bz@, kbowling@, rrs@
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D16458
When sending TCP segments from the timewait code path, a stored
value of the last sent window is used. Use the same code for
computing this in the timewait code path as in the main code
path used in tcp_output() to avoiv inconsistencies.
Reviewed by: rrs@
MFC after: 1 month
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D16503
The following issues are fixed:
* Whenever a TCP server with TCP fast open enabled, calls accept(),
recv(), send(), and close() before the TCP-ACK segment has been received,
the TCP connection is just dropped and the reception of the TCP-ACK
segment triggers the sending of a TCP-RST segment.
* Whenever a TCP server with TCP fast open enabled, calls accept(), recv(),
send(), send(), and close() before the TCP-ACK segment has been received,
the first byte provided in the second send call is not transferred.
* Whenever a TCP client with TCP fast open enabled calls sendto() followed
by close() the TCP connection is just dropped.
Reviewed by: jtl@, kbowling@, rrs@
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D16485
Newer versions of gcc generate "set, but not used" warnings.
Add __unused macros to silence these warnings.
Although the variables are not being used, they are values parsed from
arguments to callback RPCs that might be needed in the future.
Requested by: mmacy
These missing probe are mostly in the syncache and timewait code.
Reviewed by: markj@, rrs@
MFC after: 1 month
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D16369
The CORB and RIRB buffers exist in DMA memory, but the device reads them as
little-endian only. Read and write as LE into the DMA memory block, to work on
BE platforms.
The latter matches the rest of the tree better [0]. The UPDATING entry has
been updated to reflect this, and the new tunable is now documented in
loader(8) [1].
Reported by: imp [0], Shawn Webb [1]
As noted in UDPATING, the new loader tunable efi.rt_disabled may be used to
disable EFIRT at runtime. It should have no effect if you are not booted via
UEFI boot.
MFC after: 6 weeks
Leading up to enabling EFIRT in GENERIC, allow runtime services to be
disabled with a new tunable: efi.rt_disabled. This makes it so that EFIRT
can be disabled easily in case we run into some buggy UEFI implementation
and fail to boot.
Discussed with: imp, kib
MFC after: 1 week
The timespecadd(3) family of macros were imported from NetBSD back in
r35029. However, they were initially guarded by #ifdef _KERNEL. In the
meantime, we have grown at least 28 syscalls that use timespecs in some
way, leading many programs both inside and outside of the base system to
redefine those macros. It's better just to make the definitions public.
Our kernel currently defines two-argument versions of timespecadd and
timespecsub. NetBSD, OpenBSD, and FreeDesktop.org's libbsd, however, define
three-argument versions. Solaris also defines a three-argument version, but
only in its kernel. This revision changes our definition to match the
common three-argument version.
Bump _FreeBSD_version due to the breaking KPI change.
Discussed with: cem, jilles, ian, bde
Differential Revision: https://reviews.freebsd.org/D14725
It's easy to confuse the error code as naked it looks decimal (EINVAL is
reported as error 16, instead of error 22, so first reading looks like EBUSY).
modules. It also fails in the same way, we are unable to relocate static
variables as the compiler uses PC-relative loads with nothing for the
kernel linker to relocate.
Sponsored by: DARPA, AFRL
This fixes the panic caused by deadlocking when grant-table free
callbacks are used.
The cause of the recursion is: check_free_callbacks() is always called
with the lock gnttab_list_lock held. In turn the callback function is
also called with the lock held. Then when the client uses any of the grant
reference methods which also attempt the lock the gnttab_list_lock
mutex from within the free callback a deadlock happens.
Fix this by making the gnttab_list_lock recursive.
Submitted by: Pratyush Yadav <pratyush@freebsd.org>
Differential Revision: https://reviews.freebsd.org/D16505
If gnttab_grant_foreign_access() fails for any of the indirection
pages, the code breaks out of both the loops without freeing the local
variable indirectpages, causing a memory leak.
Submitted by: Pratyush Yadav <pratyush@freebsd.org>
Differential Review: https://reviews.freebsd.org/D16136
Length is an unsigned integer, so checking against < 0 doesn't make
sense. While there also make clear that a length of 0 always succeeds.
Submitted by: Pratyush Yadav <pratyush@freebsd.org>
Differential Review: https://reviews.freebsd.org/D16045
sending an invalid segment into the reassembly
queue. This would happen if you enabled the
data after close option.
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D16453
Precompute the new PTE before entering the critical section.
Eliminate duplication of the pmap and pv list unlock operations in
pmap_enter() by implementing a single return path. Otherwise, the
duplication will only increase with the upcoming support for psind == 1.
Reviewed by: mmel
Tested by: mmel
Discussed with: kib, markj
MFC after: 3 weeks
Differential Revision: https://reviews.freebsd.org/D16443
Ifuncs selectors dispatch copyin(9) family to the suitable variant, to
set rflags.AC around userspace access. Rflags.AC bit is cleared in
all kernel entry points unconditionally even on machines not
supporting SMAP.
Reviewed by: jhb
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D13838
According to the man page, getrusage(2) should return EFAULT if the rusage
argument lies outside of the process's address space. But due to an
oversight in r100384, that's never been the case during 32-bit emulation.
Fix it.
PR: 230153
Reported by: tests(7)
Reviewed by: cem
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D16500
It allows locking or unlocking physical pages in memory within a jail
This allows running elasticsearch with "bootstrap.memory_lock" inside a jail
Reviewed by: jamie@
Differential Revision: https://reviews.freebsd.org/D16342
initial call to sched_balance() during startup is meant to initialize
balance_ticks, but does not actually do that since smp_started is
still zero at that time. Since balance_ticks does not get set,
there are no further calls to sched_balance(). Fix this by setting
balance_ticks in sched_initticks() since we know the value of
balance_interval at that time, and eliminate the useless startup
call to sched_balance(). We don't need to randomize the intial
value of balance_ticks.
Since there is now only one call to sched_balance(), we can hoist
the tests at the top of this function out to the caller and avoid
the overhead of the function call when running a SMP kernel on UP
hardware.
PR: 223914
Reviewed by: kib
MFC after: 2 weeks
I believe that a ReclaimComplete with rca_one_fs == TRUE is only
to be used after a file system has been transferred to a different
file server. However, RFC5661 is somewhat vague w.r.t. this and
the ESXi 6.7 client does both a ReclaimComplete with rca_one_fs == TRUE
and one with ReclaimComplete with rca_one_fs == FALSE.
Therefore, just ignore the rca_one_fs == TRUE operation and return
NFS_OK without doing anything instead of replying NFS4ERR_NOTSUPP.
This allows the ESXi 6.7 NFSv4.1 client to do a mount.
After discussion on the NFSv4 IETF working group mailing list, doing this
along with setting a flag to note that a ReclaimComplete with rca_one_fs TRUE
was an appropriate way to handle this.
The flag that indicates that a ReclaimComplete with rca_one_fs == TRUE was
done may be used to disable replies of NFS4ERR_GRACE for non-reclaim
state operations in a future commit.
This patch along with r332790, r334492 and r336357 allow ESXi 6.7 NFSv4.1 mounts
work ok. ESX 6.5 NFSv4.1 mounts do not work well, due to what I believe are
violations of RFC-5661 and should not be used.
Reported by: andreas.nagy@frequentis.com
Tested by: andreas.nagy@frequentis.com, daniel@ftml.net (earlier version)
MFC after: 2 weeks
Relnotes: yes
blocking vm map entry and object coalescing for the calling process.
However, there is no reason that mlockall(MCL_FUTURE) should block
such coalescing. This change enables it.
Reviewed by: kib, markj
Tested by: pho
MFC after: 6 weeks
Differential Revision: https://reviews.freebsd.org/D16413
As discussed in arm@. This is a scaled back version of the prior
commit because xscale is overlaoded in places to mean armv5 or
similar. The OLD XSCALE stuff hasn't been useful in a while. The
original committer (cognet@) was the only one that had boards for
it. He's blessed this removal. Newer XSCALE (GUMSTIX) is for hardware
that's quite old. After discussion on arm@, it was clear there was no
support for keeping it.
Noticed by: andrew@
r336773 removed all things xscale. However, some things xscale are
really armv5. Revert that entirely. A more modest removal will follow.
Noticed by: andrew@
There's no differene between VM_FREELIST_ISADMA and VM_FREELIST_LOWMEM
except for the default boundary (16MB on x86 and 256MB on MIPS, but
they are otherwise the same). We don't need both for any system we
support (there were some really old ARC systems that did have ISA/EISA
bus, but we never ran on them and they are too old to ever grow
support for).
Differential Review: https://reviews.freebsd.org/D16290
The OLD XSCALE stuff hasn't been useful in a while. The original
committer (cognet@) was the only one that had boards for it. He's
blessed this removal. Newer XSCALE (GUMSTIX) is for hardware that's
quite old. After discussion on arm@, it was clear there was no support
for keeping it.
Differential Review: https://reviews.freebsd.org/D16313
Add std.ralink to define common things across all ralink configs.
Add cpu, machine and options INTRNG to this file.
Remove RT1310.hints file reference: that file isn't in our tree.
This port hasn't been updated since it was committed, apart from
housekeeping. There's no known users, and the known hardware for
this port is too thin to run FreeBSD/arm these days well.
This also removes the last armv4 port. We've had no reports of armv4
systems working since FreeBSD 8. All the kernel support for armv4 has
not been removed since it's too intertwined with armv5 support (which
remains in the tree).
RelNotes: Yes
No objection from: arm@
The last known robust version of this code base was FreeBSD 8.2. There
are no users of this on current, and all users of it have abandoned
this platform or are in legacy mode with a prior version of FreeBSD.
All known users on arm@ approved this removal, and there were no
objections.
Differential Revision: https://reviews.freebsd.org/D16312
register to determine if trap is from userspace.
Otherwise if we jump to kernel address from userspace, then
TRAPF_USERMODE failed to detect usermode and then do_ast
triggers a panic "ast in kernel mode".
Reviewed by: markj@
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D16469
Do not use vm_map_remove() to release KVA back to the system. Because
kernel map entries do not have an associated VM object, with r336030
the vm_map_remove() call will not update the kernel page tables. Avoid
relying on the vm_map layer and instead update the pmap and release KVA
to the kernel arena directly in kmem_bootstrap_free().
Because the pmap updates will generally result in superpage demotions,
modify pmap_init() to insert PTPs shadowed by superpage mappings into
the kernel pmap's radix tree.
While here, port r329171 to i386.
Reported by: alc
Reviewed by: alc, kib
X-MFC with: r336505
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D16426
the AMD document 55449 'Revision Guide for AMD Family 17h Models
00h-0Fh Processors' rev 1.12.
The errata numbers are mentioned near each action.
It seems that newer BIOSes already include required chicken bits
settings, so the magic MSR updates are only needed when BIOS cannot be
updated. On the other hand, MWAIT avoidance seems to be important.
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
If a timer is updated (re-added) with a different time period
(specified in the .data field of the kevent), the new time period has
no effect; the timer will not expire until the original time has
elapsed. This violates the documented behavior as the kqueue(2) man
page says (in part) "Re-adding an existing event will modify the
parameters of the original event, and not result in a duplicate
entry."
This modification, adapted from a patch submitted by cem@ to PR214987,
fixes the kqueue system to allow updating a timer entry. The
kevent timer behavior is changed to:
* When a timer is re-added, update the timer parameters to and
re-start the timer using the new parameters.
* Allow updating both active and already expired timers.
* When the timer has already expired, dequeue any undelivered events
and clear the count of expirations.
All of these changes address the original PR and also bring the
FreeBSD and macOS kevent timer behaviors into agreement.
A few other changes were made along the way:
* Update the kqueue(2) man page to reflect the new timer behavior.
* Fix man page style issues in kqueue(2) diagnosed by igor.
* Update the timer libkqueue system test to test for the updated
timer behavior.
* Fix the (test) libkqueue common.h file so that it includes
config.h which defines various HAVE_* feature defines, before the
#if tests for such variables in common.h. This enables the use of
the actual err(3) family of functions.
* Fix the usages of the err(3) functions in the tests for incorrect
type of variables. Those were formerly undiagnosed due to the
disablement of the err(3) functions (see previous bullet point).
PR: 214987
Reported by: Brian Wellington <bwelling@xbill.org>
Reviewed by: kib
MFC after: 1 week
Relnotes: yes
Sponsored by: Dell EMC
Differential Revision: https://reviews.freebsd.org/D15778
There were some bits that were being set in cmd_flags (a field of AHCI's
command list structure) after cmd_flags was converted to little endian.
On a big endian host, such as PowerPC, this would set the wrong bits.
This was preventing AHCI driver from working on these hosts.
Reviewed by: jhibbits
Approved by: jhibbits (mentor)
read bias so we do reads in preference to TRIMs. This helps a lot when
many trims are delivered at once from the upper layers as they tend to
delay READs due to priority inversion in the code today.
The non iosched case will be fixed when the trim comibing changes
needed for nvme come in later this year.
Sponsored by: Netflix
In the V5 ABI the flags are EF_ARM_ABI_FLOAT_HARD and
EF_ARM_ABI_FLOAT_SOFT. The flags have the same values as the legacy GCC
flags EF_ARM_VFP_FLOAT and EF_ARM_SOFT_FLOAT respectively.
The legacy names are kept for compatibility.
Reported by: Peter Smith (Linaro)
Reviewed by: imp
Sponsored by: The FreeBSD Foundation
Start in "class" instead of "flow" mode. This eliminates the need to
specify an MTU, which is not available that early anyway. It also
allows the user to manually configure ch-rl rate limiting after attach.
This used to fail because ch-rl isn't supported if cl-rl "flow" mode is
configured.
Set all traffic classes to 1Gbps during initialization. The goal is to
start off with _any_ valid configuration and 1Gbps works even for
gigabit cards.
Sponsored by: Chelsio Communications
Previously, a step by PT_STEP resulted in no signal being raised to
the debugger so that a step was silently completed with the program
continuing to execute after the step. Fix by raising a SIGTRAP
signal with TRAP_TRACE as the signal code.
To simplify the error handling cases (if ptrace_clear_single_step()
fails, etc.) move the handling of PTRACE_BREAKPOINT into the
gdb_trapper() function. If ptrace_clear_single_step() fails,
gdb_trapper() won't claim the fault, and the default case of
SIGILL / ILL_OPC will be used.
Differential Revision: https://reviews.freebsd.org/D16100
It works excellent, but KDB disassembler and DTrace FBT provider for
RISC-V do lack support for it. They currently handle 4-byte instructions
only, while C-compressed ISA extension introduces 2-byte instructions
freely mixing them together.
So disable it for now.
Reviewed by: markj@
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D16436
Some old compatibility bits were removed in r227503 but this comment
was left behind.
Reported by: br, via review D11962
Sponsored by: The FreeBSD Foundation
code never sees FPU pcb flags not consistent with the hardware state.
This is uncovered by the eager FPU switch mode.
Analyzed, reviewed and tested by: gleb
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
This is an OFW initrd module that would load the initrd from device tree
parameters and give the to the md driver.
With this patch, it is possible to pass a rootfs image through kexec in PowerNV
mode (powerpc64). In order to user it, you should set the MD_ROOT_MEM option in
your kernel configuration.
Reviewed by: jhibbits
Approved by: jhibbits (mentor)
Differential Revision: https://reviews.freebsd.org/D15705
declaired static. This will allow us to change the definition on arm64
as it has the same issues described in r336349.
Reviewed by: bz
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D16147
jedec_dimm(4) is a superset of the functionality of jedec_ts(4). Mark
jedec_ts(4) as removed in FreeBSD 12, and include a pointer to the migration
instructions in the jedec_dimm(4) manpage, in both the jedec_ts(4) code and
the jedec_ts(4) manpage. Add a note to the jedec_dimm(4) manpage about the
fact that it is a superset of jedec_ts(4).
This change will be MFCed to stable/11 and stable/10; the followup change
to actually remove jedec_ts(4) from -HEAD will not.
Reviewed by: avg
MFC after: 1 week
Relnotes: yes
Differential Revision: https://reviews.freebsd.org/D16412
The primary reason for this commit is to separate mechanical and nearly
mechanical code changes from an upcoming fix for unsafe teardown of
shared interrupt handlers that have only filters (see D15905).
The technical rationale is that SLIST is sufficient. The only operation
that gets worse performance -- O(n) instead of O(1) is a removal of a
handler, but it is not a critical operation and the list is expected to
be rather short.
Additionally, it is easier to reason about SLIST when considering the
concurrent lock-free access to the list from the interrupt context and
the interrupt thread.
CK_SLIST is used because the upcoming change depends on the memory order
provided by CK_SLIST insert and the fact that CL_SLIST remove does not
trash the linkage in a removed element.
While here, I also fixed a couple of whitespace issues, made code under
ifdef notyet compilable, added a lock assertion to ithread_update() and
made intr_event_execute_handlers() static as it had no external callers.
Reviewed by: cem (earlier version)
MFC after: 4 weeks
Differential Revision: https://reviews.freebsd.org/D16016
o The correct value for _JB_SIGMASK is 27.
o The storage size for double-precision floating
point register is 8 bytes.
Submitted by: "James Clarke" <jrtc4@cam.ac.uk>
Reviewed by: markj@
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D16344
This change allows one to set the busy_detect flag
required by the synopsys UART at the loader prompt.
This is needed by the EPYC 3000 SoC.
This will give users a working console up to the point where getty is required:
hw.uart.console="mm:0xfedc9000,rs:2,bd:1"
Reviewed by: imp
MFC after: 4 weeks
Differential Revision: https://reviews.freebsd.org/D16399
Since r336313, TSO support for LEM-class devices is removed again as it
was before the conversion of {l,}em(4) to iflib(4) in r311849 and as a
result, isc_tx_tso_segments_max is 0 for LEM-class devices now. Thus,
inappropriate watermarks were used for this class.
This is really only a band-aid, though, because so far iflib(9) doesn't
fully take into account that DMA engines can support different maxima
of segments for transfers of TSO and non-TSO packets. For example, the
DESC_RECLAIMABLE macro is based on isc_tx_nsegments while MAX_TX_DESC
used isc_tx_tso_segments_max only. For most in-tree consumers that
doesn't make a difference as the maxima are the same for both kinds of
transfers (that is, apart from the fact that TSO may require up to 2
sentinel descriptors but also not with every MAC supported). However,
isc_tx_nsegments is 8 but isc_tx_tso_segments_max is 85 by default
with ixl(4).
tests for avail > 0, avail can never be 0 within that loop. Thus, move
decrementing avail and budget_left into the loop and before the code which
checks for additional descriptors having become available in case all the
previous ones have been processed but there still is budget left so the
latter code works as expected. [1]
- In iflib_{busdma_load_mbuf_sg,parse_header}(), remove dead stores to m
and n respectively. [2, 3]
- In collapse_pkthdr(), ensure that m_next isn't NULL before dereferencing
it. [4]
- Remove a duplicate assignment of segs in iflib_encap().
Reported by: Coverity
CID: 1356027 [1], 1356047 [2], 1368205 [3], 1356028 [4]
- Don't bother calling if_setbaudrate(9) as iflib_link_state_change(9)
takes care of that,
- correctly check for E1000_CTRL_EXT_LINK_MODE_GMII in E1000_CTRL_EXT [1],
- properly convert the uint16_t link_speed to a uint64_t baudrate by
using IF_Mbps() which contains an appropriate cast [2],
- remove the duplicate link down announcement when bootverbose isn't
zero and bring the remaining one in line with the other link state
messages.
o Remove a dead store to rid in em_if_msix_intr_assign(). [3]
o Or in the DMA coalescing Rx threshold so the other bits set in E1000_DMACR
remain intact as intended in igb_init_dmac(). [4]
Reported by: Coverity
CID: 1378464 [1], 1368765 [2], 1381681 [3], 1304929 [4]
These syscalls were always supposed to have been auditted, but due to
oversights never were.
PR: 228374
Reported by: aniketp
Reviewed by: aniketp
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D16388
When ABE was added (rS331214) to NewReno and leak fixed (rS333699) , it now has
a destructor (newreno_cb_destroy) for per connection state. Other congestion
controls may allocate and free cc_data on entry and exit, but the field is
never explicitly NULLed if moving back to NewReno which only internally
allocates stateful data (no entry contstructor) resulting in a situation where
newreno_cb_destory might be called on a junk pointer.
- NULL out cc_data in the framework after calling {cc}_cb_destroy
- free(9) checks for NULL so there is no need to perform not NULL checks
before calling free.
- Improve a comment about NewReno in tcp_ccalgounload
This is the result of a debugging session from Jason Wolfe, Jason Eggleston,
and mmacy@ and very helpful insight from lstewart@.
Submitted by: Kevin Bowling
Reviewed by: lstewart
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D16282
We've got a set of probably damaged hard disks, reporting 0x04,0x02
("Logical unit not ready, initializing command required") in response
to READ CAPACITY(16), where attempts to use START STOP UNIT for recovery
results in 0x44,0x00 ("Internal target failure") after ~1 second delay.
As result of all recovery retries, device open attempt took ~3 seconds
before finally reporting to GEOM that device is opened, but has no media.
If the open was for writing and since it hasn't formally failed, following
close triggered GEOM retaste, opening device few more times with respective
delays.
This change reduces whole time of this cycle from ~12 seconds to ~3 by
giving up on recovery after the first failure.
Reviewed by: ken
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.
This fixes usage/report size calculation of Microsoft`s "Touch Hardware
Quality Assurance" certificate blob found in many touchscreens.
While here, join several "c->flags = dval" lines in to single line.
Reviewed by: hselasky
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D16357
Use CLSET_TIMEOUT to set the timeout for connections to DSs instead of
specifying a timeout on each RPC. This is done so that SO_SNDTIMEO
is set on the TCP socket as well as specifying a time limit when
waiting for an RPC reply. Useful if the send queue for the TCP
connection has become constipated, due to a failed DS.
The choice of lease_duration / 4 is fairly arbitrary, but seems to work
ok, with a lower bound of 10sec.
For client connections to a DS, set the retry limit to vfs.nfsd.dsretries,
which is 2 by default.
This patch should only affect pNFS connections to DSs.
This patch requires r336542.
MFC after: 2 weeks
r323954 changed the mp ring behaviour when 64-bit atomics were
available to abdicate the TX ring rather than having one become a
consumer thereby running to completion on TX. The consumer of the mp
ring was then triggered in the tx task rather than blocking the TX call.
While this significantly lowered the number of RX drops in small-packet
forwarding, it also negatively impacts TX performance.
With this change, the default behaviour is reverted, causing one TX ring
to become a consumer during the enqueue call. A new sysctl,
dev.X.Y.iflib.tx_abdicate is added to control this behaviour.
Reviewed by: gallatin
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D16302
Use the timer to poll for TX completions when there are
outstanding TX slots. Track when the last driver timer was called
to prevent overcalling it. Also clean up some kring vs NIC ring
usage.
Reviewed by: marius, Johannes Lundberg <johalun0@gmail.com>
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D16300
See the commit log messages for r321378 and r336288 for descriptions of
this functionality.
Reviewed by: alc
Differential Revision: https://reviews.freebsd.org/D16303
/chosen/stdout-path is a string, not ihandle. Treat it as such.
With this, ofwfb now starts correctly on a POWER9 system when launched from
the local console (not serial).
The FDT implementation of OF_instance_to_package() backend checks the
cross-reference to get the node. On failure, this returns the input handle
unchanged. In the case of ofwfb attachment, if /chosen/stdout property does not
exist, sc->sc_handle is either garbage or 0, which then gets propagated to node.
This will prevent "screen" from being used, resulting in not properly attaching.
Correct this by matching the code in ofwfb_probe().
Fire UDP receive probes when a packet is received and there is no
endpoint consuming it. Fire the probe also if the TTL of the
received packet is smaller than the minimum required by the endpoint.
Clarify also in the man page, when the probe fires.
Reviewed by: dteske@, markj@, rrs@
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D16046
During testing of the pNFS client, it was observed that an RPC could get
stuck in sosend() for a very long time if the network connection to a DS
had failed. This is fixed by setting SO_SNDTIMEO on the TCP socket.
This is only done when CLSET_TIMEOUT is done and this is not done by any
use of the krpc currently in the source tree, so there should be no effect
on extant uses.
A future patch will use CLSET_TIMEOUT for TCP connections to DSs.
Reviewed by: kib
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D16293
As an aside: 1200074 should be used as the last version with big
endian arm support, should that be needed. it was actually removed
a day later, but no bump was made until now.
Code analysis and runtime analysis using truss(8) indicate that the only
privileged operations performed by ntpd are adjusting system time, and
(re-)binding to privileged UDP port 123. These changes add a new mac(4)
policy module, mac_ntpd(4), which grants just those privileges to any
process running with uid 123.
This also adds a new user and group, ntpd:ntpd, (uid:gid 123:123), and makes
them the owner of the /var/db/ntp directory, so that it can be used as a
location where the non-privileged daemon can write files such as the
driftfile, and any optional logfile or stats files.
Because there are so many ways to configure ntpd, the question of how to
configure it to run without root privs can be a bit complex, so that will be
addressed in a separate commit. These changes are just what's required to
grant the limited subset of privs to ntpd, and the small change to ntpd to
prevent it from exiting with an error if running as non-root.
Differential Revision: https://reviews.freebsd.org/D16281
We can now have efifb being setup correctly.
Enjoy video output on some boards when you couldn't before.
Tested-On: Pine64
Tested-On: Pine64-LTS
Tested-On: Pinebook
Some driver (like efifb) needs to map more than the current L2_SIZE
Raise the size so we can map the framebuffer setup by the bootloader.
Reviewed by: cognet
I had naively assumed that building kernel would be sufficient to test that
the header is sane. However, it turns out this now needs -fms-extensions to
build. Rather than sprinkling -fms-extensions all over the place, revert
for now, and revisit with a better fix.
Summary:
Ports like sysutils/lsof troll through kernel structures, and
therefore include kernel headers and all the dirty secrets involved. struct
vm_page includes the struct md_page inline, which currently is only defined
if AIM or BOOKE is defined. Thus, by default, sysutils/lsof cannot build,
due to the struct md_page having an incomplete type. Fix this by merging
the two struct definitions into an anonymous struct-union.
A similar change could be made to unify the pmap structures as well.
Reviewed By: nwhitehorn
Differential Revision: https://reviews.freebsd.org/D16232
On i386 and amd64, add a vm_phys segment for physical memory used to
store the kernel binary and other preloaded data. This makes it
possible to free such memory back to the system once it is no longer
needed, e.g., when a preloaded kernel module is unloaded. Previously,
it would have remained unused.
Reviewed by: kib, royger
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D16330
The basename will never match against the preload metadata, so these
calls previously had no effect.
Reviewed by: kib, royger
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D16330
multithreaded programs that was addressed by r329254 was in the
implementation of pmap_enter() on some architectures, notably, amd64.
kib@, markj@ and I have audited all of the pmap_enter() implementations,
and fixed the broken ones, specifically, amd64 (r335784, r335971), i386
(r336092), mips (r336248), and riscv (r336294).
To be clear, the reason to address the problem within pmap_enter() and
revert r329254 is not just a matter of principle. An effect of r329254
was that a copy-on-write fault actually entailed two page faults, not
one, even for single-threaded programs. Now, in the expected case for
either single- or multithreaded programs, we are back to a single page
fault to complete a copy-on-write operation. (In extremely rare
circumstances, a multithreaded program could suffer two page faults.)
Reviewed by: kib, markj
Tested by: truckman
MFC after: 3 weeks
Differential Revision: https://reviews.freebsd.org/D16301
Oppv2 add more flexibility on regulator value for the core voltage amongst
other new thing.
For now only shared opp table is supported as I don't have hardware with
non-shared opp table.
Tested-On: OrangePi One (with oppv1 and oppv2)
Tested-On: Pine64-LTS
In order to setup an initial environment and jump into the generic
hammer_time initialization function. Some of the code is shared with
PVHv1, while other code is PVHv2 specific.
This allows booting FreeBSD as a PVHv2 DomU and Dom0.
Sponsored by: Citrix Systems R&D
Allow the hypercall page to be initialized very early, even before
vtophys is functional. Also make the function global so it can be
called by other files.
This will be needed in order to perform the early bringup on PVHv2
guests.
Sponsored by: Citrix Systems R&D
When booted as PVHv2, there's no ACPI CPU object, so attach the PV CPU
device in order to take it's place.
This is required in case some device or driver tries to poke at the
PCPU device field.
Sponsored by: Citrix Systems R&D
HYPERVISOR_start_info is only available to PV and PVHv1 guests, HVM
and PVHv2 guests get this data from HVM parameters that are fetched
using a hypercall.
Instead provide a set of helper functions that should be used to fetch
this data. The helper functions have different implementations
depending on whether FreeBSD is running as PVHv1 or HVM/PVHv2 guest
type.
This helps to cleanup generic Xen code by removing quite a lot of
xen_pv_domain and xen_hvm_domain macro usages.
Sponsored by: Citrix Systems R&D
The PVHv2 entry point is fairly similar to the multiboot1 one. The
kernel is started in protected mode with paging disabled. More
information about the exact BSP state can be found in the pvh.markdown
document on the Xen tree.
This entry point is going to be joined with the native entry point at
hammer_time, and in order to do so the BSP needs to be bootstrapped
into long mode with the same set of page tables as used on bare metal.
Sponsored by: Citrix Systems R&D
found in some MacBook Pro.
PR: 229727
Reported by: Stephan Neuhaus <sten@artdecode.de> and others
Tested by: Stephan Neuhaus <sten@artdecode.de>
Approved by: mav (mentor)
MFC after: 1 month
These changes ensure that reclaim_pv_chunk() can be safely be
executed concurrently by multiple threads.
Reviewed by: alc
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D16304
This is a trivial comment-only fix. The man page for kevent(2) gives
the definition of struct kevent, including a comment on each
field. The actual definition in sys/event.h omitted the comments on
some fields. Add the comments in. Not only does this make the man page
and include file agree, but the comments are useful in and of
themselves.
Reviewed by: kib (D15778: commented that this should be a separate commit)
MFC after: 3 days
Sponsored by: Dell EMC
Query the minimal inline mode supported by the card.
When creating a send queue, cache the queried mode and optimize the transmit
if no inlining is required. In this case, we can avoid touching the headers
cache line and avoid dirtying several more lines by copying headers into
the send WQEs. Also, if no inline headers are used, hardware assists in
the VLAN tag framing.
Submitted by: kib@, slavash@
MFC after: 1 week
Sponsored by: Mellanox Technologies
Issue: IO fails immediately after doing port-toggle.
Fix: Added LDT(Device Lost Timer)- we wait a specific period of time prior to telling the OS about lost device.
Approved by: ken, mav
MFC after: 3 days
Differential Revision: D16196
Track session objects in the framework, and pass handles between the
framework (OCF), consumers, and drivers. Avoid redundancy and complexity in
individual drivers by allocating session memory in the framework and
providing it to drivers in ::newsession().
Session handles are no longer integers with information encoded in various
high bits. Use of the CRYPTO_SESID2FOO() macros should be replaced with the
appropriate crypto_ses2foo() function on the opaque session handle.
Convert OCF drivers (in particular, cryptosoft, as well as myriad others) to
the opaque handle interface. Discard existing session tracking as much as
possible (quick pass). There may be additional code ripe for deletion.
Convert OCF consumers (ipsec, geom_eli, krb5, cryptodev) to handle-style
interface. The conversion is largely mechnical.
The change is documented in crypto.9.
Inspired by
https://lists.freebsd.org/pipermail/freebsd-arch/2018-January/018835.html .
No objection from: ae (ipsec portion)
Reported by: jhb
1. Fix taskqueues drain/free to fix panic seen when interface is being
bought down and in parallel asynchronous link events happening.
2. Fix bxe_ifmedia_status()
Submitted by:Vaishali.Kulkarni@cavium.com and Anand.Khoje@cavium.com
MFC after:5 days
Remove all the big-endian arm architectures (ixp425 and ixp435)
support in the kernel and associated drivers.
Differential Revision: https://reviews.freebsd.org/D16257
Also, I misspoke in r336428. Any devices on sparc64 machines on "isa"
that can do DMA can do 32-bit address DMA and aren't limited to
24-bits of address.
For setups having a large amount of PCI devices, it makes sense to limit the
number of MSIX vectors per PCI device, in order to avoid running out of IRQ
vectors.
MFC after: 1 week
Sponsored by: Mellanox Technologies
The scatter list is formed by the chunks of MCLBYTES each, and larger
than default packets are returned to the stack as the mbuf chain.
Submitted by: kib@
MFC after: 1 week
Sponsored by: Mellanox Technologies
This deduplicates the code a bit, and also implicitly adds missing
callout_stop() to in[6]_lltable_delete_entry() functions.
PR: 209682, 225927
Submitted by: hselasky (previous version)
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D4605
To access the data, set sysctl dev.mce.N.conf.debug_stats to 1.
This enables the sysctl node dev.mce.N.hw_ctx_debug. Its content is
the mapping of each channel' number to used receive queue and associated
completion queue, set of the transmit queues numbers and corresponding
completion queues.
Trimmed example output:
channel 30 rq 188 cq 1085
channel 30 tc 0 sq 187 cq 1084
channel 31 rq 191 cq 1087
channel 31 tc 0 sq 190 cq 1086
MFC after: 1 week
Sponsored by: Mellanox Technologies
Device detach and setting error state may deadlock over the interface mutex
like this:
a) Detach code in mlx5en waits until error state is set while the interface
mutex is locked.
b) The set error handler needs to lock the interface mutex before it can
set the error state.
The solution is to use atomics to set the error state.
MFC after: 1 week
Sponsored by: Mellanox Technologies
When resetting mlx5core instances it can happen that the order of attach and
detach for mlx5ib instances is changed. Take the unit number for mlx5_%d from
the parent PCI device, similarly to what is done in mlx5en(4), so that there
is a direct relationship between mce<N> and mlx5_<N>.
MFC after: 1 week
Sponsored by: Mellanox Technologies
The DSCP feature is controlled using a set of sysctl(8) fields under
the qos sysctl directory entry for mlx5en(4).
For Routable RoCE QPs, the DSCP should be set in the QP's address path.
The DSCP's value is derived from the traffic class.
Linux commit:
ed88451e1f2d400fd6a743d0a481631cf9f97550
MFC after: 1 week
Sponsored by: Mellanox Technologies
When creating address handle from multicast GID, set MAC according to
the appropriate formula instead of searching for it in the GID table:
- For IPv4 multicast GID use ip_eth_mc_map().
- For IPv6 multicast GID use ipv6_eth_mc_map().
Linux commit:
9636a56fa864464896bf7d1272c701f2b9a57737
MFC after: 1 week
Sponsored by: Mellanox Technologies
The return status of ib_init_ah_from_mcmember() is ignored by
cma_ib_mc_handler(). Honor it and return error event if ah attribute
initialization failed.
Linux commit:
6d337179f28cc50ddd7e224f677b4cda70b275fc
MFC after: 1 week
Sponsored by: Mellanox Technologies
ah_attr contains the port number to which cm_id is bound. However, while
searching for GID table for matching GID entry, the port number is
ignored.
This could cause the wrong GID to be used when the ah_attr is converted to
an AH.
Linux commit:
563c4ba3bd2b8b0b21c65669ec2226b1cfa1138b
MFC after: 1 week
Sponsored by: Mellanox Technologies
The current implementation assumes a static mapping between
the TOS bits and the priority code point, PCP bits.
MFC after: 1 week
Sponsored by: Mellanox Technologies
When a loopback address is detected use the network interface which
has the loopback flag set to trigger loopback logic in address resolve.
MFC after: 1 week
Sponsored by: Mellanox Technologies
The ib_uverbs_create_ah() ind ib_uverbs_modify_qp() calls receive
the port number from user input as part of its attributes and assumes
it is valid. Down on the stack, that parameter is used to access kernel
data structures. If the value is invalid, the kernel accesses memory
it should not. To prevent this, verify the port number before using it.
Linux commit:
5ecce4c9b17bed4dc9cb58bfb10447307569b77b
a62ab66b13a0f9bcb17b7b761f6670941ed5cd62
5a7a88f1b488e4ee49eb3d5b82612d4d9ffdf2c3
MFC after: 1 week
Sponsored by: Mellanox Technologies
RoCEv1 does not use the IPv6 stack to resolve the link local DGID since it
uses GID address. It forms the DMAC directly from the DGID.
Linux commit:
56d0a7d9a0f045ee27a001762deac28c7d28e2e4
MFC after: 1 week
Sponsored by: Mellanox Technologies
This patch fixes the kernel crash that occurs during ib_dealloc_device()
called due to provider driver fails with an error after
ib_alloc_device() and before it can register using ib_register_device().
This crashed seen in tha lab as below which can occur with any IB device
which fails to perform its device initialization before invoking
ib_register_device().
This patch avoids touching cache and port immutable structures if device
is not yet initialized.
It also releases related memory when cache and port immutable data
structure initialization fails during register_device() state.
Linux commit:
4be3a4fa51f432ef045546d16f25c68a1ab525b9
MFC after: 1 week
Sponsored by: Mellanox Technologies
Garbage supplied by user will cause to UCMA module provide zero
memory size for memcpy(), because it wasn't checked, it will
produce unpredictable results in rdma_resolve_addr().
There are several places in the ucma ABI where userspace can pass in a
sockaddr but set the address family to AF_IB. When that happens,
rdma_addr_size() will return a size bigger than sizeof struct sockaddr_in6,
and the ucma kernel code might end up copying past the end of a buffer
not sized for a struct sockaddr_ib.
Fix this by introducing new variants
int rdma_addr_size_in6(struct sockaddr_in6 *addr);
int rdma_addr_size_kss(struct __kernel_sockaddr_storage *addr);
that are type-safe for the types used in the ucma ABI and return 0 if the
size computed is bigger than the size of the type passed in. We can use
these new variants to check what size userspace has passed in before
copying any addresses.
Linux commit:
2975d5de6428ff6d9317e9948f0968f7d42e5d74
09abfe7b5b2f442a85f4c4d59ecf582ad76088d7
84652aefb347297aa08e91e283adf7b18f77c2d5
MFC after: 1 week
Sponsored by: Mellanox Technologies
This was done by auditing all callers of ucma_get_ctx and switching the
ones that unconditionally touch ->device to ucma_get_ctx_dev. This covers
a little less than half of the call sites.
The 11 remaining call sites to ucma_get_ctx() were manually audited.
Linux commit:
4b658d1bbc16605330694bb3ef2570c465ef383d
8b77586bd8fe600d97f922c79f7222c46f37c118
MFC after: 1 week
Sponsored by: Mellanox Technologies
Attempt to modify XRC_TGT QP type from the user space (ibv_xsrq_pingpong
invocation) will trigger the following kernel panic. It is caused by the
fact that such QPs missed uobject initialization.
Linux commit:
f45765872e7aae7b81feb3044aaf9886b21885ef
MFC after: 1 week
Sponsored by: Mellanox Technologies
As part of ib_uverbs_remove_one which might be triggered upon
reset flow, we trigger IB_EVENT_DEVICE_FATAL event to userspace
application.
If device was removed after uverbs fd was opened but before
ib_uverbs_get_context was called, the event file will be accessed
before it was allocated, result in NULL pointer dereference:
Linux commit:
870201f95fcbd19538aef630393fe9d583eff82e
MFC after: 1 week
Sponsored by: Mellanox Technologies
The attempt to join multicast group without ensuring that CMA device
exists will lead to the following crash reported by syzkaller.
Linux commit:
7688f2c3bbf55e52388e37ac5d63ca471a7712e1
MFC after: 1 week
Sponsored by: Mellanox Technologies
Prior to access UCMA commands, the context should be initialized
and connected to CM_ID with ucma_create_id(). In case user skips
this step, he can provide non-valid ctx without CM_ID and cause
to multiple NULL dereferences.
Also there are situations where the create_id can be raced with
other user access, ensure that the context is only shared to
other threads once it is fully initialized to avoid the races.
Linux commit:
e8980d67d6017c8eee8f9c35f782c4bd68e004c9
MFC after: 1 week
Sponsored by: Mellanox Technologies
When receiving a PCP change all GID entries are reloaded.
This ensures the relevant GID entries use prio tagging,
by setting VLAN present and VLAN ID to zero.
The priority for prio tagged traffic is set using the regular
rdma_set_service_type() function.
Fake the real network device to have a VLAN ID of zero
when prio tagging is enabled. This is logic is hidden inside
the rdma_vlan_dev_vlan_id() function which must always be used
to retrieve the VLAN ID throughout all of ibcore and the
infiniband network drivers.
The VLAN presence information then propagates through all
of ibcore and so incoming connections will have the VLAN
bit set. The incoming VLAN ID is then checked against the
return value of rdma_vlan_dev_vlan_id().
MFC after: 1 week
Sponsored by: Mellanox Technologies
cma_iboe_set_mgid() is updated to reflect the RoCEv2 GID check.
Linux commit:
5c181bda77f409d89ad513528eccac5f3a416474
MFC after: 1 week
Sponsored by: Mellanox Technologies
RoCEv2 Annex states that for RoCEv2 over IPv4, the corresponding
IPv4 address is encoded into the GID according to the following rule:
GID= :ffff:<IPv4 address>
Remove the 0xff0e prefix for RoCEv2 packets with IPv4 and leave it
zeroed and change rdma_is_multicast_addr() to consider the new logic.
Linux commit:
be1d325a335840a86c133a56c6a911c368bac0fd
1c3aea2bc8f0b2e5b57375ead40457ff75a3a2ec
MFC after: 1 week
Sponsored by: Mellanox Technologies
The Infiniband spec defines "A multicast address is defined by a
MGID and a MLID" (section 10.5).
Add check to verify that the MLID value is in the correct address
range.
RoCE Annex (A16.9.10/11) declares that during attach (detach) QP to a
multicast group, if the QP is associated with a RoCE port, the
multicast group MLID is unused and is ignored.
During attach or detach multicast, when the QP is associated with a
port, it is enough to check the port's link layer and validate the
LID only if it is Infiniband. Otherwise, avoid validating the
multicast LID.
Linux commit:
8561eae60ff9417a50fa1fb2b83ae950dc5c1e21
5236333592244557a19694a51337df6ac018f0a7
MFC after: 1 week
Sponsored by: Mellanox Technologies
Implement a more generic solution for detecting loopback.
The problem was that the default netdevice was resolved
for loopback also when VLAN was used. Use real network
device instead of loopback device for bound device
interface.
How to test:
ucmatose -b 127.0.0.1 -p 20090
ucmatose -s 5.6.5.1 -p 20090
Note that RDMA treats the IPv4 and IPv6 loopback
addresses like any address.
MFC after: 1 week
Sponsored by: Mellanox Technologies
A list of MGID/MLID pairs is built when doing a multicast attach. When
the multicast detach is called, the list is searched, and regardless of
the search outcome, the driver detach is called.
If an MGID/MLID pair is not on the list, driver detach should not be
called, and an error should be returned. Calling the driver without
removing an MGID/MLID pair from the list can leave the core and driver
out of sync.
Linux commit:
20c7840a77ddcb2ed2fbd66e8197db2868495751
MFC after: 1 week
Sponsored by: Mellanox Technologies