The goal of this change is to fix a problem with PCI shared interrupts
during suspend and resume.
I have observed a couple of variations of the following scenario.
Devices A and B are on the same PCI bus and share the same interrupt.
Device A's driver is suspended first and the device is powered down.
Device B generates an interrupt. Interrupt handlers of both drivers are
called. Device A's interrupt handler accesses registers of the powered
down device and gets back bogus values (I assume all 0xff). That data is
interpreted as interrupt status bits, etc. So, the interrupt handler
gets confused and may produce some noise or enter an infinite loop, etc.
This change affects only PCI devices. The pci(4) bus driver marks a
child's interrupt handler as suspended after the child's suspend method
is called and before the device is powered down. This is done only for
traditional PCI interrupts, because only they can be shared.
At the moment the change is only for x86.
Notable changes in core subsystems / interfaces:
- BUS_SUSPEND_INTR and BUS_RESUME_INTR methods are added to bus
interface along with convenience functions bus_suspend_intr and
bus_resume_intr;
- rman_set_irq_cookie and rman_get_irq_cookie functions are added to
provide a way to associate an interrupt resource with an interrupt
cookie;
- intr_event_suspend_handler and intr_event_resume_handler functions
are added to the MI interrupt handler interface.
I added two new interrupt handler flags, IH_SUSP and IH_CHANGED, to
implement the new intr_event functions. IH_SUSP marks a suspended
interrupt handler. IH_CHANGED is used to implement a barrier that
ensures that a change to the interrupt handler's state is visible
to future interrupts.
While there, I fixed some whitespace issues in comments and changed a
couple of logically boolean variables to be bool.
MFC after: 1 month (maybe)
Differential Revision: https://reviews.freebsd.org/D15755
It has been reported that on some systems (with real hardware passed
through to a virtual machine) the WP detection causes USB disk probing
failures.
While here, also fix the selection of the next state in the case
of malloc failure in DA_STATE_PROBE_WP. It was DA_STATE_PROBE_RC
unconditionally even when it should have been DA_STATE_PROBE_RC16.
PR: 225794
Reported by: David Boyd <David.Boyd49@twc.com>
MFC after: 3 weeks
Differential Revision: https://reviews.freebsd.org/D18496
or the likes. Add new control message types: setdlt and getdlt to switch
from default DLT_RAW (no encapsulation) to DLT_EN10MB (ethernet).
Approved by: glebius
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D18535
check-hash fails. Panic'ing is not an appropriate response. So, check
for an error return from VFS_ROOT() and when an error is reported,
unwind and return the error.
Reported by: Gary Jennejohn (gj)
Sponsored by: Netflix
the check-hash fails. Prior to the fix in -r342133 the inode with the
zeroed out check-hash was written back to disk causing further confusion.
Reported by: Gary Jennejohn (gj)
Sponsored by: Netflix
before copying in the inode so that the mode and link-count are not set
if the check-hash fails. This change ensures that the vnode will be properly
unwound and recycled rather than being held in the cache.
Initialize the file mode is zero so that if the loading of the inode
fails (for example because of a check-hash failure), the vnode will be
properly unwound and recycled.
Reported by: Gary Jennejohn (gj)
Sponsored by: Netflix
"panic: softdep_update_inodeblock: bad link count" when releasing
a partially initialized vnode after an inode check-hash failure.
Reported by: Gary Jennejohn <gljennjohn@gmail.com>
Reported by: Peter Holm (pho)
Sponsored by: Netflix
This change is causing TCP connections using cubic to hang. Need to dig more to
find exact cause and fix it.
Reported by: tj at mrsk dot me, Matt Garber (via twitter)
Discussed with: sbruno (previously), allanjude, cperciva
MFC after: 3 days
Use the sysctl_handle_int() handler to write out the old value and read
the new value into a temporary variable. Use the temporary variable
for any checks of values rather than using the CAST_PTR_INT() macro on
req->newptr. The prior usage read directly from userspace memory if the
sysctl() was called correctly. This is unsafe and doesn't work at all on
some architectures (at least i386.)
In some cases, the code could also be tricked into reading from kernel
memory and leaking limited information about the contents or crashing
the system. This was true for CDG, newreno, and siftr on all platforms
and true for i386 in all cases. The impact of this bug is largest in
VIMAGE jails which have been configured to allow writing to these
sysctls.
Per discussion with the security officer, we will not be issuing an
advisory for this issue as root access and a non-default config are
required to be impacted.
Reviewed by: markj, bz
Discussed with: gordon (security officer)
MFC after: 3 days
Security: kernel information leak, local DoS (both require root)
Differential Revision: https://reviews.freebsd.org/D18443
This can drastically lower the load on gssd(8) on large NFS servers.
Submitted by: Per Andersson <pa at chalmers dot se>
Reviewed by: rmacklem@
MFC after: 2 weeks
Sponsored by: Chalmers University of Technology
Differential Revision: https://reviews.freebsd.org/D18393
PR: maybe related to 233998 (inconclusive at this time)
Submitted by: byuu <byuu AT tutanota.com> (previous version)
Reviewed by: imp
Differential Revision: https://reviews.freebsd.org/D18506
This fixes building in CheriBSD with a strict tmp path since we don't
bootstrap a cpp but pass the full path to clang-cpp instead.
While touching this file also fix all shellcheck warnings in make_dtb.sh.
Reviewed By: manu
Differential Revision: https://reviews.freebsd.org/D18376
- Panic immediately if witness says we're holding non-sleepable locks.
This helps ensure that we don't recurse on the pmap lock in
pmap_fault_fixup().
- Panic if the kernel faults on a user address without setting an
onfault handler.
- Panic if the fault occurred in a critical section or interrupt
handler, like we do on other platforms.
- Fix some style issues in trap_pfault().
Reviewed by: jhb
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D18561
pmap_remove_pages() is called during process termination, when it is
guaranteed that no other CPU may access the mappings being torn down.
In particular, it unnecessary to invalidate each mapping individually
since we do a pmap_invalidate_all() at the end of the function.
Also don't call pmap_invalidate_all() while holding a PV list lock, the
global pvh lock is sufficient.
Reviewed by: jhb
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D18562
pmaps on RISC-V always have an L1 page table page, so we don't need to
check for this when performing lookups.
Reviewed by: jhb
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D18563
fcmpset returns true/false as a int, so make the return types and
variables match the int to be consistent with other arch.
Reviewed by: cognet@
Differential Revision: https://reviews.freebsd.org/D18557
- Build up phys_avail[] in a single loop, excluding memory used by
the loaded kernel.
- Fix an array indexing bug in the aforementioned phys_avail[]
initialization.[1]
- Remove some unneeded code copied from the arm64 implementation.
PR: 231515 [1]
Reviewed by: jhb
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D18464
It was written basing on:
TCG PC Client Platform TPM Profile (PTP) Specification Version 22, Revision 1.03.
It only supports Locality 0. Interrupts are only supported in FIFO mode.
The driver in FIFO mode was tested on x86 with Infineon SLB9665 discrete TPM chip.
Driver in both modes was also tested on qemu with swtpm running on host.
Submitted by: Kornel Duleba <mindal@semihalf.com>
Obtained from: Semihalf
Sponsored by: Stormshield
Differential Revision: https://reviews.freebsd.org/D18048
This fix booting on A64 boards when disabling the unused regulators at boot.
We did disable all the regulator handled by register 0x13 which of course contain
mandatory regulators for the board to be up.
Reported by: Mark Millard <marklmi@yahoo.com>
X-MFC-With: r340848
This is based on a patch developed by
Tetsuya Uemura <t_uemura@macome.co.jp>.
Many thanks!
Submitted by: Tetsuya Uemura <t_uemura@macome.co.jp> (earlier version)
Tested by: Tetsuya Uemura <t_uemura@macome.co.jp>
MFC after: 2 weeks
very minimal prints and even few important messages will not get logged.
Submitted by: Sumit Saxena <sumit.saxena@broadcom.com>
Reviewed by: Kashyap Desai <Kashyap.Desai@broadcom.com>
Approved by: ken
MFC after: 3 days
Sponsored by: Broadcom Inc
capable IOs. NVME specification supports specific type of scatter gather list
called as PRP (Physical Region Page) for IO data buffers. Since NVME drive is
connected behind SAS3.5 tri-mode adapter, MegaRAID driver/firmware has to convert
OS SGLs in native NVMe PRP format. For IOs sent to firmware, MegaRAID firmware
does this job of OS SGLs to PRP translation and send PRPs to backend NVME device.
For fastpath IOs, driver will do this OS SGLs to PRP translation.
Submitted by: Sumit Saxena <sumit.saxena@broadcom.com>
Reviewed by: Kashyap Desai <Kashyap.Desai@broadcom.com>
Approved by: ken
MFC after: 3 days
Sponsored by: Broadcom Inc
required Write IOs as Fast Path IOs (after the appropriate checks
allowing Fast Path to be used) to the appropriate physical drives
(translated from the OS logical IO) and wait for all Write IOs to complete.
Design: A write IO on RAID volume will be examined if it can be sent in
Fast Path based on IO size and starting LBA and ending LBA falling on to
a Physical Drive boundary. If the underlying RAID volume is a RAID 1/10,
driver issues two fast path write IOs one for each corresponding physical
drive after computing the corresponding start LBA for each physical drive.
Both write IOs will have the same payload and are posted to HW such that
replies land in the same reply queue.
If there are no resources available for sending two IOs, driver will send
the original IO from upper layer to RAID volume through the Firmware.
When both IOs are completed by HW, the resources will be released
and SCSI IO completion handler will be called.
Submitted by: Sumit Saxena <sumit.saxena@broadcom.com>
Reviewed by: Kashyap Desai <Kashyap.Desai@broadcom.com>
Approved by: ken
MFC after: 3 days
Sponsored by: Broadcom Inc
stream to help HBA Firmware do the Full Stripe Writes. For read IOs on
certain RAID volumes like Read Ahead volumes,this will help driver to
send it to Firmware even if the IOs can potentially be sent to
hardware directly (called fast path) bypassing firmware.
Design: 8 streams are maintained per RAID volume as per the combined
firmware/driver design. When there is no stream detected the LRU stream
is used for next potential stream and LRU/MRU map is updated to make this
as MRU stream. Every time a stream is detected the MRU map
is updated to make the current stream as MRU stream.
Submitted by: Sumit Saxena <sumit.saxena@broadcom.com>
Reviewed by: Kashyap Desai <Kashyap.Desai@broadcom.com>
Approved by: ken
MFC after: 3 days
Sponsored by: Broadcom Inc
for different number of supported VDs for SAS3.5 MegaRAID adapters.
Submitted by: Sumit Saxena <sumit.saxena@broadcom.com>
Reviewed by: Kashyap Desai <Kashyap.Desai@broadcom.com>
Approved by: ken
MFC after: 3 days
Sponsored by: Broadcom Inc
1) filecaps_init was unnecesarily a function call
2) an asignment at the end was preventing tail calling of cap_rights_init
Sponsored by: The FreeBSD Foundation
dtrace has its own routines which were not updated after SMAP support got
implemented. Use ifunc just like for other routines.
This in particular fixes ustack().
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D18542
In the nda(4) driver, only set DISKFLAG_CANDELETE (a.k.a. can support
BIO_DELETE) if the drive supports Dataset Management. There are reports
that without this check, VMWare Workstation does not work reliably.
Fix is to check the ONCS field in the NVMe Controller Data structure for
support. This check previously existed but did not survive the
big-endian changes.
Reported by: yuripv@yuripv.net
Reviewed by: imp, mav, jimharris
Approved by: imp (mentor)
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D18493
Previous commits have made VM_MIN_KERNEL_ADDRESS its own separate entity,
and rebased the kernel around that address instead of KERNBASE. This commit
pulls the trigger to rebase KERNBASE to a physical load address. The
eventual goal is to align the address with the AIM KERNBASE, but at this
time that's not an option.
Currently a Book-E kernel must be loaded on a 64MB boundary, due to size
issues. The common load address is at the 64MB mark (0x04000000), so simply
make that the default KERNBASE.
As of this commit, Book-E kernels can be loaded and booted with ubldr.
MFC after: 3 weeks
Optimize the exception handler to only save and load the upper word of the
GPRs used in the emulating instruction. This reduces the save/load
overhead, and as a side effect does not overwrite the upper word of any
temporary register.
With this commit I am now able to run editors/abiword and math/gnumeric on a
e500-based system.
MFC after: 1 week
MFC With: r341752,r341751
Right now, aesni_cipher_alloc does a bit of special-casing
for CRYPTO_F_IOV, to not do any allocation if the first uio
is large enough for the requested size. While working on ZFS
crypto port, I ran into horrible performance because the code
uses scatter-gather, and many of the times the data to encrypt
was in the second entry. This code looks through the list, and
tries to see if there is a single uio that can contain the
requested data, and, if so, uses that.
This has a slight impact on the current consumers, in that the
check is a little more complicated for the ones that use
CRYPTO_F_IOV -- but none of them meet the criteria for testing
more than one.
Submitted by: sef at ixsystems.com
Reviewed by: cem@
MFC after: 3 days
Sponsored by: iX Systems
Differential Revision: https://reviews.freebsd.org/D18522
MIPS64 has 64-bit longs, so use uint64_t for it, otherwise uint32_t.
sizeof(long) == sizeof(ptr) for all platforms, so define
atomic_swap_ptr in terms of atomic_swap_long.
Submitted by: hps@
icu is a interrupt concentrator in the CP110 block and gicp
is a gic extension to allow interrupts in the CP block to be turned
into GIC SPI interrupts
Sponsored by: Rubicon Communications, LLC ("Netgate")
The cp110 clock controller controls the clocks and gate of the CP110
hardware block.
Every clock/gate are implemented except the NAND clock.
Sponsored by: Rubicon Communications, LLC ("Netgate")
The first two clocks are for the clusters and their frequencies can be
found reading a register. Then a fixed 1200Mhz clock is present and two
fixed clocks, 'mss' which is 1200 / 6 and 'sdio' which is 1200 / 3.
Sponsored by: Rubicon Communications, LLC ("Netgate")
The pwm subsystem consist of API for PWM controllers, pwmbus to register them
and a pwm(8) utility to talk to them from userland.
Reviewed by: oshgobo (capsicum), bcr (manpage), 0mp (manpage)
Differential Revision: https://reviews.freebsd.org/D17938
When we try to find a source port in pf_get_sport() it's possible that
all available source ports will be in use. In that case we call
pf_map_addr() to try to find a new source IP to try from. If there are
no more available source IPs pf_map_addr() will return 1 and we stop
trying.
However, if sticky-address is set we'll always return the same IP
address, even if we've already tried that one.
We need to check the supplied address, because if that's the one we'd
set it means pf_get_sport() has already tried it, and we should error
out rather than keep trying.
PR: 233867
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D18483
I mistakenly added a lock assertion to this routine at the last minute
without confirming it was held during g_mirror_create. It isn't (it isn't
even initialized yet). Mea culpa. Access is exclusive in both callers,
just not always by that particular lock.
Reported by: lwhsu
X-MFC-With: r341840, r341674
If bus_dmamap_load_mbuf() fails following a defrag, the caller of
bwn_dma_tx_start() would free the original mbuf after m_defrag() had
already done so. Fix this by returning the defragged mbuf to the
caller instead. Update bwn_pio_tx_start() similarly for consistency.
Reported by: Ilja Van Sprundel <ivansprundel@ioactive.com>
Reviewed by: landonf
Tested by: landonf
MFC after: 3 days
admbug: 820
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D18342
PR: 217505
Submitted by: John O. Brickley <obryan.brickley@gmail.com>, updated by Maciej Pasternacki <maciej@pasternacki.net>
Reported by: John O. Brickley <obryan.brickley@gmail.com>
MFC after: 1 week
r341674 inadvertently introduced a bug where newer mirror components being
tasted would clear the high sc_flags that are not controlled by component
metadata, such as G_MIRROR_DEVICE_FLAG_TASTING. This could plausibly expose
a small window of time during STARTING where device destruction might race
with mirror component addition, probably resulting in a crash.
Reviewed by: markj
X-MFC-With: r341674
Differential Revision: https://reviews.freebsd.org/D18521
check hash to the filesystem inodes. Access attempts to files
associated with an inode with an invalid check hash will fail with
EINVAL (Invalid argument). Access is reestablished after an fsck
is run to find and validate the inodes with invalid check-hashes.
This check avoids a class of filesystem panics related to corrupted
inodes. The hash is done using crc32c.
Note this check-hash is for the inode itself and not any of its
indirect blocks. Check-hash validation may be extended to also
cover indirect block pointers, but that will be a separate (and
more costly) feature.
Check hashes are added only to UFS2 and not to UFS1 as UFS1 is
primarily used in embedded systems with small memories and low-powered
processors which need as light-weight a filesystem as possible.
Reviewed by: kib
Tested by: Peter Holm
Sponsored by: Netflix
Mainly states of established TCP connections would be affected resulting
in immediate state removal once the number of states is bigger than
adaptive.start. Disabling adaptive timeouts is a workaround to avoid this bug.
Issue found and initial diff by Mathieu Blanc (mathieu.blanc at cea dot fr)
Reported by: Andreas Longwitz <longwitz AT incore.de>
Obtained from: OpenBSD
MFC after: 2 weeks
r336560 was supposed to restore pre-r323954 behaviour when tx_abdicate is
not set (the default case). However, it appears that rather than the drainage
check being made conditional on tx_abdicate being set, it was duplicated
so it occured twice if tx_abdicate was set and once if it was not.
Now when !tx_abdicate, drainage is only checked if the doorbell isn't
pending.
Reported by: lev
MFC after: 1 week
Sponsored by: Limelight Networks
The error was caused by map_ucode() casting a vm_paddr_t to a void *.
Use a uintptr_t instead to match the caller. Fix some style bugs while
here.
Reported by: bde
Reviewed by: bde
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
It is a function call only to accomodate *some* ABIs which install a hook.
They only care for 3 types of limits: DATA, STACK, VMEM
Instead of always calling the func, see at compilation time if the requested
limit is something else and just do the read if so.
Sponsored by: The FreeBSD Foundation
initialize the controller.
According to the datasheet, the old code checks if port 2 (P2E, 0x4) was
the only enabled port (except port 0, which was ignored by mask 0xfe),
and issue a write to the PCS register to disable all but port 0, right
before ahci_ctlr_reset.
Some other operating systems would issue a port enable to all ports, but
since the current code only does the special initialization for ICH8M,
it entirely and rely on BIOS to do the right thing (the alternative
would be https://reviews.freebsd.org/D18300?id=50922 , should we see
reports that we really need to do it).
Reviewed by: mav
MFC after: 3 months
Differential Revision: https://reviews.freebsd.org/D18300
Bootstacks are unused after APs executed sched_throw() in
init_secondary_tail() and started executing on proper idle thread
stack. Add sysinit that detects that the idle thread for each CPU was
scheduled at least once, and free corresponding bootstack.
Slight addition of the code (~200 bytes) is compensated by the saving,
because even on typical small modern desktop CPU we leak 128K of
memory otherwise (4 pages x 8 threads).
Reviewed by: jhb
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D18486
There is no reason for it to behave differently from openat(fd, NULL).
Also the handling did not worked because the substituted path was from
the system address space, causing EFAULT.
Submitted by: Jack Halford <jack@gandi.net>
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D18501
Inline tests for PTE_* bits are easy to read and don't really require a
predicate function, and predicates which operate on a pt_entry_t are
inconvenient when working with L1 and L2 page table entries.
Reviewed by: jhb
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D18461
The code was a near exact copy of the code in startup, but it doesn't need
the complexity since the kernel is already relocated. With
VM_MIN_KERNEL_ADDRESS as currently set to KERNBASE, this doesn't cause a
problem, because it's a zero offset. However, when KERNBASE is changed to a
physical load address, it then has a non-zero offset, and ends up with an
invalid stack pointer, causing the AP to hang.
Once a signal's siginfo was copied to 'td_si' as part of the signal
exchange in issignal(), it was never cleared. This caused future
thread events that are reported as SIGTRAP events without signal
information to report the stale siginfo in 'td_si'. For example, if a
debugger created a new process and used SIGSTOP to stop it after
PT_ATTACH, future system call entry / exit events would set PL_FLAG_SI
with the SIGSTOP siginfo in pl_siginfo. This broke 'catch syscall' in
current versions of gdb as it assumed PL_FLAG_SI with SIGTRAP
indicates a breakpoint or single step trap.
Reviewed by: kib
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D18487
This change adds a hypervisor trap handler for exception 0x1500 (soft patch),
normalizing all VSX registers and returning.
This avoids a kernel panic due to unknown exception.
Change made with the collaboration of leonardo.bianconi_eldorado.org.br,
that found out that this is a hypervisor exception and not a supervisor one,
and fixed this in the code.
Reviewed by: jhibbits, sbruno
Differential Revision: https://reviews.freebsd.org/D17806
On EF10 HW we can avoid sending packets without checksum offload
or with IP-only checksum offload to dedicated queues. Instead, we
can use option descriptors to change offload policy on any queue
during runtime. Thus, we don't need to create two dedicated queues.
Submitted by: Ivan Malov <Ivan.Malov at oktetlabs.ru>
Sponsored by: Solarflare Communications, Inc.
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D18390
The number of Tx queues on event queue 0 can depend on the NIC family type,
and this property will be leveraged by future patches.
This patch prepares the code for this change.
Submitted by: Ivan Malov <Ivan.Malov at oktetlabs.ru>
Sponsored by: Solarflare Communications, Inc.
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D18389
FreeBSD driver needs a patch to provide a means for packets
which do not need checksum offload but have flow ID set
to avoid hitting only the first Tx queue (which has been used
for packets not needing checksum offload).
This should be possible on Huntington, Medford or Medford2 chips
since these support toggling checksum offload on any given queue
dynamically by means of pushing option descriptors.
The patch for FreeBSD driver will then need a means to figure out
whether the feature can be used, and testing adapter family might
not be a good solution.
This patch adds a feature bit specifically to indicate support
for checksum option descriptors. The new feature bits may have
more users in future, apart from the mentioned FreeBSD patch.
Submitted by: Ivan Malov <Ivan.Malov at oktetlabs.ru>
Sponsored by: Solarflare Communications, Inc.
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D18388
In order to find out why the first event queue and corresponding
interrupt is triggered more frequent, it is useful to know which
events go to each event queue.
Sponsored by: Solarflare Communications, Inc.
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D18418
and includes the last block represented by the leaf. The reasoning is that,
if the last block is included, then there must be no solution before that
one in the leaf, so the leaf cannot provide an allocation that big again;
indeed, the leaf cannot provide a solution bigger than range1.
Which is all correct, except that if the value of blk passed in did not
represent the first block of the leaf, because the cursor was pointing to
the middle of the leaf, then a possible solution before the cursor may have
been ignored, and bighint cannot be updated.
Consider the sequence allocate 63 (returning address 0), free 0,63 (freeing
that same block, and allocate 1 (returning 63). The result is that one
block is allocated from the first leaf, and the value of bighint is 0, so
that nothing can be allocated from that leaf until the only block allocated
from that leaf is freed. This change detects that skipped-over solution,
and when there is one it makes sure that the value of bighint is not changed
when the last block is allocated.
Submitted by: Doug Moore <dougm@rice.edu>
Tested by: pho
X-MFC with: r340402
Differential Revision: https://reviews.freebsd.org/D18474
devstat_end_transaction() was called before the i/o was actually ended
(by delivering it to GEOM), so at least the i/o length was messed up.
It was always recorded as 0, so the average transaction size and the
average transfer rate was always displayed as 0.
devstat_end_transaction() was not called at all for the error case, so
there were sometimes multiple starts per end. I didn't observe this in
practice and don't know if it did much damage. I think it extended the
length of the i/o to the next transaction.
Reviewed by: kib
I also learned that 'mips' is overly broad and covers 64bit architectures
too. However, it's not worth the fight right now, so any refinements
will have to come another day.
mpr as a module for powerpc or mips. An upcoming commit will cause these
drivers to rely on the presence of 64bit atomic operations. Discussed
with jhibbits.
With the introduction of M_EXEC support for kmem_malloc(), some kernel
mappings start having NX bit set in the paging structures early, for
PAE kernels on machines with NX support, i.e. practically on all
machines. In particular, AP trampoline and initialization needs to
access pages which translations has NX bit set, before initializecpu()
is called.
Check for CPUID NX feature and enable EFER.NXE before we enable paging
in mp boot trampoline. This allows the CPU to use the kernel page
table instead of generating page fault due to reserved bit set.
PR: 233819
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Use the newly defined SRAT/SLIT parsing APIs in arm64 to support
ACPI based NUMA.
Reviewed by: markj
Differential Revision: https://reviews.freebsd.org/D17943
ACPI SRAT table on arm64 uses GICC entries to provide CPU locality
information. These entries use an AcpiProcessorUid to identify the
CPU (unlike on x86 where the entries have an APIC ID).
Update acpi_pxm.c to extend the cpu_add/cpu_find/cpu_get_info
functions to handle AcpiProcessorUid. Use the updated functions
while parsing ACPI_SRAT_GICC_AFFINITY entry for arm64.
Also update sys/conf/files.arm64 to build acpi_pxm.c when ACPI is
enabled.
Reviewed by: markj (previous version)
Differential Revision: https://reviews.freebsd.org/D17942
This moves the architecture independent parts of sys/x86/acpica/srat.c
to sys/dev/acpica/acpi_pxm.c, to be used later on arm64. The function
declarations are moved to sys/dev/acpica/acpivar.h
We also need to update sys/conf/files.{i386,amd64} to use the new file.
No functional changes.
Reviewed by: markj, imp
Differential Revision: https://reviews.freebsd.org/D17941
The SLIT and SRAT ACPI tables needs to be parsed on arm64 as well, on
systems that use UEFI/ACPI firmware and support NUMA. To do this, we
need to move most of the logic of x86/acpica/srat.c to dev/acpica and
provide an API that architectures can use to parse and configure ACPI
NUMA information.
This commit adds the API in srat.c as a first step, without making any
functional changes. We will move the common code to sys/dev/acpica
as the next step.
The functions added are:
* int acpi_pxm_init(int ncpus, vm_paddr_t maxphys) - to allocate and
initialize data structures used
* void acpi_pxm_parse_tables(void) - parse SRAT/SLIT, save the cpu and
memory proximity information
* void acpi_pxm_set_mem_locality(void) - use the saved data to set
memory locality
* void acpi_pxm_set_cpu_locality(void) - use the saved data to set cpu
locality
* void acpi_pxm_free(void) - free data structures allocated by init
On arm64, we do not have an cpu APIC id that can be used as index to
store CPU data, we need to use the Processor Uid. To help with this,
define internal functions cpu_add, cpu_find, cpu_get_info to store
and get CPU proximity information.
Reviewed by: markj, jhb (previous version)
Differential Revision: https://reviews.freebsd.org/D17940
If all IDs from trypid to pid_max were used as pids, the code would enter
a loop which would be infinite if none of the IDs could become free (e.g.
they all belong to processes which did not transitioned to zombie).
Fixes: r341684 ("Manage process-related IDs with bitmaps")
Sponsored by: The FreeBSD Foundation
kqueue would always relock immediately afterwards.
While here drop the NULL check for list itself. The list is
always allocated.
Sponsored by: The FreeBSD Foundation
Originally read value is still safely kept. Re-reading code was there
for previous iterations which were partially shared with i386.
Sponsored by: The FreeBSD Foundation
When we detected that the vnode is not symlink, return immediately.
This moves the readlink code out of else branch and unindents it.
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Braces were put in the wrong place, causing failing EAGAIN check to
return zero result. Remove the problematic assignment from the
conditional expression at all.
While there, remove used once variable vp, and wrap too long line.
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Because of that typo the driver would try to attach to every device
on acpi bus. That disrupted acpi attachment of uart driver, at least.
MFC after: 4 days
X-MFC with: r339754
This adds more detail and fixes some inaccuracies.
Reviewed by: jhb
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D18463
Add a subroutine for updating satp, for use when updating the
active pmap. No functional change intended.
Reviewed by: jhb
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D18462
The kernel was already doing this prior to r329615. It was changed
to reduce contention on allproc. However, introduction of pidhash
locks and removal of proctree -> allproc ordering from fork thanks
to bitmaps fixed things enough to make this change pessimal.
waitpid takes proctree on each call and this change (now) causes
avoidable stalls if allproc is held.
Sponsored by: The FreeBSD Foundation
Currently unique pid allocation on fork often requires a full walk of
process, group, session lists to make sure it is not used by anything.
This has a side effect of requiring proctree to be held along with allproc,
which adds more contention in poudriere -j 128.
The patch below implements trivial bitmaps which gets rid of the problem.
Dedicated lock is introduced to manage IDs.
While here a bug was discovered: all processes would inherit reap id from
the first process spawned by init. This had a side effect of keeping the
ID used and when allocation rolls over to the beginning it keeps being
skipped.
The patch is loosely based on initial work by mjoras@.
Reviewed by: kib
Sponsored by: The FreeBSD Foundation
The current ifdefs are not sufficient to distinguish 32- and 64- bit
variants, which results e.g. in powerpc64 not using atomics.
While some 32-bit archs provide 64-bit atomics, there is no huge advantage
of using them on these platforms.
Reported by: many
Suggested by: jhb
Sponsored by: The FreeBSD Foundation
The iflib subsystem implements netmap support in a driver-independent
way (sys/net/iflib.c). We can therefore remove the headers that
used to implement netmap support for all the drivers now supported
by iflib (em, igb, ixl, ixgbe, lem).
MFC after: 1 week
Re-apply r341665 with format strings fixed.
If we happen to taste a stale mirror component first, don't reject valid,
newer components that have differing metadata from the stale component
(during STARTING). Instead, update our view of the most recent metadata as
we taste components.
Like mediasize beforehand, remove some checks from g_mirror_check_metadata
which would evict valid components due to metadata that can change over a
mirror's lifetime. g_mirror_check_metadata is invoked long before we check
genid/syncid and decide which component(s) are newest and whether or not we
have quorum.
Before checking if we can enter RUNNING (i.e., we have quorum) after a NEW
component is added, first remove any known stale or inconsistent disks from
the mirrorset, rather than removing them *after* deciding we have quorum.
Check if we have quorum after removing these components.
Additionally, add a knob, kern.geom.mirror.launch_mirror_before_timeout, to
force gmirrors to wait out the full timeout (kern.geom.mirror.timeout)
before transitioning from STARTING to RUNNING. This is a kludge to help
ensure all eligible, boot-time available mirror components are tasted before
RUNNING a gmirror.
Add a basic test case for STARTING -> RUNNING startup behavior around stale
genids.
PR: 232671, 232835
Submitted by: Cindy Yang <cyang AT isilon.com> (previous version)
Reviewed by: markj (kernel portions)
Discussed with: asomers, Cindy Yang
Tested by: pho
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D18062
If we happen to taste a stale mirror component first, don't reject valid,
newer components that have differing metadata from the stale component
(during STARTING). Instead, update our view of the most recent metadata as
we taste components.
Like mediasize beforehand, remove some checks from g_mirror_check_metadata
which would evict valid components due to metadata that can change over a
mirror's lifetime. g_mirror_check_metadata is invoked long before we check
genid/syncid and decide which component(s) are newest and whether or not we
have quorum.
Before checking if we can enter RUNNING (i.e., we have quorum) after a NEW
component is added, first remove any known stale or inconsistent disks from
the mirrorset, rather than removing them *after* deciding we have quorum.
Check if we have quorum after removing these components.
Additionally, add a knob, kern.geom.mirror.launch_mirror_before_timeout, to
force gmirrors to wait out the full timeout (kern.geom.mirror.timeout)
before transitioning from STARTING to RUNNING. This is a kludge to help
ensure all eligible, boot-time available mirror components are tasted before
RUNNING a gmirror.
When we are instructed to forget mirror components, bump the generation id
to avoid confusion with such stale components later.
Add a basic test case for STARTING -> RUNNING startup behavior around stale
genids.
PR: 232671, 232835
Submitted by: Cindy Yang <cyang AT isilon.com> (previous version)
Reviewed by: markj (kernel portions)
Discussed with: asomers, Cindy Yang
Tested by: pho
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D18062
through.
cxgb4vf doesn't own the buffer size list but still expects the first two
entries to be 4K and some power of 2 respectively. The BSD cxgbe
doesn't care where its preferred buffer sizes are as long as they're in
the list somewhere, so just move its entries towards the end as a
workaround.
MFC after: 1 month
Sponsored by: Chelsio Communicatons
pfsync code is called for every new state, state update and state
deletion in pf. While pf itself can operate on multiple states at the
same time (on different cores, assuming the states hash to a different
hashrow), pfsync only had a single lock.
This greatly reduced throughput on multicore systems.
Address this by splitting the pfsync queues into buckets, based on the
state id. This ensures that updates for a given connection always end up
in the same bucket, which allows pfsync to still collapse multiple
updates into one, while allowing multiple cores to proceed at the same
time.
The number of buckets is tunable, but defaults to 2 x number of cpus.
Benchmarking has shown improvement, depending on hardware and setup, from ~30%
to ~100%.
MFC after: 1 week
Sponsored by: Orange Business Services
Differential Revision: https://reviews.freebsd.org/D18373
Specifically, assume that the device is present if evaluation of _STA
method fails.
Before r330957 we ignored any _STA evaluation failure (which was
performed by AcpiGetObjectInfo in ACPICA contrib code) for the purpose
of acpi_DeviceIsPresent and acpi_BatteryIsPresent. ACPICA 20180313
removed evaluation of _STA from AcpiGetObjectInfo. So, we added
evaluation of _STA to acpi_DeviceIsPresent and acpi_BatteryIsPresent.
One important difference is that the new code ignored a failure only if
_STA did not exist (AE_NOT_FOUND). Any other kind of failure was
treated as a fatal failure. Apparently, on some systems we can get
AE_NOT_EXIST when evaluating _STA. And that error is not an evil twin
of AE_NOT_FOUND, despite a very similar name, but a distinct error
related to a missing handler for an ACPI operation region.
It's possible that for some people the problem was already fixed by
changes in ACPICA and/or in acpi_ec driver (or even in BIOS) that fixed
the AE_NOT_EXIST failure related to EC operation region.
This work is based on a great analysis by cem and an earlier patch by
Ali Abdallah <aliovx@gmail.com>.
PR: 227191
Reported by: 0mp
MFC after: 2 weeks
This allows tcpdump to capture outbound kernel packets while
in netmap mode
Submitted by: Marc de la Gueronniere <mdelagueronniere@verisign.com>
Reviewed by: vmaffione
MFC after: 1 week
Sponsored by: Verisign, Inc.
Differential Revision: https://reviews.freebsd.org/D17896
card initialization. This is an expanded version of r333682.
Break up prep_firmware into simpler routines while here. Load the
firmware/config KLD only if needed.
MFC after: 1 month
Sponsored by: Chelsio Communications
The POWER9 does not recognize 'or 27,27,27' as a thread priority NOP. On
earlier POWER architectures, this NOP would note to the processor to give up
resources if able, to improve performance of other threads.
All processors that support the thread priority NOPs recognize the
'or 31,31,31' NOP as very low priority, so use this to perform a similar
function, and not burn cycles on POWER9.
The jump slot is a function pointer, not a descriptor pointer, in ELFv2. Just
write the pointer itself over, not the contents of the pointer, which would be
the first instruction of the function.
The 'interrupts' property is actually 2 words, not one, on macgpio child
nodes. Open Firmware's getprop function might be returning the value
copied, not the total size of the property, but FDT's returns the total
size. Prior to this patch, this would cause the SYS_RES_IRQ resource list
to not be populated when running with the 'usefdt' loader variable set, to
convert the OFW device tree to a FDT. Since the property is always 2 words,
read both words, and ignore the second.
Tested by: Dennis Clarke (previous attempt)
MFC after: 2 weeks
deletion is active, specifically after a call to ffs_blkrelease_start()
but before the call to ffs_blkrelease_finish(), ffs_blkrelease_start()
will have handed out SINGLETON_KEY rather than starting a collection
sequence. Thus if we get a SINGLETON_KEY passed to ffs_blkrelease_finish(),
we just return rather than trying to finish the nonexistent sequence.
Reported by: Warner Losh (imp@)
Sponsored by: Netflix
superblock has a check-hash error, an error message noting the
superblock check-hash failure is printed and the mount fails. The
administrator then runs fsck to repair the filesystem and when
successful, the filesystem can once again be mounted.
This approach fails if the filesystem in question is a root filesystem
from which you are trying to boot. Here, the loader fails when trying
to access the filesystem to get the kernel to boot. So it is necessary
to allow the loader to ignore the superblock check-hash error and make
a best effort to read the kernel. The filesystem may be suffiently
corrupted that the read attempt fails, but there is no harm in trying
since the loader makes no attempt to write to the filesystem.
Once the kernel is loaded and starts to run, it attempts to mount its
root filesystem. Once again, failure means that it breaks to its prompt
to ask where to get its root filesystem. Unless you have an alternate
root filesystem, you are stuck.
Since the root filesystem is initially mounted read-only, it is
safe to make an attempt to mount the root filesystem with the failed
superblock check-hash. Thus, when asked to mount a root filesystem
with a failed superblock check-hash, the kernel prints a warning
message that the root filesystem superblock check-hash needs repair,
but notes that it is ignoring the error and proceeding. It does
mark the filesystem as needing an fsck which prevents it from being
enabled for writing until fsck has been run on it. The net effect
is that the reboot fails to single user, but at least at that point
the administrator has the tools at hand to fix the problem.
Reported by: Rick Macklem (rmacklem@)
Discussed with: Warner Losh (imp@)
Sponsored by: Netflix
With the removal of BOOTCDROM and fastboot support, this code always
passed "-s" or "--". The latter simply terminates getopt(3) processing
in init so we only need to pass "-s" in the single user case, or nothing
in other cases.
The passing of "--" seems to have been done to ensure that the number of
arguments passed to init was always the same and thus that argc was the
same.
Also GC the write-only variable pathlen (not in reviewed version).
Reviewed by: kib, jhb
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D18441
cursor == 0.
Every call to blst_meta_alloc but the one at the root is made only when the
meta-node is known to include a free block, so that either the allocation
will succeed, the node hint will be updated, or the last block of the meta-
node range is, and remains, free. But the call at the root is made without
checking that there is a free block, so in the case that every block is
allocated, there is no hint update to prevent the current code from looping
forever.
Submitted by: Doug Moore <dougm@rice.edu>
Reported by: pho
Reviewed by: pho
Tested by: pho
X-MFC with: r340402
Differential Revision: https://reviews.freebsd.org/D17999
When BOOTCDROM is defined (via CFLAGS as there is no config option)
it causes -C to be passed to init, but our init and the version of
sysinstall I glanced at in 6.x don't support -C. The last plausibly
related support was removed from the tree in 1995.
Reviewed by: kib
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D18431
Memory beyond that limit was previously unused, wasting roughly 1MB per
8GB of RAM. Also retire INP_PCBLBGROUP_PORTHASH, which was identical to
INP_PCBPORTHASH.
Reviewed by: glebius
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D17803
The flag is not used by anything for years and supporting it requires an
explicit read from the lock when entering slow path.
Flag value is left unused on purpose.
Sponsored by: The FreeBSD Foundation
The backpressure indication is implemented using an unlimited rate type of
mbuf send tag. When the upper layers typically the socket layer has obtained such
a tag, it can then query the destination driver queue for the current
amount of space available in the send queue.
A single mbuf send tag may be referenced multiple times and a refcount has been added
to the mlx5e_priv structure to track its usage. Because the send tag resides
in the mlx5e_channel structure, there is no need to wait for refcounts to reach
zero until the mlx4en(4) driver is detached. The channels structure is persistant
during the lifetime of the mlx5en(4) driver it belongs to and can so be accessed
without any need of synchronization.
The mlx5e_snd_tag structure was extended to contain a type field, because there are now
two different tag types which end up in the driver which need to be distinguished.
Submitted by: hselasky@
Approved by: hselasky (mentor)
MFC after: 1 week
Sponsored by: Mellanox Technologies
In order to enable HW LRO, both the "hw_lro" sysctl in the mlx5en(4) config
space must be set, and the ifconfig(8) LRO capability must be set. Any other
settings will disable HW LRO.
Submitted by: hselasky@
Approved by: hselasky (mentor)
MFC after: 1 week
Sponsored by: Mellanox Technologies
Add counter for all transmitted and received bytes. Currently only all
transmitted and received packets were counted. Fix description of RX LRO
counters while at it.
Submitted by: hselasky@
Approved by: hselasky (mentor)
MFC after: 1 week
Sponsored by: Mellanox Technologies
By allocating the worst case size channel structure array
at attach time we can eliminate various NULL checks in the
fast path. And also reduce the chance for use-after-free
issues in the transmit fast path.
This change is also a requirement for implementing
backpressure support.
Submitted by: hselasky@
Approved by: hselasky (mentor)
MFC after: 1 week
Sponsored by: Mellanox Technologies
Writing to the debug stats variable must be locked,
else serialization will be lost which might cause
various kernel panics due to creating and destroying
sysctls out of order.
Make sure the sysctl context is initialized after freeing
the sysctl nodes, else they can be freed twice.
Submitted by: hselasky@
Approved by: hselasky (mentor)
MFC after: 1 week
Sponsored by: Mellanox Technologies
Inspect the ethernet compliance code to figure out actual cable type by reading
the PDDR module info register.
Submitted by: hselasky@
Approved by: hselasky (mentor)
MFC after: 1 week
Sponsored by: Mellanox Technologies
This can happen when connections are short lived and leads to
a firmware error printout in dmesg, syndrome 0x51cfb0, because
the SQ is in the wrong state.
Submitted by: hselasky@
Approved by: hselasky (mentor)
MFC after: 1 week
Sponsored by: Mellanox Technologies
1) Don't exceed the drivers own hardcoded TX inline limit.
The blueflame register size can be much greater than the hardcoded limit
for inlining. Make sure we don't exceed the drivers own limit, because this
also means that the maximum number of TX fragments becomes invalid and
then memory size assumptions in the TX path no longer hold up.
2) Make sure the mlx5_query_min_inline() function returns an error code.
3) Header inlining is required when using TSO.
4) Catch failure to compute inline header size for TSO.
5) Add support for UDP when computing inline header size.
6) Fix for inlining issues with regards to DSCP.
Make sure we inline 4 bytes beyond the ethernet and/or
VLAN header to workaround a hardware bug extracting
the DSCP field from the IPv4/v6 header.
Submitted by: hselasky@
Approved by: hselasky (mentor)
MFC after: 1 week
Sponsored by: Mellanox Technologies
The hardware queues are deep enough currently and using the DRBR and associated
callbacks only leads to more task switching in the TX path. The is also a race
setting the queue_state which can lead to hung TX rings.
Submitted by: hselasky@
Approved by: hselasky (mentor)
MFC after: 1 week
Sponsored by: Mellanox Technologies
Add support for setting the bandwidth limit as a ratio rather than in bits per
second. The ratio must be an integer number between 1 and 100 inclusivly.
Implement the needed firmware commands and SYSCTLs through mlx5en(4).
Submitted by: hselasky@
Approved by: hselasky (mentor)
MFC after: 1 week
Sponsored by: Mellanox Technologies
Make sure the active width and speed is set in case the
translate_eth_proto_oper() function doesn't recognize the
current port operation mask.
Linux commit:
7672ed33c4c15dbe9d56880683baaba4227cf940
Submitted by: hselasky@
Approved by: hselasky (mentor)
MFC after: 1 week
Sponsored by: Mellanox Technologies
If the mlx5_ib_read_cong_stats() function was running when mlx5ib was unloaded,
because this function unconditionally restarts the timer, the timer can still
be pending after the delayed work has been cancelled. To fix this simply loop
on the delayed work cancel procedure as long as it returns non-zero.
Submitted by: hselasky@
Approved by: hselasky (mentor)
MFC after: 1 week
Sponsored by: Mellanox Technologies
Although "create_srq_user" does overwrite "in.pas" on some paths, it
also contains at least one feasible path which does not overwrite it.
Approved by: hselasky (mentor)
MFC after: 1 week
Sponsored by: Mellanox Technologies
"fw_rev_min(dev->mdev)" with type "unsigned short" (16 bits, unsigned) is
promoted in "fw_rev_min(dev->mdev) << 16" to type "int" (32 bits, signed), then
sign-extended to type "unsigned long" (64 bits, unsigned). If
"fw_rev_min(dev->mdev) << 16" is greater than 0x7FFFFFFF, the upper bits of the
result will all be 1.
Approved by: hselasky (mentor)
MFC after: 1 week
Sponsored by: Mellanox Technologies
Driver description should be set by core and not by the Ethernet driver.
Approved by: hselasky (mentor)
MFC after: 1 week
Sponsored by: Mellanox Technologies
A command is either polling or event driven and the mode cannot change
during execution of a command. Make sure the event handler only handle
commands which are not polled. This is done by checking the command mode
in the command handler before completing commands.
Submitted by: hselasky@
Approved by: hselasky (mentor)
MFC after: 1 week
Sponsored by: Mellanox Technologies
This counter will represent transmitted packets which has more than
1518 octets.
The NIC has multiple hardware counters for counting transmitted
packets larger than 1518 octets. Each counter counts the packets
in specific range.
We accumulate those counters to have a single counter.
Approved by: hselasky (mentor)
MFC after: 1 week
Sponsored by: Mellanox Technologies
When the mlx5 health mechanism detects a problem while the driver
is in the middle of init_one or remove_one, the driver needs to prevent
the health mechanism from scheduling future work; if future work
is scheduled, there is a problem with use-after-free: the system WQ
tries to run the work item (which has been freed) at the scheduled
future time.
Prevent this by disabling work item scheduling in the health mechanism
when the driver is in the middle of init_one() or remove_one().
Approved by: hselasky (mentor)
MFC after: 1 week
Sponsored by: Mellanox Technologies
All other mlx5_events report the port number as 1 based, which is how FW
reports it in the port event EQE. Reporting 0 for this event causes
mlx5_ib to not raise a fatal event notification to registered clients
due to a seemingly invalid port.
All switch cases in mlx5_ib_event that go through the port check are
supposed to set the port now, so just do it once at variable
declaration.
Linux commit:
aba462134634b502d720e15b23154f21cfa277e5
Approved by: hselasky (mentor)
MFC after: 1 week
Sponsored by: Mellanox Technologies
The user can provide very large cqe_size which will cause to integer
overflow.
Linux commit:
28e9091e3119933c38933cb8fc48d5618eb784c8
Approved by: hselasky (mentor)
MFC after: 1 week
Sponsored by: Mellanox Technologies
Copy small packets like TCP ACKs into a new mbuf
reusing the existing mbuf to receive a new ethernet
frame. This avoids wasting buffer space for
small sized packets.
Submitted by: hselasky@
Approved by: hselasky (mentor)
MFC after: 1 week
Sponsored by: Mellanox Technologies
Adding an interface might be done outside the device_attach() routine
and will then cause a panic, due to the VNET not being defined.
Submitted by: hselasky@
Approved by: hselasky (mentor)
MFC after: 1 week
Sponsored by: Mellanox Technologies
The "priv->pkstats.rx_dropped" is written twice in a row.
Submitted by: hselasky@
Approved by: hselasky (mentor)
MFC after: 1 week
Sponsored by: Mellanox Technologies
Also when the MTU is greater than MCLBYTES.
Submitted by: hselasky@
Approved by: hselasky (mentor)
MFC after: 1 week
Sponsored by: Mellanox Technologies
The hardware queues are deep enough currently and using the DRBR and associated
callbacks only leads to more task switching in the TX path. The is also a race
setting the queue_state which can lead to hung TX rings.
Submitted by: hselasky@
Approved by: hselasky (mentor)
MFC after: 1 week
Sponsored by: Mellanox Technologies
In last mlx4 update (r325841) we lost the sysctl to show the
firmware version for mlx4 devices.
Add both board identifier and firmware version under:
sys.device.mlx4_core0.hw sysctl node.
Approved by: hselasky (mentor)
MFC after: 1 week
Sponsored by: Mellanox Technologies
Modify QP can fail and it can be acceptable, like when moving from RST to
ERR state, all the rest are not acceptable and a message to the log
should be printed.
The current code prints on all failures and many messages like:
"Failed to modify QP to ERROR state" appear, even when supported by the
state machine of the QP object.
Linux commit:
5dc78ad1904db597bdb4427f3ead437aae86f54c
Submitted by: hselasky@
Approved by: hselasky (mentor)
MFC after: 1 week
Sponsored by: Mellanox Technologies
When a packet needs fragmentation, it might generate more than 3 fragments.
With the queue length 3, all fragments are generated faster than the
queue is drained, which effectively drops fourth and later fragments on
the floor.
Submitted by: kib@
Approved by: hselasky (mentor)
MFC after: 1 week
Sponsored by: Mellanox Technologies
When changing the MTU of ibX network interfaces, check that the MTU was really
changed before requesting an update of the multicast rules. Else we might go
into an infinite loop joining and leaving ibX multicast groups towards the
opensm master interface.
Submitted by: hselasky@
Approved by: hselasky (mentor)
MFC after: 1 week
Sponsored by: Mellanox Technologies
It is not enough to set ifnet->if_mtu to change the interface MTU.
System saves the MTU for route in the radix tree, and route cache keeps
the interface MTU as well. Since addition of the multicast group causes
recalculation of MTU, even bringing the interface up changes MTU from
4042 to 1500, which makes the system configuration inconsistent. Worse,
ip_output() prefers route MTU over interface MTU, so large packets are
not fragmented and dropped on floor.
Fix it for ipoib(4) using the same approach (or hack) as was applied
for it_tun/if_tap in r339012. Thanks to bz@ for giving the hint.
Submitted by: kib@
Approved by: hselasky (mentor)
MFC after: 1 week
Sponsored by: Mellanox Technologies
Binding to a loopback device is not allowed. Make sure the destination
device address is global by clearing the bound device interface.
Only do this conditionally, else link local addresses won't work.
Submitted by: hselasky@
Approved by: hselasky (mentor)
MFC after: 1 week
Sponsored by: Mellanox Technologies
A couple of places in the CM do
spin_lock_irq(&cm_id_priv->lock);
...
if (cm_alloc_response_msg(work->port, work->mad_recv_wc, &msg))
However when the underlying transport is RoCE, this leads to a sleeping function
being called with the lock held - the callchain is
cm_alloc_response_msg() ->
ib_create_ah_from_wc() ->
ib_init_ah_from_wc() ->
rdma_addr_find_l2_eth_by_grh() ->
rdma_resolve_ip()
and rdma_resolve_ip() starts out by doing
req = kzalloc(sizeof *req, GFP_KERNEL);
not to mention rdma_addr_find_l2_eth_by_grh() doing
wait_for_completion(&ctx.comp);
to wait for the task that rdma_resolve_ip() queues up.
Fix this by moving the AH creation out of the lock.
Linux commit:
c76161181193985087cd716fdf69b5cb6cf9ee85
Submitted by: hselasky@
Approved by: hselasky (mentor)
MFC after: 1 week
Sponsored by: Mellanox Technologies
Trying to validate loopback fails because rtalloc1() resolves system
local addresses to the loopback network interface, lo0. Fix this by
explicitly checking for loopback during validation of the source
and destination network address. If the source address belongs to
a local network interface and is equal to the destination address,
there is no need to run the destination address through rtalloc1().
Submitted by: hselasky@
Approved by: hselasky (mentor)
MFC after: 1 week
Sponsored by: Mellanox Technologies
The master network interface and the VLANs may reside in different VNETs.
Make sure that all VNETs are searched when scanning for GID entries.
Submitted by: netapp
Approved by: hselasky (mentor)
MFC after: 1 week
Sponsored by: Mellanox Technologies
This prevents code from accepting RoCEv1 connections when
only ROCEv2 is enabled and vice versa.
Linux commit:
0c4386ec77cfcd0ccbdbe8c2e67dd3a49b2a4c7f
Submitted by: hselasky@
Approved by: hselasky (mentor)
MFC after: 1 week
Sponsored by: Mellanox Technologies
The array ib_mad_mgmt_class_table.method_table has MAX_MGMT_CLASS
(80) elements. Hence compare the array index with that value instead
of with IB_MGMT_MAX_METHODS (128). This patch avoids that Coverity
reports the following:
Overrunning array class->method_table of 80 8-byte elements at element index 127
(byte offset 1016) using index convert_mgmt_class(mad_hdr->mgmt_class)
(which evaluates to 127).
Linux commit:
2fe2f378dd45847d2643638c07a7658822087836
Submitted by: hselasky@
Approved by: hselasky (mentor)
MFC after: 1 week
Sponsored by: Mellanox Technologies