Remove malloc_domain(9) and most other _domain KPIs added in r327900.
The new functions allow the caller to specify a general NUMA domain
selection policy, rather than specifically requesting an allocation from
a specific domain. The latter policy tends to interact poorly with
M_WAITOK, resulting in situations where a caller is blocked indefinitely
because the specified domain is depleted. Most existing consumers of
the _domain KPIs are converted to instead use a DOMAINSET_PREF() policy,
in which we fall back to other domains to satisfy the allocation
request.
This change also defines a set of DOMAINSET_FIXED() policies, which
only permit allocations from the specified domain.
Discussed with: gallatin, jeff
Reported and tested by: pho (previous version)
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D17418
This simplifies the runtime logic and reduces the number of
runtime-constant branches.
Reviewed by: alc, markj
Sponsored by: The FreeBSD Foundation
Approved by: re (gjb)
Differential revision: https://reviews.freebsd.org/D16736
Previously, x86 used static ranges of IRQ values for different types
of I/O interrupts. Interrupt pins on I/O APICs and 8259A PICs used
IRQ values from 0 to 254. MSI interrupts used a compile-time-defined
range starting at 256, and Xen event channels used a
compile-time-defined range after MSI. Some recent systems have more
than 255 I/O APIC interrupt pins which resulted in those IRQ values
overflowing into the MSI range triggering an assertion failure.
Replace statically assigned ranges with dynamic ranges. Do a single
pass computing the sizes of the IRQ ranges (PICs, MSI, Xen) to
determine the total number of IRQs required. Allocate the interrupt
source and interrupt count arrays dynamically once this pass has
completed. To minimize runtime complexity these arrays are only sized
once during bootup. The PIC range is determined by the PICs present
in the system. The MSI and Xen ranges continue to use a fixed size,
though this does make it possible to turn the MSI range size into a
tunable in the future.
As a result, various places are updated to use dynamic limits instead
of constants. In addition, the vmstat(8) utility has been taught to
understand that some kernels may treat 'intrcnt' and 'intrnames' as
pointers rather than arrays when extracting interrupt stats from a
crashdump. This is determined by the presence (vs absence) of a
global 'nintrcnt' symbol.
This change reverts r189404 which worked around a buggy BIOS which
enumerated an I/O APIC twice (using the same memory mapped address for
both entries but using an IRQ base of 256 for one entry and a valid
IRQ base for the second entry). Making the "base" of MSI IRQ values
dynamic avoids the panic that r189404 worked around, and there may now
be valid I/O APICs with an IRQ base above 256 which this workaround
would incorrectly skip.
If in the future the issue reported in PR 130483 reoccurs, we will
have to add a pass over the I/O APIC entries in the MADT to detect
duplicates using the memory mapped address and use some strategy to
choose the "correct" one.
While here, reserve room in intrcnts for the Hyper-V counters.
PR: 229429, 130483
Reviewed by: kib, royger, cem
Tested by: royger (Xen), kib (DMAR)
Approved by: re (gjb)
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D16861
error in the function hypercall_memfree(), where the wrong arena was being
passed to kmem_free().
Introduce a per-page flag, VPO_KMEM_EXEC, to mark physical pages that are
mapped in kmem with execute permissions. Use this flag to determine which
arena the kmem virtual addresses are returned to.
Eliminate UMA_SLAB_KRWX. The introduction of VPO_KMEM_EXEC makes it
redundant.
Update the nearby comment for UMA_SLAB_KERNEL.
Reviewed by: kib, markj
Discussed with: jeff
Approved by: re (marius)
Differential Revision: https://reviews.freebsd.org/D16845
The timespecadd(3) family of macros were imported from NetBSD back in
r35029. However, they were initially guarded by #ifdef _KERNEL. In the
meantime, we have grown at least 28 syscalls that use timespecs in some
way, leading many programs both inside and outside of the base system to
redefine those macros. It's better just to make the definitions public.
Our kernel currently defines two-argument versions of timespecadd and
timespecsub. NetBSD, OpenBSD, and FreeDesktop.org's libbsd, however, define
three-argument versions. Solaris also defines a three-argument version, but
only in its kernel. This revision changes our definition to match the
common three-argument version.
Bump _FreeBSD_version due to the breaking KPI change.
Discussed with: cem, jilles, ian, bde
Differential Revision: https://reviews.freebsd.org/D14725
Such items may be allocated in the I/O path used by the dumper,
potentially causing the dump to fail. Since there is some precedent
in the DMAR driver for avoiding this problem using _NODUMP, apply
this workaround to the zone as well.
Reported and tested by: mmacy
Reviewed by: kib
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D14422
allocated with a tag to come from the specified domain if it meets the
other constraints provided by the tag. Automatically create a tag at
the root of each bus specifying the domain local to that bus if
available.
Reviewed by: jhb, kib
Tested by: pho
Sponsored by: Netflix, Dell/EMC Isilon
Differential Revision: https://reviews.freebsd.org/D13545
Mainly focus on files that use BSD 2-Clause license, however the tool I
was using misidentified many licenses so this was mostly a manual - error
prone - task.
The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.
Also keep the calculated vm_page_alloc_contig() flags in the variable
to not re-evaluate it on the loop iteration.
Noted by: alc
Sponsored by: The FreeBSD Foundation
similar to the kernel memory allocator.
This simplifies NUMA allocation because the domain will be known at wait
time and races between failure and sleeping are eliminated. This also
reduces boilerplate code and simplifies callers.
A wait primitive is supplied for uma zones for similar reasons. This
eliminates some non-specific VM_WAIT calls in favor of more explicit
sleeps that may be satisfied without new pages.
Reviewed by: alc, kib, markj
Tested by: pho
Sponsored by: Netflix, Dell/EMC Isilon
remapping.
VT-d specification requires use of PCI rid as source id for IOAPICs
enumerated by PCI bus. The values from the DMAR ACPI table should be
only used when IOAPIC is not on PCI.
Reviewed by: jhb
Sponsored by: The FreeBSD Foundation
Hardware provided by: Intel
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D12205
--Remove special-case handling of sparc64 bus_dmamap* functions.
Replace with a more generic mechanism that allows MD busdma
implementations to generate inline mapping functions by
defining WANT_INLINE_DMAMAP in <machine/bus_dma.h>. This
is currently useful for sparc64, x86, and arm64, which all
implement non-load dmamap operations as simple wrappers
around map objects which may be bus- or device-specific.
--Remove NULL-checked bus_dmamap macros. Implement the
equivalent NULL checks in the inlined x86 implementation.
For non-x86 platforms, these checks are a minor pessimization
as those platforms do not currently allow NULL maps. NULL
maps were originally allowed on arm64, which appears to have
been the motivation behind adding arm[64]-specific barriers
to bus_dma.h, but that support was removed in r299463.
--Simplify the internal interface used by the bus_dmamap_load*
variants and move it to bus_dma_internal.h
--Fix some drivers that directly include sys/bus_dma.h
despite the recommendations of bus_dma(9)
Reviewed by: kib (previous revision), marius
Differential Revision: https://reviews.freebsd.org/D10729
Do not queue dmar_map_entries with zeroed gseq to
dmar_qi_invalidate_locked(). Zero gseq stops the processing in the qi
task. Do not assign possibly uninitialized on-stack gseq to map
entries when requeuing them on unit tlb_flush queue. Random garbage
in gsec is interpreted as too high invalidation sequence number and
again stop the processing in the task.
Make the sequence numbers generation completely contained in
dmar_qi_invalidate_locked() and dmar_qi_emit_wait_seq(). Upper code
directly passes boolean requesting emiting wait command instead of
trying to provide hint to avoid it by passing NULL gseq pointer.
Microoptimize the requeueing to tlb_flush queue by doing it for the
whole queue.
Diagnosed and tested by: Brett Gutstein <bgutstein@rice.edu>
Discussed with: alc
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Implement timeouts for register-based DMAR commands. Tunable/sysctl
hw.dmar.timeout specifies the timeout in nanoseconds, set it to zero
to allow infinite wait. Default is 1ms.
Runtime modification of the sysctl is not safe, it is allowed for
debugging.
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Kernel environment variable hw.busdma.default can take values 'bounce'
and 'dmar' and selects corresponding busdma backend as default.
Per-device environment variable hw.busdma.pci<domain>.<bus>.<slot>.<func>
takes the same values and overrides hw.busdma.default for the given device.
Note that even with hw.busdma.default=bounce, DMA translation engines
are still started if DMARs are enabled, to disable them use
hw.dmar.dma tunable, as before.
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
As noted in the comment, nothing special needs to be done to destroy
the unneeded context after the allocation race, but the context memory
itself still should to be freed.
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
It does not make sense since identity mapping already provides the
required mapping for RMRR ranges. More, since identity page tables do
not reflect content of map entries for id domains, creating RMRR
entries makes domain data inconsistent.
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
- Add constants for the fields in the root-entry table address register,
namely the root type type (RTT) and root table address (RTA) mask.
- Add macros for the bitmask of the domain ID field in the second word
of context table entries as well as a helper macro (DMAR_CTX2_GET_DID)
to extract the domain ID from a context table entry.
Reviewed by: kib
MFC after: 1 month
Sponsored by: Chelsio Communications
rounddown2 tends to produce longer lines than the original code
and when the code has a high indentation level it was not really
advantageous to do the replacement.
This tries to strike a balance between readability using the macros
and flexibility of having the expressions, so not everything is
converted.
which queued invalidation completion interrupt is requested with
regard to the queued invalidation requests. In other words, setting
the value of the knob to N requests completion interrupt after N items
are processed. Existing behaviour is restored by setting
hw.dmar.batch_coalesce=1.
The knob significantly decreases the DMAR qi interrupt rate at the
cost of slightly longer DMAR map entries recycling.
Sponsored by: The FreeBSD Foundation
taskqueue_enqueue() was changed to support both fast and non-fast
taskqueues 10 years ago in r154167. It has been a compat shim ever
since. It's time for the compat shim to go.
Submitted by: Howard Su <howard0su@gmail.com>
Reviewed by: sephe
Differential Revision: https://reviews.freebsd.org/D5131
acpi_GetInteger() execution. Intel DMAR interrupt remapping code
needs to know UID of the HPET to properly route the FSB interrupts
from the HPET, even when interrupt remapping is disabled, and the code
is executed under some non-sleepable mutexes.
Cache HPET UIDs in the device softc at the attach time and provide
lock-less method to get UID, use the method from the dmar hpet
handling code instead of calling GetInteger().
Reported and tested by: Larry Rosenman <ler@lerctr.org>
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
and related data structures. Contexts attach requests initiators to
domains. There is still 1:1 correspondence between contexts and
domains on the running system, since only busdma currently allocates
them, using dmar_get_ctx_for_dev().
Large part of the change is formal rename of the ctx to domain, but
patch also reworks the context allocation and free to allow for
independent domain creation.
The helper dmar_move_ctx_to_domain() is introduced for future use, to
reassign request initiator from one domain to another. The hard issue
which is not yet resolved with the context move is proper handling (or
reserving) RMRR entries in the destination domain as required by ACPI
DMAR table for moved context.
Tested by: pho
Sponsored by: The FreeBSD Foundation
buildkernel run.
Some of them were write-only under some kernel options, e.g. variables
keeping values only used by CTR() macros. It costs nothing to the
code readability and correctness to eliminate the warnings in those
cases too by removing the local cached values used only for
single-access.
Review: https://reviews.freebsd.org/D2665
Reviewed by: rodrigc
Looked at by: bjk
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
queue is started, not relying on the interrupt remaping method to
happen. Also disable interrupts when shooting down the queue.
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
dmar_map_entry. Non-zero offset both increases the required mapping
size, which is handled in dmar_bus_dmamap_load_something1(), and makes
it possible that allocated range crosses boundary, which needs a check
in dmar_gas_match_one().
Reported and tested by: jimharris
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
requested size. If tag restrictions caused split entry, its size is
less then requsted.
Hardware provided by: Michael Fuckner <michael@fuckner.net>
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
translation. In particular, despite IO-APICs only take 8bit apic id,
IR translation structures accept 32bit APIC Id, which allows x2APIC
mode to function properly. Extend msi_cpu of struct msi_intrsrc and
io_cpu of ioapic_intsrc to full int from one byte.
KPI of IR is isolated into the x86/iommu/iommu_intrmap.h, to avoid
bringing all dmar headers into interrupt code. The non-PCI(e) devices
which generate message interrupts on FSB require special handling. The
HPET FSB interrupts are remapped, while DMAR interrupts are not.
For each msi and ioapic interrupt source, the iommu cookie is added,
which is in fact index of the IRE (interrupt remap entry) in the IR
table. Cookie is made at the source allocation time, and then used at
the map time to fill both IRE and device registers. The MSI
address/data registers and IO-APIC redirection registers are
programmed with the special values which are recognized by IR and used
to restore the IRE index, to find proper delivery mode and target.
Map all MSI interrupts in the block when msi_map() is called.
Since an interrupt source setup and dismantle code are done in the
non-sleepable context, flushing interrupt entries cache in the IR
hardware, which is done async and ideally waits for the interrupt,
requires busy-wait for queue to drain. The dmar_qi_wait_for_seq() is
modified to take a boolean argument requesting busy-wait for the
written sequence number instead of waiting for interrupt.
Some interrupts are configured before IR is initialized, e.g. ACPI
SCI. Add intr_reprogram() function to reprogram all already
configured interrupts, and call it immediately before an IR unit is
enabled. There is still a small window after the IO-APIC redirection
entry is reprogrammed with cookie but before the unit is enabled, but
to fix this properly, IR must be started much earlier.
Add workarounds for 5500 and X58 northbridges, some revisions of which
have severe flaws in handling IR. Use the same identification methods
as employed by Linux.
Review: https://reviews.freebsd.org/D1892
Reviewed by: neel
Discussed with: jhb
Tested by: glebius, pho (previous versions)
Sponsored by: The FreeBSD Foundation
MFC after: 3 weeks
queue. They are for first-level translations and device TLB.
Review: https://reviews.freebsd.org/D1892
Reviewed by: neel
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
allocator tries to move the entry up, after the boundary. The new
location may still fail to satisfy boundary requirement, for instance,
if the boundary is set to page size, and allocation is of multiple
pages.
Recheck that boundary is not crossed after the move. If it is
crossed, give up on allocating the whole entry and split it.
Reported by: Michael Fuckner <michael@fuckner.net>, running nvme(4)
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
next entry does not intersect with the tail of the new entry, but also
that previous entry is also before new entry start.
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
follow specification and do not provide PCIe capability.
Verify if the port above such bridge is downstream PCIe (or root port)
and treat the bridge as PCIe/PCI then. This allows to avoid
maintaining the table of device ids for bridges without capability,
while still calculate correct request originator for devices behind
the bridge.
Submitted by: Jason Harmening <jason.harmening@gmail.com>
MFC after: 1 week
VT-d specification. Also add definitions for the interrupt remapping
table and IEC.
Print new capabilities on boot. although there is no hardware which
support it.
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
cache for whole page containing modified pte, and more, only last page
in the series of the consequtive pages is flushed (i.e. the affected
mappings should be larger than 2MB).
Avoid excessive flushing and do missed neccessary flushing, by
splitting invalidation and unmapping. For now, flush exactly the
range of the changed pte. This is still somewhat bigger than
neccessary, since pte is 8 bytes, while cache flush line is at least
32 bytes.
The originator of the issue reports that after the change,
'dmar_bus_dmamap_unload went from 13,288 cycles down to
3,257. dmar_bus_dmamap_load_buffer went from 9,686 cycles down to
3,517. and I am now able to get line 1GbE speed with Netperf TCP
(even with 1K message size).'
Diagnosed and tested by: Nadav Amit <nadav.amit@gmail.com>
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
In my case on the test machine, I have hierarchy of
pcib2 (PCIe port on host bridge with PCIe capability) -> pci2 ->
pcib3 (ITE PCIe/PCI bridge) -> pci3 -> em1
The device to check PCIe capability is pcib2 and not pcib3, as it is
currently done in the code. Also, in case of the bridge, we shall
step to pcib2 for the loop iteration, since pcib3 does not carry PCIe
capability info and would force wrong recalculation of rid.
Also change the returned requester to the PCIe bus which provides port
for the bridge. This only results in changing
hw.busdma.pciX.X.X.X.bounce tunable to force identity-mapped context
for the device.
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
after dmar driver was converted to use rids. The bus component to
calculate context page must be taken from the requestor rid, which is
a bridge, and not from the device bus number.
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
in userland rename in-kernel getenv()/setenv() to kern_setenv()/kern_getenv().
This fixes a namespace collision with libc symbols.
Submitted by: kmacy
Tested by: make universe