We can switch into long mode directly with LA57 enabled.
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D25273
so we don't ifdef for every arch in busdma_iommu.c;
o No need to include specialreg.h for x86, remove it.
Requested by: andrew
Reviewed by: kib
Sponsored by: DARPA/AFRL
Differential Revision: https://reviews.freebsd.org/D25957
These functions were introduced before UMA started ensuring that freed
memory gets placed in domain-local caches. They no longer serve any
purpose since UMA now provides their functionality by default. Remove
them to simplyify the kernel memory allocator interfaces a bit.
Reviewed by: cem, kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D25937
so x86 can support Intel DMAR and AMD IOMMU simultaneously.
Reviewed by: kib
Sponsored by: DARPA/AFRL
Differential Revision: https://reviews.freebsd.org/D25894
APEI allows platform to report different kinds of errors to OS in several
ways. We've found that Supermicro X10/X11 motherboards report PCIe errors
appearing on hot-unplug via this interface using NMI. Without respective
driver it ended up in kernel panic without any additional information.
This driver introduces support for the APEI Generic Hardware Error Source
reporting via NMI, SCI or polling. It decodes the reported errors and
either pass them to pci(4) for processing or just logs otherwise. Errors
marked as fatal still end up in kernel panic, but some more informative.
When somebody get to native PCIe AER support implementation both of the
reporting mechanisms should get common error recovery code. Since in our
case errors happen when the device is already gone, there is nothing to
recover, so the code just clears the error statuses, practically ignoring
the otherwise destructive NMIs in nicer way.
MFC after: 2 weeks
Relnotes: yes
Sponsored by: iXsystems, Inc.
For purposes of handling hardware error reported via NMIs I need a way to
escape NMI context, being too restrictive to do something significant.
To do it this change introduces new swi_sched() flag SWI_FROMNMI, making
it careful about used KPIs. On platforms allowing IPI sending from NMI
context (x86 for now) it immediately wakes clk_intr_event via new IPI_SWI,
otherwise it works just like SWI_DELAY. To handle the delayed SWIs this
patch calls clk_intr_event on every hardclock() tick.
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.
Differential Revision: https://reviews.freebsd.org/D25754
from Intel DMAR support, so it can be used on other IOMMU systems.
Reviewed by: kib
Sponsored by: DARPA/AFRL
Differential Revision: https://reviews.freebsd.org/D25743
It allows safe IPI sending to current CPU from NMI context.
Unlike other ipi_*() functions this waits for delivery to leave LAPIC in
a state safe for interrupted code.
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.
Sending IPI to self or all CPUs does not require write into upper part of
the ICR, prone to races. Previously the code disabled interrupts, but it
was not enough for NMIs. Instead of that when possible write only lower
part of the register, or use special SELF IPI register in x2APIC mode.
This also removes ICR reads used to preserve reserved bits on write.
It was there from the beginning, but I failed to find explanation why,
neither I see Linux doing it. Specification even tells that ICR content
may be lost in deep C-states, so if hardware does not bother to preserve
it, why should we?
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.
pmc_cpuid was uninitialized for most AMD processor families. We can still
populate this string for unimplemented families.
Also added a CPUID_TO_STEPPING macro and converted existing code to use it.
Reviewed by: mav
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.
Differential Revision: https://reviews.freebsd.org/D25673
Stop using smp_ipi_mtx to protect global shootdown state, and
move/multiply the global state into pcpu. Now each CPU can initiate
shootdown IPI independently from other CPUs. Initiator enters
critical section, then fills its local PCPU shootdown info
(pc_smp_tlb_XXX), then clears scoreboard generation at location (cpu,
my_cpuid) for each target cpu. After that IPI is sent to all targets
which scan for zeroed scoreboard generation words. Upon finding such
word the shootdown data is read from corresponding cpu' pcpu, and
generation is set. Meantime initiator loops waiting for all zeroed
generations in scoreboard to update.
Initiator does not disable interrupts, which should allow
non-invalidation IPIs from deadlocking, it only needs to disable
preemption to pin itself to the instance of the pcpu smp_tlb data.
The generation is set before the actual invalidation is performed in
handler. It is safe because target CPU cannot return to userspace
before handler finishes. In principle only NMI can preempt the
handler, but NMI would see the kernel handler frame and not touch
not-invalidated user page table.
Handlers loop until they do not see zeroed scoreboard generations.
This, together with hardware keeping one pending IPI in LAPIC IRR
should prevent lost shootdowns.
Notes.
1. The code does protect writes to LAPIC ICR with exclusion. I believe
this is fine because we in fact do not send IPIs from interrupt
handlers. More for !x2APIC mode where ICR access for write requires
two registers write, we disable interrupts around it. If considered
incorrect, I can add per-cpu spinlock around ipi_send().
2. Scoreboard lines owned by given target CPU can be padded to the
cache line, to reduce ping-pong.
Reviewed by: markj (previous version)
Discussed with: alc
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 3 weeks
Differential revision: https://reviews.freebsd.org/D25510
so it can be used on other IOMMU systems.
Provide MI iommu_unit, iommu_domain and iommu_ctx structs in sys/iommu.h;
use them as a first member of MD dmar_unit, dmar_domain and dmar_ctx.
Change the namespace in DMAR backend: use iommu_ prefix instead of dmar_.
Move some macroses and function prototypes to sys/iommu.h.
Reviewed by: kib
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D25574
when it has passed the synchronization test.
"Processor Programming Reference (PPR) for AMD Family 17h" states that
the TSC uses a common reference for all sockets, cores and threads.
MFC after: 1 month
Given that 64c/128t CPUs are currently available, and that many
devices (nvme, many NICs) desire to map 1 MSI-X vector per core,
or even 1 per-thread, it is becoming far easier to see MSI-X interrupt
setup fail due to msi vector exhaustion, and devices fail to attach at
boot on large system.
This bump costs 12KB on amd64 (and 6KB on i386), which seems
worth the trade off for a better out of the box experience on
high end hardware.
Reviewed by: jhb
MFC after: 21 days
Sponsored by: Netflix
In particular, invalidation of the preloaded modules text to allow
execution from it was broken after D25188/r362031.
Reviewed by: markj
Reported by: delphij, dhw
Sponsored by: The FreeBSD Foundation
MFC after: 13 days
Right now code first flushes all local TLB entries that needs to be
flushed, then signals IPI to remote cores, and then waits for
acknowledgements while spinning idle. In the VMWare article 'Don’t
shoot down TLB shootdowns!' it was noted that the time spent spinning
is lost, and can be more usefully used doing local TLB invalidation.
We could use the same invalidation handler for local TLB as for
remote, but typically for pmap == curpmap we can use INVLPG for locals
instead of INVPCID on remotes, since we cannot control context
switches on them. Due to that, keep the local code and provide the
callbacks to be called from smp_targeted_tlb_shootdown() after IPIs
are fired but before spin wait starts.
Reviewed by: alc, cem, markj, Anton Rang <rang at acm.org>
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D25188
When I did some bus_dma cleanup in r320528, I brought forward some sketchy
WITNESS checks from the prior x86 busdma wrappers, instead of recognizing
them as technical debt and just dropping them. Two of these were removed in
r346351 and r346851, but one remains in bounce_bus_dmamem_alloc(). This check
could be constrained to only apply in the BUS_DMA_NOWAIT case, but it's cleaner
to simply remove it and rely on the checks already present in the sleepable
allocation paths used by this function.
While here, remove another unnecessary witness check in bus_dma_tag_create
(the tag is always allocated with M_NOWAIT), and fix a couple of typos.
Reported by: cem
Reviewed by: kib, cem
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D25107
On amd64 we already avoid using memory below 4GB in order to prevent
clashes with MMIO regions, but i386 was allowed to use any hole in
the physical memory map in order to map Xen pages.
Limit this to memory above the 1MB boundary on i386 in order to avoid
clashes with the MMIO holes in that area.
Sponsored by: Citrix Systems R&D
MFC after: 1 week
The flush is needed to prevent cross-process ret2spec, which is not handled
on kernel entry if IBPB is enabled but SMEP is present.
While there, add i386 RSB flush.
Reported by: Anthony Steinhauser <asteinhauser@google.com>
Reviewed by: markj, Anthony Steinhauser
Discussed with: philip
admbugs: 961
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
This function is responsible for setting pc_domain in each pcpu
structure. Call it from the main function that starts APs, rather than
a separate SYSINIT. This makes it easier to close the window where
UMA's per-CPU slab allocator may be called while pc_domain is
uninitialized. In particular, the allocator uses pc_domain to allocate
domain-local pages, so allocations before this point end up using domain
0 for everything.
Reviewed by: kib
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D24757
Release kernels have no KDB backends enabled, so they discard an NMI
if it is not due to a hardware failure. This includes NMIs from
IPMI BMCs and hypervisors.
Furthermore, the interaction of panic_on_nmi, kdb_on_nmi, and
debugger_on_panic is confusing.
Respond to all NMIs according to panic_on_nmi and debugger_on_panic.
Remove kdb_on_nmi. Expand the meaning of panic_on_nmi by making
it a bitfield. There are currently two bits: one for NMIs due to
hardware failure, and one for all others. Leave room for more.
If panic_on_nmi and debugger_on_panic are both true, don't actually panic,
but directly enter the debugger, to allow someone to leave the debugger
and [hopefully] resume normal execution.
Reviewed by: kib
MFC after: 2 weeks
Relnotes: yes: machdep.kdb_on_nmi is gone; machdep.panic_on_nmi changed
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D24558
Stop attempting to use FADT legacy hardware flag, it is almost always
a lie.
Instead, unless user explicitly disabled the calibration, calibrate
against 8254 ISA clock. Then, obtain the rough value of the expected
TSC frequency from CPUID leafs 0x15/0x16 or even from the CPU
marketing name string. If calibration results look unbelievably bogus
comparing with CPUID leafs report, use CPUID one.
Intel does not recommend to use CPUID leaf 0x16 for the value of the
system time frequency, indeed the error there might be up to 1% which
e.g. makes ntpd give up. If ISA clock is present, we win, if not, we
get some frequency that allows the machine to boot without enormous
delay.
Next improvement would be to use HPET for re-calibration if we decided
that ISA clock gives bogus results, after HPETs are enumerated. This
is a much bigger change since we probably would need to re-evaluate
some constants depending on TSC frequency.
Reviewed by: emaste, jhb, scottl
Tested by: scottl
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D24426
If EOI suppression is supported but reported ioapic version is so old
that it does not has EOI register (weird virtualization setup), fix
Intel trick of eoi-ing by flipping pin type (edge/level) to account
for the disabled pin.
Reported by: Juniper
Reviewed by: jhb
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D23965
It does not serve any purpose now, the io apic id is not seen by
software, and some Intel documents claim that the register is
implemented for FUD reasons. More, renumbering seems to not work on
new Intel machines which actually have mismatched MADT and hw IDs.
On older machines where separate APIC bus existed, unique numbering of
all APICs was required for bus arbitration to work, but it is no
longer true (that machines were SMP from pre-Pentium IV era).
When matching PCIe IOAPIC device against MADT-enumerated IOAPICs,
compare io_apic_id from BAR against io_apic_id read from the
MADT-pointed register page.
Reviewed by: jhb
Tested by: flo (previous version), pho
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D23965