Currently NMIs are sent over event channels, but that defeats the
purpose of NMIs since event channels can be masked. Fix this by
issuing NMIs using a hypercall, which injects a NMI (vector #2) to the
desired vCPU.
Note that NMIs could also be triggered using the emulated local APIC,
but using a hypercall is better from a performance point of view
since it doesn't involve instruction decoding when not using x2APIC
mode.
Reported and Tested by: avg
Sponsored by: Citrix Systems R&D
The main differences with the currently implemented method are:
- Requires a local APIC EOI, since it doesn't bypass the local APIC
as the previous method used to do.
- Can be set to use different IDT vectors on each vCPU. Note that
FreeBSD doesn't make use of this feature since the event channel
IDT vector is reserved system wide.
Note that the old method of setting the event channel upcall is
not removed, and will be used as a fallback if this newly introduced
method is not available.
MFC after: 1 month
Sponsored by: Citrix Systems R&D
Just allow MSI interrupts to always start at the end of the I/O APIC
pins. Since existing machines already have more than 255 I/O APIC
pins, IRQ 255 is no longer reliably invalid, so just remove the
minimum starting value for MSI.
Reviewed by: kib, markj
Differential Revision: https://reviews.freebsd.org/D17991
The number of MSI IRQs still defaults to 512, but it can now be
changed at boot time via the machdep.num_msi_irqs tunable.
Reviewed by: kib, royger (older version)
Reviewed by: markj
MFC after: 1 month
Relnotes: yes
Differential Revision: https://reviews.freebsd.org/D17977
The change is a no-op for architectures which don't ifunc memset,
memcpy nor memmove.
Convert places which need them. Xen bits by royger.
Reviewed by: kib
Approved by: re (gjb)
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D17487
Register interrupts using the PIC pic_register_sources method instead
of doing it in apic_setup_io. This is now required, since the internal
interrupt structures are not yet setup when calling apic_setup_io.
Approved by: re (gjb)
Sponsored by: Citrix Systems R&D
The recommended way to obtain the vcpu id is using the cpuid
instruction with a specific leaf value. This leaf value must be
obtained at runtime, and it's done when populating the hypercall page.
Legacy PVH however will get the hypercall page populated by the
hypervisor itself before booting, so the cpuid leaf was not actually
set, thus preventing setting the vcpu id value from cpuid.
Fix this by making sure the cpuid leaf has been probed before
attempting to set the vcpu id.
Approved by: re (gjb)
Sponsored by: Citrix Systems R&D
That's the only mode in FreeBSD that requires the usage of PIRQs, so
there's no need to attach the PIRQ PIC when running in other modes.
Approved by: re (gjb)
Sponsored by: Citrix Systems R&D
When adding support for the new PVH mode the kenv handling was
switched to use a boot time allocated scratch space, however the
legacy PVH early boot code was not modified to allocate such space.
Approved by: re (gjb)
Sponsored by: Citrix Systems R&D
The vcpu_id for legacy PVH mode can be set from the output of cpuid,
so there's no need to have a special function to set it.
Also note that xenpv_set_ids should have been executed only for PV
guests, but was executed for all guests types and vcpu_id was later
fixed up for HVM guests.
Reported by: cperciva
Approved by: re (gjb)
Sponsored by: Citrix Systems R&D
So that it's done when the vcpu_id has been set. For the BSP the
vcpu_id is set at SUB_INTR, while for the APs it's done in
init_secondary_tail that's called at SUB_SMP order FIRST.
Reported and tested by: cperciva
Approved by: re (gjb)
Sponsored by: Citrix Systems R&D
Differential revision: https://reviews.freebsd.org/D17013
Previously, x86 used static ranges of IRQ values for different types
of I/O interrupts. Interrupt pins on I/O APICs and 8259A PICs used
IRQ values from 0 to 254. MSI interrupts used a compile-time-defined
range starting at 256, and Xen event channels used a
compile-time-defined range after MSI. Some recent systems have more
than 255 I/O APIC interrupt pins which resulted in those IRQ values
overflowing into the MSI range triggering an assertion failure.
Replace statically assigned ranges with dynamic ranges. Do a single
pass computing the sizes of the IRQ ranges (PICs, MSI, Xen) to
determine the total number of IRQs required. Allocate the interrupt
source and interrupt count arrays dynamically once this pass has
completed. To minimize runtime complexity these arrays are only sized
once during bootup. The PIC range is determined by the PICs present
in the system. The MSI and Xen ranges continue to use a fixed size,
though this does make it possible to turn the MSI range size into a
tunable in the future.
As a result, various places are updated to use dynamic limits instead
of constants. In addition, the vmstat(8) utility has been taught to
understand that some kernels may treat 'intrcnt' and 'intrnames' as
pointers rather than arrays when extracting interrupt stats from a
crashdump. This is determined by the presence (vs absence) of a
global 'nintrcnt' symbol.
This change reverts r189404 which worked around a buggy BIOS which
enumerated an I/O APIC twice (using the same memory mapped address for
both entries but using an IRQ base of 256 for one entry and a valid
IRQ base for the second entry). Making the "base" of MSI IRQ values
dynamic avoids the panic that r189404 worked around, and there may now
be valid I/O APICs with an IRQ base above 256 which this workaround
would incorrectly skip.
If in the future the issue reported in PR 130483 reoccurs, we will
have to add a pass over the I/O APIC entries in the MADT to detect
duplicates using the memory mapped address and use some strategy to
choose the "correct" one.
While here, reserve room in intrcnts for the Hyper-V counters.
PR: 229429, 130483
Reviewed by: kib, royger, cem
Tested by: royger (Xen), kib (DMAR)
Approved by: re (gjb)
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D16861
became unused in FreeBSD 12.x as a side-effect of the NUMA-related
changes.)
Reviewed by: kib, markj
Discussed with: jeff, re@
Differential Revision: https://reviews.freebsd.org/D16825
In order to setup an initial environment and jump into the generic
hammer_time initialization function. Some of the code is shared with
PVHv1, while other code is PVHv2 specific.
This allows booting FreeBSD as a PVHv2 DomU and Dom0.
Sponsored by: Citrix Systems R&D
Allow the hypercall page to be initialized very early, even before
vtophys is functional. Also make the function global so it can be
called by other files.
This will be needed in order to perform the early bringup on PVHv2
guests.
Sponsored by: Citrix Systems R&D
HYPERVISOR_start_info is only available to PV and PVHv1 guests, HVM
and PVHv2 guests get this data from HVM parameters that are fetched
using a hypercall.
Instead provide a set of helper functions that should be used to fetch
this data. The helper functions have different implementations
depending on whether FreeBSD is running as PVHv1 or HVM/PVHv2 guest
type.
This helps to cleanup generic Xen code by removing quite a lot of
xen_pv_domain and xen_hvm_domain macro usages.
Sponsored by: Citrix Systems R&D
boot_parse_arg to parse a single arg
boot_parse_cmdline to parse a command line string
boot_parse_args to parse all the args in a vector
boot_howto_to_env Convert howto bits to env vars
boot_env_to_howto Return howto mask mased on what's set in the environment.
All these routines return an int that's the bitmask of the args
translated to RB_* flags. As a special case, the 'S' flag sets the
comconsole_speed env var. Any arg that looks like a=b will set the env
key 'a' to value 'b'. If =b is omitted, 'a' is set to '1'. This
should help us reduce the number of redundant copies of these routines
in the tree. It should also give a more uniform experience between
platforms.
Also, invent a new flag RB_PROBE that's set when 'P' is parsed. On
x86 + BIOS, this means 'probe for the keyboard, and if it's not there
set both RB_MULTIPLE and RB_SERIAL (which means show the output on
both video and serial consoles, but make serial primary). Others it
may be some similar concept of probing, but it's loader dependent
what, exactly, it means.
These routines are suitable for /boot/loader and/or the kernel,
though they may not be suitable for the tightly hand-rolled-for-space
environments like boot2.
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D16205
On arm64 (and possible other architectures) we are unable to use static
DPCPU data in kernel modules. This is because the compiler will generate
PC-relative accesses, however the runtime-linker expects to be able to
relocate these.
In preparation to fix this create two macros depending on if the data is
global or static.
Reviewed by: bz, emaste, markj
Sponsored by: ABT Systems Ltd
Differential Revision: https://reviews.freebsd.org/D16140
The interface already guarantees that the number of hypercall pages is
always going to be 1, see the comment in interface/arch-x86/cpuid.h
Sponsored by: Citrix Systems R&D
Install appropriate pti-aware shootdown IPI handlers, otherwise user
page tables do not get enough invalidations. The non-pti handlers
were used so far.
Reported and tested by: cperciva
Sponsored by: The FreeBSD Foundation
MFC after: 3 days
platforms. Original commit message as follows:
Only use CPUs in the domain the device is attached to for default
assignment. Device drivers are able to override the default assignment
if they bind directly. There are severe performance penalties for
handling interrupts on remote CPUs and this should only be done in
very controlled circumstances.
Reviewed by: jhb, kib
Tested by: pho
Sponsored by: Netflix, Dell/EMC Isilon
Differential Revision: https://reviews.freebsd.org/D14838
assignment. Device drivers are able to override the default assignment
if they bind directly. There are severe performance penalties for
handling interrupts on remote CPUs and this should only be done in
very controlled circumstances.
Reviewed by: jhb, kib
Tested by: pho (earlier version)
Sponsored by: Netflix, Dell/EMC Isilon
Differential Revision: https://reviews.freebsd.org/D14838
Mainly focus on files that use BSD 2-Clause license, however the tool I
was using misidentified many licenses so this was mostly a manual - error
prone - task.
The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.
Populate the lapics arrays and call cpu_add/lapic_create in the setup
phase instead. Also store the max APIC ID found in the newly
introduced max_apic_id global variable.
This is a requirement in order to make the static arrays currently
using MAX_LAPIC_ID dynamic.
Sponsored by: Citrix Systems R&D
MFC after: 1 month
Reviewed by: kib
Differential revision: https://reviews.freebsd.org/D11911
The change introduced a dependency between genassym.c and header files
generated from .m files, but that dependency is not specified in the
make files.
Also, the change could be not as useful as I thought it was.
Reported by: dchagin, Manfred Antar <null@pozo.com>, and many others
The change is more intrusive than I would like because the feature
requires that a vector number is written to a special register.
Thus, now the vector number has to be provided to lapic_eoi().
It was readily available in the IO-APIC and MSI cases, but the IPI
handlers required more work.
Also, we now store the VMM IPI number in a global variable, so that it
is available to the justreturn handler for the same reason.
Reviewed by: kib
MFC after: 6 weeks
Differential Revision: https://reviews.freebsd.org/D9880
Current Xen IPI setup functions require that the caller provide a device in
order to obtain the name of the interrupt from it. With early AP startup this
device is no longer available at the point where IPIs are bound, and a KASSERT
would trigger:
panic: NULL pcpu device_t
cpuid = 0
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xffffffff82233a20
vpanic() at vpanic+0x186/frame 0xffffffff82233aa0
kassert_panic() at kassert_panic+0x126/frame 0xffffffff82233b10
xen_setup_cpus() at xen_setup_cpus+0x5b/frame 0xffffffff82233b50
mi_startup() at mi_startup+0x118/frame 0xffffffff82233b70
btext() at btext+0x2c
Fix this by no longer requiring the presence of a device in order to bind IPIs,
and simply use the "cpuX" format where X is the CPU identifier in order to
describe the interrupt.
Reported by: sbruno, cperciva
Tested by: sbruno
X-MFC-With: r310177
Sponsored by: Citrix Systems R&D
Add a reference count to xenisrc. This is required for implementation of
unmap-notifications in the grant table userspace device (gntdev). We need to
hold a reference to the event channel port, in case the user deallocates the
port before we send the notification.
Submitted by: jaggi
Reviewed by: royger
Differential review: https://reviews.freebsd.org/D7429
If BIOS performed hand-off to OS with BSP LAPIC in the x2APIC mode,
system usually consumes such configuration without a notice, since
x2APIC is turned on by OS if possible (nop). But if BIOS
simultaneously requested OS to not use x2APIC, code assumption that
that xAPIC is active breaks.
In my opinion, we cannot safely turn off x2APIC if control is passed
in this mode. Make madt.c ignore user or BIOS requests to turn x2APIC
off, and do not check the x2APIC black list. Just trust the config
and try to continue, giving a warning in dmesg.
Reported and tested by: Slawa Olhovchenkov <slw@zxy.spb.ru> (previous version)
Diagnosed by and discussed with: avg
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Move msix_disable_migration under #ifdef SMP since it doesn't make sense
for !SMP kernels.
PR: 212014
Reported by: Glyn Grinstead <glyn@grinstead.org>
MFC after: 3 days
Event channel handlers cannot be removed during resume because there might
be an interrupt thread running on a CPU currently blocked in the
cpususpend_handler, which prevents the call to intr_remove_handler from
finishing and completely freezes the system during resume. r291022 tried to
fix this by allowing recursion in intr_remove_handler, but that's clearly
not enough.
Instead don't remove the handlers at the interrupt resume phase, and let
each driver remove the handler by itself during resume. In order to do this,
change the opaque event channel handler cookie to use the global interrupt
vector instead of the event channel port. The event channel port cannot be
used because after resume all event channels are reset, and the port numbers
can change.
Sponsored by: Citrix Systems R&D
MFC after: 5 days
If the hypervisor version is smaller than 4.6.0. Xen commits 74fd00 and
70a3cb are required on the hypervisor side for this to be fixed, and those
are only included in 4.6.0, so stay on the safe side and disable MSI-X
interrupt migration on anything older than 4.6.0.
It should not cause major performance degradation unless a lot of MSI-X
interrupts are allocated.
Sponsored by: Citrix Systems R&D
MFC after: 3 days
Reviewed by: jhb
Differential revision: https://reviews.freebsd.org/D7148
Summary:
The idea behind this is '~0ul' is well-defined, and casting to uintmax_t, on a
32-bit platform, will leave the upper 32 bits as 0. The maximum range of a
resource is 0xFFF.... (all bits of the full type set). By dropping the 'ul'
suffix, C type promotion rules apply, and the sign extension of ~0 on 32 bit
platforms gets it to a type-independent 'unsigned max'.
Reviewed By: cem
Sponsored by: Alex Perez/Inertial Computing
Differential Revision: https://reviews.freebsd.org/D5255
providing compiled-in static environment data that is used instead of any
data passed in from a boot loader.
Previously 'env' worked only on i386 and arm xscale systems, because it
required the MD startup code to examine the global envmode variable and
decide whether to use static_env or an environment obtained from the boot
loader, and set the global kern_envp accordingly. Most startup code wasn't
doing so. Making things even more complex, some mips startup code uses an
alternate scheme that involves calling init_static_kenv() to pass an empty
buffer and its size, then uses a series of kern_setenv() calls to populate
that buffer.
Now all MD startup code calls init_static_kenv(), and that routine provides
a single point where envmode is checked and the decision is made whether to
use the compiled-in static_kenv or the values provided by the MD code.
The routine also continues to serve its original purpose for mips; if a
non-zero buffer size is passed the routine installs the empty buffer ready
to accept kern_setenv() values. Now if the size is zero, the provided buffer
full of existing env data is installed. A NULL pointer can be passed if the
boot loader provides no env data; this allows the static env to be installed
if envmode is set to do so.
Most of the work here is a near-mechanical change to call the init function
instead of directly setting kern_envp. A notable exception is in xen/pv.c;
that code was originally installing a buffer full of preformatted env data
along with its non-zero size (like mips code does), which would have allowed
kern_setenv() calls to wipe out the preformatted data. Now it passes a zero
for the size so that the buffer of data it installs is treated as
non-writeable.
Current Xen resume code clears all pending bitmap IPIs on resume, which is
not correct. Instead re-inject bitmap IPI vectors on resume to all CPUs in
order to acknowledge any pending bitmap IPIs.
Sponsored by: Citrix Systems R&D
MFC after: 2 weeks