1998-04-06 15:37:21 +00:00
|
|
|
/*-
|
2017-11-27 15:03:07 +00:00
|
|
|
* SPDX-License-Identifier: BSD-2-Clause-FreeBSD
|
|
|
|
*
|
1998-04-06 15:37:21 +00:00
|
|
|
* Copyright (c) Peter Wemm <peter@netplex.com.au>
|
|
|
|
* All rights reserved.
|
|
|
|
*
|
|
|
|
* Redistribution and use in source and binary forms, with or without
|
|
|
|
* modification, are permitted provided that the following conditions
|
|
|
|
* are met:
|
|
|
|
* 1. Redistributions of source code must retain the above copyright
|
|
|
|
* notice, this list of conditions and the following disclaimer.
|
|
|
|
* 2. Redistributions in binary form must reproduce the above copyright
|
|
|
|
* notice, this list of conditions and the following disclaimer in the
|
|
|
|
* documentation and/or other materials provided with the distribution.
|
|
|
|
*
|
|
|
|
* THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
|
|
|
|
* ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
|
|
|
* IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
|
|
|
|
* ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
|
|
|
|
* FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
|
|
|
|
* DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
|
|
|
|
* OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
|
|
|
|
* HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
|
|
|
|
* LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
|
|
|
|
* OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
|
|
|
|
* SUCH DAMAGE.
|
|
|
|
*
|
1999-08-28 01:08:13 +00:00
|
|
|
* $FreeBSD$
|
1998-04-06 15:37:21 +00:00
|
|
|
*/
|
|
|
|
|
2001-12-11 23:33:44 +00:00
|
|
|
#ifndef _MACHINE_PCPU_H_
|
2007-02-06 18:04:02 +00:00
|
|
|
#define _MACHINE_PCPU_H_
|
2000-09-07 01:33:02 +00:00
|
|
|
|
2005-03-02 21:33:29 +00:00
|
|
|
#ifndef _SYS_CDEFS_H_
|
2007-02-06 18:04:02 +00:00
|
|
|
#error "sys/cdefs.h is a prerequisite for this file"
|
2005-03-02 21:33:29 +00:00
|
|
|
#endif
|
|
|
|
|
2019-11-12 15:51:47 +00:00
|
|
|
#include <machine/segments.h>
|
2019-11-10 09:28:18 +00:00
|
|
|
#include <machine/tss.h>
|
|
|
|
|
PTI for amd64.
The implementation of the Kernel Page Table Isolation (KPTI) for
amd64, first version. It provides a workaround for the 'meltdown'
vulnerability. PTI is turned off by default for now, enable with the
loader tunable vm.pmap.pti=1.
The pmap page table is split into kernel-mode table and user-mode
table. Kernel-mode table is identical to the non-PTI table, while
usermode table is obtained from kernel table by leaving userspace
mappings intact, but only leaving the following parts of the kernel
mapped:
kernel text (but not modules text)
PCPU
GDT/IDT/user LDT/task structures
IST stacks for NMI and doublefault handlers.
Kernel switches to user page table before returning to usermode, and
restores full kernel page table on the entry. Initial kernel-mode
stack for PTI trampoline is allocated in PCPU, it is only 16
qwords. Kernel entry trampoline switches page tables. then the
hardware trap frame is copied to the normal kstack, and execution
continues.
IST stacks are kept mapped and no trampoline is needed for
NMI/doublefault, but of course page table switch is performed.
On return to usermode, the trampoline is used again, iret frame is
copied to the trampoline stack, page tables are switched and iretq is
executed. The case of iretq faulting due to the invalid usermode
context is tricky, since the frame for fault is appended to the
trampoline frame. Besides copying the fault frame and original
(corrupted) frame to kstack, the fault frame must be patched to make
it look as if the fault occured on the kstack, see the comment in
doret_iret detection code in trap().
Currently kernel pages which are mapped during trampoline operation
are identical for all pmaps. They are registered using
pmap_pti_add_kva(). Besides initial registrations done during boot,
LDT and non-common TSS segments are registered if user requested their
use. In principle, they can be installed into kernel page table per
pmap with some work. Similarly, PCPU can be hidden from userspace
mapping using trampoline PCPU page, but again I do not see much
benefits besides complexity.
PDPE pages for the kernel half of the user page tables are
pre-allocated during boot because we need to know pml4 entries which
are copied to the top-level paging structure page, in advance on a new
pmap creation. I enforce this to avoid iterating over the all
existing pmaps if a new PDPE page is needed for PTI kernel mappings.
The iteration is a known problematic operation on i386.
The need to flush hidden kernel translations on the switch to user
mode make global tables (PG_G) meaningless and even harming, so PG_G
use is disabled for PTI case. Our existing use of PCID is
incompatible with PTI and is automatically disabled if PTI is
enabled. PCID can be forced on only for developer's benefit.
MCE is known to be broken, it requires IST stack to operate completely
correctly even for non-PTI case, and absolutely needs dedicated IST
stack because MCE delivery while trampoline did not switched from PTI
stack is fatal. The fix is pending.
Reviewed by: markj (partially)
Tested by: pho (previous version)
Discussed with: jeff, jhb
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
2018-01-17 11:44:21 +00:00
|
|
|
#define PC_PTI_STACK_SZ 16
|
2019-05-04 17:35:13 +00:00
|
|
|
|
|
|
|
struct monitorbuf {
|
|
|
|
int idle_state; /* Used by cpu_idle_mwait. */
|
2019-05-04 20:34:26 +00:00
|
|
|
int stop_state; /* Used by cpustop_handler. */
|
|
|
|
char padding[128 - (2 * sizeof(int))];
|
2019-05-04 17:35:13 +00:00
|
|
|
};
|
|
|
|
_Static_assert(sizeof(struct monitorbuf) == 128, "2x cache line");
|
|
|
|
|
1998-04-06 15:37:21 +00:00
|
|
|
/*
|
|
|
|
* The SMP parts are setup in pmap.c and locore.s for the BSP, and
|
|
|
|
* mp_machdep.c sets up the data for the AP's to "see" when they awake.
|
|
|
|
* The reason for doing it via a struct is so that an array of pointers
|
|
|
|
* to each CPU's data can be set up for things like "check curproc on all
|
|
|
|
* other processors"
|
|
|
|
*/
|
2001-12-11 23:33:44 +00:00
|
|
|
#define PCPU_MD_FIELDS \
|
2019-05-04 17:35:13 +00:00
|
|
|
struct monitorbuf pc_monitorbuf __aligned(128); /* cache line */\
|
2003-05-01 01:05:25 +00:00
|
|
|
struct pcpu *pc_prvspace; /* Self-reference */ \
|
2003-11-17 08:58:16 +00:00
|
|
|
struct pmap *pc_curpmap; \
|
2009-04-01 13:09:26 +00:00
|
|
|
struct amd64tss *pc_tssp; /* TSS segment active on CPU */ \
|
2019-11-10 09:28:18 +00:00
|
|
|
void *pc_pad0; \
|
PTI for amd64.
The implementation of the Kernel Page Table Isolation (KPTI) for
amd64, first version. It provides a workaround for the 'meltdown'
vulnerability. PTI is turned off by default for now, enable with the
loader tunable vm.pmap.pti=1.
The pmap page table is split into kernel-mode table and user-mode
table. Kernel-mode table is identical to the non-PTI table, while
usermode table is obtained from kernel table by leaving userspace
mappings intact, but only leaving the following parts of the kernel
mapped:
kernel text (but not modules text)
PCPU
GDT/IDT/user LDT/task structures
IST stacks for NMI and doublefault handlers.
Kernel switches to user page table before returning to usermode, and
restores full kernel page table on the entry. Initial kernel-mode
stack for PTI trampoline is allocated in PCPU, it is only 16
qwords. Kernel entry trampoline switches page tables. then the
hardware trap frame is copied to the normal kstack, and execution
continues.
IST stacks are kept mapped and no trampoline is needed for
NMI/doublefault, but of course page table switch is performed.
On return to usermode, the trampoline is used again, iret frame is
copied to the trampoline stack, page tables are switched and iretq is
executed. The case of iretq faulting due to the invalid usermode
context is tricky, since the frame for fault is appended to the
trampoline frame. Besides copying the fault frame and original
(corrupted) frame to kstack, the fault frame must be patched to make
it look as if the fault occured on the kstack, see the comment in
doret_iret detection code in trap().
Currently kernel pages which are mapped during trampoline operation
are identical for all pmaps. They are registered using
pmap_pti_add_kva(). Besides initial registrations done during boot,
LDT and non-common TSS segments are registered if user requested their
use. In principle, they can be installed into kernel page table per
pmap with some work. Similarly, PCPU can be hidden from userspace
mapping using trampoline PCPU page, but again I do not see much
benefits besides complexity.
PDPE pages for the kernel half of the user page tables are
pre-allocated during boot because we need to know pml4 entries which
are copied to the top-level paging structure page, in advance on a new
pmap creation. I enforce this to avoid iterating over the all
existing pmaps if a new PDPE page is needed for PTI kernel mappings.
The iteration is a known problematic operation on i386.
The need to flush hidden kernel translations on the switch to user
mode make global tables (PG_G) meaningless and even harming, so PG_G
use is disabled for PTI case. Our existing use of PCID is
incompatible with PTI and is automatically disabled if PTI is
enabled. PCID can be forced on only for developer's benefit.
MCE is known to be broken, it requires IST stack to operate completely
correctly even for non-PTI case, and absolutely needs dedicated IST
stack because MCE delivery while trampoline did not switched from PTI
stack is fatal. The fix is pending.
Reviewed by: markj (partially)
Tested by: pho (previous version)
Discussed with: jeff, jhb
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
2018-01-17 11:44:21 +00:00
|
|
|
uint64_t pc_kcr3; \
|
|
|
|
uint64_t pc_ucr3; \
|
2018-01-19 22:10:29 +00:00
|
|
|
uint64_t pc_saved_ucr3; \
|
2003-11-17 08:58:16 +00:00
|
|
|
register_t pc_rsp0; \
|
2003-11-15 18:58:29 +00:00
|
|
|
register_t pc_scratch_rsp; /* User %rsp in syscall */ \
|
PTI for amd64.
The implementation of the Kernel Page Table Isolation (KPTI) for
amd64, first version. It provides a workaround for the 'meltdown'
vulnerability. PTI is turned off by default for now, enable with the
loader tunable vm.pmap.pti=1.
The pmap page table is split into kernel-mode table and user-mode
table. Kernel-mode table is identical to the non-PTI table, while
usermode table is obtained from kernel table by leaving userspace
mappings intact, but only leaving the following parts of the kernel
mapped:
kernel text (but not modules text)
PCPU
GDT/IDT/user LDT/task structures
IST stacks for NMI and doublefault handlers.
Kernel switches to user page table before returning to usermode, and
restores full kernel page table on the entry. Initial kernel-mode
stack for PTI trampoline is allocated in PCPU, it is only 16
qwords. Kernel entry trampoline switches page tables. then the
hardware trap frame is copied to the normal kstack, and execution
continues.
IST stacks are kept mapped and no trampoline is needed for
NMI/doublefault, but of course page table switch is performed.
On return to usermode, the trampoline is used again, iret frame is
copied to the trampoline stack, page tables are switched and iretq is
executed. The case of iretq faulting due to the invalid usermode
context is tricky, since the frame for fault is appended to the
trampoline frame. Besides copying the fault frame and original
(corrupted) frame to kstack, the fault frame must be patched to make
it look as if the fault occured on the kstack, see the comment in
doret_iret detection code in trap().
Currently kernel pages which are mapped during trampoline operation
are identical for all pmaps. They are registered using
pmap_pti_add_kva(). Besides initial registrations done during boot,
LDT and non-common TSS segments are registered if user requested their
use. In principle, they can be installed into kernel page table per
pmap with some work. Similarly, PCPU can be hidden from userspace
mapping using trampoline PCPU page, but again I do not see much
benefits besides complexity.
PDPE pages for the kernel half of the user page tables are
pre-allocated during boot because we need to know pml4 entries which
are copied to the top-level paging structure page, in advance on a new
pmap creation. I enforce this to avoid iterating over the all
existing pmaps if a new PDPE page is needed for PTI kernel mappings.
The iteration is a known problematic operation on i386.
The need to flush hidden kernel translations on the switch to user
mode make global tables (PG_G) meaningless and even harming, so PG_G
use is disabled for PTI case. Our existing use of PCID is
incompatible with PTI and is automatically disabled if PTI is
enabled. PCID can be forced on only for developer's benefit.
MCE is known to be broken, it requires IST stack to operate completely
correctly even for non-PTI case, and absolutely needs dedicated IST
stack because MCE delivery while trampoline did not switched from PTI
stack is fatal. The fix is pending.
Reviewed by: markj (partially)
Tested by: pho (previous version)
Discussed with: jeff, jhb
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
2018-01-17 11:44:21 +00:00
|
|
|
register_t pc_scratch_rax; \
|
2003-11-17 08:58:16 +00:00
|
|
|
u_int pc_apic_id; \
|
2008-09-08 09:59:05 +00:00
|
|
|
u_int pc_acpi_id; /* ACPI CPU id */ \
|
2009-04-01 13:09:26 +00:00
|
|
|
/* Pointer to the CPU %fs descriptor */ \
|
|
|
|
struct user_segment_descriptor *pc_fs32p; \
|
|
|
|
/* Pointer to the CPU %gs descriptor */ \
|
|
|
|
struct user_segment_descriptor *pc_gs32p; \
|
|
|
|
/* Pointer to the CPU LDT descriptor */ \
|
|
|
|
struct system_segment_descriptor *pc_ldt; \
|
|
|
|
/* Pointer to the CPU TSS descriptor */ \
|
2010-05-24 15:45:05 +00:00
|
|
|
struct system_segment_descriptor *pc_tss; \
|
Implement vector callback for PVHVM and unify event channel implementations
Re-structure Xen HVM support so that:
- Xen is detected and hypercalls can be performed very
early in system startup.
- Xen interrupt services are implemented using FreeBSD's native
interrupt delivery infrastructure.
- the Xen interrupt service implementation is shared between PV
and HVM guests.
- Xen interrupt handlers can optionally use a filter handler
in order to avoid the overhead of dispatch to an interrupt
thread.
- interrupt load can be distributed among all available CPUs.
- the overhead of accessing the emulated local and I/O apics
on HVM is removed for event channel port events.
- a similar optimization can eventually, and fairly easily,
be used to optimize MSI.
Early Xen detection, HVM refactoring, PVHVM interrupt infrastructure,
and misc Xen cleanups:
Sponsored by: Spectra Logic Corporation
Unification of PV & HVM interrupt infrastructure, bug fixes,
and misc Xen cleanups:
Submitted by: Roger Pau Monné
Sponsored by: Citrix Systems R&D
sys/x86/x86/local_apic.c:
sys/amd64/include/apicvar.h:
sys/i386/include/apicvar.h:
sys/amd64/amd64/apic_vector.S:
sys/i386/i386/apic_vector.s:
sys/amd64/amd64/machdep.c:
sys/i386/i386/machdep.c:
sys/i386/xen/exception.s:
sys/x86/include/segments.h:
Reserve IDT vector 0x93 for the Xen event channel upcall
interrupt handler. On Hypervisors that support the direct
vector callback feature, we can request that this vector be
called directly by an injected HVM interrupt event, instead
of a simulated PCI interrupt on the Xen platform PCI device.
This avoids all of the overhead of dealing with the emulated
I/O APIC and local APIC. It also means that the Hypervisor
can inject these events on any CPU, allowing upcalls for
different ports to be handled in parallel.
sys/amd64/amd64/mp_machdep.c:
sys/i386/i386/mp_machdep.c:
Map Xen per-vcpu area during AP startup.
sys/amd64/include/intr_machdep.h:
sys/i386/include/intr_machdep.h:
Increase the FreeBSD IRQ vector table to include space
for event channel interrupt sources.
sys/amd64/include/pcpu.h:
sys/i386/include/pcpu.h:
Remove Xen HVM per-cpu variable data. These fields are now
allocated via the dynamic per-cpu scheme. See xen_intr.c
for details.
sys/amd64/include/xen/hypercall.h:
sys/dev/xen/blkback/blkback.c:
sys/i386/include/xen/xenvar.h:
sys/i386/xen/clock.c:
sys/i386/xen/xen_machdep.c:
sys/xen/gnttab.c:
Prefer FreeBSD primatives to Linux ones in Xen support code.
sys/amd64/include/xen/xen-os.h:
sys/i386/include/xen/xen-os.h:
sys/xen/xen-os.h:
sys/dev/xen/balloon/balloon.c:
sys/dev/xen/blkback/blkback.c:
sys/dev/xen/blkfront/blkfront.c:
sys/dev/xen/console/xencons_ring.c:
sys/dev/xen/control/control.c:
sys/dev/xen/netback/netback.c:
sys/dev/xen/netfront/netfront.c:
sys/dev/xen/xenpci/xenpci.c:
sys/i386/i386/machdep.c:
sys/i386/include/pmap.h:
sys/i386/include/xen/xenfunc.h:
sys/i386/isa/npx.c:
sys/i386/xen/clock.c:
sys/i386/xen/mp_machdep.c:
sys/i386/xen/mptable.c:
sys/i386/xen/xen_clock_util.c:
sys/i386/xen/xen_machdep.c:
sys/i386/xen/xen_rtc.c:
sys/xen/evtchn/evtchn_dev.c:
sys/xen/features.c:
sys/xen/gnttab.c:
sys/xen/gnttab.h:
sys/xen/hvm.h:
sys/xen/xenbus/xenbus.c:
sys/xen/xenbus/xenbus_if.m:
sys/xen/xenbus/xenbusb_front.c:
sys/xen/xenbus/xenbusvar.h:
sys/xen/xenstore/xenstore.c:
sys/xen/xenstore/xenstore_dev.c:
sys/xen/xenstore/xenstorevar.h:
Pull common Xen OS support functions/settings into xen/xen-os.h.
sys/amd64/include/xen/xen-os.h:
sys/i386/include/xen/xen-os.h:
sys/xen/xen-os.h:
Remove constants, macros, and functions unused in FreeBSD's Xen
support.
sys/xen/xen-os.h:
sys/i386/xen/xen_machdep.c:
sys/x86/xen/hvm.c:
Introduce new functions xen_domain(), xen_pv_domain(), and
xen_hvm_domain(). These are used in favor of #ifdefs so that
FreeBSD can dynamically detect and adapt to the presence of
a hypervisor. The goal is to have an HVM optimized GENERIC,
but more is necessary before this is possible.
sys/amd64/amd64/machdep.c:
sys/dev/xen/xenpci/xenpcivar.h:
sys/dev/xen/xenpci/xenpci.c:
sys/x86/xen/hvm.c:
sys/sys/kernel.h:
Refactor magic ioport, Hypercall table and Hypervisor shared
information page setup, and move it to a dedicated HVM support
module.
HVM mode initialization is now triggered during the
SI_SUB_HYPERVISOR phase of system startup. This currently
occurs just after the kernel VM is fully setup which is
just enough infrastructure to allow the hypercall table
and shared info page to be properly mapped.
sys/xen/hvm.h:
sys/x86/xen/hvm.c:
Add definitions and a method for configuring Hypervisor event
delievery via a direct vector callback.
sys/amd64/include/xen/xen-os.h:
sys/x86/xen/hvm.c:
sys/conf/files:
sys/conf/files.amd64:
sys/conf/files.i386:
Adjust kernel build to reflect the refactoring of early
Xen startup code and Xen interrupt services.
sys/dev/xen/blkback/blkback.c:
sys/dev/xen/blkfront/blkfront.c:
sys/dev/xen/blkfront/block.h:
sys/dev/xen/control/control.c:
sys/dev/xen/evtchn/evtchn_dev.c:
sys/dev/xen/netback/netback.c:
sys/dev/xen/netfront/netfront.c:
sys/xen/xenstore/xenstore.c:
sys/xen/evtchn/evtchn_dev.c:
sys/dev/xen/console/console.c:
sys/dev/xen/console/xencons_ring.c
Adjust drivers to use new xen_intr_*() API.
sys/dev/xen/blkback/blkback.c:
Since blkback defers all event handling to a taskqueue,
convert this task queue to a "fast" taskqueue, and schedule
it via an interrupt filter. This avoids an unnecessary
ithread context switch.
sys/xen/xenstore/xenstore.c:
The xenstore driver is MPSAFE. Indicate as much when
registering its interrupt handler.
sys/xen/xenbus/xenbus.c:
sys/xen/xenbus/xenbusvar.h:
Remove unused event channel APIs.
sys/xen/evtchn.h:
Remove all kernel Xen interrupt service API definitions
from this file. It is now only used for structure and
ioctl definitions related to the event channel userland
device driver.
Update the definitions in this file to match those from
NetBSD. Implementing this interface will be necessary for
Dom0 support.
sys/xen/evtchn/evtchnvar.h:
Add a header file for implemenation internal APIs related
to managing event channels event delivery. This is used
to allow, for example, the event channel userland device
driver to access low-level routines that typical kernel
consumers of event channel services should never access.
sys/xen/interface/event_channel.h:
sys/xen/xen_intr.h:
Standardize on the evtchn_port_t type for referring to
an event channel port id. In order to prevent low-level
event channel APIs from leaking to kernel consumers who
should not have access to this data, the type is defined
twice: Once in the Xen provided event_channel.h, and again
in xen/xen_intr.h. The double declaration is protected by
__XEN_EVTCHN_PORT_DEFINED__ to ensure it is never declared
twice within a given compilation unit.
sys/xen/xen_intr.h:
sys/xen/evtchn/evtchn.c:
sys/x86/xen/xen_intr.c:
sys/dev/xen/xenpci/evtchn.c:
sys/dev/xen/xenpci/xenpcivar.h:
New implementation of Xen interrupt services. This is
similar in many respects to the i386 PV implementation with
the exception that events for bound to event channel ports
(i.e. not IPI, virtual IRQ, or physical IRQ) are further
optimized to avoid mask/unmask operations that aren't
necessary for these edge triggered events.
Stubs exist for supporting physical IRQ binding, but will
need additional work before this implementation can be
fully shared between PV and HVM.
sys/amd64/amd64/mp_machdep.c:
sys/i386/i386/mp_machdep.c:
sys/i386/xen/mp_machdep.c
sys/x86/xen/hvm.c:
Add support for placing vcpu_info into an arbritary memory
page instead of using HYPERVISOR_shared_info->vcpu_info.
This allows the creation of domains with more than 32 vcpus.
sys/i386/i386/machdep.c:
sys/i386/xen/clock.c:
sys/i386/xen/xen_machdep.c:
sys/i386/xen/exception.s:
Add support for new event channle implementation.
2013-08-29 19:52:18 +00:00
|
|
|
u_int pc_cmci_mask; /* MCx banks for CMCI */ \
|
2013-05-21 11:24:32 +00:00
|
|
|
uint64_t pc_dbreg[16]; /* ddb debugging regs */ \
|
2018-04-27 12:44:20 +00:00
|
|
|
uint64_t pc_pti_stack[PC_PTI_STACK_SZ]; \
|
|
|
|
register_t pc_pti_rsp0; \
|
2013-05-21 11:24:32 +00:00
|
|
|
int pc_dbreg_cmd; /* ddb debugging reg cmd */ \
|
2013-10-05 23:11:01 +00:00
|
|
|
u_int pc_vcpu_id; /* Xen vCPU ID */ \
|
Rewrite amd64 PCID implementation to follow an algorithm described in
the Vahalia' "Unix Internals" section 15.12 "Other TLB Consistency
Algorithms". The same algorithm is already utilized by the MIPS pmap
to handle ASIDs.
The PCID for the address space is now allocated per-cpu during context
switch to the thread using pmap, when no PCID on the cpu was ever
allocated, or the current PCID is invalidated. If the PCID is reused,
bit 63 of %cr3 can be set to avoid TLB flush.
Each cpu has PCID' algorithm generation count, which is saved in the
pmap pcpu block when pcpu PCID is allocated. On invalidation, the
pmap generation count is zeroed, which signals the context switch code
that already allocated PCID is no longer valid. The implication is
the TLB shootdown for the given cpu/address space, due to the
allocation of new PCID.
The pm_save mask is no longer has to be tracked, which (significantly)
reduces the targets of the TLB shootdown IPIs. Previously, pm_save
was reset only on pmap_invalidate_all(), which made it accumulate the
cpuids of all processors on which the thread was scheduled between
full TLB shootdowns.
Besides reducing the amount of TLB shootdowns and removing atomics to
update pm_saves in the context switch code, the algorithm is much
simpler than the maintanence of pm_save and selection of the right
address space in the shootdown IPI handler.
Reviewed by: alc
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 3 weeks
2015-05-09 19:11:01 +00:00
|
|
|
uint32_t pc_pcid_next; \
|
|
|
|
uint32_t pc_pcid_gen; \
|
amd64: allow parallel shootdown IPIs
Stop using smp_ipi_mtx to protect global shootdown state, and
move/multiply the global state into pcpu. Now each CPU can initiate
shootdown IPI independently from other CPUs. Initiator enters
critical section, then fills its local PCPU shootdown info
(pc_smp_tlb_XXX), then clears scoreboard generation at location (cpu,
my_cpuid) for each target cpu. After that IPI is sent to all targets
which scan for zeroed scoreboard generation words. Upon finding such
word the shootdown data is read from corresponding cpu' pcpu, and
generation is set. Meantime initiator loops waiting for all zeroed
generations in scoreboard to update.
Initiator does not disable interrupts, which should allow
non-invalidation IPIs from deadlocking, it only needs to disable
preemption to pin itself to the instance of the pcpu smp_tlb data.
The generation is set before the actual invalidation is performed in
handler. It is safe because target CPU cannot return to userspace
before handler finishes. In principle only NMI can preempt the
handler, but NMI would see the kernel handler frame and not touch
not-invalidated user page table.
Handlers loop until they do not see zeroed scoreboard generations.
This, together with hardware keeping one pending IPI in LAPIC IRR
should prevent lost shootdowns.
Notes.
1. The code does protect writes to LAPIC ICR with exclusion. I believe
this is fine because we in fact do not send IPIs from interrupt
handlers. More for !x2APIC mode where ICR access for write requires
two registers write, we disable interrupts around it. If considered
incorrect, I can add per-cpu spinlock around ipi_send().
2. Scoreboard lines owned by given target CPU can be padded to the
cache line, to reduce ping-pong.
Reviewed by: markj (previous version)
Discussed with: alc
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 3 weeks
Differential revision: https://reviews.freebsd.org/D25510
2020-07-14 20:37:50 +00:00
|
|
|
uint32_t pc_unused; \
|
2018-01-31 14:36:27 +00:00
|
|
|
uint32_t pc_ibpb_set; \
|
2019-05-14 17:02:20 +00:00
|
|
|
void *pc_mds_buf; \
|
|
|
|
void *pc_mds_buf64; \
|
2021-02-25 05:08:42 +00:00
|
|
|
uint32_t pc_pad[4]; \
|
2019-05-14 17:02:20 +00:00
|
|
|
uint8_t pc_mds_tmp[64]; \
|
2019-05-12 06:36:54 +00:00
|
|
|
u_int pc_ipi_bitmap; \
|
2019-11-10 09:28:18 +00:00
|
|
|
struct amd64tss pc_common_tss; \
|
2019-11-12 15:51:47 +00:00
|
|
|
struct user_segment_descriptor pc_gdt[NGDT]; \
|
amd64: allow parallel shootdown IPIs
Stop using smp_ipi_mtx to protect global shootdown state, and
move/multiply the global state into pcpu. Now each CPU can initiate
shootdown IPI independently from other CPUs. Initiator enters
critical section, then fills its local PCPU shootdown info
(pc_smp_tlb_XXX), then clears scoreboard generation at location (cpu,
my_cpuid) for each target cpu. After that IPI is sent to all targets
which scan for zeroed scoreboard generation words. Upon finding such
word the shootdown data is read from corresponding cpu' pcpu, and
generation is set. Meantime initiator loops waiting for all zeroed
generations in scoreboard to update.
Initiator does not disable interrupts, which should allow
non-invalidation IPIs from deadlocking, it only needs to disable
preemption to pin itself to the instance of the pcpu smp_tlb data.
The generation is set before the actual invalidation is performed in
handler. It is safe because target CPU cannot return to userspace
before handler finishes. In principle only NMI can preempt the
handler, but NMI would see the kernel handler frame and not touch
not-invalidated user page table.
Handlers loop until they do not see zeroed scoreboard generations.
This, together with hardware keeping one pending IPI in LAPIC IRR
should prevent lost shootdowns.
Notes.
1. The code does protect writes to LAPIC ICR with exclusion. I believe
this is fine because we in fact do not send IPIs from interrupt
handlers. More for !x2APIC mode where ICR access for write requires
two registers write, we disable interrupts around it. If considered
incorrect, I can add per-cpu spinlock around ipi_send().
2. Scoreboard lines owned by given target CPU can be padded to the
cache line, to reduce ping-pong.
Reviewed by: markj (previous version)
Discussed with: alc
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 3 weeks
Differential revision: https://reviews.freebsd.org/D25510
2020-07-14 20:37:50 +00:00
|
|
|
void *pc_smp_tlb_pmap; \
|
|
|
|
uint64_t pc_smp_tlb_addr1; \
|
|
|
|
uint64_t pc_smp_tlb_addr2; \
|
|
|
|
uint32_t pc_smp_tlb_gen; \
|
|
|
|
u_int pc_smp_tlb_op; \
|
amd64 pmap: microoptimize local shootdowns for PCID PTI configurations
When pmap operates in PTI mode, we must reload %cr3 on return to
userspace. In non-PCID mode the reload always flushes all non-global
TLB entries and we take advantage of it by only invalidating the KPT
TLB entries (there is no cached UPT entries at all).
In PCID mode, we flush both KPT and UPT TLB explicitly, but we can
take advantage of the fact that PCID mode command to reload %cr3
includes a flag to flush/not flush target TLB. In particular, we can
avoid the flush for UPT, instead record that load of pc_ucr3 into %cr3
on return to usermode should be flushing. This is done by providing
either all-1s or ~CR3_PCID_MASK in pc_ucr3_load_mask. The mask is
automatically reset to all-1s on return to usermode.
Similarly, we can avoid flushing UPT TLB on context switch, replacing
it by setting pc_ucr3_load_mask. This unifies INVPCID and non-INVPCID
PTI ifunc, leaving only 4 cases instead of 6. This trick is also
applicable both to the TLB shootdown IPI handlers, since handlers
interrupt the target thread.
But then we need to check pc_curpmap in handlers, and this would
reopen the same race for INVPCID machines as was fixed in r306350 for
non-INVPCID. To not introduce the same bug, unconditionally do
spinlock_enter() in pmap_activate().
Reviewed by: alc, markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 3 weeks
Differential revision: https://reviews.freebsd.org/D25483
2020-07-18 18:19:57 +00:00
|
|
|
uint64_t pc_ucr3_load_mask; \
|
|
|
|
char __pad[2916] /* pad to UMA_PCPU_ALLOC_SIZE */
|
2001-12-11 23:33:44 +00:00
|
|
|
|
2013-05-21 11:24:32 +00:00
|
|
|
#define PC_DBREG_CMD_NONE 0
|
|
|
|
#define PC_DBREG_CMD_LOAD 1
|
|
|
|
|
2008-08-19 19:53:52 +00:00
|
|
|
#ifdef _KERNEL
|
|
|
|
|
2019-05-04 20:34:26 +00:00
|
|
|
#define MONITOR_STOPSTATE_RUNNING 0
|
|
|
|
#define MONITOR_STOPSTATE_STOPPED 1
|
|
|
|
|
2017-11-23 11:40:16 +00:00
|
|
|
#if defined(__GNUCLIKE_ASM) && defined(__GNUCLIKE___TYPEOF)
|
2002-10-01 14:01:58 +00:00
|
|
|
|
2001-12-11 23:33:44 +00:00
|
|
|
/*
|
|
|
|
* Evaluates to the byte offset of the per-cpu variable name.
|
|
|
|
*/
|
|
|
|
#define __pcpu_offset(name) \
|
|
|
|
__offsetof(struct pcpu, name)
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Evaluates to the type of the per-cpu variable name.
|
|
|
|
*/
|
|
|
|
#define __pcpu_type(name) \
|
|
|
|
__typeof(((struct pcpu *)0)->name)
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Evaluates to the address of the per-cpu variable name.
|
|
|
|
*/
|
2003-11-17 04:40:58 +00:00
|
|
|
#define __PCPU_PTR(name) __extension__ ({ \
|
2001-12-11 23:33:44 +00:00
|
|
|
__pcpu_type(name) *__p; \
|
|
|
|
\
|
2003-05-01 01:05:25 +00:00
|
|
|
__asm __volatile("movq %%gs:%1,%0; addq %2,%0" \
|
2001-12-11 23:33:44 +00:00
|
|
|
: "=r" (__p) \
|
|
|
|
: "m" (*(struct pcpu *)(__pcpu_offset(pc_prvspace))), \
|
|
|
|
"i" (__pcpu_offset(name))); \
|
|
|
|
\
|
|
|
|
__p; \
|
|
|
|
})
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Evaluates to the value of the per-cpu variable name.
|
|
|
|
*/
|
2003-11-17 04:40:58 +00:00
|
|
|
#define __PCPU_GET(name) __extension__ ({ \
|
2007-02-06 18:04:02 +00:00
|
|
|
__pcpu_type(name) __res; \
|
|
|
|
struct __s { \
|
|
|
|
u_char __b[MIN(sizeof(__pcpu_type(name)), 8)]; \
|
|
|
|
} __s; \
|
2001-12-11 23:33:44 +00:00
|
|
|
\
|
2007-02-06 18:04:02 +00:00
|
|
|
if (sizeof(__res) == 1 || sizeof(__res) == 2 || \
|
|
|
|
sizeof(__res) == 4 || sizeof(__res) == 8) { \
|
2007-02-06 16:21:09 +00:00
|
|
|
__asm __volatile("mov %%gs:%1,%0" \
|
|
|
|
: "=r" (__s) \
|
|
|
|
: "m" (*(struct __s *)(__pcpu_offset(name)))); \
|
2007-02-06 18:04:02 +00:00
|
|
|
*(struct __s *)(void *)&__res = __s; \
|
2001-12-11 23:33:44 +00:00
|
|
|
} else { \
|
2007-02-06 18:04:02 +00:00
|
|
|
__res = *__PCPU_PTR(name); \
|
2001-12-11 23:33:44 +00:00
|
|
|
} \
|
2007-02-06 18:04:02 +00:00
|
|
|
__res; \
|
2001-12-11 23:33:44 +00:00
|
|
|
})
|
|
|
|
|
2007-06-04 21:38:48 +00:00
|
|
|
/*
|
|
|
|
* Adds the value to the per-cpu counter name. The implementation
|
|
|
|
* must be atomic with respect to interrupts.
|
|
|
|
*/
|
|
|
|
#define __PCPU_ADD(name, val) do { \
|
|
|
|
__pcpu_type(name) __val; \
|
|
|
|
struct __s { \
|
|
|
|
u_char __b[MIN(sizeof(__pcpu_type(name)), 8)]; \
|
|
|
|
} __s; \
|
|
|
|
\
|
|
|
|
__val = (val); \
|
|
|
|
if (sizeof(__val) == 1 || sizeof(__val) == 2 || \
|
|
|
|
sizeof(__val) == 4 || sizeof(__val) == 8) { \
|
|
|
|
__s = *(struct __s *)(void *)&__val; \
|
|
|
|
__asm __volatile("add %1,%%gs:%0" \
|
|
|
|
: "=m" (*(struct __s *)(__pcpu_offset(name))) \
|
|
|
|
: "r" (__s)); \
|
|
|
|
} else \
|
|
|
|
*__PCPU_PTR(name) += __val; \
|
|
|
|
} while (0)
|
|
|
|
|
2001-12-11 23:33:44 +00:00
|
|
|
/*
|
|
|
|
* Sets the value of the per-cpu variable name to value val.
|
|
|
|
*/
|
2003-11-17 04:40:58 +00:00
|
|
|
#define __PCPU_SET(name, val) { \
|
2007-02-06 18:04:02 +00:00
|
|
|
__pcpu_type(name) __val; \
|
|
|
|
struct __s { \
|
|
|
|
u_char __b[MIN(sizeof(__pcpu_type(name)), 8)]; \
|
|
|
|
} __s; \
|
2001-12-11 23:33:44 +00:00
|
|
|
\
|
2007-02-06 18:04:02 +00:00
|
|
|
__val = (val); \
|
2007-02-06 16:21:09 +00:00
|
|
|
if (sizeof(__val) == 1 || sizeof(__val) == 2 || \
|
|
|
|
sizeof(__val) == 4 || sizeof(__val) == 8) { \
|
|
|
|
__s = *(struct __s *)(void *)&__val; \
|
|
|
|
__asm __volatile("mov %1,%%gs:%0" \
|
|
|
|
: "=m" (*(struct __s *)(__pcpu_offset(name))) \
|
|
|
|
: "r" (__s)); \
|
2001-12-11 23:33:44 +00:00
|
|
|
} else { \
|
|
|
|
*__PCPU_PTR(name) = __val; \
|
|
|
|
} \
|
2003-11-17 04:40:58 +00:00
|
|
|
}
|
2001-12-11 23:33:44 +00:00
|
|
|
|
2017-02-19 02:03:09 +00:00
|
|
|
#define get_pcpu() __extension__ ({ \
|
|
|
|
struct pcpu *__pc; \
|
|
|
|
\
|
|
|
|
__asm __volatile("movq %%gs:%1,%0" \
|
|
|
|
: "=r" (__pc) \
|
|
|
|
: "m" (*(struct pcpu *)(__pcpu_offset(pc_prvspace)))); \
|
|
|
|
__pc; \
|
|
|
|
})
|
|
|
|
|
2001-12-11 23:33:44 +00:00
|
|
|
#define PCPU_GET(member) __PCPU_GET(pc_ ## member)
|
2007-06-04 21:38:48 +00:00
|
|
|
#define PCPU_ADD(member, val) __PCPU_ADD(pc_ ## member, val)
|
2001-12-11 23:33:44 +00:00
|
|
|
#define PCPU_PTR(member) __PCPU_PTR(pc_ ## member)
|
|
|
|
#define PCPU_SET(member, val) __PCPU_SET(pc_ ## member, val)
|
1998-04-06 15:37:21 +00:00
|
|
|
|
2012-01-17 07:21:23 +00:00
|
|
|
#define IS_BSP() (PCPU_GET(cpuid) == 0)
|
|
|
|
|
2020-02-12 11:12:13 +00:00
|
|
|
#define zpcpu_offset_cpu(cpu) ((uintptr_t)&__pcpu[0] + UMA_PCPU_ALLOC_SIZE * cpu)
|
|
|
|
#define zpcpu_base_to_offset(base) (void *)((uintptr_t)(base) - (uintptr_t)&__pcpu[0])
|
|
|
|
#define zpcpu_offset_to_base(base) (void *)((uintptr_t)(base) + (uintptr_t)&__pcpu[0])
|
|
|
|
|
2020-02-12 11:15:33 +00:00
|
|
|
#define zpcpu_sub_protected(base, n) do { \
|
|
|
|
ZPCPU_ASSERT_PROTECTED(); \
|
|
|
|
zpcpu_sub(base, n); \
|
|
|
|
} while (0)
|
|
|
|
|
|
|
|
#define zpcpu_set_protected(base, n) do { \
|
|
|
|
__typeof(*base) __n = (n); \
|
|
|
|
ZPCPU_ASSERT_PROTECTED(); \
|
|
|
|
switch (sizeof(*base)) { \
|
|
|
|
case 4: \
|
|
|
|
__asm __volatile("movl\t%1,%%gs:(%0)" \
|
|
|
|
: : "r" (base), "ri" (__n) : "memory", "cc"); \
|
|
|
|
break; \
|
|
|
|
case 8: \
|
|
|
|
__asm __volatile("movq\t%1,%%gs:(%0)" \
|
|
|
|
: : "r" (base), "ri" (__n) : "memory", "cc"); \
|
|
|
|
break; \
|
|
|
|
default: \
|
|
|
|
*zpcpu_get(base) = __n; \
|
|
|
|
} \
|
|
|
|
} while (0);
|
|
|
|
|
|
|
|
#define zpcpu_add(base, n) do { \
|
|
|
|
__typeof(*base) __n = (n); \
|
|
|
|
CTASSERT(sizeof(*base) == 4 || sizeof(*base) == 8); \
|
|
|
|
switch (sizeof(*base)) { \
|
|
|
|
case 4: \
|
|
|
|
__asm __volatile("addl\t%1,%%gs:(%0)" \
|
|
|
|
: : "r" (base), "ri" (__n) : "memory", "cc"); \
|
|
|
|
break; \
|
|
|
|
case 8: \
|
|
|
|
__asm __volatile("addq\t%1,%%gs:(%0)" \
|
|
|
|
: : "r" (base), "ri" (__n) : "memory", "cc"); \
|
|
|
|
break; \
|
|
|
|
} \
|
|
|
|
} while (0)
|
|
|
|
|
|
|
|
#define zpcpu_add_protected(base, n) do { \
|
|
|
|
ZPCPU_ASSERT_PROTECTED(); \
|
|
|
|
zpcpu_add(base, n); \
|
|
|
|
} while (0)
|
|
|
|
|
|
|
|
#define zpcpu_sub(base, n) do { \
|
|
|
|
__typeof(*base) __n = (n); \
|
|
|
|
CTASSERT(sizeof(*base) == 4 || sizeof(*base) == 8); \
|
|
|
|
switch (sizeof(*base)) { \
|
|
|
|
case 4: \
|
|
|
|
__asm __volatile("subl\t%1,%%gs:(%0)" \
|
|
|
|
: : "r" (base), "ri" (__n) : "memory", "cc"); \
|
|
|
|
break; \
|
|
|
|
case 8: \
|
|
|
|
__asm __volatile("subq\t%1,%%gs:(%0)" \
|
|
|
|
: : "r" (base), "ri" (__n) : "memory", "cc"); \
|
|
|
|
break; \
|
|
|
|
} \
|
|
|
|
} while (0);
|
|
|
|
|
2017-11-23 11:40:16 +00:00
|
|
|
#else /* !__GNUCLIKE_ASM || !__GNUCLIKE___TYPEOF */
|
2007-02-06 18:04:02 +00:00
|
|
|
|
|
|
|
#error "this file needs to be ported to your compiler"
|
|
|
|
|
2017-11-23 11:40:16 +00:00
|
|
|
#endif /* __GNUCLIKE_ASM && __GNUCLIKE___TYPEOF */
|
2002-07-15 13:29:40 +00:00
|
|
|
|
2007-02-06 18:04:02 +00:00
|
|
|
#endif /* _KERNEL */
|
2001-08-16 09:29:35 +00:00
|
|
|
|
2007-02-06 18:04:02 +00:00
|
|
|
#endif /* !_MACHINE_PCPU_H_ */
|