Commit Graph

96 Commits

Author SHA1 Message Date
Konstantin Belousov
3ec7e1695c amd64 pmap: microoptimize local shootdowns for PCID PTI configurations
When pmap operates in PTI mode, we must reload %cr3 on return to
userspace.  In non-PCID mode the reload always flushes all non-global
TLB entries and we take advantage of it by only invalidating the KPT
TLB entries (there is no cached UPT entries at all).

In PCID mode, we flush both KPT and UPT TLB explicitly, but we can
take advantage of the fact that PCID mode command to reload %cr3
includes a flag to flush/not flush target TLB.  In particular, we can
avoid the flush for UPT, instead record that load of pc_ucr3 into %cr3
on return to usermode should be flushing.  This is done by providing
either all-1s or ~CR3_PCID_MASK in pc_ucr3_load_mask.  The mask is
automatically reset to all-1s on return to usermode.

Similarly, we can avoid flushing UPT TLB on context switch, replacing
it by setting pc_ucr3_load_mask.  This unifies INVPCID and non-INVPCID
PTI ifunc, leaving only 4 cases instead of 6.  This trick is also
applicable both to the TLB shootdown IPI handlers, since handlers
interrupt the target thread.

But then we need to check pc_curpmap in handlers, and this would
reopen the same race for INVPCID machines as was fixed in r306350 for
non-INVPCID.  To not introduce the same bug, unconditionally do
spinlock_enter() in pmap_activate().

Reviewed by:	alc, markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	3 weeks
Differential revision:	https://reviews.freebsd.org/D25483
2020-07-18 18:19:57 +00:00
Konstantin Belousov
dc43978aa5 amd64: allow parallel shootdown IPIs
Stop using smp_ipi_mtx to protect global shootdown state, and
move/multiply the global state into pcpu.  Now each CPU can initiate
shootdown IPI independently from other CPUs.  Initiator enters
critical section, then fills its local PCPU shootdown info
(pc_smp_tlb_XXX), then clears scoreboard generation at location (cpu,
my_cpuid) for each target cpu.  After that IPI is sent to all targets
which scan for zeroed scoreboard generation words.  Upon finding such
word the shootdown data is read from corresponding cpu' pcpu, and
generation is set.  Meantime initiator loops waiting for all zeroed
generations in scoreboard to update.

Initiator does not disable interrupts, which should allow
non-invalidation IPIs from deadlocking, it only needs to disable
preemption to pin itself to the instance of the pcpu smp_tlb data.

The generation is set before the actual invalidation is performed in
handler. It is safe because target CPU cannot return to userspace
before handler finishes. In principle only NMI can preempt the
handler, but NMI would see the kernel handler frame and not touch
not-invalidated user page table.

Handlers loop until they do not see zeroed scoreboard generations.
This, together with hardware keeping one pending IPI in LAPIC IRR
should prevent lost shootdowns.

Notes.
1. The code does protect writes to LAPIC ICR with exclusion. I believe
   this is fine because we in fact do not send IPIs from interrupt
   handlers. More for !x2APIC mode where ICR access for write requires
   two registers write, we disable interrupts around it. If considered
   incorrect, I can add per-cpu spinlock around ipi_send().
2. Scoreboard lines owned by given target CPU can be padded to the
   cache line, to reduce ping-pong.

Reviewed by:	markj (previous version)
Discussed with:	alc
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	3 weeks
Differential revision:	https://reviews.freebsd.org/D25510
2020-07-14 20:37:50 +00:00
Mateusz Guzik
2318ed2508 amd64: provide custom zpcpu set/add/sub routines
Note that clobbers are highly overzealous, can be cleaned up later.
2020-02-12 11:15:33 +00:00
Mateusz Guzik
fb886947d9 amd64: store per-cpu allocations subtracted by __pcpu
This eliminates a runtime subtraction from counter_u64_add.

before:
mov    0x4f00ed(%rip),%rax        # 0xffffffff80c01788 <numfullpathfail4>
sub    0x808ff6(%rip),%rax        # 0xffffffff80f1a698 <__pcpu>
addq   $0x1,%gs:(%rax)

after:
mov    0x4f02fd(%rip),%rax        # 0xffffffff80c01788 <numfullpathfail4>
addq   $0x1,%gs:(%rax)

Reviewed by:	jeff
Differential Revision:	https://reviews.freebsd.org/D23570
2020-02-12 11:12:13 +00:00
Konstantin Belousov
a7af4a3e7d amd64: move GDT into PCPU area.
Reviewed by:	jhb, markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D22302
2019-11-12 15:51:47 +00:00
Konstantin Belousov
98158c753d amd64: move common_tss into pcpu.
This saves some memory, around 256K I think.  It removes some code,
e.g. KPTI does not need to specially map common_tss anymore.  Also,
common_tss become domain-local.

Reviewed by:	jhb
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D22231
2019-11-10 09:28:18 +00:00
Konstantin Belousov
a2a0f90654 Centralize __pcpu definitions.
Many extern struct pcpu <something>__pcpu declarations were
copied/pasted in sources.  The issue is that the definition is MD, but
it cannot be provided by machine/pcpu.h due to actual struct pcpu
defined in sys/pcpu.h later than the inclusion of machine/pcpu.h.
This forced the copying when other code needed direct access to
__pcpu.  There is no way around it, due to machine/pcpu.h supplying
part of struct pcpu fields.

To work around the problem, add a new machine/pcpu_aux.h header, which
should fill any needed MD definitions after struct pcpu definition is
completed. This allows to remove copies of __pcpu spread around the
source.  Also on x86 it makes it possible to remove work arounds like
OFFSETOF_CURTHREAD or clang specific warnings supressions.

Reported and tested by:	lwhsu, bcran
Reviewed by:	imp, markj (previous version)
Discussed with:	jhb
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D21418
2019-08-29 07:25:27 +00:00
Konstantin Belousov
7355a02bdd Mitigations for Microarchitectural Data Sampling.
Microarchitectural buffers on some Intel processors utilizing
speculative execution may allow a local process to obtain a memory
disclosure.  An attacker may be able to read secret data from the
kernel or from a process when executing untrusted code (for example,
in a web browser).

Reference: https://www.intel.com/content/www/us/en/security-center/advisory/intel-sa-00233.html
Security:	CVE-2018-12126, CVE-2018-12127, CVE-2018-12130, CVE-2019-11091
Security:	FreeBSD-SA-19:07.mds
Reviewed by:	jhb
Tested by:	emaste, lwhsu
Approved by:	so (gtetlow)
2019-05-14 17:02:20 +00:00
Mateusz Guzik
a8c2fcb287 x86: store pending bitmapped IPIs in per-cpu areas
This gets rid of the global cpu_ipi_pending array.

While replace cmpset with fcmpset in the delivery code and opportunistically
check if given IPI is already pending.

Sponsored by:	The FreeBSD Foundation
2019-05-12 06:36:54 +00:00
Conrad Meyer
665919aaaf x86: Implement MWAIT support for stopping a CPU
IPI_STOP is used after panic or when ddb is entered manually.  MONITOR/
MWAIT allows CPUs that support the feature to sleep in a low power way
instead of spinning.  Something similar is already used at idle.

It is perhaps especially useful in oversubscribed VM environments, and is
safe to use even if the panic/ddb thread is not the BSP.  (Except in the
presence of MWAIT errata, which are detected automatically on platforms with
known wakeup problems.)

It can be tuned/sysctled with "machdep.stop_mwait," which defaults to 0
(off).  This commit also introduces the tunable
"machdep.mwait_cpustop_broken," which defaults to 0, unless the CPU has
known errata, but may be set to "1" in loader.conf to signal that mwait
wakeup is broken on CPUs FreeBSD does not yet know about.

Unfortunately, Bhyve doesn't yet support MONITOR extensions, so this doesn't
help bhyve hypervisors running FreeBSD guests.

Submitted by:   Anton Rang <rang AT acm.org> (earlier version)
Reviewed by:	kib
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D20135
2019-05-04 20:34:26 +00:00
Conrad Meyer
83dc49beaf x86: Define pc_monitorbuf as a logical structure
Rather than just accessing it via pointer cast.

No functional change intended.

Discussed with:	kib (earlier version)
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D20135
2019-05-04 17:35:13 +00:00
Mateusz Guzik
c88205e76e amd64: relax constraints in curthread and curpcb
This makes the compiler less likely to reload the content from %gs.

The 'P' modifier drops all synteax prefixes and 'n' constraint treats
input as a known at compilation time immediate integer.

Example reloading victim was spinlock_enter.

Stolen from:	OpenBSD

Reported by:	jtl
Reviewed by:	kib
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D17615
2018-10-20 17:00:18 +00:00
Konstantin Belousov
53dec71d39 Expand x86 struct pcpus to UMA_PCPU_ALLOC_SIZE AKA PAGE_SIZE.
This restores counters(9) operation.
Revert r336024. Improve assert of pcpu size on x86.

Reviewed by:	mmacy
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D16163
2018-07-06 19:50:44 +00:00
Konstantin Belousov
fb0a281196 Revert to recommit with the proper message. 2018-07-06 19:50:25 +00:00
Konstantin Belousov
1614716655 Save a call to pmap_remove() if entry cannot have any pages mapped.
Due to the way rtld creates mappings for the shared objects, each dso
causes unmap of at least three guard map entries.  For instance, in
the buildworld load, this change reduces the amount of pmap_remove()
calls by 1/5.

Profiled by:	alc
Reviewed by:	alc, markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D16148
2018-07-06 19:48:47 +00:00
Tycho Nightingale
27275f8a52 Expand the checks for UCR3 == PMAP_NO_CR3 to enable processes to be
excluded from PTI.

Reviewed by:	kib
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D15100
2018-04-27 12:44:20 +00:00
Jung-uk Kim
ea4fe1da62 Change size of padding to reflect reality. No functional change.
Discussed with:		kib
2018-02-15 20:42:38 +00:00
Konstantin Belousov
319117fd57 IBRS support, AKA Spectre hardware mitigation.
It is coded according to the Intel document 336996-001, reading of the
patches posted on lkml, and some additional consultations with Intel.

For existing processors, you need a microcode update which adds IBRS
CPU features, and to manually enable it by setting the tunable/sysctl
hw.ibrs_disable to 0.  Current status can be checked in sysctl
hw.ibrs_active.  The mitigation might be inactive if the CPU feature
is not patched in, or if CPU reports that IBRS use is not required, by
IA32_ARCH_CAP_IBRS_ALL bit.

Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D14029
2018-01-31 14:36:27 +00:00
Konstantin Belousov
b4dfc9d7ad PTI: Trap if we returned to userspace with kernel (full) page table
still active.

Map userspace portion of VA in the PTI kernel-mode page table as
non-executable. This way, if we ever miss reloading ucr3 into %cr3 on
the return to usermode, the process traps instead of executing in
potentially vulnerable setup.  Catch the condition of such trap and
verify user-mode %cr3, which is saved by page fault handler.

I peek this trick in some article about Linux implementation.

Reviewed by:	alc, markj (previous version)
Sponsored by:	The FreeBSD Foundation
MFC after:	12 days
DIfferential revision:	https://reviews.freebsd.org/D13956
2018-01-19 22:10:29 +00:00
Konstantin Belousov
bd50262f70 PTI for amd64.
The implementation of the Kernel Page Table Isolation (KPTI) for
amd64, first version. It provides a workaround for the 'meltdown'
vulnerability.  PTI is turned off by default for now, enable with the
loader tunable vm.pmap.pti=1.

The pmap page table is split into kernel-mode table and user-mode
table. Kernel-mode table is identical to the non-PTI table, while
usermode table is obtained from kernel table by leaving userspace
mappings intact, but only leaving the following parts of the kernel
mapped:

    kernel text (but not modules text)
    PCPU
    GDT/IDT/user LDT/task structures
    IST stacks for NMI and doublefault handlers.

Kernel switches to user page table before returning to usermode, and
restores full kernel page table on the entry. Initial kernel-mode
stack for PTI trampoline is allocated in PCPU, it is only 16
qwords.  Kernel entry trampoline switches page tables. then the
hardware trap frame is copied to the normal kstack, and execution
continues.

IST stacks are kept mapped and no trampoline is needed for
NMI/doublefault, but of course page table switch is performed.

On return to usermode, the trampoline is used again, iret frame is
copied to the trampoline stack, page tables are switched and iretq is
executed.  The case of iretq faulting due to the invalid usermode
context is tricky, since the frame for fault is appended to the
trampoline frame.  Besides copying the fault frame and original
(corrupted) frame to kstack, the fault frame must be patched to make
it look as if the fault occured on the kstack, see the comment in
doret_iret detection code in trap().

Currently kernel pages which are mapped during trampoline operation
are identical for all pmaps.  They are registered using
pmap_pti_add_kva().  Besides initial registrations done during boot,
LDT and non-common TSS segments are registered if user requested their
use.  In principle, they can be installed into kernel page table per
pmap with some work.  Similarly, PCPU can be hidden from userspace
mapping using trampoline PCPU page, but again I do not see much
benefits besides complexity.

PDPE pages for the kernel half of the user page tables are
pre-allocated during boot because we need to know pml4 entries which
are copied to the top-level paging structure page, in advance on a new
pmap creation.  I enforce this to avoid iterating over the all
existing pmaps if a new PDPE page is needed for PTI kernel mappings.
The iteration is a known problematic operation on i386.

The need to flush hidden kernel translations on the switch to user
mode make global tables (PG_G) meaningless and even harming, so PG_G
use is disabled for PTI case.  Our existing use of PCID is
incompatible with PTI and is automatically disabled if PTI is
enabled.  PCID can be forced on only for developer's benefit.

MCE is known to be broken, it requires IST stack to operate completely
correctly even for non-PTI case, and absolutely needs dedicated IST
stack because MCE delivery while trampoline did not switched from PTI
stack is fatal.  The fix is pending.

Reviewed by:	markj (partially)
Tested by:	pho (previous version)
Discussed with:	jeff, jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2018-01-17 11:44:21 +00:00
Pedro F. Giffuni
c49761dd57 sys/amd64: further adoption of SPDX licensing ID tags.
Mainly focus on files that use BSD 2-Clause license, however the tool I
was using misidentified many licenses so this was mostly a manual - error
prone - task.

The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.
2017-11-27 15:03:07 +00:00
Konstantin Belousov
383f241dce Remove lint support from system headers and MD x86 headers.
Reviewed by:	dim, jhb
Discussed with:	imp
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D13156
2017-11-23 11:40:16 +00:00
Gleb Smirnoff
83c9dea1ba - Remove 'struct vmmeter' from 'struct pcpu', leaving only global vmmeter
in place.  To do per-cpu stats, convert all fields that previously were
  maintained in the vmmeters that sit in pcpus to counter(9).
- Since some vmmeter stats may be touched at very early stages of boot,
  before we have set up UMA and we can do counter_u64_alloc(), provide an
  early counter mechanism:
  o Leave one spare uint64_t in struct pcpu, named pc_early_dummy_counter.
  o Point counter(9) fields of vmmeter to pcpu[0].pc_early_dummy_counter,
    so that at early stages of boot, before counters are allocated we already
    point to a counter that can be safely written to.
  o For sparc64 that required a whole dummy pcpu[MAXCPU] array.

Further related changes:
- Don't include vmmeter.h into pcpu.h.
- vm.stats.vm.v_swappgsout and vm.stats.vm.v_swappgsin changed to 64-bit,
  to match kernel representation.
- struct vmmeter hidden under _KERNEL, and only vmstat(1) is an exclusion.

This is based on benno@'s 4-year old patch:
https://lists.freebsd.org/pipermail/freebsd-arch/2013-July/014471.html

Reviewed by:	kib, gallatin, marius, lidl
Differential Revision:	https://reviews.freebsd.org/D10156
2017-04-17 17:34:47 +00:00
Jason A. Harmening
e2a8d17887 Bring back r313037, with fixes for mips:
Implement get_pcpu() for amd64/sparc64/mips/powerpc, and use it to
replace pcpu_find(curcpu) in MI code.

Reviewed by:	andreast, kan, lidl
Tested by:	lidl(mips, sparc64), andreast(powerpc)
Differential Revision:	https://reviews.freebsd.org/D9587
2017-02-19 02:03:09 +00:00
Jason A. Harmening
ad62ba6e96 Revert r313037
The switch to get_pcpu() in MI code seems to cause hangs on MIPS.
Back out until we can get a better idea of what's happening there.

Reported by:	kan, lidl
2017-02-04 06:24:49 +00:00
Jason A. Harmening
65ed483615 Implement get_pcpu() for the remaining architectures and use it to
replace pcpu_find(curcpu) in MI code.
2017-02-01 03:32:49 +00:00
Konstantin Belousov
83c001d3c2 Re-apply r306516 (by cem):
Reduce the cost of TLB invalidation on x86 by using per-CPU completion flags

Reduce contention during TLB invalidation operations by using a per-CPU
completion flag, rather than a single atomically-updated variable.

On a Westmere system (2 sockets x 4 cores x 1 threads), dtrace measurements
show that smp_tlb_shootdown is about 50% faster with this patch; observations
with VTune show that the percentage of time spent in invlrng_single_page on an
interrupt (actually doing invalidation, rather than synchronization) increases
from 31% with the old mechanism to 71% with the new one.  (Running a basic file
server workload.)

Submitted by:	Anton Rang <rang at acm.org>
Reviewed by:	cem (earlier version)
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D8041
2016-10-04 17:01:24 +00:00
Conrad Meyer
31f575777c Revert r306516 for now, it is incomplete on i386
Noted by:	kib
2016-09-30 18:58:50 +00:00
Conrad Meyer
2965d505f6 Reduce the cost of TLB invalidation on x86 by using per-CPU completion flags
Reduce contention during TLB invalidation operations by using a per-CPU
completion flag, rather than a single atomically-updated variable.

On a Westmere system (2 sockets x 4 cores x 1 threads), dtrace measurements
show that smp_tlb_shootdown is about 50% faster with this patch; observations
with VTune show that the percentage of time spent in invlrng_single_page on an
interrupt (actually doing invalidation, rather than synchronization) increases
from 31% with the old mechanism to 71% with the new one.  (Running a basic file
server workload.)

Submitted by:	Anton Rang <rang at acm.org>
Reviewed by:	cem (earlier version), kib
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D8041
2016-09-30 18:12:16 +00:00
Konstantin Belousov
a546448b8d Rewrite amd64 PCID implementation to follow an algorithm described in
the Vahalia' "Unix Internals" section 15.12 "Other TLB Consistency
Algorithms".  The same algorithm is already utilized by the MIPS pmap
to handle ASIDs.

The PCID for the address space is now allocated per-cpu during context
switch to the thread using pmap, when no PCID on the cpu was ever
allocated, or the current PCID is invalidated.  If the PCID is reused,
bit 63 of %cr3 can be set to avoid TLB flush.

Each cpu has PCID' algorithm generation count, which is saved in the
pmap pcpu block when pcpu PCID is allocated.  On invalidation, the
pmap generation count is zeroed, which signals the context switch code
that already allocated PCID is no longer valid.  The implication is
the TLB shootdown for the given cpu/address space, due to the
allocation of new PCID.

The pm_save mask is no longer has to be tracked, which (significantly)
reduces the targets of the TLB shootdown IPIs.  Previously, pm_save
was reset only on pmap_invalidate_all(), which made it accumulate the
cpuids of all processors on which the thread was scheduled between
full TLB shootdowns.

Besides reducing the amount of TLB shootdowns and removing atomics to
update pm_saves in the context switch code, the algorithm is much
simpler than the maintanence of pm_save and selection of the right
address space in the shootdown IPI handler.

Reviewed by:	alc
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	3 weeks
2015-05-09 19:11:01 +00:00
Justin T. Gibbs
5fdd34ee20 Formalize the concept of virtual CPU ids by adding a per-cpu vcpu_id
field.  Perform vcpu enumeration for Xen PV and HVM environments
and convert all Xen drivers to use vcpu_id instead of a hard coded
assumption of the mapping algorithm (acpi or apic ID) in use.

Submitted by:	Roger Pau Monné
Sponsored by:	Citrix Systems R&D
Reviewed by:	gibbs
Approved by:	re (blanket Xen)

amd64/include/pcpu.h:
i386/include/pcpu.h:
	Add vcpu_id to the amd64 and i386 pcpu structures.

dev/xen/timer/timer.c
x86/xen/xen_intr.c
	Use new vcpu_id instead of assuming acpi_id == vcpu_id.

i386/xen/mp_machdep.c:
i386/xen/mptable.c
x86/xen/hvm.c:
	Perform Xen HVM and Xen full PV vcpu_id mapping.

x86/xen/hvm.c:
x86/acpica/madt.c
	Change SYSINIT ordering of acpi CPU enumeration so that it
	is guaranteed to be available at the time of Xen HVM vcpu
	id mapping.
2013-10-05 23:11:01 +00:00
Justin T. Gibbs
e44af46e4c Implement PV IPIs for PVHVM guests and further converge PV and HVM
IPI implmementations.

Submitted by: Roger Pau Monné
Sponsored by: Citrix Systems R&D
Submitted by: gibbs (misc cleanup, table driven config)
Reviewed by:  gibbs
MFC after: 2 weeks

sys/amd64/include/cpufunc.h:
sys/amd64/amd64/pmap.c:
	Move invltlb_globpcid() into cpufunc.h so that it can be
	used by the Xen HVM version of tlb shootdown IPI handlers.

sys/x86/xen/xen_intr.c:
sys/xen/xen_intr.h:
	Rename xen_intr_bind_ipi() to xen_intr_alloc_and_bind_ipi(),
	and remove the ipi vector parameter.  This api allocates
	an event channel port that can be used for ipi services,
	but knows nothing of the actual ipi for which that port
	will be used.  Removing the unused argument and cleaning
	up the comments surrounding its declaration helps clarify
	its actual role.

sys/amd64/amd64/mp_machdep.c:
sys/amd64/include/cpu.h:
sys/i386/i386/mp_machdep.c:
sys/i386/include/cpu.h:
	Implement a generic framework for amd64 and i386 that allows
	the implementation of certain CPU management functions to
	be selected at runtime.  Currently this is only used for
	the ipi send function, which we optimize for Xen when running
	on a Xen hypervisor, but can easily be expanded to support
	more operations.

sys/x86/xen/hvm.c:
	Implement Xen PV IPI handlers and operations, replacing native
	send IPI.

sys/amd64/include/pcpu.h:
sys/i386/include/pcpu.h:
sys/i386/include/smp.h:
	Remove NR_VIRQS and NR_IPIS from FreeBSD headers.  NR_VIRQS
	is defined already for us in the xen interface files.
	NR_IPIS is only needed in one file per Xen platform and is
	easily inferred by the IPI vector table that is defined in
	those files.

sys/i386/xen/mp_machdep.c:
	Restructure to more closely match the HVM implementation by
	performing table driven IPI setup.
2013-09-06 22:17:02 +00:00
Konstantin Belousov
6aceaa3e17 Tidy up some loose ends in the PCID code:
- Restore the pre-PCID TLB shootdown handlers for whole address space
  and single page invalidation asm code, and assign the IPI handler to
  them when PCID is not supported or disabled.  Old handlers have
  linear control flow.  But, still use the common return sequence.

- Stop using pcpu for INVPCID descriptors in the invlrg handler.  It
  is enough to allocate descriptors on the stack.  As result, two
  SWAPGS instructions are shaved off from the code for Haswell+.

- Fix the reverted condition in invlrng for checking of the PCID
  support [1], also in invlrng check that pmap is kernel pmap before
  performing other tests.  For the kernel pmap, which provides global
  mappings, the INVLPG must be used for invalidation always.

- Save the pre-computed pmap' %CR3 register in the struct pmap.  This
  allows to remove several checks for pm_pcid validity when %CR3 is
  reloaded [2].

Noted by:   gibbs [1]
Discussed with:	alc [2]
Tested by:	pho, flo
Sponsored by:	The FreeBSD Foundation
2013-09-04 23:31:29 +00:00
Konstantin Belousov
37eed8419c Implement support for the process-context identifiers ('PCID') on
Intel CPUs.  The feature tags TLB entries with the Id of the address
space and allows to avoid TLB invalidation on the context switch, it
is available only in the long mode.  In the microbenchmarks, using the
PCID decreased latency of the context switches by ~30% on SandyBridge
class desktop CPUs, measured with the lat_ctx program from lmbench.

If available, use INVPCID instruction when a TLB entry in non-current
address space needs to be invalidated.  The instruction is typically
available on the Haswell.

If needed, the use of PCID can be turned off with the
vm.pmap.pcid_enabled loader tunable set to 0.  The state of the
feature is reported by the vm.pmap.pcid_enabled sysctl.  The sysctl
vm.pmap.pcid_save_cnt reports the number of context switches which
avoided invalidating the TLB; compare with the total number of context
switches, available as sysctl vm.stats.sys.v_swtch.

Sponsored by:	The FreeBSD Foundation
Reviewed by:	alc
Tested by:	pho, bf
2013-08-30 07:59:49 +00:00
Justin T. Gibbs
76acc41fb7 Implement vector callback for PVHVM and unify event channel implementations
Re-structure Xen HVM support so that:
	- Xen is detected and hypercalls can be performed very
	  early in system startup.
	- Xen interrupt services are implemented using FreeBSD's native
	  interrupt delivery infrastructure.
	- the Xen interrupt service implementation is shared between PV
	  and HVM guests.
	- Xen interrupt handlers can optionally use a filter handler
	  in order to avoid the overhead of dispatch to an interrupt
	  thread.
	- interrupt load can be distributed among all available CPUs.
	- the overhead of accessing the emulated local and I/O apics
	  on HVM is removed for event channel port events.
	- a similar optimization can eventually, and fairly easily,
	  be used to optimize MSI.

Early Xen detection, HVM refactoring, PVHVM interrupt infrastructure,
and misc Xen cleanups:

Sponsored by: Spectra Logic Corporation

Unification of PV & HVM interrupt infrastructure, bug fixes,
and misc Xen cleanups:

Submitted by: Roger Pau Monné
Sponsored by: Citrix Systems R&D

sys/x86/x86/local_apic.c:
sys/amd64/include/apicvar.h:
sys/i386/include/apicvar.h:
sys/amd64/amd64/apic_vector.S:
sys/i386/i386/apic_vector.s:
sys/amd64/amd64/machdep.c:
sys/i386/i386/machdep.c:
sys/i386/xen/exception.s:
sys/x86/include/segments.h:
	Reserve IDT vector 0x93 for the Xen event channel upcall
	interrupt handler.  On Hypervisors that support the direct
	vector callback feature, we can request that this vector be
	called directly by an injected HVM interrupt event, instead
	of a simulated PCI interrupt on the Xen platform PCI device.
	This avoids all of the overhead of dealing with the emulated
	I/O APIC and local APIC.  It also means that the Hypervisor
	can inject these events on any CPU, allowing upcalls for
	different ports to be handled in parallel.

sys/amd64/amd64/mp_machdep.c:
sys/i386/i386/mp_machdep.c:
	Map Xen per-vcpu area during AP startup.

sys/amd64/include/intr_machdep.h:
sys/i386/include/intr_machdep.h:
	Increase the FreeBSD IRQ vector table to include space
	for event channel interrupt sources.

sys/amd64/include/pcpu.h:
sys/i386/include/pcpu.h:
	Remove Xen HVM per-cpu variable data.  These fields are now
	allocated via the dynamic per-cpu scheme.  See xen_intr.c
	for details.

sys/amd64/include/xen/hypercall.h:
sys/dev/xen/blkback/blkback.c:
sys/i386/include/xen/xenvar.h:
sys/i386/xen/clock.c:
sys/i386/xen/xen_machdep.c:
sys/xen/gnttab.c:
	Prefer FreeBSD primatives to Linux ones in Xen support code.

sys/amd64/include/xen/xen-os.h:
sys/i386/include/xen/xen-os.h:
sys/xen/xen-os.h:
sys/dev/xen/balloon/balloon.c:
sys/dev/xen/blkback/blkback.c:
sys/dev/xen/blkfront/blkfront.c:
sys/dev/xen/console/xencons_ring.c:
sys/dev/xen/control/control.c:
sys/dev/xen/netback/netback.c:
sys/dev/xen/netfront/netfront.c:
sys/dev/xen/xenpci/xenpci.c:
sys/i386/i386/machdep.c:
sys/i386/include/pmap.h:
sys/i386/include/xen/xenfunc.h:
sys/i386/isa/npx.c:
sys/i386/xen/clock.c:
sys/i386/xen/mp_machdep.c:
sys/i386/xen/mptable.c:
sys/i386/xen/xen_clock_util.c:
sys/i386/xen/xen_machdep.c:
sys/i386/xen/xen_rtc.c:
sys/xen/evtchn/evtchn_dev.c:
sys/xen/features.c:
sys/xen/gnttab.c:
sys/xen/gnttab.h:
sys/xen/hvm.h:
sys/xen/xenbus/xenbus.c:
sys/xen/xenbus/xenbus_if.m:
sys/xen/xenbus/xenbusb_front.c:
sys/xen/xenbus/xenbusvar.h:
sys/xen/xenstore/xenstore.c:
sys/xen/xenstore/xenstore_dev.c:
sys/xen/xenstore/xenstorevar.h:
	Pull common Xen OS support functions/settings into xen/xen-os.h.

sys/amd64/include/xen/xen-os.h:
sys/i386/include/xen/xen-os.h:
sys/xen/xen-os.h:
	Remove constants, macros, and functions unused in FreeBSD's Xen
	support.

sys/xen/xen-os.h:
sys/i386/xen/xen_machdep.c:
sys/x86/xen/hvm.c:
	Introduce new functions xen_domain(), xen_pv_domain(), and
	xen_hvm_domain().  These are used in favor of #ifdefs so that
	FreeBSD can dynamically detect and adapt to the presence of
	a hypervisor.  The goal is to have an HVM optimized GENERIC,
	but more is necessary before this is possible.

sys/amd64/amd64/machdep.c:
sys/dev/xen/xenpci/xenpcivar.h:
sys/dev/xen/xenpci/xenpci.c:
sys/x86/xen/hvm.c:
sys/sys/kernel.h:
	Refactor magic ioport, Hypercall table and Hypervisor shared
	information page setup, and move it to a dedicated HVM support
	module.

	HVM mode initialization is now triggered during the
	SI_SUB_HYPERVISOR phase of system startup.  This currently
	occurs just after the kernel VM is fully setup which is
	just enough infrastructure to allow the hypercall table
	and shared info page to be properly mapped.

sys/xen/hvm.h:
sys/x86/xen/hvm.c:
	Add definitions and a method for configuring Hypervisor event
	delievery via a direct vector callback.

sys/amd64/include/xen/xen-os.h:
sys/x86/xen/hvm.c:

sys/conf/files:
sys/conf/files.amd64:
sys/conf/files.i386:
	Adjust kernel build to reflect the refactoring of early
	Xen startup code and Xen interrupt services.

sys/dev/xen/blkback/blkback.c:
sys/dev/xen/blkfront/blkfront.c:
sys/dev/xen/blkfront/block.h:
sys/dev/xen/control/control.c:
sys/dev/xen/evtchn/evtchn_dev.c:
sys/dev/xen/netback/netback.c:
sys/dev/xen/netfront/netfront.c:
sys/xen/xenstore/xenstore.c:
sys/xen/evtchn/evtchn_dev.c:
sys/dev/xen/console/console.c:
sys/dev/xen/console/xencons_ring.c
	Adjust drivers to use new xen_intr_*() API.

sys/dev/xen/blkback/blkback.c:
	Since blkback defers all event handling to a taskqueue,
	convert this task queue to a "fast" taskqueue, and schedule
	it via an interrupt filter.  This avoids an unnecessary
	ithread context switch.

sys/xen/xenstore/xenstore.c:
	The xenstore driver is MPSAFE.  Indicate as much when
	registering its interrupt handler.

sys/xen/xenbus/xenbus.c:
sys/xen/xenbus/xenbusvar.h:
	Remove unused event channel APIs.

sys/xen/evtchn.h:
	Remove all kernel Xen interrupt service API definitions
	from this file.  It is now only used for structure and
	ioctl definitions related to the event channel userland
	device driver.

	Update the definitions in this file to match those from
	NetBSD.  Implementing this interface will be necessary for
	Dom0 support.

sys/xen/evtchn/evtchnvar.h:
	Add a header file for implemenation internal APIs related
	to managing event channels event delivery.  This is used
	to allow, for example, the event channel userland device
	driver to access low-level routines that typical kernel
	consumers of event channel services should never access.

sys/xen/interface/event_channel.h:
sys/xen/xen_intr.h:
	Standardize on the evtchn_port_t type for referring to
	an event channel port id.  In order to prevent low-level
	event channel APIs from leaking to kernel consumers who
	should not have access to this data, the type is defined
	twice: Once in the Xen provided event_channel.h, and again
	in xen/xen_intr.h.  The double declaration is protected by
	__XEN_EVTCHN_PORT_DEFINED__ to ensure it is never declared
	twice within a given compilation unit.

sys/xen/xen_intr.h:
sys/xen/evtchn/evtchn.c:
sys/x86/xen/xen_intr.c:
sys/dev/xen/xenpci/evtchn.c:
sys/dev/xen/xenpci/xenpcivar.h:
	New implementation of Xen interrupt services.  This is
	similar in many respects to the i386 PV implementation with
	the exception that events for bound to event channel ports
	(i.e. not IPI, virtual IRQ, or physical IRQ) are further
	optimized to avoid mask/unmask operations that aren't
	necessary for these edge triggered events.

	Stubs exist for supporting physical IRQ binding, but will
	need additional work before this implementation can be
	fully shared between PV and HVM.

sys/amd64/amd64/mp_machdep.c:
sys/i386/i386/mp_machdep.c:
sys/i386/xen/mp_machdep.c
sys/x86/xen/hvm.c:
	Add support for placing vcpu_info into an arbritary memory
	page instead of using HYPERVISOR_shared_info->vcpu_info.
	This allows the creation of domains with more than 32 vcpus.

sys/i386/i386/machdep.c:
sys/i386/xen/clock.c:
sys/i386/xen/xen_machdep.c:
sys/i386/xen/exception.s:
	Add support for new event channle implementation.
2013-08-29 19:52:18 +00:00
Konstantin Belousov
4f493a25f4 Fix the hardware watchpoints on SMP amd64. Load the updated %dr
registers also on other CPUs, besides the CPU which happens to execute
the ddb.  The debugging registers are stored in the pcpu area,
together with the command which is executed by the IPI stop handler
upon resume.

Reviewed by:	jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2013-05-21 11:24:32 +00:00
Gleb Smirnoff
17dece86fe Merge from projects/counters:
Pad struct pcpu so that its size is denominator of PAGE_SIZE. This
is done to reduce memory waste in UMA_PCPU_ZONE zones.

Sponsored by:	Nginx, Inc.
2013-04-08 19:19:10 +00:00
Konstantin Belousov
59c1a8315c Forcibly shut up clang warning about NULL pointer dereference.
MFC after:	3 weeks
2012-07-23 19:16:31 +00:00
Konstantin Belousov
83b22b05e6 Introduce curpcb magic variable, similar to curthread, which is MD
amd64.  It is implemented as __pure2 inline with non-volatile asm read
from pcpu, which allows a compiler to cache its results.

Convert most PCPU_GET(pcb) and curthread->td_pcb accesses into curpcb.

Note that __curthread() uses magic value 0 as an offsetof(struct pcpu,
pc_curthread). It seems to be done this way due to machine/pcpu.h
needs to be processed before sys/pcpu.h, because machine/pcpu.h
contributes machine-depended fields to the struct pcpu definition. As
result, machine/pcpu.h cannot use struct pcpu yet.

The __curpcb() also uses a magic constant instead of offsetof(struct
pcpu, pc_curpcb) for the same reason. The constants are now defined as
symbols and CTASSERTs are added to ensure that future KBI changes do
not break the code.

Requested and reviewed by: bde
MFC after:    3 weeks
2012-07-19 19:09:12 +00:00
Konstantin Belousov
5ba2a4998c Add macro IS_BSP() to check whether the current CPU is BSP.
MFC after:	1 week
2012-01-17 07:21:23 +00:00
John Baldwin
536af0d751 Mark the __curthread() functions as __pure2 and remove the volatile keyword
from the inline assembly.  This allows the compiler to cache invocations of
curthread since it's value does not change within a thread context.

Submitted by:	zec (i386)
MFC after:	1 week
2010-07-29 18:44:10 +00:00
John Baldwin
58ccad7ddc Add support for corrected machine check interrupts. CMCI is a new local
APIC interrupt that fires when a threshold of corrected machine check
events is reached.  CMCI also includes a count of events when reporting
corrected errors in the bank's status register.  Note that individual
banks may or may not support CMCI.  If they do, each bank includes its own
threshold register that determines when the interrupt fires.  Currently
the code uses a very simple strategy where it doubles the threshold on
each interrupt until it succeeds in throttling the interrupt to occur
only once a minute (this interval can be tuned via sysctl).  The threshold
is also adjusted on each hourly poll which will lower the threshold once
events stop occurring.

Tested by:	Sailaja Bangaru  sbappana at yahoo com
MFC after:	1 month
2010-05-24 15:45:05 +00:00
Doug Rabson
8480241102 Fix XENHVM build. 2009-05-06 17:48:39 +00:00
Konstantin Belousov
2c66cccab7 Save and restore segment registers on amd64 when entering and leaving
the kernel on amd64. Fill and read segment registers for mcontext and
signals. Handle traps caused by restoration of the
invalidated selectors.

Implement user-mode creation and manipulation of the process-specific
LDT descriptors for amd64, see sysarch(2).

Implement support for TSS i/o port access permission bitmap for amd64.

Context-switch LDT and TSS. Do not save and restore segment registers on
the context switch, that is handled by kernel enter/leave trampolines
now. Remove segment restore code from the signal trampolines for
freebsd/amd64, freebsd/ia32 and linux/i386 for the same reason.

Implement amd64-specific compat shims for sysarch.

Linuxolator (temporary ?) switched to use gsbase for thread_area pointer.

TODO:
Currently, gdb is not adapted to show segment registers from struct reg.
Also, no machine-depended ptrace command is added to set segment
registers for debugged process.

In collaboration with:	pho
Discussed with:	peter
Reviewed by:	jhb
Linuxolator tested by:	dchagin
2009-04-01 13:09:26 +00:00
Doug Rabson
1267802438 Merge in support for Xen HVM on amd64 architecture. 2009-03-11 15:30:12 +00:00
Konstantin Belousov
3bd5e467b2 The pcb_gs32p should be per-cpu, not per-thread pointer. This is
location in GDT where the segment descriptor from pcb_gs32sd is
copied, and the location is in GDT local to CPU.

Noted and reviewed by:	peter
MFC after:	1 week
2008-09-08 09:59:05 +00:00
John Baldwin
70d12a18f2 Export 'struct pcpu' to userland w/o requiring _KERNEL. A few ports
already define _KERNEL to get to this and I'm about to add hooks to
libkvm to access per-CPU data.

MFC after:	1 week
2008-08-19 19:53:52 +00:00
Jeff Roberson
6c47aaae12 - Add an integer argument to idle to indicate how likely we are to wake
from idle over the next tick.
 - Add a new MD routine, cpu_wake_idle() to wakeup idle threads who are
   suspended in cpu specific states.  This function can fail and cause the
   scheduler to fall back to another mechanism (ipi).
 - Implement support for mwait in cpu_idle() on i386/amd64 machines that
   support it.  mwait is a higher performance way to synchronize cpus
   as compared to hlt & ipis.
 - Allow selecting the idle routine by name via sysctl machdep.idle.  This
   replaces machdep.cpu_idle_hlt.  Only idle routines supported by the
   current machine are permitted.

Sponsored by:	Nokia
2008-04-25 05:18:50 +00:00
Attilio Rao
6759608248 Rework the PCPU_* (MD) interface:
- Rename PCPU_LAZY_INC into PCPU_INC
- Add the PCPU_ADD interface which just does an add on the pcpu member
  given a specific value.

Note that for most architectures PCPU_INC and PCPU_ADD are not safe.
This is a point that needs some discussions/work in the next days.

Reviewed by: alc, bde
Approved by: jeff (mentor)
2007-06-04 21:38:48 +00:00
Alan Cox
c640357f04 Push down the implementation of PCPU_LAZY_INC() into the machine-dependent
header file.  Reimplement PCPU_LAZY_INC() on amd64 and i386 making it
atomic with respect to interrupts.

Reviewed by: bde, jhb
2007-03-11 05:54:29 +00:00