Commit Graph

240 Commits

Author SHA1 Message Date
Konstantin Belousov
fd8d844f76 amd64 KPTI: add control from procctl(2).
Add the infrastructure to allow MD procctl(2) commands, and use it to
introduce amd64 PTI control and reporting.  PTI mode cannot be
modified for existing pmap, the knob controls PTI of the new vmspace
created on exec.

Requested by:	jhb
Reviewed by:	jhb, markj (previous version)
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D19514
2019-03-16 11:44:33 +00:00
Konstantin Belousov
7e0a345bc5 Add symbolic name for TSC_AUX MSR address.
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2019-03-15 16:39:05 +00:00
Konstantin Belousov
3dcf329ee5 Add register number, CPUID bits, and print identification for TSX
force abort errata.

Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2019-03-12 18:59:01 +00:00
John Baldwin
2e43efd0bb Drop "All rights reserved" from my copyright statements.
Reviewed by:	rgrimes
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D19485
2019-03-06 22:11:45 +00:00
Konstantin Belousov
a2d95495ee Add usermode helpers for for Intel userspace protection keys feature.
Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D18893
2019-02-20 09:56:23 +00:00
Konstantin Belousov
e7a9df16e6 Add kernel support for Intel userspace protection keys feature on
Skylake Xeons.

See SDM rev. 68 Vol 3 4.6.2 Protection Keys and the description of the
RDPKRU and WRPKRU instructions.

Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D18893
2019-02-20 09:51:13 +00:00
Konstantin Belousov
5671e0d62e Add definition for %cr4 PKRU enable bit.
Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
Differential revision:	https://reviews.freebsd.org/D18893
2019-02-19 19:13:48 +00:00
Konstantin Belousov
eb785fab3b Port sysctl kern.elf32.read_exec from amd64 to i386.
Make it more comprehensive on i386, by not setting nx bit for any
mapping, not just adding PF_X to all kernel-loaded ELF segments.  This
is needed for the compatibility with older i386 programs that assume
that read access implies exec, e.g. old X servers with hand-rolled
module loader.

Reported and tested by:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2019-02-07 02:17:34 +00:00
Konstantin Belousov
ccc2d07e77 Update CPUID bits definitions and CPU identification based on changes
in SDM rev. 069.

Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2019-02-04 23:57:59 +00:00
Konstantin Belousov
9a52756044 i386: Merge PAE and non-PAE pmaps into same kernel.
Effectively all i386 kernels now have two pmaps compiled in: one
managing PAE pagetables, and another non-PAE. The implementation is
selected at cold time depending on the CPU features. The vm_paddr_t is
always 64bit now. As result, nx bit can be used on all capable CPUs.

Option PAE only affects the bus_addr_t: it is still 32bit for non-PAE
configs, for drivers compatibility. Kernel layout, esp. max kernel
address, low memory PDEs and max user address (same as trampoline
start) are now same for PAE and for non-PAE regardless of the type of
page tables used.

Non-PAE kernel (when using PAE pagetables) can handle physical memory
up to 24G now, larger memory requires re-tuning the KVA consumers and
instead the code caps the maximum at 24G. Unfortunately, a lot of
drivers do not use busdma(9) properly so by default even 4G barrier is
not easy. There are two tunables added: hw.above4g_allow and
hw.above24g_allow, the first one is kept enabled for now to evaluate
the status on HEAD, second is only for dev use.

i386 now creates three freelists if there is any memory above 4G, to
allow proper bounce pages allocation. Also, VM_KMEM_SIZE_SCALE changed
from 3 to 1.

The PAE_TABLES kernel config option is retired.

In collaboarion with: pho
Discussed with:	emaste
Reviewed by:	markj
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D18894
2019-01-30 02:07:13 +00:00
Konstantin Belousov
957b9bbf3c x86 busdma: fix mis-use of bus_addr_t where vm_paddr_t is assumed.
Right now bus_addr_t and vm_paddr_t are always aliased to the same
underlying integer type on x86, which makes the interchange hard to
detect.  Shortly, i386 kernel would use uint64_t for vm_paddr_t to
enable automatic use of PAE paging structures if hardware allows it,
while bus_addr_t would be extended to 64bit only when PAE option is
specified.

Fix all places that were identified as using bus_addr_t while page
address was assumed.  This was performed by testing the complete PAE
merging patch on machine with > 4G of RAM enabled.

Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D18854
2019-01-18 13:38:56 +00:00
Conrad Meyer
16068ae479 Add definitions for AMD Spectre/Meltdown CPUID information
No functional change, aside from printing recognized bits in CPU
identification.

The bits are documented in 111006-B "Indirect Branch Control Extension"[1] and
124441 "Speculative Store Bypass Disable."[2]

Notably missing (left as future work):
  * Integration with hw.spec_store_bypass_disable and hw_ssb_active flag,
    which are currently Intel-specific
  * Integration with hw_ibrs_active global flag, which are currently
    Intel-specific
  * SSB_NO integration in hw_ssb_recalculate()
  * Bhyve integration (PR 235010)

[1]:
https://developer.amd.com/wp-content/resources/111006-B_AMD64TechnologyIndirectBranchControlExtenstion_WP_7-18Update_FNL.pdf

[2]:
https://developer.amd.com/wp-content/resources/124441_AMD64_SpeculativeStoreBypassDisable_Whitepaper_final.pdf

PR:		235010 (related, but does not fix)
MFC after:	a week
2019-01-17 19:44:47 +00:00
Ben Widawsky
91890b73ad Add definitions for Intel Speed Shift
These definitions will be used by a driver to implement Hardware
P-States (autonomous control of HWP, via Intel Speed Shift technology).

Reviewed by:	kib
Approved by:	emaste (mentor)
Differential Revision:	https://reviews.freebsd.org/D18050
2018-11-21 00:21:58 +00:00
John Baldwin
e13507f6f0 Axe MINIMUM_MSI_INT.
Just allow MSI interrupts to always start at the end of the I/O APIC
pins.  Since existing machines already have more than 255 I/O APIC
pins, IRQ 255 is no longer reliably invalid, so just remove the
minimum starting value for MSI.

Reviewed by:	kib, markj
Differential Revision:	https://reviews.freebsd.org/D17991
2018-11-16 23:39:39 +00:00
Konstantin Belousov
2343757338 Align IA32_ARCH_CAP MSR definitions and use with SDM rev. 068.
SDM rev. 068 was released yesterday and it contains the description of
the MSR 0x10a IA32_ARCH_CAP. This change adds symbolic definitions for
all bits present in the document, and decode them in the CPU
identification lines printed on boot.

But also, the document defines SSB_NO as bit 4, while FreeBSD used but
2 to detect the need to work-around Speculative Store Bypass
issue.  Change code to use the bit from SDM.

Similarly, the document describes bit 3 as an indicator that L1TF
issue is not present, in particular, no L1D flush is needed on
VMENTRY.  We used RDCL_NO to avoid flushing, and again I changed the
code to follow new spec from SDM.

In fact my Apollo Lake machine with latest ucode shows this:
    IA32_ARCH_CAPS=0x19<RDCL_NO,SKIP_L1DFL_VME,SSB_NO>

Reviewed by:	bwidawsk
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
Differential revision:	https://reviews.freebsd.org/D18006
2018-11-16 21:27:11 +00:00
John Baldwin
b6b42932db Convert the number of MSI IRQs on x86 from a constant to a tunable.
The number of MSI IRQs still defaults to 512, but it can now be
changed at boot time via the machdep.num_msi_irqs tunable.

Reviewed by:	kib, royger (older version)
Reviewed by:	markj
MFC after:	1 month
Relnotes:	yes
Differential Revision:	https://reviews.freebsd.org/D17977
2018-11-15 18:37:41 +00:00
Konstantin Belousov
83813c6696 Apply fix to un-cripple max cpu id on BSP earlier.
We need to know actual value for the standard extended features before
ifuncs are resolved.

Reported and tested by:	madpilot
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2018-11-12 19:17:26 +00:00
Brooks Davis
c3adaa3305 Consolidate identical ELF auxargs type defintions.
All platforms except powerpc use the same values and powerpc shares a
majority of them.

Go ahead and declare AT_NOTELF, AT_UID, and AT_EUID in favor of the
unused AT_DCACHEBSIZE, AT_ICACHEBSIZE, and AT_UCACHEBSIZE for powerpc.

Reviewed by:	jhb, imp
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D17397
2018-10-22 22:24:32 +00:00
Konstantin Belousov
632227a739 Update x86/ifunc.h.
Remove ifunc emulation.
Add helper for usermode ifunc resolver definition.
Update copyright years.

Sponsored by:	The FreeBSD Foundation
Approved by:	re (rgrimes)
MFC after:	1 week
2018-09-30 16:57:30 +00:00
Mark Johnston
d5089b3aed Log a message after a successful boot-time microcode update.
Reviewed by:	kib
Approved by:	re (delphij)
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D17135
2018-09-14 17:04:36 +00:00
John Baldwin
fd036deac1 Dynamically allocate IRQ ranges on x86.
Previously, x86 used static ranges of IRQ values for different types
of I/O interrupts.  Interrupt pins on I/O APICs and 8259A PICs used
IRQ values from 0 to 254.  MSI interrupts used a compile-time-defined
range starting at 256, and Xen event channels used a
compile-time-defined range after MSI.  Some recent systems have more
than 255 I/O APIC interrupt pins which resulted in those IRQ values
overflowing into the MSI range triggering an assertion failure.

Replace statically assigned ranges with dynamic ranges.  Do a single
pass computing the sizes of the IRQ ranges (PICs, MSI, Xen) to
determine the total number of IRQs required.  Allocate the interrupt
source and interrupt count arrays dynamically once this pass has
completed.  To minimize runtime complexity these arrays are only sized
once during bootup.  The PIC range is determined by the PICs present
in the system.  The MSI and Xen ranges continue to use a fixed size,
though this does make it possible to turn the MSI range size into a
tunable in the future.

As a result, various places are updated to use dynamic limits instead
of constants.  In addition, the vmstat(8) utility has been taught to
understand that some kernels may treat 'intrcnt' and 'intrnames' as
pointers rather than arrays when extracting interrupt stats from a
crashdump.  This is determined by the presence (vs absence) of a
global 'nintrcnt' symbol.

This change reverts r189404 which worked around a buggy BIOS which
enumerated an I/O APIC twice (using the same memory mapped address for
both entries but using an IRQ base of 256 for one entry and a valid
IRQ base for the second entry).  Making the "base" of MSI IRQ values
dynamic avoids the panic that r189404 worked around, and there may now
be valid I/O APICs with an IRQ base above 256 which this workaround
would incorrectly skip.

If in the future the issue reported in PR 130483 reoccurs, we will
have to add a pass over the I/O APIC entries in the MADT to detect
duplicates using the memory mapped address and use some strategy to
choose the "correct" one.

While here, reserve room in intrcnts for the Hyper-V counters.

PR:		229429, 130483
Reviewed by:	kib, royger, cem
Tested by:	royger (Xen), kib (DMAR)
Approved by:	re (gjb)
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D16861
2018-08-28 21:09:19 +00:00
John Baldwin
a800b45c18 Merge amd64 and i386 <machine/intr_machdep.h> headers.
Reviewed by:	kib
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D16803
2018-08-20 12:31:39 +00:00
John Baldwin
a568818913 Remove some vestiges of IPI_LAZYPMAP on i386.
The support for lazy pmap invalidations on i386 was removed in r281707.
This removes the constant for the IPI and stops accounting for it when
sizing the interrupt count arrays.

Reviewed by:	kib
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D16801
2018-08-19 16:14:59 +00:00
Konstantin Belousov
8d32b46379 Add definitions related to the L1D flush operation capability and MSR.
Sponsored by:	The FreeBSD Foundation
2018-08-14 17:19:11 +00:00
Mark Johnston
97edfc1b45 Implement kernel support for early loading of Intel microcode updates.
Updates in the format described in section 9.11 of the Intel SDM can
now be applied as one of the first steps in booting the kernel.  Updates
that are loaded this way are automatically re-applied upon exit from
ACPI sleep states, in contrast with the existing cpucontrol(8)-based
method.  For the time being only Intel updates are supported.

Microcode update files are passed to the kernel via loader(8).  The
file type must be "cpu_microcode" in order for the file to be recognized
as a candidate microcode update.  Updates for multiple CPU types may be
concatenated together into a single file, in which case the kernel
will select and apply a matching update.  Memory used to store the
update file will be freed back to the system once the update is applied,
so this approach will not consume more memory than required.

Reviewed by:	kib
MFC after:	6 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D16370
2018-08-13 17:13:09 +00:00
Mark Johnston
a18e40aad4 Use the existing MSR_BIOS_SIGN on AMD.
Reported by:	kib
Sponsored by:	The FreeBSD Foundation
2018-07-13 20:56:20 +00:00
Mark Johnston
5612bb23d0 Define the MSR used to fetch the current microcode patch level on AMD.
It is defined in the AMD family 17h register reference.

MFC after:	3 days
Sponsored by:	The FreeBSD Foundation
2018-07-13 19:42:59 +00:00
Konstantin Belousov
300c34e431 Add a name for the MSR controlling standard extended features report on AMD.
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2018-07-05 10:44:18 +00:00
Konstantin Belousov
fe15b8543e Order the portion of the AMD-specific MSRs names definitions numerically.
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2018-07-05 10:34:01 +00:00
Konstantin Belousov
7705dd4df0 Provide a helper function acpi_get_fadt_bootflags() to fetch the FADT
x86 boot flags.

Reviewed by:	royger
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D16004
MFC after:	1 week
2018-06-25 11:01:12 +00:00
John Baldwin
9e2154ff1c Cleanups related to debug exceptions on x86.
- Add constants for fields in DR6 and the reserved fields in DR7.  Use
  these constants instead of magic numbers in most places that use DR6
  and DR7.
- Refer to T_TRCTRAP as "debug exception" rather than a "trace trap"
  as it is not just for trace exceptions.
- Always read DR6 for debug exceptions and only clear TF in the flags
  register for user exceptions where DR6.BS is set.
- Clear DR6 before returning from a debug exception handler as
  recommended by the SDM dating all the way back to the 386.  This
  allows debuggers to determine the cause of each exception.  For
  kernel traps, clear DR6 in the T_TRCTRAP case and pass DR6 by value
  to other parts of the handler (namely, user_dbreg_trap()).  For user
  traps, wait until after trapsignal to clear DR6 so that userland
  debuggers can read DR6 via PT_GETDBREGS while the thread is stopped
  in trapsignal().

Reviewed by:	kib, rgrimes
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D15189
2018-05-22 00:45:00 +00:00
Konstantin Belousov
3621ba1ede Add Intel Spec Store Bypass Disable control.
Speculative Store Bypass (SSB) is a speculative execution side channel
vulnerability identified by Jann Horn of Google Project Zero (GPZ) and
Ken Johnson of the Microsoft Security Response Center (MSRC)
https://bugs.chromium.org/p/project-zero/issues/detail?id=1528.
Updated Intel microcode introduces a MSR bit to disable SSB as a
mitigation for the vulnerability.

Introduce a sysctl hw.spec_store_bypass_disable to provide global
control over the SSBD bit, akin to the existing sysctl that controls
IBRS. The sysctl can be set to one of three values:
0: off
1: on
2: auto

Future work will enable applications to control SSBD on a per-process
basis (when it is not enabled globally).

SSBD bit detection and control was verified with prerelease microcode.

Security:	CVE-2018-3639
Tested by:	emaste (previous version, without updated microcode)
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2018-05-21 21:08:19 +00:00
Konstantin Belousov
9be4bbbb21 Add definition for Intel Speculative Store Bypass Disable MSR bits
Security:	CVE-2018-3639
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2018-05-21 21:07:13 +00:00
Konstantin Belousov
d5effb01f1 Add helper macros to hide some boring repeatable ceremonies to define
ifuncs on x86.

Also keep helpers to define 'pseudo-ifuncs' which are emulated by the
indirect jmp.

Reviewed by:	jhb (previous version, as part of the larger patch)
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D13838
2018-05-03 21:45:59 +00:00
Konstantin Belousov
986c4ca387 Turn off IBRS on suspend.
Resume starts CPU from the init state, which clears any loaded
microcode updates.  As result, IBRS MSRs are no longer available,
until the microcode is reloaded.

I have to forcibly clear cpu_stdext_feature3, which assumes that CPUID
leaf 7 reg %ebx does not report anything except Meltdown/Spectre bugs
bits.  If future CPUs add new bits there, hw_ibrs_recalculate() and
identify_cpu1()/identify_cpu2() need to be adjusted for that.

Submitted and tested by:	Michael Danilov <mike.d.ft402@gmail.com>
PR:	227866
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D15236
2018-04-30 20:18:32 +00:00
Roger Pau Monné
9dba82a442 x86: improve reservation of AP trampoline memory
So that it doesn't rely on physmap[1] containing an address below
1MiB. Instead scan the full physmap and search for a suitable address
to place the trampoline code (below 1MiB) and the initial memory pages
(below 4GiB).

Sponsored by:		Citrix Systems R&D
Reviewed by:		kib
Differential Revision:	https://reviews.freebsd.org/D14878
2018-04-05 14:39:51 +00:00
John Baldwin
d41e41f9f0 Remove very old and unused signal information codes.
These have been supplanted by the MI signal information codes in
<sys/signal.h> since 7.0.  The FPE_*_TRAP ones were deprecated even
earlier in 1999.

PR:		226579 (exp-run)
Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D14637
2018-03-27 20:57:51 +00:00
Konstantin Belousov
8fbcc3343f Move the CR0.WP manipulation KPI to x86.
This should allow to avoid some #ifdefs in the common x86/ code.

Requested by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2018-03-20 20:20:49 +00:00
John Baldwin
7af5f2acfb Fix a typo.
Reviewed by:	kib
2018-03-19 17:14:56 +00:00
Ed Maste
315fbaeca2 Correct pseudo misspelling in sys/ comments
contrib code and #define in intel_ata.h unchanged.
2018-02-23 18:15:50 +00:00
Warner Losh
ef1fcaf0f5 Do not include float interfaces when using libsa.
We don't support float in the boot loaders, so don't include
interfaces for float or double in systems headers. In addition, take
the unusual step of spiking double and float to prevent any more
accidental seepage.
2018-02-23 04:04:25 +00:00
Konstantin Belousov
c688c9051b Fix build with gas.
Do not use C constant suffixes.  Bit values are small enough to not
require typing, despite they are used for 64bit MSR writes.  The added
cast in hw_ibrs_recalculate() is redundand but I prefer to add it for
clarity.

Reported by:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2018-02-13 15:30:31 +00:00
Warner Losh
62bca77843 Move __va_list and related defines to sys/sys/_types.h
__va_list and related defines are identical in all the
ARCH/include/_types.h files. Move them to sys/sys/_types.h

Sponsored by: Netflix
2018-02-12 14:48:20 +00:00
Konstantin Belousov
319117fd57 IBRS support, AKA Spectre hardware mitigation.
It is coded according to the Intel document 336996-001, reading of the
patches posted on lkml, and some additional consultations with Intel.

For existing processors, you need a microcode update which adds IBRS
CPU features, and to manually enable it by setting the tunable/sysctl
hw.ibrs_disable to 0.  Current status can be checked in sysctl
hw.ibrs_active.  The mitigation might be inactive if the CPU feature
is not patched in, or if CPU reports that IBRS use is not required, by
IA32_ARCH_CAP_IBRS_ALL bit.

Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D14029
2018-01-31 14:36:27 +00:00
Konstantin Belousov
c8f9c1f3d9 Use PCID to optimize PTI.
Use PCID to avoid complete TLB shootdown when switching between user
and kernel mode with PTI enabled.

I use the model close to what I read about KAISER, user-mode PCID has
1:1 correspondence to the kernel-mode PCID, by setting bit 11 in PCID.
Full kernel-mode TLB shootdown is performed on context switches, since
KVA TLB invalidation only works in the current pmap. User-mode part of
TLB is flushed on the pmap activations as well.

Similarly, IPI TLB shootdowns must handle both kernel and user address
spaces for each address.  Note that machines which implement PCID but
do not have INVPCID instructions, cause the usual complications in the
IPI handlers, due to the need to switch to the target PCID temporary.
This is racy, but because for PCID/no-INVPCID we disable the
interrupts in pmap_activate_sw(), IPI handler cannot see inconsistent
state of CPU PCID vs PCPU pmap/kcr3/ucr3 pointers.

On the other hand, on kernel/user switches, CR3_PCID_SAVE bit is set
and we do not clear TLB.

I can imagine alternative use of PCID, where there is only one PCID
allocated for the kernel pmap. Then, there is no need to shootdown
kernel TLB entries on context switch. But copyout(3) would need to
either use method similar to proc_rwmem() to access the userspace
data, or (in reverse) provide a temporal mapping for the kernel buffer
into user mode PCID and use trampoline for copy.

Reviewed by:	markj (previous version)
Tested by:	pho
Discussed with:	alc (some aspects)
Sponsored by:	The FreeBSD Foundation
MFC after:	3 weeks
Differential revision:	https://reviews.freebsd.org/D13985
2018-01-27 11:49:37 +00:00
Ed Maste
b3327f62f0 Enable KPTI by default on amd64 for non-AMD CPUs
Kernel Page Table Isolation (KPTI) was introduced in r328083 as a
mitigation for the 'Meltdown' vulnerability.  AMD CPUs are not affected,
per https://www.amd.com/en/corporate/speculative-execution:

    We believe AMD processors are not susceptible due to our use of
    privilege level protections within paging architecture and no
    mitigation is required.

Thus default KPTI to off for AMD CPUs, and to on for others.  This may
be refined later as we obtain more specific information on the sets of
CPUs that are and are not affected.

Submitted by:	Mitchell Horne
Reviewed by:	cem
Relnotes:	Yes
Security:	CVE-2017-5754
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D13971
2018-01-19 15:42:34 +00:00
Konstantin Belousov
bd50262f70 PTI for amd64.
The implementation of the Kernel Page Table Isolation (KPTI) for
amd64, first version. It provides a workaround for the 'meltdown'
vulnerability.  PTI is turned off by default for now, enable with the
loader tunable vm.pmap.pti=1.

The pmap page table is split into kernel-mode table and user-mode
table. Kernel-mode table is identical to the non-PTI table, while
usermode table is obtained from kernel table by leaving userspace
mappings intact, but only leaving the following parts of the kernel
mapped:

    kernel text (but not modules text)
    PCPU
    GDT/IDT/user LDT/task structures
    IST stacks for NMI and doublefault handlers.

Kernel switches to user page table before returning to usermode, and
restores full kernel page table on the entry. Initial kernel-mode
stack for PTI trampoline is allocated in PCPU, it is only 16
qwords.  Kernel entry trampoline switches page tables. then the
hardware trap frame is copied to the normal kstack, and execution
continues.

IST stacks are kept mapped and no trampoline is needed for
NMI/doublefault, but of course page table switch is performed.

On return to usermode, the trampoline is used again, iret frame is
copied to the trampoline stack, page tables are switched and iretq is
executed.  The case of iretq faulting due to the invalid usermode
context is tricky, since the frame for fault is appended to the
trampoline frame.  Besides copying the fault frame and original
(corrupted) frame to kstack, the fault frame must be patched to make
it look as if the fault occured on the kstack, see the comment in
doret_iret detection code in trap().

Currently kernel pages which are mapped during trampoline operation
are identical for all pmaps.  They are registered using
pmap_pti_add_kva().  Besides initial registrations done during boot,
LDT and non-common TSS segments are registered if user requested their
use.  In principle, they can be installed into kernel page table per
pmap with some work.  Similarly, PCPU can be hidden from userspace
mapping using trampoline PCPU page, but again I do not see much
benefits besides complexity.

PDPE pages for the kernel half of the user page tables are
pre-allocated during boot because we need to know pml4 entries which
are copied to the top-level paging structure page, in advance on a new
pmap creation.  I enforce this to avoid iterating over the all
existing pmaps if a new PDPE page is needed for PTI kernel mappings.
The iteration is a known problematic operation on i386.

The need to flush hidden kernel translations on the switch to user
mode make global tables (PG_G) meaningless and even harming, so PG_G
use is disabled for PTI case.  Our existing use of PCID is
incompatible with PTI and is automatically disabled if PTI is
enabled.  PCID can be forced on only for developer's benefit.

MCE is known to be broken, it requires IST stack to operate completely
correctly even for non-PTI case, and absolutely needs dedicated IST
stack because MCE delivery while trampoline did not switched from PTI
stack is fatal.  The fix is pending.

Reviewed by:	markj (partially)
Tested by:	pho (previous version)
Discussed with:	jeff, jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2018-01-17 11:44:21 +00:00
Konstantin Belousov
e8c770a66e Enumerate and print Intel CPU features for Speculative Execution Side
Channel Mitigations.

The definitions are taken from the document 336996-001.

Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2018-01-14 12:36:23 +00:00
Conrad Meyer
233933cb00 amd64: Add a 48-bit MAXADDR constant
Some devices (e.g., ccp(4) -- to be committed) can only access the low 48
bits of physical memory.

Reviewed by:	markj
Sponsored by:	Dell EMC Isilon
2018-01-13 17:55:22 +00:00
Jeff Roberson
6f4acaf4c9 Add support for NUMA domains to bus dma tags. This causes all memory
allocated with a tag to come from the specified domain if it meets the
other constraints provided by the tag.  Automatically create a tag at
the root of each bus specifying the domain local to that bus if
available.

Reviewed by:	jhb, kib
Tested by:	pho
Sponsored by:	Netflix, Dell/EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D13545
2018-01-12 23:34:16 +00:00