Commit Graph

1964 Commits

Author SHA1 Message Date
Konstantin Belousov
8fba5348fc amd64: ensure that curproc->p_vmspace pmap always matches PCPU
curpmap.

When performing context switch on a machine without PCID, if current
%cr3 equals to the new pmap %cr3, which is typical for kernel_pmap
vs. kernel process, I overlooked to update PCPU curpmap value.  Remove
check for %cr3 not equal to pm_cr3 for doing the update.  It is
believed that this case cannot happen at all, due to other changes in
this revision.

Also, do not set the very first curpmap to kernel_pmap, it should be
vmspace0 pmap instead to match curproc.

Move the common code to activate the initial pmap both on BSP and APs
into pmap_activate_boot() helper.

Reported by: eadler, ambrisko
Discussed with: kevans
Reviewed by:	alc, markj (previous version)
Tested by: ambrisko (previous version)
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D16618
2018-08-14 16:37:14 +00:00
Konstantin Belousov
b3a7db3b06 Use SMAP on amd64.
Ifuncs selectors dispatch copyin(9) family to the suitable variant, to
set rflags.AC around userspace access.  Rflags.AC bit is cleared in
all kernel entry points unconditionally even on machines not
supporting SMAP.

Reviewed by:	jhb
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D13838
2018-07-29 20:47:00 +00:00
Warner Losh
67d33338c0 Rename VM_FREELIST_ISADMA to VM_FREELIST_LOWMEM.
There's no differene between VM_FREELIST_ISADMA and VM_FREELIST_LOWMEM
except for the default boundary (16MB on x86 and 256MB on MIPS, but
they are otherwise the same). We don't need both for any system we
support (there were some really old ARC systems that did have ISA/EISA
bus, but we never ran on them and they are too old to ever grow
support for).

Differential Review: https://reviews.freebsd.org/D16290
2018-07-27 18:34:20 +00:00
Konstantin Belousov
53dec71d39 Expand x86 struct pcpus to UMA_PCPU_ALLOC_SIZE AKA PAGE_SIZE.
This restores counters(9) operation.
Revert r336024. Improve assert of pcpu size on x86.

Reviewed by:	mmacy
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D16163
2018-07-06 19:50:44 +00:00
Konstantin Belousov
fb0a281196 Revert to recommit with the proper message. 2018-07-06 19:50:25 +00:00
Konstantin Belousov
1614716655 Save a call to pmap_remove() if entry cannot have any pages mapped.
Due to the way rtld creates mappings for the shared objects, each dso
causes unmap of at least three guard map entries.  For instance, in
the buildworld load, this change reduces the amount of pmap_remove()
calls by 1/5.

Profiled by:	alc
Reviewed by:	alc, markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D16148
2018-07-06 19:48:47 +00:00
Hans Petter Selasky
a7a7f5b472 Make sure kernel modules built by default are portable between UP and
SMP systems by extending defined(SMP) to include defined(KLD_MODULE).

This is a regression issue after r335873 .

Discussed with:		mmacy@
Sponsored by:		Mellanox Technologies
2018-07-06 10:13:42 +00:00
Matt Macy
428194fed2 counter(9): unbreak amd64 following r336020
Apply temporary fix to counter until daylight hours.
The fact that the assembly for counter_u64_add relied on the sizeof(struct pcpu) was
the basis for the otherwise arbitrary offset never came up in D15933.
critical_{enter,exit} is now inline so the only real added overhead is the
added (mostly false) conditional branch in exit.
2018-07-06 10:10:00 +00:00
Matt Macy
ab3059a8e7 Back pcpu zone with domain correct pages
- Change pcpu zone consumers to use a stride size of PAGE_SIZE.
  (defined as UMA_PCPU_ALLOC_SIZE to make future identification easier)

- Allocate page from the correct domain for a given cpu.

- Don't initialize pc_domain to non-zero value if NUMA is not defined
  There are some misconceptions surrounding this field. It is the
  _VM_ NUMA domain and should only ever correspond to valid domain
  values as understood by the VM.

The former slab size of sizeof(struct pcpu) was somewhat arbitrary.
The new value is PAGE_SIZE because that's the smallest granularity
which the VM can allocate a slab for a given domain. If you have
fewer than PAGE_SIZE/8 counters on your system there will be some
memory wasted, but this is obviously something where you want the
cache line to be coming from the correct domain.

Reviewed by: jeff
Sponsored by: Limelight Networks
Differential Revision:  https://reviews.freebsd.org/D15933
2018-07-06 02:06:03 +00:00
John Baldwin
79ba91952d Use 'e' instead of 'i' constraints with 64-bit atomic operations on amd64.
The ADD, AND, OR, and SUB instructions take at most a 32-bit
sign-extended immediate operand.  64-bit constants that do not fit into
that constraint need to be loaded into a register.  The 'i' constraint
tells the compiler it can pass any integer constant to the assembler,
whereas the 'e' constrain only permits constants that fit into a 32-bit
sign-extended value.  This fixes using
atomic_add/clear/set/subtract_long/64 with constants that do not fit into
a 32-bit sign-extended immediate.

Reported by:	several folks
Tested by:	Pete Wright <pete@nomadlogic.org>
MFC after:	2 weeks
2018-07-03 22:03:28 +00:00
Matt Macy
f4b3640475 inline atomics and allow tied modules to inline locks
- inline atomics in modules on i386 and amd64 (they were always
  inline on other arches)
- allow modules to opt in to inlining locks by specifying
  MODULE_TIED=1 in the makefile

Reviewed by: kib
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D16079
2018-07-02 19:48:38 +00:00
Konstantin Belousov
7f12ebe583 Do not leave stray qword on top of stack for interrupts and exceptions
without error code.  Doing so it mis-aligned the stack.

Since the only consumer of the SSE instructions with the alignment
requirements is AES-NI module, and since the FPU context cannot be
accessed in interrupts, the only situation where the alignment matter
are the compat32 syscalls, as reported in the PR.

PR:	229222
Reported and tested by:	 dewayne@heuristicsystems.com.au
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2018-06-25 11:29:04 +00:00
Mark Johnston
f090f67503 Tell the compiler that rdtscp clobbers %ecx. 2018-06-09 18:31:19 +00:00
Matt Macy
155046394a cpufunc: add rdtscp for x86 2018-06-07 00:54:11 +00:00
Matt Macy
07d80fd8dc hwpmc: ABI fixes
- increase pmc cpuid field from 8 to 12 bits
- add cpuid version string to initialize entry in the log
  so that filter can identify which counter index an
  event name maps to
- GC unused config flags
- make fixed counter assignment more robust as well as the
  changes needed to be properly identified for filter
2018-06-04 02:05:48 +00:00
Bruce Evans
49c871278a Fix high resolution kernel profiling just enough to not crash at boot
time, especially for SMP.  If configured, it turns itself on at boot
time for calibration, so is fragile even if never otherwise used.

Both types of kernel profiling were supposed to use a global spinlock
in the SMP case.  If hi-res profiling is configured (but not necessarily
used), this was supposed to be optimized by only using it when
necessary, and slightly more efficiently, in asm.  But it was not done
at all for mcount entry where it is necessary.  This caused crashes
in the SMP case when either type of profiling was enabled.  For mcount
exit, it only caused wrong times.  The times were wrongest with an
i8254 timer since using that requires exclusive access to the hardware.
The i8254 timer was too slow to use here 20 years ago and is much less
usable now, but it is the default for the SMP case since TSCs weren't
invariant when SMP was new.  Do the locking in all hi-res SMP cases for
simplicity.

Calibration uses special asms, and the clobber lists in these were sort
of inverted.  They contained the arg and return registers which are not
clobbered, but on amd64 they didn't contain the residue of the call-used
registers which may be clobbered (%r10 and %r11).  This usually caused
hangs at boot time.  This usually affected even the UP case.
2018-06-02 05:48:44 +00:00
Matt Macy
e92a1350b5 hwpmc: remove unused pre-table driven bits for intel
Intel now provides comprehensive tables for all performance counters
and the various valid configuration permutations as text .json files.
Libpmc has been converted to use these and hwpmc_core has been greatly
simplified by moving to passthrough of the table values.

The one gotcha is that said tables don't support pentium pro and and pentium
IV. There's very few users of hwpmc on _amd64_ kernels on new hardware. It is
unlikely that anyone is doing low level optimization on 15 year old Intel
hardware. Nonetheless, if someone feels strongly enough to populate the
corresponding tables for p4 and ppro I will reinstate the files in to the
build.

Code for the K8 counters and !x86 architectures remains unchanged.
2018-05-31 22:41:07 +00:00
Dimitry Andric
b451efbedc Resolve conflicts between macros in fenv.h and ieeefp.h
This is a follow-up to r321483, which disabled -Wmacro-redefined for
some lib/msun tests.

If an application included both fenv.h and ieeefp.h, several macros such
as __fldcw(), __fldenv() were defined in both headers, with slightly
different arguments, leading to conflicts.

Fix this by putting all the common macros in the machine-specific
versions of ieeefp.h.  Where needed, update the arguments in places
where the macros are invoked.

This also slightly reduces the differences between the amd64 and i386
versions of ieeefp.h.

Reviewed by:	kib
MFC after:	1 week
Differential Revision: https://reviews.freebsd.org/D15633
2018-05-31 20:22:47 +00:00
Andriy Gapon
279be68bfd re-synchronize TSC-s on SMP systems after resume, if necessary
The TSC-s are checked and synchronized only if they were good
originally.  That is, invariant, synchronized, etc.

This is necessary on an AMD-based system where after a wakeup from STR I
see that BSP clock differs from AP clocks by a count that roughly
corresponds to one second.  The APs are in sync with each other.  Not
sure if this is a hardware quirk or a firmware bug.

This is what I see after a resume with this change:
    SMP: passed TSC synchronization test after adjustment
    acpi_timer0: restoring timecounter, ACPI-fast -> TSC-low

Reviewed by:	kib
MFC after:	3 weeks
Differential Revision: https://reviews.freebsd.org/D15551
2018-05-25 07:33:20 +00:00
Konstantin Belousov
14f7050dba Enable IBRS when entering an interrupt handler from usermode.
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2018-05-22 13:25:15 +00:00
John Baldwin
9e2154ff1c Cleanups related to debug exceptions on x86.
- Add constants for fields in DR6 and the reserved fields in DR7.  Use
  these constants instead of magic numbers in most places that use DR6
  and DR7.
- Refer to T_TRCTRAP as "debug exception" rather than a "trace trap"
  as it is not just for trace exceptions.
- Always read DR6 for debug exceptions and only clear TF in the flags
  register for user exceptions where DR6.BS is set.
- Clear DR6 before returning from a debug exception handler as
  recommended by the SDM dating all the way back to the 386.  This
  allows debuggers to determine the cause of each exception.  For
  kernel traps, clear DR6 in the T_TRCTRAP case and pass DR6 by value
  to other parts of the handler (namely, user_dbreg_trap()).  For user
  traps, wait until after trapsignal to clear DR6 so that userland
  debuggers can read DR6 via PT_GETDBREGS while the thread is stopped
  in trapsignal().

Reviewed by:	kib, rgrimes
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D15189
2018-05-22 00:45:00 +00:00
Konstantin Belousov
3621ba1ede Add Intel Spec Store Bypass Disable control.
Speculative Store Bypass (SSB) is a speculative execution side channel
vulnerability identified by Jann Horn of Google Project Zero (GPZ) and
Ken Johnson of the Microsoft Security Response Center (MSRC)
https://bugs.chromium.org/p/project-zero/issues/detail?id=1528.
Updated Intel microcode introduces a MSR bit to disable SSB as a
mitigation for the vulnerability.

Introduce a sysctl hw.spec_store_bypass_disable to provide global
control over the SSBD bit, akin to the existing sysctl that controls
IBRS. The sysctl can be set to one of three values:
0: off
1: on
2: auto

Future work will enable applications to control SSBD on a per-process
basis (when it is not enabled globally).

SSBD bit detection and control was verified with prerelease microcode.

Security:	CVE-2018-3639
Tested by:	emaste (previous version, without updated microcode)
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2018-05-21 21:08:19 +00:00
Antoine Brodin
147d12a7d3 vmmdev: return EFAULT when trying to read beyond VM system memory max address
Currently, when using dd(1) to take a VM memory image, the capture never ends,
reading zeroes when it's beyond VM system memory max address.
Return EFAULT when trying to read beyond VM system memory max address.

Reviewed by:	imp, grehan, anish
Approved by:	grehan
Differential Revision:	https://reviews.freebsd.org/D15156
2018-05-15 17:20:58 +00:00
John Baldwin
0b3e6e4c50 Make the common interrupt entry point labels local labels.
Kernel debuggers depend on symbol names to find stack frames with a
trapframe rather than a normal stack frame.  The labels used for the
shared interrupt entry point for the PTI and non-PTI cases did not
match the existing patterns confusing debuggers.  Add the '.L' prefix
to mark these symbols as local so they are not visible in the symbol
table.

Reviewed by:	kib
MFC after:	1 week
Sponsored by:	Chelsio Communications
2018-05-14 17:27:53 +00:00
Tycho Nightingale
27275f8a52 Expand the checks for UCR3 == PMAP_NO_CR3 to enable processes to be
excluded from PTI.

Reviewed by:	kib
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D15100
2018-04-27 12:44:20 +00:00
Mark Johnston
5cd29d0f3c Improve VM page queue scalability.
Currently both the page lock and a page queue lock must be held in
order to enqueue, dequeue or requeue a page in a given page queue.
The queue locks are a scalability bottleneck in many workloads. This
change reduces page queue lock contention by batching queue operations.
To detangle the page and page queue locks, per-CPU batch queues are
used to reference pages with pending queue operations. The requested
operation is encoded in the page's aflags field with the page lock
held, after which the page is enqueued for a deferred batch operation.
Page queue scans are similarly optimized to minimize the amount of
work performed with a page queue lock held.

Reviewed by:	kib, jeff (previous versions)
Tested by:	pho
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D14893
2018-04-24 21:15:54 +00:00
Tycho Nightingale
6ac73777ea Add SDT probes to vmexit on Intel.
Submitted by:	domagoj.stolfa_gmail.com
Reviewed by:	grehan, tychon
Sponsored by:	DARPA/AFRL
Differential Revision:	https://reviews.freebsd.org/D14656
2018-04-13 17:23:05 +00:00
Rodney W. Grimes
01d822d33b Add the ability to control the CPU topology of created VMs
from userland without the need to use sysctls, it allows the old
sysctls to continue to function, but deprecates them at
FreeBSD_version 1200060 (Relnotes for deprecate).

The command line of bhyve is maintained in a backwards compatible way.
The API of libvmmapi is maintained in a backwards compatible way.
The sysctl's are maintained in a backwards compatible way.

Added command option looks like:
bhyve -c [[cpus=]n][,sockets=n][,cores=n][,threads=n][,maxcpus=n]
The optional parts can be specified in any order, but only a single
integer invokes the backwards compatible parse.  [,maxcpus=n] is
hidden by #ifdef until kernel support is added, though the api
is put in place.

bhyvectl --get-cpu-topology option added.

Reviewed by:	grehan (maintainer, earlier version),
Reviewed by:	bcr (manpages)
Approved by:	bde (mentor), phk (mentor)
Tested by:	Oleg Ginzburg <olevole@olevole.ru> (cbsd)
MFC after:	1 week
Relnotes:	Y
Differential Revision:	https://reviews.freebsd.org/D9930
2018-04-08 19:24:49 +00:00
John Baldwin
fc276d92ae Add a way to temporarily suspend and resume virtual CPUs.
This is used as part of implementing run control in bhyve's debug
server.  The hypervisor now maintains a set of "debugged" CPUs.
Attempting to run a debugged CPU will fail to execute any guest
instructions and will instead report a VM_EXITCODE_DEBUG exit to
the userland hypervisor.  Virtual CPUs are placed into the debugged
state via vm_suspend_cpu() (implemented via a new VM_SUSPEND_CPU ioctl).
Virtual CPUs can be resumed via vm_resume_cpu() (VM_RESUME_CPU ioctl).

The debug server suspends virtual CPUs when it wishes them to stop
executing in the guest (for example, when a debugger attaches to the
server).  The debug server can choose to resume only a subset of CPUs
(for example, when single stepping) or it can choose to resume all
CPUs.  The debug server must explicitly mark a CPU as resumed via
vm_resume_cpu() before the virtual CPU will successfully execute any
guest instructions.

Reviewed by:	avg, grehan
Tested on:	Intel (jhb), AMD (avg)
Differential Revision:	https://reviews.freebsd.org/D14466
2018-04-06 22:03:43 +00:00
Roger Pau Monné
9dba82a442 x86: improve reservation of AP trampoline memory
So that it doesn't rely on physmap[1] containing an address below
1MiB. Instead scan the full physmap and search for a suitable address
to place the trampoline code (below 1MiB) and the initial memory pages
(below 4GiB).

Sponsored by:		Citrix Systems R&D
Reviewed by:		kib
Differential Revision:	https://reviews.freebsd.org/D14878
2018-04-05 14:39:51 +00:00
Jeff Roberson
27a3c9d710 Restore r331606 with a bugfix to setup cpuset_domain[] earlier on all
platforms.  Original commit message as follows:

Only use CPUs in the domain the device is attached to for default
assignment.  Device drivers are able to override the default assignment
if they bind directly.  There are severe performance penalties for
handling interrupts on remote CPUs and this should only be done in
very controlled circumstances.

Reviewed by:    jhb, kib
Tested by:      pho
Sponsored by:   Netflix, Dell/EMC Isilon
Differential Revision:  https://reviews.freebsd.org/D14838
2018-03-28 18:47:35 +00:00
Jeff Roberson
261c408744 Backout r331606 until I can identify why it does not boot on some
machines.
2018-03-27 10:20:50 +00:00
Jeff Roberson
a48de40bcc Only use CPUs in the domain the device is attached to for default
assignment.  Device drivers are able to override the default assignment
if they bind directly.  There are severe performance penalties for
handling interrupts on remote CPUs and this should only be done in
very controlled circumstances.

Reviewed by:	jhb, kib
Tested by:	pho (earlier version)
Sponsored by:	Netflix, Dell/EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D14838
2018-03-27 03:37:04 +00:00
Konstantin Belousov
8fbcc3343f Move the CR0.WP manipulation KPI to x86.
This should allow to avoid some #ifdefs in the common x86/ code.

Requested by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2018-03-20 20:20:49 +00:00
Konstantin Belousov
2337dc6430 Provide KPI for handling of rw/ro kernel text.
This is a pure syntax patch to create an interface to enable and later
restore write access to the kernel text and other read-only mapped
regions.  It is in line with e.g. vm_fault_disable_pagefaults() by
allowing the nesting.

Discussed with:	Peter Lei <peter.lei@ieee.org>
Reviewed by:	jtl
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D14768
2018-03-20 17:43:50 +00:00
Ian Lepore
c7053bbe54 Revert r330780, it was improperly tested and results in taking a spin
mutex before acquiring sleep mutexes.

Reported by:	kib@
2018-03-11 20:13:15 +00:00
Ian Lepore
86051be993 Eliminate atrtc_time_lock, and use atrtc_lock for efirtc locking. 2018-03-11 19:22:58 +00:00
Tycho Nightingale
490768e24a Fix a lock recursion introduced in r327065.
Reported by:	kmacy
Reviewed by:	grehan, jhb
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D14548
2018-03-07 18:03:22 +00:00
Jonathan T. Looney
beb2406556 amd64: Protect the kernel text, data, and BSS by setting the RW/NX bits
correctly for the data contained on each memory page.

There are several components to this change:
 * Add a variable to indicate the start of the R/W portion of the
   initial memory.
 * Stop detecting NX bit support for each AP.  Instead, use the value
   from the BSP and, if supported, activate the feature on the other
   APs just before loading the correct page table.  (Functionally, we
   already assume that the BSP and all APs had the same support or
   lack of support for the NX bit.)
 * Set the RW and NX bits correctly for the kernel text, data, and
   BSS (subject to some caveats below).
 * Ensure DDB can write to memory when necessary (such as to set a
   breakpoint).
 * Ensure GDB can write to memory when necessary (such as to set a
   breakpoint).  For this purpose, add new MD functions gdb_begin_write()
   and gdb_end_write() which the GDB support code can call before and
   after writing to memory.

This change is not comprehensive:
 * It doesn't do anything to protect modules.
 * It doesn't do anything for kernel memory allocated after the kernel
   starts running.
 * In order to avoid excessive memory inefficiency, it may let multiple
   types of data share a 2M page, and assigns the most permissions
   needed for data on that page.

Reviewed by:	jhb, kib
Discussed with:	emaste
MFC after:	2 weeks
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D14282
2018-03-06 14:28:37 +00:00
John Baldwin
5f8754c077 Add a new variant of the GLA2GPA ioctl for use by the debug server.
Unlike the existing GLA2GPA ioctl, GLA2GPA_NOFAULT does not modify
the guest.  In particular, it does not inject any faults or modify
PTEs in the guest when performing an address space translation.

This is used by bhyve's debug server to read and write memory for
the remote debugger.

Reviewed by:	grehan
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D14075
2018-02-26 19:19:05 +00:00
Conrad Meyer
849ce31a82 Remove unused error return from API that cannot fail
No implementation of fpu_kern_enter() can fail, and it was causing needless
error checking boilerplate and confusion. Change the return code to void to
match reality.

(This trivial change took nine days to land because of the commit hook on
sys/dev/random.  Please consider removing the hook or otherwise lowering the
bar -- secteam never seems to have free time to review patches.)

Reported by:	Lachlan McIlroy <Lachlan.McIlroy AT isilon.com>
Reviewed by:	delphij
Approved by:	secteam (delphij)
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D14380
2018-02-23 20:15:19 +00:00
John Baldwin
4f8666989a Add two new ioctls to bhyve for batch register fetch/store operations.
These are a convenience for bhyve's debug server to use a single
ioctl for 'g' and 'G' rather than a loop of individual get/set
ioctl requests.

Reviewed by:	grehan
MFC after:	2 months
Differential Revision:	https://reviews.freebsd.org/D14074
2018-02-22 00:39:25 +00:00
Konstantin Belousov
13cad9af82 Use local symbol for offset.
Small global symbols confuse ddb which matches them against small
unrelated displacements and makes the disassembly ugly.

Reported by:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2018-02-16 13:32:46 +00:00
Jung-uk Kim
ea4fe1da62 Change size of padding to reflect reality. No functional change.
Discussed with:		kib
2018-02-15 20:42:38 +00:00
Warner Losh
982e7bdafc We don't support gcc < 4.2.1, so varargs.h now is just #error
always. Unifdef for versions prior to 4.2.1 and remove now-unused
header files.

Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D14323
2018-02-12 14:48:14 +00:00
Konstantin Belousov
319117fd57 IBRS support, AKA Spectre hardware mitigation.
It is coded according to the Intel document 336996-001, reading of the
patches posted on lkml, and some additional consultations with Intel.

For existing processors, you need a microcode update which adds IBRS
CPU features, and to manually enable it by setting the tunable/sysctl
hw.ibrs_disable to 0.  Current status can be checked in sysctl
hw.ibrs_active.  The mitigation might be inactive if the CPU feature
is not patched in, or if CPU reports that IBRS use is not required, by
IA32_ARCH_CAP_IBRS_ALL bit.

Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D14029
2018-01-31 14:36:27 +00:00
Konstantin Belousov
c8f9c1f3d9 Use PCID to optimize PTI.
Use PCID to avoid complete TLB shootdown when switching between user
and kernel mode with PTI enabled.

I use the model close to what I read about KAISER, user-mode PCID has
1:1 correspondence to the kernel-mode PCID, by setting bit 11 in PCID.
Full kernel-mode TLB shootdown is performed on context switches, since
KVA TLB invalidation only works in the current pmap. User-mode part of
TLB is flushed on the pmap activations as well.

Similarly, IPI TLB shootdowns must handle both kernel and user address
spaces for each address.  Note that machines which implement PCID but
do not have INVPCID instructions, cause the usual complications in the
IPI handlers, due to the need to switch to the target PCID temporary.
This is racy, but because for PCID/no-INVPCID we disable the
interrupts in pmap_activate_sw(), IPI handler cannot see inconsistent
state of CPU PCID vs PCPU pmap/kcr3/ucr3 pointers.

On the other hand, on kernel/user switches, CR3_PCID_SAVE bit is set
and we do not clear TLB.

I can imagine alternative use of PCID, where there is only one PCID
allocated for the kernel pmap. Then, there is no need to shootdown
kernel TLB entries on context switch. But copyout(3) would need to
either use method similar to proc_rwmem() to access the userspace
data, or (in reverse) provide a temporal mapping for the kernel buffer
into user mode PCID and use trampoline for copy.

Reviewed by:	markj (previous version)
Tested by:	pho
Discussed with:	alc (some aspects)
Sponsored by:	The FreeBSD Foundation
MFC after:	3 weeks
Differential revision:	https://reviews.freebsd.org/D13985
2018-01-27 11:49:37 +00:00
Konstantin Belousov
b4dfc9d7ad PTI: Trap if we returned to userspace with kernel (full) page table
still active.

Map userspace portion of VA in the PTI kernel-mode page table as
non-executable. This way, if we ever miss reloading ucr3 into %cr3 on
the return to usermode, the process traps instead of executing in
potentially vulnerable setup.  Catch the condition of such trap and
verify user-mode %cr3, which is saved by page fault handler.

I peek this trick in some article about Linux implementation.

Reviewed by:	alc, markj (previous version)
Sponsored by:	The FreeBSD Foundation
MFC after:	12 days
DIfferential revision:	https://reviews.freebsd.org/D13956
2018-01-19 22:10:29 +00:00
Nathan Whitehorn
9a8196ce19 Remove SFBUF_OPTIONAL_DIRECT_MAP and such hacks, replacing them across the
kernel by PHYS_TO_DMAP() as previously present on amd64, arm64, riscv, and
powerpc64. This introduces a new MI macro (PMAP_HAS_DMAP) that can be
evaluated at runtime to determine if the architecture has a direct map;
if it does not (or does) unconditionally and PMAP_HAS_DMAP is either 0 or
1, the compiler can remove the conditional logic.

As part of this, implement PHYS_TO_DMAP() on sparc64 and mips64, which had
similar things but spelled differently. 32-bit MIPS has a partial direct-map
that maps poorly to this concept and is unchanged.

Reviewed by:		kib
Suggestions from:	marius, alc, kib
Runtime tested on:	amd64, powerpc64, powerpc, mips64
2018-01-19 17:46:31 +00:00
John Baldwin
68fd3b0ef5 Use a dedicated per-CPU stack for machine check exceptions.
Similar to NMIs, machine check exceptions can fire at any time and are
not masked by IF.  This means that machine checks can fire when the
kstack is too deep to hold a trap frame, or at critical sections in
trap handlers when a user %gs is used with a kernel %cs.  Use the same
strategy used for NMIs of using a dedicated per-CPU stack configured
in IST 3.  Store the CPU's pcpu pointer at the stop of the stack so
that the machine check handler can reliably find the proper value for
%gs (also borrowed from NMIs).

This should also fix a similar issue with PTI with a MC# occurring
while the CPU is executing on the trampoline stack.

While here, bypass trap() entirely and just call mca_intr().  This
avoids a bogus call to kdb_reenter() (there's no reason to try to
reenter kdb if a MC# is raised).

Reviewed by:	kib
Tested by:	avg (on AMD without PTI)
Differential Revision:	https://reviews.freebsd.org/D13962
2018-01-18 23:50:21 +00:00