- Allow the userland hypervisor to intercept breakpoint exceptions
(BP#) in the guest. A new capability (VM_CAP_BPT_EXIT) is used to
enable this feature. These exceptions are reported to userland via
a new VM_EXITCODE_BPT that includes the length of the original
breakpoint instruction. If userland wishes to pass the exception
through to the guest, it must be explicitly re-injected via
vm_inject_exception().
- Export VMCS_ENTRY_INST_LENGTH as a VM_REG_GUEST_ENTRY_INST_LENGTH
pseudo-register. Injecting a BP# on Intel requires setting this to
the length of the breakpoint instruction. AMD SVM currently ignores
writes to this register (but reports success) and fails to read it.
- Rework the per-vCPU state tracked by the debug server. Rather than
a single 'stepping_vcpu' global, add a structure for each vCPU that
tracks state about that vCPU ('stepping', 'stepped', and
'hit_swbreak'). A global 'stopped_vcpu' tracks which vCPU is
currently reporting an event. Event handlers for MTRAP and
breakpoint exits loop until the associated event is reported to the
debugger.
Breakpoint events are discarded if the breakpoint is not present
when a vCPU resumes in the breakpoint handler to retry submitting
the breakpoint event.
- Maintain a linked-list of active breakpoints in response to the GDB
'Z0' and 'z0' packets.
Reviewed by: markj (earlier version)
MFC after: 2 months
Differential Revision: https://reviews.freebsd.org/D20309
This is a 32-bit structure embedded in each vm_page, consisting mostly
of page queue state. The use of a structure makes it easy to store a
snapshot of a page's queue state in a stack variable and use cmpset
loops to update that state without requiring the page lock.
This change merely adds the structure and updates references to atomic
state fields. No functional change intended.
Reviewed by: alc, jeff, kib
Sponsored by: Netflix, Intel
Differential Revision: https://reviews.freebsd.org/D22650
o Remove All Rights Reserved from my notices
o imp@FreeBSD.org everywhere
o regularize punctiation, eliminate date ranges
o Make sure that it's clear that I don't claim All Rights reserved by listing
All Rights Reserved on same line as other copyright holders (but not
me). Other such holders are also listed last where it's clear.
Update the NetBSD Kernel Concurrency Sanitizer (KCSAN) runtime to work in
the FreeBSD kernel. It is a useful tool for finding data races between
threads executing on different CPUs.
This can be enabled by enabling KCSAN in the kernel config, or by using the
GENERIC-KCSAN amd64 kernel. It works on amd64 and arm64, however the later
needs a compiler change to allow -fsanitize=thread that KCSAN uses.
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D22315
Disable the use of executable 2M page mappings in EPT-format page
tables on affected CPUs. For bhyve virtual machines, this effectively
disables all use of superpage mappings on affected CPUs. The
vm.pmap.allow_2m_x_ept sysctl can be set to override the default and
enable mappings on affected CPUs.
Alternate approaches have been suggested, but at present we do not
believe the complexity is warranted for typical bhyve's use cases.
Reviewed by: alc, emaste, markj, scottl
Security: CVE-2018-12207
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D21884
This saves some memory, around 256K I think. It removes some code,
e.g. KPTI does not need to specially map common_tss anymore. Also,
common_tss become domain-local.
Reviewed by: jhb
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D22231
This saves 320 bytes of the precious stack space.
The only negative aspect of the change I can think of is that the
struct thread increased by 320 bytes obviously, and that 320 bytes are
not swapped out anymore. I believe the freed stack space is much more
important than that. Also, current struct thread size is 1392 bytes
on amd64, so UMA will allocate two thread structures per (4KB) slab,
which leaves a space for pcb without increasing zone memory use.
Reviewed by: alc, markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D22138
No functional change (in program code). Additional DWARF metadata is
generated in the .eh_frame section. Also, it is now a compile-time
requirement that machine/asm.h ENTRY() and END() macros are paired. (This
is subject to ongoing discussion and may change.)
This DWARF metadata allows llvm-libunwind to unwind program stacks when the
program is executing the function. The goal is to collect accurate
userspace stacktraces when programs have entered syscalls.
(The motivation for "Call Frame Information," or CFI for short -- not to be
confused with Control Flow Integrity -- is to sufficiently annotate assembly
functions such that stack unwinders can unwind out of the local frame
without the requirement of a dedicated framepointer register; i.e.,
-fomit-frame-pointer. This is necessary for C++ exception handling or
collecting backtraces.)
For the curious, a more thorough description of the metadata and some
examples may be found at [1] and documentation at [2]. You can also look at
'cc -S -o - foo.c | less' and search for '.cfi_' to see the CFI directives
generated by your C compiler.
[1]: https://www.imperialviolet.org/2017/01/18/cfi.html
[2]: https://sourceware.org/binutils/docs/as/CFI-directives.html
Reviewed by: emaste, kib (with reservations)
Differential Revision: https://reviews.freebsd.org/D22122
This updates the protection attributes of subranges of the kernel map.
Unlike pmap_protect(), which is typically used for user mappings,
pmap_change_prot() does not perform lazy upgrades of protections.
pmap_change_prot() also updates the aliasing range of the direct map.
Reviewed by: kib
MFC after: 1 month
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D21758
ABI already guarantees the direction is forward. Note this does not take care
of i386-specific cld's.
Reviewed by: kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D21906
- Remove a dead variable from the amd64 pmap_extract_and_hold().
- Fix grammar in the vm_page_wire man page.
Reported by: alc
Reviewed by: alc, kib
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D21639
- Use ptoa() instead of the archaic ctob().
- Use pagezero() to zero a PDP page.
- Remove PA_MIN_ADDRESS, orphaned by r351742.
- Remove unneeded parens and an unnecessary control flow statement.
Reported by: alc
Reviewed by: alc, kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D21495
r351198 allows the kernel to use domain-local memory to back the vm_page
array (up to 2MB boundaries) and reserves a separate PML4 entry for that
purpose. One consequence of that change is that the vm_page array is no
longer present in minidumps, which only adds pages mapped above
VM_MIN_KERNEL_ADDRESS.
To avoid the friction caused by having kernel data structures mapped
below VM_MIN_KERNEL_ADDRESS, map the vm_page array starting at
VM_MIN_KERNEL_ADDRESS instead of using a dedicated PML4 entry.
Reviewed by: kib
Discussed with: jeff
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D21491
Many extern struct pcpu <something>__pcpu declarations were
copied/pasted in sources. The issue is that the definition is MD, but
it cannot be provided by machine/pcpu.h due to actual struct pcpu
defined in sys/pcpu.h later than the inclusion of machine/pcpu.h.
This forced the copying when other code needed direct access to
__pcpu. There is no way around it, due to machine/pcpu.h supplying
part of struct pcpu fields.
To work around the problem, add a new machine/pcpu_aux.h header, which
should fill any needed MD definitions after struct pcpu definition is
completed. This allows to remove copies of __pcpu spread around the
source. Also on x86 it makes it possible to remove work arounds like
OFFSETOF_CURTHREAD or clang specific warnings supressions.
Reported and tested by: lwhsu, bcran
Reviewed by: imp, markj (previous version)
Discussed with: jhb
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D21418
Move pcpu KVA out of .bss into dynamically allocated VA at
pmap_bootstrap(). This avoids demoting superpage mapping .data/.bss.
Also it makes possible to use pmap_qenter() for installation of
domain-local pcpu page on NUMA configs.
Refactor pcpu and IST initialization by moving it to helper functions.
Reviewed by: markj
Tested by: pho
Discussed with: jeff
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D21320
NUMA domain that the pages describe. Patch original from gallatin.
Reviewed by: kib
Tested by: pho
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D21252
doing so adds more flexibility with less redundant code.
Reviewed by: jhb, markj, kib
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D21250
Such speculations could use user-controlled %gs base, esp. since
FreeBSD supports WRGSBASE instructions.
Place LFENCEs on entry for each basic block after the test for
previous kernel/user mode on the kernel entry, which prevents the
speculation. Code accesses %gs-based PCPU before any serialization
instructions are executed, like %cr3 reload for KPTI.
With pti disabled, on haswell i7-4770S machine, "syscall_timings getppid"
shows when no lfence is added to syscall path:
test loop time iterations periteration
getppid 0 1.040918865 4643611 0.000000224
getppid 1 1.004985962 4481816 0.000000224
getppid 2 1.005196483 4482363 0.000000224
with lfence:
getppid 0 1.043701091 4554779 0.000000229
getppid 1 1.016930328 4438094 0.000000229
getppid 2 1.023223117 4466640 0.000000229
and ministat reports 'No difference proven at 95.0% confidence.'
Security: CVE-2019-1125
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
PTI-mode entry points were coded to set up the environment identical
to non-PTI entry and then fall-through to non-PTI handlers, mostly.
This has the drawback of requiring two more SWAPGS, first to access
PCPU, and then to return to the state expected by the non-PTI entry
point.
Eliminate the duplication by doing more in entry stubs both for PTI
and non-PTI, and adjusting the common code to expect that SWAPGS and
some minimal registers saving is done by entries.
Some less often used entries, in particular, #GP, #NP, and #SS, which
can fault on doreti, are left as is because there are basically four
variants of entrance, and they are not performance-critical,
esp. comparing with e.g. #PF or interrupts.
Reviewed by: markj (previous version)
Tested by: pho (previous version)
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Bhyve's vmm is a self-contained modern component and thus a good
candidate for use of C99 types.
Reviewed by: jhb, kib, markj, Patrick Mooney
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D21036
hard-coded value. Don't allocate space for it from the kernel stack.
Account for prefix, suffix, and separator space in the name. This
takes the effective length up to 229 bytes on 13-current, and 37 bytes
on 12-stable. 37 bytes is enough to hold a full GUID string.
PR: 234134
MFC after: 1 week
Differential Revision: http://reviews.freebsd.org/D20924
If service code faulted, we might end up unwinding with interrupts
disabled. Top-level kernel code should have interrupts enabled, which
is enforced by checks.
Save %rflags before entering EFI, and restore to the known good value
on return. This handles situation with disabled interrupts on fault
and perhaps other potential bugs, e.g. invalid value for PSL_D.
Reported and tested by: Jan Martin Mikkelsen <janm@transactionware.com>
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Similar to PG_FRAME and PG_PS_FRAME, it denotes the mask of the
physical address component of 1G superpage PDP entry.
Reviewed by: alc, markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D20386
For machines having cmpxcgh16b instruction, i.e. everything but very
early Athlons, provide lockless implementation of delayed
invalidation.
The implementation maintains lock-less single-linked list with the
trick from the T.L. Harris article about volatile mark of the elements
being removed. Double-CAS is used to atomically update both link and
generation. New thread starting DI appends itself to the end of the
queue, setting the generation to the generation of the last element
+1. On DI finish, thread donates its generation to the previous
element. The generation of the fake head of the list is the last
passed DI generation. Basically, the implementation is a queued
spinlock but without spinlock.
Many thanks both to Peter Holm and Mark Johnson for keeping with me
while I produced intermediate versions of the patch.
Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 month
MFC note: td_md.md_invl_gen should go to the end of struct thread
Differential revision: https://reviews.freebsd.org/D19630
Microarchitectural buffers on some Intel processors utilizing
speculative execution may allow a local process to obtain a memory
disclosure. An attacker may be able to read secret data from the
kernel or from a process when executing untrusted code (for example,
in a web browser).
Reference: https://www.intel.com/content/www/us/en/security-center/advisory/intel-sa-00233.html
Security: CVE-2018-12126, CVE-2018-12127, CVE-2018-12130, CVE-2019-11091
Security: FreeBSD-SA-19:07.mds
Reviewed by: jhb
Tested by: emaste, lwhsu
Approved by: so (gtetlow)
This gets rid of the global cpu_ipi_pending array.
While replace cmpset with fcmpset in the delivery code and opportunistically
check if given IPI is already pending.
Sponsored by: The FreeBSD Foundation
IPI_STOP is used after panic or when ddb is entered manually. MONITOR/
MWAIT allows CPUs that support the feature to sleep in a low power way
instead of spinning. Something similar is already used at idle.
It is perhaps especially useful in oversubscribed VM environments, and is
safe to use even if the panic/ddb thread is not the BSP. (Except in the
presence of MWAIT errata, which are detected automatically on platforms with
known wakeup problems.)
It can be tuned/sysctled with "machdep.stop_mwait," which defaults to 0
(off). This commit also introduces the tunable
"machdep.mwait_cpustop_broken," which defaults to 0, unless the CPU has
known errata, but may be set to "1" in loader.conf to signal that mwait
wakeup is broken on CPUs FreeBSD does not yet know about.
Unfortunately, Bhyve doesn't yet support MONITOR extensions, so this doesn't
help bhyve hypervisors running FreeBSD guests.
Submitted by: Anton Rang <rang AT acm.org> (earlier version)
Reviewed by: kib
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D20135
Rather than just accessing it via pointer cast.
No functional change intended.
Discussed with: kib (earlier version)
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D20135
Replace most VM_MAXCPU constant useses with an accessor function to
vm->maxcpus which for now is initialized and kept at the value of
VM_MAXCPUS.
This is a rework of Fabian Freyer (fabian.freyer_physik.tu-berlin.de)
work from D10070 to adjust it for the cpu topology changes that
occured in r332298
Submitted by: Fabian Freyer (fabian.freyer_physik.tu-berlin.de)
Reviewed by: Patrick Mooney <patrick.mooney@joyent.com>
Approved by: bde (mentor), jhb (maintainer)
MFC after: 3 days
Differential Revision: https://reviews.freebsd.org/D18755
Add the infrastructure to allow MD procctl(2) commands, and use it to
introduce amd64 PTI control and reporting. PTI mode cannot be
modified for existing pmap, the knob controls PTI of the new vmspace
created on exec.
Requested by: jhb
Reviewed by: jhb, markj (previous version)
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D19514
PTI mode for the process pmap on exec is activated iff P_MD_PTI is set.
On exec, the existing vmspace can be reused only if pti mode of the
pmap matches the P_MD_PTI flag of the process. Add MD
cpu_exec_vmspace_reuse() callback for exec_new_vmspace() which can
vetoed reuse of the existing vmspace.
MFC note: md_flags change struct proc KBI.
Reviewed by: jhb, markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D19514
Skylake Xeons.
See SDM rev. 68 Vol 3 4.6.2 Protection Keys and the description of the
RDPKRU and WRPKRU instructions.
Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D18893
Avoid using DELAY() since it can try to use spin locks on CPUs without
a P-state invariant TSC. For cpu_lock_delay(), always use the TSC if
it exists (even if it is not P-state invariant) to delay for a
microsecond. If the TSC does not exist, read from I/O port 0x84 to
delay instead.
PR: 228768
Reported by: Roger Hammerstein <cheeky.m@live.com>
Reviewed by: kib
MFC after: 3 days
Differential Revision: https://reviews.freebsd.org/D17851
Replace a call to DELAY(1) with a new cpu_lock_delay() KPI. Currently
cpu_lock_delay() is defined to DELAY(1) on all platforms. However,
platforms with a DELAY() implementation that uses spin locks should
implement a custom cpu_lock_delay() doesn't use locks.
Reviewed by: kib
MFC after: 3 days
On some Intel devices BIOS does not properly reserve memory (called
"stolen memory") for the GPU. If the stolen memory is claimed by the
OS, functions that depend on stolen memory (like frame buffer
compression) can't be used.
A function called pci_early_quirks that is called before the virtual
memory system is started was added. In Linux, this PCI early quirks
function iterates through all PCI slots to check for any device that
require quirks. While this more generic solution is preferable I only
ported the Intel graphics specific parts because I think my
implementation would be too similar to Linux GPL'd solution after
looking at the Linux code too much.
The code regarding Intel graphics stolen memory was ported from
Linux. In the case of Intel graphics stolen memory this
pci_early_quirks will read the stolen memory base and size from north
bridge registers. The values are stored in global variables that is
later read by linuxkpi_gplv2. Linuxkpi stores these values in a
Linux-specific structure that is read by the drm driver.
Relevant linuxkpi code is here:
https://github.com/FreeBSDDesktop/kms-drm/blob/drm-v4.16/linuxkpi/gplv2/src/linux_compat.c#L37
For now, only amd64 arch is suppor ted since that is the only arch
supported by the new drm drivers. I was told that Intel GPUs are
always located on 0:2:0 so these values are hard coded for now.
Note that the structure and early execution of the detection code is
not required in its current form, but we expect that the code will be
added shortly which fixes the potential BIOS bugs by reserving the
stolen range in phys_avail[]. This must be done as early as possible
to avoid conflicts with the potential usage of the memory in kernel.
Submitted by: Johannes Lundberg <johalun0@gmail.com>
Reviewed by: bwidawsk, imp
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D16719
Differential revision: https://reviews.freebsd.org/D17775
The knob allows to select the flushing mode or turn it off/on. The
idea, as well as the list of the ignored syscall errors, were taken
from https://www.openwall.com/lists/kernel-hardening/2018/10/11/10 .
I was not able to measure statistically significant difference between
flush enabled vs disabled using syscall_timing getuid.
Reviewed by: bwidawsk
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D17536
This makes the compiler less likely to reload the content from %gs.
The 'P' modifier drops all synteax prefixes and 'n' constraint treats
input as a known at compilation time immediate integer.
Example reloading victim was spinlock_enter.
Stolen from: OpenBSD
Reported by: jtl
Reviewed by: kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D17615