Apparently CLFLUSH on mmio can cause VM exit, as reported in the PR.
I do not see that anything useful can be done except emulating page
faults on invalid addresses.
Due to the instruction encoding pecularity, also emulate SFENCE.
PR: 232081
Reported by: phk
Reviewed by: araujo, avg, jhb (all: previous version)
Sponsored by: The FreeBSD Foundation
Approved by: re (gjb)
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D17482
The reasoning is the same as with the memset change, see r339205
Reviewed by: kib (previous version)
Approved by: re (gjb)
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D17441
The change is a no-op for architectures which don't ifunc memset,
memcpy nor memmove.
Convert places which need them. Xen bits by royger.
Reviewed by: kib
Approved by: re (gjb)
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D17487
The VT-x VMCS only stores the base address of the GDTR and IDTR. As a
result, VM exits use a fixed limit of 0xffff for the host GDTR and
IDTR losing the smaller limits set in when the initial GDT is loaded
on each CPU during boot. Explicitly save and restore the full GDTR
and IDTR contents around VM entries and exits to restore the correct
limit.
Similarly, explicitly save and restore the LDT selector. VM exits
always clear the host LDTR as if the LDT was loaded with a NULL
selector and a userspace hypervisor is probably using a NULL selector
anyway, but save and restore the LDT explicitly just to be safe.
PR: 230773
Reported by: John Levon <levon@movementarian.org>
Reviewed by: kib
Tested by: araujo
Approved by: re (rgrimes)
MFC after: 1 week
configuring kernels for i386, amd64, and arm64.
The 'GEOM_PART_GPT' option was added to the DEFAULTS configuration
in r337967.
Approved by: re (kib@)
Reviewed by: ler@
Differential Revision: https://reviews.freebsd.org/D17458
Sponsored by: Netflix, Inc.
rep stos has a high startup time even on modern microarchitectures like
Skylake. Intel optimization manuals discuss how for small sizes it is
beneficial to go for streaming stores. Since those cannot be used without
extra penalty in the kernel I investigated performance impact of just
regular movs.
The patch below implements a very simple scheme: a 32-byte loop followed
by filling in the remainder of at most 31 bytes. It has a 256 breaking
point on which it falls back to rep stos. It provides a significant win
over the current primitive on several machines I tested (both Intel and
AMD). A 64-byte loop did not provide any benefit even for multiple of 64
sizes.
See the review for benchmark data.
Reviewed by: kib
Approved by: re (gjb)
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D17398
This change is a no-op in terms of semantics, but has a side effect
of removing a perfectly useless nop sled for CPUs with ERMS.
Approved by: re (gjb)
Sponsored by: The FreeBSD Foundation
Belatedly add a comment to the amd64 pmap explaining why we initialize
the kernel pmap's resident page count.
Reviewed by: alc, kib
Approved by: re (gjb)
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D17377
Such data may later be unmapped. This occurs, for example, when a
loader-provided microcode update file is discarded.
Reviewed by: alc, kib
Approved by: re (gjb)
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D17340
For PCID case, there is a dependency between pm_gen zeroing and
reading pm_active for IPI target selection, to ensure that the
invalidation is not missed.
Reported and tested by: mjg
Sponsored by: The FreeBSD Foundation
Approved by: re (gjb)
The function stopped swapping rdi and rsi, but the error handling
code was not updated with the new register name.
Approved by: re (implicit)
Sponsored by: The FreeBSD Foundation
This reverts part of r333368. The attempt to clear DR6 was occuring
too soon as trapsignal() does not pause to let the debugger notice the
SIGTRAP and query DR6. The signal exchange does not occur until much
later during ast(). As a result, GDB was no longer recognizing
hardware breakpoints and watchpoints on x86.
In addition, any userland programs that want to inspect DR6 in a
SIGTRAP handler don't have a way to do this if we clear DR6 in the
exception handler.
Instead of relying on the kernel to clear DR6, debuggers will have to
explicitly clear it after a trace trap (which they needed to do on
older kernels anyway).
Reviewed by: kib
Approved by: re (delphij)
MFC after: 3 days
Differential Revision: https://reviews.freebsd.org/D17319
- remove a forward branch in the common case
- replace xchg + lodsb/stosb loop with simple movs
A simple test on Intel(R) Core(TM) i7-4600U CPU @ 2.10GH copying
/foo/bar/baz in a loop goes from 295715863 ops/s to 465807408.
Further changes are pending.
Reviewed by: kib
Approved by: re (gjb)
Differential Revision: https://reviews.freebsd.org/D17281
- move the PSL.AC comment to the fault handler
- stop testing for zero-sized ops. after several minutes of package
building there were no copyin calls with zero bytes and very few
copyout. the semantic of returning 0 in this case is preserved
- shorten exit paths by clearing %eax earlier
- replace xchg with 3 movs. this is what compilers do. a naive
benchmark on EPYC suggests about 1% increase in thoughput thanks to
this change.
- remove the useless movb %cl,%al from copyout. it looks like a
leftover from many years ago
Reviewed by: kib
Approved by: re (gjb)
Differential Revision: https://reviews.freebsd.org/D17286
Both the in-kernel C variant and libc asm variant have very poor performance.
The former compiles to a single byte comparison loop, which breaks down even
for small sizes. The latter uses rep cmpsq/b which turn out to have very poor
throughput and are slower than a hand-coded 32-byte comparison loop.
Depending on size this is about 3-4 times faster than the current routines.
Reviewed by: kib
Approved by: re (gjb)
Differential Revision: https://reviews.freebsd.org/D17328
undefined instruction exception. Previously we would exit the guest,
however an unprivileged user could execute these.
Found with: syzkaller
Reviewed by: araujo, tychon (previous version)
Approved by: re (kib)
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D17192
dmaplimit is the first byte after the end of DMAP.
Reported by: "Johnson, Archna" <Archna.Johnson@netapp.com>
Reviewed by: alc, markj
Approved by: re (gjb)
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D17318
For pmap_invalidate_all_pcid(), only reset pm_gen for non-kernel
pmaps, as it was done before the conversion to ifuncs. The reset is
useless but innocent for kernel_pmap. Coverity reported that cpuid is
used uninitialized in this case.
Reported by: cem
Reviewed by: alc, cem, markj
CID: 1395807
Sponsored by: The FreeBSD Foundation
Approved by: re (gjb)
Differential revision: https://reviews.freebsd.org/D17314
Split calculation of mask for shootdown IPI and local
invalidation. Reorder IPI before local.
Suggested by: alc
Reviewed by: alc, markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
Approved by: re (rgrimes)
Differential revision: https://reviews.freebsd.org/D17277
- _fault handlers for both primitives are identical, provide just one
- change the copying scheme to match memcpy (in particular jump
avoidance for the most common case of multiply of 8)
- stop re-reading pcb address on exit, just store it locally (in r9)
Reviewed by: kib
Approved by: re (gjb)
Differential Revision: https://reviews.freebsd.org/D17265
If the size is 15 bytes or less avoid spinning up rep just to copy the 8
bytes. In my tests on EPYC and old Intel microarchs without ERMS (like
Westmere) it provided a nice win over the current version (e.g. for EPYC
memset with 15 bytes of size goes from 59712651 ops/s to 70600095) all
while almost not pessimizing the other cases.
Data collected during package building shows that < 16 sizes are pretty
common.
Verified with the glibc test suite.
Approved by: re (kib)
Fix a fat-fingered typo with a "funny" side-effect: when doing copyin on a
cpu without ERMS and with size being a multiply of 8 a page fault would be
triggered resulting in EFAULT.
Pointy hat: mjg
Approved by: re (implicit)
A lot of function have the following check:
cmpq %rax,%rdi /* verify address is valid */
ja fusufault
The label is present earlier in kernel .text, which means this is a jump
backwards. Absent any information in branch predictor, the cpu predicts it
as taken. Since it is almost never taken in practice, this results in a
completely avoidable misprediction.
Move it past all consumers, so that it is predicted as not taken.
Approved by: re (kib)
This simplifies the runtime logic and reduces the number of
runtime-constant branches.
Reviewed by: alc, markj
Sponsored by: The FreeBSD Foundation
Approved by: re (gjb)
Differential revision: https://reviews.freebsd.org/D16736
pm_pcid is unsigned.
Reviewed by: cem, markj
CID: 1395727
Noted by: cem
Sponsored by: The FreeBSD Foundation
Approved by: re (gjb)
MFC after: 3 days
Differential revision: https://reviews.freebsd.org/D17235
Patch removes all checks for pti/pcid/invpcid from the context switch
path. I verified this by looking at the generated code, compiling with
the in-tree clang. The invpcid_works1 trick required inline attribute
for pmap_activate_sw_pcid_pti() to work.
Reviewed by: alc, markj
Sponsored by: The FreeBSD Foundation
Approved by: re (gjb)
Differential revision: https://reviews.freebsd.org/D17181
There is no need to use %rax for temporary values and avoiding doing
so shortens the func.
Handle the explicit 'check for tail' depessimisization for backwards copying.
This reduces the diff against userspace.
Tested with the glibc test suite.
Approved by: re (kib)
This will be used in following conversion of pmap_activate_sw().
Reviewed by: alc, markj
Sponsored by: The FreeBSD Foundation
Approved by: re (gjb)
Differential revision: https://reviews.freebsd.org/D17181
There is a braino in the non-erms variant which breaks the
functionality.
Will be fixed at a later time with a different patch.
Reported by: Manfred Antar
Approved by: re (implicit)
There is no need to use %rax for temporary values and avoiding doing
so shortens the func.
Handle the explicit 'check for tail' depessimisization for backwards copying.
This reduces the diff against userspace.
Approved by: re (kib)
Intel docs claim such a memset (rep stosb + 4096 bytes) is
special-cased by microarchs. They also switched Linux to use
it for this purpose.
Approved by: re (gjb)
The stac/clac combo around each byte copy is causing a measurable
slowdown in benchmarks. Do it only before and after all data is
copied. While here reorder the code to avoid a forward branch in
the common case.
Note the copying loop (originating from copyinstr) is avoidably slow
and will be fixed later.
Reviewed by: kib
Approved by: re (gjb)
Differential Revision: https://reviews.freebsd.org/D17063
Also this fixes the eflags.ac leak from copyin_smap() when the copied
data length is multiple of eight bytes.
Sponsored by: The FreeBSD Foundation
Approved by: re (gjb)
Non-PTI mode does not switch kcr3, which means that kcr3 is almost
always stale. This is important for the NMI handler, which reloads
%cr3 with PCPU(kcr3) if the value is different from PMAP_NO_CR3.
The end result is that curpmap in NMI handler does not match the page
table loaded into hardware. The manifestation was copyin(9) looping
forever when a usermode access page fault cannot be resolved by
vm_fault() updating a different page table.
Reported by: mmacy
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 3 days
Approved by: re (gjb)
This appeared to be required to have EFI RT support and EFI RTC
enabled by default, because there are too many reports of faulting
calls on many different machines. The knob is added to leave the
exceptions unhandled to allow to debug the actual bugs.
Reviewed by: kevans
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Approved by: re (rgrimes)
Differential revision: https://reviews.freebsd.org/D16972
handling.
This is split into a separate commit from the main change to make it
easier to handle possible revert after upcoming KBI freeze.
Reviewed by: kevans
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Approved by: re (rgrimes)
Differential revision: https://reviews.freebsd.org/D16972
trap_pfault() KPTI violation check.
EFI RT may set curpmap to NULL for the duration of the call for some
machines (PCID but no INVPCID). Since apparently EFI RT code must be
ready for exceptions from the calls, avoid dereferencing curpmap until
we know that this call does not come from usermode.
Reviewed by: kevans
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Approved by: re (rgrimes)
Differential revision: https://reviews.freebsd.org/D16972
Exposing max_offset and min_offset defines in public headers is
causing clashes with variable names, for example when building QEMU.
Based on the submission by: royger
Reviewed by: alc, markj (previous version)
Sponsored by: The FreeBSD Foundation (kib)
MFC after: 1 week
Approved by: re (marius)
Differential revision: https://reviews.freebsd.org/D16881
table allocation.
At the time that mp_bootaddress() is called, phys_avail[] array does
not reflect some memory reservations already done, like kernel
placement. Recent changes to DMAP protection which make kernel text
read-only in DMAP revealed this, where on some machines AP boot page
tables selection appears to intersect with the kernel itself.
Fix this by checking the addresses selected using the same algorithm
as bootaddr_rwx(). Also, try to chomp pages for the page table not
only at the start of the contiguous range, but also at the end. This
should improve robustness when the only suitable range is already
consumed by the kernel.
Reported and tested by: Michael Gmelin <freebsd@grem.de>
Reviewed by: jhb
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Approved by: re (gjb)
Differential revision: https://reviews.freebsd.org/D16907
error in the function hypercall_memfree(), where the wrong arena was being
passed to kmem_free().
Introduce a per-page flag, VPO_KMEM_EXEC, to mark physical pages that are
mapped in kmem with execute permissions. Use this flag to determine which
arena the kmem virtual addresses are returned to.
Eliminate UMA_SLAB_KRWX. The introduction of VPO_KMEM_EXEC makes it
redundant.
Update the nearby comment for UMA_SLAB_KERNEL.
Reviewed by: kib, markj
Discussed with: jeff
Approved by: re (marius)
Differential Revision: https://reviews.freebsd.org/D16845