Commit Graph

8387 Commits

Author SHA1 Message Date
John Baldwin
908dca3ef4 Pull the check for VM ownership into ppt_find().
This reduces some code duplication.  One behavior change is that
ppt_assign_device() will now only succeed if the device is unowned.
Previously, a device could be assigned to the same VM multiple times,
but each time it was assigned, the device's state was reset.

Reviewed by:	markj, grehan
MFC after:	2 weeks
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D27301
2020-11-24 23:56:33 +00:00
John Baldwin
1925586e03 Honor the disabled setting for MSI-X interrupts for passthrough devices.
Add a new ioctl to disable all MSI-X interrupts for a PCI passthrough
device and invoke it if a write to the MSI-X capability registers
disables MSI-X.  This avoids leaving MSI-X interrupts enabled on the
host if a guest device driver has disabled them (e.g. as part of
detaching a guest device driver).

This was found by Chelsio QA when testing that a Linux guest could
switch from MSI-X to MSI interrupts when using the cxgb4vf driver.

While here, explicitly fail requests to enable MSI on a passthrough
device if MSI-X is enabled and vice versa.

Reported by:	Sony Arpita Das @ Chelsio
Reviewed by:	grehan, markj
MFC after:	2 weeks
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D27212
2020-11-24 23:18:52 +00:00
Jung-uk Kim
926ce35a7e Port rtsx(4) driver for Realtek SD card reader from OpenBSD.
This driver provides support for Realtek PCI SD card readers.  It attaches
mmc(4) bus on card insertion and detaches it on card removal.  It has been
tested with RTS5209, RTS5227, RTS5229, RTS522A, RTS525A and RTL8411B.  It
should also work with RTS5249, RTL8402 and RTL8411.

PR:			204521
Submitted by:		Henri Hennebert (hlh at restart dot be)
Reviewed by:		imp, jkim
Differential Revision:	https://reviews.freebsd.org/D26435
2020-11-24 21:28:44 +00:00
Konstantin Belousov
4815f175d0 Linuxolator: Replace use of eventhandlers by sysent hooks.
Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D27309
2020-11-23 18:18:16 +00:00
Mark Johnston
431fb8abd7 vm_phys: Try to clean up NUMA KPIs
It can useful for code outside the VM system to look up the NUMA domain
of a page backing a virtual or physical address, specifically when
creating NUMA-aware data structures.  We have _vm_phys_domain() for
this, but the leading underscore implies that it's an internal function,
and vm_phys.h has dependencies on a number of other headers.

Rename vm_phys_domain() to vm_page_domain(), and _vm_phys_domain() to
vm_phys_domain().  Make the latter an inline function.

Add _vm_phys.h and define struct vm_phys_seg there so that it's easier
to use in other headers.  Include it from vm_page.h so that
vm_page_domain() can be defined there.

Include machine/vmparam.h from _vm_phys.h since it depends directly on
some constants defined there.

Reviewed by:	alc
Reviewed by:	dougm, kib (earlier versions)
Differential Revision:	https://reviews.freebsd.org/D27207
2020-11-19 03:59:21 +00:00
Conrad Meyer
77eb984147 'make sysent' for r367773
X-MFC-With:	r367773
2020-11-17 19:53:59 +00:00
Conrad Meyer
de774e422e linux(4): Implement name_to_handle_at(), open_by_handle_at()
They are similar to our getfhat(2) and fhopen(2) syscalls.

Differential Revision:	https://reviews.freebsd.org/D27111
2020-11-17 19:51:47 +00:00
Mark Johnston
6f5a960678 vmm: Make pmap_invalidate_ept() wait synchronously for guest exits
Currently EPT TLB invalidation is done by incrementing a generation
counter and issuing an IPI to all CPUs currently running vCPU threads.
The VMM inner loop caches the most recently observed generation on each
host CPU and invalidates TLB entries before executing the VM if the
cached generation number is not the most recent value.
pmap_invalidate_ept() issues IPIs to force each vCPU to stop executing
guest instructions and reload the generation number.  However, it does
not actually wait for vCPUs to exit, potentially creating a window where
guests may continue to reference stale TLB entries.

Fix the problem by bracketing guest execution with an SMR read section
which is entered before loading the invalidation generation.  Then,
pmap_invalidate_ept() increments the current write sequence before
loading pm_active and sending IPIs, and polls readers to ensure that all
vCPUs potentially operating with stale TLB entries have exited before
pmap_invalidate_ept() returns.

Also ensure that unsynchronized loads of the generation counter are
wrapped with atomic(9), and stop (inconsistently) updating the
invalidation counter and pm_active bitmask with acquire semantics.

Reviewed by:	grehan, kib
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D26910
2020-11-11 15:01:17 +00:00
Conrad Meyer
e9b13c6612 linux(4): Deduplicate unimpl/dummy syscall handlers
No functional change.

Reviewed by:	emaste, trasz
Differential Revision:	https://reviews.freebsd.org/D27099
2020-11-05 19:30:31 +00:00
Mark Johnston
72143e89bb Add qat(4)
This provides an OpenCrypto driver for Intel QuickAssist devices.  The
driver was initially ported from NetBSD and comes with a few
improvements:
- support for GMAC/AES-GCM, AES-CTR and AES-XTS, and support for
  SHA/HMAC-authenticated encryption
- support for detaching the driver
- various bug fixes
- DH895X support

Discussed with:	jhb
MFC after:	3 days
Sponsored by:	Rubicon Communications, LLC (Netgate)
Differential Revision:	https://reviews.freebsd.org/D26963
2020-11-05 15:55:23 +00:00
Mark Johnston
cff169880e amd64: Make it easier to configure exception stack sizes
The amd64 kernel handles certain types of exceptions on a dedicated
stack.  Currently the sizes of these stacks are all hard-coded to
PAGE_SIZE, but for at least NMI handling it can be useful to use larger
stacks.  Add constants to intr_machdep.h to make this easier to tweak.

No functional change intended.

Reviewed by:	kib
MFC after:	1 week
Sponsored by:	NetApp, Inc.
Sponsored by:	Klara, Inc.
Differential Revision:	https://reviews.freebsd.org/D27076
2020-11-04 16:42:20 +00:00
Mateusz Piotrowski
a858a39b31 Fix a typo 2020-11-04 10:38:25 +00:00
Alan Cox
9b4e77cb97 Tidy up the #includes. Recent changes, such as the introduction of
VM_ALLOC_WAITOK and vm_page_unwire_noq(), have eliminated the need for
many of the #includes.

Reviewed by:	kib, markj
Differential Revision:	https://reviews.freebsd.org/D27052
2020-11-02 19:20:06 +00:00
Mateusz Guzik
82c174a3b4 malloc: delegate M_EXEC handling to dedicacted routines
It is almost never needed and adds an avoidable branch.

While here do minior clean ups in preparation for larger changes.

Reviewed by:	markj
Differential Revision:	https://reviews.freebsd.org/D27019
2020-10-30 20:02:32 +00:00
Edward Tomasz Napierala
bce7ee9d41 Drop "All rights reserved" from all my stuff. This includes
Foundation copyrights, approved by emaste@.  It does not include
files which carry other people's copyrights; if you're one
of those people, feel free to make similar change.

Reviewed by:	emaste, imp, gbe (manpages)
Differential Revision:	https://reviews.freebsd.org/D26980
2020-10-28 13:46:11 +00:00
Edward Tomasz Napierala
866b1f5147 Fix misnomer - linux_to_bsd_errno() does the exact opposite.
Reported by:	arichardson
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D26965
2020-10-27 12:49:40 +00:00
Kyle Evans
d42a83b1a9 audit: also correctly audit linux_execve()
Linux execve() gets audited as AUE_EXECVE as well, we should also interpret
the return from this correctly for the same reasoning as in r367002.

MFC with:	r367002
2020-10-26 17:30:17 +00:00
Konstantin Belousov
c0b5fcf692 Improve FPU Tag Word reconstruction on i386 to indicate register states.
Improve the code reconstructing en_tw in struct fpreg32 from FXSAVE
results so that all register states are indicated correctly.  The
previous code unconditionally mapped non-empty register state to
'normalized value' constant.  The new code explicitly distinguishes
the 'zero value' and 'special value' constants as well.  This improves
consistency between real FSAVE and translation from FXSAVE, and
ensures that tests using PT_GETFPREGS can rely on a single correct
value independently of the underlying implementation.

PR:	250454
Sponsored by:	The FreeBSD Foundation
Obtained from:	Moritz Systems
Submitted by:	Michał Górny <mgorny@moritz.systems>
Discussed with:	emaste
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D26856
2020-10-21 00:15:12 +00:00
John Baldwin
ba610be90a Add a kernel crypto driver using assembly routines from OpenSSL.
Currently, this supports SHA1 and SHA2-{224,256,384,512} both as plain
hashes and in HMAC mode on both amd64 and i386.  It uses the SHA
intrinsics when present similar to aesni(4), but uses SSE/AVX
instructions when they are not.

Note that some files from OpenSSL that normally wrap the assembly
routines have been adapted to export methods usable by 'struct
auth_xform' as is used by existing software crypto routines.

Reviewed by:	gallatin, jkim, delphij, gnn
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D26821
2020-10-20 17:50:18 +00:00
Mark Johnston
8e2cbc5660 vmx: Implement pmap (de)activation in C
Rewrite the code that maintains pm_active and invalidates EPTP-tagged
TLB entries in C.  Previously this work was done in vmx_enter_guest(),
in assembly, but there is no good reason for that and it makes the TLB
invalidation algorithm for nested page tables harder to review.

No functional change intended.  Now, an error from the invept
instruction results in a kernel panic rather than a vmexit.  Such errors
should occur only as a result of VMM bugs.

Reviewed by:	grehan, kib
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D26830
2020-10-19 15:24:35 +00:00
Edward Tomasz Napierala
6221ec6064 Stop calling set_syscall_retval() from linux_set_syscall_retval().
The former clobbers some registers that shouldn't be touched.

Reviewed by:	kib (earlier version)
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D26406
2020-10-18 16:16:22 +00:00
Edward Tomasz Napierala
c0d07d326f Slightly tweak linux ptrace(2) debug message; no functional changes.
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D26815
2020-10-18 15:56:47 +00:00
Konstantin Belousov
546df7a45d amd64 pmap.h: explicitly provide constants values instead of relying
on some more advanced C features.

This fixes gcc-toolchain build of exception.S.

Reported and tested by:	kevans
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2020-10-16 16:22:32 +00:00
Mitchell Horne
ce4900bc8a Simplify preload_dump() condition
Hiding this feature behind RB_VERBOSE is gratuitous. The tunable is enough
to limit its use to only those who explicitly request it.

Suggested by:	kevans
2020-10-15 20:21:15 +00:00
Konstantin Belousov
e406235000 Fix for mis-interpretation of PCB_KERNFPU.
RIght now PCB_KERNFPU is used both as indication that kernel prepared
hardware FPU context to use and that the thread is fpu-kern
thread.  This also breaks fpu_kern_enter(FPU_KERN_NOCTX), since
fpu_kern_leave() then clears PCB_KERNFPU.

Introduce new flag PCB_KERNFPU_THR which indicates that the thread is
fpu-kern.  Do not clear PCB_KERNFPU if fpu-kern thread leaves noctx
fpu region.

Reported and tested by:	jhb (amd64)
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D25511
2020-10-14 23:01:41 +00:00
Konstantin Belousov
d3ba71b2b1 Limit workaround for errata E400 to appropriate AMD cpus.
From Linux sources and several datasheets I looked at, it seems that
the workaround is only needed on families 0xf and 0x10.  For instance,
Ryzens do not implement the accessed MSR at all, it is documented as
reserved.  Also, hypervisors should not allow guest to put CPU into
idle state, so activate workaround only when on bare hardware.

While there, style the code:
    move MSR defines to specialreg.h
    move identification to initcpu.c

Reported by:	whu
Reviewed by:	avg
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D26470
2020-10-14 22:57:50 +00:00
Konstantin Belousov
6f3b523c9a Avoid dump_avail[] redefinition.
Move dump_avail[] extern declaration and inlines into a new header
vm/vm_dumpset.h.  This fixes default gcc build for mips.

Reviewed by:	alc, scottph
Tested by:	kevans (previous version)
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D26741
2020-10-14 22:51:40 +00:00
Tycho Nightingale
42360f5c5b eliminate possible race in parallel TLB shootdown IPI
On the target side TLB shootdown IPI handler, prevent the compiler
from performing a forward store optimization which may mask a
subsequent update to the scoreboard by the initiator.

Reported by:	Max Laier, Anton Rang
Discussed with:	kib
Sponsored by:	Dell EMC Isilon
2020-10-13 18:28:48 +00:00
Emmanuel Vadot
7113afc84c 10Gigabit Ethernet driver for AMD SoC
This patch has the driver for 10Gigabit Ethernet controller in AMD
SoC. This driver is written compatible to the Iflib framework. The
existing driver is for the old version of hardware. The submitted
driver here is for the recent versions of the hardware where the Ethernet
controller is PCI-E based.

Submitted by:	Rajesh Kumar <rajesh1.kumar@amd.com>
MFC after:	1 month
Relnotes:	yes
Differential Revision:	https://reviews.freebsd.org/D25793
2020-10-11 16:01:16 +00:00
Conrad Meyer
f8e8a06d23 random(4) FenestrasX: Push root seed version to arc4random(3)
Push the root seed version to userspace through the VDSO page, if
the RANDOM_FENESTRASX algorithm is enabled.  Otherwise, there is no
functional change.  The mechanism can be disabled with
debug.fxrng_vdso_enable=0.

arc4random(3) obtains a pointer to the root seed version published by
the kernel in the shared page at allocation time.  Like arc4random(9),
it maintains its own per-process copy of the seed version corresponding
to the root seed version at the time it last rekeyed.  On read requests,
the process seed version is compared with the version published in the
shared page; if they do not match, arc4random(3) reseeds from the
kernel before providing generated output.

This change does not implement the FenestrasX concept of PCPU userspace
generators seeded from a per-process base generator.  That change is
left for future discussion/work.

Reviewed by:	kib (previous version)
Approved by:	csprng (me -- only touching FXRNG here)
Differential Revision:	https://reviews.freebsd.org/D22839
2020-10-10 21:52:00 +00:00
Warner Losh
7e46dafa58 Create in-tree LINT files
Now that config(8) has supported include for 19 years, transition to
including the NOTES files. include support didn't exist at the time,
nor did the envvar stuff recently added. Now that it does, eliminate
the building of LINT files by just including everything you need.

Note: This may cause conflicts with updating in some cases.
	find sys -name LINT\* -rm
is suggested across this commit to remove the generated LINT
files.

Reviewed by: kevans
Differential Revision: https://reviews.freebsd.org/D26540
2020-10-09 01:48:14 +00:00
Mitchell Horne
22e6a67086 Add a routine to dump boot metadata
The boot metadata (also referred to as modinfo, or preload metadata)
provides information about the size and location of the kernel,
pre-loaded modules, and other metadata (e.g. the EFI framebuffer) to be
consumed during by the kernel during early boot. It is encoded as a
series of type-length-value entries and is usually constructed by
loader(8) and passed to the kernel. It is also faked on some
architectures when booted by other means.

Although much of the module information is available via kldstat(8),
there is no easy way to debug the metadata in its entirety. Add some
routines to parse this data and allow it to be printed to the console
during early boot or output via a sysctl.

Since the output can be lengthly, printing to the console is gated
behind the debug.dump_modinfo_at_boot kenv variable as well as the
BOOTVERBOSE flag. The sysctl to print the metadata is named
debug.dump_modinfo.

Reviewed by:	tsoome
Sponsored by:	NetApp, Inc.
Sponsored by:	Klara, Inc.
Differential Revision:	https://reviews.freebsd.org/D26687
2020-10-08 18:02:05 +00:00
Mitchell Horne
8481aab1ac Print symbol index for unsupported relocation types
It is unlikely, but possible, that an unrecognized or unsupported
relocation type is encountered while trying to load a kernel module. If
this occurs we should offer the symbol index as a hint to the user.

While here, fix some small style issues.

Reviewed by:	markj, kib (amd64 part, in D26701)
Sponsored by:	NetApp, Inc.
Sponsored by:	Klara, Inc.
2020-10-07 18:48:10 +00:00
Konstantin Belousov
df01340989 amd64: Store full 64bit of FIP/FDP for 64bit processes when using XSAVE.
If current process is 64bit, use rex-prefixed version of XSAVE
(XSAVE64).  If current process is 32bit and CPU supports saving
segment registers cs/ds in the FPU save area, use non-prefixed variant
of XSAVE.

Reported and tested by:	Michał Górny <mgorny@mgorny@moritz.systems>
PR:	250043
Reviewed by:	emaste, markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D26643
2020-10-03 23:17:29 +00:00
Konstantin Belousov
9f2a3e3b0a Fix pmap_pti_add_kva() call for doublefault stack page.
After r354889 stack got struct nmi_pcpu at top, which makes IST top
not page-aligned.  Since pmap_pti_add_kva() truncates/rounds up
addresses, it erronously entered a page mapped before double fault
stack into the pti page table.

Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2020-10-03 23:11:20 +00:00
Konstantin Belousov
5e8ea68fd8 Move ctx_switch_xsave declaration to amd64 md_var.h.
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2020-10-03 23:07:09 +00:00
Emmanuel Vadot
90b8c0ea10 Fix LINT: Add backlight to NOTES 2020-10-02 20:52:09 +00:00
Mark Johnston
494955366a Remove svn:executable from a couple of vmm(4) source files.
MFC after:	3 days
2020-10-01 22:20:29 +00:00
John Baldwin
a3f2a9c57e Clear the upper 32-bits of registers in x86_emulate_cpuid().
Per the Intel manuals, CPUID is supposed to unconditionally zero the
upper 32 bits of the involved (rax/rbx/rcx/rdx) registers.
Previously, the emulation would cast pointers to the 64-bit register
values down to `uint32_t`, which while properly manipulating the lower
bits, would leave any garbage in the upper bits uncleared.  While no
existing guest OSes seem to stumble over this in practice, the bhyve
emulation should match x86 expectations.

This was discovered through alignment warnings emitted by gcc9, while
testing it against SmartOS/bhyve.

SmartOS bug:	https://smartos.org/bugview/OS-8168
Submitted by:	Patrick Mooney
Reviewed by:	rgrimes
Differential Revision:	https://reviews.freebsd.org/D24727
2020-10-01 16:45:11 +00:00
Ruslan Bukin
6186bfbd18 Rename kernel option ACPI_DMAR to IOMMU.
This is mostly needed for a common arm64/amd64 iommu code.

Reviewed by:	kib
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D26587
2020-09-29 20:29:07 +00:00
Edward Tomasz Napierala
1e2521ffae Get rid of sa->narg. It serves no purpose; use sa->callp->sy_narg instead.
Reviewed by:	kib
Sponsored by:	DARPA
Differential Revision:	https://reviews.freebsd.org/D26458
2020-09-27 18:47:06 +00:00
Edward Tomasz Napierala
0c5bd5f993 Regen after r366145.
Sponsored by:	DARPA
2020-09-25 10:05:38 +00:00
Mark Johnston
78257765f2 Add a vmparam.h constant indicating pmap support for large pages.
Enable SHM_LARGEPAGE support on arm64.

Reviewed by:	alc, kib
Sponsored by:	Juniper Networks, Inc., Klara, Inc.
Differential Revision:	https://reviews.freebsd.org/D26467
2020-09-23 19:34:21 +00:00
Warner Losh
f9ba2bbe3a Use envvar rather than nonstandard hint. lines
The NOTES files have a bunch of hint lines that are removed when
generating LINT. However, we can achieve the same effect by prepending
each of the lines with 'envvar' so the NOTES files become standard
config(8) files. No functional changes as the sed script to generate
the LINT files filters these either way.

Suggested by: kevans
2020-09-23 19:18:53 +00:00
Konstantin Belousov
b82149116a amd64 pmap: More unification for psind = 1 vs 2 in pmap_enter_largepage().
Move
  pkru check
  wait for page alloc
  wire accounting update
  asserting allowed updates for valid mappings
out of psind conditions.

Also add assert that psind references supported page size.
Remove not true comment.
Avoid uneccessary page table walks from top level.

Reviewed by:	alc, markj (previous version)
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D26513
2020-09-22 23:28:06 +00:00
D Scott Phillips
00e6614750 Sparsify the vm_page_dump bitmap
On Ampere Altra systems, the sparse population of RAM within the
physical address space causes the vm_page_dump bitmap to be much
larger than necessary, increasing the size from ~8 Mib to > 2 Gib
(and overflowing `int` for the size).

Changing the page dump bitmap also changes the minidump file
format, so changes are also necessary in libkvm.

Reviewed by:	jhb
Approved by:	scottl (implicit)
MFC after:	1 week
Sponsored by:	Ampere Computing, Inc.
Differential Revision:	https://reviews.freebsd.org/D26131
2020-09-21 22:21:59 +00:00
D Scott Phillips
ab041f713a Move vm_page_dump bitset array definition to MI code
These definitions were repeated by all architectures, with small
variations. Consolidate the common definitons in machine
independent code and use bitset(9) macros for manipulation. Many
opportunities for deduplication remain in the machine dependent
minidump logic. The only intended functional change is increasing
the bit index type to vm_pindex_t, allowing the indexing of pages
with address of 8 TiB and greater.

Reviewed by:	kib, markj
Approved by:	scottl (implicit)
MFC after:	1 week
Sponsored by:	Ampere Computing, Inc.
Differential Revision:	https://reviews.freebsd.org/D26129
2020-09-21 22:20:37 +00:00
Konstantin Belousov
6d4b6bd3ce amd64 pmap: only calculate page table page when needed.
Noted by:	alc
Reviewed by:	alc, markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D26499
2020-09-21 15:53:41 +00:00
Konstantin Belousov
7149d7209e amd64 pmap: handle cases where pml4 page table page is not allocated.
Possible in LA57 pmap config.

Noted by:	alc
Reviewed by:	alc, markj
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D26492
2020-09-20 22:16:24 +00:00
Mark Johnston
d26ab2bec0 Fix some nits in 1G page support in the amd64 pmap.
- Move assertions out of the main loop to avoid duplicate conditional
  expressions, and improve assertion messages.
- Fix va_next updates.  In some cases we were not doing the wraparound
  check before continuing the loop.
- Use the right va_next.  In pmap_advise() and pmap_copy() we would step
  through 1G pages 2M at a time.
- Copy 1G mappings in pmap_copy().

Reviewed by:	alc, kib
MFC with:	r365518
Sponsored by:	Juniper Networks, Inc., Klara, Inc.
Differential Revision:	https://reviews.freebsd.org/D26463
2020-09-19 15:22:04 +00:00
Eric van Gyzen
d8d2dda141 amd64 pmap_pkru_same: prev_ppr was always NULL
Fix the logic so it works as it appears.

Reported by:	Coverity
Reviewed by:	kib
MFC after:	2 weeks
Sponsored by:	Dell EMC Isilon
Differential Revision:	D26211 (in progress, so omitting full URL)
2020-09-18 20:53:40 +00:00
Mark Johnston
04636a71c6 Ensure that a protection key is selected in pmap_enter_largepage().
Reviewed by:	alc, kib
Reported by:	Coverity
MFC with:	r365518
Differential Revision:	https://reviews.freebsd.org/D26464
2020-09-18 12:30:39 +00:00
Edward Tomasz Napierala
70890254b3 Get rid of sv_errtbl and SV_ABI_ERRNO().
Reviewed by:	kib
Sponsored by:	DARPA
Differential Revision:	https://reviews.freebsd.org/D26388
2020-09-17 11:39:33 +00:00
Ed Maste
09860d44e4 bhyve: do not permit write access to VMCB / VMCS
Reported by:	Patrick Mooney
Submitted by:	jhb
Security:	CVE-2020-24718
2020-09-15 21:04:27 +00:00
Konstantin Belousov
101d5b527a bhyve: intercept AMD SVM instructions.
Intercept and report #UD to VM on SVM/AMD in case VM tried to execute an
SVM instruction.  Otherwise, SVM allows execution of them, and instructions
operate on host physical addresses despite being executed in guest mode.

Reported by:	Maxime Villard <max@m00nbsd.net>
admbug:	972
CVE:	CVE-2020-7467
Reviewed by:	grehan, markj
Differential revision:	https://reviews.freebsd.org/D26313
2020-09-15 20:22:50 +00:00
Edward Tomasz Napierala
c26391f4dd Move SV_ABI_ERRNO translation into linux-specific code, to simplify
the syscall path and declutter it a bit.  No functional changes intended.

Reviewed by:	kib (earlier version)
MFC after:	2 weeks
Sponsored by:	DARPA
Differential Revision:	https://reviews.freebsd.org/D26378
2020-09-15 16:41:21 +00:00
John Baldwin
d8dc46f6e9 Add constant for the DE_CFG MSR on AMD CPUs.
Reported by:	Patrick Mooney <pmooney@pfmooney.com>
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D25885
2020-09-11 20:32:40 +00:00
John Baldwin
385f4a5ac8 Use vmcb_read/write for the vmcb snapshot functions.
This avoids some unnecessary layers of indirection.
2020-09-10 22:22:23 +00:00
Konstantin Belousov
6cadbcd203 Add pmap_enter(9) PMAP_ENTER_LARGEPAGE flag and implement it on amd64.
The flag requests entry of non-managed superpage mapping of size
pagesizes[psind] into the page table.

Pmap supports fake wiring of the largepage mappings.  Only attributes
of the largepage mapping can be changed by calling pmap_enter(9) over
existing mapping, physical address of the page must be unchanged.

Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D24652
2020-09-09 21:50:24 +00:00
Konstantin Belousov
fceae0fd3c Fix assert.
Noted by:	alc
MFC after:	1 week
2020-09-09 21:35:44 +00:00
Konstantin Belousov
4ebfc4edaf amd64 pmap: teach functions walking user page tables about PG_PS bit in PDPE.
Only unmanaged 1G superpages are handled.

Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D24652
2020-09-09 21:08:45 +00:00
Konstantin Belousov
6e64bebb6f amd64: report support for 1G superpages in getpagesizes(2).
Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D24652
2020-09-09 21:01:36 +00:00
Mark Johnston
847ab36bf2 Include the psind in data returned by mincore(2).
Currently we use a single bit to indicate whether the virtual page is
part of a superpage.  To support a forthcoming implementation of
non-transparent 1GB superpages, it is useful to provide more detailed
information about large page sizes.

The change converts MINCORE_SUPER into a mask for MINCORE_PSIND(psind)
values, indicating a mapping of size psind, where psind is an index into
the pagesizes array returned by getpagesizes(3), which in turn comes
from the hw.pagesizes sysctl.  MINCORE_PSIND(1) is equal to the old
value of MINCORE_SUPER.

For now, two bits are used to record the page size, permitting values
of MAXPAGESIZES up to 4.

Reviewed by:	alc, kib
Sponsored by:	Juniper Networks, Inc.
Sponsored by:	Klara, Inc.
Differential Revision:	https://reviews.freebsd.org/D26238
2020-09-02 18:16:43 +00:00
Mark Johnston
2d838cd867 Add the MEM_EXTRACT_PADDR ioctl to /dev/mem.
This allows privileged userspace processes to find information about the
physical page backing a given mapping.  It is useful in applications
such as DPDK which perform some of their own memory management.

Reviewed by:	kib, jhb (previous version)
MFC after:	2 weeks
Sponsored by:	Juniper Networks, Inc.
Sponsored by:	Klara Inc.
Differential Revision:	https://reviews.freebsd.org/D26237
2020-09-02 18:12:47 +00:00
Konstantin Belousov
8f8838c059 Fix a page table pages leak after LA57.
If the call to _pmap_allocpte() is not sleepable, it is possible that
allocation of PML4 or PDP page is successful but either PDP or PD page
is not.  Restructured code in _pmap_allocpte() leaves zero-referenced
page in the paging structure.

Handle it by checking refcount of the page one level above failed
alloc and free that page if its reference count is zero.

Reported and tested by:	pho
Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D26293
2020-09-02 15:55:16 +00:00
Mateusz Guzik
543769bf83 amd64: clean up empty lines in .c and .h files 2020-09-01 21:16:54 +00:00
Matt Macy
fb702b4446 ZFS: clarify dependencies for static linking 2020-08-28 17:06:35 +00:00
Konstantin Belousov
f1e1c4ad73 Restore workaround for sysret fault on non-canonical address after LA57.
Sponsored by:	The FreeBSD Foundation
2020-08-24 22:12:45 +00:00
Peter Grehan
a333a508a2 cpu_auxmsr: assert caller is preventing CPU migration.
Submitted by:	Adam Fenn (adam at fenn dot io)
Requested by:	kib
Reviewed by:	kib, grehan
Approved by:	kib
MFC after:	3 weeks
Differential Revision:	https://reviews.freebsd.org/D26166
2020-08-24 11:49:49 +00:00
Konstantin Belousov
f446480b5f amd64: Handle 5-level paging on wakeup.
We can switch into long mode directly with LA57 enabled.

Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D25273
2020-08-23 20:43:23 +00:00
Konstantin Belousov
177622f1fd amd64: Handle 5-level paging for efirt calls.
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D25273
2020-08-23 20:40:35 +00:00
Konstantin Belousov
f3eb12e4a6 Add bhyve support for LA57 guest mode.
Noted and reviewed by:	grehan
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D25273
2020-08-23 20:37:21 +00:00
Konstantin Belousov
25f2da2e64 Add amd64 procctl(2) ops to manage forced LA48/LA57 VA after exec.
Tested by:	pho (LA48 hardware)
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D25273
2020-08-23 20:32:13 +00:00
Konstantin Belousov
9ce875d9b5 amd64 pmap: LA57 AKA 5-level paging
Since LA57 was moved to the main SDM document with revision 072, it
seems that we should have a support for it, and silicons are coming.

This patch makes pmap support both LA48 and LA57 hardware.  The
selection of page table level is done at startup, kernel always
receives control from loader with 4-level paging.  It is not clear how
UEFI spec would adapt LA57, for instance it could hand out control in
LA57 mode sometimes.

To switch from LA48 to LA57 requires turning off long mode, requesting
LA57 in CR4, then re-entering long mode.  This is somewhat delicate
and done in pmap_bootstrap_la57().  AP startup in LA57 mode is much
easier, we only need to toggle a bit in CR4 and load right value in CR3.

I decided to not change kernel map for now.  Single PML5 entry is
created that points to the existing kernel_pml4 (KML4Phys) page, and a
pml5 entry to create our recursive mapping for vtopte()/vtopde().
This decision is motivated by the fact that we cannot overcommit for
KVA, so large space there is unusable until machines start providing
wider physical memory addressing.  Another reason is that I do not
want to break our fragile autotuning, so the KVA expansion is not
included into this first step.  Nice side effect is that minidumps are
compatible.

On the other hand, (very) large address space is definitely
immediately useful for some userspace applications.

For userspace, numbering of pte entries (or page table pages) is
always done for 5-level structures even if we operate in 4-level mode.
The pmap_is_la57() function is added to report the mode of the
specified pmap, this is done not to allow simultaneous 4-/5-levels
(which is not allowed by hw), but to accomodate for EPT which has
separate level control and in principle might not allow 5-leve EPT
despite x86 paging supports it. Anyway, it does not seems critical to
have 5-level EPT support now.

Tested by:	pho (LA48 hardware)
Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D25273
2020-08-23 20:19:04 +00:00
Eric van Gyzen
d17136fc3d amd64 pmap: potential integer overflowing expression
Coverity has identified the line in this change as "Potential integer
overflowing expression" due to the variable i declared as an int
and used in an expression with vm_paddr_t, a 64bit variable.

This change has very little effect as when this line is execute
nkpt is small and phys_addr is a the beginning of physical memory.
But there is no explicit protection that the above is true.

Submitted by:	bret_ketchum@dell.com
Reported by:	Coverity
Reviewed by:	markj
MFC after:	2 weeks
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D26141
2020-08-21 14:22:32 +00:00
Mark Johnston
72c7f24c8d Use pmap_mapbios() to map ACPI tables on amd64 and i386.
The ACPI table-mapping code used pmap_kenter_temporary() to create
mappings, which in turn uses the fixed-size crashdump map.  Moreover,
the code was not verifying that the table fits in this map, so when
mapping large tables we could clobber adjacent mappings.  This use of
pmap_kenter_temporary() appears to predate support in pmap_mapbios() for
creating early mappings, but that restriction no longer applies.

PR:		248746
Reviewed by:	kib, mav
Tested by:	gallatin, Curtis Villamizar <curtis@ipv6.occnc.com>
MFC after:	3 days
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D26125
2020-08-20 00:52:53 +00:00
Alexander Motin
8f6355b51d Remove some noisy ACPI tables messages from verbose dmesg.
Those messages were printed hundreds of times during boot, often multiple
times for each table.  We already print information about the tables in
more organized form once to not duplicate it when random ACPI drivers are
attaching.

MFC after:	1 week
2020-08-19 16:09:36 +00:00
Mateusz Guzik
a125ed50a6 linux: add sysctl compat.linux.use_emul_path
This is a step towards facilitating jails with only Linux binaries.
Supporting emul_path adds path lookups which are completely spurious
if the binary at hand runs in a Linux-based root directory.

It defaults to on (== current behavior).

make -C /root/linux-5.3-rc8 -s -j 1 bzImage:

use_emul_path=1: 101.65s user 68.68s system 100% cpu 2:49.62 total
use_emul_path=0: 101.41s user 64.32s system 100% cpu 2:45.02 total
2020-08-18 22:04:22 +00:00
Mateusz Guzik
d5e3895ea4 linux: consistently use LFREEPATH instead of open-coding it 2020-08-18 22:03:55 +00:00
Peter Grehan
3a3f1e9dfa Export a routine to provide the TSC_AUX MSR value and use this in vmm.
Also, drop an unnecessary set of braces.

Requested by:	kib
Reviewed by:	kib
MFC after:	3 weeks
2020-08-18 11:36:38 +00:00
Peter Grehan
f5f5f1e7d6 Support guest rdtscp and rdpid instructions on Intel VT-x
Enable any of rdtscp and/or rdpid for bhyve guests on Intel-based hosts
that support the "enable RDTSCP" VM-execution control.

Submitted by:	adam_fenn.io
Reported by:	chuck
Reviewed by:	chuck, grehan, jhb
Approved by:	jhb (bhyve), grehan
MFC after:	3 weeks
Relnotes:	Yes
Differential Revision:	https://reviews.freebsd.org/D26003
2020-08-18 07:23:47 +00:00
Peter Grehan
46567b4f5e Allow guest device MMIO access from bootmem memory segments.
Recent versions of UEFI have moved local APIC timer initialization into
the early SEC phase which runs out of ROM, prior to self-relocating
into RAM. This results in a hypervisor exit.

Currently bhyve prevents instruction emulation from segments that aren't
marked as "sysmem" aka guest RAM, with the vm_gpa_hold() routine failing.
However, there is no reason for this restriction: the hypervisor already
controls whether EPT mappings are marked as executable.

Fix by dropping the redundant check of sysmem.

MFC after:	3 weeks
Differential Revision:	https://reviews.freebsd.org/D25955
2020-08-18 07:08:17 +00:00
Ruslan Bukin
c4cd699010 o Add machine/iommu.h and include MD iommu headers from it,
so we don't ifdef for every arch in busdma_iommu.c;
o No need to include specialreg.h for x86, remove it.

Requested by:	andrew
Reviewed by:	kib
Sponsored by:	DARPA/AFRL
Differential Revision:	https://reviews.freebsd.org/D25957
2020-08-05 19:11:31 +00:00
Alexander Motin
aba10e131f Allow swi_sched() to be called from NMI context.
For purposes of handling hardware error reported via NMIs I need a way to
escape NMI context, being too restrictive to do something significant.

To do it this change introduces new swi_sched() flag SWI_FROMNMI, making
it careful about used KPIs.  On platforms allowing IPI sending from NMI
context (x86 for now) it immediately wakes clk_intr_event via new IPI_SWI,
otherwise it works just like SWI_DELAY.  To handle the delayed SWIs this
patch calls clk_intr_event on every hardclock() tick.

MFC after:	2 weeks
Sponsored by:	iXsystems, Inc.
Differential Revision:	https://reviews.freebsd.org/D25754
2020-07-25 15:19:38 +00:00
Alex Richardson
b798ef6490 Include TMPFS in all the GENERIC kernel configs
Being able to use tmpfs without kernel modules is very useful when building
small MFS_ROOT kernels without a real file system.
Including TMPFS also matches arm/GENERIC and the MIPS std.MALTA configs.

Compiling TMPFS only adds 4 .c files so this should not make much of a
difference to NO_MODULES build times (as we do for our minimal RISC-V
images).

Reviewed By: br (earlier version for riscv), brooks, emaste
Differential Revision: https://reviews.freebsd.org/D25317
2020-07-24 08:40:04 +00:00
Alexander Motin
ce53f590ca Untie nmi_handle_intr() from DEV_ISA.
The only part of nmi_handle_intr() depending on ISA is isa_nmi(), which is
already wrapped.  Entering debugger on NMI does not really depend on ISA.

MFC after:	2 weeks
2020-07-22 20:15:21 +00:00
Alexander Motin
f2b6da18de Avoid code duplicaiton by using ipi_selected().
MFC after:	2 weeks
2020-07-21 17:18:38 +00:00
Konstantin Belousov
0675f4e1ca Simplify non-pti syscall entry on amd64.
Limit manipulations to use %rax as scratch to the pti portion of the
syscall entry code.

Submitted by:	alc
Reviewed by:	markj
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D25722
2020-07-19 17:47:55 +00:00
Konstantin Belousov
3ec7e1695c amd64 pmap: microoptimize local shootdowns for PCID PTI configurations
When pmap operates in PTI mode, we must reload %cr3 on return to
userspace.  In non-PCID mode the reload always flushes all non-global
TLB entries and we take advantage of it by only invalidating the KPT
TLB entries (there is no cached UPT entries at all).

In PCID mode, we flush both KPT and UPT TLB explicitly, but we can
take advantage of the fact that PCID mode command to reload %cr3
includes a flag to flush/not flush target TLB.  In particular, we can
avoid the flush for UPT, instead record that load of pc_ucr3 into %cr3
on return to usermode should be flushing.  This is done by providing
either all-1s or ~CR3_PCID_MASK in pc_ucr3_load_mask.  The mask is
automatically reset to all-1s on return to usermode.

Similarly, we can avoid flushing UPT TLB on context switch, replacing
it by setting pc_ucr3_load_mask.  This unifies INVPCID and non-INVPCID
PTI ifunc, leaving only 4 cases instead of 6.  This trick is also
applicable both to the TLB shootdown IPI handlers, since handlers
interrupt the target thread.

But then we need to check pc_curpmap in handlers, and this would
reopen the same race for INVPCID machines as was fixed in r306350 for
non-INVPCID.  To not introduce the same bug, unconditionally do
spinlock_enter() in pmap_activate().

Reviewed by:	alc, markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	3 weeks
Differential revision:	https://reviews.freebsd.org/D25483
2020-07-18 18:19:57 +00:00
Edward Tomasz Napierala
3e9a214260 Regen after r363304.
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
2020-07-18 11:31:31 +00:00
Edward Tomasz Napierala
8d1d017175 Add a trivial linux(4) splice(2) implementation, which simply
returns EINVAL.  Fixes grep (grep-3.1-2build1).

PR:		kern/218699
Reported by:	avos
Reviewed by:	emaste
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D25636
2020-07-18 11:28:40 +00:00
Conrad Meyer
4ae224c663 Revert r240317 to prevent leaking pmap entries
Subsequent to r240317, kmem_free() was replaced with kva_free() (r254025).
kva_free() releases the KVA allocation for the mapped region, but no longer
clears the pmap (pagetable) entries.

An affected pmap_unmapdev operation would leave the still-pmap'd VA space
free for allocation by other KVA consumers.  However, this bug easily
avoided notice for ~7 years because most devices (1) never call
pmap_unmapdev and (2) on amd64, mostly fit within the DMAP and do not need
KVA allocations.  Other affected arch are less popular: i386, MIPS, and
PowerPC.  Arm64, arm32, and riscv are not affected.

Reported by:	Don Morris <dgmorris AT earthlink.net>
Submitted by:	Don Morris (amd64 part)
Reviewed by:	kib, markj, Don (!amd64 parts)
MFC after:	I don't intend to, but you might want to
Sponsored by:	Dell Isilon
Differential Revision:	https://reviews.freebsd.org/D25689
2020-07-16 23:29:26 +00:00
Mark Johnston
e64080e79c Switch from SCTP to SCTP_SUPPORT in GENERIC configs.
This removes SCTP from in-tree kernel configuration files.  Now, SCTP
can be enabled by simply loading the module, as discussed on
freebsd-net@.

Reviewed by:	tuexen
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D25611
2020-07-16 15:09:04 +00:00
Mateusz Guzik
c4e64133d8 amd64: patch ffsl to use the compiler builtin
This shortens fdalloc by over 60 bytes. Correctness verified by running both
variants at the same time and comparing the result of each call.

Note someone(tm) should make a pass at converting everything else feasible.
2020-07-16 11:28:24 +00:00
Konstantin Belousov
a64fab1dd1 Grammar and typo fixes.
Submitted by:	alc
MFC after:	20 days
2020-07-15 09:48:36 +00:00
Konstantin Belousov
dc43978aa5 amd64: allow parallel shootdown IPIs
Stop using smp_ipi_mtx to protect global shootdown state, and
move/multiply the global state into pcpu.  Now each CPU can initiate
shootdown IPI independently from other CPUs.  Initiator enters
critical section, then fills its local PCPU shootdown info
(pc_smp_tlb_XXX), then clears scoreboard generation at location (cpu,
my_cpuid) for each target cpu.  After that IPI is sent to all targets
which scan for zeroed scoreboard generation words.  Upon finding such
word the shootdown data is read from corresponding cpu' pcpu, and
generation is set.  Meantime initiator loops waiting for all zeroed
generations in scoreboard to update.

Initiator does not disable interrupts, which should allow
non-invalidation IPIs from deadlocking, it only needs to disable
preemption to pin itself to the instance of the pcpu smp_tlb data.

The generation is set before the actual invalidation is performed in
handler. It is safe because target CPU cannot return to userspace
before handler finishes. In principle only NMI can preempt the
handler, but NMI would see the kernel handler frame and not touch
not-invalidated user page table.

Handlers loop until they do not see zeroed scoreboard generations.
This, together with hardware keeping one pending IPI in LAPIC IRR
should prevent lost shootdowns.

Notes.
1. The code does protect writes to LAPIC ICR with exclusion. I believe
   this is fine because we in fact do not send IPIs from interrupt
   handlers. More for !x2APIC mode where ICR access for write requires
   two registers write, we disable interrupts around it. If considered
   incorrect, I can add per-cpu spinlock around ipi_send().
2. Scoreboard lines owned by given target CPU can be padded to the
   cache line, to reduce ping-pong.

Reviewed by:	markj (previous version)
Discussed with:	alc
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	3 weeks
Differential revision:	https://reviews.freebsd.org/D25510
2020-07-14 20:37:50 +00:00
Scott Long
ffc568ba8b Revert r362998, r326999 while a better compatibility strategy is devised. 2020-07-09 22:38:36 +00:00
Scott Long
b302c2e5c9 Migrate the feature of excluding RAM pages to use "excludelist"
as its nomenclature.

MFC after:	1 week
2020-07-07 20:33:11 +00:00
Andrew Turner
fcf7a48191 Rerun kernel ifunc resolvers after all CPUs have started
On architectures that use RELA relocations it is safe to rerun the ifunc
resolvers on after all CPUs have started, but while they are sill parked.

On arm64 with big.LITTLE this is needed as some SoCs have shipped with
different ID register values the big and little clusters meaning we were
unable to rely on the register values from the boot CPU.

Add support for rerunning the resolvers on arm64 and amd64 as these are
both RELA using architectures.

Reviewed by:	kib
Sponsored by:	Innovate UK
Differential Revision:	https://reviews.freebsd.org/D25455
2020-07-05 14:38:22 +00:00
Conrad Meyer
c74a3041f0 Add domain policy allocation for amd64 fpu_kern_ctx
Like other types of allocation, fpu_kern_ctx are frequently allocated per-cpu.
Provide the API and sketch some example consumers.

fpu_kern_alloc_ctx_domain() preferentially allocates memory from the
provided domain, and falls back to other domains if that one is empty
(DOMAINSET_PREF(domain) policy).

Maybe it makes more sense to just shove one of these in the DPCPU area
sooner or later -- left for future work.

Reviewed by:	markj
Differential Revision:	https://reviews.freebsd.org/D22053
2020-07-03 14:54:46 +00:00