Current mitigation for L1TF in bhyve flushes L1D either by an explicit
WRMSR command, or by software reading enough uninteresting data to
fully populate all lines of L1D. If NMI occurs after either of
methods is completed, but before VM entry, L1D becomes polluted with
the cache lines touched by NMI handlers. There is no interesting data
which NMI accesses, but something sensitive might be co-located on the
same cache line, and then L1TF exposes that to a rogue guest.
Use VM entry MSR load list to ensure atomicity of L1D cache and VM
entry if updated microcode was loaded. If only software flush method
is available, try to help the bhyve sw flusher by also flushing L1D on
NMI exit to kernel mode.
Suggested by and discussed with: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed by: jhb
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D16790
- In configurations with a pseudo devices section, move 'device crypto'
into that section.
- Use a consistent comment. Note that other things common in kernel
configs such as GELI also require 'device crypto', not just IPSEC.
Reviewed by: rgrimes, cem, imp
Differential Revision: https://reviews.freebsd.org/D16775
Ensure that the valid PCID state is created for proc0 pmap, since it
might be used by efirt enter() before first context switch on the BSP.
Sponsored by: The FreeBSD Foundation
MFC after: 6 days
On the guest entry in bhyve, flush L1 data cache, using either L1D
flush command MSR if available, or by reading enough uninteresting
data to fill whole cache.
Flush is automatically enabled on CPUs which do not report RDCL_NO,
and can be disabled with the hw.vmm.l1d_flush tunable/kenv.
Security: CVE-2018-3646
Reviewed by: emaste. jhb, Tony Luck <tony.luck@intel.com>
Sponsored by: The FreeBSD Foundation
We always zero the invalidated PTE/PDE for superpage, which means that
L1TF CPU vulnerability (CVE-2018-3620) can be only used for reading
from the page at zero.
Note that both i386 and amd64 exclude the page from phys_avail[]
array, so this change is redundant, but I think that phys_avail[] on
UEFI-boot does not need to do that. Eventually the blacklisting
should be made conditional on CPUs which report that they are not
vulnerable to L1TF.
Reviewed by: emaste. jhb
Sponsored by: The FreeBSD Foundation
curpmap.
When performing context switch on a machine without PCID, if current
%cr3 equals to the new pmap %cr3, which is typical for kernel_pmap
vs. kernel process, I overlooked to update PCPU curpmap value. Remove
check for %cr3 not equal to pm_cr3 for doing the update. It is
believed that this case cannot happen at all, due to other changes in
this revision.
Also, do not set the very first curpmap to kernel_pmap, it should be
vmspace0 pmap instead to match curproc.
Move the common code to activate the initial pmap both on BSP and APs
into pmap_activate_boot() helper.
Reported by: eadler, ambrisko
Discussed with: kevans
Reviewed by: alc, markj (previous version)
Tested by: ambrisko (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D16618
Updates in the format described in section 9.11 of the Intel SDM can
now be applied as one of the first steps in booting the kernel. Updates
that are loaded this way are automatically re-applied upon exit from
ACPI sleep states, in contrast with the existing cpucontrol(8)-based
method. For the time being only Intel updates are supported.
Microcode update files are passed to the kernel via loader(8). The
file type must be "cpu_microcode" in order for the file to be recognized
as a candidate microcode update. Updates for multiple CPU types may be
concatenated together into a single file, in which case the kernel
will select and apply a matching update. Memory used to store the
update file will be freed back to the system once the update is applied,
so this approach will not consume more memory than required.
Reviewed by: kib
MFC after: 6 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D16370
efi_enter here was needed because efi_runtime dereference causes a fault
outside of EFI context, due to runtime table living in runtime service
space. This may cause problems early in boot, though, so instead access it
by converting paddr to KVA for access.
While here, remove the other direct PHYS_TO_DMAP calls and the explicit DMAP
requirement from efidev.
Reviewed by: kib
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D16591
This patch adds a new sysctl(8) knob "security.jail.vmm_allowed",
by default this option is disable.
Submitted by: Shawn Webb <shawn.webb____hardenedbsd.org>
Reviewed by: jamie@ and myself.
Relnotes: Yes.
Sponsored by: HardenedBSD and G2, Inc.
Differential Revision: https://reviews.freebsd.org/D16057
As noted in UDPATING, the new loader tunable efi.rt_disabled may be used to
disable EFIRT at runtime. It should have no effect if you are not booted via
UEFI boot.
MFC after: 6 weeks
Ifuncs selectors dispatch copyin(9) family to the suitable variant, to
set rflags.AC around userspace access. Rflags.AC bit is cleared in
all kernel entry points unconditionally even on machines not
supporting SMAP.
Reviewed by: jhb
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D13838
There's no differene between VM_FREELIST_ISADMA and VM_FREELIST_LOWMEM
except for the default boundary (16MB on x86 and 256MB on MIPS, but
they are otherwise the same). We don't need both for any system we
support (there were some really old ARC systems that did have ISA/EISA
bus, but we never ran on them and they are too old to ever grow
support for).
Differential Review: https://reviews.freebsd.org/D16290
Do not use vm_map_remove() to release KVA back to the system. Because
kernel map entries do not have an associated VM object, with r336030
the vm_map_remove() call will not update the kernel page tables. Avoid
relying on the vm_map layer and instead update the pmap and release KVA
to the kernel arena directly in kmem_bootstrap_free().
Because the pmap updates will generally result in superpage demotions,
modify pmap_init() to insert PTPs shadowed by superpage mappings into
the kernel pmap's radix tree.
While here, port r329171 to i386.
Reported by: alc
Reviewed by: alc, kib
X-MFC with: r336505
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D16426
the AMD document 55449 'Revision Guide for AMD Family 17h Models
00h-0Fh Processors' rev 1.12.
The errata numbers are mentioned near each action.
It seems that newer BIOSes already include required chicken bits
settings, so the magic MSR updates are only needed when BIOS cannot be
updated. On the other hand, MWAIT avoidance seems to be important.
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
code never sees FPU pcb flags not consistent with the hardware state.
This is uncovered by the eager FPU switch mode.
Analyzed, reviewed and tested by: gleb
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
On i386 and amd64, add a vm_phys segment for physical memory used to
store the kernel binary and other preloaded data. This makes it
possible to free such memory back to the system once it is no longer
needed, e.g., when a preloaded kernel module is unloaded. Previously,
it would have remained unused.
Reviewed by: kib, royger
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D16330
In order to setup an initial environment and jump into the generic
hammer_time initialization function. Some of the code is shared with
PVHv1, while other code is PVHv2 specific.
This allows booting FreeBSD as a PVHv2 DomU and Dom0.
Sponsored by: Citrix Systems R&D
The PVHv2 entry point is fairly similar to the multiboot1 one. The
kernel is started in protected mode with paging disabled. More
information about the exact BSP state can be found in the pvh.markdown
document on the Xen tree.
This entry point is going to be joined with the native entry point at
hammer_time, and in order to do so the BSP needs to be bootstrapped
into long mode with the same set of page tables as used on bare metal.
Sponsored by: Citrix Systems R&D
This restores counters(9) operation.
Revert r336024. Improve assert of pcpu size on x86.
Reviewed by: mmacy
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D16163
Due to the way rtld creates mappings for the shared objects, each dso
causes unmap of at least three guard map entries. For instance, in
the buildworld load, this change reduces the amount of pmap_remove()
calls by 1/5.
Profiled by: alc
Reviewed by: alc, markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D16148
SMP systems by extending defined(SMP) to include defined(KLD_MODULE).
This is a regression issue after r335873 .
Discussed with: mmacy@
Sponsored by: Mellanox Technologies
Apply temporary fix to counter until daylight hours.
The fact that the assembly for counter_u64_add relied on the sizeof(struct pcpu) was
the basis for the otherwise arbitrary offset never came up in D15933.
critical_{enter,exit} is now inline so the only real added overhead is the
added (mostly false) conditional branch in exit.
- Change pcpu zone consumers to use a stride size of PAGE_SIZE.
(defined as UMA_PCPU_ALLOC_SIZE to make future identification easier)
- Allocate page from the correct domain for a given cpu.
- Don't initialize pc_domain to non-zero value if NUMA is not defined
There are some misconceptions surrounding this field. It is the
_VM_ NUMA domain and should only ever correspond to valid domain
values as understood by the VM.
The former slab size of sizeof(struct pcpu) was somewhat arbitrary.
The new value is PAGE_SIZE because that's the smallest granularity
which the VM can allocate a slab for a given domain. If you have
fewer than PAGE_SIZE/8 counters on your system there will be some
memory wasted, but this is obviously something where you want the
cache line to be coming from the correct domain.
Reviewed by: jeff
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D15933
It is possible that a fictitious unmanaged userspace mapping of
superpage is created on x86, e.g. by pmap_object_init_pt(), with the
physical address outside the vm_page_array[] coverage.
Noted and reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D16085
physical address, which is readily available after sucessfull
vm_page_pa_tryrelock().
Noted and reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D16085
physical address, which is readily available after sucessfull
vm_page_pa_tryrelock().
Noted and reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D16085
mapping, then it leaks the unlinked PV entry. This change eliminates that
leak, freeing the PV entry.
Reviewed by: kib, markj
X-MFC with: r335784
Differential Revision: https://reviews.freebsd.org/D16130
returning NULL.
vm_fault_quick_hold_pages() can be legitimately called on userspace
mappings backed by fictitious pages created by unmanaged device and sg
pagers.
Note that other architectures pmap_extract_and_hold() might need
similar fix, but I postponed the examination.
Reported by: bde
Discussed with: alc
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D16085
The ADD, AND, OR, and SUB instructions take at most a 32-bit
sign-extended immediate operand. 64-bit constants that do not fit into
that constraint need to be loaded into a register. The 'i' constraint
tells the compiler it can pass any integer constant to the assembler,
whereas the 'e' constrain only permits constants that fit into a 32-bit
sign-extended value. This fixes using
atomic_add/clear/set/subtract_long/64 with constants that do not fit into
a 32-bit sign-extended immediate.
Reported by: several folks
Tested by: Pete Wright <pete@nomadlogic.org>
MFC after: 2 weeks
- inline atomics in modules on i386 and amd64 (they were always
inline on other arches)
- allow modules to opt in to inlining locks by specifying
MODULE_TIED=1 in the makefile
Reviewed by: kib
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D16079
Doing so ensures that all threads sharing the pmap have a consistent
view of the mapping. This fixes the problem described in the commit
log messages for r329254 without the overhead of an extra fault in the
common case. Once other pmap_enter() implementations are similarly
modified, the workaround added in r329254 can be removed, reducing the
overhead of CoW faults.
With this change we can reuse the PV entry from the old mapping,
potentially avoiding a call to reclaim_pv_chunk(). Otherwise, there is
nothing preventing the old PV entry from being reclaimed. In rare
cases this could result in the PTE's page table page being freed,
leading to a use-after-free of the page when the updated PTE is written
following the allocation of the PV entry for the new mapping.
Reported and tested by: pho
Reviewed by: alc, kib
MFC after: 3 weeks
Differential Revision: https://reviews.freebsd.org/D16005
without error code. Doing so it mis-aligned the stack.
Since the only consumer of the SSE instructions with the alignment
requirements is AES-NI module, and since the FPU context cannot be
accessed in interrupts, the only situation where the alignment matter
are the compat32 syscalls, as reported in the PR.
PR: 229222
Reported and tested by: dewayne@heuristicsystems.com.au
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
The call to reclaim_pv_chunk() in reserve_pv_entries() may free a
PV chunk with free entries belonging to the current pmap. In this
case we must account for the free entries that were reclaimed, or
reserve_pv_entries() may return without having reserved the requested
number of entries.
Reviewed by: alc, kib
Tested by: pho (previous version)
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D15911
The Linux compatibility code was converting the version number (e.g.
2.6.32) in two different ways and then comparing the results.
The linux_map_osrel() function converted MAJOR.MINOR.PATCH similar to
what FreeBSD does natively. I.e. where major=v0, minor=v1, and patch=v2
v = v0 * 1000000 + v1 * 1000 + v2;
The LINUX_KERNVER() macro, on the other hand, converted the value with
bit shifts. I.e. where major=a, minor=b, and patch=c
v = (((a) << 16) + ((b) << 8) + (c))
The Linux kernel uses the later format via the KERNEL_VERSION() macro in
include/generated/uapi/linux/version.h
Fix is to use the LINUX_KERNVER() macro in linux_map_osrel() as well as
in the .trans_osrel functions.
PR: 229209
Reviewed by: emaste, cem, imp (mentor)
Approved by: imp (mentor)
Differential Revision: https://reviews.freebsd.org/D15952
Update the driver to use iflib in order to bring performance,
maintainability, and (hopefully) stability benefits to the driver.
The driver currently isn't completely ported; features that are missing:
- VF driver (ixlv)
- SR-IOV host support
- RDMA support
The plan is to have these re-added to the driver before the next FreeBSD release.
Reviewed by: gallatin@
Contributions by: gallatin@, mmacy@, krzysztof.galazka@intel.com
Tested by: jeffrey.e.pieper@intel.com
MFC after: 1 month
Sponsored by: Intel Corporation
Differential Revision: https://reviews.freebsd.org/D15577
Existing linuxulator platforms (i386, amd64) support legacy syscalls,
such as non-*at ones like open, but arm64 and other new platforms do
not.
Wrap these in #ifdef LINUX_LEGACY_SYSCALLS, #defined in the MD linux.h
files. We may need finer grained control in the future but this is
sufficient for now.
Reviewed by: andrew
Sponsored by: Turing Robotic Industries
Differential Revision: https://reviews.freebsd.org/D15237