Specifically, devices that do not support PCI-e FLR and were not
gracefully shutdown by the guest OS could continue to issue DMA
requests after the VM was terminated. The changes in r305485 meant
that those DMA requests were completed against the host's memory which
could result in random memory corruption. Instead, leave ppt devices
that are not attached to a VM disabled in the IOMMU and only restore
the devices to the host domain if the ppt(4) driver is detached from a
device.
As an added safety belt, disable busmastering for a pass-through device
when before adding it to the host domain during ppt(4) detach.
PR: 222937
Tested by: Harry Schmalzbauer <freebsd@omnilan.de>
Reviewed by: grehan
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D12661
While here cache-align chains.
This shortens longest found chain during poudriere -j 80 from 32 to 16.
Pushing this higher up will probably require allocation on boot.
HEAD. Enable VIMAGE in GENERIC kernels and some others (where GENERIC does
not exist) on HEAD.
Disable building LINT-VIMAGE with VIMAGE being default.
This should give it a lot more exposure in the run-up to 12 to help
us evaluate whether to keep it on by default or not.
We are also hoping to get better performance testing.
The feature can be disabled using nooptions.
Requested by: many
Reviewed by: kristof, emaste, hiren
X-MFC after: never
Relnotes: yes
Differential Revision: https://reviews.freebsd.org/D12639
Note that pv_list_locks is an array and currently it fits 2 locks per line.
Resizing it and/or putting more locks in different lines requires several tests.
MFC after: 1 week
All of the kernel dump implementations keep track of the current offset
("dumplo") within the dump device. However, except for textdumps, they
all write the dump sequentially, so we can reduce code duplication by
having the MI code keep track of the current offset. The new
dump_append() API can be used to write at the current offset.
This is needed to implement support for kernel dump compression in the
MI kernel dump code.
Also simplify dump_encrypted_write() somewhat: use dump_write() instead
of duplicating its bounds checks, and get rid of the redundant offset
tracking.
Reviewed by: cem
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D11722
For processing, reclaim_pv_chunk() removes the pv_chunk from the lru
list, which makes pc_lru linkage invalid. Then the pmap lock is
released, which allows for other thread to free the last pv entry
allocated from the chunk and call free_pv_chunk(), which tries to
modify the invalid linkage.
Similarly, the chunk is inserted into the private tailq new_tail
temporary. Again, free_pv_chunk() might be run and corrupt the
linkage for the new_tail after the pmap lock is dropped.
This is a consequence of r299788 elimination of pvh_global_lock, which
allowed for reclaim to run in parallel with other pmap calls which
free pv chunks.
As a fix, do not remove the chunk from pc_lru queue, use a marker to
remember the position in the queue iteration. We can safely operate
on the chunks after the chunk's pmap is locked, we fetched the chunk
after the marker, and we checked that chunk pmap is same as we have
locked, because chunk removal from pc_lru requires both pv_chunk_mutex
and the pmap mutex owned.
Note that the fix lost an optimization which was present in the
previous algorithm. Namely, new_tail requeueing rotated the pv chunks
list so that reclaim didn't scan the same pv chunks that couldn't be
freed (because they contained a wired and/or superpage mapping) on
every invocation. An additional change is planned which would improve
this.
Reported and tested by: pho
Reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
allocated, when requested range of descriptors does not fit into
currently allocated LDT, or trim the return if the range fits
partially. Before, the function returned EINVAL.
Reviewed by: bde
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
use LDT segments immediately.
If the i386_set_ldt() call created a first LDT descriptor (and
consequently created the LDT) for our address space, LDTR is currently
loaded only on the CPU executing the syscall. Other CPUs executing
threads sharing the address space, would only load LDTR after context
switch.
Uncomment set_user_ldt_rv() and call it on all CPUs. Remove critical
section inside set_user_ldt(), it is not needed in the context of call
from smp_rendezvous().
Set md_ldt after md_ldt_sd is initialized using the same code sequence
as in user_ldt_free(). Do the whole initialization in a critical
section, to not race with the context switching while we set LDT.
Discussed with: bde
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
cpu_switch.S uses curproc->p_md.md_ldt value as the flag indicating
presence of the process LDT. The flag is checked and then ldt segment
descriptor is copied into the CPU' GDT slot.
Disallow context switches around clearing of the curproc LDT state by
performing the cleanup in critical section. Ensure that the md_ldt
flag is cleared before md_ldt_sd descriptor content is destroyed by
inserting fence between the operations.
We depend on the x86 memory model strong ordering guarantees, in
particular, that cpu_switch.S observes the writes to md_ldt and
md_ldt_sd in the expected order.
Discussed with: bde
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Provide consistent snapshot of the requested descriptors by preventing
other threads from modifying LDT while we fetch the data, lock dt_lock
around the read. Copy the data into intermediate buffer, which is
copied out after the lock is dropped.
Use guaranteed atomic (aligned volatile) reads of the descriptors to
use same-size atomic as CPU update to set A bit in the descriptor type
field.
Improve overflow checking for the descriptors range calculations and
remove unneeded casts.
Reviewed by: bde
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Compilers are allowed to combine plain reads into group operations,
e.g. 64bit element copies of one array into another can be
legitimately optimized back to a memcpy() call, which r323772 tried to
prevent.
Qualify accesses to LDT descriptors with volatile dereference to
ensure that each write indeed occurs. After that, our usual claim of
native-size aligned writes being atomic applies.
This is equivalent to atomic_store(memory_order_relaxed) C11 accesses,
but our machine/atomic.h does not provide corresponding primitive.
Noted and reviewed by: bde
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
On i386, the function is used from the context switch code and needs
to be accessible externally. Amd64 MD context switch does not lock an
LDT spinlock and inlines switching in assembly.
Discussed with: bde
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
This makes the LDT to use only one page with default settings,
avoiding the need to find contigous 2 pages in KVA. It seems that
most users are fine even with 512 segments.
Discussed with: bde
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
machine independent parts of the existing code to a new file that can be
shared between amd64 and arm64.
Reviewed by: kib (previous version), imp
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D12434
Care must be taken when updating the active LDT, since parallel
threads might try to load a segment descriptor which is currently
updated. Since the results are undefined, this cannot be ignored by
claiming to be an application race.
Reviewed by: jhb
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D12413
CAM_DEBUG_TRACE results in way too much debug output than needed now.
When debugging, it's always possible to turn on trace level using camcontrol.
Approved by: imp (mentor)
Differential Revision: https://reviews.freebsd.org/D12110
AMD Family 17h CPUs have an internal network used to communicate between
the host CPU and the PSP and SMU coprocessors. It exposes a simple
32-bit register space.
Reviewed by: avg (no +1), mjoras, truckman
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D12217
This driver supports both NTB-to-NTB and NTB-to-Root Port modes (though
the second with predictable complications on hot-plug and reboot events).
I tested it with PEX 8717 and PEX 8733 chips, but expect it should work
with many other compatible ones too. It supports up to two NT bridges
per chip, each of which can have up to 2 64-bit or 4 32-bit memory windows,
6 or 12 scratchpad registers and 16 doorbells. There are also 4 DMA engines
in those chips, but they are not yet supported.
While there, rename Intel NTB driver from generic ntb_hw(4) to more specific
ntb_hw_intel(4), so now it is on par with this new ntb_hw_plx(4) driver and
alike to Linux naming.
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.
The actual cache line size has always been 64 bytes.
The 128 number arose as an optimization for Core 2 era Intel processors. By
default (configurable in BIOS), these CPUs would prefetch adjacent cache
lines unintelligently. Newer CPUs prefetch more intelligently.
The latest Core 2 era CPU was introduced in September 2008 (Xeon 7400
series, "Dunnington"). If you are still using one of these CPUs, especially
in a multi-socket configuration, consider locating the "adjacent cache line
prefetch" option in BIOS and disabling it.
Reported by: mjg
Reviewed by: np
Discussed with: jhb
Sponsored by: Dell EMC Isilon
Right now, we enable the CR4.FSGSBASE bit on CPUs which support the
facility (Ivy and later), to allow usermode to read fs and gs bases
without syscalls. This bit also controls the write access to bases
from userspace, but WRFSBASE and WRGSBASE instructions currently
cannot be used, because return path from both exceptions or interrupts
overrides bases with the values from pcb.
Supporting the instructions is useful because this means that usermode
can implement green-threads completely in userspace without issuing
syscalls to change all of the machine context.
Support is implemented by saving the fs base and user gs base when
PCB_FULL_IRET flag is set. The flag is set on the context switch,
which potentially causes clobber of the bases due to activation of
another context, and when explicit modification of the user context by
a syscall or exception handler is performed. In particular, the patch
moves setting of the flag before syscalls change context.
The changes to doreti_exit and PUSH_FRAME to clear PCB_FULL_IRET on
entry from userspace can be considered a bug fixes on its own.
Reviewed by: jhb (previous version)
Tested by: pho (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 3 weeks
Differential revision: https://reviews.freebsd.org/D12023
- Use more relevant name 'signo' instead of 'i' for the local variable
which contains a signal number to send for the current exception.
- Eliminate two labels 'userout' and 'out' which point to the very end
of the trap() function. Instead use return directly.
- Re-indent the prot_fault_translation block by reducing if() nesting.
- Some more monor style changes.
Requested and reviewed by: bde
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
This helps simplify the code in kern_shutdown.c and reduces the number
of globally visible functions.
No functional change intended.
Reviewed by: cem, def
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D11603
dump_start() and dump_finish() are responsible for writing kernel dump
headers, optionally writing the key when encryption is enabled, and
initializing the initial offset into the dump device.
Also remove the unused dump_pad(), and make some functions static now that
they're only called from kern_shutdown.c.
No functional change intended.
Reviewed by: cem, def
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D11584
Add an option to dynamically rebalance interrupts across cores
(hw.intrbalance); off by default.
The goal is to minimize preemption. By placing interrupt sources on distinct
CPUs, ithreads get preferentially scheduled on distinct CPUs. Overall
preemption is reduced and latency is reduced. In our workflow it reduced
"fighting" between two high-frequency interrupt sources. Reduced latency
was proven by, e.g., SPEC2008.
Submitted by: jeff@ (earlier version)
Reviewed by: kib@
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D10435
Intel SGX allows to manage isolated compartments "Enclaves" in user VA
space. Enclaves memory is part of processor reserved memory (PRM) and
always encrypted. This allows to protect user application code and data
from upper privilege levels including OS kernel.
This includes SGX driver and optional linux ioctl compatibility layer.
Intel SGX SDK for FreeBSD is also available.
Note this requires support from hardware (available since late Intel
Skylake CPUs).
Many thanks to Robert Watson for support and Konstantin Belousov
for code review.
Project wiki: https://wiki.freebsd.org/Intel_SGX.
Reviewed by: kib
Relnotes: yes
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D11113
It is quite useful when double fault is not caused by a stack overflow.
Tested by: pho (as part of the larger patch)
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
reduces diff between amd64 and i386. Also, it fixes a regression introduced
in r322076, i.e., identify_hypervisor() failed to identify some hypervisors.
This function assumes cpu_feature2 is already initialized.
Reported by: dexuan
Tested by: dexuan