Right now, we enable the CR4.FSGSBASE bit on CPUs which support the
facility (Ivy and later), to allow usermode to read fs and gs bases
without syscalls. This bit also controls the write access to bases
from userspace, but WRFSBASE and WRGSBASE instructions currently
cannot be used, because return path from both exceptions or interrupts
overrides bases with the values from pcb.
Supporting the instructions is useful because this means that usermode
can implement green-threads completely in userspace without issuing
syscalls to change all of the machine context.
Support is implemented by saving the fs base and user gs base when
PCB_FULL_IRET flag is set. The flag is set on the context switch,
which potentially causes clobber of the bases due to activation of
another context, and when explicit modification of the user context by
a syscall or exception handler is performed. In particular, the patch
moves setting of the flag before syscalls change context.
The changes to doreti_exit and PUSH_FRAME to clear PCB_FULL_IRET on
entry from userspace can be considered a bug fixes on its own.
Reviewed by: jhb (previous version)
Tested by: pho (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 3 weeks
Differential revision: https://reviews.freebsd.org/D12023
Add an option to dynamically rebalance interrupts across cores
(hw.intrbalance); off by default.
The goal is to minimize preemption. By placing interrupt sources on distinct
CPUs, ithreads get preferentially scheduled on distinct CPUs. Overall
preemption is reduced and latency is reduced. In our workflow it reduced
"fighting" between two high-frequency interrupt sources. Reduced latency
was proven by, e.g., SPEC2008.
Submitted by: jeff@ (earlier version)
Reviewed by: kib@
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D10435
Intel SGX allows to manage isolated compartments "Enclaves" in user VA
space. Enclaves memory is part of processor reserved memory (PRM) and
always encrypted. This allows to protect user application code and data
from upper privilege levels including OS kernel.
This includes SGX driver and optional linux ioctl compatibility layer.
Intel SGX SDK for FreeBSD is also available.
Note this requires support from hardware (available since late Intel
Skylake CPUs).
Many thanks to Robert Watson for support and Konstantin Belousov
for code review.
Project wiki: https://wiki.freebsd.org/Intel_SGX.
Reviewed by: kib
Relnotes: yes
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D11113
libefivar expects opening /dev/efi to indicate if the we can make efi
runtime calls. With a null routine, it was always succeeding leading
efi_variables_supported() to return the wrong value. Only succeed if
we have an efi_runtime table. Also, while I'm hear, out of an
abundance of caution, add a likely redundant check to make sure
efi_systbl is not NULL before dereferencing it. I know it can't be
NULL if efi_cfgtbl is non-NULL, but the compiler doesn't.
Some C wrappers for x86 instructions do not touch global memory and only act
on their arguments; they can be marked __pure2, aka __const__. Without this
annotation, Clang 3.9.1 is not intelligent enough on its own to grok that
these functions are __const__.
Submitted by: Anton Rang <anton.rang AT isilon.com>
Sponsored by: Dell EMC Isilon
from the top of user memory to one page lower on machines with the
Ryzen (AMD Family 17h) CPU. This pushes ps_strings and the stack
down by one page as well. On Ryzen there is some sort of interaction
between code running at the top of user memory address space and
interrupts that can cause FreeBSD to either hang or silently reset.
This sounds similar to the problem found with DragonFly BSD that
was fixed with this commit:
https://gitweb.dragonflybsd.org/dragonfly.git/commitdiff/b48dd28447fc8ef62fbc963accd301557fd9ac20
but our signal trampoline location was already lower than the address
that DragonFly moved their signal trampoline to. It also does not
appear to be related to SMT as described here:
https://www.phoronix.com/forums/forum/hardware/processors-memory/955368-some-ryzen-linux-users-are-facing-issues-with-heavy-compilation-loads?p=955498#post955498
"Hi, Matt Dillon here. Yes, I did find what I believe to be a
hardware issue with Ryzen related to concurrent operations. In a
nutshell, for any given hyperthread pair, if one hyperthread is
in a cpu-bound loop of any kind (can be in user mode), and the
other hyperthread is returning from an interrupt via IRETQ, the
hyperthread issuing the IRETQ can stall indefinitely until the
other hyperthread with the cpu-bound loop pauses (aka HLT until
next interrupt). After this situation occurs, the system appears
to destabilize. The situation does not occur if the cpu-bound
loop is on a different core than the core doing the IRETQ. The
%rip the IRETQ returns to (e.g. userland %rip address) matters a
*LOT*. The problem occurs more often with high %rip addresses
such as near the top of the user stack, which is where DragonFly's
signal trampoline traditionally resides. So a user program taking
a signal on one thread while another thread is cpu-bound can cause
this behavior. Changing the location of the signal trampoline
makes it more difficult to reproduce the problem. I have not
been because the able to completely mitigate it. When a cpu-thread
stalls in this manner it appears to stall INSIDE the microcode
for IRETQ. It doesn't make it to the return pc, and the cpu thread
cannot take any IPIs or other hardware interrupts while in this
state."
since the system instability has been observed on FreeBSD with SMT
disabled. Interrupts to appear to play a factor since running a
signal-intensive process on the first CPU core, which handles most
of the interrupts on my machine, is far more likely to trigger the
problem than running such a process on any other core.
Also lower sv_maxuser to prevent a malicious user from using mmap()
to load and execute code in the top page of user memory that was made
available when the shared page was moved down.
Make the same changes to the 64-bit Linux emulator.
PR: 219399
Reported by: nbe@renzel.net
Reviewed by: kib
Reviewed by: dchagin (previous version)
Tested by: nbe@renzel.net (earlier version)
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D11780
add support for explicitly requesting that pmap_enter() create a 2MB page
mapping. (Essentially, this feature allows the machine-independent layer to
create superpage mappings preemptively, and not wait for automatic promotion
to occur.)
Export pmap_ps_enabled() to the machine-independent layer.
Add a flag to pmap_pv_insert_pde() that specifies whether it should fail or
reclaim a PV entry when one is not available.
Refactor pmap_enter_pde() into two functions, one by the same name, that is
a general-purpose function for creating PDE PG_PS mappings, and another,
pmap_enter_2mpage(), that is used to prefault 2MB read- and/or execute-only
mappings for execve(2), mmap(2), and shmat(2).
Submitted by: Yufeng Zhou <yz70@rice.edu> (an earlier version)
Reviewed by: kib, markj
Tested by: pho
MFC after: 10 days
Differential Revision: https://reviews.freebsd.org/D11556
--Remove special-case handling of sparc64 bus_dmamap* functions.
Replace with a more generic mechanism that allows MD busdma
implementations to generate inline mapping functions by
defining WANT_INLINE_DMAMAP in <machine/bus_dma.h>. This
is currently useful for sparc64, x86, and arm64, which all
implement non-load dmamap operations as simple wrappers
around map objects which may be bus- or device-specific.
--Remove NULL-checked bus_dmamap macros. Implement the
equivalent NULL checks in the inlined x86 implementation.
For non-x86 platforms, these checks are a minor pessimization
as those platforms do not currently allow NULL maps. NULL
maps were originally allowed on arm64, which appears to have
been the motivation behind adding arm[64]-specific barriers
to bus_dma.h, but that support was removed in r299463.
--Simplify the internal interface used by the bus_dmamap_load*
variants and move it to bus_dma_internal.h
--Fix some drivers that directly include sys/bus_dma.h
despite the recommendations of bus_dma(9)
Reviewed by: kib (previous revision), marius
Differential Revision: https://reviews.freebsd.org/D10729
from machine/proc.h, consistently on all architectures.
Reviewed by: jhb
Sponsored by: The FreeBSD Foundation
MFC after: 3 weeks
X-Differential revision: https://reviews.freebsd.org/D11080
prevents one from running eg clang built with debug; the new one is
arbitrary (equal to MAXDSIZ) and... well, should be quite future-proof.
Same fix might be applicable to other 64 bit architectures; I'll ask
their respective maintainers to make sure it won't break anything.
Reviewed by: kib
MFC after: 2 weeks
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D10758
in place. To do per-cpu stats, convert all fields that previously were
maintained in the vmmeters that sit in pcpus to counter(9).
- Since some vmmeter stats may be touched at very early stages of boot,
before we have set up UMA and we can do counter_u64_alloc(), provide an
early counter mechanism:
o Leave one spare uint64_t in struct pcpu, named pc_early_dummy_counter.
o Point counter(9) fields of vmmeter to pcpu[0].pc_early_dummy_counter,
so that at early stages of boot, before counters are allocated we already
point to a counter that can be safely written to.
o For sparc64 that required a whole dummy pcpu[MAXCPU] array.
Further related changes:
- Don't include vmmeter.h into pcpu.h.
- vm.stats.vm.v_swappgsout and vm.stats.vm.v_swappgsin changed to 64-bit,
to match kernel representation.
- struct vmmeter hidden under _KERNEL, and only vmstat(1) is an exclusion.
This is based on benno@'s 4-year old patch:
https://lists.freebsd.org/pipermail/freebsd-arch/2013-July/014471.html
Reviewed by: kib, gallatin, marius, lidl
Differential Revision: https://reviews.freebsd.org/D10156
The MFC will include a compat definition of smp_no_rendevous_barrier()
that calls smp_no_rendezvous_barrier().
Reviewed by: gnn, kib
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D10313
I fixed this in 1997, but the fix was over-engineered and fragile and
was broken in 2003 if not before. i386 parameters were copied to 8
other arches verbatim, mostly after they stopped working on i386, and
mostly without the large comment saying how the values were chosen on
i386. powerpc has a non-verbatim copy which just changes the uncritical
parameter and seems to add a sign extension bug to it.
Just treat negative offsets as offsets if they are no more negative than
-db_offset_max (default -64K), and remove all the broken parameters.
-64K is not very negative, but it is enough for frame and stack pointer
offsets since kernel stacks are small.
The over-engineering was mainly to go more negative than -64K for the
negative offset format, without affecting printing for more than a
single address.
Addresses in the top 64K of a (full 32-bit or 64-bit) address space
are now printed less well, but there aren't many interesting ones.
For arches that have many interesting ones very near the top (e.g.,
68k has interrupt vectors there), there would be no good limit for
the negative offset format and -64K is a good as anything.
On the original i386, %dr[4-5] were unimplemented but not very clearly
reserved, so debuggers read them to print them. i386 was still doing
this.
On the original athlon64, %dr[4-5] are documented as reserved but are
aliased to %dr[6-7] unless CR4_DE is set, when accessing them traps.
On 2 of my systems, accessing %dr[4-5] trapped sometimes. On my Haswell
system, the apparent randomness was because the boot CPU starts with
CR4_DE set while all other CPUs start with CR4_DE clear. FreeBSD
doesn't support the data breakpoints enabled by CR4_DE and it never
changes this flag, so the flag remains different across CPUs and
the behaviour seemed inconsistent except while booting when the CPU
doesn't change.
The invalid accesses broke:
- read access for printing the registers in ddb "show watches" on CPUs
with CR4_DE set
- read accesses in fill_dbregs() on CPUs with CR4_DE set. This didn't
implement panic(3) since the user case always skipped %dr[4-5].
- write accesses in set_dbregs(). This also didn't affect userland.
When it didn't trap, the aliasing made it fragile.
Don't print the dummy (zero) values of %dr[4-5] in "show watches" for
i386 or amd64. Fix style bugs near this printing.
amd64 also has space in the dbregs struct for the reserved %dr[8-15]
and already didn't print the dummy values for these, and never accessed
any of the 10 reserved debug registers.
Remove cpufuncs for making the invalid accesses. Even amd64 had these.
Renumber cluase 4 to 3, per what everybody else did when BSD granted
them permission to remove clause 3. My insistance on keeping the same
numbering for legal reasons is too pedantic, so give up on that point.
Submitted by: Jan Schaumann <jschauma@stevens.edu>
Pull Request: https://github.com/freebsd/freebsd/pull/96
PG_PROMOTED, that indicates whether lingering 4KB page mappings might
need to be flushed on a PDE change that restricts or destroys a 2MB
page mapping. This flag allows the pmap to avoid range invalidations
that are both unnecessary and costly.
Reviewed by: kib, markj
MFC after: 6 weeks
Differential Revision: https://reviews.freebsd.org/D9665
Implement get_pcpu() for amd64/sparc64/mips/powerpc, and use it to
replace pcpu_find(curcpu) in MI code.
Reviewed by: andreast, kan, lidl
Tested by: lidl(mips, sparc64), andreast(powerpc)
Differential Revision: https://reviews.freebsd.org/D9587
The switch to get_pcpu() in MI code seems to cause hangs on MIPS.
Back out until we can get a better idea of what's happening there.
Reported by: kan, lidl
SDM states that CLFLUSHOPT instructions can be ordered with other
writes by SFENCE, heavier MFENCE is not required.
Reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Previously these were only declared under #ifdef SMP in <machine/smp.h>.
However, these variables are defind in pmap.c unconditionally, and efirt.c
references them unconditionally. This fixes non-SMP kernel builds.
Discussed with: kib
MFC after: 1 week
Reduce the cost of TLB invalidation on x86 by using per-CPU completion flags
Reduce contention during TLB invalidation operations by using a per-CPU
completion flag, rather than a single atomically-updated variable.
On a Westmere system (2 sockets x 4 cores x 1 threads), dtrace measurements
show that smp_tlb_shootdown is about 50% faster with this patch; observations
with VTune show that the percentage of time spent in invlrng_single_page on an
interrupt (actually doing invalidation, rather than synchronization) increases
from 31% with the old mechanism to 71% with the new one. (Running a basic file
server workload.)
Submitted by: Anton Rang <rang at acm.org>
Reviewed by: cem (earlier version)
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D8041
Reduce contention during TLB invalidation operations by using a per-CPU
completion flag, rather than a single atomically-updated variable.
On a Westmere system (2 sockets x 4 cores x 1 threads), dtrace measurements
show that smp_tlb_shootdown is about 50% faster with this patch; observations
with VTune show that the percentage of time spent in invlrng_single_page on an
interrupt (actually doing invalidation, rather than synchronization) increases
from 31% with the old mechanism to 71% with the new one. (Running a basic file
server workload.)
Submitted by: Anton Rang <rang at acm.org>
Reviewed by: cem (earlier version), kib
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D8041
Runtime services require special execution environment for the call.
Besides that, OS must inform firmware about runtime virtual memory map
which will be active during the calls, with the SetVirtualAddressMap()
runtime call, done while the 1:1 mapping is still used. There are two
complication: the SetVirtualAddressMap() effectively must be done from
loader, which needs to know kernel address map in advance. More,
despite not explicitely mentioned in the specification, both 1:1 and
the map passed to SetVirtualAddressMap() must be active during the
SetVirtualAddressMap() call. Second, there are buggy BIOSes which
require both mappings active during runtime calls as well, most likely
because they fail to identify all relocations to perform.
On amd64, we can get rid of both problems by providing 1:1 mapping for
the duration of runtime calls, by temprorary remapping user addresses.
As result, we avoid the need for loader to know about future kernel
address map, and avoid bugs in BIOSes. Typically BIOS only maps
something in low 4G. If not runtime bugs, we would take advantage of
the DMAP, as previous versions of this patch did.
Similar but more complicated trick can be used even for i386 and 32bit
runtime, if and when the EFI boot on i386 is supported. We would need
a trampoline page, since potentially whole 4G of VA would be switched
on calls, instead of only userspace portion on amd64.
Context switches are disabled for the duration of the call, FPU access
is granted, and interrupts are not disabled. The later is possible
because kernel is mapped during calls.
To test, the sysctl mib debug.efi_time is provided, setting it to 1
makes one call to EFI get_time() runtime service, on success the efitm
structure is printed to the control terminal. Load efirt.ko, or add
EFIRT option to the kernel config, to enable code.
Discussed with: emaste, imp
Tested by: emaste (mac, qemu)
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Note that lgdt() name is already used for function which, besides
loading GDT, also reloads segment descriptors cache, thus new function
is named bare_lgdt().
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
amd64 pmap.
The new pmap_pinit_pml4() function initializes the level 4 page table
with entries for the kernel mappings. Both functions are needed for
upcoming EFI Runtime Services support.
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
This is not very easy to do, since ddb didn't know when traps are
for single-stepping. It more or less assumed that traps are either
breakpoints or single-step, but even for x86 this became inadequate
with the release of the i386 in ~1986, and FreeBSD passes it other
trap types for NMIs and panics.
On x86, teach ddb when a trap is for single stepping using the %dr6
register. Unknown traps are now treated almost the same as breakpoints
instead of as the same as single-steps. Previously, the classification
of breakpoints was almost correct and everything else was unknown so
had to be treated as a single-step. Now the classification of single-
steps is precise, the classification of breakpoints is almost correct
(as before) and everything else is unknown and treated like a
breakpoint.
This fixes:
- breakpoints not set by ddb, including the main one in kdb_enter(),
were treated as single-steps and not stopped on when stepping
(except for the usual, simple case of a step with residual count 1).
As special cases, kdb_enter() didn't stop for fatal traps or panics
- similarly for "hardware breakpoints".
Use a new MD macro IS_SSTEP_TRAP(type, code) to code to classify
single-steps. This is excessively complicated for bug-for-bug and
backwards compatibilty. Design errors apparently started in Mach
in ~1990 or perhaps in the FreeBSD interface in ~1993. Common trap
types like single steps should have a unique MI code (like the TRAP*
codes for user SIGTRAP) so that debuggers don't need macros like
IS_SSTEP_TRAP() to decode them. But 'type' is actually an ambiguous
MD trap number, and code was always 0 (now it is (int)%dr6 on x86).
So it was impossible to determine the trap type from the args.
Global variables had to be used.
There is already a classification macro db_pc_is_single_step(), but
this just gets in the way. It is only used to recover from bugs in
IS_BREAKPOINT_TRAP(). On some arches, IS_BREAKPOINT_TRAP() just
duplicates the ambiguity in 'type' and misclassifies single-steps as
breakpoints. It defaults to 'false', which is the opposite of what is
needed for bug-for-bug compatibility.
When this is cleaned up, MI classification bits should be passed in
'code'. This could be done now for positive-logic bits, since 'code'
was always 0, but some negative logic is needed for compatibility so
a simple MI classificition is not usable yet.
After reading %dr6, clear the single-step bit in it so that the type
of the next debugger trap can be decoded. This is a little
ddb-specific. ddb doesn't understand the need to clear this bit and
doing it before calling kdb is easiest. gdb would need to reverse
this to support hardware breakpoints, but it just doesn't support
them now since gdbstub doesn't support %dr*.
Fix a bug involving %dr6: when emulating a single-step trap for vm86,
set the bit for it in %dr6. Userland debuggers need this. ddb now
needs this for vm86 bios calls. The bit gets copied to 'code' then
cleared again.
Fix related style bugs:
- when clearing bits for hardware breakpoints in %dr6, spell the mask
as ~0xf on both amd64 and i386 to get the correct number of bits
using sign extension and not need a comment about using the wrong
mask on amd64 (amd64 traps for invalid results but clearing the
reserved top bits didn't trap since they are 0).
- rewrite my old wrong comments about using %dr6 for ddb watchpoints.
The 'cpu' and 'cpu_class' variables were always set to the same value
on amd64 and are legacy holdovers from i386. Remove them entirely on
amd64.
Reviewed by: imp, kib (older version)
Differential Revision: https://reviews.freebsd.org/D7888
The flag specifies that the block which uses FPU must be executed in
critical section, i.e. take no context switches, and does not need an
FPU save area during the execution.
It is intended to be applied around fast and short code pathes where
save area allocation is impossible or undesirable, due to context or
due to the relative cost of calculation vs. allocation.
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
for zeroing pages in idle where nontemporal writes are clearly best.
This is almost a no-op since zeroing in idle works does nothing good
and is off by default. Fix END() statement forgotten in previous
commit.
Align the loop in sse2_pagezero(). Since it writes to main memory,
the loop doesn't have to be very carefully written to keep up.
Unrolling it was considered useless or harmful and was not done on
i386, but that was too careless.
Timing for i386: the loop was not unrolled at all, and moved only 4
bytes/iteration. So on a 2GHz CPU, it needed to run at 2 cycles/
iteration to keep up with a memory speed of just 4GB/sec. But when
it crossed a 16-byte boundary, on old CPUs it ran at 3 cycles/
iteration so it gave a maximum speed of 2.67GB/sec and couldn't even
keep up with PC3200 memory. Fix the alignment so that it keep up with
4GB/sec memory, and unroll once to get nearer to 8GB/sec. Further
unrolling might be useless or harmful since it would prevent the loop
fitting in 16-bytes. My test system with an old CPU and old DDR1 only
needed 5+ GB/sec. My test system with a new CPU and DDR3 doesn't need
any changes to keep up ~16GB/sec.
Timing for amd64: with 8-byte accesses and newer faster CPUs it is
easy to reach 16GB/sec but not so easy to go much faster. The
alignment doesn't matter much if the CPU is not very old. The loop
was already unrolled 4 times, but needs 32 bytes and uses a fancy
method that doesn't work for 2-way unrolling in 16 bytes. Just
align it to 32-bytes.
Move msix_disable_migration under #ifdef SMP since it doesn't make sense
for !SMP kernels.
PR: 212014
Reported by: Glyn Grinstead <glyn@grinstead.org>
MFC after: 3 days
If the hypervisor version is smaller than 4.6.0. Xen commits 74fd00 and
70a3cb are required on the hypervisor side for this to be fixed, and those
are only included in 4.6.0, so stay on the safe side and disable MSI-X
interrupt migration on anything older than 4.6.0.
It should not cause major performance degradation unless a lot of MSI-X
interrupts are allocated.
Sponsored by: Citrix Systems R&D
MFC after: 3 days
Reviewed by: jhb
Differential revision: https://reviews.freebsd.org/D7148
mp_maxid or CPU_FOREACH() as appropriate. This fixes a number of places in
the kernel that assumed CPU IDs are dense in [0, mp_ncpus) and would try,
for example, to run tasks on CPUs that did not exist or to allocate too
few buffers on systems with sparse CPU IDs in which there are holes in the
range and mp_maxid > mp_ncpus. Such circumstances generally occur on
systems with SMT, but on which SMT is disabled. This patch restores system
operation at least on POWER8 systems configured in this way.
There are a number of other places in the kernel with potential problems
in these situations, but where sparse CPU IDs are not currently known
to occur, mostly in the ARM machine-dependent code. These will be fixed
in a follow-up commit after the stable/11 branch.
PR: kern/210106
Reviewed by: jhb
Approved by: re (glebius)
The only current purpose of the pvh lock was explained there
On Wed, Jan 09, 2013 at 11:46:13PM -0600, Alan Cox wrote:
> Let me lay out one example for you in detail. Suppose that we have
> three processors and two of these processors are actively using the same
> pmap. Now, one of the two processors sharing the pmap performs a
> pmap_remove(). Suppose that one of the removed mappings is to a
> physical page P. Moreover, suppose that the other processor sharing
> that pmap has this mapping cached with write access in its TLB. Here's
> where the trouble might begin. As you might expect, the processor
> performing the pmap_remove() will acquire the fine-grained lock on the
> PV list for page P before destroying the mapping to page P. Moreover,
> this processor will ensure that the vm_page's dirty field is updated
> before releasing that PV list lock. However, the TLB shootdown for this
> mapping may not be initiated until after the PV list lock is released.
> The processor performing the pmap_remove() is not problematic, because
> the code being executed by that processor won't presume that the mapping
> is destroyed until the TLB shootdown has completed and pmap_remove() has
> returned. However, the other processor sharing the pmap could be
> problematic. Specifically, suppose that the third processor is
> executing the page daemon and concurrently trying to reclaim page P.
> This processor performs a pmap_remove_all() on page P in preparation for
> reclaiming the page. At this instant, the PV list for page P may
> already be empty but our second processor still has a stale TLB entry
> mapping page P. So, changes might still occur to the page after the
> page daemon believes that all mappings have been destroyed. (If the PV
> entry had still existed, then the pmap lock would have ensured that the
> TLB shootdown completed before the pmap_remove_all() finished.) Note,
> however, the page daemon will know that the page is dirty. It can't
> possibly mistake a dirty page for a clean one. However, without the
> current pvh global locking, I don't think anything is stopping the page
> daemon from starting the laundering process before the TLB shootdown has
> completed.
>
> I believe that a similar example could be constructed with a clean page
> P' and a stale read-only TLB entry. In this case, the page P' could be
> "cached" in the cache/free queues and recycled before the stale TLB
> entry is flushed.
TLBs for addresses with updated PTEs are always flushed before pmap
lock is unlocked. On the other hand, amd64 pmap code does not always
flushes TLBs before PV list locks are unlocked, if previously PTEs
were cleared and PV entries removed.
To handle the situations where a thread might notice empty PV list but
third thread still having access to the page due to TLB invalidation
not finished yet, introduce delayed invalidation. Comparing with the
pvh_global_lock, DI does not block entered thread when
pmap_remove_all() or pmap_remove_write() (callers of
pmap_delayed_invl_wait()) are executed in parallel. But _invl_wait()
callers are blocked until all previously noted DI blocks are leaved,
thus ensuring that neccessary TLB invalidations were performed before
returning from pmap_remove_all() or pmap_remove_write().
See comments for detailed description of the mechanism, and also for
the explanations why several pmap methods, most important
pmap_enter(), do not need DI protection.
Reviewed by: alc, jhb (turnstile KPI usage)
Tested by: pho (previous version)
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D5747
bus_get_cpus() returns a specified set of CPUs for a device. It accepts
an enum for the second parameter that indicates the type of cpuset to
request. Currently two valus are supported:
- LOCAL_CPUS (on x86 this returns all the CPUs in the package closest to
the device when DEVICE_NUMA is enabled)
- INTR_CPUS (like LOCAL_CPUS but only returns 1 SMT thread for each core)
For systems that do not support NUMA (or if it is not enabled in the kernel
config), LOCAL_CPUS fails with EINVAL. INTR_CPUS is mapped to 'all_cpus'
by default. The idea is that INTR_CPUS should always return a valid set.
Device drivers which want to use per-CPU interrupts should start using
INTR_CPUS instead of simply assigning interrupts to all available CPUs.
In the future we may wish to add tunables to control the policy of
INTR_CPUS (e.g. should it be local-only or global, should it ignore
SMT threads or not).
The x86 nexus driver exposes the internal set of interrupt CPUs from the
the x86 interrupt code via INTR_CPUS.
The ACPI bus driver and PCI bridge drivers use _PXM to return a suitable
LOCAL_CPUS set when _PXM exists and DEVICE_NUMA is enabled. They also and
the global INTR_CPUS set from the nexus driver with the per-domain set from
_PXM to generate a local INTR_CPUS set for child devices.
Compared to the r298933, this version uses 'struct _cpuset' in
<sys/bus.h> instead of 'cpuset_t' to avoid requiring <sys/param.h>
(<sys/_cpuset.h> still requires <sys/param.h> for MAXCPU even though
<sys/_bitset.h> does not after recent changes).
bus_get_cpus() returns a specified set of CPUs for a device. It accepts
an enum for the second parameter that indicates the type of cpuset to
request. Currently two valus are supported:
- LOCAL_CPUS (on x86 this returns all the CPUs in the package closest to
the device when DEVICE_NUMA is enabled)
- INTR_CPUS (like LOCAL_CPUS but only returns 1 SMT thread for each core)
For systems that do not support NUMA (or if it is not enabled in the kernel
config), LOCAL_CPUS fails with EINVAL. INTR_CPUS is mapped to 'all_cpus'
by default. The idea is that INTR_CPUS should always return a valid set.
Device drivers which want to use per-CPU interrupts should start using
INTR_CPUS instead of simply assigning interrupts to all available CPUs.
In the future we may wish to add tunables to control the policy of
INTR_CPUS (e.g. should it be local-only or global, should it ignore
SMT threads or not).
The x86 nexus driver exposes the internal set of interrupt CPUs from the
the x86 interrupt code via INTR_CPUS.
The ACPI bus driver and PCI bridge drivers use _PXM to return a suitable
LOCAL_CPUS set when _PXM exists and DEVICE_NUMA is enabled. They also and
the global INTR_CPUS set from the nexus driver with the per-domain set from
_PXM to generate a local INTR_CPUS set for child devices.
Reviewed by: wblock (manpage)
Differential Revision: https://reviews.freebsd.org/D5519
Simplify and unify placeholder type definitions.
Reviewed by: jhb
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D5771
While here, move the common bits of <machine/cputypes.h> to
<x86/cputypes.h> as well.
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D4670
This variable was added to sys/x86/include/x86_var.h recently.
This unbreaks building kernel source that #includes both md_var.h and x86_var.h
with gcc 4.2.1 on amd64
Differential Revision: https://reviews.freebsd.org/D4686
Reviewed by: kib
X-MFC with: r291949
Sponsored by: EMC / Isilon Storage Division
new headers x86/include x86_var.h and x86_smp.h.
Reviewed by: emaste, jhb
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D4358
the PG_G global pte flag, pmap_invalidate_all() fails to flush global
TLB entries [*]. This is because TLB shootdown handler for such
configs reloads CR3, and on i386 pmap_invalidate_all() does the same
for the initiating CPU. Note that current code does not issue total
invalidation requests for the kernel_pmap.
Rename amd64 function invltlb_globpcid() to invltlb_glob(), it is not
specific for PCID for quite some time, and implement the same
functionality for i386. Use the function instead of invltlb() in
shootdown handlers and in i386 pmap_invalidate_all(), but only for the
kernel pmap (which maps pages with the PG_G attribute set), which
takes care of PG_G TLB entries on flush.
To detect the affected pmap in i386 TLB shootdown handler, pmap should
be passed to the smp_masked_invltlb() function, which makes amd64 and
i386 TLB shootdown code almost identical. Merge the code under x86/.
Noted by: jhb [*]
Reviewed by: cem, jhb, pho
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D4346
certain kernel structures for use by debuggers. This mostly aids
in examining cores from a kernel without debug symbols as a debugger
can infer these values if debug symbols are available.
One set of variables describes the layout of 'struct linker_file' to
walk the list of loaded kernel modules.
A second set of variables describes the layout of 'struct proc' and
'struct thread' to walk the list of processes in the kernel and the
threads in each process.
The 'pcb_size' variable is used to index into the stoppcbs[] array.
The 'vm_maxuser_address' is used to distinguish kernel virtual addresses
from user addresses. This doesn't have to be perfect, and
'vm_maxuser_address' is a cheap and simple way to differentiate kernel
pointers from simple values like TIDs and PIDs.
While here, annotate the fields in struct pcb used by kgdb on amd64
and i386 to note that their ABI should be preserved. Annotations for
other platforms will be added in the future.
Reviewed by: kib
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D3773
amd64 and i386 platform code contain very similar xen/xen-os.h
The only differences are:
- Functions/variables/types which were unused in i386/xen/xen-os.h:
* xen_xchg
* __xchg_dummy
* __xg
* __xchg
* atomic_t
* atomic_inc
* rdtscll
The functions/variables/types unused in xen-os.h can be dropped and there
is no more differences betwen amd64 and i386.
The new header is placed in x86/include/xen and each platform will have
dummy headers include x86/xen/*.h. This is to be able to include
machine/xen/*.h in the PV drivers.
Submitted by: Julien Grall <julien.grall@citrix.com>
Reviewed by: royger
Differential Revision: https://reviews.freebsd.org/D3880
Sponsored by: Citrix Systems R&D
The current Xen console driver is crashing very quickly when using it on
an ARM guest. This is because the console lock is recursive and it may
lead to recursion on the tty lock and/or corrupt the ring pointer.
Furthermore, the console lock is not always taken where it should be and has
to be released too early because of the way the console has been designed.
Over the years, code has been modified to support various new features but
the driver has not been reworked.
This new driver has been rewritten with the idea of only having a small set
of specific function to write either via the shared ring or the hypercall
interface.
Note that HVM support has been left aside for now because it requires
additional features which are not yet supported. A follow-up patch will be
sent with HVM guest support.
List of items that may be good to have but not mandatory:
- Avoid to flush for each character written when using the tty
- Support multiple consoles
Submitted by: Julien Grall <julien.grall@citrix.com>
Reviewed by: royger
Differential Revision: https://reviews.freebsd.org/D3698
Sponsored by: Citrix Systems R&D
Pull the latest headers for Xen which allow us to add support for ARM and
use new features in FreeBSD.
This is a verbatim copy of the xen/include/public so every headers which
don't exits anymore in the Xen repositories have been dropped.
Note the interface version hasn't been bumped, it will be done in a
follow-up. Although, it requires fix in the code to get it compiled:
- sys/xen/xen_intr.h: evtchn_port_t is already defined in the headers so
drop it.
- {amd64,i386}/include/intr_machdep.h: NR_EVENT_CHANNELS now depends on
xen/interface/event_channel.h, so include it.
- {amd64,i386}/{amd64,i386}/support.S: It's not neccessary to include
machine/intr_machdep.h. This is also fixing build compilation with the
new headers.
- dev/xen/blkfront/blkfront.c: The typedef for blkif_request_segmenthas
been dropped. So directly use struct blkif_request_segment
Finally, modify xen/interface/xen-compat.h to throw a preprocessing error if
__XEN_INTERFACE_VERSION__ is not set. This is allow us to catch any file
where xen/xen-os.h is not correctly included.
Submitted by: Julien Grall <julien.grall@citrix.com>
Reviewed by: royger
Differential Revision: https://reviews.freebsd.org/D3805
Sponsored by: Citrix Systems R&D
use vtophys() directly instead of vtomach() and retire the no-longer-used
headers <machine/xenfunc.h> and <machine/xenvar.h>.
Reported by: bde (stale bits in <machine/xenfunc.h>)
Reviewed by: royger (earlier version)
Differential Revision: https://reviews.freebsd.org/D3266
reported, on APs. We already did this on BSP.
Otherwise, the userspace software which depends on the features
reported by the high CPUID levels is misbehaving. In particular, AVX
detection is non-functional, depending on which CPU thread happens to
execute when doing CPUID. Another victim is the libthr signal
handlers interposer, which needs to save full FPU extended state.
Reported and tested by: Andre Meiser <ortadur@web.de>
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
ordering semantic of x86 CPUs makes only the compiler barrier
neccessary to give the acquire behaviour.
Existing implementation ensured sequentially consistent semantic for
load_acq, making much stronger guarantee than required by standard's
definition of the load acquire. Consumers which depend on the barrier
are believed to be identified and already fixed to use proper
operations.
Noted by: alc (long time ago)
Reviewed by: alc, bde
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
provide a semantic defined by the C11 fences with corresponding
memory_order.
atomic_thread_fence_acq() gives r | r, w, where r and w are read and
write accesses, and | denotes the fence itself.
atomic_thread_fence_rel() is r, w | w.
atomic_thread_fence_acq_rel() is the combination of the acquire and
release in single operation. Note that reads after the acq+rel fence
could be made visible before writes preceeding the fence.
atomic_thread_fence_seq_cst() orders all accesses before/after the
fence, and the fence itself is globally ordered against other
sequentially consistent atomic operations.
Reviewed by: alc
Discussed with: bde
Sponsored by: The FreeBSD Foundation
MFC after: 3 weeks
macros on amd64 and i386. Move the definition to machine/param.h.
kgdb defines INKERNEL() too, the conflict is resolved by renaming kgdb
version to PINKERNEL().
On i386, correct the lowest kernel address. After the shared page was
introduced, USRSTACK no longer points to the last user address + 1 [*]
Submitted by: Oliver Pinter [*]
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
atomic_load_acq(9), on it source, for x86.
Right now, atomic_load_acq() on x86 is sequentially consistent with
other atomics, code ensures this by doing store/load barrier by
performing locked nop on the source. Provide separate primitive
__storeload_barrier(), which is implemented as the locked nop done on
a cpu-private variable, and put __storeload_barrier() before load, to
keep seq_cst semantic but avoid introducing false dependency on the
no-modification of the source for its later use.
Note that seq_cst property of x86 atomic_load_acq() is not documented
and not carried by atomics implementations on other architectures,
although some kernel code relies on the behaviour. This commit does
not intend to change this.
Reviewed by: alc
Discussed with: bde
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
devmem is used to represent MMIO devices like the boot ROM or a VESA framebuffer
where doing a trap-and-emulate for every access is impractical. devmem is a
hybrid of system memory (sysmem) and emulated device models.
devmem is mapped in the guest address space via nested page tables similar
to sysmem. However the address range where devmem is mapped may be changed
by the guest at runtime (e.g. by reprogramming a PCI BAR). Also devmem is
usually mapped RO or RW as compared to RWX mappings for sysmem.
Each devmem segment is named (e.g. "bootrom") and this name is used to
create a device node for the devmem segment (e.g. /dev/vmm/testvm.bootrom).
The device node supports mmap(2) and this decouples the host mapping of
devmem from its mapping in the guest address space (which can change).
Reviewed by: tychon
Discussed with: grehan
Differential Revision: https://reviews.freebsd.org/D2762
MFC after: 4 weeks
rev. 55. The modern CPUs cache and TLB descriptions looked quite
questionable without the update, e.g. Haswell i7 4770S reported:
Data TLB: 4 KB pages, 4-way set associative, 64 entries
L2 cache: 256 kbytes, 8-way associative, 64 bytes/line
After the update, the report is:
Data TLB: 1 GByte pages, 4-way set associative, 4 entries
Data TLB: 4 KB pages, 4-way set associative, 64 entries
Instruction TLB: 2M/4M pages, fully associative, 8 entries
Instruction TLB: 4KByte pages, 8-way set associative, 64 entries
64-Byte prefetching
Shared 2nd-Level TLB: 4 KByte/2MByte pages, 8-way associative, 1024 entries
L2 cache: 256 kbytes, 8-way associative, 64 bytes/line
Some tags were apparently removed from the table 3-21, Vol. 2A. Keep
them around, but add a comment stating the removal.
Update the format line for cpu_stdext_feature according to the bits
from the SDM rev.55. It appears that Haswells do not store %cs and
%ds values in the FPU save area.
Store content of the %ecx register from the CPUID leaf 0x7
subleaf 0 as cpu_stdext_feature2 and print defined bits from it,
again acording to SDM rev. 55.
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
"sleeping" state. This is done by forcing the vcpu to transition to "idle"
by returning to userspace with an exit code of VM_EXITCODE_REQIDLE.
MFC after: 2 weeks
the Vahalia' "Unix Internals" section 15.12 "Other TLB Consistency
Algorithms". The same algorithm is already utilized by the MIPS pmap
to handle ASIDs.
The PCID for the address space is now allocated per-cpu during context
switch to the thread using pmap, when no PCID on the cpu was ever
allocated, or the current PCID is invalidated. If the PCID is reused,
bit 63 of %cr3 can be set to avoid TLB flush.
Each cpu has PCID' algorithm generation count, which is saved in the
pmap pcpu block when pcpu PCID is allocated. On invalidation, the
pmap generation count is zeroed, which signals the context switch code
that already allocated PCID is no longer valid. The implication is
the TLB shootdown for the given cpu/address space, due to the
allocation of new PCID.
The pm_save mask is no longer has to be tracked, which (significantly)
reduces the targets of the TLB shootdown IPIs. Previously, pm_save
was reset only on pmap_invalidate_all(), which made it accumulate the
cpuids of all processors on which the thread was scheduled between
full TLB shootdowns.
Besides reducing the amount of TLB shootdowns and removing atomics to
update pm_saves in the context switch code, the algorithm is much
simpler than the maintanence of pm_save and selection of the right
address space in the shootdown IPI handler.
Reviewed by: alc
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 3 weeks
interacts with interrupts, query ACPI and use MWAIT for entrance into
Cx sleep states. Support C1 "I/O then halt" mode. See Intel'
document 302223-007 "Intelб╝ Processor Vendor-Specific ACPI Interface
Specification" for description.
Move the acpi_cpu_c1() function into x86/cpu_machdep.c and use
it instead of inlining "sti; hlt" sequence in several places.
In the acpi(4) man page, besides documenting the dev.cpu.N.cx_methods
sysctl, correct the names for dev.cpu.N.{cx_usage,cx_lowest,cx_supported}
sysctls.
Both jkim and avg have some other patches implementing the mwait
functionality; this work is unrelated. Linux does not rely on the
ACPI to provide correct tables describing Cx modes. Instead, the
driver has pre-defined knowledge of the CPU models, it was supplied by
Intel.
Tested by: pho (previous versions)
Sponsored by: The FreeBSD Foundation
This is done explicitly because a vcpu thread can be in a critical section
for the entire time slice alloted to it. This in turn can delay the handling
of the 'td_owepreempt'.
Reviewed by: jhb
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D2430
Prior to this change both functions returned 0 for success, -1 for failure
and +1 to indicate that an exception was injected into the guest.
The numerical value of ERESTART also happens to be -1 so when these functions
returned -1 it had to be translated to a positive errno value to prevent the
VM_RUN ioctl from being inadvertently restarted. This made it easy to introduce
bugs when writing emulation code.
Fix this by adding an 'int *guest_fault' parameter and setting it to '1' if
an exception was delivered to the guest. The return value is 0 or EFAULT so
no additional translation is needed.
Reviewed by: tychon
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D2428
remains. Xen is planning to phase out support for PV upstream since it
is harder to maintain and has more overhead. Modern x86 CPUs include
virtualization extensions that support HVM guests instead of PV guests.
In addition, the PV code was i386 only and not as well maintained recently
as the HVM code.
- Remove the i386-only NATIVE option that was used to disable certain
components for PV kernels. These components are now standard as they
are on amd64.
- Remove !XENHVM bits from PV drivers.
- Remove various shims required for XEN (e.g. PT_UPDATES_FLUSH, LOAD_CR3,
etc.)
- Remove duplicate copy of <xen/features.h>.
- Remove unused, i386-only xenstored.h.
Differential Revision: https://reviews.freebsd.org/D2362
Reviewed by: royger
Tested by: royger (i386/amd64 HVM domU and amd64 PVH dom0)
Relnotes: yes
sys/amd64/amd64/mp_machdep.c, to the new common x86 source
sys/x86/x86/mp_x86.c.
Proposed and reviewed by: jhb
Review: https://reviews.freebsd.org/D2347
Sponsored by: The FreeBSD Foundation
sys/i386/i386/machdep.c to new file sys/x86/x86/cpu_machdep.c. Most
of the code is related to the idle handling.
Discussed with: pluknet
Sponsored by: The FreeBSD Foundation
%rdi, %rsi, etc are inadvertently bypassed along with the check to
see if the instruction needs to be repeated per the 'rep' prefix.
Add "MOVS" instruction support for the 'MMIO to MMIO' case.
Reviewed by: neel
code segment base address.
Also if an instruction doesn't support a mod R/M (modRM) byte, don't
be concerned if the CPU is in real mode.
Reviewed by: neel
translation. In particular, despite IO-APICs only take 8bit apic id,
IR translation structures accept 32bit APIC Id, which allows x2APIC
mode to function properly. Extend msi_cpu of struct msi_intrsrc and
io_cpu of ioapic_intsrc to full int from one byte.
KPI of IR is isolated into the x86/iommu/iommu_intrmap.h, to avoid
bringing all dmar headers into interrupt code. The non-PCI(e) devices
which generate message interrupts on FSB require special handling. The
HPET FSB interrupts are remapped, while DMAR interrupts are not.
For each msi and ioapic interrupt source, the iommu cookie is added,
which is in fact index of the IRE (interrupt remap entry) in the IR
table. Cookie is made at the source allocation time, and then used at
the map time to fill both IRE and device registers. The MSI
address/data registers and IO-APIC redirection registers are
programmed with the special values which are recognized by IR and used
to restore the IRE index, to find proper delivery mode and target.
Map all MSI interrupts in the block when msi_map() is called.
Since an interrupt source setup and dismantle code are done in the
non-sleepable context, flushing interrupt entries cache in the IR
hardware, which is done async and ideally waits for the interrupt,
requires busy-wait for queue to drain. The dmar_qi_wait_for_seq() is
modified to take a boolean argument requesting busy-wait for the
written sequence number instead of waiting for interrupt.
Some interrupts are configured before IR is initialized, e.g. ACPI
SCI. Add intr_reprogram() function to reprogram all already
configured interrupts, and call it immediately before an IR unit is
enabled. There is still a small window after the IO-APIC redirection
entry is reprogrammed with cookie but before the unit is enabled, but
to fix this properly, IR must be started much earlier.
Add workarounds for 5500 and X58 northbridges, some revisions of which
have severe flaws in handling IR. Use the same identification methods
as employed by Linux.
Review: https://reviews.freebsd.org/D1892
Reviewed by: neel
Discussed with: jhb
Tested by: glebius, pho (previous versions)
Sponsored by: The FreeBSD Foundation
MFC after: 3 weeks
hw.x2apic_enable tunable allows disabling it from the loader prompt.
To closely repeat effects of the uncached memory ops when accessing
registers in the xAPIC mode, the x2APIC writes to MSRs are preceeded
by mfence, except for the EOI notifications. This is probably too
strict, only ICR writes to send IPI require serialization to ensure
that other CPUs see the previous actions when IPI is delivered. This
may be changed later.
In vmm justreturn IPI handler, call doreti_iret instead of doing iretd
inline, to handle corner conditions.
Note that the patch only switches LAPICs into x2APIC mode. It does not
enables FreeBSD to support > 255 CPUs, which requires parsing x2APIC
MADT entries and doing interrupts remapping, but is the required step
on the way.
Reviewed by: neel
Tested by: pho (real hardware), neel (on bhyve)
Discussed with: jhb, grehan
Sponsored by: The FreeBSD Foundation
MFC after: 2 months
KVM clock shares the same data structures between the guest and the host
as Xen so it makes sense to just have a single copy of this code.
Differential Revision: https://reviews.freebsd.org/D1429
Reviewed by: royger (eariler version)
MFC after: 1 month
These instructions are emitted by 'bus_space_read_region()' when accessing
MMIO regions.
Since MOVS can be used with a repeat prefix start decoding the REPZ and
REPNZ prefixes. Also start decoding the segment override prefix since MOVS
allows overriding the source operand segment register.
Tested by: tychon
MFC after: 1 week
Keep track of the next instruction to be executed by the vcpu as 'nextrip'.
As a result the VM_RUN ioctl no longer takes the %rip where a vcpu should
start execution.
Also, instruction restart happens implicitly via 'vm_inject_exception()' or
explicitly via 'vm_restart_instruction()'. The APIs behave identically in
both kernel and userspace contexts. The main beneficiary is the instruction
emulation code that executes in both contexts.
bhyve(8) VM exit handlers now treat 'vmexit->rip' and 'vmexit->inst_length'
as readonly:
- Restarting an instruction is now done by calling 'vm_restart_instruction()'
as opposed to setting 'vmexit->inst_length' to 0 (e.g. emulate_inout())
- Resuming vcpu at an arbitrary %rip is now done by setting VM_REG_GUEST_RIP
as opposed to changing 'vmexit->rip' (e.g. vmexit_task_switch())
Differential Revision: https://reviews.freebsd.org/D1526
Reviewed by: grehan
MFC after: 2 weeks
Implement a subset of the multiboot specification in order to boot Xen
and a FreeBSD Dom0 from the FreeBSD bootloader. This multiboot
implementation is tailored to boot Xen and FreeBSD Dom0, and it will
most surely fail to boot any other multiboot compilant kernel.
In order to detect and boot the Xen microkernel, two new file formats
are added to the bootloader, multiboot and multiboot_obj. Multiboot
support must be tested before regular ELF support, since Xen is a
multiboot kernel that also uses ELF. After a multiboot kernel is
detected, all the other loaded kernels/modules are parsed by the
multiboot_obj format.
The layout of the loaded objects in memory is the following; first the
Xen kernel is loaded as a 32bit ELF into memory (Xen will switch to
long mode by itself), after that the FreeBSD kernel is loaded as a RAW
file (Xen will parse and load it using it's internal ELF loader), and
finally the metadata and the modules are loaded using the native
FreeBSD way. After everything is loaded we jump into Xen's entry point
using a small trampoline. The order of the multiboot modules passed to
Xen is the following, the first module is the RAW FreeBSD kernel, and
the second module is the metadata and the FreeBSD modules.
Since Xen will relocate the memory position of the second
multiboot module (the one that contains the metadata and native
FreeBSD modules), we need to stash the original modulep address inside
of the metadata itself in order to recalculate its position once
booted. This also means the metadata must come before the loaded
modules, so after loading the FreeBSD kernel a portion of memory is
reserved in order to place the metadata before booting.
In order to tell the loader to boot Xen and then the FreeBSD kernel the
following has to be added to the /boot/loader.conf file:
xen_cmdline="dom0_mem=1024M dom0_max_vcpus=2 dom0pvh=1 console=com1,vga"
xen_kernel="/boot/xen"
The first argument contains the command line that will be passed to the Xen
kernel, while the second argument is the path to the Xen kernel itself. This
can also be done manually from the loader command line, by for example
typing the following set of commands:
OK unload
OK load /boot/xen dom0_mem=1024M dom0_max_vcpus=2 dom0pvh=1 console=com1,vga
OK load kernel
OK load zfs
OK load if_tap
OK load ...
OK boot
Sponsored by: Citrix Systems R&D
Reviewed by: jhb
Differential Revision: https://reviews.freebsd.org/D517
For the Forth bits:
Submitted by: Julien Grall <julien.grall AT citrix.com>
VM_INJECT_EXCEPTION ioctl. However it morphed into other uses like keeping
track pending exceptions for a vcpu. This in turn causes confusion because
some fields in 'struct vm_exception' like 'vcpuid' make sense only in the
ioctl context. It also makes it harder to add or remove structure fields.
Fix this by using 'struct vm_exception' only to communicate information
from userspace to vmm.ko when injecting an exception.
Also, add a field 'restart_instruction' to 'struct vm_exception'. This
field is set to '1' for exceptions where the faulting instruction is
restarted after the exception is handled.
MFC after: 1 week
Features by CPUID as CPUID.80000008H:EAX[7:0], into variable cpu_maxphyaddr.
Reviewed by: alc
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
code in sys/kern/kern_dump.c. Most dumpsys() implementations are nearly
identical and simply redefine a number of constants and helper subroutines;
a generic implementation will make it easier to implement features around
kernel core dumps. This change does not alter any minidump code and should
have no functional impact.
PR: 193873
Differential Revision: https://reviews.freebsd.org/D904
Submitted by: Conrad Meyer <conrad.meyer@isilon.com>
Reviewed by: jhibbits (earlier version)
Sponsored by: EMC / Isilon Storage Division
managing pages from different address ranges. Generally speaking, this
feature is used to increase the likelihood that physical pages are
available that can meet special DMA requirements or can be accessed through
a limited-coverage direct mapping (e.g., MIPS). However, prior to this
change, the configuration of the free lists was static, i.e., it was
determined at compile time. Consequentally, free lists could be created
for address ranges that held no actual pages, for example, on 32-bit MIPS-
based systems with 512 MB or less of physical memory. This change makes
the creation of the free lists dynamic, i.e., it is based on the available
physical memory at boot time.
On 64-bit x86-based systems with 64 GB or more of physical memory, create
free lists for managing pages with physical addresses below 4 GB. This
change is to address reported problems with initializing devices that
require the allocation of physical pages below 4 GB on some systems with
128 GB or more of physical memory.
PR: 185727
Differential Revision: https://reviews.freebsd.org/D1274
Reviewed by: jhb, kib
MFC after: 3 weeks
Sponsored by: EMC / Isilon Storage Division
The new RTC emulation supports all interrupt modes: periodic, update ended
and alarm. It is also capable of maintaining the date/time and NVRAM contents
across virtual machine reset. Also, the date/time fields can now be modified
by the guest.
Since bhyve now emulates both the PIT and the RTC there is no need for
"Legacy Replacement Routing" in the HPET so get rid of it.
The RTC device state can be inspected via bhyvectl as follows:
bhyvectl --vm=vm --get-rtc-time
bhyvectl --vm=vm --set-rtc-time=<unix_time_secs>
bhyvectl --vm=vm --rtc-nvram-offset=<offset> --get-rtc-nvram
bhyvectl --vm=vm --rtc-nvram-offset=<offset> --set-rtc-nvram=<value>
Reviewed by: tychon
Discussed with: grehan
Differential Revision: https://reviews.freebsd.org/D1385
MFC after: 2 weeks
"hw.vmm.trace_guest_exceptions". To enable this feature set the tunable
to "1" before loading vmm.ko.
Tracing the guest exceptions can be useful when debugging guest triple faults.
Note that there is a performance impact when exception tracing is enabled
since every exception will now trigger a VM-exit.
Also, handle machine check exceptions that happen during guest execution
by vectoring to the host's machine check handler via "int $18".
Discussed with: grehan
MFC after: 2 weeks
It is automatically set when -fPIC is passed to the compiler.
Reviewed by: dim, kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D1179
In vt_efifb_init the framebuffer's physaddr is passed to PHYS_TO_DMAP
before the DMAP is setup. The result is not actually accessed until
after the mapping is setup, though. Loosen the assertion in PHYS_TO_DMAP
for now, to allow use when dmaplimit == 0.
Reviewed by: kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D1142
- Move the existing code to x86/x86/identcpu.c since it is x86-specific.
- If the CPUID2_HV flag is set, assume a hypervisor is present and query
the 0x40000000 leaf to determine the hypervisor vendor ID. Export the
vendor ID and the highest supported hypervisor CPUID leaf via
hv_vendor[] and hv_high variables, respectively. The hv_vendor[]
array is also exported via the hw.hv_vendor sysctl.
- Merge the VMWare detection code from tsc.c into the new probe in
identcpu.c. Add a VM_GUEST_VMWARE to identify vmware and use that in
the TSC code to identify VMWare.
Differential Revision: https://reviews.freebsd.org/D1010
Reviewed by: delphij, jkim, neel
This reduces variability during timer calibration by keeping the emulation
"close" to the guest. Additionally having all timer emulations in the kernel
will ease the transition to a per-VM clock source (as opposed to using the
host's uptime keep track of time).
Discussed with: grehan
Place the code introduced in r268660 into a separate function that can be
called from uiomove_fromphys. Instead of pre-allocating two KVA pages use
vmem_alloc to allocate them on demand when needed. This prevents blocking if
a page fault is taken while physical addresses from outside the DMAP are
used, since the lock is now removed.
Also introduce a safety catch in PHYS_TO_DMAP and DMAP_TO_PHYS.
Sponsored by: Citrix Systems R&D
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D947
amd64/amd64/pmap.c:
- Factor out the code to deal with non DMAP addresses from pmap_copy_pages
and place it in pmap_map_io_transient.
- Change the code to use vmem_alloc instead of a set of pre-allocated
pages.
- Use pmap_qenter and don't pin the thread if there can be page faults.
amd64/amd64/uio_machdep.c:
- Use pmap_map_io_transient in order to correctly deal with physical
addresses not covered by the DMAP.
amd64/include/pmap.h:
- Add the prototypes for the new functions.
amd64/include/vmparam.h:
- Add safety catches to make sure PHYS_TO_DMAP and DMAP_TO_PHYS are only
used with addresses covered by the DMAP.
This device is only attached to priviledged domains, and allows the
toolstack to interact with Xen. The two functions of the privcmd
interface is to allow the execution of hypercalls from user-space, and
the mapping of foreign domain memory.
Sponsored by: Citrix Systems R&D
i386/include/xen/hypercall.h:
amd64/include/xen/hypercall.h:
- Introduce a function to make generic hypercalls into Xen.
xen/interface/xen.h:
xen/interface/memory.h:
- Import the new hypercall XENMEM_add_to_physmap_range used by
auto-translated guests to map memory from foreign domains.
dev/xen/privcmd/privcmd.c:
- This device has the following functions:
- Allow user-space applications to make hypercalls into Xen.
- Allow user-space applications to map memory from foreign domains,
this is accomplished using the newly introduced hypercall
(XENMEM_add_to_physmap_range).
xen/privcmd.h:
- Public ioctl interface for the privcmd device.
x86/xen/hvm.c:
- Remove declaration of hypercall_page, now it's declared in
hypercall.h.
conf/files:
- Add the privcmd device to the build process.
forced invalidation of the cache range regardless of the presence of
self-snoop feature. Some recent Intel GPUs in some modes are not
coherent, and dirty lines in CPU cache must be flushed before the
pages are transferred to GPU domain.
Reviewed by: alc (previous version)
Tested by: pho (amd64)
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
The hypervisor hides the MONITOR/MWAIT capability by unconditionally setting
CPUID.01H:ECX[3] to 0 so the guest should not expect these instructions to
be present anyways.
Discussed with: grehan
code. There are only a handful of MSRs common between the two so there isn't
too much duplicate functionality.
The VT-x code has the following types of MSRs:
- MSRs that are unconditionally saved/restored on every guest/host context
switch (e.g., MSR_GSBASE).
- MSRs that are restored to guest values on entry to vmx_run() and saved
before returning. This is an optimization for MSRs that are not used in
host kernel context (e.g., MSR_KGSBASE).
- MSRs that are emulated and every access by the guest causes a trap into
the hypervisor (e.g., MSR_IA32_MISC_ENABLE).
Reviewed by: grehan
VM-exit and ultimately on whether nRIP is valid. This allows us to update
the %rip after the emulation is finished so any exceptions triggered during
the emulation will point to the right instruction.
Don't attempt to handle INS/OUTS VM-exits unless the DecodeAssist capability
is available. The effective segment field in EXITINFO1 is not valid without
this capability.
Add VM_EXITCODE_SVM to flag SVM VM-exits that cannot be handled. Provide the
VMCB fields exitinfo1 and exitinfo2 as collateral to help with debugging.
Provide a SVM VM-exit handler to dump the exitcode, exitinfo1 and exitinfo2
fields in bhyve(8).
Reviewed by: Anish Gupta (akgupt3@gmail.com)
Reviewed by: grehan
instruction bytes in the VMCB on a nested page fault. This is useful because
it saves having to walk the guest page tables to fetch the instruction.
vie_init() now takes two additional parameters 'inst_bytes' and 'inst_len'
that map directly to 'vie->inst[]' and 'vie->num_valid'.
The instruction emulation handler skips calling 'vmm_fetch_instruction()'
if 'vie->num_valid' is non-zero.
The use of this capability can be turned off by setting the sysctl/tunable
'hw.vmm.svm.disable_npf_assist' to '1'.
Reviewed by: Anish Gupta (akgupt3@gmail.com)
Discussed with: grehan
by explicitly moving it out of the interrupt shadow. The hypervisor is done
"executing" the HLT and by definition this moves the vcpu out of the
1-instruction interrupt shadow.
Prior to this change the interrupt would be held pending because the VMCS
guest-interruptibility-state would indicate that "blocking by STI" was in
effect. This resulted in an unnecessary round trip into the guest before
the pending interrupt could be injected.
Reviewed by: grehan
resume that is a superset of a pcb. Move the FPU state out of the pcb and
into this new structure. As part of this, move the FPU resume code on
amd64 into a C function. This allows resumectx() to still operate only on
a pcb and more closely mirrors the i386 code.
Reviewed by: kib (earlier version)
<machine/md_var.h>.
- Move some CPU-related variables out of i386/i386/identcpu.c to
initcpu.c to match amd64.
- Move the declaration of has_f00f_hack out of identcpu.c to machdep.c.
- Remove a misleading comment from i386/i386/initcpu.c (locore zeros
the BSS before it calls identify_cpu()) and remove explicit zero
assignments to reduce the diff with amd64.
optional attributes field.
- Add a 'machdep.smap' sysctl that exports the SMAP table of the running
system as an array of the ACPI 3.0 structure. (On older systems, the
attributes are given a value of zero.) Note that the sysctl only
exports the SMAP table if it is available via the metadata passed from
the loader to the kernel. If an SMAP is not available, an empty array
is returned.
- Add a format handler for the ACPI 3.0 SMAP structure to the sysctl(8)
binary to format the SMAP structures in a readable format similar to
the format found in boot messages.
MFC after: 2 weeks
Eventually, the vmd_segs of the struct vm_domain should become bitset
instead of long, to allow arbitrary compile-time selected maximum.
Reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
The MD allocators were very common, however there were some minor
differencies. These differencies were all consolidated in the MI allocator,
under ifdefs. The defines from machine/vmparam.h turn on features required
for a particular machine. For details look in the comment in sys/sf_buf.h.
As result no MD code left in sys/*/*/vm_machdep.c. Some arches still have
machine/sf_buf.h, which is usually quite small.
Tested by: glebius (i386), tuexen (arm32), kevlo (arm32)
Reviewed by: kib
Sponsored by: Netflix
Sponsored by: Nginx, Inc.
forever in vm_handle_hlt().
This is usually not an issue as long as one of the other vcpus properly resets
or powers off the virtual machine. However, if the bhyve(8) process is killed
with a signal the halted vcpu cannot be woken up because it's sleep cannot be
interrupted.
Fix this by waking up periodically and returning from vm_handle_hlt() if
TDF_ASTPENDING is set.
Reported by: Leon Dang
Sponsored by: Nahanni Systems
The faulting instruction needs to be restarted when the exception handler
is done handling the fault. bhyve now does this correctly by setting
'vmexit[vcpu].inst_length' to zero so the %rip is not advanced.
A minor complication is that the fault injection APIs are used by instruction
emulation code that is shared by vmm.ko and bhyve. Thus the argument that
refers to 'struct vm *' in kernel or 'struct vmctx *' in userspace needs to
be loosely typed as a 'void *'.
A nested exception condition arises when a second exception is triggered while
delivering the first exception. Most nested exceptions can be handled serially
but some are converted into a double fault. If an exception is generated during
delivery of a double fault then the virtual machine shuts down as a result of
a triple fault.
vm_exit_intinfo() is used to record that a VM-exit happened while an event was
being delivered through the IDT. If an exception is triggered while handling
the VM-exit it will be treated like a nested exception.
vm_entry_intinfo() is used by processor-specific code to get the event to be
injected into the guest on the next VM-entry. This function is responsible for
deciding the disposition of nested exceptions.
instruction emulation [1].
Fix bug in emulation of opcode 0x8A where the destination is a legacy high
byte register and the guest vcpu is in 32-bit mode. Prior to this change
instead of modifying %ah, %bh, %ch or %dh the emulation would end up
modifying %spl, %bpl, %sil or %dil instead.
Add support for moffsets by treating it as a 2, 4 or 8 byte immediate value
during instruction decoding.
Fix bug in verify_gla() where the linear address computed after decoding
the instruction was not being truncated to the effective address size [2].
Tested by: Leon Dang [1]
Reported by: Peter Grehan [2]
Sponsored by: Nahanni Systems
context into memory for the kernel threads which called
fpu_kern_thread(9). This allows the fpu_kern_enter() callers to not
check for is_fpu_kern_thread() to get the optimization.
Apply the flag to padlock(4) and aesni(4). In aesni_cipher_process(),
do not leak FPU context state on error.
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
This is needed for Xen PV(H) guests, since there's no hardware lapic
available on this kind of domains. This commit should not change
functionality.
Sponsored by: Citrix Systems R&D
Reviewed by: jhb
Approved by: gibbs
amd64/include/cpu.h:
amd64/amd64/mp_machdep.c:
i386/include/cpu.h:
i386/i386/mp_machdep.c:
- Remove lapic_ipi_vectored hook from cpu_ops, since it's now
implemented in the lapic hooks.
amd64/amd64/mp_machdep.c:
i386/i386/mp_machdep.c:
- Use lapic_ipi_vectored directly, since it's now an inline function
that will call the appropiate hook.
x86/x86/local_apic.c:
- Prefix bare metal public lapic functions with native_ and mark them
as static.
- Define default implementation of apic_ops.
x86/include/apicvar.h:
- Declare the apic_ops structure and create inline functions to
access the hooks, so the change is transparent to existing users of
the lapic_ functions.
x86/xen/hvm.c:
- Switch to use the new apic_ops.
it implicitly in vmm.ko.
Add ioctl VM_GET_CPUS to get the current set of 'active' and 'suspended' cpus
and display them via /usr/sbin/bhyvectl using the "--get-active-cpus" and
"--get-suspended-cpus" options.
This is in preparation for being able to reset virtual machine state without
having to destroy and recreate it.
API function 'vie_calculate_gla()'.
While the current implementation is simplistic it forms the basis of doing
segmentation checks if the guest is in 32-bit protected mode.
of the guest linear address space. These APIs in turn use a new ioctl
'VM_GLA2GPA' to convert the guest linear address to guest physical.
Use the new copyin/copyout APIs when emulating ins/outs instruction in
bhyve(8).
'struct vm_guest_paging'.
Check for canonical addressing in vmm_gla2gpa() and inject a protection
fault into the guest if a violation is detected.
If the page table walk is restarted in vmm_gla2gpa() then reset 'ptpphys' to
point to the root of the page tables.
indicate the faulting linear address.
If the guest PML4 entry has the PG_PS bit set then inject a page fault into
the guest with the PGEX_RSV bit set in the error_code.
Get rid of redundant checks for the PG_RW violations when walking the page
tables.