When a vCPU is HLTed, interrupts with a priority below the processor
priority (PPR) should not resume the vCPU while interrupts at or above
the PPR should. With posted interrupts, bhyve maintains a bitmap of
pending interrupts in PIR descriptor along with a single 'pending'
bit. This bit is checked by a CPU running in guest mode at various
places to determine if it should be checked. In addition, another CPU
can force a CPU in guest mode to check for pending interrupts by
sending an IPI to a special IDT vector reserved for this purpose.
bhyve had a bug in that it would only notify a guest vCPU of an
interrupt (e.g. by sending the special IPI or by resuming it if it was
idle due to HLT) if an interrupt arrived that was higher priority than
PPR and no interrupts were currently pending. This assumed that if
the 'pending' bit was set, any needed notification was already in
progress. However, if the first interrupt sent to a HLTed vCPU was
lower priority than PPR and the second was higher than PPR, the first
interrupt would set 'pending' but not notify the vCPU, and the second
interrupt would not notify the vCPU because 'pending' was already set.
To fix this, track the priority of pending interrupts in a separate
per-vCPU bitmask and notify a vCPU anytime an interrupt arrives that
is above PPR and higher than any previously-received interrupt.
This was found and debugged in the bhyve port to SmartOS maintained by
Joyent. Relevant SmartOS bugs with more background:
https://smartos.org/bugview/OS-6829https://smartos.org/bugview/OS-6930https://smartos.org/bugview/OS-7354
Submitted by: Patrick Mooney <pmooney@pfmooney.com>
Reviewed by: tychon, rgrimes
Obtained from: SmartOS / Joyent
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D19299
In all of the architectures we have today, we always use PAGE_SIZE.
While in theory one could define different things, none of the
current architectures do, even the ones that have transitioned from
32-bit to 64-bit like i386 and arm. Some ancient mips binaries on
other systems used 8k instead of 4k, but we don't support running
those and likely never will due to their age and obscurity.
Reviewed by: imp (who also contributed the commit message)
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D19280
Skylake Xeons.
See SDM rev. 68 Vol 3 4.6.2 Protection Keys and the description of the
RDPKRU and WRPKRU instructions.
Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D18893
The pmap_works variable is always true for amd64. Remove it, the
branch in the initialization taken when false, and corresponding
sysctl.
Remove pat_table[] local array, work on pat_index[] directly.
Collapse whole initialization to not override already assigned values.
Add comment explaining the choice for PAT4 and PAT7.
Reviewed by: alc, markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
MFC note: Leave the sysctl around
Differential revision: https://reviews.freebsd.org/D19225
Some argument validation error paths would return without releasing the
file reference obtained at the beginning of the function.
While here, fix some style bugs and remove trivial debug prints.
Reviewed by: kib
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D19214
With this change, randomization can be enabled for all non-fixed
mappings. It means that the base address for the mapping is selected
with a guaranteed amount of entropy (bits). If the mapping was
requested to be superpage aligned, the randomization honours the
superpage attributes.
Although the value of ASLR is diminshing over time as exploit authors
work out simple ASLR bypass techniques, it elimintates the trivial
exploitation of certain vulnerabilities, at least in theory. This
implementation is relatively small and happens at the correct
architectural level. Also, it is not expected to introduce
regressions in existing cases when turned off (default for now), or
cause any significant maintaince burden.
The randomization is done on a best-effort basis - that is, the
allocator falls back to a first fit strategy if fragmentation prevents
entropy injection. It is trivial to implement a strong mode where
failure to guarantee the requested amount of entropy results in
mapping request failure, but I do not consider that to be usable.
I have not fine-tuned the amount of entropy injected right now. It is
only a quantitive change that will not change the implementation. The
current amount is controlled by aslr_pages_rnd.
To not spoil coalescing optimizations, to reduce the page table
fragmentation inherent to ASLR, and to keep the transient superpage
promotion for the malloced memory, locality clustering is implemented
for anonymous private mappings, which are automatically grouped until
fragmentation kicks in. The initial location for the anon group range
is, of course, randomized. This is controlled by vm.cluster_anon,
enabled by default.
The default mode keeps the sbrk area unpopulated by other mappings,
but this can be turned off, which gives much more breathing bits on
architectures with small address space, such as i386. This is tied
with the question of following an application's hint about the mmap(2)
base address. Testing shows that ignoring the hint does not affect the
function of common applications, but I would expect more demanding
code could break. By default sbrk is preserved and mmap hints are
satisfied, which can be changed by using the
kern.elf{32,64}.aslr.honor_sbrk sysctl.
ASLR is enabled on per-ABI basis, and currently it is only allowed on
FreeBSD native i386 and amd64 (including compat 32bit) ABIs. Support
for additional architectures will be added after further testing.
Both per-process and per-image controls are implemented:
- procctl(2) adds PROC_ASLR_CTL/PROC_ASLR_STATUS;
- NT_FREEBSD_FCTL_ASLR_DISABLE feature control note bit makes it possible
to force ASLR off for the given binary. (A tool to edit the feature
control note is in development.)
Global controls are:
- kern.elf{32,64}.aslr.enable - for non-fixed mappings done by mmap(2);
- kern.elf{32,64}.aslr.pie_enable - for PIE image activation mappings;
- kern.elf{32,64}.aslr.honor_sbrk - allow to use sbrk area for mmap(2);
- vm.cluster_anon - enables anon mapping clustering.
PR: 208580 (exp runs)
Exp-runs done by: antoine
Reviewed by: markj (previous version)
Discussed with: emaste
Tested by: pho
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D5603
The COVERAGE option breaks xtoolchain-gcc GENERIC kernel early boot
extremely badly and hasn't been fixed for the ~week since it was committed.
Please enable for GENERIC only when it doesn't do that.
Related fallout reported by: lwhsu, tuexen (pr 235611)
%r8, %r10, and on non-KPTI configuration %r9 were not restored on fast
return from a syscall.
Reviewed by: markj
Approved by: so
Security: CVE-2019-5595
Sponsored by: The FreeBSD Foundation
MFC after: 0 minutes
This allows userspace to trace the kernel using the coverage sanitizer
found in clang. It will also allow other coverage tools to be built as
modules and attach into the same framework.
Sponsored by: DARPA, AFRL
iflib is already a module, but it is unconditionally compiled into the
kernel. There are drivers which do not need iflib(4), and there are
situations where somebody might not want iflib in kernel because of
using the corresponding driver as module.
Reviewed by: marius
Discussed with: erj
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D19041
This will allow multiple consumers of the coverage data to be compiled
into the kernel together. The only requirement is only one can be
registered at a given point in time, however it is expected they will
only register when the coverage data is needed.
A new kernel conflig option COVERAGE is added. This will allow kcov to
become a module that can be loaded as needed, or compiled into the
kernel.
While here clean up the #include style a little.
Reviewed by: kib
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D18955
For parity with Intel hosts, which already mask out the CPUID feature
bits that indicate the presence of the SPEC_CTRL MSR, do the same on
AMD.
Eventually we may want to have a better support story for guests, but
for now, limit the damage of incorrectly indicating an MSR we do not yet
support.
Eventually, we may want a generic CPUID override system for
administrators, or for minimum supported feature set in heterogenous
environments with failover. That is a much larger scope effort than
this bug fix.
PR: 235010
Reported by: Rys Sommefeldt <rys AT sommefeldt.com>
Sponsored by: Dell EMC Isilon
vmm's CPUID emulation presented Intel topology information to the guest, but
disabled AMD topology information and in some cases passed through garbage.
I.e., CPUID leaves 0x8000_001[de] were passed through to the guest, but
guest CPUs can migrate between host threads, so the information presented
was not consistent. This could easily be observed with 'cpucontrol -i 0xfoo
/dev/cpuctl0'.
Slightly improve this situation by enabling the AMD topology feature flag
and presenting at least the CPUID fields used by FreeBSD itself to probe
topology on more modern AMD64 hardware (Family 15h+). Older stuff is
probably less interesting. I have not been able to empirically confirm it
is sufficient, but it should not regress anything either.
Reviewed by: araujo (previous version)
Relnotes: sure
When building with KCOV enabled the compiler will insert function calls
to probes allowing us to trace the execution of the kernel from userspace.
These probes are on function entry (trace-pc) and on comparison operations
(trace-cmp).
Userspace can enable the use of these probes on a single kernel thread with
an ioctl interface. It can allocate space for the probe with KIOSETBUFSIZE,
then mmap the allocated buffer and enable tracing with KIOENABLE, with the
trace mode being passed in as the int argument. When complete KIODISABLE
is used to disable tracing.
The first item in the buffer is the number of trace event that have
happened. Userspace can write 0 to this to reset the tracing, and is
expected to do so on first use.
The format of the buffer depends on the trace mode. When in PC tracing just
the return address of the probe is stored. Under comparison tracing the
comparison type, the two arguments, and the return address are traced. The
former method uses on entry per trace event, while the later uses 4. As
such they are incompatible so only a single mode may be enabled.
KCOV is expected to help fuzzing the kernel, and while in development has
already found a number of issues. It is required for the syzkaller system
call fuzzer [1]. Other kernel fuzzers could also make use of it, either
with the current interface, or by extending it with new modes.
A man page is currently being worked on and is expected to be committed
soon, however having the code in the kernel now is useful for other
developers to use.
[1] https://github.com/google/syzkaller
Submitted by: Mitchell Horne <mhorne063@gmail.com> (Earlier version)
Reviewed by: kib
Testing by: tuexen
Sponsored by: DARPA, AFRL
Sponsored by: The FreeBSD Foundation (Mitchell Horne)
Differential Revision: https://reviews.freebsd.org/D14599
It is useful for inspecting tlb shootdown hangs. The smp_tlb_generation value
is available using regular ddb data inspection commands.
Sponsored by: The FreeBSD Foundation
MFC after: 3 days
This KPI may in principle be used to create kernel mappings, in which
case we certainly should not be setting PG_U. In any case, PG_U must be
set on all layers in the page tables to grant user mode access, and we
were only setting it on leaf entries. Thus, this change should have no
functional impact.
Reviewed by: kib
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Originally read value is still safely kept. Re-reading code was there
for previous iterations which were partially shared with i386.
Sponsored by: The FreeBSD Foundation
On some architectures, the structures returned by PT_GET*REGS were not
fully populated and could contain uninitialized stack memory. The same
issue existed with the register files in procfs.
Reported by: Thomas Barabosch, Fraunhofer FKIE
Reviewed by: kib
MFC after: 3 days
Security: kernel stack memory disclosure
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D18421
See the review for sample test results.
Reviewed by: kib (kernel part)
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D18401
Handling sizes of > 32 backwards will be updated later.
Reviewed by: kib (kernel part)
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D18387
For non-ERMS case the code used handle possible trailing bytes with
movsb first and then followed it up with movsq. This also happened
to alter how calculations were done for other cases.
Handle the tail with regular movs, just like when copying forward.
Use leaq to calculate the right offset from the get go, instead of
doing separate add and sub.
This adjusts the offset for non-rep cases so that they can be used
to handle the tail.
The routine is still a work in progress.
Sponsored by: The FreeBSD Foundation
pmap_large_unmap() asserts that an unmapping request covers the
entirety of a 2M or 1G page. The logic in the asserts was out of date
with the loop logic. Correct the test to actually check that
destroying the current superpage mapping does not unmap addresses
beyond those requested by the caller.
Submitted by: D Scott Phillips <d.scott.phillips@intel.com>
Reviewed by: alc
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D18345
We zero the whole structure; we don't need to zero the __spare__ field again.
Remove trailing whitespace.
MFC after: 2 weeks
Sponsored by: Dell EMC Isilon
Mirror the fix for the native i386 implementation from r218327. This
code is compiled only when the non-default COMPAT_43 option is
configured.
Reported by: Ilja Van Sprundel <ivansprundel@ioactive.com>
Reviewed by: kib
MFC after: 3 days
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D18298
SDM rev. 068 was released yesterday and it contains the description of
the MSR 0x10a IA32_ARCH_CAP. This change adds symbolic definitions for
all bits present in the document, and decode them in the CPU
identification lines printed on boot.
But also, the document defines SSB_NO as bit 4, while FreeBSD used but
2 to detect the need to work-around Speculative Store Bypass
issue. Change code to use the bit from SDM.
Similarly, the document describes bit 3 as an indicator that L1TF
issue is not present, in particular, no L1D flush is needed on
VMENTRY. We used RDCL_NO to avoid flushing, and again I changed the
code to follow new spec from SDM.
In fact my Apollo Lake machine with latest ucode shows this:
IA32_ARCH_CAPS=0x19<RDCL_NO,SKIP_L1DFL_VME,SSB_NO>
Reviewed by: bwidawsk
Sponsored by: The FreeBSD Foundation
MFC after: 3 days
Differential revision: https://reviews.freebsd.org/D18006
Instead of jumping to locations which store the exact number of bytes,
use displacement to move the destination.
In particular the following clears an area between 8-16 (inclusive)
branch-free:
movq %r10,(%rdi)
movq %r10,-8(%rdi,%rcx)
For instance for rcx of 10 the second line is rdi + 10 - 8 = rdi + 2.
Writing 8 bytes starting at that offset overlaps with 6 bytes written
previously and writes 2 new, giving 10 in total.
Provides a nice win for smaller stores. Other ones are erratic depending
on the microarchitecture.
General idea taken from NetBSD (restricted use of the trick) and bionic
string functions (use for various ranges like in this patch).
Reviewed by: kib (previous version)
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D17660
Include evdev support and drivers in the amd64 and i386 GENERIC and MINIMAL
kernels. Evdev is used by X and wayland to handle input devices, and this
change, together with upcomming changes in ports will make us handle input
devices better in graphical UIs.
Reviewed by: wulf, bapt, imp
Approved by: imp
Differential Revision: https://reviews.freebsd.org/D17912
We need to know actual value for the standard extended features before
ifuncs are resolved.
Reported and tested by: madpilot
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Both Intel manual and Agner Fog's docs suggest aligning to 16.
See the review for benchmark results.
Reviewed by: kib (previous version)
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D17661
linux_ioctl_(un)register_handler that allows other driver modules to
register ioctl handlers. The ioctl syscall implementation in each Linux
compat module iterates over the list of handlers and forwards the call to
the appropriate driver. Because the registration functions have the same
name in each module it is not possible for a driver to support both 32 and
64 bit linux compatibility.
Move the list of ioctl handlers to linux_common.ko so it is shared by
both Linux modules and all drivers receive both 32 and 64 bit ioctl calls
with one registration. These ioctl handlers normally forward the call
to the FreeBSD ioctl handler which can handle both 32 and 64 bit.
Keep the special COMPAT_LINUX32 ioctl handlers in linux.ko in a separate
list for now and let the ioctl syscall iterate over that list first.
Later, COMPAT_LINUX32 support can be added to the 64 bit ioctl handlers
via a runtime check for ILP32 like is done for COMPAT_FREEBSD32 and then
this separate list would disappear again. That is a much bigger effort
however and this commit is meant to be MFCable.
This enables linux64 support in x11/nvidia-driver*.
PR: 206711
Reviewed by: kib
MFC after: 3 days
Avoid using DELAY() since it can try to use spin locks on CPUs without
a P-state invariant TSC. For cpu_lock_delay(), always use the TSC if
it exists (even if it is not P-state invariant) to delay for a
microsecond. If the TSC does not exist, read from I/O port 0x84 to
delay instead.
PR: 228768
Reported by: Roger Hammerstein <cheeky.m@live.com>
Reviewed by: kib
MFC after: 3 days
Differential Revision: https://reviews.freebsd.org/D17851
Replace a call to DELAY(1) with a new cpu_lock_delay() KPI. Currently
cpu_lock_delay() is defined to DELAY(1) on all platforms. However,
platforms with a DELAY() implementation that uses spin locks should
implement a custom cpu_lock_delay() doesn't use locks.
Reviewed by: kib
MFC after: 3 days
Add a new 'debugger_on_trap' knob separate from 'debugger_on_panic'
and make the calls to kdb_trap() in MD fatal trap handlers prior to
calling panic() conditional on this new knob instead of
'debugger_on_panic'. Disable the new knob by default. Developers who
wish to recover from a fatal fault by adjusting saved register state
and retrying the faulting instruction can still do so by enabling the
new knob. However, for the more common case this makes the user
experience for panics due to a fatal fault match the user experience
for other panics, e.g. 'c' in DDB will generate a crash dump and
reboot the system rather than being stuck in an infinite loop of fatal
fault messages and DDB prompts.
Reviewed by: kib, avg
MFC after: 2 months
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D17768
On some Intel devices BIOS does not properly reserve memory (called
"stolen memory") for the GPU. If the stolen memory is claimed by the
OS, functions that depend on stolen memory (like frame buffer
compression) can't be used.
A function called pci_early_quirks that is called before the virtual
memory system is started was added. In Linux, this PCI early quirks
function iterates through all PCI slots to check for any device that
require quirks. While this more generic solution is preferable I only
ported the Intel graphics specific parts because I think my
implementation would be too similar to Linux GPL'd solution after
looking at the Linux code too much.
The code regarding Intel graphics stolen memory was ported from
Linux. In the case of Intel graphics stolen memory this
pci_early_quirks will read the stolen memory base and size from north
bridge registers. The values are stored in global variables that is
later read by linuxkpi_gplv2. Linuxkpi stores these values in a
Linux-specific structure that is read by the drm driver.
Relevant linuxkpi code is here:
https://github.com/FreeBSDDesktop/kms-drm/blob/drm-v4.16/linuxkpi/gplv2/src/linux_compat.c#L37
For now, only amd64 arch is suppor ted since that is the only arch
supported by the new drm drivers. I was told that Intel GPUs are
always located on 0:2:0 so these values are hard coded for now.
Note that the structure and early execution of the detection code is
not required in its current form, but we expect that the code will be
added shortly which fixes the potential BIOS bugs by reserving the
stolen range in phys_avail[]. This must be done as early as possible
to avoid conflicts with the potential usage of the memory in kernel.
Submitted by: Johannes Lundberg <johalun0@gmail.com>
Reviewed by: bwidawsk, imp
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D16719
Differential revision: https://reviews.freebsd.org/D17775
The loader tunable 'debug.verbose_sysinit' may be used to toggle verbosity.
This is added to the debugging section of these kernconfs to be turned off
in stable branches for clarity of intent.
MFC after: never