After the referenced commit, we did not set x87 and sse valid bits in
the xstate_bv bitmask for initial fpu state (stored in memory), when
using XSAVE.
The state is loaded into FPU register file to initialize the process
FPU state, and since both bits were clear, the default x87 and SSE
states were loaded. By chance, FreeBSD ABI SSE2 state is same as FPU
initial state, so the bug is not visible for 64bit processes. But on
i386, the precision control should be set to double (53bit mantissa),
instead of the default double extended (64bit mantissa). For 32bit
processes on amd64, kernel reloads control word with the right mask,
which only left native i386 and amd64 native but using x87 as
affected.
Fix it by setting minimal required xstate_bv mask.
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Some early PCIe chipsets are explicitly listed in the white-list to
enable use of the MMIO config space accesses, perhaps because ACPI
tables were not reliable source of the base MCFG address at that time.
For that chipsets, MCFG base was read from the known chipset MCFGbase
config register.
During very early stage of boot, when access to the PCI config space
is performed (see e.g. pci_early_quirks.c), we cannot map 255MB of
registers because the method used with pre-boot pmap overflows initial
kernel page tables.
Move fallback to read MCFGbase to the attachment method of the
x86/legacy device, which removes code duplication, and results in the
use of io accesses until MCFG is parsed or legacy attach called.
For amd64, pre-initialize cfgmech with CFGMECH_1, right now we
dynamically assign CFGMECH_1 to it anyway, and remove checks for
CFGMECH_NONE.
There is a mention in the Intel documentation for corresponding
chipsets that OS must use either io port or MMIO access method, but we
already break this rule by reading MCFGbase register, so one more
access seems to be innocent.
Reported by: longwitz@incore.de
PR: 236838
Reviewed by: avg (other version), jhb
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D19833
For PCI device (i.e. child of a PCI bus), reset tries FLR if
implemented and worked, and falls to power reset otherwise.
For PCIe bus (child of a PCIe bridge or root port), reset
disables PCIe link and then re-trains it, performing what is known as
link-level reset.
Reviewed by: imp (previous version), jhb (previous version)
Sponsored by: Mellanox Technologies
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D19646
Remove redundant npxsave_core definition while here.
Suggested by: Anton Rang
Reviewed by: kib, Anton Rang <rang AT acm.org>
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D19665
For 32-bit Linuxulator, ipc() syscall was historically
the entry point for the IPC API. Starting in Linux 4.18, direct
syscalls are provided for the IPC. Enable it.
MFC after: 1 month
AMD64_SET_**BASE expects a pointer to a pointer, we just passing in the pointer value itself.
Set PCB_FULL_IRET for doreti to restore %fs, %gs and its correspondig base.
PR: 225105
Reported by: trasz@
MFC after: 1 month
There are some unusual cases where a process may cause an mlock()ed
range of memory to be unmapped. If the application subsequently
faults on that region, the handler may attempt to create a superpage
mapping backed by the resident, wired pages. However, the pmap code
responsible for creating such a mapping (pmap_enter_pde() on i386
and amd64) does not ensure that a leaf page table page is available
if the superpage is later demoted; the demotion operation must therefore
perform a non-blocking page allocation and must unmap the entire
superpage if the allocation fails. The pmap layer ensures that this
can never happen for wired mappings, and so the case described above
breaks that invariant.
For now, simply ensure that the MI fault handler never attempts to
create a wired superpage except via promotion.
Reviewed by: kib
Reported by: syzbot+292d3b0416c27c131505@syzkaller.appspotmail.com
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D19670
It may happen on some machines, that even if SGX is disabled
in firmware, the driver would still attach despite EPC base and
size equal zero. Such behaviour causes a kernel panic when the
module is unloaded. Add a simple check to make sure we
only attach when these values are correctly set.
Submitted by: Kornel Duleba <mindal@semihalf.com>
Reviewed by: br
Obtained from: Semihalf
Sponsored by: Stormshield
Differential Revision: https://reviews.freebsd.org/D19595
Add the infrastructure to allow MD procctl(2) commands, and use it to
introduce amd64 PTI control and reporting. PTI mode cannot be
modified for existing pmap, the knob controls PTI of the new vmspace
created on exec.
Requested by: jhb
Reviewed by: jhb, markj (previous version)
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D19514
PTI mode for the process pmap on exec is activated iff P_MD_PTI is set.
On exec, the existing vmspace can be reused only if pti mode of the
pmap matches the P_MD_PTI flag of the process. Add MD
cpu_exec_vmspace_reuse() callback for exec_new_vmspace() which can
vetoed reuse of the existing vmspace.
MFC note: md_flags change struct proc KBI.
Reviewed by: jhb, markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D19514
When the pmap with pti disabled (i.e. pm_ucr3 == PMAP_NO_CR3) is
activated, tss.rsp0 was not updated. Any interrupt that happen before
next context switch would use pti trampoline stack for hardware frame
but fault and interrupt handlers are not prepared to this. Correctly
update tss.rsp0 for both PMAP_NO_CR3 and pti pmaps.
Note that this case, pti = 1 but pmap->pm_ucr3 == PMAP_NO_CR3 is not
used at the moment.
Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D19514
and for %ecx after RDTSCP.
Initialize TSC_AUX MSR with CPUID. It allows for userspace to cheaply
identify CPU it was executed on some time ago, which is sometimes useful.
Note: The values returned might be changed in future.
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Research Unix, 7th Edition introduced TIMEZONE and DSTFLAG
compile-time constants in sys/param.h to communicate these values for
the machine. 4.2BSD moved from the compile-time to run-time and
introduced these variables and used for localtime() to return the
right offset from UTC (sometimes referred to as GMT, for this purpose
is the same). 4.4BSD migrated to using the tzdata code/database and
these variables were basically unused.
FreeBSD removed the real need for these with adjkerntz in
1995. However, some RTC clocks continued to use these variables,
though they were largely unused otherwise. Later, phk centeralized
most of the uses in utc_offset, but left it using both tz_minuteswest
and adjkerntz.
POSIX (IEEE Std 1003.1-2017) states in the gettimeofday specification
"If tzp is not a null pointer, the behavior is unspecified" so there's
no standards reason to retain it anymore. In fact, gettimeofday has
been marked as obsolecent, meaning it could be removed from a future
release of the standard. It is the only interface defined in POSIX
that references these two values. All other references come from the
tzdata database via tzset().
These were used to more faithfully implement early unix ABIs which
have been removed from FreeBSD. NetBSD has completely eliminated
these variables years ago. Linux has migrated to tzdata as well,
though these variables technically still exist for compatibility
with unspecified older programs.
So, there's no real reason to have them these days. They are a
historical vestige that's no longer used in any meaningful way.
Reviewed By: jhb@, brooks@
Differential Revision: https://reviews.freebsd.org/D19550
When a vCPU is HLTed, interrupts with a priority below the processor
priority (PPR) should not resume the vCPU while interrupts at or above
the PPR should. With posted interrupts, bhyve maintains a bitmap of
pending interrupts in PIR descriptor along with a single 'pending'
bit. This bit is checked by a CPU running in guest mode at various
places to determine if it should be checked. In addition, another CPU
can force a CPU in guest mode to check for pending interrupts by
sending an IPI to a special IDT vector reserved for this purpose.
bhyve had a bug in that it would only notify a guest vCPU of an
interrupt (e.g. by sending the special IPI or by resuming it if it was
idle due to HLT) if an interrupt arrived that was higher priority than
PPR and no interrupts were currently pending. This assumed that if
the 'pending' bit was set, any needed notification was already in
progress. However, if the first interrupt sent to a HLTed vCPU was
lower priority than PPR and the second was higher than PPR, the first
interrupt would set 'pending' but not notify the vCPU, and the second
interrupt would not notify the vCPU because 'pending' was already set.
To fix this, track the priority of pending interrupts in a separate
per-vCPU bitmask and notify a vCPU anytime an interrupt arrives that
is above PPR and higher than any previously-received interrupt.
This was found and debugged in the bhyve port to SmartOS maintained by
Joyent. Relevant SmartOS bugs with more background:
https://smartos.org/bugview/OS-6829https://smartos.org/bugview/OS-6930https://smartos.org/bugview/OS-7354
Submitted by: Patrick Mooney <pmooney@pfmooney.com>
Reviewed by: tychon, rgrimes
Obtained from: SmartOS / Joyent
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D19299
In all of the architectures we have today, we always use PAGE_SIZE.
While in theory one could define different things, none of the
current architectures do, even the ones that have transitioned from
32-bit to 64-bit like i386 and arm. Some ancient mips binaries on
other systems used 8k instead of 4k, but we don't support running
those and likely never will due to their age and obscurity.
Reviewed by: imp (who also contributed the commit message)
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D19280
Skylake Xeons.
See SDM rev. 68 Vol 3 4.6.2 Protection Keys and the description of the
RDPKRU and WRPKRU instructions.
Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D18893
The pmap_works variable is always true for amd64. Remove it, the
branch in the initialization taken when false, and corresponding
sysctl.
Remove pat_table[] local array, work on pat_index[] directly.
Collapse whole initialization to not override already assigned values.
Add comment explaining the choice for PAT4 and PAT7.
Reviewed by: alc, markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
MFC note: Leave the sysctl around
Differential revision: https://reviews.freebsd.org/D19225
Some argument validation error paths would return without releasing the
file reference obtained at the beginning of the function.
While here, fix some style bugs and remove trivial debug prints.
Reviewed by: kib
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D19214
With this change, randomization can be enabled for all non-fixed
mappings. It means that the base address for the mapping is selected
with a guaranteed amount of entropy (bits). If the mapping was
requested to be superpage aligned, the randomization honours the
superpage attributes.
Although the value of ASLR is diminshing over time as exploit authors
work out simple ASLR bypass techniques, it elimintates the trivial
exploitation of certain vulnerabilities, at least in theory. This
implementation is relatively small and happens at the correct
architectural level. Also, it is not expected to introduce
regressions in existing cases when turned off (default for now), or
cause any significant maintaince burden.
The randomization is done on a best-effort basis - that is, the
allocator falls back to a first fit strategy if fragmentation prevents
entropy injection. It is trivial to implement a strong mode where
failure to guarantee the requested amount of entropy results in
mapping request failure, but I do not consider that to be usable.
I have not fine-tuned the amount of entropy injected right now. It is
only a quantitive change that will not change the implementation. The
current amount is controlled by aslr_pages_rnd.
To not spoil coalescing optimizations, to reduce the page table
fragmentation inherent to ASLR, and to keep the transient superpage
promotion for the malloced memory, locality clustering is implemented
for anonymous private mappings, which are automatically grouped until
fragmentation kicks in. The initial location for the anon group range
is, of course, randomized. This is controlled by vm.cluster_anon,
enabled by default.
The default mode keeps the sbrk area unpopulated by other mappings,
but this can be turned off, which gives much more breathing bits on
architectures with small address space, such as i386. This is tied
with the question of following an application's hint about the mmap(2)
base address. Testing shows that ignoring the hint does not affect the
function of common applications, but I would expect more demanding
code could break. By default sbrk is preserved and mmap hints are
satisfied, which can be changed by using the
kern.elf{32,64}.aslr.honor_sbrk sysctl.
ASLR is enabled on per-ABI basis, and currently it is only allowed on
FreeBSD native i386 and amd64 (including compat 32bit) ABIs. Support
for additional architectures will be added after further testing.
Both per-process and per-image controls are implemented:
- procctl(2) adds PROC_ASLR_CTL/PROC_ASLR_STATUS;
- NT_FREEBSD_FCTL_ASLR_DISABLE feature control note bit makes it possible
to force ASLR off for the given binary. (A tool to edit the feature
control note is in development.)
Global controls are:
- kern.elf{32,64}.aslr.enable - for non-fixed mappings done by mmap(2);
- kern.elf{32,64}.aslr.pie_enable - for PIE image activation mappings;
- kern.elf{32,64}.aslr.honor_sbrk - allow to use sbrk area for mmap(2);
- vm.cluster_anon - enables anon mapping clustering.
PR: 208580 (exp runs)
Exp-runs done by: antoine
Reviewed by: markj (previous version)
Discussed with: emaste
Tested by: pho
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D5603
The COVERAGE option breaks xtoolchain-gcc GENERIC kernel early boot
extremely badly and hasn't been fixed for the ~week since it was committed.
Please enable for GENERIC only when it doesn't do that.
Related fallout reported by: lwhsu, tuexen (pr 235611)
%r8, %r10, and on non-KPTI configuration %r9 were not restored on fast
return from a syscall.
Reviewed by: markj
Approved by: so
Security: CVE-2019-5595
Sponsored by: The FreeBSD Foundation
MFC after: 0 minutes
This allows userspace to trace the kernel using the coverage sanitizer
found in clang. It will also allow other coverage tools to be built as
modules and attach into the same framework.
Sponsored by: DARPA, AFRL
iflib is already a module, but it is unconditionally compiled into the
kernel. There are drivers which do not need iflib(4), and there are
situations where somebody might not want iflib in kernel because of
using the corresponding driver as module.
Reviewed by: marius
Discussed with: erj
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D19041
This will allow multiple consumers of the coverage data to be compiled
into the kernel together. The only requirement is only one can be
registered at a given point in time, however it is expected they will
only register when the coverage data is needed.
A new kernel conflig option COVERAGE is added. This will allow kcov to
become a module that can be loaded as needed, or compiled into the
kernel.
While here clean up the #include style a little.
Reviewed by: kib
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D18955
For parity with Intel hosts, which already mask out the CPUID feature
bits that indicate the presence of the SPEC_CTRL MSR, do the same on
AMD.
Eventually we may want to have a better support story for guests, but
for now, limit the damage of incorrectly indicating an MSR we do not yet
support.
Eventually, we may want a generic CPUID override system for
administrators, or for minimum supported feature set in heterogenous
environments with failover. That is a much larger scope effort than
this bug fix.
PR: 235010
Reported by: Rys Sommefeldt <rys AT sommefeldt.com>
Sponsored by: Dell EMC Isilon
vmm's CPUID emulation presented Intel topology information to the guest, but
disabled AMD topology information and in some cases passed through garbage.
I.e., CPUID leaves 0x8000_001[de] were passed through to the guest, but
guest CPUs can migrate between host threads, so the information presented
was not consistent. This could easily be observed with 'cpucontrol -i 0xfoo
/dev/cpuctl0'.
Slightly improve this situation by enabling the AMD topology feature flag
and presenting at least the CPUID fields used by FreeBSD itself to probe
topology on more modern AMD64 hardware (Family 15h+). Older stuff is
probably less interesting. I have not been able to empirically confirm it
is sufficient, but it should not regress anything either.
Reviewed by: araujo (previous version)
Relnotes: sure
When building with KCOV enabled the compiler will insert function calls
to probes allowing us to trace the execution of the kernel from userspace.
These probes are on function entry (trace-pc) and on comparison operations
(trace-cmp).
Userspace can enable the use of these probes on a single kernel thread with
an ioctl interface. It can allocate space for the probe with KIOSETBUFSIZE,
then mmap the allocated buffer and enable tracing with KIOENABLE, with the
trace mode being passed in as the int argument. When complete KIODISABLE
is used to disable tracing.
The first item in the buffer is the number of trace event that have
happened. Userspace can write 0 to this to reset the tracing, and is
expected to do so on first use.
The format of the buffer depends on the trace mode. When in PC tracing just
the return address of the probe is stored. Under comparison tracing the
comparison type, the two arguments, and the return address are traced. The
former method uses on entry per trace event, while the later uses 4. As
such they are incompatible so only a single mode may be enabled.
KCOV is expected to help fuzzing the kernel, and while in development has
already found a number of issues. It is required for the syzkaller system
call fuzzer [1]. Other kernel fuzzers could also make use of it, either
with the current interface, or by extending it with new modes.
A man page is currently being worked on and is expected to be committed
soon, however having the code in the kernel now is useful for other
developers to use.
[1] https://github.com/google/syzkaller
Submitted by: Mitchell Horne <mhorne063@gmail.com> (Earlier version)
Reviewed by: kib
Testing by: tuexen
Sponsored by: DARPA, AFRL
Sponsored by: The FreeBSD Foundation (Mitchell Horne)
Differential Revision: https://reviews.freebsd.org/D14599
It is useful for inspecting tlb shootdown hangs. The smp_tlb_generation value
is available using regular ddb data inspection commands.
Sponsored by: The FreeBSD Foundation
MFC after: 3 days
This KPI may in principle be used to create kernel mappings, in which
case we certainly should not be setting PG_U. In any case, PG_U must be
set on all layers in the page tables to grant user mode access, and we
were only setting it on leaf entries. Thus, this change should have no
functional impact.
Reviewed by: kib
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Originally read value is still safely kept. Re-reading code was there
for previous iterations which were partially shared with i386.
Sponsored by: The FreeBSD Foundation
On some architectures, the structures returned by PT_GET*REGS were not
fully populated and could contain uninitialized stack memory. The same
issue existed with the register files in procfs.
Reported by: Thomas Barabosch, Fraunhofer FKIE
Reviewed by: kib
MFC after: 3 days
Security: kernel stack memory disclosure
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D18421
See the review for sample test results.
Reviewed by: kib (kernel part)
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D18401
Handling sizes of > 32 backwards will be updated later.
Reviewed by: kib (kernel part)
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D18387
For non-ERMS case the code used handle possible trailing bytes with
movsb first and then followed it up with movsq. This also happened
to alter how calculations were done for other cases.
Handle the tail with regular movs, just like when copying forward.
Use leaq to calculate the right offset from the get go, instead of
doing separate add and sub.
This adjusts the offset for non-rep cases so that they can be used
to handle the tail.
The routine is still a work in progress.
Sponsored by: The FreeBSD Foundation
pmap_large_unmap() asserts that an unmapping request covers the
entirety of a 2M or 1G page. The logic in the asserts was out of date
with the loop logic. Correct the test to actually check that
destroying the current superpage mapping does not unmap addresses
beyond those requested by the caller.
Submitted by: D Scott Phillips <d.scott.phillips@intel.com>
Reviewed by: alc
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D18345
We zero the whole structure; we don't need to zero the __spare__ field again.
Remove trailing whitespace.
MFC after: 2 weeks
Sponsored by: Dell EMC Isilon
Mirror the fix for the native i386 implementation from r218327. This
code is compiled only when the non-default COMPAT_43 option is
configured.
Reported by: Ilja Van Sprundel <ivansprundel@ioactive.com>
Reviewed by: kib
MFC after: 3 days
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D18298
SDM rev. 068 was released yesterday and it contains the description of
the MSR 0x10a IA32_ARCH_CAP. This change adds symbolic definitions for
all bits present in the document, and decode them in the CPU
identification lines printed on boot.
But also, the document defines SSB_NO as bit 4, while FreeBSD used but
2 to detect the need to work-around Speculative Store Bypass
issue. Change code to use the bit from SDM.
Similarly, the document describes bit 3 as an indicator that L1TF
issue is not present, in particular, no L1D flush is needed on
VMENTRY. We used RDCL_NO to avoid flushing, and again I changed the
code to follow new spec from SDM.
In fact my Apollo Lake machine with latest ucode shows this:
IA32_ARCH_CAPS=0x19<RDCL_NO,SKIP_L1DFL_VME,SSB_NO>
Reviewed by: bwidawsk
Sponsored by: The FreeBSD Foundation
MFC after: 3 days
Differential revision: https://reviews.freebsd.org/D18006
Instead of jumping to locations which store the exact number of bytes,
use displacement to move the destination.
In particular the following clears an area between 8-16 (inclusive)
branch-free:
movq %r10,(%rdi)
movq %r10,-8(%rdi,%rcx)
For instance for rcx of 10 the second line is rdi + 10 - 8 = rdi + 2.
Writing 8 bytes starting at that offset overlaps with 6 bytes written
previously and writes 2 new, giving 10 in total.
Provides a nice win for smaller stores. Other ones are erratic depending
on the microarchitecture.
General idea taken from NetBSD (restricted use of the trick) and bionic
string functions (use for various ranges like in this patch).
Reviewed by: kib (previous version)
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D17660
Include evdev support and drivers in the amd64 and i386 GENERIC and MINIMAL
kernels. Evdev is used by X and wayland to handle input devices, and this
change, together with upcomming changes in ports will make us handle input
devices better in graphical UIs.
Reviewed by: wulf, bapt, imp
Approved by: imp
Differential Revision: https://reviews.freebsd.org/D17912
We need to know actual value for the standard extended features before
ifuncs are resolved.
Reported and tested by: madpilot
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Both Intel manual and Agner Fog's docs suggest aligning to 16.
See the review for benchmark results.
Reviewed by: kib (previous version)
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D17661
linux_ioctl_(un)register_handler that allows other driver modules to
register ioctl handlers. The ioctl syscall implementation in each Linux
compat module iterates over the list of handlers and forwards the call to
the appropriate driver. Because the registration functions have the same
name in each module it is not possible for a driver to support both 32 and
64 bit linux compatibility.
Move the list of ioctl handlers to linux_common.ko so it is shared by
both Linux modules and all drivers receive both 32 and 64 bit ioctl calls
with one registration. These ioctl handlers normally forward the call
to the FreeBSD ioctl handler which can handle both 32 and 64 bit.
Keep the special COMPAT_LINUX32 ioctl handlers in linux.ko in a separate
list for now and let the ioctl syscall iterate over that list first.
Later, COMPAT_LINUX32 support can be added to the 64 bit ioctl handlers
via a runtime check for ILP32 like is done for COMPAT_FREEBSD32 and then
this separate list would disappear again. That is a much bigger effort
however and this commit is meant to be MFCable.
This enables linux64 support in x11/nvidia-driver*.
PR: 206711
Reviewed by: kib
MFC after: 3 days
Avoid using DELAY() since it can try to use spin locks on CPUs without
a P-state invariant TSC. For cpu_lock_delay(), always use the TSC if
it exists (even if it is not P-state invariant) to delay for a
microsecond. If the TSC does not exist, read from I/O port 0x84 to
delay instead.
PR: 228768
Reported by: Roger Hammerstein <cheeky.m@live.com>
Reviewed by: kib
MFC after: 3 days
Differential Revision: https://reviews.freebsd.org/D17851
Replace a call to DELAY(1) with a new cpu_lock_delay() KPI. Currently
cpu_lock_delay() is defined to DELAY(1) on all platforms. However,
platforms with a DELAY() implementation that uses spin locks should
implement a custom cpu_lock_delay() doesn't use locks.
Reviewed by: kib
MFC after: 3 days
Add a new 'debugger_on_trap' knob separate from 'debugger_on_panic'
and make the calls to kdb_trap() in MD fatal trap handlers prior to
calling panic() conditional on this new knob instead of
'debugger_on_panic'. Disable the new knob by default. Developers who
wish to recover from a fatal fault by adjusting saved register state
and retrying the faulting instruction can still do so by enabling the
new knob. However, for the more common case this makes the user
experience for panics due to a fatal fault match the user experience
for other panics, e.g. 'c' in DDB will generate a crash dump and
reboot the system rather than being stuck in an infinite loop of fatal
fault messages and DDB prompts.
Reviewed by: kib, avg
MFC after: 2 months
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D17768
On some Intel devices BIOS does not properly reserve memory (called
"stolen memory") for the GPU. If the stolen memory is claimed by the
OS, functions that depend on stolen memory (like frame buffer
compression) can't be used.
A function called pci_early_quirks that is called before the virtual
memory system is started was added. In Linux, this PCI early quirks
function iterates through all PCI slots to check for any device that
require quirks. While this more generic solution is preferable I only
ported the Intel graphics specific parts because I think my
implementation would be too similar to Linux GPL'd solution after
looking at the Linux code too much.
The code regarding Intel graphics stolen memory was ported from
Linux. In the case of Intel graphics stolen memory this
pci_early_quirks will read the stolen memory base and size from north
bridge registers. The values are stored in global variables that is
later read by linuxkpi_gplv2. Linuxkpi stores these values in a
Linux-specific structure that is read by the drm driver.
Relevant linuxkpi code is here:
https://github.com/FreeBSDDesktop/kms-drm/blob/drm-v4.16/linuxkpi/gplv2/src/linux_compat.c#L37
For now, only amd64 arch is suppor ted since that is the only arch
supported by the new drm drivers. I was told that Intel GPUs are
always located on 0:2:0 so these values are hard coded for now.
Note that the structure and early execution of the detection code is
not required in its current form, but we expect that the code will be
added shortly which fixes the potential BIOS bugs by reserving the
stolen range in phys_avail[]. This must be done as early as possible
to avoid conflicts with the potential usage of the memory in kernel.
Submitted by: Johannes Lundberg <johalun0@gmail.com>
Reviewed by: bwidawsk, imp
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D16719
Differential revision: https://reviews.freebsd.org/D17775
The loader tunable 'debug.verbose_sysinit' may be used to toggle verbosity.
This is added to the debugging section of these kernconfs to be turned off
in stable branches for clarity of intent.
MFC after: never
Architectures Software Developer’s Manual Volume 3"). Add the document
to SEE ALSO in bhyve.8 (and pet manlint here a bit).
Reviewed by: jhb, rgrimes, 0mp
Approved by: kib (mentor)
Differential Revision: https://reviews.freebsd.org/D17531
Instead of finding the exact size to fit in we can just shift the target
by -8 + tail. Doing a blind write to a previously rep stosq'ed area comes
with a penalty so do it conditionally.
Sample win on EPYC when zeroing a 257 sized buffer (tail = 1) aligned to
16 bytes:
before: 44782846 ops/s
after: 46118614 ops/s
Idea stolen from NetBSD.
Sponsored by: The FreeBSD Foundation
This driver has been obsolete since the FreeBSD 4.x. It should have
been removed then since the sym(4) driver had subsumed it. The driver
was commented out of GENERIC in 2000.
RelNotes: Yes
stg(4) is marked as gone in 12. Remove it. There are no sightings of
it in the nycbug dmesg database. It was for an obscure SCSI card that
sold mostly in Japan, and was especially popilar among pc98 hackers in
the 4.x time frame. It was also only enabled on i386.
Relnote: Yes
nsp(4) is marked as gone in 12. Remove it. There are no sightings of
it in the nycbug dmesg database. It was for an obscure SCSI card that
sold mostly in Japan, and was especially popilar among pc98 hackers in
the 4.x time frame. It was also only enabled on i386.
Relnote: Yes
ncv(4) is marked as gone in 12. Remove it. There are no sightings of
it in the nycbug dmesg database. It was for an obscure SCSI card that
sold mostly in Japan, and was especially popilar among pc98 hackers in
the 4.x time frame..
Relnote: Yes
We're planning on removing adv, adw, aha, aic, bt, ncv, nsp, and stg
soon. They have been tagged for removal in 12. At least get them out
of GENERIC.
MFC after: 3 days
Relnotes: yes
The knob allows to select the flushing mode or turn it off/on. The
idea, as well as the list of the ignored syscall errors, were taken
from https://www.openwall.com/lists/kernel-hardening/2018/10/11/10 .
I was not able to measure statistically significant difference between
flush enabled vs disabled using syscall_timing getuid.
Reviewed by: bwidawsk
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D17536
When modifying an existing managed mapping, we should find a PV entry
for the old mapping. Verify this.
Before r335784 this would have been implicitly tested by the fact that
we always freed the PV entry for the old mapping.
Reviewed by: alc, kib
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D17626
This makes the compiler less likely to reload the content from %gs.
The 'P' modifier drops all synteax prefixes and 'n' constraint treats
input as a known at compilation time immediate integer.
Example reloading victim was spinlock_enter.
Stolen from: OpenBSD
Reported by: jtl
Reviewed by: kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D17615
Apparently AMD machines cannot tolerate this. This was uncovered by
r339386, where cache flush started really flushing the requested range.
Introduce pmap_mapdev_pciecfg(), which simply does not flush cache
comparing with pmap_mapdev(). It assumes that the MCFG region was
never accessed through the cacheable mapping, which is most likely
true for machine to boot at all.
Note that i386 does not need the change, since the architecture
handles access per-page due to the KVA shortage, and page remapping
already does not flush the cache.
Reported and tested by: mjg, Mike Tancsa <mike@sentex.net>
Reviewed by: alc
Sponsored by: The FreeBSD Foundation
Approved by: re (gjb)
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D17612
The KPI allows to map very large contigous physical memory regions
into KVA, which are not covered by DMAP.
I see both with QEMU and with some real hardware started shipping, the
regions for NVDIMMs might be very far apart from the normal RAM, and
we expect that at least initial users of NVDIMM could install very
large amount of such memory. IMO it is not reasonable to extend DMAP
to cover that far-away regions both because it could overflow existing
4T window for DMAP in KVA, and because it costs in page table pages
allocations, for gap and for possibly unused NV RAM.
Also, KPI provides some special functionality for fast cache flushing
based on the knowledge of the NVRAM mapping use.
Reviewed by: alc, markj
Sponsored by: The FreeBSD Foundation
Approved by: re (gjb)
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D17070
Reviewed by: alc, markj
Sponsored by: The FreeBSD Foundation
Approved by: re (gjb)
MFC after: 3 days
Differential revision: https://reviews.freebsd.org/D17070
cpu_switch() always reloads the LDT, so this can only affect the
hypervisor process itself. Fix this by explicitly reloading the host
LDT selector after each #VMEXIT. The stock bhyve process on FreeBSD
never uses a custom LDT, so this change is cosmetic.
Reviewed by: kib
Tested by: Mike Tancsa <mike@sentex.net>
Approved by: re (gjb)
MFC after: 2 weeks
Vast majority of syscalls take 6 or less arguments. Move handling of other
cases to a fallback function. Similarly, special casing for _syscall
and __syscall
magic syscalls is moved away.
Return is almost always 0. The change replaces 3 branches with 1 in the common
case. Also the 'frame' variable convinces clang not to reload it on each access.
Reviewed by: kib
Approved by: re (gjb)
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D17542
Finishes the conversion of the 40Gb Intel Ethernet drivers to iflib(9) for
FreeBSD 12.0, and fixes numerous bugs in both ixl(4) and iavf(4).
This commit also re-adds the VF driver to GENERIC since it now compiles and
functions.
The VF driver name was changed from ixlv(4) to iavf(4) because the VF driver is
now intended to be used with future products, not just with Fortville/Fort Park
VFs.
A man page update that documents these drivers is forthcoming in a separate
commit.
Reviewed by: sbruno@, kbowling@
Tested by: jeffrey.e.pieper@intel.com
Approved by: re (gjb@)
Relnotes: yes
Sponsored by: Intel Corporation
Differential Revision: https://reviews.freebsd.org/D16429
See r339205 for justification.
Reviewed by: kib
Approved by: re (gjb)
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D17526
The function tweaks CPU capabilities based on the VM platform and
tunables, which affected selection of the cache flush method before
ifuncs were used, and should affect the cache flush in the same way
after ifunc.
PR: 232081
Reported by: phk
Analyzed by: avg
Sponsored by: The FreeBSD Foundation
Approved by: re (gjb)
Apparently CLFLUSH on mmio can cause VM exit, as reported in the PR.
I do not see that anything useful can be done except emulating page
faults on invalid addresses.
Due to the instruction encoding pecularity, also emulate SFENCE.
PR: 232081
Reported by: phk
Reviewed by: araujo, avg, jhb (all: previous version)
Sponsored by: The FreeBSD Foundation
Approved by: re (gjb)
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D17482
The reasoning is the same as with the memset change, see r339205
Reviewed by: kib (previous version)
Approved by: re (gjb)
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D17441
The change is a no-op for architectures which don't ifunc memset,
memcpy nor memmove.
Convert places which need them. Xen bits by royger.
Reviewed by: kib
Approved by: re (gjb)
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D17487
The VT-x VMCS only stores the base address of the GDTR and IDTR. As a
result, VM exits use a fixed limit of 0xffff for the host GDTR and
IDTR losing the smaller limits set in when the initial GDT is loaded
on each CPU during boot. Explicitly save and restore the full GDTR
and IDTR contents around VM entries and exits to restore the correct
limit.
Similarly, explicitly save and restore the LDT selector. VM exits
always clear the host LDTR as if the LDT was loaded with a NULL
selector and a userspace hypervisor is probably using a NULL selector
anyway, but save and restore the LDT explicitly just to be safe.
PR: 230773
Reported by: John Levon <levon@movementarian.org>
Reviewed by: kib
Tested by: araujo
Approved by: re (rgrimes)
MFC after: 1 week
configuring kernels for i386, amd64, and arm64.
The 'GEOM_PART_GPT' option was added to the DEFAULTS configuration
in r337967.
Approved by: re (kib@)
Reviewed by: ler@
Differential Revision: https://reviews.freebsd.org/D17458
Sponsored by: Netflix, Inc.
rep stos has a high startup time even on modern microarchitectures like
Skylake. Intel optimization manuals discuss how for small sizes it is
beneficial to go for streaming stores. Since those cannot be used without
extra penalty in the kernel I investigated performance impact of just
regular movs.
The patch below implements a very simple scheme: a 32-byte loop followed
by filling in the remainder of at most 31 bytes. It has a 256 breaking
point on which it falls back to rep stos. It provides a significant win
over the current primitive on several machines I tested (both Intel and
AMD). A 64-byte loop did not provide any benefit even for multiple of 64
sizes.
See the review for benchmark data.
Reviewed by: kib
Approved by: re (gjb)
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D17398
This change is a no-op in terms of semantics, but has a side effect
of removing a perfectly useless nop sled for CPUs with ERMS.
Approved by: re (gjb)
Sponsored by: The FreeBSD Foundation
Belatedly add a comment to the amd64 pmap explaining why we initialize
the kernel pmap's resident page count.
Reviewed by: alc, kib
Approved by: re (gjb)
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D17377
Such data may later be unmapped. This occurs, for example, when a
loader-provided microcode update file is discarded.
Reviewed by: alc, kib
Approved by: re (gjb)
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D17340
For PCID case, there is a dependency between pm_gen zeroing and
reading pm_active for IPI target selection, to ensure that the
invalidation is not missed.
Reported and tested by: mjg
Sponsored by: The FreeBSD Foundation
Approved by: re (gjb)
The function stopped swapping rdi and rsi, but the error handling
code was not updated with the new register name.
Approved by: re (implicit)
Sponsored by: The FreeBSD Foundation
This reverts part of r333368. The attempt to clear DR6 was occuring
too soon as trapsignal() does not pause to let the debugger notice the
SIGTRAP and query DR6. The signal exchange does not occur until much
later during ast(). As a result, GDB was no longer recognizing
hardware breakpoints and watchpoints on x86.
In addition, any userland programs that want to inspect DR6 in a
SIGTRAP handler don't have a way to do this if we clear DR6 in the
exception handler.
Instead of relying on the kernel to clear DR6, debuggers will have to
explicitly clear it after a trace trap (which they needed to do on
older kernels anyway).
Reviewed by: kib
Approved by: re (delphij)
MFC after: 3 days
Differential Revision: https://reviews.freebsd.org/D17319
- remove a forward branch in the common case
- replace xchg + lodsb/stosb loop with simple movs
A simple test on Intel(R) Core(TM) i7-4600U CPU @ 2.10GH copying
/foo/bar/baz in a loop goes from 295715863 ops/s to 465807408.
Further changes are pending.
Reviewed by: kib
Approved by: re (gjb)
Differential Revision: https://reviews.freebsd.org/D17281
- move the PSL.AC comment to the fault handler
- stop testing for zero-sized ops. after several minutes of package
building there were no copyin calls with zero bytes and very few
copyout. the semantic of returning 0 in this case is preserved
- shorten exit paths by clearing %eax earlier
- replace xchg with 3 movs. this is what compilers do. a naive
benchmark on EPYC suggests about 1% increase in thoughput thanks to
this change.
- remove the useless movb %cl,%al from copyout. it looks like a
leftover from many years ago
Reviewed by: kib
Approved by: re (gjb)
Differential Revision: https://reviews.freebsd.org/D17286
Both the in-kernel C variant and libc asm variant have very poor performance.
The former compiles to a single byte comparison loop, which breaks down even
for small sizes. The latter uses rep cmpsq/b which turn out to have very poor
throughput and are slower than a hand-coded 32-byte comparison loop.
Depending on size this is about 3-4 times faster than the current routines.
Reviewed by: kib
Approved by: re (gjb)
Differential Revision: https://reviews.freebsd.org/D17328
undefined instruction exception. Previously we would exit the guest,
however an unprivileged user could execute these.
Found with: syzkaller
Reviewed by: araujo, tychon (previous version)
Approved by: re (kib)
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D17192
dmaplimit is the first byte after the end of DMAP.
Reported by: "Johnson, Archna" <Archna.Johnson@netapp.com>
Reviewed by: alc, markj
Approved by: re (gjb)
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D17318
For pmap_invalidate_all_pcid(), only reset pm_gen for non-kernel
pmaps, as it was done before the conversion to ifuncs. The reset is
useless but innocent for kernel_pmap. Coverity reported that cpuid is
used uninitialized in this case.
Reported by: cem
Reviewed by: alc, cem, markj
CID: 1395807
Sponsored by: The FreeBSD Foundation
Approved by: re (gjb)
Differential revision: https://reviews.freebsd.org/D17314
Split calculation of mask for shootdown IPI and local
invalidation. Reorder IPI before local.
Suggested by: alc
Reviewed by: alc, markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
Approved by: re (rgrimes)
Differential revision: https://reviews.freebsd.org/D17277
- _fault handlers for both primitives are identical, provide just one
- change the copying scheme to match memcpy (in particular jump
avoidance for the most common case of multiply of 8)
- stop re-reading pcb address on exit, just store it locally (in r9)
Reviewed by: kib
Approved by: re (gjb)
Differential Revision: https://reviews.freebsd.org/D17265
If the size is 15 bytes or less avoid spinning up rep just to copy the 8
bytes. In my tests on EPYC and old Intel microarchs without ERMS (like
Westmere) it provided a nice win over the current version (e.g. for EPYC
memset with 15 bytes of size goes from 59712651 ops/s to 70600095) all
while almost not pessimizing the other cases.
Data collected during package building shows that < 16 sizes are pretty
common.
Verified with the glibc test suite.
Approved by: re (kib)
Fix a fat-fingered typo with a "funny" side-effect: when doing copyin on a
cpu without ERMS and with size being a multiply of 8 a page fault would be
triggered resulting in EFAULT.
Pointy hat: mjg
Approved by: re (implicit)
A lot of function have the following check:
cmpq %rax,%rdi /* verify address is valid */
ja fusufault
The label is present earlier in kernel .text, which means this is a jump
backwards. Absent any information in branch predictor, the cpu predicts it
as taken. Since it is almost never taken in practice, this results in a
completely avoidable misprediction.
Move it past all consumers, so that it is predicted as not taken.
Approved by: re (kib)
This simplifies the runtime logic and reduces the number of
runtime-constant branches.
Reviewed by: alc, markj
Sponsored by: The FreeBSD Foundation
Approved by: re (gjb)
Differential revision: https://reviews.freebsd.org/D16736
pm_pcid is unsigned.
Reviewed by: cem, markj
CID: 1395727
Noted by: cem
Sponsored by: The FreeBSD Foundation
Approved by: re (gjb)
MFC after: 3 days
Differential revision: https://reviews.freebsd.org/D17235
Patch removes all checks for pti/pcid/invpcid from the context switch
path. I verified this by looking at the generated code, compiling with
the in-tree clang. The invpcid_works1 trick required inline attribute
for pmap_activate_sw_pcid_pti() to work.
Reviewed by: alc, markj
Sponsored by: The FreeBSD Foundation
Approved by: re (gjb)
Differential revision: https://reviews.freebsd.org/D17181
There is no need to use %rax for temporary values and avoiding doing
so shortens the func.
Handle the explicit 'check for tail' depessimisization for backwards copying.
This reduces the diff against userspace.
Tested with the glibc test suite.
Approved by: re (kib)
This will be used in following conversion of pmap_activate_sw().
Reviewed by: alc, markj
Sponsored by: The FreeBSD Foundation
Approved by: re (gjb)
Differential revision: https://reviews.freebsd.org/D17181
There is a braino in the non-erms variant which breaks the
functionality.
Will be fixed at a later time with a different patch.
Reported by: Manfred Antar
Approved by: re (implicit)
There is no need to use %rax for temporary values and avoiding doing
so shortens the func.
Handle the explicit 'check for tail' depessimisization for backwards copying.
This reduces the diff against userspace.
Approved by: re (kib)
Intel docs claim such a memset (rep stosb + 4096 bytes) is
special-cased by microarchs. They also switched Linux to use
it for this purpose.
Approved by: re (gjb)
The stac/clac combo around each byte copy is causing a measurable
slowdown in benchmarks. Do it only before and after all data is
copied. While here reorder the code to avoid a forward branch in
the common case.
Note the copying loop (originating from copyinstr) is avoidably slow
and will be fixed later.
Reviewed by: kib
Approved by: re (gjb)
Differential Revision: https://reviews.freebsd.org/D17063
Also this fixes the eflags.ac leak from copyin_smap() when the copied
data length is multiple of eight bytes.
Sponsored by: The FreeBSD Foundation
Approved by: re (gjb)
Non-PTI mode does not switch kcr3, which means that kcr3 is almost
always stale. This is important for the NMI handler, which reloads
%cr3 with PCPU(kcr3) if the value is different from PMAP_NO_CR3.
The end result is that curpmap in NMI handler does not match the page
table loaded into hardware. The manifestation was copyin(9) looping
forever when a usermode access page fault cannot be resolved by
vm_fault() updating a different page table.
Reported by: mmacy
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 3 days
Approved by: re (gjb)
This appeared to be required to have EFI RT support and EFI RTC
enabled by default, because there are too many reports of faulting
calls on many different machines. The knob is added to leave the
exceptions unhandled to allow to debug the actual bugs.
Reviewed by: kevans
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Approved by: re (rgrimes)
Differential revision: https://reviews.freebsd.org/D16972
handling.
This is split into a separate commit from the main change to make it
easier to handle possible revert after upcoming KBI freeze.
Reviewed by: kevans
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Approved by: re (rgrimes)
Differential revision: https://reviews.freebsd.org/D16972
trap_pfault() KPTI violation check.
EFI RT may set curpmap to NULL for the duration of the call for some
machines (PCID but no INVPCID). Since apparently EFI RT code must be
ready for exceptions from the calls, avoid dereferencing curpmap until
we know that this call does not come from usermode.
Reviewed by: kevans
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Approved by: re (rgrimes)
Differential revision: https://reviews.freebsd.org/D16972
Exposing max_offset and min_offset defines in public headers is
causing clashes with variable names, for example when building QEMU.
Based on the submission by: royger
Reviewed by: alc, markj (previous version)
Sponsored by: The FreeBSD Foundation (kib)
MFC after: 1 week
Approved by: re (marius)
Differential revision: https://reviews.freebsd.org/D16881
table allocation.
At the time that mp_bootaddress() is called, phys_avail[] array does
not reflect some memory reservations already done, like kernel
placement. Recent changes to DMAP protection which make kernel text
read-only in DMAP revealed this, where on some machines AP boot page
tables selection appears to intersect with the kernel itself.
Fix this by checking the addresses selected using the same algorithm
as bootaddr_rwx(). Also, try to chomp pages for the page table not
only at the start of the contiguous range, but also at the end. This
should improve robustness when the only suitable range is already
consumed by the kernel.
Reported and tested by: Michael Gmelin <freebsd@grem.de>
Reviewed by: jhb
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Approved by: re (gjb)
Differential revision: https://reviews.freebsd.org/D16907
error in the function hypercall_memfree(), where the wrong arena was being
passed to kmem_free().
Introduce a per-page flag, VPO_KMEM_EXEC, to mark physical pages that are
mapped in kmem with execute permissions. Use this flag to determine which
arena the kmem virtual addresses are returned to.
Eliminate UMA_SLAB_KRWX. The introduction of VPO_KMEM_EXEC makes it
redundant.
Update the nearby comment for UMA_SLAB_KERNEL.
Reviewed by: kib, markj
Discussed with: jeff
Approved by: re (marius)
Differential Revision: https://reviews.freebsd.org/D16845
Add pmap_activate_boot() for i386, move the invocation on APs from MD
init_secondary() to x86 init_secondary_tail().
Suggested by: alc
Reviewed by: alc, markj
Sponsored by: The FreeBSD Foundation
Approved by: re (marius)
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D16893
Revert r338177, r338176, r338175, r338174, r338172
After long consultations with re@, core members and mmacy, revert
these changes. Followup changes will be made to mark them as
deprecated and prent a message about where to find the up-to-date
driver. Followup commits will be made to make this clear in the
installer. Followup commits to reduce POLA in ways we're still
exploring.
It's anticipated that after the freeze, this will be removed in
13-current (with the residual of the drm2 code copied to
sys/arm/dev/drm2 for the TEGRA port's use w/o the intel or
radeon drivers).
Due to the impending freeze, there was no formal core vote for
this. I've been talking to different core members all day, as well as
Matt Macey and Glen Barber. Nobody is completely happy, all are
grudgingly going along with this. Work is in progress to mitigate
the negative effects as much as possible.
Requested by: re@ (gjb, rgrimes)
The boot-time ifunc resolver assumes that it only needs to apply
IRELATIVE relocations to PLT entries. With an upcoming optimization,
this assumption no longer holds, so add the support required to handle
PC-relative relocations targeting GNU_IFUNC symbols.
- Provide a custom symbol lookup routine that can be used in early boot.
The default lookup routine uses kobj, which is not functional at that
point.
- Apply all existing relocations during boot rather than filtering
IRELATIVE relocations.
- Ensure that we continue to apply ifunc relocations in a second pass
when loading a kernel module.
Reviewed by: kib
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D16749
As discussed on the MLs drm2 conflicts with the ports' version and there
is no upstream for most if not all of drm. Both have been merged in to
a single port.
Users on powerpc, 32-bit hardware, or with GPUs predating Radeon
and i915 will need to install the graphics/drm-legacy-kmod. All
other users should be able to use one of the LinuxKPI-based ports:
graphics/drm-stable-kmod, graphics/drm-next-kmod, graphics/drm-devel-kmod.
MFC: never
Approved by: core@
became unused in FreeBSD 12.x as a side-effect of the NUMA-related
changes.)
Reviewed by: kib, markj
Discussed with: jeff, re@
Differential Revision: https://reviews.freebsd.org/D16825
If an exception or NMI occurs before CPU switched to a pmap different
from vmspace0, PCPU kcr3 is left zero for pti config, which causes
triple-fault in the handler.
Sponsored by: The FreeBSD Foundation
MFC after: 3 days
Current mitigation for L1TF in bhyve flushes L1D either by an explicit
WRMSR command, or by software reading enough uninteresting data to
fully populate all lines of L1D. If NMI occurs after either of
methods is completed, but before VM entry, L1D becomes polluted with
the cache lines touched by NMI handlers. There is no interesting data
which NMI accesses, but something sensitive might be co-located on the
same cache line, and then L1TF exposes that to a rogue guest.
Use VM entry MSR load list to ensure atomicity of L1D cache and VM
entry if updated microcode was loaded. If only software flush method
is available, try to help the bhyve sw flusher by also flushing L1D on
NMI exit to kernel mode.
Suggested by and discussed with: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed by: jhb
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D16790
- In configurations with a pseudo devices section, move 'device crypto'
into that section.
- Use a consistent comment. Note that other things common in kernel
configs such as GELI also require 'device crypto', not just IPSEC.
Reviewed by: rgrimes, cem, imp
Differential Revision: https://reviews.freebsd.org/D16775
Ensure that the valid PCID state is created for proc0 pmap, since it
might be used by efirt enter() before first context switch on the BSP.
Sponsored by: The FreeBSD Foundation
MFC after: 6 days
On the guest entry in bhyve, flush L1 data cache, using either L1D
flush command MSR if available, or by reading enough uninteresting
data to fill whole cache.
Flush is automatically enabled on CPUs which do not report RDCL_NO,
and can be disabled with the hw.vmm.l1d_flush tunable/kenv.
Security: CVE-2018-3646
Reviewed by: emaste. jhb, Tony Luck <tony.luck@intel.com>
Sponsored by: The FreeBSD Foundation
We always zero the invalidated PTE/PDE for superpage, which means that
L1TF CPU vulnerability (CVE-2018-3620) can be only used for reading
from the page at zero.
Note that both i386 and amd64 exclude the page from phys_avail[]
array, so this change is redundant, but I think that phys_avail[] on
UEFI-boot does not need to do that. Eventually the blacklisting
should be made conditional on CPUs which report that they are not
vulnerable to L1TF.
Reviewed by: emaste. jhb
Sponsored by: The FreeBSD Foundation
curpmap.
When performing context switch on a machine without PCID, if current
%cr3 equals to the new pmap %cr3, which is typical for kernel_pmap
vs. kernel process, I overlooked to update PCPU curpmap value. Remove
check for %cr3 not equal to pm_cr3 for doing the update. It is
believed that this case cannot happen at all, due to other changes in
this revision.
Also, do not set the very first curpmap to kernel_pmap, it should be
vmspace0 pmap instead to match curproc.
Move the common code to activate the initial pmap both on BSP and APs
into pmap_activate_boot() helper.
Reported by: eadler, ambrisko
Discussed with: kevans
Reviewed by: alc, markj (previous version)
Tested by: ambrisko (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D16618
Updates in the format described in section 9.11 of the Intel SDM can
now be applied as one of the first steps in booting the kernel. Updates
that are loaded this way are automatically re-applied upon exit from
ACPI sleep states, in contrast with the existing cpucontrol(8)-based
method. For the time being only Intel updates are supported.
Microcode update files are passed to the kernel via loader(8). The
file type must be "cpu_microcode" in order for the file to be recognized
as a candidate microcode update. Updates for multiple CPU types may be
concatenated together into a single file, in which case the kernel
will select and apply a matching update. Memory used to store the
update file will be freed back to the system once the update is applied,
so this approach will not consume more memory than required.
Reviewed by: kib
MFC after: 6 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D16370
efi_enter here was needed because efi_runtime dereference causes a fault
outside of EFI context, due to runtime table living in runtime service
space. This may cause problems early in boot, though, so instead access it
by converting paddr to KVA for access.
While here, remove the other direct PHYS_TO_DMAP calls and the explicit DMAP
requirement from efidev.
Reviewed by: kib
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D16591
This patch adds a new sysctl(8) knob "security.jail.vmm_allowed",
by default this option is disable.
Submitted by: Shawn Webb <shawn.webb____hardenedbsd.org>
Reviewed by: jamie@ and myself.
Relnotes: Yes.
Sponsored by: HardenedBSD and G2, Inc.
Differential Revision: https://reviews.freebsd.org/D16057
As noted in UDPATING, the new loader tunable efi.rt_disabled may be used to
disable EFIRT at runtime. It should have no effect if you are not booted via
UEFI boot.
MFC after: 6 weeks
Ifuncs selectors dispatch copyin(9) family to the suitable variant, to
set rflags.AC around userspace access. Rflags.AC bit is cleared in
all kernel entry points unconditionally even on machines not
supporting SMAP.
Reviewed by: jhb
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D13838
There's no differene between VM_FREELIST_ISADMA and VM_FREELIST_LOWMEM
except for the default boundary (16MB on x86 and 256MB on MIPS, but
they are otherwise the same). We don't need both for any system we
support (there were some really old ARC systems that did have ISA/EISA
bus, but we never ran on them and they are too old to ever grow
support for).
Differential Review: https://reviews.freebsd.org/D16290
Do not use vm_map_remove() to release KVA back to the system. Because
kernel map entries do not have an associated VM object, with r336030
the vm_map_remove() call will not update the kernel page tables. Avoid
relying on the vm_map layer and instead update the pmap and release KVA
to the kernel arena directly in kmem_bootstrap_free().
Because the pmap updates will generally result in superpage demotions,
modify pmap_init() to insert PTPs shadowed by superpage mappings into
the kernel pmap's radix tree.
While here, port r329171 to i386.
Reported by: alc
Reviewed by: alc, kib
X-MFC with: r336505
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D16426
the AMD document 55449 'Revision Guide for AMD Family 17h Models
00h-0Fh Processors' rev 1.12.
The errata numbers are mentioned near each action.
It seems that newer BIOSes already include required chicken bits
settings, so the magic MSR updates are only needed when BIOS cannot be
updated. On the other hand, MWAIT avoidance seems to be important.
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
code never sees FPU pcb flags not consistent with the hardware state.
This is uncovered by the eager FPU switch mode.
Analyzed, reviewed and tested by: gleb
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
On i386 and amd64, add a vm_phys segment for physical memory used to
store the kernel binary and other preloaded data. This makes it
possible to free such memory back to the system once it is no longer
needed, e.g., when a preloaded kernel module is unloaded. Previously,
it would have remained unused.
Reviewed by: kib, royger
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D16330
In order to setup an initial environment and jump into the generic
hammer_time initialization function. Some of the code is shared with
PVHv1, while other code is PVHv2 specific.
This allows booting FreeBSD as a PVHv2 DomU and Dom0.
Sponsored by: Citrix Systems R&D
The PVHv2 entry point is fairly similar to the multiboot1 one. The
kernel is started in protected mode with paging disabled. More
information about the exact BSP state can be found in the pvh.markdown
document on the Xen tree.
This entry point is going to be joined with the native entry point at
hammer_time, and in order to do so the BSP needs to be bootstrapped
into long mode with the same set of page tables as used on bare metal.
Sponsored by: Citrix Systems R&D
This restores counters(9) operation.
Revert r336024. Improve assert of pcpu size on x86.
Reviewed by: mmacy
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D16163
Due to the way rtld creates mappings for the shared objects, each dso
causes unmap of at least three guard map entries. For instance, in
the buildworld load, this change reduces the amount of pmap_remove()
calls by 1/5.
Profiled by: alc
Reviewed by: alc, markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D16148
SMP systems by extending defined(SMP) to include defined(KLD_MODULE).
This is a regression issue after r335873 .
Discussed with: mmacy@
Sponsored by: Mellanox Technologies
Apply temporary fix to counter until daylight hours.
The fact that the assembly for counter_u64_add relied on the sizeof(struct pcpu) was
the basis for the otherwise arbitrary offset never came up in D15933.
critical_{enter,exit} is now inline so the only real added overhead is the
added (mostly false) conditional branch in exit.
- Change pcpu zone consumers to use a stride size of PAGE_SIZE.
(defined as UMA_PCPU_ALLOC_SIZE to make future identification easier)
- Allocate page from the correct domain for a given cpu.
- Don't initialize pc_domain to non-zero value if NUMA is not defined
There are some misconceptions surrounding this field. It is the
_VM_ NUMA domain and should only ever correspond to valid domain
values as understood by the VM.
The former slab size of sizeof(struct pcpu) was somewhat arbitrary.
The new value is PAGE_SIZE because that's the smallest granularity
which the VM can allocate a slab for a given domain. If you have
fewer than PAGE_SIZE/8 counters on your system there will be some
memory wasted, but this is obviously something where you want the
cache line to be coming from the correct domain.
Reviewed by: jeff
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D15933
It is possible that a fictitious unmanaged userspace mapping of
superpage is created on x86, e.g. by pmap_object_init_pt(), with the
physical address outside the vm_page_array[] coverage.
Noted and reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D16085
physical address, which is readily available after sucessfull
vm_page_pa_tryrelock().
Noted and reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D16085
physical address, which is readily available after sucessfull
vm_page_pa_tryrelock().
Noted and reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D16085
mapping, then it leaks the unlinked PV entry. This change eliminates that
leak, freeing the PV entry.
Reviewed by: kib, markj
X-MFC with: r335784
Differential Revision: https://reviews.freebsd.org/D16130
returning NULL.
vm_fault_quick_hold_pages() can be legitimately called on userspace
mappings backed by fictitious pages created by unmanaged device and sg
pagers.
Note that other architectures pmap_extract_and_hold() might need
similar fix, but I postponed the examination.
Reported by: bde
Discussed with: alc
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D16085
The ADD, AND, OR, and SUB instructions take at most a 32-bit
sign-extended immediate operand. 64-bit constants that do not fit into
that constraint need to be loaded into a register. The 'i' constraint
tells the compiler it can pass any integer constant to the assembler,
whereas the 'e' constrain only permits constants that fit into a 32-bit
sign-extended value. This fixes using
atomic_add/clear/set/subtract_long/64 with constants that do not fit into
a 32-bit sign-extended immediate.
Reported by: several folks
Tested by: Pete Wright <pete@nomadlogic.org>
MFC after: 2 weeks
- inline atomics in modules on i386 and amd64 (they were always
inline on other arches)
- allow modules to opt in to inlining locks by specifying
MODULE_TIED=1 in the makefile
Reviewed by: kib
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D16079
Doing so ensures that all threads sharing the pmap have a consistent
view of the mapping. This fixes the problem described in the commit
log messages for r329254 without the overhead of an extra fault in the
common case. Once other pmap_enter() implementations are similarly
modified, the workaround added in r329254 can be removed, reducing the
overhead of CoW faults.
With this change we can reuse the PV entry from the old mapping,
potentially avoiding a call to reclaim_pv_chunk(). Otherwise, there is
nothing preventing the old PV entry from being reclaimed. In rare
cases this could result in the PTE's page table page being freed,
leading to a use-after-free of the page when the updated PTE is written
following the allocation of the PV entry for the new mapping.
Reported and tested by: pho
Reviewed by: alc, kib
MFC after: 3 weeks
Differential Revision: https://reviews.freebsd.org/D16005
without error code. Doing so it mis-aligned the stack.
Since the only consumer of the SSE instructions with the alignment
requirements is AES-NI module, and since the FPU context cannot be
accessed in interrupts, the only situation where the alignment matter
are the compat32 syscalls, as reported in the PR.
PR: 229222
Reported and tested by: dewayne@heuristicsystems.com.au
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
The call to reclaim_pv_chunk() in reserve_pv_entries() may free a
PV chunk with free entries belonging to the current pmap. In this
case we must account for the free entries that were reclaimed, or
reserve_pv_entries() may return without having reserved the requested
number of entries.
Reviewed by: alc, kib
Tested by: pho (previous version)
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D15911
The Linux compatibility code was converting the version number (e.g.
2.6.32) in two different ways and then comparing the results.
The linux_map_osrel() function converted MAJOR.MINOR.PATCH similar to
what FreeBSD does natively. I.e. where major=v0, minor=v1, and patch=v2
v = v0 * 1000000 + v1 * 1000 + v2;
The LINUX_KERNVER() macro, on the other hand, converted the value with
bit shifts. I.e. where major=a, minor=b, and patch=c
v = (((a) << 16) + ((b) << 8) + (c))
The Linux kernel uses the later format via the KERNEL_VERSION() macro in
include/generated/uapi/linux/version.h
Fix is to use the LINUX_KERNVER() macro in linux_map_osrel() as well as
in the .trans_osrel functions.
PR: 229209
Reviewed by: emaste, cem, imp (mentor)
Approved by: imp (mentor)
Differential Revision: https://reviews.freebsd.org/D15952
Update the driver to use iflib in order to bring performance,
maintainability, and (hopefully) stability benefits to the driver.
The driver currently isn't completely ported; features that are missing:
- VF driver (ixlv)
- SR-IOV host support
- RDMA support
The plan is to have these re-added to the driver before the next FreeBSD release.
Reviewed by: gallatin@
Contributions by: gallatin@, mmacy@, krzysztof.galazka@intel.com
Tested by: jeffrey.e.pieper@intel.com
MFC after: 1 month
Sponsored by: Intel Corporation
Differential Revision: https://reviews.freebsd.org/D15577
Existing linuxulator platforms (i386, amd64) support legacy syscalls,
such as non-*at ones like open, but arm64 and other new platforms do
not.
Wrap these in #ifdef LINUX_LEGACY_SYSCALLS, #defined in the MD linux.h
files. We may need finer grained control in the future but this is
sufficient for now.
Reviewed by: andrew
Sponsored by: Turing Robotic Industries
Differential Revision: https://reviews.freebsd.org/D15237
Give up and remove the almost useless informational message reporting
that device not available exception occured while our state tracking
indicates the current CPU has FPU context loaded for the current
thread.
It seems that this is recurring bug with some VM monitors.
Sponsored by: The FreeBSD Foundation
With compilers making increasing use of vector instructions the
performance benefit of lazily switching FPU state is no longer a
desirable tradeoff. Linux switched to eager FPU context switch some
time ago, and the idea was floated on the FreeBSD-current mailing list
some years ago[1].
Enable eager FPU context switch by default on amd64, with a tunable/sysctl
available to turn it back off.
[1] https://lists.freebsd.org/pipermail/freebsd-current/2015-March/055198.html
Reviewed by: jhb
Tested by: pho
Sponsored by: The FreeBSD Foundation
Most kernel memory that is allocated after boot does not need to be
executable. There are a few exceptions. For example, kernel modules
do need executable memory, but they don't use UMA or malloc(9). The
BPF JIT compiler also needs executable memory and did use malloc(9)
until r317072.
(Note that a side effect of r316767 was that the "small allocation"
path in UMA on amd64 already returned non-executable memory. This
meant that some calls to malloc(9) or the UMA zone(9) allocator could
return executable memory, while others could return non-executable
memory. This change makes the behavior consistent.)
This change makes malloc(9) return non-executable memory unless the new
M_EXEC flag is specified. After this change, the UMA zone(9) allocator
will always return non-executable memory, and a KASSERT will catch
attempts to use the M_EXEC flag to allocate executable memory using
uma_zalloc() or its variants.
Allocations that do need executable memory have various choices. They
may use the M_EXEC flag to malloc(9), or they may use a different VM
interfact to obtain executable pages.
Now that malloc(9) again allows executable allocations, this change also
reverts most of r317072.
PR: 228927
Reviewed by: alc, kib, markj, jhb (previous version)
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D15691
and also on apic in common and i386 files (except for xen it is optional
only on xenhvm), but it was not ifdefed except on apic in common and i386
files.
This is all that is left from an attempt to build a (sub-)minimal kernel
without any devices. The isa "option" is still used without ifdefs in many
standard files even on amd64. ISAPNP is not optional on at least i386.
ATPIC is not optional on i386 (it is used mainly for Xspuriousint). But
pci is now supposed to be optional on x86.
pmc_process_interrupt takes 5 arguments when only 3 are needed.
cpu is always available in curcpu and inuserspace can always be
derived from the passed trapframe.
While facially a reasonable cleanup this change was motivated
by the need to workaround a compiler bug.
core2_intr(cpu, tf) ->
pmc_process_interrupt(cpu, ring, pmc, tf, inuserspace) ->
pmc_add_sample(cpu, ring, pm, tf, inuserspace)
In the process of optimizing the tail call the tf pointer was getting
clobbered:
(kgdb) up
at /storage/mmacy/devel/freebsd/sys/dev/hwpmc/hwpmc_mod.c:4709
4709 pmc_save_kernel_callchain(ps->ps_pc,
(kgdb) up
1205 error = pmc_process_interrupt(cpu, PMC_HR, pm, tf,
resulting in a crash in pmc_save_kernel_callchain.
memset fills the target buffer from a byte-sized value passed in as the
second argument.
The fully-sized (8 bytes) register containing it is named %rsi. Lower 4 bytes
can be referred to as %esi and finally the lowest byte is %sil.
Vast majority of all the callers just zero the target buffer and set it up by
doing xor %esi,%esi which has a side-effect of zeroing the upper parts of
the register as well. Some others do a word-sized move to %esi which has the
same result.
However, there are callers which only fill %sil. This does *not* clear up
the rest of the register.
The value of %rsi is multiplied by $0x0101010101010101 to create a 8-byte sized
pattern for 8-byte stores.
Prior to the patch, the func just blindly took %rsi assuming the unwanted bytes
are zeroed out. Since this is not the case for the callers which only play with
%sil (the rest of the register can have absolutely anything), the resulting
pattern can be garbage.
This has potential for funny bugs. One side effect (which was not amusing)
after enabling it instead of bzero was that the kernel was hanging on boot
as a xen domU.
Reported by: Trond Endrestøl <Trond.Endrestol fagskolen.gjovik.no>
Pointy hat: me
pagetables.
physmap[] can be inconsistent with the physical memory limit due to
buggy bios, or to the hw.physmem tunable. Since bootstrap pagetables
are initialized by accesses through the DMAP, we must ensure that DMAP
really cover the selected pages. This is only relevant when machine
has less than 4G RAM and buggy BIOS, which is the combination on Acer
Chromebook 720.
The call to mp_bootaddress() is moved later to have Maxmem initialized.
An alternative could be to always cover 4G for DMAP, but this change
seems to be simpler.
Reported and tested by: grembo
Reviewed by: royger
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D15675
- increase pmc cpuid field from 8 to 12 bits
- add cpuid version string to initialize entry in the log
so that filter can identify which counter index an
event name maps to
- GC unused config flags
- make fixed counter assignment more robust as well as the
changes needed to be properly identified for filter
Currently all the primitives are waiting for a rewrite, tidy them up in the
meantime.
Vast majority of cases pass sizes which are multiple of 8. Which means the
following rep stosb/movb has nothing to do. Turns out testing first if there
is anything to do is a big win across the board (cpus with and without ERMS,
Intel and AMD) while not pessimizing the case where there is work to do.
Sample results for zeroing 64 bytes (ops/second):
Ryzen Threadripper 1950X 91433212 -> 147265741
Intel(R) Xeon(R) CPU X5675 @ 3.07GHz 90714044 -> 121992888
bzero and bcopy are on their way out and were not modified. Nothing in the
tree uses them.
file in /sys/conf, so was unavailable in configurations that don't use
modules, and was not testable or notable in NOTES. Its normal
configuration (not using a module) is still silently deprecated in
aout(4) by not mentioning it there.
Update i386 NOTES for COMPAT_AOUT. It is not i386-only, or even very MD.
Sort its entry better.
Finish gzip configuration (but not support) for amd64. gzip is really
gzipped aout. It is currently broken even for i386 (a call to vm fails).
amd64 has always attempted to configure and test it, but it depends on
COMPAT_AOUT (as noted). The bug that it depends on unconfigured files
was not detected since it is configured as a device. All other optional
image activators are configured properly using an option.
time, especially for SMP. If configured, it turns itself on at boot
time for calibration, so is fragile even if never otherwise used.
Both types of kernel profiling were supposed to use a global spinlock
in the SMP case. If hi-res profiling is configured (but not necessarily
used), this was supposed to be optimized by only using it when
necessary, and slightly more efficiently, in asm. But it was not done
at all for mcount entry where it is necessary. This caused crashes
in the SMP case when either type of profiling was enabled. For mcount
exit, it only caused wrong times. The times were wrongest with an
i8254 timer since using that requires exclusive access to the hardware.
The i8254 timer was too slow to use here 20 years ago and is much less
usable now, but it is the default for the SMP case since TSCs weren't
invariant when SMP was new. Do the locking in all hi-res SMP cases for
simplicity.
Calibration uses special asms, and the clobber lists in these were sort
of inverted. They contained the arg and return registers which are not
clobbered, but on amd64 they didn't contain the residue of the call-used
registers which may be clobbered (%r10 and %r11). This usually caused
hangs at boot time. This usually affected even the UP case.
kernel profiling remains broken).
memmove() was broken using ALTENTRY(). ALTENTRY() is only different from
ENTRY() in the profiling case, and its use in that case was sort of
backwards. The backwardness magically turned memmove() into memcpy()
instead of completely breaking it. Only the high resolution parts of
profiling itself were broken. Use ordinary ENTRY() for memmove().
Turn bcopy() into a tail call to memmove() to reduce complications.
This gives slightly different pessimizations and profiling lossage.
The pessimizations are minimized by not using a frame pointer() for
bcopy().
Calls to profiling functions from exception trampolines were not
relocated. This caused crashes on the first exception. Fix this using
function pointers.
Addresses of exception handlers in trampolines were not relocated. This
caused unknown offsets in the profiling data. Relocate by abusing
setidt_disp as for pmc although this is slower than necessary and
requires namespace pollution. pmc seems to be missing some relocations.
Stack traces and lots of other things in debuggers need similar relocations.
Most user addresses were misclassified as unknown kernel addresses and
then ignored. Treat all unknown addresses as user. Now only user
addresses in the kernel text range are significantly misclassified (as
known kernel addresses).
The ibrs functions didn't preserve enough registers. This is the only
recent breakage on amd64. Although these functions are written in
asm, in the profiling case they call profiling functions which are
mostly for the C ABI, so they only have to save call-used registers.
They also have to save arg and return registers in some cases and
actually save them in all cases to reduce complications. They end up
saving all registers except %ecx on i386 and %r10 and %r11 on amd64.
Saving these is only needed for 1 caller on each of amd64 and i386.
Save them there. This is slightly simpler.
Remove saving %ecx in handle_ibrs_exit on i386. Both handle_ibrs_entry
and handle_ibrs_exit use %ecx, but only the latter needed to or did
save it. But saving it there doesn't work for the profiling case.
amd64 has more automatic saving of the most common scratch registers
%rax, %rcx and %rdx (its complications for %r10 are from unusual use
of %r10 by SYSCALL). Thus profiling of handle_ibrs_exit_rs() was not
broken, and I didn't simplify the saving by moving the saving of these
registers from it to the caller.
Intel now provides comprehensive tables for all performance counters
and the various valid configuration permutations as text .json files.
Libpmc has been converted to use these and hwpmc_core has been greatly
simplified by moving to passthrough of the table values.
The one gotcha is that said tables don't support pentium pro and and pentium
IV. There's very few users of hwpmc on _amd64_ kernels on new hardware. It is
unlikely that anyone is doing low level optimization on 15 year old Intel
hardware. Nonetheless, if someone feels strongly enough to populate the
corresponding tables for p4 and ppro I will reinstate the files in to the
build.
Code for the K8 counters and !x86 architectures remains unchanged.
This is a follow-up to r321483, which disabled -Wmacro-redefined for
some lib/msun tests.
If an application included both fenv.h and ieeefp.h, several macros such
as __fldcw(), __fldenv() were defined in both headers, with slightly
different arguments, leading to conflicts.
Fix this by putting all the common macros in the machine-specific
versions of ieeefp.h. Where needed, update the arguments in places
where the macros are invoked.
This also slightly reduces the differences between the amd64 and i386
versions of ieeefp.h.
Reviewed by: kib
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D15633
The copied data is accessed in part soon after and it results with additional
cache misses during a -j 1 buildkernel WITHOUT_CTF=yes KERNFAST=1, as measured
with pmc stat.
before:
256165411 cache-references # 0.003 refs/inst
15105408 cache-misses # 5.897%
20.70 real # 99.67% cpu
13.24 user # 63.94% cpu
7.40 sys # 35.73% cpu
after:
256764469 cache-references # 0.003 refs/inst
11913551 cache-misses # 4.640%
20.70 real # 99.67% cpu
13.19 user # 63.73% cpu
7.44 sys # 35.95% cpu
Note the real time did not change, but traffic to RAM was reduced (multiple
measurements performed with switching the implementation at runtime).
Since nobody else is using non-temporal for this and there is no apparent
benefit at least these days, don't use them either.
Side note is that pagecopy arguments should probably get reversed to not
have to flip them around in the primitive.
Discussed with: jeff
The TSC-s are checked and synchronized only if they were good
originally. That is, invariant, synchronized, etc.
This is necessary on an AMD-based system where after a wakeup from STR I
see that BSP clock differs from AP clocks by a count that roughly
corresponds to one second. The APs are in sync with each other. Not
sure if this is a hardware quirk or a firmware bug.
This is what I see after a resume with this change:
SMP: passed TSC synchronization test after adjustment
acpi_timer0: restoring timecounter, ACPI-fast -> TSC-low
Reviewed by: kib
MFC after: 3 weeks
Differential Revision: https://reviews.freebsd.org/D15551
Instead, construct an auxargs array and copy it out all at once.
Use an array of Elf_Auxinfo rather than pairs of Elf_Addr * to represent
the array. This is the correct type where pairs of words just happend
to work. To reduce the size of the diff, AUXARGS_ENTRY is altered to act
on this array rather than introducing a new macro.
Return errors on copyout() and suword() failures and handle them in the
caller.
Incidentally fixes AT_RANDOM and AT_EXECFN in 32-bit linux on amd64
which incorrectly used AUXARG_ENTRY instead of AUXARGS_ENTRY_32
(now removed due to the use of proper types).
Reviewed by: kib
Comments from: emaste, jhb
Obtained from: CheriBSD
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D15485
We certainly should clear PSL_T when calling the SIGTRAP signal
handler, which is already done by all x86 sendsig(9) ABI code. On the
other hand, there is no obvious reason why PSL_T needs to be cleared
when returning from the signal handler. For instance, Linux allows
userspace to set PSL_T and keep tracing enabled for the desired
period. There are userspace programs which would use PSL_T if we make
it possible, for instance sbcl.
Remember if PSL_T was set by PT_STEP or PT_SETSTEP by mean of TDB_STEP
flag, and only clear it when the flag is set.
Discussed with: Ali Mashtizadeh
Reviewed by: jhb (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D15054
- Add constants for fields in DR6 and the reserved fields in DR7. Use
these constants instead of magic numbers in most places that use DR6
and DR7.
- Refer to T_TRCTRAP as "debug exception" rather than a "trace trap"
as it is not just for trace exceptions.
- Always read DR6 for debug exceptions and only clear TF in the flags
register for user exceptions where DR6.BS is set.
- Clear DR6 before returning from a debug exception handler as
recommended by the SDM dating all the way back to the 386. This
allows debuggers to determine the cause of each exception. For
kernel traps, clear DR6 in the T_TRCTRAP case and pass DR6 by value
to other parts of the handler (namely, user_dbreg_trap()). For user
traps, wait until after trapsignal to clear DR6 so that userland
debuggers can read DR6 via PT_GETDBREGS while the thread is stopped
in trapsignal().
Reviewed by: kib, rgrimes
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D15189
Speculative Store Bypass (SSB) is a speculative execution side channel
vulnerability identified by Jann Horn of Google Project Zero (GPZ) and
Ken Johnson of the Microsoft Security Response Center (MSRC)
https://bugs.chromium.org/p/project-zero/issues/detail?id=1528.
Updated Intel microcode introduces a MSR bit to disable SSB as a
mitigation for the vulnerability.
Introduce a sysctl hw.spec_store_bypass_disable to provide global
control over the SSBD bit, akin to the existing sysctl that controls
IBRS. The sysctl can be set to one of three values:
0: off
1: on
2: auto
Future work will enable applications to control SSBD on a per-process
basis (when it is not enabled globally).
SSBD bit detection and control was verified with prerelease microcode.
Security: CVE-2018-3639
Tested by: emaste (previous version, without updated microcode)
Sponsored by: The FreeBSD Foundation
MFC after: 3 days
When we issue shootdown IPIs, we first assign zero to pm_gens to
indicate the need to flush on the next context switch in case our IPI
misses the context, next we read pm_active. On context switch we set
our bit in pm_active, then we read pm_gen. It is crucial that both
threads see the memory in the program order, otherwise invalidation
thread might read pm_active bit as zero and the context switching
thread might read pm_gen as zero.
IA32 allows CPU for both reads to see zero. We must use the barriers
between write and read. The pm_active bit set is already locked, so
only the invalidation functions need it.
I never saw it in real life, or at least I do not have a good
reproduction case. I found this during code inspection when hunting
for the Xen TLB issue reported by cperciva.
Reviewed by: alc, markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D15506
This turns on support for kernel dump encryption and compression, and
netdump. arm and mips platforms are omitted for now, since they are more
constrained and don't benefit as much from these features.
Reviewed by: cem, manu, rgrimes
Tested by: manu (arm64)
Relnotes: yes
Differential Revision: https://reviews.freebsd.org/D15465
Currently, when using dd(1) to take a VM memory image, the capture never ends,
reading zeroes when it's beyond VM system memory max address.
Return EFAULT when trying to read beyond VM system memory max address.
Reviewed by: imp, grehan, anish
Approved by: grehan
Differential Revision: https://reviews.freebsd.org/D15156
Kernel debuggers depend on symbol names to find stack frames with a
trapframe rather than a normal stack frame. The labels used for the
shared interrupt entry point for the PTI and non-PTI cases did not
match the existing patterns confusing debuggers. Add the '.L' prefix
to mark these symbols as local so they are not visible in the symbol
table.
Reviewed by: kib
MFC after: 1 week
Sponsored by: Chelsio Communications
From now on, linking amd64 kernel requires either lld or newer ld.bfd.
Reviewed by: jhb (as part of the large patch)
Discussed with: emaste
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D13838
Adapt assembly generated by clang for memcmp and use it for <= 64 sized
compares (which are the vast majority).
Sample result of doing stats on Broadwell (% of samples):
before: 4.0 kernel bcmp cache_lookup
after : 0.7 kernel bcmp cache_lookup
The routine is most definitely still not optimal. Anyone interested in
spending time improving it is welcome to take over.
Reviewed by: kib