e4b8deb222 removed the last in-tree uses of PCPU_INC(). Its
potential benefit is also practically nonexistent. Non-x86
platforms already implement it as PCPU_ADD(..., 1), and according
to [0] there are no recent x86 processors for which the 'inc'
instruction provides a performance benefit over the equivalent
memory-operand form of the 'add' instruction. The only remaining
benefit of 'inc' is smaller instruction size, which in this case
is inconsequential given the limited number of per-CPU data consumers.
[0]: https://www.agner.org/optimize/instruction_tables.pdf
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D29308
This is a prerequisite to using these functions outside of ddb, but also
provides some cleanup and minor refactoring. This code is almost
entirely duplicated between the two implementations, the only
significant difference being the lack of dbreg synchronization on i386.
Cleanups are:
- demote some internal functions to static
- use the constant NDBREGS instead of a '4' literal
- remove K&R definitions
- some added comments
Reviewed by: kib, jhb
Sponsored by: NetApp, Inc.
Sponsored by: Klara, Inc.
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D29153
We want to allow the UEFI firmware to enumerate and assign
addresses to PCI devices so we can boot from NVMe[1]. Address
assignment of PCI BARs is properly handled by the PCI emulation
code in general, but a few specific cases need additional support.
fbuf and passthru map additional objects into the guest physical
address space and so need to handle address updates. Here we add a
callback to emulated PCI devices to inform them of a BAR
configuration change. fbuf and passthru then watch for these BAR
changes and relocate the frame buffer memory segment and passthru
device mmio area respectively.
We also add new VM_MUNMAP_MEMSEG and VM_UNMAP_PPTDEV_MMIO ioctls
to vmm(4) to facilitate the unmapping needed for addres updates.
[1]: https://github.com/freebsd/uefi-edk2/pull/9/
Originally by: scottph
MFC After: 1 week
Sponsored by: Intel Corporation
Reviewed by: grehan
Approved by: philip (mentor)
Differential Revision: https://reviews.freebsd.org/D24066
As follow-on work to e4b8deb222, move page table page
allocation and freeing into their own functions. Use these
functions to provide separate kernel vs. user page table page
accounting, and to wrap common tasks such as management of
zero-filled page state.
Requested by: markj, kib
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D29151
POSIX states that new threads created via pthread_create() should
inherit the "floating point environment" from the creating thread.
Discussed with: kib
MFC after: 1 week
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D29204
- Use update_pcb_bases() when updating FS or GS base addresses to
permit use of FSBASE and GSBASE in Linux processes. This also sets
PCB_FULL_IRET. linux32 was setting PCB_32BIT which should be a
no-op (exec sets it).
- Remove write-only variables to construct unused segment descriptors
for linux32.
Reviewed by: kib
MFC after: 1 week
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D29026
Before the pcb is copied to the new thread during cpu_fork() and
cpu_copy_thread(), the kernel re-reads the current register values in
case they are stale. This is done by setting PCB_FULL_IRET in
pcb_flags.
This works fine for user threads, but the creation of kernel processes
and kernel threads do not follow the normal synchronization rules for
pcb_flags. Specifically, new kernel processes are always forked from
thread0, not from curthread, so adjusting pcb_flags via a simple
instruction without the LOCK prefix can race with thread0 running on
another CPU. Similarly, kthread_add() clones from the first thread in
the relevant kernel process, not from curthread. In practice, Netflix
encountered a panic where the pcb_flags in the first kthread of the
KTLS process were trashed due to update_pcb_bases() in
cpu_copy_thread() running from thread0 to create one of the other KTLS
threads racing with the first KTLS kthread calling fpu_kern_thread()
on another CPU. In the panicking case, the write to update pcb_flags
in fpu_kern_thread() was lost triggering an "Unregistered use of FPU
in kernel" panic when the first KTLS kthread later tried to use the
FPU.
Reported by: gallatin
Discussed with: kib
MFC after: 1 week
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D29023
This change converts most of the counters in the amd64 pmap from
global atomics to scalable counter(9) counters. Per discussion
with kib@, it also removes the handrolled per-CPU PCID save count
as it isn't considered generally useful.
The bulk of these counters remain guarded by PV_STATS, as it seems
unlikely that they will be useful outside of very specific debugging
scenarios. However, this change does add two new counters that
are available without PV_STATS. pt_page_count and pv_page_count
track the number of active physical-to-virtual list pages and page
table pages, respectively. These will be useful in evaluating
the memory footprint of pmap structures under various workloads,
which will help to guide future changes in this area.
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D28923
Other kernel sanitizers (KMSAN, KASAN) require interceptors as well, so
put these in a more generic place as a step towards importing the other
sanitizers.
No functional change intended.
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D29103
invltlb_invpcid_pti_handler() was requesting delayed TLB invalidation
even for processes that aren't using PTI. With an out-of-tree
change to avoid PTI for non-jailed root processes, this caused an
assertion failure in pmap_activate_sw_pcid_pti() when context-switching
between PTI and non-PTI processes.
Reviewed by: bdrewery kib tychon
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D29094
Otherwise during attach newbus prints "nexus0", which is not very
useful.
The generic nexus device is already quiet, as is nexus_acpi on arm64.
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
The purpose of these checks is to ensure that the address of the
next-level page table page is valid, since nothing is synchronizing with
a concurrent update of the large map and large map PTPs are freed to the
system. However, if PG_PS is set, there is no next level.
Reported by: rpokala
Reviewed by: kib
Tested by: rpokala
MFC after: 3 days
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D28922
In some cases the DELAY implementation on amd64 can recurse on a spin
mutex in the i8254 early delay code. Detect when this is going to
happen and don't call delay in this case. It is safe to not delay here
with the only issue being KCSAN may not detect data races.
Reviewed by: kib
Tested by: arichardson
Sponsored by: Innovate UK
Differential Revision: https://reviews.freebsd.org/D28895
Add it to the x86 GENERIC and MINIMAL kernels
Sponsored by: Ampere Computing LLC
Submitted by: Klara Inc.
Reviewed by: rpokala
Differential Revision: https://reviews.freebsd.org/D28738
Tested with glibc test suite.
The C variant in libkern performs excessive branching to find the zero
byte instead of using the bsfq instruction. The same code patched to use
it is still slower than the routine implemented here as the compiler
keeps neglecting to perform certain optimizations (like using leaq).
On top of that the routine can be used as a starting point for copyinstr
which operates on words intead of bytes.
The previous attempt had an instance of swapped operands to andq when
dealing with fully aligned case, which had a side effect of breaking the
code for certain corner cases. Noted by jrtc27.
Sample results:
$(perl -e "print 'A' x 3"):
stock: 211198039
patched:338626619
asm: 465609618
$(perl -e "print 'A' x 100"):
stock: 83151997
patched: 98285919
asm: 120719888
Reviewed by: jhb, kib
Differential Revision: https://reviews.freebsd.org/D28779
This macro returns true if a provided virtual address is contained
in the kernel's clean submap.
In CHERI kernels, the buffer cache and transient I/O map are allocated
as separate regions. Abstracting this check reduces the diff relative
to FreeBSD. It is perhaps slightly more readable as well.
Reviewed by: kib
Obtained from: CheriBSD
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D28710
linux_shared_page_init() creates an object and grabs and maps a single
page to back the VDSO. When destroying the VDSO object, we failed to
destroy the mapping and free KVA. Fix this.
Reviewed by: kib
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D28696
Allow setting the bootmethod variable from the Xen PVH entry point, in
order to be able to correctly set the underlying firmware mode when
booted as a dom0.
Move the bootmethod variable to be defined in x86/cpu_machdep.c
instead so it can be shared by both i386 and amd64.
Sponsored by: Citrix Systems R&D
Reviewed by: kib
Differential revision: https://reviews.freebsd.org/D28619
This reverts commit af366d353b.
Trips over '\xa4' byte and terminates early, as found in
lib/libc/gen/setdomainname_test:setdomainname_basic testcase
However, keep moving libkern/strlen.c out of conf/files.
Reported by: lwhsu
The C variant in libkern performs excessive branching to find the
non-zero byte instead of using the bsfq instruction. The same code
patched to use it is still slower than the routine implemented here
as the compiler keeps neglecting to perform certain optimizations
(like using leaq).
On top of that the routine can is a starting point for copyinstr
which operates on words instead of bytes.
Tested with glibc test suite.
Sample results (calls/s):
Haswell:
$(perl -e "print 'A' x 3"):
stock: 211198039
patched:338626619
asm: 465609618
$(perl -e "print 'A' x 100"):
stock: 83151997
patched: 98285919
asm: 120719888
AMD EPYC 7R32:
$(perl -e "print 'A' x 3"):
stock: 282523617
asm: 491498172
$(perl -e "print 'A' x 100"):
stock: 114857172
asm: 112082057
One common method of EOI'ing an interrupt at the IO-APIC level is to
switch the pin to edge triggering mode and then back into level mode.
That would cause the IRR bit to be cleared and thus further interrupts
to be injected. FreeBSD does indeed use that method if the IO-APIC EOI
register is not supported.
The bhyve IO-APIC emulation code didn't clear the IRR bit when doing
that switch, and was also missing acknowledging the IRR state when
trying to inject an interrupt in vioapic_send_intr.
Reviewed by: grehan
Differential revision: https://reviews.freebsd.org/D28238
After modifying a redirection entry only try to inject an interrupt if
the pin is in level mode, pins in edge mode shouldn't take into
account the line assert status as they are triggered by edge changes,
not the line status itself.
Reviewed by: grehan
Differential revision: https://reviews.freebsd.org/D28237
vioapic_send_intr does already check whether the pin is masked before
injecting the interrupt, there's no need to do it in vioapic_write
also.
No functional change intended.
Reviewed by: grehan
Differential revision: https://reviews.freebsd.org/D28236
This is a tradeoff which saves jumps for smaller sizes while making
the 8-16 range slower (roughly in line with the other cases).
Tested with glibc test suite.
For example size 3 (most common with vfs namecache) (ops/s):
before: 407086026
after: 461391995
The regressed range of 8-16 (with 8 as example):
before: 540850489
after: 461671032
All page zeroing is using temporal stores with rep movs*, the routine is
unused for several years.
Should a need arise for zeroing using non-temporal stores, a more
optimized variant can be implemented with a more descriptive name.
The ptrace(2) Linux man page claims the syscall returns ESRCH,
if the tracee is not stopped; the native ptrace(2) returns EBUSY.
Sponsored by: The FreeBSD Foundation
Based on discussions on freebsd-arch@, enable KERN_TLS in
GENERIC on amd64, but leave it disabled via the
sysctl kern.ipc.tls.enable. Users wishing to enable
ktls must set kern.ipc.tls.enable=1
While here, fix wording in NOTES to mention that KERN_TLS
also does receive now.
Sponsored by: Netflix
Reviewed by: allanjude
Differential Revision: https://reviews.freebsd.org/D28163
usbhid(4) is disabled by default to avoid conflicts with existing USB HID
drivers. To enable it place following lines to /boot/loader.conf:
hw.usb.usbhid.enable=1
usbhid_load="YES"
Suggested by: jhb
Reviewed by: hselasky
Differential revision: https://reviews.freebsd.org/D28124
This is the superset of the nooptions found in the -DEBUG kernels.
Reviewed by: emaste, manu
Sponsored by: Innovate UK
Differential Revision: https://reviews.freebsd.org/D28152
In particular, using GELI on a root filesystem will only use
accelerated software crypto drivers if they are available before the
root filesystem is mounted. While these modules can be loaded from
the loader, including them in GENERIC provides a better out-of-the-box
experience for users.
Both aesni(4) and armv8crypto(4) provide accelerated implementations
of the default cipher used by GELI (AES-XTS) in addition to other
ciphers.
Reviewed by: mhorne, allanjude, markj
Differential Revision: https://reviews.freebsd.org/D28100
Right now the routine leaves the current CPU in the map, later tripping
on an assert when filling in the scoreboard: panic: IPI scoreboard is
zero, initiator 1 target 1
Instead pre-check if all CPUs are present in the map and remember that
outcome for later.
Fixes: 7eaea04a5b ("amd64: compare TLB shootdown target to all_cpus")
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D28111
On amd64, the pmap code passes all_cpus to
smp_targeted_tlb_shootdown() when unmapping from the
kernel pmap. This function has an optimized path to send IPIs
to all but itself, which it intends to do when the target
is all cpus. However, we need to compare the target cpu mask
with all_cpus, rather than using CPU_ISFULLSET(). Comparing with
CPU_ISFULLSET() will only work when we have MAXCPU cpus active in
the system, otherwise, we'll be sending repeated IPIs, rather than
a single IPI to all CPUs but ourself.
Fixing this should reduce the time spent in native_lapic_ipi_wait()
as we will be sending ipis in parallel, rather than one-by-one.
This is confirmed by dtrace.
Reviewed by: alc, jhb, kib, markj
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D28102
Otherwise parallel pmap_allocpte_alloc() for nearby va might also fail
allocating page table page and free the page under us. The end result is
that we could dereference unmapped pte when doing cleanup after sleep.
Instead, on allocation failure, first free everything, only then we can
drop pmap mutex and sleep safely, right before returning to caller.
Split inner non-sleepable part of the pmap_allocpte_alloc() into a new
helper pmap_allocpte_nosleep().
Reviewed by: markj
Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D27956
The function performs actual allocation of pte, as opposed to
pmap_allocpte() that uses existing free pte if pt page is already
there. This also moves function out of namespace similar to a language
reserved.
Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D27956
pmap_pdpe() might return NULL, check for it.
Reviewed by: markj
Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D27956
Use an interface compatible with the Linux one so that the user-space
libraries already using the Linux interface can be used without much
modifications.
This allows user-space to make use of the dm_op family of hypercalls,
which are used by device models.
Sponsored by: Citrix Systems R&D
When we already have the vm page in hand, use vm_page_domain() instead
of vm_phys_domain(). The former has a trivial constant-time
implementation whereas the latter iterates over the mem_affinity array.
Reviewed by: kib, markj
Differential Revision: https://reviews.freebsd.org/D28005
Remove wi(4). pccard is going away, and wi only supports PC Card
devices, though it has a minor amount of glue to also support
PCI cards. However, removing the one without removing the other
is hard, so the whole driver is being removed.
Relnotes: Yes
This change implements hid_if.m methods for HID-over-USB protocol [1].
Also, this change adds USBHID_ENABLED kernel option which changes
device_probe() priority and adds/removes PnP records to prefer usbhid
over ums, ukbd, wmt and other USB HID device drivers and vice-versa.
The module is based on uhid(4) driver. It is disabled by default for
now due to conflicts with existing USB HID drivers.
[1] https://www.usb.org/sites/default/files/hid1_11.pdf
Reviewed by: hselasky
Differential revision: https://reviews.freebsd.org/D27893
This does an import of quirk stubs, debugging macros from USB code and
numerous usage constants used by dependent drivers.
Besides, this change renames some functions to get a better matching
with userland library and NetBSD/OpenBSD HID code. Namely:
- Old hid_report_size() renamed to hid_report_size_max()
- New hid_report_size() calculates size of given report rather than
maximum size of all reports.
- hid_get_data_unsigned() renamed to hid_get_udata()
- hid_put_data_unsigned() renamed to hid_put_udata()
Compat shim functions are provided in usbhid.h to make possible compile
of legacy code unmodified after this change.
Reviewed by: manu, hselasky
Differential revision: https://reviews.freebsd.org/D27887
It will be used by the upcoming HID-over-i2C implementation. Should be
no-op, except hid.ko module dependency is to be added to affected drivers.
Reviewed by: hselasky, manu
Differential revision: https://reviews.freebsd.org/D27867
These handlers could interrupt code which has interrupts disabled,
and if a spurious page fault occurs during exception handler run,
we get clobbered %cr2 in higher level stack.
This is mostly a speculation, but it is based on hints from good sources.
MFC after: 1 week
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D27772
The existing values correspond to x86 exception vector numbers, but the
trap numbers used in the kernel do not match these 1-to-1. Prefer the
definitions from x86/trap.h, as they are what actually get passed to
kdb_trap(). This is of little consequence, as gdb_cpu_signal() only
reports the trap reason (signal number) to the gdb client.
This is limited to the subset of trap values for which kdb_trap() is
reachable.
Reviewed by: kib
Discussed with: jhb
MFC after: 1 week
Sponsored by: NetApp, Inc.
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D27645
This sysctl node can generate very verbose output, so don't trigger it
for sysctl -a or sysctl vm.pmap.
Reviewed by: markj, kib
Differential Revision: https://reviews.freebsd.org/D27504
Similar to the recent patch to arm's gdb stub in r368414, allow GDB to
update the contents of most general purpose registers.
Reviewed by: cem, jhb, markj
MFC after: 2 weeks
Sponsored by: NetApp, Inc.
Sponsored by: Klara, Inc.
NetApp PR: 44
Differential Revision: https://reviews.freebsd.org/D27642
When r362031 moved local TLB invalidation after shootdown IPI send, it
moved too much. In particular, PCID-mode clearing of the pm_gen
generation counters must occur before IPIs are send, which is in fact
described by the comment before seq_cst fence in the invalidation
functions.
Fix it by extracting pm_gen clearing into new helper
pmap_invalidate_preipi(), which is executed before a call to
smp_masked_tlb_shootdown().
Rest of the local invalidation callbacks is simplified as result, and
become very similar to the remote shootdown handlers (to be merged in
some future).
Move pin of the thread to pmap_invalidate_preipi(), and do unpin in
smp_masked_tlb_shootdown().
Reported and tested by: mjg (previous version)
Reviewed by: alc, cem (previous version), markj
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D227588
Ability to load-balance traffic over multiple path is a must-have thing for routers.
It may be used by the servers to balance outgoing traffic over multiple default gateways.
The previous implementation, RADIX_MPATH stayed in the shadow for too long.
It was not well maintained, which lead us to a vicious circle - people were using
non-contiguous mask or firewalls to achieve similar goals. As a result, some routing
daemons implementation still don't have multipath support enabled for FreeBSD.
Turning on ROUTE_MPATH by default would fix it. It will allow to reduce networking
feature gap to other operating systems. Linux and OpenBSD enabled similar support
at least 5 years ago.
ROUTE_MPATH does not consume memory unless actually used. It enables around ~1k LOC.
It does not bring any behaviour changes for userland.
Additionally, feature is (temporarily) turned off by the net.route.multipath sysctl
defaulting to 0.
Differential Revision: https://reviews.freebsd.org/D27428
The hme (Happy Meal Ethernet) driver was the onboard NIC in most
supported sparc64 platforms. A few PCI NICs do exist, but we have seen
no evidence of use on non-sparc systems.
Reviewed by: imp, emaste, bcr
Sponsored by: DARPA
In r367395 parts of machine dependent linux_dummy.c were moved to a new
machine independent file sys/compat/linux/linux_dummy.c and the existing
linux_dummy.c was renamed to linux_dummy_machdep.c.
Add linux_dummy_machdep.c to the linux module for i386.
Rename sys/amd64/linux32/linux_dummy.c for consistency.
Add the new linux_dummy.c to the linux module for i386.
Implement vt_vbefb to support Vesa Bios Extensions (VBE) framebuffer with VT.
vt_vbefb is built based on vt_efifb and is assuming similar data for
initialization, use MODINFOMD_VBE_FB to identify the structure vbe_fb
in kernel metadata.
struct vbe_fb, is populated by boot loader, and is passed to kernel via
metadata payload.
Differential Revision: https://reviews.freebsd.org/D27373
We use 4-level EPT pages, correct the upper bound.
Reviewed by: grehan
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D27402
There is no need for these to be function pointers since they are
never modified post-module load.
Rename AMD/Intel ops to be more consistent.
Submitted by: adam_fenn.io
Reviewed by: markj, grehan
Approved by: grehan (bhyve)
MFC after: 3 weeks
Differential Revision: https://reviews.freebsd.org/D27375
where it originally was. The bug introduced in r366267.
o Remove options IOMMU from i386/MINIMAL as we don't have it in
i386/GENERIC.
Reported by: Harry Schmalzbauer <freebsd@omnilan.de>
Reviewed by: kib
Sponsored by: Innovate DSbD
Differential Revision: https://reviews.freebsd.org/D27399
This is a relic from when these instructions weren't supported by the toolchain.
No functional change.
Submitted by: adam_fenn.io
Reviewed by: grehan
Approved by: grehan (bhyve)
MFC after: 3 days
Differential Revision: https://reviews.freebsd.org/D27130
i386 and the rest of supported architectures by defining KERNLOAD in the
vmparam.h and getting rid of magic constant in the linker script, which albeit
documented via comment but isn't programmatically accessible at a compile time.
Use KERNLOAD to eliminate another (matching) magic constant 100 lines down
inside unremarkable TU "copy.c" 3 levels deep in the EFI loader tree.
Reviewed by: markj
Approved by: markj
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D27355
This reduces some code duplication. One behavior change is that
ppt_assign_device() will now only succeed if the device is unowned.
Previously, a device could be assigned to the same VM multiple times,
but each time it was assigned, the device's state was reset.
Reviewed by: markj, grehan
MFC after: 2 weeks
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D27301
Add a new ioctl to disable all MSI-X interrupts for a PCI passthrough
device and invoke it if a write to the MSI-X capability registers
disables MSI-X. This avoids leaving MSI-X interrupts enabled on the
host if a guest device driver has disabled them (e.g. as part of
detaching a guest device driver).
This was found by Chelsio QA when testing that a Linux guest could
switch from MSI-X to MSI interrupts when using the cxgb4vf driver.
While here, explicitly fail requests to enable MSI on a passthrough
device if MSI-X is enabled and vice versa.
Reported by: Sony Arpita Das @ Chelsio
Reviewed by: grehan, markj
MFC after: 2 weeks
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D27212
This driver provides support for Realtek PCI SD card readers. It attaches
mmc(4) bus on card insertion and detaches it on card removal. It has been
tested with RTS5209, RTS5227, RTS5229, RTS522A, RTS525A and RTL8411B. It
should also work with RTS5249, RTL8402 and RTL8411.
PR: 204521
Submitted by: Henri Hennebert (hlh at restart dot be)
Reviewed by: imp, jkim
Differential Revision: https://reviews.freebsd.org/D26435
It can useful for code outside the VM system to look up the NUMA domain
of a page backing a virtual or physical address, specifically when
creating NUMA-aware data structures. We have _vm_phys_domain() for
this, but the leading underscore implies that it's an internal function,
and vm_phys.h has dependencies on a number of other headers.
Rename vm_phys_domain() to vm_page_domain(), and _vm_phys_domain() to
vm_phys_domain(). Make the latter an inline function.
Add _vm_phys.h and define struct vm_phys_seg there so that it's easier
to use in other headers. Include it from vm_page.h so that
vm_page_domain() can be defined there.
Include machine/vmparam.h from _vm_phys.h since it depends directly on
some constants defined there.
Reviewed by: alc
Reviewed by: dougm, kib (earlier versions)
Differential Revision: https://reviews.freebsd.org/D27207
Currently EPT TLB invalidation is done by incrementing a generation
counter and issuing an IPI to all CPUs currently running vCPU threads.
The VMM inner loop caches the most recently observed generation on each
host CPU and invalidates TLB entries before executing the VM if the
cached generation number is not the most recent value.
pmap_invalidate_ept() issues IPIs to force each vCPU to stop executing
guest instructions and reload the generation number. However, it does
not actually wait for vCPUs to exit, potentially creating a window where
guests may continue to reference stale TLB entries.
Fix the problem by bracketing guest execution with an SMR read section
which is entered before loading the invalidation generation. Then,
pmap_invalidate_ept() increments the current write sequence before
loading pm_active and sending IPIs, and polls readers to ensure that all
vCPUs potentially operating with stale TLB entries have exited before
pmap_invalidate_ept() returns.
Also ensure that unsynchronized loads of the generation counter are
wrapped with atomic(9), and stop (inconsistently) updating the
invalidation counter and pm_active bitmask with acquire semantics.
Reviewed by: grehan, kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D26910
This provides an OpenCrypto driver for Intel QuickAssist devices. The
driver was initially ported from NetBSD and comes with a few
improvements:
- support for GMAC/AES-GCM, AES-CTR and AES-XTS, and support for
SHA/HMAC-authenticated encryption
- support for detaching the driver
- various bug fixes
- DH895X support
Discussed with: jhb
MFC after: 3 days
Sponsored by: Rubicon Communications, LLC (Netgate)
Differential Revision: https://reviews.freebsd.org/D26963
The amd64 kernel handles certain types of exceptions on a dedicated
stack. Currently the sizes of these stacks are all hard-coded to
PAGE_SIZE, but for at least NMI handling it can be useful to use larger
stacks. Add constants to intr_machdep.h to make this easier to tweak.
No functional change intended.
Reviewed by: kib
MFC after: 1 week
Sponsored by: NetApp, Inc.
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D27076
VM_ALLOC_WAITOK and vm_page_unwire_noq(), have eliminated the need for
many of the #includes.
Reviewed by: kib, markj
Differential Revision: https://reviews.freebsd.org/D27052
It is almost never needed and adds an avoidable branch.
While here do minior clean ups in preparation for larger changes.
Reviewed by: markj
Differential Revision: https://reviews.freebsd.org/D27019
Foundation copyrights, approved by emaste@. It does not include
files which carry other people's copyrights; if you're one
of those people, feel free to make similar change.
Reviewed by: emaste, imp, gbe (manpages)
Differential Revision: https://reviews.freebsd.org/D26980
Linux execve() gets audited as AUE_EXECVE as well, we should also interpret
the return from this correctly for the same reasoning as in r367002.
MFC with: r367002
Improve the code reconstructing en_tw in struct fpreg32 from FXSAVE
results so that all register states are indicated correctly. The
previous code unconditionally mapped non-empty register state to
'normalized value' constant. The new code explicitly distinguishes
the 'zero value' and 'special value' constants as well. This improves
consistency between real FSAVE and translation from FXSAVE, and
ensures that tests using PT_GETFPREGS can rely on a single correct
value independently of the underlying implementation.
PR: 250454
Sponsored by: The FreeBSD Foundation
Obtained from: Moritz Systems
Submitted by: Michał Górny <mgorny@moritz.systems>
Discussed with: emaste
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D26856
Currently, this supports SHA1 and SHA2-{224,256,384,512} both as plain
hashes and in HMAC mode on both amd64 and i386. It uses the SHA
intrinsics when present similar to aesni(4), but uses SSE/AVX
instructions when they are not.
Note that some files from OpenSSL that normally wrap the assembly
routines have been adapted to export methods usable by 'struct
auth_xform' as is used by existing software crypto routines.
Reviewed by: gallatin, jkim, delphij, gnn
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D26821
Rewrite the code that maintains pm_active and invalidates EPTP-tagged
TLB entries in C. Previously this work was done in vmx_enter_guest(),
in assembly, but there is no good reason for that and it makes the TLB
invalidation algorithm for nested page tables harder to review.
No functional change intended. Now, an error from the invept
instruction results in a kernel panic rather than a vmexit. Such errors
should occur only as a result of VMM bugs.
Reviewed by: grehan, kib
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D26830
The former clobbers some registers that shouldn't be touched.
Reviewed by: kib (earlier version)
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D26406
on some more advanced C features.
This fixes gcc-toolchain build of exception.S.
Reported and tested by: kevans
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Hiding this feature behind RB_VERBOSE is gratuitous. The tunable is enough
to limit its use to only those who explicitly request it.
Suggested by: kevans
RIght now PCB_KERNFPU is used both as indication that kernel prepared
hardware FPU context to use and that the thread is fpu-kern
thread. This also breaks fpu_kern_enter(FPU_KERN_NOCTX), since
fpu_kern_leave() then clears PCB_KERNFPU.
Introduce new flag PCB_KERNFPU_THR which indicates that the thread is
fpu-kern. Do not clear PCB_KERNFPU if fpu-kern thread leaves noctx
fpu region.
Reported and tested by: jhb (amd64)
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D25511
From Linux sources and several datasheets I looked at, it seems that
the workaround is only needed on families 0xf and 0x10. For instance,
Ryzens do not implement the accessed MSR at all, it is documented as
reserved. Also, hypervisors should not allow guest to put CPU into
idle state, so activate workaround only when on bare hardware.
While there, style the code:
move MSR defines to specialreg.h
move identification to initcpu.c
Reported by: whu
Reviewed by: avg
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D26470
Move dump_avail[] extern declaration and inlines into a new header
vm/vm_dumpset.h. This fixes default gcc build for mips.
Reviewed by: alc, scottph
Tested by: kevans (previous version)
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D26741
On the target side TLB shootdown IPI handler, prevent the compiler
from performing a forward store optimization which may mask a
subsequent update to the scoreboard by the initiator.
Reported by: Max Laier, Anton Rang
Discussed with: kib
Sponsored by: Dell EMC Isilon
This patch has the driver for 10Gigabit Ethernet controller in AMD
SoC. This driver is written compatible to the Iflib framework. The
existing driver is for the old version of hardware. The submitted
driver here is for the recent versions of the hardware where the Ethernet
controller is PCI-E based.
Submitted by: Rajesh Kumar <rajesh1.kumar@amd.com>
MFC after: 1 month
Relnotes: yes
Differential Revision: https://reviews.freebsd.org/D25793
Push the root seed version to userspace through the VDSO page, if
the RANDOM_FENESTRASX algorithm is enabled. Otherwise, there is no
functional change. The mechanism can be disabled with
debug.fxrng_vdso_enable=0.
arc4random(3) obtains a pointer to the root seed version published by
the kernel in the shared page at allocation time. Like arc4random(9),
it maintains its own per-process copy of the seed version corresponding
to the root seed version at the time it last rekeyed. On read requests,
the process seed version is compared with the version published in the
shared page; if they do not match, arc4random(3) reseeds from the
kernel before providing generated output.
This change does not implement the FenestrasX concept of PCPU userspace
generators seeded from a per-process base generator. That change is
left for future discussion/work.
Reviewed by: kib (previous version)
Approved by: csprng (me -- only touching FXRNG here)
Differential Revision: https://reviews.freebsd.org/D22839
Now that config(8) has supported include for 19 years, transition to
including the NOTES files. include support didn't exist at the time,
nor did the envvar stuff recently added. Now that it does, eliminate
the building of LINT files by just including everything you need.
Note: This may cause conflicts with updating in some cases.
find sys -name LINT\* -rm
is suggested across this commit to remove the generated LINT
files.
Reviewed by: kevans
Differential Revision: https://reviews.freebsd.org/D26540
The boot metadata (also referred to as modinfo, or preload metadata)
provides information about the size and location of the kernel,
pre-loaded modules, and other metadata (e.g. the EFI framebuffer) to be
consumed during by the kernel during early boot. It is encoded as a
series of type-length-value entries and is usually constructed by
loader(8) and passed to the kernel. It is also faked on some
architectures when booted by other means.
Although much of the module information is available via kldstat(8),
there is no easy way to debug the metadata in its entirety. Add some
routines to parse this data and allow it to be printed to the console
during early boot or output via a sysctl.
Since the output can be lengthly, printing to the console is gated
behind the debug.dump_modinfo_at_boot kenv variable as well as the
BOOTVERBOSE flag. The sysctl to print the metadata is named
debug.dump_modinfo.
Reviewed by: tsoome
Sponsored by: NetApp, Inc.
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D26687
It is unlikely, but possible, that an unrecognized or unsupported
relocation type is encountered while trying to load a kernel module. If
this occurs we should offer the symbol index as a hint to the user.
While here, fix some small style issues.
Reviewed by: markj, kib (amd64 part, in D26701)
Sponsored by: NetApp, Inc.
Sponsored by: Klara, Inc.
If current process is 64bit, use rex-prefixed version of XSAVE
(XSAVE64). If current process is 32bit and CPU supports saving
segment registers cs/ds in the FPU save area, use non-prefixed variant
of XSAVE.
Reported and tested by: Michał Górny <mgorny@mgorny@moritz.systems>
PR: 250043
Reviewed by: emaste, markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D26643
After r354889 stack got struct nmi_pcpu at top, which makes IST top
not page-aligned. Since pmap_pti_add_kva() truncates/rounds up
addresses, it erronously entered a page mapped before double fault
stack into the pti page table.
Sponsored by: The FreeBSD Foundation
MFC after: 3 days
Per the Intel manuals, CPUID is supposed to unconditionally zero the
upper 32 bits of the involved (rax/rbx/rcx/rdx) registers.
Previously, the emulation would cast pointers to the 64-bit register
values down to `uint32_t`, which while properly manipulating the lower
bits, would leave any garbage in the upper bits uncleared. While no
existing guest OSes seem to stumble over this in practice, the bhyve
emulation should match x86 expectations.
This was discovered through alignment warnings emitted by gcc9, while
testing it against SmartOS/bhyve.
SmartOS bug: https://smartos.org/bugview/OS-8168
Submitted by: Patrick Mooney
Reviewed by: rgrimes
Differential Revision: https://reviews.freebsd.org/D24727
This is mostly needed for a common arm64/amd64 iommu code.
Reviewed by: kib
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D26587
The NOTES files have a bunch of hint lines that are removed when
generating LINT. However, we can achieve the same effect by prepending
each of the lines with 'envvar' so the NOTES files become standard
config(8) files. No functional changes as the sed script to generate
the LINT files filters these either way.
Suggested by: kevans
On Ampere Altra systems, the sparse population of RAM within the
physical address space causes the vm_page_dump bitmap to be much
larger than necessary, increasing the size from ~8 Mib to > 2 Gib
(and overflowing `int` for the size).
Changing the page dump bitmap also changes the minidump file
format, so changes are also necessary in libkvm.
Reviewed by: jhb
Approved by: scottl (implicit)
MFC after: 1 week
Sponsored by: Ampere Computing, Inc.
Differential Revision: https://reviews.freebsd.org/D26131
These definitions were repeated by all architectures, with small
variations. Consolidate the common definitons in machine
independent code and use bitset(9) macros for manipulation. Many
opportunities for deduplication remain in the machine dependent
minidump logic. The only intended functional change is increasing
the bit index type to vm_pindex_t, allowing the indexing of pages
with address of 8 TiB and greater.
Reviewed by: kib, markj
Approved by: scottl (implicit)
MFC after: 1 week
Sponsored by: Ampere Computing, Inc.
Differential Revision: https://reviews.freebsd.org/D26129
Possible in LA57 pmap config.
Noted by: alc
Reviewed by: alc, markj
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D26492
- Move assertions out of the main loop to avoid duplicate conditional
expressions, and improve assertion messages.
- Fix va_next updates. In some cases we were not doing the wraparound
check before continuing the loop.
- Use the right va_next. In pmap_advise() and pmap_copy() we would step
through 1G pages 2M at a time.
- Copy 1G mappings in pmap_copy().
Reviewed by: alc, kib
MFC with: r365518
Sponsored by: Juniper Networks, Inc., Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D26463
Fix the logic so it works as it appears.
Reported by: Coverity
Reviewed by: kib
MFC after: 2 weeks
Sponsored by: Dell EMC Isilon
Differential Revision: D26211 (in progress, so omitting full URL)
Intercept and report #UD to VM on SVM/AMD in case VM tried to execute an
SVM instruction. Otherwise, SVM allows execution of them, and instructions
operate on host physical addresses despite being executed in guest mode.
Reported by: Maxime Villard <max@m00nbsd.net>
admbug: 972
CVE: CVE-2020-7467
Reviewed by: grehan, markj
Differential revision: https://reviews.freebsd.org/D26313
The flag requests entry of non-managed superpage mapping of size
pagesizes[psind] into the page table.
Pmap supports fake wiring of the largepage mappings. Only attributes
of the largepage mapping can be changed by calling pmap_enter(9) over
existing mapping, physical address of the page must be unchanged.
Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D24652
Currently we use a single bit to indicate whether the virtual page is
part of a superpage. To support a forthcoming implementation of
non-transparent 1GB superpages, it is useful to provide more detailed
information about large page sizes.
The change converts MINCORE_SUPER into a mask for MINCORE_PSIND(psind)
values, indicating a mapping of size psind, where psind is an index into
the pagesizes array returned by getpagesizes(3), which in turn comes
from the hw.pagesizes sysctl. MINCORE_PSIND(1) is equal to the old
value of MINCORE_SUPER.
For now, two bits are used to record the page size, permitting values
of MAXPAGESIZES up to 4.
Reviewed by: alc, kib
Sponsored by: Juniper Networks, Inc.
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D26238
This allows privileged userspace processes to find information about the
physical page backing a given mapping. It is useful in applications
such as DPDK which perform some of their own memory management.
Reviewed by: kib, jhb (previous version)
MFC after: 2 weeks
Sponsored by: Juniper Networks, Inc.
Sponsored by: Klara Inc.
Differential Revision: https://reviews.freebsd.org/D26237
If the call to _pmap_allocpte() is not sleepable, it is possible that
allocation of PML4 or PDP page is successful but either PDP or PD page
is not. Restructured code in _pmap_allocpte() leaves zero-referenced
page in the paging structure.
Handle it by checking refcount of the page one level above failed
alloc and free that page if its reference count is zero.
Reported and tested by: pho
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D26293
We can switch into long mode directly with LA57 enabled.
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D25273
Since LA57 was moved to the main SDM document with revision 072, it
seems that we should have a support for it, and silicons are coming.
This patch makes pmap support both LA48 and LA57 hardware. The
selection of page table level is done at startup, kernel always
receives control from loader with 4-level paging. It is not clear how
UEFI spec would adapt LA57, for instance it could hand out control in
LA57 mode sometimes.
To switch from LA48 to LA57 requires turning off long mode, requesting
LA57 in CR4, then re-entering long mode. This is somewhat delicate
and done in pmap_bootstrap_la57(). AP startup in LA57 mode is much
easier, we only need to toggle a bit in CR4 and load right value in CR3.
I decided to not change kernel map for now. Single PML5 entry is
created that points to the existing kernel_pml4 (KML4Phys) page, and a
pml5 entry to create our recursive mapping for vtopte()/vtopde().
This decision is motivated by the fact that we cannot overcommit for
KVA, so large space there is unusable until machines start providing
wider physical memory addressing. Another reason is that I do not
want to break our fragile autotuning, so the KVA expansion is not
included into this first step. Nice side effect is that minidumps are
compatible.
On the other hand, (very) large address space is definitely
immediately useful for some userspace applications.
For userspace, numbering of pte entries (or page table pages) is
always done for 5-level structures even if we operate in 4-level mode.
The pmap_is_la57() function is added to report the mode of the
specified pmap, this is done not to allow simultaneous 4-/5-levels
(which is not allowed by hw), but to accomodate for EPT which has
separate level control and in principle might not allow 5-leve EPT
despite x86 paging supports it. Anyway, it does not seems critical to
have 5-level EPT support now.
Tested by: pho (LA48 hardware)
Reviewed by: alc
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D25273
Coverity has identified the line in this change as "Potential integer
overflowing expression" due to the variable i declared as an int
and used in an expression with vm_paddr_t, a 64bit variable.
This change has very little effect as when this line is execute
nkpt is small and phys_addr is a the beginning of physical memory.
But there is no explicit protection that the above is true.
Submitted by: bret_ketchum@dell.com
Reported by: Coverity
Reviewed by: markj
MFC after: 2 weeks
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D26141
The ACPI table-mapping code used pmap_kenter_temporary() to create
mappings, which in turn uses the fixed-size crashdump map. Moreover,
the code was not verifying that the table fits in this map, so when
mapping large tables we could clobber adjacent mappings. This use of
pmap_kenter_temporary() appears to predate support in pmap_mapbios() for
creating early mappings, but that restriction no longer applies.
PR: 248746
Reviewed by: kib, mav
Tested by: gallatin, Curtis Villamizar <curtis@ipv6.occnc.com>
MFC after: 3 days
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D26125
Those messages were printed hundreds of times during boot, often multiple
times for each table. We already print information about the tables in
more organized form once to not duplicate it when random ACPI drivers are
attaching.
MFC after: 1 week
This is a step towards facilitating jails with only Linux binaries.
Supporting emul_path adds path lookups which are completely spurious
if the binary at hand runs in a Linux-based root directory.
It defaults to on (== current behavior).
make -C /root/linux-5.3-rc8 -s -j 1 bzImage:
use_emul_path=1: 101.65s user 68.68s system 100% cpu 2:49.62 total
use_emul_path=0: 101.41s user 64.32s system 100% cpu 2:45.02 total
Recent versions of UEFI have moved local APIC timer initialization into
the early SEC phase which runs out of ROM, prior to self-relocating
into RAM. This results in a hypervisor exit.
Currently bhyve prevents instruction emulation from segments that aren't
marked as "sysmem" aka guest RAM, with the vm_gpa_hold() routine failing.
However, there is no reason for this restriction: the hypervisor already
controls whether EPT mappings are marked as executable.
Fix by dropping the redundant check of sysmem.
MFC after: 3 weeks
Differential Revision: https://reviews.freebsd.org/D25955
so we don't ifdef for every arch in busdma_iommu.c;
o No need to include specialreg.h for x86, remove it.
Requested by: andrew
Reviewed by: kib
Sponsored by: DARPA/AFRL
Differential Revision: https://reviews.freebsd.org/D25957
For purposes of handling hardware error reported via NMIs I need a way to
escape NMI context, being too restrictive to do something significant.
To do it this change introduces new swi_sched() flag SWI_FROMNMI, making
it careful about used KPIs. On platforms allowing IPI sending from NMI
context (x86 for now) it immediately wakes clk_intr_event via new IPI_SWI,
otherwise it works just like SWI_DELAY. To handle the delayed SWIs this
patch calls clk_intr_event on every hardclock() tick.
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.
Differential Revision: https://reviews.freebsd.org/D25754
Being able to use tmpfs without kernel modules is very useful when building
small MFS_ROOT kernels without a real file system.
Including TMPFS also matches arm/GENERIC and the MIPS std.MALTA configs.
Compiling TMPFS only adds 4 .c files so this should not make much of a
difference to NO_MODULES build times (as we do for our minimal RISC-V
images).
Reviewed By: br (earlier version for riscv), brooks, emaste
Differential Revision: https://reviews.freebsd.org/D25317
The only part of nmi_handle_intr() depending on ISA is isa_nmi(), which is
already wrapped. Entering debugger on NMI does not really depend on ISA.
MFC after: 2 weeks
Limit manipulations to use %rax as scratch to the pti portion of the
syscall entry code.
Submitted by: alc
Reviewed by: markj
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D25722
When pmap operates in PTI mode, we must reload %cr3 on return to
userspace. In non-PCID mode the reload always flushes all non-global
TLB entries and we take advantage of it by only invalidating the KPT
TLB entries (there is no cached UPT entries at all).
In PCID mode, we flush both KPT and UPT TLB explicitly, but we can
take advantage of the fact that PCID mode command to reload %cr3
includes a flag to flush/not flush target TLB. In particular, we can
avoid the flush for UPT, instead record that load of pc_ucr3 into %cr3
on return to usermode should be flushing. This is done by providing
either all-1s or ~CR3_PCID_MASK in pc_ucr3_load_mask. The mask is
automatically reset to all-1s on return to usermode.
Similarly, we can avoid flushing UPT TLB on context switch, replacing
it by setting pc_ucr3_load_mask. This unifies INVPCID and non-INVPCID
PTI ifunc, leaving only 4 cases instead of 6. This trick is also
applicable both to the TLB shootdown IPI handlers, since handlers
interrupt the target thread.
But then we need to check pc_curpmap in handlers, and this would
reopen the same race for INVPCID machines as was fixed in r306350 for
non-INVPCID. To not introduce the same bug, unconditionally do
spinlock_enter() in pmap_activate().
Reviewed by: alc, markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 3 weeks
Differential revision: https://reviews.freebsd.org/D25483
Subsequent to r240317, kmem_free() was replaced with kva_free() (r254025).
kva_free() releases the KVA allocation for the mapped region, but no longer
clears the pmap (pagetable) entries.
An affected pmap_unmapdev operation would leave the still-pmap'd VA space
free for allocation by other KVA consumers. However, this bug easily
avoided notice for ~7 years because most devices (1) never call
pmap_unmapdev and (2) on amd64, mostly fit within the DMAP and do not need
KVA allocations. Other affected arch are less popular: i386, MIPS, and
PowerPC. Arm64, arm32, and riscv are not affected.
Reported by: Don Morris <dgmorris AT earthlink.net>
Submitted by: Don Morris (amd64 part)
Reviewed by: kib, markj, Don (!amd64 parts)
MFC after: I don't intend to, but you might want to
Sponsored by: Dell Isilon
Differential Revision: https://reviews.freebsd.org/D25689
This removes SCTP from in-tree kernel configuration files. Now, SCTP
can be enabled by simply loading the module, as discussed on
freebsd-net@.
Reviewed by: tuexen
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D25611
This shortens fdalloc by over 60 bytes. Correctness verified by running both
variants at the same time and comparing the result of each call.
Note someone(tm) should make a pass at converting everything else feasible.
Stop using smp_ipi_mtx to protect global shootdown state, and
move/multiply the global state into pcpu. Now each CPU can initiate
shootdown IPI independently from other CPUs. Initiator enters
critical section, then fills its local PCPU shootdown info
(pc_smp_tlb_XXX), then clears scoreboard generation at location (cpu,
my_cpuid) for each target cpu. After that IPI is sent to all targets
which scan for zeroed scoreboard generation words. Upon finding such
word the shootdown data is read from corresponding cpu' pcpu, and
generation is set. Meantime initiator loops waiting for all zeroed
generations in scoreboard to update.
Initiator does not disable interrupts, which should allow
non-invalidation IPIs from deadlocking, it only needs to disable
preemption to pin itself to the instance of the pcpu smp_tlb data.
The generation is set before the actual invalidation is performed in
handler. It is safe because target CPU cannot return to userspace
before handler finishes. In principle only NMI can preempt the
handler, but NMI would see the kernel handler frame and not touch
not-invalidated user page table.
Handlers loop until they do not see zeroed scoreboard generations.
This, together with hardware keeping one pending IPI in LAPIC IRR
should prevent lost shootdowns.
Notes.
1. The code does protect writes to LAPIC ICR with exclusion. I believe
this is fine because we in fact do not send IPIs from interrupt
handlers. More for !x2APIC mode where ICR access for write requires
two registers write, we disable interrupts around it. If considered
incorrect, I can add per-cpu spinlock around ipi_send().
2. Scoreboard lines owned by given target CPU can be padded to the
cache line, to reduce ping-pong.
Reviewed by: markj (previous version)
Discussed with: alc
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 3 weeks
Differential revision: https://reviews.freebsd.org/D25510
On architectures that use RELA relocations it is safe to rerun the ifunc
resolvers on after all CPUs have started, but while they are sill parked.
On arm64 with big.LITTLE this is needed as some SoCs have shipped with
different ID register values the big and little clusters meaning we were
unable to rely on the register values from the boot CPU.
Add support for rerunning the resolvers on arm64 and amd64 as these are
both RELA using architectures.
Reviewed by: kib
Sponsored by: Innovate UK
Differential Revision: https://reviews.freebsd.org/D25455
Like other types of allocation, fpu_kern_ctx are frequently allocated per-cpu.
Provide the API and sketch some example consumers.
fpu_kern_alloc_ctx_domain() preferentially allocates memory from the
provided domain, and falls back to other domains if that one is empty
(DOMAINSET_PREF(domain) policy).
Maybe it makes more sense to just shove one of these in the DPCPU area
sooner or later -- left for future work.
Reviewed by: markj
Differential Revision: https://reviews.freebsd.org/D22053
Take advantage of Warner's nice new real GEOM aliasing system and use it for
aliased partition names that actually work.
Our canonical EBR partition name is the weird, not-default-on-x86-prior-to-
this-revision "da1p4+00001234." However, if compatibility mode (tunable
kern.geom.part.ebr.compat_aliases) is enabled (1, default), we continue to
provide the alias names like "da1p5" in addition to the weird canonical
names.
Naming partition providers was just one aspect of the COMPAT knob; in
addition it limited mutability, in part because it did not preserve existing
EBR header content aside from that of LBA 0. This change saves the EBR
header for LBA 0, as well as for every EBR partition encountered. That way,
when we write out the EBR partition table on modification, we can restore
any bootloader or other metadata in both LBA0 (the first data-containing EBR
may start after 0) as well as every logical EBR we read from the disk, and
only update the geometry metadata and linked list pointers that describe the
actual partitioning.
(This change does not add support for the 'bootcode' verb to EBR.)
PR: 232463
Reported by: Manish Jain <bourne.identity AT hotmail.com>
Discussed with: ae (no objection)
Relnotes: maybe
Differential Revision: https://reviews.freebsd.org/D24939
This effectively mirrors our libc implementation, but with minor fudging --
name needs to be copied in from userspace, so we just copy it straight into
stack-allocated memfd_name into the correct position rather than allocating
memory that needs to be cleaned up.
The sealing-related fcntl(2) commands, F_GET_SEALS and F_ADD_SEALS, have
also been implemented now that we support them.
Note that this implementation is still not quite at feature parity w.r.t.
the actual Linux version; some caveats, from my foggy memory:
- Need to implement SHM_GROW_ON_WRITE, default for memfd (in progress)
- LTP wants the memfd name exposed to fdescfs
- Linux allows open() of an fdescfs fd with O_TRUNC to truncate after dup.
(?)
Interested parties can install and run LTP from ports (devel/linux-ltp) to
confirm any fixes.
PR: 240874
Reviewed by: kib, trasz
Differential Revision: https://reviews.freebsd.org/D21845
in vanilla Linux git tree.
Reviewed by: markj
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D25385
If userspace has a newer bhyve than the kernel, it may be able to decode
and emulate some instructions vmm.ko is unaware of. In this scenario,
reset decoder state and try again.
Reviewed by: grehan
Differential Revision: https://reviews.freebsd.org/D24464
It turns out relocating the symbol table itself can cause issues, like fbt
crashing because it applies the offsets to the kernel twice.
This had been previously brought up in rS333447 when the stoffs hack was
added, but I had been unaware of this and reimplemented symtab relocation.
Instead of relocating the symbol table, keep track of the relocation base
in ddb, so the ddb symbols behave like the kernel linker-provided symbols.
This is intended to be NFC on platforms other than PowerPC, which do not
use fully relocatable kernels. (The relbase will always be 0)
* Remove the rest of the stoffs hack.
* Remove my half-baked displace_symbol_table() function.
* Extend ddb initialization to cope with having a relocation offset on the
kernel symbol table.
* Fix my kernel-as-initrd hack to work with booke64 by using a temporary
mapping to access the data.
* Fix another instance of __powerpc__ that is actually RELOCATABLE_KERNEL.
* Change the behavior or X_db_symbol_values to apply the relocation base
when updating valp, to match link_elf_symbol_values() behavior.
Reviewed by: jhibbits
Sponsored by: Tag1 Consulting, Inc.
Differential Revision: https://reviews.freebsd.org/D25223
FreeBSD madvise(2) directly. While some of the flag values match,
most don't.
PR: kern/230160
Reported by: markj
Reviewed by: markj
Discussed with: brooks, kib
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D25272
The Intel Instruction Set Reference says this about the XSAVE instruction:
Use of a destination operand not aligned to 64-byte boundary
(in either 64-bit or 32-bit modes) results in a general-protection
(#GP) exception.
This alignment happens naturally when all malloc buckets are powers
of two. However, this change is necessary on some systems when
certain non-power-of-two (and non-multiple of 64) malloc buckets
are defined.
Reviewed by: cem; kib; earlier version by jhb
MFC after: 2 weeks
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D25098
In particular, uma_zcreate creates sysctl oids, which locks an sx lock,
which uses IPIs under contention. IPIs tend not to work very well
when interrupts are disabled. Who knew, right?
Reviewed by: cem kib
MFC after: 2 weeks
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D25098
Right now code first flushes all local TLB entries that needs to be
flushed, then signals IPI to remote cores, and then waits for
acknowledgements while spinning idle. In the VMWare article 'Don’t
shoot down TLB shootdowns!' it was noted that the time spent spinning
is lost, and can be more usefully used doing local TLB invalidation.
We could use the same invalidation handler for local TLB as for
remote, but typically for pmap == curpmap we can use INVLPG for locals
instead of INVPCID on remotes, since we cannot control context
switches on them. Due to that, keep the local code and provide the
callbacks to be called from smp_targeted_tlb_shootdown() after IPIs
are fired but before spin wait starts.
Reviewed by: alc, cem, markj, Anton Rang <rang at acm.org>
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D25188
- Add vm_phys_early_add_seg(), complementing vm_phys_early_alloc(), to
ensure that segments registered during hammer_time() are placed in the
right domain. Otherwise, since the SRAT is not parsed at that point,
we just add them to domain 0, which may be incorrect and results in a
domain with only several MB worth of memory.
- Fix uma_startup1() to try allocating memory for zones from any domain.
If domain 0 is unpopulated, the allocation will simply fail, resulting
in a page fault slightly later during boot.
- Change _vm_phys_domain() to return -1 for addresses not covered by the
affinity table, and change vm_phys_early_alloc() to handle wildcard
domains. This is necessary on amd64, where the page array is dense
and pmap_page_array_startup() may allocate page table pages for
non-existent page frames.
Reported and tested by: Rafael Kitover <rkitover@gmail.com>
Reviewed by: cem (earlier version), kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D25001
The ice(4) driver is the driver for the Intel E8xx series Ethernet
controllers; currently with codenames Columbiaville and
Columbia Park.
These new controllers support 100G speeds, as well as introducing
more queues, better virtualization support, and more offload
capabilities. Future work will enable virtual functions (like
in ixl(4)) and the other functionality outlined above.
For full functionality, the kernel should be compiled with
"device ice_ddp" like in the amd64 NOTES file, and/or
ice_ddp_load="YES" should be added to /boot/loader.conf so that
the DDP package file included in this commit can be downloaded
to the adapter. Otherwise, the adapter will fall back to a single
queue mode with limited functionality.
A man page for this driver will be forthcoming.
MFC after: 1 month
Relnotes: yes
Sponsored by: Intel Corporation
Differential Revision: https://reviews.freebsd.org/D21959
There was an off-by-one in the GDT descriptor size field used by the
early Xen boot code. The GDT descriptor size should be the size of the
GDT minus one. No functional change expected as a result of this
change.
Sponsored by: Citrix Systems R&D
This reapplies logical r360944 and r360946 (reverting r360955), with fixed
copystr() stand-in replacement macro. Eventually the goal is to convert
consumers and kill the macro, but for a first step it helps if the macro is
correct.
Prior commit message:
Unlike the other copy*() functions, it does not serve to copy from one
address space to another or protect against potential faults. It's just
an older incarnation of the now-more-common strlcpy().
Add a coccinelle script to tools/ which can be used to mechanically
convert existing instances where replacement with strlcpy is trivial.
In the two cases which matched, fuse_vfsops.c and union_vfsops.c, the
code was further refactored manually to simplify.
Replace the declaration of copystr() in systm.h with a small macro
wrapper around strlcpy (with correction from brooks@ -- thanks).
Remove N redundant MI implementations of copystr. For MIPS, this
entailed inlining the assembler copystr into the only consumer,
copyinstr, and making the latter a leaf function.
Reviewed by: jhb (earlier version)
Discussed with: brooks (thanks!)
Differential Revision: https://reviews.freebsd.org/D24672
The flush is needed to prevent cross-process ret2spec, which is not handled
on kernel entry if IBPB is enabled but SMEP is present.
While there, add i386 RSB flush.
Reported by: Anthony Steinhauser <asteinhauser@google.com>
Reviewed by: markj, Anthony Steinhauser
Discussed with: philip
admbugs: 961
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Expose the special kernel LAPIC, IOAPIC, and HPET devices to userspace
for use in, e.g., fallback instruction emulation (when userspace has a
newer instruction decode/emulation layer than the kernel vmm(4)).
Plumb the ioctl through libvmmapi and register the memory ranges in
bhyve(8).
Reviewed by: grehan
Differential Revision: https://reviews.freebsd.org/D24525
In recent Linux (5.3+) and OpenBSD (6.6+) kernels, and with hosts that
support CPUID 0x15, the local APIC frequency is determined directly
from the reported crystal clock to avoid calibration against the 8254
timer.
However, the local APIC frequency implemented by bhyve is 128MHz, where
most h/w systems report frequencies around 25MHz. This shows up on
OpenBSD guests as repeated keystrokes on the emulated PS2 keyboard
when using VNC, since the kernel's timers are now much shorter.
Fix by reporting all-zeroes for CPUID 0x15. This allows guests to fall
back to using the 8254 to calibrate the local APIC frequency.
Future work could be to compute values returned for 0x15 that would
match the host TSC and bhyve local APIC frequency, though all dependencies
on this would need to be examined (for example, Linux will start using
0x16 for some hosts).
PR: 246321
Reported by: Jason Tubnor (and tested)
Reviewed by: jhb
Approved by: jhb, bz (mentor)
MFC after: 3 days
Differential Revision: https://reviews.freebsd.org/D24837
This function is responsible for setting pc_domain in each pcpu
structure. Call it from the main function that starts APs, rather than
a separate SYSINIT. This makes it easier to close the window where
UMA's per-CPU slab allocator may be called while pc_domain is
uninitialized. In particular, the allocator uses pc_domain to allocate
domain-local pages, so allocations before this point end up using domain
0 for everything.
Reviewed by: kib
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D24757
A fictitious page can have a physical address beyond the end of the RAM.
In the NUMA case there is some special code to handle such pages, but in
the other case the pages are handled the same as normal pages. So, we
cannot assert that the physical address is within RAM addresses.
Suggested by: kib
Reviewed by: kib
X-MFC note: NUMA support has not been MFC-ed
Unlike the other copy*() functions, it does not serve to copy from one
address space to another or protect against potential faults. It's just
an older incarnation of the now-more-common strlcpy().
Add a coccinelle script to tools/ which can be used to mechanically
convert existing instances where replacement with strlcpy is trivial.
In the two cases which matched, fuse_vfsops.c and union_vfsops.c, the
code was further refactored manually to simplify.
Replace the declaration of copystr() in systm.h with a small macro
wrapper around strlcpy.
Remove N redundant MI implementations of copystr. For MIPS, this
entailed inlining the assembler copystr into the only consumer,
copyinstr, and making the latter a leaf function.
Reviewed by: jhb
Differential Revision: https://reviews.freebsd.org/D24672
Otherwise the initial call to set_top_of_stack(), which occurs before
fpuinit() sets the correct value for cpu_max_ext_state_size, leaves the
stack base at an incorrect location. Then, when the full area is
zeroed, we end up erroneously zeroing part of the following page.
Reviewed by: kib
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D24754
Save and restore (also known as suspend and resume) permits a snapshot
to be taken of a guest's state that can later be resumed. In the
current implementation, bhyve(8) creates a UNIX domain socket that is
used by bhyvectl(8) to send a request to save a snapshot (and
optionally exit after the snapshot has been taken). A snapshot
currently consists of two files: the first holds a copy of guest RAM,
and the second file holds other guest state such as vCPU register
values and device model state.
To resume a guest, bhyve(8) must be started with a matching pair of
command line arguments to instantiate the same set of device models as
well as a pointer to the saved snapshot.
While the current implementation is useful for several uses cases, it
has a few limitations. The file format for saving the guest state is
tied to the ABI of internal bhyve structures and is not
self-describing (in that it does not communicate the set of device
models present in the system). In addition, the state saved for some
device models closely matches the internal data structures which might
prove a challenge for compatibility of snapshot files across a range
of bhyve versions. The file format also does not currently support
versioning of individual chunks of state. As a result, the current
file format is not a fixed binary format and future revisions to save
and restore will break binary compatiblity of snapshot files. The
goal is to move to a more flexible format that adds versioning,
etc. and at that point to commit to providing a reasonable level of
compatibility. As a result, the current implementation is not enabled
by default. It can be enabled via the WITH_BHYVE_SNAPSHOT=yes option
for userland builds, and the kernel option BHYVE_SHAPSHOT.
Submitted by: Mihai Tiganus, Flavius Anton, Darius Mihai
Submitted by: Elena Mihailescu, Mihai Carabas, Sergiu Weisz
Relnotes: yes
Sponsored by: University Politehnica of Bucharest
Sponsored by: Matthew Grooms (student scholarships)
Sponsored by: iXsystems
Differential Revision: https://reviews.freebsd.org/D19495
The comment referenced a non-existent function, and these minidump
implementations already buffer discontiguous physical data pages by
mapping them into a single VA range that gets passed to the dump device,
so there is no real advantage in batching calls to blk_write().
The RISC-V and MIPS minidump implementations still write a page at a
time and so would benefit from some form of batching.
MFC after: 2 weeks
Sponsored by: Juniper Networks, Klara Inc.
As a short term solution for the problem reported by Shawn Webb re: r359950,
bump the maximum number of memmaps per VM. This structure is 40 bytes, and the
additional four (fixed array embedded in the struct vm) members increase the
size of struct vm by 3%.
(The vast majority of struct vm is the embedded struct vcpu array, which
accounts for 84% of the size -- over 4 kB.)
Reported by: Shawn Webb <shawn.webb AT hardenedbsd.org>
Reviewed by: grehan
X-MFC-With: r359950
Differential Revision: https://reviews.freebsd.org/D24507
The static assertions were added (with size and offsets from gdb) and verified
with a build prior to marking the holes explicitly.
This is in preparation for a subsequent revision, pending in phabricator, that
makes use of some of these unused bits without impacting the ABI.
Reviewed by: grehan
Differential Revision: https://reviews.freebsd.org/D24461
Use AUXARGS_ENTRY_PTR to export these pointers. This is a followup to
r359987 and r359988.
Reviewed by: jhb
Obtained from: CheriBSD
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D24446
Permit instruction decoding logic to be compiled outside of the kernel for
rapid iteration and validation.
Reviewed by: grehan
Differential Revision: https://reviews.freebsd.org/D24439
Copy the CP, PTRIN, etc macros from freebsd32.h into a sys/abi_compat.h
and replace existing definitation with includes where required. This
eliminates duplicate code and allows Linux and FreeBSD compatability
headers to be included in the same files.
Input from: cem, jhb
Obtained from: CheriBSD
MFC after: 2 weeks
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D24275
Add QUEUE_MACRO_DEBUG_TRACE and QUEUE_MACRO_DEBUG_TRASH as proper kernel
options. While here, alpha-sort the debug section of sys/conf/options.
Enable QUEUE_MACRO_DEBUG_TRASH in amd64 GENERIC (but not GENERIC-NODEBUG)
kernels. It is similar in nature and cost to other use-after-free pointer
trashing we do in GENERIC. It is probably reasonable to enable in any arch
GENERIC kernel that defines INVARIANTS.
Modern debuggers and process tracers use ptrace() rather than procfs
for debugging. ptrace() has a supserset of functionality available
via procfs and new debugging features are only added to ptrace().
While the two debugging services share some fields in struct proc,
they each use dedicated fields and separate code. This results in
extra complexity to support a feature that hasn't been enabled in the
default install for several years.
PR: 244939 (exp-run)
Reviewed by: kib, mjg (earlier version)
Relnotes: yes
Differential Revision: https://reviews.freebsd.org/D23837
The goal of this change is to make the atomic_load_acq_{8,16},
atomic_testandset{,_acq}_long, and atomic_testandclear_long primitives
available in MI-namespace.
The second goal is to get this draft out of my local tree, as anything that
requires a full tinderbox is a big burden out of tree. MD specifics can be
refined individually afterwards.
The generic implementations may not be ideal for your architecture; feel
free to implement better versions. If no subword_atomic definitions are
needed, the include can be removed from your arch's machine/atomic.h.
Generic definitions are guarded by defined macros of the same name. To
avoid picking up conflicting generic definitions, some macro defines are
added to various MD machine/atomic.h to register an existing implementation.
Include _atomic_subword.h in arm and arm64 machine/atomic.h.
For some odd reason, KCSAN only generates some versions of primitives.
Generate the _acq variants of atomic_load.*_8, atomic_load.*_16, and
atomic_testandset.*_long. There are other questionably disabled primitives,
but I didn't run into them, so I left them alone. KCSAN is only built for
amd64 in tinderbox for now.
Add atomic_subword implementations of atomic_load_acq_{8,16} implemented
using masking and atomic_load_acq_32.
Add generic atomic_subword implementations of atomic_testandset_long(),
atomic_testandclear_long(), and atomic_testandset_acq_long(), using
atomic_fcmpset_long() and atomic_fcmpset_acq_long().
On x86, add atomic_testandset_acq_long as an alias for
atomic_testandset_long.
Reviewed by: kevans, rlibby (previous versions both)
Differential Revision: https://reviews.freebsd.org/D22963
When I implemented MD DYNAMIC parsing, I was originally passing a
linker_file_t so that the MD code could relocate pointers.
However, it turns out this isn't even filled in until later, so it was
always 0.
Just pass the load base (ef->address) directly, as that's really the only
thing we were interested in in the first place.
This fixes a crash on RB800 where it was trying to write to an unmapped
address when updating the GOT.
Reviewed by: jhibbits
Sponsored by: Tag1 Consulting, Inc.
Differential Revision: https://reviews.freebsd.org/D24105
This speeds up Windows guests tremendously.
The patch does:
Add a new tuneable 'hw.vmm.vmx.use_tpr_shadowing' to disable TLP shadowing.
Also add 'hw.vmm.vmx.cap.tpr_shadowing' to be able to query if TPR shadowing is used.
Detach the initialization of TPR shadowing from the initialization of APIC virtualization.
APIC virtualization still needs TPR shadowing, but not vice versa.
Any CPU that supports APIC virtualization should also support TPR shadowing.
When TPR shadowing is used, the APIC page of each vCPU is written to the VMCS_VIRTUAL_APIC field of the VMCS
so that the CPU can write directly to the page without intercept.
On vm exit, vlapic_update_ppr() is called to update the PPR.
Submitted by: Yamagi Burmeister
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D22942
Port aacraid driver to big-endian (BE) hosts.
The immediate goal of this change is to make it possible to use the
aacraid driver on PowerPC64 machines that have Adaptec Series 8 SAS
controllers.
Adapters supported by this driver expect FIB contents in little-endian
(LE) byte order. All FIBs have a fixed header part as well as a data
part that depends on the command being issued to the controller.
In this way, on BE hosts, the FIB header and all FIB data structures
used in aacraid.c and aacraid_cam.c need to be converted to LE before
being sent to the adapter and converted to BE when coming from it.
The functions to convert each struct are on aacraid_endian.c.
For little-endian (LE) targets, they are macros that expand
to nothing.
In some cases, when only a few fields of a large structure are used,
the fields are converted inline, by the code using them.
PR: 237463
Reviewed by: jhibbits
Sponsored by: Eldorado Research Institute (eldorado.org.br)
Differential Revision: https://reviews.freebsd.org/D23887
Following previous revision, apply the same minor optimization to
hand-rolled atomic_fcmpset_128 in pmap.c.
Reviewed by: kib, markj
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D23870
Previously the pattern to extract status flags from inline assembly
blocks was to use setcc in the block to write the flag to a register.
This was suboptimal in a few ways:
- It would lead to code like: sete %cl; test %cl; jne, i.e. a flag
would just be loaded into a register and then reloaded to a flag.
- The setcc would force the block to use an additional register.
- If the client code didn't care for the flag value then the setcc
would be entirely pointless but could not be eliminated by the
optimizer.
A more modern inline asm construct (since gcc 6 and clang 9) allows for
"flag output operands", where a C variable can be written directly from
a flag. The optimizer can then use this to produce direct code where
the flag does not take a trip through a register.
In practice this makes each affected operation sequence shorter by five
bytes of instructions. It's unlikely this has a measurable performance
impact.
Reviewed by: kib, markj, mjg
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D23869
r357614 added CTLFLAG_NEEDGIANT to make it easier to find nodes that are
still not MPSAFE (or already are but aren’t properly marked).
Use it in preparation for a general review of all nodes.
This is non-functional change that adds annotations to SYSCTL_NODE and
SYSCTL_PROC nodes using one of the soon-to-be-required flags.
Mark all obvious cases as MPSAFE. All entries that haven't been marked
as MPSAFE before are by default marked as NEEDGIANT
Approved by: kib (mentor, blanket)
Commented by: kib, gallatin, melifaro
Differential Revision: https://reviews.freebsd.org/D23718
When turning IBRS mitigation using sysctl, as opposed to loader tunable,
send IPI to tweak MSR on all cores. Right now code only performed MSR write
onr the CPU where sysctl was run.
Properly report hw.ibrs_active for IBRS_ALL. Split hw_ibrs_ibpb_active out
from ibrs_active, to keep the current semantic of guiding kernel entry and
exit handlers.
Reported and tested by: mav
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
in reclaim_pv_chunk_domain(), when we switch to a new target pmap from which
we are trying to reclaim a pv chunk, always update the current PTE bitmasks
to match.
Reviewed by: kib, markj
Approved by: imp (mentor)
Sponsored by: Netflix
r357614 added CTLFLAG_NEEDGIANT to make it easier to find nodes that are
still not MPSAFE (or already are but aren’t properly marked). Use it in
preparation for a general review of all nodes.
This is non-functional change that adds annotations to SYSCTL_NODE and
SYSCTL_PROC nodes using one of the soon-to-be-required flags.
Reviewed by: kib
Approved by: kib (mentor)
Differential Revision: https://reviews.freebsd.org/D23625
X-Generally looks fine: jhb
Returned value has type based on the argument, meaning consumers no longer
have to cast in the commmon case.
This commit keeps the kernel compilable without patching the rest.
this replaces the following near the syscall exit:
cmp $0x39,%rax
ja 0xffffffff8108f82c
movabs $0x200001800060005,%rcx
bt %rax,%rcx
jae 0xffffffff8108f82c
with:
test %edi,%edi
jne 0xffffffff8091a49c
This reverts r177661. The change is no longer very useful since
out-of-tree KLDs will be built to target SMP kernels anyway. Moveover
it breaks the KBI in !SMP builds since cpuset_t's layout depends on the
value of MAXCPU, and several kernel interfaces, notably
smp_rendezvous_cpus(), take a cpuset_t as a parameter.
PR: 243711
Reviewed by: jhb, kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D23512
Submitted by: Bora Özarslan <borako.ozarslan@gmail.com>
Submitted by: Yang Wang <2333@outlook.jp>
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D19917
messages for some of the unimplemented syscalls, in particular
the AIO-related ones.
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D23231
After r355784 the td_oncpu field is no longer synchronized by the thread
lock, so the stack capture interrupt cannot be delievered precisely.
Fix this using a loop which drops the thread lock and restarts if the
wrong thread was sampled from the stack capture interrupt handler.
Change the implementation to use a regular interrupt instead of an NMI.
Now that we drop the thread lock, there is no advantage to the latter.
Simplify the KPIs. Remove stack_save_td_running() and add a return
value to stack_save_td(). On platforms that do not support stack
capture of running threads, stack_save_td() returns EOPNOTSUPP. If the
target thread is running in user mode, stack_save_td() returns EBUSY.
Reviewed by: kib
Reported by: mjg, pho
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D23355
Borrow the trick from memset and memmove and use the scale/index/base addressing
to avoid branches.
If a mismatch is found, the routine has to calculate the difference. Make sure
there is always up to 8 bytes to inspect. This replaces the previous loop which
would operate over up to 16 bytes with an unrolled list of 8 tests.
Speed varies a lot, but this is a net win over the previous routine with probably
a lot more to gain.
Validated with glibc test suite.
The Linux32 system call argument fetcher places each argument (passed in
registers in the Linux x86 system call convention) into an entry in the
generic system call args array. Each member of this array is 8 bytes
wide, so this approach is broken for system calls that take off_t
arguments.
Fix the problem by splitting l_loff_t arguments in the 32-bit system
call descriptions, the same as we do for FreeBSD32. Change entry points
to handle this using the PAIR32TO64 macro.
Move linux_ftruncate64() into compat/linux.
PR: 243155
Reported by: Alex S <iwtcex@gmail.com>
Reviewed by: kib (previous version)
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D23210
As a new x86 CPU vendor, Chengdu Haiguang IC Design Co., Ltd (Hygon)
is a joint venture between AMD and Haiguang Information Technology Co.,
Ltd., aims at providing x86 processors for China server market.
The first generation Hygon processor(Dhyana) shares most architecture
with AMD's family 17h, but with different CPU vendor ID("HygonGenuine")
and PCI vendor ID(0x1d94) and family series number 18h(Hygon negotiated
with AMD to confirm that only Hygon use family 18h).
To enable Hygon Dhyana support in FreeBSD, add new definitions
HYGON_VENDOR_ID("HygonGenuine") and X86_VENDOR_HYGON(0x1d94) to identify
Hygon Dhyana CPU.
Initialize the CPU features(topology, local APIC ext, MSI, TSC, hwpstate,
MCA, DEBUG_CTL, etc) for amd64 and i386 mode by sharing the code path of
AMD family 17h.
The changes have been applied on FreeBSD 13.0-CURRENT and tested
successfully on Hygon Dhyana processor.
References:
[1] Linux kernel patches for Hygon Dhyana, merged in 4.20:
https://git.kernel.org/tip/c9661c1e80b609cd038db7c908e061f0535804ef
[2] MSR and CPUID definition:
https://www.amd.com/system/files/TechDocs/54945_PPR_Family_17h_Models_00h-0Fh.pdf
Submitted by: Pu Wen <puwen@hygon.cn>
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D23163
r355473 vastly improved the readability and cleanliness of these Makefiles.
Every single one of them follows the same pattern and duplicates the exact
same logic.
Now that we have GENERATED/SRCS, split SRCS up into the two parameters we'll
use for ${MAKESYSCALLS} rather than assuming a specific ordering of SRCS and
include a common sysent.mk to handle the rest. This makes it less tedious to
make sweeping changes.
Some default values are provided for GENERATED/SYSENT_*; almost all of these
just use a 'syscalls.master' and 'syscalls.conf' in cwd, and they all use
effectively the same filenames with an arbitrary prefix. Most ABIs will be
able to get away with just setting GENERATED_PREFIX and including
^/sys/conf/sysent.mk, while others only need light additions. kern/Makefile
is the notable exception, as it doesn't take a SYSENT_CONF and the generated
files are spread out between ^/sys/kern and ^/sys/sys, but it otherwise fits
the pattern enough to use the common version.
Reviewed by: brooks, imp
Nice!: emaste
Differential Revision: https://reviews.freebsd.org/D23197
When either makesyscalls.lua or syscalls.master changes, all of the
${GENERATED} targets are now out-of-date. With make jobs > 1, this means we
will run the makesyscalls script in parallel for the same ABI, generating
the same set of output files.
Prior to r356603 , there is a large window for interlacing output for some
of the generated files that we were generating in-place rather than staging
in a temp dir. After that, we still should't need to run the script more
than once per-ABI as the first invocation should update all of them. Add
.ORDER to do so cleanly.
Reviewed by: brooks
Discussed with: sjg
Differential Revision: https://reviews.freebsd.org/D23099
vm.kvm_size and vm.kvm_free are read only and marked as MPSAFE on i386
already. Mark them as that on amd64 and arm64 too to avoid locking Giant.
Reviewed by: kib (mentor)
Approved by: kib (mentor)
Differential Revision: https://reviews.freebsd.org/D23039
mapping to the old read-only page with a mapping to the new read-write page.
To destroy the old mapping, pmap_enter() must destroy its page table and PV
entries and invalidate its TLB entry. This change simply invalidates that
TLB entry a little earlier, specifically, on amd64 and arm64, before the PV
list lock is held.
Reviewed by: kib, markj
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D23027
syscall is to query the CPU number and the NUMA domain the calling
thread is currently running on. The third argument is ignored.
It doesn't do anything regarding scheduling - it's literally
just a way to query the current state, without any guarantees
you won't get rescheduled an opcode later.
This unbreaks Java from CentOS 8
(java-11-openjdk-11.0.5.10-0.el8_0.x86_64).
Reviewed by: kib
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D22972
copy_file_range(2) is implemented natively since r350315, make it available
for Linux binaries too.
Reviewed by: kib (mentor), trasz (previous version)
Approved by: kib (mentor)
Differential Revision: https://reviews.freebsd.org/D22959
incrementing (and decrementing) the ref_count on kernel page table pages.
They should not do this. Kernel page table pages are expected to have a
fixed ref_count. Address this problem by refactoring pmap_alloc{_l2,pde}()
and their callers. This also eliminates some duplicated code from the
callers.
Correctly implement PMAP_ENTER_NOREPLACE in pmap_enter_{l2,pde}() on kernel
mappings.
Reduce code duplication by defining a function, pmap_abort_ptp(), for
handling a common error case.
Handle a possible page table page leak in pmap_copy(). Suppose that we are
determining whether to copy a superpage mapping. If we abort because there
is already a mapping in the destination pmap at the current address, then
simply decrementing the page table page's ref_count is correct, because the
page table page must have a ref_count > 1. However, if we abort because we
failed to allocate a PV entry, this might be a just allocated page table
page that has a ref_count = 1, so we should call pmap_abort_ptp().
Simplify error handling in pmap_enter_quick_locked().
Reviewed by: kib, markj (an earlier)
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D22763
than "/compat/linux". Useful when you have several compat directories
with different Linux versions and you don't want to clash with files
installed by linux-c7 packages.
Reviewed by: bcr (manpages)
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D22574
over the usual fsync(2).
This silences some warnings when running "apt-get upgrade".
Reviewed by: brooks, emaste
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D22371
- Allow the userland hypervisor to intercept breakpoint exceptions
(BP#) in the guest. A new capability (VM_CAP_BPT_EXIT) is used to
enable this feature. These exceptions are reported to userland via
a new VM_EXITCODE_BPT that includes the length of the original
breakpoint instruction. If userland wishes to pass the exception
through to the guest, it must be explicitly re-injected via
vm_inject_exception().
- Export VMCS_ENTRY_INST_LENGTH as a VM_REG_GUEST_ENTRY_INST_LENGTH
pseudo-register. Injecting a BP# on Intel requires setting this to
the length of the breakpoint instruction. AMD SVM currently ignores
writes to this register (but reports success) and fails to read it.
- Rework the per-vCPU state tracked by the debug server. Rather than
a single 'stepping_vcpu' global, add a structure for each vCPU that
tracks state about that vCPU ('stepping', 'stepped', and
'hit_swbreak'). A global 'stopped_vcpu' tracks which vCPU is
currently reporting an event. Event handlers for MTRAP and
breakpoint exits loop until the associated event is reported to the
debugger.
Breakpoint events are discarded if the breakpoint is not present
when a vCPU resumes in the breakpoint handler to retry submitting
the breakpoint event.
- Maintain a linked-list of active breakpoints in response to the GDB
'Z0' and 'z0' packets.
Reviewed by: markj (earlier version)
MFC after: 2 months
Differential Revision: https://reviews.freebsd.org/D20309
This is a 32-bit structure embedded in each vm_page, consisting mostly
of page queue state. The use of a structure makes it easy to store a
snapshot of a page's queue state in a stack variable and use cmpset
loops to update that state without requiring the page lock.
This change merely adds the structure and updates references to atomic
state fields. No functional change intended.
Reviewed by: alc, jeff, kib
Sponsored by: Netflix, Intel
Differential Revision: https://reviews.freebsd.org/D22650
Partially revert r354741 and r354754 and go back to allocating a
fixed-size chunk of stack space for the auxiliary vector. Keep
sv_copyout_auxargs but change it to accept the address at the end of
the environment vector as an input stack address and no longer
allocate room on the stack. It is now called at the end of
copyout_strings after the argv and environment vectors have been
copied out.
This should fix a regression in r354754 that broke the stack alignment
for newer Linux amd64 binaries (and probably broke Linux arm64 as
well).
Reviewed by: kib
Tested on: amd64 (native, linux64 (only linux-base-c7), and i386)
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D22695
... after the initial common TSS is copied into its final location
during PCPU reallocation.
Reported by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Use the power of variable to avoid spelling out source and generated
files too many times. The previous Makefiles were hard to read, hard to
edit, and badly formatted.
Reviewed by: kevans, emaste
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D22714
They're in both the old and new places in HEAD for the moment for
discussion and transition. The old locations will be garbage collected
in 4 weeks. MFCs to 12 an 11 will keep the old and new for transition
purposes.
Reviewed by: kib
MFC after: 4 weeks
Sponsored by: Intel
Differential Revision: https://reviews.freebsd.org/D22590
o Remove All Rights Reserved from my notices
o imp@FreeBSD.org everywhere
o regularize punctiation, eliminate date ranges
o Make sure that it's clear that I don't claim All Rights reserved by listing
All Rights Reserved on same line as other copyright holders (but not
me). Other such holders are also listed last where it's clear.
- Use ustringp for the location of the argv and environment strings
and allow destp to travel further down the stack for the stackgap
and auxv regions.
- Update the Linux copyout_strings variants to move destp down the
stack as was done for the native ABIs in r263349.
- Stop allocating a space for a stack gap in the Linux ABIs. This
used to hold translated system call arguments, but hasn't been used
since r159992.
Reviewed by: kib
Tested on: md64 (amd64, i386, linux64), i386 (i386, linux)
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D22501
tightening constraints on busy as a precursor to lockless page lookup and
should largely be a NOP for these cases.
Reviewed by: alc, kib, markj
Differential Revision: https://reviews.freebsd.org/D22611
No need to log all the commands in command ring but only the last one for which completion failed.
Reported by: np@freebsd.org
Reviewed by: np, markj
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D22566
Update the NetBSD Kernel Concurrency Sanitizer (KCSAN) runtime to work in
the FreeBSD kernel. It is a useful tool for finding data races between
threads executing on different CPUs.
This can be enabled by enabling KCSAN in the kernel config, or by using the
GENERIC-KCSAN amd64 kernel. It works on amd64 and arm64, however the later
needs a compiler change to allow -fsanitize=thread that KCSAN uses.
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D22315
Typical reasons for doublefault faults are either kernel stack
overflow or bugs in the code that manipulates protection CPU state.
The later code is the code which often has to set up gsbase for
kernel. Switching to explicit load of GSBASE MSR in the fault handler
makes it more probable to output a useful information.
Now all IST handlers have nmi_pcpu structure on top of their stacks.
It would be even more useful to save gsbase value at the moment of the
fault. I did not this because I do not want to modify PCB layout now.
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
flua is bootstrapped as part of the build for those on older
versions/revisions that don't yet have flua installed. Once upgraded past
r354833, "make sysent" will again naturally work as expected.
Reviewed by: brooks
Differential Revision: https://reviews.freebsd.org/D21894
The purpose of this option is to make it easier to track down memory
corruption bugs by reducing the number of malloc(9) types that might
have recently been associated with a given chunk of memory. However, it
increases fragmentation and is disabled in release kernels.
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
This CVE has already been announced in FreeBSD SA-19:26.mcu.
Mitigation for TAA involves either turning off TSX or turning on the
VERW mitigation used for MDS. Some CPUs will also be self-mitigating
for TAA and require no software workaround.
Control knobs are:
machdep.mitigations.taa.enable:
0 - no software mitigation is enabled
1 - attempt to disable TSX
2 - use the VERW mitigation
3 - automatically select the mitigation based on processor
features.
machdep.mitigations.taa.state:
inactive - no mitigation is active/enabled
TSX disable - TSX is disabled in the bare metal CPU as well as
- any virtualized CPUs
VERW - VERW instruction clears CPU buffers
not vulnerable - The CPU has identified itself as not being
vulnerable
Nothing in the base FreeBSD system uses TSX. However, the instructions
are straight-forward to add to custom applications and require no kernel
support, so the mitigation is provided for users with untrusted
applications and tenants.
Reviewed by: emaste, imp, kib, scottph
Sponsored by: Intel
Differential Revision: 22374
Change the FreeBSD ELF ABIs to use this new hook to copyout ELF auxv
instead of doing it in the sv_fixup hook. In particular, this new
hook allows the stack space to be allocated at the same time the auxv
values are copied out to userland. This allows us to avoid wasting
space for unused auxv entries as well as not having to recalculate
where the auxv vector is by walking back up over the argv and
environment vectors.
Reviewed by: brooks, emaste
Tested on: amd64 (amd64 and i386 binaries), i386, mips, mips64
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D22355
This driver allows to usage of the paravirt SCSI controller
in VMware products like ESXi. The pvscsi driver provides a
substantial performance improvement in block devices versus
the emulated mpt and mps SCSI/SAS controllers.
Error handling in this driver has not been extensively tested
yet.
Submitted by: vbhakta@vmware.com
Relnotes: yes
Sponsored by: VMware, Panzura
Differential Revision: D18613
from usermode.
If CPU supports RDFSBASE, the flag also means that userspace fsbase
and gsbase are already written into pcb, which might be not true when
we handle #gp from kernel.
The offender is rdmsr_safe(), and the visible result is corrupted
userspace TLS base.
Reported by: pstef
Sponsored by: The FreeBSD Foundation
MFC after: 3 days