There are several mechanisms by which a vm_page reference is held,
preventing the page from being freed back to the page allocator. In
particular, holding the page's object lock is sufficient to prevent the
page from being freed; holding the busy lock or a wiring is sufficent as
well. These references are protected by the page lock, which must
therefore be acquired for many per-page operations. This results in
false sharing since the page locks are external to the vm_page
structures themselves and each lock protects multiple structures.
Transition to using an atomically updated per-page reference counter.
The object's reference is counted using a flag bit in the counter. A
second flag bit is used to atomically block new references via
pmap_extract_and_hold() while removing managed mappings of a page.
Thus, the reference count of a page is guaranteed not to increase if the
page is unbusied, unmapped, and the object's write lock is held. As
a consequence of this, the page lock no longer protects a page's
identity; operations which move pages between objects are now
synchronized solely by the objects' locks.
The vm_page_wire() and vm_page_unwire() KPIs are changed. The former
requires that either the object lock or the busy lock is held. The
latter no longer has a return value and may free the page if it releases
the last reference to that page. vm_page_unwire_noq() behaves the same
as before; the caller is responsible for checking its return value and
freeing or enqueuing the page as appropriate. vm_page_wire_mapped() is
introduced for use in pmap_extract_and_hold(). It fails if the page is
concurrently being unmapped, typically triggering a fallback to the
fault handler. vm_page_wire() no longer requires the page lock and
vm_page_unwire() now internally acquires the page lock when releasing
the last wiring of a page (since the page lock still protects a page's
queue state). In particular, synchronization details are no longer
leaked into the caller.
The change excises the page lock from several frequently executed code
paths. In particular, vm_object_terminate() no longer bounces between
page locks as it releases an object's pages, and direct I/O and
sendfile(SF_NOCACHE) completions no longer require the page lock. In
these latter cases we now get linear scalability in the common scenario
where different threads are operating on different files.
__FreeBSD_version is bumped. The DRM ports have been updated to
accomodate the KPI changes.
Reviewed by: jeff (earlier version)
Tested by: gallatin (earlier version), pho
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20486
- Use ptoa() instead of the archaic ctob().
- Use pagezero() to zero a PDP page.
- Remove PA_MIN_ADDRESS, orphaned by r351742.
- Remove unneeded parens and an unnecessary control flow statement.
Reported by: alc
Reviewed by: alc, kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D21495
from recent Ubuntu versions. Without it they segfault on startup.
Reviewed by: emaste
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D20687
r351198 allows the kernel to use domain-local memory to back the vm_page
array (up to 2MB boundaries) and reserves a separate PML4 entry for that
purpose. One consequence of that change is that the vm_page array is no
longer present in minidumps, which only adds pages mapped above
VM_MIN_KERNEL_ADDRESS.
To avoid the friction caused by having kernel data structures mapped
below VM_MIN_KERNEL_ADDRESS, map the vm_page array starting at
VM_MIN_KERNEL_ADDRESS instead of using a dedicated PML4 entry.
Reviewed by: kib
Discussed with: jeff
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D21491
The sysctl is called vm.pmap.kernel_maps. It dumps address ranges
and their corresponding protection and mapping mode, as well as
counts of 2MB and 1GB pages in the range.
Reviewed by: kib
MFC after: 2 weeks
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D21380
The bhyve virtual local APIC uses an instance-global flag to indicate
when an error LVT is being delivered to prevent infinite recursion.
Use a function argument instead to reduce the amount of instance-global
state.
This was inspired by reviewing the bhyve save/restore work, which
saves a copy of the instance-global state for each vlapic.
Smart OS bug: https://smartos.org/bugview/OS-7777
Submitted by: Patrick Mooney
Reviewed by: markj, rgrimes
Obtained from: SmartOS / Joyent
Differential Revision: https://reviews.freebsd.org/D20365
Many extern struct pcpu <something>__pcpu declarations were
copied/pasted in sources. The issue is that the definition is MD, but
it cannot be provided by machine/pcpu.h due to actual struct pcpu
defined in sys/pcpu.h later than the inclusion of machine/pcpu.h.
This forced the copying when other code needed direct access to
__pcpu. There is no way around it, due to machine/pcpu.h supplying
part of struct pcpu fields.
To work around the problem, add a new machine/pcpu_aux.h header, which
should fill any needed MD definitions after struct pcpu definition is
completed. This allows to remove copies of __pcpu spread around the
source. Also on x86 it makes it possible to remove work arounds like
OFFSETOF_CURTHREAD or clang specific warnings supressions.
Reported and tested by: lwhsu, bcran
Reviewed by: imp, markj (previous version)
Discussed with: jhb
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D21418
- LK macro (conditional on SMP for the lock prefix) is unused
- SETLK unnecessarily performs xchg. obtained value is never used and the
implicit lock prefix adds avoidable cost. Barrier provided by it does
not appear to be of any use.
- the lock waited for is almost never blocked, yet the loop starts with
a pause. Move it out of the common case.
Reviewed by: kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D19563
Use DOMAINSET_PREF() instead of DOMAINSET_FIXED(), to gracefully
fallback in case of memory-less domain.
Reported and tested by: bcran
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
existing one.
Allocation failure is possible for instance when cpu domain has no memory.
Reported and tested by: bcran
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
Move pcpu KVA out of .bss into dynamically allocated VA at
pmap_bootstrap(). This avoids demoting superpage mapping .data/.bss.
Also it makes possible to use pmap_qenter() for installation of
domain-local pcpu page on NUMA configs.
Refactor pcpu and IST initialization by moving it to helper functions.
Reviewed by: markj
Tested by: pho
Discussed with: jeff
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D21320
All these stacks are used only once (doublefault, boot) or very rare
(mce).
Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D21320
NUMA domain that the pages describe. Patch original from gallatin.
Reviewed by: kib
Tested by: pho
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D21252
doing so adds more flexibility with less redundant code.
Reviewed by: jhb, markj, kib
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D21250
Such speculations could use user-controlled %gs base, esp. since
FreeBSD supports WRGSBASE instructions.
Place LFENCEs on entry for each basic block after the test for
previous kernel/user mode on the kernel entry, which prevents the
speculation. Code accesses %gs-based PCPU before any serialization
instructions are executed, like %cr3 reload for KPTI.
With pti disabled, on haswell i7-4770S machine, "syscall_timings getppid"
shows when no lfence is added to syscall path:
test loop time iterations periteration
getppid 0 1.040918865 4643611 0.000000224
getppid 1 1.004985962 4481816 0.000000224
getppid 2 1.005196483 4482363 0.000000224
with lfence:
getppid 0 1.043701091 4554779 0.000000229
getppid 1 1.016930328 4438094 0.000000229
getppid 2 1.023223117 4466640 0.000000229
and ministat reports 'No difference proven at 95.0% confidence.'
Security: CVE-2019-1125
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
PTI-mode entry points were coded to set up the environment identical
to non-PTI entry and then fall-through to non-PTI handlers, mostly.
This has the drawback of requiring two more SWAPGS, first to access
PCPU, and then to return to the state expected by the non-PTI entry
point.
Eliminate the duplication by doing more in entry stubs both for PTI
and non-PTI, and adjusting the common code to expect that SWAPGS and
some minimal registers saving is done by entries.
Some less often used entries, in particular, #GP, #NP, and #SS, which
can fault on doreti, are left as is because there are basically four
variants of entrance, and they are not performance-critical,
esp. comparing with e.g. #PF or interrupts.
Reviewed by: markj (previous version)
Tested by: pho (previous version)
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
physical destination mode.
This is mostly a nop, because the vmm initializes all vCPUs up to
vm_maxcpus, so even if the target CPU is not active, lapic/vlapic code
still has the valid data to use. As John notes, dropping such
interrupts more closely matches the real harware, which ignores all
interrupts for not started APs.
Reviewed by: jhb
admbugs: 837
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Previously, AcpiOsMemory was using pmap_mapbios which would always map
the requested address Write-Back (WB). For several AMD Ryzen laptops,
the BIOS uses AcpiOsMemory to directly access the PCI MCFG region in
order to access PCI config registers. This has the side effect of
remapping the MCFG region in the direct map as WB instead of UC
hanging the laptops during boot.
On the one laptop I examined in detail, the _PIC global method used to
switch from 8259A PICs to I/O APICs uses a pair of PCI config space
registers at offset 0x84 in the device at 0:0:0 to as a pair of
address/data registers to access an indirect register in the chipset
and clear a single bit to switch modes.
To fix, alter the semantics of pmap_mapbios() such that it does not
modify the attributes of any existing mappings and instead uses the
existing attributes. If a new mapping is created, this new mapping
uses WB (the default memory attribute).
Special thanks to the gentleman whose name I don't have who brought
two affected laptops to the hacker lounge at BSDCan. Direct access to
the affected systems permitted finding the root cause within an hour
or so.
PR: 231760, 236899
Reviewed by: kib, alc
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D20327
Bhyve's vmm is a self-contained modern component and thus a good
candidate for use of C99 types.
Reviewed by: jhb, kib, markj, Patrick Mooney
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D21036
This effectively makes the stack base on the csu _start entry
randomized.
The gap is enabled if ASLR is for the ABI is enabled, and then
kern.elf{64,32}.aslr.stack_gap specify the max percentage of the
initial stack size that can be wasted for gap. Setting it to zero
disables the gap, and max is capped at 50%.
Only amd64 for now.
Reviewed by: cem, markj
Discussed with: emaste
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D21081
mapping and then destroy one of the 4 KB page mappings so that there is a
potential trigger for repromotion. Currently, we destroy the first 4 KB
page mapping that falls within the (current) superpage mapping or the
virtual address range [sva, eva). However, I have found empirically that
destroying the last 4 KB mapping produces slightly better results,
specifically, more promotions and fewer failed promotion attempts.
Accordingly, this revision changes pmap_advise() to destroy the last 4 KB
page mapping. It also replaces some nearby uses of boolean_t with bool.
Reviewed by: kib, markj
Differential Revision: https://reviews.freebsd.org/D21115
It is assembled using "${CC} -x assembler-with-cpp", which by convention
(bsd.suffixes.mk) uses the .asm extension.
This is a portion of the review referenced below (D18344). That review
also renamed linux_support.s to .S, but that is a functional change
(using the compiler's integrated assembler instead of as) and will be
revisited separately.
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D18344
The current implementation of gzipped a.out support was based
on a very old version of InfoZIP which ships with an ancient
modified version of zlib, and was removed from the GENERIC
kernel in 1999 when we moved to an ELF world.
PR: 205822
Reviewed by: imp, kib, emaste, Yoshihiro Ota <ota at j.email.ne.jp>
Relnotes: yes
Differential Revision: https://reviews.freebsd.org/D21099
if a demotion succeeds, then all of the 4KB page mappings within the
superpage-sized region must be valid, so there is no point in testing the
validity of the 4KB page mapping that is going to be write protected.
Deindent the nearby code.
Reviewed by: kib, markj
Tested by: pho (amd64, i386)
X-MFC after: r350004 (this change depends on arm64 dirty bit emulation)
Differential Revision: https://reviews.freebsd.org/D21027
Use 'struct bintime' instead of 'sbintime_t' to manage times in vPIT
to postpone rounding to final results rather than intermediate
results. In tests performed by Joyent, this reduced the error measured
by Linux guests by 59 ppm.
Smart OS bug: https://smartos.org/bugview/OS-6923
Submitted by: Patrick Mooney
Reviewed by: rgrimes
Obtained from: SmartOS / Joyent
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D20335
syscallret() doesn't use error anymore. Fix a few other places to permit
removing the return value from syscallenter() entirely.
- Remove a duplicated assertion from arm's syscall().
- Use td_errno for amd64_syscall_ret_flush_l1d.
Reviewed by: kib
MFC after: 1 month
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D2090
Casueword(9) on ll/sc architectures must be prepared for userspace
constantly modifying the same cache line as containing the CAS word,
and not loop infinitely. Otherwise, rogue userspace livelocks the
kernel.
To fix the issue, change casueword(9) interface to return new value 1
indicating that either comparision or store failed, instead of relying
on the oldval == *oldvalp comparison. The primitive no longer retries
the operation if it failed spuriously. Modify callers of
casueword(9), all in kern_umtx.c, to handle retries, and react to
stops and requests to terminate between retries.
On x86, despite cmpxchg should not return spurious failures, we can
take advantage of the new interface and just return PSL.ZF.
Reviewed by: andrew (arm64, previous version), markj
Tested by: pho
Reported by: https://xenbits.xen.org/xsa/advisory-295.txt
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D20772
hard-coded value. Don't allocate space for it from the kernel stack.
Account for prefix, suffix, and separator space in the name. This
takes the effective length up to 229 bytes on 13-current, and 37 bytes
on 12-stable. 37 bytes is enough to hold a full GUID string.
PR: 234134
MFC after: 1 week
Differential Revision: http://reviews.freebsd.org/D20924
The hold_count and wire_count fields of struct vm_page are separate
reference counters with similar semantics. The remaining essential
differences are that holds are not counted as a reference with respect
to LRU, and holds have an implicit free-on-last unhold semantic whereas
vm_page_unwire() callers must explicitly determine whether to free the
page once the last reference to the page is released.
This change removes the KPIs which directly manipulate hold_count.
Functions such as vm_fault_quick_hold_pages() now return wired pages
instead. Since r328977 the overhead of maintaining LRU for wired pages
is lower, and in many cases vm_fault_quick_hold_pages() callers would
swap holds for wirings on the returned pages anyway, so with this change
we remove a number of page lock acquisitions.
No functional change is intended. __FreeBSD_version is bumped.
Reviewed by: alc, kib
Discussed with: jeff
Discussed with: jhb, np (cxgbe)
Tested by: pho (previous version)
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D19247
in some cases (strace -f man id > /dev/null).
Reviewed by: dchagin
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D20691
This patch is the driver for NTB hardware in AMD SoCs (ported from Linux)
and enables the NTB infrastructure like Doorbells, Scratchpads and Memory
window in AMD SoC. This driver has been validated using ntb_transport and
if_ntb driver already available in FreeBSD.
Submitted by: Rajesh Kumar <rajesh1.kumar@amd.com>
MFC after: 1 month
Relnotes: yes
Differential Revision: https://reviews.freebsd.org/D18774
simple change to the control flow. Replace an unnecessary test by a
KASSERT. Add a comment explaining an obscure test.
Reviewed by: kib, markj
MFC after: 3 weeks
Differential Revision: https://reviews.freebsd.org/D20812
The goal of this driver is consolidate information about SuperIO chips
and to provide for peaceful coexistence of drivers that need to access
SuperIO configuration registers.
While SuperIO chips can host various functions most of them are
discoverable and accessible without any knowledge of the SuperIO.
Examples are: keyboard and mouse controllers, UARTs, floppy disk
controllers. SuperIO-s also provide non-standard functions such as
GPIO, watchdog timers and hardware monitoring. Such functions do
require drivers with a knowledge of a specific SuperIO.
At this time the driver supports a number of ITE and Nuvoton (fka
Winbond) SuperIO chips.
There is a single driver for all devices. So, I have not done the usual
split between the hardware driver and the bus functionality. Although,
superio does act as a bus for devices that represent known non-standard
functions of a SuperIO chip. The bus provides enumeration of child
devices based on the hardcoded knowledge of such functions. The
knowledge as extracted from datasheets and other drivers.
As there is a single driver, I have not defined a kobj interface for it.
So, its interface is currently made of simple functions.
I think that we can the flexibility (and complications) when we actually
need it.
I am planning to convert nctgpio and wbwd to superio bus very soon.
Also, I am working on itwd driver (watchdog in ITE SuperIO-s).
Additionally, there is ithwm driver based on the reverted sensors
import, but I am not sure how to integrate it given that we still lack
any sensors interface.
Discussed with: imp, jhb
MFC after: 7 weeks
Differential Revision: https://reviews.freebsd.org/D8175
when, in fact, we are write protecting the page and the PTE has PG_M set.
However, pmap_protect_pde() was always calling vm_page_dirty() when the PDE
has PG_M set. So, adding PG_NX to a writeable PDE could result in
unnecessary (but harmless) calls to vm_page_dirty().
Simplify the loop calling vm_page_dirty() in pmap_protect_pde().
Reviewed by: kib, markj
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D20793
This adds emulation for:
test r/m16, imm16
test r/m32, imm32
test r/m64, imm32 sign-extended to 64
OpenBSD guests compiled with clang 8.0.0 use TEST directly against a
Local APIC register instead of separate read via MOV followed by a
TEST against the register.
PR: 238794
Submitted by: jhb
Reported by: Jason Tubnor jason@tubnor.net
Tested by: Jason Tubnor jason@tubnor.net
Reviewed by: markj, Patrick Mooney patrick.mooney@joyent.com
MFC after: 3 days
Differential Revision: https://reviews.freebsd.org/D20755
Use it to indicate whether the page may be safely freed following
its removal from the object. Also change vm_page_remove() to assume
that the page's object pointer is non-NULL, and have callers perform
this check instead.
This is a step towards an implementation of an atomic reference counter
for each physical page structure.
Reviewed by: alc, dougm, kib
MFC after: 1 week
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20758
When pmap_pkru_on_remove() is called, the sva argument value was
advanced. Clear PKRU earlier when sva still specifies the start of
the region.
Noted and reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 3 days