The calculation of Maxmem was skipping the last phys_avail segment,
because of a wrong stop condition.
This was detected when using QEMU/PowerNV with Radix MMU and low
memory (2G). In this case opal_pci would allocate a DMA window that
was too small to cover all physical memory, resulting in reading all
zeroes from disk when using memory that was not inside the allocated
window.
Reviewed by: jhibbits
Sponsored by: Instituto de Pesquisas Eldorado (eldorado.org.br)
Differential Revision: https://reviews.freebsd.org/D33449
MFC after: 2 weeks
When using QEMU PowerNV with latest op-build release (v2.7), its
kexec transfers control to FreeBSD kernel in BE mode, causing an
instant exception on LE kernels. Make kboot able to detect and
swap endian to fix this.
Reviewed by: imp
Sponsored by: Instituto de Pesquisas Eldorado (eldorado.org.br)
Differential Revision: https://reviews.freebsd.org/D33104
When constructing the set of dumpable pages, use the bitset provided by
the state argument, rather than assuming vm_page_dump invariably. For
normal kernel minidumps this will be a pointer to vm_page_dump, but when
dumping the live system it will not.
To do this, the functions in vm_dumpset.h are extended to accept the
desired bitset as an argument. Note that this provided bitset is assumed
to be derived from vm_page_dump, and therefore has the same size.
Reviewed by: kib, markj, jhb
MFC after: 2 weeks
Sponsored by: Juniper Networks, Inc.
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D31992
schedinit_ap() sets up an AP for a later call to sched_throw(NULL).
Currently, ULE sets up some pcpu bits and fixes the idlethread lock with
a call to sched_throw(NULL); this results in a window where curthread is
setup in platforms' init_secondary(), but it has the wrong td_lock.
Typical platform AP startup procedure looks something like:
- Setup curthread
- ... other stuff, including cpu_initclocks_ap()
- Signal smp_started
- sched_throw(NULL) to enter the scheduler
cpu_initclocks_ap() may have callouts to process (e.g., nvme) and
attempt to sched_add() for this AP, but this attempt fails because
of the noted violated assumption leading to locking heartburn in
sched_setpreempt().
Interrupts are still disabled until cpu_throw() so we're not really at
risk of being preempted -- just let the scheduler in on it a little
earlier as part of setting up curthread.
Reviewed by: alfredo, kib, markj
Triage help from: andrew, markj
Smoke-tested by: alfredo (ppc), kevans (arm64, x86), mhorne (arm)
Differential Revision: https://reviews.freebsd.org/D32797
Remove page zeroing code from consumers and stop specifying
VM_ALLOC_NOOBJ. In a few places, also convert an allocation loop to
simply use VM_ALLOC_WAITOK.
Similarly, convert vm_page_alloc_domain() callers.
Note that callers are now responsible for assigning the pindex.
Reviewed by: alc, hselasky, kib
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D31986
As Radix MMU with superpages enabled is now stable, make it the
default choice on supported hardware (POWER9 and above), since its
performance is greater than that of HPT MMU.
Reviewed by: alfredo, jhibbits
Sponsored by: Instituto de Pesquisas Eldorado (eldorado.org.br)
Differential Revision: https://reviews.freebsd.org/D30797
Current implementation of Radix MMU doesn't support mapping
arbitrary virtual addresses, such as the ones generated by
"direct mapping" I/O addresses. This caused the system to hang, when
early I/O addresses, such as those used by OpenFirmware Frame Buffer,
were remapped after the MMU was up.
To avoid having to modify mmu_radix_kenter_attr just to support this
use case, this change makes early I/O map use virtual addresses from
KVA area instead (similar to what mmu_radix_mapdev_attr does), as
these can be safely remapped later.
Reviewed by: alfredo (earlier version), jhibbits (in irc)
MFC after: 2 weeks
Sponsored by: Instituto de Pesquisas Eldorado (eldorado.org.br)
Differential Revision: https://reviews.freebsd.org/D31232
The function is identical in each minidump implementation, so move it to
vm_phys.c. The only slight exception is powerpc where the function was
public, for use in moea64_scan_pmap().
Reviewed by: kib, markj, imp (earlier version)
MFC after: 2 weeks
Sponsored by: Juniper Networks, Inc.
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D31884
When running in a virtualized environment, TLB invalidations can only
be performed on process scope, as only the hypervisor is allowed to
invalidate a global scope, or else a Program Interrupt is triggered.
Since we are here, also make sure that the register process table
hypercall returns success.
Reviewed by: jhibbits
MFC after: 2 weeks
Sponsored by: Instituto de Pesquisas Eldorado (eldorado.org.br)
Differential Revision: https://reviews.freebsd.org/D31775
Move the common kernel function signatures from machine/reg.h to a new
sys/reg.h. This is in preperation for adding PT_GETREGSET to ptrace(2).
Reviewed by: imp, markj
Sponsored by: DARPA, AFRL (original work)
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D19830
ISA 3.0 allows for nested radix translations with minimal to no
involvement of the hypervisor. This should make pseries signficantly
faster on POWER9 pseries instances, as fewer hypercalls are needed to
manage pmap now.
MFC after: 2 weeks
Relnotes: yes
amd64 and 32-bit ARM already had assertions to this effect. Add them to
other pmaps.
Reviewed by: alc, kib
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D31171
Invalidate the last page of a demoted superpage mapping, instead of the
first page, as it results in slightly more promotions and fewer
failures. While here, replace 'boolean_t's with 'bool's in
mmu_radix_advise().
Simplify pmap_clear_modify() a bit, by assuming that since the superpage
demotion succeeded, all 4k mappings from it are valid. Deindent the
surrounding code, as there are no 'else' branches in the code anyway.
Summary:
Some methods are split between DMAP and non-DMAP, conditional on
hw_direct_map variable. Rather than checking this variable every time,
use it to install different functions via IFUNCs.
Reviewed By: luporl
Differential Revision: https://reviews.freebsd.org/D30071
Summary:
Since PCPU can live in a GPR for a while longer, let it, rather than
re-getting it in yet another register. MFSPR is an expensive operation,
12 clock latency on POWER9, so the fewer operations we need, the better.
Since the check is tightly coupled to the fetch, by reducing the number
of fetch+check, we reduce the stalls, and improve the performance
marginally. Buildworld was measured at a ~5-7% improvement on a single
run.
Reviewed By: nwhitehorn
Differential Revision: https://reviews.freebsd.org/D30003
Radix MMU code was missing TLB invalidations when some Level 3 PDEs were
modified. This caused TLB multi-hit machine check interrupts when
superpages were enabled.
Reviewed by: jhibbits
MFC after: 2 weeks
Sponsored by: Eldorado Research Institute (eldorado.org.br)
Differential Revision: https://reviews.freebsd.org/D29511
This only works on single-CPU G4 systems, and more work is needed for
dual-CPU systems. That said, platform sleep does not work, and this is
currently only used for PMU-based CPU speed change.
The elimination of the platform_smp_timebase_sync() call is so that the
timebase sync rendezvous can be enhanced to perform better
synchronization, which requires a full rendezvous. This would be
impossible to do on this single-threaded run.
Rename cpu_sleep() to mpc745x_sleep() to denote what it's actually
intended for. This function is very G4-specific, and will not work on
any other CPU. This will afterward eliminate a
platform_smp_timebase_sync() call by directly updating the timebase
instead.
--Eliminate a big ifdef that encompassed all currently-supported
architectures except mips and powerpc32. This applied to the case
in which we've allocated a superpage but the pager-populated range
is insufficient for a superpage mapping. For platforms that don't
support superpages the check should be inexpensive as we shouldn't
get a superpage in the first place. Make the normal-page fallback
logic identical for all platforms and provide a simple implementation
of pmap_ps_enabled() for MIPS and Book-E/AIM32 powerpc.
--Apply the logic for handling pmap_enter() failure if a superpage
mapping can't be supported due to additional protection policy.
Use KERN_PROTECTION_FAILURE instead of KERN_FAILURE for this case,
and note Intel PKU on amd64 as the first example of such protection
policy.
Reviewed by: kib, markj, bdragon
Differential Revision: https://reviews.freebsd.org/D29439
Now that superpages for HPT MMU has landed, finish implementation of
pmap_mincore by adding support for superpages.
Submitted by: Fernando Eckhardt Valle <fernando.valle@eldorado.org.br>
Reviewed by: bdragon, luporl
MFC after: 1 week
Sponsored by: Eldorado Research Institute (eldorado.org.br)
Differential Revision: https://reviews.freebsd.org/D29230
PowerISA 2.07B says that the low-order p-12 bits of the real page number
contained in ARPN and LP fields of a PTE must be 0s and are ignored
by the hardware (Book III-S, 5.7.7.1), where 2^p is the actual page size
in bytes, but we were clearing only the LP field.
This worked on bare metal and QEMU with KVM, that ignore these bits,
but caused a kernel panic on QEMU with TCG, that expects them to be
cleared.
This fixes running FreeBSD with HPT superpages enabled on QEMU
with TCG.
MFC after: 2 weeks
Sponsored by: Eldorado Research Institute (eldorado.org.br)
In r361544, the pmap drivers were converted to ifuncs. When doing so,
this changed the call type of pmap functions to be called via the
secure-plt stubs.
These stubs depend on the TOC base being loaded to r30 to run properly.
On SMP AIM (i.e. a dual processor G4 or running 32-bit on G5), since the
APs were being started up from the reset vector instead of going
through __start, they had never had r30 initialized properly, so when the
cpu_reset code in trap_subr32.S attempted to branch to
pmap_cpu_bootstrap(), it was loading the target from the wrong location.
Ensure r30 is set up directly in the cpu_reset trap code, so we can make
PLT calls as normal.
Fixes boot on my SMP G4.
Reviewed by: jhibbits
MFC after: 3 days
Sponsored by: Tag1 Consulting, Inc.
This macro returns true if a provided virtual address is contained
in the kernel's clean submap.
In CHERI kernels, the buffer cache and transient I/O map are allocated
as separate regions. Abstracting this check reduces the diff relative
to FreeBSD. It is perhaps slightly more readable as well.
Reviewed by: kib
Obtained from: CheriBSD
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D28710
In 78599c32ef, CFI endproc decoration was
added to locore64.S. However, it missed the subtle detail that
__restartkernel_virtual() falls through to __restartkernel(). This was
causing boot failure on PowerMac G5, as it tried to execute the
epilogue as code.
Fix this by branching to __restartkernel() instead of intentionally
running off the end of the function.
While here, add some additional notes on how the virtual mode restart
works.
MFC after: 3 days
Follow-up to r353959 and r368070: do the same for other architectures.
arm32 already seems to use its own .fnstart/.fnend directives, which
appear to be ARM-specific variants of the same thing. Likewise, MIPS
uses .frame directives.
Reviewed by: arichardson
Differential Revision: https://reviews.freebsd.org/D27387
It can useful for code outside the VM system to look up the NUMA domain
of a page backing a virtual or physical address, specifically when
creating NUMA-aware data structures. We have _vm_phys_domain() for
this, but the leading underscore implies that it's an internal function,
and vm_phys.h has dependencies on a number of other headers.
Rename vm_phys_domain() to vm_page_domain(), and _vm_phys_domain() to
vm_phys_domain(). Make the latter an inline function.
Add _vm_phys.h and define struct vm_phys_seg there so that it's easier
to use in other headers. Include it from vm_page.h so that
vm_page_domain() can be defined there.
Include machine/vmparam.h from _vm_phys.h since it depends directly on
some constants defined there.
Reviewed by: alc
Reviewed by: dougm, kib (earlier versions)
Differential Revision: https://reviews.freebsd.org/D27207
After r367417, both mmu_oea64 and mmu_radix were defining the vm.pmap
sysctl node, resulting in the later definition hiding the properties of
the previous one. Avoid this issue by defining vm.pmap in a common
source file and declaring it where needed.
This change also standardizes the tunable name used to enable superpages
and change its default to disabled on radix MMU, because it still has some
issues with superpages.
Reviewed by: bdragon, jhibbits
Sponsored by: Eldorado Research Institute (eldorado.org.br)
Differential Revision: https://reviews.freebsd.org/D27156
There were many, many endianness fixes needed for Radix MMU. The Radix
pagetable is stored in BE (as it is read and written to by the MMU hw),
so we need to convert back and forth every time we interact with it when
running in LE.
With these changes, I can successfully boot with radix enabled on POWER9 hw.
Reviewed by: luporl, jhibbits
Sponsored by: Tag1 Consulting, Inc.
Differential Revision: https://reviews.freebsd.org/D27181
The HPT is always stored in big-endian, as it is accessed directly by the
hardware as well as the kernel. As such, it is necessary to convert values
to and from native endian when running on LE.
Some unconverted accesses snuck in accidentally with r367417.
Apply the appropriate conversions to fix boot hanging on powerpc64le.
Sponsored by: Tag1 Consulting, Inc.
Fix build errors introduced by r367417 and r367390:
- Guard label reached only by powerpc64
- Guard vm_reserv_level_iffullpop call, that is not defined on powerpc
variants that don't support superpages
- Add missing hwpmc file, for when hwpmc is built into kernel
This change adds support for transparent superpages for PowerPC64
systems using Hashed Page Tables (HPT). All pmap operations are
supported.
The changes were inspired by RISC-V implementation of superpages,
by @markj (r344106), but heavily adapted to fit PPC64 HPT architecture
and existing MMU OEA64 code.
While these changes are not better tested, superpages support is disabled by
default. To enable it, use vm.pmap.superpages_enabled=1.
In this initial implementation, when superpages are disabled, system
performance stays at the same level as without these changes. When
superpages are enabled, buildworld time increases a bit (~2%). However,
for workloads that put a heavy pressure on the TLB the performance boost
is much bigger (see HPC Challenge and pgbench on D25237).
Reviewed by: jhibbits
Sponsored by: Eldorado Research Institute (eldorado.org.br)
Differential Revision: https://reviews.freebsd.org/D25237
Move dump_avail[] extern declaration and inlines into a new header
vm/vm_dumpset.h. This fixes default gcc build for mips.
Reviewed by: alc, scottph
Tested by: kevans (previous version)
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D26741
OPAL unconditionally enters secondary CPUs with only HV and SF set.
I tried writing a secondary entry point instead, but OPAL rejected it
and I am unsure why, so I resorted to making the system reset interrupt
endian-flexible.
This means we take a slight performance hit on wakeup on LE, but it is
a good stopgap until we can figure out a reliable way to make OPAL enter
where we want it to.
It probably makes sense to have it around anyway, because I can imagine
scenarios where the cpu resets itself to BE and does a software reset.
Sponsored by: Tag1 Consulting, Inc.
It's much easier to implement this in an endian-independent way when we
don't also have to worry about masking half of the dword off.
Given that this code ran on a machine that ran a poudriere bulk with no
kernel oddities, I am relatively certain it is correctly implemented. ;)
This should be a minor performance boost on BE as well.
Sponsored by: Tag1 Consulting, Inc.
For a body of code that had its endian conversion bits written blind without
the ability to test, moea64 was VERY close to being correct.
There were only four instances where the existing code was getting it wrong.
Sponsored by: Tag1 Consulting, Inc.