This change adds support for transparent superpages for PowerPC64
systems using Hashed Page Tables (HPT). All pmap operations are
supported.
The changes were inspired by RISC-V implementation of superpages,
by @markj (r344106), but heavily adapted to fit PPC64 HPT architecture
and existing MMU OEA64 code.
While these changes are not better tested, superpages support is disabled by
default. To enable it, use vm.pmap.superpages_enabled=1.
In this initial implementation, when superpages are disabled, system
performance stays at the same level as without these changes. When
superpages are enabled, buildworld time increases a bit (~2%). However,
for workloads that put a heavy pressure on the TLB the performance boost
is much bigger (see HPC Challenge and pgbench on D25237).
Reviewed by: jhibbits
Sponsored by: Eldorado Research Institute (eldorado.org.br)
Differential Revision: https://reviews.freebsd.org/D25237
More endian conversion.
* Install TCEs correctly (i.e. in big endian)
* Convert to big endian and back when setting up queue pages and IRQs.
Sponsored by: Tag1 Consulting, Inc.
Since OPAL runs in big endian, any data being passed back and forth
via memory instead of registers needs to be byteswapped.
From my notes during development:
"A good way to find candidates is to look for vtophys() in opal_call()
parameters. The memory being passed will be written into in BE."
Sponsored by: Tag1 Consulting, Inc.
When running without a hypervisor, we need to set the ILE bit in the LPCR
ourselves.
For the boot processor, handle it in powernv_attach() like we do for other
LPCR bits.
No change for the APs, as they will use the lpcr global to set up their own
LPCR when they do their own cpudep_ap_early_bootstrap() and pick up this
automatically.
Sponsored by: Tag1 Consulting, Inc.
OPAL runs in big endian, so we need to rfid into it to switch endian
atomically when branching to it, and we need to do the
RETURN_TO_NATIVE_ENDIAN dance when it returns to us.
Sponsored by: Tag1 Consulting, Inc.
The TCG implementation of mtmsrd in qemu blindly copies the entire register
to the MSR, instead of the specific bit positions listed in the ISA.
This means that qemu will prematurely switch endian out from under the
running code instead of waiting for the rfid, causing an immediate trap
as it attempts to interpret the next instruction in the wrong endianness.
To work around this, ensure PSL_LE is still set before doing the mtmsrd.
In the future, we may wish to just turn off translation and unconditionally
use rfid to switch to the ofmsr instead of quasi-switching to the ofmsr.
Add a new platform option so this can be disabled. (And so that we can
conditonalize additional QEMU-specific hacks in the platform code.)
Sponsored by: Tag1 Consulting, Inc.
There were multiple bugs in the OPAL RTC code which had never been
discovered, as the default configuration of OPAL machines is to
have the BMC / FSP control the RTC.
* Fix calling convention for setting the time -- the variables are passed
directly in CPU registers, not via memory.
* Fix bug in the bcd encoding routines. (from jhibbits)
Tested on POWER9 Talos II (BE) and POWER9 Blackbird (LE).
Reviewed by: jhibbits (in irc)
Sponsored by: Tag1 Consulting, Inc.
When emulating a single thread system for testing reasons, mp_maxid can
be 0. This trips up our math for calculating the order.
Account for this to fix xive attachment when emulating a single-thread
core on qemu powernv (a configuration that doesn't exist in the real world.)
Sponsored by: Tag1 Consulting, Inc.
This fixes spurious "XIVE[ IC 00 ] ISN 1 lead to invalid IVE !" messages
generated by OPAL when running with the debug level cranked up.
Discussed with jhibbits.
Sponsored by: Tag1 Consulting, Inc.
* Only read the DPCPU pointer once per xive_dispatch call.
* Optimize HE decoding for the common cases.
Reported by: jhibbits (in irc)
Reviewed by: jhibbits
Sponsored by: Tag1 Consulting, Inc.
Differential Revision: https://reviews.freebsd.org/D25545
vmem quantum cache is only needed when doing a lot of concurrent allocations,
which doesn't happen when allocating MSIs. This wastes memory for the cache
zones. Avoid this waste and don't use the quantum cache.
Reported by: markj
If the POWER firmware detects a bad CPU core, it will "GUARD" it out,
marking it disabled. Any attempt to spin up a bad CPU will trigger a panic
later on when waiting for threads on said core to wake up. Support limping
along on fewer cores instead.
If NUMA is not enabled in the kernel config, or is disabled at boot, this
function should just return domain 0 regardless of what's in the device
tree.
Fixes a panic in iflib with NUMA disabled.
Reported by: luporl
r357614 added CTLFLAG_NEEDGIANT to make it easier to find nodes that are
still not MPSAFE (or already are but aren’t properly marked).
Use it in preparation for a general review of all nodes.
This is non-functional change that adds annotations to SYSCTL_NODE and
SYSCTL_PROC nodes using one of the soon-to-be-required flags.
Mark all obvious cases as MPSAFE. All entries that haven't been marked
as MPSAFE before are by default marked as NEEDGIANT
Approved by: kib (mentor, blanket)
Commented by: kib, gallatin, melifaro
Differential Revision: https://reviews.freebsd.org/D23718
On some POWER8 machines, 'ibm,associativity' property may have 6
cells, which would overflow the 5 cells buffer being used.
There was also an issue with the "check if node is root" part,
that have been fixed too.
Reviewed by: jhibbits
Differential Revision: https://reviews.freebsd.org/D23414
Summary:
Consolidate the NUMA associativity handling into a platform function.
Non-NUMA platforms will just fall back to the default (0). Currently
only implemented for powernv, which uses a lookup table to map the
device tree associativity into a system NUMA domain.
Fixes hangs on powernv after r356534, and corrects a fairly longstanding
bug in powernv's NUMA handling, which ended up using domains 1 and 2 for
devices and memory on power9, while CPUs were bound to domains 0 and 1.
Reviewed by: bdragon, luporl
Differential Revision: https://reviews.freebsd.org/D23220
It may be possible to make this completely lock free, but for now it's using
a statically allocated bounce buffer in the softc, so it needs to be
guarded.
In IRC, sfs_ finally managed to get a good trace of a kernel panic that was
happening when attempting to use webengine.
As it turns out, we were using vtophys() from interrupt context on an idle
thread in opal_hmi_handler2().
Since this involves locking the kernel pmap on PPC64 at the moment, this
ended up tripping a KASSERT in mtx_lock(), which then caused a parallel
panic stampede.
So, avoid this by preallocating the flags variable and storing it in PCPU.
Fixes "panic: mtx_lock() by idle thread 0x... on sleep mutex kernelpmap".
Differential Revision: https://reviews.freebsd.org/D22962
The Nest MMU manages address translation for accelerators on the POWER9. To
do so, it needs a page table, so export the system page table to the Nest
MMU. This will quietly fail on pre-POWER9 systems that do not have a NMMU.
The NMMU is currently unused, so this change is currently effectively a NOP,
but the NMMU and VAS will eventually be used.
This change makes it possible to use OPAL console as a GDB debug port.
Similar to uart and uart_phyp debug ports, it has to be enabled by
setting the hw.uart.dbgport variable to the serial console node
of the device tree.
Reviewed by: jhibbits
Differential Revision: https://reviews.freebsd.org/D22649
According to the OPAL documentation, only the POWER8 (PHB3) should use
the register write TCE reset method. All others should use the OPAL
call.
On POWER9 the call is semantically identical to the register write, with
a wait for completion.
Add a very basic NVRAM driver for OPAL which can be used by the IBM
powerpc-utils nvram utility, not to be confused with the base nvram utility,
which only operates on powermac_nvram.
The IBM utility handles all partitions itself, treating the nvram device as
a plain store.
An alternative would be to manage partitions in the kernel, and augment the
base nvram utility to deal with different backing stores, but that
complicates the driver significantly. Instead, present the same interface
IBM's utlity expects, and we get the usage for free.
Tested by: bdragon
Freeze clearing needs to heppen any time OPAL reads return either an error
(except OPAL_HARDWARE), AND any time it returns 0xff for all bytes.
For cfgwrite, any error that's not OPAL_HARDWARE should be cleaned up.
Only clear an EEH freeze if an error occurs. However, if an OPAL_HARDWARE
error is returned, this indicates a hardware failure which cannot be
unfrozen, and instead needs a hardware reset. Attempting to unfreeze a
broken PCH will result in console spam for each attempt. To avoid the spam,
just don't do it.
When an HMI occurs a message event also gets created with the details of the
exception. Hook into the messaging framework to retrieve the HMI message.
Nothing is done with it yet, except to panic on unhandled exception.
vmem_xalloc() cannot be called while holding a nonblocking mutex, warned
by WITNESS. The lock may not be necessary in general, but it avoids
superfluous concurrent OPAL calls for the same sensor.
Reported by: pkubaj
This allows replacing "sys/eventfilter.h" includes with "sys/_eventfilter.h"
in other header files (e.g., sys/{bus,conf,cpu}.h) and reduces header
pollution substantially.
EVENTHANDLER_DECLARE and EVENTHANDLER_LIST_DECLAREs were moved out of .c
files into appropriate headers (e.g., sys/proc.h, powernv/opal.h).
As a side effect of reduced header pollution, many .c files and headers no
longer contain needed definitions. The remainder of the patch addresses
adding appropriate includes to fix those files.
LOCK_DEBUG and LOCK_FILE_LINE_ARG are moved to sys/_lock.h, as required by
sys/mutex.h since r326106 (but silently protected by header pollution prior
to this change).
No functional change (intended). Of course, any out of tree modules that
relied on header pollution for sys/eventhandler.h, sys/lock.h, or
sys/mutex.h inclusion need to be fixed. __FreeBSD_version has been bumped.
It's possible for a Hypervisor Maintenance Interrupt (HMI) to occur while in
the pmap code, holding locks. This can cause WITNESS to panic due to lock
errors in calling pmap_kextract(). Since we don't yet handle the flags
returned by OPAL_HANDLE_HMI2, just stop using it, so that we don't call into
pmap_kextract().
Reported by: pkubaj
This way its children can attach earlier if needed, and some subsystems are
attached earlier, like the asynchronous token management.
MFC after: 2 weeks
Since writes don't necessarily need to be on erase-block boundaries, we can
relax the block size and alignments down to sector size. If it needs to be
erased, opalflash_erase() will check proper alignment and size.
If the OPAL flash driver supports writing without erase, it adds a
'no-erase' property to the flash device node. Honor that property and don't
bother erasing if it exists.
Summary:
Initial NUMA support:
- associate CPU with domain
- associate memory ranges with domain
- identify domain for devices
- limit device interrupt binding to appropriate domain
- Additionally fixes a bug in the setting of Maxmem which led to
only memory attached to the first socket being enabled for DMA
A pmap variant can opt in to numa support by by calling `numa_mem_regions`
at the end of pmap_bootstrap - registering the corresponding ranges with the
VM.
This yields a ~20% improvement in build times of llvm on dual socket POWER9
over non-NUMA.
Original patch by mmacy.
Differential Revision: https://reviews.freebsd.org/D17933
* The BIO bio_data may not be page aligned. Only the base address of each
page worth of data is extracted to pass to OPAL. Without page alignment
it can scribble over random memory when finishing the page read. Fix this
by short-reading the first page to properly align for full page reads.
* Fix the definition of OPAL_FLASH_ERASE.
* Properly handle the async message result, as now returned from r345974.
* Properly return the full opal_msg from an async completion.
* Don't keep bugging OPAL, wait 100us or so. With some minor changes to
DELAY() to drop to very low priority, the thread won't hog the CPU while
polling for the async completion.
Since OPAL_GET_MSG does not discriminate between message types, asynchronous
completion events may be received in the OPAL_GET_MSG call, which dequeues
them from the list, thus preventing OPAL_CHECK_ASYNC_COMPLETION from
succeeding. Handle this case by integrating with the messaging framework.
Summary:
OPAL needs to be kicked periodically in order for the firmware to make
progress on its tasks. To do so, create a heartbeat thread to perform this task
every N milliseconds, defined by the device tree. This task is also a central
location to handle all messages received from OPAL.
Reviewed By: luporl
Differential Revision: https://reviews.freebsd.org/D19743
Summary:
With a sufficiently large TOC, it's possible to index out of range, as
the immediate load instructions only permit 16-bit indices, allowing up
to 64kB range (signed) from the base pointer. Allow +/- 2GB range, with
the medium code model TOC accesses in asm.
Patch originally by Brandon Bergren. The issue appears to impact ELFv2
more than ELFv1.
Reviewed by: luporl
Differential Revision: https://reviews.freebsd.org/D19708
Attempting to build www/firefox on POWER9 resulted in a HMI exception being
thrown, a fatal trap currently. This is typically caused by timer facility
errors, but examination of the Hypervisor Maintenance Exception Register
(HMER) yielded only that an exception had recovered, with no information of
the actual exception cause.
When an HMI occurs, OPAL_HANDLE_HMI or OPAL_HANDLE_HMI2 must be called to
handle the exception at the firmware level. If the exception is handled, we
can continue.
This adds only the preliminary handler, enough to prevent package building
from panicking. An enhancement in the future is to use the flags returned
by OPAL_HANDLE_HMI2 to print more useful error messages, and log maintenance
events.
Reviewed by: luporl
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D19634
Firmware needed by petitboot, for example, GPU firmware, can be installed to
a partition in the flash filesystem. This driver exposes the full flash
given by the device tree, letting the user manage firmware, etc, from
FreeBSD.
To use the partitions provided by the flash module, the fdt_slicer module is
needed, but the module isn't needed for raw access, so there's no direct
dependency link in here.
MFC after: 2 weeks
The OPAL firmware only supports a finite number of in-flight asynchronous
operations. Rather than have each subsystem try to manage its own, use a
central management service to hand out tokens.
More work can be done to improve asynchronous behavior, such as funneling
things through a future OPAL heartbeat handler, but capabilities will be
added as needed.
Augment the existing consumers (i2c and sensors) to use this new API.
MFC after: 4 weeks
The XIVE (External Interrupt Virtualization Engine) is a new interrupt
controller present in IBM's POWER9 processor. It's a very powerful,
very complex device using queues and shared memory to improve interrupt
dispatch performance in a virtualized environment.
This yields a ~10% performance improvment over the XICS emulation mode,
measured in both buildworld, and 'dd' from nvme to /dev/null.
Currently, this only supports native access.
MFC after: 1 month
The XICS and XIVE need extra data beyond irq and vector. Rather than
performing a separate search, it's better for the general interrupt facility
to hold a private pointer, since the search already must be done anyway at
that level.
In r342771, I introduced a regression in Power by abusing the platform
smp_topo() method as a shortcut for providing the MI information needed for
the stated sysctls. The smp_topo() method was already called later by
sched_ule (under the name cpu_topo()), and initializes a static array of
scheduler topology information. I had skimmed the smp_topo_foo() functions
and assumed they were idempotent; empirically, they are not (or at least,
detect re-initialization and panic).
Do the cleaner thing I should have done in the first place and add a
platform method specifically for core- and thread-count probing.
Reported by: luporl via jhibbits
Reviewed by: luporl
X-MFC-With: r342771
Differential Revision: https://reviews.freebsd.org/D18777
With new sysctls (to the best of our ability do detect them). Restructured
smp.4 slightly for clarity (keep relevant stuff closer to the top) while
documenting.
Reviewed by: markj, jhibbits (ppc parts)
MFC after: 3 days
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D18322
It seems this tag is causing problems on POWER9 systems. Since no POWER9 user
has encountered the problem fixed by r339589 just restrict it to POWER8 for now.
A better fix will likely be to update powerpc/busdma_machdep.c to handle the
window correctly.
Reported by: mmacy, others
Further investigation of issues with 32-bit DMA on PowerNV revealed that
its window is hardcoded by OPAL (at least in skiboot version 5.4.9) and
cannot be changed by the OS.
Thus, now jhb suggestion of limiting the range in PCI DMA tag seems
the best way to deal with it.
Reviewed by: jhibbits, nwhitehorn, sbruno
Approved by: jhibbits(mentor)
Differential Revision: https://reviews.freebsd.org/D17601