This is a 32-bit structure embedded in each vm_page, consisting mostly
of page queue state. The use of a structure makes it easy to store a
snapshot of a page's queue state in a stack variable and use cmpset
loops to update that state without requiring the page lock.
This change merely adds the structure and updates references to atomic
state fields. No functional change intended.
Reviewed by: alc, jeff, kib
Sponsored by: Netflix, Intel
Differential Revision: https://reviews.freebsd.org/D22650
Summary:
moea64_pte_sync_native() and moea64_pte_unset_native() don't need the
full PTE created, they only need to check that the PVO has a matching
PTE to the PTE in the page table. Don't waste time creating the full
PTE in this case.
Reviewed by: luporl
Differential Revision: https://reviews.freebsd.org/D22341
Summary:
This matches r351198 from amd64. This only applies to AIM64 and Book-E.
On AIM64 it short-circuits with one domain, to behave similar to
existing. Otherwise it will allocate 16MB huge pages to hold the page
array, across all NUMA domains. On the first domain it will shift the
page array base up, to "upper-align" the page array in that domain, so
as to reduce the number of pages from the next domain appearing in this
domain. After the first domain, subsequent domains will be allocated in
full 16MB pages, until the final domain, which can be short. This means
some inner domains may have pages accounted in earlier domains.
On Book-E the page array is setup at MMU bootstrap time so that it's
always mapped in TLB1, on both 32-bit and 64-bit. This reduces the TLB0
overhead for touching the vm_page_array, which reduces up to one TLB
miss per array access.
Since page_range (vm_page_startup()) is no longer used on Book-E but is on
32-bit AIM, mark the variable as potentially unused, rather than using a
nasty #if defined() list.
Reviewed by: luporl
Differential Revision: https://reviews.freebsd.org/D21449
ENOENT is leftover from mmu_oea.c's moea_pvo_enter(), where it's used to
syncicache() on the first new mapping of a page. This sync is done
differently in OEA64.
Fix wrong section ordering that was causing a ".got is not contiguous with
other relro sections" lld error. This also brings ldscript.powerpc and
ldscript.powerpcspe closer to ldscript.powerpc64.
Also, remove unnecessary text relocs from the ppc32 AIM trap code.
Approved by: jhibbits (mentor)
Differential Revision: https://reviews.freebsd.org/D22349
PowerISA 3.0 eliminated the 64-bit bridge mode which allowed 32-bit kernels
to run on 64-bit AIM/Book-S hardware. Since therefore only a 64-bit kernel
can run on this hardware, and 64-bit native always has the direct map, there
is no need to guard it.
In some scenarios, the 4K trapstk may overflow, corrupting tmpstk.
This was observed during remote debugging, with the following steps:
At remote host (R):
- enter kdb during boot
- switch to gdb backend
At local host (L):
- attach gdb to R
- try to read an invalid memory position
At R:
- a DSI trap occurs and kdb restarts (all this occurs on trapstk)
- while printing the stacktrace, trapstk overflows and corrupts tmpstk
Reviewed by: jhibbits
Differential Revision: https://reviews.freebsd.org/D22200
moea_pvo_remove() might remove the last mapping of a page, in which case
it is clearly no longer writeable. This can happen via pmap_remove(),
or when a CoW fault removes the last mapping of the old page.
Reported and tested by: bdragon
Reviewed by: alc, bdragon, kib
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D22044
The VM_PAGE_OBJECT_BUSY_ASSERT() in pmap_enter() implementation should
be only asserted when the code is executed as result of pmap_enter(),
not when the same code is entered from e.g. pmap_enter_quick(). This
is relevant for all PowerPC pmap variants, because mmu_*_enter() is
used as the backend, and assert is located there.
Add a PowerPC private pmap_enter() PMAP_ENTER_QUICK_LOCKED flag to
indicate that the call is not from pmap_enter(). For non-quick-locked
calls, assert that the object is locked.
Reported and tested by: bdragon
Reviewed by: alc, bdragon, markj
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D22041
callers hold it.
This simplifies pmap code and removes a dependency on the object lock.
Reviewed by: kib, markj
Tested by: pho
Sponsored by: Netflix, Intel
Differential Revision: https://reviews.freebsd.org/D21596
busy acquires while held.
This allows code that would need to acquire and release a very large number
of page busy locks to use the old mechanism where busy is only checked and
not held. This comes at the cost of false positives but never false
negatives which the single consumer, vm_fault_soft_fast(), handles.
Reviewed by: kib
Tested by: pho
Sponsored by: Netflix, Intel
Differential Revision: https://reviews.freebsd.org/D21592
Based on POWER9BSD implementation, with all POWER9 specific code removed and
addition of new methods in PPC64 MMU interface, to isolate platform specific
code. Currently, the new methods are implemented on pseries and PowerNV
(D21643).
Reviewed by: jhibbits
Differential Revision: https://reviews.freebsd.org/D21551
As pointed out by mjg, without the parentheses the calculations done against
these macros are incorrect, resulting in only 1/3 of locks being used.
Reported by: mjg
- Remove a dead variable from the amd64 pmap_extract_and_hold().
- Fix grammar in the vm_page_wire man page.
Reported by: alc
Reviewed by: alc, kib
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D21639
There are several mechanisms by which a vm_page reference is held,
preventing the page from being freed back to the page allocator. In
particular, holding the page's object lock is sufficient to prevent the
page from being freed; holding the busy lock or a wiring is sufficent as
well. These references are protected by the page lock, which must
therefore be acquired for many per-page operations. This results in
false sharing since the page locks are external to the vm_page
structures themselves and each lock protects multiple structures.
Transition to using an atomically updated per-page reference counter.
The object's reference is counted using a flag bit in the counter. A
second flag bit is used to atomically block new references via
pmap_extract_and_hold() while removing managed mappings of a page.
Thus, the reference count of a page is guaranteed not to increase if the
page is unbusied, unmapped, and the object's write lock is held. As
a consequence of this, the page lock no longer protects a page's
identity; operations which move pages between objects are now
synchronized solely by the objects' locks.
The vm_page_wire() and vm_page_unwire() KPIs are changed. The former
requires that either the object lock or the busy lock is held. The
latter no longer has a return value and may free the page if it releases
the last reference to that page. vm_page_unwire_noq() behaves the same
as before; the caller is responsible for checking its return value and
freeing or enqueuing the page as appropriate. vm_page_wire_mapped() is
introduced for use in pmap_extract_and_hold(). It fails if the page is
concurrently being unmapped, typically triggering a fallback to the
fault handler. vm_page_wire() no longer requires the page lock and
vm_page_unwire() now internally acquires the page lock when releasing
the last wiring of a page (since the page lock still protects a page's
queue state). In particular, synchronization details are no longer
leaked into the caller.
The change excises the page lock from several frequently executed code
paths. In particular, vm_object_terminate() no longer bounces between
page locks as it releases an object's pages, and direct I/O and
sendfile(SF_NOCACHE) completions no longer require the page lock. In
these latter cases we now get linear scalability in the common scenario
where different threads are operating on different files.
__FreeBSD_version is bumped. The DRM ports have been updated to
accomodate the KPI changes.
Reviewed by: jeff (earlier version)
Tested by: gallatin (earlier version), pho
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20486
We only call alloc_pvo_entry() with M_WAITOK from one location. However,
this can be called while holding nonsleepable locks. Rather than passing
M_WAITOK down, use vm_wait() and loop.
Summary:
MOEA64_PTE_REPLACE() is called often with the pmap lock held, and
sometimes with the page pv lock held. The less work done while holding
a lock, the better. Since we are intending to replace the same PTE
(same hash index), we don't need to recalculate anything, just flat
replace the PTE. This cuts more than 200 instructions off the
invalidating code path. In addition, we don't need to replace a PTE
that's not occupied by this PVO.
Reviewed by: luporl
Differential Revision: https://reviews.freebsd.org/D21515
doing so adds more flexibility with less redundant code.
Reviewed by: jhb, markj, kib
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D21250
Summary:
Although it's convenient to reuse the pvo_plist for deletion, RB_TREE
insertion and removal is not free, and can result in a lot of extra work
to rebalance the tree. Instead, use a SLIST as a LIFO delete queue,
which gives us almost free insertion, deletion, and traversal.
Reviewed by: luporl
Differential Revision: https://reviews.freebsd.org/D21061
Added allocation retry loop in alloc_pvo_entry(), to wait for
memory to become available if the caller specifies the M_WAITOK flag.
Also, the loop in moa64_enter() was removed, as moea64_pvo_enter()
never returns ENOMEM. It is alloc_pvo_entry() memory allocation that
can fail and must be retried.
Reviewed by: jhibbits
Differential Revision: https://reviews.freebsd.org/D21035
Summary:
It turns out statistics accounting is very expensive in the pmap driver,
and doesn't seem necessary in the common case. Make this optional
behind a MOEA64_STATS #define, which one can set if they really need
statistics.
This saves ~7-8% on buildworld time on a POWER9.
Found by bdragon.
Reviewed by: luporl
Differential Revision: https://reviews.freebsd.org/D20903
oldpvo is never explicitly NULL'd by moea64_pvo_enter(), so don't check for
NULL to do anything, only check error.
PR: 239372
Reported by: Francis Little
Summary:
Instead of searching for a PVO entry before adding, take advantage of
the fact that RB_INSERT() returns NULL if it inserts, and the existing entry if
an entry exists, without inserting a new entry. This saves an extra tree
traversal in the cases where the PVO does not exist.
Reviewed by: luporl
Differential Revision: https://reviews.freebsd.org/D20944
The only consumer of moea64_pvo_remove_from_page_locked() already has the
page in hand, so there is no need to search for the page while holding the
lock. Drop the wrapper, and rename _moea64_pvo_remove_from_page_locked().
Reported by: alc
Summary:
Since the 'page pv' lock is one of the most highly contended locks, we
need to try to do as much work outside of the lock as we can. The
moea64_pvo_remove_from_page() path is a low hanging fruit, where we can
do some heavy work (PHYS_TO_VM_PAGE()) outside of the lock if needed.
In one path, moea64_remove_all(), the PV lock is already held and can't
be swizzled, so we provide two ways to perform the locked operation, one
that can call PHYS_TO_VM_PAGE outside the lock, and one that calls with
the lock already held.
Reviewed By: luporl
Differential Revision: https://reviews.freebsd.org/D20694
Summary:
If an illegal instruction is encountered on a process running on a
powerpc64 kernel it would attempt to sync the cache before retrying the
instruction "just in case". However, since curpmap is not set, when
moea64_sync_icache() attempts to lock the pmap, it's locking on a NULL pointer,
triggering a panic. Fix this by adding a (assumed unnecessary) fallback to
curthread's pmap in moea64_sync_icache().
Reported by: alfredo.junior_eldorado.org.br
Reviewed by: luporl, alfredo.junior_eldorado.org.br
Differential Revision: https://reviews.freebsd.org/D20911
The hold_count and wire_count fields of struct vm_page are separate
reference counters with similar semantics. The remaining essential
differences are that holds are not counted as a reference with respect
to LRU, and holds have an implicit free-on-last unhold semantic whereas
vm_page_unwire() callers must explicitly determine whether to free the
page once the last reference to the page is released.
This change removes the KPIs which directly manipulate hold_count.
Functions such as vm_fault_quick_hold_pages() now return wired pages
instead. Since r328977 the overhead of maintaining LRU for wired pages
is lower, and in many cases vm_fault_quick_hold_pages() callers would
swap holds for wirings on the returned pages anyway, so with this change
we remove a number of page lock acquisitions.
No functional change is intended. __FreeBSD_version is bumped.
Reviewed by: alc, kib
Discussed with: jeff
Discussed with: jhb, np (cxgbe)
Tested by: pho (previous version)
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D19247
Although PPC SLB code doesn't handle allocation failures,
which are rare, in most places it asserts that the pointer
returned by uma_zalloc() is not NULL, making it easier to
identify the failure and avoiding an invalid pointer dereference.
This change simply adds a missing KASSERT in SLB code.
r348783 changed the behavior of the kernel mappings and broke booting on G5.
- Split the kernel mapping logic out so that the case where we are
running from the wrong memory space is handled using identity
mappings, and the case where we are not using a DMAP is handled by
forcibly mapping the kernel into the dmap range as intended by
r348783.
Reported by: Mikael Urankar
Reviewed by: luporl
Approved by: jhibbits (mentor)
Differential Revision: https://reviews.freebsd.org/D20608
This set of changes make it possible to run FreeBSD for PowerPC64/pseries,
under QEMU/KVM, without requiring the host to make hugepages available to the
guest.
While there was already this possibility, by means of setting hw_direct_map to
0, on PowerPC64 there were a couple of issues/wrong assumptions that prevented
this from working, before this changelist.
Reviewed by: jhibbits
Differential Revision: https://reviews.freebsd.org/D20522
Summary:
moea64_insert_pteg_native()'s invalidation only works by happenstance.
The purpose of the shifts and XORs is to extract the VSID in order to
reverse-engineer the lower bits of the VPN. Currently a segment size is 256MB
(2**28), and ADDR_API_SHFT64 is 16, so ADDR_PIDX_SHIFT is equivalent. However,
it's semantically incorrect, in that we don't want to shift by the page shift
size, we want to shift to get to the VSID.
Tested by: bdragon
Differential Revision: https://reviews.freebsd.org/D20467
It was found during building llvm that the page pv lock pool was seeing very
high contention. Since the pmap is already NUMA aware, it was surmised that
the domains were referencing similar pages in the different domains. This
reduces contention to the point of noise in a lockstat(8) run (~51% down to
under 5%), reducing build times by up to 20%.
This doesn't do a perfect domain alignment, just a best-guess based on
hardware available, that the domain is roughly specified in the upper bits
of the PA. Trying to be more clever would more than likely result in
reduced performance just on the work needed.
MFC after: 2 weeks
Since we now have a much larger KVA on powerpc64, it's possible to get SLB
traps earlier in boot, possibly even before the HIOR is properly configured
for us. Move the HIOR setup to immediately after reset, so that we use our
exception handlers instead of Open Firmware's.
PR: 233863
Submitted by: Mark Millard (partial)
Reported by: Mark Millard
MFC after: 2 weeks
The POWER8NVL (POWER8 NVLink) architecturally behaves identically to the
POWER8, with a different PVR identifier. Mark it as such, so it shows up
appropriately to the user.
Reported by: Alexey Kardashevskiy
MFC after: 2 weeks
mtmsr and mtsr require context synchronizing instructions to follow. Without
a CSI, there's a chance for a machine check exception. This reportedly does
occur on a MPC750 (PowerMac G3).
Reported by: Mark Millard
Summary:
Initial NUMA support:
- associate CPU with domain
- associate memory ranges with domain
- identify domain for devices
- limit device interrupt binding to appropriate domain
- Additionally fixes a bug in the setting of Maxmem which led to
only memory attached to the first socket being enabled for DMA
A pmap variant can opt in to numa support by by calling `numa_mem_regions`
at the end of pmap_bootstrap - registering the corresponding ranges with the
VM.
This yields a ~20% improvement in build times of llvm on dual socket POWER9
over non-NUMA.
Original patch by mmacy.
Differential Revision: https://reviews.freebsd.org/D17933
Summary:
With a sufficiently large TOC, it's possible to index out of range, as
the immediate load instructions only permit 16-bit indices, allowing up
to 64kB range (signed) from the base pointer. Allow +/- 2GB range, with
the medium code model TOC accesses in asm.
Patch originally by Brandon Bergren. The issue appears to impact ELFv2
more than ELFv1.
Reviewed by: luporl
Differential Revision: https://reviews.freebsd.org/D19708
* Cache moea64_need_lock in a local variable; gcc generates slightly better
code this way, it doesn't need to reload the value from memory each read.
* VPN cropping is only needed on PowerPC ISA 2.02 and older cores, a subset
of those that need serialization, so move this under the need_lock check,
so those that don't need the lock don't even need to check this.
r345402 fixed the bug that led to the split of the ISA 3.0 HPT handling from
the existing manager. The cause of the bug was gcc moving the register
holding VPN to a different register (not r0), which triggered bizarre
behaviors. With the fix, things work, so they can be re-merged. No
performance lost with the merge.
By happenstance gcc4 puts 'vpn' into r0 in all uses of TLBIE(), but modern
gcc does not. Also, the single-argument form of tlbie zeros all unused
arguments, making the modern tlbie instruction use r0 as the RS field
(LPID).
The vpn argument has the bottom 12 bits cleared (the input having been
left-shifted by 12 bits), which just so happens, on the POWER9 and previous
incarnations, to be the number of LPID bits supported. With those bits
being zero, the instruction:
tlbie r0, r0
will invalidate the VPN in r0, in LPAR 0 (ignoring the upper bits of r0 for
the RS field). One build with gcc8 yields:
tlbie r9, r0
with r0 having arbitrary contents, not equal to r9. This leads to strange
crashes, behaviors, and panics, due to the requested TLB entry not actually
being invalidated.
As the moea64_native must work on both old and new, we explicitly zero out
r0 so that it can work with only the single argument, built with base gcc
and modern gcc. isa3_hashtb takes a different approach, encoding the
two-argument form, soas not to explicitly clobber r0, and instead let the
compiler decide.
Reported by: Brandon Bergren
Tested by: Brandon Bergren
MFC after: 1 week
The check for early exit should be checking the SLB entry itself. As
currently written it was checking the address of the SLB, which is always
non-zero, so would go through the kernel SR restore loop regardless.
Submitted by: mmacy
MFC after: 2 weeks
At moea64_sync_icache(), when the 'va' argument has page size
alignment, round_page() will return the same value as 'va'.
This would cause 'len' to be 0 and thus an infinite loop.
With this change, 'lim' will always point to the next page boundary.
This issue occurred especially during debugging sessions, when a breakpoint
was placed on an exact page-aligned offset, for instance.
Reviewed by: jhibbits
Differential Revision: https://reviews.freebsd.org/D19149
Currently, the trap code switches to the the temporary stack in the dbtrap
section. It works in most cases, but in the beginning of the execution, the
temp stack is being used, as starting in the powerpc_init() code.
In this current scenario, the stack is being overwritten, which causes the
return of breakpoint() to take abnormal execution.
This current patchset create a small stack to use by the dbtrap: codepath
avoiding the corruption of the temporary stack.
PR: 224872
Submitted by: breno.leitao_gmail.com
Reviewed by: jhibbits
Differential Revision: https://reviews.freebsd.org/D14484
This change adds a hypervisor trap handler for exception 0x1500 (soft patch),
normalizing all VSX registers and returning.
This avoids a kernel panic due to unknown exception.
Change made with the collaboration of leonardo.bianconi_eldorado.org.br,
that found out that this is a hypervisor exception and not a supervisor one,
and fixed this in the code.
Reviewed by: jhibbits, sbruno
Differential Revision: https://reviews.freebsd.org/D17806
A related future change, which changes KERNBASE for Book-E for some reason
causes a "KERNBASE redefined" error with assym.inc, even though it only changed
the value of KERNBASE and nothing else. Since machine/vmparam.h is already
included in booke/locore.S, and the requisite guards are already in place for
properly handling KERNBASE in vmparam.h, just remove it from genassym, and
include vmparam.h in the AIM locore files.
Maxmem is the highest address for physical memory in the system. It's
measured in pages which, since max() returns a u_int, should allow for up to
2^44 bytes of memory addressable by the system. However, on POWER9 systems
at least, memory addressed by additional socketed CPUs begins at addresses
far above the 2^44 mark, causing issues with memory accesses and DMA, when
memory is addressed on the auxiliary CPUs. Use the MAX() macro instead,
which doesn't convert arguments, so retains Maxmem and all calculations as
its defined long type (64-bit on powerpc64), keeping the maximum address
correct.
Submitted by: mmacy
r279252 inverted the logic in moea64_scan_init, such that instead of
terminating when reaching a dead page, it terminates when reaching a live
page, ostensibly preserving exactly one page of KVA.
r330610 relocated the DMAP from the base of memory to the base of the fourth
quadrant of memory. This broke synthetic traps, such as KDB forced
breakpoints. Use GET_TOCBASE() so the DMAP offset is handled.
Submitted by: git_bdragon.rkt0.net
Differential Revision: https://reviews.freebsd.org/D15973
PowerISA 3.0 makes several changes to not only the format of the HPT but
also the behavior surrounding it. For instance, TLBIE no longer requires
serialization. Removing this lock cuts buildworld time in half on a
18-core/72-thread POWER9 system, demonstrating that this lock is highly
contended on such a system.
There was odd behavior observed trying to make this change in a
backwards-compatible manner in moea64_native.c, so the best option was to
fully split it, and largely revert the original changes adding POWER9
support to the original file.
Suggested by: nwhitehorn
On very large memory systems 'size' can become 2GB or larger, resulting in a
negative value being formatted. Also, moea64_pteg_count is already a long, so
format it as such.
POWER9 supports Radix page tables in addition to Hashed page tables. When
Radix page tables are in use, the TLB is cut in half, so that half of the
TLB is used for the page walk cache. This is the default behavior, however
FreeBSD currently does not support Radix tables. Clear this bit so that we
can use the full TLB. Do this in the MMU logic so that configuration can be
localized to the specific translation format. Once we do support Radix
tables, the setup for that will be localized to the Radix MMU kobj.
Summary:
PowerISA 2.03 and later require bits 14:65 in the RB register argument,
which is the full value of the vpn argument post-shift. Only POWER4, POWER4+,
and PPC970* need the upper 16 bits cropped.
With this change FreeBSD can boot to multi-user on POWER9.
Reviewed by: nwhitehorn
Differential Revision: https://reviews.freebsd.org/D15581
Summary:
POWER9 systems use a new interrupt controller, XIVE, managed through OPAL
firmware calls. The OPAL firmware includes support for emulating the previous
generation XICS presentation layer in addition to a new "XIVE Exploitation"
mode. As a stopgap until we have XIVE exploitation mode, enable XICS emulation
mode so that we at least have an interrupt controller.
Since the CPPR is local to the current CPU, it cannot be updated for APs when
initializing on the BSP. This adds a new function, directly called by the
powernv platform code, to initialize the CPPR on AP bringup.
Reviewed by: nwhitehorn
Differential Revision: https://reviews.freebsd.org/D15492
Summary:
Some hypervisor exceptions on POWER architecture only save state to HSRR0/HSRR1.
Until we have bhyve on POWER, use a lightweight exception frontend which copies
HSRR0/HSRR1 into SRR0/SRR1, and run the normal trap handler.
The first user of this is the Hypervisor Virtualization Interrupt, which targets
the XIVE interrupt controller on POWER9.
Reviewed By: nwhitehorn
Differential Revision: https://reviews.freebsd.org/D15487
r333273 and partially reverted with r333594.
Older CPUs implement addition of offsets into the page table by a
bitwise OR rather than actual addition, which only works if the table is
aligned at a multiple of its own size (they also require it to be aligned
at a multiple of 256KB). Newer ones do not have that requirement, but it
hardly matters to enforce it anyway.
The original code was failing on newer systems with huge amounts of RAM
(> 512 GB), in which the page table was 4 GB in size. Because the
bootstrap memory allocator took its alignment parameter as an int, this
turned into a 0, removing any alignment constraint at all and making
the MMU fail. The first round of this patch (r333273) fixed this case by
aligning it at 256 KB, which broke older CPUs. Fix this instead by widening
the alignment parameter.
The POWER9 MMU (PowerISA 3.0) is slightly different from current
configurations, using a partition table even for hypervisor mode, and
dropping the SDR1 register. Key off the newly early-enabled CPU features
flags for the new architecture, and configure the MMU appropriately.
The POWER9 MMU ignores the "PSIZ" field in the PTCR, and expects a 64kB
table. As we are enabled for powernv (hypervisor mode, no VMs), only
initialize partition table entry 0, and zero out the rest. The actual
contents of the register are identical to SDR1 from previous architectures.
Along with this, fix a bug in the page table allocation with very large
memory. The table can be allocated on any 256k boundary. The
bootstrap_alloc alignment argument is an int, and with large amounts of
memory passing the size of the table as the alignment will overflow an
integer. Hard-code the alignment at 256k as wider alignment is not
necessary.
Reviewed by: nwhitehorn
Tested by: Breno Leitao
Relnotes: Yes
POWER8 and POWER9 have similar configuration requirements for hypervisor setup,
and in the cases here they're identical. Add the POWER9 constant to the POWER8
list so it's initialized correctly.
Reviewed by: nwhitehorn
Summary:
POWER9 also contains 32 slbs entries as explained by the POWER9 User Manual:
"For HPT translation, the POWER9 core contains a unified (combined for both
instruction and data), 32-entry, fully-associative SLB per thread"
Submitted by: Breno Leitao
Differential Revision: https://reviews.freebsd.org/D15128
region marked "available" by firmware is contained entirely in the kernel.
This had a tendency to happen with FDTs passed by loader, though could for
other reasons as well, and would result in the kernel slowly cannibalizing
itself for other purposes, eventually resulting in a crash.
A similar fix is needed for mmu_oea.c and should probably just be rolled
at that point into some generic code in platform.c for taking a mem_region
list and removing chunks.
PR: 226974
Submitted by: leandro.lupori@gmail.com
Reviewed by: jhibbits
Differential Revision: D15121
opt_compat.h is mentioned in nearly 180 files. In-progress network
driver compabibility improvements may add over 100 more so this is
closer to "just about everywhere" than "only some files" per the
guidance in sys/conf/options.
Keep COMPAT_LINUX32 in opt_compat.h as it is confined to a subset of
sys/compat/linux/*.c. A fake _COMPAT_LINUX option ensure opt_compat.h
is created on all architectures.
Move COMPAT_LINUXKPI to opt_dontuse.h as it is only used to control the
set of compiled files.
Reviewed by: kib, cem, jhb, jtl
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D14941
assym is only to be included by other .s files, and should never
actually be assembled by itself.
Reviewed by: imp, bdrewery (earlier)
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D14180
When the kernel can be in real mode in early boot, we can execute from
high addresses aliased to the kernel's physical memory. If that high
address has the first two bits set to 1 (0xc...), those addresses will
automatically become part of the direct map. This reduces page table
pressure from the kernel and it sets up the kernel to be used with
radix translation, for which it has to be up here.
This is accomplished by exploiting the fact that all PowerPC kernels are
built as position-independent executables and relocate themselves
on start. Before this patch, the kernel runs at 1:1 VA:PA, but that
VA/PA is random and set by the bootloader. Very early, it processes
its ELF relocations to operate wherever it happens to find itself.
This patch uses that mechanism to re-enter and re-relocate the kernel
a second time witha new base address set up in the early parts of
powerpc_init().
Reviewed by: jhibbits
Differential Revision: D14647
accomplishes a few things:
- Makes NULL an invalid address in the kernel, which is useful for catching
bugs.
- Lays groundwork for radix-tree translation on POWER9, which requires the
direct map be at high memory.
- Similarly lays groundwork for a direct map on 64-bit Book-E.
The new base address is chosen as the base of the fourth radix quadrant
(the minimum kernel address in this translation mode) and because all
supported CPUs ignore at least the first two bits of addresses in real
mode, allowing direct-map addresses to be used in real-mode handlers.
This is required by Linux and is part of the architecture standard
starting in POWER ISA 3, so can be relied upon.
Reviewed by: jhibbits, Breno Leitao
Differential Revision: D14499
When processor enters power-save state it releases resources shared with other
cpu threads which makes other cores working much faster.
This patch also implements saving and restoring registers that might get
corrupted in power-save state.
Submitted by: Patryk Duda <pdk@semihalf.com>
Obtained from: Semihalf
Reviewed by: jhibbits, nwhitehorn, wma
Sponsored by: IBM, QCM Technologies
Differential revision: https://reviews.freebsd.org/D14330
Make vm_wait() take the vm_object argument which specifies the domain
set to wait for the min condition pass. If there is no object
associated with the wait, use curthread' policy domainset. The
mechanics of the wait in vm_wait() and vm_wait_domain() is supplied by
the new helper vm_wait_doms(), which directly takes the bitmask of the
domains to wait for passing min condition.
Eliminate pagedaemon_wait(). vm_domain_clear() handles the same
operations.
Eliminate VM_WAIT and VM_WAITPFAULT macros, the direct functions calls
are enough.
Eliminate several control state variables from vm_domain, unneeded
after the vm_wait() conversion.
Scetched and reviewed by: jeff
Tested by: pho
Sponsored by: The FreeBSD Foundation, Mellanox Technologies
Differential revision: https://reviews.freebsd.org/D14384
This is part of a long-term goal of merging Book-E and AIM into a single GENERIC
kernel. As more work is done, the struct may be optimized further.
Reviewed by: nwhitehorn
threads from compile-time defines to global variables. This removes a
significant amount of duplicated runtime patches to the compile-time
defines, centralizing the conditional logic in the early startup code.
Reviewed by: jhibbits
It turns out that under some circumstances we can get DSI or DSE before we set
LPCR and LPID so we should set it as early as possible.
Authored by: Patryk Duda <pdk@semihalf.com>
Submitted by: Wojciech Macek <wma@semihalf.com>
Obtained from: Semihalf
Sponsored by: IBM, QCM Technologies
used with hashed page tables on AIM and place it into a new, modular pmap
function called pmap_decode_kernel_ptr(). This function is the inverse
of pmap_map_user_ptr(). With POWER9 radix tables, which mapping to use
becomes more complex than just AIM/BOOKE and it is best to have it in
the same place as pmap_map_user_ptr().
Reviewed by: jhibbits
to which it is specific, rather than in the generic AIM startup code. This
will be required to support the radix-table-based MMU introduced with POWER9.
buffers into a new pmap-module function pmap_map_user_ptr() that can
be implemented by the respective modules. This is required to implement
non-segment-based AIM-ish MMU systems such as the radix-tree page tables
introduced by POWER ISA 3.0 and present on POWER9.
Reviewed by: jhibbits
using a new macro PHYS_TO_DMAP, which deliberately has the same name as the
equivalent macro on amd64. This also sets the stage for moving the direct
map to another base address.
domains can be done by the _domain() API variants. UMA also supports a
first-touch policy via the NUMA zone flag.
The slab layer is now segregated by VM domains and is precise. It handles
iteration for round-robin directly. The per-cpu cache layer remains
a mix of domains according to where memory is allocated and freed. Well
behaved clients can achieve perfect locality with no performance penalty.
The direct domain allocation functions have to visit the slab layer and
so require per-zone locks which come at some expense.
Reviewed by: Attilio (a slightly older version)
Tested by: pho
Sponsored by: Netflix, Dell/EMC Isilon
is used as the bootloader on a number of PPC64 platforms. This involves the
following pieces:
- Making the first instruction a valid kernel entry point, since kexec
ignores the ELF entry value. This requires a separate section and linker
magic to prevent the linker from filling the beginning of the section
with stubs.
- Adding an entry point at 0x60 past the first instruction for systems
lacking firmware CPU shutdown support (notably PS3).
- Linker script changes to support the above.
MFC after: 1 month
If these are not aligned, the linker has to emit a different type of
relocation that the early boot self-relocation code cannot handle, even
in principle, resulting in them being set to zero and the kernel crashing.
MFC after: 1 week
Mainly focus on files that use BSD 2-Clause license, however the tool I
was using misidentified many licenses so this was mostly a manual - error
prone - task.
The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.
PowerPC kernels in r6 is actually metadata from loader(8) or gibberish
left in r6, which is not required to be anything under the
PAPR/ePAPR/CHRP/OF standards, by another boot loader.
Note that, as a result, systems need a new boot loader to boot PPC kernels
after this revision without ending up at a mountroot prompt. New boot
loaders are backwards compatible and can boot older kernels.
Reviewed by: jhibbits
MFC after: 2 months