Since qemu does not implement the L2 cache, we get stuck forever waiting
for a bit to be set when trying to invalidate it.
To prevent that, we should bail out if the L2 cache is missing.
One easy way to check this is L2CFG0 == 0 (since L2CSIZE always has at
least one bit set in a valid implementation)
(tested on qemu, rb800, and x5000)
Reviewed by: jhibbits
Sponsored by: Tag1 Consulting, Inc.
Differential Revision: https://reviews.freebsd.org/D25225
After r361988 fixed the reference count leak on booke64, it became possible
for an iteration somewhere in the middle of a page to become stale, with the
page vanishing (correctly) due to all PTEs on that page going away.
pte_find_next() would start at that iterator, and move along 'higher' order
directory pages until it finds a valid one, without zeroing out the lower
order pages. For instance:
/* Find next pte at or above 0x10002000. */
pte = pte_find_next(pmap, &(0x10002000));
pte_remove(pmap, pte);
/* This pte was the last reference in the page table page, page is
* gone.
*/
pte = pte_find_next(pmap, 0x10002000);
/* pte_find_next will see 0x10002000's page is gone, and jump to the
* next one, but starting iteration at the '0x2000' slot, skipping
* 0x0000 and 0x1000.
*/
This caused some processes, like git, to trip the KASSERT() in
pmap_release().
Fix this by zeroing all lower order iterators at each level.
Properly handle reference counts in the 64-bit pmap page directories.
Otherwise all page table pages would leak due to over-referencing. This
would cause a quick enter to swap on a desktop system (AmigaOne X5000) when
quitting and rerunning applications, or just building world.
Add an INVARIANTS check to validate no leakage at pmap release time.
Summary:
Radix on AIM, and all of Book-E (currently), can do direct addressing of
user space, instead of needing to map user addresses into kernel space.
Take advantage of this to optimize the copy(9) functions for this
behavior, and avoid effectively NOP translations.
Test Plan: Tested on powerpcspe, powerpc64/booke, powerpc64/AIM
Reviewed by: bdragon
Differential Revision: https://reviews.freebsd.org/D25129
With IFUNC support in the kernel, we can finally get rid of our poor-man's
ifunc for pmap, utilizing kobj. Since moea64 uses a second tier kobj as
well, for its own private methods, this adds a second pmap install function
(pmap_mmu_init()) to perform pmap 'post-install pre-bootstrap'
initialization, before the IFUNCs get initialized.
Reviewed by: bdragon
Kernel page tables actually start at index 4096, given kernel base address
of 0xc008000000000000, not index 0, which would yield 0xc000000000000000.
Fix this by indexing at the real base, instead of the assumed base.
Summary:
POWER9 supports two MMU formats: traditional hashed page tables, and Radix
page tables, similar to what's presesnt on most other architectures. The
PowerISA also specifies a process table -- a table of page table pointers--
which on the POWER9 is only available with the Radix MMU, so we can take
advantage of it with the Radix MMU driver.
Written by Matt Macy.
Differential Revision: https://reviews.freebsd.org/D19516
Summary:
Some machine checks are process-recoverable, others are not. Let a
CPU-specific handler decide what to do.
This works around a machine check error hit while building www/firefox
and mail/thunderbird, which would otherwise cause the build to fail.
More work is needed to handle all possible machine check conditions, but
this is sufficient to unblock some ports building.
Differential Revision: https://reviews.freebsd.org/D23731
Summary:
This reduces the precious TLB1 entry consumption (64 possible in
existing 64-bit cores), by adjusting the size and alignment of a device
mapping to a power of 2, to encompass the full mapping and its
surroundings.
One caveat with this: If a mapping really is smaller than a power of 2,
it's possible to get a machine check or hang if the 'missing' physical
space is accessed. In practice this should not be an issue for users,
as devices overwhelmingly have physical spaces on power-of-two sizes and
alignments, and any design that includes devices which don't follow this
can be addressed by undefining the POW2_MAPPINGS guard.
Reviewed by: bdragon
Differential Revision: https://reviews.freebsd.org/D24248
Summary:
Iterating over VM_MIN_ADDRESS->VM_MAXUSER_ADDRESS can take a very long
time iterating one page at a time (2**(log_2(SIZE)-12) operations),
yielding possibly several days or even weeks on 64-bit Book-E, even for
a largely empty, which can happen when swapping out a process by
vmdaemon. Speed this up by instead finding the next PTE at or equal to
the given VA.
Reviewed by: bdragon
Differential Revision: https://reviews.freebsd.org/D24238
Summary:
The existing page table is fraught with errors, since it creates a hole
in the address space bits. Fix this by taking a cue from the POWER9
radix pmap, and make the page table 4 levels, 52 bits.
Reviewed by: bdragon
Differential Revision: https://reviews.freebsd.org/D24220
Summary:
The support was added almost a decade ago, and never completed. Just axe
it. It was also inadvertently broken 5 years ago, and nobody noticed.
Reviewed by: bdragon
Differential Revision: https://reviews.freebsd.org/D23753
Summary:
This is largely a straight-forward cleave of the 32-bit and 64-bit page
table specifics, along with the mmu_booke_*() functions that are wholely
different between the two implementations.
The ultimate goal of this is to make it easier to reason about and
update a specific implementation without wading through the other
implementation details. This is in support of further changes to the 64-bit
pmap.
Reviewed by: bdragon
Differential Revision: https://reviews.freebsd.org/D23983
Since powerpc64 has such a large virtual address space, significantly larger
than its physical address space, take advantage of this, and create yet
another DMAP-like instance for the device mappings. In this case, the
device mapping "DMAP" is in the 0x8000000000000000 - 0xc000000000000000
range, so as not to overlap the physical memory DMAP.
This will allow us to add TLB1 entry coalescing in the future, especially
useful for things like the radeonkms driver, which maps parts of the GPU at
a time, but eventually maps all of it, using up a lot of TLB1 entries (~40).
ptbl_alloc() is expected to return with the pvh_global_lock and pmap
lock held. However, it will return with them unlocked if nosleep is
specified.
Along with this, fix lock ordering of pvh_global_lock with respect to
the pmap lock in other places.
Differential Revision: https://reviews.freebsd.org/D23692
In r356767, memcpy/memmove/bcopy optimizations were added to libc to
improve performance.
This exposed an existing kernel issue in VSX handling. The PSL_VSX flag was
not being excluded from the psl_userstatic set, which meant that any thread
that used these and then called swapcontext(3) would get an EINVAL error.
Fixing this exposed a second issue - in r344123, the FPU was being forced
off in set_mcontext(). However, this was neglecting to ensure VSX was turned
off at the same time.
While here, add some code comments to explain what's going on.
Reviewed by: jhibbits, luporl (earlier rev), pkubaj (earlier rev)
Sponsored by: Tag1 Consulting, Inc.
Differential Revision: https://reviews.freebsd.org/D23497
It turns out the maximum TLB1 page size on e5500 is 4G, despite the format
being defined for up to 1TB.
So, we need to clamp the DMAP TLB1 entries to not attempt to create 16G or
larger entries.
Fixes boot on my X5000 in which I just installed 16G of RAM.
Reviewed by: jhibbits
Sponsored by: Tag1 Consulting, Inc.
Differential Revision: https://reviews.freebsd.org/D23244
Fix multiple problems in the powerpcspe floating point code.
* Endianness handling of the SPEFSCR in fenv.h was completely broken.
* Ensure SPEFSCR synchronization requirements are being met.
The __r.__d -> __r transformations were written by jhibbits.
Reviewed by: jhibbits
Differential Revision: https://reviews.freebsd.org/D22526
Summary:
This matches r351198 from amd64. This only applies to AIM64 and Book-E.
On AIM64 it short-circuits with one domain, to behave similar to
existing. Otherwise it will allocate 16MB huge pages to hold the page
array, across all NUMA domains. On the first domain it will shift the
page array base up, to "upper-align" the page array in that domain, so
as to reduce the number of pages from the next domain appearing in this
domain. After the first domain, subsequent domains will be allocated in
full 16MB pages, until the final domain, which can be short. This means
some inner domains may have pages accounted in earlier domains.
On Book-E the page array is setup at MMU bootstrap time so that it's
always mapped in TLB1, on both 32-bit and 64-bit. This reduces the TLB0
overhead for touching the vm_page_array, which reduces up to one TLB
miss per array access.
Since page_range (vm_page_startup()) is no longer used on Book-E but is on
32-bit AIM, mark the variable as potentially unused, rather than using a
nasty #if defined() list.
Reviewed by: luporl
Differential Revision: https://reviews.freebsd.org/D21449
Use the right formats for the types given (vm_offset_t and vm_size_t are
both uint32_t on 32-bit platforms, and uint64_t on 64-bit platforms, and
match size_t in size, so we can use the size_t format as we do in other
similar code).
These were found by clang.
r354266 changed the type of bp_kernload to vm_paddr_t in platform_mpc85xx.c,
but not the variable itself in locore.S. This caused the AP to not come up,
due to overwriting the following variable (bp_virtaddr). Also, properly
load bp_kernload into MAS3 and MAS7. Prior to r354266, we required loading
into the low 4GB, but now we can load from anywhere in memory that ubldr can
access.
Switch from "evaddumiaaw 0,0" to "evmwumiaa 0,0,0" when persisting the
accumulator. This has the benefit of actually being implemented in QEMU
as it is the form Linux uses for the same task.
Both instructions are functionally equivilent, as we are using them for
their side effect of copying the accumulator to GPRs rather than for the
actual math operation that they are performing.
Reviewed by: jhibbits
'tlbilxpid' is 'tlbilx 1, 0', while the existing form is 'tlbilx 0, 0',
which translates to 'tlbilxlpid', invalidating a LDPID. This effectively
invalidates the entire TLB, causing unnecessary reloads.
save_vec_int() for SPE saves off only the high word of the register, leaving
the low word as "garbage", but really containing whatever was in the kernel
register at the time. This leaks into core dumps, and in a near future
commit also into ptrace. Instead, save the GPR in the low word in
save_vec_nodrop(), which is used only for core dumps and ptrace.
Modern gcc errors that "'vec[0]' is used uninitialized in this function"
without us telling it that vec is clobbered. Neither clang nor gcc 4.2.1
error on the existing construct.
Submitted by: bdragon
The memory range between VM_MAXUSER_ADDRESS and VM_MIN_KERNEL_ADDRESS is
reserved for devices currently, which are always mapped in TLB1, and
therefore do not exist in the kernel page table. Any page fault in this
range is therefore automatically a fatal fault.
Since TLB_MAXNEST is 3, the insert mask should only be 2 bits. Given that 2
bits counts to 4, and that we already have plenty of space wasted in
padding, make the nest level 4 to match the mask.
Also, fix pmap_change_attr() to ignore non-kernel mappings.
* Fix a masking bug in mmu_booke_mapdev_attr() which caused it to align
mappings to the smallest mapping alignment, instead of the largest. This
caused mappings to be potentially pessimally aligned, using more TLB
entries than necessary.
* Return existing mappings from mmu_booke_mapdev_attr() that span more than
one TLB1 entry. The drm-current-kmod drivers map discontiguous segments
of the GPU, resulting in more than one TLB entry being used to satisfy the
mapping.
* Ignore non-kernel mappings in mmu_booke_change_attr(). There's a bug in
the linuxkpi layer that causes it to actually try to change physical
address mappings, instead of virtual addresses. amd64 doesn't encounter
this because it ignores non-kernel mappings.
With this it's possible to use drm-current-kmod on Book-E.
tlb1_mapin_region() and pmap_mapdev_attr() do roughly the same thing -- map
a chunk of physical address space(memory or MMIO) into virtual, but do it in
differing ways. Unify the code, settling on pmap_mapdev_attr()'s algorithm,
to simplify and unify the logic. This fixes a bug with growing the kernel
mappings in mmu_booke_bootstrap(), where part of the mapping was not getting
done, leading to a hang when the unmapped VAs were accessed.
This involved several changes:
* Since lld does not like text relocations, replace SMP boot page text relocs
in booke/locore.S with position-independent math, and track the virtual base
in the SMP boot page header.
* As some SPRs are interpreted differently on clang due to the way it handles
platform-specific SPRs, switch m*dear and m*esr mnemonics out for regular
m*spr. Add both forms of SPR_DEAR to spr.h so the correct encoding is selected.
* Change some hardcoded 32 bit things in the boot page to be pointer-sized, and
fix alignment.
* Fix 64-bit build of booke/pmap.c when enabling pmap debugging.
Additionally, I took the opportunity to document how the SMP boot page works.
Approved by: jhibbits (mentor)
Differential Revision: https://reviews.freebsd.org/D21999
It's possible, with per-CPU mappings, for TLB1 indices to get out of sync.
This presents a problem when trying to insert an entry into TLB1 of all
CPUs. Currently that's done by assuming (hoping) that the TLBs are
perfectly synced, and inserting to the same index for all CPUs. However,
with aforementioned private mappings, this can result in overwriting
mappings on the other CPUs.
An example:
CPU0 CPU1
<setup all mappings> <idle>
3 private mappings
kick off CPU 1
initialize shared mappings (3 indices low)
Load kernel module, triggers 20 new mappings
Sync mappings at N-3
initialize 3 private mappings.
At this point, CPU 1 has all the correct mappings, while CPU 0 is missing 3
mappings that were shared across to CPU 1. When CPU 0 tries to access
memory in one of the overwritten mappings, it hangs while tripping through
the TLB miss handler. Device mappings are not stored in any page table.
This fixes by introducing a '-1' index for tlb1_write_entry_int(), so each
CPU searches for an available index private to itself.
MFC after: 3 weeks
r353489 added minidump support for powerpc64, but it added a dependency on
the dump_avail array. Leaving it uninitialized caused breakage in late
boot. Initialize dump_avail, even though the 64-bit booke pmap doesn't yet
support minidumps, but will in the future.
MAS8 is hypervisor privileged, defining the logical partition (VM) to
operate on for TLB accesses. It's already guaranteed to be cleared when
booting bare metal (bootloader needs it zeroed to work), and we can't touch
it from a guest. Assume that if/when we eventually port bhyve to PowerPC
(and Book-E) the hypervisor module will take care of managing MAS8. This
saves several (tens) of clocks on each TLB miss.
MFC after: 2 weeks
After r352110 the page lock no longer protects a page's identity, so
there is no purpose in locking the page in pmap_mincore(). Instead,
if vm.mincore_mapped is set to the non-default value of 0, re-lookup
the page after acquiring its object lock, which holds the page's
identity stable.
The change removes the last callers of vm_page_pa_tryrelock(), so
remove it.
Reviewed by: kib
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D21823
The VM_PAGE_OBJECT_BUSY_ASSERT() in pmap_enter() implementation should
be only asserted when the code is executed as result of pmap_enter(),
not when the same code is entered from e.g. pmap_enter_quick(). This
is relevant for all PowerPC pmap variants, because mmu_*_enter() is
used as the backend, and assert is located there.
Add a PowerPC private pmap_enter() PMAP_ENTER_QUICK_LOCKED flag to
indicate that the call is not from pmap_enter(). For non-quick-locked
calls, assert that the object is locked.
Reported and tested by: bdragon
Reviewed by: alc, bdragon, markj
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D22041
callers hold it.
This simplifies pmap code and removes a dependency on the object lock.
Reviewed by: kib, markj
Tested by: pho
Sponsored by: Netflix, Intel
Differential Revision: https://reviews.freebsd.org/D21596
busy acquires while held.
This allows code that would need to acquire and release a very large number
of page busy locks to use the old mechanism where busy is only checked and
not held. This comes at the cost of false positives but never false
negatives which the single consumer, vm_fault_soft_fast(), handles.
Reviewed by: kib
Tested by: pho
Sponsored by: Netflix, Intel
Differential Revision: https://reviews.freebsd.org/D21592
There are cases where there's no vm_page_t structure for a given physical
address, such as the CCSR. In this case, trying to obtain the
md.page_tracked struct member would lead to a NULL dereference, and panic.
Tighten this up by checking for kernel_pmap AND that the page structure
actually exists before dereferencing. The flag can only be set when it's
tracked in the kernel pmap anyway.
MFC after: 3 weeks
Clang9/LLD9 appears to get quite confused with the instruction stream used
to obtain the tmpstack pointer, almost as though it thinks this is a C
function, so tries to optimize it. Since the AIM64 method doesn't use the
TOC to obtain the tmpstack, just follow that model, and lld won't get
confused.
Reported by: bdragon
MFC after: 2 weeks
Convert all remaining references to that field to "ref_count" and update
comments accordingly. No functional change intended.
Reviewed by: alc, kib
Sponsored by: Intel, Netflix
Differential Revision: https://reviews.freebsd.org/D21768
- Remove a dead variable from the amd64 pmap_extract_and_hold().
- Fix grammar in the vm_page_wire man page.
Reported by: alc
Reviewed by: alc, kib
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D21639
There are several mechanisms by which a vm_page reference is held,
preventing the page from being freed back to the page allocator. In
particular, holding the page's object lock is sufficient to prevent the
page from being freed; holding the busy lock or a wiring is sufficent as
well. These references are protected by the page lock, which must
therefore be acquired for many per-page operations. This results in
false sharing since the page locks are external to the vm_page
structures themselves and each lock protects multiple structures.
Transition to using an atomically updated per-page reference counter.
The object's reference is counted using a flag bit in the counter. A
second flag bit is used to atomically block new references via
pmap_extract_and_hold() while removing managed mappings of a page.
Thus, the reference count of a page is guaranteed not to increase if the
page is unbusied, unmapped, and the object's write lock is held. As
a consequence of this, the page lock no longer protects a page's
identity; operations which move pages between objects are now
synchronized solely by the objects' locks.
The vm_page_wire() and vm_page_unwire() KPIs are changed. The former
requires that either the object lock or the busy lock is held. The
latter no longer has a return value and may free the page if it releases
the last reference to that page. vm_page_unwire_noq() behaves the same
as before; the caller is responsible for checking its return value and
freeing or enqueuing the page as appropriate. vm_page_wire_mapped() is
introduced for use in pmap_extract_and_hold(). It fails if the page is
concurrently being unmapped, typically triggering a fallback to the
fault handler. vm_page_wire() no longer requires the page lock and
vm_page_unwire() now internally acquires the page lock when releasing
the last wiring of a page (since the page lock still protects a page's
queue state). In particular, synchronization details are no longer
leaked into the caller.
The change excises the page lock from several frequently executed code
paths. In particular, vm_object_terminate() no longer bounces between
page locks as it releases an object's pages, and direct I/O and
sendfile(SF_NOCACHE) completions no longer require the page lock. In
these latter cases we now get linear scalability in the common scenario
where different threads are operating on different files.
__FreeBSD_version is bumped. The DRM ports have been updated to
accomodate the KPI changes.
Reviewed by: jeff (earlier version)
Tested by: gallatin (earlier version), pho
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20486