MAS8 is hypervisor privileged, defining the logical partition (VM) to
operate on for TLB accesses. It's already guaranteed to be cleared when
booting bare metal (bootloader needs it zeroed to work), and we can't touch
it from a guest. Assume that if/when we eventually port bhyve to PowerPC
(and Book-E) the hypervisor module will take care of managing MAS8. This
saves several (tens) of clocks on each TLB miss.
MFC after: 2 weeks
Debugnet is a simplistic and specialized panic- or debug-time reliable
datagram transport. It can drive a single connection at a time and is
currently unidirectional (debug/panic machine transmit to remote server
only).
It is mostly a verbatim code lift from netdump(4). Netdump(4) remains
the only consumer (until the rest of this patch series lands).
The INET-specific logic has been extracted somewhat more thoroughly than
previously in netdump(4), into debugnet_inet.c. UDP-layer logic and up, as
much as possible as is protocol-independent, remains in debugnet.c. The
separation is not perfect and future improvement is welcome. Supporting
INET6 is a long-term goal.
Much of the diff is "gratuitous" renaming from 'netdump_' or 'nd_' to
'debugnet_' or 'dn_' -- sorry. I thought keeping the netdump name on the
generic module would be more confusing than the refactoring.
The only functional change here is the mbuf allocation / tracking. Instead
of initiating solely on netdump-configured interface(s) at dumpon(8)
configuration time, we watch for any debugnet-enabled NIC for link
activation and query it for mbuf parameters at that time. If they exceed
the existing high-water mark allocation, we re-allocate and track the new
high-water mark. Otherwise, we leave the pre-panic mbuf allocation alone.
In a future patch in this series, this will allow initiating netdump from
panic ddb(4) without pre-panic configuration.
No other functional change intended.
Reviewed by: markj (earlier version)
Some discussion with: emaste, jhb
Objection from: marius
Differential Revision: https://reviews.freebsd.org/D21421
After r352110 the page lock no longer protects a page's identity, so
there is no purpose in locking the page in pmap_mincore(). Instead,
if vm.mincore_mapped is set to the non-default value of 0, re-lookup
the page after acquiring its object lock, which holds the page's
identity stable.
The change removes the last callers of vm_page_pa_tryrelock(), so
remove it.
Reviewed by: kib
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D21823
moea_pvo_remove() might remove the last mapping of a page, in which case
it is clearly no longer writeable. This can happen via pmap_remove(),
or when a CoW fault removes the last mapping of the old page.
Reported and tested by: bdragon
Reviewed by: alc, bdragon, kib
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D22044
The VM_PAGE_OBJECT_BUSY_ASSERT() in pmap_enter() implementation should
be only asserted when the code is executed as result of pmap_enter(),
not when the same code is entered from e.g. pmap_enter_quick(). This
is relevant for all PowerPC pmap variants, because mmu_*_enter() is
used as the backend, and assert is located there.
Add a PowerPC private pmap_enter() PMAP_ENTER_QUICK_LOCKED flag to
indicate that the call is not from pmap_enter(). For non-quick-locked
calls, assert that the object is locked.
Reported and tested by: bdragon
Reviewed by: alc, bdragon, markj
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D22041
Summary:
The AmigaOne platform, encompassing the X5000 and A1222 at this time, is
based on the mpc85xx platform, but includes some things not listed in
the device tree. Some custom devices, like CPLD, could be added to the
device tree with an overlay, or other means. However, some cannot
easily be done, such as the power button interrupt.
The directory will also become a location to add AmigaOne platform drivers,
such as the aforementioned CPLD, and its children.
Reviewed by: bdragon
Differential Revision: https://reviews.freebsd.org/D21829
callers hold it.
This simplifies pmap code and removes a dependency on the object lock.
Reviewed by: kib, markj
Tested by: pho
Sponsored by: Netflix, Intel
Differential Revision: https://reviews.freebsd.org/D21596
busy acquires while held.
This allows code that would need to acquire and release a very large number
of page busy locks to use the old mechanism where busy is only checked and
not held. This comes at the cost of false positives but never false
negatives which the single consumer, vm_fault_soft_fast(), handles.
Reviewed by: kib
Tested by: pho
Sponsored by: Netflix, Intel
Differential Revision: https://reviews.freebsd.org/D21592
Based on POWER9BSD implementation, with all POWER9 specific code removed and
addition of new methods in PPC64 MMU interface, to isolate platform specific
code. Currently, the new methods are implemented on pseries and PowerNV
(D21643).
Reviewed by: jhibbits
Differential Revision: https://reviews.freebsd.org/D21551
There are cases where there's no vm_page_t structure for a given physical
address, such as the CCSR. In this case, trying to obtain the
md.page_tracked struct member would lead to a NULL dereference, and panic.
Tighten this up by checking for kernel_pmap AND that the page structure
actually exists before dereferencing. The flag can only be set when it's
tracked in the kernel pmap anyway.
MFC after: 3 weeks
|
This adds two implementations for each atomic_fcmpset_ and atomic_cmpset_
short and char functions, selectable at compile time for the target
architecture. By default, it uses a generic shift-and-mask to perform atomic
updates to sub-components of 32-bit words from <sys/_atomic_subword.h>.
However, if ISA_206_ATOMICS is defined it uses the ll/sc instructions for
halfword and bytes, introduced in PowerISA 2.06. These instructions are
supported by all IBM processors from POWER7 on, as well as the Freescale/NXP
e6500 core. Although the e5500 and e500mc both implement PowerISA 2.06 they
do not implement these instructions.
As part of this, clean up the atomic_(f)cmpset_acq and _rel wrappers, by
using macros to reduce code duplication.
ISA_206_ATOMICS requires clang or newer binutils (2.20 or later).
Differential Revision: https://reviews.freebsd.org/D21682
As pointed out by mjg, without the parentheses the calculations done against
these macros are incorrect, resulting in only 1/3 of locks being used.
Reported by: mjg
Clang9/LLD9 appears to get quite confused with the instruction stream used
to obtain the tmpstack pointer, almost as though it thinks this is a C
function, so tries to optimize it. Since the AIM64 method doesn't use the
TOC to obtain the tmpstack, just follow that model, and lld won't get
confused.
Reported by: bdragon
MFC after: 2 weeks
Centralize calculation of signal and ucode delivered on unhandled page
fault in new function vm_fault_trap(). MD trap_pfault() now almost
always uses the signal numbers and error codes calculated in
consistent MI way.
This introduces the protection fault compatibility sysctls to all
non-x86 architectures which did not have that bug, but apparently they
were already much more wrong in selecting delivered signals on
protection violations.
Change the delivered signal for accesses to mapped area after the
backing object was truncated. According to POSIX description for
mmap(2):
The system shall always zero-fill any partial page at the end of an
object. Further, the system shall never write out any modified
portions of the last page of an object which are beyond its
end. References within the address range starting at pa and
continuing for len bytes to whole pages following the end of an
object shall result in delivery of a SIGBUS signal.
An implementation may generate SIGBUS signals when a reference
would cause an error in the mapped object, such as out-of-space
condition.
Adjust according to the description, keeping the existing
compatibility code for SIGSEGV/SIGBUS on protection failures.
For situations where kernel cannot handle page fault due to resource
limit enforcement, SIGBUS with a new error code BUS_OBJERR is
delivered. Also, provide a new error code SEGV_PKUERR for SIGSEGV on
amd64 due to protection key access violation.
vm_fault_hold() is renamed to vm_fault(). Fixed some nits in
trap_pfault()s like mis-interpreting Mach errors as errnos. Removed
unneeded truncations of the fault addresses reported by hardware.
PR: 211924
Reviewed by: alc
Discussed with: jilles, markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D21566
Convert all remaining references to that field to "ref_count" and update
comments accordingly. No functional change intended.
Reviewed by: alc, kib
Sponsored by: Intel, Netflix
Differential Revision: https://reviews.freebsd.org/D21768
Both IBM and Freescale programming examples presume the cmpset operands will
favor equal, and pessimize the non-equal case instead. Do the same for
atomic_cmpset_* and atomic_fcmpset_*. This slightly pessimizes the failure
case, in favor of the success case.
MFC after: 3 weeks
- Remove a dead variable from the amd64 pmap_extract_and_hold().
- Fix grammar in the vm_page_wire man page.
Reported by: alc
Reviewed by: alc, kib
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D21639
Add a very basic NVRAM driver for OPAL which can be used by the IBM
powerpc-utils nvram utility, not to be confused with the base nvram utility,
which only operates on powermac_nvram.
The IBM utility handles all partitions itself, treating the nvram device as
a plain store.
An alternative would be to manage partitions in the kernel, and augment the
base nvram utility to deal with different backing stores, but that
complicates the driver significantly. Instead, present the same interface
IBM's utlity expects, and we get the usage for free.
Tested by: bdragon
There are several mechanisms by which a vm_page reference is held,
preventing the page from being freed back to the page allocator. In
particular, holding the page's object lock is sufficient to prevent the
page from being freed; holding the busy lock or a wiring is sufficent as
well. These references are protected by the page lock, which must
therefore be acquired for many per-page operations. This results in
false sharing since the page locks are external to the vm_page
structures themselves and each lock protects multiple structures.
Transition to using an atomically updated per-page reference counter.
The object's reference is counted using a flag bit in the counter. A
second flag bit is used to atomically block new references via
pmap_extract_and_hold() while removing managed mappings of a page.
Thus, the reference count of a page is guaranteed not to increase if the
page is unbusied, unmapped, and the object's write lock is held. As
a consequence of this, the page lock no longer protects a page's
identity; operations which move pages between objects are now
synchronized solely by the objects' locks.
The vm_page_wire() and vm_page_unwire() KPIs are changed. The former
requires that either the object lock or the busy lock is held. The
latter no longer has a return value and may free the page if it releases
the last reference to that page. vm_page_unwire_noq() behaves the same
as before; the caller is responsible for checking its return value and
freeing or enqueuing the page as appropriate. vm_page_wire_mapped() is
introduced for use in pmap_extract_and_hold(). It fails if the page is
concurrently being unmapped, typically triggering a fallback to the
fault handler. vm_page_wire() no longer requires the page lock and
vm_page_unwire() now internally acquires the page lock when releasing
the last wiring of a page (since the page lock still protects a page's
queue state). In particular, synchronization details are no longer
leaked into the caller.
The change excises the page lock from several frequently executed code
paths. In particular, vm_object_terminate() no longer bounces between
page locks as it releases an object's pages, and direct I/O and
sendfile(SF_NOCACHE) completions no longer require the page lock. In
these latter cases we now get linear scalability in the common scenario
where different threads are operating on different files.
__FreeBSD_version is bumped. The DRM ports have been updated to
accomodate the KPI changes.
Reviewed by: jeff (earlier version)
Tested by: gallatin (earlier version), pho
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20486
We only call alloc_pvo_entry() with M_WAITOK from one location. However,
this can be called while holding nonsleepable locks. Rather than passing
M_WAITOK down, use vm_wait() and loop.
Summary:
MOEA64_PTE_REPLACE() is called often with the pmap lock held, and
sometimes with the page pv lock held. The less work done while holding
a lock, the better. Since we are intending to replace the same PTE
(same hash index), we don't need to recalculate anything, just flat
replace the PTE. This cuts more than 200 instructions off the
invalidating code path. In addition, we don't need to replace a PTE
that's not occupied by this PVO.
Reviewed by: luporl
Differential Revision: https://reviews.freebsd.org/D21515
Many extern struct pcpu <something>__pcpu declarations were
copied/pasted in sources. The issue is that the definition is MD, but
it cannot be provided by machine/pcpu.h due to actual struct pcpu
defined in sys/pcpu.h later than the inclusion of machine/pcpu.h.
This forced the copying when other code needed direct access to
__pcpu. There is no way around it, due to machine/pcpu.h supplying
part of struct pcpu fields.
To work around the problem, add a new machine/pcpu_aux.h header, which
should fill any needed MD definitions after struct pcpu definition is
completed. This allows to remove copies of __pcpu spread around the
source. Also on x86 it makes it possible to remove work arounds like
OFFSETOF_CURTHREAD or clang specific warnings supressions.
Reported and tested by: lwhsu, bcran
Reviewed by: imp, markj (previous version)
Discussed with: jhb
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D21418
Many arm kernel configs bogusly specified WERROR=-Werror. There's no
reason for this because the default is that and there's no reason to
override. These date from a time when we needed to add additional
warning->error suppression. They are obsolete and were cut and paste
propagated from file to file.
Comment out all the WERROR=.... lines in powerpc. They aren't bogus,
but were appropriate for the old defaults for gcc4.2.1. Now that we've
made the policy decision to suppress -Werror by default on these
platforms, it is appropriate to comment these out. People wishing to
fix these errors can still un-comment them out, or say WERROR=-Werror
on the command line.
Fix two instances (cut and paste propagation) of hard-coded -Werror
in x86 code. Replace with ${WERROR} instead. This is a no-op change
except for people who build WERROR=-Wno-error :).
This should fix tinderbox / CI breakage.
Summary:
Reduce the diff between AIM and Book-E even more. This also cleans up
vmparam.h significantly.
Reviewed by: luporl
Differential Revision: https://reviews.freebsd.org/D21301
doing so adds more flexibility with less redundant code.
Reviewed by: jhb, markj, kib
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D21250
The only thing blocking UMA_MD_SMALL_ALLOC from working on 64-bit booke
powerpc was a missing check in pmap_kextract(). Adding DMAP handling into
pmap_kextract(), we can now use UMA_MD_SMALL_ALLOC. This should improve
performance and stability a bit, since DMAP is always mapped in TLB1, so
this relieves pressure on TLB0.
MFC after: 3 weeks
tcpratelimit isn't supported as there's now atomic_add_64, so add it to the exclusion list
Add comment for why PPC_PROBE_CHIPSET is on the list
Remove UKBD_DFLT_KEYMAP now that ukbd works on all platforms.
Move the floppy driver to the x86 specific notes file.
Reviewed by: jhb, manu, jhibbits, emaste
Differential Revision: https://reviews.freebsd.org/D21208
x86 needs sc, as does sparc64. powerpc doesn't use it by default, but some old
powermac notebooks do not work with vt yet for reasons unknonw. Even so, I've
removed it from powerpc LINT. It's not in daily use there, and the intent is to
100% switch to vt now that it works for that platform to limit support burden.
All the other architectures omit some or all of the screen savers from their
lint config. Move them to the x86 NOTES files and remove the exclusions. This
reduces slightly the number of savers sparc64 compiles, but since they are in
GENERIC, the overage is adequate and if someone reaelly wants to sort them out
in sparc64 they can sweat the details and the testing.
Reviewed by: jhb (earlier version), manu (earlier version), jhibbits
Differential Revision: https://reviews.freebsd.org/D21233
Avoid empty structs, that have undefined behavior in C99 and
make compilers complain about it
(empty struct has size 0 in C, size 1 in C++).
Reviewed by: jhibbits
Differential Revision: https://reviews.freebsd.org/D21231
Fixed trap handler logic, in order to make it save FPU registers,
if FPU is enabled, before enabling VSX. Without this change, FPU
register contents were being lost when set before VSX was enabled.
This is part 2 of r347078, pulling the page directory out of the Book-E
pmap. This breaks KBI for anything that uses struct pmap (such as vm_map)
so any modules that access this must be rebuilt.
There is no need for the 64-bit pmap to have a fixed number of page table
buffers. Since the 64-bit pmap has a DMAP, we can effectively have user
page tables limited only by total RAM size.
The last few changes needed before 32-bit AIM builds with secure-PLT with
base GCC. Because ofwcall32.S and swtch32.S were branching to the GOT it
could not use secure PLT.
The flag handling was committed commented out 7 years ago. It works, and is
needed for LinuxKPI-based DRM drivers.
Also mark a local as potentially unusable, as it's only really used when KTR
is enabled.
Submitted by: mmacy
Freeze clearing needs to heppen any time OPAL reads return either an error
(except OPAL_HARDWARE), AND any time it returns 0xff for all bytes.
For cfgwrite, any error that's not OPAL_HARDWARE should be cleaned up.
Only clear an EEH freeze if an error occurs. However, if an OPAL_HARDWARE
error is returned, this indicates a hardware failure which cannot be
unfrozen, and instead needs a hardware reset. Attempting to unfreeze a
broken PCH will result in console spam for each attempt. To avoid the spam,
just don't do it.
Summary:
Although it's convenient to reuse the pvo_plist for deletion, RB_TREE
insertion and removal is not free, and can result in a lot of extra work
to rebalance the tree. Instead, use a SLIST as a LIFO delete queue,
which gives us almost free insertion, deletion, and traversal.
Reviewed by: luporl
Differential Revision: https://reviews.freebsd.org/D21061
Added allocation retry loop in alloc_pvo_entry(), to wait for
memory to become available if the caller specifies the M_WAITOK flag.
Also, the loop in moa64_enter() was removed, as moea64_pvo_enter()
never returns ENOMEM. It is alloc_pvo_entry() memory allocation that
can fail and must be retried.
Reviewed by: jhibbits
Differential Revision: https://reviews.freebsd.org/D21035
Summary:
It turns out statistics accounting is very expensive in the pmap driver,
and doesn't seem necessary in the common case. Make this optional
behind a MOEA64_STATS #define, which one can set if they really need
statistics.
This saves ~7-8% on buildworld time on a POWER9.
Found by bdragon.
Reviewed by: luporl
Differential Revision: https://reviews.freebsd.org/D20903
oldpvo is never explicitly NULL'd by moea64_pvo_enter(), so don't check for
NULL to do anything, only check error.
PR: 239372
Reported by: Francis Little
Summary:
Instead of searching for a PVO entry before adding, take advantage of
the fact that RB_INSERT() returns NULL if it inserts, and the existing entry if
an entry exists, without inserting a new entry. This saves an extra tree
traversal in the cases where the PVO does not exist.
Reviewed by: luporl
Differential Revision: https://reviews.freebsd.org/D20944
EFSCFD (floating point single convert from double) emulation requires saving
the high word of the register, which uses SPE instructions. Enable the SPE
to avoid an SPV Unavailable exception.
MFC after: 1 week
'=' asm constraint marks a variable as write-only. Because of this, gcc
throws away the initialization of 'res', causing garbage to be returned if
the CAS was successful. Use '+' to mark res as read/write, so that the
initialization stays in the generated asm. Also, fix the reservation
clearing stwcx store index register in casueword32, and only do the dummy
store when needed, skip it if the real store has already succeeded.
syscallret() doesn't use error anymore. Fix a few other places to permit
removing the return value from syscallenter() entirely.
- Remove a duplicated assertion from arm's syscall().
- Use td_errno for amd64_syscall_ret_flush_l1d.
Reviewed by: kib
MFC after: 1 month
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D2090
The only consumer of moea64_pvo_remove_from_page_locked() already has the
page in hand, so there is no need to search for the page while holding the
lock. Drop the wrapper, and rename _moea64_pvo_remove_from_page_locked().
Reported by: alc
Summary:
Since the 'page pv' lock is one of the most highly contended locks, we
need to try to do as much work outside of the lock as we can. The
moea64_pvo_remove_from_page() path is a low hanging fruit, where we can
do some heavy work (PHYS_TO_VM_PAGE()) outside of the lock if needed.
In one path, moea64_remove_all(), the PV lock is already held and can't
be swizzled, so we provide two ways to perform the locked operation, one
that can call PHYS_TO_VM_PAGE outside the lock, and one that calls with
the lock already held.
Reviewed By: luporl
Differential Revision: https://reviews.freebsd.org/D20694
Summary:
If an illegal instruction is encountered on a process running on a
powerpc64 kernel it would attempt to sync the cache before retrying the
instruction "just in case". However, since curpmap is not set, when
moea64_sync_icache() attempts to lock the pmap, it's locking on a NULL pointer,
triggering a panic. Fix this by adding a (assumed unnecessary) fallback to
curthread's pmap in moea64_sync_icache().
Reported by: alfredo.junior_eldorado.org.br
Reviewed by: luporl, alfredo.junior_eldorado.org.br
Differential Revision: https://reviews.freebsd.org/D20911
Casueword(9) on ll/sc architectures must be prepared for userspace
constantly modifying the same cache line as containing the CAS word,
and not loop infinitely. Otherwise, rogue userspace livelocks the
kernel.
To fix the issue, change casueword(9) interface to return new value 1
indicating that either comparision or store failed, instead of relying
on the oldval == *oldvalp comparison. The primitive no longer retries
the operation if it failed spuriously. Modify callers of
casueword(9), all in kern_umtx.c, to handle retries, and react to
stops and requests to terminate between retries.
On x86, despite cmpxchg should not return spurious failures, we can
take advantage of the new interface and just return PSL.ZF.
Reviewed by: andrew (arm64, previous version), markj
Tested by: pho
Reported by: https://xenbits.xen.org/xsa/advisory-295.txt
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D20772
Summary:
Running a 32-bit process on a 64-bit POWER CPU may still use all 64-bits
in calculations, while ignoring the upper 32 bits for addressing
storage. It so happens that some processes end up with r1 (SP) having
bit 31 set in some cases (33-bit address). Writing out to this 33-bit
address obviosly fails. Since the CPU ignores the upper bits, we should
as well.
sendsig() and cpu_fetch_syscall_args() appear to be the only functions
that actually rely on userspace register values for copy in/out, and
cpu_fetch_syscall_args() doesn't seem to be bitten in practice yet.
Reviewed By: luporl
Differential Revision: https://reviews.freebsd.org/D20896
On POWER9/pseries, QEMU passes several regions of memory,
instead of a single region containing all memory, as the
code was expecting.
Reviewed by: jhibbits
Differential Revision: https://reviews.freebsd.org/D20857
sv_maxuser specifies the maximum addressable space for user space. Presently
this is all 64-bits worth, which is impossible for a 32-bit process.
This bug has existed since the initial import of powerpc64 in 2010.
MFC after: 2 weeks
The hold_count and wire_count fields of struct vm_page are separate
reference counters with similar semantics. The remaining essential
differences are that holds are not counted as a reference with respect
to LRU, and holds have an implicit free-on-last unhold semantic whereas
vm_page_unwire() callers must explicitly determine whether to free the
page once the last reference to the page is released.
This change removes the KPIs which directly manipulate hold_count.
Functions such as vm_fault_quick_hold_pages() now return wired pages
instead. Since r328977 the overhead of maintaining LRU for wired pages
is lower, and in many cases vm_fault_quick_hold_pages() callers would
swap holds for wirings on the returned pages anyway, so with this change
we remove a number of page lock acquisitions.
No functional change is intended. __FreeBSD_version is bumped.
Reviewed by: alc, kib
Discussed with: jeff
Discussed with: jhb, np (cxgbe)
Tested by: pho (previous version)
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D19247
Although PPC SLB code doesn't handle allocation failures,
which are rare, in most places it asserts that the pointer
returned by uma_zalloc() is not NULL, making it easier to
identify the failure and avoiding an invalid pointer dereference.
This change simply adds a missing KASSERT in SLB code.
There was an issue in pseries llan driver, that resulted in the first 2 bytes
of the MAC address getting stripped, and the last 2 being always 0.
In most cases the network interface still worked, despite the MAC being
different of what was specified to QEMU, but when some other host or DHCP
server expected a specific MAC, this would fail.
This change fixes this by shifting right by 2 the local-mac-address read from
device tree, if its length is 6 instead of 8, as observed in QEMU DT, that
always presents a 6 bytes value for this property.
PR: 237471
Reported by: Alfredo Dal'Ava Junior
Reviewed by: jhibbits
Differential Revision: https://reviews.freebsd.org/D20843
Misaligned floating point loads and stores are already handled for AIM, but
use the DSISR to obtain the necessary data. Book-E does not have the DSISR,
so these fixups are not performed, leading to a SIGBUS on misaligned FP
loads or stores. Obtain the necessary data on the Book-E side, similar to
how is done for SPE.
MFC after: 1 week
Summary:
PowerPC has two PLT models: BSS-PLT and Secure-PLT. BSS-PLT uses runtime
code generation to generate the PLT stubs. Secure-PLT was introduced with
GCC 4.1 and Binutils 2.17 (base has GCC 4.2.1 and Binutils 2.17), and is a
more secure PLT format, using a read-only linkage table, with the dynamic
linker populating a non-executable index table.
This is the libc, rtld, and kernel support only. The toolchain and build
parts will be updated separately.
Reviewed By: nwhitehorn, bdragon, pfg
Differential Revision: https://reviews.freebsd.org/D20598
MFC after: 1 month
r348783 changed the behavior of the kernel mappings and broke booting on G5.
- Split the kernel mapping logic out so that the case where we are
running from the wrong memory space is handled using identity
mappings, and the case where we are not using a DMAP is handled by
forcibly mapping the kernel into the dmap range as intended by
r348783.
Reported by: Mikael Urankar
Reviewed by: luporl
Approved by: jhibbits (mentor)
Differential Revision: https://reviews.freebsd.org/D20608
When building a kernel supporting PSERIES but not POWERNV,
the compiler would complain about an error variable being
possibly used before being initialized.
In practice, however, this should never happen. In any case, it
is now initialized to an error value.
Before this change, OFW initrd (as md) handling code was simulating an ofwbus
device. But as there isn't really a Device Tree (DT) node representing OFW
initrd (it is specified in 2 properties under /chosen), its driver was in fact
stealing other driver's DT node. This was noticed after MD_ROOT_MEM became
default and QEMU's USB keyboard stopped working under VNC.
This change consists in simplifying the process of detection and mapping of
initrd memory, turning it into a simple startup step, instead of trying to
simulate a device.
Reviewed by: jhibbits
Differential Revision: https://reviews.freebsd.org/D20553
When an HMI occurs a message event also gets created with the details of the
exception. Hook into the messaging framework to retrieve the HMI message.
Nothing is done with it yet, except to panic on unhandled exception.
vmem_xalloc() cannot be called while holding a nonblocking mutex, warned
by WITNESS. The lock may not be necessary in general, but it avoids
superfluous concurrent OPAL calls for the same sensor.
Reported by: pkubaj
This set of changes make it possible to run FreeBSD for PowerPC64/pseries,
under QEMU/KVM, without requiring the host to make hugepages available to the
guest.
While there was already this possibility, by means of setting hw_direct_map to
0, on PowerPC64 there were a couple of issues/wrong assumptions that prevented
this from working, before this changelist.
Reviewed by: jhibbits
Differential Revision: https://reviews.freebsd.org/D20522
Summary:
moea64_insert_pteg_native()'s invalidation only works by happenstance.
The purpose of the shifts and XORs is to extract the VSID in order to
reverse-engineer the lower bits of the VPN. Currently a segment size is 256MB
(2**28), and ADDR_API_SHFT64 is 16, so ADDR_PIDX_SHIFT is equivalent. However,
it's semantically incorrect, in that we don't want to shift by the page shift
size, we want to shift to get to the VSID.
Tested by: bdragon
Differential Revision: https://reviews.freebsd.org/D20467
Having this option enabled by default on PowerPC64 kernels makes
booting ISO images much easier when on PowerNV.
With it, the ISO may simply be given to the -i flag of kexec.
Better yet, the ISO may be loop mounted on PetitBoot and its
kernel may be used to load itself.
Without this option, booting ISOs on remote PPC64 machines usually
involve preparing a separate kernel, with this option enabled.
DBCR0, according to the Freescale EREF, is guaranteed to be updated, and
changes take effect, after an isync plus change of MSR[DE] from 0 to 1.
Otherwise it's guaranteed to be updated "eventually". Use the expected
synchronization sequence to write it for resetting.
This prevents "Reset failed" from being printed immediately before the CPU
resets.
MFC after: 2 weeks
Actually set the source and destination VA's before using them. Fixes a
bizarre panic on 32-bit Book-E. Not sure why this wasn't caught by the
compiler.
The MSR[EE] bit does not require synchronization when changing. This is a
trivial micro-optimization, removing the trailing isync from mtmsr().
MFC after: 1 week
This allows replacing "sys/eventfilter.h" includes with "sys/_eventfilter.h"
in other header files (e.g., sys/{bus,conf,cpu}.h) and reduces header
pollution substantially.
EVENTHANDLER_DECLARE and EVENTHANDLER_LIST_DECLAREs were moved out of .c
files into appropriate headers (e.g., sys/proc.h, powernv/opal.h).
As a side effect of reduced header pollution, many .c files and headers no
longer contain needed definitions. The remainder of the patch addresses
adding appropriate includes to fix those files.
LOCK_DEBUG and LOCK_FILE_LINE_ARG are moved to sys/_lock.h, as required by
sys/mutex.h since r326106 (but silently protected by header pollution prior
to this change).
No functional change (intended). Of course, any out of tree modules that
relied on header pollution for sys/eventhandler.h, sys/lock.h, or
sys/mutex.h inclusion need to be fixed. __FreeBSD_version has been bumped.
It was found during building llvm that the page pv lock pool was seeing very
high contention. Since the pmap is already NUMA aware, it was surmised that
the domains were referencing similar pages in the different domains. This
reduces contention to the point of noise in a lockstat(8) run (~51% down to
under 5%), reducing build times by up to 20%.
This doesn't do a perfect domain alignment, just a best-guess based on
hardware available, that the domain is roughly specified in the upper bits
of the PA. Trying to be more clever would more than likely result in
reduced performance just on the work needed.
MFC after: 2 weeks
Since we now have a much larger KVA on powerpc64, it's possible to get SLB
traps earlier in boot, possibly even before the HIOR is properly configured
for us. Move the HIOR setup to immediately after reset, so that we use our
exception handlers instead of Open Firmware's.
PR: 233863
Submitted by: Mark Millard (partial)
Reported by: Mark Millard
MFC after: 2 weeks
Having IPSEC compiled into the kernel imposes a non-trivial
performance penalty on multi-threaded workloads due to IPSEC
refcounting. In my benchmarks of multi-threaded UDP
transmit (connected sockets), I've seen a roughly 20% performance
penalty when the IPSEC option is included in the kernel (16.8Mpps
vs 13.8Mpps with 32 senders on a 14 core / 28 HTT Xeon
2697v3)). This is largely due to key_addref() incrementing and
decrementing an atomic reference count on the default
policy. This cause all CPUs to stall on the same cacheline, as it
bounces between different CPUs.
Given that relatively few users use ipsec, and that it can be
loaded as a module, it seems reasonable to ask those users to
load the ipsec module so as to avoid imposing this penalty on the
GENERIC kernel. Its my hope that this will make FreeBSD look
better in "out of the box" benchmark comparisons with other
operating systems.
Many thanks to ae for fixing auto-loading of ipsec.ko when
ifconfig tries to configure ipsec, and to cy for volunteering
to ensure the the racoon ports will load the ipsec.ko module
Reviewed by: cem, cy, delphij, gnn, jhb, jpaetzel
Differential Revision: https://reviews.freebsd.org/D20163
* Make mmu_booke_sync_icache() use the DMAP on 64-bit prcoesses, no need to
map the page into the user's address space. This removes the
pvh_global_lock from the equation on 64-bit.
* Don't map the page with user-readability on 32-bit. I don't know what the
chance of a given user process being able to access the NULL page when
another process's page is added there, but it doesn't seem like a good
idea to map it to NULL with user read permissions.
* Only sync as much as we need to. There are only two significant places
where pmap_sync_icache is used: proc_rwmem(), and the SIGILL second-chance
for powerpc. The SIGILL second chance is likely the most common, and only
syncs 4 bytes, so avoid the other 127 loop iterations (4096 / 32 byte
cacheline) in __syncicache().
Reduce the surface area of the TLB locks. Unfortunately the same trick for
serializing the tlbie instruction on OEA64 cannot be used here to reduce the
scope of the tlbivax mutex to the tlbsync only, as the mutex also serializes
the TLB miss lock as a side effect, so contention on this lock may not be
reducible any further.
tun(4) and tap(4) share the same general management interface and have a lot
in common. Bugs exist in tap(4) that have been fixed in tun(4), and
vice-versa. Let's reduce the maintenance requirements by merging them
together and using flags to differentiate between the three interface types
(tun, tap, vmnet).
This fixes a couple of tap(4)/vmnet(4) issues right out of the gate:
- tap devices may no longer be destroyed while they're open [0]
- VIMAGE issues already addressed in tun by kp
[0] emaste had removed an easy-panic-button in r240938 due to devdrn
blocking. A naive glance over this leads me to believe that this isn't quite
complete -- destroy_devl will only block while executing d_* functions, but
doesn't block the device from being destroyed while a process has it open.
The latter is the intent of the condvar in tun, so this is "fixed" (for
certain definitions of the word -- it wasn't really broken in tap, it just
wasn't quite ideal).
ifconfig(8) also grew the ability to map an interface name to a kld, so
that `ifconfig {tun,tap}0` can continue to autoload the correct module, and
`ifconfig vmnet0 create` will now autoload the correct module. This is a
low overhead addition.
(MFC commentary)
This may get MFC'd if many bugs in tun(4)/tap(4) are discovered after this,
and how critical they are. Changes after this are likely easily MFC'd
without taking this merge, but the merge will be easier.
I have no plans to do this MFC as of now.
Reviewed by: bcr (manpages), tuexen (testing, syzkaller/packetdrill)
Input also from: melifaro
Relnotes: yes
Differential Revision: https://reviews.freebsd.org/D20044
Since the DMAP is only available on powerpc64, and is *always* available on
Book-E powerpc64, don't penalize either side (32-bit or 64-bit) by always
checking hw_direct_map to perform operations. This saves 5-10% time on
various ports builds, and on buildworld+buildkernel on Book-E hardware.
MFC after: 3 weeks
Use the nitems() macro instead of the expansion, a'la r298352. Also, fix the
location of this check to after initializing availmem_regions_sz, so that the
check isn't always against 0, thus always failing (nitems(phys_avail) is always
more than 0).
Summary:
A few ports fail to build due to missing pmap-related definitions, which are
specific per-pmap type. This tries to appease those ports, by merging all
pmaps together.
A future change will move the inline page directory out of the Book-E pmap,
to eliminate the last #ifdefs in pmap.h and complete the merge.
Reviewed By: luporl
Differential Revision: https://reviews.freebsd.org/D20119
Use it wherever COMPAT_FREEBSD11 is currently specified, like r309749.
Reviewed by: imp, jhb, markj
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D20120
It's possible for a Hypervisor Maintenance Interrupt (HMI) to occur while in
the pmap code, holding locks. This can cause WITNESS to panic due to lock
errors in calling pmap_kextract(). Since we don't yet handle the flags
returned by OPAL_HANDLE_HMI2, just stop using it, so that we don't call into
pmap_kextract().
Reported by: pkubaj
Unconditional writing to MAS7, which doesn't exist on the e500v1 core, in a
TLB miss handler has been in the code for several years now. Since this has
gone unnoticed for so long, it's easily concluded that e500v1 is not in use
with FreeBSD. Simplify the code path a bit, by unconditionally zeroing MAS7
instead of calling a subroutine to do it.
r18 is used to hold the old PCB flags, but cpu_throw doesn't populate r18
with PCB flags, since the old thread is gone. This can lead to panics on
cores that don't have the registers guarded by these flags.
This change makes it easier to enable/disable the inclusion of
OPAL flash in the kernel.
Reviewed by: jhibbits
Differential Revision: https://reviews.freebsd.org/D20098
This way its children can attach earlier if needed, and some subsystems are
attached earlier, like the asynchronous token management.
MFC after: 2 weeks
Add support to enable, save, and restore the following facilities:
* Target Address Register (bctar) -- seemingly just another register to
branch to.
* Event-based branching -- an interrupt-like userspace event handler
subsystem.
* Load-monitored facility -- A facility that allows monitoring a range of
physical memory, and triggering an event on access. Targeted to garbage
collection software features.
The Data Stream Control Register (DSCR) is privileged on POWER7, but
unprivileged (different register) on POWER8 and later. However, it's now
guarded by a new register, the Facility Status and Control Register, instead of
the MSR like other pre-existing facilities (FPU, Altivec). The FSCR must be
managed explicitly, since it's effectively an extension of the MSR.
Tested by: Brandon Bergren
The POWER8NVL (POWER8 NVLink) architecturally behaves identically to the
POWER8, with a different PVR identifier. Mark it as such, so it shows up
appropriately to the user.
Reported by: Alexey Kardashevskiy
MFC after: 2 weeks
Since the non-volatile registers are restored at the end of cpu_switchin (of
the new thread) they're free for us to use for our own purposes. Load the
PCB_FLAGS into a non-volatile register so it's preserved across the C
function calls that manage FPU and altivec state. This removes 4 loads from
each file. Might be a trivial performance improvement (~12 clock cycles per
context switch).
MFC after: 3 weeks
mtmsr and mtsr require context synchronizing instructions to follow. Without
a CSI, there's a chance for a machine check exception. This reportedly does
occur on a MPC750 (PowerMac G3).
Reported by: Mark Millard
As mphyp_pte_unset() can also remove PTE entries, and as this can
happen in parallel with PTEs evicted by mphyp_pte_insert(), there
is a (rare) chance the PTE being evicted gets removed before
mphyp_pte_insert() is able to do so. Thus, the KASSERT should
check wether the result is H_SUCCESS or H_NOT_FOUND, to avoid
panics if the situation described above occurs.
More details about this issue can be found in PR 237470.
PR: 237470
Reviewed by: jhibbits
Differential Revision: https://reviews.freebsd.org/D20012
Summary: when using pseries-llan driver, Opkts and Oerrs counters (netstat
-i) are always zero. This patch adds an small error handling to increment
these counters.
Submitted by: alfredo.junior_eldorado.org.br
Differential Revision: https://reviews.freebsd.org/D20009
Some hypervisor calls, such as H_SEND_LOGICAL_LAN, take more arguments than
are traditionally passed in registers. The HCALL ABI will accept these
arguments in r11 and r12. With ELFv2 ABI, these arguments are 2
double-words lower than ELFv1 ABI, as two double-words in the stack frame
are no longer used, and therefore removed from the frame. Fix the offsets
for loading the registers for the HCALL. This fixes the phyp_llan driver
with ELFv2 kernel.
Submitted by: alfredo.junior_eldorado.org.br
Differential Revision: https://reviews.freebsd.org/D20008
Since writes don't necessarily need to be on erase-block boundaries, we can
relax the block size and alignments down to sector size. If it needs to be
erased, opalflash_erase() will check proper alignment and size.
If the OPAL flash driver supports writing without erase, it adds a
'no-erase' property to the flash device node. Honor that property and don't
bother erasing if it exists.
Summary:
Initial NUMA support:
- associate CPU with domain
- associate memory ranges with domain
- identify domain for devices
- limit device interrupt binding to appropriate domain
- Additionally fixes a bug in the setting of Maxmem which led to
only memory attached to the first socket being enabled for DMA
A pmap variant can opt in to numa support by by calling `numa_mem_regions`
at the end of pmap_bootstrap - registering the corresponding ranges with the
VM.
This yields a ~20% improvement in build times of llvm on dual socket POWER9
over non-NUMA.
Original patch by mmacy.
Differential Revision: https://reviews.freebsd.org/D17933
Forgot to add the changes for DELAY(), which lowers priority during the
delay period. Also, mark the timebase read as volatile so newer GCC does
not optimize it away, as it reportedly does currently.
MFC after: 2 weeks
MFC with: r346144
PowerISA 2.07 and PowerISA 3.0 both specify special NOPs for priority
adjustments, with "medium" priority being normal. We had been setting
medium-low as our normal priority. Rather than guess each time as to what
we want and the right NOP, wrap them in inline functions, and replace the
occurrances of the NOPs with the functions. Also, make DELAY() drop to very
low priority while waiting, so we don't burn CPU.
Coupled with r346143, this shaves off a modest 5-8% on buildworld times with
-j72. There may be more room for improvement with judicious use of these
NOPs.
MFC after: 2 weeks
The POWER9 documentation specifies that levels 0-3 are the 'lightest' sleep
level, meaning lowest latency and with no state loss. However, state 3 is
not implemented, and is instead reserved for future chips. This now
properly configures the PSSCR, specifying state 2 as the lowest level to
enter, but request level 0 for quickest sleep level. If the OCC determines
that the CPU can enter states 1 or 2 it will trigger the transition to those
states on demand.
MFC after: 1 week
* The BIO bio_data may not be page aligned. Only the base address of each
page worth of data is extracted to pass to OPAL. Without page alignment
it can scribble over random memory when finishing the page read. Fix this
by short-reading the first page to properly align for full page reads.
* Fix the definition of OPAL_FLASH_ERASE.
* Properly handle the async message result, as now returned from r345974.
* Properly return the full opal_msg from an async completion.
* Don't keep bugging OPAL, wait 100us or so. With some minor changes to
DELAY() to drop to very low priority, the thread won't hog the CPU while
polling for the async completion.
The e5500 has an FPU, but lacks the optional fsqrt instruction. This
instruction gets emulated in the kernel, but the emulation uses stale data,
from the last switch out, and does not return the result of the operation
immediately. Fix both of these conditions by saving and restoring the FPRs
around the emulation point.
MFC after: 1 week
MFC with: r345829
This fix was committed less than 2 months after the code was forked into the
powerpc kernel. Though powerpc doesn't use quad-precision floating point,
or need it for emulation, the changes do look like correctness fixes
overall.
This was found while trying to get fsqrt emulation working on e5500, which
does have a real FPU, but lacks the fsqrt instruction. This is not the
complete fix, the rest is to be committed separately.
MFC after: 1 week
Since OPAL_GET_MSG does not discriminate between message types, asynchronous
completion events may be received in the OPAL_GET_MSG call, which dequeues
them from the list, thus preventing OPAL_CHECK_ASYNC_COMPLETION from
succeeding. Handle this case by integrating with the messaging framework.
Summary:
OPAL needs to be kicked periodically in order for the firmware to make
progress on its tasks. To do so, create a heartbeat thread to perform this task
every N milliseconds, defined by the device tree. This task is also a central
location to handle all messages received from OPAL.
Reviewed By: luporl
Differential Revision: https://reviews.freebsd.org/D19743
Summary:
With a sufficiently large TOC, it's possible to index out of range, as
the immediate load instructions only permit 16-bit indices, allowing up
to 64kB range (signed) from the base pointer. Allow +/- 2GB range, with
the medium code model TOC accesses in asm.
Patch originally by Brandon Bergren. The issue appears to impact ELFv2
more than ELFv1.
Reviewed by: luporl
Differential Revision: https://reviews.freebsd.org/D19708
* Cache moea64_need_lock in a local variable; gcc generates slightly better
code this way, it doesn't need to reload the value from memory each read.
* VPN cropping is only needed on PowerPC ISA 2.02 and older cores, a subset
of those that need serialization, so move this under the need_lock check,
so those that don't need the lock don't even need to check this.
Attempting to build www/firefox on POWER9 resulted in a HMI exception being
thrown, a fatal trap currently. This is typically caused by timer facility
errors, but examination of the Hypervisor Maintenance Exception Register
(HMER) yielded only that an exception had recovered, with no information of
the actual exception cause.
When an HMI occurs, OPAL_HANDLE_HMI or OPAL_HANDLE_HMI2 must be called to
handle the exception at the firmware level. If the exception is handled, we
can continue.
This adds only the preliminary handler, enough to prevent package building
from panicking. An enhancement in the future is to use the flags returned
by OPAL_HANDLE_HMI2 to print more useful error messages, and log maintenance
events.
Reviewed by: luporl
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D19634
r345402 fixed the bug that led to the split of the ISA 3.0 HPT handling from
the existing manager. The cause of the bug was gcc moving the register
holding VPN to a different register (not r0), which triggered bizarre
behaviors. With the fix, things work, so they can be re-merged. No
performance lost with the merge.
By happenstance gcc4 puts 'vpn' into r0 in all uses of TLBIE(), but modern
gcc does not. Also, the single-argument form of tlbie zeros all unused
arguments, making the modern tlbie instruction use r0 as the RS field
(LPID).
The vpn argument has the bottom 12 bits cleared (the input having been
left-shifted by 12 bits), which just so happens, on the POWER9 and previous
incarnations, to be the number of LPID bits supported. With those bits
being zero, the instruction:
tlbie r0, r0
will invalidate the VPN in r0, in LPAR 0 (ignoring the upper bits of r0 for
the RS field). One build with gcc8 yields:
tlbie r9, r0
with r0 having arbitrary contents, not equal to r9. This leads to strange
crashes, behaviors, and panics, due to the requested TLB entry not actually
being invalidated.
As the moea64_native must work on both old and new, we explicitly zero out
r0 so that it can work with only the single argument, built with base gcc
and modern gcc. isa3_hashtb takes a different approach, encoding the
two-argument form, soas not to explicitly clobber r0, and instead let the
compiler decide.
Reported by: Brandon Bergren
Tested by: Brandon Bergren
MFC after: 1 week
Add the infrastructure to allow MD procctl(2) commands, and use it to
introduce amd64 PTI control and reporting. PTI mode cannot be
modified for existing pmap, the knob controls PTI of the new vmspace
created on exec.
Requested by: jhb
Reviewed by: jhb, markj (previous version)
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D19514
PTI mode for the process pmap on exec is activated iff P_MD_PTI is set.
On exec, the existing vmspace can be reused only if pti mode of the
pmap matches the P_MD_PTI flag of the process. Add MD
cpu_exec_vmspace_reuse() callback for exec_new_vmspace() which can
vetoed reuse of the existing vmspace.
MFC note: md_flags change struct proc KBI.
Reviewed by: jhb, markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D19514
Registers visible from 'show reg' don't always match the registers from the
offending trap frame. Knowing the frame address lets one examine the
registers manually.
MFC after: 1 week
The check for early exit should be checking the SLB entry itself. As
currently written it was checking the address of the SLB, which is always
non-zero, so would go through the kernel SR restore loop regardless.
Submitted by: mmacy
MFC after: 2 weeks
The second statements on the lines are not guarded by the `if' condition.
This triggers a warning with newer gcc. It's relatively harmless given the
usage, but incorrect. Instead, wrap the statements so they're properly
guarded.
Reported by: powerpc64-gcc xtoolchain
MFC after: 1 week
On very large powerpc64 systems (2x22x4 power9) it's very easy to run out of
available IRQs and crash the system at boot. Scale the count by mp_ncpus,
similar to x86, so this doesn't happen. Further work can be done in the future
to scale the I/O IRQs as well, but that's left for the future.
Submitted by: mmacy
MFC after: 3 weeks
In all of the architectures we have today, we always use PAGE_SIZE.
While in theory one could define different things, none of the
current architectures do, even the ones that have transitioned from
32-bit to 64-bit like i386 and arm. Some ancient mips binaries on
other systems used 8k instead of 4k, but we don't support running
those and likely never will due to their age and obscurity.
Reviewed by: imp (who also contributed the commit message)
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D19280
Firmware needed by petitboot, for example, GPU firmware, can be installed to
a partition in the flash filesystem. This driver exposes the full flash
given by the device tree, letting the user manage firmware, etc, from
FreeBSD.
To use the partitions provided by the flash module, the fdt_slicer module is
needed, but the module isn't needed for raw access, so there's no direct
dependency link in here.
MFC after: 2 weeks
The OPAL firmware only supports a finite number of in-flight asynchronous
operations. Rather than have each subsystem try to manage its own, use a
central management service to hand out tokens.
More work can be done to improve asynchronous behavior, such as funneling
things through a future OPAL heartbeat handler, but capabilities will be
added as needed.
Augment the existing consumers (i2c and sensors) to use this new API.
MFC after: 4 weeks
Summary:
To safely synchronize timebase we need to disable the timebase on all
cores, set timebase, and resynchronize. This adds two new devices, mutually
exclusive, which attach on the SoC simplebus, to freeze and unfreeze the
timebase. The devices are singletons, and platform-specific, so no reason
to make them optional and in separate files.
This was found to be necessary for top(1) to work correctly on an AmigaOne
X5000 (P5020 SoC). It also fixes bufdaemon and bufspacedaemon hangs at
shutdown.
Test Plan: Regression test on various Book-E hardware.
Reviewed by: nwhitehorn
Tested by: Brandon Bergren (git_bdragon.rtk0.net)
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D19208
Skylake Xeons.
See SDM rev. 68 Vol 3 4.6.2 Protection Keys and the description of the
RDPKRU and WRPKRU instructions.
Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D18893
When sigreturn() restored a thread's context, SRR1 was being restored
to its previous value, but pcb_flags was not being touched.
This could cause a mismatch between the thread's MSR and its pcb_flags.
For instance, when the thread used the FPU for the first time inside
the signal handler, sigreturn() would clear SRR1, but not pcb_flags.
Then, the thread would return with the FPU bit cleared in MSR and,
the next time it tried to use the FPU, it would fail on a KASSERT
that checked if the FPU was disabled.
This change clears the FPU bit in both pcb_flags and frame->srr1,
as the code that restores the context expects to use the FPU trap
to re-enable it.
PR: 234539
Reported by: sbruno
Reviewed by: jhibbits, sbruno
Differential Revision: https://reviews.freebsd.org/D19166
Newer cores have the 'tlbilx' instruction, which doesn't broadcast over
CoreNet. This is significantly faster than walking the TLB to invalidate
the PID mappings. tlbilx with the arguments given takes 131 clock cycles to
complete, as opposed to 512 iterations through the loop plus tlbre/tlbwe at
each iteration.
MFC after: 3 weeks
At moea64_sync_icache(), when the 'va' argument has page size
alignment, round_page() will return the same value as 'va'.
This would cause 'len' to be 0 and thus an infinite loop.
With this change, 'lim' will always point to the next page boundary.
This issue occurred especially during debugging sessions, when a breakpoint
was placed on an exact page-aligned offset, for instance.
Reviewed by: jhibbits
Differential Revision: https://reviews.freebsd.org/D19149
SoCs with e500v2 chips only have at most 2 cores, and there are no plans to
release any more e500v2-based SoCs. Clamping MAXCPU down to 2 saves 5MB of
data, and 1.5MB bss.
For direct mapped kernel addresses, ppc64 function was not
performing the dmap to physical conversion, before jumping
to the code that fetched the value from physical memory.
Reviewed by: jhibbits
Differential Revision: https://reviews.freebsd.org/D19086
The QorIQ SoCs don't actually support multicast interrupts, and the
references state explicitly that multicast is undefined behavior. Avoid the
undefined behavior by binding to only a single CPU, using a quirk to
determine if this is necessary.
MFC after: 3 weeks
When running several builders in parallel, on QEMU, with 8GB of
memory, a fatal kernel trap (0x300 (data storage interrupt))
caused by llan driver is sometimes observed, when the system
starts to run out of swap space.
This happens because, at llan_intr(), a phyp call to add a
logical LAN buffer is always made when llan_add_rxbuf() fails,
even if it fails to allocate a new buffer.
PR: 235489
Reviewed by: jhibbits
Differential Revision: https://reviews.freebsd.org/D19084
It appears idling via 'wait' on e5500 causes strange behaviors, such as
top(1) simply hanging sporadically, until input. Until this can possibly be
sorted out (interrupt issue?), just don't idle on this hardware. The SoCs
are low power already, and the wait state doesn't save much anyway.
Currently, the trap code switches to the the temporary stack in the dbtrap
section. It works in most cases, but in the beginning of the execution, the
temp stack is being used, as starting in the powerpc_init() code.
In this current scenario, the stack is being overwritten, which causes the
return of breakpoint() to take abnormal execution.
This current patchset create a small stack to use by the dbtrap: codepath
avoiding the corruption of the temporary stack.
PR: 224872
Submitted by: breno.leitao_gmail.com
Reviewed by: jhibbits
Differential Revision: https://reviews.freebsd.org/D14484
The XIVE (External Interrupt Virtualization Engine) is a new interrupt
controller present in IBM's POWER9 processor. It's a very powerful,
very complex device using queues and shared memory to improve interrupt
dispatch performance in a virtualized environment.
This yields a ~10% performance improvment over the XICS emulation mode,
measured in both buildworld, and 'dd' from nvme to /dev/null.
Currently, this only supports native access.
MFC after: 1 month
iflib is already a module, but it is unconditionally compiled into the
kernel. There are drivers which do not need iflib(4), and there are
situations where somebody might not want iflib in kernel because of
using the corresponding driver as module.
Reviewed by: marius
Discussed with: erj
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D19041
The powerpc_intr structure is not zero-initialized, so on an invariants
build would panic in the xics driver with an invalid pointer. Also fix the
xics driver to share the private data setup code between xics_enable() and
xics_bind().
Reported by: Leonardo Bianconi
If fsqrts is emulated with +INF as its argument, the 0 return value causes a
NULL pointer dereference, panicking the system. Follow the PowerISA and
return +INF with no FP exception.
MFC after: 1 week
Don't clobber the low part of the register restoring the high component of.
This could lead to very bad behavior if it's an ABI-affected register.
While here, also mark the asm volatile in the SPE high save case, to match
the load case.
Reported by: Branden Bergren (git_bdragon.rtk0.net)
MFC after: 1 week
Summary:
I was working on implementing ifuncs on powerpc64 elfv2 today, and I suddenly
realized that the reason I was having so much trouble with AT_HWCAP and
AT_HWCAP2 is they are missing from the sysentvec.
After adding them, the auxv is being filled like it should.
Submitted by: Brandon Bergren (git_bdragon.rtk0.net)
Differential Revision: https://reviews.freebsd.org/D18575
The IPI vector is static, and happens to be the most common interrupt by far
on some systems. Rather than searching for the interrupt every time, cache
the index.
This appears to yield a small performance boost, of about 8% reduction in
buildworld times, on my POWER9 system, when paired with r342975.
The XICS and XIVE need extra data beyond irq and vector. Rather than
performing a separate search, it's better for the general interrupt facility
to hold a private pointer, since the search already must be done anyway at
that level.
In r342771, I introduced a regression in Power by abusing the platform
smp_topo() method as a shortcut for providing the MI information needed for
the stated sysctls. The smp_topo() method was already called later by
sched_ule (under the name cpu_topo()), and initializes a static array of
scheduler topology information. I had skimmed the smp_topo_foo() functions
and assumed they were idempotent; empirically, they are not (or at least,
detect re-initialization and panic).
Do the cleaner thing I should have done in the first place and add a
platform method specifically for core- and thread-count probing.
Reported by: luporl via jhibbits
Reviewed by: luporl
X-MFC-With: r342771
Differential Revision: https://reviews.freebsd.org/D18777
With new sysctls (to the best of our ability do detect them). Restructured
smp.4 slightly for clarity (keep relevant stuff closer to the top) while
documenting.
Reviewed by: markj, jhibbits (ppc parts)
MFC after: 3 days
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D18322
Previous commits have made VM_MIN_KERNEL_ADDRESS its own separate entity,
and rebased the kernel around that address instead of KERNBASE. This commit
pulls the trigger to rebase KERNBASE to a physical load address. The
eventual goal is to align the address with the AIM KERNBASE, but at this
time that's not an option.
Currently a Book-E kernel must be loaded on a 64MB boundary, due to size
issues. The common load address is at the 64MB mark (0x04000000), so simply
make that the default KERNBASE.
As of this commit, Book-E kernels can be loaded and booted with ubldr.
MFC after: 3 weeks
Optimize the exception handler to only save and load the upper word of the
GPRs used in the emulating instruction. This reduces the save/load
overhead, and as a side effect does not overwrite the upper word of any
temporary register.
With this commit I am now able to run editors/abiword and math/gnumeric on a
e500-based system.
MFC after: 1 week
MFC With: r341752,r341751
The code was a near exact copy of the code in startup, but it doesn't need
the complexity since the kernel is already relocated. With
VM_MIN_KERNEL_ADDRESS as currently set to KERNBASE, this doesn't cause a
problem, because it's a zero offset. However, when KERNBASE is changed to a
physical load address, it then has a non-zero offset, and ends up with an
invalid stack pointer, causing the AP to hang.
This change adds a hypervisor trap handler for exception 0x1500 (soft patch),
normalizing all VSX registers and returning.
This avoids a kernel panic due to unknown exception.
Change made with the collaboration of leonardo.bianconi_eldorado.org.br,
that found out that this is a hypervisor exception and not a supervisor one,
and fixed this in the code.
Reviewed by: jhibbits, sbruno
Differential Revision: https://reviews.freebsd.org/D17806
mpr as a module for powerpc or mips. An upcoming commit will cause these
drivers to rely on the presence of 64bit atomic operations. Discussed
with jhibbits.
The POWER9 does not recognize 'or 27,27,27' as a thread priority NOP. On
earlier POWER architectures, this NOP would note to the processor to give up
resources if able, to improve performance of other threads.
All processors that support the thread priority NOPs recognize the
'or 31,31,31' NOP as very low priority, so use this to perform a similar
function, and not burn cycles on POWER9.
The jump slot is a function pointer, not a descriptor pointer, in ELFv2. Just
write the pointer itself over, not the contents of the pointer, which would be
the first instruction of the function.
The 'interrupts' property is actually 2 words, not one, on macgpio child
nodes. Open Firmware's getprop function might be returning the value
copied, not the total size of the property, but FDT's returns the total
size. Prior to this patch, this would cause the SYS_RES_IRQ resource list
to not be populated when running with the 'usefdt' loader variable set, to
convert the OFW device tree to a FDT. Since the property is always 2 words,
read both words, and ignore the second.
Tested by: Dennis Clarke (previous attempt)
MFC after: 2 weeks
Mark some buses as BUS_PASS_BUS, and some resources as BUS_PASS_RESOURCE.
This also decouples some resource attachment orderings from being races by
device tree ordering, instead relying on the bus pass to provide the
ordering.
This was originally intended to support multipass suspend/resume, but it's
also needed on PowerMacs when using fdt, as the device tree seems to get
created in reverse of the OFW tree.
Reviewed by: nwhitehorn (long ago)
Differential Revision: https://reviews.freebsd.org/D918
The same behavior was moved to machdep.c, paired with AIM's relocation,
making this redundant. With this, it's now possible to boot FreeBSD with
ubldr on a uboot Book-E platform, even with a
KERNBASE != VM_MIN_KERNEL_ADDRESS.