The MSR[EE] bit does not require synchronization when changing. This is a
trivial micro-optimization, removing the trailing isync from mtmsr().
MFC after: 1 week
This allows replacing "sys/eventfilter.h" includes with "sys/_eventfilter.h"
in other header files (e.g., sys/{bus,conf,cpu}.h) and reduces header
pollution substantially.
EVENTHANDLER_DECLARE and EVENTHANDLER_LIST_DECLAREs were moved out of .c
files into appropriate headers (e.g., sys/proc.h, powernv/opal.h).
As a side effect of reduced header pollution, many .c files and headers no
longer contain needed definitions. The remainder of the patch addresses
adding appropriate includes to fix those files.
LOCK_DEBUG and LOCK_FILE_LINE_ARG are moved to sys/_lock.h, as required by
sys/mutex.h since r326106 (but silently protected by header pollution prior
to this change).
No functional change (intended). Of course, any out of tree modules that
relied on header pollution for sys/eventhandler.h, sys/lock.h, or
sys/mutex.h inclusion need to be fixed. __FreeBSD_version has been bumped.
It was found during building llvm that the page pv lock pool was seeing very
high contention. Since the pmap is already NUMA aware, it was surmised that
the domains were referencing similar pages in the different domains. This
reduces contention to the point of noise in a lockstat(8) run (~51% down to
under 5%), reducing build times by up to 20%.
This doesn't do a perfect domain alignment, just a best-guess based on
hardware available, that the domain is roughly specified in the upper bits
of the PA. Trying to be more clever would more than likely result in
reduced performance just on the work needed.
MFC after: 2 weeks
Since we now have a much larger KVA on powerpc64, it's possible to get SLB
traps earlier in boot, possibly even before the HIOR is properly configured
for us. Move the HIOR setup to immediately after reset, so that we use our
exception handlers instead of Open Firmware's.
PR: 233863
Submitted by: Mark Millard (partial)
Reported by: Mark Millard
MFC after: 2 weeks
Having IPSEC compiled into the kernel imposes a non-trivial
performance penalty on multi-threaded workloads due to IPSEC
refcounting. In my benchmarks of multi-threaded UDP
transmit (connected sockets), I've seen a roughly 20% performance
penalty when the IPSEC option is included in the kernel (16.8Mpps
vs 13.8Mpps with 32 senders on a 14 core / 28 HTT Xeon
2697v3)). This is largely due to key_addref() incrementing and
decrementing an atomic reference count on the default
policy. This cause all CPUs to stall on the same cacheline, as it
bounces between different CPUs.
Given that relatively few users use ipsec, and that it can be
loaded as a module, it seems reasonable to ask those users to
load the ipsec module so as to avoid imposing this penalty on the
GENERIC kernel. Its my hope that this will make FreeBSD look
better in "out of the box" benchmark comparisons with other
operating systems.
Many thanks to ae for fixing auto-loading of ipsec.ko when
ifconfig tries to configure ipsec, and to cy for volunteering
to ensure the the racoon ports will load the ipsec.ko module
Reviewed by: cem, cy, delphij, gnn, jhb, jpaetzel
Differential Revision: https://reviews.freebsd.org/D20163
* Make mmu_booke_sync_icache() use the DMAP on 64-bit prcoesses, no need to
map the page into the user's address space. This removes the
pvh_global_lock from the equation on 64-bit.
* Don't map the page with user-readability on 32-bit. I don't know what the
chance of a given user process being able to access the NULL page when
another process's page is added there, but it doesn't seem like a good
idea to map it to NULL with user read permissions.
* Only sync as much as we need to. There are only two significant places
where pmap_sync_icache is used: proc_rwmem(), and the SIGILL second-chance
for powerpc. The SIGILL second chance is likely the most common, and only
syncs 4 bytes, so avoid the other 127 loop iterations (4096 / 32 byte
cacheline) in __syncicache().
Reduce the surface area of the TLB locks. Unfortunately the same trick for
serializing the tlbie instruction on OEA64 cannot be used here to reduce the
scope of the tlbivax mutex to the tlbsync only, as the mutex also serializes
the TLB miss lock as a side effect, so contention on this lock may not be
reducible any further.
tun(4) and tap(4) share the same general management interface and have a lot
in common. Bugs exist in tap(4) that have been fixed in tun(4), and
vice-versa. Let's reduce the maintenance requirements by merging them
together and using flags to differentiate between the three interface types
(tun, tap, vmnet).
This fixes a couple of tap(4)/vmnet(4) issues right out of the gate:
- tap devices may no longer be destroyed while they're open [0]
- VIMAGE issues already addressed in tun by kp
[0] emaste had removed an easy-panic-button in r240938 due to devdrn
blocking. A naive glance over this leads me to believe that this isn't quite
complete -- destroy_devl will only block while executing d_* functions, but
doesn't block the device from being destroyed while a process has it open.
The latter is the intent of the condvar in tun, so this is "fixed" (for
certain definitions of the word -- it wasn't really broken in tap, it just
wasn't quite ideal).
ifconfig(8) also grew the ability to map an interface name to a kld, so
that `ifconfig {tun,tap}0` can continue to autoload the correct module, and
`ifconfig vmnet0 create` will now autoload the correct module. This is a
low overhead addition.
(MFC commentary)
This may get MFC'd if many bugs in tun(4)/tap(4) are discovered after this,
and how critical they are. Changes after this are likely easily MFC'd
without taking this merge, but the merge will be easier.
I have no plans to do this MFC as of now.
Reviewed by: bcr (manpages), tuexen (testing, syzkaller/packetdrill)
Input also from: melifaro
Relnotes: yes
Differential Revision: https://reviews.freebsd.org/D20044
Since the DMAP is only available on powerpc64, and is *always* available on
Book-E powerpc64, don't penalize either side (32-bit or 64-bit) by always
checking hw_direct_map to perform operations. This saves 5-10% time on
various ports builds, and on buildworld+buildkernel on Book-E hardware.
MFC after: 3 weeks
Use the nitems() macro instead of the expansion, a'la r298352. Also, fix the
location of this check to after initializing availmem_regions_sz, so that the
check isn't always against 0, thus always failing (nitems(phys_avail) is always
more than 0).
Summary:
A few ports fail to build due to missing pmap-related definitions, which are
specific per-pmap type. This tries to appease those ports, by merging all
pmaps together.
A future change will move the inline page directory out of the Book-E pmap,
to eliminate the last #ifdefs in pmap.h and complete the merge.
Reviewed By: luporl
Differential Revision: https://reviews.freebsd.org/D20119
Use it wherever COMPAT_FREEBSD11 is currently specified, like r309749.
Reviewed by: imp, jhb, markj
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D20120
It's possible for a Hypervisor Maintenance Interrupt (HMI) to occur while in
the pmap code, holding locks. This can cause WITNESS to panic due to lock
errors in calling pmap_kextract(). Since we don't yet handle the flags
returned by OPAL_HANDLE_HMI2, just stop using it, so that we don't call into
pmap_kextract().
Reported by: pkubaj
Unconditional writing to MAS7, which doesn't exist on the e500v1 core, in a
TLB miss handler has been in the code for several years now. Since this has
gone unnoticed for so long, it's easily concluded that e500v1 is not in use
with FreeBSD. Simplify the code path a bit, by unconditionally zeroing MAS7
instead of calling a subroutine to do it.
r18 is used to hold the old PCB flags, but cpu_throw doesn't populate r18
with PCB flags, since the old thread is gone. This can lead to panics on
cores that don't have the registers guarded by these flags.
This change makes it easier to enable/disable the inclusion of
OPAL flash in the kernel.
Reviewed by: jhibbits
Differential Revision: https://reviews.freebsd.org/D20098
This way its children can attach earlier if needed, and some subsystems are
attached earlier, like the asynchronous token management.
MFC after: 2 weeks
Add support to enable, save, and restore the following facilities:
* Target Address Register (bctar) -- seemingly just another register to
branch to.
* Event-based branching -- an interrupt-like userspace event handler
subsystem.
* Load-monitored facility -- A facility that allows monitoring a range of
physical memory, and triggering an event on access. Targeted to garbage
collection software features.
The Data Stream Control Register (DSCR) is privileged on POWER7, but
unprivileged (different register) on POWER8 and later. However, it's now
guarded by a new register, the Facility Status and Control Register, instead of
the MSR like other pre-existing facilities (FPU, Altivec). The FSCR must be
managed explicitly, since it's effectively an extension of the MSR.
Tested by: Brandon Bergren
The POWER8NVL (POWER8 NVLink) architecturally behaves identically to the
POWER8, with a different PVR identifier. Mark it as such, so it shows up
appropriately to the user.
Reported by: Alexey Kardashevskiy
MFC after: 2 weeks
Since the non-volatile registers are restored at the end of cpu_switchin (of
the new thread) they're free for us to use for our own purposes. Load the
PCB_FLAGS into a non-volatile register so it's preserved across the C
function calls that manage FPU and altivec state. This removes 4 loads from
each file. Might be a trivial performance improvement (~12 clock cycles per
context switch).
MFC after: 3 weeks
mtmsr and mtsr require context synchronizing instructions to follow. Without
a CSI, there's a chance for a machine check exception. This reportedly does
occur on a MPC750 (PowerMac G3).
Reported by: Mark Millard
As mphyp_pte_unset() can also remove PTE entries, and as this can
happen in parallel with PTEs evicted by mphyp_pte_insert(), there
is a (rare) chance the PTE being evicted gets removed before
mphyp_pte_insert() is able to do so. Thus, the KASSERT should
check wether the result is H_SUCCESS or H_NOT_FOUND, to avoid
panics if the situation described above occurs.
More details about this issue can be found in PR 237470.
PR: 237470
Reviewed by: jhibbits
Differential Revision: https://reviews.freebsd.org/D20012
Summary: when using pseries-llan driver, Opkts and Oerrs counters (netstat
-i) are always zero. This patch adds an small error handling to increment
these counters.
Submitted by: alfredo.junior_eldorado.org.br
Differential Revision: https://reviews.freebsd.org/D20009
Some hypervisor calls, such as H_SEND_LOGICAL_LAN, take more arguments than
are traditionally passed in registers. The HCALL ABI will accept these
arguments in r11 and r12. With ELFv2 ABI, these arguments are 2
double-words lower than ELFv1 ABI, as two double-words in the stack frame
are no longer used, and therefore removed from the frame. Fix the offsets
for loading the registers for the HCALL. This fixes the phyp_llan driver
with ELFv2 kernel.
Submitted by: alfredo.junior_eldorado.org.br
Differential Revision: https://reviews.freebsd.org/D20008
Since writes don't necessarily need to be on erase-block boundaries, we can
relax the block size and alignments down to sector size. If it needs to be
erased, opalflash_erase() will check proper alignment and size.
If the OPAL flash driver supports writing without erase, it adds a
'no-erase' property to the flash device node. Honor that property and don't
bother erasing if it exists.
Summary:
Initial NUMA support:
- associate CPU with domain
- associate memory ranges with domain
- identify domain for devices
- limit device interrupt binding to appropriate domain
- Additionally fixes a bug in the setting of Maxmem which led to
only memory attached to the first socket being enabled for DMA
A pmap variant can opt in to numa support by by calling `numa_mem_regions`
at the end of pmap_bootstrap - registering the corresponding ranges with the
VM.
This yields a ~20% improvement in build times of llvm on dual socket POWER9
over non-NUMA.
Original patch by mmacy.
Differential Revision: https://reviews.freebsd.org/D17933
Forgot to add the changes for DELAY(), which lowers priority during the
delay period. Also, mark the timebase read as volatile so newer GCC does
not optimize it away, as it reportedly does currently.
MFC after: 2 weeks
MFC with: r346144
PowerISA 2.07 and PowerISA 3.0 both specify special NOPs for priority
adjustments, with "medium" priority being normal. We had been setting
medium-low as our normal priority. Rather than guess each time as to what
we want and the right NOP, wrap them in inline functions, and replace the
occurrances of the NOPs with the functions. Also, make DELAY() drop to very
low priority while waiting, so we don't burn CPU.
Coupled with r346143, this shaves off a modest 5-8% on buildworld times with
-j72. There may be more room for improvement with judicious use of these
NOPs.
MFC after: 2 weeks
The POWER9 documentation specifies that levels 0-3 are the 'lightest' sleep
level, meaning lowest latency and with no state loss. However, state 3 is
not implemented, and is instead reserved for future chips. This now
properly configures the PSSCR, specifying state 2 as the lowest level to
enter, but request level 0 for quickest sleep level. If the OCC determines
that the CPU can enter states 1 or 2 it will trigger the transition to those
states on demand.
MFC after: 1 week
* The BIO bio_data may not be page aligned. Only the base address of each
page worth of data is extracted to pass to OPAL. Without page alignment
it can scribble over random memory when finishing the page read. Fix this
by short-reading the first page to properly align for full page reads.
* Fix the definition of OPAL_FLASH_ERASE.
* Properly handle the async message result, as now returned from r345974.
* Properly return the full opal_msg from an async completion.
* Don't keep bugging OPAL, wait 100us or so. With some minor changes to
DELAY() to drop to very low priority, the thread won't hog the CPU while
polling for the async completion.
The e5500 has an FPU, but lacks the optional fsqrt instruction. This
instruction gets emulated in the kernel, but the emulation uses stale data,
from the last switch out, and does not return the result of the operation
immediately. Fix both of these conditions by saving and restoring the FPRs
around the emulation point.
MFC after: 1 week
MFC with: r345829
This fix was committed less than 2 months after the code was forked into the
powerpc kernel. Though powerpc doesn't use quad-precision floating point,
or need it for emulation, the changes do look like correctness fixes
overall.
This was found while trying to get fsqrt emulation working on e5500, which
does have a real FPU, but lacks the fsqrt instruction. This is not the
complete fix, the rest is to be committed separately.
MFC after: 1 week
Since OPAL_GET_MSG does not discriminate between message types, asynchronous
completion events may be received in the OPAL_GET_MSG call, which dequeues
them from the list, thus preventing OPAL_CHECK_ASYNC_COMPLETION from
succeeding. Handle this case by integrating with the messaging framework.
Summary:
OPAL needs to be kicked periodically in order for the firmware to make
progress on its tasks. To do so, create a heartbeat thread to perform this task
every N milliseconds, defined by the device tree. This task is also a central
location to handle all messages received from OPAL.
Reviewed By: luporl
Differential Revision: https://reviews.freebsd.org/D19743
Summary:
With a sufficiently large TOC, it's possible to index out of range, as
the immediate load instructions only permit 16-bit indices, allowing up
to 64kB range (signed) from the base pointer. Allow +/- 2GB range, with
the medium code model TOC accesses in asm.
Patch originally by Brandon Bergren. The issue appears to impact ELFv2
more than ELFv1.
Reviewed by: luporl
Differential Revision: https://reviews.freebsd.org/D19708
* Cache moea64_need_lock in a local variable; gcc generates slightly better
code this way, it doesn't need to reload the value from memory each read.
* VPN cropping is only needed on PowerPC ISA 2.02 and older cores, a subset
of those that need serialization, so move this under the need_lock check,
so those that don't need the lock don't even need to check this.