On very large memory systems 'size' can become 2GB or larger, resulting in a
negative value being formatted. Also, moea64_pteg_count is already a long, so
format it as such.
There is a type promotion that transform count = -1 into a unsigned int causing
the default TCE SEG SIZE not being returned on a Boston POWER9 machine.
This machine does not have the 'ibm,supported-tce-sizes' entries, thus, count
is set to -1, and the function continue to execute instead of returning.
Reviewed by: jhibbits, wma
Approved by: jhibbits (mentor)
Differential Revision: https://reviews.freebsd.org/D15763
pmc_process_interrupt takes 5 arguments when only 3 are needed.
cpu is always available in curcpu and inuserspace can always be
derived from the passed trapframe.
While facially a reasonable cleanup this change was motivated
by the need to workaround a compiler bug.
core2_intr(cpu, tf) ->
pmc_process_interrupt(cpu, ring, pmc, tf, inuserspace) ->
pmc_add_sample(cpu, ring, pm, tf, inuserspace)
In the process of optimizing the tail call the tf pointer was getting
clobbered:
(kgdb) up
at /storage/mmacy/devel/freebsd/sys/dev/hwpmc/hwpmc_mod.c:4709
4709 pmc_save_kernel_callchain(ps->ps_pc,
(kgdb) up
1205 error = pmc_process_interrupt(cpu, PMC_HR, pm, tf,
resulting in a crash in pmc_save_kernel_callchain.
Changed excise_initrd_region to support both 32- and 64-bit
values for linux,initrd-start and linux,initrd-end.
This fixes the boot problem on some machines after rS334485.
Submitted by: Luis Pires <lffpires@ruabrasil.org>
Reviewed by: jhibbits, leitao
Approved by: jhibbits (mentor)
Differential Revision: https://reviews.freebsd.org/D15667
Summary: Included VSX registers in powerpc core dumps (both kernel and gcore)
Submitted by: Luis Pires
Differential Revision: https://reviews.freebsd.org/D15512
Summary:
Added ptrace support for getting/setting the remaining part of the VSX registers
(the part that's not already covered by FPR or VR registers).
This is necessary to add support for VSX registers in debuggers.
Submitted by: Luis Pires
Differential Revision: https://reviews.freebsd.org/D15458
This will let us use much more KVA for ZFS ARC where needed. This may be
incresed in the future if memory requirements increase.
Discussed with: nwhitehorn
Recently a change was made which broke loading 32-bit binaries on powerpc64,
with an assertion in ld-elf32.so.1:
ld-elf32.so.1: assert failed:
/usr/local/poudriere/jails/ppc64/usr/src/libexec/rtld-elf/rtld.c:390
It turns out Elf32_AuxInfo was broken for a very long time on powerpc64, as
it uses long and pointers, which are both 64 bits on powerpc64, and only
manifested with the recent work on auxargs.
Currently kexec loads an initrd file into the main memory but does not
mark that region as reserved, thus the area is not protected.
If any initrd/md file is loaded from kexec/petitboot, the region might become
corarupted/overwritten since FreeBSD does not know the region is 'reserved'.
This patch simply adds the initrd area as a reserved memory region.
Approved by: jhibbits
Differential Revision: https://reviews.freebsd.org/D15610
Summary:
Coupled with r334365, this makes PCI work on POWER9. There is still more to
do to fully exploit the hardware capabilities, but this is sufficient to
enable USB and ethernet controllers on a POWER9 Talos II system.
Reviewed by: nwhitehorn, leitao
Differential Revision: https://reviews.freebsd.org/D15566
This reduces the CPU cycle wastage on power9, which is SMT4. Any idle
thread that's spinning is simply starving working threads on the same core
of valuable resources.
This can be reduced further by taking more advantage of the PSSCR supported
states, as well as permitting state loss, as is currently done for power8.
The currently implemented stop state is the lowest latency, which may still
consume resources.
POWER9 supports Radix page tables in addition to Hashed page tables. When
Radix page tables are in use, the TLB is cut in half, so that half of the
TLB is used for the page walk cache. This is the default behavior, however
FreeBSD currently does not support Radix tables. Clear this bit so that we
can use the full TLB. Do this in the MMU logic so that configuration can be
localized to the specific translation format. Once we do support Radix
tables, the setup for that will be localized to the Radix MMU kobj.
Summary:
PowerISA 2.03 and later require bits 14:65 in the RB register argument,
which is the full value of the vpn argument post-shift. Only POWER4, POWER4+,
and PPC970* need the upper 16 bits cropped.
With this change FreeBSD can boot to multi-user on POWER9.
Reviewed by: nwhitehorn
Differential Revision: https://reviews.freebsd.org/D15581
IPMI access on PowerNV systems is done through the OPAL firmware. This adds a
simple attachment for communicating with the FSP/BMC on these machines. This
has been tested on a Talos POWER9 workstation, only in the bootup phase, noting
the successful attachment messages:
...
ipmi0: IPMI device rev. 0, firmware rev. 2.00, version 2.0, device support mask 0
ipmi0: Number of channels 2
...
The ipmi device has not been added to GENERIC64, but may be after further
testing. It may also eventually be added to the ipmi module at that point.
Summary:
PowerNV architectures (in the test case POWER9) export sensors via the device
tree, which are accessed via OPAL calls. This adds sysctl nodes for each
device in a generic fashion. New sysctl nodes are:
dev.opal_sensor.N.sensor
dev.opal_sensor.N.sensor_min
dev.opal_sensor.N.sensor_max
dev.opal_sensor.N.type
dev.opal_sensor.N.label
These are rooted at a parent attachment under opal, called opalsens. This does
not add support for the "sensor groups" defined in the device tree.
Reviewed by: breno.leitao_gmail.com
Differential Revision: https://reviews.freebsd.org/D15362
Summary:
POWER9 systems use a new interrupt controller, XIVE, managed through OPAL
firmware calls. The OPAL firmware includes support for emulating the previous
generation XICS presentation layer in addition to a new "XIVE Exploitation"
mode. As a stopgap until we have XIVE exploitation mode, enable XICS emulation
mode so that we at least have an interrupt controller.
Since the CPPR is local to the current CPU, it cannot be updated for APs when
initializing on the BSP. This adds a new function, directly called by the
powernv platform code, to initialize the CPPR on AP bringup.
Reviewed by: nwhitehorn
Differential Revision: https://reviews.freebsd.org/D15492
This turns on support for kernel dump encryption and compression, and
netdump. arm and mips platforms are omitted for now, since they are more
constrained and don't benefit as much from these features.
Reviewed by: cem, manu, rgrimes
Tested by: manu (arm64)
Relnotes: yes
Differential Revision: https://reviews.freebsd.org/D15465
Summary:
Some hypervisor exceptions on POWER architecture only save state to HSRR0/HSRR1.
Until we have bhyve on POWER, use a lightweight exception frontend which copies
HSRR0/HSRR1 into SRR0/SRR1, and run the normal trap handler.
The first user of this is the Hypervisor Virtualization Interrupt, which targets
the XIVE interrupt controller on POWER9.
Reviewed By: nwhitehorn
Differential Revision: https://reviews.freebsd.org/D15487
Summary:
Add additional OPAL PCI definitions and expand the code to use them in order to
ease the OPAL interface process for new comers.
These definitions came directly from the OPAL code and they are the same for
both PHB3 (POWER8) and PHB4 (POWER9).
Submitted by: Breno Leitao
Differential Revision: https://reviews.freebsd.org/D15432
On some POWER9 systems, 'reg' denotes the full memory in the system, while
'linux,usable-memory' denotes the usable memory. Some memory is reserved for
NVLink usage, so is partitioned off.
Submitted by: Breno Leitao
r333273 and partially reverted with r333594.
Older CPUs implement addition of offsets into the page table by a
bitwise OR rather than actual addition, which only works if the table is
aligned at a multiple of its own size (they also require it to be aligned
at a multiple of 256KB). Newer ones do not have that requirement, but it
hardly matters to enforce it anyway.
The original code was failing on newer systems with huge amounts of RAM
(> 512 GB), in which the page table was 4 GB in size. Because the
bootstrap memory allocator took its alignment parameter as an int, this
turned into a 0, removing any alignment constraint at all and making
the MMU fail. The first round of this patch (r333273) fixed this case by
aligning it at 256 KB, which broke older CPUs. Fix this instead by widening
the alignment parameter.
Summary:
There were 2 issues that were preventing correct symbol resolution
on PowerPC/pseries:
1- memory corruption at chrp_attach() - this caused the inital
part of the symbol table to become zeroed, which would cause
the kernel linker to fail to parse it.
(this was probably zeroing out other memory parts as well)
2- DDB symbol resolution wasn't working because symtab contained
not relocated addresses but it was given relocated offsets.
Although relocating the symbol table fixed this, it broke the
linker, that already handled this case.
Thus, the fix for this consists in adding a new DDB macro:
DB_STOFFS(offs) that converts a (potentially) relocated offset
into one that can be compared with symbol table values.
PR: 227093
Submitted by: Leandro Lupori <leandro.lupori_gmail.com>
Differential Revision: https://reviews.freebsd.org/D15372
riscv and powerpc have nearly identical bcopy.c that's
supposed to be mostly MI. Move it to the MI libkern.
Differential Revision: https://reviews.freebsd.org/D15374
Summary:
chrp_cpuref_init() was relying on the boot strap processor to be
the first child of /cpus. That was not always the case, specially
on pseries with FDT.
This change uses the "reg" property of each CPU instead and also
adds several sanity checks to avoid unexpected behavior (maybe
too many panics?).
The main observed symptom was interrupts being missed by the main
processor, leading to timeouts and the kernel aborting the boot.
Submitted by: Leandro Lupori
Reviewed by: nwhitehorn
Differential Revision: https://reviews.freebsd.org/D15174
The POWER9 MMU (PowerISA 3.0) is slightly different from current
configurations, using a partition table even for hypervisor mode, and
dropping the SDR1 register. Key off the newly early-enabled CPU features
flags for the new architecture, and configure the MMU appropriately.
The POWER9 MMU ignores the "PSIZ" field in the PTCR, and expects a 64kB
table. As we are enabled for powernv (hypervisor mode, no VMs), only
initialize partition table entry 0, and zero out the rest. The actual
contents of the register are identical to SDR1 from previous architectures.
Along with this, fix a bug in the page table allocation with very large
memory. The table can be allocated on any 256k boundary. The
bootstrap_alloc alignment argument is an int, and with large amounts of
memory passing the size of the table as the alignment will overflow an
integer. Hard-code the alignment at 256k as wider alignment is not
necessary.
Reviewed by: nwhitehorn
Tested by: Breno Leitao
Relnotes: Yes
The new POWER9 MMU configuration is slightly different from current setups.
Rather than special-casing on POWER9, move the initialization of cpu_features
and cpu_features2 to as early as possible, so that platform and MMU
configuration can be based upon CPU features instead of specific CPUs if at all
possible.
Reviewed by: nwhitehorn
POWER8 and POWER9 have similar configuration requirements for hypervisor setup,
and in the cases here they're identical. Add the POWER9 constant to the POWER8
list so it's initialized correctly.
Reviewed by: nwhitehorn
This code caused more problems than it should have fixed (boot failures) on
the machines I tested, so has been commented out for a while now. Remove
it, and assume the errata fixups were done by the bootloader where they
belong.
Discussing with others, this needs to be at least 20 to boot on some POWER9
nodes. Linux made a similar change for the same reason, so increase to 32
to give us some extra breathing room as well. The input and output arrays
are sized at 256, so much greater than the increase in the property array
size.
sysentvec::sv_hwcap/sv_hwcap2 are pointers to u_long, so cpu_features* need
to be u_long to use the pointers. This also requires a temporary cast in
printing the bitfields, which is fine because the feature flag fields are
only 32 bits anyway.
FreeBSD exports the AT_HWCAP* auxvec items if provided by the ELF sysentvec
structure. Add the CPU features to be exported, so user space can more
easily check for them without using the hw.cpu_features and hw.cpu_features2
sysctls.
Not all feature flags are synced. Those for processors we don't currently
support are ignored currently. Those that are supported are synced best I
can tell. One flag was renamed to match the Linux flag name
(PPC_FEATURE2_VCRYPTO -> PPC_FEATURE2_VEC_CRYPTO).
Summary:
POWER9 also contains 32 slbs entries as explained by the POWER9 User Manual:
"For HPT translation, the POWER9 core contains a unified (combined for both
instruction and data), 32-entry, fully-associative SLB per thread"
Submitted by: Breno Leitao
Differential Revision: https://reviews.freebsd.org/D15128
Summary:
Powerpc64 has support for a register called Data Stream Control Register
(DSCR), which basically controls how the hardware controls the caching and
prefetch for stream operations.
Since mfdscr and mtdscr are privileged instructions, we need to emulate them,
and
keep the custom DSCR configuration per thread.
The purpose of this feature is to change DSCR depending on the operation, set
to DSCR Default Prefetch Depth to deepest on string operations, as memcpy.
Submitted by: Breno Leitao
Differential Revision: https://reviews.freebsd.org/D15081
region marked "available" by firmware is contained entirely in the kernel.
This had a tendency to happen with FDTs passed by loader, though could for
other reasons as well, and would result in the kernel slowly cannibalizing
itself for other purposes, eventually resulting in a crash.
A similar fix is needed for mmu_oea.c and should probably just be rolled
at that point into some generic code in platform.c for taking a mem_region
list and removing chunks.
PR: 226974
Submitted by: leandro.lupori@gmail.com
Reviewed by: jhibbits
Differential Revision: D15121
This will allow to hook a ddb script to "kdb.enter.trap" event.
Previously there was no specific name for this event, so it could only
be handled by either "kdb.enter.unknown" or "kdb.enter.default" hooks.
Both are very unspecific.
Having a specific event is useful because the fatal trap condition is
very similar to panic but it has an additional property that the current
stack frame is the frame where the trap occurred. So, both a register
dump and a stack bottom dump have additional information that can help
analyze the problem.
I have added the event only on architectures that have trap_fatal()
function defined. I haven't looked at other architectures. Their
maintainers can add support for the event later.
Sample script:
kdb.enter.trap=bt; show reg; x/aS $rsp,20; x/agx $rsp,20
Reviewed by: kib, jhb, markj
MFC after: 11 days
Sponsored by: Panzura
Differential Revision: https://reviews.freebsd.org/D15093
Half of implementations always failed (returned (-1)) and they were
previously used in only one place.
Reviewed by: kib, andrew
Obtained from: CheriBSD
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D15102
This makes it more consistent with FreeBSD norms, rather than using Linux's
norms. Now, instead of needing an environment variable
video-mode=fslfb:1280x1024@60
Now one would use a hint:
hint.fb.0.mode=1280x1024@60
Most other architectures already re-enter KDB on faults, powerpc and mips
are the only outliers. Correct this for powerpc, so that now bad addresses
can be handled gracefully instead of panicking.
Make int_external_input, int_decrementer, and int_performance_counter all
now use trap_common, just like on AIM. The effects of this are:
* All traps are now properly displayed in ddb. Previously traps from
external input, decrementer, and performance counters, would display as
just basic stack traces. Now the frame is displayed.
* External interrupts are now handled with interrupts enabled, so handling
can be preempted. This seems to fix a hang found post-r329882.
Change OF_getencprop_alloc semantics to be combination of malloc and
OF_getencprop and return size of the property, not number of elements
allocated.
For the use cases where number of elements is preferred introduce
OF_getencprop_alloc_multi helper function that copies semantics
of OF_getencprop_alloc prior to this change.
This is to make OF_getencprop_alloc and OF_getencprop_alloc_multi
function signatures consistent with OF_getencprop_alloc and
OF_getencprop_alloc_multi.
Functionality-wise this patch is mostly rename of OF_getencprop_alloc
to OF_getencprop_alloc_multi except two calls in ofw_bus_setup_iinfo
where 1 was used as a block size.
OF_getprop_alloc takes element size argument and returns number of
elements in the property. There are valid use cases for such behavior
but mostly API consumers pass 1 as element size to get string
properties. What API users would expect from OF_getprop_alloc is to be
a combination of malloc + OF_getprop with the same semantic of return
value. This patch modifies API signature to match these expectations.
For the valid use cases with element size != 1 and to reduce
modification scope new OF_getprop_alloc_multi function has been
introduced that behaves the same way OF_getprop_alloc behaved prior to
this patch.
Reviewed by: ian, manu
Differential Revision: https://reviews.freebsd.org/D14850
Summary:
This code adds the basic infrastructure for the facility subsystem. A facility
trap is raised when an unavailable instruction is executed. One example is
executing a Hardware Transactional Memory instruction while the MSR[TM] is
disabled. In the past, there was a specific interrupt for it (FP, VEC), but the
new instructions seem to be multiplexed on this facility interrupt.
The root cause of the trap is provided on Facility Status and Control Register
(FSCR) register.
Submitted by: Breno Leitao
Reviewed by: nwhitehorn
Differential Revision: https://reviews.freebsd.org/D14566
Summary:
Print current MSR on printtrap(). Currently, printtrap just prints srr1, which
contains part of the MSR prior to the exception. I find useful to dump the
current value of the MSR, since it changes when there is an interruption.
With this patch, this is the new printtrap model:
handled user trap:
exception = 0x700 (program)
srr0 = 0x100008a0 (0x100008a0)
srr1 = 0x800000000002f032
current msr = 0x8000000000009032
lr = 0x1000089c (0x1000089c)
curthread = 0x7a50000
pid = 714, comm = ttrap2
Submitted by: Breno Leitao
Reviewed by: nwhitehorn
Differential Revision: https://reviews.freebsd.org/D14600
Summary:
It is not necessary to call isync() after calling mtmsr() function, mainly
because the mtmsr() calls 'isync' internally to synchronize the machine state
register. Other than that, isync() just calls the 'isync' instruction, thus,
the 'isync' instruction is being called twice, and that seems to be unnecessary.
This patch just remove the unecessary calls to isync() after mtmsr().
Submitted by: Breno Leitao
Differential Revision: https://reviews.freebsd.org/D14583
Summary:
Currently ofw_real_bounce_alloc() is requesting memory, using WAITOK, holding a
non-sleepable locks, called 'OF Bounce Page'.
Fix this by allocating the pages outside of the lock, and only updating the
global variables while holding the lock.
Submitted by: Breno Leitao
Differential Revision: https://reviews.freebsd.org/D14955
opt_compat.h is mentioned in nearly 180 files. In-progress network
driver compabibility improvements may add over 100 more so this is
closer to "just about everywhere" than "only some files" per the
guidance in sys/conf/options.
Keep COMPAT_LINUX32 in opt_compat.h as it is confined to a subset of
sys/compat/linux/*.c. A fake _COMPAT_LINUX option ensure opt_compat.h
is created on all architectures.
Move COMPAT_LINUXKPI to opt_dontuse.h as it is only used to control the
set of compiled files.
Reviewed by: kib, cem, jhb, jtl
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D14941
TLB1 can handle ranges up to 4GB (through e5500, larger in e6500), but
ilog2() took a unsigned int, which maxes out at 4GB-1, but truncates
silently. Increase the input range to the largest supported, at least for
64-bit targets. This lets the DMAP be completely mapped, instead of only
1GB blocks with it assuming being fully mapped.
As with AIM64, map the DMAP at the beginning of the fourth "quadrant" of
memory, and move the KERNBASE to the the start of KVA.
Eventually we may run the kernel out of the DMAP, but for now, continue
booting as it has been.
assym is only to be included by other .s files, and should never
actually be assembled by itself.
Reviewed by: imp, bdrewery (earlier)
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D14180
OF_finddevices returns ((phandle_t)-1) in case of failure. Some code
in existing drivers checked return value to be equal to 0 or
less/equal to 0 which is also wrong because phandle_t is unsigned
type. Most of these checks were for negative cases that were never
triggered so trhere was no impact on functionality.
Reviewed by: nwhitehorn
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D14645
When the kernel can be in real mode in early boot, we can execute from
high addresses aliased to the kernel's physical memory. If that high
address has the first two bits set to 1 (0xc...), those addresses will
automatically become part of the direct map. This reduces page table
pressure from the kernel and it sets up the kernel to be used with
radix translation, for which it has to be up here.
This is accomplished by exploiting the fact that all PowerPC kernels are
built as position-independent executables and relocate themselves
on start. Before this patch, the kernel runs at 1:1 VA:PA, but that
VA/PA is random and set by the bootloader. Very early, it processes
its ELF relocations to operate wherever it happens to find itself.
This patch uses that mechanism to re-enter and re-relocate the kernel
a second time witha new base address set up in the early parts of
powerpc_init().
Reviewed by: jhibbits
Differential Revision: D14647
accomplishes a few things:
- Makes NULL an invalid address in the kernel, which is useful for catching
bugs.
- Lays groundwork for radix-tree translation on POWER9, which requires the
direct map be at high memory.
- Similarly lays groundwork for a direct map on 64-bit Book-E.
The new base address is chosen as the base of the fourth radix quadrant
(the minimum kernel address in this translation mode) and because all
supported CPUs ignore at least the first two bits of addresses in real
mode, allowing direct-map addresses to be used in real-mode handlers.
This is required by Linux and is part of the architecture standard
starting in POWER ISA 3, so can be relied upon.
Reviewed by: jhibbits, Breno Leitao
Differential Revision: D14499
correctly for the data contained on each memory page.
There are several components to this change:
* Add a variable to indicate the start of the R/W portion of the
initial memory.
* Stop detecting NX bit support for each AP. Instead, use the value
from the BSP and, if supported, activate the feature on the other
APs just before loading the correct page table. (Functionally, we
already assume that the BSP and all APs had the same support or
lack of support for the NX bit.)
* Set the RW and NX bits correctly for the kernel text, data, and
BSS (subject to some caveats below).
* Ensure DDB can write to memory when necessary (such as to set a
breakpoint).
* Ensure GDB can write to memory when necessary (such as to set a
breakpoint). For this purpose, add new MD functions gdb_begin_write()
and gdb_end_write() which the GDB support code can call before and
after writing to memory.
This change is not comprehensive:
* It doesn't do anything to protect modules.
* It doesn't do anything for kernel memory allocated after the kernel
starts running.
* In order to avoid excessive memory inefficiency, it may let multiple
types of data share a 2M page, and assigns the most permissions
needed for data on that page.
Reviewed by: jhb, kib
Discussed with: emaste
MFC after: 2 weeks
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D14282
Add I2C OPAL driver and a set of dummy-ones to allow
all I2C things on Power8 to attach.
TODO: better async token management
Submitted by: Wojciech Macek <wma@semihalf.com>
Obtained from: Semihalf
Sponsored by: IBM, QCM Technologies
A reservation granule on PowerPC is a cache line.
On e500mc and derivatives a cacheline size is 64 bytes, not 32. Allocate
the maximum size permitted, but only utilize the size that is needed. On
e500v1 and e500v2 the reservation granule will still be 32 bytes.
These interfaces were put in place to let QorIQ SoCs dictate CPU idling
semantics, in order to support capabilities such as NAP mode and deep sleep.
However, this never stabilized, and the idling support reverted back to
CPU-level rather than SoC level. Move this code back to cpu.c instead. If
at a later date the lower power modes do come to fruition, it should be done
by overriding the cpu_idle_hook instead of this platform hook.
NVMe support is ready and should be compiled-in
to the ppc64 kernel.
Submitted by: Wojciech Macek <wma@semihalf.org>
Obtained from: Semihalf
Sponsored by: IBM, QCM Technologies
We don't support float in the boot loaders, so don't include
interfaces for float or double in systems headers. In addition, take
the unusual step of spiking double and float to prevent any more
accidental seepage.
When processor enters power-save state it releases resources shared with other
cpu threads which makes other cores working much faster.
This patch also implements saving and restoring registers that might get
corrupted in power-save state.
Submitted by: Patryk Duda <pdk@semihalf.com>
Obtained from: Semihalf
Reviewed by: jhibbits, nwhitehorn, wma
Sponsored by: IBM, QCM Technologies
Differential revision: https://reviews.freebsd.org/D14330
Add function which can store RTC values to OPAL.
Submitted by: Wojciech Macek <wma@semihalf.org>
Obtained from: Semihalf
Sponsored by: IBM, QCM Technologies
Summary:
This compartmentalizes the CPU-specific trap components into its own
function, rather than littering the general printtrap() with various checks.
This will let us replace a series of #ifdef's with a runtime conditional check
in the future.
Reviewed By: nwhitehorn
Differential Revision: https://reviews.freebsd.org/D14416
Make vm_wait() take the vm_object argument which specifies the domain
set to wait for the min condition pass. If there is no object
associated with the wait, use curthread' policy domainset. The
mechanics of the wait in vm_wait() and vm_wait_domain() is supplied by
the new helper vm_wait_doms(), which directly takes the bitmask of the
domains to wait for passing min condition.
Eliminate pagedaemon_wait(). vm_domain_clear() handles the same
operations.
Eliminate VM_WAIT and VM_WAITPFAULT macros, the direct functions calls
are enough.
Eliminate several control state variables from vm_domain, unneeded
after the vm_wait() conversion.
Scetched and reviewed by: jeff
Tested by: pho
Sponsored by: The FreeBSD Foundation, Mellanox Technologies
Differential revision: https://reviews.freebsd.org/D14384
On POWER8 architecture there is a timer with 512Mhz frequency.
It has about 1,95ns period, but it is rounded to 1ns which is not accurate.
Submitted by: Patryk Duda <pdk@semihalf.com>
Obtained from: Semihalf
Reviewed by: wma
Sponsored by: IBM, QCM Technologies
Differential revision: https://reviews.freebsd.org/D14433
Currently Hypervisor Emulation Assistance interrupt is unhandled.
Executing an undefined instruction in userland triggers kernel panic.
Handle this the same way as Facility Unavailable Interrupt - send
SIGILL signal to userspace.
Submitted by: Michal Stanek <mst@semihalf.com>
Obtained from: Semihalf
Reviewed by: nwhitehorn, pdk@semihalf.com, wma
Sponsored by: IBM, QCM Technologies
Differential revision: https://reviews.freebsd.org/D14437
zero, matching the IEEE 1275 standard. Since these internal error paths
have never, to my knowledge, been taken, behavior is unchanged.
Reported by: gonzo
MFC after: 2 weeks
This is part of a long-term goal of merging Book-E and AIM into a single GENERIC
kernel. As more work is done, the struct may be optimized further.
Reviewed by: nwhitehorn
Summary:
After revision rS328534('PPC64: use hwref instead of cpuid'), FreeBSD on
powerpc64 virtual machine panics since it is unable to read the
timebase, showing the following error:
get-property for timebase-frequency on zero phandle
panic: Unable to determine timebase frequency!
With the change above, cpuref->cr_hwref does not contain the phandle
anymore, thus, it never reads the proper CPU entry in OF.
Submitted by: Breno Leitao
Differential Revision: https://reviews.freebsd.org/D14204
significant source of cache line contention from vm_page_alloc(). Use
accessors and vm_page_unwire_noq() so that the mechanism can be easily
changed in the future.
Reviewed by: markj
Discussed with: kib, glebius
Tested by: pho (earlier version)
Sponsored by: Netflix, Dell/EMC Isilon
Differential Revision: https://reviews.freebsd.org/D14273
We don't support older compilers. Most of the code in these files is
for pre-3.0 gcc, which is at least 15 years obsolete. Move to using
phk's sys/_stdargs.h for all these platforms.
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D14323
r328113 and affecting SMP systems.
The way the time is set on PowerMacs is racy and relies on all the
CPUs in the system setting a register simultaneously in a rendezvous. A
few-cycle delay can result in out-of-sync times, which can break the
scheduler and result in calls like mtx_sleep() and pause() never timing out
if the thread is migrated while sleeping. r328113 added a call to a no-op
function between the beginning of the rendezvous and setting the time that
was only called on APs and added enough cycles to cause a problematic offset.
For some reason, the fan-management code was the first place this appeared.
Clue from: andreast
Reported by: many
global to per-domain state. Protect reservations with the free lock
from the domain that they belong to. Refactor to make vm domains more
of a first class object.
Reviewed by: markj, kib, gallatin
Tested by: pho
Sponsored by: Netflix, Dell/EMC Isilon
Differential Revision: https://reviews.freebsd.org/D14000
The L3 cache controller (Corenet Platform Cache) is listed with one of its
compatible strings as "cache", which this driver can't attach to. Restrict
to a known list of primary cache controller strings, as found in the l2cache
devicetree binding.
threads from compile-time defines to global variables. This removes a
significant amount of duplicated runtime patches to the compile-time
defines, centralizing the conditional logic in the early startup code.
Reviewed by: jhibbits
It turns out that under some circumstances we can get DSI or DSE before we set
LPCR and LPID so we should set it as early as possible.
Authored by: Patryk Duda <pdk@semihalf.com>
Submitted by: Wojciech Macek <wma@semihalf.com>
Obtained from: Semihalf
Sponsored by: IBM, QCM Technologies
On CHRP and PowerNV, use the interrupt server number in the cpuref and pcpu
hwref field instead of the device-tree phandle and make the CPU IDs reported
to the scheduler dense and with the BSP at 0.
Submitted by: Wojciech Macek <wma@semihalf.com>
Obtained from: Semihalf
Sponsored by: IBM, QCM Technologies
Differential revision: https://reviews.freebsd.org/D14011
Cleaning up AP startup routines. This is a mix of changes
required to make PowerNV running and to modify the code
to be more robust. Previously, some races were seen if more
than 90CPUs were online.
Authored by: Patryk Duda <pdk@semihalf.com>
Submitted by: Wojciech Macek <wma@semihalf.com>
Obtained from: Semihalf
Sponsored by: IBM, QCM Technologies
Differential revision: https://reviews.freebsd.org/D14026
used with hashed page tables on AIM and place it into a new, modular pmap
function called pmap_decode_kernel_ptr(). This function is the inverse
of pmap_map_user_ptr(). With POWER9 radix tables, which mapping to use
becomes more complex than just AIM/BOOKE and it is best to have it in
the same place as pmap_map_user_ptr().
Reviewed by: jhibbits
There's no reason not to build modules for 64-bit QorIQ devices. This
config has evolved to be analogous to the AIM GENERIC64 kernel, so will grow
to match it in more ways as well.
Summary:
Rather than duplicating the checks for programmatic traps all over the code, put
it all in one function. This helps to remove some of the #ifdefs between AIM
and Book-E.
Reviewed By: nwhitehorn
Differential Revision: https://reviews.freebsd.org/D14082
In a corner case we could fall into OOB error.
Authored by: Patryk Duda <pdk@semihalf.com>
Submitted by: Wojciech Macek <wma@semihalf.com>
Obtained from: Semihalf
Sponsored by: IBM, QCM Technologies
MSI/MSI-x interrupts are edge-triggered. If an interrupt
arrives when IRQ line is masked, it will be lost and will
never recover. Perform MSI_EOI always after unmask to give
a chance for PHB/XICS to send an interrupt again if MSI/MSI-x
pending bit is set in MSI/MSI-x BAR space.
Submitted by: Wojciech Macek <wma@semihalf.org>
Obtained from: Semihalf
Sponsored by: IBM, QCM Technologies
Commits r326203 and r326978 broke 64-bit booke kernels by introducing a 1MB
zero-pad between the ELF header and the start of the kernel. This didn't
cause a build failure, but caused kernels to need to be loaded into memory
1MB lower, which could easily break scripts expecting previous behavior.
This change matches the similar change made to AIM in r327358.
Uses of mallocarray(9).
The use of mallocarray(9) has rocketed the required swap to build FreeBSD.
This is likely caused by the allocation size attributes which put extra pressure
on the compiler.
Given that most of these checks are superfluous we have to choose better
where to use mallocarray(9). We still have more uses of mallocarray(9) but
hopefully this is enough to bring swap usage to a reasonable level.
Reported by: wosch
PR: 225197
kernel by PHYS_TO_DMAP() as previously present on amd64, arm64, riscv, and
powerpc64. This introduces a new MI macro (PMAP_HAS_DMAP) that can be
evaluated at runtime to determine if the architecture has a direct map;
if it does not (or does) unconditionally and PMAP_HAS_DMAP is either 0 or
1, the compiler can remove the conditional logic.
As part of this, implement PHYS_TO_DMAP() on sparc64 and mips64, which had
similar things but spelled differently. 32-bit MIPS has a partial direct-map
that maps poorly to this concept and is unchanged.
Reviewed by: kib
Suggestions from: marius, alc, kib
Runtime tested on: amd64, powerpc64, powerpc, mips64
In platform_smp_ap_init we are doing some crucial code (eg. set LPCR register)
which have influence over further execution.
Practiculary in PowerNV platform we have experienced Data Storage Interrupt
before we set apropriate LPCR. It caused code execution from location which was
legal in bootloader (petitboot based on linux) but illegal in FreeBSD
Set stack pointer to correct value after thread's stack pointer restore
Restoring new thread's stack pointer caused stack corruption because
restored stack pointer didn't point to callee (cpu_switch) stack frame but
caller stack frame.
As a result we had mysterious errors in caller function (sched_switch).
Solution: simply set stack pointer to correct value
Also, initialize TOC to a valid pointer once the thread is being
created.
Created by: Patryk Duda <pdk@semihalf.com>
Submitted by: Wojciech Macek <wma@semihalf.com>
Obtained from: Semihalf
Reviewed by: nwhitehorn
Differential revision: https://reviews.freebsd.org/D13947
Sponsored by: QCM Technologies
Zero BSS always. The only case when this operation is
ommitted is when booting on BookE.
Created by: Wojciech Macek <wma@semihalf.com>
Obtained from: Semihalf
Reviewed by: imp, nwhitehorn
Differential revision: https://reviews.freebsd.org/D13948
Sponsored by: QCM Technologies
Add missing little-endian 64-bit read and write. Since there
is no direct ASM opcode for this, perform byte swap if
necessary.
Created by: Wojciech Macek <wma@semihalf.com>
Obtained from: Semihalf
Sponsored by: QCM Technologies
Use current userspace address for segment mapping. Previously,
there was a bug which made the funciton constantly using the userspace
base address which could cause data integrity issues.
Created by: Wojciech Macek <wma@semihalf.com>
Obtained from: Semihalf
Sponsored by: QCM Technologies
Add CXGBE driver which is required for PowerNV system.
Also, remove AHCI which does not work in BigEndian.
Created by: Wojciech Macek <wma@semihalf.com>
Obtained from: Semihalf
Sponsored by: QCM Technologies
FreeBSD prints text char-by-char, which is not what OPAL
is designed to. Poll events more frequently to avoid buffer
overflow and loosing data.
Created by: Wojciech Macek <wma@semihalf.com>
Obtained from: Semihalf
Sponsored by: QCM Technologies
Fixes:
- map all devices to PE0
- use 1:1 TCE mapping
- provide the same TCE mapping for all PEs (not only PE0)
- add TCE reset and alignment (required by OPAL)
Created by: Wojciech Macek <wma@semihalf.com>
Obtained from: Semihalf
Sponsored by: QCM Technologies
Make XICS to be OPAL-aware.
Created by: Nathan Whitehorn <nwhitehorn@freebsd.org>
Submitted by: Wojciech Macek <wma@semihalf.com>
Sponsored by: FreeBSD Foundation
P1022 SATA controller may set the wrong CCR bit for a command completion.
This would previously cause an interrupt storm. Solve this by marking all
commands complete, and letting the end_transaction deal with the successes.
Causes no problems on P5020.
While here, fix a minor bug in collision detection. The Freescale SATA
controller only has 16 slots, not 32.
Focus on code where we are doing multiplications within malloc(9). None of
these ire likely to overflow, however the change is still useful as some
static checkers can benefit from the allocation attributes we use for
mallocarray.
This initial sweep only covers malloc(9) calls with M_NOWAIT. No good
reason but I started doing the changes before r327796 and at that time it
was convenient to make sure the sorrounding code could handle NULL values.
X-Differential revision: https://reviews.freebsd.org/D13837
to which it is specific, rather than in the generic AIM startup code. This
will be required to support the radix-table-based MMU introduced with POWER9.
buffers into a new pmap-module function pmap_map_user_ptr() that can
be implemented by the respective modules. This is required to implement
non-segment-based AIM-ish MMU systems such as the radix-tree page tables
introduced by POWER ISA 3.0 and present on POWER9.
Reviewed by: jhibbits
using a new macro PHYS_TO_DMAP, which deliberately has the same name as the
equivalent macro on amd64. This also sets the stage for moving the direct
map to another base address.
Some PowerQUICC and QorIQ platforms have a L2 cache managed via the
memory-mapped configuration registers, and appear as a node in the device
tree. This adds basic support to enable the cache.
allocated with a tag to come from the specified domain if it meets the
other constraints provided by the tag. Automatically create a tag at
the root of each bus specifying the domain local to that bus if
available.
Reviewed by: jhb, kib
Tested by: pho
Sponsored by: Netflix, Dell/EMC Isilon
Differential Revision: https://reviews.freebsd.org/D13545
domains can be done by the _domain() API variants. UMA also supports a
first-touch policy via the NUMA zone flag.
The slab layer is now segregated by VM domains and is precise. It handles
iteration for round-robin directly. The per-cpu cache layer remains
a mix of domains according to where memory is allocated and freed. Well
behaved clients can achieve perfect locality with no performance penalty.
The direct domain allocation functions have to visit the slab layer and
so require per-zone locks which come at some expense.
Reviewed by: Attilio (a slightly older version)
Tested by: pho
Sponsored by: Netflix, Dell/EMC Isilon
Provide initial support for PCIe host controller as
well as for IOMMU mapping. This commit allows proper
bus enumeration, but does not guarantee DMA operations
are working.
Created by: Nathan Whitehorn <nwhitehorn@freebsd.org>
Submitted by: Wojciech Macek <wma@semihalf.com>
Sponsored by: FreeBSD Foundation
Avoid the lock in vtophys() by providing a static direct-mapped
spinlock- protected output buffer to use when the console driver
cannot acquire locks for some reason. This allows the idle thread
to use printf() (e.g. the SMP startup messages) without crashing
the kernel.
Created by: Nathan Whitehorn <nwhitehorn@freebsd.org>
Submitted by: Wojciech Macek <wma@freebsd.org>
Sponsored by: FreeBSD Foundation
Make sure to set LPCR[LPES] so that external interrupts set SRR0 and SRR1
instead of HSRR0 and HSRR1. Without this, external interrupt handlers would
get the wrong MSR value when executing, causing eventual madness.
Created by: Nathan Whitehorn <nwhitehorn@freebsd.org>
Submitted by: Wojciech Macek <wma@freebsd.org>
Sponsored by: FreeBSD Foundation
Fix AP startup, which was broken.
Created by: Nathan Whitehorn <nwhitehorn@freebsd.org>
Submitted by: Wojciech Macek <wma@freebsd.org>
Sponsored by: FreeBSD Foundation
Add basic power control (reset, power off) and bind
ttyuX to opal console so that init will start login there.
Created by: Nathan Whitehorn <nw@freebsd.org>
Submitted by: Wojciech Macek <wma@freebsd.org>
Sponsored by: FreeBSD Foundation
OPAL is a dedicated firmware acting as a hypervisor.
Add generic functions to provide all access.
Created by: Nathan Whitehorn <nw@freebsd.org>
Submitted by: Wojciech Macek <wma@freebsd.org>
- Call resource_int_value() once during attach, rather than within the
pci_(read|write)_config() code path; this avoids taking a blocking mutex
to read kenv variables.
- Use a spin lock to protect non-atomic config space accesses; this matches
the behavior of Darwin's AppleMacRiscPCI driver.
Reviewed by: jhibbits
Differential Revision: https://reviews.freebsd.org/D13839
supported on newer POWER hardware and in graphical VMs run on the same,
which are typically XHCI-only. The 32-bit GENERIC kernel, which
does not run on hardware made in the last decade and is unlikely to
encounter XHCI devices, is left unchanged.
PR: kern/224940
Submitted by: Gustavo Romero
MFC after: 1 week
The data segement was too big.
Add a fix-up function like on ia32 for MAXDSIZ.
While here, bring also the MAXSSIZ closer to amd64 and add an equal fix-up
function for MAXSSIZ.
Reviewed by: jhibbits@
Obtained from: jhibbits@
Differential Revision: https://reviews.freebsd.org/D13753
is of limited utility outside of platform-specific code and can vary
at runtime when running as a hypervisor guest, so does not even have the
virtue of being a static identifier.
Reviewed by: jhibbits
instead of hard-coding a default. This information is passed implicitly by
the PS3 firmware and can be relied upon. Also adjust the default mode, if
somehow firmware doesn't pass one, to 1920x1080 from 720x480 since it is
2017.
MFC after: 2 weeks
- Densely number CPUs to avoid systems with CPUs with very high ID numbers
- Always have the BSP be CPU 0 to avoid remnant brokenness with non-0 BSPs
in other parts of the kernel.
- Improve parsing of the device tree CPU listings on SMT systems.
- Allow reboot via RTAS as well as OF for pSeries systems booted by FDT
without functioning Open Firmware.
Obtained from: projects/powernv
MFC after: 3 weeks
is used as the bootloader on a number of PPC64 platforms. This involves the
following pieces:
- Making the first instruction a valid kernel entry point, since kexec
ignores the ELF entry value. This requires a separate section and linker
magic to prevent the linker from filling the beginning of the section
with stubs.
- Adding an entry point at 0x60 past the first instruction for systems
lacking firmware CPU shutdown support (notably PS3).
- Linker script changes to support the above.
MFC after: 1 month
If these are not aligned, the linker has to emit a different type of
relocation that the early boot self-relocation code cannot handle, even
in principle, resulting in them being set to zero and the kernel crashing.
MFC after: 1 week
draft of a never-finalized standard (CHRP) and is irrelevant in practice
on FreeBSD since we load the kernel with loader(8) on Open Firmware
platforms anyway. Moreover, loader(8), which is directly loaded by Open
Firmware, has never had an equivalent note.
MFC after: 2 weeks
the 32-bit cookie can be sign-extended on its way out of the loader and
through Open Firmware. If sign-extended, the in-kernel check of its value
would fail on 64-bit systems, resulting in a mountroot prompt. Solve this
by telling the kernel to ignore the high-order bits.
PR: kern/224437
Submitted by: Gustavo Romero
They provide relaxed-ordered atomic access semantic. Due to the
FreeBSD memory model, the operations are syntaxical wrappers around
the volatile accesses. The volatile qualifier is used to ensure that
the access not optimized out and in turn depends on the volatile
semantic as implemented by supported compilers.
The motivation for adding the operation is to help people coming from
other systems or knowing the C11/C++ standards where atomics have
special type and require use of the special access operations. It is
still the case that FreeBSD requires plain load and stores of aligned
integer types to be atomic.
Suggested by: jhb
Reviewed by: alc, jhb
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D13534
Currently Facility Unavailable is absent and once an application
tries to use or access a register from a feature disabled in the
CPU it causes a kernel panic.
A simple test-case is:
int main() { asm volatile ("tbegin.;"); }
which will use TM (Hardware Transactional Memory) feature which
is not supported by the kernel and so will trigger the following
kernel panic:
----
fatal user trap:
exception = 0xf60 (unknown)
srr0 = 0x10000890
srr1 = 0x800000000000f032
lr = 0x100004e4
curthread = 0x5f93000
pid = 1021, comm = htm
panic: unknown trap
cpuid = 40
KDB: stack backtrace:
Uptime: 3m18s
Dumping 10 MB (3 chunks)
chunk 0: 11MB (2648 pages) ... ok
chunk 1: 1MB (24 pages) ... ok
chunk 2: 1MB (2 pages)panic: IOMMU mapping error: -4
cpuid = 40
Uptime: 3m18s
----
Since Hardware Transactional Memory is not yet supported by FreeBSD, treat
this as an illegal instruction.
PR: 224350
Submitted by: Gustavo Romero <gromero_AT_ibm_DOT_com>
MFC after: 2 weeks
Without the identifier in the list booting FreeBSD results in printing the
following (from a PowerKVM boot):
cpu0: Unknown PowerPC CPU revision 0x1201, 2550.00 MHz
For now, add the same feature list as POWER8. As new capabilities are added to
support POWER9 specific features, they will be added to this.
PR: 224344
Submitted by: Breno Leitao <breno_DOT_leitao_AT_gmail_DOT_com>
Decode on Book-E:
* ESR (Exception Syndrome Register)
* MCSR (Machine Check Status Register)
On AIM:
* MSSSR (Memory Subsystem Status Register)
Makes it easier to tell at a glance the type of trap and machine check
conditions now.
The DTrace fasttrap entry points expect a struct reg containing the
register values of the calling thread. Perform the conversion in
fasttrap rather than in the trap handler: this reduces the number of
ifdefs and avoids wasting stack space for traps that don't involve
DTrace.
MFC after: 2 weeks
pmap_track_page() only works with physical memory pages, which have a
constant vm_page_t address. Microoptimize pmap_track_page() to perform one
less operation under the lock.
This was done in 32-bit mode, but not duplicated when 64-bit mode was
brought in. Without this, stale mappings can be left, leading to odd
crashes when the wrong VA is checked in XX_PhysToVirt() (dpaa(4)).
This variable should be pure MI except possibly for reading it in MD
dump routines. Its initialization was pure MD in 4.4BSD, but FreeBSD
changed this in r36441 in 1998. There were many imperfections in
r36441. This commit fixes only a small one, to simplify fixing the
others 1 arch at a time. (r47678 added support for
special/early/multiple message buffer initialization which I want in
a more general form, but this was too fragile to use because hacking
on the msgbufp global corrupted it, and was only used for 5 hours in
-current...)
The Display Interface Unit (DIU) uses main memory for the framebuffer, which
is already mapped as cache coherent physical memory. Prevent mmap() from
using its own attributes which may otherwise conflict.
Devices aren't mapped within the KVA, and with the way 64-bit hashes the
addresses pte_vatopa() may not return a 0 physical address for a device.
MFC after: 1 week
Mainly focus on files that use BSD 2-Clause license, however the tool I
was using misidentified many licenses so this was mostly a manual - error
prone - task.
The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.
This allows modules creating mappings to be loaded post-boot, after SMP has
started. Without this, the TLB1 mappings can become unsynchronized and lead
to kernel page faults when accessed on the alternate CPUs.
MFC after: 3 weeks
PowerPC kernels in r6 is actually metadata from loader(8) or gibberish
left in r6, which is not required to be anything under the
PAPR/ePAPR/CHRP/OF standards, by another boot loader.
Note that, as a result, systems need a new boot loader to boot PPC kernels
after this revision without ending up at a mountroot prompt. New boot
loaders are backwards compatible and can boot older kernels.
Reviewed by: jhibbits
MFC after: 2 months
just set it to a large default value (and inherit any previously existing
value), hoping it never turns over. Instead, silently allow spurious
one-shots from rollovers.
MFC after: 10 days
and such from ending on the wrong CPU on SMP systems. It would be good to
have this be more generic somehow as POWER9s appear, but PPC does not
have features bits, unfortunately.
MFC after: 3 weeks
is set and the right thing to do may be platform-dependent (it requires
firmware on PowerNV, for instance). Make it a new platform method called
platform_smp_timebase_sync().
MFC after: 3 weeks
Upon successful completion, the execve() system call invokes
exec_setregs() to initialize the registers of the initial thread of the
newly executed process. What is weird is that when execve() returns, it
still goes through the normal system call return path, clobbering the
registers with the system call's return value (td->td_retval).
Though this doesn't seem to be problematic for x86 most of the times (as
the value of eax/rax doesn't matter upon startup), this can be pretty
frustrating for architectures where function argument and return
registers overlap (e.g., ARM). On these systems, exec_setregs() also
needs to initialize td_retval.
Even worse are architectures where cpu_set_syscall_retval() sets
registers to values not derived from td_retval. On these architectures,
there is no way cpu_set_syscall_retval() can set registers to the way it
wants them to be upon the start of execution.
To get rid of this madness, let sys_execve() return EJUSTRETURN. This
will cause cpu_set_syscall_retval() to leave registers intact. This
makes process execution easier to understand. It also eliminates the
difference between execution of the initial process and successive ones.
The initial call to sys_execve() is not performed through a system call
context.
Reviewed by: kib, jhibbits
Differential Revision: https://reviews.freebsd.org/D13180
The vast majority of pmap_kextract() calls are looking for a physical memory
address, not a device address. By checking the page table first this saves
the formerly inevitable 64 (on e500mc and derivatives) iteration loop
through TLB1 in the most common cases.
Benchmarking this on the P5020 (e5500 core) yields a 300% throughput
improvement on dtsec(4) (115Mbit/s -> 460Mbit/s) measured with iperf.
Benchmarked on the P1022 (e500v2 core, 16 TLB1 entries) yields a 50%
throughput improvement on tsec(4) (~93Mbit/s -> 165Mbit/s) measured with
iperf.
MFC after: 1 week
Relnotes: Maybe (significant performance improvement)
Mainly focus on files that use BSD 3-Clause license.
The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.
Special thanks to Wind River for providing access to "The Duke of
Highlander" tool: an older (2014) run over FreeBSD tree was useful as a
starting point.
The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.
Special thanks to Wind River for providing access to "The Duke of
Highlander" tool: an older (2014) run over FreeBSD tree was useful as a
starting point.
Initially, only tag files that use BSD 4-Clause "Original" license.
RelNotes: yes
Differential Revision: https://reviews.freebsd.org/D13133
There's no need to special case 32-bit AIM to short circuit processing.
Some AIM CPUs can handle 36 bit addresses, and 64-bit CPUs can run 32-bit
OSes, so this will allow us to expand for that in the future if we desire.
The interrupt map wasn't being allocated properly, preventing IRQs from being
allocated to children of the PCIe bus. Fix this by cloning the ofw_pcib_pci
code, which handles all cases -- device tree and probed.
In the future this may become a subclass of the ofw_pcib_pci driver, but as
that's not an exported class, it's cloned for now.
MFC after: 3 weeks
* Check TLB1 in all mapdev cases, in case the memattr matches an existing
mapping (doesn't need to be MAP_DEFAULT).
* Fix mapping where the starting address is not a multiple of the widest size
base. For instance, it will now properly map 0xffffef000, size 0x11000 using
2 TLB entries, basing it at 0x****f000, instead of 0x***00000.
MFC after: 2 weeks
similar to the kernel memory allocator.
This simplifies NUMA allocation because the domain will be known at wait
time and races between failure and sleeping are eliminated. This also
reduces boilerplate code and simplifies callers.
A wait primitive is supplied for uma zones for similar reasons. This
eliminates some non-specific VM_WAIT calls in favor of more explicit
sleeps that may be satisfied without new pages.
Reviewed by: alc, kib, markj
Tested by: pho
Sponsored by: Netflix, Dell/EMC Isilon
According to EREF rlwinm is supposed to clear the upper 32 bits of the
register of 64-bit cores. However, from experience it seems there's a bug
in the e5500 which causes the result to be duplicated in the upper bits of
the register. This causes problems when applied to stashed SRR1 accessed
to retrieve context, as the upper bits are not masked out, so a
set_mcontext() fails. This causes sigreturn() to in turn return with
EINVAL, causing make(1) to exit with error.
This bit is unused in e500mc derivatives (including e5500), so could just be
conditional on non-powerpc64, but there may be other non-Freescale cores
which do use it. This is also the same as the POW bit on Book-S, so could
be cleared unconditionally with the only penalty being a few clock cycles
for these two interrupts.
When the segment count is > 16 it spills into an 'indirect descriptor list',
which immediately follows the main table, but the indirect list is entry 15, so
needs to be skipped for the general list.
The Freescale SATA controller has many similarities to AHCI controllers, so
this driver is a heavily modified AHCI driver. Currently it seems to only
do SATA 1.0 speeds (~100-150MB/s), so there is still room for improvement.
Still to be done:
* Address erratum SATA-A-006187 -- Spread Spectrum Support (intermittent
non-recoverable transient data integrity error seen when SSC enabled).
* Linux doesn't read the log page as it hangs on the P1022. See if that's
applicable to this, and address accordingly.
* Try to determine what's holding back performance, and address it.
MFC after: 3 weeks
Differential Revision: https://reviews.freebsd.org/D6071
We already pass -many to the assembler, and -me500 drops 64-bit instruction
handling, for some reason only breaking module building for 64-bit kernels.
Additionally, build with CTF for dtrace.
gcc complains "cast to pointer from integer of different size". phandle_t is
*always* a uint32_t, so treat it as such, not as a pointer. Fixes 64-bit build.
This brings it closer to par with GENERIC64. In the future I hope to have a
GENERIC64-E and GENERIC-E kernels as Book-E analogues to the GENERIC64/GENERIC
AIM kernels.
Rework the dTSEC and FMan drivers to be more like a full bus relationship,
so that dtsec can use bus_alloc_resource() instead of trying to handle the
offset from the dts. This required taking some code from the sparc64 ebus
driver to allow subdividing the fman region for the dTSEC devices.
This adds some support for ARM as well as 64-bit. 64-bit on PowerPC is
currently not working, and ARM support has not been completed or tested on the
FreeBSD side.
As this was imported from a Linux tree, it includes some Linux-isms
(ioread/iowrite), so compile with the LinuxKPI for now. This may change in the
future.
- allocate value for new AT_HWCAP2 auxiliary vector on all platforms.
- expand 'struct sysentvec' by new 'u_long *sv_hwcap2', in exactly
same way as for AT_HWCAP.
MFC after: 1 month
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D12699
HEAD. Enable VIMAGE in GENERIC kernels and some others (where GENERIC does
not exist) on HEAD.
Disable building LINT-VIMAGE with VIMAGE being default.
This should give it a lot more exposure in the run-up to 12 to help
us evaluate whether to keep it on by default or not.
We are also hoping to get better performance testing.
The feature can be disabled using nooptions.
Requested by: many
Reviewed by: kristof, emaste, hiren
X-MFC after: never
Relnotes: yes
Differential Revision: https://reviews.freebsd.org/D12639
* Book-E can have Altivec exceptions, so move it out of the AIM-only block.
* Print the right DSI trap mode (read vs write) for Book-E
While here, fix some whitespace found while reviewing other diffs.
These devices bring the configs closer to a desktop-like (GENERIC) kernel
config.
* The Freescale DIU support was added to the config in r306358.
Without keyboard support video support is nearly pointless, so add ukbd and
ums.
* The AmigaOne X5000, and P1022 devboard, both use a variant of the ds1307 RTC
* cpufreq scaling is currently supported by the p1022. More SoCs will be added
eventually.
__builtin_frame_address with a non-zero argument is unsafe and rejected by
newer gcc. Since it doesn't seem to impact the stacktrace, don't bother
with gymnastics to unwind to a different frame for starting.
PR: kern/220118
MFC after: 2 weeks
All the Book-E world is no longer e500v{1,2}. e500mc the 64-bit derivatives do
not use the DOZE/NAP bits with MSR[WE], instead using the `wait' instruction to
wait for interrupts, and SoC plane controls (via CCSR) for power management.
MFC after: 1 week
A new 'u_long *sv_hwcap' field is added to 'struct sysentvec'. A
process ABI can set this field to point to a value holding a mask of
architecture-specific CPU feature flags. If an ABI does not wish to
supply AT_HWCAP to processes the field can be left as NULL.
The support code for AT_EHDRFLAGS was already present on all systems,
just the #define was not present. This is a step towards unifying the
AT_* constants across platforms.
Reviewed by: kib
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D12290
P5040/P5021 have the same number of LAWs as P5020. There may be a better way of
getting the count from the FDT (fsl,num-laws property on soc/corenet-law or
soc/ecm-law), but that's not supported everywhere, so we still need this check
for those other cases.
These processors may not be supported yet, but add them for completion.
POWER9 is planned for support. e300 may work (based on 603e core).
P5040/P5021 are similar to P5020, so should work as well. One addition is
needed for P5040, to support the number of LAWs, and will be a separate commit.
redundant initializations.
Hard-code base = 0, height = (approx. 1/8 of the boot-time font height)
in all cases, and remove the BIOS/MD support for setting these values.
This asks for an underline cursor sized for the boot-time font instead
of various less hard-coded but worse values. I used that think that
the x86 BIOS always gave the same values as the above hard-coding, but
on 1 of my systems it gives the wrong value of base = 1.
The remaining BIOS fields are shift_state and bell_pitch. These are now
consistently not explicitly reinitialized to 0. All sc_get_bios_value()
functions except x86's are now empty, and the only useful thing that x86
returns is shift_state. This really belongs in atkbdc, but heavier
use of the BIOS to read the more useful typematic rate has been removed
there. fb still makes much heavier use of the BIOS.
P1022 and MPC8536 include a 'jog' feature for clock control
(jog being a slower form of run mode). This is done by changing the
PLL multiplier, and cannot be done if any core is in doze or sleep mode.
--Remove special-case handling of sparc64 bus_dmamap* functions.
Replace with a more generic mechanism that allows MD busdma
implementations to generate inline mapping functions by
defining WANT_INLINE_DMAMAP in <machine/bus_dma.h>. This
is currently useful for sparc64, x86, and arm64, which all
implement non-load dmamap operations as simple wrappers
around map objects which may be bus- or device-specific.
--Remove NULL-checked bus_dmamap macros. Implement the
equivalent NULL checks in the inlined x86 implementation.
For non-x86 platforms, these checks are a minor pessimization
as those platforms do not currently allow NULL maps. NULL
maps were originally allowed on arm64, which appears to have
been the motivation behind adding arm[64]-specific barriers
to bus_dma.h, but that support was removed in r299463.
--Simplify the internal interface used by the bus_dmamap_load*
variants and move it to bus_dma_internal.h
--Fix some drivers that directly include sys/bus_dma.h
despite the recommendations of bus_dma(9)
Reviewed by: kib (previous revision), marius
Differential Revision: https://reviews.freebsd.org/D10729
Without disabling interrupts it's possible for another thread to preempt
and update the registers post-read (tlb1_read_entry) or pre-write
(tlb1_write_entry), and confuse the kernel with mixed register states.
MFC after: 2 weeks
AKA Make time_t 64 bits on powerpc(32).
PowerPC currently (until now) was one of two architectures with a 32-bit time_t
on 32-bit archs (the other being i386). This is an ABI breakage, so all ports,
and all local binaries, *must* be recompiled.
Tested by: andreast, others
MFC after: Never
Relnotes: Yes
Follow up r319935 by actually committing the mpc85xx_get_platform_clock()
function. This function was created to facilitate other development, and I
thought I had committed it earlier.
Some blocks depend on the platform clock rather than the system clock.
The System clock is derived from the platform clock as one-half the
platform clock. Rewrite mpc85xx_get_system_clock() to use the new
function.
Pointy-hat to: jhibbits
struct thread.
For all architectures, the syscall trap handlers have to allocate the
structure on the stack. The structure takes 88 bytes on 64bit arches
which is not negligible. Also, it cannot be easily found by other
code, which e.g. caused duplication of some members of the structure
to struct thread already. The change removes td_dbg_sc_code and
td_dbg_sc_nargs which were directly copied from syscall_args.
The structure is put into the copied on fork part of the struct thread
to make the syscall arguments information correct in the child after
fork.
This move will also allow several more uses shortly.
Reviewed by: jhb (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 3 weeks
X-Differential revision: https://reviews.freebsd.org/D11080
from machine/proc.h, consistently on all architectures.
Reviewed by: jhb
Sponsored by: The FreeBSD Foundation
MFC after: 3 weeks
X-Differential revision: https://reviews.freebsd.org/D11080
The ccr(4) driver supports use of the crypto accelerator engine on
Chelsio T6 NICs in "lookaside" mode via the opencrypto framework.
Currently, the driver supports AES-CBC, AES-CTR, AES-GCM, and AES-XTS
cipher algorithms as well as the SHA1-HMAC, SHA2-256-HMAC, SHA2-384-HMAC,
and SHA2-512-HMAC authentication algorithms. The driver also supports
chaining one of AES-CBC, AES-CTR, or AES-XTS with an authentication
algorithm for encrypt-then-authenticate operations.
Note that this driver is still under active development and testing and
may not yet be ready for production use. It does pass the tests in
tests/sys/opencrypto with the exception that the AES-GCM implementation
in the driver does not yet support requests with a zero byte payload.
To use this driver currently, the "uwire" configuration must be used
along with explicitly enabling support for lookaside crypto capabilities
in the cxgbe(4) driver. These can be done by setting the following
tunables before loading the cxgbe(4) driver:
hw.cxgbe.config_file=uwire
hw.cxgbe.cryptocaps_allowed=-1
MFC after: 1 month
Relnotes: yes
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D10763
The current method only sort of works, and usually doesn't work reliably.
Also, on Book-E the return address from DEBUG exceptions is not the sentinel
addresses, so it won't exit the loop correctly.
Fix this by better handling trap frames during unwinding, and using the
common trap handler for debug traps, as the code in that segment is
identical between the two.
MFC after: 1 week
in place. To do per-cpu stats, convert all fields that previously were
maintained in the vmmeters that sit in pcpus to counter(9).
- Since some vmmeter stats may be touched at very early stages of boot,
before we have set up UMA and we can do counter_u64_alloc(), provide an
early counter mechanism:
o Leave one spare uint64_t in struct pcpu, named pc_early_dummy_counter.
o Point counter(9) fields of vmmeter to pcpu[0].pc_early_dummy_counter,
so that at early stages of boot, before counters are allocated we already
point to a counter that can be safely written to.
o For sparc64 that required a whole dummy pcpu[MAXCPU] array.
Further related changes:
- Don't include vmmeter.h into pcpu.h.
- vm.stats.vm.v_swappgsout and vm.stats.vm.v_swappgsin changed to 64-bit,
to match kernel representation.
- struct vmmeter hidden under _KERNEL, and only vmstat(1) is an exclusion.
This is based on benno@'s 4-year old patch:
https://lists.freebsd.org/pipermail/freebsd-arch/2013-July/014471.html
Reviewed by: kib, gallatin, marius, lidl
Differential Revision: https://reviews.freebsd.org/D10156
The MFC will include a compat definition of smp_no_rendevous_barrier()
that calls smp_no_rendezvous_barrier().
Reviewed by: gnn, kib
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D10313
Many devices are clocked from the SoC's platform clock / 2. Some device nodes
include their own clock-frequency property, while others are dependent on the
SoC's bus-frequency property instead. To simplify, add a helper function to get
this clock.
I fixed this in 1997, but the fix was over-engineered and fragile and
was broken in 2003 if not before. i386 parameters were copied to 8
other arches verbatim, mostly after they stopped working on i386, and
mostly without the large comment saying how the values were chosen on
i386. powerpc has a non-verbatim copy which just changes the uncritical
parameter and seems to add a sign extension bug to it.
Just treat negative offsets as offsets if they are no more negative than
-db_offset_max (default -64K), and remove all the broken parameters.
-64K is not very negative, but it is enough for frame and stack pointer
offsets since kernel stacks are small.
The over-engineering was mainly to go more negative than -64K for the
negative offset format, without affecting printing for more than a
single address.
Addresses in the top 64K of a (full 32-bit or 64-bit) address space
are now printed less well, but there aren't many interesting ones.
For arches that have many interesting ones very near the top (e.g.,
68k has interrupt vectors there), there would be no good limit for
the negative offset format and -64K is a good as anything.
Extend the Book-E pmap to support 64-bit operation. Much of this was taken from
Juniper's Junos FreeBSD port. It uses a 3-level page table (page directory
list -- PP2D, page directory, page table), but has gaps in the page directory
list where regions will repeat, due to the design of the PP2D hash (a 20-bit gap
between the two parts of the index). In practice this may not be a problem
given the expanded address space. However, an alternative to this would be to
use a 4-level page table, like Linux, and possibly reduce the available address
space; Linux appears to use a 46-bit address space. Alternatively, a cache of
page directory pointers could be used to keep the overall design as-is, but
remove the gaps in the address space.
This includes a new kernel config for 64-bit QorIQ SoCs, based on MPC85XX, with
the following notes:
* The DPAA driver has not yet been ported to 64-bit so is not included in the
kernel config.
* This has been tested on the AmigaOne X5000, using a MD_ROOT compiled in
(total size kernel+mdroot must be under 64MB).
* This can run both 32-bit and 64-bit processes, and has even been tested to run
a 32-bit init with 64-bit children.
Many thanks to stevek and marcel for getting Juniper's FreeBSD patches open
sourced to be used here, and to stevek for reviewing, and providing some
historical contexts on quirks of the code.
Reviewed by: stevek
Obtained from: Juniper (in part)
MFC after: 2 months
Relnotes: yes
Differential Revision: https://reviews.freebsd.org/D9433
===
From Nathan Whitehorn:
Open Firmware runs in virtual mode on the Powermac G5. This runs inside the
kernel page table, which preserves all address translations made by OF before
the kernel starts; as a result, the kernel address space is a strict superset of
OF's.
Where this explodes is if OF uses an unmapped SLB entry. The SLB fault handler
runs in real mode and refers to the PCPU pointer in SPRG0, which blows up the
kernel. Having a value of SPRG0 that works for the kernel is less fatal than
preserving OF's value in this case.
===
The result of this is seemingly random panics from NULL dereferences, or hangs
immediately upon boot. By not restoring SPRG0 for Open Firmware entry the
kernel PCPU pointer is preserved and SLB faults are successful, resulting in a
stable kernel.
PR: 205458
Reported by: several (over bugzilla, lists, IRC)
Reviewed by: andreast
Tested by: many (various forms)
MFC after: 2 weeks
Renumber cluase 4 to 3, per what everybody else did when BSD granted
them permission to remove clause 3. My insistance on keeping the same
numbering for legal reasons is too pedantic, so give up on that point.
Submitted by: Jan Schaumann <jschauma@stevens.edu>
Pull Request: https://github.com/freebsd/freebsd/pull/96
Add the necessary bits to enable kernel breakpoints for Book-E. The entrypoint
for program exception is very trivial, so rather than expand it to be similar to
AIM, add it into the standard trap handler.
This wasn't blocked out as Book-E specific because it is only a minor redundancy
over AIM, which should have already called db_trap_glue() at this point. If
it's going to panic with a fatal trap anywya, it doesn't matter if it goes
through this path again.
When committing DTrace in 2012/2013 era I inadvertently broke breakpoints, by
setting EXC_DTRACE to the same value as BKPT_INST. Change EXC_DTRACE to a
different, yet logically identical, trap (tw <all>,31,31).
MFC after: 2 weeks
This is required for FDT's standard "reg-io-width" property
(similar to "reg-shift" property) found in many DTS files.
This fixes operation on Altera Arria 10 SOC Development Kit,
where standard ns8250 uart allows 4-byte access only.
Reviewed by: kan, marcel
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D9785
Convert PCIe hot plug support over to asking the firmware, if any, for
permission to use the HotPlug hardware. Implement pci_request_feature
for ACPI. All other host pci connections to allowing all valid feature
requests.
Sponsored by: Netflix
with geom_flashmap(4) and teach it about MMC for slicing enhanced
user data area partitions. The FDT slicer still is the default for
CFI, NAND and SPI flash on FDT-enabled platforms.
- In addition to a device_t, also pass the name of the GEOM provider
in question to the slicers as a single device may provide more than
provider.
- Build a geom_flashmap.ko.
- Use MODULE_VERSION() so other modules can depend on geom_flashmap(4).
- Remove redundant/superfluous GEOM routines that either do nothing
or provide/just call default GEOM (slice) functionality.
- Trim/adjust includes
Submitted by: jhibbits (RouterBoard bits)
Reviewed by: jhibbits
Implement get_pcpu() for amd64/sparc64/mips/powerpc, and use it to
replace pcpu_find(curcpu) in MI code.
Reviewed by: andreast, kan, lidl
Tested by: lidl(mips, sparc64), andreast(powerpc)
Differential Revision: https://reviews.freebsd.org/D9587
The types are for the byte offset and page index in vm object. They
are similar to off_t, which is defined as 64bit MI integer. Using MI
definitions will allow to provide consistent MD values of vm
object-related maximum sizes.
Reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
The switch to get_pcpu() in MI code seems to cause hangs on MIPS.
Back out until we can get a better idea of what's happening there.
Reported by: kan, lidl