Ensure we initialize the static environment when not booting via
loader(8), and provide a static buffer if this is the case. This fixes
two issues.
First, performing the initialization ensures that kenv variables set in
the kernel's config file are honored. Previously, any new or overridden
values were ignored.
Second, providing the static buffer allows variables to be set in the
device tree's bootargs property of the chosen node. This can be set by
u-boot or by QEMU's '-append' flag. Attempting to this prior to this
change resulted in an early panic, since the static environment had no
buffer backing it.
Submitted by: syrinx (earlier version)
Reviewed by: kp
Differential Revision: https://reviews.freebsd.org/D25034
In pmap_bootstrap(), we fill kernel_pmap->pm_active since it is
invariably active on all harts. However, this marks it as active even
for harts that don't exist in the system, which can cause issue when the
mask is passed to the SBI firmware via sbi_remote_sfence_vma().
Specifically, the SBI spec allows SBI_ERR_INVALID_PARAM to be returned
when an invalid hart is set in the mask.
The latest version of OpenSBI does not have this issue, but v0.6 does,
and this is triggering a recently added KASSERT in CI. Switch to only
setting bits in pm_active for harts that enter the system.
Reported by: Jenkins
Reviewed by: markj
Differential Revision: https://reviews.freebsd.org/D27080
VM_ALLOC_WAITOK and vm_page_unwire_noq(), have eliminated the need for
many of the #includes.
Reviewed by: kib, markj
Differential Revision: https://reviews.freebsd.org/D27052
- remove setting of register value which is not used until the next value is
set
- Use the L2_SHIFT constant when setting up L2 superpages
Submitted by: Antonin Houska <ah AT melesmeles DOT cz>
Version 0.2 of the SBI specification [1] marked the existing SBI
functions as "legacy" in order to move to a newer calling convention. It
also introduced a set of replacement extensions for some of the legacy
functionality. In particular, the TIME, IPI, and RFENCE extensions
implement and extend the semantics of their legacy counterparts, while
conforming to the newer version of the spec.
Update our SBI code to use the new replacement extensions when
available, and fall back to the legacy ones. These will eventually be
dropped, when support for version 0.2 is ubiquitous.
[1] https://github.com/riscv/riscv-sbi-doc/blob/master/riscv-sbi.adoc
Submitted by: Danjel Q. <danq1222@gmail.com>
Reviewed by: kp
Differential Revision: https://reviews.freebsd.org/D26953
S-mode software has write access to the SIP.SSIP bit, so instead of
making a second round-trip through the SBI we can clear it ourselves.
The SBI spec has deprecated this function for this exactly this reason.
Submitted by: Danjel Q. <danq1222@gmail.com
Reviewed by: kp
Differential Revision: https://reviews.freebsd.org/D26952
The existing names were inherited from arm64, but we should prefer
RISC-V terminology. Change the prefix to SCAUSE, and further change the
names to better match the RISC-V spec and be more consistent with one
another. Also, remove two codes that are not defined for S-mode (machine
and hypervisor ecall).
While here, apply style(9) to some condition checks.
Reviewed by: kp
Discussed with: jrtc27
Differential Revision: https://reviews.freebsd.org/D26918
As was done for L3 PTEs in r362853, mask out the reserved bits when
extracting the physical address from an L2 PTE. Future versions of the
spec or custom implementations may make use of these reserved bits, in
which case the resulting physical address could be incorrect.
Submitted by: Nathaniel Filardo <nwf20@cl.cam.ac.uk>
Reviewed by: kp, mhorne
Differential Revision: https://reviews.freebsd.org/D26607
Hiding this feature behind RB_VERBOSE is gratuitous. The tunable is enough
to limit its use to only those who explicitly request it.
Suggested by: kevans
Move dump_avail[] extern declaration and inlines into a new header
vm/vm_dumpset.h. This fixes default gcc build for mips.
Reviewed by: alc, scottph
Tested by: kevans (previous version)
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D26741
Push the root seed version to userspace through the VDSO page, if
the RANDOM_FENESTRASX algorithm is enabled. Otherwise, there is no
functional change. The mechanism can be disabled with
debug.fxrng_vdso_enable=0.
arc4random(3) obtains a pointer to the root seed version published by
the kernel in the shared page at allocation time. Like arc4random(9),
it maintains its own per-process copy of the seed version corresponding
to the root seed version at the time it last rekeyed. On read requests,
the process seed version is compared with the version published in the
shared page; if they do not match, arc4random(3) reseeds from the
kernel before providing generated output.
This change does not implement the FenestrasX concept of PCPU userspace
generators seeded from a per-process base generator. That change is
left for future discussion/work.
Reviewed by: kib (previous version)
Approved by: csprng (me -- only touching FXRNG here)
Differential Revision: https://reviews.freebsd.org/D22839
Create the RISC-V NOTES and LINT files. As of r366559, LINT configs are
no longer generated but checked in to the tree.
Reviewed by: imp
Differential Revision: https://reviews.freebsd.org/D26502
The boot metadata (also referred to as modinfo, or preload metadata)
provides information about the size and location of the kernel,
pre-loaded modules, and other metadata (e.g. the EFI framebuffer) to be
consumed during by the kernel during early boot. It is encoded as a
series of type-length-value entries and is usually constructed by
loader(8) and passed to the kernel. It is also faked on some
architectures when booted by other means.
Although much of the module information is available via kldstat(8),
there is no easy way to debug the metadata in its entirety. Add some
routines to parse this data and allow it to be printed to the console
during early boot or output via a sysctl.
Since the output can be lengthly, printing to the console is gated
behind the debug.dump_modinfo_at_boot kenv variable as well as the
BOOTVERBOSE flag. The sysctl to print the metadata is named
debug.dump_modinfo.
Reviewed by: tsoome
Sponsored by: NetApp, Inc.
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D26687
It is unlikely, but possible, that an unrecognized or unsupported
relocation type is encountered while trying to load a kernel module. If
this occurs we should offer the symbol index as a hint to the user.
While here, fix some small style issues.
Reviewed by: markj, kib (amd64 part, in D26701)
Sponsored by: NetApp, Inc.
Sponsored by: Klara, Inc.
Since r366355 and r366284 we panic on access faults rather than treating
them like page faults so this condition is never true.
Reviewed by: jhb (mentor), markj, mhorne
Approved by: jhb (mentor), markj, mhorne
Differential Revision: https://reviews.freebsd.org/D26686
We should never take instruction page faults when in the kernel, but by
using the standard page fault code we should get a more-informative
message about faulting on a NOFAULT page rather than branching to the
default case here and printing an "Unknown kernel exception ..."
message.
Reviewed by: jhb (mentor), markj
Approved by: jhb (mentor), markj
Differential Revision: https://reviews.freebsd.org/D26685
These names were inherited from the arm64 port and should be changed to
the RISC-V terminology.
Reviewed by: jhb (mentor), kp, markj
Approved by: jhb (mentor), kp, markj
Differential Revision: https://reviews.freebsd.org/D26671
Access faults in user mode are treated like TLB misses, which leads to an
endless loop of faults. It's less serious than the same fault in kernel mode,
because we can just terminate the process, but that's not ideal.
Treat user mode access faults as a bus error.
Suggested by: jrtc27
Reviewed by: br, jhb
Sponsored by: Axiado
Differential Revision: https://reviews.freebsd.org/D26621
Load/store/fetch access exceptions always indicate a violation of a PMP
rule. We can't treat those as page faults, because updating the page
table and trying again will only result in exactly the same access
exception recurring. This leaves us in an endless exception loop.
We cannot recover from these exceptions, so panic instead.
Reviewed by: jhb
Sponsored by: Axiado
Differential Revision: https://reviews.freebsd.org/D26544
Every other architecture defines this and this is required for
interrupts to work when using QEMU's PCI VirtIO devices (which all
report an interrupt line of 0) for two reasons.
Firstly, interrupt line 0 is wrong; they use one of 0x20-0x23 with the
lines being cycled across devices like normal. Moreover, RISC-V uses
INTRNG, whose IRQs are virtual as indices into its irq_map, so even if
we have the right interrupt line we still need to try and route the
interrupt in order to ultimately call into intr_map_irq and get back a
unique index into the map for the given line, otherwise we will use
whatever happens to be in irq_map[line] (which for QEMU where the line
is initialised to 0 results in using the first allocated interrupt,
namely the RTC on IRQ 11 at time of commit).
Note that pci_assign_interrupt will still do the wrong thing for INTRNG
when using a tunable, as it will bypass INTRNG entirely and use the
tunable's value as the index into irq_map, when it should instead
(indirectly) call intr_map_irq to allocate a new entry for the given
IRQ and treat the tunable as stating the physical line in use, which is
what one would expect. This, however, is a problem shared by all INTRNG
architectures, and not exclusive to RISC-V.
Reviewed by: kib
Approved by: kib
Differential Revision: https://reviews.freebsd.org/D26564
In the spirit of the GENERIC config, we should include the drivers required to
run on most supported platforms.
Reviewed by: kp
Differential Revision: https://reviews.freebsd.org/D26501
On Ampere Altra systems, the sparse population of RAM within the
physical address space causes the vm_page_dump bitmap to be much
larger than necessary, increasing the size from ~8 Mib to > 2 Gib
(and overflowing `int` for the size).
Changing the page dump bitmap also changes the minidump file
format, so changes are also necessary in libkvm.
Reviewed by: jhb
Approved by: scottl (implicit)
MFC after: 1 week
Sponsored by: Ampere Computing, Inc.
Differential Revision: https://reviews.freebsd.org/D26131
These definitions were repeated by all architectures, with small
variations. Consolidate the common definitons in machine
independent code and use bitset(9) macros for manipulation. Many
opportunities for deduplication remain in the machine dependent
minidump logic. The only intended functional change is increasing
the bit index type to vm_pindex_t, allowing the indexing of pages
with address of 8 TiB and greater.
Reviewed by: kib, markj
Approved by: scottl (implicit)
MFC after: 1 week
Sponsored by: Ampere Computing, Inc.
Differential Revision: https://reviews.freebsd.org/D26129
One problem with the bus_space_read_N() and bus_space_write_N() family of
functions is that they provide no protection against exceptions which can
occur when no physical hardware or device responds to the read or write
cycles. In such a situation, the system typically would panic due to a
kernel-mode bus error. The bus_space_peek_N() and bus_space_poke_N() family
of functions provide a mechanism to handle these exceptions gracefully
without the risk of crashing the system.
Typical example is access to PCI(e) configuration space in bus enumeration
function on badly implemented PCI(e) root complexes (RK3399 or Neoverse
N1 N1SDP and/or access to PCI(e) register when device is in deep sleep state.
This commit adds a real implementation for arm64 only. The remaining
architectures have bus_space_peek()/bus_space_poke() emulated by using
bus_space_read()/bus_space_write() (without exception handling).
MFC after: 1 month
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D25371
that can be extended, but also ensure compile-time type checking. Refactor
common code out of arch-specific implementations. Move the mpr and mps
drivers to this new API. The template type remains visible to the consumer
so that it can be allocated on the stack, but should be considered opaque.
RISC-V is currently built with -Wno-format, which is how these went
undetected. Address them now before re-enabling those warnings.
Differential Revision: https://reviews.freebsd.org/D26319
The kernel adjusts the stack by TF_SIZE and the RISC-V ABI requires
that it remain 16-byte aligned.
Reported by: CHERI, jrtc27
Reviewed by: mhorne
Obtained from: CheriBSD
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D26328
Currently we use a single bit to indicate whether the virtual page is
part of a superpage. To support a forthcoming implementation of
non-transparent 1GB superpages, it is useful to provide more detailed
information about large page sizes.
The change converts MINCORE_SUPER into a mask for MINCORE_PSIND(psind)
values, indicating a mapping of size psind, where psind is an index into
the pagesizes array returned by getpagesizes(3), which in turn comes
from the hw.pagesizes sysctl. MINCORE_PSIND(1) is equal to the old
value of MINCORE_SUPER.
For now, two bits are used to record the page size, permitting values
of MAXPAGESIZES up to 4.
Reviewed by: alc, kib
Sponsored by: Juniper Networks, Inc.
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D26238
This allows privileged userspace processes to find information about the
physical page backing a given mapping. It is useful in applications
such as DPDK which perform some of their own memory management.
Reviewed by: kib, jhb (previous version)
MFC after: 2 weeks
Sponsored by: Juniper Networks, Inc.
Sponsored by: Klara Inc.
Differential Revision: https://reviews.freebsd.org/D26237
sbi_init() sets mimpid, we can use that value.
Reviewed by: philip (mentor), kp (mentor)
Approved by: philip (mentor), kp (mentor)
Sponsored by: Axiado
Differential Revision: https://reviews.freebsd.org/D26092
I observed hangs post-r362977 in QEMU with -smp 2, in which one thread
would acquire write access to an rm_lock (sysctllock) and get stuck
waiting in smp_rendezvous_cpus while the other CPU was servicing a trap.
The other thread was waiting for read access to the same lock, thus
causing deadlock.
It's clear that this is just one symptom of a larger problem. The
general expectation of MI kernel code is that interrupts are enabled.
Violating this assumption will at best create some additional latency,
but otherwise might cause locking or other unforeseen issues. All other
architectures do so for some subset of trap values, but this somehow got
missed in the RISC-V port. Enable interrupts now during kernel page
faults and for all user trap types.
The code in exception.S already knows to disable interrupts while
handling the return from exception, so there are no changes required
there.
Reviewed by: jhb, markj
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D26017
- Properly set up the frame pointer
- Hang if we return from mi_startup
- Whitespace
Clearing the frame pointer marks the end of the backtrace. This fixes
"bt 0" in ddb, which previously would unwind one frame too far.
Reviewed by: jhb
Differential Revision: https://reviews.freebsd.org/D26016
This same check is used on other architectures. Previously this would
permit a stack frame to unwind into any arbitrary kernel address
(including unmapped addresses).
Reviewed by: mhorne
Obtained from: CheriBSD
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D25996
There was an additional 7 bytes of compiler-inserted padding at the
end of the structure visible via 'ptype /o' in gdb.
Reviewed by: mhorne
Obtained from: CheriBSD
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D25867
RISC-V doesn't support floating-point exceptions.
RISC-V Instruction Set Manual: Volume I: User-Level ISA, 11.2 Floating-Point
Control and Status Register: "As allowed by the standard, we do not support
traps on floating-point exceptions in the base ISA, but instead require
explicit checks of the flags in software. We considered adding branches
controlled directly by the contents of the floating-point accrued exception
flags, but ultimately chose to omit these instructions to keep the ISA simple."
We still need these functions, because some applications (notably Perl) call
them, but we cannot provide a meaningful implementation.
Sponsored by: Axiado
Differential Revision: https://reviews.freebsd.org/D25740
QEMU's RISC-V virt machine provides syscon-power and syscon-reset
devices as the means by which to shutdown and reboot. We also need to
ensure that we have attached the syscon_generic device before attaching
any syscon_power devices, and so we introduce a new riscv_syscon device
akin to aw_syscon added in r327936. Currently the SiFive test finisher
is used as the specific implementation of such a syscon device.
Reviewed by: br, brooks (mentor), jhb (mentor)
Approved by: br, brooks (mentor), jhb (mentor)
Obtained from: CheriBSD
Differential Revision: https://reviews.freebsd.org/D25725
This device was originally used as part of the goldfish virtual hardware
platform used for emulating Android on QEMU, but is now also used as the
RTC for the RISC-V virt machine in QEMU. It provides a simple 64-bit
nanosecond timer exposed via a pair of memory-mapped 32-bit registers,
although only with 1s granularity.
Reviewed by: brooks (mentor), jhb (mentor), kp
Approved by: brooks (mentor), jhb (mentor), kp
Obtained from: CheriBSD
Differential Revision: https://reviews.freebsd.org/D25717
Being able to use tmpfs without kernel modules is very useful when building
small MFS_ROOT kernels without a real file system.
Including TMPFS also matches arm/GENERIC and the MIPS std.MALTA configs.
Compiling TMPFS only adds 4 .c files so this should not make much of a
difference to NO_MODULES build times (as we do for our minimal RISC-V
images).
Reviewed By: br (earlier version for riscv), brooks, emaste
Differential Revision: https://reviews.freebsd.org/D25317
The size of the containing structure was passed instead of the size of
the array. This happened to be harmless as the extra word copied is
one we copy in the next line anyway.
Reported by: CHERI (bounds check violation)
Reviewed by: brooks, imp
Obtained from: CheriBSD
MFC after: 1 week
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D25791
During device attachment, all interrupt sources will bind to the BSP,
as it is the only processor online. This means interrupts must be
redistributed ("shuffled") later, during SI_SUB_SMP.
For the EARLY_AP_STARTUP case, this is no longer true. SI_SUB_SMP will
execute much earlier, meaning APs will be online and available before
devices begin attachment, and there will therefore be nothing to
shuffle.
All PIC-conforming interrupt controllers will handle this early
distribution properly, except for RISC-V's PLIC. Make the necessary
tweak to the PLIC driver.
While here, convert irq_assign_cpu from a boolean_t to a bool.
Reviewed by: markj
Differential Revision: https://reviews.freebsd.org/D25693
The FDT may contain a short /chosen/bootargs string which we should pass
to boot_parse_cmdline. Notably, this allows the use of qemu's -append
option to pass things like -s to boot to single user mode.
Submitted by: Nathaniel Filardo <nwf20@cl.cam.ac.uk>
Reviewed by: mhorne
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D25544
This removes SCTP from in-tree kernel configuration files. Now, SCTP
can be enabled by simply loading the module, as discussed on
freebsd-net@.
Reviewed by: tuexen
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D25611
We cannot complete the interrupt (i.e. write to the claims/complete register
until the interrupt handler has actually run. We don't run the interrupt
handler immediately from intr_isrc_dispatch(), we only schedule it for later
execution.
If we immediately complete it (i.e. before the interrupt handler proper has
run) the interrupt may be triggered again if the interrupt source remains set.
From RISC-V Instruction Set Manual: Volume II: Priviliged Architecture, 7.4
Interrupt Gateways:
"If a level-sensitive interrupt source deasserts the interrupt after the PLIC
core accepts the request and before the interrupt is serviced, the interrupt
request remains present in the IP bit of the PLIC core and will be serviced by
a handler, which will then have to determine that the interrupt device no
longer requires service."
In other words, we may receive interrupts twice.
Avoid that by postponing the completion until after the interrupt handler has
run.
If the interrupt is handled by a filter rather than by scheduling an interrupt
thread we must also complete the interrupt, so set up a post_filter handler
(which is the same as the post_ithread handler).
Reviewed by: mhorne
Sponsored by: Axiado
Differential Revision: https://reviews.freebsd.org/D25531
This implementation doesn't have any major deviations from the other EFI
ports. I've copied the boilerplate from arm and arm64.
I've tested this with the following boot flows:
OpenSBI (M-mode) -> u-boot (S-mode) -> loader.efi -> FreeBSD
OpenSBI (M-mode) -> u-boot (S-mode) -> boot1.efi -> loader.efi -> FreeBSD
Due to the way that u-boot handles secondary CPUs, OpenSBI >= v0.7 is required,
as the HSM extension is needed to bring them up explicitly. Because of this,
using BBL as the SBI implementation will not be possible. Additionally, there
are a few recent u-boot changes that are required as well, all of which will be
present in the upcoming v2020.07 release.
Looks good: emaste
Differential Revision: https://reviews.freebsd.org/D25135
The top 10 bits of a pte are reserved by specification[1] and are not part of
the PPN.
[1] 'Volume II: RISC-V Privileged Architectures V20190608-Priv-MSU-Ratified',
'4.4.1 Addressing and Memory Protection', page 72: "The PTE format for Sv39 is
shown in Figure 4.18. ... Bits 63–54 are reserved for future use and must be
zeroed by software for forward compatibility."
Submitted by: Nathaniel Filardo <nwf20@cl.cam.ac.uk>
Reviewed by: kp, mhorne
Differential Revision: https://reviews.freebsd.org/D25523
A very minor micro-optimization; t0 is not clobbered between the loop top and
bottom and there appear to be no other branches to this label.
Submitted by: Nathaniel Filardo <nwf20@cl.cam.ac.uk>
Reviewed by: mhorne
Differential Revision: https://reviews.freebsd.org/D25524
If we panic we dump the registers for debugging. This is very useful, but it
missed several registers (ra, sp, gp and tp).
Log these as well. Especially the return address value is extremely useful.
Sponsored by: Axiado
This temporary mapping will become optional. Booting via loader(8)
means that the DTB will have already been copied into the kernel's
staging area, and is therefore covered by the early KVA mappings.
Reviewed by: markj
Differential Revision: https://reviews.freebsd.org/D24911
In locore, we must detect and handle different arguments passed by
loader(8) compared to what we recieve when booting directly via SBI
firmware. Currently we receive the hart ID in a0 and a pointer to the
device tree blob in a1. loader(8) provides only a pointer to its
metadata in a0.
The solution to this is to add an additional entry point, _alt_start.
This will be placed first in the .text section, so SBI firmware will
enter here, and jump to the common pagetable setup shortly after. Since
loader(8) understands our ELF kernel, it will enter at the ELF's entry
address, which points to _start. This approach leads to very little
guesswork as to which way we booted.
Fix-up initriscv() to parse the loader's metadata, continuing to use
fake_preload_metadata() in the SBI direct boot case.
Reviewed by: markj, jrtc27 (asm portion)
Differential Revision: https://reviews.freebsd.org/D24912
Currently we only call sbi_shutdown in cpu_reset, which means we reach
"Please press any key to reboot." even when RB_POWEROFF is set, and only
once the user presses a key do we then shutdown. Instead, register a
shutdown_final event handler and make an SBI shutdown call if
RB_POWEROFF is set.
Reviewed by: br, jhb (mentor), kp
Approved by: br, jhb (mentor), kp
Differential Revision: https://reviews.freebsd.org/D25183
This can happen with very large kernels (e.g. ones embedding a root
filesystem). The DTB written by OpenSBI/BBL is quite small so this is
unlikely to hit important data, but if it does this can result in very
confusing and hard-to-debug crashes. Add a KASSERT() and a verbose print
to catch this problem with debug kernels.
While this will not print any output by default if it fails (that would
depend on EARLY_PRINTF), at least the kernel now halts reliably instead
of randomly crashing.
Reviewed By: mhorne
Differential Revision: https://reviews.freebsd.org/D25153
By default OpenSBI and BBL will pass the DTB at a 2MB-aligned address.
However, by default there are no 2MB aligned regions between the SBI and
the kernel, so we have to choose a 2MB aligned region after the kernel.
OpenSBI defaults to placing the DTB 32MB after the start of the kernel but
this is not sufficient for a kernel with a large MFS embedded.
We could increase this offset to a larger number (e.g. 64/128/256) but that
imposes restrictions on the minimum RAM size.
Another solution would be to place the DTB between OpenSBI and the kernel
at 1MB alignment, but current locore.S code assumes 2MB alignment.
With this change I can now boot on QEMU with an OpenSBI configured to
store the DTB at an offset of 1MB.
See also https://github.com/riscv/opensbi/issues/169
Reviewed By: mhorne
Differential Revision: https://reviews.freebsd.org/D25151
The trampoline code used for loading gzipped a.out kernels on arm was
removed in r350436. A portion of this code allowed for DDB to find the
symbol tables when booting without loader(8), and some of this was
untouched in the removal. Remove it now.
Differential Revision: https://reviews.freebsd.org/D24950
This is in preparation for booting via loader(8). Lift these macros from arm64
so we don't need to worry about the size when inserting new elements. This
could have been done in r359673, but I didn't think I would be returning to
this function so soon.
Reviewed by: markj
Differential Revision: https://reviews.freebsd.org/D24910
This reapplies logical r360944 and r360946 (reverting r360955), with fixed
copystr() stand-in replacement macro. Eventually the goal is to convert
consumers and kill the macro, but for a first step it helps if the macro is
correct.
Prior commit message:
Unlike the other copy*() functions, it does not serve to copy from one
address space to another or protect against potential faults. It's just
an older incarnation of the now-more-common strlcpy().
Add a coccinelle script to tools/ which can be used to mechanically
convert existing instances where replacement with strlcpy is trivial.
In the two cases which matched, fuse_vfsops.c and union_vfsops.c, the
code was further refactored manually to simplify.
Replace the declaration of copystr() in systm.h with a small macro
wrapper around strlcpy (with correction from brooks@ -- thanks).
Remove N redundant MI implementations of copystr. For MIPS, this
entailed inlining the assembler copystr into the only consumer,
copyinstr, and making the latter a leaf function.
Reviewed by: jhb (earlier version)
Discussed with: brooks (thanks!)
Differential Revision: https://reviews.freebsd.org/D24672
When protecting a superpage, we would previously fall through to the
non-superpage case and read the contents of the superpage as PTEs,
potentially modifying them and trying to look up underlying VM pages that
don't exist if they happen to look like PTEs we would care about. This led
to nginx causing an unexpected page fault in pmap_protect that panic'ed the
kernel. Instead, if we see a superpage, we are done for this range and
should continue to the next.
Reviewed by: markj, jhb (mentor)
Approved by: markj, jhb (mentor)
Differential Revision: https://reviews.freebsd.org/D24827
Unlike the other copy*() functions, it does not serve to copy from one
address space to another or protect against potential faults. It's just
an older incarnation of the now-more-common strlcpy().
Add a coccinelle script to tools/ which can be used to mechanically
convert existing instances where replacement with strlcpy is trivial.
In the two cases which matched, fuse_vfsops.c and union_vfsops.c, the
code was further refactored manually to simplify.
Replace the declaration of copystr() in systm.h with a small macro
wrapper around strlcpy.
Remove N redundant MI implementations of copystr. For MIPS, this
entailed inlining the assembler copystr into the only consumer,
copyinstr, and making the latter a leaf function.
Reviewed by: jhb
Differential Revision: https://reviews.freebsd.org/D24672
The addition of the HSM SBI extension to OpenSBI introduces a new
breaking change: secondary harts will remain parked in the firmware,
until they are brought up explicitly via sbi_hsm_hart_start(). Add
the call to do this, sending the secondary harts to mpentry.
If the HSM extension is not present, secondary harts are assumed to be
released by the firmware, as is the case for OpenSBI =< v0.6 and BBL.
In the case that the HSM call fails we exclude the CPU, notify the
user, and allow the system to proceed with booting.
Reviewed by: markj (older version)
Differential Revision: https://reviews.freebsd.org/D24497
APs enter the kernel at the same point as the BSP, the _start routine.
They then jump to mpentry, but not before storing the kernel's physical
load address in the s9 register. Extract this calculation into its own
routine, so that APs can be instructed to enter directly from mpentry.
Differential Revision: https://reviews.freebsd.org/D24495
Now that hw.machine_arch handles soft-float vs hard-float there is no
longer a reason for this config.
Submitted by: mhorne (kern.mk hunk)
Reviewed by: imp (earlier version), kp
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D24544
For userland, MACHINE_ARCH reflects the current ABI via preprocessor
directives. For the kernel, the hw.machine_arch sysctl uses the ELF
header flags of the current process to select the correct MACHINE_ARCH
value.
Reviewed by: imp, kp
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D24543
pmap_bootstrap() expects the kernel's physical load address, but we have
been providing the start of physical memory. This had the nice effect of
protecting the memory used by the SBI runtime firmware, but now that we
have alternate means of achieving that, we should provide the correct
value. This will free up any memory between the SBI firmware and the
kernel for allocation.
Reviewed by: markj
Differential Revision: https://reviews.freebsd.org/D24156
The device tree may contain a "reserved-memory" node, whose purpose is
to communicate sections of physical memory that should not be used for
general allocations. Add the logic to parse and exclude these regions.
The particular motivation for this is protection of the SBI runtime
firmware. Currently, there is no mechanism through which the SBI
can communicate the details of its reserved memory region(s) to
a supervisor payload. There has been some discussion recently on how
this can be achieved [1], and it seems that the path going forward
will be to add an entry to the reserved-memory node.
This hasn't caused any issues for us yet, since we exclude all physical
memory below the kernel's load address from being allocated, and on all
currently supported platforms this covers the SBI firmware region. This
will change in another commit, so as a safety measure, ensure that the
lowest 2MB of memory is excluded if this region has not been reported.
[1] https://github.com/riscv/riscv-sbi-doc/pull/37
Reviewed by: markj, nick (older version)
Differential Revision: https://reviews.freebsd.org/D24155
Replace our hand-rolled functions with the generic ones provided by
kern/subr_physmem.c. This greatly simplifies the initialization of
physical memory regions and kernel globals.
Tested by: nick
Differential Revision: https://reviews.freebsd.org/D24154
The location of the device-tree blob is passed to the kernel by the
previous booting stage (i.e. BBL or OpenSBI). Currently, we leave it
untouched and mark the 1MB of memory holding it as unavailable.
Instead, do what is done by other fake_preload_metadata() routines and
copy to the DTB to KVA space. This is more in line with what loader(8)
will provide us in the future, and it allows us to reclaim the hole in
physical memory.
Reviewed by: markj, kp (earlier version)
Differential Revision: https://reviews.freebsd.org/D24152
The only way to flush the local hart's icache is with a FENCE.I (or an
equivalent SBI call); a normal FENCE is insufficient and, for the
single-hart case, unnecessary.
Reviewed by: jhb (mentor), markj
Approved by: jhb (mentor), markj
Differential Revision: https://reviews.freebsd.org/D24317
Summary:
The parentheses being in the wrong place means that, for L3 pages,
oldpte has all bits except PTE_V cleared, and so all the subsequent
checks against oldpte will fail, causing us to bail out and not retry
the faulting instruction after an SFENCE.VMA. This causes a WITNESS +
INVARIANTS kernel to fault on the "Chisel P3" (BOOM-based) DARPA SSITH
GFE SoC in pmap_init when writing to pv_table and, being a nofault
entry, subsequently panic with:
panic: vm_fault_lookup: fault on nofault entry, addr: 0xffffffc004e00000
Reviewed by: markj
Approved by: markj
Differential Revision: https://reviews.freebsd.org/D24315
This driver supports SiFive's FE310 Always-on (AON) peripheral's
Real-time clock (RTC) and Watchdog timer (WDT). AON has other
functionality that this driver could support such as the power
management unit (PMU) but that functionality hasn't been implemented.
Reviewed by: philip (mentor), kp (mentor)
Approved by: philip (mentor)
Sponsored by: Axiado
Differential Revision: https://reviews.freebsd.org/D24170
Modern debuggers and process tracers use ptrace() rather than procfs
for debugging. ptrace() has a supserset of functionality available
via procfs and new debugging features are only added to ptrace().
While the two debugging services share some fields in struct proc,
they each use dedicated fields and separate code. This results in
extra complexity to support a feature that hasn't been enabled in the
default install for several years.
PR: 244939 (exp-run)
Reviewed by: kib, mjg (earlier version)
Relnotes: yes
Differential Revision: https://reviews.freebsd.org/D23837
Instead, dynamically allocate a page for the boot stack of each AP when
starting them up, like we do on x86. This shrinks the bss by
MAXCPU*KSTACK_PAGES pages, which corresponds to 4MB on arm64 and 256KB
on riscv.
Duplicate the logic used on x86 to free the bootstacks, by using a
sysinit to wait for each AP to switch to a thread before freeing its
stack.
While here, mark some static MD variables as such.
Reviewed by: kib
MFC after: 1 month
Sponsored by: Juniper Networks, Klara Inc.
Differential Revision: https://reviews.freebsd.org/D24158
When I implemented MD DYNAMIC parsing, I was originally passing a
linker_file_t so that the MD code could relocate pointers.
However, it turns out this isn't even filled in until later, so it was
always 0.
Just pass the load base (ef->address) directly, as that's really the only
thing we were interested in in the first place.
This fixes a crash on RB800 where it was trying to write to an unmapped
address when updating the GOT.
Reviewed by: jhibbits
Sponsored by: Tag1 Consulting, Inc.
Differential Revision: https://reviews.freebsd.org/D24105
This silences an "unused label" warning as well as fixes the attach fail
path that wasn't releasing resources.
Submitted by: Nicholas O'Brien <nickisobrien_gmail.com>
Sponsored by: Axiado
Differential Revision: https://reviews.freebsd.org/D24004
Use __riscv_flen instead of __riscv_float_abi_soft. While the latter works for
userland (and one could argue it's more correct), it fails for the kernel. We
compile the kernel with -mabi=lp64 (eg soft float abi) to avoid floating point
instructions in the kernel. We also compile the kernel -march=rv64imafdc for
hard float kernels (eg those with options FPE), but with -march=rv64imac for
softfloat kernels (eg those with FPE). Since we do this, in the kernel (as in
userland) __riscv_flen will be defined for 'riscv64' and not for 'riscv64sf'.
This also removes the -DMACHINE_ARCH hack now that it's no longer needed.
Longer term, we should return the ABI from the sysctl hw.machine_arch like on
amd64 for i386 binaries.
Suggested by: mhorne@
Differential Revision: https://reviews.freebsd.org/D23813
r357614 added CTLFLAG_NEEDGIANT to make it easier to find nodes that are
still not MPSAFE (or already are but aren’t properly marked).
Use it in preparation for a general review of all nodes.
This is non-functional change that adds annotations to SYSCTL_NODE and
SYSCTL_PROC nodes using one of the soon-to-be-required flags.
Mark all obvious cases as MPSAFE. All entries that haven't been marked
as MPSAFE before are by default marked as NEEDGIANT
Approved by: kib (mentor, blanket)
Commented by: kib, gallatin, melifaro
Differential Revision: https://reviews.freebsd.org/D23718
GENERICSF is just like GENERIC, only creates a soft-float kernel. Omit it from the
universe build for now.
Reviewed by: philip
Differential Revision: https://reviews.freebsd.org/D23812
MACHINE_ARCH sets the hw.machine_arch sysctl in the kernel. In userspace
it sets MACHINE_ARCH in bmake, which bsd.cpu.mk uses to configure the
target ABI for ports.
For riscv64sf builds (i.e. soft-float) that needs to be riscv64sf, but
the sysctl didn't reflect that. It is static.
Set the define from the riscv makefile so that we correctly reflect our
actual build (i.e. riscv64 or riscv64sf), depending on what TARGET_ARCH
we were built with.
That still doesn't satisfy userspace builds (e.g. bmake), so check if
we're building with a software-floating point toolchain there. That
check doesn't work in the kernel, because it never uses floating point.
Reviewed by: philip (previous version), mhorne
Sponsored by: Axiado
Differential Revision: https://reviews.freebsd.org/D23741
This is taken from the arm64 version, with the following simplifications:
- Our current pmap implementation uses a 3-level paging scheme
- The "mode" field has been omitted since RISC-V PTEs don't encode
typical mode attributes
Reviewed by: markj
Differential Revision: https://reviews.freebsd.org/D23594
Always use the kdb_thr_ctx() for db_trace_thread() as on other
architectures. Initialize pcb_ra to be the sepc from the saved
trapframe rather than the saved ra to avoid skipping a frame.
Reviewed by: mhorne, br
MFC after: 1 week
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D23513
While here, add an extra line of information for exceptions and
interrupts and compress the per-frame line down to one line to match
other architectures.
Reviewed by: mhorne, br
MFC after: 1 week
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D23508
While cause codes higher than 16 are reserved, the exception code
field of the register is defined to be all bits but the upper-most
bit.
Reviewed by: mhorne
MFC after: 1 week
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D23510
While here, remove a local variable to avoid the CSR read in non-debug
kernels.
Reviewed by: mhorne
MFC after: 1 week
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D23511
In practice this discarded all characters entered at the DDB prompt.
Reviewed by: br
MFC after: 1 week
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D23509
This fixes continuing from debug.kdb.enter=1 after enabling the use of
compressed instructions since the compiler can emit the two byte
c.ebreak instead of the 4 byte ebreak.
Reviewed by: br
MFC after: 1 week
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D23507
This reverts r177661. The change is no longer very useful since
out-of-tree KLDs will be built to target SMP kernels anyway. Moveover
it breaks the KBI in !SMP builds since cpuset_t's layout depends on the
value of MAXCPU, and several kernel interfaces, notably
smp_rendezvous_cpus(), take a cpuset_t as a parameter.
PR: 243711
Reviewed by: jhb, kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D23512
The PRCI exports tlclk as a constant fixed divisor clock, defined as 1/2
of the coreclk frequency. In older FU540 device trees (such as the one
provided by SiFive), tlclk is represented as its own entity, and is
automatically registered as a fixed-divisor-clock. Unfortunately the
upstream FU540 device tree (that we have in our tree) represents tlclk
as an output of the PRCI block, and we must register it manually. At
worst, users of the old device tree will end up with an unreferenced
duplicate of tlclk.
This fixes device attachment for the SiFive UART on newer device trees,
since it references tlclk via the PRCI.
Reviewed by: kp
Differential Revision: https://reviews.freebsd.org/D23406
Add two additional compat strings that can be used to identify the PRCI.
With newer device trees the PRCI has two parents, hfclk and rtcclk, so
allow the driver to attach when more than one parent is found.
Reviewed by: kp
Differential Revision: https://reviews.freebsd.org/D23405
The PRCI module exports three PLLs. Currently only the coreclk/corepll
is registered, so add the logic to register the DDR (memory) and GEMGX
(ethernet) clocks as well. These clocks are unused at the moment.
Reviewed by: kp
Differential Revision: https://reviews.freebsd.org/D23404
Rather than trying to blacklist which bits userland can't write to via
sigreturn() or setcontext(), only permit changes to whitelisted bits.
- Permit arbitrary writes to bits in the user-writable USTATUS
register that shadows SSTATUS.
- Ignore changes in write-only bits maintained by the CPU.
- Ignore the user-supplied value of the FS field used to track
floating point state and instead set it to a value matching the
actions taken by set_fpcontext().
Discussed with: mhorne
MFC after: 2 weeks
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D23338
After r355784 the td_oncpu field is no longer synchronized by the thread
lock, so the stack capture interrupt cannot be delievered precisely.
Fix this using a loop which drops the thread lock and restarts if the
wrong thread was sampled from the stack capture interrupt handler.
Change the implementation to use a regular interrupt instead of an NMI.
Now that we drop the thread lock, there is no advantage to the latter.
Simplify the KPIs. Remove stack_save_td_running() and add a return
value to stack_save_td(). On platforms that do not support stack
capture of running threads, stack_save_td() returns EOPNOTSUPP. If the
target thread is running in user mode, stack_save_td() returns EBUSY.
Reviewed by: kib
Reported by: mjg, pho
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D23355
The stack pointer is swapped with the sscratch CSR just before the
jump to cpu_exception_handler_user where the first instruction swaps
it again. The two swaps together are a no-op, but the csr swap
instructions can be expensive (e.g. on Bluespec RISC-V cores csr swap
instructions force a full pipeline stall).
Reported by: jrtc27
Reviewed by: br
MFC after: 2 weeks
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D23394
cpu_switch/throw() and savectx() do not save or restore any values in
these fields which mostly held non-callee-save registers.
makectx() copied these fields from kdb_frame, but they weren't used
except for PC_REGS using pcb_sepc. Change PC_REGS to use
kdb_frame->tf_sepc directly instead.
Reviewed by: br
MFC after: 2 weeks
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D23395
The SD bit is defined as the MSB of the sstatus register, meaning its
position will vary depending on the CSR's length. Previously, there were
two (unused) defines for this, for the 32 and 64-bit cases, but their
definitions were swapped.
Consolidate these into one define: SSTATUS_SD, and make the definition
dependent on the value of __riscv_xlen.
Reviewed by: br
MFC after: 3 days
Differential Revision: https://reviews.freebsd.org/D23402
Previously, this check was only in sys_sigreturn() which meant that
user applications could write invalid values to the register via
setcontext() or swapcontext().
Reviewed by: mhorne
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D23219
arm64 and riscv were only saving and restoring floating point
registers for sendsig() and sys_sigreturn(), but not for getcontext(),
setcontext(), and swapcontext().
While here, remove an always-false check for uap being NULL from
sys_sigreturn().
Reviewed by: br, mhorne
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D23218
As part of the RISC-V ABI, the gp register is expected to initialized
with the address of __global_pointer$ as early as possible. This allows
loads and stores from .sdata to be relaxed based on the value of gp. In
locore.S we do this initialization twice, once each for va and mpva.
However, in both cases the initialization is preceded by an la
instruction, which in theory could be relaxed by the linker.
Move the initialization of gp to be slightly earlier (before la
cpu_exception_handler), and add an additional gp initialization at the
very beginning of _start, before virtual memory is set up.
Reported by: jrtc27
Reviewed by: jrtc27
Differential Revision: https://reviews.freebsd.org/D23139
as these functions should do zero-extend.
Discovered by running pci_read_cap(), and by hint from James Clarke.
Reviewed by: James Clarke <jrtc27@jrtc27.com>
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D23236
This is a follow up to r356481. In locore.S, before virtual memory is
set up, we should avoid using indirect address lookups through the GOT.
Therefore we need to convert uses of the la instruction to lla, which
always generates an auipc/addi pair of instructions. This conversion was
done for the BSP case, but not the AP case, resulting in a fault
somewhere before mpva and a failure to bring APs online.
Reported by: lwhsu
Reviewed by: lwhsu, jrtc27 (accepted in a comment)
Differential Revision: https://reviews.freebsd.org/D23138
lld on RISC-V is not yet able to handle undefined weak symbols for
non-PIC code in the code model (medany/medium) used by the RISC-V
kernel.
Both GCC and clang emit an auipc / addi pair of instructions to
generate an address relative to the current PC with a 31-bit offset.
Undefined weak symbols need to have an address of 0, but the kernel
runs with PC values much greater than 2^31, so there is no way to
construct a NULL pointer as a PC-relative value. The bfd linker
rewrites the instruction pair to use lui / addi with values of 0 to
force a NULL pointer address. (There are similar cases for 'ld'
becoming auipc / ld that bfd rewrites to lui / ld with an address of
0.)
To work around this, compile the kernel with -fPIE when using lld.
This does not make the kernel position-independent, but it does
force the compiler to indirect address lookups through GOT entries
(so auipc / ld against a GOT entry to fetch the address). This
adds extra memory indirections for global symbols, so should be
disabled once lld is finally fixed.
A few 'la' instructions in locore that depend on PC-relative
addressing to load physical addresses before paging is enabled have to
use auipc / addi and not indirect via GOT entries, so change those to
use 'lla' which always uses auipc / addi for both PIC and non-PIC.
Submitted by: jrtc27
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D23064
Implement support for the UART as found on the SiFive FU540. It should
also work on, but has not been tested with, the FU310.
Reviewed by: philip
Sponsored by: Axiado
There's no point in checking for absent CPUs if we're not going to do anything
about either the present or absent case. This loop can just be removed.
Reviewed by: philip
Sponsored by: Axiado
With WITNESS enabled we see the following warning:
lock order reversal: (sleepable after non-sleepable)
1st 0xffffffd0847c7210 fu540spi0 (fu540spi0) @
/usr/home/kp/axiado/hornet-freebsd/src/sys/riscv/sifive/fu540_spi.c:297
2nd 0xffffffc00372bb30 Clock topology lock (Clock topology lock) @
/usr/home/kp/axiado/hornet-freebsd/src/sys/dev/extres/clk/clk.c:1137
stack backtrace:
#0 0xffffffc0002a579e at witness_checkorder+0xb72
#1 0xffffffc0002a5556 at witness_checkorder+0x92a
#2 0xffffffc000254c7a at _sx_slock_int+0x66
#3 0xffffffc00025537a at _sx_slock+0x8
#4 0xffffffc000123022 at clk_get_freq+0x38
#5 0xffffffc0005463e4 at __clzdi2+0x2bb8
#6 0xffffffc00014af58 at randomdev_getkey+0x76e
#7 0xffffffc0001278b0 at simplebus_add_device+0x7ee
#8 0xffffffc00027c9a8 at device_attach+0x2e6
#9 0xffffffc00027c634 at device_probe_and_attach+0x7a
#10 0xffffffc00027d76a at bus_generic_attach+0x10
#11 0xffffffc00014aab0 at randomdev_getkey+0x2c6
#12 0xffffffc00027c9a8 at device_attach+0x2e6
#13 0xffffffc00027c634 at device_probe_and_attach+0x7a
#14 0xffffffc00027d76a at bus_generic_attach+0x10
#15 0xffffffc000278bd2 at config_intrhook_oneshot+0x52
#16 0xffffffc000278b3e at config_intrhook_establish+0x146
#17 0xffffffc000278cf2 at config_intrhook_disestablish+0xfe
The clock topology lock can sleep, which means we cannot attempt to
acquire it while holding the non-sleepable mutex.
Fix that by retrieving the clock speed once, during attach and not every
time during SPI transaction setup.
Submitted by: kp
Sponsored by: Axiado
Due to clang and LLD's tendency to use a PLT for builtins, and as they
don't have full support for EABI, we sometimes have to deal with a PLT in
.ko files in a clang-built kernel.
As such, augment the in-kernel linker to support jump table processing.
As there is no particular reason to support lazy binding in kernel modules,
only implement Secure-PLT immediate binding.
As part of these changes, add elf_cpu_parse_dynamic() to the MD API of the
in-kernel linker (except on platforms that use raw object files.)
The new function will allow MD code to act on MD tags in _DYNAMIC.
Use this new function in the PowerPC MD code to ensure BSS-PLT modules using
PLT will be rejected during insertion, and to poison the runtime resolver to
ensure we get a clear panic reason if a call is made to the resolver.
Reviewed by: jhibbits
Differential Revision: https://reviews.freebsd.org/D22608
off the stack, initialized to default values, and then filled in with
driver-specific values, all without having to worry about the numerous
other fields in the tag. The resulting template is then passed into
busdma and the normal opaque tag object created. See the man page for
details on how to initialize a template.
Templates do not support tag filters. Filters have been broken for many
years, and only existed for an ancient make/model of hardware that had a
quirky DMA engine. Instead of breaking the ABI/API and changing the
arugment signature of bus_dma_tag_create() to remove the filter arguments,
templates allow us to ignore them, and also significantly reduce the
complexity of creating and managing tags.
Reviewed by: imp, kib
Differential Revision: https://reviews.freebsd.org/D22906
The SiFive FU540 Power Reset Clocking Interrupt block contains a PLL
that turns the input crystal (33.3MHz) into a 1-1.5GHz clock.
This clock in turn is divided by two to produce the tlclk, which is fed
into devices such as the SPI and I2C controllers.
Register a new clock device for the PRCI so that those devices can
read the correct clock through the clk framework.
Submitted by: kp
Sponsored by: Axiado
fix an assert violation introduced in r355784. Without this spinlock_exit()
may see owepreempt and switch before reducing the spinlock count. amd64
had been optimized to do a single critical enter/exit regardless of the
number of spinlocks which avoided the problem and this optimization had
not been applied elsewhere.
Reported by: emaste
Suggested by: rlibby
Discussed with: jhb, rlibby
Tested by: manu (arm64)
This is a 32-bit structure embedded in each vm_page, consisting mostly
of page queue state. The use of a structure makes it easy to store a
snapshot of a page's queue state in a stack variable and use cmpset
loops to update that state without requiring the page lock.
This change merely adds the structure and updates references to atomic
state fields. No functional change intended.
Reviewed by: alc, jeff, kib
Sponsored by: Netflix, Intel
Differential Revision: https://reviews.freebsd.org/D22650
o Remove All Rights Reserved from my notices
o imp@FreeBSD.org everywhere
o regularize punctiation, eliminate date ranges
o Make sure that it's clear that I don't claim All Rights reserved by listing
All Rights Reserved on same line as other copyright holders (but not
me). Other such holders are also listed last where it's clear.
- Use ustringp for the location of the argv and environment strings
and allow destp to travel further down the stack for the stackgap
and auxv regions.
- Update the Linux copyout_strings variants to move destp down the
stack as was done for the native ABIs in r263349.
- Stop allocating a space for a stack gap in the Linux ABIs. This
used to hold translated system call arguments, but hasn't been used
since r159992.
Reviewed by: kib
Tested on: md64 (amd64, i386, linux64), i386 (i386, linux)
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D22501
RISC-V inherited this code from arm64, so implement the fix from r354712.
See the revision for the full description.
Submitted by: kevans (arm64 version)
Change the FreeBSD ELF ABIs to use this new hook to copyout ELF auxv
instead of doing it in the sv_fixup hook. In particular, this new
hook allows the stack space to be allocated at the same time the auxv
values are copied out to userland. This allows us to avoid wasting
space for unused auxv entries as well as not having to recalculate
where the auxv vector is by walking back up over the argv and
environment vectors.
Reviewed by: brooks, emaste
Tested on: amd64 (amd64 and i386 binaries), i386, mips, mips64
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D22355
SBI version 0.2 introduces functions for obtaining the details of the
SBI implementation, such as version and implemntation ID. Print this
info at startup when it is available.
Reviewed by: jhb, kp
MFC after: 3 weeks
Differential Revision: https://reviews.freebsd.org/D22327
The Supervisor Binary Interface (SBI) specification v0.2 is a backwards
incompatible update to the SBI call interface for kernels running in
supervisor mode. The goal of this update was to make it easier for new
and optional functionality to be added to the SBI.
SBI functions are now called by passing an "extension ID" and a
"function ID" which are passed in a7 and a6 respectively. SBI calls
will also return an error and value in the following struct:
struct sbi_ret {
long error;
long value;
}
This version introduces several new functions under the "base"
extension. It is expected that all SBI implementations >= 0.2 will
support this base set of functions, as they implement some essential
services such as obtaining the SBI version, CPU implementation info, and
extension probing.
Existing SBI functions have been designated as "legacy". For the time
being they will remain implemented, but it is expected that in the
future their functionality will be duplicated or replaced by new SBI
extensions. Each legacy function has been assigned its own extension ID,
and for now we simply probe and assert for their existence.
Compatibility with legacy SBI implementations (such as BBL) is
maintained by checking the output of sbi_get_spec_version(). This
function is guaranteed to succeed by the new spec, but will return an
error in legacy implementations. We use this as an indicator of whether
or not we can rely on the new SBI base extensions.
For further info on the Supervisor Binary Interface, see:
https://github.com/riscv/riscv-sbi-doc/blob/master/riscv-sbi.adoc
Reviewed by: kp, jhb
MFC after: 3 weeks
Differential Revision: https://reviews.freebsd.org/D22326
Allow for an additional argument to sbi_call which will be passed in a6.
This is required for SBI spec 0.2 support, as a6 will indicate the SBI
function ID.
While here, introduce some macros to clean up the calls.
Reviewed by: kp, jhb
MFC after: 3 weeks
Differential Revision: https://reviews.freebsd.org/D22325
Our PLIC implementation only enables interrupts on the boot cpu.
Implement plic_bind_intr() so that they can be redistributed near the
end of boot during intr_irq_shuffle().
This also slightly modifies how enable bits are handled in an attempt to
better fit the PIC interface. plic_enable_intr()/plic_disable_intr() are
converted to manage an interrupt source's threshold value, since this
value can be used as to globally enable/disable an irq. All handing of the
per-context enable bits is moved to the new methods plic_setup_intr()
and plic_bind_intr().
Reviewed by: br
MFC after: 3 weeks
Differential Revision: https://reviews.freebsd.org/D21928
The RISC-V PLIC (platform level interrupt controller) registers are divided up
by "context", which is purposefully left ambiguous in the PLIC spec. Currently
we assume each CPU number corresponds 1-to-1 with a context number, but that is
not correct. Most existing PLIC implementations (such as SiFive's) have
multiple contexts per-cpu. For example, a single CPU might have a context for
machine mode interrupts and a context for supervisor mode interrupts. To
complicate things further, FreeBSD renumbers the CPUs during boot, but the PLIC
driver still assumes that CPU ID equals the RISC-V hart number, meaning
interrupt enables/claims might be performed for the wrong context registers.
To fix this, we must calculate each CPU's context number during
attachment. This is done by reading the interrupt properties from the
device tree, from which a mapping from context to RISC-V hart to CPU
number can be created.
Reviewed by: br
MFC after: 3 weeks
Differential Revision: https://reviews.freebsd.org/D21927
This option is causing boot to fail for the Hifive Unleashed and older
versions of QEMU (3.1.1). Remove it from the GENERIC config for now.
Reported by: br
MFC after: 1 week
The fill_elf_hwcap() function expects to find only cpu nodes under the
/cpus entry of the device tree. Newer versions of QEMU insert a cpu-map
node which describes the CPU topology, breaking this function. To fix
this, simply skip any non-cpu entries.
Reviewed by: markj, kp, jhb
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D22151
The lr.w instruction used to read the value from memory sign-extends
the value read from memory. GCC sign-extends the 32-bit comparison
value passed in whereas clang currently does not. As a result, if the
value being compared has the MSB set, the comparison fails for
matching 32-bit values when compiled with clang.
Use a cast to explicitly sign-extend the unsigned comparison value.
This works with both GCC and clang.
There is commentary in the RISC-V spec that suggests that GCC's
approach is more correct, but it is not clear if the commentary in the
RISC-V spec is binding.
Reviewed by: mhorne
Obtained from: Axiado
MFC after: 2 weeks
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D22084
- td_kstack_pages was not being initialized.
- td_kstack is supposed to be the base address of the stack region,
not the top.
The arm ports seem to have similar problems and will be fixed next.
Reported by: Jenkins via lwhsu
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
After r352110 the page lock no longer protects a page's identity, so
there is no purpose in locking the page in pmap_mincore(). Instead,
if vm.mincore_mapped is set to the non-default value of 0, re-lookup
the page after acquiring its object lock, which holds the page's
identity stable.
The change removes the last callers of vm_page_pa_tryrelock(), so
remove it.
Reviewed by: kib
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D21823
callers hold it.
This simplifies pmap code and removes a dependency on the object lock.
Reviewed by: kib, markj
Tested by: pho
Sponsored by: Netflix, Intel
Differential Revision: https://reviews.freebsd.org/D21596
busy acquires while held.
This allows code that would need to acquire and release a very large number
of page busy locks to use the old mechanism where busy is only checked and
not held. This comes at the cost of false positives but never false
negatives which the single consumer, vm_fault_soft_fast(), handles.
Reviewed by: kib
Tested by: pho
Sponsored by: Netflix, Intel
Differential Revision: https://reviews.freebsd.org/D21592
be called on bootverbose. Do so on RISV-V too.
Submitted by: Nicholas O'Brien <nickisobrien_gmail.com>
Reviewed by: imp, kp
Sponsored by: Axiado
Differential Revision: https://reviews.freebsd.org/D21998
the static device mappings.
While RISC-V support was added to subr_devmap.c in r298631, it was never
actually initialised in the machine dependent code.
Submitted by: Nicholas O'Brien <nickisobrien_gmail.com>
Reviewed by: br, kp
Sponsored by: Axiado
Differential Revision: https://reviews.freebsd.org/D21975
During a CoW fault, we must check for both 4KB and 2MB mappings before
clearing PGA_WRITEABLE on the old mapping's page. Previously we were
only checking for 4KB mappings. This was missed in r344106.
MFC after: 3 days
Sponsored by: The FreeBSD Foundation
pmap_remove_l3() may remove the last mapping of a page, in which case
it must clear PGA_WRITEABLE.
Reported by: Jenkins, via lwhsu
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
We must also check for large mappings. pmap_page_is_mapped() is
mostly used in assertions, so the problem was not very noticeable.
Reviewed by: alc
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D21824
Centralize calculation of signal and ucode delivered on unhandled page
fault in new function vm_fault_trap(). MD trap_pfault() now almost
always uses the signal numbers and error codes calculated in
consistent MI way.
This introduces the protection fault compatibility sysctls to all
non-x86 architectures which did not have that bug, but apparently they
were already much more wrong in selecting delivered signals on
protection violations.
Change the delivered signal for accesses to mapped area after the
backing object was truncated. According to POSIX description for
mmap(2):
The system shall always zero-fill any partial page at the end of an
object. Further, the system shall never write out any modified
portions of the last page of an object which are beyond its
end. References within the address range starting at pa and
continuing for len bytes to whole pages following the end of an
object shall result in delivery of a SIGBUS signal.
An implementation may generate SIGBUS signals when a reference
would cause an error in the mapped object, such as out-of-space
condition.
Adjust according to the description, keeping the existing
compatibility code for SIGSEGV/SIGBUS on protection failures.
For situations where kernel cannot handle page fault due to resource
limit enforcement, SIGBUS with a new error code BUS_OBJERR is
delivered. Also, provide a new error code SEGV_PKUERR for SIGSEGV on
amd64 due to protection key access violation.
vm_fault_hold() is renamed to vm_fault(). Fixed some nits in
trap_pfault()s like mis-interpreting Mach errors as errnos. Removed
unneeded truncations of the fault addresses reported by hardware.
PR: 211924
Reviewed by: alc
Discussed with: jilles, markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D21566
In a few cases, the symbol lookup is missing before attempting to
perform the relocation. While the relocation types affected are
currently unused, this results in an uninitialized variable warning,
that is escalated to an error when building with clang.
Reviewed by: markj
MFC after: 3 days
Differential Revision: https://reviews.freebsd.org/D21773
Fix some style(9) violations.
This also changes the name of the machine-dependent sysctl kern.debug_kld to
debug.kld_reloc, and changes its type from int to bool. This is acceptable
since we are not currently concerned with preserving the RISC-V ABI.
Reviewed by: markj, kp
MFC after: 3 days
Differential Revision: https://reviews.freebsd.org/D21772
Convert all remaining references to that field to "ref_count" and update
comments accordingly. No functional change intended.
Reviewed by: alc, kib
Sponsored by: Intel, Netflix
Differential Revision: https://reviews.freebsd.org/D21768
The EARLY_AP_STARTUP option initializes non-boot processors
much sooner during startup. This adds support for this option
on RISC-V and enables it by default for GENERIC.
Reviewed by: jhb, markj
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D21661
- Remove a dead variable from the amd64 pmap_extract_and_hold().
- Fix grammar in the vm_page_wire man page.
Reported by: alc
Reviewed by: alc, kib
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D21639
This is re-appearance of the nop code removed from other arches in r287625.
Reviewed by: alc (as part of the larger patch), markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
DIfferential revision: https://reviews.freebsd.org/D21645
fdt_is_compatible_strict() inspects the first compatible property.
We need to inspect the following properties for 'riscv'.
ofw_bus_node_is_compatible() does a recursive search.
This patch fixes "Can't find CPU" error message when bootverbose = true.
Submitted by: Nicholas O'Brien (nickisobrien_gmail.com)
Reviewed by: philip, kp
Sponsored by: Axiado
Differential Revision: https://reviews.freebsd.org/D21576
There are several mechanisms by which a vm_page reference is held,
preventing the page from being freed back to the page allocator. In
particular, holding the page's object lock is sufficient to prevent the
page from being freed; holding the busy lock or a wiring is sufficent as
well. These references are protected by the page lock, which must
therefore be acquired for many per-page operations. This results in
false sharing since the page locks are external to the vm_page
structures themselves and each lock protects multiple structures.
Transition to using an atomically updated per-page reference counter.
The object's reference is counted using a flag bit in the counter. A
second flag bit is used to atomically block new references via
pmap_extract_and_hold() while removing managed mappings of a page.
Thus, the reference count of a page is guaranteed not to increase if the
page is unbusied, unmapped, and the object's write lock is held. As
a consequence of this, the page lock no longer protects a page's
identity; operations which move pages between objects are now
synchronized solely by the objects' locks.
The vm_page_wire() and vm_page_unwire() KPIs are changed. The former
requires that either the object lock or the busy lock is held. The
latter no longer has a return value and may free the page if it releases
the last reference to that page. vm_page_unwire_noq() behaves the same
as before; the caller is responsible for checking its return value and
freeing or enqueuing the page as appropriate. vm_page_wire_mapped() is
introduced for use in pmap_extract_and_hold(). It fails if the page is
concurrently being unmapped, typically triggering a fallback to the
fault handler. vm_page_wire() no longer requires the page lock and
vm_page_unwire() now internally acquires the page lock when releasing
the last wiring of a page (since the page lock still protects a page's
queue state). In particular, synchronization details are no longer
leaked into the caller.
The change excises the page lock from several frequently executed code
paths. In particular, vm_object_terminate() no longer bounces between
page locks as it releases an object's pages, and direct I/O and
sendfile(SF_NOCACHE) completions no longer require the page lock. In
these latter cases we now get linear scalability in the common scenario
where different threads are operating on different files.
__FreeBSD_version is bumped. The DRM ports have been updated to
accomodate the KPI changes.
Reviewed by: jeff (earlier version)
Tested by: gallatin (earlier version), pho
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20486
The branch from _start to mpentry has to cross a large section of data;
an offset larger than can be specified with a 12-bit branch immediate.
Fix this by converting the branch to an unconditional jump. The gcc
assembler does this conversion silently but it is not done automatically
by clang.
Reported by: Jeremy Bennett <jeremy.bennett@embecosm.com>
Reviewed by: markj
Approved by: markj (mentor)
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D21437
Many extern struct pcpu <something>__pcpu declarations were
copied/pasted in sources. The issue is that the definition is MD, but
it cannot be provided by machine/pcpu.h due to actual struct pcpu
defined in sys/pcpu.h later than the inclusion of machine/pcpu.h.
This forced the copying when other code needed direct access to
__pcpu. There is no way around it, due to machine/pcpu.h supplying
part of struct pcpu fields.
To work around the problem, add a new machine/pcpu_aux.h header, which
should fill any needed MD definitions after struct pcpu definition is
completed. This allows to remove copies of __pcpu spread around the
source. Also on x86 it makes it possible to remove work arounds like
OFFSETOF_CURTHREAD or clang specific warnings supressions.
Reported and tested by: lwhsu, bcran
Reviewed by: imp, markj (previous version)
Discussed with: jhb
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D21418
doing so adds more flexibility with less redundant code.
Reviewed by: jhb, markj, kib
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D21250
r343275 introduced a performance optimisation to the copyin/copyout
routines by attempting to copy word-per-word rather than byte-per-byte
where possible.
This optimisation failed to account for cases where the buffer is longer
than XLEN_BYTES, but due to misalignment does not not allow for any
word-sized copies. E.g. a 9 byte buffer (with XLEN_BYTES == 8) which is
misaligned by 2 bytes. The code nevertheless did a single full-word
copy, which meant we copied too much data. This potentially clobbered
other data.
This is most easily demonstrated by a simple `sysctl -a`.
Fix it by not assuming that we'll always have at least one full-word
copy to do, but instead checking the remaining length first.
Reviewed by: markj@, mhorne@, br@ (previous version)
MFC after: 1 week
Sponsored by: Axiado
Differential Revision: https://reviews.freebsd.org/D21100
if a demotion succeeds, then all of the 4KB page mappings within the
superpage-sized region must be valid, so there is no point in testing the
validity of the 4KB page mapping that is going to be write protected.
Deindent the nearby code.
Reviewed by: kib, markj
Tested by: pho (amd64, i386)
X-MFC after: r350004 (this change depends on arm64 dirty bit emulation)
Differential Revision: https://reviews.freebsd.org/D21027
We can't use a u_int to compute the physical address in
pmap_early_vtophys(). Our int is 32-bit, but the physical address is
64-bit. This works fine if everything lives in below 0x100000000, but as
soon as it doesn't this breaks.
MFC after: 1 week
Sponsored by: Axiado
syscallret() doesn't use error anymore. Fix a few other places to permit
removing the return value from syscallenter() entirely.
- Remove a duplicated assertion from arm's syscall().
- Use td_errno for amd64_syscall_ret_flush_l1d.
Reviewed by: kib
MFC after: 1 month
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D2090
pmap_ts_referenced() does not necessarily clear the access bit from
all accessed mappings of a given page. Thus, if a scan of the mappings
needs to be restarted, we should be careful to avoid double-counting
accessed mappings whose access bits were not cleared in a previous
attempt.
Reported by: alc
Reviewed by: alc
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D20926
Casueword(9) on ll/sc architectures must be prepared for userspace
constantly modifying the same cache line as containing the CAS word,
and not loop infinitely. Otherwise, rogue userspace livelocks the
kernel.
To fix the issue, change casueword(9) interface to return new value 1
indicating that either comparision or store failed, instead of relying
on the oldval == *oldvalp comparison. The primitive no longer retries
the operation if it failed spuriously. Modify callers of
casueword(9), all in kern_umtx.c, to handle retries, and react to
stops and requests to terminate between retries.
On x86, despite cmpxchg should not return spurious failures, we can
take advantage of the new interface and just return PSL.ZF.
Reviewed by: andrew (arm64, previous version), markj
Tested by: pho
Reported by: https://xenbits.xen.org/xsa/advisory-295.txt
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D20772
The hold_count and wire_count fields of struct vm_page are separate
reference counters with similar semantics. The remaining essential
differences are that holds are not counted as a reference with respect
to LRU, and holds have an implicit free-on-last unhold semantic whereas
vm_page_unwire() callers must explicitly determine whether to free the
page once the last reference to the page is released.
This change removes the KPIs which directly manipulate hold_count.
Functions such as vm_fault_quick_hold_pages() now return wired pages
instead. Since r328977 the overhead of maintaining LRU for wired pages
is lower, and in many cases vm_fault_quick_hold_pages() callers would
swap holds for wirings on the returned pages anyway, so with this change
we remove a number of page lock acquisitions.
No functional change is intended. __FreeBSD_version is bumped.
Reviewed by: alc, kib
Discussed with: jeff
Discussed with: jhb, np (cxgbe)
Tested by: pho (previous version)
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D19247
vm_page_dirty() when, in fact, we are write protecting the page and the L3
entry has PTE_D set. However, pmap_protect() was always calling
vm_page_dirty() when an L2 entry has PTE_D set. Handle L2 entries the
same as L3 entries so that we won't perform unnecessary calls to
vm_page_dirty().
Simplify the loop calling vm_page_dirty() on L2 entries.
AT_HWCAP is a field in the elf auxiliary vector meant to describe
cpu-specific hardware features. For RISC-V we want to use this to
indicate the presence of any standard extensions supported by the CPU.
This allows userland applications to query the system for supported
extensions using elf_aux_info(3).
Support for an extension is indicated by the presence of its
corresponding bit in AT_HWCAP -- e.g. systems supporting the 'c'
extension (compressed instructions) will have the second bit set.
Extensions advertised through AT_HWCAP are only those that are supported
by all harts in the system.
Reviewed by: jhb, markj
Approved by: markj (mentor)
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D20493
The mcall_trap() dummy function is unused, and should be removed as we
are unlikely to support M-mode traps any time soon.
Reviewed by: markj
Approved by: markj (mentor)
MFC after: 3 days
Differential Revision: https://reviews.freebsd.org/D20494
Some of the config options that are disabled by default seem to be only
for historical reasons. Enable those that appear to no longer be
problematic. This includes WITH_CTF, STACK, GEOM_RAID, and re-enabling
blacklisted kernel modules.
Reviewed by: markj
Approved by: markj (mentor)
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D20495
Most architectures print their total (real) and available memory during
boot. Properly initialize the realmem global and print these messages.
Also print the physical memory chunks (behind a bootverbose flag).
Reviewed by: markj
Approved by: markj (mentor)
MFC after: 3 days
Differential Revision: https://reviews.freebsd.org/D20496
Add the enter and exit events, similar to what's found in
hammer_time() on amd64. We must use TSRAW as the pcpu isn't yet
initialized.
Reviewed by: markj
Approved by: markj (mentor)
MFC after: 3 days
Differential Revision: https://reviews.freebsd.org/D20497
The gp register is intended to used by the linker as another means of
performing relaxations, and should point to the small data section (.sdata).
Currently gp is being used as the pcpu pointer within the kernel, but the more
appropriate choice for this is the tp register, which is unused.
Swap existing usage of gp with tp within the kernel, and set up gp properly
at boot with the value of __global_pointer$ for all harts.
Additionally, remove some cases of accessing tp from the PCB, as it is not
part of the per-thread state. The user's tp and gp should be tracked only
through the trapframe.
Reviewed by: markj, jhb
Approved by: markj (mentor)
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D19893
previously addressed in r348246.
This pmap problem also exists on arm64 and riscv. However, the original
solution developed for amd64 and i386 cannot be used on arm64 and riscv. In
particular, arm64 and riscv do not define a PG_PROMOTED flag in their level
2 PTEs. (A PG_PROMOTED flag makes no sense on arm64, where unlike x86 or
riscv we are required to break the old 4KB mappings before making the 2MB
mapping; and on riscv there are no unused bits in the PTE to define a
PG_PROMOTED flag.)
This commit implements an alternative solution that can be used on all four
architectures. Moreover, this solution has two other advantages. First, on
older AMD processors that required the Erratum 383 workaround, it is less
costly. Specifically, it avoids unnecessary calls to pmap_fill_ptp() on a
superpage demotion. Second, it enables the elimination of some calls to
pagezero() in pmap_kernel_remove_{l2,pde}().
In addition, remove a related stale comment from pmap_enter_{l2,pde}().
Reviewed by: kib, markj (an earlier version)
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D20538
These calls are not the same in general: the former will dequeue the
page if it is enqueued, while the latter will just leave it alone. But,
all existing uses of the former apply to unmanaged pages, which are
never enqueued in the first place. No functional change intended.
Reviewed by: kib
MFC after: 1 week
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20470
Similar to r348026, exhaustive search for uses of CTRn() and cross reference
ktr.h includes. Where it was obvious that an OS compat header of some kind
included ktr.h indirectly, .c files were left alone. Some of these files
clearly got ktr.h via header pollution in some scenarios, or tinderbox would
not be passing prior to this revision, but go ahead and explicitly include it
in files using it anyway.
Like r348026, these CUs did not show up in tinderbox as missing the include.
Reported by: peterj (arm64/mp_machdep.c)
X-MFC-With: r347984
Sponsored by: Dell EMC Isilon
This allows replacing "sys/eventfilter.h" includes with "sys/_eventfilter.h"
in other header files (e.g., sys/{bus,conf,cpu}.h) and reduces header
pollution substantially.
EVENTHANDLER_DECLARE and EVENTHANDLER_LIST_DECLAREs were moved out of .c
files into appropriate headers (e.g., sys/proc.h, powernv/opal.h).
As a side effect of reduced header pollution, many .c files and headers no
longer contain needed definitions. The remainder of the patch addresses
adding appropriate includes to fix those files.
LOCK_DEBUG and LOCK_FILE_LINE_ARG are moved to sys/_lock.h, as required by
sys/mutex.h since r326106 (but silently protected by header pollution prior
to this change).
No functional change (intended). Of course, any out of tree modules that
relied on header pollution for sys/eventhandler.h, sys/lock.h, or
sys/mutex.h inclusion need to be fixed. __FreeBSD_version has been bumped.
from SiFive, Inc.
The first core on this SoC (hart 0) is a 64-bit microcontroller.
o Pick a hart to run boot process using hart lottery.
This allows to exclude hart 0 from running the boot process.
(BBL releases hart 0 after the main harts, so it never wins the lottery).
o Renumber CPUs early on boot.
Exclude non-MMU cores. Store the original hart ID in struct pcpu. This
allows to find out the correct destination for IPIs and remote sfence
calls.
Thanks to SiFive, Inc for the board provided.
Reviewed by: markj
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D20225
So do nothing in pmap_page_set_memattr() and don't panic.
Reviewed by: markj
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D20209
Having IPSEC compiled into the kernel imposes a non-trivial
performance penalty on multi-threaded workloads due to IPSEC
refcounting. In my benchmarks of multi-threaded UDP
transmit (connected sockets), I've seen a roughly 20% performance
penalty when the IPSEC option is included in the kernel (16.8Mpps
vs 13.8Mpps with 32 senders on a 14 core / 28 HTT Xeon
2697v3)). This is largely due to key_addref() incrementing and
decrementing an atomic reference count on the default
policy. This cause all CPUs to stall on the same cacheline, as it
bounces between different CPUs.
Given that relatively few users use ipsec, and that it can be
loaded as a module, it seems reasonable to ask those users to
load the ipsec module so as to avoid imposing this penalty on the
GENERIC kernel. Its my hope that this will make FreeBSD look
better in "out of the box" benchmark comparisons with other
operating systems.
Many thanks to ae for fixing auto-loading of ipsec.ko when
ifconfig tries to configure ipsec, and to cy for volunteering
to ensure the the racoon ports will load the ipsec.ko module
Reviewed by: cem, cy, delphij, gnn, jhb, jpaetzel
Differential Revision: https://reviews.freebsd.org/D20163
tun(4) and tap(4) share the same general management interface and have a lot
in common. Bugs exist in tap(4) that have been fixed in tun(4), and
vice-versa. Let's reduce the maintenance requirements by merging them
together and using flags to differentiate between the three interface types
(tun, tap, vmnet).
This fixes a couple of tap(4)/vmnet(4) issues right out of the gate:
- tap devices may no longer be destroyed while they're open [0]
- VIMAGE issues already addressed in tun by kp
[0] emaste had removed an easy-panic-button in r240938 due to devdrn
blocking. A naive glance over this leads me to believe that this isn't quite
complete -- destroy_devl will only block while executing d_* functions, but
doesn't block the device from being destroyed while a process has it open.
The latter is the intent of the condvar in tun, so this is "fixed" (for
certain definitions of the word -- it wasn't really broken in tap, it just
wasn't quite ideal).
ifconfig(8) also grew the ability to map an interface name to a kld, so
that `ifconfig {tun,tap}0` can continue to autoload the correct module, and
`ifconfig vmnet0 create` will now autoload the correct module. This is a
low overhead addition.
(MFC commentary)
This may get MFC'd if many bugs in tun(4)/tap(4) are discovered after this,
and how critical they are. Changes after this are likely easily MFC'd
without taking this merge, but the merge will be easier.
I have no plans to do this MFC as of now.
Reviewed by: bcr (manpages), tuexen (testing, syzkaller/packetdrill)
Input also from: melifaro
Relnotes: yes
Differential Revision: https://reviews.freebsd.org/D20044
RISC-V ISA specifies no cache management instructions so leave cache
operations in cpufunc.h as no-op for now.
Note some new hardware comes with their own memory-mapped cache
management controller.
Tested on HiFive Unleashed board with cgem(4).
Reviewed by: markj
Obtained from: arm64
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D20126
In certain scenarios, it is possible for PCPU data to be
accessed before it has been initialized (e.g. during printf
if the kernel was built with the TSLOG option).
Initialize the PCPU pointer for hart 0 at the beginning of
initriscv() rather than near the end.
Reviewed by: markj
Approved by: markj (mentor)
Differential Revision: https://reviews.freebsd.org/D19726
o Fix bug in PLIC_ENABLE macro when irq >= 32.
Tested on the real hardware, which is HiFive Unleashed board.
Thanks to SiFive, Inc. for the board provided.
Reviewed by: markj
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D19775
RISC-V timer has no dedicated DTS node and we have to get timer
frequency from cpus node.
Tested on Government Furnished Equipment (GFE) cores synthesized
on Xilinx VCU118.
Reviewed by: markj
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D19727
Add the infrastructure to allow MD procctl(2) commands, and use it to
introduce amd64 PTI control and reporting. PTI mode cannot be
modified for existing pmap, the knob controls PTI of the new vmspace
created on exec.
Requested by: jhb
Reviewed by: jhb, markj (previous version)
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D19514
PTI mode for the process pmap on exec is activated iff P_MD_PTI is set.
On exec, the existing vmspace can be reused only if pti mode of the
pmap matches the P_MD_PTI flag of the process. Add MD
cpu_exec_vmspace_reuse() callback for exec_new_vmspace() which can
vetoed reuse of the existing vmspace.
MFC note: md_flags change struct proc KBI.
Reviewed by: jhb, markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D19514
This mirrors the arm64 implementation and is for use in the minidump
code.
Submitted by: Mitchell Horne <mhorne063@gmail.com>
Differential Revision: https://reviews.freebsd.org/D18321
In all of the architectures we have today, we always use PAGE_SIZE.
While in theory one could define different things, none of the
current architectures do, even the ones that have transitioned from
32-bit to 64-bit like i386 and arm. Some ancient mips binaries on
other systems used 8k instead of 4k, but we don't support running
those and likely never will due to their age and obscurity.
Reviewed by: imp (who also contributed the commit message)
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D19280
Skylake Xeons.
See SDM rev. 68 Vol 3 4.6.2 Protection Keys and the description of the
RDPKRU and WRPKRU instructions.
Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D18893
This reduces the overhead of TLB invalidations by ensuring that we
only interrupt CPUs which are using the given pmap. Tracking is
performed in pmap_activate(), which gets called during context switches:
from cpu_throw(), if a thread is exiting or an AP is starting, or
cpu_switch() for a regular context switch.
For now, pmap_sync_icache() still must interrupt all CPUs.
Reviewed by: kib (earlier version), jhb
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D18874
This includes support for pmap_enter(..., psind=1) as described in the
commit log message for r321378.
The changes are largely modelled after amd64. arm64 has more stringent
requirements around superpage creation to avoid the possibility of TLB
conflict aborts, and these requirements do not apply to RISC-V, which
like amd64 permits simultaneous caching of 4KB and 2MB translations for
a given page. RISC-V's PTE format includes only two software bits, and
as these are already consumed we do not have an analogue for amd64's
PG_PROMOTED. Instead, pmap_remove_l2() always invalidates the entire
2MB address range.
pmap_ts_referenced() is modified to clear PTE_A, now that we support
both hardware- and software-managed reference and dirty bits. Also
fix pmap_fault_fixup() so that it does not set PTE_A or PTE_D on kernel
mappings.
Reviewed by: kib (earlier version)
Discussed with: jhb
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D18863
Differential Revision: https://reviews.freebsd.org/D18864
Differential Revision: https://reviews.freebsd.org/D18865
Differential Revision: https://reviews.freebsd.org/D18866
Differential Revision: https://reviews.freebsd.org/D18867
Differential Revision: https://reviews.freebsd.org/D18868
There's no need to worry about potential backwards compatibility issues
in a brand-new architecture, so avoid stack PROT_EXEC as with arm64.
Discussed with: br
The existence of a PV entry for a mapping guarantees that the mapping
exists, so we should not need to test for that.
Reviewed by: kib
MFC after: 3 days
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D18866
The existing copyin(9) and copyout(9) routines on RISC-V perform only a
simple byte-by-byte copy. Improve their performance by performing
word-sized copies where possible.
Submitted by: Mitchell Horne <mhorne063@gmail.com>
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D18851
The MI kernel assumes that interrupts will not be enabled on APs until
after the first context switch. In particular, the problem was causing
occasional deadlocks during boot.
Remove an unneeded intr_disable() added in r335005.
Reviewed by: jhb (previous version)
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D18738
Don't bother zeroing the top-level page before freeing it. Previously,
the page was freed before being zeroed.
Reviewed by: jhb, kib
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D18720
The list will be removed with some future work.
Reviewed by: jhb
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D18721
- Handle VM_PROT_EXECUTE.
- Clear PTE_D and mark the page dirty when removing write access
from a mapping.
- Atomically clear PTE_W to avoid clobbering a hardware PTE update.
Reviewed by: jhb, kib
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D18719
Otherwise prefaulted entries are not accessible from user mode and
end up triggering a fault upon access, so prefaulting has no effect.
Reviewed by: jhb, kib
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D18718