Updates in the format described in section 9.11 of the Intel SDM can
now be applied as one of the first steps in booting the kernel. Updates
that are loaded this way are automatically re-applied upon exit from
ACPI sleep states, in contrast with the existing cpucontrol(8)-based
method. For the time being only Intel updates are supported.
Microcode update files are passed to the kernel via loader(8). The
file type must be "cpu_microcode" in order for the file to be recognized
as a candidate microcode update. Updates for multiple CPU types may be
concatenated together into a single file, in which case the kernel
will select and apply a matching update. Memory used to store the
update file will be freed back to the system once the update is applied,
so this approach will not consume more memory than required.
Reviewed by: kib
MFC after: 6 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D16370
There's no differene between VM_FREELIST_ISADMA and VM_FREELIST_LOWMEM
except for the default boundary (16MB on x86 and 256MB on MIPS, but
they are otherwise the same). We don't need both for any system we
support (there were some really old ARC systems that did have ISA/EISA
bus, but we never ran on them and they are too old to ever grow
support for).
Differential Review: https://reviews.freebsd.org/D16290
Do not use vm_map_remove() to release KVA back to the system. Because
kernel map entries do not have an associated VM object, with r336030
the vm_map_remove() call will not update the kernel page tables. Avoid
relying on the vm_map layer and instead update the pmap and release KVA
to the kernel arena directly in kmem_bootstrap_free().
Because the pmap updates will generally result in superpage demotions,
modify pmap_init() to insert PTPs shadowed by superpage mappings into
the kernel pmap's radix tree.
While here, port r329171 to i386.
Reported by: alc
Reviewed by: alc, kib
X-MFC with: r336505
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D16426
code never sees FPU pcb flags not consistent with the hardware state.
This is uncovered by the eager FPU switch mode.
Analyzed, reviewed and tested by: gleb
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
On i386 and amd64, add a vm_phys segment for physical memory used to
store the kernel binary and other preloaded data. This makes it
possible to free such memory back to the system once it is no longer
needed, e.g., when a preloaded kernel module is unloaded. Previously,
it would have remained unused.
Reviewed by: kib, royger
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D16330
Without this, the support for transparent superpage promotion on i386
was left disabled.
Reviewed by: alc, kib
Differential Revision: https://reviews.freebsd.org/D16279
add support for explicitly requesting that pmap_enter() create a 2 or 4 MB
page mapping. (Essentially, this feature allows the machine-independent
layer to create superpage mappings preemptively, and not wait for automatic
promotion to occur.)
Export pmap_ps_enabled() to the machine-independent layer.
Add a flag to pmap_pv_insert_pde() that specifies whether it should fail or
reclaim a PV entry when one is not available.
Refactor pmap_enter_pde() into two functions, one by the same name, that is
a general-purpose function for creating PDE PG_PS mappings, and another,
pmap_enter_4mpage(), that is used to prefault 2 or 4 MB read- and/or
execute-only mappings for execve(2), mmap(2), and shmat(2).
Reviewed by: kib
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D16246
For example, fully construct the new PTE before entering the critical
section. This change is a stepping stone to psind == 1 support on i386.
Reviewed by: kib, markj
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D16188
Doing so ensures that all threads sharing the pmap have a consistent
view of the mapping. This fixes the problem described in the commit
log messages for r329254 without the overhead of an extra fault in the
common case. Once other pmap_enter() implementations are similarly
modified, the workaround added in r329254 can be removed, reducing the
overhead of CoW faults.
See also r335784 for amd64. The i386 implementation of pmap_enter()
already reused the PV entry from the old mapping.
Reviewed by: kib, markj
Tested by: pho
MFC after: 3 weeks
Differential Revision: https://reviews.freebsd.org/D16133
This restores counters(9) operation.
Revert r336024. Improve assert of pcpu size on x86.
Reviewed by: mmacy
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D16163
Due to the way rtld creates mappings for the shared objects, each dso
causes unmap of at least three guard map entries. For instance, in
the buildworld load, this change reduces the amount of pmap_remove()
calls by 1/5.
Profiled by: alc
Reviewed by: alc, markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D16148
SMP systems by extending defined(SMP) to include defined(KLD_MODULE).
This is a regression issue after r335873 .
Discussed with: mmacy@
Sponsored by: Mellanox Technologies
- Change pcpu zone consumers to use a stride size of PAGE_SIZE.
(defined as UMA_PCPU_ALLOC_SIZE to make future identification easier)
- Allocate page from the correct domain for a given cpu.
- Don't initialize pc_domain to non-zero value if NUMA is not defined
There are some misconceptions surrounding this field. It is the
_VM_ NUMA domain and should only ever correspond to valid domain
values as understood by the VM.
The former slab size of sizeof(struct pcpu) was somewhat arbitrary.
The new value is PAGE_SIZE because that's the smallest granularity
which the VM can allocate a slab for a given domain. If you have
fewer than PAGE_SIZE/8 counters on your system there will be some
memory wasted, but this is obviously something where you want the
cache line to be coming from the correct domain.
Reviewed by: jeff
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D15933
It is possible that a fictitious unmanaged userspace mapping of
superpage is created on x86, e.g. by pmap_object_init_pt(), with the
physical address outside the vm_page_array[] coverage.
Noted and reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D16085
physical address, which is readily available after sucessfull
vm_page_pa_tryrelock().
Noted and reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D16085
physical address, which is readily available after sucessfull
vm_page_pa_tryrelock().
Noted and reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D16085
returning NULL.
vm_fault_quick_hold_pages() can be legitimately called on userspace
mappings backed by fictitious pages created by unmanaged device and sg
pagers.
Note that other architectures pmap_extract_and_hold() might need
similar fix, but I postponed the examination.
Reported by: bde
Discussed with: alc
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D16085
- inline atomics in modules on i386 and amd64 (they were always
inline on other arches)
- allow modules to opt in to inlining locks by specifying
MODULE_TIED=1 in the makefile
Reviewed by: kib
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D16079
The Linux compatibility code was converting the version number (e.g.
2.6.32) in two different ways and then comparing the results.
The linux_map_osrel() function converted MAJOR.MINOR.PATCH similar to
what FreeBSD does natively. I.e. where major=v0, minor=v1, and patch=v2
v = v0 * 1000000 + v1 * 1000 + v2;
The LINUX_KERNVER() macro, on the other hand, converted the value with
bit shifts. I.e. where major=a, minor=b, and patch=c
v = (((a) << 16) + ((b) << 8) + (c))
The Linux kernel uses the later format via the KERNEL_VERSION() macro in
include/generated/uapi/linux/version.h
Fix is to use the LINUX_KERNVER() macro in linux_map_osrel() as well as
in the .trans_osrel functions.
PR: 229209
Reviewed by: emaste, cem, imp (mentor)
Approved by: imp (mentor)
Differential Revision: https://reviews.freebsd.org/D15952
Existing linuxulator platforms (i386, amd64) support legacy syscalls,
such as non-*at ones like open, but arm64 and other new platforms do
not.
Wrap these in #ifdef LINUX_LEGACY_SYSCALLS, #defined in the MD linux.h
files. We may need finer grained control in the future but this is
sufficient for now.
Reviewed by: andrew
Sponsored by: Turing Robotic Industries
Differential Revision: https://reviews.freebsd.org/D15237
The break() system call was renamed (several times) starting in v3
AT&T UNIX when C was invented and break was a language keyword. The
last vestage of a need for it to be called something else (eg obreak)
was removed in r225617 which consistantly prefixed all syscall
implementations.
Reviewed by: emaste, kib (older version)
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D15638
Give up and remove the almost useless informational message reporting
that device not available exception occured while our state tracking
indicates the current CPU has FPU context loaded for the current
thread.
It seems that this is recurring bug with some VM monitors.
Sponsored by: The FreeBSD Foundation
Most kernel memory that is allocated after boot does not need to be
executable. There are a few exceptions. For example, kernel modules
do need executable memory, but they don't use UMA or malloc(9). The
BPF JIT compiler also needs executable memory and did use malloc(9)
until r317072.
(Note that a side effect of r316767 was that the "small allocation"
path in UMA on amd64 already returned non-executable memory. This
meant that some calls to malloc(9) or the UMA zone(9) allocator could
return executable memory, while others could return non-executable
memory. This change makes the behavior consistent.)
This change makes malloc(9) return non-executable memory unless the new
M_EXEC flag is specified. After this change, the UMA zone(9) allocator
will always return non-executable memory, and a KASSERT will catch
attempts to use the M_EXEC flag to allocate executable memory using
uma_zalloc() or its variants.
Allocations that do need executable memory have various choices. They
may use the M_EXEC flag to malloc(9), or they may use a different VM
interfact to obtain executable pages.
Now that malloc(9) again allows executable allocations, this change also
reverts most of r317072.
PR: 228927
Reviewed by: alc, kib, markj, jhb (previous version)
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D15691
A call to npxsave() in the exception trampolines was not relocated.
This call to a garbage address usually paniced when made, but it is only
made when the thread has used an FPU recently, and this is not the usual
case.
PR: 228755
Reviewed by: kib
pmc_process_interrupt takes 5 arguments when only 3 are needed.
cpu is always available in curcpu and inuserspace can always be
derived from the passed trapframe.
While facially a reasonable cleanup this change was motivated
by the need to workaround a compiler bug.
core2_intr(cpu, tf) ->
pmc_process_interrupt(cpu, ring, pmc, tf, inuserspace) ->
pmc_add_sample(cpu, ring, pm, tf, inuserspace)
In the process of optimizing the tail call the tf pointer was getting
clobbered:
(kgdb) up
at /storage/mmacy/devel/freebsd/sys/dev/hwpmc/hwpmc_mod.c:4709
4709 pmc_save_kernel_callchain(ps->ps_pc,
(kgdb) up
1205 error = pmc_process_interrupt(cpu, PMC_HR, pm, tf,
resulting in a crash in pmc_save_kernel_callchain.
- increase pmc cpuid field from 8 to 12 bits
- add cpuid version string to initialize entry in the log
so that filter can identify which counter index an
event name maps to
- GC unused config flags
- make fixed counter assignment more robust as well as the
changes needed to be properly identified for filter
file in /sys/conf, so was unavailable in configurations that don't use
modules, and was not testable or notable in NOTES. Its normal
configuration (not using a module) is still silently deprecated in
aout(4) by not mentioning it there.
Update i386 NOTES for COMPAT_AOUT. It is not i386-only, or even very MD.
Sort its entry better.
Finish gzip configuration (but not support) for amd64. gzip is really
gzipped aout. It is currently broken even for i386 (a call to vm fails).
amd64 has always attempted to configure and test it, but it depends on
COMPAT_AOUT (as noted). The bug that it depends on unconfigured files
was not detected since it is configured as a device. All other optional
image activators are configured properly using an option.
time, especially for SMP. If configured, it turns itself on at boot
time for calibration, so is fragile even if never otherwise used.
Both types of kernel profiling were supposed to use a global spinlock
in the SMP case. If hi-res profiling is configured (but not necessarily
used), this was supposed to be optimized by only using it when
necessary, and slightly more efficiently, in asm. But it was not done
at all for mcount entry where it is necessary. This caused crashes
in the SMP case when either type of profiling was enabled. For mcount
exit, it only caused wrong times. The times were wrongest with an
i8254 timer since using that requires exclusive access to the hardware.
The i8254 timer was too slow to use here 20 years ago and is much less
usable now, but it is the default for the SMP case since TSCs weren't
invariant when SMP was new. Do the locking in all hi-res SMP cases for
simplicity.
Calibration uses special asms, and the clobber lists in these were sort
of inverted. They contained the arg and return registers which are not
clobbered, but on amd64 they didn't contain the residue of the call-used
registers which may be clobbered (%r10 and %r11). This usually caused
hangs at boot time. This usually affected even the UP case.
kernel profiling remains broken).
memmove() was broken using ALTENTRY(). ALTENTRY() is only different from
ENTRY() in the profiling case, and its use in that case was sort of
backwards. The backwardness magically turned memmove() into memcpy()
instead of completely breaking it. Only the high resolution parts of
profiling itself were broken. Use ordinary ENTRY() for memmove().
Turn bcopy() into a tail call to memmove() to reduce complications.
This gives slightly different pessimizations and profiling lossage.
The pessimizations are minimized by not using a frame pointer() for
bcopy().
Calls to profiling functions from exception trampolines were not
relocated. This caused crashes on the first exception. Fix this using
function pointers.
Addresses of exception handlers in trampolines were not relocated. This
caused unknown offsets in the profiling data. Relocate by abusing
setidt_disp as for pmc although this is slower than necessary and
requires namespace pollution. pmc seems to be missing some relocations.
Stack traces and lots of other things in debuggers need similar relocations.
Most user addresses were misclassified as unknown kernel addresses and
then ignored. Treat all unknown addresses as user. Now only user
addresses in the kernel text range are significantly misclassified (as
known kernel addresses).
The ibrs functions didn't preserve enough registers. This is the only
recent breakage on amd64. Although these functions are written in
asm, in the profiling case they call profiling functions which are
mostly for the C ABI, so they only have to save call-used registers.
They also have to save arg and return registers in some cases and
actually save them in all cases to reduce complications. They end up
saving all registers except %ecx on i386 and %r10 and %r11 on amd64.
Saving these is only needed for 1 caller on each of amd64 and i386.
Save them there. This is slightly simpler.
Remove saving %ecx in handle_ibrs_exit on i386. Both handle_ibrs_entry
and handle_ibrs_exit use %ecx, but only the latter needed to or did
save it. But saving it there doesn't work for the profiling case.
amd64 has more automatic saving of the most common scratch registers
%rax, %rcx and %rdx (its complications for %r10 are from unusual use
of %r10 by SYSCALL). Thus profiling of handle_ibrs_exit_rs() was not
broken, and I didn't simplify the saving by moving the saving of these
registers from it to the caller.
Intel now provides comprehensive tables for all performance counters
and the various valid configuration permutations as text .json files.
Libpmc has been converted to use these and hwpmc_core has been greatly
simplified by moving to passthrough of the table values.
The one gotcha is that said tables don't support pentium pro and and pentium
IV. There's very few users of hwpmc on _amd64_ kernels on new hardware. It is
unlikely that anyone is doing low level optimization on 15 year old Intel
hardware. Nonetheless, if someone feels strongly enough to populate the
corresponding tables for p4 and ppro I will reinstate the files in to the
build.
Code for the K8 counters and !x86 architectures remains unchanged.
This is a follow-up to r321483, which disabled -Wmacro-redefined for
some lib/msun tests.
If an application included both fenv.h and ieeefp.h, several macros such
as __fldcw(), __fldenv() were defined in both headers, with slightly
different arguments, leading to conflicts.
Fix this by putting all the common macros in the machine-specific
versions of ieeefp.h. Where needed, update the arguments in places
where the macros are invoked.
This also slightly reduces the differences between the amd64 and i386
versions of ieeefp.h.
Reviewed by: kib
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D15633
pmap_is_prefaultable() and pmap_incore(), pushing the number of
shootdown IPIs back to the 3/1 kernel.
Benchmarked by: bde
Tested by: pho
Sponsored by: The FreeBSD Foundation