SystemCMOS address space is accessible for system wide.
So install address handler in \_SB space.
Reviewed by: jhb
Differential Revision: https://reviews.freebsd.org/D33892
All supported Xen instances by FreeBSD provide a local APIC
implementation, so there's no need to replace the native local APIC
implementation anymore.
Leave just the ipi_vectored hook in order to be able to override it
with an implementation based on event channels if the underlying local
APIC is not virtualized by hardware. Note the hook cannot use ifuncs,
because at the point where ifuncs are resolved the kernel doesn't yet
know whether it will benefit from using the optimization.
Sponsored by: Citrix Systems R&D
Reviewed by: kib
Differential revision: https://reviews.freebsd.org/D33917
Instead of using event channels or hypercalls to deal with IPIs and
NMIs.
Using a hardware virtualized APIC should be faster than using any PV
interface, since the VM exit can be avoided.
Xen exposes whether the domain is using hardware assisted x{2}APIC
emulation in a CPUID bit.
Sponsored by: Citrix Systems R&D
It has been reported that on some AWS instances VCPUOP_send_nmi
returns -38 (ENOSYS). The hypercall is only available for HVM guests
in Xen 4.7 and newer. Add a fallback to use the native NMI sending
procedure when VCPUOP_send_nmi is not available, so that the NMI is
not lost.
Reported and Tested by: avg
MFC after: 1 week
Fixes: b2802351c1 ('xen: fix dispatching of NMIs')
Sponsored by: Citrix Systems R&D
Some VM systems announce the frequency of the local APIC via the
CPUID leaf 0x40000010. Using this allows us to boot slightly
faster by avoiding the need for timer calibration.
Reviewed by: markj
Sponsored by: https://www.patreon.com/cperciva
While this CPUID leaf was originally only used by VMWare, other
hypervisors now also use it to announce the TSC frequency to guests.
This speeds up the boot process by 100 ms in EC2 and other systems,
by allowing the early calibration DELAY to be skipped.
Reviewed by: markj
Sponsored by: https://www.patreon.com/cperciva
This allows us to set tsc_is_invariant and select appropriately
fenced versions of RDTSC based on the CPU type.
Reviewed by: markj
Sponsored by: https://www.patreon.com/cperciva
The ACPI spec describes the FADT->Century field as:
The RTC CMOS RAM index to the century of data value (hundred and
thousand year decimals). If this field contains a zero, then the
RTC centenary feature is not supported. If this field has a non-zero
value, then this field contains an index into RTC RAM space that
OSPM can use to program the centenary field.
Use this field to decide whether to program the CENTURY register
of the CMOS RTC device.
Reviewed by: akumar3@isilon.com, dab, vangyzen
MFC after: 1 week
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D33667
MFC after: 1 week
Sponsored by: Dell EMC Isilon
The old bogus Xen versions that would deliver a GPF when writing to
the LAPIC MSR are likely retired, so it's safe to enable x2APIC
unconditionally now if available.
Tested by: avg
Reviewed by: kib
Sponsored by: Citrix Systems R&D
Differential revision: https://reviews.freebsd.org/D33877
When running as a Xen guest it's easier to use an hypercall in order
to do power management operations (power off, power cycle). Do this
for all supported guest types (HVM and PVH). Note that for HVM the
power operation could also be done using ACPI, but there's no reason
to differentiate between PVH and HVM.
While there fix the shutdown handler to properly differentiate between
power cycle and power off requests.
Reported by: Freddy DISSAUX
MFC: 1 week
Sponsored by: Citrix Systems R&D
Prior to this commit, the TSC and local APIC frequencies were calibrated
at boot time by measuring the clocks before and after a one-second sleep.
This was simple and effective, but had the disadvantage of *requiring a
one-second sleep*.
Rather than making two clock measurements (before and after sleeping) we
now perform many measurements; and rather than simply subtracting the
starting count from the ending count, we calculate a best-fit regression
between the target clock and the reference clock (for which the current
best available timecounter is used). While we do this, we keep track
of an estimate of the uncertainty in the regression slope (aka. the ratio
of clock speeds), and stop measuring when we believe the uncertainty is
less than 1 PPM.
In order to avoid the risk of aliasing resulting from the data-gathering
loop synchronizing with (a multiple of) the frequency of the reference
clock, we add some additional spinning depending upon the iteration number.
For numerical stability and simplicity of implementation, we make use of
floating-point arithmetic for the statistical calculations.
On the author's Dell laptop, this reduces the time spent in calibration
from 2000 ms to 29 ms; on an EC2 c5.xlarge instance, it is reduced from
2000 ms to 2.5 ms.
Reviewed by: bde (previous version), kib
MFC after: 1 month
Sponsored by: https://www.patreon.com/cperciva
Differential Revision: https://reviews.freebsd.org/D33802
- Move busdma_lock_mutex to subr_bus_dma.c.
- Move _busdma_lock_dflt to subr_bus_dma.c. This function was named a
couple of different things previously. It is not a public API but
an internal helper used in place of a NULL pointer. The prototype
is in <sys/bus_dma.h> as not all backends include
<sys/bus_dma_internal.h>.
Reviewed by: kib
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D33694
Move mostly duplicated code in various MD bus_dma backends to support
bounce pages into sys/kern/subr_busdma_bounce.c. This file is
currently #include'd into the backends rather than compiled standalone
since it requires access to internal members of opaque bus_dma
structures such as bus_dmamap_t and bus_dma_tag_t.
Reviewed by: kib
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D33684
Some AMD Geode-based systems end up using the 8254 PIT to calibrate the
TSC during late calibration, which doesn't work because that
timecounter's mask (65535) is much smaller than its frequency (1193182).
Moreover, early calibration is done against the 8254 timer anyway.
Work around the problem by simply using early calibration results if no
high-quality timecounters exist.
PR: 260868
Fixes: 22875f8879 ("x86: Implement deferred TSC calibration")
Reported and tested by: mike@sentex.net, Stefan Hegnauer <stefan.hegnauer@gmx.ch>
Reviewed by: imp, kib
MFC after: 3 days
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D33730
An earlier version of this code computed the TSC frequency in kHz.
When the code was changed to compute the frequency more accurately,
the variable name was not updated.
Reviewed by: markj
Fixes: 22875f8879 x86: Implement deferred TSC calibration
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D33696
It's possible that the "early" TSC calibration gave us a value which
is known to be exact; in that case, skip the later re-calibration.
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D33695
A recent change introduced a one-off error into a test allowing
coalescing chunks into segments. This fixes that error.
broke a check in _bus_dmamap_addseg on many architectures. This change makes it clear that it is not a particular range that is being boundary-checked, but the proposed union of the two adjacent ranges.
Reported by: se
Reviewed by: se
Fixes: c606ab59e7 vm_extern: use standard address checkers everywhere
Differential Revision: https://reviews.freebsd.org/D33715
Define simple functions for alignment and boundary checks and use them
everywhere instead of having slightly different implementations
scattered about. Define them in vm_extern.h and use them where
possible where vm_extern.h is included.
Reviewed by: kib, markj
Differential Revision: https://reviews.freebsd.org/D33685
The introduction of <sched.h> improved compatibility with some 3rd
party software, but caused the configure scripts of some ports to
assume that they were run in a GLIBC compatible environment.
Parts of sched.h were made conditional on -D_WITH_CPU_SET_T being
added to ports, but there still were compatibility issues due to
invalid assumptions made in autoconfigure scripts.
The differences between the FreeBSD version of macros like CPU_AND,
CPU_OR, etc. and the GLIBC versions was in the number of arguments:
FreeBSD used a 2-address scheme (one source argument is also used as
the destination of the operation), while GLIBC uses a 3-adderess
scheme (2 source operands and a separately passed destination).
The GLIBC scheme provides a super-set of the functionality of the
FreeBSD macros, since it does not prevent passing the same variable
as source and destination arguments. In code that wanted to preserve
both source arguments, the FreeBSD macros required a temporary copy of
one of the source arguments.
This patch set allows to unconditionally provide functions and macros
expected by 3rd party software written for GLIBC based systems, but
breaks builds of externally maintained sources that use any of the
following macros: CPU_AND, CPU_ANDNOT, CPU_OR, CPU_XOR.
One contributed driver (contrib/ofed/libmlx5) has been patched to
support both the old and the new CPU_OR signatures. If this commit
is merged to -STABLE, the version test will have to be extended to
cover more ranges.
Ports that have added -D_WITH_CPU_SET_T to build on -CURRENT do
no longer require that option.
The FreeBSD version has been bumped to 1400046 to reflect this
incompatible change.
Reviewed by: kib
MFC after: 2 weeks
Relnotes: yes
Differential Revision: https://reviews.freebsd.org/D33451
Reported and tested by: Michael Butler <imb@protected-networks.net>
Reviewed by: jhb, kib
Fixes: 62d09b46ad ("x86: Defer LAPIC calibration until after timecounters are available")
MFC after: 3 days
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D33669
We only attempt to gracefully handle absence of an APIC if "device
atpic" is defined in the kernel configuration.
Suggested by: kib
Reviewed by: jhb, kib
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
When a DMA request using bounce pages completes, a swi is triggered to
schedule pending DMA requests using the just-freed bounce pages. For
a long time this bus_dma swi has been tied to a "virtual memory" swi
(swi_vm). However, all of the swi_vm implementations are the same and
consist of checking a flag (busdma_swi_pending) which is always true
and if set calling busdma_swi. I suspect this dates back to the
pre-SMPng days and that the intention was for swi_vm to serve as a
mux. However, in the current scheme there's no need for the mux.
Instead, remove swi_vm and vm_ih. Each bus_dma implementation that
uses bounce pages is responsible for creating its own swi (busdma_ih)
which it now schedules directly. This swi invokes busdma_swi directly
removing the need for busdma_swi_pending.
One consequence is that the swi now works on RISC-V which had previously
failed to invoke busdma_swi from swi_vm.
Reviewed by: imp, kib
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D33447
The header exports the following:
- Definition of struct tcb.
- Helpers to get/set the tcb for the current thread.
- TLS_TCB_SIZE (size of TCB)
- TLS_TCB_ALIGN (alignment of TCB)
- TLS_VARIANT_I or TLS_VARIANT_II
- TLS_DTV_OFFSET (bias of pointers in dtv[])
- TLS_TP_OFFSET (bias of "thread pointer" relative to TCB)
Note that TLS_TP_OFFSET does not account for if the unbiased thread
pointer points to the start of the TCB (arm and x86) or the end of the
TCB (MIPS, PowerPC, and RISC-V).
Note also that for amd64, the struct tcb does not include the unused
tcb_spare field included in the current structure in libthr. libthr
does not use this field, and the existing calls in libc and rtld that
allocate a TCB for amd64 assume it is the size of 3 Elf_Addr's (and
thus do not allocate room for tcb_spare).
A <sys/_tls_variant_i.h> header is used by architectures using
Variant I TLS which uses a common struct tcb.
Reviewed by: kib (older version of x86/tls.h), jrtc27
Sponsored by: The University of Cambridge, Google Inc.
Differential Revision: https://reviews.freebsd.org/D33351
With various firmware files used by graphics and wireless drivers
we are exceeding the current 32 character module name (file path
in kldxref) length.
In order to overcome this issue bump it to the maximum path length
for the next version.
To be able to MFC provide backward compat support for another version
of the struct as the offsets for the second half change due to the
array size increase.
MAXMODNAME being defined to MAXPATHLEN needs param.h to be
included first. With only 7 modules (or LinuxKPI module.h) not
doing that adjust them rather than including param.h in module.h [1].
Reported by: Greg V (greg unrelenting.technology)
Sponsored by: The FreeBSD Foundation
Suggested by: imp [1]
MFC after: 10 days
Reviewed by: imp (and others to different level)
Differential Revision: https://reviews.freebsd.org/D32383
- Enable local MCEs on capable Intel CPUs. It delivers exceptions
only to the affected CPU instead of global broadcast, requiring a lot
of synchronization between CPUs. AMD always deliver MCEs locally.
- Make MCE handler process only uncorrected errors, while CMCI and
polling only corrected. It reduces synchronization problems between
them and is explicitly recommended by the documentation.
- Add minimal support for uncorrected software recoverable errors
on Intel CPUs. It allows to avoid kernel panics in case uncorrected
errors do not affect current operation, like ones found during scrub
or write. Such errors are only logged, postponing the panic until
the corrupted data will actually be needed (that may never happen).
- Reduce polling period from 1 hour to 5 minutes.
MFC after: 2 weeks
Previously it was not allowed on fast taskqueues. It was fixed in
4730a8972b. This should make no functional change, just a bit
cleaner and efficient code.
MFC after: 1 week
This ensures that LAPIC calibration is done using the correct tsc_freq
value, i.e., the one associated with the TSC timecounter. It does mean
though that TSC calibration cannot use sbinuptime() to read the
reference timecounter, as timehands are not yet set up.
Reviewed by: kib, jhb
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D33209
This ensures that we have a good reference timecounter for performing
calibration.
Change lapic_setup to avoid configuring the timer when booting, and move
calibration and initial configuration to a new lapic routine,
lapic_calibrate_timer. This calibration will be initiated from
cpu_initclocks(), before an eventtimer is selected.
Reviewed by: kib, jhb
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D33206
The headers were mostly identical on amd64 and i386.
No functional change intended.
Reviewed by: cperciva, mav, imp, kib, jhb
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D33205
It stopped being used in 3c256f5395, when trap() was reorganized to
have separate switch statements for user and kernel traps. Remove the
two leftover references and the flag itself.
Reviewed by: kib
MFC after: 3 days
Differential Revision: https://reviews.freebsd.org/D33253
The minidump code is written assuming that certain global state will not
change, and rightly so, since it executes from a kernel debugger
context. In order to support taking minidumps of a live system, we
should allow copies of relevant global state that is likely to change to
be passed as parameters to the minidumpsys() function.
This patch does the work of parameterizing this function, by adding a
struct minidumpstate argument. For now, this struct allows for copies of
the kernel message buffer, and the bitset that tracks which pages should
be dumped (vm_page_dump). Follow-up changes will actually make use of
these arguments.
Notably, dump_avail[] does not need a snapshot, since it is not expected
to change after system initialization.
The existing minidumpsys() definitions are renamed, and a thin MI
wrapper is added to kern_dump.c, which handles the construction of
the state struct. Thus, calling minidumpsys() remains as simple as
before.
Reviewed by: kib, markj, jhb
Sponsored by: Juniper Networks, Inc.
Sponsored by: Klara, Inc.
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D31989
There is no universal way to find the TSC frequency. Newer Intel CPUs
may report it via CPUID leaves 0x15 and 0x16. Sometimes it can be
obtained from the PLATFORM_INFO MSR as well, though we never use that.
On older platforms we derive the frequency using a DELAY(1000000) call,
which uses the 8254 PIT. On some newer platforms the 8254 is apparently
non-functional, leading to bogus calibration results. On such platforms
the TSC frequency must be available from CPUID. It is also possible to
disable calibration with a tunable, in which case we try to parse the
brand string if the TSC freq is not available from CPUID.
CPUID 0x15 provides an authoritative TSC frequency value, but even that
is not always available on new Intel platforms. CPUID 0x16 provides the
specified processor base frequency, which is not the same as the TSC
frequency. Empirically, it is close enough for early boot, but too far
off for timekeeping: on a Comet Lake NUC, CPUID 0x16 yields 1600MHz but
the TSC frequency is rougly 1608MHz, leading to frequent clock stepping
when NTP is in use.
Thus we have a situation where we cannot calibrate using the PIT and
cannot obtain a precise frequency from CPUID (or MSRs). This change
seeks to address that by using the CPUID 0x16 value during early boot
and refining the calibration later once ACPI-based timecounters are
available. TSC frequency detection is thus split into two phases:
Early phase:
- On Intel platforms, query CPUID 0x15 and 0x16 and use that value
initially if available.
- Otherwise, get an estimate using the PIT, reducing the delay loop to
100ms from 1s.
- Continue to register the TSC as the CPU ticks provider early, even
though the frequency may be off. Otherwise any code executed during
boot that uses cpu_ticks() (e.g., context switching) gets tripped up
when the ticks provider changes.
Later phase:
- In SI_SUB_CLOCKS, once the timehands are initialized, load the current
TSC and timecounter (sbinuptime()) values at the beginning and end of
a 1s interval and use the timecounter frequency (typically from
kvmclock, HPET or the ACPI PM timer) to estimate the TSC frequency.
- Update the TSC timecounter, global tsc_freq and CPU ticker with the
new frequency and finally register the TSC as a timecounter.
Reviewed by: kib, jhb (previous version)
Discussed with: imp, cperciva
MFC after: 6 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D32512
We use pmap_invalidate_cpu_mask() to get the set of active CPUs. This
(32-byte) set is copied by value through multiple frames until we get to
smp_targeted_tlb_shootdown(), where it is copied yet again.
Avoid this copying by having smp_targeted_tlb_shootdown() make a local
copy of the active CPUs for the pmap, and drop the cpuset parameter,
simplifying callers. Also leverage the use of the non-destructive
CPU_FOREACH_ISSET to avoid unneeded copying within
smp_targeted_tlb_shootdown().
Reviewed by: alc, kib
Tested by: pho
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D32792
sched_throw() can no longer take a NULL thread, APs enter through
sched_ap_entry() instead. This completely removes branching in the
common case and cleans up both paths. No functional change intended.
Reviewed by: kib, markj
Differential Revision: https://reviews.freebsd.org/D32829
schedinit_ap() sets up an AP for a later call to sched_throw(NULL).
Currently, ULE sets up some pcpu bits and fixes the idlethread lock with
a call to sched_throw(NULL); this results in a window where curthread is
setup in platforms' init_secondary(), but it has the wrong td_lock.
Typical platform AP startup procedure looks something like:
- Setup curthread
- ... other stuff, including cpu_initclocks_ap()
- Signal smp_started
- sched_throw(NULL) to enter the scheduler
cpu_initclocks_ap() may have callouts to process (e.g., nvme) and
attempt to sched_add() for this AP, but this attempt fails because
of the noted violated assumption leading to locking heartburn in
sched_setpreempt().
Interrupts are still disabled until cpu_throw() so we're not really at
risk of being preempted -- just let the scheduler in on it a little
earlier as part of setting up curthread.
Reviewed by: alfredo, kib, markj
Triage help from: andrew, markj
Smoke-tested by: alfredo (ppc), kevans (arm64, x86), mhorne (arm)
Differential Revision: https://reviews.freebsd.org/D32797
Some BIOSes protect memory region they reside in by using DMAR to
prevent devices from doing any DMA transactions to that part of RAM.
AMI refers to this as "DMA Control Guarantee".
Disable the protection when address translation is enabled.
I stumbled upon this while investigation a failing coredump on a device
which has this feature enabled.
Sponsored by: Stormshield
Obtained from: Semihalf
Reviewed by: kib
Differential revision: https://reviews.freebsd.org/D32591