Commit Graph

834 Commits

Author SHA1 Message Date
mmacy
ff20311f27 Back pcpu zone with domain correct pages
- Change pcpu zone consumers to use a stride size of PAGE_SIZE.
  (defined as UMA_PCPU_ALLOC_SIZE to make future identification easier)

- Allocate page from the correct domain for a given cpu.

- Don't initialize pc_domain to non-zero value if NUMA is not defined
  There are some misconceptions surrounding this field. It is the
  _VM_ NUMA domain and should only ever correspond to valid domain
  values as understood by the VM.

The former slab size of sizeof(struct pcpu) was somewhat arbitrary.
The new value is PAGE_SIZE because that's the smallest granularity
which the VM can allocate a slab for a given domain. If you have
fewer than PAGE_SIZE/8 counters on your system there will be some
memory wasted, but this is obviously something where you want the
cache line to be coming from the correct domain.

Reviewed by: jeff
Sponsored by: Limelight Networks
Differential Revision:  https://reviews.freebsd.org/D15933
2018-07-06 02:06:03 +00:00
andrew
ae591a440e Create a new macro for static DPCPU data.
On arm64 (and possible other architectures) we are unable to use static
DPCPU data in kernel modules. This is because the compiler will generate
PC-relative accesses, however the runtime-linker expects to be able to
relocate these.

In preparation to fix this create two macros depending on if the data is
global or static.

Reviewed by:	bz, emaste, markj
Sponsored by:	ABT Systems Ltd
Differential Revision:	https://reviews.freebsd.org/D16140
2018-07-05 17:13:37 +00:00
kib
c494bb83e7 Add a name for the MSR controlling standard extended features report on AMD.
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2018-07-05 10:44:18 +00:00
kib
ebb8917d48 Order the portion of the AMD-specific MSRs names definitions numerically.
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2018-07-05 10:34:01 +00:00
royger
2a1fbbe9f7 xen: obtain vCPU ID from CPUID
The Xen vCPU ID can be fetched from the cpuid instead of inferring it
from the ACPI ID.

Sponsored by: Citrix Systems R&D
2018-06-26 15:00:54 +00:00
royger
9015201203 xen: limit the number of hypercall pages to 1
The interface already guarantees that the number of hypercall pages is
always going to be 1, see the comment in interface/arch-x86/cpuid.h

Sponsored by: Citrix Systems R&D
2018-06-26 14:39:27 +00:00
kib
9dcf52daee Do not access ISA timer if BIOS reports that there is no legacy
devices present.

On at least one machine where it would matter since the ISA timer is
power gated when booted in the UEFI mode, BIOS still reports that the
legacy devices are present.  That is, user still have to manually
disable TSC calibration on such machines.  Hopefully it will be more
useful in the future.

Discussed with:	Ben Widawsky <benjamin.widawsky@intel.com>
Reviewed by:	royger
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D16004
MFC after:	1 week
2018-06-25 11:24:26 +00:00
kib
8f7ca8028a Provide a helper function acpi_get_fadt_bootflags() to fetch the FADT
x86 boot flags.

Reviewed by:	royger
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D16004
MFC after:	1 week
2018-06-25 11:01:12 +00:00
bde
0ea71e2e9d Untangle configuration ifdefs a little. On x86, msi is optional on pci,
and also on apic in common and i386 files (except for xen it is optional
only on xenhvm), but it was not ifdefed except on apic in common and i386
files.

This is all that is left from an attempt to build a (sub-)minimal kernel
without any devices.  The isa "option" is still used without ifdefs in many
standard files even on amd64.  ISAPNP is not optional on at least i386.
ATPIC is not optional on i386 (it is used mainly for Xspuriousint).  But
pci is now supposed to be optional on x86.
2018-06-10 14:49:13 +00:00
avg
b72d6ac274 x86: reorganize code that deals with unexpected NMI-s
Expected NMI-s are those than are either generated by the software (such
as a CPU sending NMI to other CPU) or generated by the hardware after
the software configured it to do so (such as NMI-s on PMC events).

Some unexpected NMI-s can be caused by hardware failures and it is
possible to inquire the hardware about them (somewhat like MCA but much
more primitive) using an EISA mechanism.  In some cases the origin of
the NMI can remain truly unknown.

This commit should not change any functionality.  It just reorganizes
the code, so that it is easier to extend with new checks for the origin
of the NMI.  Also, it frees the code that has nothing to do with ISA
from DEV_ISA.

MFC after:	3 weeks
2018-06-07 14:46:52 +00:00
avg
e0d383ce63 expand descriptions of x86 panic_on_nmi and kdb_on_nmi sysctls
The descriptions were as terse as the variable names and they did not
explain additional conditions for knobs.

MFC after:	1 week
2018-06-07 14:23:31 +00:00
avg
db453a7a34 add support for console resuming, implement it for uart, use on x86
This change adds a new optional console method cn_resume and a kernel
console interface cnresume.  Consoles that may need to re-initialize
their hardware after suspend (e.g., because firmware does not care to do
it) will implement cn_resume.  Note that it is called in rather early
environment not unlike early boot, so the same restrictions apply.
Platform specific code, for platforms that support hardware suspend,
should call cnresume early after resume, before any console output is
expected.

This change fixes a problem with a system of mine failing to resume when
a serial console is used.  I found that the serial port was in a strange
configuration and an attempt to write to it likely resulted in an
infinite loop.

To avoid adding cn_resume method to every console driver, CONSOLE_DRIVER
macro has been extended to support optional methods.

Reviewed by:	imp, mav
MFC after:	3 weeks
Differential Revision: https://reviews.freebsd.org/D15552
2018-05-29 16:16:24 +00:00
avg
4dc74abe76 fix x86 UP build broken by r334204, TSC resynchronization
Reported by:	bde
MFC after:	1 week
X-MFC with:	r334204
2018-05-29 16:03:53 +00:00
avg
546f863d51 re-synchronize TSC-s on SMP systems after resume, if necessary
The TSC-s are checked and synchronized only if they were good
originally.  That is, invariant, synchronized, etc.

This is necessary on an AMD-based system where after a wakeup from STR I
see that BSP clock differs from AP clocks by a count that roughly
corresponds to one second.  The APs are in sync with each other.  Not
sure if this is a hardware quirk or a firmware bug.

This is what I see after a resume with this change:
    SMP: passed TSC synchronization test after adjustment
    acpi_timer0: restoring timecounter, ACPI-fast -> TSC-low

Reviewed by:	kib
MFC after:	3 weeks
Differential Revision: https://reviews.freebsd.org/D15551
2018-05-25 07:33:20 +00:00
royger
f625000e52 xen/pvh: allocate dbg_stack
Or else init_secondary will hit a page fault (or write garbage
somewhere).

Sponsored by:	Citrix Systems R&D
2018-05-24 10:22:57 +00:00
kib
c1893ab1fa Fix UP build.
Reported by:	jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2018-05-22 20:50:19 +00:00
jhb
31acfe0f07 Cleanups related to debug exceptions on x86.
- Add constants for fields in DR6 and the reserved fields in DR7.  Use
  these constants instead of magic numbers in most places that use DR6
  and DR7.
- Refer to T_TRCTRAP as "debug exception" rather than a "trace trap"
  as it is not just for trace exceptions.
- Always read DR6 for debug exceptions and only clear TF in the flags
  register for user exceptions where DR6.BS is set.
- Clear DR6 before returning from a debug exception handler as
  recommended by the SDM dating all the way back to the 386.  This
  allows debuggers to determine the cause of each exception.  For
  kernel traps, clear DR6 in the T_TRCTRAP case and pass DR6 by value
  to other parts of the handler (namely, user_dbreg_trap()).  For user
  traps, wait until after trapsignal to clear DR6 so that userland
  debuggers can read DR6 via PT_GETDBREGS while the thread is stopped
  in trapsignal().

Reviewed by:	kib, rgrimes
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D15189
2018-05-22 00:45:00 +00:00
kib
1300dfd419 Add Intel Spec Store Bypass Disable control.
Speculative Store Bypass (SSB) is a speculative execution side channel
vulnerability identified by Jann Horn of Google Project Zero (GPZ) and
Ken Johnson of the Microsoft Security Response Center (MSRC)
https://bugs.chromium.org/p/project-zero/issues/detail?id=1528.
Updated Intel microcode introduces a MSR bit to disable SSB as a
mitigation for the vulnerability.

Introduce a sysctl hw.spec_store_bypass_disable to provide global
control over the SSBD bit, akin to the existing sysctl that controls
IBRS. The sysctl can be set to one of three values:
0: off
1: on
2: auto

Future work will enable applications to control SSBD on a per-process
basis (when it is not enabled globally).

SSBD bit detection and control was verified with prerelease microcode.

Security:	CVE-2018-3639
Tested by:	emaste (previous version, without updated microcode)
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2018-05-21 21:08:19 +00:00
kib
3fe94097b6 Add definition for Intel Speculative Store Bypass Disable MSR bits
Security:	CVE-2018-3639
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2018-05-21 21:07:13 +00:00
kib
f49d5cb80d Style.
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2018-05-19 21:36:55 +00:00
kib
d0801d183b Fix PCID+PTI pmap operations on Xen/HVM.
Install appropriate pti-aware shootdown IPI handlers, otherwise user
page tables do not get enough invalidations.  The non-pti handlers
were used so far.

Reported and tested by:	cperciva
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2018-05-19 20:28:59 +00:00
kib
bc1837178c Fix IBRS handling around MWAIT.
The intent was to disable IBPB and IBRS around MWAIT, and re-enable on
the sleep end.

Reviewed by:	emaste
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2018-05-19 20:26:33 +00:00
avg
4e99d4e797 fix a problem with bad performance after wakeup caused by r333321
This change reverts a "while here" part of r333321 that moved clearing
of suspended_cpus to an earlier place.

Apparently, there can be a problem when modifying (shared) memory before
restoring proper cache attributes.  So, to be safe, move the clearing to
the old place.

Many thanks to Johannes Lundberg for bisecting the changes to that
particular commit and then bisecting the commit to the particular
change.

Reported by:	many
Debugged by:	Johannes Lundberg <johalun0@gmail.com>
MFC after:	1 week
X-MFC with:	r333321
2018-05-17 10:16:20 +00:00
avg
d8caa49667 calibrate lapic timer in native_lapic_setup
The idea is to calibrate the LAPIC timer just once and only on boot,
given that [at present] the timer constants are global and shared
between all processors.

My primary motivation is to fix a panic that can happen when dynamically
switching to lapic timer.  The panic is caused by a recursion on
et_hw_mtx when printing the calibration results to console.  See the
review for the details of the panic.

Also, the code should become slightly simpler and easier to read.  The
previous code was racy too.  Multiple processors could start calibrating
the global constants concurrently, although that seems to have been
benign.

Reviewed by:	kib, mav, jhb
MFC after:	3 weeks
Differential Revision: https://reviews.freebsd.org/D15422
2018-05-15 16:56:30 +00:00
imp
7e7e7da0f2 Put the CPU starting on one line. 2018-05-07 21:09:21 +00:00
avg
66f063557f x86 cpususpend_handler: call wbinvd after setting suspend state bits
Without a subsequent wbinvd the changes to suspended_cpus (and
resuming_cpus) can be lost at least on AMD systems that use MOESI cache
coherency protocol.  That can happen because one of APs ends up as an
Owner of the corresponding cache line(s) and the changes may never reach
the main memory before the AP is reset.

While here, move clearing of suspended_cpus a little bit earlier as the
fact of returning from savectx (with zero return value) means that the
CPU has fully restored it execution context.

Also, rework the comment that describes the need for resuming_cpus.

This change fixed suspend to RAM a previously broken AMD-based system.

Reviewed by:	kib
Discussed with:	bde
MFC after:	3 weeks
Differential Revision: https://reviews.freebsd.org/D15295
2018-05-07 12:22:25 +00:00
kib
f7b133e86a Add helper macros to hide some boring repeatable ceremonies to define
ifuncs on x86.

Also keep helpers to define 'pseudo-ifuncs' which are emulated by the
indirect jmp.

Reviewed by:	jhb (previous version, as part of the larger patch)
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D13838
2018-05-03 21:45:59 +00:00
jkim
619bcfed64 Redo r332918 with the ACPICA API and remove debug.acpi.suspend_deep_bounce.
AcpiOsEnterSleep() was meant to implement this feature.

Reviewed by:	avg
2018-05-03 19:00:50 +00:00
royger
038f6a7b0f xen: fix formatting of xen_init_ops
No functional change

Sponsored by: Citrix Systems R&D
2018-05-02 10:20:55 +00:00
kib
fb6788ab02 Turn off IBRS on suspend.
Resume starts CPU from the init state, which clears any loaded
microcode updates.  As result, IBRS MSRs are no longer available,
until the microcode is reloaded.

I have to forcibly clear cpu_stdext_feature3, which assumes that CPUID
leaf 7 reg %ebx does not report anything except Meltdown/Spectre bugs
bits.  If future CPUs add new bits there, hw_ibrs_recalculate() and
identify_cpu1()/identify_cpu2() need to be adjusted for that.

Submitted and tested by:	Michael Danilov <mike.d.ft402@gmail.com>
PR:	227866
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D15236
2018-04-30 20:18:32 +00:00
kib
b287aa2c4e Fix spelling: Appolo -> Apollo [1].
The APL31 NDA errata is APL30 public errata.  Add the reference and
provide the description [2].

Noted by:	emaste [2], rpokala [1]
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2018-04-26 19:23:19 +00:00
kib
f9e192fcbd Handle Appolo Lake errata APL31.
If the workaround is activated, always send IPI for wake up, not rely
on the write to the monitor line.  This fixes Appolo Lake machines
early hang in sched_bind(), without requiring user to manually select
idle method.

Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2018-04-26 18:24:31 +00:00
kib
1ec6f4b094 Some style and minor code improvements for idle selection.
Use designated initializers for the idlt_tlb elements.
Remove strstr() use, add flag field to detect supported MWAIT.
Use nitems() instead of the terminating NULL entry for idle_tlb.
Move several functions into cpu_idle_* namespace.

Based on the discussion with:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2018-04-26 18:12:40 +00:00
kib
34f7d1c691 Use CPUID leaf 0x15 to get TSC frequency when the calibration is
disabled.

Intel finally added this information, which allows us to not parse CPU
identification string looking for the nominal frequency.  The leaf is
present e.g. on Appolo Lake Atom CPUs.  It is only used if the TSC
calibration is disabled by user.

Also, report the TSC frequency in bootverbose mode always, regardless
of the way it was obtained.

Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2018-04-25 16:43:45 +00:00
kib
aaf44aa5e1 Make the sysctl machdep.idle also a tunable.
It is applied before it is possible for idle threads to execute on any
CPU, allowing to work around against some bugs.

Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2018-04-24 20:49:16 +00:00
kib
57b709f4b4 Extend ap_boot_mtx scope to also cover mca_init().
Otherwise, under bootverbose, the lapic_enable_cmc() banner 'lapicX:
CMCI unmasked' is printed by several CPUs in parallel, causing garbled
output for the LAPIC dumps.

Reported by:	royger
Reviewed by:	jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D15157
2018-04-24 20:33:08 +00:00
kib
d234c145ba Ensure that cmci_monitor() is not executed in parallel, since shared
machine check banks must be only monitored by single CPU.

Noted and reviewed by:	jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D15157
2018-04-24 20:29:40 +00:00
kib
a9ebca3e14 Use IS_BSP() macro.
Noted and reviewed by:	jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
Differential revision:	https://reviews.freebsd.org/D15157
2018-04-24 20:22:30 +00:00
kib
1bd517bdbd Use relaxed atomics to access the monitor line.
We must ensure that accesses occur, they do not have any other
compiler-visible effects.  Bruce found some situations where
optimization could remove an access, and provided a patch to use
volatile qualifier for the state variables.  Since volatile behaviour
there is the compiler-specific interpretation of the keyword, use
relaxed atomics instead, which gives exactly the desired semantic.

Noted by and discussed with:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2018-04-24 14:02:46 +00:00
avg
b84a44c4ca add a new ACPI suspend debugging knob, debug.acpi.suspend_deep_bounce
This sysctl allows a deeper dive into the sleep abyss comparing to
debug.acpi.suspend_bounce.  When the new sysctl is set the system will
execute the suspend sequence up to the call to AcpiEnterSleepState().
That includes saving processor contexts and parking APs.  Then, instead
of actually entering the sleep state, the BSP will call resumectx() to
emulate the wakeup.  The APs should get restarted by the sequence of
Init and Startup IPIs that BSP sends to them.

MFC after:	8 days
2018-04-24 09:42:58 +00:00
jhb
1ae4d2ff0e Fix two off-by-one errors when allocating MSI and MSI-X interrupts.
x86 enforces an (arbitray) limit on the number of available MSI and
MSI-X interrupts to simplify code (in particular, interrupt_source[]
is statically sized).  This means that an attempt to allocate an MSI
vector needs to fail if it would go beyond the limit, but the checks
for exceeding the limit had an off-by-one error.  In the case of MSI-X
which allocates interrupts one at a time this meant that IRQ 768 kept
getting handed out multiple times for msix_alloc() instead of failing
because all MSI IRQs were in use.

Tested by:	lidl
MFC after:	1 week
2018-04-18 18:45:34 +00:00
cem
ef5bec98f2 cpufreq: Remove error-prone table terminators in favor of automatic sizing
PR:		227388
Reported by:	Vladimir Machulsky <xdelta AT meta.ua>
Sponsored by:	Dell EMC Isilon
2018-04-14 03:15:05 +00:00
kib
e3089a0318 i386 4/4G split.
The change makes the user and kernel address spaces on i386
independent, giving each almost the full 4G of usable virtual addresses
except for one PDE at top used for trampoline and per-CPU trampoline
stacks, and system structures that must be always mapped, namely IDT,
GDT, common TSS and LDT, and process-private TSS and LDT if allocated.

By using 1:1 mapping for the kernel text and data, it appeared
possible to eliminate assembler part of the locore.S which bootstraps
initial page table and KPTmap.  The code is rewritten in C and moved
into the pmap_cold(). The comment in vmparam.h explains the KVA
layout.

There is no PCID mechanism available in protected mode, so each
kernel/user switch forth and back completely flushes the TLB, except
for the trampoline PTD region. The TLB invalidations for userspace
becomes trivial, because IPI handlers switch page tables. On the other
hand, context switches no longer need to reload %cr3.

copyout(9) was rewritten to use vm_fault_quick_hold().  An issue for
new copyout(9) is compatibility with wiring user buffers around sysctl
handlers. This explains two kind of locks for copyout ptes and
accounting of the vslock() calls.  The vm_fault_quick_hold() AKA slow
path, is only tried after the 'fast path' failed, which temporary
changes mapping to the userspace and copies the data to/from small
per-cpu buffer in the trampoline.  If a page fault occurs during the
copy, it is short-circuit by exception.s to not even reach C code.

The change was motivated by the need to implement the Meltdown
mitigation, but instead of KPTI the full split is done.  The i386
architecture already shows the sizing problems, in particular, it is
impossible to link clang and lld with debugging.  I expect that the
issues due to the virtual address space limits would only exaggerate
and the split gives more liveness to the platform.

Tested by: pho
Discussed with:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	1 month
Differential revision:	https://reviews.freebsd.org/D14633
2018-04-13 20:30:49 +00:00
brooks
9d79658aab Move most of the contents of opt_compat.h to opt_global.h.
opt_compat.h is mentioned in nearly 180 files. In-progress network
driver compabibility improvements may add over 100 more so this is
closer to "just about everywhere" than "only some files" per the
guidance in sys/conf/options.

Keep COMPAT_LINUX32 in opt_compat.h as it is confined to a subset of
sys/compat/linux/*.c.  A fake _COMPAT_LINUX option ensure opt_compat.h
is created on all architectures.

Move COMPAT_LINUXKPI to opt_dontuse.h as it is only used to control the
set of compiled files.

Reviewed by:	kib, cem, jhb, jtl
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D14941
2018-04-06 17:35:35 +00:00
royger
e51e4bdd98 x86: fix trampoline memory allocation after r332073
Add the missing breaks in the for loops, in order to exit the loop
when a suitable entry is found.

Also switch amd64 native_start_all_aps to use PHYS_TO_DMAP in order to
find the virtual address of the boot_trampoline and the initial page
tables.

Reported and tested by:	pho
Sponsored by:		Citrix Systems R&D
2018-04-06 16:22:14 +00:00
royger
e1f89be1d3 remove GiB/MiB macros from param.h
And instead define them in the files where they are used.

Requested by: bde
2018-04-06 11:20:06 +00:00
royger
5f1547e410 x86: improve reservation of AP trampoline memory
So that it doesn't rely on physmap[1] containing an address below
1MiB. Instead scan the full physmap and search for a suitable address
to place the trampoline code (below 1MiB) and the initial memory pages
(below 4GiB).

Sponsored by:		Citrix Systems R&D
Reviewed by:		kib
Differential Revision:	https://reviews.freebsd.org/D14878
2018-04-05 14:39:51 +00:00
avg
3331ff57c2 fix i386 build with CPU_ELAN (LINT for instance) after r331878
x86/cpu_machdep.c now needs to include elan_mmcr.h when CPU_ELAN is set.
While here, also remove the now unneeded inclusion of isareg.h in i386
and amd64 vm_machdep.c.

Reported by:	lwhsu
MFC after:	14 days
X-MFC with:	r331878
2018-04-03 17:16:06 +00:00
avg
8ff7c82ffb fix signatures of cpu_reset_real and cpu_reset_proxy, broken in r331878
When I moved these functions from i386 and amd64 to x86 I dropped their
prototype declarations (that were correct) and left only their definitions
that became incorrect.

Reported by:	bde
MFC after:	15 days
X-MFC with:	r331878
2018-04-03 06:46:26 +00:00
avg
cbde65132d unify amd64 and i386 cpu_reset() in x86/cpu_machdep.c
Because I didn't see any reason not too.
I've been making some changes to the code and couldn't help but notice
that the i386 and am64 code was nearly identical.

MFC after:	17 days
2018-04-02 13:45:23 +00:00