Commit Graph

922 Commits

Author SHA1 Message Date
Roger Pau Monné
d7627401ec lapic: skip setting intrcnt if lapic is not present
Instead of panicking. Legacy PVH mode doesn't provide a lapic, and
since native_lapic_intrcnt is called unconditionally this would cause
the assert to trigger. Change the assert into a continue in order to
take into account the possibility of systems without a lapic.

Reviewed by:		jhb
Approved by:		re (gjb)
Sponsored by:		Citrix Systems R&D
Differential revision:	https://reviews.freebsd.org/D17015
2018-09-13 07:13:13 +00:00
Roger Pau Monné
4edbde911b xen: fix setting legacy PVH vcpu id
The recommended way to obtain the vcpu id is using the cpuid
instruction with a specific leaf value. This leaf value must be
obtained at runtime, and it's done when populating the hypercall page.

Legacy PVH however will get the hypercall page populated by the
hypervisor itself before booting, so the cpuid leaf was not actually
set, thus preventing setting the vcpu id value from cpuid.

Fix this by making sure the cpuid leaf has been probed before
attempting to set the vcpu id.

Approved by:		re (gjb)
Sponsored by:		Citrix Systems R&D
2018-09-13 07:12:16 +00:00
Roger Pau Monné
4fcd5f3003 xen: limit the usage of PIRQs to a legacy PVH Dom0
That's the only mode in FreeBSD that requires the usage of PIRQs, so
there's no need to attach the PIRQ PIC when running in other modes.

Approved by:		re (gjb)
Sponsored by:		Citrix Systems R&D
2018-09-13 07:11:11 +00:00
Roger Pau Monné
ddbc1b4387 xen: fix initial kenv setup for legacy PVH
When adding support for the new PVH mode the kenv handling was
switched to use a boot time allocated scratch space, however the
legacy PVH early boot code was not modified to allocate such space.

Approved by:		re (gjb)
Sponsored by:		Citrix Systems R&D
2018-09-13 07:09:41 +00:00
Roger Pau Monné
c9a591b0f6 xen: remove xenpv_set_ids
The vcpu_id for legacy PVH mode can be set from the output of cpuid,
so there's no need to have a special function to set it.

Also note that xenpv_set_ids should have been executed only for PV
guests, but was executed for all guests types and vcpu_id was later
fixed up for HVM guests.

Reported by:		cperciva
Approved by:		re (gjb)
Sponsored by:		Citrix Systems R&D
2018-09-13 07:08:31 +00:00
Roger Pau Monné
fae9a0cb9b xen: fix PV IPI setup
So that it's done when the vcpu_id has been set. For the BSP the
vcpu_id is set at SUB_INTR, while for the APs it's done in
init_secondary_tail that's called at SUB_SMP order FIRST.

Reported and tested by:	cperciva
Approved by:		re (gjb)
Sponsored by:		Citrix Systems R&D
Differential revision:	https://reviews.freebsd.org/D17013
2018-09-13 07:07:13 +00:00
Roger Pau Monné
a515acf7bb msi: remove the check that interrupt sources have been added
When running as a specific type of Xen guest the hypervisor won't
provide any emulated IO-APICs or legacy PICs at all, thus hitting the
following assert in the MSI code:

panic: Assertion num_io_irqs > 0 failed at /usr/src/sys/x86/x86/msi.c:334
cpuid = 0
time = 1
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xffffffff826ffa70
vpanic() at vpanic+0x1a3/frame 0xffffffff826ffad0
panic() at panic+0x43/frame 0xffffffff826ffb30
msi_init() at msi_init+0xed/frame 0xffffffff826ffb40
apic_setup_io() at apic_setup_io+0x72/frame 0xffffffff826ffb50
mi_startup() at mi_startup+0x118/frame 0xffffffff826ffb70
start_kernel() at start_kernel+0x10

Fix this by removing the assert in the MSI code, since it's possible
to get to the MSI initialization without having registered any other
interrupt sources.

Reviewed by:		jhb
Approved by:		re (gjb)
Sponsored by:		Citrix Systems R&D
Differential revision:	https://reviews.freebsd.org/D17001
2018-09-13 07:05:51 +00:00
John Baldwin
cdb6aa7e47 Fix build of x86 UP kernels after dynamic IRQ changes in r338360.
Reported by:	Ian FREISLICH <ian.freislich@capeaugusta.com>
Approved by:	re (gjb)
MFC after:	2 weeks
2018-08-31 18:26:37 +00:00
John Baldwin
fd036deac1 Dynamically allocate IRQ ranges on x86.
Previously, x86 used static ranges of IRQ values for different types
of I/O interrupts.  Interrupt pins on I/O APICs and 8259A PICs used
IRQ values from 0 to 254.  MSI interrupts used a compile-time-defined
range starting at 256, and Xen event channels used a
compile-time-defined range after MSI.  Some recent systems have more
than 255 I/O APIC interrupt pins which resulted in those IRQ values
overflowing into the MSI range triggering an assertion failure.

Replace statically assigned ranges with dynamic ranges.  Do a single
pass computing the sizes of the IRQ ranges (PICs, MSI, Xen) to
determine the total number of IRQs required.  Allocate the interrupt
source and interrupt count arrays dynamically once this pass has
completed.  To minimize runtime complexity these arrays are only sized
once during bootup.  The PIC range is determined by the PICs present
in the system.  The MSI and Xen ranges continue to use a fixed size,
though this does make it possible to turn the MSI range size into a
tunable in the future.

As a result, various places are updated to use dynamic limits instead
of constants.  In addition, the vmstat(8) utility has been taught to
understand that some kernels may treat 'intrcnt' and 'intrnames' as
pointers rather than arrays when extracting interrupt stats from a
crashdump.  This is determined by the presence (vs absence) of a
global 'nintrcnt' symbol.

This change reverts r189404 which worked around a buggy BIOS which
enumerated an I/O APIC twice (using the same memory mapped address for
both entries but using an IRQ base of 256 for one entry and a valid
IRQ base for the second entry).  Making the "base" of MSI IRQ values
dynamic avoids the panic that r189404 worked around, and there may now
be valid I/O APICs with an IRQ base above 256 which this workaround
would incorrectly skip.

If in the future the issue reported in PR 130483 reoccurs, we will
have to add a pass over the I/O APIC entries in the MADT to detect
duplicates using the memory mapped address and use some strategy to
choose the "correct" one.

While here, reserve room in intrcnts for the Hyper-V counters.

PR:		229429, 130483
Reviewed by:	kib, royger, cem
Tested by:	royger (Xen), kib (DMAR)
Approved by:	re (gjb)
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D16861
2018-08-28 21:09:19 +00:00
Alan Cox
49bfa624ac Eliminate the arena parameter to kmem_free(). Implicitly this corrects an
error in the function hypercall_memfree(), where the wrong arena was being
passed to kmem_free().

Introduce a per-page flag, VPO_KMEM_EXEC, to mark physical pages that are
mapped in kmem with execute permissions.  Use this flag to determine which
arena the kmem virtual addresses are returned to.

Eliminate UMA_SLAB_KRWX.  The introduction of VPO_KMEM_EXEC makes it
redundant.

Update the nearby comment for UMA_SLAB_KERNEL.

Reviewed by:	kib, markj
Discussed with:	jeff
Approved by:	re (marius)
Differential Revision:	https://reviews.freebsd.org/D16845
2018-08-25 19:38:08 +00:00
Konstantin Belousov
60b7423434 Unify amd64 and i386 vmspace0 pmap activation.
Add pmap_activate_boot() for i386, move the invocation on APs from MD
init_secondary() to x86 init_secondary_tail().

Suggested by:	alc
Reviewed by:	alc, markj
Sponsored by:	The FreeBSD Foundation
Approved by:	re (marius)
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D16893
2018-08-25 15:21:28 +00:00
John Baldwin
62a08214bc Remove 'imen' global variable from atpic(4).
In pre-SMPng, the global 'imen' was used to track mask state of the
hardware interrupts and was aligned to the masks used by spl*().
When the atpic code was converted to using the x86 interrupt source
abstraction, the global 'imen' was preserved by having each PIC
instance point to an invididual byte in the global 'imen' to hold its
8-bit interrupt mask.  The global 'imen' is no longer used for
anything however, so rather than storing pointers in 'struct atpic',
just store the individual 8-bit mask for each PIC as a char.

While here, convert the ATPIC macro to using C99 initializers.

Reviewed by:	kib, imp
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D16827
2018-08-21 17:13:51 +00:00
Alan Cox
83a90bffd8 Eliminate kmem_malloc()'s unused arena parameter. (The arena parameter
became unused in FreeBSD 12.x as a side-effect of the NUMA-related
changes.)

Reviewed by:	kib, markj
Discussed with:	jeff, re@
Differential Revision:	https://reviews.freebsd.org/D16825
2018-08-21 16:43:46 +00:00
Alan Cox
44d0efb215 Eliminate kmem_alloc_contig()'s unused arena parameter.
Reviewed by:	hselasky, kib, markj
Discussed with:	jeff
Differential Revision:	https://reviews.freebsd.org/D16799
2018-08-20 15:57:27 +00:00
John Baldwin
a800b45c18 Merge amd64 and i386 <machine/intr_machdep.h> headers.
Reviewed by:	kib
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D16803
2018-08-20 12:31:39 +00:00
John Baldwin
2734fedc4e Fix a couple of comment nits. 2018-08-19 17:57:51 +00:00
John Baldwin
38a13e9002 Fix the MPTable probe code after the 4:4 changes on i386.
The MPTable probe code was using PMAP_MAP_LOW as the PA -> VA offset
when searching for the table signature but still using KERNBASE once
it had found the table.  As a result, the mpfps table pointed into a
random part of the kernel text instead of the actual MP Table.

Rather than adding more #ifdef's, use BIOS_PADDRTOVADDR from
<machine/pc/bios.h> which already uses PMAP_MAP_LOW on i386 and KERNBASE
on amd64.

Reviewed by:	kib
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D16802
2018-08-19 17:36:50 +00:00
John Baldwin
a568818913 Remove some vestiges of IPI_LAZYPMAP on i386.
The support for lazy pmap invalidations on i386 was removed in r281707.
This removes the constant for the IPI and stops accounting for it when
sizing the interrupt count arrays.

Reviewed by:	kib
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D16801
2018-08-19 16:14:59 +00:00
Konstantin Belousov
9e2d4791d1 Print L1D FLUSH feature.
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2018-08-18 12:17:05 +00:00
Mark Johnston
b8abc9d8f5 Help ensure that the copy loop doesn't get converted to a memcpy() call.
Reported and reviewed by: kib
X-MFC with:	r337715
Sponsored by:	The FreeBSD Foundation
2018-08-14 19:21:31 +00:00
Konstantin Belousov
8d32b46379 Add definitions related to the L1D flush operation capability and MSR.
Sponsored by:	The FreeBSD Foundation
2018-08-14 17:19:11 +00:00
Mark Johnston
27f4c235ee Explain why we aren't using memcpy().
Reported by:	jmg
X-MFC with:	r337715
Sponsored by:	The FreeBSD Foundation
2018-08-14 14:50:06 +00:00
Mark Johnston
845800e190 Don't use memcpy() in the early microcode loading code.
At some point memcpy() may be an ifunc, ifunc resolution cannot be done
until CPU identification has been performed, and CPU identification must
be done after loading any microcode updates.

X-MFC with:	r337715
Sponsored by:	The FreeBSD Foundation
2018-08-14 14:02:53 +00:00
Mark Johnston
3571aee662 Fix the !SMP x86 build.
Reported by:	Michael Butler <imb@protected-networks.net>
X-MFC with:	r337715
Sponsored by:	The FreeBSD Foundation
2018-08-14 13:56:42 +00:00
Mark Johnston
97edfc1b45 Implement kernel support for early loading of Intel microcode updates.
Updates in the format described in section 9.11 of the Intel SDM can
now be applied as one of the first steps in booting the kernel.  Updates
that are loaded this way are automatically re-applied upon exit from
ACPI sleep states, in contrast with the existing cpucontrol(8)-based
method.  For the time being only Intel updates are supported.

Microcode update files are passed to the kernel via loader(8).  The
file type must be "cpu_microcode" in order for the file to be recognized
as a candidate microcode update.  Updates for multiple CPU types may be
concatenated together into a single file, in which case the kernel
will select and apply a matching update.  Memory used to store the
update file will be freed back to the system once the update is applied,
so this approach will not consume more memory than required.

Reviewed by:	kib
MFC after:	6 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D16370
2018-08-13 17:13:09 +00:00
Mark Johnston
fe585be529 Verify that each frame pointer lies within the thread's kstack.
Previously, this check was omitted for the first frame pointer.

Reported by:	pho
Reviewed by:	kib
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D16572
2018-08-03 02:51:37 +00:00
Alan Somers
6040822c4e Make timespecadd(3) and friends public
The timespecadd(3) family of macros were imported from NetBSD back in
r35029. However, they were initially guarded by #ifdef _KERNEL. In the
meantime, we have grown at least 28 syscalls that use timespecs in some
way, leading many programs both inside and outside of the base system to
redefine those macros. It's better just to make the definitions public.

Our kernel currently defines two-argument versions of timespecadd and
timespecsub.  NetBSD, OpenBSD, and FreeDesktop.org's libbsd, however, define
three-argument versions.  Solaris also defines a three-argument version, but
only in its kernel.  This revision changes our definition to match the
common three-argument version.

Bump _FreeBSD_version due to the breaking KPI change.

Discussed with:	cem, jilles, ian, bde
Differential Revision:	https://reviews.freebsd.org/D14725
2018-07-30 15:46:40 +00:00
Konstantin Belousov
45ed991d96 On amd64, enable workarounds for several Ryzen erratas as described in
the AMD document 55449 'Revision Guide for AMD Family 17h Models
00h-0Fh Processors' rev 1.12.

The errata numbers are mentioned near each action.

It seems that newer BIOSes already include required chicken bits
settings, so the magic MSR updates are only needed when BIOS cannot be
updated.  On the other hand, MWAIT avoidance seems to be important.

Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2018-07-27 15:31:20 +00:00
Roger Pau Monné
b0663c33c2 xen: implement early init helper for PVHv2
In order to setup an initial environment and jump into the generic
hammer_time initialization function. Some of the code is shared with
PVHv1, while other code is PVHv2 specific.

This allows booting FreeBSD as a PVHv2 DomU and Dom0.

Sponsored by:	Citrix Systems R&D
2018-07-19 08:44:52 +00:00
Roger Pau Monné
07c2711fbf xen: allow very early initialization of the hypercall page
Allow the hypercall page to be initialized very early, even before
vtophys is functional. Also make the function global so it can be
called by other files.

This will be needed in order to perform the early bringup on PVHv2
guests.

Sponsored by: Citrix Systems R&D
2018-07-19 08:13:41 +00:00
Roger Pau Monné
cfa0b7b82f xen: remove direct usage of HYPERVISOR_start_info
HYPERVISOR_start_info is only available to PV and PVHv1 guests, HVM
and PVHv2 guests get this data from HVM parameters that are fetched
using a hypercall.

Instead provide a set of helper functions that should be used to fetch
this data. The helper functions have different implementations
depending on whether FreeBSD is running as PVHv1 or HVM/PVHv2 guest
type.

This helps to cleanup generic Xen code by removing quite a lot of
xen_pv_domain and xen_hvm_domain macro usages.

Sponsored by:	Citrix Systems R&D
2018-07-19 07:54:45 +00:00
Mark Johnston
a18e40aad4 Use the existing MSR_BIOS_SIGN on AMD.
Reported by:	kib
Sponsored by:	The FreeBSD Foundation
2018-07-13 20:56:20 +00:00
Mark Johnston
5612bb23d0 Define the MSR used to fetch the current microcode patch level on AMD.
It is defined in the AMD family 17h register reference.

MFC after:	3 days
Sponsored by:	The FreeBSD Foundation
2018-07-13 19:42:59 +00:00
Mark Johnston
6ac05ba486 Use C99 initializers for instances of struct apic_enumerator.
MFC after:	3 days
2018-07-13 17:42:48 +00:00
Warner Losh
52379d36a9 Create helper functions for parsing boot args.
boot_parse_arg		to parse a single arg
boot_parse_cmdline	to parse a command line string
boot_parse_args		to parse all the args in a vector
boot_howto_to_env	Convert howto bits to env vars
boot_env_to_howto	Return howto mask mased on what's set in the environment.

All these routines return an int that's the bitmask of the args
translated to RB_* flags. As a special case, the 'S' flag sets the
comconsole_speed env var. Any arg that looks like a=b will set the env
key 'a' to value 'b'. If =b is omitted, 'a' is set to '1'.  This
should help us reduce the number of redundant copies of these routines
in the tree.  It should also give a more uniform experience between
platforms.

Also, invent a new flag RB_PROBE that's set when 'P' is parsed.  On
x86 + BIOS, this means 'probe for the keyboard, and if it's not there
set both RB_MULTIPLE and RB_SERIAL (which means show the output on
both video and serial consoles, but make serial primary).  Others it
may be some similar concept of probing, but it's loader dependent
what, exactly, it means.

These routines are suitable for /boot/loader and/or the kernel,
though they may not be suitable for the tightly hand-rolled-for-space
environments like boot2.

Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D16205
2018-07-13 16:43:05 +00:00
Matt Macy
ab3059a8e7 Back pcpu zone with domain correct pages
- Change pcpu zone consumers to use a stride size of PAGE_SIZE.
  (defined as UMA_PCPU_ALLOC_SIZE to make future identification easier)

- Allocate page from the correct domain for a given cpu.

- Don't initialize pc_domain to non-zero value if NUMA is not defined
  There are some misconceptions surrounding this field. It is the
  _VM_ NUMA domain and should only ever correspond to valid domain
  values as understood by the VM.

The former slab size of sizeof(struct pcpu) was somewhat arbitrary.
The new value is PAGE_SIZE because that's the smallest granularity
which the VM can allocate a slab for a given domain. If you have
fewer than PAGE_SIZE/8 counters on your system there will be some
memory wasted, but this is obviously something where you want the
cache line to be coming from the correct domain.

Reviewed by: jeff
Sponsored by: Limelight Networks
Differential Revision:  https://reviews.freebsd.org/D15933
2018-07-06 02:06:03 +00:00
Andrew Turner
2bf9501287 Create a new macro for static DPCPU data.
On arm64 (and possible other architectures) we are unable to use static
DPCPU data in kernel modules. This is because the compiler will generate
PC-relative accesses, however the runtime-linker expects to be able to
relocate these.

In preparation to fix this create two macros depending on if the data is
global or static.

Reviewed by:	bz, emaste, markj
Sponsored by:	ABT Systems Ltd
Differential Revision:	https://reviews.freebsd.org/D16140
2018-07-05 17:13:37 +00:00
Konstantin Belousov
300c34e431 Add a name for the MSR controlling standard extended features report on AMD.
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2018-07-05 10:44:18 +00:00
Konstantin Belousov
fe15b8543e Order the portion of the AMD-specific MSRs names definitions numerically.
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2018-07-05 10:34:01 +00:00
Roger Pau Monné
8518997526 xen: obtain vCPU ID from CPUID
The Xen vCPU ID can be fetched from the cpuid instead of inferring it
from the ACPI ID.

Sponsored by: Citrix Systems R&D
2018-06-26 15:00:54 +00:00
Roger Pau Monné
1ad78dd631 xen: limit the number of hypercall pages to 1
The interface already guarantees that the number of hypercall pages is
always going to be 1, see the comment in interface/arch-x86/cpuid.h

Sponsored by: Citrix Systems R&D
2018-06-26 14:39:27 +00:00
Konstantin Belousov
ce3bf75015 Do not access ISA timer if BIOS reports that there is no legacy
devices present.

On at least one machine where it would matter since the ISA timer is
power gated when booted in the UEFI mode, BIOS still reports that the
legacy devices are present.  That is, user still have to manually
disable TSC calibration on such machines.  Hopefully it will be more
useful in the future.

Discussed with:	Ben Widawsky <benjamin.widawsky@intel.com>
Reviewed by:	royger
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D16004
MFC after:	1 week
2018-06-25 11:24:26 +00:00
Konstantin Belousov
7705dd4df0 Provide a helper function acpi_get_fadt_bootflags() to fetch the FADT
x86 boot flags.

Reviewed by:	royger
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D16004
MFC after:	1 week
2018-06-25 11:01:12 +00:00
Bruce Evans
3cd246d9a9 Untangle configuration ifdefs a little. On x86, msi is optional on pci,
and also on apic in common and i386 files (except for xen it is optional
only on xenhvm), but it was not ifdefed except on apic in common and i386
files.

This is all that is left from an attempt to build a (sub-)minimal kernel
without any devices.  The isa "option" is still used without ifdefs in many
standard files even on amd64.  ISAPNP is not optional on at least i386.
ATPIC is not optional on i386 (it is used mainly for Xspuriousint).  But
pci is now supposed to be optional on x86.
2018-06-10 14:49:13 +00:00
Andriy Gapon
0fb3a72a0d x86: reorganize code that deals with unexpected NMI-s
Expected NMI-s are those than are either generated by the software (such
as a CPU sending NMI to other CPU) or generated by the hardware after
the software configured it to do so (such as NMI-s on PMC events).

Some unexpected NMI-s can be caused by hardware failures and it is
possible to inquire the hardware about them (somewhat like MCA but much
more primitive) using an EISA mechanism.  In some cases the origin of
the NMI can remain truly unknown.

This commit should not change any functionality.  It just reorganizes
the code, so that it is easier to extend with new checks for the origin
of the NMI.  Also, it frees the code that has nothing to do with ISA
from DEV_ISA.

MFC after:	3 weeks
2018-06-07 14:46:52 +00:00
Andriy Gapon
413ed27cd7 expand descriptions of x86 panic_on_nmi and kdb_on_nmi sysctls
The descriptions were as terse as the variable names and they did not
explain additional conditions for knobs.

MFC after:	1 week
2018-06-07 14:23:31 +00:00
Andriy Gapon
ec6faf94c4 add support for console resuming, implement it for uart, use on x86
This change adds a new optional console method cn_resume and a kernel
console interface cnresume.  Consoles that may need to re-initialize
their hardware after suspend (e.g., because firmware does not care to do
it) will implement cn_resume.  Note that it is called in rather early
environment not unlike early boot, so the same restrictions apply.
Platform specific code, for platforms that support hardware suspend,
should call cnresume early after resume, before any console output is
expected.

This change fixes a problem with a system of mine failing to resume when
a serial console is used.  I found that the serial port was in a strange
configuration and an attempt to write to it likely resulted in an
infinite loop.

To avoid adding cn_resume method to every console driver, CONSOLE_DRIVER
macro has been extended to support optional methods.

Reviewed by:	imp, mav
MFC after:	3 weeks
Differential Revision: https://reviews.freebsd.org/D15552
2018-05-29 16:16:24 +00:00
Andriy Gapon
ba79ab8215 fix x86 UP build broken by r334204, TSC resynchronization
Reported by:	bde
MFC after:	1 week
X-MFC with:	r334204
2018-05-29 16:03:53 +00:00
Andriy Gapon
279be68bfd re-synchronize TSC-s on SMP systems after resume, if necessary
The TSC-s are checked and synchronized only if they were good
originally.  That is, invariant, synchronized, etc.

This is necessary on an AMD-based system where after a wakeup from STR I
see that BSP clock differs from AP clocks by a count that roughly
corresponds to one second.  The APs are in sync with each other.  Not
sure if this is a hardware quirk or a firmware bug.

This is what I see after a resume with this change:
    SMP: passed TSC synchronization test after adjustment
    acpi_timer0: restoring timecounter, ACPI-fast -> TSC-low

Reviewed by:	kib
MFC after:	3 weeks
Differential Revision: https://reviews.freebsd.org/D15551
2018-05-25 07:33:20 +00:00
Roger Pau Monné
92849603d0 xen/pvh: allocate dbg_stack
Or else init_secondary will hit a page fault (or write garbage
somewhere).

Sponsored by:	Citrix Systems R&D
2018-05-24 10:22:57 +00:00
Konstantin Belousov
e3fab0ff2b Fix UP build.
Reported by:	jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2018-05-22 20:50:19 +00:00
John Baldwin
9e2154ff1c Cleanups related to debug exceptions on x86.
- Add constants for fields in DR6 and the reserved fields in DR7.  Use
  these constants instead of magic numbers in most places that use DR6
  and DR7.
- Refer to T_TRCTRAP as "debug exception" rather than a "trace trap"
  as it is not just for trace exceptions.
- Always read DR6 for debug exceptions and only clear TF in the flags
  register for user exceptions where DR6.BS is set.
- Clear DR6 before returning from a debug exception handler as
  recommended by the SDM dating all the way back to the 386.  This
  allows debuggers to determine the cause of each exception.  For
  kernel traps, clear DR6 in the T_TRCTRAP case and pass DR6 by value
  to other parts of the handler (namely, user_dbreg_trap()).  For user
  traps, wait until after trapsignal to clear DR6 so that userland
  debuggers can read DR6 via PT_GETDBREGS while the thread is stopped
  in trapsignal().

Reviewed by:	kib, rgrimes
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D15189
2018-05-22 00:45:00 +00:00
Konstantin Belousov
3621ba1ede Add Intel Spec Store Bypass Disable control.
Speculative Store Bypass (SSB) is a speculative execution side channel
vulnerability identified by Jann Horn of Google Project Zero (GPZ) and
Ken Johnson of the Microsoft Security Response Center (MSRC)
https://bugs.chromium.org/p/project-zero/issues/detail?id=1528.
Updated Intel microcode introduces a MSR bit to disable SSB as a
mitigation for the vulnerability.

Introduce a sysctl hw.spec_store_bypass_disable to provide global
control over the SSBD bit, akin to the existing sysctl that controls
IBRS. The sysctl can be set to one of three values:
0: off
1: on
2: auto

Future work will enable applications to control SSBD on a per-process
basis (when it is not enabled globally).

SSBD bit detection and control was verified with prerelease microcode.

Security:	CVE-2018-3639
Tested by:	emaste (previous version, without updated microcode)
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2018-05-21 21:08:19 +00:00
Konstantin Belousov
9be4bbbb21 Add definition for Intel Speculative Store Bypass Disable MSR bits
Security:	CVE-2018-3639
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2018-05-21 21:07:13 +00:00
Konstantin Belousov
ba6ce3a34b Style.
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2018-05-19 21:36:55 +00:00
Konstantin Belousov
45c228cc29 Fix PCID+PTI pmap operations on Xen/HVM.
Install appropriate pti-aware shootdown IPI handlers, otherwise user
page tables do not get enough invalidations.  The non-pti handlers
were used so far.

Reported and tested by:	cperciva
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2018-05-19 20:28:59 +00:00
Konstantin Belousov
7c25320c69 Fix IBRS handling around MWAIT.
The intent was to disable IBPB and IBRS around MWAIT, and re-enable on
the sleep end.

Reviewed by:	emaste
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2018-05-19 20:26:33 +00:00
Andriy Gapon
7973b47369 fix a problem with bad performance after wakeup caused by r333321
This change reverts a "while here" part of r333321 that moved clearing
of suspended_cpus to an earlier place.

Apparently, there can be a problem when modifying (shared) memory before
restoring proper cache attributes.  So, to be safe, move the clearing to
the old place.

Many thanks to Johannes Lundberg for bisecting the changes to that
particular commit and then bisecting the commit to the particular
change.

Reported by:	many
Debugged by:	Johannes Lundberg <johalun0@gmail.com>
MFC after:	1 week
X-MFC with:	r333321
2018-05-17 10:16:20 +00:00
Andriy Gapon
7c5ccd2dce calibrate lapic timer in native_lapic_setup
The idea is to calibrate the LAPIC timer just once and only on boot,
given that [at present] the timer constants are global and shared
between all processors.

My primary motivation is to fix a panic that can happen when dynamically
switching to lapic timer.  The panic is caused by a recursion on
et_hw_mtx when printing the calibration results to console.  See the
review for the details of the panic.

Also, the code should become slightly simpler and easier to read.  The
previous code was racy too.  Multiple processors could start calibrating
the global constants concurrently, although that seems to have been
benign.

Reviewed by:	kib, mav, jhb
MFC after:	3 weeks
Differential Revision: https://reviews.freebsd.org/D15422
2018-05-15 16:56:30 +00:00
Warner Losh
b425e3fba2 Put the CPU starting on one line. 2018-05-07 21:09:21 +00:00
Andriy Gapon
de15b11aaa x86 cpususpend_handler: call wbinvd after setting suspend state bits
Without a subsequent wbinvd the changes to suspended_cpus (and
resuming_cpus) can be lost at least on AMD systems that use MOESI cache
coherency protocol.  That can happen because one of APs ends up as an
Owner of the corresponding cache line(s) and the changes may never reach
the main memory before the AP is reset.

While here, move clearing of suspended_cpus a little bit earlier as the
fact of returning from savectx (with zero return value) means that the
CPU has fully restored it execution context.

Also, rework the comment that describes the need for resuming_cpus.

This change fixed suspend to RAM a previously broken AMD-based system.

Reviewed by:	kib
Discussed with:	bde
MFC after:	3 weeks
Differential Revision: https://reviews.freebsd.org/D15295
2018-05-07 12:22:25 +00:00
Konstantin Belousov
d5effb01f1 Add helper macros to hide some boring repeatable ceremonies to define
ifuncs on x86.

Also keep helpers to define 'pseudo-ifuncs' which are emulated by the
indirect jmp.

Reviewed by:	jhb (previous version, as part of the larger patch)
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D13838
2018-05-03 21:45:59 +00:00
Jung-uk Kim
e787342e25 Redo r332918 with the ACPICA API and remove debug.acpi.suspend_deep_bounce.
AcpiOsEnterSleep() was meant to implement this feature.

Reviewed by:	avg
2018-05-03 19:00:50 +00:00
Roger Pau Monné
9021fe72fc xen: fix formatting of xen_init_ops
No functional change

Sponsored by: Citrix Systems R&D
2018-05-02 10:20:55 +00:00
Konstantin Belousov
986c4ca387 Turn off IBRS on suspend.
Resume starts CPU from the init state, which clears any loaded
microcode updates.  As result, IBRS MSRs are no longer available,
until the microcode is reloaded.

I have to forcibly clear cpu_stdext_feature3, which assumes that CPUID
leaf 7 reg %ebx does not report anything except Meltdown/Spectre bugs
bits.  If future CPUs add new bits there, hw_ibrs_recalculate() and
identify_cpu1()/identify_cpu2() need to be adjusted for that.

Submitted and tested by:	Michael Danilov <mike.d.ft402@gmail.com>
PR:	227866
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D15236
2018-04-30 20:18:32 +00:00
Konstantin Belousov
160be7cc08 Fix spelling: Appolo -> Apollo [1].
The APL31 NDA errata is APL30 public errata.  Add the reference and
provide the description [2].

Noted by:	emaste [2], rpokala [1]
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2018-04-26 19:23:19 +00:00
Konstantin Belousov
3f3937b4ae Handle Appolo Lake errata APL31.
If the workaround is activated, always send IPI for wake up, not rely
on the write to the monitor line.  This fixes Appolo Lake machines
early hang in sched_bind(), without requiring user to manually select
idle method.

Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2018-04-26 18:24:31 +00:00
Konstantin Belousov
a5f472c579 Some style and minor code improvements for idle selection.
Use designated initializers for the idlt_tlb elements.
Remove strstr() use, add flag field to detect supported MWAIT.
Use nitems() instead of the terminating NULL entry for idle_tlb.
Move several functions into cpu_idle_* namespace.

Based on the discussion with:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2018-04-26 18:12:40 +00:00
Konstantin Belousov
506a906c05 Use CPUID leaf 0x15 to get TSC frequency when the calibration is
disabled.

Intel finally added this information, which allows us to not parse CPU
identification string looking for the nominal frequency.  The leaf is
present e.g. on Appolo Lake Atom CPUs.  It is only used if the TSC
calibration is disabled by user.

Also, report the TSC frequency in bootverbose mode always, regardless
of the way it was obtained.

Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2018-04-25 16:43:45 +00:00
Konstantin Belousov
55ba21d4fd Make the sysctl machdep.idle also a tunable.
It is applied before it is possible for idle threads to execute on any
CPU, allowing to work around against some bugs.

Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2018-04-24 20:49:16 +00:00
Konstantin Belousov
bc7e39c339 Extend ap_boot_mtx scope to also cover mca_init().
Otherwise, under bootverbose, the lapic_enable_cmc() banner 'lapicX:
CMCI unmasked' is printed by several CPUs in parallel, causing garbled
output for the LAPIC dumps.

Reported by:	royger
Reviewed by:	jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D15157
2018-04-24 20:33:08 +00:00
Konstantin Belousov
215e4657d5 Ensure that cmci_monitor() is not executed in parallel, since shared
machine check banks must be only monitored by single CPU.

Noted and reviewed by:	jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D15157
2018-04-24 20:29:40 +00:00
Konstantin Belousov
d9d8645c3f Use IS_BSP() macro.
Noted and reviewed by:	jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
Differential revision:	https://reviews.freebsd.org/D15157
2018-04-24 20:22:30 +00:00
Konstantin Belousov
a5bd21d0fe Use relaxed atomics to access the monitor line.
We must ensure that accesses occur, they do not have any other
compiler-visible effects.  Bruce found some situations where
optimization could remove an access, and provided a patch to use
volatile qualifier for the state variables.  Since volatile behaviour
there is the compiler-specific interpretation of the keyword, use
relaxed atomics instead, which gives exactly the desired semantic.

Noted by and discussed with:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2018-04-24 14:02:46 +00:00
Andriy Gapon
e673a4ec4c add a new ACPI suspend debugging knob, debug.acpi.suspend_deep_bounce
This sysctl allows a deeper dive into the sleep abyss comparing to
debug.acpi.suspend_bounce.  When the new sysctl is set the system will
execute the suspend sequence up to the call to AcpiEnterSleepState().
That includes saving processor contexts and parking APs.  Then, instead
of actually entering the sleep state, the BSP will call resumectx() to
emulate the wakeup.  The APs should get restarted by the sequence of
Init and Startup IPIs that BSP sends to them.

MFC after:	8 days
2018-04-24 09:42:58 +00:00
John Baldwin
f36411145e Fix two off-by-one errors when allocating MSI and MSI-X interrupts.
x86 enforces an (arbitray) limit on the number of available MSI and
MSI-X interrupts to simplify code (in particular, interrupt_source[]
is statically sized).  This means that an attempt to allocate an MSI
vector needs to fail if it would go beyond the limit, but the checks
for exceeding the limit had an off-by-one error.  In the case of MSI-X
which allocates interrupts one at a time this meant that IRQ 768 kept
getting handed out multiple times for msix_alloc() instead of failing
because all MSI IRQs were in use.

Tested by:	lidl
MFC after:	1 week
2018-04-18 18:45:34 +00:00
Conrad Meyer
f6e61711ed cpufreq: Remove error-prone table terminators in favor of automatic sizing
PR:		227388
Reported by:	Vladimir Machulsky <xdelta AT meta.ua>
Sponsored by:	Dell EMC Isilon
2018-04-14 03:15:05 +00:00
Konstantin Belousov
d86c1f0dc1 i386 4/4G split.
The change makes the user and kernel address spaces on i386
independent, giving each almost the full 4G of usable virtual addresses
except for one PDE at top used for trampoline and per-CPU trampoline
stacks, and system structures that must be always mapped, namely IDT,
GDT, common TSS and LDT, and process-private TSS and LDT if allocated.

By using 1:1 mapping for the kernel text and data, it appeared
possible to eliminate assembler part of the locore.S which bootstraps
initial page table and KPTmap.  The code is rewritten in C and moved
into the pmap_cold(). The comment in vmparam.h explains the KVA
layout.

There is no PCID mechanism available in protected mode, so each
kernel/user switch forth and back completely flushes the TLB, except
for the trampoline PTD region. The TLB invalidations for userspace
becomes trivial, because IPI handlers switch page tables. On the other
hand, context switches no longer need to reload %cr3.

copyout(9) was rewritten to use vm_fault_quick_hold().  An issue for
new copyout(9) is compatibility with wiring user buffers around sysctl
handlers. This explains two kind of locks for copyout ptes and
accounting of the vslock() calls.  The vm_fault_quick_hold() AKA slow
path, is only tried after the 'fast path' failed, which temporary
changes mapping to the userspace and copies the data to/from small
per-cpu buffer in the trampoline.  If a page fault occurs during the
copy, it is short-circuit by exception.s to not even reach C code.

The change was motivated by the need to implement the Meltdown
mitigation, but instead of KPTI the full split is done.  The i386
architecture already shows the sizing problems, in particular, it is
impossible to link clang and lld with debugging.  I expect that the
issues due to the virtual address space limits would only exaggerate
and the split gives more liveness to the platform.

Tested by: pho
Discussed with:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	1 month
Differential revision:	https://reviews.freebsd.org/D14633
2018-04-13 20:30:49 +00:00
Brooks Davis
6469bdcdb6 Move most of the contents of opt_compat.h to opt_global.h.
opt_compat.h is mentioned in nearly 180 files. In-progress network
driver compabibility improvements may add over 100 more so this is
closer to "just about everywhere" than "only some files" per the
guidance in sys/conf/options.

Keep COMPAT_LINUX32 in opt_compat.h as it is confined to a subset of
sys/compat/linux/*.c.  A fake _COMPAT_LINUX option ensure opt_compat.h
is created on all architectures.

Move COMPAT_LINUXKPI to opt_dontuse.h as it is only used to control the
set of compiled files.

Reviewed by:	kib, cem, jhb, jtl
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D14941
2018-04-06 17:35:35 +00:00
Roger Pau Monné
e0f92f5c77 x86: fix trampoline memory allocation after r332073
Add the missing breaks in the for loops, in order to exit the loop
when a suitable entry is found.

Also switch amd64 native_start_all_aps to use PHYS_TO_DMAP in order to
find the virtual address of the boot_trampoline and the initial page
tables.

Reported and tested by:	pho
Sponsored by:		Citrix Systems R&D
2018-04-06 16:22:14 +00:00
Roger Pau Monné
444c6d6f03 remove GiB/MiB macros from param.h
And instead define them in the files where they are used.

Requested by: bde
2018-04-06 11:20:06 +00:00
Roger Pau Monné
9dba82a442 x86: improve reservation of AP trampoline memory
So that it doesn't rely on physmap[1] containing an address below
1MiB. Instead scan the full physmap and search for a suitable address
to place the trampoline code (below 1MiB) and the initial memory pages
(below 4GiB).

Sponsored by:		Citrix Systems R&D
Reviewed by:		kib
Differential Revision:	https://reviews.freebsd.org/D14878
2018-04-05 14:39:51 +00:00
Andriy Gapon
3da25bdb02 fix i386 build with CPU_ELAN (LINT for instance) after r331878
x86/cpu_machdep.c now needs to include elan_mmcr.h when CPU_ELAN is set.
While here, also remove the now unneeded inclusion of isareg.h in i386
and amd64 vm_machdep.c.

Reported by:	lwhsu
MFC after:	14 days
X-MFC with:	r331878
2018-04-03 17:16:06 +00:00
Andriy Gapon
b7b25af06a fix signatures of cpu_reset_real and cpu_reset_proxy, broken in r331878
When I moved these functions from i386 and amd64 to x86 I dropped their
prototype declarations (that were correct) and left only their definitions
that became incorrect.

Reported by:	bde
MFC after:	15 days
X-MFC with:	r331878
2018-04-03 06:46:26 +00:00
Andriy Gapon
8428d0f154 unify amd64 and i386 cpu_reset() in x86/cpu_machdep.c
Because I didn't see any reason not too.
I've been making some changes to the code and couldn't help but notice
that the i386 and am64 code was nearly identical.

MFC after:	17 days
2018-04-02 13:45:23 +00:00
Jeff Roberson
27a3c9d710 Restore r331606 with a bugfix to setup cpuset_domain[] earlier on all
platforms.  Original commit message as follows:

Only use CPUs in the domain the device is attached to for default
assignment.  Device drivers are able to override the default assignment
if they bind directly.  There are severe performance penalties for
handling interrupts on remote CPUs and this should only be done in
very controlled circumstances.

Reviewed by:    jhb, kib
Tested by:      pho
Sponsored by:   Netflix, Dell/EMC Isilon
Differential Revision:  https://reviews.freebsd.org/D14838
2018-03-28 18:47:35 +00:00
John Baldwin
d41e41f9f0 Remove very old and unused signal information codes.
These have been supplanted by the MI signal information codes in
<sys/signal.h> since 7.0.  The FPE_*_TRAP ones were deprecated even
earlier in 1999.

PR:		226579 (exp-run)
Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D14637
2018-03-27 20:57:51 +00:00
Jeff Roberson
261c408744 Backout r331606 until I can identify why it does not boot on some
machines.
2018-03-27 10:20:50 +00:00
Jeff Roberson
a48de40bcc Only use CPUs in the domain the device is attached to for default
assignment.  Device drivers are able to override the default assignment
if they bind directly.  There are severe performance penalties for
handling interrupts on remote CPUs and this should only be done in
very controlled circumstances.

Reviewed by:	jhb, kib
Tested by:	pho (earlier version)
Sponsored by:	Netflix, Dell/EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D14838
2018-03-27 03:37:04 +00:00
John Baldwin
7091608617 Add a workaround to the hypervisor detection for older versions of KVM.
Originally KVM set %eax to 0 in the cpuid leaf 0x4000000 rather than
to the highest supported leaf in the hypervisor "branch".  Detect this
case and fixup the %eax value so that the hypervisor is still
detected.

Reported by:	jpaetzel
Reviewed by:	kib
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D14810
2018-03-23 22:36:24 +00:00
Konstantin Belousov
8fbcc3343f Move the CR0.WP manipulation KPI to x86.
This should allow to avoid some #ifdefs in the common x86/ code.

Requested by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2018-03-20 20:20:49 +00:00
John Baldwin
7af5f2acfb Fix a typo.
Reviewed by:	kib
2018-03-19 17:14:56 +00:00
Ed Maste
4e78ff7068 ANSIfy sys/x86 2018-03-17 01:40:09 +00:00
Roger Pau Monné
4a6d4e7b58 at_rtc: check in ACPI FADT boot flags if the RTC is present
Or else disable the device. Note that the detection can be bypassed by
setting the hw.atrtc.enable option in the loader configuration file.
More information can be found on atrtc(4).

Sponsored by:		Citrix Systems R&D
Reviewed by:		ian
Differential revision:	https://reviews.freebsd.org/D14399
2018-03-13 09:42:33 +00:00
Ian Lepore
22b3d71e82 Give the atrtc_time_lock a unique name.
Reported by:	hps@
2018-03-12 15:26:11 +00:00
Andriy Gapon
7471a3fae8 fix r297857, do not modify CPU extension bits under virtual machines
r297857 was meant for real hardware only.

PR:		213155
Submitted by:	mainland@apeiron.net
MFC after:	1 week
2018-03-12 11:28:09 +00:00
Ian Lepore
c7053bbe54 Revert r330780, it was improperly tested and results in taking a spin
mutex before acquiring sleep mutexes.

Reported by:	kib@
2018-03-11 20:13:15 +00:00
Ian Lepore
4b502f0016 Remove MTX_NOPROFILE from atrtc_lock, it was inappropriately copy/pasted
from the i8254 driver when I created separate mutexes for each.  The i8254
driver could be the active timecounter, leading to recursion during mutex
profiling, but the atrtc driver cannot be a timecounter, so it isn't needed.
2018-03-11 19:56:07 +00:00
Ian Lepore
86051be993 Eliminate atrtc_time_lock, and use atrtc_lock for efirtc locking. 2018-03-11 19:22:58 +00:00
Ian Lepore
67e2a29216 Everywhere that multiple registers are accessed in sequence, lock/unlock
just once around the whole group of accesses.
2018-03-11 18:54:45 +00:00