freebsd-skq

Author	SHA1	Message	Date
Vladimir Kondratyev	76cefcd810	Fix amd64/i386 LINT build after r344982 Submitted by: jkim Reported by: rpokala MFC with: r344982	2019-03-11 19:46:15 +00:00
Vladimir Kondratyev	2b4ee39838	atrtc(4): install ACPI RTC/CMOS operation region handler FreeBSD base system does not provide an ACPI handler for the PC/AT RTC/CMOS device with PnP ID PNP0B00; on some HP laptops, the absence of this handler causes suspend/resume and poweroff(8) to hang or fail [1], [2]. On these laptops EC _REG method queries the RTC date/time registers via ACPI before suspending/powering off. The handler should be registered before acpi_ec driver is loaded. This change adds handler to access CMOS RTC operation region described in section 9.15 of ACPI-6.2 specification [3]. It is installed only for ACPI version of atrtc(4) so it should not affect old ACPI-less i386 systems. It is possible to disable the handler with loader tunable: debug.acpi.disabled=atrtc Informational debugging printf can be enabled by setting hw.acpi.verbose=1 in loader.conf [1] https://wiki.freebsd.org/Laptops/HP_Envy_6Z-1100 [2] https://wiki.freebsd.org/Laptops/HP_Notebook_15-af104ur [3] https://uefi.org/sites/default/files/resources/ACPI_6_2.pdf PR: 207419, 213039 Submitted by: Anthony Jenkins <Scoobi_doo@yahoo.com> Reviewed by: ian Discussed on: acpi@, 2013-2015, several threads MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D19314	2019-03-10 20:19:43 +00:00
John Baldwin	2e43efd0bb	Drop "All rights reserved" from my copyright statements. Reviewed by: rgrimes MFC after: 1 month Differential Revision: https://reviews.freebsd.org/D19485	2019-03-06 22:11:45 +00:00
Konstantin Belousov	a2d95495ee	Add usermode helpers for for Intel userspace protection keys feature. Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D18893	2019-02-20 09:56:23 +00:00
Konstantin Belousov	e7a9df16e6	Add kernel support for Intel userspace protection keys feature on Skylake Xeons. See SDM rev. 68 Vol 3 4.6.2 Protection Keys and the description of the RDPKRU and WRPKRU instructions. Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D18893	2019-02-20 09:51:13 +00:00
Bruce Evans	27c56cf357	Fix hangs in r341810 waiting for AP startup. idle_td is dereferenced without thread-locking it to make its contents is invariant, and was accessed without telling the compiler that its contents is invariant. Some compilers optimized accesses to the supposedly invariant contents by moving the critical checks for changes outside of the loop that waits for changes. Fix this using atomic ops. This bug only showed up for the following configuration: a Turion2 system, amd64 kernels, compiled by gcc, and SCHED_4BSD. clang fails to do the optimization with all CFLAGS that I tried, because it doesn't fully optimize the '__asm __volatile' for cpu_spinwait() although this asm has no memory clobber. gcc only does the optimization with most CFLAGS. I mostly used -Os with all compilers. i386 works because gcc -m32 -Os only moves 1 or the 2 accesses outside of the loop. Non-Turion2 systems and SCHED_ULE worked due to different timing (when all APs start before the BP checks them outside of the loop). Reviewed by: kib	2019-02-20 02:40:38 +00:00
Konstantin Belousov	5671e0d62e	Add definition for %cr4 PKRU enable bit. Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 3 days Differential revision: https://reviews.freebsd.org/D18893	2019-02-19 19:13:48 +00:00
Konstantin Belousov	eb785fab3b	Port sysctl kern.elf32.read_exec from amd64 to i386. Make it more comprehensive on i386, by not setting nx bit for any mapping, not just adding PF_X to all kernel-loaded ELF segments. This is needed for the compatibility with older i386 programs that assume that read access implies exec, e.g. old X servers with hand-rolled module loader. Reported and tested by: bde Sponsored by: The FreeBSD Foundation MFC after: 1 week	2019-02-07 02:17:34 +00:00
Konstantin Belousov	f76b5ab6cc	Fix resume on i386 PAE. It was broken before PAE/no-PAE merge, but since now PAE is the default, resume is apparently becomes for all machines. The corrected issues: - the trampoline page is not mapped executable, so machine faults when paging is on; - MSR.EFER and %cr4 both should be loaded before paging is enabled, otherwise paging structures are invalid (cr4.PAE and EFER.NX). - MSR.EFER and %cr4 should be only loaded if present. I attempt to handle this by not touching the registers if the value is zero. There are some more bits still not quite correct, e.g. unconditional access to %cr4 in resumectx. Reported and debugging help by: bde Sponsored by: The FreeBSD Foundation MFC after: 1 week	2019-02-07 02:09:34 +00:00
Konstantin Belousov	ccc2d07e77	Update CPUID bits definitions and CPU identification based on changes in SDM rev. 069. Sponsored by: The FreeBSD Foundation MFC after: 3 days	2019-02-04 23:57:59 +00:00
Konstantin Belousov	c3f5a36651	x86: correctly limit max memory resource address.. CPU and buses can manage up to the limit reported by cpu_maxphyaddr, so set mem_rman to the value returned by cpu_getmaxphyaddr(). For the PAE mode, it was missed both when rman_res_t was increased to uintmax_t, and from the PAE merge commit. When importing smaps or dump_avail chunks into memory rman, do not blindly ignore resources which ends above the limit, chomp them instead if start is below the limit. The same change was already done to i386 add_physmap_entry(). Based on the submission by: bde MFC after: 2 months	2019-02-01 20:46:47 +00:00
Roger Pau Monné	27c36a12f1	xen: introduce a new way to setup event channel upcall The main differences with the currently implemented method are: - Requires a local APIC EOI, since it doesn't bypass the local APIC as the previous method used to do. - Can be set to use different IDT vectors on each vCPU. Note that FreeBSD doesn't make use of this feature since the event channel IDT vector is reserved system wide. Note that the old method of setting the event channel upcall is not removed, and will be used as a fallback if this newly introduced method is not available. MFC after: 1 month Sponsored by: Citrix Systems R&D	2019-01-30 11:34:52 +00:00
Konstantin Belousov	9a52756044	i386: Merge PAE and non-PAE pmaps into same kernel. Effectively all i386 kernels now have two pmaps compiled in: one managing PAE pagetables, and another non-PAE. The implementation is selected at cold time depending on the CPU features. The vm_paddr_t is always 64bit now. As result, nx bit can be used on all capable CPUs. Option PAE only affects the bus_addr_t: it is still 32bit for non-PAE configs, for drivers compatibility. Kernel layout, esp. max kernel address, low memory PDEs and max user address (same as trampoline start) are now same for PAE and for non-PAE regardless of the type of page tables used. Non-PAE kernel (when using PAE pagetables) can handle physical memory up to 24G now, larger memory requires re-tuning the KVA consumers and instead the code caps the maximum at 24G. Unfortunately, a lot of drivers do not use busdma(9) properly so by default even 4G barrier is not easy. There are two tunables added: hw.above4g_allow and hw.above24g_allow, the first one is kept enabled for now to evaluate the status on HEAD, second is only for dev use. i386 now creates three freelists if there is any memory above 4G, to allow proper bounce pages allocation. Also, VM_KMEM_SIZE_SCALE changed from 3 to 1. The PAE_TABLES kernel config option is retired. In collaboarion with: pho Discussed with: emaste Reviewed by: markj MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D18894	2019-01-30 02:07:13 +00:00
Konstantin Belousov	8f0916fc11	i386/PAE busdma: allow more bounce pages. If i386 has more than 4G of memory, allow the same number of busdma bounce pages as for amd64. In fact, in this case bouncing sometimes is much heavier than on amd64. Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D18854	2019-01-18 13:43:11 +00:00
Konstantin Belousov	957b9bbf3c	x86 busdma: fix mis-use of bus_addr_t where vm_paddr_t is assumed. Right now bus_addr_t and vm_paddr_t are always aliased to the same underlying integer type on x86, which makes the interchange hard to detect. Shortly, i386 kernel would use uint64_t for vm_paddr_t to enable automatic use of PAE paging structures if hardware allows it, while bus_addr_t would be extended to 64bit only when PAE option is specified. Fix all places that were identified as using bus_addr_t while page address was assumed. This was performed by testing the complete PAE merging patch on machine with > 4G of RAM enabled. Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D18854	2019-01-18 13:38:56 +00:00
Conrad Meyer	16068ae479	Add definitions for AMD Spectre/Meltdown CPUID information No functional change, aside from printing recognized bits in CPU identification. The bits are documented in 111006-B "Indirect Branch Control Extension"[1] and 124441 "Speculative Store Bypass Disable."[2] Notably missing (left as future work): * Integration with hw.spec_store_bypass_disable and hw_ssb_active flag, which are currently Intel-specific * Integration with hw_ibrs_active global flag, which are currently Intel-specific * SSB_NO integration in hw_ssb_recalculate() * Bhyve integration (PR 235010) [1]: https://developer.amd.com/wp-content/resources/111006-B_AMD64TechnologyIndirectBranchControlExtenstion_WP_7-18Update_FNL.pdf [2]: https://developer.amd.com/wp-content/resources/124441_AMD64_SpeculativeStoreBypassDisable_Whitepaper_final.pdf PR: 235010 (related, but does not fix) MFC after: a week	2019-01-17 19:44:47 +00:00
Konstantin Belousov	62ee17d2ee	Style(9) fixes for x86/busdma_bounce.c. Remove extra parentheses. Adjust indents and lines fill. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2019-01-16 06:10:55 +00:00
Konstantin Belousov	e471df6670	Remove unused prototype. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2019-01-16 05:51:03 +00:00
Conrad Meyer	15b7da10ac	vmm(4): Take steps towards multicore bhyve AMD support vmm's CPUID emulation presented Intel topology information to the guest, but disabled AMD topology information and in some cases passed through garbage. I.e., CPUID leaves 0x8000_001[de] were passed through to the guest, but guest CPUs can migrate between host threads, so the information presented was not consistent. This could easily be observed with 'cpucontrol -i 0xfoo /dev/cpuctl0'. Slightly improve this situation by enabling the AMD topology feature flag and presenting at least the CPUID fields used by FreeBSD itself to probe topology on more modern AMD64 hardware (Family 15h+). Older stuff is probably less interesting. I have not been able to empirically confirm it is sufficient, but it should not regress anything either. Reviewed by: araujo (previous version) Relnotes: sure	2019-01-16 02:19:04 +00:00
Conrad Meyer	6b83069e05	Expose threads-per-core and physical core count information With new sysctls (to the best of our ability do detect them). Restructured smp.4 slightly for clarity (keep relevant stuff closer to the top) while documenting. Reviewed by: markj, jhibbits (ppc parts) MFC after: 3 days Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D18322	2019-01-04 18:31:17 +00:00
John Baldwin	a230c2f1b3	Correct variable name in two panic messages: num_msi_irq -> num_msi_irqs. MFC after: 1 week	2018-12-31 22:46:43 +00:00
Andriy Gapon	82a5a27527	add support for marking interrupt handlers as suspended The goal of this change is to fix a problem with PCI shared interrupts during suspend and resume. I have observed a couple of variations of the following scenario. Devices A and B are on the same PCI bus and share the same interrupt. Device A's driver is suspended first and the device is powered down. Device B generates an interrupt. Interrupt handlers of both drivers are called. Device A's interrupt handler accesses registers of the powered down device and gets back bogus values (I assume all 0xff). That data is interpreted as interrupt status bits, etc. So, the interrupt handler gets confused and may produce some noise or enter an infinite loop, etc. This change affects only PCI devices. The pci(4) bus driver marks a child's interrupt handler as suspended after the child's suspend method is called and before the device is powered down. This is done only for traditional PCI interrupts, because only they can be shared. At the moment the change is only for x86. Notable changes in core subsystems / interfaces: - BUS_SUSPEND_INTR and BUS_RESUME_INTR methods are added to bus interface along with convenience functions bus_suspend_intr and bus_resume_intr; - rman_set_irq_cookie and rman_get_irq_cookie functions are added to provide a way to associate an interrupt resource with an interrupt cookie; - intr_event_suspend_handler and intr_event_resume_handler functions are added to the MI interrupt handler interface. I added two new interrupt handler flags, IH_SUSP and IH_CHANGED, to implement the new intr_event functions. IH_SUSP marks a suspended interrupt handler. IH_CHANGED is used to implement a barrier that ensures that a change to the interrupt handler's state is visible to future interrupts. While there, I fixed some whitespace issues in comments and changed a couple of logically boolean variables to be bool. MFC after: 1 month (maybe) Differential Revision: https://reviews.freebsd.org/D15755	2018-12-17 17:11:00 +00:00
Mark Johnston	b6da2600f9	Fix the PAE kernel gcc build. The error was caused by map_ucode() casting a vm_paddr_t to a void *. Use a uintptr_t instead to match the caller. Fix some style bugs while here. Reported by: bde Reviewed by: bde MFC after: 1 week Sponsored by: The FreeBSD Foundation	2018-12-11 16:49:01 +00:00
Konstantin Belousov	94dd54b9a2	Free bootstacks after AP startup. Bootstacks are unused after APs executed sched_throw() in init_secondary_tail() and started executing on proper idle thread stack. Add sysinit that detects that the idle thread for each CPU was scheduled at least once, and free corresponding bootstack. Slight addition of the code (~200 bytes) is compensated by the saving, because even on typical small modern desktop CPU we leak 128K of memory otherwise (4 pages x 8 threads). Reviewed by: jhb MFC after: 1 week Differential revision: https://reviews.freebsd.org/D18486	2018-12-11 02:54:36 +00:00
Jayachandran C.	9417fa9e3c	acpica : move SRAT/SLIT parsing to sys/dev/acpica This moves the architecture independent parts of sys/x86/acpica/srat.c to sys/dev/acpica/acpi_pxm.c, to be used later on arm64. The function declarations are moved to sys/dev/acpica/acpivar.h We also need to update sys/conf/files.{i386,amd64} to use the new file. No functional changes. Reviewed by: markj, imp Differential Revision: https://reviews.freebsd.org/D17941	2018-12-08 19:10:58 +00:00
Jayachandran C.	a3a6167448	x86/acpica/srat.c: Add API for parsing proximity tables The SLIT and SRAT ACPI tables needs to be parsed on arm64 as well, on systems that use UEFI/ACPI firmware and support NUMA. To do this, we need to move most of the logic of x86/acpica/srat.c to dev/acpica and provide an API that architectures can use to parse and configure ACPI NUMA information. This commit adds the API in srat.c as a first step, without making any functional changes. We will move the common code to sys/dev/acpica as the next step. The functions added are: * int acpi_pxm_init(int ncpus, vm_paddr_t maxphys) - to allocate and initialize data structures used * void acpi_pxm_parse_tables(void) - parse SRAT/SLIT, save the cpu and memory proximity information * void acpi_pxm_set_mem_locality(void) - use the saved data to set memory locality * void acpi_pxm_set_cpu_locality(void) - use the saved data to set cpu locality * void acpi_pxm_free(void) - free data structures allocated by init On arm64, we do not have an cpu APIC id that can be used as index to store CPU data, we need to use the Processor Uid. To help with this, define internal functions cpu_add, cpu_find, cpu_get_info to store and get CPU proximity information. Reviewed by: markj, jhb (previous version) Differential Revision: https://reviews.freebsd.org/D17940	2018-12-08 18:34:05 +00:00
Ben Widawsky	91890b73ad	Add definitions for Intel Speed Shift These definitions will be used by a driver to implement Hardware P-States (autonomous control of HWP, via Intel Speed Shift technology). Reviewed by: kib Approved by: emaste (mentor) Differential Revision: https://reviews.freebsd.org/D18050	2018-11-21 00:21:58 +00:00
John Baldwin	e13507f6f0	Axe MINIMUM_MSI_INT. Just allow MSI interrupts to always start at the end of the I/O APIC pins. Since existing machines already have more than 255 I/O APIC pins, IRQ 255 is no longer reliably invalid, so just remove the minimum starting value for MSI. Reviewed by: kib, markj Differential Revision: https://reviews.freebsd.org/D17991	2018-11-16 23:39:39 +00:00
Konstantin Belousov	2343757338	Align IA32_ARCH_CAP MSR definitions and use with SDM rev. 068. SDM rev. 068 was released yesterday and it contains the description of the MSR 0x10a IA32_ARCH_CAP. This change adds symbolic definitions for all bits present in the document, and decode them in the CPU identification lines printed on boot. But also, the document defines SSB_NO as bit 4, while FreeBSD used but 2 to detect the need to work-around Speculative Store Bypass issue. Change code to use the bit from SDM. Similarly, the document describes bit 3 as an indicator that L1TF issue is not present, in particular, no L1D flush is needed on VMENTRY. We used RDCL_NO to avoid flushing, and again I changed the code to follow new spec from SDM. In fact my Apollo Lake machine with latest ucode shows this: IA32_ARCH_CAPS=0x19<RDCL_NO,SKIP_L1DFL_VME,SSB_NO> Reviewed by: bwidawsk Sponsored by: The FreeBSD Foundation MFC after: 3 days Differential revision: https://reviews.freebsd.org/D18006	2018-11-16 21:27:11 +00:00
John Baldwin	b6b42932db	Convert the number of MSI IRQs on x86 from a constant to a tunable. The number of MSI IRQs still defaults to 512, but it can now be changed at boot time via the machdep.num_msi_irqs tunable. Reviewed by: kib, royger (older version) Reviewed by: markj MFC after: 1 month Relnotes: yes Differential Revision: https://reviews.freebsd.org/D17977	2018-11-15 18:37:41 +00:00
John Baldwin	c6aba52e4f	Revert r332735 and fix MSI-X to properly fail allocations when full. The off-by-one errors in 332735 weren't actual errors and were preventing the last MSI interrupt source from being used. Instead, the issue is that when all MSI interrupt sources were allocated, the loop in msix_alloc() would terminate with 'msi' still set to non-null. The only check for 'i' overflowing was in the 'msi' == NULL case, so msix_alloc() would try to reuse the last MSI interrupt source instead of failing. Fix by moving the check for all sources being in use to just after the loop. Reviewed by: kib, markj MFC after: 1 month Differential Revision: https://reviews.freebsd.org/D17976	2018-11-14 18:45:33 +00:00
Konstantin Belousov	83813c6696	Apply fix to un-cripple max cpu id on BSP earlier. We need to know actual value for the standard extended features before ifuncs are resolved. Reported and tested by: madpilot Sponsored by: The FreeBSD Foundation MFC after: 1 week	2018-11-12 19:17:26 +00:00
John Baldwin	7f7f6f85a1	Add a custom implementation of cpu_lock_delay() for x86. Avoid using DELAY() since it can try to use spin locks on CPUs without a P-state invariant TSC. For cpu_lock_delay(), always use the TSC if it exists (even if it is not P-state invariant) to delay for a microsecond. If the TSC does not exist, read from I/O port 0x84 to delay instead. PR: 228768 Reported by: Roger Hammerstein <cheeky.m@live.com> Reviewed by: kib MFC after: 3 days Differential Revision: https://reviews.freebsd.org/D17851	2018-11-05 22:54:03 +00:00
John Baldwin	3c03efc4ab	Add a delay_tsc() static function for when DELAY() uses the TSC. This uses slightly simpler logic than the existing code by using the full 64-bit counter and thus not having to worry about counter overflow. Reviewed by: kib MFC after: 3 days Differential Revision: https://reviews.freebsd.org/D17850	2018-11-05 22:51:45 +00:00
Konstantin Belousov	6bc6a54280	Add pci_early function to detect Intel stolen memory. On some Intel devices BIOS does not properly reserve memory (called "stolen memory") for the GPU. If the stolen memory is claimed by the OS, functions that depend on stolen memory (like frame buffer compression) can't be used. A function called pci_early_quirks that is called before the virtual memory system is started was added. In Linux, this PCI early quirks function iterates through all PCI slots to check for any device that require quirks. While this more generic solution is preferable I only ported the Intel graphics specific parts because I think my implementation would be too similar to Linux GPL'd solution after looking at the Linux code too much. The code regarding Intel graphics stolen memory was ported from Linux. In the case of Intel graphics stolen memory this pci_early_quirks will read the stolen memory base and size from north bridge registers. The values are stored in global variables that is later read by linuxkpi_gplv2. Linuxkpi stores these values in a Linux-specific structure that is read by the drm driver. Relevant linuxkpi code is here: https://github.com/FreeBSDDesktop/kms-drm/blob/drm-v4.16/linuxkpi/gplv2/src/linux_compat.c#L37 For now, only amd64 arch is suppor ted since that is the only arch supported by the new drm drivers. I was told that Intel GPUs are always located on 0:2:0 so these values are hard coded for now. Note that the structure and early execution of the detection code is not required in its current form, but we expect that the code will be added shortly which fixes the potential BIOS bugs by reserving the stolen range in phys_avail[]. This must be done as early as possible to avoid conflicts with the potential usage of the memory in kernel. Submitted by: Johannes Lundberg <johalun0@gmail.com> Reviewed by: bwidawsk, imp MFC after: 1 week Differential revision: https://reviews.freebsd.org/D16719 Differential revision: https://reviews.freebsd.org/D17775	2018-10-31 23:17:00 +00:00
Mark Johnston	9978bd996b	Add malloc_domainset(9) and _domainset variants to other allocator KPIs. Remove malloc_domain(9) and most other _domain KPIs added in r327900. The new functions allow the caller to specify a general NUMA domain selection policy, rather than specifically requesting an allocation from a specific domain. The latter policy tends to interact poorly with M_WAITOK, resulting in situations where a caller is blocked indefinitely because the specified domain is depleted. Most existing consumers of the _domain KPIs are converted to instead use a DOMAINSET_PREF() policy, in which we fall back to other domains to satisfy the allocation request. This change also defines a set of DOMAINSET_FIXED() policies, which only permit allocations from the specified domain. Discussed with: gallatin, jeff Reported and tested by: pho (previous version) MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D17418	2018-10-30 18:26:34 +00:00
Brooks Davis	c3adaa3305	Consolidate identical ELF auxargs type defintions. All platforms except powerpc use the same values and powerpc shares a majority of them. Go ahead and declare AT_NOTELF, AT_UID, and AT_EUID in favor of the unused AT_DCACHEBSIZE, AT_ICACHEBSIZE, and AT_UCACHEBSIZE for powerpc. Reviewed by: jhb, imp Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D17397	2018-10-22 22:24:32 +00:00
Mark Johnston	b61f314290	Make it possible to disable NUMA support with a tunable. This provides a chicken switch for anyone negatively impacted by enabling NUMA in the amd64 GENERIC kernel configuration. With NUMA disabled at boot-time, information about the NUMA topology is not exposed to the rest of the kernel, and all of physical memory is viewed as coming from a single domain. This method still has some performance overhead relative to disabling NUMA support at compile time. PR: 231460 Reviewed by: alc, gallatin, kib MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D17439	2018-10-22 20:13:51 +00:00
Mark Johnston	662e7fa8d9	Create some global domainsets and refactor NUMA registration. Pre-defined policies are useful when integrating the domainset(9) policy machinery into various kernel memory allocators. The refactoring will make it easier to add NUMA support for other architectures. No functional change intended. Reviewed by: alc, gallatin, jeff, kib Tested by: pho (part of a larger patch) MFC after: 3 days Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D17416	2018-10-20 17:36:00 +00:00
Mateusz Guzik	3f102f5881	Provide string functions for use before ifuncs get resolved. The change is a no-op for architectures which don't ifunc memset, memcpy nor memmove. Convert places which need them. Xen bits by royger. Reviewed by: kib Approved by: re (gjb) Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D17487	2018-10-11 23:28:04 +00:00
Mark Johnston	7c179abac7	Fix an inverted test in ucode_load_ap(). This caused microcode to be updated only on the BSP if hyperthreading was disabled, typically resulting in a hang or reset. Approved by: re (kib) Sponsored by: The FreeBSD Foundation	2018-10-03 14:20:43 +00:00
Andrew Gallatin	30c5525b3c	Allow empty NUMA memory domains to support Threadripper2 The AMD Threadripper 2990WX is basically a slightly crippled Epyc. Rather than having 4 memory controllers, one per NUMA domain, it has only 2 memory controllers enabled. This means that only 2 of the 4 NUMA domains can be populated with physical memory, and the others are empty. Add support to FreeBSD for empty NUMA domains by: - creating empty memory domains when parsing the SRAT table, rather than failing to parse the table - not running the pageout deamon threads in empty domains - adding defensive code to UMA to avoid allocating from empty domains - adding defensive code to cpuset to avoid binding to an empty domain Thanks to Jeff for suggesting this strategy. Reviewed by: alc, markj Approved by: re (gjb@) Differential Revision: https://reviews.freebsd.org/D1683	2018-10-01 14:14:21 +00:00
Konstantin Belousov	632227a739	Update x86/ifunc.h. Remove ifunc emulation. Add helper for usermode ifunc resolver definition. Update copyright years. Sponsored by: The FreeBSD Foundation Approved by: re (rgrimes) MFC after: 1 week	2018-09-30 16:57:30 +00:00
Mark Johnston	463406ac4a	Add more NUMA-specific low memory predicates. Use these predicates instead of inline references to vm_min_domains. Also add a global all_domains set, akin to all_cpus. Reviewed by: alc, jeff, kib Approved by: re (gjb) Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D17278	2018-09-24 19:24:17 +00:00
Konstantin Belousov	d12c446550	Convert x86 cache invalidation functions to ifuncs. This simplifies the runtime logic and reduces the number of runtime-constant branches. Reviewed by: alc, markj Sponsored by: The FreeBSD Foundation Approved by: re (gjb) Differential revision: https://reviews.freebsd.org/D16736	2018-09-19 19:35:02 +00:00
John Baldwin	87bdca8290	Fix a regression in r338360 when booting an x86 machine without APIC. The atpic_register_sources callback tries to avoid registering interrupt sources that would collide with an I/O APIC. However, the previous implementation was failing to register IRQs 8-15 since the slave PIC saw valid IRQs from the master and assumed an I/O APIC was present. To fix, go back to registering all 8259A interrupt sources in one loop when the master's register_sources method is invoked. PR: 231291 Approved by: re (kib) MFC after: 1 month	2018-09-17 17:18:54 +00:00
Mark Johnston	d5089b3aed	Log a message after a successful boot-time microcode update. Reviewed by: kib Approved by: re (delphij) Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D17135	2018-09-14 17:04:36 +00:00
Roger Pau Monné	a74cdf4e74	xen: legacy PVH fixes for the new interrupt count Register interrupts using the PIC pic_register_sources method instead of doing it in apic_setup_io. This is now required, since the internal interrupt structures are not yet setup when calling apic_setup_io. Approved by: re (gjb) Sponsored by: Citrix Systems R&D	2018-09-13 07:14:11 +00:00
Roger Pau Monné	d7627401ec	lapic: skip setting intrcnt if lapic is not present Instead of panicking. Legacy PVH mode doesn't provide a lapic, and since native_lapic_intrcnt is called unconditionally this would cause the assert to trigger. Change the assert into a continue in order to take into account the possibility of systems without a lapic. Reviewed by: jhb Approved by: re (gjb) Sponsored by: Citrix Systems R&D Differential revision: https://reviews.freebsd.org/D17015	2018-09-13 07:13:13 +00:00
Roger Pau Monné	4edbde911b	xen: fix setting legacy PVH vcpu id The recommended way to obtain the vcpu id is using the cpuid instruction with a specific leaf value. This leaf value must be obtained at runtime, and it's done when populating the hypercall page. Legacy PVH however will get the hypercall page populated by the hypervisor itself before booting, so the cpuid leaf was not actually set, thus preventing setting the vcpu id value from cpuid. Fix this by making sure the cpuid leaf has been probed before attempting to set the vcpu id. Approved by: re (gjb) Sponsored by: Citrix Systems R&D	2018-09-13 07:12:16 +00:00
Roger Pau Monné	4fcd5f3003	xen: limit the usage of PIRQs to a legacy PVH Dom0 That's the only mode in FreeBSD that requires the usage of PIRQs, so there's no need to attach the PIRQ PIC when running in other modes. Approved by: re (gjb) Sponsored by: Citrix Systems R&D	2018-09-13 07:11:11 +00:00
Roger Pau Monné	ddbc1b4387	xen: fix initial kenv setup for legacy PVH When adding support for the new PVH mode the kenv handling was switched to use a boot time allocated scratch space, however the legacy PVH early boot code was not modified to allocate such space. Approved by: re (gjb) Sponsored by: Citrix Systems R&D	2018-09-13 07:09:41 +00:00
Roger Pau Monné	c9a591b0f6	xen: remove xenpv_set_ids The vcpu_id for legacy PVH mode can be set from the output of cpuid, so there's no need to have a special function to set it. Also note that xenpv_set_ids should have been executed only for PV guests, but was executed for all guests types and vcpu_id was later fixed up for HVM guests. Reported by: cperciva Approved by: re (gjb) Sponsored by: Citrix Systems R&D	2018-09-13 07:08:31 +00:00
Roger Pau Monné	fae9a0cb9b	xen: fix PV IPI setup So that it's done when the vcpu_id has been set. For the BSP the vcpu_id is set at SUB_INTR, while for the APs it's done in init_secondary_tail that's called at SUB_SMP order FIRST. Reported and tested by: cperciva Approved by: re (gjb) Sponsored by: Citrix Systems R&D Differential revision: https://reviews.freebsd.org/D17013	2018-09-13 07:07:13 +00:00
Roger Pau Monné	a515acf7bb	msi: remove the check that interrupt sources have been added When running as a specific type of Xen guest the hypervisor won't provide any emulated IO-APICs or legacy PICs at all, thus hitting the following assert in the MSI code: panic: Assertion num_io_irqs > 0 failed at /usr/src/sys/x86/x86/msi.c:334 cpuid = 0 time = 1 KDB: stack backtrace: db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xffffffff826ffa70 vpanic() at vpanic+0x1a3/frame 0xffffffff826ffad0 panic() at panic+0x43/frame 0xffffffff826ffb30 msi_init() at msi_init+0xed/frame 0xffffffff826ffb40 apic_setup_io() at apic_setup_io+0x72/frame 0xffffffff826ffb50 mi_startup() at mi_startup+0x118/frame 0xffffffff826ffb70 start_kernel() at start_kernel+0x10 Fix this by removing the assert in the MSI code, since it's possible to get to the MSI initialization without having registered any other interrupt sources. Reviewed by: jhb Approved by: re (gjb) Sponsored by: Citrix Systems R&D Differential revision: https://reviews.freebsd.org/D17001	2018-09-13 07:05:51 +00:00
John Baldwin	cdb6aa7e47	Fix build of x86 UP kernels after dynamic IRQ changes in r338360. Reported by: Ian FREISLICH <ian.freislich@capeaugusta.com> Approved by: re (gjb) MFC after: 2 weeks	2018-08-31 18:26:37 +00:00
John Baldwin	fd036deac1	Dynamically allocate IRQ ranges on x86. Previously, x86 used static ranges of IRQ values for different types of I/O interrupts. Interrupt pins on I/O APICs and 8259A PICs used IRQ values from 0 to 254. MSI interrupts used a compile-time-defined range starting at 256, and Xen event channels used a compile-time-defined range after MSI. Some recent systems have more than 255 I/O APIC interrupt pins which resulted in those IRQ values overflowing into the MSI range triggering an assertion failure. Replace statically assigned ranges with dynamic ranges. Do a single pass computing the sizes of the IRQ ranges (PICs, MSI, Xen) to determine the total number of IRQs required. Allocate the interrupt source and interrupt count arrays dynamically once this pass has completed. To minimize runtime complexity these arrays are only sized once during bootup. The PIC range is determined by the PICs present in the system. The MSI and Xen ranges continue to use a fixed size, though this does make it possible to turn the MSI range size into a tunable in the future. As a result, various places are updated to use dynamic limits instead of constants. In addition, the vmstat(8) utility has been taught to understand that some kernels may treat 'intrcnt' and 'intrnames' as pointers rather than arrays when extracting interrupt stats from a crashdump. This is determined by the presence (vs absence) of a global 'nintrcnt' symbol. This change reverts r189404 which worked around a buggy BIOS which enumerated an I/O APIC twice (using the same memory mapped address for both entries but using an IRQ base of 256 for one entry and a valid IRQ base for the second entry). Making the "base" of MSI IRQ values dynamic avoids the panic that r189404 worked around, and there may now be valid I/O APICs with an IRQ base above 256 which this workaround would incorrectly skip. If in the future the issue reported in PR 130483 reoccurs, we will have to add a pass over the I/O APIC entries in the MADT to detect duplicates using the memory mapped address and use some strategy to choose the "correct" one. While here, reserve room in intrcnts for the Hyper-V counters. PR: 229429, 130483 Reviewed by: kib, royger, cem Tested by: royger (Xen), kib (DMAR) Approved by: re (gjb) MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D16861	2018-08-28 21:09:19 +00:00
Alan Cox	49bfa624ac	Eliminate the arena parameter to kmem_free(). Implicitly this corrects an error in the function hypercall_memfree(), where the wrong arena was being passed to kmem_free(). Introduce a per-page flag, VPO_KMEM_EXEC, to mark physical pages that are mapped in kmem with execute permissions. Use this flag to determine which arena the kmem virtual addresses are returned to. Eliminate UMA_SLAB_KRWX. The introduction of VPO_KMEM_EXEC makes it redundant. Update the nearby comment for UMA_SLAB_KERNEL. Reviewed by: kib, markj Discussed with: jeff Approved by: re (marius) Differential Revision: https://reviews.freebsd.org/D16845	2018-08-25 19:38:08 +00:00
Konstantin Belousov	60b7423434	Unify amd64 and i386 vmspace0 pmap activation. Add pmap_activate_boot() for i386, move the invocation on APs from MD init_secondary() to x86 init_secondary_tail(). Suggested by: alc Reviewed by: alc, markj Sponsored by: The FreeBSD Foundation Approved by: re (marius) MFC after: 1 week Differential revision: https://reviews.freebsd.org/D16893	2018-08-25 15:21:28 +00:00
John Baldwin	62a08214bc	Remove 'imen' global variable from atpic(4). In pre-SMPng, the global 'imen' was used to track mask state of the hardware interrupts and was aligned to the masks used by spl*(). When the atpic code was converted to using the x86 interrupt source abstraction, the global 'imen' was preserved by having each PIC instance point to an invididual byte in the global 'imen' to hold its 8-bit interrupt mask. The global 'imen' is no longer used for anything however, so rather than storing pointers in 'struct atpic', just store the individual 8-bit mask for each PIC as a char. While here, convert the ATPIC macro to using C99 initializers. Reviewed by: kib, imp MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D16827	2018-08-21 17:13:51 +00:00
Alan Cox	83a90bffd8	Eliminate kmem_malloc()'s unused arena parameter. (The arena parameter became unused in FreeBSD 12.x as a side-effect of the NUMA-related changes.) Reviewed by: kib, markj Discussed with: jeff, re@ Differential Revision: https://reviews.freebsd.org/D16825	2018-08-21 16:43:46 +00:00
Alan Cox	44d0efb215	Eliminate kmem_alloc_contig()'s unused arena parameter. Reviewed by: hselasky, kib, markj Discussed with: jeff Differential Revision: https://reviews.freebsd.org/D16799	2018-08-20 15:57:27 +00:00
John Baldwin	a800b45c18	Merge amd64 and i386 <machine/intr_machdep.h> headers. Reviewed by: kib MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D16803	2018-08-20 12:31:39 +00:00
John Baldwin	2734fedc4e	Fix a couple of comment nits.	2018-08-19 17:57:51 +00:00
John Baldwin	38a13e9002	Fix the MPTable probe code after the 4:4 changes on i386. The MPTable probe code was using PMAP_MAP_LOW as the PA -> VA offset when searching for the table signature but still using KERNBASE once it had found the table. As a result, the mpfps table pointed into a random part of the kernel text instead of the actual MP Table. Rather than adding more #ifdef's, use BIOS_PADDRTOVADDR from <machine/pc/bios.h> which already uses PMAP_MAP_LOW on i386 and KERNBASE on amd64. Reviewed by: kib MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D16802	2018-08-19 17:36:50 +00:00
John Baldwin	a568818913	Remove some vestiges of IPI_LAZYPMAP on i386. The support for lazy pmap invalidations on i386 was removed in r281707. This removes the constant for the IPI and stops accounting for it when sizing the interrupt count arrays. Reviewed by: kib MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D16801	2018-08-19 16:14:59 +00:00
Konstantin Belousov	9e2d4791d1	Print L1D FLUSH feature. Sponsored by: The FreeBSD Foundation MFC after: 3 days	2018-08-18 12:17:05 +00:00
Mark Johnston	b8abc9d8f5	Help ensure that the copy loop doesn't get converted to a memcpy() call. Reported and reviewed by: kib X-MFC with: r337715 Sponsored by: The FreeBSD Foundation	2018-08-14 19:21:31 +00:00
Konstantin Belousov	8d32b46379	Add definitions related to the L1D flush operation capability and MSR. Sponsored by: The FreeBSD Foundation	2018-08-14 17:19:11 +00:00
Mark Johnston	27f4c235ee	Explain why we aren't using memcpy(). Reported by: jmg X-MFC with: r337715 Sponsored by: The FreeBSD Foundation	2018-08-14 14:50:06 +00:00
Mark Johnston	845800e190	Don't use memcpy() in the early microcode loading code. At some point memcpy() may be an ifunc, ifunc resolution cannot be done until CPU identification has been performed, and CPU identification must be done after loading any microcode updates. X-MFC with: r337715 Sponsored by: The FreeBSD Foundation	2018-08-14 14:02:53 +00:00
Mark Johnston	3571aee662	Fix the !SMP x86 build. Reported by: Michael Butler <imb@protected-networks.net> X-MFC with: r337715 Sponsored by: The FreeBSD Foundation	2018-08-14 13:56:42 +00:00
Mark Johnston	97edfc1b45	Implement kernel support for early loading of Intel microcode updates. Updates in the format described in section 9.11 of the Intel SDM can now be applied as one of the first steps in booting the kernel. Updates that are loaded this way are automatically re-applied upon exit from ACPI sleep states, in contrast with the existing cpucontrol(8)-based method. For the time being only Intel updates are supported. Microcode update files are passed to the kernel via loader(8). The file type must be "cpu_microcode" in order for the file to be recognized as a candidate microcode update. Updates for multiple CPU types may be concatenated together into a single file, in which case the kernel will select and apply a matching update. Memory used to store the update file will be freed back to the system once the update is applied, so this approach will not consume more memory than required. Reviewed by: kib MFC after: 6 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D16370	2018-08-13 17:13:09 +00:00
Mark Johnston	fe585be529	Verify that each frame pointer lies within the thread's kstack. Previously, this check was omitted for the first frame pointer. Reported by: pho Reviewed by: kib MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D16572	2018-08-03 02:51:37 +00:00
Alan Somers	6040822c4e	Make timespecadd(3) and friends public The timespecadd(3) family of macros were imported from NetBSD back in r35029. However, they were initially guarded by #ifdef _KERNEL. In the meantime, we have grown at least 28 syscalls that use timespecs in some way, leading many programs both inside and outside of the base system to redefine those macros. It's better just to make the definitions public. Our kernel currently defines two-argument versions of timespecadd and timespecsub. NetBSD, OpenBSD, and FreeDesktop.org's libbsd, however, define three-argument versions. Solaris also defines a three-argument version, but only in its kernel. This revision changes our definition to match the common three-argument version. Bump _FreeBSD_version due to the breaking KPI change. Discussed with: cem, jilles, ian, bde Differential Revision: https://reviews.freebsd.org/D14725	2018-07-30 15:46:40 +00:00
Konstantin Belousov	45ed991d96	On amd64, enable workarounds for several Ryzen erratas as described in the AMD document 55449 'Revision Guide for AMD Family 17h Models 00h-0Fh Processors' rev 1.12. The errata numbers are mentioned near each action. It seems that newer BIOSes already include required chicken bits settings, so the magic MSR updates are only needed when BIOS cannot be updated. On the other hand, MWAIT avoidance seems to be important. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2018-07-27 15:31:20 +00:00
Roger Pau Monné	b0663c33c2	xen: implement early init helper for PVHv2 In order to setup an initial environment and jump into the generic hammer_time initialization function. Some of the code is shared with PVHv1, while other code is PVHv2 specific. This allows booting FreeBSD as a PVHv2 DomU and Dom0. Sponsored by: Citrix Systems R&D	2018-07-19 08:44:52 +00:00
Roger Pau Monné	07c2711fbf	xen: allow very early initialization of the hypercall page Allow the hypercall page to be initialized very early, even before vtophys is functional. Also make the function global so it can be called by other files. This will be needed in order to perform the early bringup on PVHv2 guests. Sponsored by: Citrix Systems R&D	2018-07-19 08:13:41 +00:00
Roger Pau Monné	cfa0b7b82f	xen: remove direct usage of HYPERVISOR_start_info HYPERVISOR_start_info is only available to PV and PVHv1 guests, HVM and PVHv2 guests get this data from HVM parameters that are fetched using a hypercall. Instead provide a set of helper functions that should be used to fetch this data. The helper functions have different implementations depending on whether FreeBSD is running as PVHv1 or HVM/PVHv2 guest type. This helps to cleanup generic Xen code by removing quite a lot of xen_pv_domain and xen_hvm_domain macro usages. Sponsored by: Citrix Systems R&D	2018-07-19 07:54:45 +00:00
Mark Johnston	a18e40aad4	Use the existing MSR_BIOS_SIGN on AMD. Reported by: kib Sponsored by: The FreeBSD Foundation	2018-07-13 20:56:20 +00:00
Mark Johnston	5612bb23d0	Define the MSR used to fetch the current microcode patch level on AMD. It is defined in the AMD family 17h register reference. MFC after: 3 days Sponsored by: The FreeBSD Foundation	2018-07-13 19:42:59 +00:00
Mark Johnston	6ac05ba486	Use C99 initializers for instances of struct apic_enumerator. MFC after: 3 days	2018-07-13 17:42:48 +00:00
Warner Losh	52379d36a9	Create helper functions for parsing boot args. boot_parse_arg to parse a single arg boot_parse_cmdline to parse a command line string boot_parse_args to parse all the args in a vector boot_howto_to_env Convert howto bits to env vars boot_env_to_howto Return howto mask mased on what's set in the environment. All these routines return an int that's the bitmask of the args translated to RB_* flags. As a special case, the 'S' flag sets the comconsole_speed env var. Any arg that looks like a=b will set the env key 'a' to value 'b'. If =b is omitted, 'a' is set to '1'. This should help us reduce the number of redundant copies of these routines in the tree. It should also give a more uniform experience between platforms. Also, invent a new flag RB_PROBE that's set when 'P' is parsed. On x86 + BIOS, this means 'probe for the keyboard, and if it's not there set both RB_MULTIPLE and RB_SERIAL (which means show the output on both video and serial consoles, but make serial primary). Others it may be some similar concept of probing, but it's loader dependent what, exactly, it means. These routines are suitable for /boot/loader and/or the kernel, though they may not be suitable for the tightly hand-rolled-for-space environments like boot2. Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D16205	2018-07-13 16:43:05 +00:00
Matt Macy	ab3059a8e7	Back pcpu zone with domain correct pages - Change pcpu zone consumers to use a stride size of PAGE_SIZE. (defined as UMA_PCPU_ALLOC_SIZE to make future identification easier) - Allocate page from the correct domain for a given cpu. - Don't initialize pc_domain to non-zero value if NUMA is not defined There are some misconceptions surrounding this field. It is the _VM_ NUMA domain and should only ever correspond to valid domain values as understood by the VM. The former slab size of sizeof(struct pcpu) was somewhat arbitrary. The new value is PAGE_SIZE because that's the smallest granularity which the VM can allocate a slab for a given domain. If you have fewer than PAGE_SIZE/8 counters on your system there will be some memory wasted, but this is obviously something where you want the cache line to be coming from the correct domain. Reviewed by: jeff Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D15933	2018-07-06 02:06:03 +00:00
Andrew Turner	2bf9501287	Create a new macro for static DPCPU data. On arm64 (and possible other architectures) we are unable to use static DPCPU data in kernel modules. This is because the compiler will generate PC-relative accesses, however the runtime-linker expects to be able to relocate these. In preparation to fix this create two macros depending on if the data is global or static. Reviewed by: bz, emaste, markj Sponsored by: ABT Systems Ltd Differential Revision: https://reviews.freebsd.org/D16140	2018-07-05 17:13:37 +00:00
Konstantin Belousov	300c34e431	Add a name for the MSR controlling standard extended features report on AMD. Sponsored by: The FreeBSD Foundation MFC after: 3 days	2018-07-05 10:44:18 +00:00
Konstantin Belousov	fe15b8543e	Order the portion of the AMD-specific MSRs names definitions numerically. Sponsored by: The FreeBSD Foundation MFC after: 3 days	2018-07-05 10:34:01 +00:00
Roger Pau Monné	8518997526	xen: obtain vCPU ID from CPUID The Xen vCPU ID can be fetched from the cpuid instead of inferring it from the ACPI ID. Sponsored by: Citrix Systems R&D	2018-06-26 15:00:54 +00:00
Roger Pau Monné	1ad78dd631	xen: limit the number of hypercall pages to 1 The interface already guarantees that the number of hypercall pages is always going to be 1, see the comment in interface/arch-x86/cpuid.h Sponsored by: Citrix Systems R&D	2018-06-26 14:39:27 +00:00
Konstantin Belousov	ce3bf75015	Do not access ISA timer if BIOS reports that there is no legacy devices present. On at least one machine where it would matter since the ISA timer is power gated when booted in the UEFI mode, BIOS still reports that the legacy devices are present. That is, user still have to manually disable TSC calibration on such machines. Hopefully it will be more useful in the future. Discussed with: Ben Widawsky <benjamin.widawsky@intel.com> Reviewed by: royger Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D16004 MFC after: 1 week	2018-06-25 11:24:26 +00:00
Konstantin Belousov	7705dd4df0	Provide a helper function acpi_get_fadt_bootflags() to fetch the FADT x86 boot flags. Reviewed by: royger Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D16004 MFC after: 1 week	2018-06-25 11:01:12 +00:00
Bruce Evans	3cd246d9a9	Untangle configuration ifdefs a little. On x86, msi is optional on pci, and also on apic in common and i386 files (except for xen it is optional only on xenhvm), but it was not ifdefed except on apic in common and i386 files. This is all that is left from an attempt to build a (sub-)minimal kernel without any devices. The isa "option" is still used without ifdefs in many standard files even on amd64. ISAPNP is not optional on at least i386. ATPIC is not optional on i386 (it is used mainly for Xspuriousint). But pci is now supposed to be optional on x86.	2018-06-10 14:49:13 +00:00
Andriy Gapon	0fb3a72a0d	x86: reorganize code that deals with unexpected NMI-s Expected NMI-s are those than are either generated by the software (such as a CPU sending NMI to other CPU) or generated by the hardware after the software configured it to do so (such as NMI-s on PMC events). Some unexpected NMI-s can be caused by hardware failures and it is possible to inquire the hardware about them (somewhat like MCA but much more primitive) using an EISA mechanism. In some cases the origin of the NMI can remain truly unknown. This commit should not change any functionality. It just reorganizes the code, so that it is easier to extend with new checks for the origin of the NMI. Also, it frees the code that has nothing to do with ISA from DEV_ISA. MFC after: 3 weeks	2018-06-07 14:46:52 +00:00
Andriy Gapon	413ed27cd7	expand descriptions of x86 panic_on_nmi and kdb_on_nmi sysctls The descriptions were as terse as the variable names and they did not explain additional conditions for knobs. MFC after: 1 week	2018-06-07 14:23:31 +00:00
Andriy Gapon	ec6faf94c4	add support for console resuming, implement it for uart, use on x86 This change adds a new optional console method cn_resume and a kernel console interface cnresume. Consoles that may need to re-initialize their hardware after suspend (e.g., because firmware does not care to do it) will implement cn_resume. Note that it is called in rather early environment not unlike early boot, so the same restrictions apply. Platform specific code, for platforms that support hardware suspend, should call cnresume early after resume, before any console output is expected. This change fixes a problem with a system of mine failing to resume when a serial console is used. I found that the serial port was in a strange configuration and an attempt to write to it likely resulted in an infinite loop. To avoid adding cn_resume method to every console driver, CONSOLE_DRIVER macro has been extended to support optional methods. Reviewed by: imp, mav MFC after: 3 weeks Differential Revision: https://reviews.freebsd.org/D15552	2018-05-29 16:16:24 +00:00
Andriy Gapon	ba79ab8215	fix x86 UP build broken by r334204, TSC resynchronization Reported by: bde MFC after: 1 week X-MFC with: r334204	2018-05-29 16:03:53 +00:00
Andriy Gapon	279be68bfd	re-synchronize TSC-s on SMP systems after resume, if necessary The TSC-s are checked and synchronized only if they were good originally. That is, invariant, synchronized, etc. This is necessary on an AMD-based system where after a wakeup from STR I see that BSP clock differs from AP clocks by a count that roughly corresponds to one second. The APs are in sync with each other. Not sure if this is a hardware quirk or a firmware bug. This is what I see after a resume with this change: SMP: passed TSC synchronization test after adjustment acpi_timer0: restoring timecounter, ACPI-fast -> TSC-low Reviewed by: kib MFC after: 3 weeks Differential Revision: https://reviews.freebsd.org/D15551	2018-05-25 07:33:20 +00:00
Roger Pau Monné	92849603d0	xen/pvh: allocate dbg_stack Or else init_secondary will hit a page fault (or write garbage somewhere). Sponsored by: Citrix Systems R&D	2018-05-24 10:22:57 +00:00
Konstantin Belousov	e3fab0ff2b	Fix UP build. Reported by: jhb Sponsored by: The FreeBSD Foundation MFC after: 1 week	2018-05-22 20:50:19 +00:00
John Baldwin	9e2154ff1c	Cleanups related to debug exceptions on x86. - Add constants for fields in DR6 and the reserved fields in DR7. Use these constants instead of magic numbers in most places that use DR6 and DR7. - Refer to T_TRCTRAP as "debug exception" rather than a "trace trap" as it is not just for trace exceptions. - Always read DR6 for debug exceptions and only clear TF in the flags register for user exceptions where DR6.BS is set. - Clear DR6 before returning from a debug exception handler as recommended by the SDM dating all the way back to the 386. This allows debuggers to determine the cause of each exception. For kernel traps, clear DR6 in the T_TRCTRAP case and pass DR6 by value to other parts of the handler (namely, user_dbreg_trap()). For user traps, wait until after trapsignal to clear DR6 so that userland debuggers can read DR6 via PT_GETDBREGS while the thread is stopped in trapsignal(). Reviewed by: kib, rgrimes MFC after: 1 month Differential Revision: https://reviews.freebsd.org/D15189	2018-05-22 00:45:00 +00:00
Konstantin Belousov	3621ba1ede	Add Intel Spec Store Bypass Disable control. Speculative Store Bypass (SSB) is a speculative execution side channel vulnerability identified by Jann Horn of Google Project Zero (GPZ) and Ken Johnson of the Microsoft Security Response Center (MSRC) https://bugs.chromium.org/p/project-zero/issues/detail?id=1528. Updated Intel microcode introduces a MSR bit to disable SSB as a mitigation for the vulnerability. Introduce a sysctl hw.spec_store_bypass_disable to provide global control over the SSBD bit, akin to the existing sysctl that controls IBRS. The sysctl can be set to one of three values: 0: off 1: on 2: auto Future work will enable applications to control SSBD on a per-process basis (when it is not enabled globally). SSBD bit detection and control was verified with prerelease microcode. Security: CVE-2018-3639 Tested by: emaste (previous version, without updated microcode) Sponsored by: The FreeBSD Foundation MFC after: 3 days	2018-05-21 21:08:19 +00:00
Konstantin Belousov	9be4bbbb21	Add definition for Intel Speculative Store Bypass Disable MSR bits Security: CVE-2018-3639 Sponsored by: The FreeBSD Foundation MFC after: 3 days	2018-05-21 21:07:13 +00:00
Konstantin Belousov	ba6ce3a34b	Style. Sponsored by: The FreeBSD Foundation MFC after: 3 days	2018-05-19 21:36:55 +00:00
Konstantin Belousov	45c228cc29	Fix PCID+PTI pmap operations on Xen/HVM. Install appropriate pti-aware shootdown IPI handlers, otherwise user page tables do not get enough invalidations. The non-pti handlers were used so far. Reported and tested by: cperciva Sponsored by: The FreeBSD Foundation MFC after: 3 days	2018-05-19 20:28:59 +00:00
Konstantin Belousov	7c25320c69	Fix IBRS handling around MWAIT. The intent was to disable IBPB and IBRS around MWAIT, and re-enable on the sleep end. Reviewed by: emaste Sponsored by: The FreeBSD Foundation MFC after: 3 days	2018-05-19 20:26:33 +00:00
Andriy Gapon	7973b47369	fix a problem with bad performance after wakeup caused by r333321 This change reverts a "while here" part of r333321 that moved clearing of suspended_cpus to an earlier place. Apparently, there can be a problem when modifying (shared) memory before restoring proper cache attributes. So, to be safe, move the clearing to the old place. Many thanks to Johannes Lundberg for bisecting the changes to that particular commit and then bisecting the commit to the particular change. Reported by: many Debugged by: Johannes Lundberg <johalun0@gmail.com> MFC after: 1 week X-MFC with: r333321	2018-05-17 10:16:20 +00:00
Andriy Gapon	7c5ccd2dce	calibrate lapic timer in native_lapic_setup The idea is to calibrate the LAPIC timer just once and only on boot, given that [at present] the timer constants are global and shared between all processors. My primary motivation is to fix a panic that can happen when dynamically switching to lapic timer. The panic is caused by a recursion on et_hw_mtx when printing the calibration results to console. See the review for the details of the panic. Also, the code should become slightly simpler and easier to read. The previous code was racy too. Multiple processors could start calibrating the global constants concurrently, although that seems to have been benign. Reviewed by: kib, mav, jhb MFC after: 3 weeks Differential Revision: https://reviews.freebsd.org/D15422	2018-05-15 16:56:30 +00:00
Warner Losh	b425e3fba2	Put the CPU starting on one line.	2018-05-07 21:09:21 +00:00
Andriy Gapon	de15b11aaa	x86 cpususpend_handler: call wbinvd after setting suspend state bits Without a subsequent wbinvd the changes to suspended_cpus (and resuming_cpus) can be lost at least on AMD systems that use MOESI cache coherency protocol. That can happen because one of APs ends up as an Owner of the corresponding cache line(s) and the changes may never reach the main memory before the AP is reset. While here, move clearing of suspended_cpus a little bit earlier as the fact of returning from savectx (with zero return value) means that the CPU has fully restored it execution context. Also, rework the comment that describes the need for resuming_cpus. This change fixed suspend to RAM a previously broken AMD-based system. Reviewed by: kib Discussed with: bde MFC after: 3 weeks Differential Revision: https://reviews.freebsd.org/D15295	2018-05-07 12:22:25 +00:00
Konstantin Belousov	d5effb01f1	Add helper macros to hide some boring repeatable ceremonies to define ifuncs on x86. Also keep helpers to define 'pseudo-ifuncs' which are emulated by the indirect jmp. Reviewed by: jhb (previous version, as part of the larger patch) Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D13838	2018-05-03 21:45:59 +00:00
Jung-uk Kim	e787342e25	Redo r332918 with the ACPICA API and remove debug.acpi.suspend_deep_bounce. AcpiOsEnterSleep() was meant to implement this feature. Reviewed by: avg	2018-05-03 19:00:50 +00:00
Roger Pau Monné	9021fe72fc	xen: fix formatting of xen_init_ops No functional change Sponsored by: Citrix Systems R&D	2018-05-02 10:20:55 +00:00
Konstantin Belousov	986c4ca387	Turn off IBRS on suspend. Resume starts CPU from the init state, which clears any loaded microcode updates. As result, IBRS MSRs are no longer available, until the microcode is reloaded. I have to forcibly clear cpu_stdext_feature3, which assumes that CPUID leaf 7 reg %ebx does not report anything except Meltdown/Spectre bugs bits. If future CPUs add new bits there, hw_ibrs_recalculate() and identify_cpu1()/identify_cpu2() need to be adjusted for that. Submitted and tested by: Michael Danilov <mike.d.ft402@gmail.com> PR: 227866 Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D15236	2018-04-30 20:18:32 +00:00
Konstantin Belousov	160be7cc08	Fix spelling: Appolo -> Apollo [1]. The APL31 NDA errata is APL30 public errata. Add the reference and provide the description [2]. Noted by: emaste [2], rpokala [1] Sponsored by: The FreeBSD Foundation MFC after: 1 week	2018-04-26 19:23:19 +00:00
Konstantin Belousov	3f3937b4ae	Handle Appolo Lake errata APL31. If the workaround is activated, always send IPI for wake up, not rely on the write to the monitor line. This fixes Appolo Lake machines early hang in sched_bind(), without requiring user to manually select idle method. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2018-04-26 18:24:31 +00:00
Konstantin Belousov	a5f472c579	Some style and minor code improvements for idle selection. Use designated initializers for the idlt_tlb elements. Remove strstr() use, add flag field to detect supported MWAIT. Use nitems() instead of the terminating NULL entry for idle_tlb. Move several functions into cpu_idle_* namespace. Based on the discussion with: bde Sponsored by: The FreeBSD Foundation MFC after: 1 week	2018-04-26 18:12:40 +00:00
Konstantin Belousov	506a906c05	Use CPUID leaf 0x15 to get TSC frequency when the calibration is disabled. Intel finally added this information, which allows us to not parse CPU identification string looking for the nominal frequency. The leaf is present e.g. on Appolo Lake Atom CPUs. It is only used if the TSC calibration is disabled by user. Also, report the TSC frequency in bootverbose mode always, regardless of the way it was obtained. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2018-04-25 16:43:45 +00:00
Konstantin Belousov	55ba21d4fd	Make the sysctl machdep.idle also a tunable. It is applied before it is possible for idle threads to execute on any CPU, allowing to work around against some bugs. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2018-04-24 20:49:16 +00:00
Konstantin Belousov	bc7e39c339	Extend ap_boot_mtx scope to also cover mca_init(). Otherwise, under bootverbose, the lapic_enable_cmc() banner 'lapicX: CMCI unmasked' is printed by several CPUs in parallel, causing garbled output for the LAPIC dumps. Reported by: royger Reviewed by: jhb Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D15157	2018-04-24 20:33:08 +00:00
Konstantin Belousov	215e4657d5	Ensure that cmci_monitor() is not executed in parallel, since shared machine check banks must be only monitored by single CPU. Noted and reviewed by: jhb Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D15157	2018-04-24 20:29:40 +00:00
Konstantin Belousov	d9d8645c3f	Use IS_BSP() macro. Noted and reviewed by: jhb Sponsored by: The FreeBSD Foundation MFC after: 3 days Differential revision: https://reviews.freebsd.org/D15157	2018-04-24 20:22:30 +00:00
Konstantin Belousov	a5bd21d0fe	Use relaxed atomics to access the monitor line. We must ensure that accesses occur, they do not have any other compiler-visible effects. Bruce found some situations where optimization could remove an access, and provided a patch to use volatile qualifier for the state variables. Since volatile behaviour there is the compiler-specific interpretation of the keyword, use relaxed atomics instead, which gives exactly the desired semantic. Noted by and discussed with: bde Sponsored by: The FreeBSD Foundation MFC after: 1 week	2018-04-24 14:02:46 +00:00
Andriy Gapon	e673a4ec4c	add a new ACPI suspend debugging knob, debug.acpi.suspend_deep_bounce This sysctl allows a deeper dive into the sleep abyss comparing to debug.acpi.suspend_bounce. When the new sysctl is set the system will execute the suspend sequence up to the call to AcpiEnterSleepState(). That includes saving processor contexts and parking APs. Then, instead of actually entering the sleep state, the BSP will call resumectx() to emulate the wakeup. The APs should get restarted by the sequence of Init and Startup IPIs that BSP sends to them. MFC after: 8 days	2018-04-24 09:42:58 +00:00
John Baldwin	f36411145e	Fix two off-by-one errors when allocating MSI and MSI-X interrupts. x86 enforces an (arbitray) limit on the number of available MSI and MSI-X interrupts to simplify code (in particular, interrupt_source[] is statically sized). This means that an attempt to allocate an MSI vector needs to fail if it would go beyond the limit, but the checks for exceeding the limit had an off-by-one error. In the case of MSI-X which allocates interrupts one at a time this meant that IRQ 768 kept getting handed out multiple times for msix_alloc() instead of failing because all MSI IRQs were in use. Tested by: lidl MFC after: 1 week	2018-04-18 18:45:34 +00:00
Conrad Meyer	f6e61711ed	cpufreq: Remove error-prone table terminators in favor of automatic sizing PR: 227388 Reported by: Vladimir Machulsky <xdelta AT meta.ua> Sponsored by: Dell EMC Isilon	2018-04-14 03:15:05 +00:00
Konstantin Belousov	d86c1f0dc1	i386 4/4G split. The change makes the user and kernel address spaces on i386 independent, giving each almost the full 4G of usable virtual addresses except for one PDE at top used for trampoline and per-CPU trampoline stacks, and system structures that must be always mapped, namely IDT, GDT, common TSS and LDT, and process-private TSS and LDT if allocated. By using 1:1 mapping for the kernel text and data, it appeared possible to eliminate assembler part of the locore.S which bootstraps initial page table and KPTmap. The code is rewritten in C and moved into the pmap_cold(). The comment in vmparam.h explains the KVA layout. There is no PCID mechanism available in protected mode, so each kernel/user switch forth and back completely flushes the TLB, except for the trampoline PTD region. The TLB invalidations for userspace becomes trivial, because IPI handlers switch page tables. On the other hand, context switches no longer need to reload %cr3. copyout(9) was rewritten to use vm_fault_quick_hold(). An issue for new copyout(9) is compatibility with wiring user buffers around sysctl handlers. This explains two kind of locks for copyout ptes and accounting of the vslock() calls. The vm_fault_quick_hold() AKA slow path, is only tried after the 'fast path' failed, which temporary changes mapping to the userspace and copies the data to/from small per-cpu buffer in the trampoline. If a page fault occurs during the copy, it is short-circuit by exception.s to not even reach C code. The change was motivated by the need to implement the Meltdown mitigation, but instead of KPTI the full split is done. The i386 architecture already shows the sizing problems, in particular, it is impossible to link clang and lld with debugging. I expect that the issues due to the virtual address space limits would only exaggerate and the split gives more liveness to the platform. Tested by: pho Discussed with: bde Sponsored by: The FreeBSD Foundation MFC after: 1 month Differential revision: https://reviews.freebsd.org/D14633	2018-04-13 20:30:49 +00:00
Brooks Davis	6469bdcdb6	Move most of the contents of opt_compat.h to opt_global.h. opt_compat.h is mentioned in nearly 180 files. In-progress network driver compabibility improvements may add over 100 more so this is closer to "just about everywhere" than "only some files" per the guidance in sys/conf/options. Keep COMPAT_LINUX32 in opt_compat.h as it is confined to a subset of sys/compat/linux/*.c. A fake _COMPAT_LINUX option ensure opt_compat.h is created on all architectures. Move COMPAT_LINUXKPI to opt_dontuse.h as it is only used to control the set of compiled files. Reviewed by: kib, cem, jhb, jtl Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D14941	2018-04-06 17:35:35 +00:00
Roger Pau Monné	e0f92f5c77	x86: fix trampoline memory allocation after r332073 Add the missing breaks in the for loops, in order to exit the loop when a suitable entry is found. Also switch amd64 native_start_all_aps to use PHYS_TO_DMAP in order to find the virtual address of the boot_trampoline and the initial page tables. Reported and tested by: pho Sponsored by: Citrix Systems R&D	2018-04-06 16:22:14 +00:00
Roger Pau Monné	444c6d6f03	remove GiB/MiB macros from param.h And instead define them in the files where they are used. Requested by: bde	2018-04-06 11:20:06 +00:00
Roger Pau Monné	9dba82a442	x86: improve reservation of AP trampoline memory So that it doesn't rely on physmap[1] containing an address below 1MiB. Instead scan the full physmap and search for a suitable address to place the trampoline code (below 1MiB) and the initial memory pages (below 4GiB). Sponsored by: Citrix Systems R&D Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D14878	2018-04-05 14:39:51 +00:00
Andriy Gapon	3da25bdb02	fix i386 build with CPU_ELAN (LINT for instance) after r331878 x86/cpu_machdep.c now needs to include elan_mmcr.h when CPU_ELAN is set. While here, also remove the now unneeded inclusion of isareg.h in i386 and amd64 vm_machdep.c. Reported by: lwhsu MFC after: 14 days X-MFC with: r331878	2018-04-03 17:16:06 +00:00
Andriy Gapon	b7b25af06a	fix signatures of cpu_reset_real and cpu_reset_proxy, broken in r331878 When I moved these functions from i386 and amd64 to x86 I dropped their prototype declarations (that were correct) and left only their definitions that became incorrect. Reported by: bde MFC after: 15 days X-MFC with: r331878	2018-04-03 06:46:26 +00:00
Andriy Gapon	8428d0f154	unify amd64 and i386 cpu_reset() in x86/cpu_machdep.c Because I didn't see any reason not too. I've been making some changes to the code and couldn't help but notice that the i386 and am64 code was nearly identical. MFC after: 17 days	2018-04-02 13:45:23 +00:00
Jeff Roberson	27a3c9d710	Restore r331606 with a bugfix to setup cpuset_domain[] earlier on all platforms. Original commit message as follows: Only use CPUs in the domain the device is attached to for default assignment. Device drivers are able to override the default assignment if they bind directly. There are severe performance penalties for handling interrupts on remote CPUs and this should only be done in very controlled circumstances. Reviewed by: jhb, kib Tested by: pho Sponsored by: Netflix, Dell/EMC Isilon Differential Revision: https://reviews.freebsd.org/D14838	2018-03-28 18:47:35 +00:00
John Baldwin	d41e41f9f0	Remove very old and unused signal information codes. These have been supplanted by the MI signal information codes in <sys/signal.h> since 7.0. The FPE_*_TRAP ones were deprecated even earlier in 1999. PR: 226579 (exp-run) Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D14637	2018-03-27 20:57:51 +00:00
Jeff Roberson	261c408744	Backout r331606 until I can identify why it does not boot on some machines.	2018-03-27 10:20:50 +00:00
Jeff Roberson	a48de40bcc	Only use CPUs in the domain the device is attached to for default assignment. Device drivers are able to override the default assignment if they bind directly. There are severe performance penalties for handling interrupts on remote CPUs and this should only be done in very controlled circumstances. Reviewed by: jhb, kib Tested by: pho (earlier version) Sponsored by: Netflix, Dell/EMC Isilon Differential Revision: https://reviews.freebsd.org/D14838	2018-03-27 03:37:04 +00:00
John Baldwin	7091608617	Add a workaround to the hypervisor detection for older versions of KVM. Originally KVM set %eax to 0 in the cpuid leaf 0x4000000 rather than to the highest supported leaf in the hypervisor "branch". Detect this case and fixup the %eax value so that the hypervisor is still detected. Reported by: jpaetzel Reviewed by: kib MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D14810	2018-03-23 22:36:24 +00:00
Konstantin Belousov	8fbcc3343f	Move the CR0.WP manipulation KPI to x86. This should allow to avoid some #ifdefs in the common x86/ code. Requested by: markj Sponsored by: The FreeBSD Foundation MFC after: 1 week	2018-03-20 20:20:49 +00:00
John Baldwin	7af5f2acfb	Fix a typo. Reviewed by: kib	2018-03-19 17:14:56 +00:00
Ed Maste	4e78ff7068	ANSIfy sys/x86	2018-03-17 01:40:09 +00:00
Roger Pau Monné	4a6d4e7b58	at_rtc: check in ACPI FADT boot flags if the RTC is present Or else disable the device. Note that the detection can be bypassed by setting the hw.atrtc.enable option in the loader configuration file. More information can be found on atrtc(4). Sponsored by: Citrix Systems R&D Reviewed by: ian Differential revision: https://reviews.freebsd.org/D14399	2018-03-13 09:42:33 +00:00
Ian Lepore	22b3d71e82	Give the atrtc_time_lock a unique name. Reported by: hps@	2018-03-12 15:26:11 +00:00
Andriy Gapon	7471a3fae8	fix r297857, do not modify CPU extension bits under virtual machines r297857 was meant for real hardware only. PR: 213155 Submitted by: mainland@apeiron.net MFC after: 1 week	2018-03-12 11:28:09 +00:00
Ian Lepore	c7053bbe54	Revert r330780, it was improperly tested and results in taking a spin mutex before acquiring sleep mutexes. Reported by: kib@	2018-03-11 20:13:15 +00:00
Ian Lepore	4b502f0016	Remove MTX_NOPROFILE from atrtc_lock, it was inappropriately copy/pasted from the i8254 driver when I created separate mutexes for each. The i8254 driver could be the active timecounter, leading to recursion during mutex profiling, but the atrtc driver cannot be a timecounter, so it isn't needed.	2018-03-11 19:56:07 +00:00
Ian Lepore	86051be993	Eliminate atrtc_time_lock, and use atrtc_lock for efirtc locking.	2018-03-11 19:22:58 +00:00
Ian Lepore	67e2a29216	Everywhere that multiple registers are accessed in sequence, lock/unlock just once around the whole group of accesses.	2018-03-11 18:54:45 +00:00
Ian Lepore	8355852f85	Use separate mutexes for atrtc and i8254 locking. Change all the strange un-function-like RTC_LOCK/UNLOCK macro usage into normal function calls. Since there is no longer any need to handle register access from a debugger context, those function calls can just be regular mutex lock/unlock calls. Requested by: bde	2018-03-11 18:20:49 +00:00
Ian Lepore	14d08b45b8	Convert atrtc the new style rtc debugging output. Remove the db show command handler which provided much the same information. Removing the possibility of accessing the hardware regs from the debugger context paves the way for simplifying the locking code in the driver.	2018-03-11 16:57:14 +00:00
Ed Maste	315fbaeca2	Correct pseudo misspelling in sys/ comments contrib code and #define in intel_ata.h unchanged.	2018-02-23 18:15:50 +00:00
Konstantin Belousov	33099716f3	Do not return out of bound pointers from intr_lookup_source(). This hardens the code against driver and upper level bugs causing invalid indexes used, e.g. on msi release. Reported by: gallatin Reviewed by: gallatin, hselasky Sponsored by: Mellanox Technologies MFC after: 1 week Differential revision: https://reviews.freebsd.org/D14470	2018-02-23 11:20:59 +00:00
Warner Losh	ef1fcaf0f5	Do not include float interfaces when using libsa. We don't support float in the boot loaders, so don't include interfaces for float or double in systems headers. In addition, take the unusual step of spiking double and float to prevent any more accidental seepage.	2018-02-23 04:04:25 +00:00
Mark Johnston	2fb9a51077	Don't include DMAR map entry zone items in kernel dumps. Such items may be allocated in the I/O path used by the dumper, potentially causing the dump to fail. Since there is some precedent in the DMAR driver for avoiding this problem using _NODUMP, apply this workaround to the zone as well. Reported and tested by: mmacy Reviewed by: kib MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D14422	2018-02-18 16:03:50 +00:00
Konstantin Belousov	fc97574bd3	Remove unused symbols. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2018-02-16 23:18:42 +00:00
Roger Pau Monné	c2bddfdc51	xen/pv: remove the attach of the ISA bus from the Xen PV bus There's no need to attach the ISA bus from the Xen PV one. Sponsored by: Citrix Systems R&D	2018-02-16 18:04:27 +00:00
Mateusz Guzik	b345111b2b	xen: fix smp boot after r328157 mce_stack was left unset leading to early crashes	2018-02-15 07:23:41 +00:00
Konstantin Belousov	c688c9051b	Fix build with gas. Do not use C constant suffixes. Bit values are small enough to not require typing, despite they are used for 64bit MSR writes. The added cast in hw_ibrs_recalculate() is redundand but I prefer to add it for clarity. Reported by: bde Sponsored by: The FreeBSD Foundation MFC after: 1 week	2018-02-13 15:30:31 +00:00
Warner Losh	62bca77843	Move __va_list and related defines to sys/sys/_types.h __va_list and related defines are identical in all the ARCH/include/_types.h files. Move them to sys/sys/_types.h Sponsored by: Netflix	2018-02-12 14:48:20 +00:00
Konstantin Belousov	b31b965e7c	Expand IBRS TLA in sysctl help lines. Requested by: bz Sponsored by: The FreeBSD Foundation MFC after: 1 week	2018-01-31 16:54:05 +00:00
Konstantin Belousov	319117fd57	IBRS support, AKA Spectre hardware mitigation. It is coded according to the Intel document 336996-001, reading of the patches posted on lkml, and some additional consultations with Intel. For existing processors, you need a microcode update which adds IBRS CPU features, and to manually enable it by setting the tunable/sysctl hw.ibrs_disable to 0. Current status can be checked in sysctl hw.ibrs_active. The mitigation might be inactive if the CPU feature is not patched in, or if CPU reports that IBRS use is not required, by IA32_ARCH_CAP_IBRS_ALL bit. Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D14029	2018-01-31 14:36:27 +00:00
Konstantin Belousov	3b5319325e	Do not enable PTI when IA32_ARCH_CAP_RDCL_NO bit is set. Intel document 336996-001 claims that this will be the way to inform about Meltdown correction. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2018-01-31 14:25:42 +00:00
Warner Losh	d6b6639713	Add ISA PNP tables to ISA drivers. Fix a few incidental comments. ACPI ISA PBP tables not tagged, there's bigger issues with them.	2018-01-29 00:22:30 +00:00
Alexander Motin	a5232cc4fb	Assume Always Running APIC Timer for AMD CPU families >= 0x12. Fallback to HPET may cause locks congestions on many-core systems. This change replicates Linux behavior. MFC after: 1 month	2018-01-28 18:18:03 +00:00
Konstantin Belousov	c8f9c1f3d9	Use PCID to optimize PTI. Use PCID to avoid complete TLB shootdown when switching between user and kernel mode with PTI enabled. I use the model close to what I read about KAISER, user-mode PCID has 1:1 correspondence to the kernel-mode PCID, by setting bit 11 in PCID. Full kernel-mode TLB shootdown is performed on context switches, since KVA TLB invalidation only works in the current pmap. User-mode part of TLB is flushed on the pmap activations as well. Similarly, IPI TLB shootdowns must handle both kernel and user address spaces for each address. Note that machines which implement PCID but do not have INVPCID instructions, cause the usual complications in the IPI handlers, due to the need to switch to the target PCID temporary. This is racy, but because for PCID/no-INVPCID we disable the interrupts in pmap_activate_sw(), IPI handler cannot see inconsistent state of CPU PCID vs PCPU pmap/kcr3/ucr3 pointers. On the other hand, on kernel/user switches, CR3_PCID_SAVE bit is set and we do not clear TLB. I can imagine alternative use of PCID, where there is only one PCID allocated for the kernel pmap. Then, there is no need to shootdown kernel TLB entries on context switch. But copyout(3) would need to either use method similar to proc_rwmem() to access the userspace data, or (in reverse) provide a temporal mapping for the kernel buffer into user mode PCID and use trampoline for copy. Reviewed by: markj (previous version) Tested by: pho Discussed with: alc (some aspects) Sponsored by: The FreeBSD Foundation MFC after: 3 weeks Differential revision: https://reviews.freebsd.org/D13985	2018-01-27 11:49:37 +00:00
Konstantin Belousov	e65c8c1afb	Fix native_lapic_ipi_alloc(). When PTI is enabled, empty IDT slots point to rsvd_pti. Reported by: Dexuan-BSD Cui <dexuan.bsd@gmail.com> Sponsored by: The FreeBSD Foundation MFC after: 5 days	2018-01-27 11:33:21 +00:00
Pedro F. Giffuni	d821d36419	Unsign some values related to allocation. When allocating memory through malloc(9), we always expect the amount of memory requested to be unsigned as a negative value would either stand for an error or an overflow. Unsign some values, found when considering the use of mallocarray(9), to avoid unnecessary casting. Also consider that indexes should be of at least the same size/type as the upper limit they pretend to index. MFC after: 3 weeks	2018-01-22 02:08:10 +00:00
Pedro F. Giffuni	ac2fffa4b7	Revert r327828, r327949, r327953, r328016-r328026, r328041: Uses of mallocarray(9). The use of mallocarray(9) has rocketed the required swap to build FreeBSD. This is likely caused by the allocation size attributes which put extra pressure on the compiler. Given that most of these checks are superfluous we have to choose better where to use mallocarray(9). We still have more uses of mallocarray(9) but hopefully this is enough to bring swap usage to a reasonable level. Reported by: wosch PR: 225197	2018-01-21 15:42:36 +00:00
Ed Maste	b3327f62f0	Enable KPTI by default on amd64 for non-AMD CPUs Kernel Page Table Isolation (KPTI) was introduced in r328083 as a mitigation for the 'Meltdown' vulnerability. AMD CPUs are not affected, per https://www.amd.com/en/corporate/speculative-execution: We believe AMD processors are not susceptible due to our use of privilege level protections within paging architecture and no mitigation is required. Thus default KPTI to off for AMD CPUs, and to on for others. This may be refined later as we obtain more specific information on the sets of CPUs that are and are not affected. Submitted by: Mitchell Horne Reviewed by: cem Relnotes: Yes Security: CVE-2017-5754 Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D13971	2018-01-19 15:42:34 +00:00
Konstantin Belousov	bd50262f70	PTI for amd64. The implementation of the Kernel Page Table Isolation (KPTI) for amd64, first version. It provides a workaround for the 'meltdown' vulnerability. PTI is turned off by default for now, enable with the loader tunable vm.pmap.pti=1. The pmap page table is split into kernel-mode table and user-mode table. Kernel-mode table is identical to the non-PTI table, while usermode table is obtained from kernel table by leaving userspace mappings intact, but only leaving the following parts of the kernel mapped: kernel text (but not modules text) PCPU GDT/IDT/user LDT/task structures IST stacks for NMI and doublefault handlers. Kernel switches to user page table before returning to usermode, and restores full kernel page table on the entry. Initial kernel-mode stack for PTI trampoline is allocated in PCPU, it is only 16 qwords. Kernel entry trampoline switches page tables. then the hardware trap frame is copied to the normal kstack, and execution continues. IST stacks are kept mapped and no trampoline is needed for NMI/doublefault, but of course page table switch is performed. On return to usermode, the trampoline is used again, iret frame is copied to the trampoline stack, page tables are switched and iretq is executed. The case of iretq faulting due to the invalid usermode context is tricky, since the frame for fault is appended to the trampoline frame. Besides copying the fault frame and original (corrupted) frame to kstack, the fault frame must be patched to make it look as if the fault occured on the kstack, see the comment in doret_iret detection code in trap(). Currently kernel pages which are mapped during trampoline operation are identical for all pmaps. They are registered using pmap_pti_add_kva(). Besides initial registrations done during boot, LDT and non-common TSS segments are registered if user requested their use. In principle, they can be installed into kernel page table per pmap with some work. Similarly, PCPU can be hidden from userspace mapping using trampoline PCPU page, but again I do not see much benefits besides complexity. PDPE pages for the kernel half of the user page tables are pre-allocated during boot because we need to know pml4 entries which are copied to the top-level paging structure page, in advance on a new pmap creation. I enforce this to avoid iterating over the all existing pmaps if a new PDPE page is needed for PTI kernel mappings. The iteration is a known problematic operation on i386. The need to flush hidden kernel translations on the switch to user mode make global tables (PG_G) meaningless and even harming, so PG_G use is disabled for PTI case. Our existing use of PCID is incompatible with PTI and is automatically disabled if PTI is enabled. PCID can be forced on only for developer's benefit. MCE is known to be broken, it requires IST stack to operate completely correctly even for non-PTI case, and absolutely needs dedicated IST stack because MCE delivery while trampoline did not switched from PTI stack is fatal. The fix is pending. Reviewed by: markj (partially) Tested by: pho (previous version) Discussed with: jeff, jhb Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2018-01-17 11:44:21 +00:00
Ian Lepore	e780324662	Remove redundant critical_enter/exit() calls. The block of code delimited by these calls is now protected by a spin mutex (obscured within the RTC_LOCK/RTC_UNLOCK macros). Reported by: bde@	2018-01-16 23:18:52 +00:00
Ian Lepore	428cdf0280	Move some code around and rename a couple variables; no functional changes. The static atrtc_set() function was called only from clock_settime(), so just move its contents entirely into clock_settime() and delete atrtc_set(). Rename the struct bcd_clocktime variables from 'ct' to 'bct'. I had originally wanted to emphasize how identical the clocktime and bcd_clocktime structs were, but things evolved to the point where the structs are not at all identical anymore, so now emphasizing the difference seems better.	2018-01-16 23:14:12 +00:00
Ian Lepore	e5ef01427c	Add static inline rtcin_locked() and rtcout_locked() functions for doing a related series of operations without doing a lock/unlock for each byte. Use them when reading and writing the entire set of time registers. The original rtcin() and writertc() functions which do lock/unlock on each byte still exist, because they are public and called by outside code.	2018-01-16 03:02:41 +00:00
Pedro F. Giffuni	74641f0bc6	x86: make some use of mallocarray(9). Focus on code where we are doing multiplications within malloc(9). None of these ire likely to overflow, however the change is still useful as some static checkers can benefit from the allocation attributes we use for mallocarray. This initial sweep only covers malloc(9) calls with M_NOWAIT. No good reason but I started doing the changes before r327796 and at that time it was convenient to make sure the sorrounding code could handle NULL values. X-Differential revision: https://reviews.freebsd.org/D13837	2018-01-15 21:08:22 +00:00
Ian Lepore	7c63e50188	Convert the x86 RTC driver to use new validated BCD<->timespec conversions. New common routines were added to kern/subr_clock.c for converting between calendrical time expressed in BCD and struct timespec. The new functions return EINVAL on error, as expected when the clock hardware does not provide valid time. PR: 224813 Differential Revision: https://reviews.freebsd.org/D13731 (no reviewers)	2018-01-15 16:40:43 +00:00
Konstantin Belousov	e8c770a66e	Enumerate and print Intel CPU features for Speculative Execution Side Channel Mitigations. The definitions are taken from the document 336996-001. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2018-01-14 12:36:23 +00:00
Jeff Roberson	b6715dab8f	Move VM_NUMA_ALLOC and DEVICE_NUMA under the single global config option NUMA. Sponsored by: Netflix, Dell/EMC Isilon Discussed with: jhb	2018-01-14 03:36:03 +00:00
Conrad Meyer	233933cb00	amd64: Add a 48-bit MAXADDR constant Some devices (e.g., ccp(4) -- to be committed) can only access the low 48 bits of physical memory. Reviewed by: markj Sponsored by: Dell EMC Isilon	2018-01-13 17:55:22 +00:00
Jeff Roberson	6f4acaf4c9	Add support for NUMA domains to bus dma tags. This causes all memory allocated with a tag to come from the specified domain if it meets the other constraints provided by the tag. Automatically create a tag at the root of each bus specifying the domain local to that bus if available. Reviewed by: jhb, kib Tested by: pho Sponsored by: Netflix, Dell/EMC Isilon Differential Revision: https://reviews.freebsd.org/D13545	2018-01-12 23:34:16 +00:00
Jeff Roberson	3f289c3fcf	Implement 'domainset', a cpuset based NUMA policy mechanism. This allows userspace to control NUMA policy administratively and programmatically. Implement domainset based iterators in the page layer. Remove the now legacy numa_* syscalls. Cleanup some header polution created by having seq.h in proc.h. Reviewed by: markj, kib Discussed with: alc Tested by: pho Sponsored by: Netflix, Dell/EMC Isilon Differential Revision: https://reviews.freebsd.org/D13403	2018-01-12 22:48:23 +00:00
Konstantin Belousov	0530a9360f	Make it possible to re-evaluate cpu_features. Add cpuctl(4) ioctl CPUCTL_EVAL_CPU_FEATURES which forces re-read of cpu_features, cpu_features2, cpu_stdext_features, and std_stdext_features2. The intent is to allow the kernel to see the changes in the CPU features after micocode update. Of course, the update is not atomic across variables and not synchronized with readers. See the man page warning as well. Reviewed by: imp (previous version), jilles Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D13770	2018-01-05 21:06:19 +00:00
Konstantin Belousov	af317aa4e5	Use the new SDM-approved way to serialize x2APIC MSR writes. SDM editions 64 and below stated that it is enough to use MFENCe or LFENCE to serialize x2APIC register writes. New edition 65 requires either full serialization instruction or MFENCE;LFENCE sequence. Use the later, FreeBSD needs serialization to ensure that writes done before IPI request are visible to the target IPI CPU. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2018-01-03 11:23:47 +00:00
Konstantin Belousov	da457ed9d6	Add CR4.SMAP control bit. Sponsored by: The FreeBSD Foundation MFC after: 3 days	2018-01-01 19:34:19 +00:00
Colin Percival	d5d7606c0c	Use the TSLOG framework to record entry/exit timestamps for DELAY and _vprintf; these functions are called in many places and can contribute meaningfully to the total time spent booting.	2017-12-31 09:24:41 +00:00
Marius Strobl	15f0034553	With the advent of interrupt remapping, Intel has repurposed bit 11 (now: Interrupt_Index[15]) and assigned the previously reserved bits 55:48 (Interrupt_Index[14:0] goes into 63:49 while Destination Field used 63:56 and bit 48 now is Interrupt_Format) in the IO redirection tables (see the VT-d specification, "5.1.5.1 I/OxAPIC Programming"). Thus, when not using interrupt remapping, ensure that all previously reserved bits in the high part of the RTEs are zero instead of doing a read-modify-write for their Destination Field bits only. Otherwise, on machines based on Apollo Lake and its derivatives such as Denverton, typically some of the previously preserved bits remain set after boot when not employing interrupt remapping. The result is that INTx interrupts are not getting delivered. Note: With an AMD IOMMU, interrupt remapping apparently bypasses the IO APIC altogether. Submitted by: loos (modulo comment) Reviewed by: jhb (modulo comment)	2017-12-28 21:46:09 +00:00
Poul-Henning Kamp	8ba749fbe3	Introduce an architecture-agnostic <sys/_stdarg.h> to reduce platform divergence. Only architectures which pass arguments in registers (mips) and platforms which use really weird compilers (any?) would need to augment the contents of <sys/_stdarg.h> Convert x86, arm and arm64 architectures to use <sys/_stdarg.h>	2017-12-25 20:54:00 +00:00
Warner Losh	ed98ce5cad	Further investigation shows this shouldn't have been added at all. Remove it.	2017-12-24 17:59:48 +00:00
Warner Losh	d76103580a	Comment this out until I have time to get to the bottom of why it's failing for some people.	2017-12-24 16:36:50 +00:00
Warner Losh	7dcb3b1295	Warn when nonPNP ISA devices are attached in GENERIC that they are being removed from GENERIC in 12. Always print PNP info for ISA when it exists: it doesn't depend on ISAPNP. Add PNP ID to orm and vga to prevent us from warning about them since those devices aren't being removed from GENERIC. PNP devices will be removed from GENERIC too, but they will be automatically loaded, so need no warning. We don't warn for non-GENERIC kernels because people running them are presumed to know what they are doing. MFC After: 2 weeks	2017-12-23 22:57:14 +00:00
Konstantin Belousov	6332b14887	Add missed AVX512VL (128 and 256 bit vector length) extension identification bit. Sponsored by: The FreeBSD Foundation MFC after: 3 days	2017-12-23 21:32:50 +00:00
Bruce Evans	da9fba5447	Use resume_cpus() instead of restart_cpus() to resume from ACPI suspension. restart_cpus() worked well enough by accident. Before this set of fixes, resume_cpus() used the same cpuset (started_cpus, meaning CPUs directed to restart) as restart_cpus(). resume_cpus() waited for the wrong cpuset (stopped_cpus) to become empty, but since mixtures of stopped and suspended CPUs are not close to working, stopped_cpus must be empty when resuming so the wait is null -- restart_cpus just allows the other CPUs to restart and returns without waiting. Fix resume_cpus() to wait on a non-wrong cpuset for the ACPI case, and add further kludges to try to keep it working for the XEN case. It was only used for XEN. It waited on suspended_cpus. This works for XEN. However, for ACPI, resuming is a 2-step process. ACPI has already woken up the other CPUs and removed them from suspended_cpus. This fix records the move by putting them in a new cpuset resuming_cpus. Waiting on suspended_cpus would give the same null wait as waiting on stopped_cpus. Wait on resuming_cpus instead. Add a cpuset toresume_cpus to map the CPUs being told to resume to keep this separate from the cpuset started_cpus for mapping the CPUs being told to restart. Mixtures of stopped and suspended/resuming CPUs are still far from working. Describe new and some old cpusets in comments. Add further kludges to cpususpend_handler() to try to avoid breaking it for XEN. XEN doesn't use resumectx(), so it doesn't use the second return path for savectx(), and it goes from the suspended state directly to the restarted state, while ACPI resume goes through the resuming state. Enter the resuming state early for all cases so that resume_cpus can test for being in this state and not have to worry about the intermediate !suspended state for ACPI only. Reviewed by: kib	2017-12-21 09:17:48 +00:00
Bruce Evans	2ba6fe0009	Remove the permanent double mapping of low physical memory and replace it by a transient double mapping for the one instruction in ACPI wakeup where it is needed (and for many surrounding instructions in ACPI resume). Invalidate the TLB as soon as convenient after undoing the transient mapping. ACPI resume already has the strict ordering needed for this. This fixes the non-trapping of null pointers and other garbage pointers below NBPDR (except transiently). NBPDR is quite large (4MB, or 2MB for PAE). This fixes spurious traps at the first instruction in VM86 bioscalls. The traps are for transiently missing read permission in the first VM86 page (physical page 0) which was just written to at KERNBASE in the kernel. The mechanism is unknown (it is not simply PG_G). locore uses a similar but larger transient double mapping and needs it for 2 instructions instead of 1. Unmap the first PDE in it after the 2 instructions to detect most garbage pointers while bootstrapping. pmap_bootstrap() finishes the unmapping. Remove the avoidance of the double mapping for a recently fixed special case. ACPI resume could use this avoidance (made non-special) to avoid any problems with the transient double mapping, but no such problems are known. Update comments in locore. Many were for old versions of FreeBSD which tried to map low memory r/o except for special cases, or might have allowed access to low memory via physical offsets. Now all kernel maps are r/w, and removal of of the double map disallows use of physical offsets again.	2017-12-18 13:53:22 +00:00
Pedro F. Giffuni	64de3fdd58	SPDX: use the Beerware identifier.	2017-11-30 20:33:45 +00:00
Jung-uk Kim	82f0844956	Properly skip the first CPU. It only accidentally worked because the CPU_FOREACH() loop always starts from BSP (cpu0) and the if condition is always false for APs. Reported by: cem	2017-11-30 20:21:42 +00:00
Pedro F. Giffuni	8820ecc040	SPDX: Fix some cases wrongly attributed to MIT. In the cases of BSD-style license variants without clauses, use 0BSD for the time being in lack of a better description.	2017-11-30 15:10:11 +00:00
Jung-uk Kim	e374a321fe	Add a tunable "debug.hwpstate_verify" to check P-state after changing it and turn it off by default. It is very inefficient to verify current P-state of each core, especially for CPUs with many cores. When multiple commands are requested to the same power domain before completion of pending transitions, the last command is executed according to the manual. Because requests are serialized by the caller, all cores will receive the same command for each call. Do not call sched_bind() and sched_unbind(). It is redundant because the caller does it anyway.	2017-11-30 01:40:07 +00:00
Jung-uk Kim	72b27e9773	Fix style(9).	2017-11-29 23:52:31 +00:00
Pedro F. Giffuni	ebf5747bdb	sys/x86: further adoption of SPDX licensing ID tags. Mainly focus on files that use BSD 2-Clause license, however the tool I was using misidentified many licenses so this was mostly a manual - error prone - task. The Software Package Data Exchange (SPDX) group provides a specification to make it easier for automated tools to detect and summarize well known opensource licenses. We are gradually adopting the specification, noting that the tags are considered only advisory and do not, in any way, superceed or replace the license texts.	2017-11-27 15:11:47 +00:00
Konstantin Belousov	383f241dce	Remove lint support from system headers and MD x86 headers. Reviewed by: dim, jhb Discussed with: imp Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D13156	2017-11-23 11:40:16 +00:00
Pedro F. Giffuni	51369649b0	sys: further adoption of SPDX licensing ID tags. Mainly focus on files that use BSD 3-Clause license. The Software Package Data Exchange (SPDX) group provides a specification to make it easier for automated tools to detect and summarize well known opensource licenses. We are gradually adopting the specification, noting that the tags are considered only advisory and do not, in any way, superceed or replace the license texts. Special thanks to Wind River for providing access to "The Duke of Highlander" tool: an older (2014) run over FreeBSD tree was useful as a starting point.	2017-11-20 19:43:44 +00:00
Pedro F. Giffuni	df57947f08	spdx: initial adoption of licensing ID tags. The Software Package Data Exchange (SPDX) group provides a specification to make it easier for automated tools to detect and summarize well known opensource licenses. We are gradually adopting the specification, noting that the tags are considered only advisory and do not, in any way, superceed or replace the license texts. Special thanks to Wind River for providing access to "The Duke of Highlander" tool: an older (2014) run over FreeBSD tree was useful as a starting point. Initially, only tag files that use BSD 4-Clause "Original" license. RelNotes: yes Differential Revision: https://reviews.freebsd.org/D13133	2017-11-18 14:26:50 +00:00
Ruslan Bukin	3b418d1b9a	Add Intel Processor Trace registers for: - CPUID - Table of Physical Addresses (ToPA). Sponsored by: DARPA, AFRL	2017-11-17 17:54:10 +00:00
Konstantin Belousov	4e421792ec	Remove i386 XBOX support. It is for console presented at 2001 and featuring Pentium III processor. Even if any of them are still alive and run FreeBSD, we do not have any sign of life from their users. While removing another dozens of #ifdefs from the i386 sources reduces the aversion from looking at the code and improves the platform vitality. Reviewed by: cem, pfg, rink (XBOX support author) Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D13016	2017-11-16 14:27:02 +00:00
Ruslan Bukin	b510dab312	Add Intel Processor Trace (PT) MSRs. Sponsored by: DARPA, AFRL	2017-11-12 23:13:04 +00:00
Konstantin Belousov	dc00696a27	Correct operators precedence. Also keep the calculated vm_page_alloc_contig() flags in the variable to not re-evaluate it on the loop iteration. Noted by: alc Sponsored by: The FreeBSD Foundation	2017-11-09 13:09:07 +00:00
Jeff Roberson	8d6fbbb867	Replace manyinstances of VM_WAIT with blocking page allocation flags similar to the kernel memory allocator. This simplifies NUMA allocation because the domain will be known at wait time and races between failure and sleeping are eliminated. This also reduces boilerplate code and simplifies callers. A wait primitive is supplied for uma zones for similar reasons. This eliminates some non-specific VM_WAIT calls in favor of more explicit sleeps that may be satisfied without new pages. Reviewed by: alc, kib, markj Tested by: pho Sponsored by: Netflix, Dell/EMC Isilon	2017-11-08 02:39:37 +00:00
Michal Meloun	904d8c492f	Add AT_HWCAP2 ELF auxiliary vector. - allocate value for new AT_HWCAP2 auxiliary vector on all platforms. - expand 'struct sysentvec' by new 'u_long *sv_hwcap2', in exactly same way as for AT_HWCAP. MFC after: 1 month Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D12699	2017-10-21 12:05:01 +00:00
Conrad Meyer	194446f9b7	x86: Decode AMD "Extended Feature Extensions ID EBX" bits In particular, this determines CPU support for the CLZERO instruction. (No, I am not making this name up.) Sponsored by: Dell EMC Isilon	2017-09-20 18:30:37 +00:00
Conrad Meyer	c50df68a08	MCA: Expand AMD Thresholding support to cover all banks When it was added in r314636, AMD Thresholding was hardcoded to only bank 4 (Northbridge) for some reason. However, even on family 10h the MCAx_MISC register Valid/Present bits determine whether thresholding is supported on that bank. Expand thresholding support to monitor all monitorable banks. This simplifies some of the logic and makes it more consistent with our Intel CMCI support. Reviewed by: markj (earlier version) Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D12321	2017-09-17 22:58:13 +00:00
John Baldwin	8df419f2df	Add AT_EHDRFLAGS and AT_HWCAP on amd64. x86 has two separate (but identical) list of AT_* constants and the earlier commit to add AT_HWCAP only updated the i386 list.	2017-09-14 15:34:29 +00:00
John Baldwin	c2f37b9245	Add AT_HWCAP and AT_EHDRFLAGS on all platforms. A new 'u_long sv_hwcap' field is added to 'struct sysentvec'. A process ABI can set this field to point to a value holding a mask of architecture-specific CPU feature flags. If an ABI does not wish to supply AT_HWCAP to processes the field can be left as NULL. The support code for AT_EHDRFLAGS was already present on all systems, just the #define was not present. This is a step towards unifying the AT_ constants across platforms. Reviewed by: kib MFC after: 1 month Differential Revision: https://reviews.freebsd.org/D12290	2017-09-14 14:26:55 +00:00
Conrad Meyer	d63edb4dc6	MCA: Rename AMD MISC bits/masks They apply to all AMD MCAi_MISC0 registers, not just MCA4 (NB). No functional change. Sponsored by: Dell EMC Isilon	2017-09-11 20:42:07 +00:00
Conrad Meyer	f739be66e6	x86 MCA: Extract CMCI support predicate into function On AMD, the MCG_CAP feature bit is reserved -- not explicitly zero. Do not use it to determine CMCI support. Reviewed by: avg, markj Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D12320	2017-09-11 20:41:25 +00:00
Konstantin Belousov	809f2d8b8b	Fix ioapic acpi id matching on PCI attach and rid calculation. Sponsored by: The FreeBSD Foundation MFC after: 11 days	2017-09-11 18:29:09 +00:00
Conrad Meyer	e8be4e41c6	Decode new AMD SVM feature bits on family 17h Sponsored by: Dell EMC Isilon	2017-09-11 18:11:53 +00:00
Konstantin Belousov	3c700e2e4c	Enhance qpi.c to make it usable on all Core-microarchitecture Xeons. Scan all buses for CSR bus, not stopping on the first failed match. Scan all slots for function 0 on the found bus, for instance on IvyBridge the slot 0 is not decoded at all. Since the scan is quite unsafe, and access to the buses is mostly useful for developers, enable the csr buses scan with the tunable. Current qpi.c makes too many assumptions about the uncore configuration buses location and about slots occupied. Also it restricts itself only to Nehalem CPUs. It is needed on all Core-based Xeons. On the 2600 v2 (IvyBridge) machine I have access to, the CSR buses have numbers 31 (BSP socket) and 63 (second socket), and there is no functions pci0.31.0.0 or pci0.63.0.0. According to the CPU datasheet, all devices on the uncore bus occupy slots >= 8. Practically, the attach to config buses is required for the intel-pcm pcm-memory.x tool to work, for instance. Reviewed by: jhb (previous version) Sponsored by: Mellanox Technologies MFC after: 1 week Differential revision: https://reviews.freebsd.org/D12268	2017-09-08 19:51:03 +00:00
Konstantin Belousov	fd15fee1ed	Use IOAPIC PCI rid as the interrupt TLP source id for DMAR interrupt remapping. VT-d specification requires use of PCI rid as source id for IOAPICs enumerated by PCI bus. The values from the DMAR ACPI table should be only used when IOAPIC is not on PCI. Reviewed by: jhb Sponsored by: The FreeBSD Foundation Hardware provided by: Intel MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D12205	2017-09-08 19:45:37 +00:00
Konstantin Belousov	3fd0053a50	Add an ioapic_get_rid() function to obtain PCIe TLP requester-id for the interrupt messages from given IOAPIC, if the IOAPIC can be enumerated on PCI bus. If IOAPIC has PCI binding, match the PCI device against MADT enumerated IOAPIC. Match is done first by registers window physical address, then by IOAPIC ID as read from the APIC ID register. PCI bsf address of the matched PCI device is the rid. Reviewed by: jhb Sponsored by: The FreeBSD Foundation Hardware provided by: Intel MFC after: 2 weeks X-Differential revision: https://reviews.freebsd.org/D12205	2017-09-08 19:39:20 +00:00
Konstantin Belousov	1a92c8402d	Add a constant specifying the min size of the IOAPIC registers window. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2017-09-08 19:25:11 +00:00
Konstantin Belousov	6ff9ce94ce	Consistently use tabs for indent. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2017-09-08 10:39:28 +00:00
Conrad Meyer	01a20b9875	mca: Fix printf types from r323289 on i386 Reported by: Michael Butler <imb AT protected-networks.net> Sponsored by: Dell EMC Isilon	2017-09-08 01:06:35 +00:00
Conrad Meyer	092c0e867a	x86 MCA: Helpfully, print why ECC thresholding is not enabled on AMD Sponsored by: Dell EMC Isilon	2017-09-07 21:33:27 +00:00
Conrad Meyer	d848ecfb7e	x86 MCA: Enable AMD thresholding support on 17h 17h supports MCA thresholding in the same way as 16h and earlier. Supposedly a ScalableMca feature bit in CPUID 8000_0007:EBX must be set, but that was not true for earlier models, so be careful about relying on it. While here, document a missing bit in LS MCA MISC0. Reviewed by: truckman Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D12237	2017-09-07 21:31:07 +00:00
Conrad Meyer	cd8c258198	Store AMD RAS Capabilities cpuid value and name flags Reviewed by: truckman Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D12237	2017-09-07 21:29:51 +00:00
Conrad Meyer	2e81566368	cpufreq(4) hwpstate: Yield CPU awaiting frequency change It doesn't seem necessary to busy the CPU while waiting to transition into a different p-state. PR: 221621 (related, but does not completely address) Reviewed by: truckman Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D12260	2017-09-07 20:20:12 +00:00
Konstantin Belousov	fd9bc183bb	Fix typos. Stop claiming that two children are created. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2017-09-06 11:47:59 +00:00
Roger Pau Monné	45ff071d6e	acpi/srat: zero the SRAT cpu array Fix from fallout introduced in r322348 that moved the cpus array to a dynamic allocation without zeroing the area. Reported by: mjg MFC with: r322348 Reviewed by: mjg Differential revision: https://reviews.freebsd.org/D12220	2017-09-04 10:08:42 +00:00
Konstantin Belousov	2624320fcc	Stop masking FSGSBASE and SMEP features under monitors. Not enabling FSGSBASE in %cr4 does not prevent reporting of the feature by the CPUID instruction (blame Int*l). As result, kernels which were run under monitors pretended that usermode cannot modify TLS base without the syscall, while libc noted right combination of capable CPU and the new kernel version, trying to use the WRFSBASE instruction. Really old hypervisors that cannot handle enablement of these features in %cr4 would require the manual configuration, by setting the loader tunable hw.cpu_stdext_disable=0x81 Reported by: lwhsu, mjoras Sponsored by: The FreeBSD Foundation MFC after: 18 days	2017-08-24 10:57:34 +00:00
Alexander Motin	ffc7e53a65	Fix off-by-one error when parsing SRAT table. Reviewed by: jhb MFC after: 1 week	2017-08-22 19:56:30 +00:00
Conrad Meyer	bb14d5643b	subr_smp: Clean up topology analysis, add additional layers Rather than repeatedly nesting loops, separate concerns with a single loop per call stack level. Use a table to drive the recursive routine. Handle missing topology layers more gracefully (infer a single unit). Analyze some additional optional layers which may be present on e.g. AMD Zen systems (groups, aka dies, per package; and cachegroups, aka CCXes, per group). Display that additional information in the boot-time topology information, when it is relevent (non-one). Reviewed by: markj@, mjoras@ (earlier version) Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D12019	2017-08-22 00:10:15 +00:00
Conrad Meyer	c768afe370	hwpstate: Add support for family 17h pstate info from MSRs This information is normally available via acpi_perf, but in case it is not, add support for fetching the information via MSRs on AMD family 17h (Zen) processors. Zen uses a slightly different formula than previous generation AMD CPUs. This was inspired by, but does not fix, PR 221621. Reported by: Sean P. R. <seanpr AT swbell.net> Reviewed by: mjoras@ Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D12082	2017-08-20 00:41:49 +00:00
Conrad Meyer	0b53ecd1d7	Discover CPU topology on multi-die AMD Zen systems The Nodes per Processor topology information determines how many bits of the APIC ID represent the Node (Zeppelin die, on Zen systems) ID. Documented in Ryzen and Epyc Processor Programming Reference (PPR). Correct topology information enables the scheduler to make better decisions on this hardware. Reviewed by: kib@ Tested by: jeff@ (earlier version) Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D11801	2017-08-17 16:54:37 +00:00
Conrad Meyer	35d87c7e96	Fix unused varable warning in !SMP case Fallout from r322588. I'm not sure why !SMP is a knob we have, but, we have it. Reported by: Michael Butler <imb AT protected-networks.net> Sponsored by: Dell EMC Isilon	2017-08-17 04:37:27 +00:00
Conrad Meyer	dc6a82801d	x86: Add dynamic interrupt rebalancing Add an option to dynamically rebalance interrupts across cores (hw.intrbalance); off by default. The goal is to minimize preemption. By placing interrupt sources on distinct CPUs, ithreads get preferentially scheduled on distinct CPUs. Overall preemption is reduced and latency is reduced. In our workflow it reduced "fighting" between two high-frequency interrupt sources. Reduced latency was proven by, e.g., SPEC2008. Submitted by: jeff@ (earlier version) Reviewed by: kib@ Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D10435	2017-08-16 18:48:53 +00:00
Roger Pau Monné	72446721e4	srat: use pmap_unmapbios To match the pmap_mapbios. Reported by: jhb MFC with: r322403	2017-08-13 14:50:38 +00:00
Ian Lepore	c82d887d47	Stop calling atrtc_set() from the xen timer clock_settime() method. That removes the only reference to atrtc_set() from outside of atrtc.c, so make it static. The xen timer driver registers as a realtime clock with 1us resolution. In the past that resulted in only the xen timer's clock_settime() getting called, so it would call atrtc_set() to set the hardware clock as well. As of r32090, the clock_settime() method of all registered realtime clocks gets called, so the xen driver no longer needs to chain-call the lower-resolution driver. Thanks to royger@ for talking me through the xen stuff, and for testing.	2017-08-11 19:02:11 +00:00
Roger Pau Monné	c642d2f5b5	acpi/srat: fix build without DMAP Use pmap_mapbios to map memory used to store the cpus array. Reported by: lwhsu X-MFC-with: r322348	2017-08-11 14:19:55 +00:00
Roger Pau Monné	3f0a9fe06c	mptable: fix i386 build failure Reported by: emaste X-MFC-with: r322347	2017-08-10 17:46:57 +00:00
Roger Pau Monné	a74bb29ada	x86: bump MAX_APIC_ID to 512 Introduce a new define to take int account the xAPIC ID limit, for systems where x2APIC is not available/reliable. Also change some of the usages of the APIC ID to use an unsigned int (which is the correct storage type to deal with x2APIC IDs as found in x2APIC MADT entries). This allows booting FreeBSD on a box with 256 CPUs and APIC IDs up to 295: FreeBSD/SMP: Multiprocessor System Detected: 256 CPUs FreeBSD/SMP: 1 package(s) x 64 core(s) x 4 hardware threads Package HW ID = 0 Core HW ID = 0 CPU0 (BSP): APIC ID: 0 CPU1 (AP/HT): APIC ID: 1 CPU2 (AP/HT): APIC ID: 2 CPU3 (AP/HT): APIC ID: 3 [...] Core HW ID = 73 CPU252 (AP): APIC ID: 292 CPU253 (AP/HT): APIC ID: 293 CPU254 (AP/HT): APIC ID: 294 CPU255 (AP/HT): APIC ID: 295 Submitted by: kib (previous version) Relnotes: yes MFC after: 1 month Reviewed by: kib Differential revision: https://reviews.freebsd.org/D11913	2017-08-10 09:16:40 +00:00
Roger Pau Monné	84525e55c1	x86: make the arrays that depend on MAX_APIC_ID dynamic So that MAX_APIC_ID can be bumped without wasting memory. Note that the usage of MAX_APIC_ID in the SRAT parsing forces the parser to allocate memory directly from the phys_avail physical memory array, which is not the best approach probably, but I haven't found any other way to allocate memory so early in boot. This memory is not returned to the system afterwards, but at least it's sized according to the maximum APIC ID found in the MADT table. Sponsored by: Citrix Systems R&D MFC after: 1 month Reviewed by: kib Differential revision: https://reviews.freebsd.org/D11912	2017-08-10 09:16:03 +00:00
Roger Pau Monné	fd1f83fb45	apic_enumerator: only set mp_ncpus and mp_maxid at probe cpus phase Populate the lapics arrays and call cpu_add/lapic_create in the setup phase instead. Also store the max APIC ID found in the newly introduced max_apic_id global variable. This is a requirement in order to make the static arrays currently using MAX_LAPIC_ID dynamic. Sponsored by: Citrix Systems R&D MFC after: 1 month Reviewed by: kib Differential revision: https://reviews.freebsd.org/D11911	2017-08-10 09:15:18 +00:00
Jung-uk Kim	b5669d0aa8	Split identify_cpu() into two functions for amd64 as we do for i386. This reduces diff between amd64 and i386. Also, it fixes a regression introduced in r322076, i.e., identify_hypervisor() failed to identify some hypervisors. This function assumes cpu_feature2 is already initialized. Reported by: dexuan Tested by: dexuan	2017-08-09 18:09:09 +00:00
Jung-uk Kim	0105034487	Detect hypervisors early. We used to set lower hz on hypervisors by default but it was broken since r273800 (and r278522, its MFC to stable/10) because identify_cpu() is called too late, i.e., after init_param1(). MFC after: 3 days	2017-08-05 06:56:46 +00:00
Mark Johnston	17b5949a31	Don't trace running threads that have interrupts disabled. In this case we shouldn't assume that the thread has a valid frame pointer. Reviewed by: kib MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D11787	2017-07-31 17:57:54 +00:00
Ryan Libby	b1a987bb34	__pcpu: gcc -Wredundant-decls Pollution from counter.h made __pcpu visible in amd64/pmap.c. Delete the existing extern decl of __pcpu in amd64/pmap.c and avoid referring to that symbol, instead accessing the pcpu region via PCPU_SET macros. Also delete an unused extern decl of __pcpu from mp_x86.c. Reviewed by: kib Approved by: markj (mentor) Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D11666	2017-07-21 17:11:36 +00:00
Ian Lepore	b524a31593	Protect access to the AT realtime clock with its own mutex. The mutex protecting access to the registered realtime clock should not be overloaded to protect access to the atrtc hardware, which might not even be the registered rtc. More importantly, the resettodr mutex needs to be eliminated to remove locking/sleeping restrictions on clock drivers, and that can't happen if MD code for amd64 depends on it. This change moves the protection into what's really being protected: access to the atrtc date and time registers. This change also adds protection when the clock is accessed from xentimer_settime(), which bypasses the resettodr locking. Differential Revision: https://reviews.freebsd.org/D11483	2017-07-12 02:42:57 +00:00
Jason A. Harmening	eb36b1d0bc	Clean up MD pollution of bus_dma.h: --Remove special-case handling of sparc64 bus_dmamap* functions. Replace with a more generic mechanism that allows MD busdma implementations to generate inline mapping functions by defining WANT_INLINE_DMAMAP in <machine/bus_dma.h>. This is currently useful for sparc64, x86, and arm64, which all implement non-load dmamap operations as simple wrappers around map objects which may be bus- or device-specific. --Remove NULL-checked bus_dmamap macros. Implement the equivalent NULL checks in the inlined x86 implementation. For non-x86 platforms, these checks are a minor pessimization as those platforms do not currently allow NULL maps. NULL maps were originally allowed on arm64, which appears to have been the motivation behind adding arm[64]-specific barriers to bus_dma.h, but that support was removed in r299463. --Simplify the internal interface used by the bus_dmamap_load* variants and move it to bus_dma_internal.h --Fix some drivers that directly include sys/bus_dma.h despite the recommendations of bus_dma(9) Reviewed by: kib (previous revision), marius Differential Revision: https://reviews.freebsd.org/D10729	2017-07-01 05:35:29 +00:00
Konstantin Belousov	cf619a92d2	Fix batched unload for DMAR busdma in qi mode. Do not queue dmar_map_entries with zeroed gseq to dmar_qi_invalidate_locked(). Zero gseq stops the processing in the qi task. Do not assign possibly uninitialized on-stack gseq to map entries when requeuing them on unit tlb_flush queue. Random garbage in gsec is interpreted as too high invalidation sequence number and again stop the processing in the task. Make the sequence numbers generation completely contained in dmar_qi_invalidate_locked() and dmar_qi_emit_wait_seq(). Upper code directly passes boolean requesting emiting wait command instead of trying to provide hint to avoid it by passing NULL gseq pointer. Microoptimize the requeueing to tlb_flush queue by doing it for the whole queue. Diagnosed and tested by: Brett Gutstein <bgutstein@rice.edu> Discussed with: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week	2017-06-19 21:48:52 +00:00
John Baldwin	fecabb72e1	Don't try to assign interrupts to a CPU on single-CPU systems. All interrupts are routed to the sole CPU in that case implicitly. This is a regression in EARLY_AP_STARTUP. Previously the 'assign_cpu' variable was only set when a multi-CPU system finished booting, so it's value both meant that interrupts could be assigned and that there was more than one CPU. PR: 219882 Reported by: ota@j.email.ne.jp MFC after: 3 days	2017-06-14 13:34:09 +00:00
Konstantin Belousov	fc8929cb29	More accurately handle early EFER restoration on resume. Do not try to set LMA bit while CPU is still in legacy mode. Apparently Intel CPUs ignore non-id writes to LMA, while AMD's (over-)react with #GP. Reported and tested by: danfe Sponsored by: The FreeBSD Foundation MFC after: 3 days	2017-06-11 14:39:08 +00:00

... 3 4 5 6 7 ...

1120 Commits