freebsd-skq

Author	SHA1	Message	Date
kib	aaf44aa5e1	Make the sysctl machdep.idle also a tunable. It is applied before it is possible for idle threads to execute on any CPU, allowing to work around against some bugs. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2018-04-24 20:49:16 +00:00
kib	57b709f4b4	Extend ap_boot_mtx scope to also cover mca_init(). Otherwise, under bootverbose, the lapic_enable_cmc() banner 'lapicX: CMCI unmasked' is printed by several CPUs in parallel, causing garbled output for the LAPIC dumps. Reported by: royger Reviewed by: jhb Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D15157	2018-04-24 20:33:08 +00:00
kib	d234c145ba	Ensure that cmci_monitor() is not executed in parallel, since shared machine check banks must be only monitored by single CPU. Noted and reviewed by: jhb Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D15157	2018-04-24 20:29:40 +00:00
kib	a9ebca3e14	Use IS_BSP() macro. Noted and reviewed by: jhb Sponsored by: The FreeBSD Foundation MFC after: 3 days Differential revision: https://reviews.freebsd.org/D15157	2018-04-24 20:22:30 +00:00
kib	1bd517bdbd	Use relaxed atomics to access the monitor line. We must ensure that accesses occur, they do not have any other compiler-visible effects. Bruce found some situations where optimization could remove an access, and provided a patch to use volatile qualifier for the state variables. Since volatile behaviour there is the compiler-specific interpretation of the keyword, use relaxed atomics instead, which gives exactly the desired semantic. Noted by and discussed with: bde Sponsored by: The FreeBSD Foundation MFC after: 1 week	2018-04-24 14:02:46 +00:00
avg	b84a44c4ca	add a new ACPI suspend debugging knob, debug.acpi.suspend_deep_bounce This sysctl allows a deeper dive into the sleep abyss comparing to debug.acpi.suspend_bounce. When the new sysctl is set the system will execute the suspend sequence up to the call to AcpiEnterSleepState(). That includes saving processor contexts and parking APs. Then, instead of actually entering the sleep state, the BSP will call resumectx() to emulate the wakeup. The APs should get restarted by the sequence of Init and Startup IPIs that BSP sends to them. MFC after: 8 days	2018-04-24 09:42:58 +00:00
jhb	1ae4d2ff0e	Fix two off-by-one errors when allocating MSI and MSI-X interrupts. x86 enforces an (arbitray) limit on the number of available MSI and MSI-X interrupts to simplify code (in particular, interrupt_source[] is statically sized). This means that an attempt to allocate an MSI vector needs to fail if it would go beyond the limit, but the checks for exceeding the limit had an off-by-one error. In the case of MSI-X which allocates interrupts one at a time this meant that IRQ 768 kept getting handed out multiple times for msix_alloc() instead of failing because all MSI IRQs were in use. Tested by: lidl MFC after: 1 week	2018-04-18 18:45:34 +00:00
cem	ef5bec98f2	cpufreq: Remove error-prone table terminators in favor of automatic sizing PR: 227388 Reported by: Vladimir Machulsky <xdelta AT meta.ua> Sponsored by: Dell EMC Isilon	2018-04-14 03:15:05 +00:00
kib	e3089a0318	i386 4/4G split. The change makes the user and kernel address spaces on i386 independent, giving each almost the full 4G of usable virtual addresses except for one PDE at top used for trampoline and per-CPU trampoline stacks, and system structures that must be always mapped, namely IDT, GDT, common TSS and LDT, and process-private TSS and LDT if allocated. By using 1:1 mapping for the kernel text and data, it appeared possible to eliminate assembler part of the locore.S which bootstraps initial page table and KPTmap. The code is rewritten in C and moved into the pmap_cold(). The comment in vmparam.h explains the KVA layout. There is no PCID mechanism available in protected mode, so each kernel/user switch forth and back completely flushes the TLB, except for the trampoline PTD region. The TLB invalidations for userspace becomes trivial, because IPI handlers switch page tables. On the other hand, context switches no longer need to reload %cr3. copyout(9) was rewritten to use vm_fault_quick_hold(). An issue for new copyout(9) is compatibility with wiring user buffers around sysctl handlers. This explains two kind of locks for copyout ptes and accounting of the vslock() calls. The vm_fault_quick_hold() AKA slow path, is only tried after the 'fast path' failed, which temporary changes mapping to the userspace and copies the data to/from small per-cpu buffer in the trampoline. If a page fault occurs during the copy, it is short-circuit by exception.s to not even reach C code. The change was motivated by the need to implement the Meltdown mitigation, but instead of KPTI the full split is done. The i386 architecture already shows the sizing problems, in particular, it is impossible to link clang and lld with debugging. I expect that the issues due to the virtual address space limits would only exaggerate and the split gives more liveness to the platform. Tested by: pho Discussed with: bde Sponsored by: The FreeBSD Foundation MFC after: 1 month Differential revision: https://reviews.freebsd.org/D14633	2018-04-13 20:30:49 +00:00
brooks	9d79658aab	Move most of the contents of opt_compat.h to opt_global.h. opt_compat.h is mentioned in nearly 180 files. In-progress network driver compabibility improvements may add over 100 more so this is closer to "just about everywhere" than "only some files" per the guidance in sys/conf/options. Keep COMPAT_LINUX32 in opt_compat.h as it is confined to a subset of sys/compat/linux/*.c. A fake _COMPAT_LINUX option ensure opt_compat.h is created on all architectures. Move COMPAT_LINUXKPI to opt_dontuse.h as it is only used to control the set of compiled files. Reviewed by: kib, cem, jhb, jtl Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D14941	2018-04-06 17:35:35 +00:00
royger	e51e4bdd98	x86: fix trampoline memory allocation after r332073 Add the missing breaks in the for loops, in order to exit the loop when a suitable entry is found. Also switch amd64 native_start_all_aps to use PHYS_TO_DMAP in order to find the virtual address of the boot_trampoline and the initial page tables. Reported and tested by: pho Sponsored by: Citrix Systems R&D	2018-04-06 16:22:14 +00:00
royger	e1f89be1d3	remove GiB/MiB macros from param.h And instead define them in the files where they are used. Requested by: bde	2018-04-06 11:20:06 +00:00
royger	5f1547e410	x86: improve reservation of AP trampoline memory So that it doesn't rely on physmap[1] containing an address below 1MiB. Instead scan the full physmap and search for a suitable address to place the trampoline code (below 1MiB) and the initial memory pages (below 4GiB). Sponsored by: Citrix Systems R&D Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D14878	2018-04-05 14:39:51 +00:00
avg	3331ff57c2	fix i386 build with CPU_ELAN (LINT for instance) after r331878 x86/cpu_machdep.c now needs to include elan_mmcr.h when CPU_ELAN is set. While here, also remove the now unneeded inclusion of isareg.h in i386 and amd64 vm_machdep.c. Reported by: lwhsu MFC after: 14 days X-MFC with: r331878	2018-04-03 17:16:06 +00:00
avg	8ff7c82ffb	fix signatures of cpu_reset_real and cpu_reset_proxy, broken in r331878 When I moved these functions from i386 and amd64 to x86 I dropped their prototype declarations (that were correct) and left only their definitions that became incorrect. Reported by: bde MFC after: 15 days X-MFC with: r331878	2018-04-03 06:46:26 +00:00
avg	cbde65132d	unify amd64 and i386 cpu_reset() in x86/cpu_machdep.c Because I didn't see any reason not too. I've been making some changes to the code and couldn't help but notice that the i386 and am64 code was nearly identical. MFC after: 17 days	2018-04-02 13:45:23 +00:00
jeff	bfe01083f9	Restore r331606 with a bugfix to setup cpuset_domain[] earlier on all platforms. Original commit message as follows: Only use CPUs in the domain the device is attached to for default assignment. Device drivers are able to override the default assignment if they bind directly. There are severe performance penalties for handling interrupts on remote CPUs and this should only be done in very controlled circumstances. Reviewed by: jhb, kib Tested by: pho Sponsored by: Netflix, Dell/EMC Isilon Differential Revision: https://reviews.freebsd.org/D14838	2018-03-28 18:47:35 +00:00
jhb	24fa2df20b	Remove very old and unused signal information codes. These have been supplanted by the MI signal information codes in <sys/signal.h> since 7.0. The FPE_*_TRAP ones were deprecated even earlier in 1999. PR: 226579 (exp-run) Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D14637	2018-03-27 20:57:51 +00:00
jeff	124dca372e	Backout r331606 until I can identify why it does not boot on some machines.	2018-03-27 10:20:50 +00:00
jeff	d1125a4e0d	Only use CPUs in the domain the device is attached to for default assignment. Device drivers are able to override the default assignment if they bind directly. There are severe performance penalties for handling interrupts on remote CPUs and this should only be done in very controlled circumstances. Reviewed by: jhb, kib Tested by: pho (earlier version) Sponsored by: Netflix, Dell/EMC Isilon Differential Revision: https://reviews.freebsd.org/D14838	2018-03-27 03:37:04 +00:00
jhb	6746bdeb71	Add a workaround to the hypervisor detection for older versions of KVM. Originally KVM set %eax to 0 in the cpuid leaf 0x4000000 rather than to the highest supported leaf in the hypervisor "branch". Detect this case and fixup the %eax value so that the hypervisor is still detected. Reported by: jpaetzel Reviewed by: kib MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D14810	2018-03-23 22:36:24 +00:00
kib	0a1d8bb0a4	Move the CR0.WP manipulation KPI to x86. This should allow to avoid some #ifdefs in the common x86/ code. Requested by: markj Sponsored by: The FreeBSD Foundation MFC after: 1 week	2018-03-20 20:20:49 +00:00
jhb	50dd7456bb	Fix a typo. Reviewed by: kib	2018-03-19 17:14:56 +00:00
emaste	f545864525	ANSIfy sys/x86	2018-03-17 01:40:09 +00:00
royger	edf2293a55	at_rtc: check in ACPI FADT boot flags if the RTC is present Or else disable the device. Note that the detection can be bypassed by setting the hw.atrtc.enable option in the loader configuration file. More information can be found on atrtc(4). Sponsored by: Citrix Systems R&D Reviewed by: ian Differential revision: https://reviews.freebsd.org/D14399	2018-03-13 09:42:33 +00:00
ian	49fbeddce5	Give the atrtc_time_lock a unique name. Reported by: hps@	2018-03-12 15:26:11 +00:00
avg	d03e4e760f	fix r297857, do not modify CPU extension bits under virtual machines r297857 was meant for real hardware only. PR: 213155 Submitted by: mainland@apeiron.net MFC after: 1 week	2018-03-12 11:28:09 +00:00
ian	de8e5f9bb1	Revert r330780, it was improperly tested and results in taking a spin mutex before acquiring sleep mutexes. Reported by: kib@	2018-03-11 20:13:15 +00:00
ian	b00d088ba5	Remove MTX_NOPROFILE from atrtc_lock, it was inappropriately copy/pasted from the i8254 driver when I created separate mutexes for each. The i8254 driver could be the active timecounter, leading to recursion during mutex profiling, but the atrtc driver cannot be a timecounter, so it isn't needed.	2018-03-11 19:56:07 +00:00
ian	00abf0e72d	Eliminate atrtc_time_lock, and use atrtc_lock for efirtc locking.	2018-03-11 19:22:58 +00:00
ian	583404330f	Everywhere that multiple registers are accessed in sequence, lock/unlock just once around the whole group of accesses.	2018-03-11 18:54:45 +00:00
ian	95221efb08	Use separate mutexes for atrtc and i8254 locking. Change all the strange un-function-like RTC_LOCK/UNLOCK macro usage into normal function calls. Since there is no longer any need to handle register access from a debugger context, those function calls can just be regular mutex lock/unlock calls. Requested by: bde	2018-03-11 18:20:49 +00:00
ian	5e5730983b	Convert atrtc the new style rtc debugging output. Remove the db show command handler which provided much the same information. Removing the possibility of accessing the hardware regs from the debugger context paves the way for simplifying the locking code in the driver.	2018-03-11 16:57:14 +00:00
emaste	5122d84e64	Correct pseudo misspelling in sys/ comments contrib code and #define in intel_ata.h unchanged.	2018-02-23 18:15:50 +00:00
kib	de90d5191c	Do not return out of bound pointers from intr_lookup_source(). This hardens the code against driver and upper level bugs causing invalid indexes used, e.g. on msi release. Reported by: gallatin Reviewed by: gallatin, hselasky Sponsored by: Mellanox Technologies MFC after: 1 week Differential revision: https://reviews.freebsd.org/D14470	2018-02-23 11:20:59 +00:00
imp	f735e1eb15	Do not include float interfaces when using libsa. We don't support float in the boot loaders, so don't include interfaces for float or double in systems headers. In addition, take the unusual step of spiking double and float to prevent any more accidental seepage.	2018-02-23 04:04:25 +00:00
markj	6a8b74d6f3	Don't include DMAR map entry zone items in kernel dumps. Such items may be allocated in the I/O path used by the dumper, potentially causing the dump to fail. Since there is some precedent in the DMAR driver for avoiding this problem using _NODUMP, apply this workaround to the zone as well. Reported and tested by: mmacy Reviewed by: kib MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D14422	2018-02-18 16:03:50 +00:00
kib	6b4aea7c3f	Remove unused symbols. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2018-02-16 23:18:42 +00:00
royger	c1141359ad	xen/pv: remove the attach of the ISA bus from the Xen PV bus There's no need to attach the ISA bus from the Xen PV one. Sponsored by: Citrix Systems R&D	2018-02-16 18:04:27 +00:00
mjg	2c4feecfbb	xen: fix smp boot after r328157 mce_stack was left unset leading to early crashes	2018-02-15 07:23:41 +00:00
kib	59d970a7ef	Fix build with gas. Do not use C constant suffixes. Bit values are small enough to not require typing, despite they are used for 64bit MSR writes. The added cast in hw_ibrs_recalculate() is redundand but I prefer to add it for clarity. Reported by: bde Sponsored by: The FreeBSD Foundation MFC after: 1 week	2018-02-13 15:30:31 +00:00
imp	8256f7ec08	Move __va_list and related defines to sys/sys/_types.h __va_list and related defines are identical in all the ARCH/include/_types.h files. Move them to sys/sys/_types.h Sponsored by: Netflix	2018-02-12 14:48:20 +00:00
kib	637f4cf41d	Expand IBRS TLA in sysctl help lines. Requested by: bz Sponsored by: The FreeBSD Foundation MFC after: 1 week	2018-01-31 16:54:05 +00:00
kib	01b52fdebb	IBRS support, AKA Spectre hardware mitigation. It is coded according to the Intel document 336996-001, reading of the patches posted on lkml, and some additional consultations with Intel. For existing processors, you need a microcode update which adds IBRS CPU features, and to manually enable it by setting the tunable/sysctl hw.ibrs_disable to 0. Current status can be checked in sysctl hw.ibrs_active. The mitigation might be inactive if the CPU feature is not patched in, or if CPU reports that IBRS use is not required, by IA32_ARCH_CAP_IBRS_ALL bit. Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D14029	2018-01-31 14:36:27 +00:00
kib	9c0b8085dc	Do not enable PTI when IA32_ARCH_CAP_RDCL_NO bit is set. Intel document 336996-001 claims that this will be the way to inform about Meltdown correction. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2018-01-31 14:25:42 +00:00
imp	1912ffb2e5	Add ISA PNP tables to ISA drivers. Fix a few incidental comments. ACPI ISA PBP tables not tagged, there's bigger issues with them.	2018-01-29 00:22:30 +00:00
mav	2c987d9c6a	Assume Always Running APIC Timer for AMD CPU families >= 0x12. Fallback to HPET may cause locks congestions on many-core systems. This change replicates Linux behavior. MFC after: 1 month	2018-01-28 18:18:03 +00:00
kib	545e25ea75	Use PCID to optimize PTI. Use PCID to avoid complete TLB shootdown when switching between user and kernel mode with PTI enabled. I use the model close to what I read about KAISER, user-mode PCID has 1:1 correspondence to the kernel-mode PCID, by setting bit 11 in PCID. Full kernel-mode TLB shootdown is performed on context switches, since KVA TLB invalidation only works in the current pmap. User-mode part of TLB is flushed on the pmap activations as well. Similarly, IPI TLB shootdowns must handle both kernel and user address spaces for each address. Note that machines which implement PCID but do not have INVPCID instructions, cause the usual complications in the IPI handlers, due to the need to switch to the target PCID temporary. This is racy, but because for PCID/no-INVPCID we disable the interrupts in pmap_activate_sw(), IPI handler cannot see inconsistent state of CPU PCID vs PCPU pmap/kcr3/ucr3 pointers. On the other hand, on kernel/user switches, CR3_PCID_SAVE bit is set and we do not clear TLB. I can imagine alternative use of PCID, where there is only one PCID allocated for the kernel pmap. Then, there is no need to shootdown kernel TLB entries on context switch. But copyout(3) would need to either use method similar to proc_rwmem() to access the userspace data, or (in reverse) provide a temporal mapping for the kernel buffer into user mode PCID and use trampoline for copy. Reviewed by: markj (previous version) Tested by: pho Discussed with: alc (some aspects) Sponsored by: The FreeBSD Foundation MFC after: 3 weeks Differential revision: https://reviews.freebsd.org/D13985	2018-01-27 11:49:37 +00:00
kib	6f0656b43b	Fix native_lapic_ipi_alloc(). When PTI is enabled, empty IDT slots point to rsvd_pti. Reported by: Dexuan-BSD Cui <dexuan.bsd@gmail.com> Sponsored by: The FreeBSD Foundation MFC after: 5 days	2018-01-27 11:33:21 +00:00
pfg	f0c6025eb6	Unsign some values related to allocation. When allocating memory through malloc(9), we always expect the amount of memory requested to be unsigned as a negative value would either stand for an error or an overflow. Unsign some values, found when considering the use of mallocarray(9), to avoid unnecessary casting. Also consider that indexes should be of at least the same size/type as the upper limit they pretend to index. MFC after: 3 weeks	2018-01-22 02:08:10 +00:00
pfg	ced875130d	Revert r327828, r327949, r327953, r328016-r328026, r328041: Uses of mallocarray(9). The use of mallocarray(9) has rocketed the required swap to build FreeBSD. This is likely caused by the allocation size attributes which put extra pressure on the compiler. Given that most of these checks are superfluous we have to choose better where to use mallocarray(9). We still have more uses of mallocarray(9) but hopefully this is enough to bring swap usage to a reasonable level. Reported by: wosch PR: 225197	2018-01-21 15:42:36 +00:00
emaste	1cf1c6c06d	Enable KPTI by default on amd64 for non-AMD CPUs Kernel Page Table Isolation (KPTI) was introduced in r328083 as a mitigation for the 'Meltdown' vulnerability. AMD CPUs are not affected, per https://www.amd.com/en/corporate/speculative-execution: We believe AMD processors are not susceptible due to our use of privilege level protections within paging architecture and no mitigation is required. Thus default KPTI to off for AMD CPUs, and to on for others. This may be refined later as we obtain more specific information on the sets of CPUs that are and are not affected. Submitted by: Mitchell Horne Reviewed by: cem Relnotes: Yes Security: CVE-2017-5754 Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D13971	2018-01-19 15:42:34 +00:00
kib	c35d24e497	PTI for amd64. The implementation of the Kernel Page Table Isolation (KPTI) for amd64, first version. It provides a workaround for the 'meltdown' vulnerability. PTI is turned off by default for now, enable with the loader tunable vm.pmap.pti=1. The pmap page table is split into kernel-mode table and user-mode table. Kernel-mode table is identical to the non-PTI table, while usermode table is obtained from kernel table by leaving userspace mappings intact, but only leaving the following parts of the kernel mapped: kernel text (but not modules text) PCPU GDT/IDT/user LDT/task structures IST stacks for NMI and doublefault handlers. Kernel switches to user page table before returning to usermode, and restores full kernel page table on the entry. Initial kernel-mode stack for PTI trampoline is allocated in PCPU, it is only 16 qwords. Kernel entry trampoline switches page tables. then the hardware trap frame is copied to the normal kstack, and execution continues. IST stacks are kept mapped and no trampoline is needed for NMI/doublefault, but of course page table switch is performed. On return to usermode, the trampoline is used again, iret frame is copied to the trampoline stack, page tables are switched and iretq is executed. The case of iretq faulting due to the invalid usermode context is tricky, since the frame for fault is appended to the trampoline frame. Besides copying the fault frame and original (corrupted) frame to kstack, the fault frame must be patched to make it look as if the fault occured on the kstack, see the comment in doret_iret detection code in trap(). Currently kernel pages which are mapped during trampoline operation are identical for all pmaps. They are registered using pmap_pti_add_kva(). Besides initial registrations done during boot, LDT and non-common TSS segments are registered if user requested their use. In principle, they can be installed into kernel page table per pmap with some work. Similarly, PCPU can be hidden from userspace mapping using trampoline PCPU page, but again I do not see much benefits besides complexity. PDPE pages for the kernel half of the user page tables are pre-allocated during boot because we need to know pml4 entries which are copied to the top-level paging structure page, in advance on a new pmap creation. I enforce this to avoid iterating over the all existing pmaps if a new PDPE page is needed for PTI kernel mappings. The iteration is a known problematic operation on i386. The need to flush hidden kernel translations on the switch to user mode make global tables (PG_G) meaningless and even harming, so PG_G use is disabled for PTI case. Our existing use of PCID is incompatible with PTI and is automatically disabled if PTI is enabled. PCID can be forced on only for developer's benefit. MCE is known to be broken, it requires IST stack to operate completely correctly even for non-PTI case, and absolutely needs dedicated IST stack because MCE delivery while trampoline did not switched from PTI stack is fatal. The fix is pending. Reviewed by: markj (partially) Tested by: pho (previous version) Discussed with: jeff, jhb Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2018-01-17 11:44:21 +00:00
ian	f0c14bee67	Remove redundant critical_enter/exit() calls. The block of code delimited by these calls is now protected by a spin mutex (obscured within the RTC_LOCK/RTC_UNLOCK macros). Reported by: bde@	2018-01-16 23:18:52 +00:00
ian	2fcaa5e746	Move some code around and rename a couple variables; no functional changes. The static atrtc_set() function was called only from clock_settime(), so just move its contents entirely into clock_settime() and delete atrtc_set(). Rename the struct bcd_clocktime variables from 'ct' to 'bct'. I had originally wanted to emphasize how identical the clocktime and bcd_clocktime structs were, but things evolved to the point where the structs are not at all identical anymore, so now emphasizing the difference seems better.	2018-01-16 23:14:12 +00:00
ian	6ac58f6094	Add static inline rtcin_locked() and rtcout_locked() functions for doing a related series of operations without doing a lock/unlock for each byte. Use them when reading and writing the entire set of time registers. The original rtcin() and writertc() functions which do lock/unlock on each byte still exist, because they are public and called by outside code.	2018-01-16 03:02:41 +00:00
pfg	a7c6776f59	x86: make some use of mallocarray(9). Focus on code where we are doing multiplications within malloc(9). None of these ire likely to overflow, however the change is still useful as some static checkers can benefit from the allocation attributes we use for mallocarray. This initial sweep only covers malloc(9) calls with M_NOWAIT. No good reason but I started doing the changes before r327796 and at that time it was convenient to make sure the sorrounding code could handle NULL values. X-Differential revision: https://reviews.freebsd.org/D13837	2018-01-15 21:08:22 +00:00
ian	3f5e0fe8f4	Convert the x86 RTC driver to use new validated BCD<->timespec conversions. New common routines were added to kern/subr_clock.c for converting between calendrical time expressed in BCD and struct timespec. The new functions return EINVAL on error, as expected when the clock hardware does not provide valid time. PR: 224813 Differential Revision: https://reviews.freebsd.org/D13731 (no reviewers)	2018-01-15 16:40:43 +00:00
kib	dcd37bb111	Enumerate and print Intel CPU features for Speculative Execution Side Channel Mitigations. The definitions are taken from the document 336996-001. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2018-01-14 12:36:23 +00:00
jeff	cc3d6a3370	Move VM_NUMA_ALLOC and DEVICE_NUMA under the single global config option NUMA. Sponsored by: Netflix, Dell/EMC Isilon Discussed with: jhb	2018-01-14 03:36:03 +00:00
cem	d1b1083a47	amd64: Add a 48-bit MAXADDR constant Some devices (e.g., ccp(4) -- to be committed) can only access the low 48 bits of physical memory. Reviewed by: markj Sponsored by: Dell EMC Isilon	2018-01-13 17:55:22 +00:00
jeff	bc9177f3a2	Add support for NUMA domains to bus dma tags. This causes all memory allocated with a tag to come from the specified domain if it meets the other constraints provided by the tag. Automatically create a tag at the root of each bus specifying the domain local to that bus if available. Reviewed by: jhb, kib Tested by: pho Sponsored by: Netflix, Dell/EMC Isilon Differential Revision: https://reviews.freebsd.org/D13545	2018-01-12 23:34:16 +00:00
jeff	94c7af8ca2	Implement 'domainset', a cpuset based NUMA policy mechanism. This allows userspace to control NUMA policy administratively and programmatically. Implement domainset based iterators in the page layer. Remove the now legacy numa_* syscalls. Cleanup some header polution created by having seq.h in proc.h. Reviewed by: markj, kib Discussed with: alc Tested by: pho Sponsored by: Netflix, Dell/EMC Isilon Differential Revision: https://reviews.freebsd.org/D13403	2018-01-12 22:48:23 +00:00
kib	dc8d51112c	Make it possible to re-evaluate cpu_features. Add cpuctl(4) ioctl CPUCTL_EVAL_CPU_FEATURES which forces re-read of cpu_features, cpu_features2, cpu_stdext_features, and std_stdext_features2. The intent is to allow the kernel to see the changes in the CPU features after micocode update. Of course, the update is not atomic across variables and not synchronized with readers. See the man page warning as well. Reviewed by: imp (previous version), jilles Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D13770	2018-01-05 21:06:19 +00:00
kib	b6ddae99a2	Use the new SDM-approved way to serialize x2APIC MSR writes. SDM editions 64 and below stated that it is enough to use MFENCe or LFENCE to serialize x2APIC register writes. New edition 65 requires either full serialization instruction or MFENCE;LFENCE sequence. Use the later, FreeBSD needs serialization to ensure that writes done before IPI request are visible to the target IPI CPU. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2018-01-03 11:23:47 +00:00
kib	241446fb2b	Add CR4.SMAP control bit. Sponsored by: The FreeBSD Foundation MFC after: 3 days	2018-01-01 19:34:19 +00:00
cperciva	55fe5887ff	Use the TSLOG framework to record entry/exit timestamps for DELAY and _vprintf; these functions are called in many places and can contribute meaningfully to the total time spent booting.	2017-12-31 09:24:41 +00:00
marius	f823b0ab84	With the advent of interrupt remapping, Intel has repurposed bit 11 (now: Interrupt_Index[15]) and assigned the previously reserved bits 55:48 (Interrupt_Index[14:0] goes into 63:49 while Destination Field used 63:56 and bit 48 now is Interrupt_Format) in the IO redirection tables (see the VT-d specification, "5.1.5.1 I/OxAPIC Programming"). Thus, when not using interrupt remapping, ensure that all previously reserved bits in the high part of the RTEs are zero instead of doing a read-modify-write for their Destination Field bits only. Otherwise, on machines based on Apollo Lake and its derivatives such as Denverton, typically some of the previously preserved bits remain set after boot when not employing interrupt remapping. The result is that INTx interrupts are not getting delivered. Note: With an AMD IOMMU, interrupt remapping apparently bypasses the IO APIC altogether. Submitted by: loos (modulo comment) Reviewed by: jhb (modulo comment)	2017-12-28 21:46:09 +00:00
phk	1642f8ba74	Introduce an architecture-agnostic <sys/_stdarg.h> to reduce platform divergence. Only architectures which pass arguments in registers (mips) and platforms which use really weird compilers (any?) would need to augment the contents of <sys/_stdarg.h> Convert x86, arm and arm64 architectures to use <sys/_stdarg.h>	2017-12-25 20:54:00 +00:00
imp	e65dafd72c	Further investigation shows this shouldn't have been added at all. Remove it.	2017-12-24 17:59:48 +00:00
imp	b002fd76bf	Comment this out until I have time to get to the bottom of why it's failing for some people.	2017-12-24 16:36:50 +00:00
imp	9afedc5ef7	Warn when nonPNP ISA devices are attached in GENERIC that they are being removed from GENERIC in 12. Always print PNP info for ISA when it exists: it doesn't depend on ISAPNP. Add PNP ID to orm and vga to prevent us from warning about them since those devices aren't being removed from GENERIC. PNP devices will be removed from GENERIC too, but they will be automatically loaded, so need no warning. We don't warn for non-GENERIC kernels because people running them are presumed to know what they are doing. MFC After: 2 weeks	2017-12-23 22:57:14 +00:00
kib	6e13a02f21	Add missed AVX512VL (128 and 256 bit vector length) extension identification bit. Sponsored by: The FreeBSD Foundation MFC after: 3 days	2017-12-23 21:32:50 +00:00
bde	cf8a25e82e	Use resume_cpus() instead of restart_cpus() to resume from ACPI suspension. restart_cpus() worked well enough by accident. Before this set of fixes, resume_cpus() used the same cpuset (started_cpus, meaning CPUs directed to restart) as restart_cpus(). resume_cpus() waited for the wrong cpuset (stopped_cpus) to become empty, but since mixtures of stopped and suspended CPUs are not close to working, stopped_cpus must be empty when resuming so the wait is null -- restart_cpus just allows the other CPUs to restart and returns without waiting. Fix resume_cpus() to wait on a non-wrong cpuset for the ACPI case, and add further kludges to try to keep it working for the XEN case. It was only used for XEN. It waited on suspended_cpus. This works for XEN. However, for ACPI, resuming is a 2-step process. ACPI has already woken up the other CPUs and removed them from suspended_cpus. This fix records the move by putting them in a new cpuset resuming_cpus. Waiting on suspended_cpus would give the same null wait as waiting on stopped_cpus. Wait on resuming_cpus instead. Add a cpuset toresume_cpus to map the CPUs being told to resume to keep this separate from the cpuset started_cpus for mapping the CPUs being told to restart. Mixtures of stopped and suspended/resuming CPUs are still far from working. Describe new and some old cpusets in comments. Add further kludges to cpususpend_handler() to try to avoid breaking it for XEN. XEN doesn't use resumectx(), so it doesn't use the second return path for savectx(), and it goes from the suspended state directly to the restarted state, while ACPI resume goes through the resuming state. Enter the resuming state early for all cases so that resume_cpus can test for being in this state and not have to worry about the intermediate !suspended state for ACPI only. Reviewed by: kib	2017-12-21 09:17:48 +00:00
bde	994bacdf8f	Remove the permanent double mapping of low physical memory and replace it by a transient double mapping for the one instruction in ACPI wakeup where it is needed (and for many surrounding instructions in ACPI resume). Invalidate the TLB as soon as convenient after undoing the transient mapping. ACPI resume already has the strict ordering needed for this. This fixes the non-trapping of null pointers and other garbage pointers below NBPDR (except transiently). NBPDR is quite large (4MB, or 2MB for PAE). This fixes spurious traps at the first instruction in VM86 bioscalls. The traps are for transiently missing read permission in the first VM86 page (physical page 0) which was just written to at KERNBASE in the kernel. The mechanism is unknown (it is not simply PG_G). locore uses a similar but larger transient double mapping and needs it for 2 instructions instead of 1. Unmap the first PDE in it after the 2 instructions to detect most garbage pointers while bootstrapping. pmap_bootstrap() finishes the unmapping. Remove the avoidance of the double mapping for a recently fixed special case. ACPI resume could use this avoidance (made non-special) to avoid any problems with the transient double mapping, but no such problems are known. Update comments in locore. Many were for old versions of FreeBSD which tried to map low memory r/o except for special cases, or might have allowed access to low memory via physical offsets. Now all kernel maps are r/w, and removal of of the double map disallows use of physical offsets again.	2017-12-18 13:53:22 +00:00
pfg	b0f7aa75d4	SPDX: use the Beerware identifier.	2017-11-30 20:33:45 +00:00
jkim	f9c37771cd	Properly skip the first CPU. It only accidentally worked because the CPU_FOREACH() loop always starts from BSP (cpu0) and the if condition is always false for APs. Reported by: cem	2017-11-30 20:21:42 +00:00
pfg	f1206865bb	SPDX: Fix some cases wrongly attributed to MIT. In the cases of BSD-style license variants without clauses, use 0BSD for the time being in lack of a better description.	2017-11-30 15:10:11 +00:00
jkim	c1509f7c95	Add a tunable "debug.hwpstate_verify" to check P-state after changing it and turn it off by default. It is very inefficient to verify current P-state of each core, especially for CPUs with many cores. When multiple commands are requested to the same power domain before completion of pending transitions, the last command is executed according to the manual. Because requests are serialized by the caller, all cores will receive the same command for each call. Do not call sched_bind() and sched_unbind(). It is redundant because the caller does it anyway.	2017-11-30 01:40:07 +00:00
jkim	ce7b988218	Fix style(9).	2017-11-29 23:52:31 +00:00
pfg	921a5b4874	sys/x86: further adoption of SPDX licensing ID tags. Mainly focus on files that use BSD 2-Clause license, however the tool I was using misidentified many licenses so this was mostly a manual - error prone - task. The Software Package Data Exchange (SPDX) group provides a specification to make it easier for automated tools to detect and summarize well known opensource licenses. We are gradually adopting the specification, noting that the tags are considered only advisory and do not, in any way, superceed or replace the license texts.	2017-11-27 15:11:47 +00:00
kib	873f304292	Remove lint support from system headers and MD x86 headers. Reviewed by: dim, jhb Discussed with: imp Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D13156	2017-11-23 11:40:16 +00:00
pfg	4736ccfd9c	sys: further adoption of SPDX licensing ID tags. Mainly focus on files that use BSD 3-Clause license. The Software Package Data Exchange (SPDX) group provides a specification to make it easier for automated tools to detect and summarize well known opensource licenses. We are gradually adopting the specification, noting that the tags are considered only advisory and do not, in any way, superceed or replace the license texts. Special thanks to Wind River for providing access to "The Duke of Highlander" tool: an older (2014) run over FreeBSD tree was useful as a starting point.	2017-11-20 19:43:44 +00:00
pfg	9da7bdde06	spdx: initial adoption of licensing ID tags. The Software Package Data Exchange (SPDX) group provides a specification to make it easier for automated tools to detect and summarize well known opensource licenses. We are gradually adopting the specification, noting that the tags are considered only advisory and do not, in any way, superceed or replace the license texts. Special thanks to Wind River for providing access to "The Duke of Highlander" tool: an older (2014) run over FreeBSD tree was useful as a starting point. Initially, only tag files that use BSD 4-Clause "Original" license. RelNotes: yes Differential Revision: https://reviews.freebsd.org/D13133	2017-11-18 14:26:50 +00:00
br	849e978e17	Add Intel Processor Trace registers for: - CPUID - Table of Physical Addresses (ToPA). Sponsored by: DARPA, AFRL	2017-11-17 17:54:10 +00:00
kib	b53cf0d5b7	Remove i386 XBOX support. It is for console presented at 2001 and featuring Pentium III processor. Even if any of them are still alive and run FreeBSD, we do not have any sign of life from their users. While removing another dozens of #ifdefs from the i386 sources reduces the aversion from looking at the code and improves the platform vitality. Reviewed by: cem, pfg, rink (XBOX support author) Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D13016	2017-11-16 14:27:02 +00:00
br	ba3ffebf37	Add Intel Processor Trace (PT) MSRs. Sponsored by: DARPA, AFRL	2017-11-12 23:13:04 +00:00
kib	e6015dcc20	Correct operators precedence. Also keep the calculated vm_page_alloc_contig() flags in the variable to not re-evaluate it on the loop iteration. Noted by: alc Sponsored by: The FreeBSD Foundation	2017-11-09 13:09:07 +00:00
jeff	3c355d849c	Replace manyinstances of VM_WAIT with blocking page allocation flags similar to the kernel memory allocator. This simplifies NUMA allocation because the domain will be known at wait time and races between failure and sleeping are eliminated. This also reduces boilerplate code and simplifies callers. A wait primitive is supplied for uma zones for similar reasons. This eliminates some non-specific VM_WAIT calls in favor of more explicit sleeps that may be satisfied without new pages. Reviewed by: alc, kib, markj Tested by: pho Sponsored by: Netflix, Dell/EMC Isilon	2017-11-08 02:39:37 +00:00
mmel	be76b77ce7	Add AT_HWCAP2 ELF auxiliary vector. - allocate value for new AT_HWCAP2 auxiliary vector on all platforms. - expand 'struct sysentvec' by new 'u_long *sv_hwcap2', in exactly same way as for AT_HWCAP. MFC after: 1 month Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D12699	2017-10-21 12:05:01 +00:00
cem	b6dd74d746	x86: Decode AMD "Extended Feature Extensions ID EBX" bits In particular, this determines CPU support for the CLZERO instruction. (No, I am not making this name up.) Sponsored by: Dell EMC Isilon	2017-09-20 18:30:37 +00:00
cem	9805bb901e	MCA: Expand AMD Thresholding support to cover all banks When it was added in r314636, AMD Thresholding was hardcoded to only bank 4 (Northbridge) for some reason. However, even on family 10h the MCAx_MISC register Valid/Present bits determine whether thresholding is supported on that bank. Expand thresholding support to monitor all monitorable banks. This simplifies some of the logic and makes it more consistent with our Intel CMCI support. Reviewed by: markj (earlier version) Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D12321	2017-09-17 22:58:13 +00:00
jhb	37aff5708d	Add AT_EHDRFLAGS and AT_HWCAP on amd64. x86 has two separate (but identical) list of AT_* constants and the earlier commit to add AT_HWCAP only updated the i386 list.	2017-09-14 15:34:29 +00:00
jhb	e5ea82a50d	Add AT_HWCAP and AT_EHDRFLAGS on all platforms. A new 'u_long sv_hwcap' field is added to 'struct sysentvec'. A process ABI can set this field to point to a value holding a mask of architecture-specific CPU feature flags. If an ABI does not wish to supply AT_HWCAP to processes the field can be left as NULL. The support code for AT_EHDRFLAGS was already present on all systems, just the #define was not present. This is a step towards unifying the AT_ constants across platforms. Reviewed by: kib MFC after: 1 month Differential Revision: https://reviews.freebsd.org/D12290	2017-09-14 14:26:55 +00:00
cem	94488dae4e	MCA: Rename AMD MISC bits/masks They apply to all AMD MCAi_MISC0 registers, not just MCA4 (NB). No functional change. Sponsored by: Dell EMC Isilon	2017-09-11 20:42:07 +00:00
cem	8fed2c5f64	x86 MCA: Extract CMCI support predicate into function On AMD, the MCG_CAP feature bit is reserved -- not explicitly zero. Do not use it to determine CMCI support. Reviewed by: avg, markj Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D12320	2017-09-11 20:41:25 +00:00
kib	fa64065f8a	Fix ioapic acpi id matching on PCI attach and rid calculation. Sponsored by: The FreeBSD Foundation MFC after: 11 days	2017-09-11 18:29:09 +00:00
cem	42d7ded221	Decode new AMD SVM feature bits on family 17h Sponsored by: Dell EMC Isilon	2017-09-11 18:11:53 +00:00
kib	c0e21dbab2	Enhance qpi.c to make it usable on all Core-microarchitecture Xeons. Scan all buses for CSR bus, not stopping on the first failed match. Scan all slots for function 0 on the found bus, for instance on IvyBridge the slot 0 is not decoded at all. Since the scan is quite unsafe, and access to the buses is mostly useful for developers, enable the csr buses scan with the tunable. Current qpi.c makes too many assumptions about the uncore configuration buses location and about slots occupied. Also it restricts itself only to Nehalem CPUs. It is needed on all Core-based Xeons. On the 2600 v2 (IvyBridge) machine I have access to, the CSR buses have numbers 31 (BSP socket) and 63 (second socket), and there is no functions pci0.31.0.0 or pci0.63.0.0. According to the CPU datasheet, all devices on the uncore bus occupy slots >= 8. Practically, the attach to config buses is required for the intel-pcm pcm-memory.x tool to work, for instance. Reviewed by: jhb (previous version) Sponsored by: Mellanox Technologies MFC after: 1 week Differential revision: https://reviews.freebsd.org/D12268	2017-09-08 19:51:03 +00:00
kib	9308b4796c	Use IOAPIC PCI rid as the interrupt TLP source id for DMAR interrupt remapping. VT-d specification requires use of PCI rid as source id for IOAPICs enumerated by PCI bus. The values from the DMAR ACPI table should be only used when IOAPIC is not on PCI. Reviewed by: jhb Sponsored by: The FreeBSD Foundation Hardware provided by: Intel MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D12205	2017-09-08 19:45:37 +00:00
kib	3065523b3b	Add an ioapic_get_rid() function to obtain PCIe TLP requester-id for the interrupt messages from given IOAPIC, if the IOAPIC can be enumerated on PCI bus. If IOAPIC has PCI binding, match the PCI device against MADT enumerated IOAPIC. Match is done first by registers window physical address, then by IOAPIC ID as read from the APIC ID register. PCI bsf address of the matched PCI device is the rid. Reviewed by: jhb Sponsored by: The FreeBSD Foundation Hardware provided by: Intel MFC after: 2 weeks X-Differential revision: https://reviews.freebsd.org/D12205	2017-09-08 19:39:20 +00:00
kib	e30d105e00	Add a constant specifying the min size of the IOAPIC registers window. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2017-09-08 19:25:11 +00:00
kib	7bdcfaffa4	Consistently use tabs for indent. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2017-09-08 10:39:28 +00:00
cem	a11a4e755f	mca: Fix printf types from r323289 on i386 Reported by: Michael Butler <imb AT protected-networks.net> Sponsored by: Dell EMC Isilon	2017-09-08 01:06:35 +00:00
cem	3a36ac9472	x86 MCA: Helpfully, print why ECC thresholding is not enabled on AMD Sponsored by: Dell EMC Isilon	2017-09-07 21:33:27 +00:00
cem	a8e0ad37ff	x86 MCA: Enable AMD thresholding support on 17h 17h supports MCA thresholding in the same way as 16h and earlier. Supposedly a ScalableMca feature bit in CPUID 8000_0007:EBX must be set, but that was not true for earlier models, so be careful about relying on it. While here, document a missing bit in LS MCA MISC0. Reviewed by: truckman Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D12237	2017-09-07 21:31:07 +00:00
cem	7b788c7348	Store AMD RAS Capabilities cpuid value and name flags Reviewed by: truckman Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D12237	2017-09-07 21:29:51 +00:00
cem	38abf30cd7	cpufreq(4) hwpstate: Yield CPU awaiting frequency change It doesn't seem necessary to busy the CPU while waiting to transition into a different p-state. PR: 221621 (related, but does not completely address) Reviewed by: truckman Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D12260	2017-09-07 20:20:12 +00:00
kib	b2f0b570ad	Fix typos. Stop claiming that two children are created. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2017-09-06 11:47:59 +00:00
royger	d97f829cdb	acpi/srat: zero the SRAT cpu array Fix from fallout introduced in r322348 that moved the cpus array to a dynamic allocation without zeroing the area. Reported by: mjg MFC with: r322348 Reviewed by: mjg Differential revision: https://reviews.freebsd.org/D12220	2017-09-04 10:08:42 +00:00
kib	9180786e2c	Stop masking FSGSBASE and SMEP features under monitors. Not enabling FSGSBASE in %cr4 does not prevent reporting of the feature by the CPUID instruction (blame Int*l). As result, kernels which were run under monitors pretended that usermode cannot modify TLS base without the syscall, while libc noted right combination of capable CPU and the new kernel version, trying to use the WRFSBASE instruction. Really old hypervisors that cannot handle enablement of these features in %cr4 would require the manual configuration, by setting the loader tunable hw.cpu_stdext_disable=0x81 Reported by: lwhsu, mjoras Sponsored by: The FreeBSD Foundation MFC after: 18 days	2017-08-24 10:57:34 +00:00
mav	c5b7650b9a	Fix off-by-one error when parsing SRAT table. Reviewed by: jhb MFC after: 1 week	2017-08-22 19:56:30 +00:00
cem	7f37053028	subr_smp: Clean up topology analysis, add additional layers Rather than repeatedly nesting loops, separate concerns with a single loop per call stack level. Use a table to drive the recursive routine. Handle missing topology layers more gracefully (infer a single unit). Analyze some additional optional layers which may be present on e.g. AMD Zen systems (groups, aka dies, per package; and cachegroups, aka CCXes, per group). Display that additional information in the boot-time topology information, when it is relevent (non-one). Reviewed by: markj@, mjoras@ (earlier version) Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D12019	2017-08-22 00:10:15 +00:00
cem	379d4cc48c	hwpstate: Add support for family 17h pstate info from MSRs This information is normally available via acpi_perf, but in case it is not, add support for fetching the information via MSRs on AMD family 17h (Zen) processors. Zen uses a slightly different formula than previous generation AMD CPUs. This was inspired by, but does not fix, PR 221621. Reported by: Sean P. R. <seanpr AT swbell.net> Reviewed by: mjoras@ Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D12082	2017-08-20 00:41:49 +00:00
cem	36c3959687	Discover CPU topology on multi-die AMD Zen systems The Nodes per Processor topology information determines how many bits of the APIC ID represent the Node (Zeppelin die, on Zen systems) ID. Documented in Ryzen and Epyc Processor Programming Reference (PPR). Correct topology information enables the scheduler to make better decisions on this hardware. Reviewed by: kib@ Tested by: jeff@ (earlier version) Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D11801	2017-08-17 16:54:37 +00:00
cem	b7cee9bec2	Fix unused varable warning in !SMP case Fallout from r322588. I'm not sure why !SMP is a knob we have, but, we have it. Reported by: Michael Butler <imb AT protected-networks.net> Sponsored by: Dell EMC Isilon	2017-08-17 04:37:27 +00:00
cem	5a79b729aa	x86: Add dynamic interrupt rebalancing Add an option to dynamically rebalance interrupts across cores (hw.intrbalance); off by default. The goal is to minimize preemption. By placing interrupt sources on distinct CPUs, ithreads get preferentially scheduled on distinct CPUs. Overall preemption is reduced and latency is reduced. In our workflow it reduced "fighting" between two high-frequency interrupt sources. Reduced latency was proven by, e.g., SPEC2008. Submitted by: jeff@ (earlier version) Reviewed by: kib@ Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D10435	2017-08-16 18:48:53 +00:00
royger	19aa079287	srat: use pmap_unmapbios To match the pmap_mapbios. Reported by: jhb MFC with: r322403	2017-08-13 14:50:38 +00:00
ian	7ba2756fac	Stop calling atrtc_set() from the xen timer clock_settime() method. That removes the only reference to atrtc_set() from outside of atrtc.c, so make it static. The xen timer driver registers as a realtime clock with 1us resolution. In the past that resulted in only the xen timer's clock_settime() getting called, so it would call atrtc_set() to set the hardware clock as well. As of r32090, the clock_settime() method of all registered realtime clocks gets called, so the xen driver no longer needs to chain-call the lower-resolution driver. Thanks to royger@ for talking me through the xen stuff, and for testing.	2017-08-11 19:02:11 +00:00
royger	17b1663c87	acpi/srat: fix build without DMAP Use pmap_mapbios to map memory used to store the cpus array. Reported by: lwhsu X-MFC-with: r322348	2017-08-11 14:19:55 +00:00
royger	89ad37fb84	mptable: fix i386 build failure Reported by: emaste X-MFC-with: r322347	2017-08-10 17:46:57 +00:00
royger	55936e520f	x86: bump MAX_APIC_ID to 512 Introduce a new define to take int account the xAPIC ID limit, for systems where x2APIC is not available/reliable. Also change some of the usages of the APIC ID to use an unsigned int (which is the correct storage type to deal with x2APIC IDs as found in x2APIC MADT entries). This allows booting FreeBSD on a box with 256 CPUs and APIC IDs up to 295: FreeBSD/SMP: Multiprocessor System Detected: 256 CPUs FreeBSD/SMP: 1 package(s) x 64 core(s) x 4 hardware threads Package HW ID = 0 Core HW ID = 0 CPU0 (BSP): APIC ID: 0 CPU1 (AP/HT): APIC ID: 1 CPU2 (AP/HT): APIC ID: 2 CPU3 (AP/HT): APIC ID: 3 [...] Core HW ID = 73 CPU252 (AP): APIC ID: 292 CPU253 (AP/HT): APIC ID: 293 CPU254 (AP/HT): APIC ID: 294 CPU255 (AP/HT): APIC ID: 295 Submitted by: kib (previous version) Relnotes: yes MFC after: 1 month Reviewed by: kib Differential revision: https://reviews.freebsd.org/D11913	2017-08-10 09:16:40 +00:00
royger	e45f85c53f	x86: make the arrays that depend on MAX_APIC_ID dynamic So that MAX_APIC_ID can be bumped without wasting memory. Note that the usage of MAX_APIC_ID in the SRAT parsing forces the parser to allocate memory directly from the phys_avail physical memory array, which is not the best approach probably, but I haven't found any other way to allocate memory so early in boot. This memory is not returned to the system afterwards, but at least it's sized according to the maximum APIC ID found in the MADT table. Sponsored by: Citrix Systems R&D MFC after: 1 month Reviewed by: kib Differential revision: https://reviews.freebsd.org/D11912	2017-08-10 09:16:03 +00:00
royger	345fb32684	apic_enumerator: only set mp_ncpus and mp_maxid at probe cpus phase Populate the lapics arrays and call cpu_add/lapic_create in the setup phase instead. Also store the max APIC ID found in the newly introduced max_apic_id global variable. This is a requirement in order to make the static arrays currently using MAX_LAPIC_ID dynamic. Sponsored by: Citrix Systems R&D MFC after: 1 month Reviewed by: kib Differential revision: https://reviews.freebsd.org/D11911	2017-08-10 09:15:18 +00:00
jkim	d871b34dbc	Split identify_cpu() into two functions for amd64 as we do for i386. This reduces diff between amd64 and i386. Also, it fixes a regression introduced in r322076, i.e., identify_hypervisor() failed to identify some hypervisors. This function assumes cpu_feature2 is already initialized. Reported by: dexuan Tested by: dexuan	2017-08-09 18:09:09 +00:00
jkim	0ac013123c	Detect hypervisors early. We used to set lower hz on hypervisors by default but it was broken since r273800 (and r278522, its MFC to stable/10) because identify_cpu() is called too late, i.e., after init_param1(). MFC after: 3 days	2017-08-05 06:56:46 +00:00
markj	da9101cade	Don't trace running threads that have interrupts disabled. In this case we shouldn't assume that the thread has a valid frame pointer. Reviewed by: kib MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D11787	2017-07-31 17:57:54 +00:00
rlibby	dfe1112fa8	__pcpu: gcc -Wredundant-decls Pollution from counter.h made __pcpu visible in amd64/pmap.c. Delete the existing extern decl of __pcpu in amd64/pmap.c and avoid referring to that symbol, instead accessing the pcpu region via PCPU_SET macros. Also delete an unused extern decl of __pcpu from mp_x86.c. Reviewed by: kib Approved by: markj (mentor) Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D11666	2017-07-21 17:11:36 +00:00
ian	242599409b	Protect access to the AT realtime clock with its own mutex. The mutex protecting access to the registered realtime clock should not be overloaded to protect access to the atrtc hardware, which might not even be the registered rtc. More importantly, the resettodr mutex needs to be eliminated to remove locking/sleeping restrictions on clock drivers, and that can't happen if MD code for amd64 depends on it. This change moves the protection into what's really being protected: access to the atrtc date and time registers. This change also adds protection when the clock is accessed from xentimer_settime(), which bypasses the resettodr locking. Differential Revision: https://reviews.freebsd.org/D11483	2017-07-12 02:42:57 +00:00
jah	d1caaa9300	Clean up MD pollution of bus_dma.h: --Remove special-case handling of sparc64 bus_dmamap* functions. Replace with a more generic mechanism that allows MD busdma implementations to generate inline mapping functions by defining WANT_INLINE_DMAMAP in <machine/bus_dma.h>. This is currently useful for sparc64, x86, and arm64, which all implement non-load dmamap operations as simple wrappers around map objects which may be bus- or device-specific. --Remove NULL-checked bus_dmamap macros. Implement the equivalent NULL checks in the inlined x86 implementation. For non-x86 platforms, these checks are a minor pessimization as those platforms do not currently allow NULL maps. NULL maps were originally allowed on arm64, which appears to have been the motivation behind adding arm[64]-specific barriers to bus_dma.h, but that support was removed in r299463. --Simplify the internal interface used by the bus_dmamap_load* variants and move it to bus_dma_internal.h --Fix some drivers that directly include sys/bus_dma.h despite the recommendations of bus_dma(9) Reviewed by: kib (previous revision), marius Differential Revision: https://reviews.freebsd.org/D10729	2017-07-01 05:35:29 +00:00
kib	ece4b18df4	Fix batched unload for DMAR busdma in qi mode. Do not queue dmar_map_entries with zeroed gseq to dmar_qi_invalidate_locked(). Zero gseq stops the processing in the qi task. Do not assign possibly uninitialized on-stack gseq to map entries when requeuing them on unit tlb_flush queue. Random garbage in gsec is interpreted as too high invalidation sequence number and again stop the processing in the task. Make the sequence numbers generation completely contained in dmar_qi_invalidate_locked() and dmar_qi_emit_wait_seq(). Upper code directly passes boolean requesting emiting wait command instead of trying to provide hint to avoid it by passing NULL gseq pointer. Microoptimize the requeueing to tlb_flush queue by doing it for the whole queue. Diagnosed and tested by: Brett Gutstein <bgutstein@rice.edu> Discussed with: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week	2017-06-19 21:48:52 +00:00
jhb	7d61b9ae69	Don't try to assign interrupts to a CPU on single-CPU systems. All interrupts are routed to the sole CPU in that case implicitly. This is a regression in EARLY_AP_STARTUP. Previously the 'assign_cpu' variable was only set when a multi-CPU system finished booting, so it's value both meant that interrupts could be assigned and that there was more than one CPU. PR: 219882 Reported by: ota@j.email.ne.jp MFC after: 3 days	2017-06-14 13:34:09 +00:00
kib	a17a213fbb	More accurately handle early EFER restoration on resume. Do not try to set LMA bit while CPU is still in legacy mode. Apparently Intel CPUs ignore non-id writes to LMA, while AMD's (over-)react with #GP. Reported and tested by: danfe Sponsored by: The FreeBSD Foundation MFC after: 3 days	2017-06-11 14:39:08 +00:00
araujo	f12ef95704	Allow sysctl kern.vm_guest to return bhyve when running under bhyve. Submitted by: Sean Fagan <sef@ixsystems.com> Reviewed by: grehan MFH: 4 weeks. Sponsored by: iXsystems, Inc. Differential Revision: https://reviews.freebsd.org/D11090	2017-06-08 04:02:14 +00:00
avg	fc233047aa	fix indentation MFC after: 4 days	2017-05-30 13:53:03 +00:00
jhb	330705f598	Remove constants and comments for unimplemented entries in the default LDT. These entries will never be added to the default LDT in the future.	2017-05-24 18:54:21 +00:00
jhb	5387dbf595	Remove the BSD/OS 2.1 system call gate LDT entry. An extra copy of the system call gate was added to the default LDT back in 1996 (r18513 / r18514). However, the ability to run BSD/OS 2.1 i386 binaries under FreeBSD's native ABI is most likely no longer needed. Discussed with: kib	2017-05-23 22:34:18 +00:00
hselasky	107bf62085	Avoid use of contiguous memory allocations in busdma when possible. This patch improves the boundary checks in busdma to allow more cases using the regular page based kernel memory allocator. Especially in the case of having a non-zero boundary in the parent DMA tag. For example AMD64 based platforms set the PCI DMA tag boundary to PCI_DMA_BOUNDARY, 4GB, which before this patch caused contiguous memory allocations to be preferred when allocating more than PAGE_SIZE bytes. Even if the required alignment was less than PAGE_SIZE bytes. This patch also fixes the nsegments check for using kmem_alloc_attr() when the maximum segment size is less than PAGE_SIZE bytes. Updated some comments describing the code in question. Differential Revision: https://reviews.freebsd.org/D10645 Reviewed by: kib, jhb, gallatin, scottl MFC after: 1 week Sponsored by: Mellanox Technologies	2017-05-16 14:21:37 +00:00
kib	542fd28222	Ensure that resume path on amd64 only accesses page tables for normal operation after processor is configured to allow all required features. In particular, NX must be enabled in EFER, otherwise load of page table element with nx bit set causes reserved bit page fault. Since malloc uses direct mapping for small allocations, in particular for the suspension pcbs, and DMAP is nx after r316767, this commit tripped fault on resume path. Restore complete state of EFER while wakeup code is still executing with custom page table, before calling resumectx, instead of trying to guess which features might be needed before resumectx restored EFER on its own. Bisected and tested by: trasz Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2017-05-15 20:52:43 +00:00
cem	2f88ac47eb	x86 MCA: Fix a deadlock in MCA exception processing In exceptional circumstances, an MCA exception will trigger when the freelist is exhausted. In such a case, no error will be logged on the list and 'mca_count' will not be incremented. Prior to this patch, all CPUs that received the exception would spin forever. With this change, the CPU that detects the error but finds the freelist empty will proceed to panic the machine, ending the deadlock. A follow-up to r260457. Reported by: Ryan Libby <rlibby at gmail.com> Reviewed by: jhb@ Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D10536	2017-04-28 18:25:10 +00:00
jhb	63ea8e794e	Remove the LSOL26CALLS_SEL constant. It is no longer used after SVR4/i386 ABI support was removed. Reported by: kib	2017-04-25 23:19:27 +00:00
glebius	21ead51d79	- Remove 'struct vmmeter' from 'struct pcpu', leaving only global vmmeter in place. To do per-cpu stats, convert all fields that previously were maintained in the vmmeters that sit in pcpus to counter(9). - Since some vmmeter stats may be touched at very early stages of boot, before we have set up UMA and we can do counter_u64_alloc(), provide an early counter mechanism: o Leave one spare uint64_t in struct pcpu, named pc_early_dummy_counter. o Point counter(9) fields of vmmeter to pcpu[0].pc_early_dummy_counter, so that at early stages of boot, before counters are allocated we already point to a counter that can be safely written to. o For sparc64 that required a whole dummy pcpu[MAXCPU] array. Further related changes: - Don't include vmmeter.h into pcpu.h. - vm.stats.vm.v_swappgsout and vm.stats.vm.v_swappgsin changed to 64-bit, to match kernel representation. - struct vmmeter hidden under _KERNEL, and only vmstat(1) is an exclusion. This is based on benno@'s 4-year old patch: https://lists.freebsd.org/pipermail/freebsd-arch/2013-July/014471.html Reviewed by: kib, gallatin, marius, lidl Differential Revision: https://reviews.freebsd.org/D10156	2017-04-17 17:34:47 +00:00
glebius	5763443023	All these files need sys/vmmeter.h, but now they got it implicitly included via sys/pcpu.h.	2017-04-17 17:07:00 +00:00
kib	1120d44b7f	Correct calculation of the entry->free_down in the invariants-checking code. Reported by: maxim Found by: PVS studio scan Sponsored by: The FreeBSD Foundation MFC after: 1 week	2017-04-14 15:16:41 +00:00
pkelsey	33064e92a2	Corrected misspelled versions of rendezvous. The MFC will include a compat definition of smp_no_rendevous_barrier() that calls smp_no_rendezvous_barrier(). Reviewed by: gnn, kib MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D10313	2017-04-09 02:00:03 +00:00
avg	800d1c86ba	use msr 0xc001100c to discover multi-node AMD processors This is applicable only to the older processors that do not have the AMD Topology extension. Opteron 6100-series "Magny-Cours" processors had multiple nodes within a package and didn't have the Topology extension. Without this change FreeBSD would assume that those processors have a single L3 cache shared by all cores while, in fact, each node has its own L3 cache. Many thanks to Freddie Cash <fjwcash@gmail.com> for providing valuable hardware information. MFC after: 2 weeks	2017-04-08 14:16:42 +00:00
avg	7a52acd8b3	revert r315959 because it causes build problems The change introduced a dependency between genassym.c and header files generated from .m files, but that dependency is not specified in the make files. Also, the change could be not as useful as I thought it was. Reported by: dchagin, Manfred Antar <null@pozo.com>, and many others	2017-03-27 12:34:29 +00:00
avg	f03ea73bb5	update comment describing topo_probe_amd() MFC after: 2 weeks MFC with: r316017	2017-03-27 11:04:57 +00:00
avg	3caafc04f1	add SMT detection for newer AMD processors The change seems to be more in the nomenclature than in the way the topology is advertised by the hardware. Tested by: truckman (earlier version of the change) MFC after: 2 weeks	2017-03-27 09:45:27 +00:00
kib	f3a84dabad	Timeout DMAR commands. Implement timeouts for register-based DMAR commands. Tunable/sysctl hw.dmar.timeout specifies the timeout in nanoseconds, set it to zero to allow infinite wait. Default is 1ms. Runtime modification of the sysctl is not safe, it is allowed for debugging. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2017-03-27 07:06:45 +00:00

1 2 3 4 5 ...

900 Commits