freebsd-skq

Author	SHA1	Message	Date
emaste	5122d84e64	Correct pseudo misspelling in sys/ comments contrib code and #define in intel_ata.h unchanged.	2018-02-23 18:15:50 +00:00
kib	de90d5191c	Do not return out of bound pointers from intr_lookup_source(). This hardens the code against driver and upper level bugs causing invalid indexes used, e.g. on msi release. Reported by: gallatin Reviewed by: gallatin, hselasky Sponsored by: Mellanox Technologies MFC after: 1 week Differential revision: https://reviews.freebsd.org/D14470	2018-02-23 11:20:59 +00:00
imp	f735e1eb15	Do not include float interfaces when using libsa. We don't support float in the boot loaders, so don't include interfaces for float or double in systems headers. In addition, take the unusual step of spiking double and float to prevent any more accidental seepage.	2018-02-23 04:04:25 +00:00
markj	6a8b74d6f3	Don't include DMAR map entry zone items in kernel dumps. Such items may be allocated in the I/O path used by the dumper, potentially causing the dump to fail. Since there is some precedent in the DMAR driver for avoiding this problem using _NODUMP, apply this workaround to the zone as well. Reported and tested by: mmacy Reviewed by: kib MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D14422	2018-02-18 16:03:50 +00:00
kib	6b4aea7c3f	Remove unused symbols. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2018-02-16 23:18:42 +00:00
royger	c1141359ad	xen/pv: remove the attach of the ISA bus from the Xen PV bus There's no need to attach the ISA bus from the Xen PV one. Sponsored by: Citrix Systems R&D	2018-02-16 18:04:27 +00:00
mjg	2c4feecfbb	xen: fix smp boot after r328157 mce_stack was left unset leading to early crashes	2018-02-15 07:23:41 +00:00
kib	59d970a7ef	Fix build with gas. Do not use C constant suffixes. Bit values are small enough to not require typing, despite they are used for 64bit MSR writes. The added cast in hw_ibrs_recalculate() is redundand but I prefer to add it for clarity. Reported by: bde Sponsored by: The FreeBSD Foundation MFC after: 1 week	2018-02-13 15:30:31 +00:00
imp	8256f7ec08	Move __va_list and related defines to sys/sys/_types.h __va_list and related defines are identical in all the ARCH/include/_types.h files. Move them to sys/sys/_types.h Sponsored by: Netflix	2018-02-12 14:48:20 +00:00
kib	637f4cf41d	Expand IBRS TLA in sysctl help lines. Requested by: bz Sponsored by: The FreeBSD Foundation MFC after: 1 week	2018-01-31 16:54:05 +00:00
kib	01b52fdebb	IBRS support, AKA Spectre hardware mitigation. It is coded according to the Intel document 336996-001, reading of the patches posted on lkml, and some additional consultations with Intel. For existing processors, you need a microcode update which adds IBRS CPU features, and to manually enable it by setting the tunable/sysctl hw.ibrs_disable to 0. Current status can be checked in sysctl hw.ibrs_active. The mitigation might be inactive if the CPU feature is not patched in, or if CPU reports that IBRS use is not required, by IA32_ARCH_CAP_IBRS_ALL bit. Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D14029	2018-01-31 14:36:27 +00:00
kib	9c0b8085dc	Do not enable PTI when IA32_ARCH_CAP_RDCL_NO bit is set. Intel document 336996-001 claims that this will be the way to inform about Meltdown correction. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2018-01-31 14:25:42 +00:00
imp	1912ffb2e5	Add ISA PNP tables to ISA drivers. Fix a few incidental comments. ACPI ISA PBP tables not tagged, there's bigger issues with them.	2018-01-29 00:22:30 +00:00
mav	2c987d9c6a	Assume Always Running APIC Timer for AMD CPU families >= 0x12. Fallback to HPET may cause locks congestions on many-core systems. This change replicates Linux behavior. MFC after: 1 month	2018-01-28 18:18:03 +00:00
kib	545e25ea75	Use PCID to optimize PTI. Use PCID to avoid complete TLB shootdown when switching between user and kernel mode with PTI enabled. I use the model close to what I read about KAISER, user-mode PCID has 1:1 correspondence to the kernel-mode PCID, by setting bit 11 in PCID. Full kernel-mode TLB shootdown is performed on context switches, since KVA TLB invalidation only works in the current pmap. User-mode part of TLB is flushed on the pmap activations as well. Similarly, IPI TLB shootdowns must handle both kernel and user address spaces for each address. Note that machines which implement PCID but do not have INVPCID instructions, cause the usual complications in the IPI handlers, due to the need to switch to the target PCID temporary. This is racy, but because for PCID/no-INVPCID we disable the interrupts in pmap_activate_sw(), IPI handler cannot see inconsistent state of CPU PCID vs PCPU pmap/kcr3/ucr3 pointers. On the other hand, on kernel/user switches, CR3_PCID_SAVE bit is set and we do not clear TLB. I can imagine alternative use of PCID, where there is only one PCID allocated for the kernel pmap. Then, there is no need to shootdown kernel TLB entries on context switch. But copyout(3) would need to either use method similar to proc_rwmem() to access the userspace data, or (in reverse) provide a temporal mapping for the kernel buffer into user mode PCID and use trampoline for copy. Reviewed by: markj (previous version) Tested by: pho Discussed with: alc (some aspects) Sponsored by: The FreeBSD Foundation MFC after: 3 weeks Differential revision: https://reviews.freebsd.org/D13985	2018-01-27 11:49:37 +00:00
kib	6f0656b43b	Fix native_lapic_ipi_alloc(). When PTI is enabled, empty IDT slots point to rsvd_pti. Reported by: Dexuan-BSD Cui <dexuan.bsd@gmail.com> Sponsored by: The FreeBSD Foundation MFC after: 5 days	2018-01-27 11:33:21 +00:00
pfg	f0c6025eb6	Unsign some values related to allocation. When allocating memory through malloc(9), we always expect the amount of memory requested to be unsigned as a negative value would either stand for an error or an overflow. Unsign some values, found when considering the use of mallocarray(9), to avoid unnecessary casting. Also consider that indexes should be of at least the same size/type as the upper limit they pretend to index. MFC after: 3 weeks	2018-01-22 02:08:10 +00:00
pfg	ced875130d	Revert r327828, r327949, r327953, r328016-r328026, r328041: Uses of mallocarray(9). The use of mallocarray(9) has rocketed the required swap to build FreeBSD. This is likely caused by the allocation size attributes which put extra pressure on the compiler. Given that most of these checks are superfluous we have to choose better where to use mallocarray(9). We still have more uses of mallocarray(9) but hopefully this is enough to bring swap usage to a reasonable level. Reported by: wosch PR: 225197	2018-01-21 15:42:36 +00:00
emaste	1cf1c6c06d	Enable KPTI by default on amd64 for non-AMD CPUs Kernel Page Table Isolation (KPTI) was introduced in r328083 as a mitigation for the 'Meltdown' vulnerability. AMD CPUs are not affected, per https://www.amd.com/en/corporate/speculative-execution: We believe AMD processors are not susceptible due to our use of privilege level protections within paging architecture and no mitigation is required. Thus default KPTI to off for AMD CPUs, and to on for others. This may be refined later as we obtain more specific information on the sets of CPUs that are and are not affected. Submitted by: Mitchell Horne Reviewed by: cem Relnotes: Yes Security: CVE-2017-5754 Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D13971	2018-01-19 15:42:34 +00:00
kib	c35d24e497	PTI for amd64. The implementation of the Kernel Page Table Isolation (KPTI) for amd64, first version. It provides a workaround for the 'meltdown' vulnerability. PTI is turned off by default for now, enable with the loader tunable vm.pmap.pti=1. The pmap page table is split into kernel-mode table and user-mode table. Kernel-mode table is identical to the non-PTI table, while usermode table is obtained from kernel table by leaving userspace mappings intact, but only leaving the following parts of the kernel mapped: kernel text (but not modules text) PCPU GDT/IDT/user LDT/task structures IST stacks for NMI and doublefault handlers. Kernel switches to user page table before returning to usermode, and restores full kernel page table on the entry. Initial kernel-mode stack for PTI trampoline is allocated in PCPU, it is only 16 qwords. Kernel entry trampoline switches page tables. then the hardware trap frame is copied to the normal kstack, and execution continues. IST stacks are kept mapped and no trampoline is needed for NMI/doublefault, but of course page table switch is performed. On return to usermode, the trampoline is used again, iret frame is copied to the trampoline stack, page tables are switched and iretq is executed. The case of iretq faulting due to the invalid usermode context is tricky, since the frame for fault is appended to the trampoline frame. Besides copying the fault frame and original (corrupted) frame to kstack, the fault frame must be patched to make it look as if the fault occured on the kstack, see the comment in doret_iret detection code in trap(). Currently kernel pages which are mapped during trampoline operation are identical for all pmaps. They are registered using pmap_pti_add_kva(). Besides initial registrations done during boot, LDT and non-common TSS segments are registered if user requested their use. In principle, they can be installed into kernel page table per pmap with some work. Similarly, PCPU can be hidden from userspace mapping using trampoline PCPU page, but again I do not see much benefits besides complexity. PDPE pages for the kernel half of the user page tables are pre-allocated during boot because we need to know pml4 entries which are copied to the top-level paging structure page, in advance on a new pmap creation. I enforce this to avoid iterating over the all existing pmaps if a new PDPE page is needed for PTI kernel mappings. The iteration is a known problematic operation on i386. The need to flush hidden kernel translations on the switch to user mode make global tables (PG_G) meaningless and even harming, so PG_G use is disabled for PTI case. Our existing use of PCID is incompatible with PTI and is automatically disabled if PTI is enabled. PCID can be forced on only for developer's benefit. MCE is known to be broken, it requires IST stack to operate completely correctly even for non-PTI case, and absolutely needs dedicated IST stack because MCE delivery while trampoline did not switched from PTI stack is fatal. The fix is pending. Reviewed by: markj (partially) Tested by: pho (previous version) Discussed with: jeff, jhb Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2018-01-17 11:44:21 +00:00
ian	f0c14bee67	Remove redundant critical_enter/exit() calls. The block of code delimited by these calls is now protected by a spin mutex (obscured within the RTC_LOCK/RTC_UNLOCK macros). Reported by: bde@	2018-01-16 23:18:52 +00:00
ian	2fcaa5e746	Move some code around and rename a couple variables; no functional changes. The static atrtc_set() function was called only from clock_settime(), so just move its contents entirely into clock_settime() and delete atrtc_set(). Rename the struct bcd_clocktime variables from 'ct' to 'bct'. I had originally wanted to emphasize how identical the clocktime and bcd_clocktime structs were, but things evolved to the point where the structs are not at all identical anymore, so now emphasizing the difference seems better.	2018-01-16 23:14:12 +00:00
ian	6ac58f6094	Add static inline rtcin_locked() and rtcout_locked() functions for doing a related series of operations without doing a lock/unlock for each byte. Use them when reading and writing the entire set of time registers. The original rtcin() and writertc() functions which do lock/unlock on each byte still exist, because they are public and called by outside code.	2018-01-16 03:02:41 +00:00
pfg	a7c6776f59	x86: make some use of mallocarray(9). Focus on code where we are doing multiplications within malloc(9). None of these ire likely to overflow, however the change is still useful as some static checkers can benefit from the allocation attributes we use for mallocarray. This initial sweep only covers malloc(9) calls with M_NOWAIT. No good reason but I started doing the changes before r327796 and at that time it was convenient to make sure the sorrounding code could handle NULL values. X-Differential revision: https://reviews.freebsd.org/D13837	2018-01-15 21:08:22 +00:00
ian	3f5e0fe8f4	Convert the x86 RTC driver to use new validated BCD<->timespec conversions. New common routines were added to kern/subr_clock.c for converting between calendrical time expressed in BCD and struct timespec. The new functions return EINVAL on error, as expected when the clock hardware does not provide valid time. PR: 224813 Differential Revision: https://reviews.freebsd.org/D13731 (no reviewers)	2018-01-15 16:40:43 +00:00
kib	dcd37bb111	Enumerate and print Intel CPU features for Speculative Execution Side Channel Mitigations. The definitions are taken from the document 336996-001. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2018-01-14 12:36:23 +00:00
jeff	cc3d6a3370	Move VM_NUMA_ALLOC and DEVICE_NUMA under the single global config option NUMA. Sponsored by: Netflix, Dell/EMC Isilon Discussed with: jhb	2018-01-14 03:36:03 +00:00
cem	d1b1083a47	amd64: Add a 48-bit MAXADDR constant Some devices (e.g., ccp(4) -- to be committed) can only access the low 48 bits of physical memory. Reviewed by: markj Sponsored by: Dell EMC Isilon	2018-01-13 17:55:22 +00:00
jeff	bc9177f3a2	Add support for NUMA domains to bus dma tags. This causes all memory allocated with a tag to come from the specified domain if it meets the other constraints provided by the tag. Automatically create a tag at the root of each bus specifying the domain local to that bus if available. Reviewed by: jhb, kib Tested by: pho Sponsored by: Netflix, Dell/EMC Isilon Differential Revision: https://reviews.freebsd.org/D13545	2018-01-12 23:34:16 +00:00
jeff	94c7af8ca2	Implement 'domainset', a cpuset based NUMA policy mechanism. This allows userspace to control NUMA policy administratively and programmatically. Implement domainset based iterators in the page layer. Remove the now legacy numa_* syscalls. Cleanup some header polution created by having seq.h in proc.h. Reviewed by: markj, kib Discussed with: alc Tested by: pho Sponsored by: Netflix, Dell/EMC Isilon Differential Revision: https://reviews.freebsd.org/D13403	2018-01-12 22:48:23 +00:00
kib	dc8d51112c	Make it possible to re-evaluate cpu_features. Add cpuctl(4) ioctl CPUCTL_EVAL_CPU_FEATURES which forces re-read of cpu_features, cpu_features2, cpu_stdext_features, and std_stdext_features2. The intent is to allow the kernel to see the changes in the CPU features after micocode update. Of course, the update is not atomic across variables and not synchronized with readers. See the man page warning as well. Reviewed by: imp (previous version), jilles Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D13770	2018-01-05 21:06:19 +00:00
kib	b6ddae99a2	Use the new SDM-approved way to serialize x2APIC MSR writes. SDM editions 64 and below stated that it is enough to use MFENCe or LFENCE to serialize x2APIC register writes. New edition 65 requires either full serialization instruction or MFENCE;LFENCE sequence. Use the later, FreeBSD needs serialization to ensure that writes done before IPI request are visible to the target IPI CPU. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2018-01-03 11:23:47 +00:00
kib	241446fb2b	Add CR4.SMAP control bit. Sponsored by: The FreeBSD Foundation MFC after: 3 days	2018-01-01 19:34:19 +00:00
cperciva	55fe5887ff	Use the TSLOG framework to record entry/exit timestamps for DELAY and _vprintf; these functions are called in many places and can contribute meaningfully to the total time spent booting.	2017-12-31 09:24:41 +00:00
marius	f823b0ab84	With the advent of interrupt remapping, Intel has repurposed bit 11 (now: Interrupt_Index[15]) and assigned the previously reserved bits 55:48 (Interrupt_Index[14:0] goes into 63:49 while Destination Field used 63:56 and bit 48 now is Interrupt_Format) in the IO redirection tables (see the VT-d specification, "5.1.5.1 I/OxAPIC Programming"). Thus, when not using interrupt remapping, ensure that all previously reserved bits in the high part of the RTEs are zero instead of doing a read-modify-write for their Destination Field bits only. Otherwise, on machines based on Apollo Lake and its derivatives such as Denverton, typically some of the previously preserved bits remain set after boot when not employing interrupt remapping. The result is that INTx interrupts are not getting delivered. Note: With an AMD IOMMU, interrupt remapping apparently bypasses the IO APIC altogether. Submitted by: loos (modulo comment) Reviewed by: jhb (modulo comment)	2017-12-28 21:46:09 +00:00
phk	1642f8ba74	Introduce an architecture-agnostic <sys/_stdarg.h> to reduce platform divergence. Only architectures which pass arguments in registers (mips) and platforms which use really weird compilers (any?) would need to augment the contents of <sys/_stdarg.h> Convert x86, arm and arm64 architectures to use <sys/_stdarg.h>	2017-12-25 20:54:00 +00:00
imp	e65dafd72c	Further investigation shows this shouldn't have been added at all. Remove it.	2017-12-24 17:59:48 +00:00
imp	b002fd76bf	Comment this out until I have time to get to the bottom of why it's failing for some people.	2017-12-24 16:36:50 +00:00
imp	9afedc5ef7	Warn when nonPNP ISA devices are attached in GENERIC that they are being removed from GENERIC in 12. Always print PNP info for ISA when it exists: it doesn't depend on ISAPNP. Add PNP ID to orm and vga to prevent us from warning about them since those devices aren't being removed from GENERIC. PNP devices will be removed from GENERIC too, but they will be automatically loaded, so need no warning. We don't warn for non-GENERIC kernels because people running them are presumed to know what they are doing. MFC After: 2 weeks	2017-12-23 22:57:14 +00:00
kib	6e13a02f21	Add missed AVX512VL (128 and 256 bit vector length) extension identification bit. Sponsored by: The FreeBSD Foundation MFC after: 3 days	2017-12-23 21:32:50 +00:00
bde	cf8a25e82e	Use resume_cpus() instead of restart_cpus() to resume from ACPI suspension. restart_cpus() worked well enough by accident. Before this set of fixes, resume_cpus() used the same cpuset (started_cpus, meaning CPUs directed to restart) as restart_cpus(). resume_cpus() waited for the wrong cpuset (stopped_cpus) to become empty, but since mixtures of stopped and suspended CPUs are not close to working, stopped_cpus must be empty when resuming so the wait is null -- restart_cpus just allows the other CPUs to restart and returns without waiting. Fix resume_cpus() to wait on a non-wrong cpuset for the ACPI case, and add further kludges to try to keep it working for the XEN case. It was only used for XEN. It waited on suspended_cpus. This works for XEN. However, for ACPI, resuming is a 2-step process. ACPI has already woken up the other CPUs and removed them from suspended_cpus. This fix records the move by putting them in a new cpuset resuming_cpus. Waiting on suspended_cpus would give the same null wait as waiting on stopped_cpus. Wait on resuming_cpus instead. Add a cpuset toresume_cpus to map the CPUs being told to resume to keep this separate from the cpuset started_cpus for mapping the CPUs being told to restart. Mixtures of stopped and suspended/resuming CPUs are still far from working. Describe new and some old cpusets in comments. Add further kludges to cpususpend_handler() to try to avoid breaking it for XEN. XEN doesn't use resumectx(), so it doesn't use the second return path for savectx(), and it goes from the suspended state directly to the restarted state, while ACPI resume goes through the resuming state. Enter the resuming state early for all cases so that resume_cpus can test for being in this state and not have to worry about the intermediate !suspended state for ACPI only. Reviewed by: kib	2017-12-21 09:17:48 +00:00
bde	994bacdf8f	Remove the permanent double mapping of low physical memory and replace it by a transient double mapping for the one instruction in ACPI wakeup where it is needed (and for many surrounding instructions in ACPI resume). Invalidate the TLB as soon as convenient after undoing the transient mapping. ACPI resume already has the strict ordering needed for this. This fixes the non-trapping of null pointers and other garbage pointers below NBPDR (except transiently). NBPDR is quite large (4MB, or 2MB for PAE). This fixes spurious traps at the first instruction in VM86 bioscalls. The traps are for transiently missing read permission in the first VM86 page (physical page 0) which was just written to at KERNBASE in the kernel. The mechanism is unknown (it is not simply PG_G). locore uses a similar but larger transient double mapping and needs it for 2 instructions instead of 1. Unmap the first PDE in it after the 2 instructions to detect most garbage pointers while bootstrapping. pmap_bootstrap() finishes the unmapping. Remove the avoidance of the double mapping for a recently fixed special case. ACPI resume could use this avoidance (made non-special) to avoid any problems with the transient double mapping, but no such problems are known. Update comments in locore. Many were for old versions of FreeBSD which tried to map low memory r/o except for special cases, or might have allowed access to low memory via physical offsets. Now all kernel maps are r/w, and removal of of the double map disallows use of physical offsets again.	2017-12-18 13:53:22 +00:00
pfg	b0f7aa75d4	SPDX: use the Beerware identifier.	2017-11-30 20:33:45 +00:00
jkim	f9c37771cd	Properly skip the first CPU. It only accidentally worked because the CPU_FOREACH() loop always starts from BSP (cpu0) and the if condition is always false for APs. Reported by: cem	2017-11-30 20:21:42 +00:00
pfg	f1206865bb	SPDX: Fix some cases wrongly attributed to MIT. In the cases of BSD-style license variants without clauses, use 0BSD for the time being in lack of a better description.	2017-11-30 15:10:11 +00:00
jkim	c1509f7c95	Add a tunable "debug.hwpstate_verify" to check P-state after changing it and turn it off by default. It is very inefficient to verify current P-state of each core, especially for CPUs with many cores. When multiple commands are requested to the same power domain before completion of pending transitions, the last command is executed according to the manual. Because requests are serialized by the caller, all cores will receive the same command for each call. Do not call sched_bind() and sched_unbind(). It is redundant because the caller does it anyway.	2017-11-30 01:40:07 +00:00
jkim	ce7b988218	Fix style(9).	2017-11-29 23:52:31 +00:00
pfg	921a5b4874	sys/x86: further adoption of SPDX licensing ID tags. Mainly focus on files that use BSD 2-Clause license, however the tool I was using misidentified many licenses so this was mostly a manual - error prone - task. The Software Package Data Exchange (SPDX) group provides a specification to make it easier for automated tools to detect and summarize well known opensource licenses. We are gradually adopting the specification, noting that the tags are considered only advisory and do not, in any way, superceed or replace the license texts.	2017-11-27 15:11:47 +00:00
kib	873f304292	Remove lint support from system headers and MD x86 headers. Reviewed by: dim, jhb Discussed with: imp Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D13156	2017-11-23 11:40:16 +00:00
pfg	4736ccfd9c	sys: further adoption of SPDX licensing ID tags. Mainly focus on files that use BSD 3-Clause license. The Software Package Data Exchange (SPDX) group provides a specification to make it easier for automated tools to detect and summarize well known opensource licenses. We are gradually adopting the specification, noting that the tags are considered only advisory and do not, in any way, superceed or replace the license texts. Special thanks to Wind River for providing access to "The Duke of Highlander" tool: an older (2014) run over FreeBSD tree was useful as a starting point.	2017-11-20 19:43:44 +00:00

1 2 3 4 5 ...

767 Commits