freebsd-skq

Author	SHA1	Message	Date
Jonathan T. Looney	e24e568336	Make the TCP blackbox code committed in r331347 be an optional feature controlled by the TCP_BLACKBOX option. Enable this as part of amd64 GENERIC. For now, leave it disabled on other platforms. Sponsored by: Netflix, Inc.	2018-03-24 12:48:10 +00:00
Ed Maste	f8268d4d97	Remove redundant cast from Linuxulator SYSINITs	2018-03-23 20:32:54 +00:00
Ed Maste	ad448975e6	Fixup return style(9) in amd64 linux*_sysvec.c Sponsored by: Turing Robotic Industries Inc.	2018-03-23 17:28:04 +00:00
Ed Maste	c0aa0e2c27	Sort headers in MD Linuxulator files Bring #includes closer to style(9) and reduce differences between the (three) MD versions of linux_machdep.c and linux_sysvec.c. Sponsored by: Turing Robotic Industries Inc.	2018-03-23 17:16:36 +00:00
Konstantin Belousov	54f30ad961	Fixes for ptrace(PT_GETXSTATE_INFO) related to the padding in struct ptrace_xstate_info). struct ptrace_xstate_info has 64bit member but ends up with 32bit one. As result, on amd64 there is a 32bit padding at the end, but not on i386. We must clear the padding before doing the copyout. For compat32 case, we must copyout the structure which does not have the padding at the end. The later fixes 32bit gdb display of the YMM registers when running on amd64 kernel. Reported by: Vlad Tsyrklevich Reviewed by: brooks (previous version) Sponsored by: The FreeBSD Foundation admbugs: 765 MFC after: 1 week Differential revision: https://reviews.freebsd.org/D14794	2018-03-22 20:44:27 +00:00
Kyle Evans	ad456dd9fa	Re-work efidev ordering to fix efirt preloaded by loader on amd64 On amd64, efi_enter calls fpu_kern_enter(). This may not be called until fpuinitstate has been invoked, resulting in a kernel panic with efirt_load="YES" in loader.conf(5). Move fpuinitstate a little earlier in SI_SUB_DRIVERS so that we can squeeze efirt between it and efirtc at SI_SUB_DRIVERS, SI_ORDER_ANY. efidev must be after efirt and doesn't really need to be at SI_SUB_DEVFS, so drop it at SI_SUB_DRIVER, SI_ORDER_ANY. The not immediately obvious dependency of fpuinitstate by efirt has been noted in both places. Discussed with: kib, andrew Reported by: Jakob Alvermark <jakob@alvermark.net> X-MFC-With: r330868	2018-03-22 18:24:00 +00:00
Ed Maste	1ac2776bbb	Share Linux errno table with libsysdecode Requested by: jhb Reviewed by: jhb Sponsored by: Turing Robotic Industries Inc.	2018-03-22 12:58:49 +00:00
Konstantin Belousov	8fbcc3343f	Move the CR0.WP manipulation KPI to x86. This should allow to avoid some #ifdefs in the common x86/ code. Requested by: markj Sponsored by: The FreeBSD Foundation MFC after: 1 week	2018-03-20 20:20:49 +00:00
Ed Maste	b7d779b3e5	Make linuxulator fn declaration match definition I accidentally swapped 'linux_fixup_elf' to 'linux_elf_fixup' in amd64's declaration (only), while bringing this change over from git and encountering a conflict.	2018-03-20 19:28:52 +00:00
Ed Maste	fc2a8776a2	Rename assym.s to assym.inc assym is only to be included by other .s files, and should never actually be assembled by itself. Reviewed by: imp, bdrewery (earlier) Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D14180	2018-03-20 17:58:51 +00:00
Konstantin Belousov	9cffc92c62	Disable write protection around patching of XSAVE instruction in the context switch code. Some BIOSes give control to the OS with CR0.WP already set, making the kernel text read-only before cpu_startup(). Reported by: Peter Lei <peter.lei@ieee.org> Reviewed by: jtl Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D14768	2018-03-20 17:47:29 +00:00
Konstantin Belousov	2337dc6430	Provide KPI for handling of rw/ro kernel text. This is a pure syntax patch to create an interface to enable and later restore write access to the kernel text and other read-only mapped regions. It is in line with e.g. vm_fault_disable_pagefaults() by allowing the nesting. Discussed with: Peter Lei <peter.lei@ieee.org> Reviewed by: jtl Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D14768	2018-03-20 17:43:50 +00:00
Ed Maste	dc85846736	Rename linuxulator functions with linux_ prefix It's preferable to have a consistent prefix. This also reduces differences between the three linux*_sysvec.c files. Sponsored by: Turing Robotic Industries Inc.	2018-03-19 21:26:32 +00:00
Ed Maste	9bec2ea66e	linux*_sysvec.c: rationalize whitespace and comments There's a fair amount of duplication between MD linuxulator files. Make indentation and comments consistent between the three versions of linux_sysvec.c to reduce diffs when comparing them. Sponsored by: Turing Robotic Industries Inc.	2018-03-19 15:11:10 +00:00
Ed Maste	6e481f83f7	Share a single bsd-linux errno table across MD consumers Three copies of the linuxulator linux_sysvec.c contained identical BSD to Linux errno translation tables, and future work to support other architectures will also use the same table. Move the table to a common file to be used by all. Make it 'const int' to place it in .rodata. (Some existing Linux architectures use MD errno values, but x86 and Arm share the generic set.) This change should introduce no functional change; a followup will add missing errno values. MFC after: 3 weeks Sponsored by: Turing Robotic Industries Inc. Differential Revision: https://reviews.freebsd.org/D14665	2018-03-16 14:46:38 +00:00
Ed Maste	7b194b3d3b	Remove stray ; at end of linux_vdso_deinstall()	2018-03-14 13:20:36 +00:00
Kyle Evans	63ee68c220	EFIRT: SetVirtualAddressMap with 1:1 mapping after exiting boot services This fixes a problem encountered on the Lenovo Thinkpad X220/Yoga 11e where runtime services would try to inexplicably jump to other parts of memory where it shouldn't be when attempting to enumerate EFI vars, causing a panic. The virtual mapping is enabled by default and can be disabled by setting efi_disable_vmap in loader.conf(5). Reviewed by: kib (earlier version) MFC after: 3 weeks Differential Revision: https://reviews.freebsd.org/D14677	2018-03-13 17:10:52 +00:00
Ed Maste	a95659f75f	Use C99 boolean type for translate_osrel Migrate to modern types before creating MD Linuxolator bits for new architectures. Reviewed by: cem Sponsored by: Turing Robotic Industries Inc. Differential Revision: https://reviews.freebsd.org/D14676	2018-03-13 16:40:29 +00:00
Ed Maste	4ba257591b	Apply some style(9) to Linuxulator linux_sysvec.c comments	2018-03-13 00:40:05 +00:00
Ian Lepore	c7053bbe54	Revert r330780, it was improperly tested and results in taking a spin mutex before acquiring sleep mutexes. Reported by: kib@	2018-03-11 20:13:15 +00:00
Ian Lepore	86051be993	Eliminate atrtc_time_lock, and use atrtc_lock for efirtc locking.	2018-03-11 19:22:58 +00:00
Tycho Nightingale	490768e24a	Fix a lock recursion introduced in r327065. Reported by: kmacy Reviewed by: grehan, jhb Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D14548	2018-03-07 18:03:22 +00:00
Jonathan T. Looney	beb2406556	amd64: Protect the kernel text, data, and BSS by setting the RW/NX bits correctly for the data contained on each memory page. There are several components to this change: * Add a variable to indicate the start of the R/W portion of the initial memory. * Stop detecting NX bit support for each AP. Instead, use the value from the BSP and, if supported, activate the feature on the other APs just before loading the correct page table. (Functionally, we already assume that the BSP and all APs had the same support or lack of support for the NX bit.) * Set the RW and NX bits correctly for the kernel text, data, and BSS (subject to some caveats below). * Ensure DDB can write to memory when necessary (such as to set a breakpoint). * Ensure GDB can write to memory when necessary (such as to set a breakpoint). For this purpose, add new MD functions gdb_begin_write() and gdb_end_write() which the GDB support code can call before and after writing to memory. This change is not comprehensive: * It doesn't do anything to protect modules. * It doesn't do anything for kernel memory allocated after the kernel starts running. * In order to avoid excessive memory inefficiency, it may let multiple types of data share a 2M page, and assigns the most permissions needed for data on that page. Reviewed by: jhb, kib Discussed with: emaste MFC after: 2 weeks Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D14282	2018-03-06 14:28:37 +00:00
Jonathan T. Looney	99e159dcf6	We shouldn't need to execute code in the recursive page table mappings; therefore, it should be safe to set the NX bit on the PML4E for the recursive page table mappings. According to the Intel docs, the effect of the NX bit should propogate to any page reached through a PML4E which has the NX bit set. Reviewed by: kib, markj MFC after: 2 weeks Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D14333	2018-03-05 15:12:35 +00:00
Jonathan T. Looney	66ce8430aa	Prior to r329071, pmap_bootstrap() used pmap_kmem_choose() to round the first available virtual address to a 2MB boundary. After r329071, create_pagetables() rounds firstaddr up to a 2MB boundary. This ensures the kernel is mapped in super-pages, which is the point of the logic in pmap_kmem_choose(). Therefore, it is no longer necessary for pmap_bootstrap() to round up to the 2MB boundary again. As pmap_bootstrap() was the only user of pmap_kmem_choose(), we can delete pmap_kmem_choose(). Reviewed by: kib MFC after: 2 weeks X-MFC-with: r329071 Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D14355	2018-03-05 15:10:17 +00:00
Anish Gupta	9363000dfe	Move the new AMD-Vi IVHD [ACPI_IVRS_HARDWARE_NEW]definitions added in r329360 in contrib ACPI to local files till ACPI code adds new definitions reported by jkim. Rename ACPI_IVRS_HARDWARE_NEW to ACPI_IVRS_HARDWARE_EFRSUP, since new definitions add Extended Feature Register support. Use IvrsType to distinguish three types of IVHD - 0x10(legacy), 0x11 and 0x40(with EFR). IVHD 0x40 is also called mixed type since it supports HID device entries. Fix 2 coverity bugs reported by cem. Reported by:jkim, cem Approved by:grehan Differential Revision://reviews.freebsd.org/D14501	2018-03-05 02:28:25 +00:00
Konstantin Belousov	8c8ee2ee1c	Unify bulk free operations in several pmaps. Submitted by: Yoshihiro Ota Reviewed by: markj MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D13485	2018-03-04 20:53:20 +00:00
Andriy Gapon	ae0eceab56	db_nextframe/amd64: catch up with r328083 to recognize fast_syscall_common Since that change the system call stack traces look like this: ... sys___sysctl() at sys___sysctl+0x5f/frame 0xfffffe0028e13ac0 amd64_syscall() at amd64_syscall+0x79b/frame 0xfffffe0028e13bf0 fast_syscall_common() at fast_syscall_common+0x101/frame 0xfffffe0028e13bf0 So, db_nextframe() stopped recognizing the system call frame. This commit should fix that. Reviewed by: kib MFC after: 4 days	2018-03-03 15:10:37 +00:00
Ravi Pokala	24f93aa05f	imcsmb(4): Intel integrated Memory Controller (iMC) SMBus controller driver imcsmb(4) provides smbus(4) support for the SMBus controller functionality in the integrated Memory Controllers (iMCs) embedded in Intel Sandybridge- Xeon, Ivybridge-Xeon, Haswell-Xeon, and Broadwell-Xeon CPUs. Each CPU implements one or more iMCs, depending on the number of cores; each iMC implements two SMBus controllers (iMC-SMBs). * IMPORTANT NOTE * Because motherboard firmware or the BMC might try to use the iMC-SMBs for monitoring DIMM temperatures and/or managing an NVDIMM, the driver might need to temporarily disable those functions, or take a hardware interlock, before using the iMC-SMBs. Details on how to do this may vary from board to board, and the procedure may be proprietary. It is strongly suggested that anyone wishing to use this driver contact their motherboard vendor, and modify the driver as described in the manual page and in the driver itself. (For what it's worth, the driver as-is has been tested on various SuperMicro motherboards.) Reviewed by: avg, jhb MFC after: 1 week Relnotes: yes Sponsored by: Panasas Differential Revision: https://reviews.freebsd.org/D14447 Discussed with: avg, ian, jhb Tested by: allanjude (previous version), Panasas	2018-03-03 01:53:51 +00:00
Ed Maste	023b850b62	Rationalize license text on Linuxolator files Many licenses on Linuxolator files contained small variations from the standard FreeBSD license text. To avoid license proliferation switch to the standard 2-clause FreeBSD license for those files where I have permission from each of the listed copyright holders. Additional files still waiting on permission from others are listed in review D14210. Approved by: dchagin, rdivacky, sos MFC after: 1 week MFC with: r329370 Sponsored by: The FreeBSD Foundation	2018-03-01 13:52:18 +00:00
John Baldwin	5f8754c077	Add a new variant of the GLA2GPA ioctl for use by the debug server. Unlike the existing GLA2GPA ioctl, GLA2GPA_NOFAULT does not modify the guest. In particular, it does not inject any faults or modify PTEs in the guest when performing an address space translation. This is used by bhyve's debug server to read and write memory for the remote debugger. Reviewed by: grehan MFC after: 1 month Differential Revision: https://reviews.freebsd.org/D14075	2018-02-26 19:19:05 +00:00
Patrick Kelsey	18a7530938	Greatly reduce the number of #ifdefs supporting the TCP_RFC7413 kernel option. The conditional compilation support is now centralized in tcp_fastopen.h and tcp_var.h. This doesn't provide the minimum theoretical code/data footprint when TCP_RFC7413 is disabled, but nearly all the TFO code should wind up being removed by the optimizer, the additional footprint in the syncache entries is a single pointer, and the additional overhead in the tcpcb is at the end of the structure. This enables the TCP_RFC7413 kernel option by default in amd64 and arm64 GENERIC. Reviewed by: hiren MFC after: 1 month Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D14048	2018-02-26 03:03:41 +00:00
Jung-uk Kim	0ef8c0cb57	Partially revert r197863 to reduce diff against i386. When I wrote the patch, I wanted to remove SYSINIT() usage from amd64 code. There is no reason to keep the divergence any more because iwasaki merged most amd64 suspend/resume code to i386 with r235622. Note this also fixed an enge case reported by royger. [1] Suggested by: jhb Reviewed by: royger Tested by: royger [1] MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D14400 [1]	2018-02-24 01:24:57 +00:00
Conrad Meyer	849ce31a82	Remove unused error return from API that cannot fail No implementation of fpu_kern_enter() can fail, and it was causing needless error checking boilerplate and confusion. Change the return code to void to match reality. (This trivial change took nine days to land because of the commit hook on sys/dev/random. Please consider removing the hook or otherwise lowering the bar -- secteam never seems to have free time to review patches.) Reported by: Lachlan McIlroy <Lachlan.McIlroy AT isilon.com> Reviewed by: delphij Approved by: secteam (delphij) Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D14380	2018-02-23 20:15:19 +00:00
Ed Maste	716cfaab96	Use linux types for linux-specific syscalls Sponsored by: Turing Robotic Industries Inc. Differential Revision: https://reviews.freebsd.org/D14065	2018-02-23 19:09:27 +00:00
Ed Maste	315fbaeca2	Correct pseudo misspelling in sys/ comments contrib code and #define in intel_ata.h unchanged.	2018-02-23 18:15:50 +00:00
Ed Maste	a0409b6f36	Remove accidental vim droppings Reported by: cy	2018-02-22 03:37:01 +00:00
Ed Maste	eae594f7d5	Correct proper nouns in the Linuxulator - Capitalize Linux - Spell FreeBSD out in full - Address some style(9) on changed lines Sponsored by: Turing Robotic Industries Inc.	2018-02-22 02:24:17 +00:00
John Baldwin	4f8666989a	Add two new ioctls to bhyve for batch register fetch/store operations. These are a convenience for bhyve's debug server to use a single ioctl for 'g' and 'G' rather than a loop of individual get/set ioctl requests. Reviewed by: grehan MFC after: 2 months Differential Revision: https://reviews.freebsd.org/D14074	2018-02-22 00:39:25 +00:00
Konstantin Belousov	2c0f13aa59	vm_wait() rework. Make vm_wait() take the vm_object argument which specifies the domain set to wait for the min condition pass. If there is no object associated with the wait, use curthread' policy domainset. The mechanics of the wait in vm_wait() and vm_wait_domain() is supplied by the new helper vm_wait_doms(), which directly takes the bitmask of the domains to wait for passing min condition. Eliminate pagedaemon_wait(). vm_domain_clear() handles the same operations. Eliminate VM_WAIT and VM_WAITPFAULT macros, the direct functions calls are enough. Eliminate several control state variables from vm_domain, unneeded after the vm_wait() conversion. Scetched and reviewed by: jeff Tested by: pho Sponsored by: The FreeBSD Foundation, Mellanox Technologies Differential revision: https://reviews.freebsd.org/D14384	2018-02-20 10:13:13 +00:00
Ed Maste	0ba1b36553	Rationalize license text on Linuxolator files Many licenses on Linuxolator files contained small variations from the standard FreeBSD license text. To avoid license proliferation switch to the standard 2-clause FreeBSD license for those files where I have permission from each of the listed copyright holders. Additional files waiting on permission from others are listed in review D14210. Approved by: kan, marcel, sos, rdivacky MFC after: 1 week Sponsored by: The FreeBSD Foundation	2018-02-16 15:00:14 +00:00
Konstantin Belousov	13cad9af82	Use local symbol for offset. Small global symbols confuse ddb which matches them against small unrelated displacements and makes the disassembly ugly. Reported by: bde Sponsored by: The FreeBSD Foundation MFC after: 1 week	2018-02-16 13:32:46 +00:00
Andriy Gapon	7b394c1066	move vintr_intercept_enabled under INVARIANTS The function is not used outside of INVARIANTS since r328622. MFC after: 1 week	2018-02-16 07:02:14 +00:00
Anish Gupta	0b37d3d90e	This change fixes duplicate detection of same IOMMU/AMD-Vi device for Ryzen with EFR support. IVRS can have entry of type legacy and non-legacy present at same time for same AMD-Vi device. ivhd driver will ignore legacy if new IVHD type is present as specified in AMD-Vi specification. Earlier both of IVHD entries used and two ivhd devices were created. Add support for new IVHD type 0x11 and 0x40 in ACPI. Create new struct of type acpi_ivrs_hardware_new for these new type of IVHDs. Legacy type 0x10 will continue to use acpi_ivrs_hardware. Reviewed by: avg Approved by: grehan Differential Revision:https://reviews.freebsd.org/D13160	2018-02-16 05:17:00 +00:00
Jung-uk Kim	ea4fe1da62	Change size of padding to reflect reality. No functional change. Discussed with: kib	2018-02-15 20:42:38 +00:00
Conrad Meyer	5bd0149714	x86 pmap: Make memory mapped via pmap_qenter() non-executable The idea is, the pmap_qenter() API is now defined to not produce executable mappings. If you need executable mappings, use another API. Add pg_nx flag in pmap_qenter on x86 to make kernel pages non-executable. Other architectures that support execute-specific permissons on page table entries should subsequently be updated to match. Submitted by: Darrick Lew <darrick.freebsd AT gmail.com> Reviewed by: markj Discussed with: alc, jhb, kib Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D14062	2018-02-14 23:35:47 +00:00
Ed Maste	83ab0f9f33	amd64/pmap: Move Foundation copyright to the 2-clause section Sponsored by: The FreeBSD Foundation	2018-02-13 19:19:26 +00:00
Hans Petter Selasky	33ec1ccbae	Import the mthca kernel side infiniband driver from Linux 4.9 and fix compilation under FreeBSD. The mthca driver was temporarily removed as part of the Linux 4.9 RoCE/infinband upgrade. Top commit in Linux source tree: 69973b830859bc6529a7a0468ba0d80ee5117826 Sponsored by: Mellanox Technologies	2018-02-13 17:04:34 +00:00
Jeff Roberson	e958ad4cf3	Make v_wire_count a per-cpu counter(9) counter. This eliminates a significant source of cache line contention from vm_page_alloc(). Use accessors and vm_page_unwire_noq() so that the mechanism can be easily changed in the future. Reviewed by: markj Discussed with: kib, glebius Tested by: pho (earlier version) Sponsored by: Netflix, Dell/EMC Isilon Differential Revision: https://reviews.freebsd.org/D14273	2018-02-12 22:53:00 +00:00
Jonathan T. Looney	48fca66157	Mark the pages used for the initial page-table entries as wired. This makes them consistent with the way other page-table pages are allocated. It also provides the rest of the VM system a good clue that these pages are used. Reviewed by: alc, kib, markj Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D14269	2018-02-12 17:27:50 +00:00
Warner Losh	982e7bdafc	We don't support gcc < 4.2.1, so varargs.h now is just #error always. Unifdef for versions prior to 4.2.1 and remove now-unused header files. Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D14323	2018-02-12 14:48:14 +00:00
Tycho Nightingale	58a6aaf7ec	Provide further mitigation against CVE-2017-5715 by flushing the return stack buffer (RSB) upon returning from the guest. This was inspired by this linux commit: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/arch/x86/kvm?id=117cc7a908c83697b0b737d15ae1eb5943afe35b Reviewed by: grehan Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D14272	2018-02-12 14:45:27 +00:00
Jonathan T. Looney	31ba4c7b5b	On bootup, the amd64 pmap initialization code creates page-table mappings for the pages used for the kernel and some initial allocations used for the page table. It maps the kernel and the blocks used for these initial allocations using 2MB pages. However, if the kernel does not end on a 2MB boundary, it still maps the last portion using a 2MB page, but reports that the unused 4K blocks within this 2MB allocation are free physical blocks. This means that these same physical blocks could also be mapped elsewhere - for example, into a user process. Given the proximity to the kernel text and data area, it seems wise to avoid allowing someone to write data to physical blocks also mapped into these virtual addresses. (Note that this isn't a security vulnerability: the direct map makes most/all memory on the system mapped into kernel space. And, nothing in the kernel should be trying to access these pages, as the virtual addresses are unused. It simply seems wise to avoid reusing these physical blocks while they are mapped to virtual addresses so close to the kernel text and data area.) Consequently, let's reserve the physical blocks covered by the page-table mappings for these initial allocations. Reviewed by: kib, markj MFC after: 2 weeks Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D14268	2018-02-09 17:46:33 +00:00
Mark Johnston	ab7c09f121	Use vm_page_unwire_noq() instead of directly modifying page wire counts. No functional change intended. Reviewed by: alc, kib (previous revision) MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D14266	2018-02-08 19:28:51 +00:00
Jeff Roberson	e2068d0bcd	Use per-domain locks for vm page queue free. Move paging control from global to per-domain state. Protect reservations with the free lock from the domain that they belong to. Refactor to make vm domains more of a first class object. Reviewed by: markj, kib, gallatin Tested by: pho Sponsored by: Netflix, Dell/EMC Isilon Differential Revision: https://reviews.freebsd.org/D14000	2018-02-06 22:10:07 +00:00
Ed Maste	8a3b44cfc2	Additional linuxolator whitespace cleanup, missed in r328890	2018-02-05 18:39:06 +00:00
Ed Maste	132f90c660	Linuxolator whitespace cleanup A version of each of the MD files by necessity exists for each CPU architecture supported by the Linuxolator. Clean these up so that new architectures do not inherit whitespace issues. Clean up shared Linuxolator files while here. Sponsored by: Turing Robotic Industries Inc.	2018-02-05 17:29:12 +00:00
Konstantin Belousov	f7f14d9dea	When switching IBRS on, also enable STIBP (Single Thread Indirect Branch Predictors) mitigation. DOcument 336996-001 promises that CPUs which implement IBRS but not STIBP silently ignore setting of the bit instead of trapping. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2018-01-31 16:56:02 +00:00
Konstantin Belousov	319117fd57	IBRS support, AKA Spectre hardware mitigation. It is coded according to the Intel document 336996-001, reading of the patches posted on lkml, and some additional consultations with Intel. For existing processors, you need a microcode update which adds IBRS CPU features, and to manually enable it by setting the tunable/sysctl hw.ibrs_disable to 0. Current status can be checked in sysctl hw.ibrs_active. The mitigation might be inactive if the CPU feature is not patched in, or if CPU reports that IBRS use is not required, by IA32_ARCH_CAP_IBRS_ALL bit. Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D14029	2018-01-31 14:36:27 +00:00
Andriy Gapon	6a8b7aa424	vmm/svm: post LAPIC interrupts using event injection, not virtual interrupts The virtual interrupt method uses V_IRQ, V_INTR_PRIO, and V_INTR_VECTOR fields of VMCB to inject a virtual interrupt into a guest VM. This method has many advantages over the direct event injection as it offloads all decisions of whether and when the interrupt can be delivered to the guest. But with a purely software emulated vAPIC the advantage is also a problem. The problem is that the hypervisor does not have any precise control over when the interrupt is actually delivered to the guest (or a notification about that). Because of that the hypervisor cannot update the interrupt vector in IRR and ISR in the same way as real hardware would. The hypervisor becomes aware that the interrupt is being serviced only upon the first VMEXIT after the interrupt is delivered. This creates a window between the actual interrupt delivery and the update of IRR and ISR. That means that IRR and ISR might not be correctly set up to the point of the end-of-interrupt signal. The described deviation has been observed to cause an interrupt loss in the following scenario. vCPU0 posts an inter-processor interrupt to vCPU1. The interrupt is injected as a virtual interrupt by the hypervisor. The interrupt is delivered to a guest and an interrupt handler is invoked. The handler performs a requested action and acknowledges the request by modifying a global variable. So far, there is no VMEXIT and the hypervisor is unaware of the events. Then, vCPU0 notices the acknowledgment and sends another IPI with the same vector. The IPI gets collapsed into the previous IPI in the IRR of vCPU1. Only after that a VMEXIT of vCPU1 occurs. At that time the vector is cleared in the IRR and is set in the ISR. vCPU1 has vAPIC state as if the second IPI has never been sent. The scenario is impossible on the real hardware because IRR and ISR are updated just before the interrupt handler gets started. I saw several possibilities of fixing the problem. One is to intercept the virtual interrupt delivery to update IRR and ISR at the right moment. The other is to deliver the LAPIC interrupts using the event injection, same as legacy interrupts. I opted to use the latter approach for several reasons. It's equivalent to what VMM/Intel does (in !VMX case). It appears to be what VirtualBox and KVM do. The code is already there (to support legacy interrupts). Another possibility was to use a special intermediate state for a vector after it is injected using a virtual interrupt and before it is known whether it was accepted or is still pending. That approach was implemented in https://reviews.freebsd.org/D13828 That method is more complex and does not have any clear advantage. Please see sections 15.20 and 15.21.4 of "AMD64 Architecture Programmer's Manual Volume 2: System Programming" (publication 24593, revision 3.29) for comparison between event injection and virtual interrupt injection. PR: 215972 Reported by: ajschot@hotmail.com, grehan Tested by: anish, grehan, Nils Beyer <nbe@renzel.net> Reviewed by: anish, grehan MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D13780	2018-01-31 11:14:26 +00:00
John Baldwin	05d56d83b6	Ensure 'name' is not NULL before passing to strcmp(). This avoids a nested page fault when obtaining a stack trace in DDB if the address from the first frame does not resolve to a known symbol. MFC after: 1 week Sponsored by: Chelsio Communications	2018-01-30 23:29:27 +00:00
Bryan Drewery	595109196a	Don't use an .OBJDIR for 'make sysent'. Reported by: emaste, jhb Sponsored by: Dell EMC	2018-01-29 19:14:15 +00:00
Warner Losh	d6b6639713	Add ISA PNP tables to ISA drivers. Fix a few incidental comments. ACPI ISA PBP tables not tagged, there's bigger issues with them.	2018-01-29 00:22:30 +00:00
Konstantin Belousov	c8f9c1f3d9	Use PCID to optimize PTI. Use PCID to avoid complete TLB shootdown when switching between user and kernel mode with PTI enabled. I use the model close to what I read about KAISER, user-mode PCID has 1:1 correspondence to the kernel-mode PCID, by setting bit 11 in PCID. Full kernel-mode TLB shootdown is performed on context switches, since KVA TLB invalidation only works in the current pmap. User-mode part of TLB is flushed on the pmap activations as well. Similarly, IPI TLB shootdowns must handle both kernel and user address spaces for each address. Note that machines which implement PCID but do not have INVPCID instructions, cause the usual complications in the IPI handlers, due to the need to switch to the target PCID temporary. This is racy, but because for PCID/no-INVPCID we disable the interrupts in pmap_activate_sw(), IPI handler cannot see inconsistent state of CPU PCID vs PCPU pmap/kcr3/ucr3 pointers. On the other hand, on kernel/user switches, CR3_PCID_SAVE bit is set and we do not clear TLB. I can imagine alternative use of PCID, where there is only one PCID allocated for the kernel pmap. Then, there is no need to shootdown kernel TLB entries on context switch. But copyout(3) would need to either use method similar to proc_rwmem() to access the userspace data, or (in reverse) provide a temporal mapping for the kernel buffer into user mode PCID and use trampoline for copy. Reviewed by: markj (previous version) Tested by: pho Discussed with: alc (some aspects) Sponsored by: The FreeBSD Foundation MFC after: 3 weeks Differential revision: https://reviews.freebsd.org/D13985	2018-01-27 11:49:37 +00:00
Edward Tomasz Napierala	28f3d8b2c2	Add SPDX identifiers to linux_ptrace.c and cfumass.c. MFC after: 2 weeks	2018-01-24 17:04:01 +00:00
Ed Maste	7eb2159f6a	Use BSD-2-Clause-FreeBSD license on linux_support.s These files previously had a 3-clause license and 'THE REGENTS' text. Switch to standard 2-clause text with kib's approval, and add the SPDX tag. Approved by: kib	2018-01-23 20:35:43 +00:00
Pedro F. Giffuni	ac2fffa4b7	Revert r327828, r327949, r327953, r328016-r328026, r328041: Uses of mallocarray(9). The use of mallocarray(9) has rocketed the required swap to build FreeBSD. This is likely caused by the allocation size attributes which put extra pressure on the compiler. Given that most of these checks are superfluous we have to choose better where to use mallocarray(9). We still have more uses of mallocarray(9) but hopefully this is enough to bring swap usage to a reasonable level. Reported by: wosch PR: 225197	2018-01-21 15:42:36 +00:00
Konstantin Belousov	c398c14664	Use correct symbol name in r328202. Sponsored by: The FreeBSD Foundation MFC after: 11 days	2018-01-20 18:05:14 +00:00
Konstantin Belousov	3a5e472e17	Use predefined symbol for the CR3.PCID mask. Sponsored by: The FreeBSD Foundation MFC after: 11 days	2018-01-20 17:46:09 +00:00
Roger Pau Monné	50a53194f6	xen: fix IDT setup after PTI On amd64 the IDT handler was not set correctly when using PTI. While there also fix the selectors to SEL_KPL. Obtained from: kib MFC with: r328083	2018-01-20 14:59:37 +00:00
Konstantin Belousov	b4dfc9d7ad	PTI: Trap if we returned to userspace with kernel (full) page table still active. Map userspace portion of VA in the PTI kernel-mode page table as non-executable. This way, if we ever miss reloading ucr3 into %cr3 on the return to usermode, the process traps instead of executing in potentially vulnerable setup. Catch the condition of such trap and verify user-mode %cr3, which is saved by page fault handler. I peek this trick in some article about Linux implementation. Reviewed by: alc, markj (previous version) Sponsored by: The FreeBSD Foundation MFC after: 12 days DIfferential revision: https://reviews.freebsd.org/D13956	2018-01-19 22:10:29 +00:00
Nathan Whitehorn	9a8196ce19	Remove SFBUF_OPTIONAL_DIRECT_MAP and such hacks, replacing them across the kernel by PHYS_TO_DMAP() as previously present on amd64, arm64, riscv, and powerpc64. This introduces a new MI macro (PMAP_HAS_DMAP) that can be evaluated at runtime to determine if the architecture has a direct map; if it does not (or does) unconditionally and PMAP_HAS_DMAP is either 0 or 1, the compiler can remove the conditional logic. As part of this, implement PHYS_TO_DMAP() on sparc64 and mips64, which had similar things but spelled differently. 32-bit MIPS has a partial direct-map that maps poorly to this concept and is unchanged. Reviewed by: kib Suggestions from: marius, alc, kib Runtime tested on: amd64, powerpc64, powerpc, mips64	2018-01-19 17:46:31 +00:00
Ed Maste	b3327f62f0	Enable KPTI by default on amd64 for non-AMD CPUs Kernel Page Table Isolation (KPTI) was introduced in r328083 as a mitigation for the 'Meltdown' vulnerability. AMD CPUs are not affected, per https://www.amd.com/en/corporate/speculative-execution: We believe AMD processors are not susceptible due to our use of privilege level protections within paging architecture and no mitigation is required. Thus default KPTI to off for AMD CPUs, and to on for others. This may be refined later as we obtain more specific information on the sets of CPUs that are and are not affected. Submitted by: Mitchell Horne Reviewed by: cem Relnotes: Yes Security: CVE-2017-5754 Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D13971	2018-01-19 15:42:34 +00:00
John Baldwin	68fd3b0ef5	Use a dedicated per-CPU stack for machine check exceptions. Similar to NMIs, machine check exceptions can fire at any time and are not masked by IF. This means that machine checks can fire when the kstack is too deep to hold a trap frame, or at critical sections in trap handlers when a user %gs is used with a kernel %cs. Use the same strategy used for NMIs of using a dedicated per-CPU stack configured in IST 3. Store the CPU's pcpu pointer at the stop of the stack so that the machine check handler can reliably find the proper value for %gs (also borrowed from NMIs). This should also fix a similar issue with PTI with a MC# occurring while the CPU is executing on the trampoline stack. While here, bypass trap() entirely and just call mca_intr(). This avoids a bogus call to kdb_reenter() (there's no reason to try to reenter kdb if a MC# is raised). Reviewed by: kib Tested by: avg (on AMD without PTI) Differential Revision: https://reviews.freebsd.org/D13962	2018-01-18 23:50:21 +00:00
John Baldwin	f36b1fe0bd	Remove two no-longer-used labels from the NMI interrupt handler. Reviewed by: kib	2018-01-18 22:13:53 +00:00
John Baldwin	7f513d17b2	Adjust branch target in NMI handler for the !PTI case. In the !PTI case the NMI handler jumped past the instructions that set %rdi to point to the current PCB, but the target instructions assumed %rdi were set. Reviewed by: kib Tested by: pho	2018-01-18 20:12:12 +00:00
Konstantin Belousov	3705dda7e4	Move the kernphys declaration to machine/md_var.h. Apparently machinde/cpu.h is supposed to contain MD implementations of MI interfaces. Also, remove kernphys declaration from machdep.c, since it is already provided by md_var.h. Requested and reviewed by: bde MFC after: 13 days	2018-01-18 15:15:35 +00:00
Konstantin Belousov	ac97ccbab5	Fix compilation with gcc. etext is already declared in machine/cpu.h, move kernphys declaration there too. Based on the patch by: bde MFC after: 13 days	2018-01-18 11:21:03 +00:00
Konstantin Belousov	406bc0da95	Fix compilation with gas. Submitted by: bde MFC after: 13 days	2018-01-18 11:19:58 +00:00
Konstantin Belousov	0d6c61ec30	Remove the 'last' argument from the pmap_pti_free_page(). It is in fact unused. Noted and reviewed by: markj Sponsored by: The FreeBSD Foundation MFC after: 13 days	2018-01-18 11:01:41 +00:00
John Baldwin	65eefbe422	Save and restore guest debug registers. Currently most of the debug registers are not saved and restored during VM transitions allowing guest and host debug register values to leak into the opposite context. One result is that hardware watchpoints do not work reliably within a guest under VT-x. Due to differences in SVM and VT-x, slightly different approaches are used. For VT-x: - Enable debug register save/restore for VM entry/exit in the VMCS for DR7 and MSR_DEBUGCTL. - Explicitly save DR0-3,6 of the guest. - Explicitly save DR0-3,6-7, MSR_DEBUGCTL, and the trap flag from %rflags for the host. Note that because DR6 is "software" managed and not stored in the VMCS a kernel debugger which single steps through VM entry could corrupt the guest DR6 (since a single step trap taken after loading the guest DR6 could alter the DR6 register). To avoid this, explicitly disable single-stepping via the trace flag before loading the guest DR6. A determined debugger could still defeat this by setting a breakpoint after the guest DR6 was loaded and then single-stepping. For SVM: - Enable debug register caching in the VMCB for DR6/DR7. - Explicitly save DR0-3 of the guest. - Explicitly save DR0-3,6-7, and MSR_DEBUGCTL for the host. Since SVM saves the guest DR6 in the VMCB, the race with single-stepping described for VT-x does not exist. For both platforms, expose all of the guest DRx values via --get-drX and --set-drX flags to bhyvectl. Discussed with: avg, grehan Tested by: avg (SVM), myself (VT-x) MFC after: 1 month Differential Revision: https://reviews.freebsd.org/D13229	2018-01-17 23:11:25 +00:00
Mark Johnston	9cb26f73ea	Annotate a couple of changes from r328083. Reviewed by: kib X-MFC with: r328083	2018-01-17 21:52:12 +00:00
Konstantin Belousov	bd50262f70	PTI for amd64. The implementation of the Kernel Page Table Isolation (KPTI) for amd64, first version. It provides a workaround for the 'meltdown' vulnerability. PTI is turned off by default for now, enable with the loader tunable vm.pmap.pti=1. The pmap page table is split into kernel-mode table and user-mode table. Kernel-mode table is identical to the non-PTI table, while usermode table is obtained from kernel table by leaving userspace mappings intact, but only leaving the following parts of the kernel mapped: kernel text (but not modules text) PCPU GDT/IDT/user LDT/task structures IST stacks for NMI and doublefault handlers. Kernel switches to user page table before returning to usermode, and restores full kernel page table on the entry. Initial kernel-mode stack for PTI trampoline is allocated in PCPU, it is only 16 qwords. Kernel entry trampoline switches page tables. then the hardware trap frame is copied to the normal kstack, and execution continues. IST stacks are kept mapped and no trampoline is needed for NMI/doublefault, but of course page table switch is performed. On return to usermode, the trampoline is used again, iret frame is copied to the trampoline stack, page tables are switched and iretq is executed. The case of iretq faulting due to the invalid usermode context is tricky, since the frame for fault is appended to the trampoline frame. Besides copying the fault frame and original (corrupted) frame to kstack, the fault frame must be patched to make it look as if the fault occured on the kstack, see the comment in doret_iret detection code in trap(). Currently kernel pages which are mapped during trampoline operation are identical for all pmaps. They are registered using pmap_pti_add_kva(). Besides initial registrations done during boot, LDT and non-common TSS segments are registered if user requested their use. In principle, they can be installed into kernel page table per pmap with some work. Similarly, PCPU can be hidden from userspace mapping using trampoline PCPU page, but again I do not see much benefits besides complexity. PDPE pages for the kernel half of the user page tables are pre-allocated during boot because we need to know pml4 entries which are copied to the top-level paging structure page, in advance on a new pmap creation. I enforce this to avoid iterating over the all existing pmaps if a new PDPE page is needed for PTI kernel mappings. The iteration is a known problematic operation on i386. The need to flush hidden kernel translations on the switch to user mode make global tables (PG_G) meaningless and even harming, so PG_G use is disabled for PTI case. Our existing use of PCID is incompatible with PTI and is automatically disabled if PTI is enabled. PCID can be forced on only for developer's benefit. MCE is known to be broken, it requires IST stack to operate completely correctly even for non-PTI case, and absolutely needs dedicated IST stack because MCE delivery while trampoline did not switched from PTI stack is fatal. The fix is pending. Reviewed by: markj (partially) Tested by: pho (previous version) Discussed with: jeff, jhb Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2018-01-17 11:44:21 +00:00
Konstantin Belousov	94b011c4bc	Amd64 user_ldt_deref() is not used outside sys_machdep.c. Mark it as static. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2018-01-17 11:21:03 +00:00
Pedro F. Giffuni	74641f0bc6	x86: make some use of mallocarray(9). Focus on code where we are doing multiplications within malloc(9). None of these ire likely to overflow, however the change is still useful as some static checkers can benefit from the allocation attributes we use for mallocarray. This initial sweep only covers malloc(9) calls with M_NOWAIT. No good reason but I started doing the changes before r327796 and at that time it was convenient to make sure the sorrounding code could handle NULL values. X-Differential revision: https://reviews.freebsd.org/D13837	2018-01-15 21:08:22 +00:00
Tycho Nightingale	91fe5fe7e7	Provide some mitigation against CVE-2017-5715 by clearing registers upon returning from the guest which aren't immediately clobbered by the host. This eradicates any remaining guest contents limiting their usefulness in an exploit gadget. This was inspired by this linux commit: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5b6c02f38315b720c593c6079364855d276886aa Reviewed by: grehan, rgrimes Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D13573	2018-01-15 18:37:03 +00:00
Konstantin Belousov	5f7b9ff2e3	Add STAC and CLAC instructions wrappers. Reviewed by: jhb Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D13838	2018-01-14 12:39:50 +00:00
Jeff Roberson	b6715dab8f	Move VM_NUMA_ALLOC and DEVICE_NUMA under the single global config option NUMA. Sponsored by: Netflix, Dell/EMC Isilon Discussed with: jhb	2018-01-14 03:36:03 +00:00
Jeff Roberson	ab3185d15e	Implement NUMA support in uma(9) and malloc(9). Allocations from specific domains can be done by the _domain() API variants. UMA also supports a first-touch policy via the NUMA zone flag. The slab layer is now segregated by VM domains and is precise. It handles iteration for round-robin directly. The per-cpu cache layer remains a mix of domains according to where memory is allocated and freed. Well behaved clients can achieve perfect locality with no performance penalty. The direct domain allocation functions have to visit the slab layer and so require per-zone locks which come at some expense. Reviewed by: Attilio (a slightly older version) Tested by: pho Sponsored by: Netflix, Dell/EMC Isilon	2018-01-12 23:25:05 +00:00
Konstantin Belousov	c751f90c0c	Fix grammar. Submitted by: alc MFC after: 3 days	2018-01-11 16:50:03 +00:00
Konstantin Belousov	6da5c56ae5	Remove redundand CLD instructions. We already clear %RFLAGS.DF on the kernel entry due to the compiler's ABI requirements. Suggested by: jhb Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2018-01-11 13:22:13 +00:00
Konstantin Belousov	4975c202ac	Do not clear %RFLAGS.DF on fast syscall entry. Hardware already did it for us due to the mask loaded into the MSR_SF_MASK msr register. Reviewed by: jhb Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D13838	2018-01-11 12:54:33 +00:00
Konstantin Belousov	0f7c159f6b	Move the hardware setup for fast syscalls into a common function. Discussed with: jhb Sponsored by: The FreeBSD Foundation MFC after: 1 week	2018-01-11 12:40:43 +00:00
Konstantin Belousov	4275e16fa9	Rename COMMON_TSS_RSP0 to TSS_RSP0. The symbol is just an offset in the hardware TSS structure, it is not limited to the common_tss instance. Sponsored by: The FreeBSD Foundation MFC after: 3 days	2018-01-11 12:28:08 +00:00
Konstantin Belousov	3ee6e65875	Update comment explaining the check, to reality. Discussed with: jhb Sponsored by: The FreeBSD Foundation MFC after: 3 days	2018-01-11 12:07:24 +00:00
Conrad Meyer	e6fcf7898d	x86: Document purpose of _safe variants of {rd,wr}msr() Sponsored by: Dell EMC Isilon	2018-01-10 22:41:00 +00:00
Andriy Gapon	091da2dfa5	vmm/svm: contigmalloc of the whole svm_softc is excessive This is a followup to r307903. struct svm_softc takes more than 200 kilobytes while what we really need is 3 contiguous pages for I/O permission map and 2 contiguous pages for MSR permission map. Other physically mapped structures have a size of a single page, so a proper alignment is sufficient for their correct mapping. Thus, only the permission maps are allocated with contigmalloc now, the softc is allocated with a regular malloc. Additionally, this commit adds a check that malloc returns memory with the expected page alignment and that contigmalloc does not fail. Unfortunately, at present svm_vminit() is expected to always succeed and there is no way to report an error. So, a contigmalloc failure leads to a panic. We should probably fix this. MFC after: 2 weeks	2018-01-09 14:22:18 +00:00
Konstantin Belousov	0530a9360f	Make it possible to re-evaluate cpu_features. Add cpuctl(4) ioctl CPUCTL_EVAL_CPU_FEATURES which forces re-read of cpu_features, cpu_features2, cpu_stdext_features, and std_stdext_features2. The intent is to allow the kernel to see the changes in the CPU features after micocode update. Of course, the update is not atomic across variables and not synchronized with readers. See the man page warning as well. Reviewed by: imp (previous version), jilles Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D13770	2018-01-05 21:06:19 +00:00
Andriy Gapon	5f3c7d6580	Fix a couple of comments in AMD Virtual Machine Control Block structure MFC after: 1 week	2018-01-05 19:15:24 +00:00
Konstantin Belousov	84874cc151	Avoid re-check of usermode condition. It does not change anything in the behavior of trap_pfault(), while eliminating obfuscation of jumping to the code which checks for the condition reversed of the goto cause. Also avoid force initialize the rv variable, since it is now only accessed after storing vm_fault() return value. Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D13725	2018-01-01 20:47:03 +00:00
Konstantin Belousov	1865d6b851	Remove MP SAFE marks and stray register name in comments. Sponsored by: The FreeBSD Foundation MFC after: 3 days	2017-12-31 17:07:59 +00:00
Colin Percival	31a55efdc5	Use the TSLOG framework to record entry/exit timestamps for hammer_time. The entry must be logged "manually" using TSRAW rather than TSENTER since PCPU data structures have not yet been initialized and thus curthread cannot be accessed; &thread0 is what will become curthread later in hammer_time. Other MD initialization code should be similarly instrumented in order to gain visibility into the time spent before entering mi_startup; this will require some care and testing from people with access to such hardware.	2017-12-31 09:22:07 +00:00
Eitan Adler	caa7e52f3f	kernel: Fix several typos and minor errors - duplicate words - typos - references to old versions of FreeBSD Reviewed by: imp, benno	2017-12-27 03:23:21 +00:00
Tycho Nightingale	9e33a61693	Recognize a pending virtual interrupt while emulating the halt instruction. Reviewed by: grehan, rgrimes Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D13573	2017-12-21 18:30:11 +00:00
Konstantin Belousov	30d4f9e888	Add atomic_load(9) and atomic_store(9) operations. They provide relaxed-ordered atomic access semantic. Due to the FreeBSD memory model, the operations are syntaxical wrappers around the volatile accesses. The volatile qualifier is used to ensure that the access not optimized out and in turn depends on the volatile semantic as implemented by supported compilers. The motivation for adding the operation is to help people coming from other systems or knowing the C11/C++ standards where atomics have special type and require use of the special access operations. It is still the case that FreeBSD requires plain load and stores of aligned integer types to be atomic. Suggested by: jhb Reviewed by: alc, jhb Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D13534	2017-12-19 09:59:20 +00:00
Mark Johnston	5bab623438	Pass the trap frame to fasttrap hooks. The DTrace fasttrap entry points expect a struct reg containing the register values of the calling thread. Perform the conversion in fasttrap rather than in the trap handler: this reduces the number of ifdefs and avoids wasting stack space for traps that don't involve DTrace. MFC after: 2 weeks	2017-12-11 19:21:39 +00:00
Bruce Evans	fb3cc1c37d	Move instantiation of msgbufp from 9 MD files to subr_prf.c. This variable should be pure MI except possibly for reading it in MD dump routines. Its initialization was pure MD in 4.4BSD, but FreeBSD changed this in r36441 in 1998. There were many imperfections in r36441. This commit fixes only a small one, to simplify fixing the others 1 arch at a time. (r47678 added support for special/early/multiple message buffer initialization which I want in a more general form, but this was too fragile to use because hacking on the msgbufp global corrupted it, and was only used for 5 hours in -current...)	2017-12-07 07:55:38 +00:00
Andriy Gapon	a7437a3e9d	amd-vi: set iommu msi configuration using pci_enable_msi method This is better than directly changing PCI configuration space of the device because it makes the PCI bus aware of the configuration. Also, the change allows to drop a bunch of code that duplicated pci_enable_msi() functionality. I wonder if it's possible to further simplify the code by using pci_alloc_msi().	2017-12-04 17:10:52 +00:00
Andriy Gapon	df92c28d6a	vmm/amd: add ivhd device with a higher order ivhd should attach after the root PCI bus and, thus, after the ACPI Host-PCI bridge off which the bus hangs. This is because ivhd changes PCI configuration of a PCI IOMMU device that is located on the root bus. If the bus attaches after ivhd it clears the MSI portion of the configuration. As a result IOMMU event interrupts would never be delivered. For regular ACPI devices the order is calculated as ACPI_DEV_BASE_ORDER + level * 10 where level is a depth of the device in the ACPI namespace. I expect the depth of the Host-PCI bridge to be two or three, so ACPI_DEV_BASE_ORDER + 10 * 10 should be a sufficiently safe order for ivhd. This should fix the setup of the AMD-Vi event interrupt when vmm is preloaded (as opposed to kldload-ed).	2017-12-04 17:08:03 +00:00
Andriy Gapon	8f09494d1e	amd-vi: clear event interrupt and overflow bits upon handling the interrupt This ensures that we can receive further event interrupts. See the description of the bits in the specification for MMIO Offset 2020h IOMMU Status Register. The bits are defined as set-by-hardware write-1-to-clear, same as all the bits in the status register. Discussed with: anish	2017-12-04 17:02:53 +00:00
Scott Long	c15269ccb8	It's time to retire AHC_REG_PRETTY_PRINT and AHD_REG_PRETTY_PRINT from the standard kernels. They are still available as custom compile options.	2017-11-29 23:41:49 +00:00
Brooks Davis	5cd667e65f	Disable vim syntax highlighting. Vim's default pick doesn't understand that ';' is a comment character and the result looks horrible. Reviewed by: emaste	2017-11-28 18:23:17 +00:00
Konstantin Belousov	dde5602786	Fix index calculation for the page table pages for efirt 1:1 map. Stop issuing pre-assigned number to enumerate all page table pages, the assignment is incorrect. Instead automatically calculate the next unused index. This index in fact does not serve any purpose except to be unique to satisfy vm_page_grab() interface, we do not look up the page by the index later. Reported and tested by: emaste Reviewed by: andrew Sponsored by: The FreeBSD Foundation MFC after: 2 weeks PR: 223906 Differential revision: https://reviews.freebsd.org/D13273	2017-11-28 09:34:43 +00:00
Fedor Uporov	cd76ee1ee3	Remap ENOATTR to ENODATA in the linuxulator. In the linux ENOADATA is frequently #defined as ENOATTR. The change is required for an xattrs support implementation. MFC after: 1 week Discussed with: netchild Approved by: pfg Differential Revision: https://reviews.freebsd.org/D13221	2017-11-27 17:03:11 +00:00
Pedro F. Giffuni	c49761dd57	sys/amd64: further adoption of SPDX licensing ID tags. Mainly focus on files that use BSD 2-Clause license, however the tool I was using misidentified many licenses so this was mostly a manual - error prone - task. The Software Package Data Exchange (SPDX) group provides a specification to make it easier for automated tools to detect and summarize well known opensource licenses. We are gradually adopting the specification, noting that the tags are considered only advisory and do not, in any way, superceed or replace the license texts.	2017-11-27 15:03:07 +00:00
Ed Schouten	ee13ffbe03	Use TO_PTR() to convert integers to pointers. For FreeBSD/arm64's cloudabi32 support, I'm going to need a TO_PTR() in this place. Also use it for all of the other source files, so that the difference remains as minimal as possible. MFC after: 2 weeks	2017-11-26 14:45:56 +00:00
Hans Petter Selasky	8a53e1340f	Merge ^/head r326132 through r326161.	2017-11-24 12:13:27 +00:00
Andriy Gapon	a3bbbd5e40	amd-vi: a small whitespace cleanup Reviewed by: anish	2017-11-24 11:37:41 +00:00
Andriy Gapon	685c54fc6a	amd-vi: use correct type for pci_rid, start_dev_rid, end_dev_rid sysctls Previously, the values could look confusing because of unrelated bits from adjacent memory. Reviewed by: anish	2017-11-24 11:36:35 +00:00
Andriy Gapon	eb6c9c128c	amd-vi: small improvements to event printing Ensure that an opening bracket always has a matching closing one. Ensure that there is always a new-line at the end of a report line. Also, add a space before the printed event flag. Reviewed by: anish	2017-11-24 11:35:43 +00:00
Andriy Gapon	dee38cdc2a	amd-vi: print some additional details for INVALID_DEVICE_REQUEST event Namely, the type of the hardware event and whether the transaction was a translation request. Reviewed by: anish	2017-11-24 11:34:46 +00:00
Andriy Gapon	53d580f984	amd-vi: fix up r326152, the new width requires a wider type This is my brain-o from extending the width at the last moment.	2017-11-24 11:25:06 +00:00
Andriy Gapon	5a041f2183	amd-vi: fix and extend definition of Command and Event Status Register (0x2020) The defined bits are the lower bits, not the higher ones. Also, the specification has been extended to define bits 0:18 and they all could potentially be interesting to us, so extend the width of the field accordingly. Reviewed by: anish	2017-11-24 11:20:10 +00:00
Andriy Gapon	8523ad24ba	vmm/amd: improve iteration over IVHD (type 10h) entries in IVRS table Many 8-byte entries have zero at byte 4, so the second 4-byte part is skipped as a 4-byte padding entry. But not all 8-byte entries have that property and they get misinterpreted. A real example: 48 00 00 00 ff 01 00 01 This an 8-byte ACPI_IVRS_TYPE_SPECIAL entry for IOAPIC with ID 255 (bogus). It is reported as: ivhd0: Unknown dev entry:0xff Fortunately, it was completely harmless. Also, bail out early if we encounter an entry of a variable length type. We do not have proper handling for those yet. Reviewed by: anish	2017-11-24 11:10:36 +00:00
Ed Schouten	814629dd64	Don't let cpu_set_syscall_retval() clobber exec_setregs(). Upon successful completion, the execve() system call invokes exec_setregs() to initialize the registers of the initial thread of the newly executed process. What is weird is that when execve() returns, it still goes through the normal system call return path, clobbering the registers with the system call's return value (td->td_retval). Though this doesn't seem to be problematic for x86 most of the times (as the value of eax/rax doesn't matter upon startup), this can be pretty frustrating for architectures where function argument and return registers overlap (e.g., ARM). On these systems, exec_setregs() also needs to initialize td_retval. Even worse are architectures where cpu_set_syscall_retval() sets registers to values not derived from td_retval. On these architectures, there is no way cpu_set_syscall_retval() can set registers to the way it wants them to be upon the start of execution. To get rid of this madness, let sys_execve() return EJUSTRETURN. This will cause cpu_set_syscall_retval() to leave registers intact. This makes process execution easier to understand. It also eliminates the difference between execution of the initial process and successive ones. The initial call to sys_execve() is not performed through a system call context. Reviewed by: kib, jhibbits Differential Revision: https://reviews.freebsd.org/D13180	2017-11-24 07:35:08 +00:00
Hans Petter Selasky	82725ba9bf	Merge ^/head r325999 through r326131.	2017-11-23 14:28:14 +00:00
Konstantin Belousov	383f241dce	Remove lint support from system headers and MD x86 headers. Reviewed by: dim, jhb Discussed with: imp Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D13156	2017-11-23 11:40:16 +00:00
Pedro F. Giffuni	51369649b0	sys: further adoption of SPDX licensing ID tags. Mainly focus on files that use BSD 3-Clause license. The Software Package Data Exchange (SPDX) group provides a specification to make it easier for automated tools to detect and summarize well known opensource licenses. We are gradually adopting the specification, noting that the tags are considered only advisory and do not, in any way, superceed or replace the license texts. Special thanks to Wind River for providing access to "The Duke of Highlander" tool: an older (2014) run over FreeBSD tree was useful as a starting point.	2017-11-20 19:43:44 +00:00
Hans Petter Selasky	937d37fc6c	Merge ^/head r325842 through r325998.	2017-11-19 12:36:03 +00:00
Pedro F. Giffuni	df57947f08	spdx: initial adoption of licensing ID tags. The Software Package Data Exchange (SPDX) group provides a specification to make it easier for automated tools to detect and summarize well known opensource licenses. We are gradually adopting the specification, noting that the tags are considered only advisory and do not, in any way, superceed or replace the license texts. Special thanks to Wind River for providing access to "The Duke of Highlander" tool: an older (2014) run over FreeBSD tree was useful as a starting point. Initially, only tag files that use BSD 4-Clause "Original" license. RelNotes: yes Differential Revision: https://reviews.freebsd.org/D13133	2017-11-18 14:26:50 +00:00
Hans Petter Selasky	55b1c6e7e4	Merge ^/head r325663 through r325841.	2017-11-15 11:28:11 +00:00
Hans Petter Selasky	8dee9a7a44	Remove no longer supported mthca driver. Sponsored by: Mellanox Technologies	2017-11-13 10:59:38 +00:00
Mateusz Guzik	ca0227933e	amd64: stop nesting preemption counter in spinlock_enter Discussed with: jhb	2017-11-12 03:13:01 +00:00
Jeff Roberson	8d6fbbb867	Replace manyinstances of VM_WAIT with blocking page allocation flags similar to the kernel memory allocator. This simplifies NUMA allocation because the domain will be known at wait time and races between failure and sleeping are eliminated. This also reduces boilerplate code and simplifies callers. A wait primitive is supplied for uma zones for similar reasons. This eliminates some non-specific VM_WAIT calls in favor of more explicit sleeps that may be satisfied without new pages. Reviewed by: alc, kib, markj Tested by: pho Sponsored by: Netflix, Dell/EMC Isilon	2017-11-08 02:39:37 +00:00
Konstantin Belousov	b535ed2898	Zero the structure instead of the pointer to it. Reported by: Don Morris <Don.Morris@dell.com> MFC after: 4 days	2017-11-05 20:03:57 +00:00
Konstantin Belousov	5b9a3721e6	x86: Do not emit unused TD_TID symbols. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2017-11-04 10:51:52 +00:00
Konstantin Belousov	ad4e4ae591	Restore an optimization that was temporary disabled by r324665. In reclaim_pv_chunk(), rotate the pv chunks list so that next invocations of the reclaim do not scan the same pv chunks that could not be freed. Only do the rotation when there is no parallel scan, tracked by active_reclaims counter. To rotate, move all chunks that are before current iteration marker, after another marker that is inserted at the list tail on start of the reclaim. Reviewed by: alc Tested by: pho (previous version) Sponsored by: The FreeBSD Foundation MFC after: 1 week	2017-11-01 18:06:44 +00:00
Konstantin Belousov	aa788cc387	Consistently ensure that we do not load MXCSR with reserved bits set. Some callers of fpusetregs()/npxsetregs(), most importantly set_fpcontext(), clear reserved bits. But some did not. Do the clearing in fpusetregs() and remove now redundand operation from set_fpcontext(). Reported by: Maxime Villard <max@m00nbsd.net> Sponsored by: The FreeBSD Foundation MFC after: 1 week	2017-11-01 10:32:44 +00:00
Peter Grehan	9d210a4a18	Emulate the "OR reg, r/m" instruction (opcode 0BH). This is needed for the HDA emulation with FreeBSD guests. Reviewed by: marcelo MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D12832	2017-11-01 03:26:53 +00:00
Tijl Coosemans	f236378b54	Set the return address for stack entry points to zero. Stack unwinders treat zero as a stop condition. The value on the stack can be non-zero because thread stacks may be arbitrary memory provided via pthread_attr_setstack(3) or may be recycled from previous threads. Reference: https://lists.freebsd.org/pipermail/freebsd-current/2017-August/066855.html https://lists.freebsd.org/pipermail/freebsd-current/2017-October/067254.html Discussed with: kib MFC after: 1 week	2017-10-31 11:51:34 +00:00
Ian Lepore	5deb1573e8	Improve the performance of the hpet timer in bhyve guests by making the timer frequency a power of two. This changes the frequency from 10 to 16.7 MHz (2 ^ 24 HZ). Using a power of two avoids roundoff errors when doing arithmetic in sbintime_t units. Testing shows this can fix erratic ntpd behavior in guests using the hpet timer (which is the default for multicore guests). Reported by: bsam@	2017-10-29 20:50:03 +00:00
Eitan Adler	a2aef24aa3	Update several more URLs - Primarily http -> https - Primarily FreeBSD project URLs	2017-10-29 08:17:03 +00:00
John Baldwin	6db55a0f3a	Rework pass through changes in r305485 to be safer. Specifically, devices that do not support PCI-e FLR and were not gracefully shutdown by the guest OS could continue to issue DMA requests after the VM was terminated. The changes in r305485 meant that those DMA requests were completed against the host's memory which could result in random memory corruption. Instead, leave ppt devices that are not attached to a VM disabled in the IOMMU and only restore the devices to the host domain if the ppt(4) driver is detached from a device. As an added safety belt, disable busmastering for a pass-through device when before adding it to the host domain during ppt(4) detach. PR: 222937 Tested by: Harry Schmalzbauer <freebsd@omnilan.de> Reviewed by: grehan MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D12661	2017-10-27 14:57:14 +00:00
Mark Johnston	5fca1d90c1	Fix the VM_NRESERVLEVEL == 0 build. Add VM_NRESERVLEVEL guards in the pmaps that implement transparent superpage promotion using reservations. Reviewed by: alc, kib MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D12764	2017-10-23 15:34:05 +00:00
Mateusz Guzik	9e68989764	Make the sleepq chain hash size configurable per-arch and bump on amd64. While here cache-align chains. This shortens longest found chain during poudriere -j 80 from 32 to 16. Pushing this higher up will probably require allocation on boot.	2017-10-22 20:43:50 +00:00
Bjoern A. Zeeb	8e94025b41	With r181803 on 2008-08-17 23:27:27Z the first VIMAGE commit went into HEAD. Enable VIMAGE in GENERIC kernels and some others (where GENERIC does not exist) on HEAD. Disable building LINT-VIMAGE with VIMAGE being default. This should give it a lot more exposure in the run-up to 12 to help us evaluate whether to keep it on by default or not. We are also hoping to get better performance testing. The feature can be disabled using nooptions. Requested by: many Reviewed by: kristof, emaste, hiren X-MFC after: never Relnotes: yes Differential Revision: https://reviews.freebsd.org/D12639	2017-10-20 21:40:59 +00:00
Mateusz Guzik	e66167764a	amd64: plug missed dt_lock in cpu_fork	2017-10-20 18:58:11 +00:00
Mateusz Guzik	a5db8ade37	amd64: __exclusive_cache_line pv_chunks_mutex and pv_list_locks Note that pv_list_locks is an array and currently it fits 2 locks per line. Resizing it and/or putting more locks in different lines requires several tests. MFC after: 1 week	2017-10-20 03:38:58 +00:00
Mateusz Guzik	d95498d44f	amd64: avoid acquiring dt lock if possible (which is the common case) Discussed with: kib MFC after: 1 week	2017-10-20 03:30:02 +00:00
Mark Johnston	46fcd1af63	Move kernel dump offset tracking into MI code. All of the kernel dump implementations keep track of the current offset ("dumplo") within the dump device. However, except for textdumps, they all write the dump sequentially, so we can reduce code duplication by having the MI code keep track of the current offset. The new dump_append() API can be used to write at the current offset. This is needed to implement support for kernel dump compression in the MI kernel dump code. Also simplify dump_encrypted_write() somewhat: use dump_write() instead of duplicating its bounds checks, and get rid of the redundant offset tracking. Reviewed by: cem Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D11722	2017-10-18 15:38:05 +00:00

1 2 3 4 5 ...

7738 Commits