freebsd-skq

Author	SHA1	Message	Date
Antoine Brodin	147d12a7d3	vmmdev: return EFAULT when trying to read beyond VM system memory max address Currently, when using dd(1) to take a VM memory image, the capture never ends, reading zeroes when it's beyond VM system memory max address. Return EFAULT when trying to read beyond VM system memory max address. Reviewed by: imp, grehan, anish Approved by: grehan Differential Revision: https://reviews.freebsd.org/D15156	2018-05-15 17:20:58 +00:00
John Baldwin	0b3e6e4c50	Make the common interrupt entry point labels local labels. Kernel debuggers depend on symbol names to find stack frames with a trapframe rather than a normal stack frame. The labels used for the shared interrupt entry point for the PTI and non-PTI cases did not match the existing patterns confusing debuggers. Add the '.L' prefix to mark these symbols as local so they are not visible in the symbol table. Reviewed by: kib MFC after: 1 week Sponsored by: Chelsio Communications	2018-05-14 17:27:53 +00:00
Tycho Nightingale	27275f8a52	Expand the checks for UCR3 == PMAP_NO_CR3 to enable processes to be excluded from PTI. Reviewed by: kib Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D15100	2018-04-27 12:44:20 +00:00
Mark Johnston	5cd29d0f3c	Improve VM page queue scalability. Currently both the page lock and a page queue lock must be held in order to enqueue, dequeue or requeue a page in a given page queue. The queue locks are a scalability bottleneck in many workloads. This change reduces page queue lock contention by batching queue operations. To detangle the page and page queue locks, per-CPU batch queues are used to reference pages with pending queue operations. The requested operation is encoded in the page's aflags field with the page lock held, after which the page is enqueued for a deferred batch operation. Page queue scans are similarly optimized to minimize the amount of work performed with a page queue lock held. Reviewed by: kib, jeff (previous versions) Tested by: pho Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D14893	2018-04-24 21:15:54 +00:00
Tycho Nightingale	6ac73777ea	Add SDT probes to vmexit on Intel. Submitted by: domagoj.stolfa_gmail.com Reviewed by: grehan, tychon Sponsored by: DARPA/AFRL Differential Revision: https://reviews.freebsd.org/D14656	2018-04-13 17:23:05 +00:00
Rodney W. Grimes	01d822d33b	Add the ability to control the CPU topology of created VMs from userland without the need to use sysctls, it allows the old sysctls to continue to function, but deprecates them at FreeBSD_version 1200060 (Relnotes for deprecate). The command line of bhyve is maintained in a backwards compatible way. The API of libvmmapi is maintained in a backwards compatible way. The sysctl's are maintained in a backwards compatible way. Added command option looks like: bhyve -c [[cpus=]n][,sockets=n][,cores=n][,threads=n][,maxcpus=n] The optional parts can be specified in any order, but only a single integer invokes the backwards compatible parse. [,maxcpus=n] is hidden by #ifdef until kernel support is added, though the api is put in place. bhyvectl --get-cpu-topology option added. Reviewed by: grehan (maintainer, earlier version), Reviewed by: bcr (manpages) Approved by: bde (mentor), phk (mentor) Tested by: Oleg Ginzburg <olevole@olevole.ru> (cbsd) MFC after: 1 week Relnotes: Y Differential Revision: https://reviews.freebsd.org/D9930	2018-04-08 19:24:49 +00:00
John Baldwin	fc276d92ae	Add a way to temporarily suspend and resume virtual CPUs. This is used as part of implementing run control in bhyve's debug server. The hypervisor now maintains a set of "debugged" CPUs. Attempting to run a debugged CPU will fail to execute any guest instructions and will instead report a VM_EXITCODE_DEBUG exit to the userland hypervisor. Virtual CPUs are placed into the debugged state via vm_suspend_cpu() (implemented via a new VM_SUSPEND_CPU ioctl). Virtual CPUs can be resumed via vm_resume_cpu() (VM_RESUME_CPU ioctl). The debug server suspends virtual CPUs when it wishes them to stop executing in the guest (for example, when a debugger attaches to the server). The debug server can choose to resume only a subset of CPUs (for example, when single stepping) or it can choose to resume all CPUs. The debug server must explicitly mark a CPU as resumed via vm_resume_cpu() before the virtual CPU will successfully execute any guest instructions. Reviewed by: avg, grehan Tested on: Intel (jhb), AMD (avg) Differential Revision: https://reviews.freebsd.org/D14466	2018-04-06 22:03:43 +00:00
Roger Pau Monné	9dba82a442	x86: improve reservation of AP trampoline memory So that it doesn't rely on physmap[1] containing an address below 1MiB. Instead scan the full physmap and search for a suitable address to place the trampoline code (below 1MiB) and the initial memory pages (below 4GiB). Sponsored by: Citrix Systems R&D Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D14878	2018-04-05 14:39:51 +00:00
Jeff Roberson	27a3c9d710	Restore r331606 with a bugfix to setup cpuset_domain[] earlier on all platforms. Original commit message as follows: Only use CPUs in the domain the device is attached to for default assignment. Device drivers are able to override the default assignment if they bind directly. There are severe performance penalties for handling interrupts on remote CPUs and this should only be done in very controlled circumstances. Reviewed by: jhb, kib Tested by: pho Sponsored by: Netflix, Dell/EMC Isilon Differential Revision: https://reviews.freebsd.org/D14838	2018-03-28 18:47:35 +00:00
Jeff Roberson	261c408744	Backout r331606 until I can identify why it does not boot on some machines.	2018-03-27 10:20:50 +00:00
Jeff Roberson	a48de40bcc	Only use CPUs in the domain the device is attached to for default assignment. Device drivers are able to override the default assignment if they bind directly. There are severe performance penalties for handling interrupts on remote CPUs and this should only be done in very controlled circumstances. Reviewed by: jhb, kib Tested by: pho (earlier version) Sponsored by: Netflix, Dell/EMC Isilon Differential Revision: https://reviews.freebsd.org/D14838	2018-03-27 03:37:04 +00:00
Konstantin Belousov	8fbcc3343f	Move the CR0.WP manipulation KPI to x86. This should allow to avoid some #ifdefs in the common x86/ code. Requested by: markj Sponsored by: The FreeBSD Foundation MFC after: 1 week	2018-03-20 20:20:49 +00:00
Konstantin Belousov	2337dc6430	Provide KPI for handling of rw/ro kernel text. This is a pure syntax patch to create an interface to enable and later restore write access to the kernel text and other read-only mapped regions. It is in line with e.g. vm_fault_disable_pagefaults() by allowing the nesting. Discussed with: Peter Lei <peter.lei@ieee.org> Reviewed by: jtl Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D14768	2018-03-20 17:43:50 +00:00
Ian Lepore	c7053bbe54	Revert r330780, it was improperly tested and results in taking a spin mutex before acquiring sleep mutexes. Reported by: kib@	2018-03-11 20:13:15 +00:00
Ian Lepore	86051be993	Eliminate atrtc_time_lock, and use atrtc_lock for efirtc locking.	2018-03-11 19:22:58 +00:00
Tycho Nightingale	490768e24a	Fix a lock recursion introduced in r327065. Reported by: kmacy Reviewed by: grehan, jhb Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D14548	2018-03-07 18:03:22 +00:00
Jonathan T. Looney	beb2406556	amd64: Protect the kernel text, data, and BSS by setting the RW/NX bits correctly for the data contained on each memory page. There are several components to this change: * Add a variable to indicate the start of the R/W portion of the initial memory. * Stop detecting NX bit support for each AP. Instead, use the value from the BSP and, if supported, activate the feature on the other APs just before loading the correct page table. (Functionally, we already assume that the BSP and all APs had the same support or lack of support for the NX bit.) * Set the RW and NX bits correctly for the kernel text, data, and BSS (subject to some caveats below). * Ensure DDB can write to memory when necessary (such as to set a breakpoint). * Ensure GDB can write to memory when necessary (such as to set a breakpoint). For this purpose, add new MD functions gdb_begin_write() and gdb_end_write() which the GDB support code can call before and after writing to memory. This change is not comprehensive: * It doesn't do anything to protect modules. * It doesn't do anything for kernel memory allocated after the kernel starts running. * In order to avoid excessive memory inefficiency, it may let multiple types of data share a 2M page, and assigns the most permissions needed for data on that page. Reviewed by: jhb, kib Discussed with: emaste MFC after: 2 weeks Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D14282	2018-03-06 14:28:37 +00:00
John Baldwin	5f8754c077	Add a new variant of the GLA2GPA ioctl for use by the debug server. Unlike the existing GLA2GPA ioctl, GLA2GPA_NOFAULT does not modify the guest. In particular, it does not inject any faults or modify PTEs in the guest when performing an address space translation. This is used by bhyve's debug server to read and write memory for the remote debugger. Reviewed by: grehan MFC after: 1 month Differential Revision: https://reviews.freebsd.org/D14075	2018-02-26 19:19:05 +00:00
Conrad Meyer	849ce31a82	Remove unused error return from API that cannot fail No implementation of fpu_kern_enter() can fail, and it was causing needless error checking boilerplate and confusion. Change the return code to void to match reality. (This trivial change took nine days to land because of the commit hook on sys/dev/random. Please consider removing the hook or otherwise lowering the bar -- secteam never seems to have free time to review patches.) Reported by: Lachlan McIlroy <Lachlan.McIlroy AT isilon.com> Reviewed by: delphij Approved by: secteam (delphij) Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D14380	2018-02-23 20:15:19 +00:00
John Baldwin	4f8666989a	Add two new ioctls to bhyve for batch register fetch/store operations. These are a convenience for bhyve's debug server to use a single ioctl for 'g' and 'G' rather than a loop of individual get/set ioctl requests. Reviewed by: grehan MFC after: 2 months Differential Revision: https://reviews.freebsd.org/D14074	2018-02-22 00:39:25 +00:00
Konstantin Belousov	13cad9af82	Use local symbol for offset. Small global symbols confuse ddb which matches them against small unrelated displacements and makes the disassembly ugly. Reported by: bde Sponsored by: The FreeBSD Foundation MFC after: 1 week	2018-02-16 13:32:46 +00:00
Jung-uk Kim	ea4fe1da62	Change size of padding to reflect reality. No functional change. Discussed with: kib	2018-02-15 20:42:38 +00:00
Warner Losh	982e7bdafc	We don't support gcc < 4.2.1, so varargs.h now is just #error always. Unifdef for versions prior to 4.2.1 and remove now-unused header files. Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D14323	2018-02-12 14:48:14 +00:00
Konstantin Belousov	319117fd57	IBRS support, AKA Spectre hardware mitigation. It is coded according to the Intel document 336996-001, reading of the patches posted on lkml, and some additional consultations with Intel. For existing processors, you need a microcode update which adds IBRS CPU features, and to manually enable it by setting the tunable/sysctl hw.ibrs_disable to 0. Current status can be checked in sysctl hw.ibrs_active. The mitigation might be inactive if the CPU feature is not patched in, or if CPU reports that IBRS use is not required, by IA32_ARCH_CAP_IBRS_ALL bit. Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D14029	2018-01-31 14:36:27 +00:00
Konstantin Belousov	c8f9c1f3d9	Use PCID to optimize PTI. Use PCID to avoid complete TLB shootdown when switching between user and kernel mode with PTI enabled. I use the model close to what I read about KAISER, user-mode PCID has 1:1 correspondence to the kernel-mode PCID, by setting bit 11 in PCID. Full kernel-mode TLB shootdown is performed on context switches, since KVA TLB invalidation only works in the current pmap. User-mode part of TLB is flushed on the pmap activations as well. Similarly, IPI TLB shootdowns must handle both kernel and user address spaces for each address. Note that machines which implement PCID but do not have INVPCID instructions, cause the usual complications in the IPI handlers, due to the need to switch to the target PCID temporary. This is racy, but because for PCID/no-INVPCID we disable the interrupts in pmap_activate_sw(), IPI handler cannot see inconsistent state of CPU PCID vs PCPU pmap/kcr3/ucr3 pointers. On the other hand, on kernel/user switches, CR3_PCID_SAVE bit is set and we do not clear TLB. I can imagine alternative use of PCID, where there is only one PCID allocated for the kernel pmap. Then, there is no need to shootdown kernel TLB entries on context switch. But copyout(3) would need to either use method similar to proc_rwmem() to access the userspace data, or (in reverse) provide a temporal mapping for the kernel buffer into user mode PCID and use trampoline for copy. Reviewed by: markj (previous version) Tested by: pho Discussed with: alc (some aspects) Sponsored by: The FreeBSD Foundation MFC after: 3 weeks Differential revision: https://reviews.freebsd.org/D13985	2018-01-27 11:49:37 +00:00
Konstantin Belousov	b4dfc9d7ad	PTI: Trap if we returned to userspace with kernel (full) page table still active. Map userspace portion of VA in the PTI kernel-mode page table as non-executable. This way, if we ever miss reloading ucr3 into %cr3 on the return to usermode, the process traps instead of executing in potentially vulnerable setup. Catch the condition of such trap and verify user-mode %cr3, which is saved by page fault handler. I peek this trick in some article about Linux implementation. Reviewed by: alc, markj (previous version) Sponsored by: The FreeBSD Foundation MFC after: 12 days DIfferential revision: https://reviews.freebsd.org/D13956	2018-01-19 22:10:29 +00:00
Nathan Whitehorn	9a8196ce19	Remove SFBUF_OPTIONAL_DIRECT_MAP and such hacks, replacing them across the kernel by PHYS_TO_DMAP() as previously present on amd64, arm64, riscv, and powerpc64. This introduces a new MI macro (PMAP_HAS_DMAP) that can be evaluated at runtime to determine if the architecture has a direct map; if it does not (or does) unconditionally and PMAP_HAS_DMAP is either 0 or 1, the compiler can remove the conditional logic. As part of this, implement PHYS_TO_DMAP() on sparc64 and mips64, which had similar things but spelled differently. 32-bit MIPS has a partial direct-map that maps poorly to this concept and is unchanged. Reviewed by: kib Suggestions from: marius, alc, kib Runtime tested on: amd64, powerpc64, powerpc, mips64	2018-01-19 17:46:31 +00:00
John Baldwin	68fd3b0ef5	Use a dedicated per-CPU stack for machine check exceptions. Similar to NMIs, machine check exceptions can fire at any time and are not masked by IF. This means that machine checks can fire when the kstack is too deep to hold a trap frame, or at critical sections in trap handlers when a user %gs is used with a kernel %cs. Use the same strategy used for NMIs of using a dedicated per-CPU stack configured in IST 3. Store the CPU's pcpu pointer at the stop of the stack so that the machine check handler can reliably find the proper value for %gs (also borrowed from NMIs). This should also fix a similar issue with PTI with a MC# occurring while the CPU is executing on the trampoline stack. While here, bypass trap() entirely and just call mca_intr(). This avoids a bogus call to kdb_reenter() (there's no reason to try to reenter kdb if a MC# is raised). Reviewed by: kib Tested by: avg (on AMD without PTI) Differential Revision: https://reviews.freebsd.org/D13962	2018-01-18 23:50:21 +00:00
Konstantin Belousov	3705dda7e4	Move the kernphys declaration to machine/md_var.h. Apparently machinde/cpu.h is supposed to contain MD implementations of MI interfaces. Also, remove kernphys declaration from machdep.c, since it is already provided by md_var.h. Requested and reviewed by: bde MFC after: 13 days	2018-01-18 15:15:35 +00:00
Konstantin Belousov	ac97ccbab5	Fix compilation with gcc. etext is already declared in machine/cpu.h, move kernphys declaration there too. Based on the patch by: bde MFC after: 13 days	2018-01-18 11:21:03 +00:00
Konstantin Belousov	406bc0da95	Fix compilation with gas. Submitted by: bde MFC after: 13 days	2018-01-18 11:19:58 +00:00
John Baldwin	65eefbe422	Save and restore guest debug registers. Currently most of the debug registers are not saved and restored during VM transitions allowing guest and host debug register values to leak into the opposite context. One result is that hardware watchpoints do not work reliably within a guest under VT-x. Due to differences in SVM and VT-x, slightly different approaches are used. For VT-x: - Enable debug register save/restore for VM entry/exit in the VMCS for DR7 and MSR_DEBUGCTL. - Explicitly save DR0-3,6 of the guest. - Explicitly save DR0-3,6-7, MSR_DEBUGCTL, and the trap flag from %rflags for the host. Note that because DR6 is "software" managed and not stored in the VMCS a kernel debugger which single steps through VM entry could corrupt the guest DR6 (since a single step trap taken after loading the guest DR6 could alter the DR6 register). To avoid this, explicitly disable single-stepping via the trace flag before loading the guest DR6. A determined debugger could still defeat this by setting a breakpoint after the guest DR6 was loaded and then single-stepping. For SVM: - Enable debug register caching in the VMCB for DR6/DR7. - Explicitly save DR0-3 of the guest. - Explicitly save DR0-3,6-7, and MSR_DEBUGCTL for the host. Since SVM saves the guest DR6 in the VMCB, the race with single-stepping described for VT-x does not exist. For both platforms, expose all of the guest DRx values via --get-drX and --set-drX flags to bhyvectl. Discussed with: avg, grehan Tested by: avg (SVM), myself (VT-x) MFC after: 1 month Differential Revision: https://reviews.freebsd.org/D13229	2018-01-17 23:11:25 +00:00
Konstantin Belousov	bd50262f70	PTI for amd64. The implementation of the Kernel Page Table Isolation (KPTI) for amd64, first version. It provides a workaround for the 'meltdown' vulnerability. PTI is turned off by default for now, enable with the loader tunable vm.pmap.pti=1. The pmap page table is split into kernel-mode table and user-mode table. Kernel-mode table is identical to the non-PTI table, while usermode table is obtained from kernel table by leaving userspace mappings intact, but only leaving the following parts of the kernel mapped: kernel text (but not modules text) PCPU GDT/IDT/user LDT/task structures IST stacks for NMI and doublefault handlers. Kernel switches to user page table before returning to usermode, and restores full kernel page table on the entry. Initial kernel-mode stack for PTI trampoline is allocated in PCPU, it is only 16 qwords. Kernel entry trampoline switches page tables. then the hardware trap frame is copied to the normal kstack, and execution continues. IST stacks are kept mapped and no trampoline is needed for NMI/doublefault, but of course page table switch is performed. On return to usermode, the trampoline is used again, iret frame is copied to the trampoline stack, page tables are switched and iretq is executed. The case of iretq faulting due to the invalid usermode context is tricky, since the frame for fault is appended to the trampoline frame. Besides copying the fault frame and original (corrupted) frame to kstack, the fault frame must be patched to make it look as if the fault occured on the kstack, see the comment in doret_iret detection code in trap(). Currently kernel pages which are mapped during trampoline operation are identical for all pmaps. They are registered using pmap_pti_add_kva(). Besides initial registrations done during boot, LDT and non-common TSS segments are registered if user requested their use. In principle, they can be installed into kernel page table per pmap with some work. Similarly, PCPU can be hidden from userspace mapping using trampoline PCPU page, but again I do not see much benefits besides complexity. PDPE pages for the kernel half of the user page tables are pre-allocated during boot because we need to know pml4 entries which are copied to the top-level paging structure page, in advance on a new pmap creation. I enforce this to avoid iterating over the all existing pmaps if a new PDPE page is needed for PTI kernel mappings. The iteration is a known problematic operation on i386. The need to flush hidden kernel translations on the switch to user mode make global tables (PG_G) meaningless and even harming, so PG_G use is disabled for PTI case. Our existing use of PCID is incompatible with PTI and is automatically disabled if PTI is enabled. PCID can be forced on only for developer's benefit. MCE is known to be broken, it requires IST stack to operate completely correctly even for non-PTI case, and absolutely needs dedicated IST stack because MCE delivery while trampoline did not switched from PTI stack is fatal. The fix is pending. Reviewed by: markj (partially) Tested by: pho (previous version) Discussed with: jeff, jhb Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2018-01-17 11:44:21 +00:00
Konstantin Belousov	94b011c4bc	Amd64 user_ldt_deref() is not used outside sys_machdep.c. Mark it as static. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2018-01-17 11:21:03 +00:00
Konstantin Belousov	5f7b9ff2e3	Add STAC and CLAC instructions wrappers. Reviewed by: jhb Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D13838	2018-01-14 12:39:50 +00:00
Konstantin Belousov	0f7c159f6b	Move the hardware setup for fast syscalls into a common function. Discussed with: jhb Sponsored by: The FreeBSD Foundation MFC after: 1 week	2018-01-11 12:40:43 +00:00
Konstantin Belousov	30d4f9e888	Add atomic_load(9) and atomic_store(9) operations. They provide relaxed-ordered atomic access semantic. Due to the FreeBSD memory model, the operations are syntaxical wrappers around the volatile accesses. The volatile qualifier is used to ensure that the access not optimized out and in turn depends on the volatile semantic as implemented by supported compilers. The motivation for adding the operation is to help people coming from other systems or knowing the C11/C++ standards where atomics have special type and require use of the special access operations. It is still the case that FreeBSD requires plain load and stores of aligned integer types to be atomic. Suggested by: jhb Reviewed by: alc, jhb Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D13534	2017-12-19 09:59:20 +00:00
Pedro F. Giffuni	c49761dd57	sys/amd64: further adoption of SPDX licensing ID tags. Mainly focus on files that use BSD 2-Clause license, however the tool I was using misidentified many licenses so this was mostly a manual - error prone - task. The Software Package Data Exchange (SPDX) group provides a specification to make it easier for automated tools to detect and summarize well known opensource licenses. We are gradually adopting the specification, noting that the tags are considered only advisory and do not, in any way, superceed or replace the license texts.	2017-11-27 15:03:07 +00:00
Konstantin Belousov	383f241dce	Remove lint support from system headers and MD x86 headers. Reviewed by: dim, jhb Discussed with: imp Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D13156	2017-11-23 11:40:16 +00:00
Pedro F. Giffuni	51369649b0	sys: further adoption of SPDX licensing ID tags. Mainly focus on files that use BSD 3-Clause license. The Software Package Data Exchange (SPDX) group provides a specification to make it easier for automated tools to detect and summarize well known opensource licenses. We are gradually adopting the specification, noting that the tags are considered only advisory and do not, in any way, superceed or replace the license texts. Special thanks to Wind River for providing access to "The Duke of Highlander" tool: an older (2014) run over FreeBSD tree was useful as a starting point.	2017-11-20 19:43:44 +00:00
Pedro F. Giffuni	df57947f08	spdx: initial adoption of licensing ID tags. The Software Package Data Exchange (SPDX) group provides a specification to make it easier for automated tools to detect and summarize well known opensource licenses. We are gradually adopting the specification, noting that the tags are considered only advisory and do not, in any way, superceed or replace the license texts. Special thanks to Wind River for providing access to "The Duke of Highlander" tool: an older (2014) run over FreeBSD tree was useful as a starting point. Initially, only tag files that use BSD 4-Clause "Original" license. RelNotes: yes Differential Revision: https://reviews.freebsd.org/D13133	2017-11-18 14:26:50 +00:00
Mateusz Guzik	9e68989764	Make the sleepq chain hash size configurable per-arch and bump on amd64. While here cache-align chains. This shortens longest found chain during poudriere -j 80 from 32 to 16. Pushing this higher up will probably require allocation on boot.	2017-10-22 20:43:50 +00:00
Konstantin Belousov	ca1f624517	Fix the pv_chunks pc_lru tailq handling in reclaim_pv_chunk(). For processing, reclaim_pv_chunk() removes the pv_chunk from the lru list, which makes pc_lru linkage invalid. Then the pmap lock is released, which allows for other thread to free the last pv entry allocated from the chunk and call free_pv_chunk(), which tries to modify the invalid linkage. Similarly, the chunk is inserted into the private tailq new_tail temporary. Again, free_pv_chunk() might be run and corrupt the linkage for the new_tail after the pmap lock is dropped. This is a consequence of r299788 elimination of pvh_global_lock, which allowed for reclaim to run in parallel with other pmap calls which free pv chunks. As a fix, do not remove the chunk from pc_lru queue, use a marker to remember the position in the queue iteration. We can safely operate on the chunks after the chunk's pmap is locked, we fetched the chunk after the marker, and we checked that chunk pmap is same as we have locked, because chunk removal from pc_lru requires both pv_chunk_mutex and the pmap mutex owned. Note that the fix lost an optimization which was present in the previous algorithm. Namely, new_tail requeueing rotated the pv chunks list so that reclaim didn't scan the same pv chunks that couldn't be freed (because they contained a wired and/or superpage mapping) on every invocation. An additional change is planned which would improve this. Reported and tested by: pho Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week	2017-10-16 15:16:24 +00:00
Konstantin Belousov	a1fc6a8c49	On amd64, mark the set_user_ldt() function as static. On i386, the function is used from the context switch code and needs to be accessible externally. Amd64 MD context switch does not lock an LDT spinlock and inlines switching in assembly. Discussed with: bde Sponsored by: The FreeBSD Foundation MFC after: 1 week	2017-10-05 11:50:01 +00:00
Konstantin Belousov	9674d76346	Hide kernel stuff from userspace. Sponsored by: Mellanox Technologies	2017-10-02 08:37:43 +00:00
Andrew Turner	0e73a61997	To prepare for adding EFI runtime services support on arm64 move the machine independent parts of the existing code to a new file that can be shared between amd64 and arm64. Reviewed by: kib (previous version), imp Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D12434	2017-10-01 19:52:47 +00:00
Conrad Meyer	2744a0b69b	Drop CACHE_LINE_SIZE to 64 bytes on x86 The actual cache line size has always been 64 bytes. The 128 number arose as an optimization for Core 2 era Intel processors. By default (configurable in BIOS), these CPUs would prefetch adjacent cache lines unintelligently. Newer CPUs prefetch more intelligently. The latest Core 2 era CPU was introduced in September 2008 (Xeon 7400 series, "Dunnington"). If you are still using one of these CPUs, especially in a multi-socket configuration, consider locating the "adjacent cache line prefetch" option in BIOS and disabling it. Reported by: mjg Reviewed by: np Discussed with: jhb Sponsored by: Dell EMC Isilon	2017-08-28 22:28:41 +00:00
Konstantin Belousov	3e902b3d76	Make WRFSBASE and WRGSBASE instructions functional. Right now, we enable the CR4.FSGSBASE bit on CPUs which support the facility (Ivy and later), to allow usermode to read fs and gs bases without syscalls. This bit also controls the write access to bases from userspace, but WRFSBASE and WRGSBASE instructions currently cannot be used, because return path from both exceptions or interrupts overrides bases with the values from pcb. Supporting the instructions is useful because this means that usermode can implement green-threads completely in userspace without issuing syscalls to change all of the machine context. Support is implemented by saving the fs base and user gs base when PCB_FULL_IRET flag is set. The flag is set on the context switch, which potentially causes clobber of the bases due to activation of another context, and when explicit modification of the user context by a syscall or exception handler is performed. In particular, the patch moves setting of the flag before syscalls change context. The changes to doreti_exit and PUSH_FRAME to clear PCB_FULL_IRET on entry from userspace can be considered a bug fixes on its own. Reviewed by: jhb (previous version) Tested by: pho (previous version) Sponsored by: The FreeBSD Foundation MFC after: 3 weeks Differential revision: https://reviews.freebsd.org/D12023	2017-08-21 17:38:02 +00:00
Conrad Meyer	dc6a82801d	x86: Add dynamic interrupt rebalancing Add an option to dynamically rebalance interrupts across cores (hw.intrbalance); off by default. The goal is to minimize preemption. By placing interrupt sources on distinct CPUs, ithreads get preferentially scheduled on distinct CPUs. Overall preemption is reduced and latency is reduced. In our workflow it reduced "fighting" between two high-frequency interrupt sources. Reduced latency was proven by, e.g., SPEC2008. Submitted by: jeff@ (earlier version) Reviewed by: kib@ Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D10435	2017-08-16 18:48:53 +00:00
Ruslan Bukin	2164af29a0	Add support for Intel Software Guard Extensions (Intel SGX). Intel SGX allows to manage isolated compartments "Enclaves" in user VA space. Enclaves memory is part of processor reserved memory (PRM) and always encrypted. This allows to protect user application code and data from upper privilege levels including OS kernel. This includes SGX driver and optional linux ioctl compatibility layer. Intel SGX SDK for FreeBSD is also available. Note this requires support from hardware (available since late Intel Skylake CPUs). Many thanks to Robert Watson for support and Konstantin Belousov for code review. Project wiki: https://wiki.freebsd.org/Intel_SGX. Reviewed by: kib Relnotes: yes Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D11113	2017-08-16 10:38:06 +00:00

1 2 3 4 5 ...

1942 Commits