freebsd-dev

Author	SHA1	Message	Date
John Baldwin	8d44440ad9	If the boot-time memory test is enabled, output a dot ('.') for each GB of RAM tested so people watching the console can see that the machine is making progress and not hung. PR: 196650 Submitted by: Ravi Pokala <rpokala@panasas.com> Suggestions from: Eric van Gyzen <eric@vangyzen.net> MFC after: 2 weeks	2015-01-25 20:16:45 +00:00
Dag-Erling Smørgrav	d4ff1726f9	Remove ISA NICs. Anyone still using these on amd64 can build their own kernel.	2015-01-25 12:02:38 +00:00
Neel Natu	e09ff17129	Add macro to identify AVIC capability (advanced virtual interrupt controller) in AMD processors. Submitted by: Dmitry Luhtionov (dmitryluhtionov@gmail.com)	2015-01-24 00:35:49 +00:00
Neel Natu	7534635359	MOVS instruction emulation. These instructions are emitted by 'bus_space_read_region()' when accessing MMIO regions. Since MOVS can be used with a repeat prefix start decoding the REPZ and REPNZ prefixes. Also start decoding the segment override prefix since MOVS allows overriding the source operand segment register. Tested by: tychon MFC after: 1 week	2015-01-19 06:53:31 +00:00
Neel Natu	d087a39935	Simplify instruction restart logic in bhyve. Keep track of the next instruction to be executed by the vcpu as 'nextrip'. As a result the VM_RUN ioctl no longer takes the %rip where a vcpu should start execution. Also, instruction restart happens implicitly via 'vm_inject_exception()' or explicitly via 'vm_restart_instruction()'. The APIs behave identically in both kernel and userspace contexts. The main beneficiary is the instruction emulation code that executes in both contexts. bhyve(8) VM exit handlers now treat 'vmexit->rip' and 'vmexit->inst_length' as readonly: - Restarting an instruction is now done by calling 'vm_restart_instruction()' as opposed to setting 'vmexit->inst_length' to 0 (e.g. emulate_inout()) - Resuming vcpu at an arbitrary %rip is now done by setting VM_REG_GUEST_RIP as opposed to changing 'vmexit->rip' (e.g. vmexit_task_switch()) Differential Revision: https://reviews.freebsd.org/D1526 Reviewed by: grehan MFC after: 2 weeks	2015-01-18 03:08:30 +00:00
Navdeep Parhar	ca7fe84a61	Plug cxgbe(4) back into !powerpc && !arm builds, instead of building it on amd64 only.	2015-01-16 01:39:24 +00:00
Roger Pau Monné	ca49b3342d	loader: implement multiboot support for Xen Dom0 Implement a subset of the multiboot specification in order to boot Xen and a FreeBSD Dom0 from the FreeBSD bootloader. This multiboot implementation is tailored to boot Xen and FreeBSD Dom0, and it will most surely fail to boot any other multiboot compilant kernel. In order to detect and boot the Xen microkernel, two new file formats are added to the bootloader, multiboot and multiboot_obj. Multiboot support must be tested before regular ELF support, since Xen is a multiboot kernel that also uses ELF. After a multiboot kernel is detected, all the other loaded kernels/modules are parsed by the multiboot_obj format. The layout of the loaded objects in memory is the following; first the Xen kernel is loaded as a 32bit ELF into memory (Xen will switch to long mode by itself), after that the FreeBSD kernel is loaded as a RAW file (Xen will parse and load it using it's internal ELF loader), and finally the metadata and the modules are loaded using the native FreeBSD way. After everything is loaded we jump into Xen's entry point using a small trampoline. The order of the multiboot modules passed to Xen is the following, the first module is the RAW FreeBSD kernel, and the second module is the metadata and the FreeBSD modules. Since Xen will relocate the memory position of the second multiboot module (the one that contains the metadata and native FreeBSD modules), we need to stash the original modulep address inside of the metadata itself in order to recalculate its position once booted. This also means the metadata must come before the loaded modules, so after loading the FreeBSD kernel a portion of memory is reserved in order to place the metadata before booting. In order to tell the loader to boot Xen and then the FreeBSD kernel the following has to be added to the /boot/loader.conf file: xen_cmdline="dom0_mem=1024M dom0_max_vcpus=2 dom0pvh=1 console=com1,vga" xen_kernel="/boot/xen" The first argument contains the command line that will be passed to the Xen kernel, while the second argument is the path to the Xen kernel itself. This can also be done manually from the loader command line, by for example typing the following set of commands: OK unload OK load /boot/xen dom0_mem=1024M dom0_max_vcpus=2 dom0pvh=1 console=com1,vga OK load kernel OK load zfs OK load if_tap OK load ... OK boot Sponsored by: Citrix Systems R&D Reviewed by: jhb Differential Revision: https://reviews.freebsd.org/D517 For the Forth bits: Submitted by: Julien Grall <julien.grall AT citrix.com>	2015-01-15 16:27:20 +00:00
Warner Losh	af8cf71035	New MINIMAL kernel config. The goal with this configuration is to only compile in those options in GENERIC that cannot be loaded as modules. ufs is still included because many of its options aren't present in the kernel module. There's some other exceptions documented in the file. This is part of some work to get more things automatically loading in the hopes of obsoleting GENERIC one day.	2015-01-15 00:42:06 +00:00
Neel Natu	07820b4b4c	Fix typo (missing comma). MFC after: 3 days	2015-01-14 07:18:51 +00:00
Neel Natu	c9c75df48c	'struct vm_exception' was intended to be used only as the collateral for the VM_INJECT_EXCEPTION ioctl. However it morphed into other uses like keeping track pending exceptions for a vcpu. This in turn causes confusion because some fields in 'struct vm_exception' like 'vcpuid' make sense only in the ioctl context. It also makes it harder to add or remove structure fields. Fix this by using 'struct vm_exception' only to communicate information from userspace to vmm.ko when injecting an exception. Also, add a field 'restart_instruction' to 'struct vm_exception'. This field is set to '1' for exceptions where the faulting instruction is restarted after the exception is handled. MFC after: 1 week	2015-01-13 22:00:47 +00:00
Konstantin Belousov	18cc2ff047	Revert r263475: TDP_DEVMEMIO no longer needed, since amd64 /dev/kmem does not access kernel mappings directly. Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week	2015-01-12 08:58:07 +00:00
Konstantin Belousov	b5243bd4f7	Revert r276600: PHYS_TO_DMAP_RAW() and DMAP_TO_PHYS_RAW() are no longer used, remove them. Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week	2015-01-12 07:50:55 +00:00
Konstantin Belousov	22bb3201ac	Fix several issues with /dev/mem and /dev/kmem devices on amd64. For /dev/mem, when requested physical address is not accessible by the direct map, do temporal remaping with the caching attribute 'uncached'. Limit the accessible addresses by MAXPHYADDR, since the architecture disallowes writing non-zero into reserved bits of ptes (or setting garbage into NX). For /dev/kmem, only access existing kernel mappings for direct map region. For all other addresses, obtain a physical address of the mapping and fall back to the /dev/mem mechanism. This ensures that /dev/kmem i/o does not fault even if the accessed region is changed in parallel, by using either direct map or temporal mapping. For both devices, operate on one page by iteration. Do not return error if any bytes were moved around, return the (partial) bytes count to userspace. Reviewed by: alc Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week	2015-01-12 07:48:22 +00:00
Konstantin Belousov	b1752aa0ea	For x86, read MAXPHYADDR, defined in SDM vol 3 4.1.4 Enumeration of Paging Features by CPUID as CPUID.80000008H:EAX[7:0], into variable cpu_maxphyaddr. Reviewed by: alc Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week	2015-01-12 07:36:25 +00:00
Mark Johnston	bdb9ab0dd9	Factor out duplicated code from dumpsys() on each architecture into generic code in sys/kern/kern_dump.c. Most dumpsys() implementations are nearly identical and simply redefine a number of constants and helper subroutines; a generic implementation will make it easier to implement features around kernel core dumps. This change does not alter any minidump code and should have no functional impact. PR: 193873 Differential Revision: https://reviews.freebsd.org/D904 Submitted by: Conrad Meyer <conrad.meyer@isilon.com> Reviewed by: jhibbits (earlier version) Sponsored by: EMC / Isilon Storage Division	2015-01-07 01:01:39 +00:00
Neel Natu	2ce1242309	Clear blocking due to STI or MOV SS in the hypervisor when an instruction is emulated or when the vcpu incurs an exception. This matches the CPU behavior. Remove special case code in HLT processing that was clearing the interrupt shadow. This is now redundant because the interrupt shadow is always cleared when the vcpu is resumed after an instruction is emulated. Reported by: David Reed (david.reed@tidalscale.com) MFC after: 2 weeks	2015-01-06 19:04:02 +00:00
John Baldwin	3e32dff52c	Remove "New" label from NFSCL/NFSD now that they are the only NFS client/server. While here, remove duplicate NFSCL from sys/conf/NOTES. Approved by: rmacklem	2015-01-06 16:15:57 +00:00
John Baldwin	92597e064b	On some Intel CPUs with a P-state but not C-state invariant TSC the TSC may also halt in C2 and not just C3 (it seems that in some cases the BIOS advertises its C3 state as a C2 state in _CST). Just play it safe and disable both C2 and C3 states if a user forces the use of the TSC as the timecounter on such CPUs. PR: 192316 Differential Revision: https://reviews.freebsd.org/D1441 No objection from: jkim MFC after: 1 week	2015-01-05 20:44:44 +00:00
Konstantin Belousov	a40e51e355	For /dev/mem and /dev/kmem accesses, avoid asserting that addresses are within direct map. We want to return error instead of panicing. PR: 194995 Sponsored by: The FreeBSD Foundation	2015-01-03 01:28:58 +00:00
Scott Long	a614ff4d01	Fix a missed comment from r276526.	2015-01-02 15:46:54 +00:00
Konstantin Belousov	91a82f9585	Callers of pmap_kextract() cannot distinguish between failure and physical address zero. Assume that the lowest page is always mapped by direct map. This restores access to the page at zero through /dev/mem after r263475. Reported and tested by: neel Sponsored by: The FreeBSD Foundation MFC after: 1 week	2015-01-02 01:05:08 +00:00
Konstantin Belousov	ae7c85e9b0	Actually remove GIANT_REQUIRED, declared but not done in r263475. Style. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2015-01-02 01:00:38 +00:00
Dmitry Chagin	7b15ee61fc	Regen after r276508, r276509.	2015-01-01 18:43:31 +00:00
Dmitry Chagin	0161038329	Correct an argument status of wait4 syscall for Linuxulator. MFC after: 1 week	2015-01-01 18:37:03 +00:00
Navdeep Parhar	183dc9860a	Temporarily unplug cxgbe(4) from !amd64 builds.	2014-12-31 20:34:12 +00:00
Alan Cox	d866a563d4	The physical memory allocator supports the use of distinct free lists for managing pages from different address ranges. Generally speaking, this feature is used to increase the likelihood that physical pages are available that can meet special DMA requirements or can be accessed through a limited-coverage direct mapping (e.g., MIPS). However, prior to this change, the configuration of the free lists was static, i.e., it was determined at compile time. Consequentally, free lists could be created for address ranges that held no actual pages, for example, on 32-bit MIPS- based systems with 512 MB or less of physical memory. This change makes the creation of the free lists dynamic, i.e., it is based on the available physical memory at boot time. On 64-bit x86-based systems with 64 GB or more of physical memory, create free lists for managing pages with physical addresses below 4 GB. This change is to address reported problems with initializing devices that require the allocation of physical pages below 4 GB on some systems with 128 GB or more of physical memory. PR: 185727 Differential Revision: https://reviews.freebsd.org/D1274 Reviewed by: jhb, kib MFC after: 3 weeks Sponsored by: EMC / Isilon Storage Division	2014-12-31 00:54:38 +00:00
Neel Natu	cd86d3634b	Initialize all fields of 'struct vm_exception exception' before passing it to vm_inject_exception(). This fixes the issue that 'exception.cpuid' is uninitialized when calling 'vm_inject_exception()'. However, in practice this change is a no-op because vm_inject_exception() does not use 'exception.cpuid' for anything. Reported by: Coverity Scan CID: 1261297 MFC after: 3 days	2014-12-30 23:38:31 +00:00
Neel Natu	0dafa5cd4b	Replace bhyve's minimal RTC emulation with a fully featured one in vmm.ko. The new RTC emulation supports all interrupt modes: periodic, update ended and alarm. It is also capable of maintaining the date/time and NVRAM contents across virtual machine reset. Also, the date/time fields can now be modified by the guest. Since bhyve now emulates both the PIT and the RTC there is no need for "Legacy Replacement Routing" in the HPET so get rid of it. The RTC device state can be inspected via bhyvectl as follows: bhyvectl --vm=vm --get-rtc-time bhyvectl --vm=vm --set-rtc-time=<unix_time_secs> bhyvectl --vm=vm --rtc-nvram-offset=<offset> --get-rtc-nvram bhyvectl --vm=vm --rtc-nvram-offset=<offset> --set-rtc-nvram=<value> Reviewed by: tychon Discussed with: grehan Differential Revision: https://reviews.freebsd.org/D1385 MFC after: 2 weeks	2014-12-30 22:19:34 +00:00
Neel Natu	95474bc26a	Inject #UD into the guest when it executes either 'MONITOR' or 'MWAIT' on an AMD/SVM host. MFC after: 1 week	2014-12-30 02:44:33 +00:00
Neel Natu	1a5934ef8e	Implement "special mask mode" in vatpic. OpenBSD guests always enable "special mask mode" during boot. As a result of r275952 this is flagged as an error and the guest cannot boot. Reviewed by: grehan Differential Revision: https://reviews.freebsd.org/D1384 MFC after: 1 week	2014-12-28 00:53:52 +00:00
Konstantin Belousov	4cc6942f37	Change the way the lcall $7,$0 is reflected to usermode. Instead of setting call gate, which must be 64 bit, put a code segment descriptor into ldt slot 0. This way, syscall shim does not switch temporary to 64bit trampoline, and does not create a window where signal delivery interrupts 64 bit mode (signal handler cannot return). The cost is shim running with non-zero based segment in %cs, which requires vfork() handling make more assumptions. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2014-12-27 23:19:08 +00:00
Poul-Henning Kamp	91b050b27b	Use compiled in default keymaps which are available both in syscons and vt.	2014-12-25 17:50:04 +00:00
Mark Johnston	cafe874475	Restore the trap type argument to the DTrace trap hook, removed in r268600. It's redundant at the moment since it can be obtained from the trapframe on the architectures where DTrace is supported, but this won't be the case with ARM.	2014-12-23 15:38:19 +00:00
Neel Natu	b053814333	Allow ktr(4) tracing of all guest exceptions via the tunable "hw.vmm.trace_guest_exceptions". To enable this feature set the tunable to "1" before loading vmm.ko. Tracing the guest exceptions can be useful when debugging guest triple faults. Note that there is a performance impact when exception tracing is enabled since every exception will now trigger a VM-exit. Also, handle machine check exceptions that happen during guest execution by vectoring to the host's machine check handler via "int $18". Discussed with: grehan MFC after: 2 weeks	2014-12-23 02:14:49 +00:00
Neel Natu	d66bcddce4	Emulate writes to the IA32_MISC_ENABLE MSR. PR: 196093 Reported by: db Tested by: db Discussed with: grehan MFC after: 1 week	2014-12-20 19:47:51 +00:00
Neel Natu	ac721e53ec	Various 8259 device model improvements: - implement 8259 "polled" mode. - set 'atpic->sfn' if bit 4 in ICW4 is set during master initialization. - report error if guest tries to enable the "special mask" mode. Differential Revision: https://reviews.freebsd.org/D1328 Reviewed by: tychon Reported by: grehan Tested by: grehan MFC after: 1 week	2014-12-20 04:57:45 +00:00
Neel Natu	e64c5af3f8	Fix 8259 IRQ priority resolver. Initialize the 8259 such that IRQ7 is the lowest priority. Reviewed by: tychon Differential Revision: https://reviews.freebsd.org/D1322 MFC after: 1 week	2014-12-17 03:04:43 +00:00
Konstantin Belousov	2d45c2d52d	The iret instruction may generate #np and #ss fault, besides #gp. When returning to usermode, the handler for that exceptions is also executed with wrong gs base. Handle all three possible faults in the same way, checking for iret fault, and performing full iret. Sponsored by: The FreeBSD Foundation MFC after: 3 days	2014-12-16 18:28:33 +00:00
Neel Natu	09eced2549	For level triggered interrupts clear the PIC IRR bit when the interrupt pin is deasserted. Prior to this change each assertion on a level triggered irq pin resulted in two interrupts being delivered to the CPU. Differential Revision: https://reviews.freebsd.org/D1310 Reviewed by: tychon MFC after: 1 week	2014-12-16 06:33:57 +00:00
George V. Neville-Neil	bd19924f6b	This configuration file removes several debugging options, including WITNESS and INVARIANTS checking, which are known to have significant performance impact on running systems. When benchmarking new features this kernel should be used instead of the standard GENERIC. This kernel configuration should never appear outside of the HEAD of the FreeBSD tree.	2014-12-02 19:55:43 +00:00
Ed Maste	294246bb7d	Revert r274772: it is not valid on MIPS Reported by: sbruno	2014-11-25 03:50:31 +00:00
Peter Grehan	526c8885fd	Change the lower bound for guest vmspace allocation to 0 instead of using the VM_MIN_ADDRESS constant. HardenedBSD redefines VM_MIN_ADDRESS to be 64K, which results in bhyve VM startup failing. Guest memory is always assumed to start at 0 so use the absolute value instead. Reported by: Shawn Webb, lattera at gmail com Reviewed by: neel, grehan Obtained from: Oliver Pinter via HardenedBSD `23bd719ce1` MFC after: 1 week	2014-11-23 23:07:21 +00:00
John Baldwin	180e57e5c7	Improve support for XSAVE with debuggers. - Dump an NT_X86_XSTATE note if XSAVE is in use. This note is designed to match what Linux does in that 1) it dumps the entire XSAVE area including the fxsave state, and 2) it stashes a copy of the current xsave mask in the unused padding between the fxsave state and the xstate header at the same location used by Linux. - Teach readelf() to recognize NT_X86_XSTATE notes. - Change PT_GET/SETXSTATE to take the entire XSAVE state instead of only the extra portion. This avoids having to always make two ptrace() calls to get or set the full XSAVE state. - Add a PT_GET_XSTATE_INFO which returns the length of the current XSTATE save area (so the size of the buffer needed for PT_GETXSTATE) and the current XSAVE mask (%xcr0). Differential Revision: https://reviews.freebsd.org/D1193 Reviewed by: kib MFC after: 2 weeks	2014-11-21 20:53:17 +00:00
Ed Maste	688fd61ae8	Use canonical __PIC__ flag It is automatically set when -fPIC is passed to the compiler. Reviewed by: dim, kib Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D1179	2014-11-21 02:05:48 +00:00
Alan Cox	271f0f1219	Enable the use of VM_PHYSSEG_SPARSE on amd64 and i386, making it the default on i386 PAE. Previously, VM_PHYSSEG_SPARSE could not be used on amd64 and i386 because vm_page_startup() would not create vm_page structures for the kernel page table pages allocated during pmap_bootstrap() but those vm_page structures are needed when the kernel attempts to promote the corresponding kernel virtual addresses to superpage mappings. To address this problem, a new public function, vm_phys_add_seg(), is introduced and vm_phys_init() is updated to reflect the creation of vm_phys_seg structures by calls to vm_phys_add_seg(). Discussed with: Svatopluk Kraus MFC after: 3 weeks Sponsored by: EMC / Isilon Storage Division	2014-11-15 23:40:44 +00:00
Konstantin Belousov	eb81bf559c	Fix END()s for fueword and fueword64, match the name in END() with entry. Submitted by: Jeroen Hofstee <jeroen@myspectrum.nl> MFC after: 1 week	2014-11-15 21:25:17 +00:00
Scott Long	18e3d9f521	Extend earlier addition of stack frames to most of support.S. This makes stack traces in KDB, HWPMC, and DTrace much more reliable and useful. Reviewed by: kan, kib Obtained from: Netflix, Inc. MFC after: 5 days	2014-11-13 22:11:44 +00:00
Ed Maste	96699e86a3	Add workaround for vt efifb's early use of PHYS_TO_DMAP In vt_efifb_init the framebuffer's physaddr is passed to PHYS_TO_DMAP before the DMAP is setup. The result is not actually accessed until after the mapping is setup, though. Loosen the assertion in PHYS_TO_DMAP for now, to allow use when dmaplimit == 0. Reviewed by: kib Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D1142	2014-11-11 14:59:46 +00:00
Alexander V. Chernikov	603eaf792b	Renove faith(4) and faithd(8) from base. It looks like industry have chosen different (and more traditional) stateless/statuful NAT64 as translation mechanism. Last non-trivial commits to both faith(4) and faithd(8) happened more than 12 years ago, so I assume it is time to drop RFC3142 in FreeBSD. No objections from: net@	2014-11-09 21:33:01 +00:00
Gleb Smirnoff	2c59cd89c8	Remove unused includes. Reviewed by: kib	2014-11-09 19:58:30 +00:00
Konstantin Belousov	2818ac81d4	MFi386 r253328: Create a proper stack frame for amd64 version of bcopy(). Note that this also makes the stack properly aligned in the function, despite it is not strictly needed. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2014-11-08 11:56:26 +00:00
George V. Neville-Neil	169760a49b	Add support for netmap in GENERIC by default.	2014-11-05 06:22:37 +00:00
Bryan Venteicher	217eb1256d	Add VirtIO console to the x86 NOTES and files Requested by: jhb	2014-11-03 22:37:10 +00:00
John Baldwin	824fc46089	MFamd64: Add support for extended FPU states on i386. This includes support for AVX on i386. - Similar to amd64, move the FPU save area out of the PCB and instead store saved FPU state in a variable-sized buffer after the PCB on the stack. - To support the variable PCB location, alter the locore code to only use the bottom-most page of proc0stack for init386(). init386() returns the correct stack pointer to locore which adjusts the stack for thread0 before calling mi_startup(). - Don't bother setting cr3 in thread0's pcb in locore before calling init386(). It wasn't used (init386() overwrote it at the end) and it doesn't work with the variable-sized FPU save area. - Remove the new-bus attachment from npx. This was only ever useful for external co-processors using IRQ13, but those have not been supported for several years. npxinit() is now called much earlier during boot (init386()) similar to amd64. - Implement PT_{GET,SET}XSTATE and I386_GET_XFPUSTATE. - npxsave() is now only called from context switch contexts so it can use XSAVEOPT. Differential Revision: https://reviews.freebsd.org/D1058 Reviewed by: kib Tested on: FreeBSD/i386 VM under bhyve on Intel i5-2520	2014-11-02 22:58:30 +00:00
John Baldwin	01e1933dcc	Rework virtual machine hypervisor detection. - Move the existing code to x86/x86/identcpu.c since it is x86-specific. - If the CPUID2_HV flag is set, assume a hypervisor is present and query the 0x40000000 leaf to determine the hypervisor vendor ID. Export the vendor ID and the highest supported hypervisor CPUID leaf via hv_vendor[] and hv_high variables, respectively. The hv_vendor[] array is also exported via the hw.hv_vendor sysctl. - Merge the VMWare detection code from tsc.c into the new probe in identcpu.c. Add a VM_GUEST_VMWARE to identify vmware and use that in the TSC code to identify VMWare. Differential Revision: https://reviews.freebsd.org/D1010 Reviewed by: delphij, jkim, neel	2014-10-28 19:17:44 +00:00
Konstantin Belousov	0a2c94b86e	Replace some calls to fuword() by fueword() with proper error checking. Sponsored by: The FreeBSD Foundation Tested by: pho MFC after: 3 weeks	2014-10-28 15:28:20 +00:00
Konstantin Belousov	4f3dc90023	Add fueword(9) and casueword(9) functions. They are like fuword(9) and casuword(9), but do not mix value read and indication of fault. I know (or remember) enough assembly to handle x86 and powerpc. For arm, mips and sparc64, implement fueword() and casueword() as wrappers around fuword() and casuword(), which means that the functions cannot distinguish between -1 and fault. On architectures where fueword() and casueword() are native, implement fuword() and casuword() using fueword() and casuword(), to reduce assembly code duplication. Sponsored by: The FreeBSD Foundation Tested by: pho MFC after: 2 weeks (ia64 needs treating)	2014-10-28 15:22:13 +00:00
Marcelo Araujo	d018cd6f5e	Reported by: Coverity CID: 1249760 Reviewed by: neel Approved by: neel Sponsored by: QNAP Systems Inc.	2014-10-28 07:19:02 +00:00
Peter Grehan	f1be09bd95	Remove bhyve SVM feature printf's now that they are available in the general CPU feature detection code. Reviewed by: neel	2014-10-27 22:20:51 +00:00
Neel Natu	f0c8263e55	Change the type of the first argument to the I/O emulation handlers to 'struct vm '. Previously it used to be a 'void ' but there is no reason to hide the actual type from the handler. Discussed with: tychon MFC after: 1 week	2014-10-26 19:03:06 +00:00
Alan Cox	d6e53ebe5e	By the time that pmap_init() runs, vm_phys_segs[] has been initialized. Obtaining the end of memory address from vm_phys_segs[] is a little easier than obtaining it from phys_avail[]. Discussed with: Svatopluk Kraus	2014-10-26 17:56:47 +00:00
Neel Natu	160ef77abf	Move the ACPI PM timer emulation into vmm.ko. This reduces variability during timer calibration by keeping the emulation "close" to the guest. Additionally having all timer emulations in the kernel will ease the transition to a per-VM clock source (as opposed to using the host's uptime keep track of time). Discussed with: grehan	2014-10-26 04:44:28 +00:00
Neel Natu	31b117bec9	Don't pass the 'error' return from an I/O port handler directly to vm_run(). Most I/O port handlers return -1 to signal an error. If this value is returned without modification to vm_run() then it leads to incorrect behavior because '-1' is interpreted as ERESTART at the system call level. Fix this by always returning EIO to signal an error from an I/O port handler. MFC after: 1 week	2014-10-26 03:03:41 +00:00
John Baldwin	7d313e7bdb	Add COMPAT_FREEBSD9 and COMPAT_FREEBSD10 options to wrap code that provides compatability for FreeBSD 9.x and 10.x binaries. Enable these options in kernel configs that enable other COMPAT_FREEBSD<n> options.	2014-10-24 19:58:24 +00:00
Roger Pau Monné	927dc0e02a	amd64: make uiomove_fromphys functional for pages not mapped by the DMAP Place the code introduced in r268660 into a separate function that can be called from uiomove_fromphys. Instead of pre-allocating two KVA pages use vmem_alloc to allocate them on demand when needed. This prevents blocking if a page fault is taken while physical addresses from outside the DMAP are used, since the lock is now removed. Also introduce a safety catch in PHYS_TO_DMAP and DMAP_TO_PHYS. Sponsored by: Citrix Systems R&D Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D947 amd64/amd64/pmap.c: - Factor out the code to deal with non DMAP addresses from pmap_copy_pages and place it in pmap_map_io_transient. - Change the code to use vmem_alloc instead of a set of pre-allocated pages. - Use pmap_qenter and don't pin the thread if there can be page faults. amd64/amd64/uio_machdep.c: - Use pmap_map_io_transient in order to correctly deal with physical addresses not covered by the DMAP. amd64/include/pmap.h: - Add the prototypes for the new functions. amd64/include/vmparam.h: - Add safety catches to make sure PHYS_TO_DMAP and DMAP_TO_PHYS are only used with addresses covered by the DMAP.	2014-10-24 09:48:58 +00:00
Roger Pau Monné	bf7313e3b7	xen: implement the privcmd user-space device This device is only attached to priviledged domains, and allows the toolstack to interact with Xen. The two functions of the privcmd interface is to allow the execution of hypercalls from user-space, and the mapping of foreign domain memory. Sponsored by: Citrix Systems R&D i386/include/xen/hypercall.h: amd64/include/xen/hypercall.h: - Introduce a function to make generic hypercalls into Xen. xen/interface/xen.h: xen/interface/memory.h: - Import the new hypercall XENMEM_add_to_physmap_range used by auto-translated guests to map memory from foreign domains. dev/xen/privcmd/privcmd.c: - This device has the following functions: - Allow user-space applications to make hypercalls into Xen. - Allow user-space applications to map memory from foreign domains, this is accomplished using the newly introduced hypercall (XENMEM_add_to_physmap_range). xen/privcmd.h: - Public ioctl interface for the privcmd device. x86/xen/hvm.c: - Remove declaration of hypercall_page, now it's declared in hypercall.h. conf/files: - Add the privcmd device to the build process.	2014-10-22 17:07:20 +00:00
Hans Petter Selasky	f0188618f2	Fix multiple incorrect SYSCTL arguments in the kernel: - Wrong integer type was specified. - Wrong or missing "access" specifier. The "access" specifier sometimes included the SYSCTL type, which it should not, except for procedural SYSCTL nodes. - Logical OR where binary OR was expected. - Properly assert the "access" argument passed to all SYSCTL macros, using the CTASSERT macro. This applies to both static- and dynamically created SYSCTLs. - Properly assert the the data type for both static and dynamic SYSCTLs. In the case of static SYSCTLs we only assert that the data pointed to by the SYSCTL data pointer has the correct size, hence there is no easy way to assert types in the C language outside a C-function. - Rewrote some code which doesn't pass a constant "access" specifier when creating dynamic SYSCTL nodes, which is now a requirement. - Updated "EXAMPLES" section in SYSCTL manual page. MFC after: 3 days Sponsored by: Mellanox Technologies	2014-10-21 07:31:21 +00:00
Neel Natu	a78dc03254	Merge projects/bhyve_svm into HEAD. After this change bhyve supports AMD processors with the SVM/AMD-V hardware extensions. More details available here: https://lists.freebsd.org/pipermail/freebsd-virtualization/2014-October/002905.html Submitted by: Anish Gupta (akgupt3@gmail.com) Tested by: Benjamin Perrault (ben.perrault@gmail.com) Tested by: Willem Jan Withagen (wjw@digiware.nl)	2014-10-21 07:10:43 +00:00
Neel Natu	a5045426db	Fix a race in pmap_emulate_accessed_dirty() that could trigger a EPT misconfiguration VM-exit. An EPT misconfiguration is triggered when the processor encounters a PTE that is writable but not readable (WR=10). On processors that require A/D bit emulation PG_M and PG_A map to EPT_PG_WRITE and EPT_PG_READ respectively. If the PTE is updated as in the following code snippet: pte \|= PG_M; pte \|= PG_A; then it is possible for another processor to observe the PTE after the PG_M (aka EPT_PG_WRITE) bit is set but before PG_A (aka EPT_PG_READ) bit is set. This will trigger an EPT misconfiguration VM-exit on the other processor. Reported by: rodrigc Reviewed by: grehan MFC after: 3 days	2014-10-21 01:06:58 +00:00
Neel Natu	e011dc962c	Merge from projects/bhyve_svm all the changes outside vmm.ko or bhyve utilities: Add support for AMD's nested page tables in pmap.c: - Provide the correct bit mask for various bit fields in a PTE (e.g. valid bit) for a pmap of type PT_RVI. - Add a function 'pmap_type_guest(pmap)' that returns TRUE if the pmap is of type PT_EPT or PT_RVI. Add CPU_SET_ATOMIC_ACQ(num, cpuset): This is used when activating a vcpu in the nested pmap. Using the 'acquire' variant guarantees that the load of the 'pm_eptgen' will happen only after the vcpu is activated in 'pm_active'. Add defines for various AMD-specific MSRs. Submitted by: Anish Gupta (akgupt3@gmail.com)	2014-10-20 18:09:33 +00:00
Neel Natu	e1a172e1c2	IFC @r273214	2014-10-20 02:57:30 +00:00
Neel Natu	867b59607c	IFC @r273206	2014-10-19 23:05:18 +00:00
Neel Natu	592cd7d3be	Don't advertise the "OS visible workarounds" feature in cpuid.80000001H:ECX. bhyve doesn't emulate the MSRs needed to support this feature at this time. Don't expose any model-specific RAS and performance monitoring features in cpuid leaf 80000007H. Emulate a few more MSRs for AMD: TSEG base address, TSEG address mask and BIOS signature and P-state related MSRs. This eliminates all the unimplemented MSRs accessed by Linux/x86_64 kernels 2.6.32, 3.10.0 and 3.17.0.	2014-10-19 21:38:58 +00:00
Neel Natu	65d5111ac1	Don't advertise support for the NodeID MSR since bhyve doesn't emulate it.	2014-10-18 05:39:32 +00:00
Warner Losh	b82e2e94e2	Fix build to not bogusly always rebuild vmm.ko. Rename vmx_assym.s to vmx_assym.h to reflect that file's actual use and update vmx_support.S's include to match. Add vmx_assym.h to the SRCS to that it gets properly added to the dependency list. Add vmx_support.S to SRCS as well, so it gets built and needs fewer special-case goo. Remove now-redundant special-case goo. Finally, vmx_genassym.o doesn't need to depend on a hand expanded ${_ILINKS} explicitly, that's all taken care of by beforedepend. With these items fixed, we no longer build vmm.ko every single time through the modules on a KERNFAST build. Sponsored by: Netflix	2014-10-17 13:20:49 +00:00
Neel Natu	2688a818a3	Don't advertise the Instruction Based Sampling feature because it requires emulating a large number of MSRs. Ignore writes to a couple more AMD-specific MSRs and return 0 on read. This further reduces the unimplemented MSRs accessed by a Linux guest on boot.	2014-10-17 06:23:04 +00:00
Neel Natu	02904c45ab	Hide extended PerfCtr MSRs on AMD processors by clearing bits 23, 24 and 28 in CPUID.80000001H:ECX. Handle accesses to PerfCtrX and PerfEvtSelX MSRs by ignoring writes and returning 0 on reads. This further reduces the number of unimplemented MSRs hit by a Linux guest during boot.	2014-10-17 03:04:38 +00:00
Neel Natu	b1cf7bb5e4	Use the correct fault type (VM_PROT_EXECUTE) for an instruction fetch.	2014-10-16 18:16:31 +00:00
Neel Natu	5a1f0b36b1	Fix topology enumeration issues exposed by AMD Bulldozer Family 15h processor. Initialize CPUID.80000008H:ECX[7:0] with the number of logical processors in the package. This fixes a panic during early boot in NetBSD 7.0 BETA. Clear the Topology Extension feature bit from CPUID.80000001H:ECX since we don't emulate leaves 0x8000001D and 0x8000001E. This fixes a divide by zero panic in early boot in Centos 6.4. Tested on an "AMD Opteron 6320" courtesy of Ben Perrault. Reviewed by: grehan	2014-10-16 18:13:10 +00:00
Davide Italiano	2be111bf7d	Follow up to r225617. In order to maximize the re-usability of kernel code in userland rename in-kernel getenv()/setenv() to kern_setenv()/kern_getenv(). This fixes a namespace collision with libc symbols. Submitted by: kmacy Tested by: make universe	2014-10-16 18:04:43 +00:00
Neel Natu	06053618cb	Actually hide the SVM capability by clearing CPUID.80000001H:ECX[bit 3] after it has been initialized by cpuid_count(). Submitted by: Anish Gupta (akgupt3@gmail.com)	2014-10-15 04:29:03 +00:00
Neel Natu	d63e02ea96	Emulate "POP r/m". This is needed to boot OpenBSD/i386 MP kernel in bhyve. Reported by: grehan MFC after: 1 week	2014-10-14 21:02:33 +00:00
Neel Natu	f37dbf579d	Remove extraneous comments.	2014-10-11 04:57:17 +00:00
Neel Natu	8fe9436d4c	Get rid of unused headers. Restrict scope of malloc types M_SVM and M_SVM_VLAPIC by making them static. Replace ERR() with KASSERT(). style(9) cleanup.	2014-10-11 04:41:21 +00:00
Neel Natu	3d492b65bc	Get rid of unused forward declaration of 'struct svm_softc'.	2014-10-11 03:21:33 +00:00
Neel Natu	92337d968c	style(9) fixes. Get rid of unused headers.	2014-10-11 03:19:26 +00:00
Neel Natu	882a1f1942	Use a consistent style for messages emitted when the module is loaded.	2014-10-11 03:09:34 +00:00
Neel Natu	ed6aacb51f	IFC @r272887	2014-10-10 23:52:56 +00:00
Neel Natu	faba66190e	Fix bhyvectl so it works correctly on AMD/SVM hosts. Also, add command line options to display some key VMCB fields. The set of valid options that can be passed to bhyvectl now depends on the processor type. AMD-specific options are identified by a "--vmcb" or "--avic" in the option name. Intel-specific options are identified by a "--vmcs" in the option name. Submitted by: Anish Gupta (akgupt3@gmail.com)	2014-10-10 21:48:59 +00:00
Neel Natu	5295c3e61d	Support Intel-specific MSRs that are accessed when booting up a linux in bhyve: - MSR_PLATFORM_INFO - MSR_TURBO_RATIO_LIMITx - MSR_RAPL_POWER_UNIT Reviewed by: grehan MFC after: 1 week	2014-10-09 19:13:33 +00:00
Mark Johnston	5eaae1411f	Pass up the error status of minidumpsys() to its callers. PR: 193761 Submitted by: Conrad Meyer <conrad.meyer@isilon.com> Sponsored by: EMC / Isilon Storage Division	2014-10-08 20:25:21 +00:00
Konstantin Belousov	07a92f34d6	Add an argument to the x86 pmap_invalidate_cache_range() to request forced invalidation of the cache range regardless of the presence of self-snoop feature. Some recent Intel GPUs in some modes are not coherent, and dirty lines in CPU cache must be flushed before the pages are transferred to GPU domain. Reviewed by: alc (previous version) Tested by: pho (amd64) Sponsored by: The FreeBSD Foundation MFC after: 1 week	2014-10-08 16:48:03 +00:00
Neel Natu	65145c7f50	Inject #UD into the guest when it executes either 'MONITOR' or 'MWAIT'. The hypervisor hides the MONITOR/MWAIT capability by unconditionally setting CPUID.01H:ECX[3] to 0 so the guest should not expect these instructions to be present anyways. Discussed with: grehan	2014-10-06 20:48:01 +00:00
Neel Natu	107af8f2ed	IFC @r272481	2014-10-05 01:28:21 +00:00
Neel Natu	d72978ecd7	Get rid of code that dealt with the hardware not being able to save/restore the PAT MSR on guest exit/entry. This workaround was done for a beta release of VMware Fusion 5 but is no longer needed in later versions. All Intel CPUs since Nehalem have supported saving and restoring MSR_PAT in the VM exit and entry controls. Discussed with: grehan	2014-10-02 05:32:29 +00:00
Roger Pau Monné	44e06d158a	msi: add Xen MSI implementation This patch adds support for MSI interrupts when running on Xen. Apart from adding the Xen related code needed in order to register MSI interrupts this patch also makes the msi_init function a hook in init_ops, so different MSI implementations can have different initialization functions. Sponsored by: Citrix Systems R&D xen/interface/physdev.h: - Add the MAP_PIRQ_TYPE_MULTI_MSI to map multi-vector MSI to the Xen public interface. x86/include/init.h: - Add a hook for setting custom msi_init methods. amd64/amd64/machdep.c: i386/i386/machdep.c: - Set the default msi_init hook to point to the native MSI initialization method. x86/xen/pv.c: - Set the Xen MSI init hook when running as a Xen guest. x86/x86/local_apic.c: - Call the msi_init hook instead of directly calling msi_init. xen/xen_intr.h: x86/xen/xen_intr.c: - Introduce support for registering/releasing MSI interrupts with Xen. - The MSI interrupts will use the same PIC as the IO APIC interrupts. xen/xen_msi.h: x86/xen/xen_msi.c: - Introduce a Xen MSI implementation. x86/xen/xen_nexus.c: - Overwrite the default MSI hooks in the Xen Nexus to use the Xen MSI implementation. x86/xen/xen_pci.c: - Introduce a Xen specific PCI bus that inherits from the ACPI PCI bus and overwrites the native MSI methods. - This is needed because when running under Xen the MSI messages used to configure MSI interrupts on PCI devices are written by Xen itself. dev/acpica/acpi_pci.c: - Lower the quality of the ACPI PCI bus so the newly introduced Xen PCI bus can take over when needed. conf/files.i386: conf/files.amd64: - Add the newly created files to the build process.	2014-09-30 16:46:45 +00:00
Neel Natu	970388bf8d	IFC @r272185	2014-09-27 22:15:50 +00:00
Neel Natu	30571674ce	Simplify register state save and restore across a VMRUN: - Host registers are now stored on the stack instead of a per-cpu host context. - Host %FS and %GS selectors are not saved and restored across VMRUN. - Restoring the %FS/%GS selectors was futile anyways since that only updates the low 32 bits of base address in the hidden descriptor state. - GS.base is properly updated via the MSR_GSBASE on return from svm_launch(). - FS.base is not used while inside the kernel so it can be safely ignored. - Add function prologue/epilogue so svm_launch() can be traced with Dtrace's FBT entry/exit probes. They also serve to save/restore the host %rbp across VMRUN. Reviewed by: grehan Discussed with: Anish Gupta (akgupt3@gmail.com)	2014-09-27 02:04:58 +00:00
Peter Grehan	a48c333805	Allow the PIC's IMR register to be read before ICW initialisation. As of git submit e179f6914152eca9, the Linux kernel does a simple probe of the PIC by writing a pattern to the IMR and then reading it back, prior to the init sequence of ICW words. The bhyve PIC emulation wasn't allowing the IMR to be read until the ICW sequence was complete. This limitation isn't required so relax the test. With this change, Linux kernels 3.15-rc2 and later won't hang on boot when calibrating the local APIC. Reviewed by: tychon MFC after: 3 days	2014-09-27 01:15:24 +00:00
Roger Pau Monné	c98a2727cc	ddb: allow specifying the exact address of the symtab and strtab When the FreeBSD kernel is loaded from Xen the symtab and strtab are not loaded the same way as the native boot loader. This patch adds three new global variables to ddb that can be used to specify the exact position and size of those tables, so they can be directly used as parameters to db_add_symbol_table. A new helper is introduced, so callers that used to set ksym_start and ksym_end can use this helper to set the new variables. It also adds support for loading them from the Xen PVH port, that was previously missing those tables. Sponsored by: Citrix Systems R&D Reviewed by: kib ddb/db_main.c: - Add three new global variables: ksymtab, kstrtab, ksymtab_size that can be used to specify the position and size of the symtab and strtab. - Use those new variables in db_init in order to call db_add_symbol_table. - Move the logic in db_init to db_fetch_symtab in order to set ksymtab, kstrtab, ksymtab_size from ksym_start and ksym_end. ddb/ddb.h: - Add prototype for db_fetch_ksymtab. - Declate the extern variables ksymtab, kstrtab and ksymtab_size. x86/xen/pv.c: - Add support for finding the symtab and strtab when booted as a Xen PVH guest. Since Xen loads the symtab and strtab as NetBSD expects to find them we have to adapt and use the same method. amd64/amd64/machdep.c: arm/arm/machdep.c: i386/i386/machdep.c: mips/mips/machdep.c: pc98/pc98/machdep.c: powerpc/aim/machdep.c: powerpc/booke/machdep.c: sparc64/sparc64/machdep.c: - Use the newly introduced db_fetch_ksymtab in order to set ksymtab, kstrtab and ksymtab_size.	2014-09-25 08:28:10 +00:00
Bjoern A. Zeeb	14f2533c56	As per [1] Intel only supports this driver on 64bit platforms. For now restrict it to amd64. Other architectures might be re-added later once tested. Remove the drivers from the global NOTES and files files and move them to the amd64 specifics. Remove the drivers from the i386 modules build and only leave the amd64 version. Rather than depending on "inet" depend on "pci" and make sure that ixl(4) and ixlv(4) can be compiled independently [2]. This also allows the drivers to build properly on IPv4-only or IPv6-only kernels. PR: 193824 [2] Reviewed by: eric.joyner intel.com MFC after: 3 days References: [1] http://lists.freebsd.org/pipermail/svn-src-all/2014-August/090470.html	2014-09-23 08:33:03 +00:00
Neel Natu	af198d882a	Allow more VMCB fields to be cached: - CR2 - CR0, CR3, CR4 and EFER - GDT/IDT base/limit fields - CS/DS/ES/SS selector/base/limit/attrib fields The caching can be further restricted via the tunable 'hw.vmm.svm.vmcb_clean'. Restructure the code such that the fields above are only modified in a single place. This makes it easy to invalidate the VMCB cache when any of these fields is modified.	2014-09-21 23:42:54 +00:00
Neel Natu	4eea1566cb	Get rid of unused stat VMM_HLT_IGNORED.	2014-09-21 18:52:56 +00:00
Konstantin Belousov	060cd4d500	Update and clarify comments. Remove the useless counter for impossible, but seen in wild situation (on buggy hypervisors). In collaboration with: bde MFC after: 1 week	2014-09-21 09:06:50 +00:00
Neel Natu	ba28c094bb	The memory type bits (PAT, PCD, PWT) associated with a nested PTE or PDE are identical to the traditional x86 page tables.	2014-09-21 06:36:17 +00:00
Neel Natu	8f02c5e456	IFC r271888. Restructure MSR emulation so it is all done in processor-specific code.	2014-09-20 21:46:31 +00:00
Neel Natu	b6cf6c8ca6	IFC @r271887	2014-09-20 06:27:37 +00:00
Neel Natu	9d8d8e3ee7	Add some more KTR events to help debugging.	2014-09-20 05:13:03 +00:00
Neel Natu	cb44ea41cb	MSR_KGSBASE is no longer saved and restored from the guest MSR save area. This behavior was changed in r271888 so update the comment block to reflect this. MSR_KGSBASE is accessible from the guest without triggering a VM-exit. The permission bitmap for MSR_KGSBASE is modified by vmx_msr_guest_init() so get rid of redundant code in vmx_vminit().	2014-09-20 05:12:34 +00:00
Neel Natu	c3498942a5	Restructure the MSR handling so it is entirely handled by processor-specific code. There are only a handful of MSRs common between the two so there isn't too much duplicate functionality. The VT-x code has the following types of MSRs: - MSRs that are unconditionally saved/restored on every guest/host context switch (e.g., MSR_GSBASE). - MSRs that are restored to guest values on entry to vmx_run() and saved before returning. This is an optimization for MSRs that are not used in host kernel context (e.g., MSR_KGSBASE). - MSRs that are emulated and every access by the guest causes a trap into the hypervisor (e.g., MSR_IA32_MISC_ENABLE). Reviewed by: grehan	2014-09-20 02:35:21 +00:00
Konstantin Belousov	6dfc9e44fa	- Use NULL instead of 0 for fpcurthread. - Note the quirk with the interrupt enabled state of the dna handler. - Use just panic() instead of printf() and panic(). Print tid instead of pid, the fpu state is per-thread. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2014-09-18 09:13:20 +00:00
Bjoern A. Zeeb	581980cff8	Re-gen after r271743 implementing most of timer_{create,settime,gettime,getoverrun,delete}. MFC after: 3 days Sponsored by: DARPA, AFRL	2014-09-18 08:40:00 +00:00
Bjoern A. Zeeb	0a041f3b47	Implement most of timer_{create,settime,gettime,getoverrun,delete} for amd64/linux32. Fix the entirely bogus (untested) version from r161310 for i386/linux using the same shared code in compat/linux. It is unclear to me if we could support more clock mappings but the current set allows me to successfully run commercial 32bit linux software under linuxolator on amd64. Reviewed by: jhb Differential Revision: D784 MFC after: 3 days Sponsored by: DARPA, AFRL	2014-09-18 08:36:45 +00:00
Konstantin Belousov	490356e5b7	Presence of any VM_PROT bits in the permission argument on x86 implies that the entry is readable and valid. Reported by: markj Submitted by: alc Tested by: pho (previous version), markj MFC after: 3 days	2014-09-17 18:49:57 +00:00
Neel Natu	4e27d36d38	IFC @r271694	2014-09-17 18:46:51 +00:00
Neel Natu	6b844b87e2	Rework vNMI injection. Keep track of NMI blocking by enabling the IRET intercept on a successful vNMI injection. The NMI blocking condition is cleared when the handler executes an IRET and traps back into the hypervisor. Don't inject NMI if the processor is in an interrupt shadow to preserve the atomic nature of "STI;HLT". Take advantage of this and artificially set the interrupt shadow to prevent NMI injection when restarting the "iret". Reviewed by: Anish Gupta (akgupt3@gmail.com), grehan	2014-09-17 00:30:25 +00:00
Neel Natu	5fb3bc71f8	Minor cleanup. Get rid of unused 'svm_feature' from the softc. Get rid of the redundant 'vcpu_cnt' checks in svm.c. There is a similar check in vmm.c against 'vm->active_cpus' before the AMD-specific code is called. Submitted by: Anish Gupta (akgupt3@gmail.com)	2014-09-16 04:01:55 +00:00
Neel Natu	79ad53fba3	Use V_IRQ, V_INTR_VECTOR and V_TPR to offload APIC interrupt delivery to the processor. Briefly, the hypervisor sets V_INTR_VECTOR to the APIC vector and sets V_IRQ to 1 to indicate a pending interrupt. The hardware then takes care of injecting this vector when the guest is able to receive it. Legacy PIC interrupts are still delivered via the event injection mechanism. This is because the vector injected by the PIC must reflect the state of its pins at the time the CPU is ready to accept the interrupt. Accesses to the TPR via %CR8 are handled entirely in hardware. This requires that the emulated TPR must be synced to V_TPR after a #VMEXIT. The guest can also modify the TPR via the memory mapped APIC. This requires that the V_TPR must be synced with the emulated TPR before a VMRUN. Reviewed by: Anish Gupta (akgupt3@gmail.com)	2014-09-16 03:31:40 +00:00
Neel Natu	bbadcde418	Set the 'vmexit->inst_length' field properly depending on the type of the VM-exit and ultimately on whether nRIP is valid. This allows us to update the %rip after the emulation is finished so any exceptions triggered during the emulation will point to the right instruction. Don't attempt to handle INS/OUTS VM-exits unless the DecodeAssist capability is available. The effective segment field in EXITINFO1 is not valid without this capability. Add VM_EXITCODE_SVM to flag SVM VM-exits that cannot be handled. Provide the VMCB fields exitinfo1 and exitinfo2 as collateral to help with debugging. Provide a SVM VM-exit handler to dump the exitcode, exitinfo1 and exitinfo2 fields in bhyve(8). Reviewed by: Anish Gupta (akgupt3@gmail.com) Reviewed by: grehan	2014-09-14 04:39:04 +00:00
Neel Natu	74accc3170	Bug fixes. - Don't enable the HLT intercept by default. It will be enabled by bhyve(8) if required. Prior to this change HLT exiting was always enabled making the "-H" option to bhyve(8) meaningless. - Recognize a VM exit triggered by a non-maskable interrupt. Prior to this change the exit would be punted to userspace and the virtual machine would terminate.	2014-09-13 23:48:43 +00:00
Neel Natu	fa7caa91cb	style(9): insert an empty line if the function has no local variables Pointed out by: grehan	2014-09-13 22:45:04 +00:00
Neel Natu	c2a875f970	AMD processors that have the SVM decode assist capability will store the instruction bytes in the VMCB on a nested page fault. This is useful because it saves having to walk the guest page tables to fetch the instruction. vie_init() now takes two additional parameters 'inst_bytes' and 'inst_len' that map directly to 'vie->inst[]' and 'vie->num_valid'. The instruction emulation handler skips calling 'vmm_fetch_instruction()' if 'vie->num_valid' is non-zero. The use of this capability can be turned off by setting the sysctl/tunable 'hw.vmm.svm.disable_npf_assist' to '1'. Reviewed by: Anish Gupta (akgupt3@gmail.com) Discussed with: grehan	2014-09-13 22:16:40 +00:00
John Baldwin	7d8312cc92	Add a sysctl to export the EFI memory map along with a handler in the sysctl(8) binary to format it. Reviewed by: emaste MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D771	2014-09-13 03:10:02 +00:00
Neel Natu	d181963296	Optimize the common case of injecting an interrupt into a vcpu after a HLT by explicitly moving it out of the interrupt shadow. The hypervisor is done "executing" the HLT and by definition this moves the vcpu out of the 1-instruction interrupt shadow. Prior to this change the interrupt would be held pending because the VMCS guest-interruptibility-state would indicate that "blocking by STI" was in effect. This resulted in an unnecessary round trip into the guest before the pending interrupt could be injected. Reviewed by: grehan	2014-09-12 06:15:20 +00:00
Neel Natu	442a04ca83	style(9): indent the switch, don't indent the case, indent case body one tab.	2014-09-11 06:17:56 +00:00
Neel Natu	e441104d63	Repurpose the V_IRQ interrupt injection to implement VMX-style interrupt window exiting. This simply involves setting V_IRQ and enabling the VINTR intercept. This instructs the CPU to trap back into the hypervisor as soon as an interrupt can be injected into the guest. The pending interrupt is then injected via the traditional event injection mechanism. Rework vcpu interrupt injection so that Linux guests now idle with host cpu utilization close to 0%. Reviewed by: Anish Gupta (earlier version) Discussed with: grehan	2014-09-11 02:37:02 +00:00
John Baldwin	de2b02fc74	MFamd64: Use initializecpu() to set various model-specific registers on AP startup and AP resume (it was already used for BSP startup and BSP resume). - Split code to do one-time probing of cache properties out of initializecpu() and into initializecpucache(). This is called once on the BSP during boot. - Move enable_sse() into initializecpu(). - Call initializecpu() for AP startup instead of enable_sse() and manually frobbing MSR_EFER to enable PG_NX. - Call initializecpu() when an AP resumes. In theory this will now properly re-enable PG_NX in MSR_EFER when resuming a PAE kernel on APs.	2014-09-10 21:37:47 +00:00
Neel Natu	238b6cb761	Allow intercepts and irq fields to be cached by the VMCB. Provide APIs svm_enable_intercept()/svm_disable_intercept() to add/delete VMCB intercepts. These APIs ensure that the VMCB state cache is invalidated when intercepts are modified. Each intercept is identified as a (index,bitmask) tuple. For e.g., the VINTR intercept is identified as (VMCB_CTRL1_INTCPT,VMCB_INTCPT_VINTR). The first 20 bytes in control area that are used to enable intercepts are represented as 'uint32_t intercept[5]' in 'struct vmcb_ctrl'. Modify svm_setcap() and svm_getcap() to use the new APIs. Discussed with: Anish Gupta (akgupt3@gmail.com)	2014-09-10 03:13:40 +00:00
Neel Natu	e5397c9fdd	Move the VMCB initialization into svm.c in preparation for changes to the interrupt injection logic. Discussed with: Anish Gupta (akgupt3@gmail.com)	2014-09-10 02:35:19 +00:00
Neel Natu	840b1a2760	Move the event injection function into svm.c and add KTR logging for every event injection. This in in preparation for changes to SVM guest interrupt injection. Discussed with: Anish Gupta (akgupt3@gmail.com)	2014-09-10 02:20:32 +00:00
Neel Natu	2591ee3e80	Remove a bogus check that flagged an error if the guest %rip was zero. An AP begins execution with %rip set to 0 after a startup IPI. Discussed with: Anish Gupta (akgupt3@gmail.com)	2014-09-10 01:46:22 +00:00
Neel Natu	5e467bd098	Make the KTR tracepoints uniform and ensure that every VM-exit is logged. Discussed with: Anish Gupta (akgupt3@gmail.com)	2014-09-10 01:37:32 +00:00
Neel Natu	a2901ce7ad	Allow guest read access to MSR_EFER without hypervisor intervention. Dirty the VMCB_CACHE_CR state cache when MSR_EFER is modified.	2014-09-10 01:10:53 +00:00
Neel Natu	501f03eba2	Remove gratuitous forward declarations. Remove tabs on empty lines.	2014-09-09 23:39:43 +00:00
Neel Natu	a268481428	Do proper ASID management for guest vcpus. Prior to this change an ASID was hard allocated to a guest and shared by all its vcpus. The meant that the number of VMs that could be created was limited to the number of ASIDs supported by the CPU. It was also inefficient because it forced a TLB flush on every VMRUN. With this change the number of guests that can be created is independent of the number of available ASIDs. Also, the TLB is flushed only when a new ASID is allocated. Discussed with: grehan Reviewed by: Anish Gupta (akgupt3@gmail.com)	2014-09-06 19:02:52 +00:00
John Baldwin	b1d735ba4c	Create a separate structure for per-CPU state saved across suspend and resume that is a superset of a pcb. Move the FPU state out of the pcb and into this new structure. As part of this, move the FPU resume code on amd64 into a C function. This allows resumectx() to still operate only on a pcb and more closely mirrors the i386 code. Reviewed by: kib (earlier version)	2014-09-06 15:23:28 +00:00
Neel Natu	3865879753	Merge svm_set_vmcb() and svm_init_vmcb() into a single function that is called just once when a vcpu is initialized. Discussed with: Anish Gupta (akgupt3@gmail.com)	2014-09-05 03:33:16 +00:00
Pedro F. Giffuni	11db54f172	Apply known workarounds for modern MacBooks. The legacy USB circuit tends to give trouble on MacBook. While the original report covered MacBook, extend the fix preemptively for the newer MacBookPro too. PR: 191693 Reviewed by: emaste MFC after: 5 days	2014-09-05 01:06:45 +00:00
Mark Johnston	a58b4afa9f	Add mrsas(4) to GENERIC for i386 and amd64. Approved by: ambrisko, kadesai MFC after: 3 days	2014-09-04 21:06:33 +00:00
John Baldwin	33a50f1b0f	Merge the amd64 and i386 identcpu.c into a single x86 implementation. This brings the structured extended features mask and VT-x reporting to i386 and Intel cache and TLB info (under bootverbose) to amd64.	2014-09-04 14:26:25 +00:00
Neel Natu	bf4993ba2a	Remove unused header file. Discussed with: Anish Gupta (akgupt3@gmail.com)	2014-09-04 06:07:32 +00:00
Neel Natu	fea6bd5cd3	Consolidate the code to restore the host TSS after a #VMEXIT into a single function restore_host_tss(). Don't bother to restore MSR_KGSBASE after a #VMEXIT since it is not used in the kernel. It will be restored on return to userspace. Discussed with: Anish Gupta (akgupt3@gmail.com)	2014-09-04 06:00:18 +00:00
John Baldwin	2b793beefd	Remove trailing whitespace.	2014-09-04 01:56:15 +00:00
John Baldwin	7fb40488d6	- Move prototypes for various functions into out of C files and into <machine/md_var.h>. - Move some CPU-related variables out of i386/i386/identcpu.c to initcpu.c to match amd64. - Move the declaration of has_f00f_hack out of identcpu.c to machdep.c. - Remove a misleading comment from i386/i386/initcpu.c (locore zeros the BSS before it calls identify_cpu()) and remove explicit zero assignments to reduce the diff with amd64.	2014-09-04 01:46:06 +00:00
Neel Natu	246e7a2b64	IFC @r269962 Submitted by: Anish Gupta (akgupt3@gmail.com)	2014-09-02 04:22:42 +00:00
Alan Cox	ad2e88a14f	Update a comment to reflect the changes in r213408. MFC after: 5 days	2014-09-02 04:11:20 +00:00
Neel Natu	4c98655ece	The "SUB" instruction used in getcc() actually does 'x -= y' so use the proper constraint for 'x'. The "+r" constraint indicates that 'x' is an input and output register operand. While here generate code for different variants of getcc() using a macro GETCC(sz) where 'sz' indicates the operand size. Update the status bits in %rflags when emulating AND and OR opcodes. Reviewed by: grehan	2014-08-30 19:59:42 +00:00
Pedro F. Giffuni	ec4a0b4408	Minor space/tab cleanups. Most of them were ripped from the GSoC 2104 SMAP + kpatch project. This is only a cosmetic change. Taken from: Oliver Pinter (op@) MFC after: 5 days	2014-08-30 15:41:07 +00:00
John Baldwin	89871cdeb6	- Add a new structure type for the ACPI 3.0 SMAP entry that includes the optional attributes field. - Add a 'machdep.smap' sysctl that exports the SMAP table of the running system as an array of the ACPI 3.0 structure. (On older systems, the attributes are given a value of zero.) Note that the sysctl only exports the SMAP table if it is available via the metadata passed from the loader to the kernel. If an SMAP is not available, an empty array is returned. - Add a format handler for the ACPI 3.0 SMAP structure to the sysctl(8) binary to format the SMAP structures in a readable format similar to the format found in boot messages. MFC after: 2 weeks	2014-08-29 21:25:47 +00:00
Peter Grehan	fc3dde9099	Implement the 0x2B SUB instruction, and the OR variant of 0x81. Found with local APIC accesses from bitrig/amd64 bsd.rd, 07/15-snap. Reviewed by: neel MFC after: 3 days	2014-08-27 00:53:56 +00:00
Neel Natu	48e8c2137a	An exception is allowed to be injected even if the vcpu is in an interrupt shadow, so move the check for pending exception before bailing out due to an interrupt shadow. Change return type of 'vmcb_eventinject()' to a void and convert all error returns into KASSERTs. Fix VMCB_EXITINTINFO_EC(x) and VMCB_EXITINTINFO_TYPE(x) to do the shift before masking the result. Reviewed by: Anish Gupta (akgupt3@gmail.com)	2014-08-25 00:58:20 +00:00
Peter Grehan	7f21538b6e	Change __inline style to be consistent with FreeBSD usage, and also fix gcc build (on STABLE, when MFCd). PR: 192880 Reviewed by: neel Reported by: ngie MFC after: 1 day	2014-08-24 02:07:34 +00:00
Neel Natu	8bd3845d3c	Add "hw.vmm.topology.threads_per_core" and "hw.vmm.topology.cores_per_package" tunables to modify the default cpu topology advertised by bhyve. Also add a tunable "hw.vmm.topology.cpuid_leaf_b" to disable the CPUID leaf 0xb. This is intended for testing guest behavior when it falls back on using CPUID leaf 0x4 to deduce CPU topology. The default behavior is to advertise each vcpu as a core in a separate soket.	2014-08-24 01:10:06 +00:00
Neel Natu	534dc967d7	Fix a bug in the emulation of CPUID leaf 0x4 where bhyve was claiming that the vcpu had no caches at all. This causes problems when executing applications in the guest compiled with the Intel compiler. Submitted by: Mark Hill (mark.hill@tidalscale.com)	2014-08-23 22:44:31 +00:00
Neel Natu	7a244722d1	Return the spurious interrupt vector (IRQ7 or IRQ15) if the atpic cannot find any unmasked pin with an interrupt asserted. Reviewed by: tychon CR: https://reviews.freebsd.org/D669 MFC after: 1 week	2014-08-23 21:16:26 +00:00
John Baldwin	669eac89c5	Fix build of si(4) and enable it in LINT on amd64 and i386.	2014-08-20 16:07:17 +00:00
John Baldwin	64d6de263b	Bump MAXCPU on amd64 from 64 to 256. In practice APIC only permits 255 CPUs (IDs 0 through 254). Getting above that limit requires x2APIC. MFC after: 1 month	2014-08-20 16:06:24 +00:00
Konstantin Belousov	3165194c6b	Increase max number of physical segments on amd64 to 63. Eventually, the vmd_segs of the struct vm_domain should become bitset instead of long, to allow arbitrary compile-time selected maximum. Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week	2014-08-20 08:07:08 +00:00
Alan Cox	ada1ae623e	There exists a possible sequence of page table page allocation failures starting with a superpage demotion by pmap_enter() that could result in a PV list lock being held when pmap_enter() is just about to return KERN_RESOURCE_SHORTAGE. Consequently, the KASSERT that no PV list locks are held needs to be replaced with a conditional unlock. Discussed with: kib X-MFC with: r269728 Sponsored by: EMC / Isilon Storage Division	2014-08-18 20:28:08 +00:00
Gavin Atkinson	67e3b91b31	Update i386/NOTES and amd64/NOTES files to contain the complete list of firmwares for iwn(4) and sort them. MFC after: 1 week	2014-08-14 18:29:55 +00:00
Neel Natu	4eec602102	Reword comment to match the interrupt mode names from the MPtable spec. Reviewed by: tychon	2014-08-14 18:03:38 +00:00
Neel Natu	477867a0e5	Use the max guest memory address when creating its iommu domain. Also, assert that the GPA being mapped in the domain is less than its maxaddr. Reviewed by: grehan Pointed out by: Anish Gupta (akgupt3@gmail.com)	2014-08-14 05:00:45 +00:00
Alan Cox	4d33fe39e4	Update the text of a KASSERT() to reflect the changes in r269728.	2014-08-09 17:13:02 +00:00
Konstantin Belousov	39ffa8c138	Change pmap_enter(9) interface to take flags parameter and superpage mapping size (currently unused). The flags includes the fault access bits, wired flag as PMAP_ENTER_WIRED, and a new flag PMAP_ENTER_NOSLEEP to indicate that pmap should not sleep. For powerpc aim both 32 and 64 bit, fix implementation to ensure that the requested mapping is created when PMAP_ENTER_NOSLEEP is not specified, in particular, wait for the available memory required to proceed. In collaboration with: alc Tested by: nwhitehorn (ppc aim32 and booke) Sponsored by: The FreeBSD Foundation and EMC / Isilon Storage Division MFC after: 2 weeks	2014-08-08 17:12:03 +00:00
Neel Natu	12a6eb99a1	Support PCI extended config space in bhyve. Add the ACPI MCFG table to advertise the extended config memory window. Introduce a new flag MEM_F_IMMUTABLE for memory ranges that cannot be deleted or moved in the guest's address space. The PCI extended config space is an example of an immutable memory range. Add emulation for the "movzw" instruction. This instruction is used by FreeBSD to read a 16-bit extended config space register. CR: https://phabric.freebsd.org/D505 Reviewed by: jhb, grehan Requested by: tychon	2014-08-08 03:49:01 +00:00
Gleb Smirnoff	c8d2ffd6a7	Merge all MD sf_buf allocators into one MI, residing in kern/subr_sfbuf.c The MD allocators were very common, however there were some minor differencies. These differencies were all consolidated in the MI allocator, under ifdefs. The defines from machine/vmparam.h turn on features required for a particular machine. For details look in the comment in sys/sf_buf.h. As result no MD code left in sys///vm_machdep.c. Some arches still have machine/sf_buf.h, which is usually quite small. Tested by: glebius (i386), tuexen (arm32), kevlo (arm32) Reviewed by: kib Sponsored by: Netflix Sponsored by: Nginx, Inc.	2014-08-05 09:44:10 +00:00
Alan Cox	a695d9b25b	Retire pmap_change_wiring(). We have never used it to wire virtual pages. We continue to use pmap_enter() for that. For unwiring virtual pages, we now use pmap_unwire(), which unwires a range of virtual addresses instead of a single virtual page. Sponsored by: EMC / Isilon Storage Division	2014-08-03 20:40:51 +00:00
John Baldwin	06fc6db948	- Output a summary of optional VT-x features in dmesg similar to CPU features. If bootverbose is enabled, a detailed list is provided; otherwise, a single-line summary is displayed. - Add read-only sysctls for optional VT-x capabilities used by bhyve under a new hw.vmm.vmx.cap node. Move a few exiting sysctls that indicate the presence of optional capabilities under this node. CR: https://phabric.freebsd.org/D498 Reviewed by: grehan, neel MFC after: 1 week	2014-07-30 00:00:12 +00:00
Neel Natu	f008d1571d	If a vcpu has issued a HLT instruction with interrupts disabled then it sleeps forever in vm_handle_hlt(). This is usually not an issue as long as one of the other vcpus properly resets or powers off the virtual machine. However, if the bhyve(8) process is killed with a signal the halted vcpu cannot be woken up because it's sleep cannot be interrupted. Fix this by waking up periodically and returning from vm_handle_hlt() if TDF_ASTPENDING is set. Reported by: Leon Dang Sponsored by: Nahanni Systems	2014-07-26 02:53:51 +00:00
Neel Natu	1edccd0f30	Don't return -1 from the push emulation handler. Negative return values are interpreted specially on return from sys_ioctl() and may cause undesirable side-effects like restarting the system call.	2014-07-26 02:51:46 +00:00
Neel Natu	830be8acb4	Fix a couple of issues in the PUSH emulation: It is not possible to PUSH a 32-bit operand on the stack in 64-bit mode. The default operand size for PUSH is 64-bits and the operand size override prefix changes that to 16-bits. vm_copy_setup() can return '1' if it encounters a fault when walking the guest page tables. This is a guest issue and is now handled properly by resuming the guest to handle the fault.	2014-07-24 23:01:53 +00:00
Marius Strobl	c615e6a8bf	Copying pages via temporary mappings in the !DMAP case of pmap_copy_pages() involves updating the corresponding page tables followed by accesses to the pages in question. This sequence is subject to the situation exactly described in the "AMD64 Architecture Programmer's Manual Volume 2: System Programming" rev. 3.23, "7.3.1 Special Coherency Considerations" [1, p. 171 f.]. Therefore, issuing the INVLPG right after modifying the PTE bits is crucial (see also r269050). For the amd64 PMAP code, the order of instructions was already correct. The above fact still is worth documenting, though. 1: http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2012/10/24593_APM_v21.pdf Reviewed by: alc Sponsored by: Bally Wulff Games & Entertainment GmbH	2014-07-24 10:12:22 +00:00
Neel Natu	d37f2adb38	Fix fault injection in bhyve. The faulting instruction needs to be restarted when the exception handler is done handling the fault. bhyve now does this correctly by setting 'vmexit[vcpu].inst_length' to zero so the %rip is not advanced. A minor complication is that the fault injection APIs are used by instruction emulation code that is shared by vmm.ko and bhyve. Thus the argument that refers to 'struct vm ' in kernel or 'struct vmctx ' in userspace needs to be loosely typed as a 'void *'.	2014-07-24 01:38:11 +00:00
Roger Pau Monné	f5417a03e3	don't set CR4 PSE bit on amd64 Setting PSE together with PAE or in long mode just makes the PSE bit completely ignored, so don't set it. Sponsored by: Citrix Systems R&D Reviewed by: kib	2014-07-23 15:53:29 +00:00
Neel Natu	d665d229ce	Emulate instructions emitted by OpenBSD/i386 version 5.5: - CMP REG, r/m - MOV AX/EAX/RAX, moffset - MOV moffset, AX/EAX/RAX - PUSH r/m	2014-07-23 04:28:51 +00:00
Ed Maste	b47228854f	Don't pass null kmdp to preload_search_info On Xen PVH guests kmdp == NULL. Submitted by: royger MFC after: 3 days Sponsored by: The FreeBSD Foundation	2014-07-22 13:58:33 +00:00
Mark Johnston	26cf239814	Fix the build when DTrace isn't enabled. Reported by: stefanf X-MFC-With: r268600	2014-07-20 18:44:56 +00:00
Neel Natu	019008ebf5	Fix build without INVARIANTS defined by getting rid of unused variable 'exc'. Reported by: adrian, stefanf	2014-07-20 16:34:35 +00:00
Neel Natu	091d453222	Handle nested exceptions in bhyve. A nested exception condition arises when a second exception is triggered while delivering the first exception. Most nested exceptions can be handled serially but some are converted into a double fault. If an exception is generated during delivery of a double fault then the virtual machine shuts down as a result of a triple fault. vm_exit_intinfo() is used to record that a VM-exit happened while an event was being delivered through the IDT. If an exception is triggered while handling the VM-exit it will be treated like a nested exception. vm_entry_intinfo() is used by processor-specific code to get the event to be injected into the guest on the next VM-entry. This function is responsible for deciding the disposition of nested exceptions.	2014-07-19 20:59:08 +00:00
Mark Johnston	5a5f9d21dd	Use a C wrapper for trap() instead of checking and calling the DTrace trap hook in assembly. Suggested by: kib Reviewed by: kib (original version) X-MFC-With: r268600	2014-07-19 02:27:31 +00:00
Neel Natu	3d5444c864	Add emulation for legacy x86 task switching mechanism. FreeBSD/i386 uses task switching to handle double fault exceptions and this change enables that to work. Reported by: glebius	2014-07-16 21:26:26 +00:00
Neel Natu	f7a9f1784f	Add support for operand size and address size override prefixes in bhyve's instruction emulation [1]. Fix bug in emulation of opcode 0x8A where the destination is a legacy high byte register and the guest vcpu is in 32-bit mode. Prior to this change instead of modifying %ah, %bh, %ch or %dh the emulation would end up modifying %spl, %bpl, %sil or %dil instead. Add support for moffsets by treating it as a 2, 4 or 8 byte immediate value during instruction decoding. Fix bug in verify_gla() where the linear address computed after decoding the instruction was not being truncated to the effective address size [2]. Tested by: Leon Dang [1] Reported by: Peter Grehan [2] Sponsored by: Nahanni Systems	2014-07-15 17:37:17 +00:00
Konstantin Belousov	5e351014e0	Make amd64 pmap_copy_pages() functional for pages not mapped by DMAP. Requested and reviewed by: royger Tested by: pho, royger Sponsored by: The FreeBSD Foundation MFC after: 1 week	2014-07-15 09:30:43 +00:00
Mark Johnston	291624fdf6	Invoke the DTrace trap handler before calling trap() on amd64. This matches the upstream implementation and helps ensure that a trap induced by tracing fbt::trap:entry is handled without recursively generating another trap. This makes it possible to run most (but not all) of the DTrace tests under common/safety/ without triggering a kernel panic. Submitted by: Anton Rang <anton.rang@isilon.com> (original version) Phabric: D95	2014-07-14 04:38:17 +00:00
Neel Natu	3ada6e07ac	Use the correct offset when converting a logical address (segment:offset) to a linear address.	2014-07-11 01:23:38 +00:00
Konstantin Belousov	fd815c0b8d	For safety, ensure that any consumer of the set_regs() and ptrace_set_pc() use the correct return to userspace using iret. The signal return, PT_CONTINUE (which in fact uses signal return path) set the pcb flag already. The setcontext(2) enforces iret return when %rip is incorrect. Due to this, the change is redundand, but is made to ensure that no path which modifies context, forgets to set PCB_FULL_IRET. Inspired by: CVE-2014-4699 Reviewed by: jhb Sponsored by: The FreeBSD Foundation MFC after: 1 week	2014-07-09 21:39:40 +00:00
Neel Natu	b301b9e28f	Accurately identify the vcpu's operating mode as 64-bit, compatibility, protected or real.	2014-07-08 21:48:57 +00:00
Neel Natu	3527963b26	Invalidate guest TLB mappings as a side-effect of its CR3 being updated. This is a pre-requisite for task switch emulation since the CR3 is loaded from the new TSS.	2014-07-08 20:51:03 +00:00
Konstantin Belousov	a5244bac2e	Correct si_code for the SIGBUS signal generated by the alignment trap. Reported and tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week	2014-07-08 08:05:42 +00:00
Alan Cox	09132ba6ac	Introduce pmap_unwire(). It will replace pmap_change_wiring(). There are several reasons for this change: pmap_change_wiring() has never (in my memory) been used to set the wired attribute on a virtual page. We have always used pmap_enter() to do that. Moreover, it is not really safe to use pmap_change_wiring() to set the wired attribute on a virtual page. The description of pmap_change_wiring() says that it assumes the existence of a mapping in the pmap. However, non-wired mappings may be reclaimed by the pmap at any time. (See pmap_collect().) Many implementations of pmap_change_wiring() will crash if the mapping does not exist. pmap_unwire() accepts a range of virtual addresses, whereas pmap_change_wiring() acts upon a single virtual page. Since we are typically unwiring a range of virtual addresses, pmap_unwire() will be more efficient. Moreover, pmap_unwire() allows us to unwire superpage mappings. Previously, we were forced to demote the superpage mapping, because pmap_change_wiring() only allowed us to express the unwiring of a single base page mapping at a time. This added to the overhead of unwiring for large ranges of addresses, including the implicit unwiring that occurs at process termination. Implementations for arm and powerpc will follow. Discussed with: jeff, marcel Reviewed by: kib Sponsored by: EMC / Isilon Storage Division	2014-07-06 17:42:38 +00:00
Ed Maste	018147eef9	Prefer vt(4) for UEFI boot The UEFI framebuffer driver vt_efifb requires vt(4), so add a mechanism for the startup routine to set the preferred console. This change is ugly because console init happens very early in the boot, making a cleaner interface difficult. This change is intended only to facilitate the sc(4) / vt(4) transition, and can be reverted once vt(4) is the default.	2014-07-02 13:24:21 +00:00
Ed Maste	ccbb7b5e19	Add vt(4) devices and options to NOTES Reviewed by: marius (earlier version)	2014-07-01 00:22:54 +00:00
Ed Maste	30dbb3eabc	Add vt(4) to GENERIC and retire the separate VT config vt(4) and sc(4) can now coexist in the same kernel. To choose the vt driver, set the loader tunable kern.vty=vt .	2014-06-30 16:18:38 +00:00
Hans Petter Selasky	af3b2549c4	Pull in r267961 and r267973 again. Fix for issues reported will follow.	2014-06-28 03:56:17 +00:00
Glen Barber	37a107a407	Revert r267961, r267973: These changes prevent sysctl(8) from returning proper output, such as: 1) no output from sysctl(8) 2) erroneously returning ENOMEM with tools like truss(1) or uname(1) truss: can not get etype: Cannot allocate memory	2014-06-27 22:05:21 +00:00
Hans Petter Selasky	3da1cf1e88	Extend the meaning of the CTLFLAG_TUN flag to automatically check if there is an environment variable which shall initialize the SYSCTL during early boot. This works for all SYSCTL types both statically and dynamically created ones, except for the SYSCTL NODE type and SYSCTLs which belong to VNETs. A new flag, CTLFLAG_NOFETCH, has been added to be used in the case a tunable sysctl has a custom initialisation function allowing the sysctl to still be marked as a tunable. The kernel SYSCTL API is mostly the same, with a few exceptions for some special operations like iterating childrens of a static/extern SYSCTL node. This operation should probably be made into a factored out common macro, hence some device drivers use this. The reason for changing the SYSCTL API was the need for a SYSCTL parent OID pointer and not only the SYSCTL parent OID list pointer in order to quickly generate the sysctl path. The motivation behind this patch is to avoid parameter loading cludges inside the OFED driver subsystem. Instead of adding special code to the OFED driver subsystem to post-load tunables into dynamically created sysctls, we generalize this in the kernel. Other changes: - Corrected a possibly incorrect sysctl name from "hw.cbb.intr_mask" to "hw.pcic.intr_mask". - Removed redundant TUNABLE statements throughout the kernel. - Some minor code rewrites in connection to removing not needed TUNABLE statements. - Added a missing SYSCTL_DECL(). - Wrapped two very long lines. - Avoid malloc()/free() inside sysctl string handling, in case it is called to initialize a sysctl from a tunable, hence malloc()/free() is not ready when sysctls from the sysctl dataset are registered. - Bumped FreeBSD version to indicate SYSCTL API change. MFC after: 2 weeks Sponsored by: Mellanox Technologies	2014-06-27 16:33:43 +00:00
Tycho Nightingale	896d1f7723	Add support for emulating the move instruction: "mov r/m8, imm8". Reviewed by: neel	2014-06-26 17:15:41 +00:00
Peter Grehan	cf1d80d88c	Expose the amount of resident and wired memory from the guest's vmspace. This is different than the amount shown for the process e.g. by /usr/bin/top - that is the mappings faulted in by the mmap'd region of guest memory. The values can be fetched with bhyvectl # bhyvectl --get-stats --vm=myvm ... Resident memory 413749248 Wired memory 0 ... vmm_stat.[ch] - Modify the counter code in bhyve to allow direct setting of a counter as opposed to incrementing, and providing a callback to fetch a counter's value. Reviewed by: neel	2014-06-25 22:13:35 +00:00
Konstantin Belousov	633034fe0e	Add FPU_KERN_KTHR flag to fpu_kern_enter(9), which avoids saving FPU context into memory for the kernel threads which called fpu_kern_thread(9). This allows the fpu_kern_enter() callers to not check for is_fpu_kern_thread() to get the optimization. Apply the flag to padlock(4) and aesni(4). In aesni_cipher_process(), do not leak FPU context state on error. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2014-06-23 07:37:54 +00:00
Dmitry Chagin	2dedc1281a	Revert r266925 as it can lead to instant panic at fexecve(): To allow to run the interpreter itself add a new ELF branding type. Pointed out by: kib, mjg	2014-06-17 05:29:18 +00:00
Tycho Nightingale	a026dc3fcb	Bring an overly enthusiastic KASSERT inline with the Intel SDM. Reviewed by: neel	2014-06-16 22:59:18 +00:00
Attilio Rao	3ae10f7477	- Modify vm_page_unwire() and vm_page_enqueue() to directly accept the queue where to enqueue pages that are going to be unwired. - Add stronger checks to the enqueue/dequeue for the pagequeues when adding and removing pages to them. Of course, for unmanaged pages the queue parameter of vm_page_unwire() will be ignored, just as the active parameter today. This makes adding new pagequeues quicker. This change effectively modifies the KPI. __FreeBSD_version will be, however, bumped just when the full cache of free pages will be evicted. Sponsored by: EMC / Isilon storage division Reviewed by: alc Tested by: pho	2014-06-16 18:15:27 +00:00
Roger Pau Monné	ef409ede7b	amd64/i386: introduce APIC hooks for different APIC implementations. This is needed for Xen PV(H) guests, since there's no hardware lapic available on this kind of domains. This commit should not change functionality. Sponsored by: Citrix Systems R&D Reviewed by: jhb Approved by: gibbs amd64/include/cpu.h: amd64/amd64/mp_machdep.c: i386/include/cpu.h: i386/i386/mp_machdep.c: - Remove lapic_ipi_vectored hook from cpu_ops, since it's now implemented in the lapic hooks. amd64/amd64/mp_machdep.c: i386/i386/mp_machdep.c: - Use lapic_ipi_vectored directly, since it's now an inline function that will call the appropiate hook. x86/x86/local_apic.c: - Prefix bare metal public lapic functions with native_ and mark them as static. - Define default implementation of apic_ops. x86/include/apicvar.h: - Declare the apic_ops structure and create inline functions to access the hooks, so the change is transparent to existing users of the lapic_ functions. x86/xen/hvm.c: - Switch to use the new apic_ops.	2014-06-16 08:43:03 +00:00
Neel Natu	4e98fc9011	Disable global interrupts early so all the software state maintained by bhyve is sampled "atomically". Any interrupts after this point will be held pending by the CPU until the guest starts executing and will immediately trigger a #VMEXIT. Reviewed by: Anish Gupta (akgupt3@gmail.com)	2014-06-11 17:48:07 +00:00
Tycho Nightingale	5ebc578ba6	Replace enum forward declarations with complete definitions. Reviewed by: neel	2014-06-10 18:46:00 +00:00
Neel Natu	404874659f	Add helper functions to populate VM exit information for rendezvous and astpending exits. This is to reduce code duplication between VT-x and SVM implementations.	2014-06-10 16:45:58 +00:00
Neel Natu	0494cb1bcb	Turn on interrupt window exiting unconditionally when an ExtINT is being injected into the guest. This allows the hypervisor to inject another ExtINT or APIC vector as soon as the guest is able to process interrupts. This change is not to address any correctness issue but to guarantee that any pending APIC vector that was preempted by the ExtINT will be injected as soon as possible. Prior to this change such pending interrupts could be delayed until the next VM exit.	2014-06-10 01:38:02 +00:00
Peter Grehan	3787148758	Temporary fix for guest idle detection. Handle ExtINT injection for SVM. The HPET emulation will inject a legacy interrupt at startup, and if this isn't handled, will result in the HLT-exit code assuming there are outstanding ExtINTs and return without sleeping. svm_inj_interrupts() needs more changes to bring it up to date with the VT-x version: these are forthcoming. Reviewed by: neel	2014-06-09 21:02:48 +00:00
Neel Natu	051f2bd19d	Add reserved bit checking when doing %CR8 emulation and inject #GP if required. Pointed out by: grehan Reviewed by: tychon	2014-06-09 20:51:08 +00:00
Peter Grehan	1cc0e0eedb	Allow the TSC MSR to be accessed directly from the guest.	2014-06-07 23:08:06 +00:00
Peter Grehan	dc6610d553	Set the guest PAT MSR in the VMCB to power-on defaults. Linux guests accept the values in this register, while *BSD guests reprogram it. Default values of zero correspond to PAT_UNCACHEABLE, resulting in glacial performance. Thanks to Willem Jan Withagen for first reporting this and helping out with the investigation.	2014-06-07 23:05:12 +00:00
Neel Natu	5fcf252f41	Add ioctl(VM_REINIT) to reinitialize the virtual machine state maintained by vmm.ko. This allows the virtual machine to be restarted without having to destroy it first. Reviewed by: grehan	2014-06-07 21:36:52 +00:00
Alan Cox	dd05fa1945	Add a page size field to struct vm_page. Increase the page size field when a partially populated reservation becomes fully populated, and decrease this field when a fully populated reservation becomes partially populated. Use this field to simplify the implementation of pmap_enter_object() on amd64, arm, and i386. On all architectures where we support superpages, the cost of creating a superpage mapping is roughly the same as creating a base page mapping. For example, both kinds of mappings entail the creation of a single PTE and PV entry. With this in mind, use the page size field to make the implementation of vm_map_pmap_enter(..., MAP_PREFAULT_PARTIAL) a little smarter. Previously, if MAP_PREFAULT_PARTIAL was specified to vm_map_pmap_enter(), that function would only map base pages. Now, it will create up to 96 base page or superpage mappings. Reviewed by: kib Sponsored by: EMC / Isilon Storage Division	2014-06-07 17:12:26 +00:00
Tycho Nightingale	594db0024e	Support guest accesses to %cr8. Reviewed by: neel	2014-06-06 18:23:49 +00:00
Warner Losh	3f1afabf09	Restore comments accidentally removed. MFC after: 3 days	2014-06-06 04:08:55 +00:00
Peter Grehan	0df5b8cb8c	ins/outs support for SVM. Modelled on the Intel VT-x code. Remove CR2 save/restore - the guest restore/save is done in hardware, and there is no need to save/restore the host version (same as VT-x). Submitted by: neel (SVM segment descriptor 'P' bit code) Reviewed by: neel	2014-06-06 02:55:18 +00:00
Peter Grehan	72a458ccff	Allow the guest's CR2 value to be read/written. This is required for page-fault injection.	2014-06-05 06:29:18 +00:00
Peter Grehan	8c1da7e67b	Use API call when VM is detected as suspended. This fixes the (harmless) error message on exit: vmexit_suspend: invalid reason 217645057 Reviewed by: neel, Anish Gupta (akgupt3@gmail.com)	2014-06-03 22:26:46 +00:00
Peter Grehan	eee8190aab	Bring (almost) up-to-date with HEAD. - use the new virtual APIC page - update to current bhyve APIs Tested by Anish with multiple FreeBSD SMP VMs on a Phenom, and verified by myself with light FreeBSD VM testing on a Sempron 3850 APU. The issues reported with Linux guests are very likely to still be here, but this sync eliminates the skew between the project branch and CURRENT, and should help to determine the causes. Some follow-on commits will fix minor cosmetic issues. Submitted by: Anish Gupta (akgupt3@gmail.com)	2014-06-03 06:56:54 +00:00
Peter Grehan	6cec9cad76	MFC @ r266724 An SVM update will follow this.	2014-06-03 02:34:21 +00:00
Neel Natu	95ebc360ef	Activate vcpus from bhyve(8) using the ioctl VM_ACTIVATE_CPU instead of doing it implicitly in vmm.ko. Add ioctl VM_GET_CPUS to get the current set of 'active' and 'suspended' cpus and display them via /usr/sbin/bhyvectl using the "--get-active-cpus" and "--get-suspended-cpus" options. This is in preparation for being able to reset virtual machine state without having to destroy and recreate it.	2014-05-31 23:37:34 +00:00
Dmitry Chagin	5f56da1891	To allow to run the interpreter itself add a new ELF branding type. Allow Linux ABI to run ELF interpreter. MFC after: 3 days	2014-05-31 15:01:51 +00:00
Tycho Nightingale	11669a681c	If VMX isn't enabled so long as the lock bit isn't set yet in MSR IA32_FEATURE_CONTROL it still can be. Approved by: grehan (co-mentor)	2014-05-30 23:37:31 +00:00
Neel Natu	92754b1199	Remove bogus check for kmem_malloc() failure even though M_WAITOK is set. Requested by: jkim	2014-05-30 20:58:32 +00:00
Neel Natu	8e351f8a3c	Allocate a zeroed LDT. Failing to do this might result in the LDT appearing to run out of free descriptors because of random junk in the descriptor's 'sd_type' field. http://lists.freebsd.org/pipermail/freebsd-amd64/2014-May/016088.html Reviewed by: kib MFC after: 2 weeks	2014-05-30 18:59:37 +00:00
Konstantin Belousov	64e9726555	When usermode loaded non-default segment selector into the %gs, correctly prepare KGSBASE msr to restore the user descriptor base on the last swapgs during return to usermode. Reported and tested by: peterj Sponsored by: The FreeBSD Foundation MFC after: 1 week	2014-05-29 16:18:31 +00:00
Mark Johnston	f2789bd5c7	Commit the rest of the changes that were intended to be part of r266826. X-MFC-with: r266826	2014-05-29 01:42:22 +00:00
John Baldwin	44a68c4e40	- Rework the XSAVE/XRSTOR emulation to only expose XCR0 features to the guest for which the rules regarding xsetbv emulation are known. In particular future extensions like AVX-512 have interdependencies among feature bits that could allow a guest to trigger a GP# in the host with the current approach of allowing anything the host supports. - Add proper checking of Intel MPX and AVX-512 XSAVE features in the xsetbv emulation and allow these features to be exposed to the guest if they are enabled in the host. - Expose a subset of known-safe features from leaf 0 of the structured extended features to guests if they are supported on the host including RDFSBASE/RDGSBASE, BMI1/2, AVX2, AVX-512, HLE, ERMS, and RTM. Aside from AVX-512, these features are all new instructions available for use in ring 3 with no additional hypervisor changes needed. Reviewed by: neel	2014-05-27 19:04:38 +00:00
Neel Natu	65ffa035a7	Add segment protection and limits violation checks in vie_calculate_gla() for 32-bit x86 guests. Tested using ins/outs executed in a FreeBSD/i386 guest.	2014-05-27 04:26:22 +00:00
Neel Natu	ae0780bbf1	Remove restriction on insb/insw/insl emulation. These instructions are properly emulated.	2014-05-25 02:05:23 +00:00
Neel Natu	5382c19d81	Do the linear address calculation for the ins/outs emulation using a new API function 'vie_calculate_gla()'. While the current implementation is simplistic it forms the basis of doing segmentation checks if the guest is in 32-bit protected mode.	2014-05-25 00:57:24 +00:00
Neel Natu	da11f4aa1d	Add libvmmapi functions vm_copyin() and vm_copyout() to copy into and out of the guest linear address space. These APIs in turn use a new ioctl 'VM_GLA2GPA' to convert the guest linear address to guest physical. Use the new copyin/copyout APIs when emulating ins/outs instruction in bhyve(8).	2014-05-24 23:12:30 +00:00
Neel Natu	e813a87350	Consolidate all the information needed by the guest page table walker into 'struct vm_guest_paging'. Check for canonical addressing in vmm_gla2gpa() and inject a protection fault into the guest if a violation is detected. If the page table walk is restarted in vmm_gla2gpa() then reset 'ptpphys' to point to the root of the page tables.	2014-05-24 20:26:57 +00:00
Neel Natu	37a723a5b3	When injecting a page fault into the guest also update the guest's %cr2 to indicate the faulting linear address. If the guest PML4 entry has the PG_PS bit set then inject a page fault into the guest with the PGEX_RSV bit set in the error_code. Get rid of redundant checks for the PG_RW violations when walking the page tables.	2014-05-24 19:13:25 +00:00
Neel Natu	a7424861fb	Check for alignment check violation when processing in/out string instructions.	2014-05-23 19:59:14 +00:00
Neel Natu	d17b5104a9	Add emulation of the "outsb" instruction. NetBSD guests use this to write to the UART FIFO. The emulation is constrained in a number of ways: 64-bit only, doesn't check for all exception conditions, limited to i/o ports emulated in userspace. Some of these constraints will be relaxed in followup commits. Requested by: grehan Reviewed by: tychon (partially and a much earlier version)	2014-05-23 05:15:17 +00:00
Neel Natu	c5e423dd2e	A Centos 6.4 guest will write 0xff to the 8259 mask register before beginning the proper ICWx initialization sequence. It assumes, probably correctly, that the boot firmware has done the 8259 initialization. Since grub-bhyve does not initialize the 8259 this write to the mask register takes a code path in which 'error' remains uninitialized (ready=0,icw_num=0). Fix this by initializing 'error' at the start of the function.	2014-05-23 05:04:50 +00:00
John Baldwin	0eb7ae8d0a	Don't permit users to request a subset of the AVX512 or MPX xsave masks. These masks are documented in the Intel Architecture Instruction Set Extensions Programming Reference (March 2014). Reviewed by: kib MFC after: 1 month	2014-05-22 18:22:02 +00:00
Neel Natu	ba6f5e23cc	Allow vmx_getdesc() and vmx_setdesc() to be called for a vcpu that is in the VCPU_RUNNING state. This will let the VMX exit handler inspect the vcpu's segment descriptors without having to exit the critical section.	2014-05-22 17:22:37 +00:00
Justin Hibbits	81e3caaf77	imagact_binmisc builds for all supported architectures, so enable it for all. Any bugs in execution will be dealt with as they crop up. MFC after: 3 weeks Relnotes: Yes	2014-05-22 05:04:40 +00:00
Neel Natu	fd949af642	Inject page fault into the guest if the page table walker detects an invalid translation for the guest linear address.	2014-05-22 03:14:54 +00:00
Neel Natu	f888763dd8	Add PG_RW check when translating a guest linear to guest physical address. Set the accessed and dirty bits in the page table entry. If it fails then restart the page table walk from the beginning. This might happen if another vcpu modifies the page tables simultaneously. Reviewed by: alc, kib	2014-05-20 20:30:28 +00:00
John Baldwin	674b6d6e0d	Add support for decoding the AMD SVM instructions.	2014-05-19 18:07:37 +00:00
Neel Natu	e4c8a13d61	Add PG_U (user/supervisor) checks when translating a guest linear address to a guest physical address. PG_PS (page size) field is valid only in a PDE or a PDPTE so it is now checked only in non-terminal paging entries. Ignore the upper 32-bits of the CR3 for PAE paging.	2014-05-19 03:50:07 +00:00
Peter Grehan	897bb47e7b	Make the vmx asm code dtrace-fbt-friendly by - inserting frame enter/leave sequences - restructuring the vmx_enter_guest routine so that it subsumes the vm_exit_guest block, which was the #vmexit RIP and not a callable routine. Reviewed by: neel MFC after: 3 weeks	2014-05-18 03:50:17 +00:00
John Baldwin	8b3949c344	Add support for decoding rdrand and rdseed.	2014-05-17 21:10:03 +00:00
John Baldwin	355d8a2f91	Add definitions for more structured extended features as well as XSAVE Extended Features for AVX512 and MPX (Memory Protection Extensions). Obtained from: Intel's Instruction Set Extensions Programming Reference (March 2014)	2014-05-16 17:45:09 +00:00
John Baldwin	b3e9732a76	Implement a PCI interrupt router to route PCI legacy INTx interrupts to the legacy 8259A PICs. - Implement an ICH-comptabile PCI interrupt router on the lpc device with 8 steerable pins configured via config space access to byte-wide registers at 0x60-63 and 0x68-6b. - For each configured PCI INTx interrupt, route it to both an I/O APIC pin and a PCI interrupt router pin. When a PCI INTx interrupt is asserted, ensure that both pins are asserted. - Provide an initial routing of PCI interrupt router (PIRQ) pins to 8259A pins (ISA IRQs) and initialize the interrupt line config register for the corresponding PCI function with the ISA IRQ as this matches existing hardware. - Add a global _PIC method for OSPM to select the desired interrupt routing configuration. - Update the _PRT methods for PCI bridges to provide both APIC and legacy PRT tables and return the appropriate table based on the configured routing configuration. Note that if the lpc device is not configured, no routing information is provided. - When the lpc device is enabled, provide ACPI PCI link devices corresponding to each PIRQ pin. - Add a VMM ioctl to adjust the trigger mode (edge vs level) for 8259A pins via the ELCR. - Mark the power management SCI as level triggered. - Don't hardcode the number of elements in Packages in the source for the DSDT. iasl(8) will fill in the actual number of elements, and this makes it simpler to generate a Package with a variable number of elements. Reviewed by: tycho	2014-05-15 14:16:55 +00:00
Neel Natu	f3db4c53e6	Increase the TSS limit by one byte. The processor requires an additional byte with all bits set to 1 beyond the I/O permission bitmap. Prior to this change accessing I/O ports [0xFFF8-0xFFFF] would trigger a #GP fault even though the I/O bitmap allowed access to those ports. For more details see section "I/O Permission Bit Map" in the Intel SDM, Vol 1. Reviewed by: kib	2014-05-14 22:24:09 +00:00
Neel Natu	055fc2cb5e	Virtual machine halt detection is turned on by default. Allow it to be disabled via the tunable 'hw.vmm.halt_detection'.	2014-05-05 16:19:24 +00:00

... 3 4 5 6 7 ...

7145 Commits