freebsd-skq

Author	SHA1	Message	Date
rstone	4df7085933	Re-write bhyve's I/O MMU handling in terms of PCI RIDs Reviewed by: neel Sponsored by: Sandvine Inc	2014-04-01 14:54:43 +00:00
kib	120b1857f9	Clear the kernel grab of the FPU state on fork. The pcb_save pointer is already correctly reset to the FPU user save area, only PCB_KERNFPU flag might leak from old thread state into the new state. For creation of the user-mode thread, the change is nop since corresponding syscall code does not use FPU. On the other hand, creation of a kernel thread forks from a thread selected arbitrary from proc0, which might use FPU. Reported and tested by: Chris Torek <torek@torek.net> Sponsored by: The FreeBSD Foundation MFC after: 1 week	2014-03-29 11:56:33 +00:00
kib	a6e5d8b248	Several fixes for the PCID implementation: - When clearing a bit for a cpuid in pmap->pm_save, ensure that the cpuid is not set in pm_active. The pm_save indicates which CPUs may have cached translations for given PCID, which implies that a CPU executing with the given pmap active have the translations cached. [1] - In smp_masked_invltlb(), pass pmap to smp_targeted_tlb_shootdown(). [1] - In invlrng_handler(), check for the special values of pcid (0 and -1) and do corresponding global or total invalidations before checking for performing PCID-specific range invalidation with INVPCID_ADDR. [2] - In invltlb_pcid_handler(), do not read %cr3 unless needed. [2] - Do minor style tweaks. [2] Submitted by: Henrik Gulbrandsen <henrik@gulbra.net> [1] Other parts sponsored by: The FreeBSD Foundation [2] Tested by: Henrik Gulbrandsen, pho MFC after: 1 week	2014-03-28 16:07:27 +00:00
emaste	d2c99117cd	Update EFI framebuffer handoff from loader Sponsored by: The FreeBSD Foundation	2014-03-27 19:43:38 +00:00
emaste	4a841fdff4	amd64: Parse the EFI memory map if present With this change (and loader.efi from the projects/uefi branch) we can now boot under qemu using the OVMF UEFI firmware image with the limitation that a serial console is required. (This is largely r246337 from the projects/uefi branch.) Sponsored by: The FreeBSD Foundation	2014-03-27 18:23:02 +00:00
neel	3e49998fdf	Add an ioctl to suspend a virtual machine (VM_SUSPEND). The ioctl can be called from any context i.e., it is not required to be called from a vcpu thread. The ioctl simply sets a state variable 'vm->suspend' to '1' and returns. The vcpus inspect 'vm->suspend' in the run loop and if it is set to '1' the vcpu breaks out of the loop with a reason of 'VM_EXITCODE_SUSPENDED'. The suspend handler waits until all 'vm->active_cpus' have transitioned to 'vm->suspended_cpus' before returning to userspace. Discussed with: grehan	2014-03-26 23:34:27 +00:00
imp	bd031ca10c	Rather than require a makeoptions DEBUG to get debug correct, add it in kern.mk, but only if we're using clang. While this option is supported by both clang and gcc, in the future there may be changes to clang which change the defaults that require a tweak to build our kernel such that other tools in our tree will work. Set a good example by forcing -gdwarf-2 only for clang builds, and only if the user hasn't specified another dwarf level already. Update UPDATING to reflect the changed state of affairs. This also keeps us from having to update all the ARM kernels to add this, and also keeps us from in the future having to update all the MIPS kernels and is one less place the user will have to know to do something special for clang and one less thing developers will need to do when moving an architecture to clang. Reviewed by: ian@ MFC after: 1 week	2014-03-25 22:08:31 +00:00
tychon	58699bc5fc	Move the atpit device model from userspace into vmm.ko for better precision and lower latency. Approved by: grehan (co-mentor)	2014-03-25 19:20:34 +00:00
bdrewery	6fcf6199a4	Rename global cnt to vm_cnt to avoid shadowing. To reduce the diff struct pcu.cnt field was not renamed, so PCPU_OP(cnt.field) is still used. pc_cnt and pcpu are also used in kvm(3) and vmstat(8). The goal was to not affect externally used KPI. Bump __FreeBSD_version_ in case some out-of-tree module/code relies on the the global cnt variable. Exp-run revealed no ports using it directly. No objection from: arch@ Sponsored by: EMC / Isilon Storage Division	2014-03-22 10:26:09 +00:00
kib	7390415c58	Add change forgotten in r263475. Make dmaplimit accessible outside amd64/pmap.c. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2014-03-21 17:17:19 +00:00
kib	24c4e4a548	Fix two issues with /dev/mem access on amd64, both causing kernel page faults. First, for accesses to direct map region should check for the limit by which direct map is instantiated. Second, for accesses to the kernel map, success returned from the kernacc(9) does not guarantee that consequent attempt to read or write to the checked address succeed, since other thread might invalidate the address meantime. Add a new thread private flag TDP_DEVMEMIO, which instructs vm_fault() to return error when fault happens on the MAP_ENTRY_NOFAULT entry, instead of panicing. The trap handler would then see a page fault from access, and recover in normal way, making /dev/mem access safer. Remove GIANT_REQUIRED from the amd64 memrw(), since it is not needed and having Giant locked does not solve issues for amd64. Note that at least the second issue exists on other architectures, and requires similar patching for md code. Reported and tested by: clusteradm (gjb, sbruno) Sponsored by: The FreeBSD Foundation MFC after: 1 week	2014-03-21 14:25:09 +00:00
imp	9f008568e7	Remove vestiges of knowing the ISA bus, which we gave up on around 20 years ago. Remove redunant copy of isaregs.h.	2014-03-19 21:03:04 +00:00
markj	b41aca8e4d	Only invoke fasttrap hooks for traps from user mode, and ensure that they're called with interrupts enabled. Calling fasttrap_pid_probe() with interrupts disabled can lead to deadlock if fasttrap writes to the process' address space. Reviewed by: rpaulo MFC after: 3 weeks	2014-03-19 01:27:56 +00:00
imp	ea27b8b541	In kernel config files, it is supposed to be 'options<space><tab>' not 'options<tab><tab>', per long standing (but recently not so strictly enforced) convention.	2014-03-18 14:41:18 +00:00
neel	3818d66305	When a vcpu is deactivated it must also unblock any rendezvous that may be blocked on it. This is done by issuing a wakeup after clearing the 'vcpuid' from 'active_cpus'. Also, use CPU_CLR_ATOMIC() to guarantee visibility of the updated 'active_cpus' across all host cpus.	2014-03-18 02:49:28 +00:00
neel	9e498dc116	Notify vcpus participating in the rendezvous of the pending event to ensure that they execute the rendezvous function as soon as possible.	2014-03-17 23:30:38 +00:00
imp	47e104941c	Align all comments in config files on same column. This consistency helps when bits and pieces of GENERIC from i386 or amd64 are cut and pasted into other architecture's config files (which in the case of ARM had gotten rather akimbo).	2014-03-16 15:22:52 +00:00
rwatson	33fdc14c0c	Update kernel inclusions of capability.h to use capsicum.h instead; some further refinement is required as some device drivers intended to be portable over FreeBSD versions rely on __FreeBSD_version to decide whether to include capability.h. MFC after: 3 weeks	2014-03-16 10:55:57 +00:00
tychon	5460439295	Fix a race wherein the source of an interrupt vector is wrongly attributed if an ExtINT arrives during interrupt injection. Also, fix a spurious interrupt if the PIC tries to raise an interrupt before the outstanding one is accepted. Finally, improve the PIC interrupt latency when another interrupt is raised immediately after the outstanding one is accepted by creating a vmexit rather than waiting for one to occur by happenstance. Approved by: neel (co-mentor)	2014-03-15 23:09:34 +00:00
rwatson	e78b9db504	Revert a small portion of r263198 left over from local testing: don't enable PCB groups and RSS by default [yet].	2014-03-15 00:59:23 +00:00
rwatson	f411704afc	Several years after initial development, merge prototype support for linking NIC Receive Side Scaling (RSS) to the network stack's connection-group implementation. This prototype (and derived patches) are in use at Juniper and several other FreeBSD-using companies, so despite some reservations about its maturity, merge the patch to the base tree so that it can be iteratively refined in collaboration rather than maintained as a set of gradually diverging patch sets. (1) Merge a software implementation of the Toeplitz hash specified in RSS implemented by David Malone. This is used to allow suitable pcbgroup placement of connections before the first packet is received from the NIC. Software hashing is generally avoided, however, due to high cost of the hash on general-purpose CPUs. (2) In in_rss.c, maintain authoritative versions of RSS state intended to be pushed to each NIC, including keying material, hash algorithm/ configuration, and buckets. Provide software-facing interfaces to hash 2- and 4-tuples for IPv4 and IPv6 using both the RSS standardised Toeplitz and a 'naive' variation with a hash efficient in software but with poor distribution properties. Implement rss_m2cpuid()to be used by netisr and other load balancing code to look up the CPU on which an mbuf should be processed. (3) In the Ethernet link layer, allow netisr distribution using RSS as a source of policy as an alternative to source ordering; continue to default to direct dispatch (i.e., don't try and requeue packets for processing on the 'right' CPU if they arrive in a directly dispatchable context). (4) Allow RSS to control tuning of connection groups in order to align groups with RSS buckets. If a packet arrives on a protocol using connection groups, and contains a suitable hardware-generated hash, use that hash value to select the connection group for pcb lookup for both IPv4 and IPv6. If no hardware-generated Toeplitz hash is available, we fall back on regular PCB lookup risking contention rather than pay the cost of Toeplitz in software -- this is a less scalable but, at my last measurement, faster approach. As core counts go up, we may want to revise this strategy despite CPU overhead. Where device drivers suitably configure NICs, and connection groups / RSS are enabled, this should avoid both lock and line contention during connection lookup for TCP. This commit does not modify any device drivers to tune device RSS configuration to the global RSS configuration; patches are in circulation to do this for at least Chelsio T3 and Intel 1G/10G drivers. Currently, the KPI for device drivers is not particularly robust, nor aware of more advanced features such as runtime reconfiguration/rebalancing. This will hopefully prove a useful starting point for refinement. No MFC is scheduled as we will first want to nail down a more mature and maintainable KPI/KBI for device drivers. Sponsored by: Juniper Networks (original work) Sponsored by: EMC/Isilon (patch update and merge)	2014-03-15 00:57:50 +00:00
glebius	80e85e32a5	Remove AppleTalk support. AppleTalk was a network transport protocol for Apple Macintosh devices in 80s and then 90s. Starting with Mac OS X in 2000 the AppleTalk was a legacy protocol and primary networking protocol is TCP/IP. The last Mac OS X release to support AppleTalk happened in 2009. The same year routing equipment vendors (namely Cisco) end their support. Thus, AppleTalk won't be supported in FreeBSD 11.0-RELEASE.	2014-03-14 06:29:43 +00:00
glebius	d494babace	Remove IPX support. IPX was a network transport protocol in Novell's NetWare network operating system from late 80s and then 90s. The NetWare itself switched to TCP/IP as default transport in 1998. Later, in this century the Novell Open Enterprise Server became successor of Novell NetWare. The last release that claimed to still support IPX was OES 2 in 2007. Routing equipment vendors (e.g. Cisco) discontinued support for IPX in 2011. Thus, IPX won't be supported in FreeBSD 11.0-RELEASE.	2014-03-14 02:58:48 +00:00
imp	bf13b5b908	Delete stray clause 3 (Advertising clause) and renumber while i'm here. Approved by: alc@	2014-03-11 23:41:35 +00:00
tychon	9affb68b8d	Don't try to return a vector to a caller that only cares if a vector is pending or not. Approved by: neel (co-mentor)	2014-03-11 22:12:12 +00:00
imp	c4c8568cd0	Remove clause 3 (the advertising clause), per the regent's letter.	2014-03-11 17:20:50 +00:00
tychon	25c8b61cfd	Replace the userspace atpic stub with a more functional vmm.ko model. New ioctls VM_ISA_ASSERT_IRQ, VM_ISA_DEASSERT_IRQ and VM_ISA_PULSE_IRQ can be used to manipulate the pic, and optionally the ioapic, pin state. Reviewed by: jhb, neel Approved by: neel (co-mentor)	2014-03-11 16:56:00 +00:00
royger	446e208ee2	xen: add a hook to perform AP startup AP startup on PVH follows the PV method, so we need to add a hook in order to diverge from bare metal. Approved by: gibbs Sponsored by: Citrix Systems R&D amd64/amd64/machdep.c: - Add hook for start_all_aps on native (using native_start_all_aps defined in mp_machdep). amd64/amd64/mp_machdep.c: - Make some variables global because they will also be used by the Xen PVH AP startup code. - Use the start_all_aps hook to start APs. - Rename start_all_aps to native_start_all_aps. amd64/include/smp.h: - Add declaration for native_start_all_aps. x86/include/init.h: - Declare start_all_aps hook in init_ops. x86/xen/pv.c: - Pick external declarations from mp_machdep. - Introduce Xen PV code to start APs on PVH. - Set start_all_aps init hook to use the Xen PVH implementation.	2014-03-11 10:27:57 +00:00
royger	419270d8a7	xen: add hook for AP bootstrap memory reservation This hook will only be implemented for bare metal, Xen doesn't require any bootstrap code since APs are started in long mode with paging enabled. Approved by: gibbs Sponsored by: Citrix Systems R&D amd64/amd64/machdep.c: - Set mp_bootaddress hook for bare metal. x86/include/init.h: - Define mp_bootaddress in init_ops.	2014-03-11 10:26:16 +00:00
royger	6b1be12234	xen: use the same hypercall mechanism for XEN and XENHVM Currently XEN (PV) and XENHVM (PVHVM) ports use different ways to issue hypercalls, unify this by filling the hypercall_page under HVM also. Approved by: gibbs Sponsored by: Citrix Systems R&D amd64/include/xen/hypercall.h: - Unify Xen hypercall code by always using the PV way. i386/i386/locore.s: - Define hypercall_page on i386 XENHVM. x86/xen/hvm.c: - Fill hypercall_page on XENHVM kernels using the HVM method (only when running as an HVM guest).	2014-03-11 10:24:13 +00:00
royger	891131cb52	xen: implement hook to fetch and parse e820 memory map e820 memory map is fetched using a hypercall under Xen PVH, so add a hook to init_ops in oder to diverge from bare metal and implement a Xen variant. Approved by: gibbs Sponsored by: Citrix Systems R&D x86/include/init.h: - Add a parse_memmap hook to init_ops, that will be called to fetch and parse the memory map. amd64/amd64/machdep.c: - Decouple the fetch and the parse of the memmap, so the parse function can be shared with Xen code. - Move code around in order to implement the parse_memmap hook. amd64/include/pc/bios.h: - Declare bios_add_smap_entries (implemented in machdep.c). x86/xen/pv.c: - Implement fetching of e820 memmap when running as a PVH guest by using the XENMEM_memory_map hypercall.	2014-03-11 10:23:03 +00:00
royger	467e743960	xen: implement an early timer for Xen PVH When running as a PVH guest, there's no emulated i8254, so we need to use the Xen PV timer as the early source for DELAY. This change allows for different implementations of the early DELAY function and implements a Xen variant for it. Approved by: gibbs Sponsored by: Citrix Systems R&D dev/xen/timer/timer.c: dev/xen/timer/timer.h: - Implement Xen early delay functions using the PV timer and declare them. x86/include/init.h: - Add hooks for early clock source initialization and early delay functions. i386/i386/machdep.c: pc98/pc98/machdep.c: amd64/amd64/machdep.c: - Set early delay hooks to use the i8254 on bare metal. - Use clock_init (that will in turn make use of init_ops) to initialize the early clock source. amd64/include/clock.h: i386/include/clock.h: - Declare i8254_delay and clock_init. i386/xen/clock.c: - Rename DELAY to i8254_delay. x86/isa/clock.c: - Introduce clock_init that will take care of initializing the early clock by making use of the init_ops hooks. - Move non ISA related delay functions to the newly introduced delay file. x86/x86/delay.c: - Add moved delay related functions. - Implement generic DELAY function that will use the init_ops hooks. x86/xen/pv.c: - Set PVH hooks for the early delay related functions in init_ops. conf/files.amd64: conf/files.i386: conf/files.pc98: - Add delay.c to the kernel build.	2014-03-11 10:20:42 +00:00
royger	4df602a6bf	amd64: introduce hook for custom preload metadata parsers Add hooks to amd64 in order to have diverging implementations, since on Xen PV the metadata is passed to the kernel in a different form. Approbed by: gibbs Sponsored by: Citrix Systems R&D amd64/amd64/machdep.c: - Define init_ops for native. - Put native code inside of native_parse_preload_data hook. - Call the parse_preload_data in order to fill the metadata info. x86/include/init.h: - Declare the init_ops struct. x86/xen/pv.c: - Declare xen_init_ops that contains the Xen PV implementation of init_ops. - Implement the parse_preload_data for Xen PVH, the info is fetched from HYPERVISOR_start_info->cmd_line as provided by Xen.	2014-03-11 10:15:25 +00:00
royger	5dd05db7ff	xen: add PV/PVH kernel entry point Add the PV/PVH entry point and the low level functions for PVH early initialization. Approved by: gibbs Sponsored by: Citrix Systems R&D amd64/amd64/genassym.c: - Add __FreeBSD_version define to assym.s so it can be used for the Xen notes. amd64/amd64/locore.S: - Make bootstack global so it can be used from Xen kernel entry point. amd64/amd64/xen-locore.S: - Add Xen notes to the kernel. - Add the Xen PV entry point, that is going to call hammer_time_xen. amd64/include/asmacros.h: - Add ELFNOTE macros. i386/xen/xen_machdep.c: - Define HYPERVISOR_start_info for the XEN i386 PV port, which is going to be used in some shared code between PV and PVH. x86/xen/hvm.c: - Define HYPERVISOR_start_info for the PVH port. x86/xen/pv.c: - Introduce hammer_time_xen which is going to perform early setup for Xen PVH: - Setup shared Xen variables start_info, shared_info and xen_store. - Set guest type. - Create initial page tables as FreeBSD expects to find them. - Call into native init function (hammer_time). xen/xen-os.h: - Declare HYPERVISOR_start_info. conf/files.amd64: - Add amd64/amd64/locore.S and x86/xen/pv.c to the list of files.	2014-03-11 10:07:01 +00:00
royger	27026f4f2a	amd64/i386: switch IPI handlers to C code. Move asm IPIs handlers to C code, so both Xen and native IPI handlers share the same code. Reviewed by: jhb Approved by: gibbs Sponsored by: Citrix Systems R&D amd64/amd64/apic_vector.S: i386/i386/apic_vector.s: - Remove asm coded IPI handlers and instead call the newly introduced C variants. amd64/amd64/mp_machdep.c: i386/i386/mp_machdep.c: - Add C coded clones to the asm IPI handlers (moved from x86/xen/hvm.c). i386/include/smp.h: amd64/include/smp.h: - Add prototypes for the C IPI handlers. x86/xen/hvm.c: - Move the C IPI handlers to mp_machdep and call those in the Xen IPI handlers. i386/xen/mp_machdep.c: - Add dummy IPI handlers to the i386 Xen PV port (this port doesn't support SMP).	2014-03-11 10:03:29 +00:00
emaste	2ab45c505b	Disable amd64 TLB Context ID (pcid) by default for now There are a number of reports of userspace application crashes that are "solved" by setting vm.pmap.pcid_enabled=0, including Java and the x11/mate-terminal port (PR ports/184362). I originally planned to disable this only in stable/10 (in r262753), but it has been pointed out that additional crash reports on HEAD are not likely to provide new insight into the problem. The feature can easily be enabled for testing.	2014-03-05 01:34:10 +00:00
jkim	9b4d3b43ca	Move fpusave() wrapper for suspend hander to sys/amd64/amd64/fpu.c. Inspired by: jhb	2014-03-04 21:35:57 +00:00
jkim	e6f1aee9e4	Revert accidentally committed changes in 262748.	2014-03-04 20:16:00 +00:00
jkim	74dcbf5843	Properly save and restore CR0. MFC after: 3 days	2014-03-04 20:07:36 +00:00
jkim	373cea9476	Remove dead code since r230426, fix a comment, and tidy up. Reported by: jhb MFC after: 3 days	2014-03-04 19:41:16 +00:00
neel	4e6374765e	Fix a race between VMRUN() and vcpu_notify_event() due to 'vcpu->hostcpu' being updated outside of the vcpu_lock(). The race is benign and could potentially result in a missed notification about a pending interrupt to a vcpu. The interrupt would not be lost but rather delayed until the next VM exit. The vcpu's hostcpu is now updated concurrently with the vcpu state change. When the vcpu transitions to the RUNNING state the hostcpu is set to 'curcpu'. It is set to 'NOCPU' in all other cases. Reviewed by: grehan	2014-03-01 03:17:58 +00:00
jhb	4905c0f870	Correct VMware capitalization. Submitted by: joeld	2014-02-28 21:33:40 +00:00
jhb	4ce54b93eb	Workaround an apparent bug in VMWare Fusion's nested VT support where it triggers a VM exit with the exit reason of an external interrupt but without a valid interrupt set in the exit interrupt information. Tested by: Michael Dexter Reviewed by: neel MFC after: 1 week	2014-02-28 19:07:55 +00:00
neel	e01c440dae	Queue pending exceptions in the 'struct vcpu' instead of directly updating the processor-specific VMCS or VMCB. The pending exception will be delivered right before entering the guest. The order of event injection into the guest is: - hardware exception - NMI - maskable interrupt In the Intel VT-x case, a pending NMI or interrupt will enable the interrupt window-exiting and inject it as soon as possible after the hardware exception is injected. Also since interrupts are inherently asynchronous, injecting them after the hardware exception should not affect correctness from the guest perspective. Rename the unused ioctl VM_INJECT_EVENT to VM_INJECT_EXCEPTION and restrict it to only deliver x86 hardware exceptions. This new ioctl is now used to inject a protection fault when the guest accesses an unimplemented MSR. Discussed with: grehan, jhb Reviewed by: jhb	2014-02-26 00:52:05 +00:00
grehan	e0e0829e5e	MFC @ r259635 This brings in the "-w" option from bhyve to ignore unknown MSRs. It will make debugging Linux guests a bit easier. Suggested by: Willem Jan Withagen (wjw at digiware nl)	2014-02-25 06:29:56 +00:00
alc	ebe945ff9f	When the kernel is running in a virtual machine, it cannot rely upon the processor family to determine if the workaround for AMD Family 10h Erratum 383 should be enabled. To enable virtual machine migration among a heterogeneous collection of physical machines, the hypervisor may have been configured to report an older processor family with a reduced feature set. Effectively, the reported processor family and its features are like a "least common denominator" for the collection of machines. Therefore, when the kernel is running in a virtual machine, instead of relying upon the processor family, we now test for features that prove that the underlying processor is not affected by the erratum. (The features that we test for are unlikely to ever be emulated in software on an affected physical processor.) PR: 186061 Tested by: Simon Matter Discussed with: jhb, neel MFC after: 2 weeks	2014-02-22 18:53:42 +00:00
neel	3e0732cf3e	Add support for x2APIC virtualization assist in Intel VT-x. The vlapic.ops handler 'enable_x2apic_mode' is called when the vlapic mode is switched to x2APIC. The VT-x implementation of this handler turns off the APIC-access virtualization and enables the x2APIC virtualization in the VMCS. The x2APIC virtualization is done by allowing guest read access to a subset of MSRs in the x2APIC range. In non-root operation the processor will satisfy an 'rdmsr' access to these MSRs by reading from the virtual APIC page instead. The guest is also given write access to TPR, EOI and SELF_IPI MSRs which get special treatment in non-root operation. This is documented in the Intel SDM section titled "Virtualizing MSR-Based APIC Accesses". Enforce that APIC-write and APIC-access VM-exits are handled only if APIC-access virtualization is enabled. The one exception to this is SELF_IPI virtualization which may result in an APIC-write VM-exit.	2014-02-21 06:03:54 +00:00
neel	4626d164b8	Simplify APIC mode switching from MMIO to x2APIC. In part this is done to simplify the implementation of the x2APIC virtualization assist in VT-x. Prior to this change the vlapic allowed the guest to change its mode from xAPIC to x2APIC. We don't allow that any more and the vlapic mode is locked when the virtual machine is created. This is not very constraining because operating systems already have to deal with BIOS setting up the APIC in x2APIC mode at boot. Fix a bug in the CPUID emulation where the x2APIC capability was leaking from the host to the guest. Ignore MMIO reads and writes to the vlapic in x2APIC mode. Similarly, ignore MSR accesses to the vlapic when it is in xAPIC mode. The default configuration of the vlapic is xAPIC. The "-x" option to bhyve(8) can be used to change the mode to x2APIC instead. Discussed with: grehan@	2014-02-20 01:48:25 +00:00
jhb	521737384d	A first pass at adding support for injecting hardware exceptions for emulated instructions. - Add helper routines to inject interrupt information for a hardware exception from the VM exit callback routines. - Use the new routines to inject GP and UD exceptions for invalid operations when emulating the xsetbv instruction. - Don't directly manipulate the entry interrupt info when a user event is injected. Instead, store the event info in the vmx state and only apply it during a VM entry if a hardware exception or NMI is not already pending. - While here, use HANDLED/UNHANDLED instead of 1/0 in a couple of routines. Reviewed by: neel	2014-02-18 03:07:36 +00:00
neel	f9781635be	Handle writes to the SELF_IPI MSR by the guest when the vlapic is configured in x2apic mode. Reads to this MSR are currently ignored but should cause a general proctection exception to be injected into the vcpu. All accesses to the corresponding offset in xAPIC mode are ignored. Also, do not panic the host if there is mismatch between the trigger mode programmed in the TMR and the actual interrupt being delivered. Instead the anomaly is logged to aid debugging and to prevent a misbehaving guest from panicking the host.	2014-02-17 23:07:16 +00:00
neel	19c649a6fb	Use spinlocks to lock accesses to the vioapic. This is necessary because if the vlapic is configured in x2apic mode the vioapic_process_eoi() function is called inside the critical section established by vm_run().	2014-02-17 22:57:51 +00:00
dim	a8b6bed223	Upgrade our copy of llvm/clang to 3.4 release. This version supports all of the features in the current working draft of the upcoming C++ standard, provisionally named C++1y. The code generator's performance is greatly increased, and the loop auto-vectorizer is now enabled at -Os and -O2 in addition to -O3. The PowerPC backend has made several major improvements to code generation quality and compile time, and the X86, SPARC, ARM32, Aarch64 and SystemZ backends have all seen major feature work. Release notes for llvm and clang can be found here: <http://llvm.org/releases/3.4/docs/ReleaseNotes.html> <http://llvm.org/releases/3.4/tools/clang/docs/ReleaseNotes.html> MFC after: 1 month	2014-02-16 19:44:07 +00:00
brueffer	4c9c4234e2	Retire the nve(4) driver; nfe(4) has been the default driver for NVIDIA nForce MCP adapters for a long time. Yays: jhb, remko, yongari Nays: none on the current and stable lists	2014-02-16 12:22:43 +00:00
avg	84460fd88d	provide fast versions of ffsl and flsl for i386; ffsll and flsll for amd64 Reviewed by: jhb MFC after: 10 days X-MFC note: consider thirdparty modules depending on these symbols Sponsored by: HybridCluster	2014-02-14 15:18:37 +00:00
jhb	6e6e271c34	Add support for managing PCI bus numbers. As with BARs and PCI-PCI bridge I/O windows, the default is to preserve the firmware-assigned resources. PCI bus numbers are only managed if NEW_PCIB is enabled and the architecture defines a PCI_RES_BUS resource type. - Add a helper API to create top-level PCI bus resource managers for each PCI domain/segment. Host-PCI bridge drivers use this API to allocate bus numbers from their associated domain. - Change the PCI bus and CardBus drivers to allocate a bus resource for their bus number from the parent PCI bridge device. - Change the PCI-PCI and PCI-CardBus bridge drivers to allocate the full range of bus numbers from secbus to subbus from their parent bridge. The drivers also always program their primary bus register. The bridge drivers also support growing their bus range by extending the bus resource and updating subbus to match the larger range. - Add support for managing PCI bus resources to the Host-PCI bridge drivers used for amd64 and i386 (acpi_pcib, mptable_pcib, legacy_pcib, and qpi_pcib). - Define a PCI_RES_BUS resource type for amd64 and i386. Reviewed by: imp MFC after: 1 month	2014-02-12 04:30:37 +00:00
jhb	22ac1edc7b	Don't waste a page of KVA for the boot-time memory test on x86. For amd64, reuse the first page of the crashdumpmap as CMAP1/CADDR1. For i386, remove CMAP1/CADDR1 entirely and reuse CMAP3/CADDR3 for the memory test. Reviewed by: alc, peter MFC after: 2 weeks	2014-02-11 22:02:40 +00:00
jhb	0cdfc370bf	Add virtualized XSAVE support to bhyve which permits guests to use XSAVE and XSAVE-enabled features like AVX. - Store a per-cpu guest xcr0 register. When switching to the guest FPU state, switch to the guest xcr0 value. Note that the guest FPU state is saved and restored using the host's xcr0 value and xcr0 is saved/restored "inside" of saving/restoring the guest FPU state. - Handle VM exits for the xsetbv instruction by updating the guest xcr0. - Expose the XSAVE feature to the guest only if the host has enabled XSAVE, and only advertise XSAVE features enabled by the host to the guest. This ensures that the guest will only adjust FPU state that is a subset of the guest FPU state saved and restored by the host. Reviewed by: grehan	2014-02-08 16:37:54 +00:00
neel	5b2777fc37	Add a counter to differentiate between VM-exits due to nested paging faults and instruction emulation faults.	2014-02-08 06:22:09 +00:00
neel	d79f5b61e6	Fix a bug in the handling of VM-exits caused by non-maskable interrupts (NMI). If a VM-exit is caused by an NMI then "blocking by NMI" is in effect on the CPU when the VM-exit is completed. No more NMIs will be recognized until the execution of an "iret". Prior to this change the NMI handler was dispatched via a software interrupt with interrupts enabled. This meant that an interrupt could be recognized by the processor before the NMI handler completed its execution. The "iret" issued by the interrupt handler would then cause the "blocking by NMI" to be cleared prematurely. This is now fixed by handling the NMI with interrupts disabled in addition to "blocking by NMI" already established by the VM-exit.	2014-02-08 05:04:34 +00:00
jhb	f4e46bef98	Add support for FreeBSD/i386 guests under bhyve. - Similar to the hack for bootinfo32.c in userboot, define _MACHINE_ELF_WANT_32BIT in the load_elf32 file handlers in userboot. This allows userboot to load 32-bit kernels and modules. - Copy the SMAP generation code out of bootinfo64.c and into its own file so it can be shared with bootinfo32.c to pass an SMAP to the i386 kernel. - Use uint32_t instead of u_long when aligning module metadata in bootinfo32.c in userboot, as otherwise the metadata used 64-bit alignment which corrupted the layout. - Populate the basemem and extmem members of the bootinfo struct passed to 32-bit kernels. - Fix the 32-bit stack in userboot to start at the top of the stack instead of the bottom so that there is room to grow before the kernel switches to its own stack. - Push a fake return address onto the 32-bit stack in addition to the arguments normally passed to exec() in the loader. This return address is needed to convince recover_bootinfo() in the 32-bit locore code that it is being invoked from a "new" boot block. - Add a routine to libvmmapi to setup a 32-bit flat mode register state including a GDT and TSS that is able to start the i386 kernel and update bhyveload to use it when booting an i386 kernel. - Use the guest register state to determine the CPU's current instruction mode (32-bit vs 64-bit) and paging mode (flat, 32-bit, PAE, or long mode) in the instruction emulation code. Update the gla2gpa() routine used when fetching instructions to handle flat mode, 32-bit paging, and PAE paging in addition to long mode paging. Don't look for a REX prefix when the CPU is in 32-bit mode, and use the detected mode to enable the existing 32-bit mode code when decoding the mod r/m byte. Reviewed by: grehan, neel MFC after: 1 month	2014-02-05 04:39:03 +00:00
tychon	c9758b2e1c	Add support for emulating the byte move and zero extend instructions: "mov r/m8, r32" and "mov r/m8, r64". Approved by: neel (co-mentor)	2014-02-05 02:01:08 +00:00
grehan	9b01253af1	Changes to the SVM code to bring it up to r259205 - Convert VMM_CTR to VCPU_CTR KTR macros - Special handling of halt, save rflags for VMM layer to emulate halt for vcpu(sleep to be awakened by interrupt or stop it) - Cleanup of RVI exit handling code Submitted by: Anish Gupta (akgupt3@gmail.com) Reviewed by: grehan	2014-02-04 07:13:56 +00:00
grehan	db98868c2f	MFC @ r259205 in preparation for some SVM updates. (for real this time)	2014-02-04 06:59:08 +00:00
grehan	f115890f37	Roll back botched partial MFC :(	2014-02-04 05:03:14 +00:00
neel	8f40ca632d	Avoid doing unnecessary nested TLB invalidations. Prior to this change the cached value of 'pm_eptgen' was tracked per-vcpu and per-hostcpu. In the degenerate case where 'N' vcpus were sharing a single hostcpu this could result in 'N - 1' unnecessary TLB invalidations. Since an 'invept' invalidates mappings for all VPIDs the first 'invept' is sufficient. Fix this by moving the 'eptgen[MAXCPU]' array from 'vmxctx' to 'struct vmx'. If it is known that an 'invept' is going to be done before entering the guest then it is safe to skip the 'invvpid'. The stat VPU_INVVPID_SAVED counts the number of 'invvpid' invalidations that were avoided because they were subsumed by an 'invept'. Discussed with: grehan	2014-02-04 02:45:08 +00:00
grehan	0ee348bd50	MFC @ r259205 in preparation for some SVM updates.	2014-02-04 02:41:54 +00:00
jhb	3f6ca218c6	Enhance the support for PCI legacy INTx interrupts and enable them in the virtio backends. - Add a new ioctl to export the count of pins on the I/O APIC from vmm to the hypervisor. - Use pins on the I/O APIC >= 16 for PCI interrupts leaving 0-15 for ISA interrupts. - Populate the MP Table with I/O interrupt entries for any PCI INTx interrupts. - Create a _PRT table under the PCI root bridge in ACPI to route any PCI INTx interrupts appropriately. - Track which INTx interrupts are in use per-slot so that functions that share a slot attempt to distribute their INTx interrupts across the four available pins. - Implicitly mask INTx interrupts if either MSI or MSI-X is enabled and when the INTx DIS bit is set in a function's PCI command register. Either assert or deassert the associated I/O APIC pin when the state of one of those conditions changes. - Add INTx support to the virtio backends. - Always advertise the MSI capability in the virtio backends. Submitted by: neel (7) Reviewed by: neel MFC after: 2 weeks	2014-01-29 14:56:48 +00:00
jhb	0fa3283d5a	Add support for 'clac' and 'stac' to DDB's disassembler on amd64.	2014-01-27 18:53:18 +00:00
neel	872f585846	Support level triggered interrupts with VT-x virtual interrupt delivery. The VMCS field EOI_bitmap[] is an array of 256 bits - one for each vector. If a bit is set to '1' in the EOI_bitmap[] then the processor will trigger an EOI-induced VM-exit when it is doing EOI virtualization. The EOI-induced VM-exit results in the EOI being forwarded to the vioapic so that level triggered interrupts can be properly handled. Tested by: Anish Gupta (akgupt3@gmail.com)	2014-01-25 20:58:05 +00:00
grehan	e0134ab4ff	Change RWX to XWR in comments to match intent and bit patterns in discussion of valid EPT pte protections. Discussed with: neel MFC after: 3 days	2014-01-25 06:58:41 +00:00
jhb	b2533ec507	Move <machine/apicvar.h> to <x86/apicvar.h>.	2014-01-23 20:10:22 +00:00
neel	232e34343e	Set "Interrupt Window Exiting" in the case where there is a vector to be injected into the vcpu but the VM-entry interruption information field already has the valid bit set. Pointed out by: David Reed (david.reed@tidalscale.com)	2014-01-23 06:06:50 +00:00
neel	e971318156	Handle a VM-exit due to a NMI properly by vectoring to the host's NMI handler via a software interrupt. This is safe to do because the logical processor is already cognizant of the NMI and further NMIs are blocked until the host's NMI handler executes "iret".	2014-01-22 04:03:11 +00:00
neel	daa658a5dd	There is no need to initialize the IOMMU if no passthru devices have been configured for bhyve to use. Suggested by: grehan@	2014-01-21 03:01:34 +00:00
emaste	13c3304ef6	Add VT kernel configuration to ease testing of vt(9), aka Newcons	2014-01-19 18:46:38 +00:00
neel	e74998b34d	Some processor's don't allow NMI injection if the STI_BLOCKING bit is set in the Guest Interruptibility-state field. However, there isn't any way to figure out which processors have this requirement. So, inject a pending NMI only if NMI_BLOCKING, MOVSS_BLOCKING, STI_BLOCKING are all clear. If any of these bits are set then enable "NMI window exiting" and inject the NMI in the VM-exit handler.	2014-01-18 21:47:12 +00:00
bryanv	31dc7c36ca	Add very simple virtio_random(4) driver to harvest entropy from host Reviewed by: markm (random bits only)	2014-01-18 06:14:38 +00:00
neel	4dec904890	If the guest exits due to a fault while it is executing IRET then restore the state of "Virtual NMI blocking" in the guest's interruptibility-state field before resuming the guest.	2014-01-18 02:20:10 +00:00
neel	77ef1cf997	If a VM-exit happens during an NMI injection then clear the "NMI Blocking" bit in the Guest Interruptibility-state VMCS field. If we fail to do this then a subsequent VM-entry will fail because it is an error to inject an NMI into the guest while "NMI Blocking" is turned on. This is described in "Checks on Guest Non-Register State" in the Intel SDM. Submitted by: David Reed (david.reed@tidalscale.com)	2014-01-17 04:21:39 +00:00
neel	0bd53a85fb	Add an API to rendezvous all active vcpus in a virtual machine. The rendezvous can be initiated in the context of a vcpu thread or from the bhyve(8) control process. The first use of this functionality is to update the vlapic trigger-mode register when the IOAPIC pin configuration is changed. Prior to this change we would update the TMR in the virtual-APIC page at the time of interrupt delivery. But this doesn't work with Posted Interrupts because there is no way to program the EOI_exit_bitmap[] in the VMCS of the target at the time of interrupt delivery. Discussed with: grehan@	2014-01-14 01:55:58 +00:00
gavin	6265cc52b1	Remove spaces from boot messages when we print the CPU ID/Family/Stepping to match the rest of the CPU identification lines, and once again fit into 80 columns in the usual case.	2014-01-11 22:41:10 +00:00
neel	ec09639132	Enable "Posted Interrupt Processing" if supported by the CPU. This lets us inject interrupts into the guest without causing a VM-exit. This feature can be disabled by setting the tunable "hw.vmm.vmx.use_apic_pir" to "0". The following sysctls provide information about this feature: - hw.vmm.vmx.posted_interrupts (0 if disabled, 1 if enabled) - hw.vmm.vmx.posted_interrupt_vector (vector number used for vcpu notification) Tested on a Intel Xeon E5-2620v2 courtesy of Allan Jude at ScaleEngine.	2014-01-11 04:22:00 +00:00
neel	41814122f1	Enable the "Acknowledge Interrupt on VM exit" VM-exit control. This control is needed to enable "Posted Interrupts" and is present in all the Intel VT-x implementations supported by bhyve so enable it as the default. With this VM-exit control enabled the processor will acknowledge the APIC and store the vector number in the "VM-Exit Interruption Information" field. We now call the interrupt handler "by hand" through the IDT entry associated with the vector.	2014-01-11 03:14:05 +00:00
neel	00a86f71de	Don't expose 'vmm_ipinum' as a global.	2014-01-09 03:25:54 +00:00
neel	ab2de99290	Use the 'Virtual Interrupt Delivery' feature of Intel VT-x if supported by hardware. It is possible to turn this feature off and fall back to software emulation of the APIC by setting the tunable hw.vmm.vmx.use_apic_vid to 0. We now start handling two new types of VM-exits: APIC-access: This is a fault-like VM-exit and is triggered when the APIC register access is not accelerated (e.g. apic timer CCR). In response to this we do emulate the instruction that triggered the APIC-access exit. APIC-write: This is a trap-like VM-exit which does not require any instruction emulation but it does require the hypervisor to emulate the access to the specified register (e.g. icrlo register). Introduce 'vlapic_ops' which are function pointers to vector the various vlapic operations into processor-dependent code. The 'Virtual Interrupt Delivery' feature installs 'ops' for setting the IRR bits in the virtual APIC page and to return whether any interrupts are pending for this vcpu. Tested on an "Intel Xeon E5-2620 v2" courtesy of Allan Jude at ScaleEngine.	2014-01-07 21:04:49 +00:00
neel	23ea3a1c59	Fix a bug introduced in r260167 related to VM-exit tracing. Keep a copy of the 'rip' and the 'exit_reason' and use that when calling vmx_exit_trace(). This is because both the 'rip' and 'exit_reason' can be changed by 'vmx_exit_process()' and can lead to very misleading traces.	2014-01-07 18:53:14 +00:00
neel	b47601c298	Allow vlapic_set_intr_ready() to return a value that indicates whether or not the vcpu should be kicked to process a pending interrupt. This will be useful in the implementation of the Posted Interrupt APICv feature. Change the return value of 'vlapic_pending_intr()' to indicate whether or not an interrupt is available to be delivered to the vcpu depending on the value of the PPR. Add KTR tracepoints to debug guest IPI delivery.	2014-01-07 00:38:22 +00:00
neel	35066c7a68	Split the VMCS setup between 'vmcs_init()' that does initialization and 'vmx_vminit()' that does customization. This makes it easier to turn on optional features (e.g. APICv) without having to keep adding new parameters to 'vmcs_set_defaults()'. Reviewed by: grehan@	2014-01-06 23:16:39 +00:00
schweikh	916cce9b6e	Correct a grammo in a comment; remove white space at EOL.	2014-01-06 17:23:22 +00:00
neel	96b61bcd13	Use the same label name for ENTRY() and END() macros for 'vmx_enter_guest'. Pointed out by: rmh@	2014-01-03 19:29:33 +00:00
neel	141215c4ae	Fix a bug in the HPET emulation where a timer interrupt could be lost when the guest disables the HPET. The HPET timer interrupt is triggered from the callout handler associated with the timer. It is possible for the callout handler to be delayed before it gets a chance to execute. If the guest disables the HPET during this window then the handler never gets a chance to execute and the timer interrupt is lost. This is now fixed by injecting a timer interrupt into the guest if the callout time is detected to be in the past when the HPET is disabled.	2014-01-03 19:25:52 +00:00
kib	57c75d298a	Update the description for pmap_remove_pages() to match the modern times [1]. Assert that the pmap passed to pmap_remove_pages() is only active on current CPU. Submitted by: alc [1] Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week	2014-01-02 18:50:52 +00:00
kib	d891f2b2cb	Assert that accounting for the pmap resident pages does not underflow. Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week	2014-01-02 18:49:05 +00:00
neel	e2a56f4497	Restructure the VMX code to enter and exit the guest. In large part this change hides the setjmp/longjmp semantics of VM enter/exit. vmx_enter_guest() is used to enter guest context and vmx_exit_guest() is used to transition back into host context. Fix a longstanding race where a vcpu interrupt notification might be ignored if it happens after vmx_inject_interrupts() but before host interrupts are disabled in vmx_resume/vmx_launch. We now called vmx_inject_interrupts() with host interrupts disabled to prevent this. Suggested by: grehan@	2014-01-01 21:17:08 +00:00
dim	968059a1fa	In sys/amd64/amd64/pmap.c, remove static function pmap_is_current(), which has been unused since r189415. Reviewed by: alc MFC after: 3 days	2013-12-30 20:37:47 +00:00
neel	d6864450cf	Modify handling of writes to the vlapic LVT registers. The handler is now called after the register value is updated in the virtual APIC page. This will make it easier to handle APIC-write VM-exits with APIC register virtualization turned on. This also implies that we need to keep a snapshot of the last value written to a LVT register. We can no longer rely on the LVT registers in the APIC page to be "clean" because the guest can write anything to it before the hypervisor has had a chance to sanitize it.	2013-12-28 00:20:55 +00:00
neel	31726cba84	Modify handling of writes to the vlapic ICR_TIMER, DCR_TIMER, ICRLO and ESR registers. The handler is now called after the register value is updated in the virtual APIC page. This will make it easier to handle APIC-write VM-exits with APIC register virtualization turned on. We can no longer rely on the value of 'icr_timer' on the APIC page in the callout handler. With APIC register virtualization the value of 'icr_timer' will be updated by the processor in guest-context before an APIC-write VM-exit. Clear the 'delivery status' bit in the ICRLO register in the write handler. With APIC register virtualization the write happens in guest-context and we cannot prevent a (buggy) guest from setting this bit.	2013-12-27 20:18:19 +00:00
dim	90c5ea9fad	In sys/amd64/vmm/intel/vmx.c, silence a (incorrect) gcc warning about regval possibly being used uninitialized. Reviewed by: neel	2013-12-27 12:15:53 +00:00
neel	a369a289ee	Modify handling of write to the vlapic SVR register. The handler is now called after the register value is updated in the virtual APIC page. This will make it easier to handle APIC-write VM-exits with APIC register virtualization turned on. Additionally, mask all the LVT entries when the vlapic is software-disabled.	2013-12-27 07:01:42 +00:00
neel	4a551f35c9	Modify handling of writes to the vlapic ID, LDR and DFR registers. The handlers are now called after the register value is updated in the virtual APIC page. This will make it easier to handle APIC-write VM-exits with APIC register virtualization turned on. Additionally, we need to ensure that the value of these registers is always correctly reflected in the virtual APIC page, because there is no VM exit when the guest reads these registers with APIC register virtualization.	2013-12-26 19:58:30 +00:00
neel	0c91ef8145	vlapic code restructuring to make it easy to support hardware-assist for APIC emulation. The vlapic initialization and cleanup is done via processor specific vmm_ops. This will allow the VT-x/SVM modules to layer any hardware-assist for APIC emulation or virtual interrupt delivery on top of the vlapic device model. Add a parameter to 'vcpu_notify_event()' to distinguish between vlapic interrupts versus other events (e.g. NMI). This provides an opportunity to use hardware-assists like Posted Interrupts (VT-x) or doorbell MSR (SVM) to deliver an interrupt to a guest without causing a VM-exit. Get rid of lapic_pending_intr() and lapic_intr_accepted() and use the vlapic_xxx() counterparts directly. Associate an 'Apic Page' with each vcpu and reference it from the 'vlapic'. The 'Apic Page' is intended to be referenced from the Intel VMCS as the 'virtual APIC page' or from the AMD VMCB as the 'vAPIC backing page'.	2013-12-25 06:46:31 +00:00
jhb	63c019063a	Add a resume hook for bhyve that runs a function on all CPUs during resume. For Intel CPUs, invoke vmxon for CPUs that were in VMX mode at the time of suspend. Reviewed by: neel	2013-12-23 19:48:22 +00:00
jhb	8ab82a5fe1	Extend the support for local interrupts on the local APIC: - Add a generic routine to trigger an LVT interrupt that supports both fixed and NMI delivery modes. - Add an ioctl and bhyvectl command to trigger local interrupts inside a guest. In particular, a global NMI similar to that raised by SERR# or PERR# can be simulated by asserting LINT1 on all vCPUs. - Extend the LVT table in the vCPU local APIC to support CMCI. - Flesh out the local APIC error reporting a bit to cache errors and report them via ESR when ESR is written to. Add support for asserting the error LVT when an error occurs. Raise illegal vector errors when attempting to signal an invalid vector for an interrupt or when sending an IPI. - Ignore writes to reserved bits in LVT entries. - Export table entries the MADT and MP Table advertising the stock x86 config of LINT0 set to ExtInt and LINT1 wired to NMI. Reviewed by: neel (earlier version)	2013-12-23 19:29:07 +00:00
neel	e78d8c9833	Add a parameter to 'vcpu_set_state()' to enforce that the vcpu is in the IDLE state before the requested state transition. This guarantees that there is exactly one ioctl() operating on a vcpu at any point in time and prevents unintended state transitions. More details available here: http://lists.freebsd.org/pipermail/freebsd-virtualization/2013-December/001825.html Reviewed by: grehan Reported by: Markiyan Kushnir (markiyan.kushnir at gmail.com) MFC after: 3 days	2013-12-22 20:29:59 +00:00
neel	5e963f9c65	Consolidate the virtual apic initialization in a single function: vlapic_reset()	2013-12-22 00:08:00 +00:00
neel	fffd41bbf1	Re-arrange bits in the amd64/pmap 'pm_flags' field. The least significant 8 bits of 'pm_flags' are now used for the IPI vector to use for nested page table TLB shootdown. Previously we used IPI_AST to interrupt the host cpu which is functionally correct but could lead to misleading interrupt counts for AST handler. The AST handler was also doing a lot more than what is required for the nested page table TLB shootdown (EOI and IRET).	2013-12-20 05:50:22 +00:00
grehan	afd2340c15	Enable memory overcommit for AMD processors. - No emulation of A/D bits is required since AMD-V RVI supports A/D bits. - Enable pmap PT_RVI support(w/o PAT) which is required for memory over-commit support. - Other minor fixes: * Make use of VMCB EXITINTINFO field. If a #VMEXIT happens while delivering an interrupt, EXITINTINFO has all the details that bhyve needs to inject the same interrupt. * SVM h/w decode assist code was incomplete - removed for now. * Some minor code clean-up (more coming). Submitted by: Anish Gupta (akgupt3@gmail.com)	2013-12-18 23:39:42 +00:00
grehan	aacc216bbf	MFC @ r256071 This is the change where the bhyve_npt_pmap branch was merged in to head. The SVM changes to work with this will be in a follow-on submit.	2013-12-18 22:31:53 +00:00
neel	104d35e573	Use vmcs_read() and vmcs_write() in preference to vmread() and vmwrite() respectively. The vmcs_xxx() functions provide inline error checking of all accesses to the VMCS.	2013-12-18 06:24:21 +00:00
neel	e62c100b90	Add an API to deliver message signalled interrupts to vcpus. This allows callers treat the MSI 'addr' and 'data' fields as opaque and also lets bhyve implement multiple destination modes: physical, flat and clustered. Submitted by: Tycho Nightingale (tycho.nightingale@pluribusnetworks.com) Reviewed by: grehan@	2013-12-16 19:59:31 +00:00
neel	4741c4b7e2	Fix typo when initializing the vlapic version register ('<<' instead of '<').	2013-12-11 06:28:44 +00:00
neel	3d4a180923	Fix x2apic support in bhyve. When the guest is bringing up the APs in the x2APIC mode a write to the ICR register will now trigger a return to userspace with an exitcode of VM_EXITCODE_SPINUP_AP. This gets SMP guests working again with x2APIC. Change the vlapic timer lock to be a spinlock because the vlapic can be accessed from within a critical section (vm run loop) when guest is using x2apic mode. Reviewed by: grehan@	2013-12-10 22:56:51 +00:00
jhb	8f255ea165	Move constants for indices in the local APIC's local vector table from apicvar.h to apicreg.h.	2013-12-09 21:08:52 +00:00
neel	e7ebb9541a	Use callout(9) to drive the vlapic timer instead of clocking it on each VM exit. This decouples the guest's 'hz' from the host's 'hz' setting. For e.g. it is now possible to have a guest run at 'hz=1000' while the host is at 'hz=100'. Discussed with: grehan@ Tested by: Tycho Nightingale (tycho.nightingale@pluribusnetworks.com)	2013-12-07 23:11:12 +00:00
neel	127d791c3e	If a vcpu disables its local apic and then executes a 'HLT' then spin down the vcpu and destroy its thread context. Also modify the 'HLT' processing to ignore pending interrupts in the IRR if interrupts have been disabled by the guest. The interrupt cannot be injected into the guest in any case so resuming it is futile. With this change "halt" from a Linux guest works correctly. Reviewed by: grehan@ Tested by: Tycho Nightingale (tycho.nightingale@pluribusnetworks.com)	2013-12-07 22:18:36 +00:00
jhb	0f5dd6ee19	Fix a typo.	2013-12-05 21:58:02 +00:00
neel	d3a0e92020	The 'protection' field in the VM exit collateral for the PAGING exit is not used - get rid of it.	2013-12-03 01:21:21 +00:00
neel	a581c446d6	Rename 'vm_interrupt_hostcpu()' to 'vcpu_notify_event()' because the function has outgrown its original name. Originally this function simply sent an IPI to the host cpu that a vcpu was executing on but now it does a lot more than just that. Reviewed by: grehan@	2013-12-03 00:43:31 +00:00
eadler	44c01df173	Fix undefined behavior: (1 << 31) is not defined as 1 is an int and this shifts into the sign bit. Instead use (1U << 31) which gets the expected result. This fix is not ideal as it assumes a 32 bit int, but does fix the issue for most cases. A similar change was made in OpenBSD. Discussed with: -arch, rdivacky Reviewed by: cperciva	2013-11-30 22:17:27 +00:00
pjd	4ac2e7d8d9	Make process descriptors standard part of the kernel. rwhod(8) already requires process descriptors to work and having PROCDESC in GENERIC seems not enough, especially that we hope to have more and more consumers in the base. MFC after: 3 days	2013-11-30 15:08:35 +00:00
neel	633860276f	Add support for level triggered interrupt pins on the vioapic. Prior to this commit level triggered interrupts would work as long as the pin was not shared among multiple interrupt sources. The vlapic now keeps track of level triggered interrupts in the trigger mode register and will forward the EOI for a level triggered interrupt to the vioapic. The vioapic in turn uses the EOI to sample the level on the pin and re-inject the vector if the pin is still asserted. The vhpet is the first consumer of level triggered interrupts and advertises that it can generate interrupts on pins 20 through 23 of the vioapic. Discussed with: grehan@	2013-11-27 22:18:08 +00:00
kib	f70df6871b	Hide struct pcb definition by #ifdef __amd64__ braces. If cc -m32 compilation results in inclusion of the header, a confict arises due to savefpu being union for i386, but used as struct in the pcb definition. The 32bit code should not need amd64 variant of the struct pcb anyway. For struct region_descriptor, use __uint64_t instead of unsigned long, as the base type for bit-fields. Unsigned long cannot have width 64 for -m32. The changes allowed to use sys/sysctl.h for cc -m32. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2013-11-26 19:38:42 +00:00
neel	89dbc92028	Add HPET device emulation to bhyve. bhyve supports a single timer block with 8 timers. The timers are all 32-bit and capable of being operated in periodic mode. All timers support interrupt delivery using MSI. Timers 0 and 1 also support legacy interrupt routing. At the moment the timers are not connected to any ioapic pins but that will be addressed in a subsequent commit. This change is based on a patch from Tycho Nightingale (tycho.nightingale@pluribusnetworks.com).	2013-11-25 19:04:51 +00:00
attilio	7ee4e910ce	- For kernel compiled only with KDTRACE_HOOKS and not any lock debugging option, unbreak the lock tracing release semantic by embedding calls to LOCKSTAT_PROFILE_RELEASE_LOCK() direclty in the inlined version of the releasing functions for mutex, rwlock and sxlock. Failing to do so skips the lockstat_probe_func invokation for unlocking. - As part of the LOCKSTAT support is inlined in mutex operation, for kernel compiled without lock debugging options, potentially every consumer must be compiled including opt_kdtrace.h. Fix this by moving KDTRACE_HOOKS into opt_global.h and remove the dependency by opt_kdtrace.h for all files, as now only KDTRACE_FRAMES is linked there and it is only used as a compile-time stub [0]. [0] immediately shows some new bug as DTRACE-derived support for debug in sfxge is broken and it was never really tested. As it was not including correctly opt_kdtrace.h before it was never enabled so it was kept broken for a while. Fix this by using a protection stub, leaving sfxge driver authors the responsibility for fixing it appropriately [1]. Sponsored by: EMC / Isilon storage division Discussed with: rstone [0] Reported by: rstone [1] Discussed with: philip	2013-11-25 07:38:45 +00:00
neel	3b87354d1e	Add an ioctl to assert and deassert an ioapic pin atomically. This will be used to inject edge triggered legacy interrupts into the guest. Start using the new API in device models that use edge triggered interrupts: viz. the 8254 timer and the LPC/uart device emulation. Submitted by: Tycho Nightingale (tycho.nightingale@pluribusnetworks.com)	2013-11-23 03:56:03 +00:00
neel	023e2d37dd	Eliminate redundant information about the host cpu in bhyve's KTR trace points. This is always tracked by ktr(4) and can be displayed using the "-c" option of ktrdump(8). Discussed with: grehan	2013-11-22 18:57:22 +00:00
emaste	2740cfb802	Don't abort SMAP processing after an entry of length 0 Length 0 is not special and should just be skipped. This is the same behaviour as i386. Discussed with: jhb@ Sponsored by: The FreeBSD Foundation	2013-11-22 14:56:10 +00:00
andreast	3eb2c9b8dd	Introduce a WEAK_REFERENCE() alias and use it. Get rid of the CNAME and the CONCAT macros in SYS.h. Reviewed by: bde, kib	2013-11-21 21:25:58 +00:00
emaste	551ae45c15	Refactor amd64 startup SMAP parsing Extracted from the projects/uefi branch, this change is a reasonable cleanup and will reduce the diffs to review when bringing in the UEFI work. Reviewed by: kib@ Sponsored by: The FreeBSD Foundation	2013-11-21 19:20:08 +00:00
emaste	b29e0d359f	Disable amd64 boot time memory test by default The page presence memory test takes a long time on large memory systems and has little value on contemporary amd64 hardware. Sponsored by: The FreeBSD Foundation	2013-11-21 18:37:11 +00:00
gibbs	918d521635	Fix accounting for hw.realmem on the i386 and amd64 platforms. sys/i386/i386/machdep.c: sys/amd64/amd64/machdep.c: The value reported by FreeBSD as "real memory" when booting doesn't match what is later reported by sysctl as hw.realmem. This is due to the fact that the value printed during the boot process is fetched from smbios data (when possible), and accounts for holes in physical memory. On the other hand, the value of hw.realmem is unconditionally set to be one larger than the highest page of the physical address space. Fix this by setting hw.realmem to the same value printed during boot, this makes hw.realmem honour it's name and account properly for physical memory present in the system. Submitted by: Roger Pau Monné Reviewed by: gibbs	2013-11-15 16:05:55 +00:00
emaste	9dcbb8e88d	x86: Allow users to change PSL_RF via ptrace(PT_SETREGS...) Debuggers may need to change PSL_RF. Note that tf_eflags is already stored in the signal context during signal handling and PSL_RF previously could be modified via sigreturn, so this change should not provide any new ability to userspace. For background see the thread at: http://lists.freebsd.org/pipermail/freebsd-i386/2007-September/005910.html Reviewed by: jhb, kib Sponsored by: DARPA, AFRL	2013-11-14 15:37:20 +00:00
neel	384d86e888	Move the ioapic device model from userspace into vmm.ko. This is needed for upcoming in-kernel device emulations like the HPET. The ioctls VM_IOAPIC_ASSERT_IRQ and VM_IOAPIC_DEASSERT_IRQ are used to manipulate the ioapic pin state. Discussed with: grehan@ Submitted by: Tycho Nightingale (tycho.nightingale@pluribusnetworks.com)	2013-11-12 22:51:03 +00:00
kib	3583963461	Add bits for the AMD features from CPUID function 0x80000001 ECX, described in the rev. 3.0 of the Kabini BKDG, document 48751.pdf. Partially based on the patch submitted by: Dmitry Luhtionov <dmitryluhtionov@gmail.com> Sponsored by: The FreeBSD Foundation MFC after: 1 week	2013-11-08 16:32:30 +00:00
alc	6c60c88452	As of r257209, all architectures have defined VM_KMEM_SIZE_SCALE. In other words, every architecture is now auto-sizing the kmem arena. This revision changes kmeminit() so that the definition of VM_KMEM_SIZE_SCALE becomes mandatory and the definition of VM_KMEM_SIZE becomes optional. Replace or eliminate all existing definitions of VM_KMEM_SIZE. With auto-sizing enabled, VM_KMEM_SIZE effectively became an alternate spelling for VM_KMEM_SIZE_MIN on most architectures. Use VM_KMEM_SIZE_MIN for clarity. Change kmeminit() so that the effect of defining VM_KMEM_SIZE is similar to that of setting the tunable vm.kmem_size. Whereas the macros VM_KMEM_SIZE_{MAX,MIN,SCALE} have had the same effect as the tunables vm.kmem_size_{max,min,scale}, the effects of VM_KMEM_SIZE and vm.kmem_size have been distinct. In particular, whereas VM_KMEM_SIZE was overridden by VM_KMEM_SIZE_{MAX,MIN,SCALE} and vm.kmem_size_{max,min,scale}, vm.kmem_size was not. Remedy this inconsistency. Now, VM_KMEM_SIZE can be used to set the size of the kmem arena at compile-time without that value being overridden by auto-sizing. Update the nearby comments to reflect the kmem submap being replaced by the kmem arena. Stop duplicating the auto-sizing formula in every machine- dependent vmparam.h and place it in kmeminit() where auto-sizing takes place. Reviewed by: kib (an earlier version) Sponsored by: EMC / Isilon Storage Division	2013-11-08 16:25:00 +00:00
neel	94bdd999bb	Remove the 'vdev' abstraction that was meant to sit on top of device models in the kernel. This abstraction was redundant because the only device emulated inside vmm.ko is the local apic and it is always at a fixed guest physical address. Discussed with: grehan	2013-11-04 23:25:07 +00:00
neel	74368cc42d	Rename the VMM_CTRx() family of macros to VCPU_CTRx() to highlight that these tracepoints are vcpu-specific. Add support for tracepoints that are global to the virtual machine - these tracepoints are called VM_CTRx().	2013-10-31 05:20:11 +00:00
markj	a5fb1fbfd8	Remove references to an unused fasttrap probe hook, and remove the corresponding x86 trap type. Userland DTrace probes are currently handled by the other fasttrap hooks (dtrace_pid_probe_ptr and dtrace_return_probe_ptr). Discussed with: rpaulo	2013-10-31 02:35:00 +00:00
grehan	12f698f0c7	MFC @ r256071 This is just prior to the bhyve_npt_pmap import so will allow just the change to be merged for easier debug.	2013-10-30 00:05:02 +00:00
neel	7ec78b8068	Remove unnecessary includes of <machine/pmap.h> Requested by: alc@	2013-10-29 02:25:18 +00:00
glebius	306cd242d2	Include XEN and HyperV into amd64 LINT.	2013-10-28 21:11:28 +00:00
kib	74b8996ebe	Import the driver for VT-d DMAR hardware, as specified in the revision 1.3 of Intelб╝ Virtualization Technology for Directed I/O Architecture Specification. The Extended Context and PASIDs from the rev. 2.2 are not supported, but I am not aware of any released hardware which implements them. Code does not use queued invalidation, see comments for the reason, and does not provide interrupt remapping services. Code implements the management of the guest address space per domain and allows to establish and tear down arbitrary mappings, but not partial unmapping. The superpages are created as needed, but not promoted. Faults are recorded, fault records could be obtained programmatically, and printed on the console. Implement the busdma(9) using DMARs. This busdma backend avoids bouncing and provides security against misbehaving hardware and driver bad programming, preventing leaks and corruption of the memory by wild DMA accesses. By default, the implementation is compiled into amd64 GENERIC kernel but disabled; to enable, set hw.dmar.enable=1 loader tunable. Code is written to work on i386, but testing there was low priority, and driver is not enabled in GENERIC. Even with the DMAR turned on, individual devices could be directed to use the bounce busdma with the hw.busdma.pci<domain>:<bus>:<device>:<function>.bounce=1 tunable. If DMARs are capable of the pass-through translations, it is used, otherwise, an identity-mapping page table is constructed. The driver was tested on Xeon 5400/5500 chipset legacy machine, Haswell desktop and E5 SandyBridge dual-socket boxes, with ahci(4), ata(4), bce(4), ehci(4), mfi(4), uhci(4), xhci(4) devices. It also works with em(4) and igb(4), but there some fixes are needed for drivers, which are not committed yet. Intel GPUs do not work with DMAR (yet). Many thanks to John Baldwin, who explained me the newbus integration; Peter Holm, who did all testing and helped me to discover and understand several incredible bugs; and to Jim Harris for the access to the EDS and BWG and for listening when I have to explain my findings to somebody. Sponsored by: The FreeBSD Foundation MFC after: 1 month	2013-10-28 13:33:29 +00:00
kib	1e4f45b18a	Several small fixes for the amd64 minidump code. In report_progress(), use nitems(progress_track) instead of manually hard-coding array size. Wrap long line. In blk_write(), code verifies that ptr and pa cannot be non-zero simultaneously. The later check for the page-alignment of the ptr argument never triggers due to pa != 0 always implying ptr == NULL. I believe that the intent was to ensure that physicall address passed is page-aligned, since the address is (temporary) mapped for the duration of the page write. Clear the progress_track.visited fields when starting minidump. If minidump is restarted or taken second time during the system lifetime, progress is not printed otherwise, making operator suspectible to the dump status. Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week	2013-10-27 16:31:12 +00:00
glebius	2c1ec831c9	Provide includes that are needed in these files, and before were read in implicitly via if.h -> if_var.h pollution. Sponsored by: Netflix Sponsored by: Nginx, Inc.	2013-10-26 18:18:50 +00:00
neel	e477631325	The ASID allocation in SVM is incorrect because it allocates a single ASID for all vcpus belonging to a guest. This means that when different vcpus belonging to the same guest are executing on the same host cpu there may be "leakage" in the mappings created by one vcpu to another. The proper fix for this is being worked on and will be committed shortly. In the meantime workaround this bug by flushing the guest TLB entries on every VM entry. Submitted by: Anish Gupta (akgupt3@gmail.com)	2013-10-21 23:46:37 +00:00
neel	75369cb181	Add a new capability, VM_CAP_ENABLE_INVPCID, that can be enabled to expose 'invpcid' instruction to the guest. Currently bhyve will try to enable this capability unconditionally if it is available. Consolidate code in bhyve to set the capabilities so it is no longer duplicated in BSP and AP bringup. Add a sysctl 'vm.pmap.invpcid_works' to display whether the 'invpcid' instruction is available. Reviewed by: grehan MFC after: 3 days	2013-10-16 18:20:27 +00:00
grehan	8afdf0fcc3	Fix SVM handling of ASTPENDING, which manifested as a hang on console output (due to a missing interrupt). SVM does exit processing and then handles ASTPENDING which overwrites the already handled SVM exit cause and corrupts virtual machine state. For example, if the SVM exit was due to an I/O port access but the main loop detected an ASTPENDING, the exit would be processed as ASTPENDING and leave the device (e.g. emulated UART) for that I/O port in bad state. Submitted by: Anish Gupta (akgupt3@gmail.com) Reviewed by: grehan	2013-10-16 05:43:03 +00:00
neel	2d5c7e881c	Fix the witness warning that warned against calling uiomove() while holding the 'vmmdev_mtx' in vmmdev_rw(). Rely on the 'si_threadcount' accounting to ensure that we never destroy the VM device node while it has operations in progress (e.g. ioctl, mmap etc). Reported by: grehan Reviewed by: grehan	2013-10-16 00:58:47 +00:00
gjb	1189d79d81	Document XENHVM and xenpci are mutually inclusive. Submitted by: gibbs Approved by: re (delphij) Sponsored by: The FreeBSD Foundation	2013-10-11 19:40:28 +00:00
dim	f72bec9791	In sys/amd64/amd64/pmap.c, fix several gcc warnings about uninitialized variables in reclaim_pv_chunk(). Approved by: re (marius) Reviewed by: neel, kib X-MFC-With: r256072	2013-10-08 20:04:35 +00:00
gibbs	9c8c76f921	Formalize the concept of virtual CPU ids by adding a per-cpu vcpu_id field. Perform vcpu enumeration for Xen PV and HVM environments and convert all Xen drivers to use vcpu_id instead of a hard coded assumption of the mapping algorithm (acpi or apic ID) in use. Submitted by: Roger Pau Monné Sponsored by: Citrix Systems R&D Reviewed by: gibbs Approved by: re (blanket Xen) amd64/include/pcpu.h: i386/include/pcpu.h: Add vcpu_id to the amd64 and i386 pcpu structures. dev/xen/timer/timer.c x86/xen/xen_intr.c Use new vcpu_id instead of assuming acpi_id == vcpu_id. i386/xen/mp_machdep.c: i386/xen/mptable.c x86/xen/hvm.c: Perform Xen HVM and Xen full PV vcpu_id mapping. x86/xen/hvm.c: x86/acpica/madt.c Change SYSINIT ordering of acpi CPU enumeration so that it is guaranteed to be available at the time of Xen HVM vcpu id mapping.	2013-10-05 23:11:01 +00:00
neel	aed205d5cd	Merge projects/bhyve_npt_pmap into head. Make the amd64/pmap code aware of nested page table mappings used by bhyve guests. This allows bhyve to associate each guest with its own vmspace and deal with nested page faults in the context of that vmspace. This also enables features like accessed/dirty bit tracking, swapping to disk and transparent superpage promotions of guest memory. Guest vmspace: Each bhyve guest has a unique vmspace to represent the physical memory allocated to the guest. Each memory segment allocated by the guest is mapped into the guest's address space via the 'vmspace->vm_map' and is backed by an object of type OBJT_DEFAULT. pmap types: The amd64/pmap now understands two types of pmaps: PT_X86 and PT_EPT. The PT_X86 pmap type is used by the vmspace associated with the host kernel as well as user processes executing on the host. The PT_EPT pmap is used by the vmspace associated with a bhyve guest. Page Table Entries: The EPT page table entries as mostly similar in functionality to regular page table entries although there are some differences in terms of what bits are used to express that functionality. For e.g. the dirty bit is represented by bit 9 in the nested PTE as opposed to bit 6 in the regular x86 PTE. Therefore the bitmask representing the dirty bit is now computed at runtime based on the type of the pmap. Thus PG_M that was previously a macro now becomes a local variable that is initialized at runtime using 'pmap_modified_bit(pmap)'. An additional wrinkle associated with EPT mappings is that older Intel processors don't have hardware support for tracking accessed/dirty bits in the PTE. This means that the amd64/pmap code needs to emulate these bits to provide proper accounting to the VM subsystem. This is achieved by using the following mapping for EPT entries that need emulation of A/D bits: Bit Position Interpreted By PG_V 52 software (accessed bit emulation handler) PG_RW 53 software (dirty bit emulation handler) PG_A 0 hardware (aka EPT_PG_RD) PG_M 1 hardware (aka EPT_PG_WR) The idea to use the mapping listed above for A/D bit emulation came from Alan Cox (alc@). The final difference with respect to x86 PTEs is that some EPT implementations do not support superpage mappings. This is recorded in the 'pm_flags' field of the pmap. TLB invalidation: The amd64/pmap code has a number of ways to do invalidation of mappings that may be cached in the TLB: single page, multiple pages in a range or the entire TLB. All of these funnel into a single EPT invalidation routine called 'pmap_invalidate_ept()'. This routine bumps up the EPT generation number and sends an IPI to the host cpus that are executing the guest's vcpus. On a subsequent entry into the guest it will detect that the EPT has changed and invalidate the mappings from the TLB. Guest memory access: Since the guest memory is no longer wired we need to hold the host physical page that backs the guest physical page before we can access it. The helper functions 'vm_gpa_hold()/vm_gpa_release()' are available for this purpose. PCI passthru: Guest's with PCI passthru devices will wire the entire guest physical address space. The MMIO BAR associated with the passthru device is backed by a vm_object of type OBJT_SG. An IOMMU domain is created only for guest's that have one or more PCI passthru devices attached to them. Limitations: There isn't a way to map a guest physical page without execute permissions. This is because the amd64/pmap code interprets the guest physical mappings as user mappings since they are numerically below VM_MAXUSER_ADDRESS. Since PG_U shares the same bit position as EPT_PG_EXECUTE all guest mappings become automatically executable. Thanks to Alan Cox and Konstantin Belousov for their rigorous code reviews as well as their support and encouragement. Thanks for John Baldwin for reviewing the use of OBJT_SG as the backing object for pci passthru mmio regions. Special thanks to Peter Holm for testing the patch on short notice. Approved by: re Discussed with: grehan Reviewed by: alc, kib Tested by: pho	2013-10-05 21:22:35 +00:00
jmg	cb6017acba	add aesni module to i386 and amd64 NOTES... Approved by: re (gjb)	2013-10-04 17:21:01 +00:00
grehan	b2189384ce	Return 0 for a rdmsr of MSR_IA32_PLATFORM_ID. This is enough to get Ubuntu 12.0.4/13.0.4 to boot. Approved by: re@ (blanket)	2013-09-27 14:55:59 +00:00
kib	8e50b328e4	In pmap_clear_modify(), initialize pvh even for fictitious managed page, otherwise the small mappings loop would use uninitialized value. Note that currently pmap_clear_modify() is not called for fictitious pages. Sponsored by: The FreeBSD Foundation Approved by: re (glebius)	2013-09-24 13:52:47 +00:00
kib	28393634c5	Use the pv lists generation count to read-lock the pvh_global_lock in pmap_clear_modify(). Noted and reviewed by: alc Tested by: pho Sponsored by: The FreeBSD Foundation Approved by: re (marius)	2013-09-24 12:26:43 +00:00
kib	2346155b1e	Ensure that the ERESTART return from the syscall reloads the registers, to make the restarted syscall instruction pass the correct arguments. PR: kern/182161 Reported by: Russ Cox <rsc@swtch.com> Sponsored by: The FreeBSD Foundation MFC after: 3 days Approved by: re (marius)	2013-09-24 12:24:48 +00:00
kib	776b37339a	Free both KVA and backing pages when freeing TSS memory. Reported and tested by: pho Sponsored by: The FreeBSD Foundation Approved by: re (marius)	2013-09-23 20:14:15 +00:00
gjb	1944aacf15	Put 'device hyperv' back in amd64/GENERIC, incorrectly removed with r255736. Pointed out by: gibbs Approved by: re (delphij) Sponsored by: The FreeBSD Foundation	2013-09-21 01:07:27 +00:00
grehan	65054fa1ce	Reorder/regroup the vmm ioctl api definitions to allow some semblance of API stability and growth during the 10.* timeframe. Userland/kernel bhyve will have to be recompiled after this. Reviewed by: neel Approved by: re@ (blanket)	2013-09-21 00:27:53 +00:00
gibbs	7ed30adae7	Merge Xen PVHVM support into the GENERIC kernel config for both amd64 and i386. Submitted by: Roger Pau Monné Sponsored by: Citrix Systems R&D Reviewed by: gibbs Approved by: re (blanket Xen) MFC after: 2 weeks sys/amd64/amd64/mp_machdep.c: sys/amd64/include/cpu.h: sys/i386/i386/mp_machdep.c: sys/i386/include/cpu.h: - Introduce two new CPU hooks for initialization and resume purposes. This allows us to get rid of the XENHVM ifdefs in mp_machdep, and also sets some hooks into common code that can be used by other hypervisor implementations. sys/amd64/conf/XENHVM: sys/i386/conf/XENHVM: - Remove these configs now that GENERIC has builtin support for Xen HVM. sys/kern/subr_smp.c: - Make sure there are no pending IPIs when suspending a system. sys/x86/xen/hvm.c: - Add cpu init and resume vectors that are called from mp_machdep using the new hooks. - Only clear the vcpu_info mapping data on resume. It is already clear for the BSP on a cold boot and is set correctly as APs are started. - Gate xen_hvm_init_cpu only to systems running under Xen. sys/x86/xen/xen_intr.c: - Gate the setup of event channels only to systems running under Xen.	2013-09-20 22:59:22 +00:00
davidch	e226cdd9b8	Substantial rewrite of bxe(4) to add support for the BCM57712 and BCM578XX controllers. Approved by: re MFC after: 4 weeks	2013-09-20 20:18:49 +00:00
neel	44c4dbefdb	Merge the following changes from projects/bhyve_npt_pmap: - add fields to 'struct pmap' that are required to manage nested page tables. - add a parameter to 'vmspace_alloc()' that can be used to override the default pmap initialization routine 'pmap_pinit()'. These changes are pushed ahead of the remaining changes in 'bhyve_npt_pmap' in anticipation of the upcoming KBI freeze for 10.0. Reviewed by: kib@, alc@ Approved by: re (glebius)	2013-09-20 17:06:49 +00:00
gibbs	a9c07a6f67	Add support for suspend/resume/migration operations when running as a Xen PVHVM guest. Submitted by: Roger Pau Monné Sponsored by: Citrix Systems R&D Reviewed by: gibbs Approved by: re (blanket Xen) MFC after: 2 weeks sys/amd64/amd64/mp_machdep.c: sys/i386/i386/mp_machdep.c: - Make sure that are no MMU related IPIs pending on migration. - Reset pending IPI_BITMAP on resume. - Init vcpu_info on resume. sys/amd64/include/intr_machdep.h: sys/i386/include/intr_machdep.h: sys/x86/acpica/acpi_wakeup.c: sys/x86/x86/intr_machdep.c: sys/x86/isa/atpic.c: sys/x86/x86/io_apic.c: sys/x86/x86/local_apic.c: - Add a "suspend_cancelled" parameter to pic_resume(). For the Xen PIC, restoration of interrupt services differs between the aborted suspend and normal resume cases, so we must provide this information. sys/dev/acpica/acpi_timer.c: sys/dev/xen/timer/timer.c: sys/timetc.h: - Don't swap out "suspend safe" timers across a suspend/resume cycle. This includes the Xen PV and ACPI timers. sys/dev/xen/control/control.c: - Perform proper suspend/resume process for PVHVM: - Suspend all APs before going into suspension, this allows us to reset the vcpu_info on resume for each AP. - Reset shared info page and callback on resume. sys/dev/xen/timer/timer.c: - Implement suspend/resume support for the PV timer. Since FreeBSD doesn't perform a per-cpu resume of the timer, we need to call smp_rendezvous in order to correctly resume the timer on each CPU. sys/dev/xen/xenpci/xenpci.c: - Don't reset the PCI interrupt on each suspend/resume. sys/kern/subr_smp.c: - When suspending a PVHVM domain make sure there are no MMU IPIs in-flight, or we will get a lockup on resume due to the fact that pending event channels are not carried over on migration. - Implement a generic version of restart_cpus that can be used by suspended and stopped cpus. sys/x86/xen/hvm.c: - Implement resume support for the hypercall page and shared info. - Clear vcpu_info so it can be reset by APs when resuming from suspension. sys/dev/xen/xenpci/xenpci.c: sys/x86/xen/hvm.c: sys/x86/xen/xen_intr.c: - Support UP kernel configurations. sys/x86/xen/xen_intr.c: - Properly rebind per-cpus VIRQs and IPIs on resume.	2013-09-20 05:06:03 +00:00
alc	88a4d0f31a	The pmap function pmap_clear_reference() is no longer used. Remove it. pmap_clear_reference() has had exactly one caller in the kernel for several years, more precisely, since FreeBSD 8. Now, that call no longer exists. Approved by: re (kib) Sponsored by: EMC / Isilon Storage Division	2013-09-20 04:30:18 +00:00
grehan	d919ede3d3	IFC @ r255692 Comment out IA32_MISC_ENABLE MSR access - this doesn't exist on AMD. Need to sort out how arch-specific MSRs will be handled.	2013-09-20 00:46:29 +00:00
grehan	073f54353f	Reconnect the hyperv drivers back into GENERIC now that the disengage driver issue has been resolved. Approved by: re@ (gjb)	2013-09-19 05:07:51 +00:00
pjd	667d7255be	Fix panic in ktrcapfail() when no capability rights are passed. While here, correct all consumers to pass NULL instead of 0 as we pass capability rights as pointers now, not uint64_t. Reported by: Daniel Peyrolon Tested by: Daniel Peyrolon Approved by: re (marius)	2013-09-18 19:26:08 +00:00
rdivacky	9e8e4eba85	Regen. Approved by: re (delphij)	2013-09-18 18:49:26 +00:00
rdivacky	fae003a069	Revert r255672, it has some serious flaws, leaking file references etc. Approved by: re (delphij)	2013-09-18 18:48:33 +00:00
rdivacky	b320aa22a6	Regen. Approved by: re (delphij)	2013-09-18 17:58:03 +00:00
rdivacky	d57db3eead	Implement epoll support in Linuxulator. This is a tiny wrapper around kqueue to implement epoll subset of functionality. The kqueue user data are 32bit on i386 which is not enough for epoll user data so this patch overrides kqueue fileops to maintain enough space in struct file. Initial patch developed by me in 2007 and then extended and finished by Yuri Victorovich. Approved by: re (delphij) Sponsored by: Google Summer of Code Submitted by: Yuri Victorovich <yuri at rawbw dot com> Tested by: Yuri Victorovich <yuri at rawbw dot com>	2013-09-18 17:56:04 +00:00
grehan	a83c14de5c	Hide TSC-deadline APIC timer support from guests. This mode isn't yet implemented in bhyve's APIC emulation. Reviewed by: neel Approved by: re@ (blanket)	2013-09-17 17:56:53 +00:00
neel	16db26f84b	Fix a bug in decoding an instruction that has an SIB byte as well as an immediate operand. The presence of an SIB byte in decoding the ModR/M field would cause 'imm_bytes' to not be set to the correct value. Fix this by initializing 'imm_bytes' independent of the ModR/M decoding. Reported by: grehan@ Approved by: re@	2013-09-17 16:06:07 +00:00
bryanv	5dc84522d4	Add vmx(4) to i386 and amd64 GENERIC Approved by: re (gjb)	2013-09-17 01:54:13 +00:00
kib	9867f4e99b	In pmap_copy(), when the copied region is mapped with superpage but does not cover entire superpage, avoid copying. Doing partial copy would require demotion, which is incompatible with the already held locks. Reported by: cperciva Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week Approved by: re (delphij)	2013-09-16 06:15:15 +00:00
grehan	dab286d879	Pull the hyperv drivers from GENERIC until the fix to the disengage driver to make it only probe when running on hyperv is reviewed and tested. Approved by: re (rodrigc)	2013-09-14 20:38:22 +00:00
grehan	0d2d002359	Import Hyper-V paravirtualized drivers from projects/hyperv branch into head. Approved by: re@ (hrs) Obtained from: Microsoft, NetApp, and Citrix.	2013-09-13 18:47:58 +00:00
neel	348fe8d4ce	Fix a limitation in bhyve that would limit the number of virtual machines to the maximum number of VT-d domains (256 on a Sandybridge). We now allocate a VT-d domain for a guest only if the administrator has explicitly configured one or more PCI passthru device(s). If there are no PCI passthru devices configured (the common case) then the number of virtual machines is no longer limited by the maximum number of VT-d domains. Reviewed by: grehan@ Approved by: re@	2013-09-11 07:11:14 +00:00
grehan	d5d1038c7e	IFC @ r255459	2013-09-11 00:19:16 +00:00
grehan	3d2b366a36	Go way past 11 and bump bhyve's max vCPUs to 16. This should be sufficient for 10.0 and will do until forthcoming work to avoid limitations in this area is complete. Thanks to Bela Lubkin at tidalscale for the headsup on the apic/cpu id/io apic ASL parameters that are actually hex values and broke when written as decimal when 11 vCPUs were configured. Approved by: re@	2013-09-10 03:48:18 +00:00
alc	4aecbb077c	Prior to r254304, we only began scanning the active page queue when the amount of free memory was close to the point at which we would begin reclaiming pages. Now, we continuously scan the active page queue, regardless of the amount of free memory. Consequently, we are continuously calling pmap_ts_referenced() on active pages. Prior to this change, pmap_ts_referenced() would always demote superpage mappings in order to obtain finer-grained reference information. This made sense because we were coming under memory pressure and would soon have to begin reclaiming pages. Now, however, with continuous scanning of the active page queue, these demotions are taking a toll on performance. For example, on one of my test machines, the running time for the HPCC Random Access benchmark (also known as GUPS) has increased by 54%. To address this problem, I have replaced the demotion with a heuristic for periodically clearing the reference flag on superpage mappings. Reviewed by: kib Approved by: re (glebius) Sponsored by: EMC / Isilon Storage Division	2013-09-08 21:30:53 +00:00
neel	1306e6ef77	Allocate VPIDs by using the unit number allocator to keep do the bookkeeping. Also deal with VPID exhaustion by allocating out of a reserved range as the last resort.	2013-09-07 05:30:34 +00:00
grehan	0fa52a8a31	Mask off the vector from the MSI-x data word. Some o/s's set the trigger-mode level bit which results in an invalid vector and pass-thru interrupts not being delivered.	2013-09-07 03:33:36 +00:00
gibbs	437790b349	Implement PV IPIs for PVHVM guests and further converge PV and HVM IPI implmementations. Submitted by: Roger Pau Monné Sponsored by: Citrix Systems R&D Submitted by: gibbs (misc cleanup, table driven config) Reviewed by: gibbs MFC after: 2 weeks sys/amd64/include/cpufunc.h: sys/amd64/amd64/pmap.c: Move invltlb_globpcid() into cpufunc.h so that it can be used by the Xen HVM version of tlb shootdown IPI handlers. sys/x86/xen/xen_intr.c: sys/xen/xen_intr.h: Rename xen_intr_bind_ipi() to xen_intr_alloc_and_bind_ipi(), and remove the ipi vector parameter. This api allocates an event channel port that can be used for ipi services, but knows nothing of the actual ipi for which that port will be used. Removing the unused argument and cleaning up the comments surrounding its declaration helps clarify its actual role. sys/amd64/amd64/mp_machdep.c: sys/amd64/include/cpu.h: sys/i386/i386/mp_machdep.c: sys/i386/include/cpu.h: Implement a generic framework for amd64 and i386 that allows the implementation of certain CPU management functions to be selected at runtime. Currently this is only used for the ipi send function, which we optimize for Xen when running on a Xen hypervisor, but can easily be expanded to support more operations. sys/x86/xen/hvm.c: Implement Xen PV IPI handlers and operations, replacing native send IPI. sys/amd64/include/pcpu.h: sys/i386/include/pcpu.h: sys/i386/include/smp.h: Remove NR_VIRQS and NR_IPIS from FreeBSD headers. NR_VIRQS is defined already for us in the xen interface files. NR_IPIS is only needed in one file per Xen platform and is easily inferred by the IPI vector table that is defined in those files. sys/i386/xen/mp_machdep.c: Restructure to more closely match the HVM implementation by performing table driven IPI setup.	2013-09-06 22:17:02 +00:00
bryanv	f46bf10bc5	Add vmx device to the i386 and amd64 NOTES files	2013-09-06 20:24:21 +00:00
kib	8439d9918f	Only lock pvh_global_lock read-only for pmap_page_wired_mappings(), pmap_is_modified() and pmap_is_referenced(), same as it was done for pmap_ts_referenced(). Consolidate identical code for pmap_is_modified() and pmap_is_referenced() into helper pmap_page_test_mappings(). Reviewed by: alc Tested by: pho (previous version) Sponsored by: The FreeBSD Foundation	2013-09-06 16:53:48 +00:00
kib	b10a9f2e78	In pmap_ts_referenced(), when restarting the loop due to pv list generation changed, do not drop and immediately relock the pv list. Suggested and reviewed by: alc Sponsored by: The FreeBSD Foundation	2013-09-06 16:48:34 +00:00
glebius	358d3d145a	On those machines, where sf_bufs do not represent any real object, make sf_buf_alloc()/sf_buf_free() inlines, to save two calls to an absolutely empty functions. Reviewed by: alc, kib, scottl Sponsored by: Nginx, Inc. Sponsored by: Netflix	2013-09-06 05:37:49 +00:00
grehan	5110b054b2	Emulate reading of the IA32_MISC_ENABLE MSR, by returning the host MSR and masking off features that aren't supported. Linux reads this MSR to detect if NX has been disabled via BIOS.	2013-09-06 05:20:11 +00:00
grehan	37b6e3223f	Allow CPUID leaf 0xD to be read as zeroes. Linux reads this even though extended features aren't exposed. Support for 0xD will be expanded once AVX[2] is exposed to the guest in upcoming work.	2013-09-06 05:16:10 +00:00
pjd	029a6f5d92	Change the cap_rights_t type from uint64_t to a structure that we can extend in the future in a backward compatible (API and ABI) way. The cap_rights_t represents capability rights. We used to use one bit to represent one right, but we are running out of spare bits. Currently the new structure provides place for 114 rights (so 50 more than the previous cap_rights_t), but it is possible to grow the structure to hold at least 285 rights, although we can make it even larger if 285 rights won't be enough. The structure definition looks like this: struct cap_rights { uint64_t cr_rights[CAP_RIGHTS_VERSION + 2]; }; The initial CAP_RIGHTS_VERSION is 0. The top two bits in the first element of the cr_rights[] array contain total number of elements in the array - 2. This means if those two bits are equal to 0, we have 2 array elements. The top two bits in all remaining array elements should be 0. The next five bits in all array elements contain array index. Only one bit is used and bit position in this five-bits range defines array index. This means there can be at most five array elements in the future. To define new right the CAPRIGHT() macro must be used. The macro takes two arguments - an array index and a bit to set, eg. #define CAP_PDKILL CAPRIGHT(1, 0x0000000000000800ULL) We still support aliases that combine few rights, but the rights have to belong to the same array element, eg: #define CAP_LOOKUP CAPRIGHT(0, 0x0000000000000400ULL) #define CAP_FCHMOD CAPRIGHT(0, 0x0000000000002000ULL) #define CAP_FCHMODAT (CAP_FCHMOD \| CAP_LOOKUP) There is new API to manage the new cap_rights_t structure: cap_rights_t cap_rights_init(cap_rights_t rights, ...); void cap_rights_set(cap_rights_t rights, ...); void cap_rights_clear(cap_rights_t rights, ...); bool cap_rights_is_set(const cap_rights_t rights, ...); bool cap_rights_is_valid(const cap_rights_t rights); void cap_rights_merge(cap_rights_t dst, const cap_rights_t src); void cap_rights_remove(cap_rights_t dst, const cap_rights_t src); bool cap_rights_contains(const cap_rights_t big, const cap_rights_t little); Capability rights to the cap_rights_init(), cap_rights_set(), cap_rights_clear() and cap_rights_is_set() functions are provided by separating them with commas, eg: cap_rights_t rights; cap_rights_init(&rights, CAP_READ, CAP_WRITE, CAP_FSTAT); There is no need to terminate the list of rights, as those functions are actually macros that take care of the termination, eg: #define cap_rights_set(rights, ...) \ __cap_rights_set((rights), __VA_ARGS__, 0ULL) void __cap_rights_set(cap_rights_t *rights, ...); Thanks to using one bit as an array index we can assert in those functions that there are no two rights belonging to different array elements provided together. For example this is illegal and will be detected, because CAP_LOOKUP belongs to element 0 and CAP_PDKILL to element 1: cap_rights_init(&rights, CAP_LOOKUP \| CAP_PDKILL); Providing several rights that belongs to the same array's element this way is correct, but is not advised. It should only be used for aliases definition. This commit also breaks compatibility with some existing Capsicum system calls, but I see no other way to do that. This should be fine as Capsicum is still experimental and this change is not going to 9.x. Sponsored by: The FreeBSD Foundation	2013-09-05 00:09:56 +00:00
kib	23ba68da73	Tidy up some loose ends in the PCID code: - Restore the pre-PCID TLB shootdown handlers for whole address space and single page invalidation asm code, and assign the IPI handler to them when PCID is not supported or disabled. Old handlers have linear control flow. But, still use the common return sequence. - Stop using pcpu for INVPCID descriptors in the invlrg handler. It is enough to allocate descriptors on the stack. As result, two SWAPGS instructions are shaved off from the code for Haswell+. - Fix the reverted condition in invlrng for checking of the PCID support [1], also in invlrng check that pmap is kernel pmap before performing other tests. For the kernel pmap, which provides global mappings, the INVLPG must be used for invalidation always. - Save the pre-computed pmap' %CR3 register in the struct pmap. This allows to remove several checks for pm_pcid validity when %CR3 is reloaded [2]. Noted by: gibbs [1] Discussed with: alc [2] Tested by: pho, flo Sponsored by: The FreeBSD Foundation	2013-09-04 23:31:29 +00:00
grehan	9394ef331e	IFC @ r255209	2013-09-04 20:55:56 +00:00
jhb	8f38cafe69	Add support for the 'invpcid' instruction to binutils and DDB's disassembler on amd64. MFC after: 1 month	2013-09-03 21:21:47 +00:00
kib	2dccc06e8e	Fix two build failures for non-tb configurations, UP [2] and when using gas [1]. Reported by: andreast [1], bf [2] Sponsored by: The FreeBSD Foundation	2013-08-31 19:13:21 +00:00
kib	6eb9415a25	The pm_save should be cleared on the pmap initialization, and not on the activation. Noted by: alc	2013-08-30 20:10:01 +00:00
kib	a2b5da0090	Implement support for the process-context identifiers ('PCID') on Intel CPUs. The feature tags TLB entries with the Id of the address space and allows to avoid TLB invalidation on the context switch, it is available only in the long mode. In the microbenchmarks, using the PCID decreased latency of the context switches by ~30% on SandyBridge class desktop CPUs, measured with the lat_ctx program from lmbench. If available, use INVPCID instruction when a TLB entry in non-current address space needs to be invalidated. The instruction is typically available on the Haswell. If needed, the use of PCID can be turned off with the vm.pmap.pcid_enabled loader tunable set to 0. The state of the feature is reported by the vm.pmap.pcid_enabled sysctl. The sysctl vm.pmap.pcid_save_cnt reports the number of context switches which avoided invalidating the TLB; compare with the total number of context switches, available as sysctl vm.stats.sys.v_swtch. Sponsored by: The FreeBSD Foundation Reviewed by: alc Tested by: pho, bf	2013-08-30 07:59:49 +00:00
kib	f8c0849eff	Provide a wrapper for the INVPCID instruction, definition of the descriptor and symbolic names for the operation types. Sponsored by: The FreeBSD Foundation Reviewed by: alc Tested by: pho, bf	2013-08-30 07:42:38 +00:00
gibbs	fcdbf70fd9	Implement vector callback for PVHVM and unify event channel implementations Re-structure Xen HVM support so that: - Xen is detected and hypercalls can be performed very early in system startup. - Xen interrupt services are implemented using FreeBSD's native interrupt delivery infrastructure. - the Xen interrupt service implementation is shared between PV and HVM guests. - Xen interrupt handlers can optionally use a filter handler in order to avoid the overhead of dispatch to an interrupt thread. - interrupt load can be distributed among all available CPUs. - the overhead of accessing the emulated local and I/O apics on HVM is removed for event channel port events. - a similar optimization can eventually, and fairly easily, be used to optimize MSI. Early Xen detection, HVM refactoring, PVHVM interrupt infrastructure, and misc Xen cleanups: Sponsored by: Spectra Logic Corporation Unification of PV & HVM interrupt infrastructure, bug fixes, and misc Xen cleanups: Submitted by: Roger Pau Monné Sponsored by: Citrix Systems R&D sys/x86/x86/local_apic.c: sys/amd64/include/apicvar.h: sys/i386/include/apicvar.h: sys/amd64/amd64/apic_vector.S: sys/i386/i386/apic_vector.s: sys/amd64/amd64/machdep.c: sys/i386/i386/machdep.c: sys/i386/xen/exception.s: sys/x86/include/segments.h: Reserve IDT vector 0x93 for the Xen event channel upcall interrupt handler. On Hypervisors that support the direct vector callback feature, we can request that this vector be called directly by an injected HVM interrupt event, instead of a simulated PCI interrupt on the Xen platform PCI device. This avoids all of the overhead of dealing with the emulated I/O APIC and local APIC. It also means that the Hypervisor can inject these events on any CPU, allowing upcalls for different ports to be handled in parallel. sys/amd64/amd64/mp_machdep.c: sys/i386/i386/mp_machdep.c: Map Xen per-vcpu area during AP startup. sys/amd64/include/intr_machdep.h: sys/i386/include/intr_machdep.h: Increase the FreeBSD IRQ vector table to include space for event channel interrupt sources. sys/amd64/include/pcpu.h: sys/i386/include/pcpu.h: Remove Xen HVM per-cpu variable data. These fields are now allocated via the dynamic per-cpu scheme. See xen_intr.c for details. sys/amd64/include/xen/hypercall.h: sys/dev/xen/blkback/blkback.c: sys/i386/include/xen/xenvar.h: sys/i386/xen/clock.c: sys/i386/xen/xen_machdep.c: sys/xen/gnttab.c: Prefer FreeBSD primatives to Linux ones in Xen support code. sys/amd64/include/xen/xen-os.h: sys/i386/include/xen/xen-os.h: sys/xen/xen-os.h: sys/dev/xen/balloon/balloon.c: sys/dev/xen/blkback/blkback.c: sys/dev/xen/blkfront/blkfront.c: sys/dev/xen/console/xencons_ring.c: sys/dev/xen/control/control.c: sys/dev/xen/netback/netback.c: sys/dev/xen/netfront/netfront.c: sys/dev/xen/xenpci/xenpci.c: sys/i386/i386/machdep.c: sys/i386/include/pmap.h: sys/i386/include/xen/xenfunc.h: sys/i386/isa/npx.c: sys/i386/xen/clock.c: sys/i386/xen/mp_machdep.c: sys/i386/xen/mptable.c: sys/i386/xen/xen_clock_util.c: sys/i386/xen/xen_machdep.c: sys/i386/xen/xen_rtc.c: sys/xen/evtchn/evtchn_dev.c: sys/xen/features.c: sys/xen/gnttab.c: sys/xen/gnttab.h: sys/xen/hvm.h: sys/xen/xenbus/xenbus.c: sys/xen/xenbus/xenbus_if.m: sys/xen/xenbus/xenbusb_front.c: sys/xen/xenbus/xenbusvar.h: sys/xen/xenstore/xenstore.c: sys/xen/xenstore/xenstore_dev.c: sys/xen/xenstore/xenstorevar.h: Pull common Xen OS support functions/settings into xen/xen-os.h. sys/amd64/include/xen/xen-os.h: sys/i386/include/xen/xen-os.h: sys/xen/xen-os.h: Remove constants, macros, and functions unused in FreeBSD's Xen support. sys/xen/xen-os.h: sys/i386/xen/xen_machdep.c: sys/x86/xen/hvm.c: Introduce new functions xen_domain(), xen_pv_domain(), and xen_hvm_domain(). These are used in favor of #ifdefs so that FreeBSD can dynamically detect and adapt to the presence of a hypervisor. The goal is to have an HVM optimized GENERIC, but more is necessary before this is possible. sys/amd64/amd64/machdep.c: sys/dev/xen/xenpci/xenpcivar.h: sys/dev/xen/xenpci/xenpci.c: sys/x86/xen/hvm.c: sys/sys/kernel.h: Refactor magic ioport, Hypercall table and Hypervisor shared information page setup, and move it to a dedicated HVM support module. HVM mode initialization is now triggered during the SI_SUB_HYPERVISOR phase of system startup. This currently occurs just after the kernel VM is fully setup which is just enough infrastructure to allow the hypercall table and shared info page to be properly mapped. sys/xen/hvm.h: sys/x86/xen/hvm.c: Add definitions and a method for configuring Hypervisor event delievery via a direct vector callback. sys/amd64/include/xen/xen-os.h: sys/x86/xen/hvm.c: sys/conf/files: sys/conf/files.amd64: sys/conf/files.i386: Adjust kernel build to reflect the refactoring of early Xen startup code and Xen interrupt services. sys/dev/xen/blkback/blkback.c: sys/dev/xen/blkfront/blkfront.c: sys/dev/xen/blkfront/block.h: sys/dev/xen/control/control.c: sys/dev/xen/evtchn/evtchn_dev.c: sys/dev/xen/netback/netback.c: sys/dev/xen/netfront/netfront.c: sys/xen/xenstore/xenstore.c: sys/xen/evtchn/evtchn_dev.c: sys/dev/xen/console/console.c: sys/dev/xen/console/xencons_ring.c Adjust drivers to use new xen_intr_*() API. sys/dev/xen/blkback/blkback.c: Since blkback defers all event handling to a taskqueue, convert this task queue to a "fast" taskqueue, and schedule it via an interrupt filter. This avoids an unnecessary ithread context switch. sys/xen/xenstore/xenstore.c: The xenstore driver is MPSAFE. Indicate as much when registering its interrupt handler. sys/xen/xenbus/xenbus.c: sys/xen/xenbus/xenbusvar.h: Remove unused event channel APIs. sys/xen/evtchn.h: Remove all kernel Xen interrupt service API definitions from this file. It is now only used for structure and ioctl definitions related to the event channel userland device driver. Update the definitions in this file to match those from NetBSD. Implementing this interface will be necessary for Dom0 support. sys/xen/evtchn/evtchnvar.h: Add a header file for implemenation internal APIs related to managing event channels event delivery. This is used to allow, for example, the event channel userland device driver to access low-level routines that typical kernel consumers of event channel services should never access. sys/xen/interface/event_channel.h: sys/xen/xen_intr.h: Standardize on the evtchn_port_t type for referring to an event channel port id. In order to prevent low-level event channel APIs from leaking to kernel consumers who should not have access to this data, the type is defined twice: Once in the Xen provided event_channel.h, and again in xen/xen_intr.h. The double declaration is protected by __XEN_EVTCHN_PORT_DEFINED__ to ensure it is never declared twice within a given compilation unit. sys/xen/xen_intr.h: sys/xen/evtchn/evtchn.c: sys/x86/xen/xen_intr.c: sys/dev/xen/xenpci/evtchn.c: sys/dev/xen/xenpci/xenpcivar.h: New implementation of Xen interrupt services. This is similar in many respects to the i386 PV implementation with the exception that events for bound to event channel ports (i.e. not IPI, virtual IRQ, or physical IRQ) are further optimized to avoid mask/unmask operations that aren't necessary for these edge triggered events. Stubs exist for supporting physical IRQ binding, but will need additional work before this implementation can be fully shared between PV and HVM. sys/amd64/amd64/mp_machdep.c: sys/i386/i386/mp_machdep.c: sys/i386/xen/mp_machdep.c sys/x86/xen/hvm.c: Add support for placing vcpu_info into an arbritary memory page instead of using HYPERVISOR_shared_info->vcpu_info. This allows the creation of domains with more than 32 vcpus. sys/i386/i386/machdep.c: sys/i386/xen/clock.c: sys/i386/xen/xen_machdep.c: sys/i386/xen/exception.s: Add support for new event channle implementation.	2013-08-29 19:52:18 +00:00
alc	aa9a7bb9e6	Significantly reduce the cost, i.e., run time, of calls to madvise(..., MADV_DONTNEED) and madvise(..., MADV_FREE). Specifically, introduce a new pmap function, pmap_advise(), that operates on a range of virtual addresses within the specified pmap, allowing for a more efficient implementation of MADV_DONTNEED and MADV_FREE. Previously, the implementation of MADV_DONTNEED and MADV_FREE relied on per-page pmap operations, such as pmap_clear_reference(). Intuitively, the problem with this implementation is that the pmap-level locks are acquired and released and the page table traversed repeatedly, once for each resident page in the range that was specified to madvise(2). A more subtle flaw with the previous implementation is that pmap_clear_reference() would clear the reference bit on all mappings to the specified page, not just the mapping in the range specified to madvise(2). Since our malloc(3) makes heavy use of madvise(2), this change can have a measureable impact. For example, the system time for completing a parallel "buildworld" on a 6-core amd64 machine was reduced by about 1.5% to 2.0%. Note: This change only contains pmap_advise() implementations for a subset of our supported architectures. I will commit implementations for the remaining architectures after further testing. For now, a stub function is sufficient because of the advisory nature of pmap_advise(). Discussed with: jeff, jhb, kib Tested by: pho (i386), marcel (ia64) Sponsored by: EMC / Isilon Storage Division	2013-08-29 15:49:05 +00:00
neel	99ab2bf08e	Add support for emulating the byte move instruction "mov r/m8, r8". This emulation is required when dumping MMIO space via the ddb "examine" command.	2013-08-27 16:49:20 +00:00
grehan	74160a9a77	Add in last remaining files to get AMD-SVM operational. Submitted by: Anish Gupta (akgupt3@gmail.com)	2013-08-23 00:37:26 +00:00
grehan	f9c9379bb7	HLT_IGNORED stat is used by both vmx and svm - move to common stats. Submitted by: Anish Gupta (akgupt3@gmail.com)	2013-08-22 22:29:27 +00:00
grehan	c954556ddf	Handle VM_PROT_NONE in nested page table code. Submitted by: Anish Gupta (akgupt3@gmail.com)	2013-08-22 22:26:46 +00:00
kib	05a9dff802	Revert r254501. Instead, reuse the type stability of the struct pmap which is the part of struct vmspace, allocated from UMA_ZONE_NOFREE zone. Initialize the pmap lock in the vmspace zone init function, and remove pmap lock initialization and destruction from pmap_pinit() and pmap_release(). Suggested and reviewed by: alc (previous version) Tested by: pho Sponsored by: The FreeBSD Foundation	2013-08-22 18:12:24 +00:00
kib	076969f54a	Use the generation count of the pv list to work around LOR between pmap lock and pv list lock, and use the shared locking on pvh_global_lock in pmap_remove_write(), same as it was done for pmap_ts_referenced(). Noted and reviewed by: alc (previous version) Tested by: pho Sponsored by: The FreeBSD Foundation	2013-08-22 18:05:31 +00:00
obrien	b450ec770a	The PADLOCK_RNG and RDRAND_RNG kernel options are now devices. Thus "device padlock_rng" and "device rdrand_rng" should be used instead of "options PADLOCK_RNG" & "options RDRAND_RNG". Requested by: so@ (des) Submitted by: obrien, arthurmesh@gmail.com Obtained from: Juniper Networks	2013-08-21 22:43:29 +00:00
jkim	f4e86b0a9f	Reimplement atomic operations on PDEs and PTEs in pmap.h. This change significantly reduces duplicate code and make it easier to read. Reviewed by: alc, bde	2013-08-21 22:40:29 +00:00
jkim	18751f3bf8	Remove empty lines before return statements for style consistency.	2013-08-21 22:05:58 +00:00
jkim	f354dbfd86	Implement atomic_swap() and atomic_testandset(). Reviewed by: arch, bde, jilles, kib	2013-08-21 22:03:06 +00:00
jkim	a0c0cc666e	- Remove the "a" constraint from main output operand for atomic_cmpset(). - Use "+" modifier for the "expect" because it is also an output (unused).	2013-08-21 21:30:06 +00:00
jkim	8c10ae99f8	Use '+' modifier for a memory operand that is both an input and an output. It was actually done in r86301 but reverted in r150182 because GCC 3.x was not able to handle it for a memory operand. Apparently, this problem was fixed in GCC 4.1+ and several contrib sources already rely on this feature.	2013-08-21 21:14:16 +00:00
jkim	dd0ac928f4	Remove bogus labels. No functional change.	2013-08-21 20:49:46 +00:00
jkim	2deff14808	Use consistent style. No functional change.	2013-08-21 20:43:50 +00:00
neel	44099f4092	Do not create superpage mappings in the iommu. This is a workaround to hide the fact that we do not have any code to demote a superpage mapping before we unmap a single page that is part of the superpage.	2013-08-20 06:46:40 +00:00
neel	15659a9ddf	Extract the location of the remapping hardware units from the ACPI DMAR table. Submitted by: Gopakumar T (gopakumar_thekkedath@yahoo.co.in)	2013-08-20 06:20:05 +00:00
neel	bea80a701c	Fix breakage caused by r254466 in minidumpsys(). r254466 increased the KVA from 512GB to 2TB which requires 4 PDP pages as opposed to a single one before the change. This broke minidumpsys() since it assumed that the entire KVA could be addressed via a single PDP page. Fix this by obtaining the address of the PDP page from the PML4 entry associated with the KVA being dumped. Reported by: pho Submitted by: kib Pointy hat to: neel	2013-08-20 02:09:26 +00:00
kib	17c7e12452	When code from r254064 in pmap_ts_referenced() drops pv lock and blocks on a pmap lock, pmap_release() might proceed in parallel and destroy the pmap mutex, since unlocked pv lock allows to remove pv entry owned by the pmap. For now, gate the pmap_release() on write-locked pvh_global_lock. Since pmap_ts_release() does not unlock the global lock, pmap_release() would not destroy pmap mutex until the pmap_ts_referenced() finished. We cannot enter pmap_ts_referenced() and encounter a pv entry for the destroyed pmap if pmap_release() passed the global lock gate, since pmap_remove_pages() would finish earlier. Reported by: jeff, pho Reviewed by: alc Tested by: pho Sponsored by: The FreeBSD Foundation	2013-08-18 21:36:22 +00:00
pjd	ae1f4cb9ce	Add process descriptors support to the GENERIC kernel. It is already being used by the tools in base systems and with sandboxing more and more tools the usage should only increase. Submitted by: Mariusz Zaborski <oshogbo@FreeBSD.org> Sponsored by: Google Summer of Code 2013 MFC after: 1 month	2013-08-18 10:21:29 +00:00
neel	1038bfa7cf	Bump up the maximum addressable memory on amd64 systems from 1TB to 4TB. Bump up the KVA size proportionally from 512GB to 2TB. The number of page table pages used by the direct map is now calculated at run time based on 'Maxmem'. This means the small memory systems will not see any additional tax in terms of page table pages for the direct map. However all amd64 systems, regardless of the memory size, will use 3 more pages to accomodate the bump in the KVA size. More details available here: http://lists.freebsd.org/pipermail/freebsd-hackers/2013-June/043015.html http://lists.freebsd.org/pipermail/freebsd-current/2013-July/043143.html Tested with the following configurations: - Sandybridge server with 64GB of memory. - bhyve VM with 64MB of memory. - bhyve VM with a 8GB of memory with the memory segment above 4GB cuddling right up against the 4TB maximum memory limit. Discussed on: hackers@, current@ Submitted by: Chris Torek (torek@torek.net)	2013-08-17 19:49:08 +00:00
jilles	fd29e78a68	libc: Access _logname_valid more efficiently. The variable _logname_valid is not exported via the version script; therefore, change C and i386/amd64 assembler code to remove indirection (which allowed interposition). This makes the code slightly smaller and faster. Also, remove #define PIC_GOT from i386/amd64 in !PIC mode. Without PIC, there is no place containing the address of each variable, so there is no possible definition for PIC_GOT.	2013-08-17 19:24:58 +00:00
brooks	2e54cfb79a	Use an ANSI C definition of initializecpucache() to match the declaration and the rest of the file.	2013-08-15 17:44:44 +00:00
jkim	20df47e877	Merge acpica_machdep.h for amd64 and i386 and move to x86. In fact, these two files were functionally identical.	2013-08-13 22:05:10 +00:00
jkim	c55a3ec3ad	Tidy up global locks for ACPICA. There is no functional change.	2013-08-13 21:34:03 +00:00
kib	4675fcfce0	Different consumers of the struct vm_page abuse pageq member to keep additional information, when the page is guaranteed to not belong to a paging queue. Usually, this results in a lot of type casts which make reasoning about the code correctness harder. Sometimes m->object is used instead of pageq, which could cause real and confusing bugs if non-NULL m->object is leaked. See r141955 and r253140 for examples. Change the pageq member into a union containing explicitly-typed members. Use them instead of type-punning or abusing m->object in x86 pmaps, uma and vm_page_alloc_contig(). Requested and reviewed by: alc Sponsored by: The FreeBSD Foundation	2013-08-10 17:36:42 +00:00
attilio	e9f37cac74	On all the architectures, avoid to preallocate the physical memory for nodes used in vm_radix. On architectures supporting direct mapping, also avoid to pre-allocate the KVA for such nodes. In order to do so make the operations derived from vm_radix_insert() to fail and handle all the deriving failure of those. vm_radix-wise introduce a new function called vm_radix_replace(), which can replace a leaf node, already present, with a new one, and take into account the possibility, during vm_radix_insert() allocation, that the operations on the radix trie can recurse. This means that if operations in vm_radix_insert() recursed vm_radix_insert() will start from scratch again. Sponsored by: EMC / Isilon storage division Reviewed by: alc (older version) Reviewed by: jeff Tested by: pho, scottl	2013-08-09 11:28:55 +00:00
attilio	16c7563cf4	The soft and hard busy mechanism rely on the vm object lock to work. Unify the 2 concept into a real, minimal, sxlock where the shared acquisition represent the soft busy and the exclusive acquisition represent the hard busy. The old VPO_WANTED mechanism becames the hard-path for this new lock and it becomes per-page rather than per-object. The vm_object lock becames an interlock for this functionality: it can be held in both read or write mode. However, if the vm_object lock is held in read mode while acquiring or releasing the busy state, the thread owner cannot make any assumption on the busy state unless it is also busying it. Also: - Add a new flag to directly shared busy pages while vm_page_alloc and vm_page_grab are being executed. This will be very helpful once these functions happen under a read object lock. - Move the swapping sleep into its own per-object flag The KPI is heavilly changed this is why the version is bumped. It is very likely that some VM ports users will need to change their own code. Sponsored by: EMC / Isilon storage division Discussed with: alc Reviewed by: jeff, kib Tested by: gavin, bapt (older version) Tested by: pho, scottl	2013-08-09 11:11:11 +00:00
avg	6f05d223ad	follow up to r254051 - update powerpc/GENERIC64 as well, suggested by mdf - update comments so that they make sense after the change, suggested by jhb X-MFC after: never (change specific to head)	2013-08-09 08:11:09 +00:00
neel	541895143d	Use local variables with the appropriate types and eliminate a bunch of casts. This is a cosmetic change but it does help with a proposed change to increase the maximum size of physical memory supported on amd64 platforms. Submitted by: Chris Torek (torek@torek.net)	2013-08-08 03:17:39 +00:00
kib	8de1718b60	Split the pagequeues per NUMA domains, and split pageademon process into threads each processing queue in a single domain. The structure of the pagedaemons and queues is kept intact, most of the changes come from the need for code to find an owning page queue for given page, calculated from the segment containing the page. The tie between NUMA domain and pagedaemon thread/pagequeue split is rather arbitrary, the multithreaded daemon could be allowed for the single-domain machines, or one domain might be split into several page domains, to further increase concurrency. Right now, each pagedaemon thread tries to reach the global target, precalculated at the start of the pass. This is not optimal, since it could cause excessive page deactivation and freeing. The code should be changed to re-check the global page deficit state in the loop after some number of iterations. The pagedaemons reach the quorum before starting the OOM, since one thread inability to meet the target is normal for split queues. Only when all pagedaemons fail to produce enough reusable pages, OOM is started by single selected thread. Launder is modified to take into account the segments layout with regard to the region for which cleaning is performed. Based on the preliminary patch by jeff, sponsored by EMC / Isilon Storage Division. Reviewed by: alc Tested by: pho Sponsored by: The FreeBSD Foundation	2013-08-07 16:36:38 +00:00
kib	a3142db9ac	Change the pmap_ts_referenced() method of amd64 pmap to use shared pvh_global_lock. This allows the method to be executed in parallel, avoiding undue contention on the pvh_global_lock for the multithreaded pagedaemon. The pmap_ts_referenced() function has to inspect the page mappings for several pmaps, which need to be locked while pv list lock is owned. This contradicts to the lock order, where pmap lock is before pv list lock. Introduce the generation count for the pv list of the page or superpage, which indicate any change in the pv list, and, as usual, perform restart of the iteration if generation changed while pv lock was dropped for blocking acquire of a pmap lock. Reported and tested by: pho Reviewed by: alc Sponsored by: The FreeBSD Foundation	2013-08-07 16:33:15 +00:00
avg	0255b98224	enable KDB_TRACE in GENERICs KDB_TRACE is not an alternative to DDB/etc, they are complementary. So I do not see any reason to not enable KDB_TRACE by default. X-MFC after: never (change specific to head)	2013-08-07 08:03:50 +00:00
jeff	de4ecca213	Replace kernel virtual address space allocation with vmem. This provides transparent layering and better fragmentation. - Normalize functions that allocate memory to use kmem_* - Those that allocate address space are named kva_* - Those that operate on maps are named kmap_* - Implement recursive allocation handling for kmem_arena in vmem. Reviewed by: alc Tested by: pho Sponsored by: EMC / Isilon Storage Division	2013-08-07 06:21:20 +00:00
grehan	baae5efeab	IFC @ r254014	2013-08-07 00:09:49 +00:00
jeff	baea70e2d2	- Introduce a specific function, pmap_remove_kernel_pde, for removing huge pages in the kernel's address space. This works around several asserts from pmap_demote_pde_locked that did not apply and gave false warnings. Discovered by: pho Reviewed by: alc Sponsored by: EMC / Isilon Storage Division	2013-08-05 00:28:03 +00:00
grehan	be25b04bb5	Follow-up commit to fix CR0 issues. Maintain architectural state on CR vmexits by guaranteeing that EFER, CR0 and the VMCS entry controls are all in sync when transitioning to IA-32e mode. Submitted by: Tycho Nightingale (tycho.nightingale <at> plurisbusnetworks.com)	2013-08-03 03:16:42 +00:00
grehan	fd121a9c7b	IFC @ r253862 - change the SI_SUB_RUN_SCHEDULER sysinits in hv_utilc and hv_netvsc_drv_freebsd.c to SI_SUB_KTHREAD_IDLE, since the former is no longer in FreeBSD. The use of these SYSINITs can probably be removed.	2013-08-01 22:09:57 +00:00
grehan	41f621778a	Moved clearing of vmm_initialized to avoid the case of unloading the module while VMs existed. This would result in EBUSY, but would prevent further operations on VMs resulting in the module being impossible to unload. Submitted by: Tycho Nightingale (tycho.nightingale <at> plurisbusnetworks.com) Reviewed by: grehan, neel	2013-08-01 05:59:28 +00:00
grehan	045ef38328	Correctly maintain the CR0/CR4 shadow registers. This was exposed with AP spinup of Linux, and booting OpenBSD, where the CR0 register is unconditionally written to prior to the longjump to enter protected mode. The CR-vmexit handling was not updating CPU state which resulted in a vmentry failure with invalid guest state. A follow-on submit will fix the CPU state issue, but this fix prevents the CR-vmexit prior to entering protected mode by properly initializing and maintaining CR* state. Reviewed by: neel Reported by: Gopakumar.T @ netapp	2013-08-01 01:18:51 +00:00
obrien	7999076e3e	Back out r253779 & r253786.	2013-07-31 17:21:18 +00:00
obrien	721ce839c7	Decouple yarrow from random(4) device. * Make Yarrow an optional kernel component -- enabled by "YARROW_RNG" option. The files sha2.c, hash.c, randomdev_soft.c and yarrow.c comprise yarrow. * random(4) device doesn't really depend on rijndael-. Yarrow, however, does. Add random_adaptors.[ch] which is basically a store of random_adaptor's. random_adaptor is basically an adapter that plugs in to random(4). random_adaptor can only be plugged in to random(4) very early in bootup. Unplugging random_adaptor from random(4) is not supported, and is probably a bad idea anyway, due to potential loss of entropy pools. We currently have 3 random_adaptors: + yarrow + rdrand (ivy.c) + nehemeiah * Remove platform dependent logic from probe.c, and move it into corresponding registration routines of each random_adaptor provider. probe.c doesn't do anything other than picking a specific random_adaptor from a list of registered ones. * If the kernel doesn't have any random_adaptor adapters present then the creation of /dev/random is postponed until next random_adaptor is kldload'ed. * Fix randomdev_soft.c to refer to its own random_adaptor, instead of a system wide one. Submitted by: arthurmesh@gmail.com, obrien Obtained from: Juniper Networks Reviewed by: obrien	2013-07-29 20:26:27 +00:00
avg	4e6c4b2a36	Revert r253748,253749 This WIP should not have been committed yet. Pointyhat to: avg	2013-07-28 18:44:17 +00:00
avg	ea04b46439	put contents of cpu.h under _KERNEL no userland-serviceable parts inside MFC after: 20 days	2013-07-28 18:32:27 +00:00
avg	425573ccba	x86: detect mwait capabilities and extensions, when present Reviewed by: kib (earlier amd64-only version) MFC after: 2 weeks	2013-07-28 17:54:42 +00:00
jeff	c8d343ae8b	- Use kmem_malloc rather than kmem_alloc() for GDT/LDT/tss allocations etc. This eliminates some unusual uses of that API in favor of more typical uses of kmem_malloc(). Discussed with: kib/alc Tested by: pho Sponsored by: EMC / Isilon Storage Division	2013-07-26 19:06:14 +00:00
neel	8a3e57621d	Add support for emulation of the "or r/m, imm8" instruction. Submitted by: Zhixiang Yu (zxyu.core@gmail.com) Obtained from: GSoC 2013 (AHCI device emulation for bhyve)	2013-07-23 23:43:00 +00:00
neel	50750f842b	Fix a bug introduced in r252646 that causes a page with the PG_PTE_PAT bit set to be interpreted as a superpage. This is because PG_PTE_PAT is at the same bit position in PTE as PG_PS is in a PDE. This caused a number of regressions on amd64 systems: panic when starting X applications, freeze during shutdown etc. Pointy hat to: me Tested by: gperez@entel.upc.edu, joel, dumbbell Reviewed by: kib	2013-07-23 22:17:00 +00:00
grehan	e5517eb7cf	First cut at adding the hyperv drivers to GENERIC. The files inventory should probably have the modules split out into net/storage/common etc as the modules build is, but this will do for now.	2013-07-19 05:32:58 +00:00
kib	c833aa741f	MFi386: add ddb "show sysregs" command. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2013-07-15 06:30:57 +00:00

... 3 4 5 6 7 ...

7112 Commits