freebsd-nq

Author	SHA1	Message	Date
Tycho Nightingale	896d1f7723	Add support for emulating the move instruction: "mov r/m8, imm8". Reviewed by: neel	2014-06-26 17:15:41 +00:00
Peter Grehan	cf1d80d88c	Expose the amount of resident and wired memory from the guest's vmspace. This is different than the amount shown for the process e.g. by /usr/bin/top - that is the mappings faulted in by the mmap'd region of guest memory. The values can be fetched with bhyvectl # bhyvectl --get-stats --vm=myvm ... Resident memory 413749248 Wired memory 0 ... vmm_stat.[ch] - Modify the counter code in bhyve to allow direct setting of a counter as opposed to incrementing, and providing a callback to fetch a counter's value. Reviewed by: neel	2014-06-25 22:13:35 +00:00
Konstantin Belousov	633034fe0e	Add FPU_KERN_KTHR flag to fpu_kern_enter(9), which avoids saving FPU context into memory for the kernel threads which called fpu_kern_thread(9). This allows the fpu_kern_enter() callers to not check for is_fpu_kern_thread() to get the optimization. Apply the flag to padlock(4) and aesni(4). In aesni_cipher_process(), do not leak FPU context state on error. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2014-06-23 07:37:54 +00:00
Dmitry Chagin	2dedc1281a	Revert r266925 as it can lead to instant panic at fexecve(): To allow to run the interpreter itself add a new ELF branding type. Pointed out by: kib, mjg	2014-06-17 05:29:18 +00:00
Tycho Nightingale	a026dc3fcb	Bring an overly enthusiastic KASSERT inline with the Intel SDM. Reviewed by: neel	2014-06-16 22:59:18 +00:00
Attilio Rao	3ae10f7477	- Modify vm_page_unwire() and vm_page_enqueue() to directly accept the queue where to enqueue pages that are going to be unwired. - Add stronger checks to the enqueue/dequeue for the pagequeues when adding and removing pages to them. Of course, for unmanaged pages the queue parameter of vm_page_unwire() will be ignored, just as the active parameter today. This makes adding new pagequeues quicker. This change effectively modifies the KPI. __FreeBSD_version will be, however, bumped just when the full cache of free pages will be evicted. Sponsored by: EMC / Isilon storage division Reviewed by: alc Tested by: pho	2014-06-16 18:15:27 +00:00
Roger Pau Monné	ef409ede7b	amd64/i386: introduce APIC hooks for different APIC implementations. This is needed for Xen PV(H) guests, since there's no hardware lapic available on this kind of domains. This commit should not change functionality. Sponsored by: Citrix Systems R&D Reviewed by: jhb Approved by: gibbs amd64/include/cpu.h: amd64/amd64/mp_machdep.c: i386/include/cpu.h: i386/i386/mp_machdep.c: - Remove lapic_ipi_vectored hook from cpu_ops, since it's now implemented in the lapic hooks. amd64/amd64/mp_machdep.c: i386/i386/mp_machdep.c: - Use lapic_ipi_vectored directly, since it's now an inline function that will call the appropiate hook. x86/x86/local_apic.c: - Prefix bare metal public lapic functions with native_ and mark them as static. - Define default implementation of apic_ops. x86/include/apicvar.h: - Declare the apic_ops structure and create inline functions to access the hooks, so the change is transparent to existing users of the lapic_ functions. x86/xen/hvm.c: - Switch to use the new apic_ops.	2014-06-16 08:43:03 +00:00
Neel Natu	4e98fc9011	Disable global interrupts early so all the software state maintained by bhyve is sampled "atomically". Any interrupts after this point will be held pending by the CPU until the guest starts executing and will immediately trigger a #VMEXIT. Reviewed by: Anish Gupta (akgupt3@gmail.com)	2014-06-11 17:48:07 +00:00
Tycho Nightingale	5ebc578ba6	Replace enum forward declarations with complete definitions. Reviewed by: neel	2014-06-10 18:46:00 +00:00
Neel Natu	404874659f	Add helper functions to populate VM exit information for rendezvous and astpending exits. This is to reduce code duplication between VT-x and SVM implementations.	2014-06-10 16:45:58 +00:00
Neel Natu	0494cb1bcb	Turn on interrupt window exiting unconditionally when an ExtINT is being injected into the guest. This allows the hypervisor to inject another ExtINT or APIC vector as soon as the guest is able to process interrupts. This change is not to address any correctness issue but to guarantee that any pending APIC vector that was preempted by the ExtINT will be injected as soon as possible. Prior to this change such pending interrupts could be delayed until the next VM exit.	2014-06-10 01:38:02 +00:00
Peter Grehan	3787148758	Temporary fix for guest idle detection. Handle ExtINT injection for SVM. The HPET emulation will inject a legacy interrupt at startup, and if this isn't handled, will result in the HLT-exit code assuming there are outstanding ExtINTs and return without sleeping. svm_inj_interrupts() needs more changes to bring it up to date with the VT-x version: these are forthcoming. Reviewed by: neel	2014-06-09 21:02:48 +00:00
Neel Natu	051f2bd19d	Add reserved bit checking when doing %CR8 emulation and inject #GP if required. Pointed out by: grehan Reviewed by: tychon	2014-06-09 20:51:08 +00:00
Peter Grehan	1cc0e0eedb	Allow the TSC MSR to be accessed directly from the guest.	2014-06-07 23:08:06 +00:00
Peter Grehan	dc6610d553	Set the guest PAT MSR in the VMCB to power-on defaults. Linux guests accept the values in this register, while *BSD guests reprogram it. Default values of zero correspond to PAT_UNCACHEABLE, resulting in glacial performance. Thanks to Willem Jan Withagen for first reporting this and helping out with the investigation.	2014-06-07 23:05:12 +00:00
Neel Natu	5fcf252f41	Add ioctl(VM_REINIT) to reinitialize the virtual machine state maintained by vmm.ko. This allows the virtual machine to be restarted without having to destroy it first. Reviewed by: grehan	2014-06-07 21:36:52 +00:00
Alan Cox	dd05fa1945	Add a page size field to struct vm_page. Increase the page size field when a partially populated reservation becomes fully populated, and decrease this field when a fully populated reservation becomes partially populated. Use this field to simplify the implementation of pmap_enter_object() on amd64, arm, and i386. On all architectures where we support superpages, the cost of creating a superpage mapping is roughly the same as creating a base page mapping. For example, both kinds of mappings entail the creation of a single PTE and PV entry. With this in mind, use the page size field to make the implementation of vm_map_pmap_enter(..., MAP_PREFAULT_PARTIAL) a little smarter. Previously, if MAP_PREFAULT_PARTIAL was specified to vm_map_pmap_enter(), that function would only map base pages. Now, it will create up to 96 base page or superpage mappings. Reviewed by: kib Sponsored by: EMC / Isilon Storage Division	2014-06-07 17:12:26 +00:00
Tycho Nightingale	594db0024e	Support guest accesses to %cr8. Reviewed by: neel	2014-06-06 18:23:49 +00:00
Warner Losh	3f1afabf09	Restore comments accidentally removed. MFC after: 3 days	2014-06-06 04:08:55 +00:00
Peter Grehan	0df5b8cb8c	ins/outs support for SVM. Modelled on the Intel VT-x code. Remove CR2 save/restore - the guest restore/save is done in hardware, and there is no need to save/restore the host version (same as VT-x). Submitted by: neel (SVM segment descriptor 'P' bit code) Reviewed by: neel	2014-06-06 02:55:18 +00:00
Peter Grehan	72a458ccff	Allow the guest's CR2 value to be read/written. This is required for page-fault injection.	2014-06-05 06:29:18 +00:00
Peter Grehan	8c1da7e67b	Use API call when VM is detected as suspended. This fixes the (harmless) error message on exit: vmexit_suspend: invalid reason 217645057 Reviewed by: neel, Anish Gupta (akgupt3@gmail.com)	2014-06-03 22:26:46 +00:00
Peter Grehan	eee8190aab	Bring (almost) up-to-date with HEAD. - use the new virtual APIC page - update to current bhyve APIs Tested by Anish with multiple FreeBSD SMP VMs on a Phenom, and verified by myself with light FreeBSD VM testing on a Sempron 3850 APU. The issues reported with Linux guests are very likely to still be here, but this sync eliminates the skew between the project branch and CURRENT, and should help to determine the causes. Some follow-on commits will fix minor cosmetic issues. Submitted by: Anish Gupta (akgupt3@gmail.com)	2014-06-03 06:56:54 +00:00
Peter Grehan	6cec9cad76	MFC @ r266724 An SVM update will follow this.	2014-06-03 02:34:21 +00:00
Neel Natu	95ebc360ef	Activate vcpus from bhyve(8) using the ioctl VM_ACTIVATE_CPU instead of doing it implicitly in vmm.ko. Add ioctl VM_GET_CPUS to get the current set of 'active' and 'suspended' cpus and display them via /usr/sbin/bhyvectl using the "--get-active-cpus" and "--get-suspended-cpus" options. This is in preparation for being able to reset virtual machine state without having to destroy and recreate it.	2014-05-31 23:37:34 +00:00
Dmitry Chagin	5f56da1891	To allow to run the interpreter itself add a new ELF branding type. Allow Linux ABI to run ELF interpreter. MFC after: 3 days	2014-05-31 15:01:51 +00:00
Tycho Nightingale	11669a681c	If VMX isn't enabled so long as the lock bit isn't set yet in MSR IA32_FEATURE_CONTROL it still can be. Approved by: grehan (co-mentor)	2014-05-30 23:37:31 +00:00
Neel Natu	92754b1199	Remove bogus check for kmem_malloc() failure even though M_WAITOK is set. Requested by: jkim	2014-05-30 20:58:32 +00:00
Neel Natu	8e351f8a3c	Allocate a zeroed LDT. Failing to do this might result in the LDT appearing to run out of free descriptors because of random junk in the descriptor's 'sd_type' field. http://lists.freebsd.org/pipermail/freebsd-amd64/2014-May/016088.html Reviewed by: kib MFC after: 2 weeks	2014-05-30 18:59:37 +00:00
Konstantin Belousov	64e9726555	When usermode loaded non-default segment selector into the %gs, correctly prepare KGSBASE msr to restore the user descriptor base on the last swapgs during return to usermode. Reported and tested by: peterj Sponsored by: The FreeBSD Foundation MFC after: 1 week	2014-05-29 16:18:31 +00:00
Mark Johnston	f2789bd5c7	Commit the rest of the changes that were intended to be part of r266826. X-MFC-with: r266826	2014-05-29 01:42:22 +00:00
John Baldwin	44a68c4e40	- Rework the XSAVE/XRSTOR emulation to only expose XCR0 features to the guest for which the rules regarding xsetbv emulation are known. In particular future extensions like AVX-512 have interdependencies among feature bits that could allow a guest to trigger a GP# in the host with the current approach of allowing anything the host supports. - Add proper checking of Intel MPX and AVX-512 XSAVE features in the xsetbv emulation and allow these features to be exposed to the guest if they are enabled in the host. - Expose a subset of known-safe features from leaf 0 of the structured extended features to guests if they are supported on the host including RDFSBASE/RDGSBASE, BMI1/2, AVX2, AVX-512, HLE, ERMS, and RTM. Aside from AVX-512, these features are all new instructions available for use in ring 3 with no additional hypervisor changes needed. Reviewed by: neel	2014-05-27 19:04:38 +00:00
Neel Natu	65ffa035a7	Add segment protection and limits violation checks in vie_calculate_gla() for 32-bit x86 guests. Tested using ins/outs executed in a FreeBSD/i386 guest.	2014-05-27 04:26:22 +00:00
Neel Natu	ae0780bbf1	Remove restriction on insb/insw/insl emulation. These instructions are properly emulated.	2014-05-25 02:05:23 +00:00
Neel Natu	5382c19d81	Do the linear address calculation for the ins/outs emulation using a new API function 'vie_calculate_gla()'. While the current implementation is simplistic it forms the basis of doing segmentation checks if the guest is in 32-bit protected mode.	2014-05-25 00:57:24 +00:00
Neel Natu	da11f4aa1d	Add libvmmapi functions vm_copyin() and vm_copyout() to copy into and out of the guest linear address space. These APIs in turn use a new ioctl 'VM_GLA2GPA' to convert the guest linear address to guest physical. Use the new copyin/copyout APIs when emulating ins/outs instruction in bhyve(8).	2014-05-24 23:12:30 +00:00
Neel Natu	e813a87350	Consolidate all the information needed by the guest page table walker into 'struct vm_guest_paging'. Check for canonical addressing in vmm_gla2gpa() and inject a protection fault into the guest if a violation is detected. If the page table walk is restarted in vmm_gla2gpa() then reset 'ptpphys' to point to the root of the page tables.	2014-05-24 20:26:57 +00:00
Neel Natu	37a723a5b3	When injecting a page fault into the guest also update the guest's %cr2 to indicate the faulting linear address. If the guest PML4 entry has the PG_PS bit set then inject a page fault into the guest with the PGEX_RSV bit set in the error_code. Get rid of redundant checks for the PG_RW violations when walking the page tables.	2014-05-24 19:13:25 +00:00
Neel Natu	a7424861fb	Check for alignment check violation when processing in/out string instructions.	2014-05-23 19:59:14 +00:00
Neel Natu	d17b5104a9	Add emulation of the "outsb" instruction. NetBSD guests use this to write to the UART FIFO. The emulation is constrained in a number of ways: 64-bit only, doesn't check for all exception conditions, limited to i/o ports emulated in userspace. Some of these constraints will be relaxed in followup commits. Requested by: grehan Reviewed by: tychon (partially and a much earlier version)	2014-05-23 05:15:17 +00:00
Neel Natu	c5e423dd2e	A Centos 6.4 guest will write 0xff to the 8259 mask register before beginning the proper ICWx initialization sequence. It assumes, probably correctly, that the boot firmware has done the 8259 initialization. Since grub-bhyve does not initialize the 8259 this write to the mask register takes a code path in which 'error' remains uninitialized (ready=0,icw_num=0). Fix this by initializing 'error' at the start of the function.	2014-05-23 05:04:50 +00:00
John Baldwin	0eb7ae8d0a	Don't permit users to request a subset of the AVX512 or MPX xsave masks. These masks are documented in the Intel Architecture Instruction Set Extensions Programming Reference (March 2014). Reviewed by: kib MFC after: 1 month	2014-05-22 18:22:02 +00:00
Neel Natu	ba6f5e23cc	Allow vmx_getdesc() and vmx_setdesc() to be called for a vcpu that is in the VCPU_RUNNING state. This will let the VMX exit handler inspect the vcpu's segment descriptors without having to exit the critical section.	2014-05-22 17:22:37 +00:00
Justin Hibbits	81e3caaf77	imagact_binmisc builds for all supported architectures, so enable it for all. Any bugs in execution will be dealt with as they crop up. MFC after: 3 weeks Relnotes: Yes	2014-05-22 05:04:40 +00:00
Neel Natu	fd949af642	Inject page fault into the guest if the page table walker detects an invalid translation for the guest linear address.	2014-05-22 03:14:54 +00:00
Neel Natu	f888763dd8	Add PG_RW check when translating a guest linear to guest physical address. Set the accessed and dirty bits in the page table entry. If it fails then restart the page table walk from the beginning. This might happen if another vcpu modifies the page tables simultaneously. Reviewed by: alc, kib	2014-05-20 20:30:28 +00:00
John Baldwin	674b6d6e0d	Add support for decoding the AMD SVM instructions.	2014-05-19 18:07:37 +00:00
Neel Natu	e4c8a13d61	Add PG_U (user/supervisor) checks when translating a guest linear address to a guest physical address. PG_PS (page size) field is valid only in a PDE or a PDPTE so it is now checked only in non-terminal paging entries. Ignore the upper 32-bits of the CR3 for PAE paging.	2014-05-19 03:50:07 +00:00
Peter Grehan	897bb47e7b	Make the vmx asm code dtrace-fbt-friendly by - inserting frame enter/leave sequences - restructuring the vmx_enter_guest routine so that it subsumes the vm_exit_guest block, which was the #vmexit RIP and not a callable routine. Reviewed by: neel MFC after: 3 weeks	2014-05-18 03:50:17 +00:00
John Baldwin	8b3949c344	Add support for decoding rdrand and rdseed.	2014-05-17 21:10:03 +00:00
John Baldwin	355d8a2f91	Add definitions for more structured extended features as well as XSAVE Extended Features for AVX512 and MPX (Memory Protection Extensions). Obtained from: Intel's Instruction Set Extensions Programming Reference (March 2014)	2014-05-16 17:45:09 +00:00
John Baldwin	b3e9732a76	Implement a PCI interrupt router to route PCI legacy INTx interrupts to the legacy 8259A PICs. - Implement an ICH-comptabile PCI interrupt router on the lpc device with 8 steerable pins configured via config space access to byte-wide registers at 0x60-63 and 0x68-6b. - For each configured PCI INTx interrupt, route it to both an I/O APIC pin and a PCI interrupt router pin. When a PCI INTx interrupt is asserted, ensure that both pins are asserted. - Provide an initial routing of PCI interrupt router (PIRQ) pins to 8259A pins (ISA IRQs) and initialize the interrupt line config register for the corresponding PCI function with the ISA IRQ as this matches existing hardware. - Add a global _PIC method for OSPM to select the desired interrupt routing configuration. - Update the _PRT methods for PCI bridges to provide both APIC and legacy PRT tables and return the appropriate table based on the configured routing configuration. Note that if the lpc device is not configured, no routing information is provided. - When the lpc device is enabled, provide ACPI PCI link devices corresponding to each PIRQ pin. - Add a VMM ioctl to adjust the trigger mode (edge vs level) for 8259A pins via the ELCR. - Mark the power management SCI as level triggered. - Don't hardcode the number of elements in Packages in the source for the DSDT. iasl(8) will fill in the actual number of elements, and this makes it simpler to generate a Package with a variable number of elements. Reviewed by: tycho	2014-05-15 14:16:55 +00:00
Neel Natu	f3db4c53e6	Increase the TSS limit by one byte. The processor requires an additional byte with all bits set to 1 beyond the I/O permission bitmap. Prior to this change accessing I/O ports [0xFFF8-0xFFFF] would trigger a #GP fault even though the I/O bitmap allowed access to those ports. For more details see section "I/O Permission Bit Map" in the Intel SDM, Vol 1. Reviewed by: kib	2014-05-14 22:24:09 +00:00
Neel Natu	055fc2cb5e	Virtual machine halt detection is turned on by default. Allow it to be disabled via the tunable 'hw.vmm.halt_detection'.	2014-05-05 16:19:24 +00:00
Nathan Whitehorn	a9d0ed68b3	Disable ACPI and P4TCC throttling by default, following discussion on freebsd-current. These CPU speed control techniques are usually unhelpful at best. For now, continue building the relevant code into GENERIC so that it can trivially be re-enabled at runtime if anyone wants it. MFC after: 1 month	2014-05-04 16:38:21 +00:00
Kenneth D. Merry	991554f2c4	Bring in the mpr(4) driver for LSI's MPT3 12Gb SAS controllers. This is derived from the mps(4) driver, but it supports only the 12Gb IT and IR hardware including the SAS 3004, SAS 3008 and SAS 3108. Some notes about this driver: o The 12Gb hardware can do "FastPath" I/O, and that capability is included in this driver. o WarpDrive functionality has been removed, since it isn't supported in the 12Gb driver interface. o The Scatter/Gather list handling code is significantly different between the 6Gb and 12Gb hardware. The 12Gb boards support IEEE Scatter/Gather lists. Thanks to LSI for developing and testing this driver for FreeBSD. share/man/man4/mpr.4: mpr(4) man page. sys/dev/mpr/*: mpr(4) driver files. sys/modules/Makefile, sys/modules/mpr/Makefile: Add a module Makefile for the mpr(4) driver. sys/conf/files: Add the mpr(4) driver. sys/amd64/conf/GENERIC, sys/i386/conf/GENERIC, sys/mips/conf/OCTEON1, sys/sparc64/conf/GENERIC: Add the mpr(4) driver to all config files that currently have the mps(4) driver. sys/ia64/conf/GENERIC: Add the mps(4) and mpr(4) drivers to the ia64 GENERIC config file. sys/i386/conf/XEN: Exclude the mpr module from building here. Submitted by: Steve McConnell <Stephen.McConnell@lsi.com> MFC after: 3 days Tested by: Chris Reeves <chrisr@spectralogic.com> Sponsored by: LSI, Spectra Logic Relnotes: LSI 12Gb SAS driver mpr(4) added	2014-05-02 20:25:09 +00:00
Eitan Adler	804e017089	lindev(4): finish the partial commit in r265212 lindev(4) was only used to provide /dev/full which is now a standard feature of FreeBSD. /dev/full was never linux-specific and provides a generally useful feature. Document this in UPDATING and bump __FreeBSD_version. This will be documented in the PH shortly. Reported by: jkim	2014-05-02 07:14:22 +00:00
Neel Natu	e50ce2aa06	Add logic in the HLT exit handler to detect if the guest has put all vcpus to sleep permanently by executing a HLT with interrupts disabled. When this condition is detected the guest with be suspended with a reason of VM_SUSPEND_HALT and the bhyve(8) process will exit. Tested by executing "halt" inside a RHEL7-beta guest. Discussed with: grehan@ Reviewed by: jhb@, tychon@	2014-05-02 00:33:56 +00:00
Neel Natu	2cb97c9dd6	Ignore writes to microcode update MSR. This MSR is accessed by RHEL7 guest. Add KTR tracepoints to annotate wrmsr and rdmsr VM exits.	2014-04-30 02:08:27 +00:00
Neel Natu	c6a0cc2e21	Some Linux guests will implement a 'halt' by disabling the APIC and executing the 'HLT' instruction. This condition was detected by 'vm_handle_hlt()' and converted into the SPINDOWN_CPU exitcode . The bhyve(8) process would exit the vcpu thread in response to a SPINDOWN_CPU and when the last vcpu was spun down it would reset the virtual machine via vm_suspend(VM_SUSPEND_RESET). This functionality was broken in r263780 in a way that made it impossible to kill the bhyve(8) process because it would loop forever in vm_handle_suspend(). Unbreak this by removing the code to spindown vcpus. Thus a 'halt' from a Linux guest will appear to be hung but this is consistent with the behavior on bare metal. The guest can be rebooted by using the bhyvectl options '--force-reset' or '--force-poweroff'. Reviewed by: grehan@	2014-04-29 18:42:56 +00:00
Neel Natu	f0fdcfe247	Allow a virtual machine to be forcibly reset or powered off. This is done by adding an argument to the VM_SUSPEND ioctl that specifies how the virtual machine should be suspended, viz. VM_SUSPEND_RESET or VM_SUSPEND_POWEROFF. The disposition of VM_SUSPEND is also made available to the exit handler via the 'u.suspended' member of 'struct vm_exit'. This capability is exposed via the '--force-reset' and '--force-poweroff' arguments to /usr/sbin/bhyvectl. Discussed with: grehan@	2014-04-28 22:06:40 +00:00
Ed Maste	b6a0a32b58	Report boot method (BIOS/UEFI) via sysctl machdep.bootmethod Sponsored by: The FreeBSD Foundation	2014-04-27 15:14:59 +00:00
Konstantin Belousov	dc74dde71e	Same as it was done in r263878 for invlrng_handler(), fix order of checks for special pcid values in invlpg_pcid_handler(). Forst check for special values, and only then do PCID-specific page invalidation. Minor fix to the style compliance, declare local variable at the function start. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2014-04-27 05:37:01 +00:00
Nathan Whitehorn	bacc56cc08	Don't need this now. VT does the same thing, but better. Submitted by: gjb	2014-04-27 02:28:32 +00:00
Nathan Whitehorn	8bcbebe741	Add vt_efifb to VT kernel configuration now that that actually works. This kernel will now boot on both BIOS and EFI systems without modification. Equivalent functionality in GENERIC requires making vt(9) the default console driver, which is probably appropriate at this point.	2014-04-27 02:22:21 +00:00
Neel Natu	63c9389af6	A VMCS is always inactive when it exits the vmx_run() loop. Remove redundant code and the misleading comment that suggest otherwise. Reviewed by: grehan@	2014-04-26 22:37:56 +00:00
Scott Long	60ad8150c7	Retire smp_active. It was racey and caused demonstrated problems with the cpufreq code. Replace its use with smp_started. There's at least one userland tool that still looks at the kern.smp.active sysctl, so preserve it but point it to smp_started as well. Discussed with: peter, jhb MFC after: 3 days Obtained from: Netflix	2014-04-26 20:27:54 +00:00
Glen Barber	8be1b6d975	Add a UEFI kernel configuration to include the VT kernel, and replace the vt_vga driver with vt_efifb. This is intended to help with snapshot builds only. There is no intention to MFC this commit. Sponsored by: The FreeBSD Foundation	2014-04-25 21:47:24 +00:00
Roger Pau Monné	df983f1d06	xen: fix copyright header Some of the code in xen-locore.S was picked from Cherry G. Mathew amd64 Xen PV branch, but I've failed to set the proper copyright, so do it now. Approved by: gibbs	2014-04-24 14:44:42 +00:00
Peter Grehan	8d1d7a9e5a	Allow the guest to read the TSC via MSR 0x10. NetBSD/amd64 does this, as does Linux on AMD CPUs. Reviewed by: neel MFC after: 3 weeks	2014-04-24 00:27:34 +00:00
Neel Natu	c5d216b786	Change the vlapic timer frequency to be in the ballpark of contemporary hardware. This also decouples the vlapic emulation from the host's TSC frequency. Requested by: grehan@	2014-04-23 16:50:40 +00:00
Tycho Nightingale	82c2c89084	Factor out common ioport handler code for better hygiene -- pointed out by neel@. Approved by: neel (co-mentor)	2014-04-22 16:13:56 +00:00
Tycho Nightingale	c46ff7fa0b	Add support for the PIT 'readback' command -- based on a patch by grehan@. Approved by: grehan (co-mentor)	2014-04-18 16:05:12 +00:00
Tycho Nightingale	d6aa08c3ef	Respect the destination operand size of the 'Input from Port' instruction. Approved by: grehan (co-mentor)	2014-04-18 15:22:56 +00:00
Tycho Nightingale	79d6ca331e	Add support for reading the PIT Counter 2 output signal via the NMI Status and Control register at port 0x61. Be more conservative about "catching up" callouts that were supposed to fire in the past by skipping an interrupt if it was scheduled too far in the past. Restore the PIT ACPI DSDT entries and add an entry for NMISC too. Approved by: neel (co-mentor)	2014-04-18 00:02:06 +00:00
John Baldwin	380174cbe4	Don't spindown the BSP if it executes hlt with the APIC disabled. A guest that doesn't use the APIC at all can trigger this, plus the BSP always needs to execute as it should trigger a reset, etc. Reviewed by: tychon	2014-04-15 20:53:53 +00:00
Tycho Nightingale	54f6330515	Local APIC access via 32-bit naturally-aligned loads is merely suggested in the SDM. Since some OSes have implemented otherwise don't be too rigorous in enforcing it. Approved by: grehan (co-mentor)	2014-04-15 17:06:26 +00:00
Tycho Nightingale	1354571279	Add support for emulating the byte move and sign extend instructions: "movsx r/m8, r32" and "movsx r/m8, r64". Approved by: grehan (co-mentor)	2014-04-15 15:11:10 +00:00
Tycho Nightingale	b96be57a2d	Add support for emulating the slave PIC. Reviewed by: grehan, jhb Approved by: grehan (co-mentor)	2014-04-14 19:00:20 +00:00
Neel Natu	81d597b736	There is no need to save and restore the host's return address in the 'struct vmxctx'. It is preserved on the host stack across a guest entry and exit and just restoring the host's '%rsp' is sufficient. Pointed out by: grehan@	2014-04-11 20:15:53 +00:00
Tycho Nightingale	e0f210e6ef	Account for the "plus 1" encoding of the CPUID Function 4 reported core per package and cache sharing values. Approved by: grehan (co-mentor)	2014-04-11 18:19:21 +00:00
Peter Grehan	201b1ccc22	Rework r264179. - remove redundant code - remove erroneous setting of the error return in vmmdev_ioctl() - use style(9) initialization - in vmx_inject_pir(), document the race condition that the final conditional statement was detecting, Tested with both gcc and clang builds. Reviewed by: neel	2014-04-10 19:15:58 +00:00
Sean Bruno	84cb72d1c6	Really, really, really only allow this option for amd64/i386 builds. Submitted by: imp@ and tinderbox	2014-04-09 18:44:54 +00:00
Warner Losh	0e30c5c0b4	Make the vmm code compile with gcc too. Not entirely sure things are correct for the pirbase test (since I'd have thought we'd need to do something even when the offset is 0 and that test looks like a misguided attempt to not use an uninitialized variable), but it is at least the same as today.	2014-04-05 22:43:23 +00:00
Ryan Stone	a86672509c	Re-write bhyve's I/O MMU handling in terms of PCI RID. Reviewed by: neel MFC after: 2 months Sponsored by: Sandvine Inc.	2014-04-01 15:54:03 +00:00
Ryan Stone	7036ae46bf	Revert PCI RID changes. My PCI RID changes somehow got intermixed with my PCI ARI patch when I committed it. I may have accidentally applied a patch to a non-clean working tree. Revert everything while I figure out what went wrong. Pointy hat to: rstone	2014-04-01 15:06:03 +00:00
Ryan Stone	956ed3830c	Re-write bhyve's I/O MMU handling in terms of PCI RIDs Reviewed by: neel Sponsored by: Sandvine Inc	2014-04-01 14:54:43 +00:00
Konstantin Belousov	65f99c74fb	Clear the kernel grab of the FPU state on fork. The pcb_save pointer is already correctly reset to the FPU user save area, only PCB_KERNFPU flag might leak from old thread state into the new state. For creation of the user-mode thread, the change is nop since corresponding syscall code does not use FPU. On the other hand, creation of a kernel thread forks from a thread selected arbitrary from proc0, which might use FPU. Reported and tested by: Chris Torek <torek@torek.net> Sponsored by: The FreeBSD Foundation MFC after: 1 week	2014-03-29 11:56:33 +00:00
Konstantin Belousov	965cc255c9	Several fixes for the PCID implementation: - When clearing a bit for a cpuid in pmap->pm_save, ensure that the cpuid is not set in pm_active. The pm_save indicates which CPUs may have cached translations for given PCID, which implies that a CPU executing with the given pmap active have the translations cached. [1] - In smp_masked_invltlb(), pass pmap to smp_targeted_tlb_shootdown(). [1] - In invlrng_handler(), check for the special values of pcid (0 and -1) and do corresponding global or total invalidations before checking for performing PCID-specific range invalidation with INVPCID_ADDR. [2] - In invltlb_pcid_handler(), do not read %cr3 unless needed. [2] - Do minor style tweaks. [2] Submitted by: Henrik Gulbrandsen <henrik@gulbra.net> [1] Other parts sponsored by: The FreeBSD Foundation [2] Tested by: Henrik Gulbrandsen, pho MFC after: 1 week	2014-03-28 16:07:27 +00:00
Ed Maste	d1d4f00e9a	Update EFI framebuffer handoff from loader Sponsored by: The FreeBSD Foundation	2014-03-27 19:43:38 +00:00
Ed Maste	c018226f1a	amd64: Parse the EFI memory map if present With this change (and loader.efi from the projects/uefi branch) we can now boot under qemu using the OVMF UEFI firmware image with the limitation that a serial console is required. (This is largely r246337 from the projects/uefi branch.) Sponsored by: The FreeBSD Foundation	2014-03-27 18:23:02 +00:00
Neel Natu	b15a09c05e	Add an ioctl to suspend a virtual machine (VM_SUSPEND). The ioctl can be called from any context i.e., it is not required to be called from a vcpu thread. The ioctl simply sets a state variable 'vm->suspend' to '1' and returns. The vcpus inspect 'vm->suspend' in the run loop and if it is set to '1' the vcpu breaks out of the loop with a reason of 'VM_EXITCODE_SUSPENDED'. The suspend handler waits until all 'vm->active_cpus' have transitioned to 'vm->suspended_cpus' before returning to userspace. Discussed with: grehan	2014-03-26 23:34:27 +00:00
Warner Losh	3ad1a09169	Rather than require a makeoptions DEBUG to get debug correct, add it in kern.mk, but only if we're using clang. While this option is supported by both clang and gcc, in the future there may be changes to clang which change the defaults that require a tweak to build our kernel such that other tools in our tree will work. Set a good example by forcing -gdwarf-2 only for clang builds, and only if the user hasn't specified another dwarf level already. Update UPDATING to reflect the changed state of affairs. This also keeps us from having to update all the ARM kernels to add this, and also keeps us from in the future having to update all the MIPS kernels and is one less place the user will have to know to do something special for clang and one less thing developers will need to do when moving an architecture to clang. Reviewed by: ian@ MFC after: 1 week	2014-03-25 22:08:31 +00:00
Tycho Nightingale	e883c9bb40	Move the atpit device model from userspace into vmm.ko for better precision and lower latency. Approved by: grehan (co-mentor)	2014-03-25 19:20:34 +00:00
Bryan Drewery	44f1c91610	Rename global cnt to vm_cnt to avoid shadowing. To reduce the diff struct pcu.cnt field was not renamed, so PCPU_OP(cnt.field) is still used. pc_cnt and pcpu are also used in kvm(3) and vmstat(8). The goal was to not affect externally used KPI. Bump __FreeBSD_version_ in case some out-of-tree module/code relies on the the global cnt variable. Exp-run revealed no ports using it directly. No objection from: arch@ Sponsored by: EMC / Isilon Storage Division	2014-03-22 10:26:09 +00:00
Konstantin Belousov	1fb1da0366	Add change forgotten in r263475. Make dmaplimit accessible outside amd64/pmap.c. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2014-03-21 17:17:19 +00:00
Konstantin Belousov	52f3c44efe	Fix two issues with /dev/mem access on amd64, both causing kernel page faults. First, for accesses to direct map region should check for the limit by which direct map is instantiated. Second, for accesses to the kernel map, success returned from the kernacc(9) does not guarantee that consequent attempt to read or write to the checked address succeed, since other thread might invalidate the address meantime. Add a new thread private flag TDP_DEVMEMIO, which instructs vm_fault() to return error when fault happens on the MAP_ENTRY_NOFAULT entry, instead of panicing. The trap handler would then see a page fault from access, and recover in normal way, making /dev/mem access safer. Remove GIANT_REQUIRED from the amd64 memrw(), since it is not needed and having Giant locked does not solve issues for amd64. Note that at least the second issue exists on other architectures, and requires similar patching for md code. Reported and tested by: clusteradm (gjb, sbruno) Sponsored by: The FreeBSD Foundation MFC after: 1 week	2014-03-21 14:25:09 +00:00
Warner Losh	f79309d29c	Remove vestiges of knowing the ISA bus, which we gave up on around 20 years ago. Remove redunant copy of isaregs.h.	2014-03-19 21:03:04 +00:00
Mark Johnston	3e0ba3a163	Only invoke fasttrap hooks for traps from user mode, and ensure that they're called with interrupts enabled. Calling fasttrap_pid_probe() with interrupts disabled can lead to deadlock if fasttrap writes to the process' address space. Reviewed by: rpaulo MFC after: 3 weeks	2014-03-19 01:27:56 +00:00
Warner Losh	d4f95c889d	In kernel config files, it is supposed to be 'options<space><tab>' not 'options<tab><tab>', per long standing (but recently not so strictly enforced) convention.	2014-03-18 14:41:18 +00:00
Neel Natu	22d822c6b0	When a vcpu is deactivated it must also unblock any rendezvous that may be blocked on it. This is done by issuing a wakeup after clearing the 'vcpuid' from 'active_cpus'. Also, use CPU_CLR_ATOMIC() to guarantee visibility of the updated 'active_cpus' across all host cpus.	2014-03-18 02:49:28 +00:00
Neel Natu	970955e479	Notify vcpus participating in the rendezvous of the pending event to ensure that they execute the rendezvous function as soon as possible.	2014-03-17 23:30:38 +00:00
Warner Losh	6a5b1a3544	Align all comments in config files on same column. This consistency helps when bits and pieces of GENERIC from i386 or amd64 are cut and pasted into other architecture's config files (which in the case of ARM had gotten rather akimbo).	2014-03-16 15:22:52 +00:00
Robert Watson	4a14441044	Update kernel inclusions of capability.h to use capsicum.h instead; some further refinement is required as some device drivers intended to be portable over FreeBSD versions rely on __FreeBSD_version to decide whether to include capability.h. MFC after: 3 weeks	2014-03-16 10:55:57 +00:00
Tycho Nightingale	0775fbb475	Fix a race wherein the source of an interrupt vector is wrongly attributed if an ExtINT arrives during interrupt injection. Also, fix a spurious interrupt if the PIC tries to raise an interrupt before the outstanding one is accepted. Finally, improve the PIC interrupt latency when another interrupt is raised immediately after the outstanding one is accepted by creating a vmexit rather than waiting for one to occur by happenstance. Approved by: neel (co-mentor)	2014-03-15 23:09:34 +00:00
Robert Watson	3dbe595b2d	Revert a small portion of r263198 left over from local testing: don't enable PCB groups and RSS by default [yet].	2014-03-15 00:59:23 +00:00
Robert Watson	7527624efa	Several years after initial development, merge prototype support for linking NIC Receive Side Scaling (RSS) to the network stack's connection-group implementation. This prototype (and derived patches) are in use at Juniper and several other FreeBSD-using companies, so despite some reservations about its maturity, merge the patch to the base tree so that it can be iteratively refined in collaboration rather than maintained as a set of gradually diverging patch sets. (1) Merge a software implementation of the Toeplitz hash specified in RSS implemented by David Malone. This is used to allow suitable pcbgroup placement of connections before the first packet is received from the NIC. Software hashing is generally avoided, however, due to high cost of the hash on general-purpose CPUs. (2) In in_rss.c, maintain authoritative versions of RSS state intended to be pushed to each NIC, including keying material, hash algorithm/ configuration, and buckets. Provide software-facing interfaces to hash 2- and 4-tuples for IPv4 and IPv6 using both the RSS standardised Toeplitz and a 'naive' variation with a hash efficient in software but with poor distribution properties. Implement rss_m2cpuid()to be used by netisr and other load balancing code to look up the CPU on which an mbuf should be processed. (3) In the Ethernet link layer, allow netisr distribution using RSS as a source of policy as an alternative to source ordering; continue to default to direct dispatch (i.e., don't try and requeue packets for processing on the 'right' CPU if they arrive in a directly dispatchable context). (4) Allow RSS to control tuning of connection groups in order to align groups with RSS buckets. If a packet arrives on a protocol using connection groups, and contains a suitable hardware-generated hash, use that hash value to select the connection group for pcb lookup for both IPv4 and IPv6. If no hardware-generated Toeplitz hash is available, we fall back on regular PCB lookup risking contention rather than pay the cost of Toeplitz in software -- this is a less scalable but, at my last measurement, faster approach. As core counts go up, we may want to revise this strategy despite CPU overhead. Where device drivers suitably configure NICs, and connection groups / RSS are enabled, this should avoid both lock and line contention during connection lookup for TCP. This commit does not modify any device drivers to tune device RSS configuration to the global RSS configuration; patches are in circulation to do this for at least Chelsio T3 and Intel 1G/10G drivers. Currently, the KPI for device drivers is not particularly robust, nor aware of more advanced features such as runtime reconfiguration/rebalancing. This will hopefully prove a useful starting point for refinement. No MFC is scheduled as we will first want to nail down a more mature and maintainable KPI/KBI for device drivers. Sponsored by: Juniper Networks (original work) Sponsored by: EMC/Isilon (patch update and merge)	2014-03-15 00:57:50 +00:00
Gleb Smirnoff	45c203fce2	Remove AppleTalk support. AppleTalk was a network transport protocol for Apple Macintosh devices in 80s and then 90s. Starting with Mac OS X in 2000 the AppleTalk was a legacy protocol and primary networking protocol is TCP/IP. The last Mac OS X release to support AppleTalk happened in 2009. The same year routing equipment vendors (namely Cisco) end their support. Thus, AppleTalk won't be supported in FreeBSD 11.0-RELEASE.	2014-03-14 06:29:43 +00:00
Gleb Smirnoff	2c284d9395	Remove IPX support. IPX was a network transport protocol in Novell's NetWare network operating system from late 80s and then 90s. The NetWare itself switched to TCP/IP as default transport in 1998. Later, in this century the Novell Open Enterprise Server became successor of Novell NetWare. The last release that claimed to still support IPX was OES 2 in 2007. Routing equipment vendors (e.g. Cisco) discontinued support for IPX in 2011. Thus, IPX won't be supported in FreeBSD 11.0-RELEASE.	2014-03-14 02:58:48 +00:00
Warner Losh	22b8ff24b5	Delete stray clause 3 (Advertising clause) and renumber while i'm here. Approved by: alc@	2014-03-11 23:41:35 +00:00
Tycho Nightingale	1ed19b835a	Don't try to return a vector to a caller that only cares if a vector is pending or not. Approved by: neel (co-mentor)	2014-03-11 22:12:12 +00:00
Warner Losh	846351f8b3	Remove clause 3 (the advertising clause), per the regent's letter.	2014-03-11 17:20:50 +00:00
Tycho Nightingale	762fd20804	Replace the userspace atpic stub with a more functional vmm.ko model. New ioctls VM_ISA_ASSERT_IRQ, VM_ISA_DEASSERT_IRQ and VM_ISA_PULSE_IRQ can be used to manipulate the pic, and optionally the ioapic, pin state. Reviewed by: jhb, neel Approved by: neel (co-mentor)	2014-03-11 16:56:00 +00:00
Roger Pau Monné	079f7ef839	xen: add a hook to perform AP startup AP startup on PVH follows the PV method, so we need to add a hook in order to diverge from bare metal. Approved by: gibbs Sponsored by: Citrix Systems R&D amd64/amd64/machdep.c: - Add hook for start_all_aps on native (using native_start_all_aps defined in mp_machdep). amd64/amd64/mp_machdep.c: - Make some variables global because they will also be used by the Xen PVH AP startup code. - Use the start_all_aps hook to start APs. - Rename start_all_aps to native_start_all_aps. amd64/include/smp.h: - Add declaration for native_start_all_aps. x86/include/init.h: - Declare start_all_aps hook in init_ops. x86/xen/pv.c: - Pick external declarations from mp_machdep. - Introduce Xen PV code to start APs on PVH. - Set start_all_aps init hook to use the Xen PVH implementation.	2014-03-11 10:27:57 +00:00
Roger Pau Monné	5a036d7e02	xen: add hook for AP bootstrap memory reservation This hook will only be implemented for bare metal, Xen doesn't require any bootstrap code since APs are started in long mode with paging enabled. Approved by: gibbs Sponsored by: Citrix Systems R&D amd64/amd64/machdep.c: - Set mp_bootaddress hook for bare metal. x86/include/init.h: - Define mp_bootaddress in init_ops.	2014-03-11 10:26:16 +00:00
Roger Pau Monné	4d30a3fb95	xen: use the same hypercall mechanism for XEN and XENHVM Currently XEN (PV) and XENHVM (PVHVM) ports use different ways to issue hypercalls, unify this by filling the hypercall_page under HVM also. Approved by: gibbs Sponsored by: Citrix Systems R&D amd64/include/xen/hypercall.h: - Unify Xen hypercall code by always using the PV way. i386/i386/locore.s: - Define hypercall_page on i386 XENHVM. x86/xen/hvm.c: - Fill hypercall_page on XENHVM kernels using the HVM method (only when running as an HVM guest).	2014-03-11 10:24:13 +00:00
Roger Pau Monné	1e69553ed1	xen: implement hook to fetch and parse e820 memory map e820 memory map is fetched using a hypercall under Xen PVH, so add a hook to init_ops in oder to diverge from bare metal and implement a Xen variant. Approved by: gibbs Sponsored by: Citrix Systems R&D x86/include/init.h: - Add a parse_memmap hook to init_ops, that will be called to fetch and parse the memory map. amd64/amd64/machdep.c: - Decouple the fetch and the parse of the memmap, so the parse function can be shared with Xen code. - Move code around in order to implement the parse_memmap hook. amd64/include/pc/bios.h: - Declare bios_add_smap_entries (implemented in machdep.c). x86/xen/pv.c: - Implement fetching of e820 memmap when running as a PVH guest by using the XENMEM_memory_map hypercall.	2014-03-11 10:23:03 +00:00
Roger Pau Monné	5f05c79450	xen: implement an early timer for Xen PVH When running as a PVH guest, there's no emulated i8254, so we need to use the Xen PV timer as the early source for DELAY. This change allows for different implementations of the early DELAY function and implements a Xen variant for it. Approved by: gibbs Sponsored by: Citrix Systems R&D dev/xen/timer/timer.c: dev/xen/timer/timer.h: - Implement Xen early delay functions using the PV timer and declare them. x86/include/init.h: - Add hooks for early clock source initialization and early delay functions. i386/i386/machdep.c: pc98/pc98/machdep.c: amd64/amd64/machdep.c: - Set early delay hooks to use the i8254 on bare metal. - Use clock_init (that will in turn make use of init_ops) to initialize the early clock source. amd64/include/clock.h: i386/include/clock.h: - Declare i8254_delay and clock_init. i386/xen/clock.c: - Rename DELAY to i8254_delay. x86/isa/clock.c: - Introduce clock_init that will take care of initializing the early clock by making use of the init_ops hooks. - Move non ISA related delay functions to the newly introduced delay file. x86/x86/delay.c: - Add moved delay related functions. - Implement generic DELAY function that will use the init_ops hooks. x86/xen/pv.c: - Set PVH hooks for the early delay related functions in init_ops. conf/files.amd64: conf/files.i386: conf/files.pc98: - Add delay.c to the kernel build.	2014-03-11 10:20:42 +00:00
Roger Pau Monné	97baeefd5b	amd64: introduce hook for custom preload metadata parsers Add hooks to amd64 in order to have diverging implementations, since on Xen PV the metadata is passed to the kernel in a different form. Approbed by: gibbs Sponsored by: Citrix Systems R&D amd64/amd64/machdep.c: - Define init_ops for native. - Put native code inside of native_parse_preload_data hook. - Call the parse_preload_data in order to fill the metadata info. x86/include/init.h: - Declare the init_ops struct. x86/xen/pv.c: - Declare xen_init_ops that contains the Xen PV implementation of init_ops. - Implement the parse_preload_data for Xen PVH, the info is fetched from HYPERVISOR_start_info->cmd_line as provided by Xen.	2014-03-11 10:15:25 +00:00
Roger Pau Monné	1a9cdd373a	xen: add PV/PVH kernel entry point Add the PV/PVH entry point and the low level functions for PVH early initialization. Approved by: gibbs Sponsored by: Citrix Systems R&D amd64/amd64/genassym.c: - Add __FreeBSD_version define to assym.s so it can be used for the Xen notes. amd64/amd64/locore.S: - Make bootstack global so it can be used from Xen kernel entry point. amd64/amd64/xen-locore.S: - Add Xen notes to the kernel. - Add the Xen PV entry point, that is going to call hammer_time_xen. amd64/include/asmacros.h: - Add ELFNOTE macros. i386/xen/xen_machdep.c: - Define HYPERVISOR_start_info for the XEN i386 PV port, which is going to be used in some shared code between PV and PVH. x86/xen/hvm.c: - Define HYPERVISOR_start_info for the PVH port. x86/xen/pv.c: - Introduce hammer_time_xen which is going to perform early setup for Xen PVH: - Setup shared Xen variables start_info, shared_info and xen_store. - Set guest type. - Create initial page tables as FreeBSD expects to find them. - Call into native init function (hammer_time). xen/xen-os.h: - Declare HYPERVISOR_start_info. conf/files.amd64: - Add amd64/amd64/locore.S and x86/xen/pv.c to the list of files.	2014-03-11 10:07:01 +00:00
Roger Pau Monné	e8da1c4877	amd64/i386: switch IPI handlers to C code. Move asm IPIs handlers to C code, so both Xen and native IPI handlers share the same code. Reviewed by: jhb Approved by: gibbs Sponsored by: Citrix Systems R&D amd64/amd64/apic_vector.S: i386/i386/apic_vector.s: - Remove asm coded IPI handlers and instead call the newly introduced C variants. amd64/amd64/mp_machdep.c: i386/i386/mp_machdep.c: - Add C coded clones to the asm IPI handlers (moved from x86/xen/hvm.c). i386/include/smp.h: amd64/include/smp.h: - Add prototypes for the C IPI handlers. x86/xen/hvm.c: - Move the C IPI handlers to mp_machdep and call those in the Xen IPI handlers. i386/xen/mp_machdep.c: - Add dummy IPI handlers to the i386 Xen PV port (this port doesn't support SMP).	2014-03-11 10:03:29 +00:00
Ed Maste	c75da776c2	Disable amd64 TLB Context ID (pcid) by default for now There are a number of reports of userspace application crashes that are "solved" by setting vm.pmap.pcid_enabled=0, including Java and the x11/mate-terminal port (PR ports/184362). I originally planned to disable this only in stable/10 (in r262753), but it has been pointed out that additional crash reports on HEAD are not likely to provide new insight into the problem. The feature can easily be enabled for testing.	2014-03-05 01:34:10 +00:00
Jung-uk Kim	1d22d877b8	Move fpusave() wrapper for suspend hander to sys/amd64/amd64/fpu.c. Inspired by: jhb	2014-03-04 21:35:57 +00:00
Jung-uk Kim	be2d4fcf68	Revert accidentally committed changes in 262748.	2014-03-04 20:16:00 +00:00
Jung-uk Kim	603bc162fb	Properly save and restore CR0. MFC after: 3 days	2014-03-04 20:07:36 +00:00
Jung-uk Kim	05acaa9f85	Remove dead code since r230426, fix a comment, and tidy up. Reported by: jhb MFC after: 3 days	2014-03-04 19:41:16 +00:00
Neel Natu	ef39d7e910	Fix a race between VMRUN() and vcpu_notify_event() due to 'vcpu->hostcpu' being updated outside of the vcpu_lock(). The race is benign and could potentially result in a missed notification about a pending interrupt to a vcpu. The interrupt would not be lost but rather delayed until the next VM exit. The vcpu's hostcpu is now updated concurrently with the vcpu state change. When the vcpu transitions to the RUNNING state the hostcpu is set to 'curcpu'. It is set to 'NOCPU' in all other cases. Reviewed by: grehan	2014-03-01 03:17:58 +00:00
John Baldwin	ad3e368726	Correct VMware capitalization. Submitted by: joeld	2014-02-28 21:33:40 +00:00
John Baldwin	722b6744a7	Workaround an apparent bug in VMWare Fusion's nested VT support where it triggers a VM exit with the exit reason of an external interrupt but without a valid interrupt set in the exit interrupt information. Tested by: Michael Dexter Reviewed by: neel MFC after: 1 week	2014-02-28 19:07:55 +00:00
Neel Natu	dc50650607	Queue pending exceptions in the 'struct vcpu' instead of directly updating the processor-specific VMCS or VMCB. The pending exception will be delivered right before entering the guest. The order of event injection into the guest is: - hardware exception - NMI - maskable interrupt In the Intel VT-x case, a pending NMI or interrupt will enable the interrupt window-exiting and inject it as soon as possible after the hardware exception is injected. Also since interrupts are inherently asynchronous, injecting them after the hardware exception should not affect correctness from the guest perspective. Rename the unused ioctl VM_INJECT_EVENT to VM_INJECT_EXCEPTION and restrict it to only deliver x86 hardware exceptions. This new ioctl is now used to inject a protection fault when the guest accesses an unimplemented MSR. Discussed with: grehan, jhb Reviewed by: jhb	2014-02-26 00:52:05 +00:00
Peter Grehan	bf775ebb60	MFC @ r259635 This brings in the "-w" option from bhyve to ignore unknown MSRs. It will make debugging Linux guests a bit easier. Suggested by: Willem Jan Withagen (wjw at digiware nl)	2014-02-25 06:29:56 +00:00
Alan Cox	f438547338	When the kernel is running in a virtual machine, it cannot rely upon the processor family to determine if the workaround for AMD Family 10h Erratum 383 should be enabled. To enable virtual machine migration among a heterogeneous collection of physical machines, the hypervisor may have been configured to report an older processor family with a reduced feature set. Effectively, the reported processor family and its features are like a "least common denominator" for the collection of machines. Therefore, when the kernel is running in a virtual machine, instead of relying upon the processor family, we now test for features that prove that the underlying processor is not affected by the erratum. (The features that we test for are unlikely to ever be emulated in software on an affected physical processor.) PR: 186061 Tested by: Simon Matter Discussed with: jhb, neel MFC after: 2 weeks	2014-02-22 18:53:42 +00:00
Neel Natu	159dd56f94	Add support for x2APIC virtualization assist in Intel VT-x. The vlapic.ops handler 'enable_x2apic_mode' is called when the vlapic mode is switched to x2APIC. The VT-x implementation of this handler turns off the APIC-access virtualization and enables the x2APIC virtualization in the VMCS. The x2APIC virtualization is done by allowing guest read access to a subset of MSRs in the x2APIC range. In non-root operation the processor will satisfy an 'rdmsr' access to these MSRs by reading from the virtual APIC page instead. The guest is also given write access to TPR, EOI and SELF_IPI MSRs which get special treatment in non-root operation. This is documented in the Intel SDM section titled "Virtualizing MSR-Based APIC Accesses". Enforce that APIC-write and APIC-access VM-exits are handled only if APIC-access virtualization is enabled. The one exception to this is SELF_IPI virtualization which may result in an APIC-write VM-exit.	2014-02-21 06:03:54 +00:00
Neel Natu	52e5c8a2ec	Simplify APIC mode switching from MMIO to x2APIC. In part this is done to simplify the implementation of the x2APIC virtualization assist in VT-x. Prior to this change the vlapic allowed the guest to change its mode from xAPIC to x2APIC. We don't allow that any more and the vlapic mode is locked when the virtual machine is created. This is not very constraining because operating systems already have to deal with BIOS setting up the APIC in x2APIC mode at boot. Fix a bug in the CPUID emulation where the x2APIC capability was leaking from the host to the guest. Ignore MMIO reads and writes to the vlapic in x2APIC mode. Similarly, ignore MSR accesses to the vlapic when it is in xAPIC mode. The default configuration of the vlapic is xAPIC. The "-x" option to bhyve(8) can be used to change the mode to x2APIC instead. Discussed with: grehan@	2014-02-20 01:48:25 +00:00
John Baldwin	a0efd3fb34	A first pass at adding support for injecting hardware exceptions for emulated instructions. - Add helper routines to inject interrupt information for a hardware exception from the VM exit callback routines. - Use the new routines to inject GP and UD exceptions for invalid operations when emulating the xsetbv instruction. - Don't directly manipulate the entry interrupt info when a user event is injected. Instead, store the event info in the vmx state and only apply it during a VM entry if a hardware exception or NMI is not already pending. - While here, use HANDLED/UNHANDLED instead of 1/0 in a couple of routines. Reviewed by: neel	2014-02-18 03:07:36 +00:00
Neel Natu	294d0d88fc	Handle writes to the SELF_IPI MSR by the guest when the vlapic is configured in x2apic mode. Reads to this MSR are currently ignored but should cause a general proctection exception to be injected into the vcpu. All accesses to the corresponding offset in xAPIC mode are ignored. Also, do not panic the host if there is mismatch between the trigger mode programmed in the TMR and the actual interrupt being delivered. Instead the anomaly is logged to aid debugging and to prevent a misbehaving guest from panicking the host.	2014-02-17 23:07:16 +00:00
Neel Natu	9c43cd07ec	Use spinlocks to lock accesses to the vioapic. This is necessary because if the vlapic is configured in x2apic mode the vioapic_process_eoi() function is called inside the critical section established by vm_run().	2014-02-17 22:57:51 +00:00
Dimitry Andric	f785676f2a	Upgrade our copy of llvm/clang to 3.4 release. This version supports all of the features in the current working draft of the upcoming C++ standard, provisionally named C++1y. The code generator's performance is greatly increased, and the loop auto-vectorizer is now enabled at -Os and -O2 in addition to -O3. The PowerPC backend has made several major improvements to code generation quality and compile time, and the X86, SPARC, ARM32, Aarch64 and SystemZ backends have all seen major feature work. Release notes for llvm and clang can be found here: <http://llvm.org/releases/3.4/docs/ReleaseNotes.html> <http://llvm.org/releases/3.4/tools/clang/docs/ReleaseNotes.html> MFC after: 1 month	2014-02-16 19:44:07 +00:00
Christian Brueffer	7f47cbd3ce	Retire the nve(4) driver; nfe(4) has been the default driver for NVIDIA nForce MCP adapters for a long time. Yays: jhb, remko, yongari Nays: none on the current and stable lists	2014-02-16 12:22:43 +00:00
Andriy Gapon	f25e50cf0b	provide fast versions of ffsl and flsl for i386; ffsll and flsll for amd64 Reviewed by: jhb MFC after: 10 days X-MFC note: consider thirdparty modules depending on these symbols Sponsored by: HybridCluster	2014-02-14 15:18:37 +00:00
John Baldwin	4edef187b8	Add support for managing PCI bus numbers. As with BARs and PCI-PCI bridge I/O windows, the default is to preserve the firmware-assigned resources. PCI bus numbers are only managed if NEW_PCIB is enabled and the architecture defines a PCI_RES_BUS resource type. - Add a helper API to create top-level PCI bus resource managers for each PCI domain/segment. Host-PCI bridge drivers use this API to allocate bus numbers from their associated domain. - Change the PCI bus and CardBus drivers to allocate a bus resource for their bus number from the parent PCI bridge device. - Change the PCI-PCI and PCI-CardBus bridge drivers to allocate the full range of bus numbers from secbus to subbus from their parent bridge. The drivers also always program their primary bus register. The bridge drivers also support growing their bus range by extending the bus resource and updating subbus to match the larger range. - Add support for managing PCI bus resources to the Host-PCI bridge drivers used for amd64 and i386 (acpi_pcib, mptable_pcib, legacy_pcib, and qpi_pcib). - Define a PCI_RES_BUS resource type for amd64 and i386. Reviewed by: imp MFC after: 1 month	2014-02-12 04:30:37 +00:00
John Baldwin	4f67a8c5e9	Don't waste a page of KVA for the boot-time memory test on x86. For amd64, reuse the first page of the crashdumpmap as CMAP1/CADDR1. For i386, remove CMAP1/CADDR1 entirely and reuse CMAP3/CADDR3 for the memory test. Reviewed by: alc, peter MFC after: 2 weeks	2014-02-11 22:02:40 +00:00
John Baldwin	abb023fb95	Add virtualized XSAVE support to bhyve which permits guests to use XSAVE and XSAVE-enabled features like AVX. - Store a per-cpu guest xcr0 register. When switching to the guest FPU state, switch to the guest xcr0 value. Note that the guest FPU state is saved and restored using the host's xcr0 value and xcr0 is saved/restored "inside" of saving/restoring the guest FPU state. - Handle VM exits for the xsetbv instruction by updating the guest xcr0. - Expose the XSAVE feature to the guest only if the host has enabled XSAVE, and only advertise XSAVE features enabled by the host to the guest. This ensures that the guest will only adjust FPU state that is a subset of the guest FPU state saved and restored by the host. Reviewed by: grehan	2014-02-08 16:37:54 +00:00
Neel Natu	bf73979dd9	Add a counter to differentiate between VM-exits due to nested paging faults and instruction emulation faults.	2014-02-08 06:22:09 +00:00
Neel Natu	62fbd7c27a	Fix a bug in the handling of VM-exits caused by non-maskable interrupts (NMI). If a VM-exit is caused by an NMI then "blocking by NMI" is in effect on the CPU when the VM-exit is completed. No more NMIs will be recognized until the execution of an "iret". Prior to this change the NMI handler was dispatched via a software interrupt with interrupts enabled. This meant that an interrupt could be recognized by the processor before the NMI handler completed its execution. The "iret" issued by the interrupt handler would then cause the "blocking by NMI" to be cleared prematurely. This is now fixed by handling the NMI with interrupts disabled in addition to "blocking by NMI" already established by the VM-exit.	2014-02-08 05:04:34 +00:00
John Baldwin	00f3efe1bd	Add support for FreeBSD/i386 guests under bhyve. - Similar to the hack for bootinfo32.c in userboot, define _MACHINE_ELF_WANT_32BIT in the load_elf32 file handlers in userboot. This allows userboot to load 32-bit kernels and modules. - Copy the SMAP generation code out of bootinfo64.c and into its own file so it can be shared with bootinfo32.c to pass an SMAP to the i386 kernel. - Use uint32_t instead of u_long when aligning module metadata in bootinfo32.c in userboot, as otherwise the metadata used 64-bit alignment which corrupted the layout. - Populate the basemem and extmem members of the bootinfo struct passed to 32-bit kernels. - Fix the 32-bit stack in userboot to start at the top of the stack instead of the bottom so that there is room to grow before the kernel switches to its own stack. - Push a fake return address onto the 32-bit stack in addition to the arguments normally passed to exec() in the loader. This return address is needed to convince recover_bootinfo() in the 32-bit locore code that it is being invoked from a "new" boot block. - Add a routine to libvmmapi to setup a 32-bit flat mode register state including a GDT and TSS that is able to start the i386 kernel and update bhyveload to use it when booting an i386 kernel. - Use the guest register state to determine the CPU's current instruction mode (32-bit vs 64-bit) and paging mode (flat, 32-bit, PAE, or long mode) in the instruction emulation code. Update the gla2gpa() routine used when fetching instructions to handle flat mode, 32-bit paging, and PAE paging in addition to long mode paging. Don't look for a REX prefix when the CPU is in 32-bit mode, and use the detected mode to enable the existing 32-bit mode code when decoding the mod r/m byte. Reviewed by: grehan, neel MFC after: 1 month	2014-02-05 04:39:03 +00:00
Tycho Nightingale	54e03e07b3	Add support for emulating the byte move and zero extend instructions: "mov r/m8, r32" and "mov r/m8, r64". Approved by: neel (co-mentor)	2014-02-05 02:01:08 +00:00
Peter Grehan	cde843b418	Changes to the SVM code to bring it up to r259205 - Convert VMM_CTR to VCPU_CTR KTR macros - Special handling of halt, save rflags for VMM layer to emulate halt for vcpu(sleep to be awakened by interrupt or stop it) - Cleanup of RVI exit handling code Submitted by: Anish Gupta (akgupt3@gmail.com) Reviewed by: grehan	2014-02-04 07:13:56 +00:00
Peter Grehan	485ac45a53	MFC @ r259205 in preparation for some SVM updates. (for real this time)	2014-02-04 06:59:08 +00:00
Peter Grehan	e9ed7bc42d	Roll back botched partial MFC :(	2014-02-04 05:03:14 +00:00
Neel Natu	953c2c47eb	Avoid doing unnecessary nested TLB invalidations. Prior to this change the cached value of 'pm_eptgen' was tracked per-vcpu and per-hostcpu. In the degenerate case where 'N' vcpus were sharing a single hostcpu this could result in 'N - 1' unnecessary TLB invalidations. Since an 'invept' invalidates mappings for all VPIDs the first 'invept' is sufficient. Fix this by moving the 'eptgen[MAXCPU]' array from 'vmxctx' to 'struct vmx'. If it is known that an 'invept' is going to be done before entering the guest then it is safe to skip the 'invvpid'. The stat VPU_INVVPID_SAVED counts the number of 'invvpid' invalidations that were avoided because they were subsumed by an 'invept'. Discussed with: grehan	2014-02-04 02:45:08 +00:00
Peter Grehan	b4bf798a37	MFC @ r259205 in preparation for some SVM updates.	2014-02-04 02:41:54 +00:00
John Baldwin	3cbf3585cb	Enhance the support for PCI legacy INTx interrupts and enable them in the virtio backends. - Add a new ioctl to export the count of pins on the I/O APIC from vmm to the hypervisor. - Use pins on the I/O APIC >= 16 for PCI interrupts leaving 0-15 for ISA interrupts. - Populate the MP Table with I/O interrupt entries for any PCI INTx interrupts. - Create a _PRT table under the PCI root bridge in ACPI to route any PCI INTx interrupts appropriately. - Track which INTx interrupts are in use per-slot so that functions that share a slot attempt to distribute their INTx interrupts across the four available pins. - Implicitly mask INTx interrupts if either MSI or MSI-X is enabled and when the INTx DIS bit is set in a function's PCI command register. Either assert or deassert the associated I/O APIC pin when the state of one of those conditions changes. - Add INTx support to the virtio backends. - Always advertise the MSI capability in the virtio backends. Submitted by: neel (7) Reviewed by: neel MFC after: 2 weeks	2014-01-29 14:56:48 +00:00
John Baldwin	3d2ec11759	Add support for 'clac' and 'stac' to DDB's disassembler on amd64.	2014-01-27 18:53:18 +00:00
Neel Natu	30b94db8c0	Support level triggered interrupts with VT-x virtual interrupt delivery. The VMCS field EOI_bitmap[] is an array of 256 bits - one for each vector. If a bit is set to '1' in the EOI_bitmap[] then the processor will trigger an EOI-induced VM-exit when it is doing EOI virtualization. The EOI-induced VM-exit results in the EOI being forwarded to the vioapic so that level triggered interrupts can be properly handled. Tested by: Anish Gupta (akgupt3@gmail.com)	2014-01-25 20:58:05 +00:00
Peter Grehan	062eef4911	Change RWX to XWR in comments to match intent and bit patterns in discussion of valid EPT pte protections. Discussed with: neel MFC after: 3 days	2014-01-25 06:58:41 +00:00
John Baldwin	e07ef9b0f6	Move <machine/apicvar.h> to <x86/apicvar.h>.	2014-01-23 20:10:22 +00:00
Neel Natu	36736912b6	Set "Interrupt Window Exiting" in the case where there is a vector to be injected into the vcpu but the VM-entry interruption information field already has the valid bit set. Pointed out by: David Reed (david.reed@tidalscale.com)	2014-01-23 06:06:50 +00:00
Neel Natu	c308b23b7a	Handle a VM-exit due to a NMI properly by vectoring to the host's NMI handler via a software interrupt. This is safe to do because the logical processor is already cognizant of the NMI and further NMIs are blocked until the host's NMI handler executes "iret".	2014-01-22 04:03:11 +00:00
Neel Natu	51f45d0146	There is no need to initialize the IOMMU if no passthru devices have been configured for bhyve to use. Suggested by: grehan@	2014-01-21 03:01:34 +00:00
Ed Maste	80f9f1580e	Add VT kernel configuration to ease testing of vt(9), aka Newcons	2014-01-19 18:46:38 +00:00
Neel Natu	48b2d828a2	Some processor's don't allow NMI injection if the STI_BLOCKING bit is set in the Guest Interruptibility-state field. However, there isn't any way to figure out which processors have this requirement. So, inject a pending NMI only if NMI_BLOCKING, MOVSS_BLOCKING, STI_BLOCKING are all clear. If any of these bits are set then enable "NMI window exiting" and inject the NMI in the VM-exit handler.	2014-01-18 21:47:12 +00:00
Bryan Venteicher	10c4018057	Add very simple virtio_random(4) driver to harvest entropy from host Reviewed by: markm (random bits only)	2014-01-18 06:14:38 +00:00
Neel Natu	e5a1d95089	If the guest exits due to a fault while it is executing IRET then restore the state of "Virtual NMI blocking" in the guest's interruptibility-state field before resuming the guest.	2014-01-18 02:20:10 +00:00
Neel Natu	160471d264	If a VM-exit happens during an NMI injection then clear the "NMI Blocking" bit in the Guest Interruptibility-state VMCS field. If we fail to do this then a subsequent VM-entry will fail because it is an error to inject an NMI into the guest while "NMI Blocking" is turned on. This is described in "Checks on Guest Non-Register State" in the Intel SDM. Submitted by: David Reed (david.reed@tidalscale.com)	2014-01-17 04:21:39 +00:00
Neel Natu	5b8a8cd1fe	Add an API to rendezvous all active vcpus in a virtual machine. The rendezvous can be initiated in the context of a vcpu thread or from the bhyve(8) control process. The first use of this functionality is to update the vlapic trigger-mode register when the IOAPIC pin configuration is changed. Prior to this change we would update the TMR in the virtual-APIC page at the time of interrupt delivery. But this doesn't work with Posted Interrupts because there is no way to program the EOI_exit_bitmap[] in the VMCS of the target at the time of interrupt delivery. Discussed with: grehan@	2014-01-14 01:55:58 +00:00
Gavin Atkinson	56c63f28ed	Remove spaces from boot messages when we print the CPU ID/Family/Stepping to match the rest of the CPU identification lines, and once again fit into 80 columns in the usual case.	2014-01-11 22:41:10 +00:00
Neel Natu	176666c2c9	Enable "Posted Interrupt Processing" if supported by the CPU. This lets us inject interrupts into the guest without causing a VM-exit. This feature can be disabled by setting the tunable "hw.vmm.vmx.use_apic_pir" to "0". The following sysctls provide information about this feature: - hw.vmm.vmx.posted_interrupts (0 if disabled, 1 if enabled) - hw.vmm.vmx.posted_interrupt_vector (vector number used for vcpu notification) Tested on a Intel Xeon E5-2620v2 courtesy of Allan Jude at ScaleEngine.	2014-01-11 04:22:00 +00:00
Neel Natu	f7d4742540	Enable the "Acknowledge Interrupt on VM exit" VM-exit control. This control is needed to enable "Posted Interrupts" and is present in all the Intel VT-x implementations supported by bhyve so enable it as the default. With this VM-exit control enabled the processor will acknowledge the APIC and store the vector number in the "VM-Exit Interruption Information" field. We now call the interrupt handler "by hand" through the IDT entry associated with the vector.	2014-01-11 03:14:05 +00:00
Neel Natu	add611fd4c	Don't expose 'vmm_ipinum' as a global.	2014-01-09 03:25:54 +00:00
Neel Natu	88c4b8d145	Use the 'Virtual Interrupt Delivery' feature of Intel VT-x if supported by hardware. It is possible to turn this feature off and fall back to software emulation of the APIC by setting the tunable hw.vmm.vmx.use_apic_vid to 0. We now start handling two new types of VM-exits: APIC-access: This is a fault-like VM-exit and is triggered when the APIC register access is not accelerated (e.g. apic timer CCR). In response to this we do emulate the instruction that triggered the APIC-access exit. APIC-write: This is a trap-like VM-exit which does not require any instruction emulation but it does require the hypervisor to emulate the access to the specified register (e.g. icrlo register). Introduce 'vlapic_ops' which are function pointers to vector the various vlapic operations into processor-dependent code. The 'Virtual Interrupt Delivery' feature installs 'ops' for setting the IRR bits in the virtual APIC page and to return whether any interrupts are pending for this vcpu. Tested on an "Intel Xeon E5-2620 v2" courtesy of Allan Jude at ScaleEngine.	2014-01-07 21:04:49 +00:00
Neel Natu	79c596309c	Fix a bug introduced in r260167 related to VM-exit tracing. Keep a copy of the 'rip' and the 'exit_reason' and use that when calling vmx_exit_trace(). This is because both the 'rip' and 'exit_reason' can be changed by 'vmx_exit_process()' and can lead to very misleading traces.	2014-01-07 18:53:14 +00:00
Neel Natu	4d1e82a88e	Allow vlapic_set_intr_ready() to return a value that indicates whether or not the vcpu should be kicked to process a pending interrupt. This will be useful in the implementation of the Posted Interrupt APICv feature. Change the return value of 'vlapic_pending_intr()' to indicate whether or not an interrupt is available to be delivered to the vcpu depending on the value of the PPR. Add KTR tracepoints to debug guest IPI delivery.	2014-01-07 00:38:22 +00:00
Neel Natu	c847a5062c	Split the VMCS setup between 'vmcs_init()' that does initialization and 'vmx_vminit()' that does customization. This makes it easier to turn on optional features (e.g. APICv) without having to keep adding new parameters to 'vmcs_set_defaults()'. Reviewed by: grehan@	2014-01-06 23:16:39 +00:00
Jens Schweikhardt	aa27ed4569	Correct a grammo in a comment; remove white space at EOL.	2014-01-06 17:23:22 +00:00
Neel Natu	5f8e2dfcb5	Use the same label name for ENTRY() and END() macros for 'vmx_enter_guest'. Pointed out by: rmh@	2014-01-03 19:29:33 +00:00
Neel Natu	0a9ae358fd	Fix a bug in the HPET emulation where a timer interrupt could be lost when the guest disables the HPET. The HPET timer interrupt is triggered from the callout handler associated with the timer. It is possible for the callout handler to be delayed before it gets a chance to execute. If the guest disables the HPET during this window then the handler never gets a chance to execute and the timer interrupt is lost. This is now fixed by injecting a timer interrupt into the guest if the callout time is detected to be in the past when the HPET is disabled.	2014-01-03 19:25:52 +00:00
Konstantin Belousov	27fd75d2c8	Update the description for pmap_remove_pages() to match the modern times [1]. Assert that the pmap passed to pmap_remove_pages() is only active on current CPU. Submitted by: alc [1] Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week	2014-01-02 18:50:52 +00:00
Konstantin Belousov	c0be75a58a	Assert that accounting for the pmap resident pages does not underflow. Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week	2014-01-02 18:49:05 +00:00
Neel Natu	0492757c70	Restructure the VMX code to enter and exit the guest. In large part this change hides the setjmp/longjmp semantics of VM enter/exit. vmx_enter_guest() is used to enter guest context and vmx_exit_guest() is used to transition back into host context. Fix a longstanding race where a vcpu interrupt notification might be ignored if it happens after vmx_inject_interrupts() but before host interrupts are disabled in vmx_resume/vmx_launch. We now called vmx_inject_interrupts() with host interrupts disabled to prevent this. Suggested by: grehan@	2014-01-01 21:17:08 +00:00
Dimitry Andric	24da2fe3d0	In sys/amd64/amd64/pmap.c, remove static function pmap_is_current(), which has been unused since r189415. Reviewed by: alc MFC after: 3 days	2013-12-30 20:37:47 +00:00
Neel Natu	7c05bc3124	Modify handling of writes to the vlapic LVT registers. The handler is now called after the register value is updated in the virtual APIC page. This will make it easier to handle APIC-write VM-exits with APIC register virtualization turned on. This also implies that we need to keep a snapshot of the last value written to a LVT register. We can no longer rely on the LVT registers in the APIC page to be "clean" because the guest can write anything to it before the hypervisor has had a chance to sanitize it.	2013-12-28 00:20:55 +00:00
Neel Natu	fafe884473	Modify handling of writes to the vlapic ICR_TIMER, DCR_TIMER, ICRLO and ESR registers. The handler is now called after the register value is updated in the virtual APIC page. This will make it easier to handle APIC-write VM-exits with APIC register virtualization turned on. We can no longer rely on the value of 'icr_timer' on the APIC page in the callout handler. With APIC register virtualization the value of 'icr_timer' will be updated by the processor in guest-context before an APIC-write VM-exit. Clear the 'delivery status' bit in the ICRLO register in the write handler. With APIC register virtualization the write happens in guest-context and we cannot prevent a (buggy) guest from setting this bit.	2013-12-27 20:18:19 +00:00
Dimitry Andric	6f0c167fe2	In sys/amd64/vmm/intel/vmx.c, silence a (incorrect) gcc warning about regval possibly being used uninitialized. Reviewed by: neel	2013-12-27 12:15:53 +00:00
Neel Natu	2c52dcd9a8	Modify handling of write to the vlapic SVR register. The handler is now called after the register value is updated in the virtual APIC page. This will make it easier to handle APIC-write VM-exits with APIC register virtualization turned on. Additionally, mask all the LVT entries when the vlapic is software-disabled.	2013-12-27 07:01:42 +00:00
Neel Natu	3f0ddc7c5c	Modify handling of writes to the vlapic ID, LDR and DFR registers. The handlers are now called after the register value is updated in the virtual APIC page. This will make it easier to handle APIC-write VM-exits with APIC register virtualization turned on. Additionally, we need to ensure that the value of these registers is always correctly reflected in the virtual APIC page, because there is no VM exit when the guest reads these registers with APIC register virtualization.	2013-12-26 19:58:30 +00:00
Neel Natu	de5ea6b65e	vlapic code restructuring to make it easy to support hardware-assist for APIC emulation. The vlapic initialization and cleanup is done via processor specific vmm_ops. This will allow the VT-x/SVM modules to layer any hardware-assist for APIC emulation or virtual interrupt delivery on top of the vlapic device model. Add a parameter to 'vcpu_notify_event()' to distinguish between vlapic interrupts versus other events (e.g. NMI). This provides an opportunity to use hardware-assists like Posted Interrupts (VT-x) or doorbell MSR (SVM) to deliver an interrupt to a guest without causing a VM-exit. Get rid of lapic_pending_intr() and lapic_intr_accepted() and use the vlapic_xxx() counterparts directly. Associate an 'Apic Page' with each vcpu and reference it from the 'vlapic'. The 'Apic Page' is intended to be referenced from the Intel VMCS as the 'virtual APIC page' or from the AMD VMCB as the 'vAPIC backing page'.	2013-12-25 06:46:31 +00:00
John Baldwin	63e62d390d	Add a resume hook for bhyve that runs a function on all CPUs during resume. For Intel CPUs, invoke vmxon for CPUs that were in VMX mode at the time of suspend. Reviewed by: neel	2013-12-23 19:48:22 +00:00
John Baldwin	330baf58c6	Extend the support for local interrupts on the local APIC: - Add a generic routine to trigger an LVT interrupt that supports both fixed and NMI delivery modes. - Add an ioctl and bhyvectl command to trigger local interrupts inside a guest. In particular, a global NMI similar to that raised by SERR# or PERR# can be simulated by asserting LINT1 on all vCPUs. - Extend the LVT table in the vCPU local APIC to support CMCI. - Flesh out the local APIC error reporting a bit to cache errors and report them via ESR when ESR is written to. Add support for asserting the error LVT when an error occurs. Raise illegal vector errors when attempting to signal an invalid vector for an interrupt or when sending an IPI. - Ignore writes to reserved bits in LVT entries. - Export table entries the MADT and MP Table advertising the stock x86 config of LINT0 set to ExtInt and LINT1 wired to NMI. Reviewed by: neel (earlier version)	2013-12-23 19:29:07 +00:00
Neel Natu	f80330a820	Add a parameter to 'vcpu_set_state()' to enforce that the vcpu is in the IDLE state before the requested state transition. This guarantees that there is exactly one ioctl() operating on a vcpu at any point in time and prevents unintended state transitions. More details available here: http://lists.freebsd.org/pipermail/freebsd-virtualization/2013-December/001825.html Reviewed by: grehan Reported by: Markiyan Kushnir (markiyan.kushnir at gmail.com) MFC after: 3 days	2013-12-22 20:29:59 +00:00
Neel Natu	a783578566	Consolidate the virtual apic initialization in a single function: vlapic_reset()	2013-12-22 00:08:00 +00:00
Neel Natu	5515bb73e6	Re-arrange bits in the amd64/pmap 'pm_flags' field. The least significant 8 bits of 'pm_flags' are now used for the IPI vector to use for nested page table TLB shootdown. Previously we used IPI_AST to interrupt the host cpu which is functionally correct but could lead to misleading interrupt counts for AST handler. The AST handler was also doing a lot more than what is required for the nested page table TLB shootdown (EOI and IRET).	2013-12-20 05:50:22 +00:00
Peter Grehan	a0b78f096a	Enable memory overcommit for AMD processors. - No emulation of A/D bits is required since AMD-V RVI supports A/D bits. - Enable pmap PT_RVI support(w/o PAT) which is required for memory over-commit support. - Other minor fixes: * Make use of VMCB EXITINTINFO field. If a #VMEXIT happens while delivering an interrupt, EXITINTINFO has all the details that bhyve needs to inject the same interrupt. * SVM h/w decode assist code was incomplete - removed for now. * Some minor code clean-up (more coming). Submitted by: Anish Gupta (akgupt3@gmail.com)	2013-12-18 23:39:42 +00:00
Peter Grehan	d8ced94511	MFC @ r256071 This is the change where the bhyve_npt_pmap branch was merged in to head. The SVM changes to work with this will be in a follow-on submit.	2013-12-18 22:31:53 +00:00
Neel Natu	3de8386283	Use vmcs_read() and vmcs_write() in preference to vmread() and vmwrite() respectively. The vmcs_xxx() functions provide inline error checking of all accesses to the VMCS.	2013-12-18 06:24:21 +00:00
Neel Natu	4f8be175d5	Add an API to deliver message signalled interrupts to vcpus. This allows callers treat the MSI 'addr' and 'data' fields as opaque and also lets bhyve implement multiple destination modes: physical, flat and clustered. Submitted by: Tycho Nightingale (tycho.nightingale@pluribusnetworks.com) Reviewed by: grehan@	2013-12-16 19:59:31 +00:00
Neel Natu	a83011d2e7	Fix typo when initializing the vlapic version register ('<<' instead of '<').	2013-12-11 06:28:44 +00:00
Neel Natu	becd984900	Fix x2apic support in bhyve. When the guest is bringing up the APs in the x2APIC mode a write to the ICR register will now trigger a return to userspace with an exitcode of VM_EXITCODE_SPINUP_AP. This gets SMP guests working again with x2APIC. Change the vlapic timer lock to be a spinlock because the vlapic can be accessed from within a critical section (vm run loop) when guest is using x2apic mode. Reviewed by: grehan@	2013-12-10 22:56:51 +00:00
John Baldwin	316032ad20	Move constants for indices in the local APIC's local vector table from apicvar.h to apicreg.h.	2013-12-09 21:08:52 +00:00
Neel Natu	fb03ca4e42	Use callout(9) to drive the vlapic timer instead of clocking it on each VM exit. This decouples the guest's 'hz' from the host's 'hz' setting. For e.g. it is now possible to have a guest run at 'hz=1000' while the host is at 'hz=100'. Discussed with: grehan@ Tested by: Tycho Nightingale (tycho.nightingale@pluribusnetworks.com)	2013-12-07 23:11:12 +00:00
Neel Natu	1c05219285	If a vcpu disables its local apic and then executes a 'HLT' then spin down the vcpu and destroy its thread context. Also modify the 'HLT' processing to ignore pending interrupts in the IRR if interrupts have been disabled by the guest. The interrupt cannot be injected into the guest in any case so resuming it is futile. With this change "halt" from a Linux guest works correctly. Reviewed by: grehan@ Tested by: Tycho Nightingale (tycho.nightingale@pluribusnetworks.com)	2013-12-07 22:18:36 +00:00
John Baldwin	5c79f1f9df	Fix a typo.	2013-12-05 21:58:02 +00:00
Neel Natu	7a3c80aa55	The 'protection' field in the VM exit collateral for the PAGING exit is not used - get rid of it.	2013-12-03 01:21:21 +00:00
Neel Natu	2282187475	Rename 'vm_interrupt_hostcpu()' to 'vcpu_notify_event()' because the function has outgrown its original name. Originally this function simply sent an IPI to the host cpu that a vcpu was executing on but now it does a lot more than just that. Reviewed by: grehan@	2013-12-03 00:43:31 +00:00
Eitan Adler	7a22215c53	Fix undefined behavior: (1 << 31) is not defined as 1 is an int and this shifts into the sign bit. Instead use (1U << 31) which gets the expected result. This fix is not ideal as it assumes a 32 bit int, but does fix the issue for most cases. A similar change was made in OpenBSD. Discussed with: -arch, rdivacky Reviewed by: cperciva	2013-11-30 22:17:27 +00:00
Pawel Jakub Dawidek	f2b525e6b9	Make process descriptors standard part of the kernel. rwhod(8) already requires process descriptors to work and having PROCDESC in GENERIC seems not enough, especially that we hope to have more and more consumers in the base. MFC after: 3 days	2013-11-30 15:08:35 +00:00
Neel Natu	b5b28fc9dc	Add support for level triggered interrupt pins on the vioapic. Prior to this commit level triggered interrupts would work as long as the pin was not shared among multiple interrupt sources. The vlapic now keeps track of level triggered interrupts in the trigger mode register and will forward the EOI for a level triggered interrupt to the vioapic. The vioapic in turn uses the EOI to sample the level on the pin and re-inject the vector if the pin is still asserted. The vhpet is the first consumer of level triggered interrupts and advertises that it can generate interrupts on pins 20 through 23 of the vioapic. Discussed with: grehan@	2013-11-27 22:18:08 +00:00
Konstantin Belousov	291bfc8d24	Hide struct pcb definition by #ifdef __amd64__ braces. If cc -m32 compilation results in inclusion of the header, a confict arises due to savefpu being union for i386, but used as struct in the pcb definition. The 32bit code should not need amd64 variant of the struct pcb anyway. For struct region_descriptor, use __uint64_t instead of unsigned long, as the base type for bit-fields. Unsigned long cannot have width 64 for -m32. The changes allowed to use sys/sysctl.h for cc -m32. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2013-11-26 19:38:42 +00:00
Neel Natu	08e3ff329a	Add HPET device emulation to bhyve. bhyve supports a single timer block with 8 timers. The timers are all 32-bit and capable of being operated in periodic mode. All timers support interrupt delivery using MSI. Timers 0 and 1 also support legacy interrupt routing. At the moment the timers are not connected to any ioapic pins but that will be addressed in a subsequent commit. This change is based on a patch from Tycho Nightingale (tycho.nightingale@pluribusnetworks.com).	2013-11-25 19:04:51 +00:00
Attilio Rao	54366c0bd7	- For kernel compiled only with KDTRACE_HOOKS and not any lock debugging option, unbreak the lock tracing release semantic by embedding calls to LOCKSTAT_PROFILE_RELEASE_LOCK() direclty in the inlined version of the releasing functions for mutex, rwlock and sxlock. Failing to do so skips the lockstat_probe_func invokation for unlocking. - As part of the LOCKSTAT support is inlined in mutex operation, for kernel compiled without lock debugging options, potentially every consumer must be compiled including opt_kdtrace.h. Fix this by moving KDTRACE_HOOKS into opt_global.h and remove the dependency by opt_kdtrace.h for all files, as now only KDTRACE_FRAMES is linked there and it is only used as a compile-time stub [0]. [0] immediately shows some new bug as DTRACE-derived support for debug in sfxge is broken and it was never really tested. As it was not including correctly opt_kdtrace.h before it was never enabled so it was kept broken for a while. Fix this by using a protection stub, leaving sfxge driver authors the responsibility for fixing it appropriately [1]. Sponsored by: EMC / Isilon storage division Discussed with: rstone [0] Reported by: rstone [1] Discussed with: philip	2013-11-25 07:38:45 +00:00
Neel Natu	ac7304a758	Add an ioctl to assert and deassert an ioapic pin atomically. This will be used to inject edge triggered legacy interrupts into the guest. Start using the new API in device models that use edge triggered interrupts: viz. the 8254 timer and the LPC/uart device emulation. Submitted by: Tycho Nightingale (tycho.nightingale@pluribusnetworks.com)	2013-11-23 03:56:03 +00:00
Neel Natu	af480303a9	Eliminate redundant information about the host cpu in bhyve's KTR trace points. This is always tracked by ktr(4) and can be displayed using the "-c" option of ktrdump(8). Discussed with: grehan	2013-11-22 18:57:22 +00:00
Ed Maste	7b7d8599fe	Don't abort SMAP processing after an entry of length 0 Length 0 is not special and should just be skipped. This is the same behaviour as i386. Discussed with: jhb@ Sponsored by: The FreeBSD Foundation	2013-11-22 14:56:10 +00:00
Andreas Tobler	d2ef321a59	Introduce a WEAK_REFERENCE() alias and use it. Get rid of the CNAME and the CONCAT macros in SYS.h. Reviewed by: bde, kib	2013-11-21 21:25:58 +00:00
Ed Maste	ff89f4778a	Refactor amd64 startup SMAP parsing Extracted from the projects/uefi branch, this change is a reasonable cleanup and will reduce the diffs to review when bringing in the UEFI work. Reviewed by: kib@ Sponsored by: The FreeBSD Foundation	2013-11-21 19:20:08 +00:00
Ed Maste	aff122d6aa	Disable amd64 boot time memory test by default The page presence memory test takes a long time on large memory systems and has little value on contemporary amd64 hardware. Sponsored by: The FreeBSD Foundation	2013-11-21 18:37:11 +00:00
Justin T. Gibbs	4fd76feafd	Fix accounting for hw.realmem on the i386 and amd64 platforms. sys/i386/i386/machdep.c: sys/amd64/amd64/machdep.c: The value reported by FreeBSD as "real memory" when booting doesn't match what is later reported by sysctl as hw.realmem. This is due to the fact that the value printed during the boot process is fetched from smbios data (when possible), and accounts for holes in physical memory. On the other hand, the value of hw.realmem is unconditionally set to be one larger than the highest page of the physical address space. Fix this by setting hw.realmem to the same value printed during boot, this makes hw.realmem honour it's name and account properly for physical memory present in the system. Submitted by: Roger Pau Monné Reviewed by: gibbs	2013-11-15 16:05:55 +00:00
Ed Maste	3d271aaab0	x86: Allow users to change PSL_RF via ptrace(PT_SETREGS...) Debuggers may need to change PSL_RF. Note that tf_eflags is already stored in the signal context during signal handling and PSL_RF previously could be modified via sigreturn, so this change should not provide any new ability to userspace. For background see the thread at: http://lists.freebsd.org/pipermail/freebsd-i386/2007-September/005910.html Reviewed by: jhb, kib Sponsored by: DARPA, AFRL	2013-11-14 15:37:20 +00:00
Neel Natu	565bbb8698	Move the ioapic device model from userspace into vmm.ko. This is needed for upcoming in-kernel device emulations like the HPET. The ioctls VM_IOAPIC_ASSERT_IRQ and VM_IOAPIC_DEASSERT_IRQ are used to manipulate the ioapic pin state. Discussed with: grehan@ Submitted by: Tycho Nightingale (tycho.nightingale@pluribusnetworks.com)	2013-11-12 22:51:03 +00:00
Konstantin Belousov	6f8a44a5dd	Add bits for the AMD features from CPUID function 0x80000001 ECX, described in the rev. 3.0 of the Kabini BKDG, document 48751.pdf. Partially based on the patch submitted by: Dmitry Luhtionov <dmitryluhtionov@gmail.com> Sponsored by: The FreeBSD Foundation MFC after: 1 week	2013-11-08 16:32:30 +00:00
Alan Cox	c70af4875e	As of r257209, all architectures have defined VM_KMEM_SIZE_SCALE. In other words, every architecture is now auto-sizing the kmem arena. This revision changes kmeminit() so that the definition of VM_KMEM_SIZE_SCALE becomes mandatory and the definition of VM_KMEM_SIZE becomes optional. Replace or eliminate all existing definitions of VM_KMEM_SIZE. With auto-sizing enabled, VM_KMEM_SIZE effectively became an alternate spelling for VM_KMEM_SIZE_MIN on most architectures. Use VM_KMEM_SIZE_MIN for clarity. Change kmeminit() so that the effect of defining VM_KMEM_SIZE is similar to that of setting the tunable vm.kmem_size. Whereas the macros VM_KMEM_SIZE_{MAX,MIN,SCALE} have had the same effect as the tunables vm.kmem_size_{max,min,scale}, the effects of VM_KMEM_SIZE and vm.kmem_size have been distinct. In particular, whereas VM_KMEM_SIZE was overridden by VM_KMEM_SIZE_{MAX,MIN,SCALE} and vm.kmem_size_{max,min,scale}, vm.kmem_size was not. Remedy this inconsistency. Now, VM_KMEM_SIZE can be used to set the size of the kmem arena at compile-time without that value being overridden by auto-sizing. Update the nearby comments to reflect the kmem submap being replaced by the kmem arena. Stop duplicating the auto-sizing formula in every machine- dependent vmparam.h and place it in kmeminit() where auto-sizing takes place. Reviewed by: kib (an earlier version) Sponsored by: EMC / Isilon Storage Division	2013-11-08 16:25:00 +00:00
Neel Natu	03cd05011f	Remove the 'vdev' abstraction that was meant to sit on top of device models in the kernel. This abstraction was redundant because the only device emulated inside vmm.ko is the local apic and it is always at a fixed guest physical address. Discussed with: grehan	2013-11-04 23:25:07 +00:00
Neel Natu	513c8d338d	Rename the VMM_CTRx() family of macros to VCPU_CTRx() to highlight that these tracepoints are vcpu-specific. Add support for tracepoints that are global to the virtual machine - these tracepoints are called VM_CTRx().	2013-10-31 05:20:11 +00:00
Mark Johnston	57170f49f2	Remove references to an unused fasttrap probe hook, and remove the corresponding x86 trap type. Userland DTrace probes are currently handled by the other fasttrap hooks (dtrace_pid_probe_ptr and dtrace_return_probe_ptr). Discussed with: rpaulo	2013-10-31 02:35:00 +00:00
Peter Grehan	064bee341e	MFC @ r256071 This is just prior to the bhyve_npt_pmap import so will allow just the change to be merged for easier debug.	2013-10-30 00:05:02 +00:00
Neel Natu	e2f5d9a129	Remove unnecessary includes of <machine/pmap.h> Requested by: alc@	2013-10-29 02:25:18 +00:00
Gleb Smirnoff	69eb2b176c	Include XEN and HyperV into amd64 LINT.	2013-10-28 21:11:28 +00:00
Konstantin Belousov	86be9f0dd5	Import the driver for VT-d DMAR hardware, as specified in the revision 1.3 of Intelб╝ Virtualization Technology for Directed I/O Architecture Specification. The Extended Context and PASIDs from the rev. 2.2 are not supported, but I am not aware of any released hardware which implements them. Code does not use queued invalidation, see comments for the reason, and does not provide interrupt remapping services. Code implements the management of the guest address space per domain and allows to establish and tear down arbitrary mappings, but not partial unmapping. The superpages are created as needed, but not promoted. Faults are recorded, fault records could be obtained programmatically, and printed on the console. Implement the busdma(9) using DMARs. This busdma backend avoids bouncing and provides security against misbehaving hardware and driver bad programming, preventing leaks and corruption of the memory by wild DMA accesses. By default, the implementation is compiled into amd64 GENERIC kernel but disabled; to enable, set hw.dmar.enable=1 loader tunable. Code is written to work on i386, but testing there was low priority, and driver is not enabled in GENERIC. Even with the DMAR turned on, individual devices could be directed to use the bounce busdma with the hw.busdma.pci<domain>:<bus>:<device>:<function>.bounce=1 tunable. If DMARs are capable of the pass-through translations, it is used, otherwise, an identity-mapping page table is constructed. The driver was tested on Xeon 5400/5500 chipset legacy machine, Haswell desktop and E5 SandyBridge dual-socket boxes, with ahci(4), ata(4), bce(4), ehci(4), mfi(4), uhci(4), xhci(4) devices. It also works with em(4) and igb(4), but there some fixes are needed for drivers, which are not committed yet. Intel GPUs do not work with DMAR (yet). Many thanks to John Baldwin, who explained me the newbus integration; Peter Holm, who did all testing and helped me to discover and understand several incredible bugs; and to Jim Harris for the access to the EDS and BWG and for listening when I have to explain my findings to somebody. Sponsored by: The FreeBSD Foundation MFC after: 1 month	2013-10-28 13:33:29 +00:00
Konstantin Belousov	e20f049b87	Several small fixes for the amd64 minidump code. In report_progress(), use nitems(progress_track) instead of manually hard-coding array size. Wrap long line. In blk_write(), code verifies that ptr and pa cannot be non-zero simultaneously. The later check for the page-alignment of the ptr argument never triggers due to pa != 0 always implying ptr == NULL. I believe that the intent was to ensure that physicall address passed is page-aligned, since the address is (temporary) mapped for the duration of the page write. Clear the progress_track.visited fields when starting minidump. If minidump is restarted or taken second time during the system lifetime, progress is not printed otherwise, making operator suspectible to the dump status. Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week	2013-10-27 16:31:12 +00:00
Gleb Smirnoff	eedc7fd9e8	Provide includes that are needed in these files, and before were read in implicitly via if.h -> if_var.h pollution. Sponsored by: Netflix Sponsored by: Nginx, Inc.	2013-10-26 18:18:50 +00:00
Neel Natu	ab76fd5833	The ASID allocation in SVM is incorrect because it allocates a single ASID for all vcpus belonging to a guest. This means that when different vcpus belonging to the same guest are executing on the same host cpu there may be "leakage" in the mappings created by one vcpu to another. The proper fix for this is being worked on and will be committed shortly. In the meantime workaround this bug by flushing the guest TLB entries on every VM entry. Submitted by: Anish Gupta (akgupt3@gmail.com)	2013-10-21 23:46:37 +00:00
Neel Natu	49cc03da31	Add a new capability, VM_CAP_ENABLE_INVPCID, that can be enabled to expose 'invpcid' instruction to the guest. Currently bhyve will try to enable this capability unconditionally if it is available. Consolidate code in bhyve to set the capabilities so it is no longer duplicated in BSP and AP bringup. Add a sysctl 'vm.pmap.invpcid_works' to display whether the 'invpcid' instruction is available. Reviewed by: grehan MFC after: 3 days	2013-10-16 18:20:27 +00:00
Peter Grehan	4599af439f	Fix SVM handling of ASTPENDING, which manifested as a hang on console output (due to a missing interrupt). SVM does exit processing and then handles ASTPENDING which overwrites the already handled SVM exit cause and corrupts virtual machine state. For example, if the SVM exit was due to an I/O port access but the main loop detected an ASTPENDING, the exit would be processed as ASTPENDING and leave the device (e.g. emulated UART) for that I/O port in bad state. Submitted by: Anish Gupta (akgupt3@gmail.com) Reviewed by: grehan	2013-10-16 05:43:03 +00:00
Neel Natu	d38cae4aad	Fix the witness warning that warned against calling uiomove() while holding the 'vmmdev_mtx' in vmmdev_rw(). Rely on the 'si_threadcount' accounting to ensure that we never destroy the VM device node while it has operations in progress (e.g. ioctl, mmap etc). Reported by: grehan Reviewed by: grehan	2013-10-16 00:58:47 +00:00
Glen Barber	6b48eebec6	Document XENHVM and xenpci are mutually inclusive. Submitted by: gibbs Approved by: re (delphij) Sponsored by: The FreeBSD Foundation	2013-10-11 19:40:28 +00:00
Dimitry Andric	9cba9d0157	In sys/amd64/amd64/pmap.c, fix several gcc warnings about uninitialized variables in reclaim_pv_chunk(). Approved by: re (marius) Reviewed by: neel, kib X-MFC-With: r256072	2013-10-08 20:04:35 +00:00
Justin T. Gibbs	5fdd34ee20	Formalize the concept of virtual CPU ids by adding a per-cpu vcpu_id field. Perform vcpu enumeration for Xen PV and HVM environments and convert all Xen drivers to use vcpu_id instead of a hard coded assumption of the mapping algorithm (acpi or apic ID) in use. Submitted by: Roger Pau Monné Sponsored by: Citrix Systems R&D Reviewed by: gibbs Approved by: re (blanket Xen) amd64/include/pcpu.h: i386/include/pcpu.h: Add vcpu_id to the amd64 and i386 pcpu structures. dev/xen/timer/timer.c x86/xen/xen_intr.c Use new vcpu_id instead of assuming acpi_id == vcpu_id. i386/xen/mp_machdep.c: i386/xen/mptable.c x86/xen/hvm.c: Perform Xen HVM and Xen full PV vcpu_id mapping. x86/xen/hvm.c: x86/acpica/madt.c Change SYSINIT ordering of acpi CPU enumeration so that it is guaranteed to be available at the time of Xen HVM vcpu id mapping.	2013-10-05 23:11:01 +00:00
Neel Natu	318224bbe6	Merge projects/bhyve_npt_pmap into head. Make the amd64/pmap code aware of nested page table mappings used by bhyve guests. This allows bhyve to associate each guest with its own vmspace and deal with nested page faults in the context of that vmspace. This also enables features like accessed/dirty bit tracking, swapping to disk and transparent superpage promotions of guest memory. Guest vmspace: Each bhyve guest has a unique vmspace to represent the physical memory allocated to the guest. Each memory segment allocated by the guest is mapped into the guest's address space via the 'vmspace->vm_map' and is backed by an object of type OBJT_DEFAULT. pmap types: The amd64/pmap now understands two types of pmaps: PT_X86 and PT_EPT. The PT_X86 pmap type is used by the vmspace associated with the host kernel as well as user processes executing on the host. The PT_EPT pmap is used by the vmspace associated with a bhyve guest. Page Table Entries: The EPT page table entries as mostly similar in functionality to regular page table entries although there are some differences in terms of what bits are used to express that functionality. For e.g. the dirty bit is represented by bit 9 in the nested PTE as opposed to bit 6 in the regular x86 PTE. Therefore the bitmask representing the dirty bit is now computed at runtime based on the type of the pmap. Thus PG_M that was previously a macro now becomes a local variable that is initialized at runtime using 'pmap_modified_bit(pmap)'. An additional wrinkle associated with EPT mappings is that older Intel processors don't have hardware support for tracking accessed/dirty bits in the PTE. This means that the amd64/pmap code needs to emulate these bits to provide proper accounting to the VM subsystem. This is achieved by using the following mapping for EPT entries that need emulation of A/D bits: Bit Position Interpreted By PG_V 52 software (accessed bit emulation handler) PG_RW 53 software (dirty bit emulation handler) PG_A 0 hardware (aka EPT_PG_RD) PG_M 1 hardware (aka EPT_PG_WR) The idea to use the mapping listed above for A/D bit emulation came from Alan Cox (alc@). The final difference with respect to x86 PTEs is that some EPT implementations do not support superpage mappings. This is recorded in the 'pm_flags' field of the pmap. TLB invalidation: The amd64/pmap code has a number of ways to do invalidation of mappings that may be cached in the TLB: single page, multiple pages in a range or the entire TLB. All of these funnel into a single EPT invalidation routine called 'pmap_invalidate_ept()'. This routine bumps up the EPT generation number and sends an IPI to the host cpus that are executing the guest's vcpus. On a subsequent entry into the guest it will detect that the EPT has changed and invalidate the mappings from the TLB. Guest memory access: Since the guest memory is no longer wired we need to hold the host physical page that backs the guest physical page before we can access it. The helper functions 'vm_gpa_hold()/vm_gpa_release()' are available for this purpose. PCI passthru: Guest's with PCI passthru devices will wire the entire guest physical address space. The MMIO BAR associated with the passthru device is backed by a vm_object of type OBJT_SG. An IOMMU domain is created only for guest's that have one or more PCI passthru devices attached to them. Limitations: There isn't a way to map a guest physical page without execute permissions. This is because the amd64/pmap code interprets the guest physical mappings as user mappings since they are numerically below VM_MAXUSER_ADDRESS. Since PG_U shares the same bit position as EPT_PG_EXECUTE all guest mappings become automatically executable. Thanks to Alan Cox and Konstantin Belousov for their rigorous code reviews as well as their support and encouragement. Thanks for John Baldwin for reviewing the use of OBJT_SG as the backing object for pci passthru mmio regions. Special thanks to Peter Holm for testing the patch on short notice. Approved by: re Discussed with: grehan Reviewed by: alc, kib Tested by: pho	2013-10-05 21:22:35 +00:00
John-Mark Gurney	29904f46d6	add aesni module to i386 and amd64 NOTES... Approved by: re (gjb)	2013-10-04 17:21:01 +00:00
Peter Grehan	e58d944482	Return 0 for a rdmsr of MSR_IA32_PLATFORM_ID. This is enough to get Ubuntu 12.0.4/13.0.4 to boot. Approved by: re@ (blanket)	2013-09-27 14:55:59 +00:00
Konstantin Belousov	4cb8b041d1	In pmap_clear_modify(), initialize pvh even for fictitious managed page, otherwise the small mappings loop would use uninitialized value. Note that currently pmap_clear_modify() is not called for fictitious pages. Sponsored by: The FreeBSD Foundation Approved by: re (glebius)	2013-09-24 13:52:47 +00:00
Konstantin Belousov	fecfc089e4	Use the pv lists generation count to read-lock the pvh_global_lock in pmap_clear_modify(). Noted and reviewed by: alc Tested by: pho Sponsored by: The FreeBSD Foundation Approved by: re (marius)	2013-09-24 12:26:43 +00:00
Konstantin Belousov	75f50c53f1	Ensure that the ERESTART return from the syscall reloads the registers, to make the restarted syscall instruction pass the correct arguments. PR: kern/182161 Reported by: Russ Cox <rsc@swtch.com> Sponsored by: The FreeBSD Foundation MFC after: 3 days Approved by: re (marius)	2013-09-24 12:24:48 +00:00
Konstantin Belousov	ad43b98491	Free both KVA and backing pages when freeing TSS memory. Reported and tested by: pho Sponsored by: The FreeBSD Foundation Approved by: re (marius)	2013-09-23 20:14:15 +00:00
Glen Barber	91aff61084	Put 'device hyperv' back in amd64/GENERIC, incorrectly removed with r255736. Pointed out by: gibbs Approved by: re (delphij) Sponsored by: The FreeBSD Foundation	2013-09-21 01:07:27 +00:00
Peter Grehan	36f23e3c20	Reorder/regroup the vmm ioctl api definitions to allow some semblance of API stability and growth during the 10.* timeframe. Userland/kernel bhyve will have to be recompiled after this. Reviewed by: neel Approved by: re@ (blanket)	2013-09-21 00:27:53 +00:00
Justin T. Gibbs	566a5f5020	Merge Xen PVHVM support into the GENERIC kernel config for both amd64 and i386. Submitted by: Roger Pau Monné Sponsored by: Citrix Systems R&D Reviewed by: gibbs Approved by: re (blanket Xen) MFC after: 2 weeks sys/amd64/amd64/mp_machdep.c: sys/amd64/include/cpu.h: sys/i386/i386/mp_machdep.c: sys/i386/include/cpu.h: - Introduce two new CPU hooks for initialization and resume purposes. This allows us to get rid of the XENHVM ifdefs in mp_machdep, and also sets some hooks into common code that can be used by other hypervisor implementations. sys/amd64/conf/XENHVM: sys/i386/conf/XENHVM: - Remove these configs now that GENERIC has builtin support for Xen HVM. sys/kern/subr_smp.c: - Make sure there are no pending IPIs when suspending a system. sys/x86/xen/hvm.c: - Add cpu init and resume vectors that are called from mp_machdep using the new hooks. - Only clear the vcpu_info mapping data on resume. It is already clear for the BSP on a cold boot and is set correctly as APs are started. - Gate xen_hvm_init_cpu only to systems running under Xen. sys/x86/xen/xen_intr.c: - Gate the setup of event channels only to systems running under Xen.	2013-09-20 22:59:22 +00:00
David Christensen	4e4007688c	Substantial rewrite of bxe(4) to add support for the BCM57712 and BCM578XX controllers. Approved by: re MFC after: 4 weeks	2013-09-20 20:18:49 +00:00
Neel Natu	74d1d2b7cc	Merge the following changes from projects/bhyve_npt_pmap: - add fields to 'struct pmap' that are required to manage nested page tables. - add a parameter to 'vmspace_alloc()' that can be used to override the default pmap initialization routine 'pmap_pinit()'. These changes are pushed ahead of the remaining changes in 'bhyve_npt_pmap' in anticipation of the upcoming KBI freeze for 10.0. Reviewed by: kib@, alc@ Approved by: re (glebius)	2013-09-20 17:06:49 +00:00
Justin T. Gibbs	428b7ca290	Add support for suspend/resume/migration operations when running as a Xen PVHVM guest. Submitted by: Roger Pau Monné Sponsored by: Citrix Systems R&D Reviewed by: gibbs Approved by: re (blanket Xen) MFC after: 2 weeks sys/amd64/amd64/mp_machdep.c: sys/i386/i386/mp_machdep.c: - Make sure that are no MMU related IPIs pending on migration. - Reset pending IPI_BITMAP on resume. - Init vcpu_info on resume. sys/amd64/include/intr_machdep.h: sys/i386/include/intr_machdep.h: sys/x86/acpica/acpi_wakeup.c: sys/x86/x86/intr_machdep.c: sys/x86/isa/atpic.c: sys/x86/x86/io_apic.c: sys/x86/x86/local_apic.c: - Add a "suspend_cancelled" parameter to pic_resume(). For the Xen PIC, restoration of interrupt services differs between the aborted suspend and normal resume cases, so we must provide this information. sys/dev/acpica/acpi_timer.c: sys/dev/xen/timer/timer.c: sys/timetc.h: - Don't swap out "suspend safe" timers across a suspend/resume cycle. This includes the Xen PV and ACPI timers. sys/dev/xen/control/control.c: - Perform proper suspend/resume process for PVHVM: - Suspend all APs before going into suspension, this allows us to reset the vcpu_info on resume for each AP. - Reset shared info page and callback on resume. sys/dev/xen/timer/timer.c: - Implement suspend/resume support for the PV timer. Since FreeBSD doesn't perform a per-cpu resume of the timer, we need to call smp_rendezvous in order to correctly resume the timer on each CPU. sys/dev/xen/xenpci/xenpci.c: - Don't reset the PCI interrupt on each suspend/resume. sys/kern/subr_smp.c: - When suspending a PVHVM domain make sure there are no MMU IPIs in-flight, or we will get a lockup on resume due to the fact that pending event channels are not carried over on migration. - Implement a generic version of restart_cpus that can be used by suspended and stopped cpus. sys/x86/xen/hvm.c: - Implement resume support for the hypercall page and shared info. - Clear vcpu_info so it can be reset by APs when resuming from suspension. sys/dev/xen/xenpci/xenpci.c: sys/x86/xen/hvm.c: sys/x86/xen/xen_intr.c: - Support UP kernel configurations. sys/x86/xen/xen_intr.c: - Properly rebind per-cpus VIRQs and IPIs on resume.	2013-09-20 05:06:03 +00:00
Alan Cox	deb179bb4c	The pmap function pmap_clear_reference() is no longer used. Remove it. pmap_clear_reference() has had exactly one caller in the kernel for several years, more precisely, since FreeBSD 8. Now, that call no longer exists. Approved by: re (kib) Sponsored by: EMC / Isilon Storage Division	2013-09-20 04:30:18 +00:00
Peter Grehan	ef90af83a5	IFC @ r255692 Comment out IA32_MISC_ENABLE MSR access - this doesn't exist on AMD. Need to sort out how arch-specific MSRs will be handled.	2013-09-20 00:46:29 +00:00
Peter Grehan	d83d73618f	Reconnect the hyperv drivers back into GENERIC now that the disengage driver issue has been resolved. Approved by: re@ (gjb)	2013-09-19 05:07:51 +00:00
Pawel Jakub Dawidek	3fded357af	Fix panic in ktrcapfail() when no capability rights are passed. While here, correct all consumers to pass NULL instead of 0 as we pass capability rights as pointers now, not uint64_t. Reported by: Daniel Peyrolon Tested by: Daniel Peyrolon Approved by: re (marius)	2013-09-18 19:26:08 +00:00
Roman Divacky	69d912af45	Regen. Approved by: re (delphij)	2013-09-18 18:49:26 +00:00
Roman Divacky	b12698e1a1	Revert r255672, it has some serious flaws, leaking file references etc. Approved by: re (delphij)	2013-09-18 18:48:33 +00:00
Roman Divacky	70ccaaf58e	Regen. Approved by: re (delphij)	2013-09-18 17:58:03 +00:00
Roman Divacky	253c75c0de	Implement epoll support in Linuxulator. This is a tiny wrapper around kqueue to implement epoll subset of functionality. The kqueue user data are 32bit on i386 which is not enough for epoll user data so this patch overrides kqueue fileops to maintain enough space in struct file. Initial patch developed by me in 2007 and then extended and finished by Yuri Victorovich. Approved by: re (delphij) Sponsored by: Google Summer of Code Submitted by: Yuri Victorovich <yuri at rawbw dot com> Tested by: Yuri Victorovich <yuri at rawbw dot com>	2013-09-18 17:56:04 +00:00
Peter Grehan	517e21d3e7	Hide TSC-deadline APIC timer support from guests. This mode isn't yet implemented in bhyve's APIC emulation. Reviewed by: neel Approved by: re@ (blanket)	2013-09-17 17:56:53 +00:00
Neel Natu	0f9d5dc758	Fix a bug in decoding an instruction that has an SIB byte as well as an immediate operand. The presence of an SIB byte in decoding the ModR/M field would cause 'imm_bytes' to not be set to the correct value. Fix this by initializing 'imm_bytes' independent of the ModR/M decoding. Reported by: grehan@ Approved by: re@	2013-09-17 16:06:07 +00:00
Bryan Venteicher	03c6abfd1c	Add vmx(4) to i386 and amd64 GENERIC Approved by: re (gjb)	2013-09-17 01:54:13 +00:00
Konstantin Belousov	70b9173019	In pmap_copy(), when the copied region is mapped with superpage but does not cover entire superpage, avoid copying. Doing partial copy would require demotion, which is incompatible with the already held locks. Reported by: cperciva Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week Approved by: re (delphij)	2013-09-16 06:15:15 +00:00
Peter Grehan	b90fcf02f2	Pull the hyperv drivers from GENERIC until the fix to the disengage driver to make it only probe when running on hyperv is reviewed and tested. Approved by: re (rodrigc)	2013-09-14 20:38:22 +00:00
Peter Grehan	ab7fb3bca7	Import Hyper-V paravirtualized drivers from projects/hyperv branch into head. Approved by: re@ (hrs) Obtained from: Microsoft, NetApp, and Citrix.	2013-09-13 18:47:58 +00:00
Neel Natu	0f1ef0ec80	Fix a limitation in bhyve that would limit the number of virtual machines to the maximum number of VT-d domains (256 on a Sandybridge). We now allocate a VT-d domain for a guest only if the administrator has explicitly configured one or more PCI passthru device(s). If there are no PCI passthru devices configured (the common case) then the number of virtual machines is no longer limited by the maximum number of VT-d domains. Reviewed by: grehan@ Approved by: re@	2013-09-11 07:11:14 +00:00
Peter Grehan	47823319c3	IFC @ r255459	2013-09-11 00:19:16 +00:00
Peter Grehan	8d39ed16c2	Go way past 11 and bump bhyve's max vCPUs to 16. This should be sufficient for 10.0 and will do until forthcoming work to avoid limitations in this area is complete. Thanks to Bela Lubkin at tidalscale for the headsup on the apic/cpu id/io apic ASL parameters that are actually hex values and broke when written as decimal when 11 vCPUs were configured. Approved by: re@	2013-09-10 03:48:18 +00:00
Alan Cox	70c4180f1c	Prior to r254304, we only began scanning the active page queue when the amount of free memory was close to the point at which we would begin reclaiming pages. Now, we continuously scan the active page queue, regardless of the amount of free memory. Consequently, we are continuously calling pmap_ts_referenced() on active pages. Prior to this change, pmap_ts_referenced() would always demote superpage mappings in order to obtain finer-grained reference information. This made sense because we were coming under memory pressure and would soon have to begin reclaiming pages. Now, however, with continuous scanning of the active page queue, these demotions are taking a toll on performance. For example, on one of my test machines, the running time for the HPCC Random Access benchmark (also known as GUPS) has increased by 54%. To address this problem, I have replaced the demotion with a heuristic for periodically clearing the reference flag on superpage mappings. Reviewed by: kib Approved by: re (glebius) Sponsored by: EMC / Isilon Storage Division	2013-09-08 21:30:53 +00:00
Neel Natu	45e51299b3	Allocate VPIDs by using the unit number allocator to keep do the bookkeeping. Also deal with VPID exhaustion by allocating out of a reserved range as the last resort.	2013-09-07 05:30:34 +00:00
Peter Grehan	8a02f69652	Mask off the vector from the MSI-x data word. Some o/s's set the trigger-mode level bit which results in an invalid vector and pass-thru interrupts not being delivered.	2013-09-07 03:33:36 +00:00
Justin T. Gibbs	e44af46e4c	Implement PV IPIs for PVHVM guests and further converge PV and HVM IPI implmementations. Submitted by: Roger Pau Monné Sponsored by: Citrix Systems R&D Submitted by: gibbs (misc cleanup, table driven config) Reviewed by: gibbs MFC after: 2 weeks sys/amd64/include/cpufunc.h: sys/amd64/amd64/pmap.c: Move invltlb_globpcid() into cpufunc.h so that it can be used by the Xen HVM version of tlb shootdown IPI handlers. sys/x86/xen/xen_intr.c: sys/xen/xen_intr.h: Rename xen_intr_bind_ipi() to xen_intr_alloc_and_bind_ipi(), and remove the ipi vector parameter. This api allocates an event channel port that can be used for ipi services, but knows nothing of the actual ipi for which that port will be used. Removing the unused argument and cleaning up the comments surrounding its declaration helps clarify its actual role. sys/amd64/amd64/mp_machdep.c: sys/amd64/include/cpu.h: sys/i386/i386/mp_machdep.c: sys/i386/include/cpu.h: Implement a generic framework for amd64 and i386 that allows the implementation of certain CPU management functions to be selected at runtime. Currently this is only used for the ipi send function, which we optimize for Xen when running on a Xen hypervisor, but can easily be expanded to support more operations. sys/x86/xen/hvm.c: Implement Xen PV IPI handlers and operations, replacing native send IPI. sys/amd64/include/pcpu.h: sys/i386/include/pcpu.h: sys/i386/include/smp.h: Remove NR_VIRQS and NR_IPIS from FreeBSD headers. NR_VIRQS is defined already for us in the xen interface files. NR_IPIS is only needed in one file per Xen platform and is easily inferred by the IPI vector table that is defined in those files. sys/i386/xen/mp_machdep.c: Restructure to more closely match the HVM implementation by performing table driven IPI setup.	2013-09-06 22:17:02 +00:00
Bryan Venteicher	ddb4ffd0c6	Add vmx device to the i386 and amd64 NOTES files	2013-09-06 20:24:21 +00:00
Konstantin Belousov	9430f833ca	Only lock pvh_global_lock read-only for pmap_page_wired_mappings(), pmap_is_modified() and pmap_is_referenced(), same as it was done for pmap_ts_referenced(). Consolidate identical code for pmap_is_modified() and pmap_is_referenced() into helper pmap_page_test_mappings(). Reviewed by: alc Tested by: pho (previous version) Sponsored by: The FreeBSD Foundation	2013-09-06 16:53:48 +00:00
Konstantin Belousov	3e4f32be7d	In pmap_ts_referenced(), when restarting the loop due to pv list generation changed, do not drop and immediately relock the pv list. Suggested and reviewed by: alc Sponsored by: The FreeBSD Foundation	2013-09-06 16:48:34 +00:00
Gleb Smirnoff	e16477e8d9	On those machines, where sf_bufs do not represent any real object, make sf_buf_alloc()/sf_buf_free() inlines, to save two calls to an absolutely empty functions. Reviewed by: alc, kib, scottl Sponsored by: Nginx, Inc. Sponsored by: Netflix	2013-09-06 05:37:49 +00:00
Peter Grehan	76c35ba80f	Emulate reading of the IA32_MISC_ENABLE MSR, by returning the host MSR and masking off features that aren't supported. Linux reads this MSR to detect if NX has been disabled via BIOS.	2013-09-06 05:20:11 +00:00
Peter Grehan	8b7e3e3022	Allow CPUID leaf 0xD to be read as zeroes. Linux reads this even though extended features aren't exposed. Support for 0xD will be expanded once AVX[2] is exposed to the guest in upcoming work.	2013-09-06 05:16:10 +00:00
Pawel Jakub Dawidek	7008be5bd7	Change the cap_rights_t type from uint64_t to a structure that we can extend in the future in a backward compatible (API and ABI) way. The cap_rights_t represents capability rights. We used to use one bit to represent one right, but we are running out of spare bits. Currently the new structure provides place for 114 rights (so 50 more than the previous cap_rights_t), but it is possible to grow the structure to hold at least 285 rights, although we can make it even larger if 285 rights won't be enough. The structure definition looks like this: struct cap_rights { uint64_t cr_rights[CAP_RIGHTS_VERSION + 2]; }; The initial CAP_RIGHTS_VERSION is 0. The top two bits in the first element of the cr_rights[] array contain total number of elements in the array - 2. This means if those two bits are equal to 0, we have 2 array elements. The top two bits in all remaining array elements should be 0. The next five bits in all array elements contain array index. Only one bit is used and bit position in this five-bits range defines array index. This means there can be at most five array elements in the future. To define new right the CAPRIGHT() macro must be used. The macro takes two arguments - an array index and a bit to set, eg. #define CAP_PDKILL CAPRIGHT(1, 0x0000000000000800ULL) We still support aliases that combine few rights, but the rights have to belong to the same array element, eg: #define CAP_LOOKUP CAPRIGHT(0, 0x0000000000000400ULL) #define CAP_FCHMOD CAPRIGHT(0, 0x0000000000002000ULL) #define CAP_FCHMODAT (CAP_FCHMOD \| CAP_LOOKUP) There is new API to manage the new cap_rights_t structure: cap_rights_t cap_rights_init(cap_rights_t rights, ...); void cap_rights_set(cap_rights_t rights, ...); void cap_rights_clear(cap_rights_t rights, ...); bool cap_rights_is_set(const cap_rights_t rights, ...); bool cap_rights_is_valid(const cap_rights_t rights); void cap_rights_merge(cap_rights_t dst, const cap_rights_t src); void cap_rights_remove(cap_rights_t dst, const cap_rights_t src); bool cap_rights_contains(const cap_rights_t big, const cap_rights_t little); Capability rights to the cap_rights_init(), cap_rights_set(), cap_rights_clear() and cap_rights_is_set() functions are provided by separating them with commas, eg: cap_rights_t rights; cap_rights_init(&rights, CAP_READ, CAP_WRITE, CAP_FSTAT); There is no need to terminate the list of rights, as those functions are actually macros that take care of the termination, eg: #define cap_rights_set(rights, ...) \ __cap_rights_set((rights), __VA_ARGS__, 0ULL) void __cap_rights_set(cap_rights_t *rights, ...); Thanks to using one bit as an array index we can assert in those functions that there are no two rights belonging to different array elements provided together. For example this is illegal and will be detected, because CAP_LOOKUP belongs to element 0 and CAP_PDKILL to element 1: cap_rights_init(&rights, CAP_LOOKUP \| CAP_PDKILL); Providing several rights that belongs to the same array's element this way is correct, but is not advised. It should only be used for aliases definition. This commit also breaks compatibility with some existing Capsicum system calls, but I see no other way to do that. This should be fine as Capsicum is still experimental and this change is not going to 9.x. Sponsored by: The FreeBSD Foundation	2013-09-05 00:09:56 +00:00
Konstantin Belousov	6aceaa3e17	Tidy up some loose ends in the PCID code: - Restore the pre-PCID TLB shootdown handlers for whole address space and single page invalidation asm code, and assign the IPI handler to them when PCID is not supported or disabled. Old handlers have linear control flow. But, still use the common return sequence. - Stop using pcpu for INVPCID descriptors in the invlrg handler. It is enough to allocate descriptors on the stack. As result, two SWAPGS instructions are shaved off from the code for Haswell+. - Fix the reverted condition in invlrng for checking of the PCID support [1], also in invlrng check that pmap is kernel pmap before performing other tests. For the kernel pmap, which provides global mappings, the INVLPG must be used for invalidation always. - Save the pre-computed pmap' %CR3 register in the struct pmap. This allows to remove several checks for pm_pcid validity when %CR3 is reloaded [2]. Noted by: gibbs [1] Discussed with: alc [2] Tested by: pho, flo Sponsored by: The FreeBSD Foundation	2013-09-04 23:31:29 +00:00
Peter Grehan	46ed9e4908	IFC @ r255209	2013-09-04 20:55:56 +00:00
John Baldwin	dffe0dc4d2	Add support for the 'invpcid' instruction to binutils and DDB's disassembler on amd64. MFC after: 1 month	2013-09-03 21:21:47 +00:00
Konstantin Belousov	f27d53b8f2	Fix two build failures for non-tb configurations, UP [2] and when using gas [1]. Reported by: andreast [1], bf [2] Sponsored by: The FreeBSD Foundation	2013-08-31 19:13:21 +00:00
Konstantin Belousov	1099068118	The pm_save should be cleared on the pmap initialization, and not on the activation. Noted by: alc	2013-08-30 20:10:01 +00:00
Konstantin Belousov	37eed8419c	Implement support for the process-context identifiers ('PCID') on Intel CPUs. The feature tags TLB entries with the Id of the address space and allows to avoid TLB invalidation on the context switch, it is available only in the long mode. In the microbenchmarks, using the PCID decreased latency of the context switches by ~30% on SandyBridge class desktop CPUs, measured with the lat_ctx program from lmbench. If available, use INVPCID instruction when a TLB entry in non-current address space needs to be invalidated. The instruction is typically available on the Haswell. If needed, the use of PCID can be turned off with the vm.pmap.pcid_enabled loader tunable set to 0. The state of the feature is reported by the vm.pmap.pcid_enabled sysctl. The sysctl vm.pmap.pcid_save_cnt reports the number of context switches which avoided invalidating the TLB; compare with the total number of context switches, available as sysctl vm.stats.sys.v_swtch. Sponsored by: The FreeBSD Foundation Reviewed by: alc Tested by: pho, bf	2013-08-30 07:59:49 +00:00
Konstantin Belousov	5f5703ef52	Provide a wrapper for the INVPCID instruction, definition of the descriptor and symbolic names for the operation types. Sponsored by: The FreeBSD Foundation Reviewed by: alc Tested by: pho, bf	2013-08-30 07:42:38 +00:00
Justin T. Gibbs	76acc41fb7	Implement vector callback for PVHVM and unify event channel implementations Re-structure Xen HVM support so that: - Xen is detected and hypercalls can be performed very early in system startup. - Xen interrupt services are implemented using FreeBSD's native interrupt delivery infrastructure. - the Xen interrupt service implementation is shared between PV and HVM guests. - Xen interrupt handlers can optionally use a filter handler in order to avoid the overhead of dispatch to an interrupt thread. - interrupt load can be distributed among all available CPUs. - the overhead of accessing the emulated local and I/O apics on HVM is removed for event channel port events. - a similar optimization can eventually, and fairly easily, be used to optimize MSI. Early Xen detection, HVM refactoring, PVHVM interrupt infrastructure, and misc Xen cleanups: Sponsored by: Spectra Logic Corporation Unification of PV & HVM interrupt infrastructure, bug fixes, and misc Xen cleanups: Submitted by: Roger Pau Monné Sponsored by: Citrix Systems R&D sys/x86/x86/local_apic.c: sys/amd64/include/apicvar.h: sys/i386/include/apicvar.h: sys/amd64/amd64/apic_vector.S: sys/i386/i386/apic_vector.s: sys/amd64/amd64/machdep.c: sys/i386/i386/machdep.c: sys/i386/xen/exception.s: sys/x86/include/segments.h: Reserve IDT vector 0x93 for the Xen event channel upcall interrupt handler. On Hypervisors that support the direct vector callback feature, we can request that this vector be called directly by an injected HVM interrupt event, instead of a simulated PCI interrupt on the Xen platform PCI device. This avoids all of the overhead of dealing with the emulated I/O APIC and local APIC. It also means that the Hypervisor can inject these events on any CPU, allowing upcalls for different ports to be handled in parallel. sys/amd64/amd64/mp_machdep.c: sys/i386/i386/mp_machdep.c: Map Xen per-vcpu area during AP startup. sys/amd64/include/intr_machdep.h: sys/i386/include/intr_machdep.h: Increase the FreeBSD IRQ vector table to include space for event channel interrupt sources. sys/amd64/include/pcpu.h: sys/i386/include/pcpu.h: Remove Xen HVM per-cpu variable data. These fields are now allocated via the dynamic per-cpu scheme. See xen_intr.c for details. sys/amd64/include/xen/hypercall.h: sys/dev/xen/blkback/blkback.c: sys/i386/include/xen/xenvar.h: sys/i386/xen/clock.c: sys/i386/xen/xen_machdep.c: sys/xen/gnttab.c: Prefer FreeBSD primatives to Linux ones in Xen support code. sys/amd64/include/xen/xen-os.h: sys/i386/include/xen/xen-os.h: sys/xen/xen-os.h: sys/dev/xen/balloon/balloon.c: sys/dev/xen/blkback/blkback.c: sys/dev/xen/blkfront/blkfront.c: sys/dev/xen/console/xencons_ring.c: sys/dev/xen/control/control.c: sys/dev/xen/netback/netback.c: sys/dev/xen/netfront/netfront.c: sys/dev/xen/xenpci/xenpci.c: sys/i386/i386/machdep.c: sys/i386/include/pmap.h: sys/i386/include/xen/xenfunc.h: sys/i386/isa/npx.c: sys/i386/xen/clock.c: sys/i386/xen/mp_machdep.c: sys/i386/xen/mptable.c: sys/i386/xen/xen_clock_util.c: sys/i386/xen/xen_machdep.c: sys/i386/xen/xen_rtc.c: sys/xen/evtchn/evtchn_dev.c: sys/xen/features.c: sys/xen/gnttab.c: sys/xen/gnttab.h: sys/xen/hvm.h: sys/xen/xenbus/xenbus.c: sys/xen/xenbus/xenbus_if.m: sys/xen/xenbus/xenbusb_front.c: sys/xen/xenbus/xenbusvar.h: sys/xen/xenstore/xenstore.c: sys/xen/xenstore/xenstore_dev.c: sys/xen/xenstore/xenstorevar.h: Pull common Xen OS support functions/settings into xen/xen-os.h. sys/amd64/include/xen/xen-os.h: sys/i386/include/xen/xen-os.h: sys/xen/xen-os.h: Remove constants, macros, and functions unused in FreeBSD's Xen support. sys/xen/xen-os.h: sys/i386/xen/xen_machdep.c: sys/x86/xen/hvm.c: Introduce new functions xen_domain(), xen_pv_domain(), and xen_hvm_domain(). These are used in favor of #ifdefs so that FreeBSD can dynamically detect and adapt to the presence of a hypervisor. The goal is to have an HVM optimized GENERIC, but more is necessary before this is possible. sys/amd64/amd64/machdep.c: sys/dev/xen/xenpci/xenpcivar.h: sys/dev/xen/xenpci/xenpci.c: sys/x86/xen/hvm.c: sys/sys/kernel.h: Refactor magic ioport, Hypercall table and Hypervisor shared information page setup, and move it to a dedicated HVM support module. HVM mode initialization is now triggered during the SI_SUB_HYPERVISOR phase of system startup. This currently occurs just after the kernel VM is fully setup which is just enough infrastructure to allow the hypercall table and shared info page to be properly mapped. sys/xen/hvm.h: sys/x86/xen/hvm.c: Add definitions and a method for configuring Hypervisor event delievery via a direct vector callback. sys/amd64/include/xen/xen-os.h: sys/x86/xen/hvm.c: sys/conf/files: sys/conf/files.amd64: sys/conf/files.i386: Adjust kernel build to reflect the refactoring of early Xen startup code and Xen interrupt services. sys/dev/xen/blkback/blkback.c: sys/dev/xen/blkfront/blkfront.c: sys/dev/xen/blkfront/block.h: sys/dev/xen/control/control.c: sys/dev/xen/evtchn/evtchn_dev.c: sys/dev/xen/netback/netback.c: sys/dev/xen/netfront/netfront.c: sys/xen/xenstore/xenstore.c: sys/xen/evtchn/evtchn_dev.c: sys/dev/xen/console/console.c: sys/dev/xen/console/xencons_ring.c Adjust drivers to use new xen_intr_*() API. sys/dev/xen/blkback/blkback.c: Since blkback defers all event handling to a taskqueue, convert this task queue to a "fast" taskqueue, and schedule it via an interrupt filter. This avoids an unnecessary ithread context switch. sys/xen/xenstore/xenstore.c: The xenstore driver is MPSAFE. Indicate as much when registering its interrupt handler. sys/xen/xenbus/xenbus.c: sys/xen/xenbus/xenbusvar.h: Remove unused event channel APIs. sys/xen/evtchn.h: Remove all kernel Xen interrupt service API definitions from this file. It is now only used for structure and ioctl definitions related to the event channel userland device driver. Update the definitions in this file to match those from NetBSD. Implementing this interface will be necessary for Dom0 support. sys/xen/evtchn/evtchnvar.h: Add a header file for implemenation internal APIs related to managing event channels event delivery. This is used to allow, for example, the event channel userland device driver to access low-level routines that typical kernel consumers of event channel services should never access. sys/xen/interface/event_channel.h: sys/xen/xen_intr.h: Standardize on the evtchn_port_t type for referring to an event channel port id. In order to prevent low-level event channel APIs from leaking to kernel consumers who should not have access to this data, the type is defined twice: Once in the Xen provided event_channel.h, and again in xen/xen_intr.h. The double declaration is protected by __XEN_EVTCHN_PORT_DEFINED__ to ensure it is never declared twice within a given compilation unit. sys/xen/xen_intr.h: sys/xen/evtchn/evtchn.c: sys/x86/xen/xen_intr.c: sys/dev/xen/xenpci/evtchn.c: sys/dev/xen/xenpci/xenpcivar.h: New implementation of Xen interrupt services. This is similar in many respects to the i386 PV implementation with the exception that events for bound to event channel ports (i.e. not IPI, virtual IRQ, or physical IRQ) are further optimized to avoid mask/unmask operations that aren't necessary for these edge triggered events. Stubs exist for supporting physical IRQ binding, but will need additional work before this implementation can be fully shared between PV and HVM. sys/amd64/amd64/mp_machdep.c: sys/i386/i386/mp_machdep.c: sys/i386/xen/mp_machdep.c sys/x86/xen/hvm.c: Add support for placing vcpu_info into an arbritary memory page instead of using HYPERVISOR_shared_info->vcpu_info. This allows the creation of domains with more than 32 vcpus. sys/i386/i386/machdep.c: sys/i386/xen/clock.c: sys/i386/xen/xen_machdep.c: sys/i386/xen/exception.s: Add support for new event channle implementation.	2013-08-29 19:52:18 +00:00
Alan Cox	51321f7c31	Significantly reduce the cost, i.e., run time, of calls to madvise(..., MADV_DONTNEED) and madvise(..., MADV_FREE). Specifically, introduce a new pmap function, pmap_advise(), that operates on a range of virtual addresses within the specified pmap, allowing for a more efficient implementation of MADV_DONTNEED and MADV_FREE. Previously, the implementation of MADV_DONTNEED and MADV_FREE relied on per-page pmap operations, such as pmap_clear_reference(). Intuitively, the problem with this implementation is that the pmap-level locks are acquired and released and the page table traversed repeatedly, once for each resident page in the range that was specified to madvise(2). A more subtle flaw with the previous implementation is that pmap_clear_reference() would clear the reference bit on all mappings to the specified page, not just the mapping in the range specified to madvise(2). Since our malloc(3) makes heavy use of madvise(2), this change can have a measureable impact. For example, the system time for completing a parallel "buildworld" on a 6-core amd64 machine was reduced by about 1.5% to 2.0%. Note: This change only contains pmap_advise() implementations for a subset of our supported architectures. I will commit implementations for the remaining architectures after further testing. For now, a stub function is sufficient because of the advisory nature of pmap_advise(). Discussed with: jeff, jhb, kib Tested by: pho (i386), marcel (ia64) Sponsored by: EMC / Isilon Storage Division	2013-08-29 15:49:05 +00:00
Neel Natu	6f6ebf3c3f	Add support for emulating the byte move instruction "mov r/m8, r8". This emulation is required when dumping MMIO space via the ddb "examine" command.	2013-08-27 16:49:20 +00:00
Peter Grehan	df5e6de3e3	Add in last remaining files to get AMD-SVM operational. Submitted by: Anish Gupta (akgupt3@gmail.com)	2013-08-23 00:37:26 +00:00
Peter Grehan	0bddaa8d25	HLT_IGNORED stat is used by both vmx and svm - move to common stats. Submitted by: Anish Gupta (akgupt3@gmail.com)	2013-08-22 22:29:27 +00:00
Peter Grehan	8b62d4719a	Handle VM_PROT_NONE in nested page table code. Submitted by: Anish Gupta (akgupt3@gmail.com)	2013-08-22 22:26:46 +00:00
Konstantin Belousov	e68c64f0ba	Revert r254501. Instead, reuse the type stability of the struct pmap which is the part of struct vmspace, allocated from UMA_ZONE_NOFREE zone. Initialize the pmap lock in the vmspace zone init function, and remove pmap lock initialization and destruction from pmap_pinit() and pmap_release(). Suggested and reviewed by: alc (previous version) Tested by: pho Sponsored by: The FreeBSD Foundation	2013-08-22 18:12:24 +00:00
Konstantin Belousov	b544368a22	Use the generation count of the pv list to work around LOR between pmap lock and pv list lock, and use the shared locking on pvh_global_lock in pmap_remove_write(), same as it was done for pmap_ts_referenced(). Noted and reviewed by: alc (previous version) Tested by: pho Sponsored by: The FreeBSD Foundation	2013-08-22 18:05:31 +00:00
David E. O'Brien	46be218dce	The PADLOCK_RNG and RDRAND_RNG kernel options are now devices. Thus "device padlock_rng" and "device rdrand_rng" should be used instead of "options PADLOCK_RNG" & "options RDRAND_RNG". Requested by: so@ (des) Submitted by: obrien, arthurmesh@gmail.com Obtained from: Juniper Networks	2013-08-21 22:43:29 +00:00
Jung-uk Kim	1533b9f714	Reimplement atomic operations on PDEs and PTEs in pmap.h. This change significantly reduces duplicate code and make it easier to read. Reviewed by: alc, bde	2013-08-21 22:40:29 +00:00
Jung-uk Kim	d36eb3f1c4	Remove empty lines before return statements for style consistency.	2013-08-21 22:05:58 +00:00
Jung-uk Kim	8a1ee2d346	Implement atomic_swap() and atomic_testandset(). Reviewed by: arch, bde, jilles, kib	2013-08-21 22:03:06 +00:00
Jung-uk Kim	da255e4c7f	- Remove the "a" constraint from main output operand for atomic_cmpset(). - Use "+" modifier for the "expect" because it is also an output (unused).	2013-08-21 21:30:06 +00:00
Jung-uk Kim	fe94be3da7	Use '+' modifier for a memory operand that is both an input and an output. It was actually done in r86301 but reverted in r150182 because GCC 3.x was not able to handle it for a memory operand. Apparently, this problem was fixed in GCC 4.1+ and several contrib sources already rely on this feature.	2013-08-21 21:14:16 +00:00
Jung-uk Kim	c1c84ce1bf	Remove bogus labels. No functional change.	2013-08-21 20:49:46 +00:00
Jung-uk Kim	ee93d1173a	Use consistent style. No functional change.	2013-08-21 20:43:50 +00:00
Neel Natu	b98940e5eb	Do not create superpage mappings in the iommu. This is a workaround to hide the fact that we do not have any code to demote a superpage mapping before we unmap a single page that is part of the superpage.	2013-08-20 06:46:40 +00:00
Neel Natu	f77e982952	Extract the location of the remapping hardware units from the ACPI DMAR table. Submitted by: Gopakumar T (gopakumar_thekkedath@yahoo.co.in)	2013-08-20 06:20:05 +00:00
Neel Natu	15e683837c	Fix breakage caused by r254466 in minidumpsys(). r254466 increased the KVA from 512GB to 2TB which requires 4 PDP pages as opposed to a single one before the change. This broke minidumpsys() since it assumed that the entire KVA could be addressed via a single PDP page. Fix this by obtaining the address of the PDP page from the PML4 entry associated with the KVA being dumped. Reported by: pho Submitted by: kib Pointy hat to: neel	2013-08-20 02:09:26 +00:00
Konstantin Belousov	d91f339823	When code from r254064 in pmap_ts_referenced() drops pv lock and blocks on a pmap lock, pmap_release() might proceed in parallel and destroy the pmap mutex, since unlocked pv lock allows to remove pv entry owned by the pmap. For now, gate the pmap_release() on write-locked pvh_global_lock. Since pmap_ts_release() does not unlock the global lock, pmap_release() would not destroy pmap mutex until the pmap_ts_referenced() finished. We cannot enter pmap_ts_referenced() and encounter a pv entry for the destroyed pmap if pmap_release() passed the global lock gate, since pmap_remove_pages() would finish earlier. Reported by: jeff, pho Reviewed by: alc Tested by: pho Sponsored by: The FreeBSD Foundation	2013-08-18 21:36:22 +00:00
Pawel Jakub Dawidek	417ffc66fa	Add process descriptors support to the GENERIC kernel. It is already being used by the tools in base systems and with sandboxing more and more tools the usage should only increase. Submitted by: Mariusz Zaborski <oshogbo@FreeBSD.org> Sponsored by: Google Summer of Code 2013 MFC after: 1 month	2013-08-18 10:21:29 +00:00
Neel Natu	0ef2ab3ab8	Bump up the maximum addressable memory on amd64 systems from 1TB to 4TB. Bump up the KVA size proportionally from 512GB to 2TB. The number of page table pages used by the direct map is now calculated at run time based on 'Maxmem'. This means the small memory systems will not see any additional tax in terms of page table pages for the direct map. However all amd64 systems, regardless of the memory size, will use 3 more pages to accomodate the bump in the KVA size. More details available here: http://lists.freebsd.org/pipermail/freebsd-hackers/2013-June/043015.html http://lists.freebsd.org/pipermail/freebsd-current/2013-July/043143.html Tested with the following configurations: - Sandybridge server with 64GB of memory. - bhyve VM with 64MB of memory. - bhyve VM with a 8GB of memory with the memory segment above 4GB cuddling right up against the 4TB maximum memory limit. Discussed on: hackers@, current@ Submitted by: Chris Torek (torek@torek.net)	2013-08-17 19:49:08 +00:00
Jilles Tjoelker	0f3a4d8051	libc: Access _logname_valid more efficiently. The variable _logname_valid is not exported via the version script; therefore, change C and i386/amd64 assembler code to remove indirection (which allowed interposition). This makes the code slightly smaller and faster. Also, remove #define PIC_GOT from i386/amd64 in !PIC mode. Without PIC, there is no place containing the address of each variable, so there is no possible definition for PIC_GOT.	2013-08-17 19:24:58 +00:00
Brooks Davis	cd234300d3	Use an ANSI C definition of initializecpucache() to match the declaration and the rest of the file.	2013-08-15 17:44:44 +00:00
Jung-uk Kim	38da30b419	Merge acpica_machdep.h for amd64 and i386 and move to x86. In fact, these two files were functionally identical.	2013-08-13 22:05:10 +00:00
Jung-uk Kim	3bd12ca8f1	Tidy up global locks for ACPICA. There is no functional change.	2013-08-13 21:34:03 +00:00
Konstantin Belousov	c325e866f4	Different consumers of the struct vm_page abuse pageq member to keep additional information, when the page is guaranteed to not belong to a paging queue. Usually, this results in a lot of type casts which make reasoning about the code correctness harder. Sometimes m->object is used instead of pageq, which could cause real and confusing bugs if non-NULL m->object is leaked. See r141955 and r253140 for examples. Change the pageq member into a union containing explicitly-typed members. Use them instead of type-punning or abusing m->object in x86 pmaps, uma and vm_page_alloc_contig(). Requested and reviewed by: alc Sponsored by: The FreeBSD Foundation	2013-08-10 17:36:42 +00:00
Attilio Rao	e946b94934	On all the architectures, avoid to preallocate the physical memory for nodes used in vm_radix. On architectures supporting direct mapping, also avoid to pre-allocate the KVA for such nodes. In order to do so make the operations derived from vm_radix_insert() to fail and handle all the deriving failure of those. vm_radix-wise introduce a new function called vm_radix_replace(), which can replace a leaf node, already present, with a new one, and take into account the possibility, during vm_radix_insert() allocation, that the operations on the radix trie can recurse. This means that if operations in vm_radix_insert() recursed vm_radix_insert() will start from scratch again. Sponsored by: EMC / Isilon storage division Reviewed by: alc (older version) Reviewed by: jeff Tested by: pho, scottl	2013-08-09 11:28:55 +00:00
Attilio Rao	c7aebda8a1	The soft and hard busy mechanism rely on the vm object lock to work. Unify the 2 concept into a real, minimal, sxlock where the shared acquisition represent the soft busy and the exclusive acquisition represent the hard busy. The old VPO_WANTED mechanism becames the hard-path for this new lock and it becomes per-page rather than per-object. The vm_object lock becames an interlock for this functionality: it can be held in both read or write mode. However, if the vm_object lock is held in read mode while acquiring or releasing the busy state, the thread owner cannot make any assumption on the busy state unless it is also busying it. Also: - Add a new flag to directly shared busy pages while vm_page_alloc and vm_page_grab are being executed. This will be very helpful once these functions happen under a read object lock. - Move the swapping sleep into its own per-object flag The KPI is heavilly changed this is why the version is bumped. It is very likely that some VM ports users will need to change their own code. Sponsored by: EMC / Isilon storage division Discussed with: alc Reviewed by: jeff, kib Tested by: gavin, bapt (older version) Tested by: pho, scottl	2013-08-09 11:11:11 +00:00
Andriy Gapon	9ba0691bdd	follow up to r254051 - update powerpc/GENERIC64 as well, suggested by mdf - update comments so that they make sense after the change, suggested by jhb X-MFC after: never (change specific to head)	2013-08-09 08:11:09 +00:00
Neel Natu	f263e391a3	Use local variables with the appropriate types and eliminate a bunch of casts. This is a cosmetic change but it does help with a proposed change to increase the maximum size of physical memory supported on amd64 platforms. Submitted by: Chris Torek (torek@torek.net)	2013-08-08 03:17:39 +00:00
Konstantin Belousov	449c2e92c9	Split the pagequeues per NUMA domains, and split pageademon process into threads each processing queue in a single domain. The structure of the pagedaemons and queues is kept intact, most of the changes come from the need for code to find an owning page queue for given page, calculated from the segment containing the page. The tie between NUMA domain and pagedaemon thread/pagequeue split is rather arbitrary, the multithreaded daemon could be allowed for the single-domain machines, or one domain might be split into several page domains, to further increase concurrency. Right now, each pagedaemon thread tries to reach the global target, precalculated at the start of the pass. This is not optimal, since it could cause excessive page deactivation and freeing. The code should be changed to re-check the global page deficit state in the loop after some number of iterations. The pagedaemons reach the quorum before starting the OOM, since one thread inability to meet the target is normal for split queues. Only when all pagedaemons fail to produce enough reusable pages, OOM is started by single selected thread. Launder is modified to take into account the segments layout with regard to the region for which cleaning is performed. Based on the preliminary patch by jeff, sponsored by EMC / Isilon Storage Division. Reviewed by: alc Tested by: pho Sponsored by: The FreeBSD Foundation	2013-08-07 16:36:38 +00:00
Konstantin Belousov	872d995f76	Change the pmap_ts_referenced() method of amd64 pmap to use shared pvh_global_lock. This allows the method to be executed in parallel, avoiding undue contention on the pvh_global_lock for the multithreaded pagedaemon. The pmap_ts_referenced() function has to inspect the page mappings for several pmaps, which need to be locked while pv list lock is owned. This contradicts to the lock order, where pmap lock is before pv list lock. Introduce the generation count for the pv list of the page or superpage, which indicate any change in the pv list, and, as usual, perform restart of the iteration if generation changed while pv lock was dropped for blocking acquire of a pmap lock. Reported and tested by: pho Reviewed by: alc Sponsored by: The FreeBSD Foundation	2013-08-07 16:33:15 +00:00
Andriy Gapon	818d282e7b	enable KDB_TRACE in GENERICs KDB_TRACE is not an alternative to DDB/etc, they are complementary. So I do not see any reason to not enable KDB_TRACE by default. X-MFC after: never (change specific to head)	2013-08-07 08:03:50 +00:00
Jeff Roberson	5df87b21d3	Replace kernel virtual address space allocation with vmem. This provides transparent layering and better fragmentation. - Normalize functions that allocate memory to use kmem_* - Those that allocate address space are named kva_* - Those that operate on maps are named kmap_* - Implement recursive allocation handling for kmem_arena in vmem. Reviewed by: alc Tested by: pho Sponsored by: EMC / Isilon Storage Division	2013-08-07 06:21:20 +00:00
Peter Grehan	40f65a4df5	IFC @ r254014	2013-08-07 00:09:49 +00:00
Jeff Roberson	2c0b86b48f	- Introduce a specific function, pmap_remove_kernel_pde, for removing huge pages in the kernel's address space. This works around several asserts from pmap_demote_pde_locked that did not apply and gave false warnings. Discovered by: pho Reviewed by: alc Sponsored by: EMC / Isilon Storage Division	2013-08-05 00:28:03 +00:00
Peter Grehan	80a902ef7d	Follow-up commit to fix CR0 issues. Maintain architectural state on CR vmexits by guaranteeing that EFER, CR0 and the VMCS entry controls are all in sync when transitioning to IA-32e mode. Submitted by: Tycho Nightingale (tycho.nightingale <at> plurisbusnetworks.com)	2013-08-03 03:16:42 +00:00
Peter Grehan	672ed870a7	IFC @ r253862 - change the SI_SUB_RUN_SCHEDULER sysinits in hv_utilc and hv_netvsc_drv_freebsd.c to SI_SUB_KTHREAD_IDLE, since the former is no longer in FreeBSD. The use of these SYSINITs can probably be removed.	2013-08-01 22:09:57 +00:00
Peter Grehan	81ef6611ed	Moved clearing of vmm_initialized to avoid the case of unloading the module while VMs existed. This would result in EBUSY, but would prevent further operations on VMs resulting in the module being impossible to unload. Submitted by: Tycho Nightingale (tycho.nightingale <at> plurisbusnetworks.com) Reviewed by: grehan, neel	2013-08-01 05:59:28 +00:00
Peter Grehan	aaaa065629	Correctly maintain the CR0/CR4 shadow registers. This was exposed with AP spinup of Linux, and booting OpenBSD, where the CR0 register is unconditionally written to prior to the longjump to enter protected mode. The CR-vmexit handling was not updating CPU state which resulted in a vmentry failure with invalid guest state. A follow-on submit will fix the CPU state issue, but this fix prevents the CR-vmexit prior to entering protected mode by properly initializing and maintaining CR* state. Reviewed by: neel Reported by: Gopakumar.T @ netapp	2013-08-01 01:18:51 +00:00
David E. O'Brien	0e6a0799a9	Back out r253779 & r253786.	2013-07-31 17:21:18 +00:00
David E. O'Brien	99ff83da74	Decouple yarrow from random(4) device. * Make Yarrow an optional kernel component -- enabled by "YARROW_RNG" option. The files sha2.c, hash.c, randomdev_soft.c and yarrow.c comprise yarrow. * random(4) device doesn't really depend on rijndael-. Yarrow, however, does. Add random_adaptors.[ch] which is basically a store of random_adaptor's. random_adaptor is basically an adapter that plugs in to random(4). random_adaptor can only be plugged in to random(4) very early in bootup. Unplugging random_adaptor from random(4) is not supported, and is probably a bad idea anyway, due to potential loss of entropy pools. We currently have 3 random_adaptors: + yarrow + rdrand (ivy.c) + nehemeiah * Remove platform dependent logic from probe.c, and move it into corresponding registration routines of each random_adaptor provider. probe.c doesn't do anything other than picking a specific random_adaptor from a list of registered ones. * If the kernel doesn't have any random_adaptor adapters present then the creation of /dev/random is postponed until next random_adaptor is kldload'ed. * Fix randomdev_soft.c to refer to its own random_adaptor, instead of a system wide one. Submitted by: arthurmesh@gmail.com, obrien Obtained from: Juniper Networks Reviewed by: obrien	2013-07-29 20:26:27 +00:00
Andriy Gapon	a29cc9a34b	Revert r253748,253749 This WIP should not have been committed yet. Pointyhat to: avg	2013-07-28 18:44:17 +00:00
Andriy Gapon	366d8bfb7b	put contents of cpu.h under _KERNEL no userland-serviceable parts inside MFC after: 20 days	2013-07-28 18:32:27 +00:00
Andriy Gapon	a69e8d609e	x86: detect mwait capabilities and extensions, when present Reviewed by: kib (earlier amd64-only version) MFC after: 2 weeks	2013-07-28 17:54:42 +00:00
Jeff Roberson	2f84c08eee	- Use kmem_malloc rather than kmem_alloc() for GDT/LDT/tss allocations etc. This eliminates some unusual uses of that API in favor of more typical uses of kmem_malloc(). Discussed with: kib/alc Tested by: pho Sponsored by: EMC / Isilon Storage Division	2013-07-26 19:06:14 +00:00
Neel Natu	84e169c6c3	Add support for emulation of the "or r/m, imm8" instruction. Submitted by: Zhixiang Yu (zxyu.core@gmail.com) Obtained from: GSoC 2013 (AHCI device emulation for bhyve)	2013-07-23 23:43:00 +00:00
Neel Natu	113326a772	Fix a bug introduced in r252646 that causes a page with the PG_PTE_PAT bit set to be interpreted as a superpage. This is because PG_PTE_PAT is at the same bit position in PTE as PG_PS is in a PDE. This caused a number of regressions on amd64 systems: panic when starting X applications, freeze during shutdown etc. Pointy hat to: me Tested by: gperez@entel.upc.edu, joel, dumbbell Reviewed by: kib	2013-07-23 22:17:00 +00:00
Peter Grehan	15b996d742	First cut at adding the hyperv drivers to GENERIC. The files inventory should probably have the modules split out into net/storage/common etc as the modules build is, but this will do for now.	2013-07-19 05:32:58 +00:00
Konstantin Belousov	0f6bcda4cd	MFi386: add ddb "show sysregs" command. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2013-07-15 06:30:57 +00:00
Konstantin Belousov	0cdd261571	Clear m->object for the page taken from the delayed free list for reuse as the pv chink page in reclaim_pv_chunk(). Having non-NULL m->object is wrong for page not owned by an object and confuses both vm_page_free_toq() and vm_page_remove() when the page is freed later. Reported and tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 3 days	2013-07-10 09:24:03 +00:00
Xin LI	1fdeb1651c	Import HighPoint DC Series Data Center HBA (DC7280 and R750) driver. This driver works for FreeBSD/i386 and FreeBSD/amd64 platforms. Many thanks to HighPoint for providing this driver. MFC after: 1 day	2013-07-06 07:49:41 +00:00
Neel Natu	be28275d00	If a superpage mapping is being removed then we need to ignore the PG_PDE_PAT bit when looking up the vm_page associated with the superpage's physical address. If the caching attribute for the mapping is write combining or write protected then the PG_PDE_PAT bit will be set and thus cause an 'off-by-one' error when looking up the vm_page. Fix this by using the PG_PS_FRAME mask to compute the physical address for a superpage mapping instead of PG_FRAME. This is a theoretical issue at this point since non-writeback attributes are currently used only for fictitious mappings and fictitious mappings are not subject to promotion. Discussed with: alc, kib MFC after: 2 weeks	2013-07-03 23:21:25 +00:00
Neel Natu	de16308c48	Verify that all bytes in the instruction buffer are consumed during decoding. Suggested by: grehan	2013-07-03 23:05:17 +00:00
Peter Grehan	e60f5d779e	Ignore guest PAT settings by default in EPT mappings. From experimentation, other hypervisors also do this. Diagnosed by: tycho nightingale at pluribusnetworks com Reviewed by: neel	2013-07-01 20:05:43 +00:00
Konstantin Belousov	70a7dd5d5b	Fix issues with zeroing and fetching the counters, on x86 and ppc64. Issues were noted by Bruce Evans and are present on all architectures. On i386, a counter fetch should use atomic read of 64bit value, otherwise carry from the increment on other CPU could be lost for the given fetch, making error of 2^32. If 64bit read (cmpxchg8b) is not available on the machine, it cannot be SMP and it is enough to disable preemption around read to avoid the split read. On x86 the counter increment is not atomic on purpose, which makes it possible for the store of the incremented result to override just zeroed per-cpu slot. The effect would be a counter going off by arbitrary value after zeroing. Perform the counter zeroing on the same processor which does the increments, making the operations mutually exclusive. On i386, same as for the fetching, if the cmpxchg8b is not available, machine is not SMP and we disable preemption for zeroing. PowerPC64 is treated the same as amd64. For other architectures, the changes made to allow the compilation to succeed, without fixing the issues with zeroing or fetching. It should be possible to handle them by using the 64bit loads and stores atomic WRT preemption (assuming the architectures also converted from using critical sections to proper asm). If architecture does not provide the facility, using global (spin) mutex would be non-optimal but working solution. Noted by: bde Sponsored by: The FreeBSD Foundation	2013-07-01 02:48:27 +00:00
Peter Grehan	560d5eda2c	Make sure all CPUID values are handled, instead of exiting the bhyve process when an unhandled one is encountered. Hide some additional capabilities from the guest (e.g. debug store). This fixes the issue with FreeBSD 9.1 MP guests exiting the VM on AP spinup (where CPUID is used when sync'ing the TSCs) and the issue with the Java build where CPUIDs are issued from a guest userspace. Submitted by: tycho nightingale at pluribusnetworks com Reviewed by: neel Reported by: many	2013-06-28 06:05:33 +00:00
Jung-uk Kim	b1ddd13145	Move definitions required by userland applications out of acpica_machdep.h.	2013-06-27 00:22:40 +00:00
Konstantin Belousov	9dbb63fe03	Allow immediate operand. Sponsored by: The FreeBSD Foundation	2013-06-20 14:30:04 +00:00
Konstantin Belousov	c788f92509	Some clarifications and updates for the comments, mostly retrieved from Bruce Evans. Trim the trailing spaces. MFC after: 1 week	2013-06-19 05:05:16 +00:00
Sergey Kandaurov	1e2751ddeb	Fix a gcc warning uncovered after r251745. Reported by: Sergey V. Dyatko Reviewed by: neel	2013-06-18 23:31:09 +00:00
Justin T. Gibbs	a8f6ac0573	Upgrade Xen interface headers to Xen 4.2.1. Move FreeBSD from interface version 0x00030204 to 0x00030208. Updates are required to our grant table implementation before we can bump this further. sys/xen/hvm.h: Replace the implementation of hvm_get_parameter(), formerly located in sys/xen/interface/hvm/params.h. Linux has a similar file which primarily stores this function. sys/xen/xenstore/xenstore.c: Include new xen/hvm.h header file to get hvm_get_parameter(). sys/amd64/include/xen/xen-os.h: sys/i386/include/xen/xen-os.h: Correctly protect function definition and variables from being included into assembly files in xen-os.h Xen memory barriers are now prefixed with "xen_" to avoid conflicts with OS native primatives. Define Xen memory barriers in terms of the native FreeBSD primatives. Sponsored by: Spectra Logic Corporation Reviewed by: Roger Pau Monné Tested by: Roger Pau Monné Obtained from: Roger Pau Monné (bug fixes)	2013-06-14 23:43:44 +00:00
Sergey Kandaurov	82f2974a69	Replace cpusetffs_obj with CPU_FFS, missed in r251703. Reported by: bdrewery, O. Hartmann	2013-06-14 10:26:38 +00:00
Neel Natu	8f1664b724	Remove unused macros PTESHIFT, PDESHIFT, PDPESHIFT and PML4ESHIFT. Reviewed by: alc	2013-06-14 00:03:43 +00:00

... 5 6 7 8 9 ...

7049 Commits