freebsd-dev

Author	SHA1	Message	Date
Neel Natu	bf73979dd9	Add a counter to differentiate between VM-exits due to nested paging faults and instruction emulation faults.	2014-02-08 06:22:09 +00:00
Neel Natu	62fbd7c27a	Fix a bug in the handling of VM-exits caused by non-maskable interrupts (NMI). If a VM-exit is caused by an NMI then "blocking by NMI" is in effect on the CPU when the VM-exit is completed. No more NMIs will be recognized until the execution of an "iret". Prior to this change the NMI handler was dispatched via a software interrupt with interrupts enabled. This meant that an interrupt could be recognized by the processor before the NMI handler completed its execution. The "iret" issued by the interrupt handler would then cause the "blocking by NMI" to be cleared prematurely. This is now fixed by handling the NMI with interrupts disabled in addition to "blocking by NMI" already established by the VM-exit.	2014-02-08 05:04:34 +00:00
John Baldwin	00f3efe1bd	Add support for FreeBSD/i386 guests under bhyve. - Similar to the hack for bootinfo32.c in userboot, define _MACHINE_ELF_WANT_32BIT in the load_elf32 file handlers in userboot. This allows userboot to load 32-bit kernels and modules. - Copy the SMAP generation code out of bootinfo64.c and into its own file so it can be shared with bootinfo32.c to pass an SMAP to the i386 kernel. - Use uint32_t instead of u_long when aligning module metadata in bootinfo32.c in userboot, as otherwise the metadata used 64-bit alignment which corrupted the layout. - Populate the basemem and extmem members of the bootinfo struct passed to 32-bit kernels. - Fix the 32-bit stack in userboot to start at the top of the stack instead of the bottom so that there is room to grow before the kernel switches to its own stack. - Push a fake return address onto the 32-bit stack in addition to the arguments normally passed to exec() in the loader. This return address is needed to convince recover_bootinfo() in the 32-bit locore code that it is being invoked from a "new" boot block. - Add a routine to libvmmapi to setup a 32-bit flat mode register state including a GDT and TSS that is able to start the i386 kernel and update bhyveload to use it when booting an i386 kernel. - Use the guest register state to determine the CPU's current instruction mode (32-bit vs 64-bit) and paging mode (flat, 32-bit, PAE, or long mode) in the instruction emulation code. Update the gla2gpa() routine used when fetching instructions to handle flat mode, 32-bit paging, and PAE paging in addition to long mode paging. Don't look for a REX prefix when the CPU is in 32-bit mode, and use the detected mode to enable the existing 32-bit mode code when decoding the mod r/m byte. Reviewed by: grehan, neel MFC after: 1 month	2014-02-05 04:39:03 +00:00
Tycho Nightingale	54e03e07b3	Add support for emulating the byte move and zero extend instructions: "mov r/m8, r32" and "mov r/m8, r64". Approved by: neel (co-mentor)	2014-02-05 02:01:08 +00:00
Neel Natu	953c2c47eb	Avoid doing unnecessary nested TLB invalidations. Prior to this change the cached value of 'pm_eptgen' was tracked per-vcpu and per-hostcpu. In the degenerate case where 'N' vcpus were sharing a single hostcpu this could result in 'N - 1' unnecessary TLB invalidations. Since an 'invept' invalidates mappings for all VPIDs the first 'invept' is sufficient. Fix this by moving the 'eptgen[MAXCPU]' array from 'vmxctx' to 'struct vmx'. If it is known that an 'invept' is going to be done before entering the guest then it is safe to skip the 'invvpid'. The stat VPU_INVVPID_SAVED counts the number of 'invvpid' invalidations that were avoided because they were subsumed by an 'invept'. Discussed with: grehan	2014-02-04 02:45:08 +00:00
John Baldwin	3cbf3585cb	Enhance the support for PCI legacy INTx interrupts and enable them in the virtio backends. - Add a new ioctl to export the count of pins on the I/O APIC from vmm to the hypervisor. - Use pins on the I/O APIC >= 16 for PCI interrupts leaving 0-15 for ISA interrupts. - Populate the MP Table with I/O interrupt entries for any PCI INTx interrupts. - Create a _PRT table under the PCI root bridge in ACPI to route any PCI INTx interrupts appropriately. - Track which INTx interrupts are in use per-slot so that functions that share a slot attempt to distribute their INTx interrupts across the four available pins. - Implicitly mask INTx interrupts if either MSI or MSI-X is enabled and when the INTx DIS bit is set in a function's PCI command register. Either assert or deassert the associated I/O APIC pin when the state of one of those conditions changes. - Add INTx support to the virtio backends. - Always advertise the MSI capability in the virtio backends. Submitted by: neel (7) Reviewed by: neel MFC after: 2 weeks	2014-01-29 14:56:48 +00:00
John Baldwin	3d2ec11759	Add support for 'clac' and 'stac' to DDB's disassembler on amd64.	2014-01-27 18:53:18 +00:00
Neel Natu	30b94db8c0	Support level triggered interrupts with VT-x virtual interrupt delivery. The VMCS field EOI_bitmap[] is an array of 256 bits - one for each vector. If a bit is set to '1' in the EOI_bitmap[] then the processor will trigger an EOI-induced VM-exit when it is doing EOI virtualization. The EOI-induced VM-exit results in the EOI being forwarded to the vioapic so that level triggered interrupts can be properly handled. Tested by: Anish Gupta (akgupt3@gmail.com)	2014-01-25 20:58:05 +00:00
Peter Grehan	062eef4911	Change RWX to XWR in comments to match intent and bit patterns in discussion of valid EPT pte protections. Discussed with: neel MFC after: 3 days	2014-01-25 06:58:41 +00:00
John Baldwin	e07ef9b0f6	Move <machine/apicvar.h> to <x86/apicvar.h>.	2014-01-23 20:10:22 +00:00
Neel Natu	36736912b6	Set "Interrupt Window Exiting" in the case where there is a vector to be injected into the vcpu but the VM-entry interruption information field already has the valid bit set. Pointed out by: David Reed (david.reed@tidalscale.com)	2014-01-23 06:06:50 +00:00
Neel Natu	c308b23b7a	Handle a VM-exit due to a NMI properly by vectoring to the host's NMI handler via a software interrupt. This is safe to do because the logical processor is already cognizant of the NMI and further NMIs are blocked until the host's NMI handler executes "iret".	2014-01-22 04:03:11 +00:00
Neel Natu	51f45d0146	There is no need to initialize the IOMMU if no passthru devices have been configured for bhyve to use. Suggested by: grehan@	2014-01-21 03:01:34 +00:00
Ed Maste	80f9f1580e	Add VT kernel configuration to ease testing of vt(9), aka Newcons	2014-01-19 18:46:38 +00:00
Neel Natu	48b2d828a2	Some processor's don't allow NMI injection if the STI_BLOCKING bit is set in the Guest Interruptibility-state field. However, there isn't any way to figure out which processors have this requirement. So, inject a pending NMI only if NMI_BLOCKING, MOVSS_BLOCKING, STI_BLOCKING are all clear. If any of these bits are set then enable "NMI window exiting" and inject the NMI in the VM-exit handler.	2014-01-18 21:47:12 +00:00
Bryan Venteicher	10c4018057	Add very simple virtio_random(4) driver to harvest entropy from host Reviewed by: markm (random bits only)	2014-01-18 06:14:38 +00:00
Neel Natu	e5a1d95089	If the guest exits due to a fault while it is executing IRET then restore the state of "Virtual NMI blocking" in the guest's interruptibility-state field before resuming the guest.	2014-01-18 02:20:10 +00:00
Neel Natu	160471d264	If a VM-exit happens during an NMI injection then clear the "NMI Blocking" bit in the Guest Interruptibility-state VMCS field. If we fail to do this then a subsequent VM-entry will fail because it is an error to inject an NMI into the guest while "NMI Blocking" is turned on. This is described in "Checks on Guest Non-Register State" in the Intel SDM. Submitted by: David Reed (david.reed@tidalscale.com)	2014-01-17 04:21:39 +00:00
Neel Natu	5b8a8cd1fe	Add an API to rendezvous all active vcpus in a virtual machine. The rendezvous can be initiated in the context of a vcpu thread or from the bhyve(8) control process. The first use of this functionality is to update the vlapic trigger-mode register when the IOAPIC pin configuration is changed. Prior to this change we would update the TMR in the virtual-APIC page at the time of interrupt delivery. But this doesn't work with Posted Interrupts because there is no way to program the EOI_exit_bitmap[] in the VMCS of the target at the time of interrupt delivery. Discussed with: grehan@	2014-01-14 01:55:58 +00:00
Gavin Atkinson	56c63f28ed	Remove spaces from boot messages when we print the CPU ID/Family/Stepping to match the rest of the CPU identification lines, and once again fit into 80 columns in the usual case.	2014-01-11 22:41:10 +00:00
Neel Natu	176666c2c9	Enable "Posted Interrupt Processing" if supported by the CPU. This lets us inject interrupts into the guest without causing a VM-exit. This feature can be disabled by setting the tunable "hw.vmm.vmx.use_apic_pir" to "0". The following sysctls provide information about this feature: - hw.vmm.vmx.posted_interrupts (0 if disabled, 1 if enabled) - hw.vmm.vmx.posted_interrupt_vector (vector number used for vcpu notification) Tested on a Intel Xeon E5-2620v2 courtesy of Allan Jude at ScaleEngine.	2014-01-11 04:22:00 +00:00
Neel Natu	f7d4742540	Enable the "Acknowledge Interrupt on VM exit" VM-exit control. This control is needed to enable "Posted Interrupts" and is present in all the Intel VT-x implementations supported by bhyve so enable it as the default. With this VM-exit control enabled the processor will acknowledge the APIC and store the vector number in the "VM-Exit Interruption Information" field. We now call the interrupt handler "by hand" through the IDT entry associated with the vector.	2014-01-11 03:14:05 +00:00
Neel Natu	add611fd4c	Don't expose 'vmm_ipinum' as a global.	2014-01-09 03:25:54 +00:00
Neel Natu	88c4b8d145	Use the 'Virtual Interrupt Delivery' feature of Intel VT-x if supported by hardware. It is possible to turn this feature off and fall back to software emulation of the APIC by setting the tunable hw.vmm.vmx.use_apic_vid to 0. We now start handling two new types of VM-exits: APIC-access: This is a fault-like VM-exit and is triggered when the APIC register access is not accelerated (e.g. apic timer CCR). In response to this we do emulate the instruction that triggered the APIC-access exit. APIC-write: This is a trap-like VM-exit which does not require any instruction emulation but it does require the hypervisor to emulate the access to the specified register (e.g. icrlo register). Introduce 'vlapic_ops' which are function pointers to vector the various vlapic operations into processor-dependent code. The 'Virtual Interrupt Delivery' feature installs 'ops' for setting the IRR bits in the virtual APIC page and to return whether any interrupts are pending for this vcpu. Tested on an "Intel Xeon E5-2620 v2" courtesy of Allan Jude at ScaleEngine.	2014-01-07 21:04:49 +00:00
Neel Natu	79c596309c	Fix a bug introduced in r260167 related to VM-exit tracing. Keep a copy of the 'rip' and the 'exit_reason' and use that when calling vmx_exit_trace(). This is because both the 'rip' and 'exit_reason' can be changed by 'vmx_exit_process()' and can lead to very misleading traces.	2014-01-07 18:53:14 +00:00
Neel Natu	4d1e82a88e	Allow vlapic_set_intr_ready() to return a value that indicates whether or not the vcpu should be kicked to process a pending interrupt. This will be useful in the implementation of the Posted Interrupt APICv feature. Change the return value of 'vlapic_pending_intr()' to indicate whether or not an interrupt is available to be delivered to the vcpu depending on the value of the PPR. Add KTR tracepoints to debug guest IPI delivery.	2014-01-07 00:38:22 +00:00
Neel Natu	c847a5062c	Split the VMCS setup between 'vmcs_init()' that does initialization and 'vmx_vminit()' that does customization. This makes it easier to turn on optional features (e.g. APICv) without having to keep adding new parameters to 'vmcs_set_defaults()'. Reviewed by: grehan@	2014-01-06 23:16:39 +00:00
Jens Schweikhardt	aa27ed4569	Correct a grammo in a comment; remove white space at EOL.	2014-01-06 17:23:22 +00:00
Neel Natu	5f8e2dfcb5	Use the same label name for ENTRY() and END() macros for 'vmx_enter_guest'. Pointed out by: rmh@	2014-01-03 19:29:33 +00:00
Neel Natu	0a9ae358fd	Fix a bug in the HPET emulation where a timer interrupt could be lost when the guest disables the HPET. The HPET timer interrupt is triggered from the callout handler associated with the timer. It is possible for the callout handler to be delayed before it gets a chance to execute. If the guest disables the HPET during this window then the handler never gets a chance to execute and the timer interrupt is lost. This is now fixed by injecting a timer interrupt into the guest if the callout time is detected to be in the past when the HPET is disabled.	2014-01-03 19:25:52 +00:00
Konstantin Belousov	27fd75d2c8	Update the description for pmap_remove_pages() to match the modern times [1]. Assert that the pmap passed to pmap_remove_pages() is only active on current CPU. Submitted by: alc [1] Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week	2014-01-02 18:50:52 +00:00
Konstantin Belousov	c0be75a58a	Assert that accounting for the pmap resident pages does not underflow. Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week	2014-01-02 18:49:05 +00:00
Neel Natu	0492757c70	Restructure the VMX code to enter and exit the guest. In large part this change hides the setjmp/longjmp semantics of VM enter/exit. vmx_enter_guest() is used to enter guest context and vmx_exit_guest() is used to transition back into host context. Fix a longstanding race where a vcpu interrupt notification might be ignored if it happens after vmx_inject_interrupts() but before host interrupts are disabled in vmx_resume/vmx_launch. We now called vmx_inject_interrupts() with host interrupts disabled to prevent this. Suggested by: grehan@	2014-01-01 21:17:08 +00:00
Dimitry Andric	24da2fe3d0	In sys/amd64/amd64/pmap.c, remove static function pmap_is_current(), which has been unused since r189415. Reviewed by: alc MFC after: 3 days	2013-12-30 20:37:47 +00:00
Neel Natu	7c05bc3124	Modify handling of writes to the vlapic LVT registers. The handler is now called after the register value is updated in the virtual APIC page. This will make it easier to handle APIC-write VM-exits with APIC register virtualization turned on. This also implies that we need to keep a snapshot of the last value written to a LVT register. We can no longer rely on the LVT registers in the APIC page to be "clean" because the guest can write anything to it before the hypervisor has had a chance to sanitize it.	2013-12-28 00:20:55 +00:00
Neel Natu	fafe884473	Modify handling of writes to the vlapic ICR_TIMER, DCR_TIMER, ICRLO and ESR registers. The handler is now called after the register value is updated in the virtual APIC page. This will make it easier to handle APIC-write VM-exits with APIC register virtualization turned on. We can no longer rely on the value of 'icr_timer' on the APIC page in the callout handler. With APIC register virtualization the value of 'icr_timer' will be updated by the processor in guest-context before an APIC-write VM-exit. Clear the 'delivery status' bit in the ICRLO register in the write handler. With APIC register virtualization the write happens in guest-context and we cannot prevent a (buggy) guest from setting this bit.	2013-12-27 20:18:19 +00:00
Dimitry Andric	6f0c167fe2	In sys/amd64/vmm/intel/vmx.c, silence a (incorrect) gcc warning about regval possibly being used uninitialized. Reviewed by: neel	2013-12-27 12:15:53 +00:00
Neel Natu	2c52dcd9a8	Modify handling of write to the vlapic SVR register. The handler is now called after the register value is updated in the virtual APIC page. This will make it easier to handle APIC-write VM-exits with APIC register virtualization turned on. Additionally, mask all the LVT entries when the vlapic is software-disabled.	2013-12-27 07:01:42 +00:00
Neel Natu	3f0ddc7c5c	Modify handling of writes to the vlapic ID, LDR and DFR registers. The handlers are now called after the register value is updated in the virtual APIC page. This will make it easier to handle APIC-write VM-exits with APIC register virtualization turned on. Additionally, we need to ensure that the value of these registers is always correctly reflected in the virtual APIC page, because there is no VM exit when the guest reads these registers with APIC register virtualization.	2013-12-26 19:58:30 +00:00
Neel Natu	de5ea6b65e	vlapic code restructuring to make it easy to support hardware-assist for APIC emulation. The vlapic initialization and cleanup is done via processor specific vmm_ops. This will allow the VT-x/SVM modules to layer any hardware-assist for APIC emulation or virtual interrupt delivery on top of the vlapic device model. Add a parameter to 'vcpu_notify_event()' to distinguish between vlapic interrupts versus other events (e.g. NMI). This provides an opportunity to use hardware-assists like Posted Interrupts (VT-x) or doorbell MSR (SVM) to deliver an interrupt to a guest without causing a VM-exit. Get rid of lapic_pending_intr() and lapic_intr_accepted() and use the vlapic_xxx() counterparts directly. Associate an 'Apic Page' with each vcpu and reference it from the 'vlapic'. The 'Apic Page' is intended to be referenced from the Intel VMCS as the 'virtual APIC page' or from the AMD VMCB as the 'vAPIC backing page'.	2013-12-25 06:46:31 +00:00
John Baldwin	63e62d390d	Add a resume hook for bhyve that runs a function on all CPUs during resume. For Intel CPUs, invoke vmxon for CPUs that were in VMX mode at the time of suspend. Reviewed by: neel	2013-12-23 19:48:22 +00:00
John Baldwin	330baf58c6	Extend the support for local interrupts on the local APIC: - Add a generic routine to trigger an LVT interrupt that supports both fixed and NMI delivery modes. - Add an ioctl and bhyvectl command to trigger local interrupts inside a guest. In particular, a global NMI similar to that raised by SERR# or PERR# can be simulated by asserting LINT1 on all vCPUs. - Extend the LVT table in the vCPU local APIC to support CMCI. - Flesh out the local APIC error reporting a bit to cache errors and report them via ESR when ESR is written to. Add support for asserting the error LVT when an error occurs. Raise illegal vector errors when attempting to signal an invalid vector for an interrupt or when sending an IPI. - Ignore writes to reserved bits in LVT entries. - Export table entries the MADT and MP Table advertising the stock x86 config of LINT0 set to ExtInt and LINT1 wired to NMI. Reviewed by: neel (earlier version)	2013-12-23 19:29:07 +00:00
Neel Natu	f80330a820	Add a parameter to 'vcpu_set_state()' to enforce that the vcpu is in the IDLE state before the requested state transition. This guarantees that there is exactly one ioctl() operating on a vcpu at any point in time and prevents unintended state transitions. More details available here: http://lists.freebsd.org/pipermail/freebsd-virtualization/2013-December/001825.html Reviewed by: grehan Reported by: Markiyan Kushnir (markiyan.kushnir at gmail.com) MFC after: 3 days	2013-12-22 20:29:59 +00:00
Neel Natu	a783578566	Consolidate the virtual apic initialization in a single function: vlapic_reset()	2013-12-22 00:08:00 +00:00
Neel Natu	5515bb73e6	Re-arrange bits in the amd64/pmap 'pm_flags' field. The least significant 8 bits of 'pm_flags' are now used for the IPI vector to use for nested page table TLB shootdown. Previously we used IPI_AST to interrupt the host cpu which is functionally correct but could lead to misleading interrupt counts for AST handler. The AST handler was also doing a lot more than what is required for the nested page table TLB shootdown (EOI and IRET).	2013-12-20 05:50:22 +00:00
Neel Natu	3de8386283	Use vmcs_read() and vmcs_write() in preference to vmread() and vmwrite() respectively. The vmcs_xxx() functions provide inline error checking of all accesses to the VMCS.	2013-12-18 06:24:21 +00:00
Neel Natu	4f8be175d5	Add an API to deliver message signalled interrupts to vcpus. This allows callers treat the MSI 'addr' and 'data' fields as opaque and also lets bhyve implement multiple destination modes: physical, flat and clustered. Submitted by: Tycho Nightingale (tycho.nightingale@pluribusnetworks.com) Reviewed by: grehan@	2013-12-16 19:59:31 +00:00
Neel Natu	a83011d2e7	Fix typo when initializing the vlapic version register ('<<' instead of '<').	2013-12-11 06:28:44 +00:00
Neel Natu	becd984900	Fix x2apic support in bhyve. When the guest is bringing up the APs in the x2APIC mode a write to the ICR register will now trigger a return to userspace with an exitcode of VM_EXITCODE_SPINUP_AP. This gets SMP guests working again with x2APIC. Change the vlapic timer lock to be a spinlock because the vlapic can be accessed from within a critical section (vm run loop) when guest is using x2apic mode. Reviewed by: grehan@	2013-12-10 22:56:51 +00:00
John Baldwin	316032ad20	Move constants for indices in the local APIC's local vector table from apicvar.h to apicreg.h.	2013-12-09 21:08:52 +00:00
Neel Natu	fb03ca4e42	Use callout(9) to drive the vlapic timer instead of clocking it on each VM exit. This decouples the guest's 'hz' from the host's 'hz' setting. For e.g. it is now possible to have a guest run at 'hz=1000' while the host is at 'hz=100'. Discussed with: grehan@ Tested by: Tycho Nightingale (tycho.nightingale@pluribusnetworks.com)	2013-12-07 23:11:12 +00:00
Neel Natu	1c05219285	If a vcpu disables its local apic and then executes a 'HLT' then spin down the vcpu and destroy its thread context. Also modify the 'HLT' processing to ignore pending interrupts in the IRR if interrupts have been disabled by the guest. The interrupt cannot be injected into the guest in any case so resuming it is futile. With this change "halt" from a Linux guest works correctly. Reviewed by: grehan@ Tested by: Tycho Nightingale (tycho.nightingale@pluribusnetworks.com)	2013-12-07 22:18:36 +00:00
John Baldwin	5c79f1f9df	Fix a typo.	2013-12-05 21:58:02 +00:00
Neel Natu	7a3c80aa55	The 'protection' field in the VM exit collateral for the PAGING exit is not used - get rid of it.	2013-12-03 01:21:21 +00:00
Neel Natu	2282187475	Rename 'vm_interrupt_hostcpu()' to 'vcpu_notify_event()' because the function has outgrown its original name. Originally this function simply sent an IPI to the host cpu that a vcpu was executing on but now it does a lot more than just that. Reviewed by: grehan@	2013-12-03 00:43:31 +00:00
Eitan Adler	7a22215c53	Fix undefined behavior: (1 << 31) is not defined as 1 is an int and this shifts into the sign bit. Instead use (1U << 31) which gets the expected result. This fix is not ideal as it assumes a 32 bit int, but does fix the issue for most cases. A similar change was made in OpenBSD. Discussed with: -arch, rdivacky Reviewed by: cperciva	2013-11-30 22:17:27 +00:00
Pawel Jakub Dawidek	f2b525e6b9	Make process descriptors standard part of the kernel. rwhod(8) already requires process descriptors to work and having PROCDESC in GENERIC seems not enough, especially that we hope to have more and more consumers in the base. MFC after: 3 days	2013-11-30 15:08:35 +00:00
Neel Natu	b5b28fc9dc	Add support for level triggered interrupt pins on the vioapic. Prior to this commit level triggered interrupts would work as long as the pin was not shared among multiple interrupt sources. The vlapic now keeps track of level triggered interrupts in the trigger mode register and will forward the EOI for a level triggered interrupt to the vioapic. The vioapic in turn uses the EOI to sample the level on the pin and re-inject the vector if the pin is still asserted. The vhpet is the first consumer of level triggered interrupts and advertises that it can generate interrupts on pins 20 through 23 of the vioapic. Discussed with: grehan@	2013-11-27 22:18:08 +00:00
Konstantin Belousov	291bfc8d24	Hide struct pcb definition by #ifdef __amd64__ braces. If cc -m32 compilation results in inclusion of the header, a confict arises due to savefpu being union for i386, but used as struct in the pcb definition. The 32bit code should not need amd64 variant of the struct pcb anyway. For struct region_descriptor, use __uint64_t instead of unsigned long, as the base type for bit-fields. Unsigned long cannot have width 64 for -m32. The changes allowed to use sys/sysctl.h for cc -m32. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2013-11-26 19:38:42 +00:00
Neel Natu	08e3ff329a	Add HPET device emulation to bhyve. bhyve supports a single timer block with 8 timers. The timers are all 32-bit and capable of being operated in periodic mode. All timers support interrupt delivery using MSI. Timers 0 and 1 also support legacy interrupt routing. At the moment the timers are not connected to any ioapic pins but that will be addressed in a subsequent commit. This change is based on a patch from Tycho Nightingale (tycho.nightingale@pluribusnetworks.com).	2013-11-25 19:04:51 +00:00
Attilio Rao	54366c0bd7	- For kernel compiled only with KDTRACE_HOOKS and not any lock debugging option, unbreak the lock tracing release semantic by embedding calls to LOCKSTAT_PROFILE_RELEASE_LOCK() direclty in the inlined version of the releasing functions for mutex, rwlock and sxlock. Failing to do so skips the lockstat_probe_func invokation for unlocking. - As part of the LOCKSTAT support is inlined in mutex operation, for kernel compiled without lock debugging options, potentially every consumer must be compiled including opt_kdtrace.h. Fix this by moving KDTRACE_HOOKS into opt_global.h and remove the dependency by opt_kdtrace.h for all files, as now only KDTRACE_FRAMES is linked there and it is only used as a compile-time stub [0]. [0] immediately shows some new bug as DTRACE-derived support for debug in sfxge is broken and it was never really tested. As it was not including correctly opt_kdtrace.h before it was never enabled so it was kept broken for a while. Fix this by using a protection stub, leaving sfxge driver authors the responsibility for fixing it appropriately [1]. Sponsored by: EMC / Isilon storage division Discussed with: rstone [0] Reported by: rstone [1] Discussed with: philip	2013-11-25 07:38:45 +00:00
Neel Natu	ac7304a758	Add an ioctl to assert and deassert an ioapic pin atomically. This will be used to inject edge triggered legacy interrupts into the guest. Start using the new API in device models that use edge triggered interrupts: viz. the 8254 timer and the LPC/uart device emulation. Submitted by: Tycho Nightingale (tycho.nightingale@pluribusnetworks.com)	2013-11-23 03:56:03 +00:00
Neel Natu	af480303a9	Eliminate redundant information about the host cpu in bhyve's KTR trace points. This is always tracked by ktr(4) and can be displayed using the "-c" option of ktrdump(8). Discussed with: grehan	2013-11-22 18:57:22 +00:00
Ed Maste	7b7d8599fe	Don't abort SMAP processing after an entry of length 0 Length 0 is not special and should just be skipped. This is the same behaviour as i386. Discussed with: jhb@ Sponsored by: The FreeBSD Foundation	2013-11-22 14:56:10 +00:00
Andreas Tobler	d2ef321a59	Introduce a WEAK_REFERENCE() alias and use it. Get rid of the CNAME and the CONCAT macros in SYS.h. Reviewed by: bde, kib	2013-11-21 21:25:58 +00:00
Ed Maste	ff89f4778a	Refactor amd64 startup SMAP parsing Extracted from the projects/uefi branch, this change is a reasonable cleanup and will reduce the diffs to review when bringing in the UEFI work. Reviewed by: kib@ Sponsored by: The FreeBSD Foundation	2013-11-21 19:20:08 +00:00
Ed Maste	aff122d6aa	Disable amd64 boot time memory test by default The page presence memory test takes a long time on large memory systems and has little value on contemporary amd64 hardware. Sponsored by: The FreeBSD Foundation	2013-11-21 18:37:11 +00:00
Justin T. Gibbs	4fd76feafd	Fix accounting for hw.realmem on the i386 and amd64 platforms. sys/i386/i386/machdep.c: sys/amd64/amd64/machdep.c: The value reported by FreeBSD as "real memory" when booting doesn't match what is later reported by sysctl as hw.realmem. This is due to the fact that the value printed during the boot process is fetched from smbios data (when possible), and accounts for holes in physical memory. On the other hand, the value of hw.realmem is unconditionally set to be one larger than the highest page of the physical address space. Fix this by setting hw.realmem to the same value printed during boot, this makes hw.realmem honour it's name and account properly for physical memory present in the system. Submitted by: Roger Pau Monné Reviewed by: gibbs	2013-11-15 16:05:55 +00:00
Ed Maste	3d271aaab0	x86: Allow users to change PSL_RF via ptrace(PT_SETREGS...) Debuggers may need to change PSL_RF. Note that tf_eflags is already stored in the signal context during signal handling and PSL_RF previously could be modified via sigreturn, so this change should not provide any new ability to userspace. For background see the thread at: http://lists.freebsd.org/pipermail/freebsd-i386/2007-September/005910.html Reviewed by: jhb, kib Sponsored by: DARPA, AFRL	2013-11-14 15:37:20 +00:00
Neel Natu	565bbb8698	Move the ioapic device model from userspace into vmm.ko. This is needed for upcoming in-kernel device emulations like the HPET. The ioctls VM_IOAPIC_ASSERT_IRQ and VM_IOAPIC_DEASSERT_IRQ are used to manipulate the ioapic pin state. Discussed with: grehan@ Submitted by: Tycho Nightingale (tycho.nightingale@pluribusnetworks.com)	2013-11-12 22:51:03 +00:00
Konstantin Belousov	6f8a44a5dd	Add bits for the AMD features from CPUID function 0x80000001 ECX, described in the rev. 3.0 of the Kabini BKDG, document 48751.pdf. Partially based on the patch submitted by: Dmitry Luhtionov <dmitryluhtionov@gmail.com> Sponsored by: The FreeBSD Foundation MFC after: 1 week	2013-11-08 16:32:30 +00:00
Alan Cox	c70af4875e	As of r257209, all architectures have defined VM_KMEM_SIZE_SCALE. In other words, every architecture is now auto-sizing the kmem arena. This revision changes kmeminit() so that the definition of VM_KMEM_SIZE_SCALE becomes mandatory and the definition of VM_KMEM_SIZE becomes optional. Replace or eliminate all existing definitions of VM_KMEM_SIZE. With auto-sizing enabled, VM_KMEM_SIZE effectively became an alternate spelling for VM_KMEM_SIZE_MIN on most architectures. Use VM_KMEM_SIZE_MIN for clarity. Change kmeminit() so that the effect of defining VM_KMEM_SIZE is similar to that of setting the tunable vm.kmem_size. Whereas the macros VM_KMEM_SIZE_{MAX,MIN,SCALE} have had the same effect as the tunables vm.kmem_size_{max,min,scale}, the effects of VM_KMEM_SIZE and vm.kmem_size have been distinct. In particular, whereas VM_KMEM_SIZE was overridden by VM_KMEM_SIZE_{MAX,MIN,SCALE} and vm.kmem_size_{max,min,scale}, vm.kmem_size was not. Remedy this inconsistency. Now, VM_KMEM_SIZE can be used to set the size of the kmem arena at compile-time without that value being overridden by auto-sizing. Update the nearby comments to reflect the kmem submap being replaced by the kmem arena. Stop duplicating the auto-sizing formula in every machine- dependent vmparam.h and place it in kmeminit() where auto-sizing takes place. Reviewed by: kib (an earlier version) Sponsored by: EMC / Isilon Storage Division	2013-11-08 16:25:00 +00:00
Neel Natu	03cd05011f	Remove the 'vdev' abstraction that was meant to sit on top of device models in the kernel. This abstraction was redundant because the only device emulated inside vmm.ko is the local apic and it is always at a fixed guest physical address. Discussed with: grehan	2013-11-04 23:25:07 +00:00
Neel Natu	513c8d338d	Rename the VMM_CTRx() family of macros to VCPU_CTRx() to highlight that these tracepoints are vcpu-specific. Add support for tracepoints that are global to the virtual machine - these tracepoints are called VM_CTRx().	2013-10-31 05:20:11 +00:00
Mark Johnston	57170f49f2	Remove references to an unused fasttrap probe hook, and remove the corresponding x86 trap type. Userland DTrace probes are currently handled by the other fasttrap hooks (dtrace_pid_probe_ptr and dtrace_return_probe_ptr). Discussed with: rpaulo	2013-10-31 02:35:00 +00:00
Neel Natu	e2f5d9a129	Remove unnecessary includes of <machine/pmap.h> Requested by: alc@	2013-10-29 02:25:18 +00:00
Gleb Smirnoff	69eb2b176c	Include XEN and HyperV into amd64 LINT.	2013-10-28 21:11:28 +00:00
Konstantin Belousov	86be9f0dd5	Import the driver for VT-d DMAR hardware, as specified in the revision 1.3 of Intelб╝ Virtualization Technology for Directed I/O Architecture Specification. The Extended Context and PASIDs from the rev. 2.2 are not supported, but I am not aware of any released hardware which implements them. Code does not use queued invalidation, see comments for the reason, and does not provide interrupt remapping services. Code implements the management of the guest address space per domain and allows to establish and tear down arbitrary mappings, but not partial unmapping. The superpages are created as needed, but not promoted. Faults are recorded, fault records could be obtained programmatically, and printed on the console. Implement the busdma(9) using DMARs. This busdma backend avoids bouncing and provides security against misbehaving hardware and driver bad programming, preventing leaks and corruption of the memory by wild DMA accesses. By default, the implementation is compiled into amd64 GENERIC kernel but disabled; to enable, set hw.dmar.enable=1 loader tunable. Code is written to work on i386, but testing there was low priority, and driver is not enabled in GENERIC. Even with the DMAR turned on, individual devices could be directed to use the bounce busdma with the hw.busdma.pci<domain>:<bus>:<device>:<function>.bounce=1 tunable. If DMARs are capable of the pass-through translations, it is used, otherwise, an identity-mapping page table is constructed. The driver was tested on Xeon 5400/5500 chipset legacy machine, Haswell desktop and E5 SandyBridge dual-socket boxes, with ahci(4), ata(4), bce(4), ehci(4), mfi(4), uhci(4), xhci(4) devices. It also works with em(4) and igb(4), but there some fixes are needed for drivers, which are not committed yet. Intel GPUs do not work with DMAR (yet). Many thanks to John Baldwin, who explained me the newbus integration; Peter Holm, who did all testing and helped me to discover and understand several incredible bugs; and to Jim Harris for the access to the EDS and BWG and for listening when I have to explain my findings to somebody. Sponsored by: The FreeBSD Foundation MFC after: 1 month	2013-10-28 13:33:29 +00:00
Konstantin Belousov	e20f049b87	Several small fixes for the amd64 minidump code. In report_progress(), use nitems(progress_track) instead of manually hard-coding array size. Wrap long line. In blk_write(), code verifies that ptr and pa cannot be non-zero simultaneously. The later check for the page-alignment of the ptr argument never triggers due to pa != 0 always implying ptr == NULL. I believe that the intent was to ensure that physicall address passed is page-aligned, since the address is (temporary) mapped for the duration of the page write. Clear the progress_track.visited fields when starting minidump. If minidump is restarted or taken second time during the system lifetime, progress is not printed otherwise, making operator suspectible to the dump status. Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week	2013-10-27 16:31:12 +00:00
Gleb Smirnoff	eedc7fd9e8	Provide includes that are needed in these files, and before were read in implicitly via if.h -> if_var.h pollution. Sponsored by: Netflix Sponsored by: Nginx, Inc.	2013-10-26 18:18:50 +00:00
Neel Natu	49cc03da31	Add a new capability, VM_CAP_ENABLE_INVPCID, that can be enabled to expose 'invpcid' instruction to the guest. Currently bhyve will try to enable this capability unconditionally if it is available. Consolidate code in bhyve to set the capabilities so it is no longer duplicated in BSP and AP bringup. Add a sysctl 'vm.pmap.invpcid_works' to display whether the 'invpcid' instruction is available. Reviewed by: grehan MFC after: 3 days	2013-10-16 18:20:27 +00:00
Neel Natu	d38cae4aad	Fix the witness warning that warned against calling uiomove() while holding the 'vmmdev_mtx' in vmmdev_rw(). Rely on the 'si_threadcount' accounting to ensure that we never destroy the VM device node while it has operations in progress (e.g. ioctl, mmap etc). Reported by: grehan Reviewed by: grehan	2013-10-16 00:58:47 +00:00
Glen Barber	6b48eebec6	Document XENHVM and xenpci are mutually inclusive. Submitted by: gibbs Approved by: re (delphij) Sponsored by: The FreeBSD Foundation	2013-10-11 19:40:28 +00:00
Dimitry Andric	9cba9d0157	In sys/amd64/amd64/pmap.c, fix several gcc warnings about uninitialized variables in reclaim_pv_chunk(). Approved by: re (marius) Reviewed by: neel, kib X-MFC-With: r256072	2013-10-08 20:04:35 +00:00
Justin T. Gibbs	5fdd34ee20	Formalize the concept of virtual CPU ids by adding a per-cpu vcpu_id field. Perform vcpu enumeration for Xen PV and HVM environments and convert all Xen drivers to use vcpu_id instead of a hard coded assumption of the mapping algorithm (acpi or apic ID) in use. Submitted by: Roger Pau Monné Sponsored by: Citrix Systems R&D Reviewed by: gibbs Approved by: re (blanket Xen) amd64/include/pcpu.h: i386/include/pcpu.h: Add vcpu_id to the amd64 and i386 pcpu structures. dev/xen/timer/timer.c x86/xen/xen_intr.c Use new vcpu_id instead of assuming acpi_id == vcpu_id. i386/xen/mp_machdep.c: i386/xen/mptable.c x86/xen/hvm.c: Perform Xen HVM and Xen full PV vcpu_id mapping. x86/xen/hvm.c: x86/acpica/madt.c Change SYSINIT ordering of acpi CPU enumeration so that it is guaranteed to be available at the time of Xen HVM vcpu id mapping.	2013-10-05 23:11:01 +00:00
Neel Natu	318224bbe6	Merge projects/bhyve_npt_pmap into head. Make the amd64/pmap code aware of nested page table mappings used by bhyve guests. This allows bhyve to associate each guest with its own vmspace and deal with nested page faults in the context of that vmspace. This also enables features like accessed/dirty bit tracking, swapping to disk and transparent superpage promotions of guest memory. Guest vmspace: Each bhyve guest has a unique vmspace to represent the physical memory allocated to the guest. Each memory segment allocated by the guest is mapped into the guest's address space via the 'vmspace->vm_map' and is backed by an object of type OBJT_DEFAULT. pmap types: The amd64/pmap now understands two types of pmaps: PT_X86 and PT_EPT. The PT_X86 pmap type is used by the vmspace associated with the host kernel as well as user processes executing on the host. The PT_EPT pmap is used by the vmspace associated with a bhyve guest. Page Table Entries: The EPT page table entries as mostly similar in functionality to regular page table entries although there are some differences in terms of what bits are used to express that functionality. For e.g. the dirty bit is represented by bit 9 in the nested PTE as opposed to bit 6 in the regular x86 PTE. Therefore the bitmask representing the dirty bit is now computed at runtime based on the type of the pmap. Thus PG_M that was previously a macro now becomes a local variable that is initialized at runtime using 'pmap_modified_bit(pmap)'. An additional wrinkle associated with EPT mappings is that older Intel processors don't have hardware support for tracking accessed/dirty bits in the PTE. This means that the amd64/pmap code needs to emulate these bits to provide proper accounting to the VM subsystem. This is achieved by using the following mapping for EPT entries that need emulation of A/D bits: Bit Position Interpreted By PG_V 52 software (accessed bit emulation handler) PG_RW 53 software (dirty bit emulation handler) PG_A 0 hardware (aka EPT_PG_RD) PG_M 1 hardware (aka EPT_PG_WR) The idea to use the mapping listed above for A/D bit emulation came from Alan Cox (alc@). The final difference with respect to x86 PTEs is that some EPT implementations do not support superpage mappings. This is recorded in the 'pm_flags' field of the pmap. TLB invalidation: The amd64/pmap code has a number of ways to do invalidation of mappings that may be cached in the TLB: single page, multiple pages in a range or the entire TLB. All of these funnel into a single EPT invalidation routine called 'pmap_invalidate_ept()'. This routine bumps up the EPT generation number and sends an IPI to the host cpus that are executing the guest's vcpus. On a subsequent entry into the guest it will detect that the EPT has changed and invalidate the mappings from the TLB. Guest memory access: Since the guest memory is no longer wired we need to hold the host physical page that backs the guest physical page before we can access it. The helper functions 'vm_gpa_hold()/vm_gpa_release()' are available for this purpose. PCI passthru: Guest's with PCI passthru devices will wire the entire guest physical address space. The MMIO BAR associated with the passthru device is backed by a vm_object of type OBJT_SG. An IOMMU domain is created only for guest's that have one or more PCI passthru devices attached to them. Limitations: There isn't a way to map a guest physical page without execute permissions. This is because the amd64/pmap code interprets the guest physical mappings as user mappings since they are numerically below VM_MAXUSER_ADDRESS. Since PG_U shares the same bit position as EPT_PG_EXECUTE all guest mappings become automatically executable. Thanks to Alan Cox and Konstantin Belousov for their rigorous code reviews as well as their support and encouragement. Thanks for John Baldwin for reviewing the use of OBJT_SG as the backing object for pci passthru mmio regions. Special thanks to Peter Holm for testing the patch on short notice. Approved by: re Discussed with: grehan Reviewed by: alc, kib Tested by: pho	2013-10-05 21:22:35 +00:00
John-Mark Gurney	29904f46d6	add aesni module to i386 and amd64 NOTES... Approved by: re (gjb)	2013-10-04 17:21:01 +00:00
Peter Grehan	e58d944482	Return 0 for a rdmsr of MSR_IA32_PLATFORM_ID. This is enough to get Ubuntu 12.0.4/13.0.4 to boot. Approved by: re@ (blanket)	2013-09-27 14:55:59 +00:00
Konstantin Belousov	4cb8b041d1	In pmap_clear_modify(), initialize pvh even for fictitious managed page, otherwise the small mappings loop would use uninitialized value. Note that currently pmap_clear_modify() is not called for fictitious pages. Sponsored by: The FreeBSD Foundation Approved by: re (glebius)	2013-09-24 13:52:47 +00:00
Konstantin Belousov	fecfc089e4	Use the pv lists generation count to read-lock the pvh_global_lock in pmap_clear_modify(). Noted and reviewed by: alc Tested by: pho Sponsored by: The FreeBSD Foundation Approved by: re (marius)	2013-09-24 12:26:43 +00:00
Konstantin Belousov	75f50c53f1	Ensure that the ERESTART return from the syscall reloads the registers, to make the restarted syscall instruction pass the correct arguments. PR: kern/182161 Reported by: Russ Cox <rsc@swtch.com> Sponsored by: The FreeBSD Foundation MFC after: 3 days Approved by: re (marius)	2013-09-24 12:24:48 +00:00
Konstantin Belousov	ad43b98491	Free both KVA and backing pages when freeing TSS memory. Reported and tested by: pho Sponsored by: The FreeBSD Foundation Approved by: re (marius)	2013-09-23 20:14:15 +00:00
Glen Barber	91aff61084	Put 'device hyperv' back in amd64/GENERIC, incorrectly removed with r255736. Pointed out by: gibbs Approved by: re (delphij) Sponsored by: The FreeBSD Foundation	2013-09-21 01:07:27 +00:00
Peter Grehan	36f23e3c20	Reorder/regroup the vmm ioctl api definitions to allow some semblance of API stability and growth during the 10.* timeframe. Userland/kernel bhyve will have to be recompiled after this. Reviewed by: neel Approved by: re@ (blanket)	2013-09-21 00:27:53 +00:00
Justin T. Gibbs	566a5f5020	Merge Xen PVHVM support into the GENERIC kernel config for both amd64 and i386. Submitted by: Roger Pau Monné Sponsored by: Citrix Systems R&D Reviewed by: gibbs Approved by: re (blanket Xen) MFC after: 2 weeks sys/amd64/amd64/mp_machdep.c: sys/amd64/include/cpu.h: sys/i386/i386/mp_machdep.c: sys/i386/include/cpu.h: - Introduce two new CPU hooks for initialization and resume purposes. This allows us to get rid of the XENHVM ifdefs in mp_machdep, and also sets some hooks into common code that can be used by other hypervisor implementations. sys/amd64/conf/XENHVM: sys/i386/conf/XENHVM: - Remove these configs now that GENERIC has builtin support for Xen HVM. sys/kern/subr_smp.c: - Make sure there are no pending IPIs when suspending a system. sys/x86/xen/hvm.c: - Add cpu init and resume vectors that are called from mp_machdep using the new hooks. - Only clear the vcpu_info mapping data on resume. It is already clear for the BSP on a cold boot and is set correctly as APs are started. - Gate xen_hvm_init_cpu only to systems running under Xen. sys/x86/xen/xen_intr.c: - Gate the setup of event channels only to systems running under Xen.	2013-09-20 22:59:22 +00:00
David Christensen	4e4007688c	Substantial rewrite of bxe(4) to add support for the BCM57712 and BCM578XX controllers. Approved by: re MFC after: 4 weeks	2013-09-20 20:18:49 +00:00
Neel Natu	74d1d2b7cc	Merge the following changes from projects/bhyve_npt_pmap: - add fields to 'struct pmap' that are required to manage nested page tables. - add a parameter to 'vmspace_alloc()' that can be used to override the default pmap initialization routine 'pmap_pinit()'. These changes are pushed ahead of the remaining changes in 'bhyve_npt_pmap' in anticipation of the upcoming KBI freeze for 10.0. Reviewed by: kib@, alc@ Approved by: re (glebius)	2013-09-20 17:06:49 +00:00
Justin T. Gibbs	428b7ca290	Add support for suspend/resume/migration operations when running as a Xen PVHVM guest. Submitted by: Roger Pau Monné Sponsored by: Citrix Systems R&D Reviewed by: gibbs Approved by: re (blanket Xen) MFC after: 2 weeks sys/amd64/amd64/mp_machdep.c: sys/i386/i386/mp_machdep.c: - Make sure that are no MMU related IPIs pending on migration. - Reset pending IPI_BITMAP on resume. - Init vcpu_info on resume. sys/amd64/include/intr_machdep.h: sys/i386/include/intr_machdep.h: sys/x86/acpica/acpi_wakeup.c: sys/x86/x86/intr_machdep.c: sys/x86/isa/atpic.c: sys/x86/x86/io_apic.c: sys/x86/x86/local_apic.c: - Add a "suspend_cancelled" parameter to pic_resume(). For the Xen PIC, restoration of interrupt services differs between the aborted suspend and normal resume cases, so we must provide this information. sys/dev/acpica/acpi_timer.c: sys/dev/xen/timer/timer.c: sys/timetc.h: - Don't swap out "suspend safe" timers across a suspend/resume cycle. This includes the Xen PV and ACPI timers. sys/dev/xen/control/control.c: - Perform proper suspend/resume process for PVHVM: - Suspend all APs before going into suspension, this allows us to reset the vcpu_info on resume for each AP. - Reset shared info page and callback on resume. sys/dev/xen/timer/timer.c: - Implement suspend/resume support for the PV timer. Since FreeBSD doesn't perform a per-cpu resume of the timer, we need to call smp_rendezvous in order to correctly resume the timer on each CPU. sys/dev/xen/xenpci/xenpci.c: - Don't reset the PCI interrupt on each suspend/resume. sys/kern/subr_smp.c: - When suspending a PVHVM domain make sure there are no MMU IPIs in-flight, or we will get a lockup on resume due to the fact that pending event channels are not carried over on migration. - Implement a generic version of restart_cpus that can be used by suspended and stopped cpus. sys/x86/xen/hvm.c: - Implement resume support for the hypercall page and shared info. - Clear vcpu_info so it can be reset by APs when resuming from suspension. sys/dev/xen/xenpci/xenpci.c: sys/x86/xen/hvm.c: sys/x86/xen/xen_intr.c: - Support UP kernel configurations. sys/x86/xen/xen_intr.c: - Properly rebind per-cpus VIRQs and IPIs on resume.	2013-09-20 05:06:03 +00:00
Alan Cox	deb179bb4c	The pmap function pmap_clear_reference() is no longer used. Remove it. pmap_clear_reference() has had exactly one caller in the kernel for several years, more precisely, since FreeBSD 8. Now, that call no longer exists. Approved by: re (kib) Sponsored by: EMC / Isilon Storage Division	2013-09-20 04:30:18 +00:00
Peter Grehan	d83d73618f	Reconnect the hyperv drivers back into GENERIC now that the disengage driver issue has been resolved. Approved by: re@ (gjb)	2013-09-19 05:07:51 +00:00
Pawel Jakub Dawidek	3fded357af	Fix panic in ktrcapfail() when no capability rights are passed. While here, correct all consumers to pass NULL instead of 0 as we pass capability rights as pointers now, not uint64_t. Reported by: Daniel Peyrolon Tested by: Daniel Peyrolon Approved by: re (marius)	2013-09-18 19:26:08 +00:00
Roman Divacky	69d912af45	Regen. Approved by: re (delphij)	2013-09-18 18:49:26 +00:00
Roman Divacky	b12698e1a1	Revert r255672, it has some serious flaws, leaking file references etc. Approved by: re (delphij)	2013-09-18 18:48:33 +00:00
Roman Divacky	70ccaaf58e	Regen. Approved by: re (delphij)	2013-09-18 17:58:03 +00:00
Roman Divacky	253c75c0de	Implement epoll support in Linuxulator. This is a tiny wrapper around kqueue to implement epoll subset of functionality. The kqueue user data are 32bit on i386 which is not enough for epoll user data so this patch overrides kqueue fileops to maintain enough space in struct file. Initial patch developed by me in 2007 and then extended and finished by Yuri Victorovich. Approved by: re (delphij) Sponsored by: Google Summer of Code Submitted by: Yuri Victorovich <yuri at rawbw dot com> Tested by: Yuri Victorovich <yuri at rawbw dot com>	2013-09-18 17:56:04 +00:00
Peter Grehan	517e21d3e7	Hide TSC-deadline APIC timer support from guests. This mode isn't yet implemented in bhyve's APIC emulation. Reviewed by: neel Approved by: re@ (blanket)	2013-09-17 17:56:53 +00:00
Neel Natu	0f9d5dc758	Fix a bug in decoding an instruction that has an SIB byte as well as an immediate operand. The presence of an SIB byte in decoding the ModR/M field would cause 'imm_bytes' to not be set to the correct value. Fix this by initializing 'imm_bytes' independent of the ModR/M decoding. Reported by: grehan@ Approved by: re@	2013-09-17 16:06:07 +00:00
Bryan Venteicher	03c6abfd1c	Add vmx(4) to i386 and amd64 GENERIC Approved by: re (gjb)	2013-09-17 01:54:13 +00:00
Konstantin Belousov	70b9173019	In pmap_copy(), when the copied region is mapped with superpage but does not cover entire superpage, avoid copying. Doing partial copy would require demotion, which is incompatible with the already held locks. Reported by: cperciva Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week Approved by: re (delphij)	2013-09-16 06:15:15 +00:00
Peter Grehan	b90fcf02f2	Pull the hyperv drivers from GENERIC until the fix to the disengage driver to make it only probe when running on hyperv is reviewed and tested. Approved by: re (rodrigc)	2013-09-14 20:38:22 +00:00
Peter Grehan	ab7fb3bca7	Import Hyper-V paravirtualized drivers from projects/hyperv branch into head. Approved by: re@ (hrs) Obtained from: Microsoft, NetApp, and Citrix.	2013-09-13 18:47:58 +00:00
Neel Natu	0f1ef0ec80	Fix a limitation in bhyve that would limit the number of virtual machines to the maximum number of VT-d domains (256 on a Sandybridge). We now allocate a VT-d domain for a guest only if the administrator has explicitly configured one or more PCI passthru device(s). If there are no PCI passthru devices configured (the common case) then the number of virtual machines is no longer limited by the maximum number of VT-d domains. Reviewed by: grehan@ Approved by: re@	2013-09-11 07:11:14 +00:00
Peter Grehan	47823319c3	IFC @ r255459	2013-09-11 00:19:16 +00:00
Peter Grehan	8d39ed16c2	Go way past 11 and bump bhyve's max vCPUs to 16. This should be sufficient for 10.0 and will do until forthcoming work to avoid limitations in this area is complete. Thanks to Bela Lubkin at tidalscale for the headsup on the apic/cpu id/io apic ASL parameters that are actually hex values and broke when written as decimal when 11 vCPUs were configured. Approved by: re@	2013-09-10 03:48:18 +00:00
Alan Cox	70c4180f1c	Prior to r254304, we only began scanning the active page queue when the amount of free memory was close to the point at which we would begin reclaiming pages. Now, we continuously scan the active page queue, regardless of the amount of free memory. Consequently, we are continuously calling pmap_ts_referenced() on active pages. Prior to this change, pmap_ts_referenced() would always demote superpage mappings in order to obtain finer-grained reference information. This made sense because we were coming under memory pressure and would soon have to begin reclaiming pages. Now, however, with continuous scanning of the active page queue, these demotions are taking a toll on performance. For example, on one of my test machines, the running time for the HPCC Random Access benchmark (also known as GUPS) has increased by 54%. To address this problem, I have replaced the demotion with a heuristic for periodically clearing the reference flag on superpage mappings. Reviewed by: kib Approved by: re (glebius) Sponsored by: EMC / Isilon Storage Division	2013-09-08 21:30:53 +00:00
Neel Natu	45e51299b3	Allocate VPIDs by using the unit number allocator to keep do the bookkeeping. Also deal with VPID exhaustion by allocating out of a reserved range as the last resort.	2013-09-07 05:30:34 +00:00
Peter Grehan	8a02f69652	Mask off the vector from the MSI-x data word. Some o/s's set the trigger-mode level bit which results in an invalid vector and pass-thru interrupts not being delivered.	2013-09-07 03:33:36 +00:00
Justin T. Gibbs	e44af46e4c	Implement PV IPIs for PVHVM guests and further converge PV and HVM IPI implmementations. Submitted by: Roger Pau Monné Sponsored by: Citrix Systems R&D Submitted by: gibbs (misc cleanup, table driven config) Reviewed by: gibbs MFC after: 2 weeks sys/amd64/include/cpufunc.h: sys/amd64/amd64/pmap.c: Move invltlb_globpcid() into cpufunc.h so that it can be used by the Xen HVM version of tlb shootdown IPI handlers. sys/x86/xen/xen_intr.c: sys/xen/xen_intr.h: Rename xen_intr_bind_ipi() to xen_intr_alloc_and_bind_ipi(), and remove the ipi vector parameter. This api allocates an event channel port that can be used for ipi services, but knows nothing of the actual ipi for which that port will be used. Removing the unused argument and cleaning up the comments surrounding its declaration helps clarify its actual role. sys/amd64/amd64/mp_machdep.c: sys/amd64/include/cpu.h: sys/i386/i386/mp_machdep.c: sys/i386/include/cpu.h: Implement a generic framework for amd64 and i386 that allows the implementation of certain CPU management functions to be selected at runtime. Currently this is only used for the ipi send function, which we optimize for Xen when running on a Xen hypervisor, but can easily be expanded to support more operations. sys/x86/xen/hvm.c: Implement Xen PV IPI handlers and operations, replacing native send IPI. sys/amd64/include/pcpu.h: sys/i386/include/pcpu.h: sys/i386/include/smp.h: Remove NR_VIRQS and NR_IPIS from FreeBSD headers. NR_VIRQS is defined already for us in the xen interface files. NR_IPIS is only needed in one file per Xen platform and is easily inferred by the IPI vector table that is defined in those files. sys/i386/xen/mp_machdep.c: Restructure to more closely match the HVM implementation by performing table driven IPI setup.	2013-09-06 22:17:02 +00:00
Bryan Venteicher	ddb4ffd0c6	Add vmx device to the i386 and amd64 NOTES files	2013-09-06 20:24:21 +00:00
Konstantin Belousov	9430f833ca	Only lock pvh_global_lock read-only for pmap_page_wired_mappings(), pmap_is_modified() and pmap_is_referenced(), same as it was done for pmap_ts_referenced(). Consolidate identical code for pmap_is_modified() and pmap_is_referenced() into helper pmap_page_test_mappings(). Reviewed by: alc Tested by: pho (previous version) Sponsored by: The FreeBSD Foundation	2013-09-06 16:53:48 +00:00
Konstantin Belousov	3e4f32be7d	In pmap_ts_referenced(), when restarting the loop due to pv list generation changed, do not drop and immediately relock the pv list. Suggested and reviewed by: alc Sponsored by: The FreeBSD Foundation	2013-09-06 16:48:34 +00:00
Gleb Smirnoff	e16477e8d9	On those machines, where sf_bufs do not represent any real object, make sf_buf_alloc()/sf_buf_free() inlines, to save two calls to an absolutely empty functions. Reviewed by: alc, kib, scottl Sponsored by: Nginx, Inc. Sponsored by: Netflix	2013-09-06 05:37:49 +00:00
Peter Grehan	76c35ba80f	Emulate reading of the IA32_MISC_ENABLE MSR, by returning the host MSR and masking off features that aren't supported. Linux reads this MSR to detect if NX has been disabled via BIOS.	2013-09-06 05:20:11 +00:00
Peter Grehan	8b7e3e3022	Allow CPUID leaf 0xD to be read as zeroes. Linux reads this even though extended features aren't exposed. Support for 0xD will be expanded once AVX[2] is exposed to the guest in upcoming work.	2013-09-06 05:16:10 +00:00
Pawel Jakub Dawidek	7008be5bd7	Change the cap_rights_t type from uint64_t to a structure that we can extend in the future in a backward compatible (API and ABI) way. The cap_rights_t represents capability rights. We used to use one bit to represent one right, but we are running out of spare bits. Currently the new structure provides place for 114 rights (so 50 more than the previous cap_rights_t), but it is possible to grow the structure to hold at least 285 rights, although we can make it even larger if 285 rights won't be enough. The structure definition looks like this: struct cap_rights { uint64_t cr_rights[CAP_RIGHTS_VERSION + 2]; }; The initial CAP_RIGHTS_VERSION is 0. The top two bits in the first element of the cr_rights[] array contain total number of elements in the array - 2. This means if those two bits are equal to 0, we have 2 array elements. The top two bits in all remaining array elements should be 0. The next five bits in all array elements contain array index. Only one bit is used and bit position in this five-bits range defines array index. This means there can be at most five array elements in the future. To define new right the CAPRIGHT() macro must be used. The macro takes two arguments - an array index and a bit to set, eg. #define CAP_PDKILL CAPRIGHT(1, 0x0000000000000800ULL) We still support aliases that combine few rights, but the rights have to belong to the same array element, eg: #define CAP_LOOKUP CAPRIGHT(0, 0x0000000000000400ULL) #define CAP_FCHMOD CAPRIGHT(0, 0x0000000000002000ULL) #define CAP_FCHMODAT (CAP_FCHMOD \| CAP_LOOKUP) There is new API to manage the new cap_rights_t structure: cap_rights_t cap_rights_init(cap_rights_t rights, ...); void cap_rights_set(cap_rights_t rights, ...); void cap_rights_clear(cap_rights_t rights, ...); bool cap_rights_is_set(const cap_rights_t rights, ...); bool cap_rights_is_valid(const cap_rights_t rights); void cap_rights_merge(cap_rights_t dst, const cap_rights_t src); void cap_rights_remove(cap_rights_t dst, const cap_rights_t src); bool cap_rights_contains(const cap_rights_t big, const cap_rights_t little); Capability rights to the cap_rights_init(), cap_rights_set(), cap_rights_clear() and cap_rights_is_set() functions are provided by separating them with commas, eg: cap_rights_t rights; cap_rights_init(&rights, CAP_READ, CAP_WRITE, CAP_FSTAT); There is no need to terminate the list of rights, as those functions are actually macros that take care of the termination, eg: #define cap_rights_set(rights, ...) \ __cap_rights_set((rights), __VA_ARGS__, 0ULL) void __cap_rights_set(cap_rights_t *rights, ...); Thanks to using one bit as an array index we can assert in those functions that there are no two rights belonging to different array elements provided together. For example this is illegal and will be detected, because CAP_LOOKUP belongs to element 0 and CAP_PDKILL to element 1: cap_rights_init(&rights, CAP_LOOKUP \| CAP_PDKILL); Providing several rights that belongs to the same array's element this way is correct, but is not advised. It should only be used for aliases definition. This commit also breaks compatibility with some existing Capsicum system calls, but I see no other way to do that. This should be fine as Capsicum is still experimental and this change is not going to 9.x. Sponsored by: The FreeBSD Foundation	2013-09-05 00:09:56 +00:00
Konstantin Belousov	6aceaa3e17	Tidy up some loose ends in the PCID code: - Restore the pre-PCID TLB shootdown handlers for whole address space and single page invalidation asm code, and assign the IPI handler to them when PCID is not supported or disabled. Old handlers have linear control flow. But, still use the common return sequence. - Stop using pcpu for INVPCID descriptors in the invlrg handler. It is enough to allocate descriptors on the stack. As result, two SWAPGS instructions are shaved off from the code for Haswell+. - Fix the reverted condition in invlrng for checking of the PCID support [1], also in invlrng check that pmap is kernel pmap before performing other tests. For the kernel pmap, which provides global mappings, the INVLPG must be used for invalidation always. - Save the pre-computed pmap' %CR3 register in the struct pmap. This allows to remove several checks for pm_pcid validity when %CR3 is reloaded [2]. Noted by: gibbs [1] Discussed with: alc [2] Tested by: pho, flo Sponsored by: The FreeBSD Foundation	2013-09-04 23:31:29 +00:00
Peter Grehan	46ed9e4908	IFC @ r255209	2013-09-04 20:55:56 +00:00
John Baldwin	dffe0dc4d2	Add support for the 'invpcid' instruction to binutils and DDB's disassembler on amd64. MFC after: 1 month	2013-09-03 21:21:47 +00:00
Konstantin Belousov	f27d53b8f2	Fix two build failures for non-tb configurations, UP [2] and when using gas [1]. Reported by: andreast [1], bf [2] Sponsored by: The FreeBSD Foundation	2013-08-31 19:13:21 +00:00
Konstantin Belousov	1099068118	The pm_save should be cleared on the pmap initialization, and not on the activation. Noted by: alc	2013-08-30 20:10:01 +00:00
Konstantin Belousov	37eed8419c	Implement support for the process-context identifiers ('PCID') on Intel CPUs. The feature tags TLB entries with the Id of the address space and allows to avoid TLB invalidation on the context switch, it is available only in the long mode. In the microbenchmarks, using the PCID decreased latency of the context switches by ~30% on SandyBridge class desktop CPUs, measured with the lat_ctx program from lmbench. If available, use INVPCID instruction when a TLB entry in non-current address space needs to be invalidated. The instruction is typically available on the Haswell. If needed, the use of PCID can be turned off with the vm.pmap.pcid_enabled loader tunable set to 0. The state of the feature is reported by the vm.pmap.pcid_enabled sysctl. The sysctl vm.pmap.pcid_save_cnt reports the number of context switches which avoided invalidating the TLB; compare with the total number of context switches, available as sysctl vm.stats.sys.v_swtch. Sponsored by: The FreeBSD Foundation Reviewed by: alc Tested by: pho, bf	2013-08-30 07:59:49 +00:00
Konstantin Belousov	5f5703ef52	Provide a wrapper for the INVPCID instruction, definition of the descriptor and symbolic names for the operation types. Sponsored by: The FreeBSD Foundation Reviewed by: alc Tested by: pho, bf	2013-08-30 07:42:38 +00:00
Justin T. Gibbs	76acc41fb7	Implement vector callback for PVHVM and unify event channel implementations Re-structure Xen HVM support so that: - Xen is detected and hypercalls can be performed very early in system startup. - Xen interrupt services are implemented using FreeBSD's native interrupt delivery infrastructure. - the Xen interrupt service implementation is shared between PV and HVM guests. - Xen interrupt handlers can optionally use a filter handler in order to avoid the overhead of dispatch to an interrupt thread. - interrupt load can be distributed among all available CPUs. - the overhead of accessing the emulated local and I/O apics on HVM is removed for event channel port events. - a similar optimization can eventually, and fairly easily, be used to optimize MSI. Early Xen detection, HVM refactoring, PVHVM interrupt infrastructure, and misc Xen cleanups: Sponsored by: Spectra Logic Corporation Unification of PV & HVM interrupt infrastructure, bug fixes, and misc Xen cleanups: Submitted by: Roger Pau Monné Sponsored by: Citrix Systems R&D sys/x86/x86/local_apic.c: sys/amd64/include/apicvar.h: sys/i386/include/apicvar.h: sys/amd64/amd64/apic_vector.S: sys/i386/i386/apic_vector.s: sys/amd64/amd64/machdep.c: sys/i386/i386/machdep.c: sys/i386/xen/exception.s: sys/x86/include/segments.h: Reserve IDT vector 0x93 for the Xen event channel upcall interrupt handler. On Hypervisors that support the direct vector callback feature, we can request that this vector be called directly by an injected HVM interrupt event, instead of a simulated PCI interrupt on the Xen platform PCI device. This avoids all of the overhead of dealing with the emulated I/O APIC and local APIC. It also means that the Hypervisor can inject these events on any CPU, allowing upcalls for different ports to be handled in parallel. sys/amd64/amd64/mp_machdep.c: sys/i386/i386/mp_machdep.c: Map Xen per-vcpu area during AP startup. sys/amd64/include/intr_machdep.h: sys/i386/include/intr_machdep.h: Increase the FreeBSD IRQ vector table to include space for event channel interrupt sources. sys/amd64/include/pcpu.h: sys/i386/include/pcpu.h: Remove Xen HVM per-cpu variable data. These fields are now allocated via the dynamic per-cpu scheme. See xen_intr.c for details. sys/amd64/include/xen/hypercall.h: sys/dev/xen/blkback/blkback.c: sys/i386/include/xen/xenvar.h: sys/i386/xen/clock.c: sys/i386/xen/xen_machdep.c: sys/xen/gnttab.c: Prefer FreeBSD primatives to Linux ones in Xen support code. sys/amd64/include/xen/xen-os.h: sys/i386/include/xen/xen-os.h: sys/xen/xen-os.h: sys/dev/xen/balloon/balloon.c: sys/dev/xen/blkback/blkback.c: sys/dev/xen/blkfront/blkfront.c: sys/dev/xen/console/xencons_ring.c: sys/dev/xen/control/control.c: sys/dev/xen/netback/netback.c: sys/dev/xen/netfront/netfront.c: sys/dev/xen/xenpci/xenpci.c: sys/i386/i386/machdep.c: sys/i386/include/pmap.h: sys/i386/include/xen/xenfunc.h: sys/i386/isa/npx.c: sys/i386/xen/clock.c: sys/i386/xen/mp_machdep.c: sys/i386/xen/mptable.c: sys/i386/xen/xen_clock_util.c: sys/i386/xen/xen_machdep.c: sys/i386/xen/xen_rtc.c: sys/xen/evtchn/evtchn_dev.c: sys/xen/features.c: sys/xen/gnttab.c: sys/xen/gnttab.h: sys/xen/hvm.h: sys/xen/xenbus/xenbus.c: sys/xen/xenbus/xenbus_if.m: sys/xen/xenbus/xenbusb_front.c: sys/xen/xenbus/xenbusvar.h: sys/xen/xenstore/xenstore.c: sys/xen/xenstore/xenstore_dev.c: sys/xen/xenstore/xenstorevar.h: Pull common Xen OS support functions/settings into xen/xen-os.h. sys/amd64/include/xen/xen-os.h: sys/i386/include/xen/xen-os.h: sys/xen/xen-os.h: Remove constants, macros, and functions unused in FreeBSD's Xen support. sys/xen/xen-os.h: sys/i386/xen/xen_machdep.c: sys/x86/xen/hvm.c: Introduce new functions xen_domain(), xen_pv_domain(), and xen_hvm_domain(). These are used in favor of #ifdefs so that FreeBSD can dynamically detect and adapt to the presence of a hypervisor. The goal is to have an HVM optimized GENERIC, but more is necessary before this is possible. sys/amd64/amd64/machdep.c: sys/dev/xen/xenpci/xenpcivar.h: sys/dev/xen/xenpci/xenpci.c: sys/x86/xen/hvm.c: sys/sys/kernel.h: Refactor magic ioport, Hypercall table and Hypervisor shared information page setup, and move it to a dedicated HVM support module. HVM mode initialization is now triggered during the SI_SUB_HYPERVISOR phase of system startup. This currently occurs just after the kernel VM is fully setup which is just enough infrastructure to allow the hypercall table and shared info page to be properly mapped. sys/xen/hvm.h: sys/x86/xen/hvm.c: Add definitions and a method for configuring Hypervisor event delievery via a direct vector callback. sys/amd64/include/xen/xen-os.h: sys/x86/xen/hvm.c: sys/conf/files: sys/conf/files.amd64: sys/conf/files.i386: Adjust kernel build to reflect the refactoring of early Xen startup code and Xen interrupt services. sys/dev/xen/blkback/blkback.c: sys/dev/xen/blkfront/blkfront.c: sys/dev/xen/blkfront/block.h: sys/dev/xen/control/control.c: sys/dev/xen/evtchn/evtchn_dev.c: sys/dev/xen/netback/netback.c: sys/dev/xen/netfront/netfront.c: sys/xen/xenstore/xenstore.c: sys/xen/evtchn/evtchn_dev.c: sys/dev/xen/console/console.c: sys/dev/xen/console/xencons_ring.c Adjust drivers to use new xen_intr_*() API. sys/dev/xen/blkback/blkback.c: Since blkback defers all event handling to a taskqueue, convert this task queue to a "fast" taskqueue, and schedule it via an interrupt filter. This avoids an unnecessary ithread context switch. sys/xen/xenstore/xenstore.c: The xenstore driver is MPSAFE. Indicate as much when registering its interrupt handler. sys/xen/xenbus/xenbus.c: sys/xen/xenbus/xenbusvar.h: Remove unused event channel APIs. sys/xen/evtchn.h: Remove all kernel Xen interrupt service API definitions from this file. It is now only used for structure and ioctl definitions related to the event channel userland device driver. Update the definitions in this file to match those from NetBSD. Implementing this interface will be necessary for Dom0 support. sys/xen/evtchn/evtchnvar.h: Add a header file for implemenation internal APIs related to managing event channels event delivery. This is used to allow, for example, the event channel userland device driver to access low-level routines that typical kernel consumers of event channel services should never access. sys/xen/interface/event_channel.h: sys/xen/xen_intr.h: Standardize on the evtchn_port_t type for referring to an event channel port id. In order to prevent low-level event channel APIs from leaking to kernel consumers who should not have access to this data, the type is defined twice: Once in the Xen provided event_channel.h, and again in xen/xen_intr.h. The double declaration is protected by __XEN_EVTCHN_PORT_DEFINED__ to ensure it is never declared twice within a given compilation unit. sys/xen/xen_intr.h: sys/xen/evtchn/evtchn.c: sys/x86/xen/xen_intr.c: sys/dev/xen/xenpci/evtchn.c: sys/dev/xen/xenpci/xenpcivar.h: New implementation of Xen interrupt services. This is similar in many respects to the i386 PV implementation with the exception that events for bound to event channel ports (i.e. not IPI, virtual IRQ, or physical IRQ) are further optimized to avoid mask/unmask operations that aren't necessary for these edge triggered events. Stubs exist for supporting physical IRQ binding, but will need additional work before this implementation can be fully shared between PV and HVM. sys/amd64/amd64/mp_machdep.c: sys/i386/i386/mp_machdep.c: sys/i386/xen/mp_machdep.c sys/x86/xen/hvm.c: Add support for placing vcpu_info into an arbritary memory page instead of using HYPERVISOR_shared_info->vcpu_info. This allows the creation of domains with more than 32 vcpus. sys/i386/i386/machdep.c: sys/i386/xen/clock.c: sys/i386/xen/xen_machdep.c: sys/i386/xen/exception.s: Add support for new event channle implementation.	2013-08-29 19:52:18 +00:00
Alan Cox	51321f7c31	Significantly reduce the cost, i.e., run time, of calls to madvise(..., MADV_DONTNEED) and madvise(..., MADV_FREE). Specifically, introduce a new pmap function, pmap_advise(), that operates on a range of virtual addresses within the specified pmap, allowing for a more efficient implementation of MADV_DONTNEED and MADV_FREE. Previously, the implementation of MADV_DONTNEED and MADV_FREE relied on per-page pmap operations, such as pmap_clear_reference(). Intuitively, the problem with this implementation is that the pmap-level locks are acquired and released and the page table traversed repeatedly, once for each resident page in the range that was specified to madvise(2). A more subtle flaw with the previous implementation is that pmap_clear_reference() would clear the reference bit on all mappings to the specified page, not just the mapping in the range specified to madvise(2). Since our malloc(3) makes heavy use of madvise(2), this change can have a measureable impact. For example, the system time for completing a parallel "buildworld" on a 6-core amd64 machine was reduced by about 1.5% to 2.0%. Note: This change only contains pmap_advise() implementations for a subset of our supported architectures. I will commit implementations for the remaining architectures after further testing. For now, a stub function is sufficient because of the advisory nature of pmap_advise(). Discussed with: jeff, jhb, kib Tested by: pho (i386), marcel (ia64) Sponsored by: EMC / Isilon Storage Division	2013-08-29 15:49:05 +00:00
Neel Natu	6f6ebf3c3f	Add support for emulating the byte move instruction "mov r/m8, r8". This emulation is required when dumping MMIO space via the ddb "examine" command.	2013-08-27 16:49:20 +00:00
Konstantin Belousov	e68c64f0ba	Revert r254501. Instead, reuse the type stability of the struct pmap which is the part of struct vmspace, allocated from UMA_ZONE_NOFREE zone. Initialize the pmap lock in the vmspace zone init function, and remove pmap lock initialization and destruction from pmap_pinit() and pmap_release(). Suggested and reviewed by: alc (previous version) Tested by: pho Sponsored by: The FreeBSD Foundation	2013-08-22 18:12:24 +00:00
Konstantin Belousov	b544368a22	Use the generation count of the pv list to work around LOR between pmap lock and pv list lock, and use the shared locking on pvh_global_lock in pmap_remove_write(), same as it was done for pmap_ts_referenced(). Noted and reviewed by: alc (previous version) Tested by: pho Sponsored by: The FreeBSD Foundation	2013-08-22 18:05:31 +00:00
David E. O'Brien	46be218dce	The PADLOCK_RNG and RDRAND_RNG kernel options are now devices. Thus "device padlock_rng" and "device rdrand_rng" should be used instead of "options PADLOCK_RNG" & "options RDRAND_RNG". Requested by: so@ (des) Submitted by: obrien, arthurmesh@gmail.com Obtained from: Juniper Networks	2013-08-21 22:43:29 +00:00
Jung-uk Kim	1533b9f714	Reimplement atomic operations on PDEs and PTEs in pmap.h. This change significantly reduces duplicate code and make it easier to read. Reviewed by: alc, bde	2013-08-21 22:40:29 +00:00
Jung-uk Kim	d36eb3f1c4	Remove empty lines before return statements for style consistency.	2013-08-21 22:05:58 +00:00
Jung-uk Kim	8a1ee2d346	Implement atomic_swap() and atomic_testandset(). Reviewed by: arch, bde, jilles, kib	2013-08-21 22:03:06 +00:00
Jung-uk Kim	da255e4c7f	- Remove the "a" constraint from main output operand for atomic_cmpset(). - Use "+" modifier for the "expect" because it is also an output (unused).	2013-08-21 21:30:06 +00:00
Jung-uk Kim	fe94be3da7	Use '+' modifier for a memory operand that is both an input and an output. It was actually done in r86301 but reverted in r150182 because GCC 3.x was not able to handle it for a memory operand. Apparently, this problem was fixed in GCC 4.1+ and several contrib sources already rely on this feature.	2013-08-21 21:14:16 +00:00
Jung-uk Kim	c1c84ce1bf	Remove bogus labels. No functional change.	2013-08-21 20:49:46 +00:00
Jung-uk Kim	ee93d1173a	Use consistent style. No functional change.	2013-08-21 20:43:50 +00:00
Neel Natu	b98940e5eb	Do not create superpage mappings in the iommu. This is a workaround to hide the fact that we do not have any code to demote a superpage mapping before we unmap a single page that is part of the superpage.	2013-08-20 06:46:40 +00:00
Neel Natu	f77e982952	Extract the location of the remapping hardware units from the ACPI DMAR table. Submitted by: Gopakumar T (gopakumar_thekkedath@yahoo.co.in)	2013-08-20 06:20:05 +00:00
Neel Natu	15e683837c	Fix breakage caused by r254466 in minidumpsys(). r254466 increased the KVA from 512GB to 2TB which requires 4 PDP pages as opposed to a single one before the change. This broke minidumpsys() since it assumed that the entire KVA could be addressed via a single PDP page. Fix this by obtaining the address of the PDP page from the PML4 entry associated with the KVA being dumped. Reported by: pho Submitted by: kib Pointy hat to: neel	2013-08-20 02:09:26 +00:00
Konstantin Belousov	d91f339823	When code from r254064 in pmap_ts_referenced() drops pv lock and blocks on a pmap lock, pmap_release() might proceed in parallel and destroy the pmap mutex, since unlocked pv lock allows to remove pv entry owned by the pmap. For now, gate the pmap_release() on write-locked pvh_global_lock. Since pmap_ts_release() does not unlock the global lock, pmap_release() would not destroy pmap mutex until the pmap_ts_referenced() finished. We cannot enter pmap_ts_referenced() and encounter a pv entry for the destroyed pmap if pmap_release() passed the global lock gate, since pmap_remove_pages() would finish earlier. Reported by: jeff, pho Reviewed by: alc Tested by: pho Sponsored by: The FreeBSD Foundation	2013-08-18 21:36:22 +00:00
Pawel Jakub Dawidek	417ffc66fa	Add process descriptors support to the GENERIC kernel. It is already being used by the tools in base systems and with sandboxing more and more tools the usage should only increase. Submitted by: Mariusz Zaborski <oshogbo@FreeBSD.org> Sponsored by: Google Summer of Code 2013 MFC after: 1 month	2013-08-18 10:21:29 +00:00
Neel Natu	0ef2ab3ab8	Bump up the maximum addressable memory on amd64 systems from 1TB to 4TB. Bump up the KVA size proportionally from 512GB to 2TB. The number of page table pages used by the direct map is now calculated at run time based on 'Maxmem'. This means the small memory systems will not see any additional tax in terms of page table pages for the direct map. However all amd64 systems, regardless of the memory size, will use 3 more pages to accomodate the bump in the KVA size. More details available here: http://lists.freebsd.org/pipermail/freebsd-hackers/2013-June/043015.html http://lists.freebsd.org/pipermail/freebsd-current/2013-July/043143.html Tested with the following configurations: - Sandybridge server with 64GB of memory. - bhyve VM with 64MB of memory. - bhyve VM with a 8GB of memory with the memory segment above 4GB cuddling right up against the 4TB maximum memory limit. Discussed on: hackers@, current@ Submitted by: Chris Torek (torek@torek.net)	2013-08-17 19:49:08 +00:00
Jilles Tjoelker	0f3a4d8051	libc: Access _logname_valid more efficiently. The variable _logname_valid is not exported via the version script; therefore, change C and i386/amd64 assembler code to remove indirection (which allowed interposition). This makes the code slightly smaller and faster. Also, remove #define PIC_GOT from i386/amd64 in !PIC mode. Without PIC, there is no place containing the address of each variable, so there is no possible definition for PIC_GOT.	2013-08-17 19:24:58 +00:00
Brooks Davis	cd234300d3	Use an ANSI C definition of initializecpucache() to match the declaration and the rest of the file.	2013-08-15 17:44:44 +00:00
Jung-uk Kim	38da30b419	Merge acpica_machdep.h for amd64 and i386 and move to x86. In fact, these two files were functionally identical.	2013-08-13 22:05:10 +00:00
Jung-uk Kim	3bd12ca8f1	Tidy up global locks for ACPICA. There is no functional change.	2013-08-13 21:34:03 +00:00
Konstantin Belousov	c325e866f4	Different consumers of the struct vm_page abuse pageq member to keep additional information, when the page is guaranteed to not belong to a paging queue. Usually, this results in a lot of type casts which make reasoning about the code correctness harder. Sometimes m->object is used instead of pageq, which could cause real and confusing bugs if non-NULL m->object is leaked. See r141955 and r253140 for examples. Change the pageq member into a union containing explicitly-typed members. Use them instead of type-punning or abusing m->object in x86 pmaps, uma and vm_page_alloc_contig(). Requested and reviewed by: alc Sponsored by: The FreeBSD Foundation	2013-08-10 17:36:42 +00:00
Attilio Rao	e946b94934	On all the architectures, avoid to preallocate the physical memory for nodes used in vm_radix. On architectures supporting direct mapping, also avoid to pre-allocate the KVA for such nodes. In order to do so make the operations derived from vm_radix_insert() to fail and handle all the deriving failure of those. vm_radix-wise introduce a new function called vm_radix_replace(), which can replace a leaf node, already present, with a new one, and take into account the possibility, during vm_radix_insert() allocation, that the operations on the radix trie can recurse. This means that if operations in vm_radix_insert() recursed vm_radix_insert() will start from scratch again. Sponsored by: EMC / Isilon storage division Reviewed by: alc (older version) Reviewed by: jeff Tested by: pho, scottl	2013-08-09 11:28:55 +00:00
Attilio Rao	c7aebda8a1	The soft and hard busy mechanism rely on the vm object lock to work. Unify the 2 concept into a real, minimal, sxlock where the shared acquisition represent the soft busy and the exclusive acquisition represent the hard busy. The old VPO_WANTED mechanism becames the hard-path for this new lock and it becomes per-page rather than per-object. The vm_object lock becames an interlock for this functionality: it can be held in both read or write mode. However, if the vm_object lock is held in read mode while acquiring or releasing the busy state, the thread owner cannot make any assumption on the busy state unless it is also busying it. Also: - Add a new flag to directly shared busy pages while vm_page_alloc and vm_page_grab are being executed. This will be very helpful once these functions happen under a read object lock. - Move the swapping sleep into its own per-object flag The KPI is heavilly changed this is why the version is bumped. It is very likely that some VM ports users will need to change their own code. Sponsored by: EMC / Isilon storage division Discussed with: alc Reviewed by: jeff, kib Tested by: gavin, bapt (older version) Tested by: pho, scottl	2013-08-09 11:11:11 +00:00
Andriy Gapon	9ba0691bdd	follow up to r254051 - update powerpc/GENERIC64 as well, suggested by mdf - update comments so that they make sense after the change, suggested by jhb X-MFC after: never (change specific to head)	2013-08-09 08:11:09 +00:00
Neel Natu	f263e391a3	Use local variables with the appropriate types and eliminate a bunch of casts. This is a cosmetic change but it does help with a proposed change to increase the maximum size of physical memory supported on amd64 platforms. Submitted by: Chris Torek (torek@torek.net)	2013-08-08 03:17:39 +00:00
Konstantin Belousov	449c2e92c9	Split the pagequeues per NUMA domains, and split pageademon process into threads each processing queue in a single domain. The structure of the pagedaemons and queues is kept intact, most of the changes come from the need for code to find an owning page queue for given page, calculated from the segment containing the page. The tie between NUMA domain and pagedaemon thread/pagequeue split is rather arbitrary, the multithreaded daemon could be allowed for the single-domain machines, or one domain might be split into several page domains, to further increase concurrency. Right now, each pagedaemon thread tries to reach the global target, precalculated at the start of the pass. This is not optimal, since it could cause excessive page deactivation and freeing. The code should be changed to re-check the global page deficit state in the loop after some number of iterations. The pagedaemons reach the quorum before starting the OOM, since one thread inability to meet the target is normal for split queues. Only when all pagedaemons fail to produce enough reusable pages, OOM is started by single selected thread. Launder is modified to take into account the segments layout with regard to the region for which cleaning is performed. Based on the preliminary patch by jeff, sponsored by EMC / Isilon Storage Division. Reviewed by: alc Tested by: pho Sponsored by: The FreeBSD Foundation	2013-08-07 16:36:38 +00:00
Konstantin Belousov	872d995f76	Change the pmap_ts_referenced() method of amd64 pmap to use shared pvh_global_lock. This allows the method to be executed in parallel, avoiding undue contention on the pvh_global_lock for the multithreaded pagedaemon. The pmap_ts_referenced() function has to inspect the page mappings for several pmaps, which need to be locked while pv list lock is owned. This contradicts to the lock order, where pmap lock is before pv list lock. Introduce the generation count for the pv list of the page or superpage, which indicate any change in the pv list, and, as usual, perform restart of the iteration if generation changed while pv lock was dropped for blocking acquire of a pmap lock. Reported and tested by: pho Reviewed by: alc Sponsored by: The FreeBSD Foundation	2013-08-07 16:33:15 +00:00
Andriy Gapon	818d282e7b	enable KDB_TRACE in GENERICs KDB_TRACE is not an alternative to DDB/etc, they are complementary. So I do not see any reason to not enable KDB_TRACE by default. X-MFC after: never (change specific to head)	2013-08-07 08:03:50 +00:00
Jeff Roberson	5df87b21d3	Replace kernel virtual address space allocation with vmem. This provides transparent layering and better fragmentation. - Normalize functions that allocate memory to use kmem_* - Those that allocate address space are named kva_* - Those that operate on maps are named kmap_* - Implement recursive allocation handling for kmem_arena in vmem. Reviewed by: alc Tested by: pho Sponsored by: EMC / Isilon Storage Division	2013-08-07 06:21:20 +00:00
Jeff Roberson	2c0b86b48f	- Introduce a specific function, pmap_remove_kernel_pde, for removing huge pages in the kernel's address space. This works around several asserts from pmap_demote_pde_locked that did not apply and gave false warnings. Discovered by: pho Reviewed by: alc Sponsored by: EMC / Isilon Storage Division	2013-08-05 00:28:03 +00:00
Peter Grehan	80a902ef7d	Follow-up commit to fix CR0 issues. Maintain architectural state on CR vmexits by guaranteeing that EFER, CR0 and the VMCS entry controls are all in sync when transitioning to IA-32e mode. Submitted by: Tycho Nightingale (tycho.nightingale <at> plurisbusnetworks.com)	2013-08-03 03:16:42 +00:00
Peter Grehan	672ed870a7	IFC @ r253862 - change the SI_SUB_RUN_SCHEDULER sysinits in hv_utilc and hv_netvsc_drv_freebsd.c to SI_SUB_KTHREAD_IDLE, since the former is no longer in FreeBSD. The use of these SYSINITs can probably be removed.	2013-08-01 22:09:57 +00:00
Peter Grehan	81ef6611ed	Moved clearing of vmm_initialized to avoid the case of unloading the module while VMs existed. This would result in EBUSY, but would prevent further operations on VMs resulting in the module being impossible to unload. Submitted by: Tycho Nightingale (tycho.nightingale <at> plurisbusnetworks.com) Reviewed by: grehan, neel	2013-08-01 05:59:28 +00:00
Peter Grehan	aaaa065629	Correctly maintain the CR0/CR4 shadow registers. This was exposed with AP spinup of Linux, and booting OpenBSD, where the CR0 register is unconditionally written to prior to the longjump to enter protected mode. The CR-vmexit handling was not updating CPU state which resulted in a vmentry failure with invalid guest state. A follow-on submit will fix the CPU state issue, but this fix prevents the CR-vmexit prior to entering protected mode by properly initializing and maintaining CR* state. Reviewed by: neel Reported by: Gopakumar.T @ netapp	2013-08-01 01:18:51 +00:00
David E. O'Brien	0e6a0799a9	Back out r253779 & r253786.	2013-07-31 17:21:18 +00:00
David E. O'Brien	99ff83da74	Decouple yarrow from random(4) device. * Make Yarrow an optional kernel component -- enabled by "YARROW_RNG" option. The files sha2.c, hash.c, randomdev_soft.c and yarrow.c comprise yarrow. * random(4) device doesn't really depend on rijndael-. Yarrow, however, does. Add random_adaptors.[ch] which is basically a store of random_adaptor's. random_adaptor is basically an adapter that plugs in to random(4). random_adaptor can only be plugged in to random(4) very early in bootup. Unplugging random_adaptor from random(4) is not supported, and is probably a bad idea anyway, due to potential loss of entropy pools. We currently have 3 random_adaptors: + yarrow + rdrand (ivy.c) + nehemeiah * Remove platform dependent logic from probe.c, and move it into corresponding registration routines of each random_adaptor provider. probe.c doesn't do anything other than picking a specific random_adaptor from a list of registered ones. * If the kernel doesn't have any random_adaptor adapters present then the creation of /dev/random is postponed until next random_adaptor is kldload'ed. * Fix randomdev_soft.c to refer to its own random_adaptor, instead of a system wide one. Submitted by: arthurmesh@gmail.com, obrien Obtained from: Juniper Networks Reviewed by: obrien	2013-07-29 20:26:27 +00:00
Andriy Gapon	a29cc9a34b	Revert r253748,253749 This WIP should not have been committed yet. Pointyhat to: avg	2013-07-28 18:44:17 +00:00
Andriy Gapon	366d8bfb7b	put contents of cpu.h under _KERNEL no userland-serviceable parts inside MFC after: 20 days	2013-07-28 18:32:27 +00:00
Andriy Gapon	a69e8d609e	x86: detect mwait capabilities and extensions, when present Reviewed by: kib (earlier amd64-only version) MFC after: 2 weeks	2013-07-28 17:54:42 +00:00
Jeff Roberson	2f84c08eee	- Use kmem_malloc rather than kmem_alloc() for GDT/LDT/tss allocations etc. This eliminates some unusual uses of that API in favor of more typical uses of kmem_malloc(). Discussed with: kib/alc Tested by: pho Sponsored by: EMC / Isilon Storage Division	2013-07-26 19:06:14 +00:00
Neel Natu	84e169c6c3	Add support for emulation of the "or r/m, imm8" instruction. Submitted by: Zhixiang Yu (zxyu.core@gmail.com) Obtained from: GSoC 2013 (AHCI device emulation for bhyve)	2013-07-23 23:43:00 +00:00
Neel Natu	113326a772	Fix a bug introduced in r252646 that causes a page with the PG_PTE_PAT bit set to be interpreted as a superpage. This is because PG_PTE_PAT is at the same bit position in PTE as PG_PS is in a PDE. This caused a number of regressions on amd64 systems: panic when starting X applications, freeze during shutdown etc. Pointy hat to: me Tested by: gperez@entel.upc.edu, joel, dumbbell Reviewed by: kib	2013-07-23 22:17:00 +00:00
Peter Grehan	15b996d742	First cut at adding the hyperv drivers to GENERIC. The files inventory should probably have the modules split out into net/storage/common etc as the modules build is, but this will do for now.	2013-07-19 05:32:58 +00:00
Konstantin Belousov	0f6bcda4cd	MFi386: add ddb "show sysregs" command. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2013-07-15 06:30:57 +00:00
Konstantin Belousov	0cdd261571	Clear m->object for the page taken from the delayed free list for reuse as the pv chink page in reclaim_pv_chunk(). Having non-NULL m->object is wrong for page not owned by an object and confuses both vm_page_free_toq() and vm_page_remove() when the page is freed later. Reported and tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 3 days	2013-07-10 09:24:03 +00:00
Xin LI	1fdeb1651c	Import HighPoint DC Series Data Center HBA (DC7280 and R750) driver. This driver works for FreeBSD/i386 and FreeBSD/amd64 platforms. Many thanks to HighPoint for providing this driver. MFC after: 1 day	2013-07-06 07:49:41 +00:00
Neel Natu	be28275d00	If a superpage mapping is being removed then we need to ignore the PG_PDE_PAT bit when looking up the vm_page associated with the superpage's physical address. If the caching attribute for the mapping is write combining or write protected then the PG_PDE_PAT bit will be set and thus cause an 'off-by-one' error when looking up the vm_page. Fix this by using the PG_PS_FRAME mask to compute the physical address for a superpage mapping instead of PG_FRAME. This is a theoretical issue at this point since non-writeback attributes are currently used only for fictitious mappings and fictitious mappings are not subject to promotion. Discussed with: alc, kib MFC after: 2 weeks	2013-07-03 23:21:25 +00:00
Neel Natu	de16308c48	Verify that all bytes in the instruction buffer are consumed during decoding. Suggested by: grehan	2013-07-03 23:05:17 +00:00
Peter Grehan	e60f5d779e	Ignore guest PAT settings by default in EPT mappings. From experimentation, other hypervisors also do this. Diagnosed by: tycho nightingale at pluribusnetworks com Reviewed by: neel	2013-07-01 20:05:43 +00:00
Konstantin Belousov	70a7dd5d5b	Fix issues with zeroing and fetching the counters, on x86 and ppc64. Issues were noted by Bruce Evans and are present on all architectures. On i386, a counter fetch should use atomic read of 64bit value, otherwise carry from the increment on other CPU could be lost for the given fetch, making error of 2^32. If 64bit read (cmpxchg8b) is not available on the machine, it cannot be SMP and it is enough to disable preemption around read to avoid the split read. On x86 the counter increment is not atomic on purpose, which makes it possible for the store of the incremented result to override just zeroed per-cpu slot. The effect would be a counter going off by arbitrary value after zeroing. Perform the counter zeroing on the same processor which does the increments, making the operations mutually exclusive. On i386, same as for the fetching, if the cmpxchg8b is not available, machine is not SMP and we disable preemption for zeroing. PowerPC64 is treated the same as amd64. For other architectures, the changes made to allow the compilation to succeed, without fixing the issues with zeroing or fetching. It should be possible to handle them by using the 64bit loads and stores atomic WRT preemption (assuming the architectures also converted from using critical sections to proper asm). If architecture does not provide the facility, using global (spin) mutex would be non-optimal but working solution. Noted by: bde Sponsored by: The FreeBSD Foundation	2013-07-01 02:48:27 +00:00
Peter Grehan	560d5eda2c	Make sure all CPUID values are handled, instead of exiting the bhyve process when an unhandled one is encountered. Hide some additional capabilities from the guest (e.g. debug store). This fixes the issue with FreeBSD 9.1 MP guests exiting the VM on AP spinup (where CPUID is used when sync'ing the TSCs) and the issue with the Java build where CPUIDs are issued from a guest userspace. Submitted by: tycho nightingale at pluribusnetworks com Reviewed by: neel Reported by: many	2013-06-28 06:05:33 +00:00
Jung-uk Kim	b1ddd13145	Move definitions required by userland applications out of acpica_machdep.h.	2013-06-27 00:22:40 +00:00
Konstantin Belousov	9dbb63fe03	Allow immediate operand. Sponsored by: The FreeBSD Foundation	2013-06-20 14:30:04 +00:00
Konstantin Belousov	c788f92509	Some clarifications and updates for the comments, mostly retrieved from Bruce Evans. Trim the trailing spaces. MFC after: 1 week	2013-06-19 05:05:16 +00:00
Sergey Kandaurov	1e2751ddeb	Fix a gcc warning uncovered after r251745. Reported by: Sergey V. Dyatko Reviewed by: neel	2013-06-18 23:31:09 +00:00
Justin T. Gibbs	a8f6ac0573	Upgrade Xen interface headers to Xen 4.2.1. Move FreeBSD from interface version 0x00030204 to 0x00030208. Updates are required to our grant table implementation before we can bump this further. sys/xen/hvm.h: Replace the implementation of hvm_get_parameter(), formerly located in sys/xen/interface/hvm/params.h. Linux has a similar file which primarily stores this function. sys/xen/xenstore/xenstore.c: Include new xen/hvm.h header file to get hvm_get_parameter(). sys/amd64/include/xen/xen-os.h: sys/i386/include/xen/xen-os.h: Correctly protect function definition and variables from being included into assembly files in xen-os.h Xen memory barriers are now prefixed with "xen_" to avoid conflicts with OS native primatives. Define Xen memory barriers in terms of the native FreeBSD primatives. Sponsored by: Spectra Logic Corporation Reviewed by: Roger Pau Monné Tested by: Roger Pau Monné Obtained from: Roger Pau Monné (bug fixes)	2013-06-14 23:43:44 +00:00
Sergey Kandaurov	82f2974a69	Replace cpusetffs_obj with CPU_FFS, missed in r251703. Reported by: bdrewery, O. Hartmann	2013-06-14 10:26:38 +00:00
Neel Natu	8f1664b724	Remove unused macros PTESHIFT, PDESHIFT, PDPESHIFT and PML4ESHIFT. Reviewed by: alc	2013-06-14 00:03:43 +00:00
Jeff Roberson	17a2737732	- Add a BIT_FFS() macro and use it to replace cpusetffs_obj() Discussed with: attilio Sponsored by: EMC / Isilon Storage Division	2013-06-13 20:46:03 +00:00
Konstantin Belousov	9138579845	Assert that interrupts are enabled in the trap handlers on x86 before calling generic code to deliver signals. Discussed with: bde Tested by: pho MFC after: 1 week	2013-06-03 17:40:05 +00:00
Konstantin Belousov	cb5bfd1240	Use slightly more idiomatic expression to get the address of array. Tested by: dim, pgj Sponsored by: The FreeBSD Foundation MFC after: 1 week	2013-05-27 18:39:39 +00:00
Konstantin Belousov	87b94d9a92	The _MC_HASFPXSTATE and _MC_IA32_HASFPXSTATE flags have the same bit value on purpose, but the ia32 context handling code is logically more correct to use the _MC_IA32_HASFPXSTATE name for the flag. Tested by: dim, pgj Sponsored by: The FreeBSD Foundation MFC after: 1 week	2013-05-27 18:36:46 +00:00
Konstantin Belousov	e9249a80f6	The ia32_get_mcontext() does not need to set PCB_FULL_IRET. The usermode context state is not changed by the get operation, and get_mcontext() does not require full iret as well. Tested by: dim, pgj Sponsored by: The FreeBSD Foundation MFC after: 1 week	2013-05-27 18:31:15 +00:00
Konstantin Belousov	80b5691a76	When reporting the fault details, also print %rsp. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2013-05-27 18:29:20 +00:00
Konstantin Belousov	6806ce6ec8	When handling an exception from the attempt from loading the faulting context on return from the trap handler, re-enable the interrupts on i386 and amd64. The trap return path have to disable interrupts since the sequence of loading the machine state is not atomic. The trap() function which transfers the control to the special handler would enable the interrupt, but an iret loads the previous eflags with PSL_I clear. Then, the special handler calls trap() on its own, which now sees the original eflags with PSL_I set and does not enable interrupts. The end result is that signal delivery and process exiting code could be executed with interrupts disabled, which is generally wrong and triggers several assertions. For amd64, the interrupts are enabled conditionally based on PSL_I in the eflags of the outer frame, as it is already done for doreti_iret_fault. For i386, the interrupts are enabled unconditionally, the ast loop could have opened a window with interrupts enabled just before the iret anyway. Reported and tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week	2013-05-27 18:26:08 +00:00

... 2 3 4 5 6 ...

6738 Commits