freebsd-dev

Author	SHA1	Message	Date
Neel Natu	1c05219285	If a vcpu disables its local apic and then executes a 'HLT' then spin down the vcpu and destroy its thread context. Also modify the 'HLT' processing to ignore pending interrupts in the IRR if interrupts have been disabled by the guest. The interrupt cannot be injected into the guest in any case so resuming it is futile. With this change "halt" from a Linux guest works correctly. Reviewed by: grehan@ Tested by: Tycho Nightingale (tycho.nightingale@pluribusnetworks.com)	2013-12-07 22:18:36 +00:00
Neel Natu	7a3c80aa55	The 'protection' field in the VM exit collateral for the PAGING exit is not used - get rid of it.	2013-12-03 01:21:21 +00:00
Neel Natu	2282187475	Rename 'vm_interrupt_hostcpu()' to 'vcpu_notify_event()' because the function has outgrown its original name. Originally this function simply sent an IPI to the host cpu that a vcpu was executing on but now it does a lot more than just that. Reviewed by: grehan@	2013-12-03 00:43:31 +00:00
Konstantin Belousov	291bfc8d24	Hide struct pcb definition by #ifdef __amd64__ braces. If cc -m32 compilation results in inclusion of the header, a confict arises due to savefpu being union for i386, but used as struct in the pcb definition. The 32bit code should not need amd64 variant of the struct pcb anyway. For struct region_descriptor, use __uint64_t instead of unsigned long, as the base type for bit-fields. Unsigned long cannot have width 64 for -m32. The changes allowed to use sys/sysctl.h for cc -m32. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2013-11-26 19:38:42 +00:00
Neel Natu	08e3ff329a	Add HPET device emulation to bhyve. bhyve supports a single timer block with 8 timers. The timers are all 32-bit and capable of being operated in periodic mode. All timers support interrupt delivery using MSI. Timers 0 and 1 also support legacy interrupt routing. At the moment the timers are not connected to any ioapic pins but that will be addressed in a subsequent commit. This change is based on a patch from Tycho Nightingale (tycho.nightingale@pluribusnetworks.com).	2013-11-25 19:04:51 +00:00
Neel Natu	ac7304a758	Add an ioctl to assert and deassert an ioapic pin atomically. This will be used to inject edge triggered legacy interrupts into the guest. Start using the new API in device models that use edge triggered interrupts: viz. the 8254 timer and the LPC/uart device emulation. Submitted by: Tycho Nightingale (tycho.nightingale@pluribusnetworks.com)	2013-11-23 03:56:03 +00:00
Andreas Tobler	d2ef321a59	Introduce a WEAK_REFERENCE() alias and use it. Get rid of the CNAME and the CONCAT macros in SYS.h. Reviewed by: bde, kib	2013-11-21 21:25:58 +00:00
Neel Natu	565bbb8698	Move the ioapic device model from userspace into vmm.ko. This is needed for upcoming in-kernel device emulations like the HPET. The ioctls VM_IOAPIC_ASSERT_IRQ and VM_IOAPIC_DEASSERT_IRQ are used to manipulate the ioapic pin state. Discussed with: grehan@ Submitted by: Tycho Nightingale (tycho.nightingale@pluribusnetworks.com)	2013-11-12 22:51:03 +00:00
Alan Cox	c70af4875e	As of r257209, all architectures have defined VM_KMEM_SIZE_SCALE. In other words, every architecture is now auto-sizing the kmem arena. This revision changes kmeminit() so that the definition of VM_KMEM_SIZE_SCALE becomes mandatory and the definition of VM_KMEM_SIZE becomes optional. Replace or eliminate all existing definitions of VM_KMEM_SIZE. With auto-sizing enabled, VM_KMEM_SIZE effectively became an alternate spelling for VM_KMEM_SIZE_MIN on most architectures. Use VM_KMEM_SIZE_MIN for clarity. Change kmeminit() so that the effect of defining VM_KMEM_SIZE is similar to that of setting the tunable vm.kmem_size. Whereas the macros VM_KMEM_SIZE_{MAX,MIN,SCALE} have had the same effect as the tunables vm.kmem_size_{max,min,scale}, the effects of VM_KMEM_SIZE and vm.kmem_size have been distinct. In particular, whereas VM_KMEM_SIZE was overridden by VM_KMEM_SIZE_{MAX,MIN,SCALE} and vm.kmem_size_{max,min,scale}, vm.kmem_size was not. Remedy this inconsistency. Now, VM_KMEM_SIZE can be used to set the size of the kmem arena at compile-time without that value being overridden by auto-sizing. Update the nearby comments to reflect the kmem submap being replaced by the kmem arena. Stop duplicating the auto-sizing formula in every machine- dependent vmparam.h and place it in kmeminit() where auto-sizing takes place. Reviewed by: kib (an earlier version) Sponsored by: EMC / Isilon Storage Division	2013-11-08 16:25:00 +00:00
Neel Natu	49cc03da31	Add a new capability, VM_CAP_ENABLE_INVPCID, that can be enabled to expose 'invpcid' instruction to the guest. Currently bhyve will try to enable this capability unconditionally if it is available. Consolidate code in bhyve to set the capabilities so it is no longer duplicated in BSP and AP bringup. Add a sysctl 'vm.pmap.invpcid_works' to display whether the 'invpcid' instruction is available. Reviewed by: grehan MFC after: 3 days	2013-10-16 18:20:27 +00:00
Justin T. Gibbs	5fdd34ee20	Formalize the concept of virtual CPU ids by adding a per-cpu vcpu_id field. Perform vcpu enumeration for Xen PV and HVM environments and convert all Xen drivers to use vcpu_id instead of a hard coded assumption of the mapping algorithm (acpi or apic ID) in use. Submitted by: Roger Pau Monné Sponsored by: Citrix Systems R&D Reviewed by: gibbs Approved by: re (blanket Xen) amd64/include/pcpu.h: i386/include/pcpu.h: Add vcpu_id to the amd64 and i386 pcpu structures. dev/xen/timer/timer.c x86/xen/xen_intr.c Use new vcpu_id instead of assuming acpi_id == vcpu_id. i386/xen/mp_machdep.c: i386/xen/mptable.c x86/xen/hvm.c: Perform Xen HVM and Xen full PV vcpu_id mapping. x86/xen/hvm.c: x86/acpica/madt.c Change SYSINIT ordering of acpi CPU enumeration so that it is guaranteed to be available at the time of Xen HVM vcpu id mapping.	2013-10-05 23:11:01 +00:00
Neel Natu	318224bbe6	Merge projects/bhyve_npt_pmap into head. Make the amd64/pmap code aware of nested page table mappings used by bhyve guests. This allows bhyve to associate each guest with its own vmspace and deal with nested page faults in the context of that vmspace. This also enables features like accessed/dirty bit tracking, swapping to disk and transparent superpage promotions of guest memory. Guest vmspace: Each bhyve guest has a unique vmspace to represent the physical memory allocated to the guest. Each memory segment allocated by the guest is mapped into the guest's address space via the 'vmspace->vm_map' and is backed by an object of type OBJT_DEFAULT. pmap types: The amd64/pmap now understands two types of pmaps: PT_X86 and PT_EPT. The PT_X86 pmap type is used by the vmspace associated with the host kernel as well as user processes executing on the host. The PT_EPT pmap is used by the vmspace associated with a bhyve guest. Page Table Entries: The EPT page table entries as mostly similar in functionality to regular page table entries although there are some differences in terms of what bits are used to express that functionality. For e.g. the dirty bit is represented by bit 9 in the nested PTE as opposed to bit 6 in the regular x86 PTE. Therefore the bitmask representing the dirty bit is now computed at runtime based on the type of the pmap. Thus PG_M that was previously a macro now becomes a local variable that is initialized at runtime using 'pmap_modified_bit(pmap)'. An additional wrinkle associated with EPT mappings is that older Intel processors don't have hardware support for tracking accessed/dirty bits in the PTE. This means that the amd64/pmap code needs to emulate these bits to provide proper accounting to the VM subsystem. This is achieved by using the following mapping for EPT entries that need emulation of A/D bits: Bit Position Interpreted By PG_V 52 software (accessed bit emulation handler) PG_RW 53 software (dirty bit emulation handler) PG_A 0 hardware (aka EPT_PG_RD) PG_M 1 hardware (aka EPT_PG_WR) The idea to use the mapping listed above for A/D bit emulation came from Alan Cox (alc@). The final difference with respect to x86 PTEs is that some EPT implementations do not support superpage mappings. This is recorded in the 'pm_flags' field of the pmap. TLB invalidation: The amd64/pmap code has a number of ways to do invalidation of mappings that may be cached in the TLB: single page, multiple pages in a range or the entire TLB. All of these funnel into a single EPT invalidation routine called 'pmap_invalidate_ept()'. This routine bumps up the EPT generation number and sends an IPI to the host cpus that are executing the guest's vcpus. On a subsequent entry into the guest it will detect that the EPT has changed and invalidate the mappings from the TLB. Guest memory access: Since the guest memory is no longer wired we need to hold the host physical page that backs the guest physical page before we can access it. The helper functions 'vm_gpa_hold()/vm_gpa_release()' are available for this purpose. PCI passthru: Guest's with PCI passthru devices will wire the entire guest physical address space. The MMIO BAR associated with the passthru device is backed by a vm_object of type OBJT_SG. An IOMMU domain is created only for guest's that have one or more PCI passthru devices attached to them. Limitations: There isn't a way to map a guest physical page without execute permissions. This is because the amd64/pmap code interprets the guest physical mappings as user mappings since they are numerically below VM_MAXUSER_ADDRESS. Since PG_U shares the same bit position as EPT_PG_EXECUTE all guest mappings become automatically executable. Thanks to Alan Cox and Konstantin Belousov for their rigorous code reviews as well as their support and encouragement. Thanks for John Baldwin for reviewing the use of OBJT_SG as the backing object for pci passthru mmio regions. Special thanks to Peter Holm for testing the patch on short notice. Approved by: re Discussed with: grehan Reviewed by: alc, kib Tested by: pho	2013-10-05 21:22:35 +00:00
Peter Grehan	36f23e3c20	Reorder/regroup the vmm ioctl api definitions to allow some semblance of API stability and growth during the 10.* timeframe. Userland/kernel bhyve will have to be recompiled after this. Reviewed by: neel Approved by: re@ (blanket)	2013-09-21 00:27:53 +00:00
Justin T. Gibbs	566a5f5020	Merge Xen PVHVM support into the GENERIC kernel config for both amd64 and i386. Submitted by: Roger Pau Monné Sponsored by: Citrix Systems R&D Reviewed by: gibbs Approved by: re (blanket Xen) MFC after: 2 weeks sys/amd64/amd64/mp_machdep.c: sys/amd64/include/cpu.h: sys/i386/i386/mp_machdep.c: sys/i386/include/cpu.h: - Introduce two new CPU hooks for initialization and resume purposes. This allows us to get rid of the XENHVM ifdefs in mp_machdep, and also sets some hooks into common code that can be used by other hypervisor implementations. sys/amd64/conf/XENHVM: sys/i386/conf/XENHVM: - Remove these configs now that GENERIC has builtin support for Xen HVM. sys/kern/subr_smp.c: - Make sure there are no pending IPIs when suspending a system. sys/x86/xen/hvm.c: - Add cpu init and resume vectors that are called from mp_machdep using the new hooks. - Only clear the vcpu_info mapping data on resume. It is already clear for the BSP on a cold boot and is set correctly as APs are started. - Gate xen_hvm_init_cpu only to systems running under Xen. sys/x86/xen/xen_intr.c: - Gate the setup of event channels only to systems running under Xen.	2013-09-20 22:59:22 +00:00
Neel Natu	74d1d2b7cc	Merge the following changes from projects/bhyve_npt_pmap: - add fields to 'struct pmap' that are required to manage nested page tables. - add a parameter to 'vmspace_alloc()' that can be used to override the default pmap initialization routine 'pmap_pinit()'. These changes are pushed ahead of the remaining changes in 'bhyve_npt_pmap' in anticipation of the upcoming KBI freeze for 10.0. Reviewed by: kib@, alc@ Approved by: re (glebius)	2013-09-20 17:06:49 +00:00
Justin T. Gibbs	428b7ca290	Add support for suspend/resume/migration operations when running as a Xen PVHVM guest. Submitted by: Roger Pau Monné Sponsored by: Citrix Systems R&D Reviewed by: gibbs Approved by: re (blanket Xen) MFC after: 2 weeks sys/amd64/amd64/mp_machdep.c: sys/i386/i386/mp_machdep.c: - Make sure that are no MMU related IPIs pending on migration. - Reset pending IPI_BITMAP on resume. - Init vcpu_info on resume. sys/amd64/include/intr_machdep.h: sys/i386/include/intr_machdep.h: sys/x86/acpica/acpi_wakeup.c: sys/x86/x86/intr_machdep.c: sys/x86/isa/atpic.c: sys/x86/x86/io_apic.c: sys/x86/x86/local_apic.c: - Add a "suspend_cancelled" parameter to pic_resume(). For the Xen PIC, restoration of interrupt services differs between the aborted suspend and normal resume cases, so we must provide this information. sys/dev/acpica/acpi_timer.c: sys/dev/xen/timer/timer.c: sys/timetc.h: - Don't swap out "suspend safe" timers across a suspend/resume cycle. This includes the Xen PV and ACPI timers. sys/dev/xen/control/control.c: - Perform proper suspend/resume process for PVHVM: - Suspend all APs before going into suspension, this allows us to reset the vcpu_info on resume for each AP. - Reset shared info page and callback on resume. sys/dev/xen/timer/timer.c: - Implement suspend/resume support for the PV timer. Since FreeBSD doesn't perform a per-cpu resume of the timer, we need to call smp_rendezvous in order to correctly resume the timer on each CPU. sys/dev/xen/xenpci/xenpci.c: - Don't reset the PCI interrupt on each suspend/resume. sys/kern/subr_smp.c: - When suspending a PVHVM domain make sure there are no MMU IPIs in-flight, or we will get a lockup on resume due to the fact that pending event channels are not carried over on migration. - Implement a generic version of restart_cpus that can be used by suspended and stopped cpus. sys/x86/xen/hvm.c: - Implement resume support for the hypercall page and shared info. - Clear vcpu_info so it can be reset by APs when resuming from suspension. sys/dev/xen/xenpci/xenpci.c: sys/x86/xen/hvm.c: sys/x86/xen/xen_intr.c: - Support UP kernel configurations. sys/x86/xen/xen_intr.c: - Properly rebind per-cpus VIRQs and IPIs on resume.	2013-09-20 05:06:03 +00:00
Peter Grehan	8d39ed16c2	Go way past 11 and bump bhyve's max vCPUs to 16. This should be sufficient for 10.0 and will do until forthcoming work to avoid limitations in this area is complete. Thanks to Bela Lubkin at tidalscale for the headsup on the apic/cpu id/io apic ASL parameters that are actually hex values and broke when written as decimal when 11 vCPUs were configured. Approved by: re@	2013-09-10 03:48:18 +00:00
Justin T. Gibbs	e44af46e4c	Implement PV IPIs for PVHVM guests and further converge PV and HVM IPI implmementations. Submitted by: Roger Pau Monné Sponsored by: Citrix Systems R&D Submitted by: gibbs (misc cleanup, table driven config) Reviewed by: gibbs MFC after: 2 weeks sys/amd64/include/cpufunc.h: sys/amd64/amd64/pmap.c: Move invltlb_globpcid() into cpufunc.h so that it can be used by the Xen HVM version of tlb shootdown IPI handlers. sys/x86/xen/xen_intr.c: sys/xen/xen_intr.h: Rename xen_intr_bind_ipi() to xen_intr_alloc_and_bind_ipi(), and remove the ipi vector parameter. This api allocates an event channel port that can be used for ipi services, but knows nothing of the actual ipi for which that port will be used. Removing the unused argument and cleaning up the comments surrounding its declaration helps clarify its actual role. sys/amd64/amd64/mp_machdep.c: sys/amd64/include/cpu.h: sys/i386/i386/mp_machdep.c: sys/i386/include/cpu.h: Implement a generic framework for amd64 and i386 that allows the implementation of certain CPU management functions to be selected at runtime. Currently this is only used for the ipi send function, which we optimize for Xen when running on a Xen hypervisor, but can easily be expanded to support more operations. sys/x86/xen/hvm.c: Implement Xen PV IPI handlers and operations, replacing native send IPI. sys/amd64/include/pcpu.h: sys/i386/include/pcpu.h: sys/i386/include/smp.h: Remove NR_VIRQS and NR_IPIS from FreeBSD headers. NR_VIRQS is defined already for us in the xen interface files. NR_IPIS is only needed in one file per Xen platform and is easily inferred by the IPI vector table that is defined in those files. sys/i386/xen/mp_machdep.c: Restructure to more closely match the HVM implementation by performing table driven IPI setup.	2013-09-06 22:17:02 +00:00
Gleb Smirnoff	e16477e8d9	On those machines, where sf_bufs do not represent any real object, make sf_buf_alloc()/sf_buf_free() inlines, to save two calls to an absolutely empty functions. Reviewed by: alc, kib, scottl Sponsored by: Nginx, Inc. Sponsored by: Netflix	2013-09-06 05:37:49 +00:00
Konstantin Belousov	6aceaa3e17	Tidy up some loose ends in the PCID code: - Restore the pre-PCID TLB shootdown handlers for whole address space and single page invalidation asm code, and assign the IPI handler to them when PCID is not supported or disabled. Old handlers have linear control flow. But, still use the common return sequence. - Stop using pcpu for INVPCID descriptors in the invlrg handler. It is enough to allocate descriptors on the stack. As result, two SWAPGS instructions are shaved off from the code for Haswell+. - Fix the reverted condition in invlrng for checking of the PCID support [1], also in invlrng check that pmap is kernel pmap before performing other tests. For the kernel pmap, which provides global mappings, the INVLPG must be used for invalidation always. - Save the pre-computed pmap' %CR3 register in the struct pmap. This allows to remove several checks for pm_pcid validity when %CR3 is reloaded [2]. Noted by: gibbs [1] Discussed with: alc [2] Tested by: pho, flo Sponsored by: The FreeBSD Foundation	2013-09-04 23:31:29 +00:00
Konstantin Belousov	37eed8419c	Implement support for the process-context identifiers ('PCID') on Intel CPUs. The feature tags TLB entries with the Id of the address space and allows to avoid TLB invalidation on the context switch, it is available only in the long mode. In the microbenchmarks, using the PCID decreased latency of the context switches by ~30% on SandyBridge class desktop CPUs, measured with the lat_ctx program from lmbench. If available, use INVPCID instruction when a TLB entry in non-current address space needs to be invalidated. The instruction is typically available on the Haswell. If needed, the use of PCID can be turned off with the vm.pmap.pcid_enabled loader tunable set to 0. The state of the feature is reported by the vm.pmap.pcid_enabled sysctl. The sysctl vm.pmap.pcid_save_cnt reports the number of context switches which avoided invalidating the TLB; compare with the total number of context switches, available as sysctl vm.stats.sys.v_swtch. Sponsored by: The FreeBSD Foundation Reviewed by: alc Tested by: pho, bf	2013-08-30 07:59:49 +00:00
Konstantin Belousov	5f5703ef52	Provide a wrapper for the INVPCID instruction, definition of the descriptor and symbolic names for the operation types. Sponsored by: The FreeBSD Foundation Reviewed by: alc Tested by: pho, bf	2013-08-30 07:42:38 +00:00
Justin T. Gibbs	76acc41fb7	Implement vector callback for PVHVM and unify event channel implementations Re-structure Xen HVM support so that: - Xen is detected and hypercalls can be performed very early in system startup. - Xen interrupt services are implemented using FreeBSD's native interrupt delivery infrastructure. - the Xen interrupt service implementation is shared between PV and HVM guests. - Xen interrupt handlers can optionally use a filter handler in order to avoid the overhead of dispatch to an interrupt thread. - interrupt load can be distributed among all available CPUs. - the overhead of accessing the emulated local and I/O apics on HVM is removed for event channel port events. - a similar optimization can eventually, and fairly easily, be used to optimize MSI. Early Xen detection, HVM refactoring, PVHVM interrupt infrastructure, and misc Xen cleanups: Sponsored by: Spectra Logic Corporation Unification of PV & HVM interrupt infrastructure, bug fixes, and misc Xen cleanups: Submitted by: Roger Pau Monné Sponsored by: Citrix Systems R&D sys/x86/x86/local_apic.c: sys/amd64/include/apicvar.h: sys/i386/include/apicvar.h: sys/amd64/amd64/apic_vector.S: sys/i386/i386/apic_vector.s: sys/amd64/amd64/machdep.c: sys/i386/i386/machdep.c: sys/i386/xen/exception.s: sys/x86/include/segments.h: Reserve IDT vector 0x93 for the Xen event channel upcall interrupt handler. On Hypervisors that support the direct vector callback feature, we can request that this vector be called directly by an injected HVM interrupt event, instead of a simulated PCI interrupt on the Xen platform PCI device. This avoids all of the overhead of dealing with the emulated I/O APIC and local APIC. It also means that the Hypervisor can inject these events on any CPU, allowing upcalls for different ports to be handled in parallel. sys/amd64/amd64/mp_machdep.c: sys/i386/i386/mp_machdep.c: Map Xen per-vcpu area during AP startup. sys/amd64/include/intr_machdep.h: sys/i386/include/intr_machdep.h: Increase the FreeBSD IRQ vector table to include space for event channel interrupt sources. sys/amd64/include/pcpu.h: sys/i386/include/pcpu.h: Remove Xen HVM per-cpu variable data. These fields are now allocated via the dynamic per-cpu scheme. See xen_intr.c for details. sys/amd64/include/xen/hypercall.h: sys/dev/xen/blkback/blkback.c: sys/i386/include/xen/xenvar.h: sys/i386/xen/clock.c: sys/i386/xen/xen_machdep.c: sys/xen/gnttab.c: Prefer FreeBSD primatives to Linux ones in Xen support code. sys/amd64/include/xen/xen-os.h: sys/i386/include/xen/xen-os.h: sys/xen/xen-os.h: sys/dev/xen/balloon/balloon.c: sys/dev/xen/blkback/blkback.c: sys/dev/xen/blkfront/blkfront.c: sys/dev/xen/console/xencons_ring.c: sys/dev/xen/control/control.c: sys/dev/xen/netback/netback.c: sys/dev/xen/netfront/netfront.c: sys/dev/xen/xenpci/xenpci.c: sys/i386/i386/machdep.c: sys/i386/include/pmap.h: sys/i386/include/xen/xenfunc.h: sys/i386/isa/npx.c: sys/i386/xen/clock.c: sys/i386/xen/mp_machdep.c: sys/i386/xen/mptable.c: sys/i386/xen/xen_clock_util.c: sys/i386/xen/xen_machdep.c: sys/i386/xen/xen_rtc.c: sys/xen/evtchn/evtchn_dev.c: sys/xen/features.c: sys/xen/gnttab.c: sys/xen/gnttab.h: sys/xen/hvm.h: sys/xen/xenbus/xenbus.c: sys/xen/xenbus/xenbus_if.m: sys/xen/xenbus/xenbusb_front.c: sys/xen/xenbus/xenbusvar.h: sys/xen/xenstore/xenstore.c: sys/xen/xenstore/xenstore_dev.c: sys/xen/xenstore/xenstorevar.h: Pull common Xen OS support functions/settings into xen/xen-os.h. sys/amd64/include/xen/xen-os.h: sys/i386/include/xen/xen-os.h: sys/xen/xen-os.h: Remove constants, macros, and functions unused in FreeBSD's Xen support. sys/xen/xen-os.h: sys/i386/xen/xen_machdep.c: sys/x86/xen/hvm.c: Introduce new functions xen_domain(), xen_pv_domain(), and xen_hvm_domain(). These are used in favor of #ifdefs so that FreeBSD can dynamically detect and adapt to the presence of a hypervisor. The goal is to have an HVM optimized GENERIC, but more is necessary before this is possible. sys/amd64/amd64/machdep.c: sys/dev/xen/xenpci/xenpcivar.h: sys/dev/xen/xenpci/xenpci.c: sys/x86/xen/hvm.c: sys/sys/kernel.h: Refactor magic ioport, Hypercall table and Hypervisor shared information page setup, and move it to a dedicated HVM support module. HVM mode initialization is now triggered during the SI_SUB_HYPERVISOR phase of system startup. This currently occurs just after the kernel VM is fully setup which is just enough infrastructure to allow the hypercall table and shared info page to be properly mapped. sys/xen/hvm.h: sys/x86/xen/hvm.c: Add definitions and a method for configuring Hypervisor event delievery via a direct vector callback. sys/amd64/include/xen/xen-os.h: sys/x86/xen/hvm.c: sys/conf/files: sys/conf/files.amd64: sys/conf/files.i386: Adjust kernel build to reflect the refactoring of early Xen startup code and Xen interrupt services. sys/dev/xen/blkback/blkback.c: sys/dev/xen/blkfront/blkfront.c: sys/dev/xen/blkfront/block.h: sys/dev/xen/control/control.c: sys/dev/xen/evtchn/evtchn_dev.c: sys/dev/xen/netback/netback.c: sys/dev/xen/netfront/netfront.c: sys/xen/xenstore/xenstore.c: sys/xen/evtchn/evtchn_dev.c: sys/dev/xen/console/console.c: sys/dev/xen/console/xencons_ring.c Adjust drivers to use new xen_intr_*() API. sys/dev/xen/blkback/blkback.c: Since blkback defers all event handling to a taskqueue, convert this task queue to a "fast" taskqueue, and schedule it via an interrupt filter. This avoids an unnecessary ithread context switch. sys/xen/xenstore/xenstore.c: The xenstore driver is MPSAFE. Indicate as much when registering its interrupt handler. sys/xen/xenbus/xenbus.c: sys/xen/xenbus/xenbusvar.h: Remove unused event channel APIs. sys/xen/evtchn.h: Remove all kernel Xen interrupt service API definitions from this file. It is now only used for structure and ioctl definitions related to the event channel userland device driver. Update the definitions in this file to match those from NetBSD. Implementing this interface will be necessary for Dom0 support. sys/xen/evtchn/evtchnvar.h: Add a header file for implemenation internal APIs related to managing event channels event delivery. This is used to allow, for example, the event channel userland device driver to access low-level routines that typical kernel consumers of event channel services should never access. sys/xen/interface/event_channel.h: sys/xen/xen_intr.h: Standardize on the evtchn_port_t type for referring to an event channel port id. In order to prevent low-level event channel APIs from leaking to kernel consumers who should not have access to this data, the type is defined twice: Once in the Xen provided event_channel.h, and again in xen/xen_intr.h. The double declaration is protected by __XEN_EVTCHN_PORT_DEFINED__ to ensure it is never declared twice within a given compilation unit. sys/xen/xen_intr.h: sys/xen/evtchn/evtchn.c: sys/x86/xen/xen_intr.c: sys/dev/xen/xenpci/evtchn.c: sys/dev/xen/xenpci/xenpcivar.h: New implementation of Xen interrupt services. This is similar in many respects to the i386 PV implementation with the exception that events for bound to event channel ports (i.e. not IPI, virtual IRQ, or physical IRQ) are further optimized to avoid mask/unmask operations that aren't necessary for these edge triggered events. Stubs exist for supporting physical IRQ binding, but will need additional work before this implementation can be fully shared between PV and HVM. sys/amd64/amd64/mp_machdep.c: sys/i386/i386/mp_machdep.c: sys/i386/xen/mp_machdep.c sys/x86/xen/hvm.c: Add support for placing vcpu_info into an arbritary memory page instead of using HYPERVISOR_shared_info->vcpu_info. This allows the creation of domains with more than 32 vcpus. sys/i386/i386/machdep.c: sys/i386/xen/clock.c: sys/i386/xen/xen_machdep.c: sys/i386/xen/exception.s: Add support for new event channle implementation.	2013-08-29 19:52:18 +00:00
Jung-uk Kim	1533b9f714	Reimplement atomic operations on PDEs and PTEs in pmap.h. This change significantly reduces duplicate code and make it easier to read. Reviewed by: alc, bde	2013-08-21 22:40:29 +00:00
Jung-uk Kim	d36eb3f1c4	Remove empty lines before return statements for style consistency.	2013-08-21 22:05:58 +00:00
Jung-uk Kim	8a1ee2d346	Implement atomic_swap() and atomic_testandset(). Reviewed by: arch, bde, jilles, kib	2013-08-21 22:03:06 +00:00
Jung-uk Kim	da255e4c7f	- Remove the "a" constraint from main output operand for atomic_cmpset(). - Use "+" modifier for the "expect" because it is also an output (unused).	2013-08-21 21:30:06 +00:00
Jung-uk Kim	fe94be3da7	Use '+' modifier for a memory operand that is both an input and an output. It was actually done in r86301 but reverted in r150182 because GCC 3.x was not able to handle it for a memory operand. Apparently, this problem was fixed in GCC 4.1+ and several contrib sources already rely on this feature.	2013-08-21 21:14:16 +00:00
Jung-uk Kim	c1c84ce1bf	Remove bogus labels. No functional change.	2013-08-21 20:49:46 +00:00
Jung-uk Kim	ee93d1173a	Use consistent style. No functional change.	2013-08-21 20:43:50 +00:00
Neel Natu	0ef2ab3ab8	Bump up the maximum addressable memory on amd64 systems from 1TB to 4TB. Bump up the KVA size proportionally from 512GB to 2TB. The number of page table pages used by the direct map is now calculated at run time based on 'Maxmem'. This means the small memory systems will not see any additional tax in terms of page table pages for the direct map. However all amd64 systems, regardless of the memory size, will use 3 more pages to accomodate the bump in the KVA size. More details available here: http://lists.freebsd.org/pipermail/freebsd-hackers/2013-June/043015.html http://lists.freebsd.org/pipermail/freebsd-current/2013-July/043143.html Tested with the following configurations: - Sandybridge server with 64GB of memory. - bhyve VM with 64MB of memory. - bhyve VM with a 8GB of memory with the memory segment above 4GB cuddling right up against the 4TB maximum memory limit. Discussed on: hackers@, current@ Submitted by: Chris Torek (torek@torek.net)	2013-08-17 19:49:08 +00:00
Jilles Tjoelker	0f3a4d8051	libc: Access _logname_valid more efficiently. The variable _logname_valid is not exported via the version script; therefore, change C and i386/amd64 assembler code to remove indirection (which allowed interposition). This makes the code slightly smaller and faster. Also, remove #define PIC_GOT from i386/amd64 in !PIC mode. Without PIC, there is no place containing the address of each variable, so there is no possible definition for PIC_GOT.	2013-08-17 19:24:58 +00:00
Jung-uk Kim	38da30b419	Merge acpica_machdep.h for amd64 and i386 and move to x86. In fact, these two files were functionally identical.	2013-08-13 22:05:10 +00:00
Jung-uk Kim	3bd12ca8f1	Tidy up global locks for ACPICA. There is no functional change.	2013-08-13 21:34:03 +00:00
Konstantin Belousov	872d995f76	Change the pmap_ts_referenced() method of amd64 pmap to use shared pvh_global_lock. This allows the method to be executed in parallel, avoiding undue contention on the pvh_global_lock for the multithreaded pagedaemon. The pmap_ts_referenced() function has to inspect the page mappings for several pmaps, which need to be locked while pv list lock is owned. This contradicts to the lock order, where pmap lock is before pv list lock. Introduce the generation count for the pv list of the page or superpage, which indicate any change in the pv list, and, as usual, perform restart of the iteration if generation changed while pv lock was dropped for blocking acquire of a pmap lock. Reported and tested by: pho Reviewed by: alc Sponsored by: The FreeBSD Foundation	2013-08-07 16:33:15 +00:00
Andriy Gapon	a29cc9a34b	Revert r253748,253749 This WIP should not have been committed yet. Pointyhat to: avg	2013-07-28 18:44:17 +00:00
Andriy Gapon	366d8bfb7b	put contents of cpu.h under _KERNEL no userland-serviceable parts inside MFC after: 20 days	2013-07-28 18:32:27 +00:00
Andriy Gapon	a69e8d609e	x86: detect mwait capabilities and extensions, when present Reviewed by: kib (earlier amd64-only version) MFC after: 2 weeks	2013-07-28 17:54:42 +00:00
Konstantin Belousov	70a7dd5d5b	Fix issues with zeroing and fetching the counters, on x86 and ppc64. Issues were noted by Bruce Evans and are present on all architectures. On i386, a counter fetch should use atomic read of 64bit value, otherwise carry from the increment on other CPU could be lost for the given fetch, making error of 2^32. If 64bit read (cmpxchg8b) is not available on the machine, it cannot be SMP and it is enough to disable preemption around read to avoid the split read. On x86 the counter increment is not atomic on purpose, which makes it possible for the store of the incremented result to override just zeroed per-cpu slot. The effect would be a counter going off by arbitrary value after zeroing. Perform the counter zeroing on the same processor which does the increments, making the operations mutually exclusive. On i386, same as for the fetching, if the cmpxchg8b is not available, machine is not SMP and we disable preemption for zeroing. PowerPC64 is treated the same as amd64. For other architectures, the changes made to allow the compilation to succeed, without fixing the issues with zeroing or fetching. It should be possible to handle them by using the 64bit loads and stores atomic WRT preemption (assuming the architectures also converted from using critical sections to proper asm). If architecture does not provide the facility, using global (spin) mutex would be non-optimal but working solution. Noted by: bde Sponsored by: The FreeBSD Foundation	2013-07-01 02:48:27 +00:00
Jung-uk Kim	b1ddd13145	Move definitions required by userland applications out of acpica_machdep.h.	2013-06-27 00:22:40 +00:00
Konstantin Belousov	9dbb63fe03	Allow immediate operand. Sponsored by: The FreeBSD Foundation	2013-06-20 14:30:04 +00:00
Justin T. Gibbs	a8f6ac0573	Upgrade Xen interface headers to Xen 4.2.1. Move FreeBSD from interface version 0x00030204 to 0x00030208. Updates are required to our grant table implementation before we can bump this further. sys/xen/hvm.h: Replace the implementation of hvm_get_parameter(), formerly located in sys/xen/interface/hvm/params.h. Linux has a similar file which primarily stores this function. sys/xen/xenstore/xenstore.c: Include new xen/hvm.h header file to get hvm_get_parameter(). sys/amd64/include/xen/xen-os.h: sys/i386/include/xen/xen-os.h: Correctly protect function definition and variables from being included into assembly files in xen-os.h Xen memory barriers are now prefixed with "xen_" to avoid conflicts with OS native primatives. Define Xen memory barriers in terms of the native FreeBSD primatives. Sponsored by: Spectra Logic Corporation Reviewed by: Roger Pau Monné Tested by: Roger Pau Monné Obtained from: Roger Pau Monné (bug fixes)	2013-06-14 23:43:44 +00:00
Neel Natu	8f1664b724	Remove unused macros PTESHIFT, PDESHIFT, PDPESHIFT and PML4ESHIFT. Reviewed by: alc	2013-06-14 00:03:43 +00:00
Konstantin Belousov	4f493a25f4	Fix the hardware watchpoints on SMP amd64. Load the updated %dr registers also on other CPUs, besides the CPU which happens to execute the ddb. The debugging registers are stored in the pcpu area, together with the command which is executed by the IPI stop handler upon resume. Reviewed by: jhb Sponsored by: The FreeBSD Foundation MFC after: 1 week	2013-05-21 11:24:32 +00:00
Marcel Moolenaar	cb34ed4434	Add basic support for FDT to i386 & amd64. This change includes: 1. Common headers for fdt.h and ofw_machdep.h under x86/include with indirections under i386/include and amd64/include. 2. New modinfo for loader provided FDT blob. 3. Common x86_init_fdt() called from hammer_time() on amd64 and init386() on i386. 4. Split-off FDT specific low-level console functions from FDT bus methods for the uart(4) driver. The low-level console logic has been moved to uart_cpu_fdt.c and is used for arm, mips & powerpc only. The FDT bus methods are shared across all architectures. 5. Add dev/fdt/fdt_x86.c to hold the fdt_fixup_table[] and the fdt_pic_table[] arrays. Both are empty right now. FDT addresses are I/O ports on x86. Since the core FDT code does not handle different address spaces, adding support for both I/O ports and memory addresses requires some thought and discussion. It may be better to use a compile-time option that controls this. Obtained from: Juniper Networks, Inc.	2013-05-21 03:05:49 +00:00
Dmitry Chagin	d127f15308	Retire write-only PCB_GS32BIT pcb flag on amd64.	2013-05-09 21:42:43 +00:00
Attilio Rao	941646f5ec	Rename VM_NDOMAIN into MAXMEMDOM and move it into machine/param.h in order to match the MAXCPU concept. The change should also be useful for consolidation and consistency. Sponsored by: EMC / Isilon storage division Obtained from: jeff Reviewed by: alc	2013-05-07 22:46:24 +00:00
Peter Grehan	d3c11f40a5	Add RIP-relative addressing to the instruction decoder. Rework the guest register fetch code to allow the RIP to be extracted from the VMCS while the kernel decoder is functioning. Hit by the OpenBSD local-apic code. Submitted by: neel Reviewed by: grehan Obtained from: NetApp	2013-04-25 04:56:43 +00:00
Gabor Kovesdan	a8b5c2a0aa	- Correct spelling in comments Submitted by: Christoph Mallon <christoph.mallon@gmx.de> (via private mail)	2013-04-17 11:56:11 +00:00
Neel Natu	d5408b1d26	If vmm.ko could not be initialized correctly then prevent the creation of virtual machines subsequently. Submitted by: Chris Torek	2013-04-12 01:16:52 +00:00
Neel Natu	1472b87f2f	Unsynchronized TSCs on the host require special handling in bhyve: - use clock_gettime(2) as the time base for the emulated ACPI timer instead of directly using rdtsc(). - don't advertise the invariant TSC capability to the guest to discourage it from using the TSC as its time base. Discussed with: jhb@ (about making 'smp_tsc' a global) Reported by: Dan Mack on freebsd-virtualization@ Obtained from: NetApp	2013-04-10 05:59:07 +00:00
Gleb Smirnoff	4e76af6a41	Merge from projects/counters: counter(9). Introduce counter(9) API, that implements fast and raceless counters, provided (but not limited to) for gathering of statistical data. See http://lists.freebsd.org/pipermail/freebsd-arch/2013-April/014204.html for more details. In collaboration with: kib Reviewed by: luigi Tested by: ae, ray Sponsored by: Nginx, Inc.	2013-04-08 19:40:53 +00:00
Gleb Smirnoff	17dece86fe	Merge from projects/counters: Pad struct pcpu so that its size is denominator of PAGE_SIZE. This is done to reduce memory waste in UMA_PCPU_ZONE zones. Sponsored by: Nginx, Inc.	2013-04-08 19:19:10 +00:00
Neel Natu	66f71b7d24	Allow caller to skip 'guest linear address' validation when doing instruction decode. This is to accomodate hardware assist implementations that do not provide the 'guest linear address' as part of nested page fault collateral. Submitted by: Anish Gupta (akgupt3 at gmail dot com)	2013-03-28 21:26:19 +00:00
Attilio Rao	774d251d99	Sync back vmcontention branch into HEAD: Replace the per-object resident and cached pages splay tree with a path-compressed multi-digit radix trie. Along with this, switch also the x86-specific handling of idle page tables to using the radix trie. This change is supposed to do the following: - Allowing the acquisition of read locking for lookup operations of the resident/cached pages collections as the per-vm_page_t splay iterators are now removed. - Increase the scalability of the operations on the page collections. The radix trie does rely on the consumers locking to ensure atomicity of its operations. In order to avoid deadlocks the bisection nodes are pre-allocated in the UMA zone. This can be done safely because the algorithm needs at maximum one new node per insert which means the maximum number of the desired nodes is the number of available physical frames themselves. However, not all the times a new bisection node is really needed. The radix trie implements path-compression because UFS indirect blocks can lead to several objects with a very sparse trie, increasing the number of levels to usually scan. It also helps in the nodes pre-fetching by introducing the single node per-insert property. This code is not generalized (yet) because of the possible loss of performance by having much of the sizes in play configurable. However, efforts to make this code more general and then reusable in further different consumers might be really done. The only KPI change is the removal of the function vm_page_splay() which is now reaped. The only KBI change, instead, is the removal of the left/right iterators from struct vm_page, which are now reaped. Further technical notes broken into mealpieces can be retrieved from the svn branch: http://svn.freebsd.org/base/user/attilio/vmcontention/ Sponsored by: EMC / Isilon storage division In collaboration with: alc, jeff Tested by: flo, pho, jhb, davide Tested by: ian (arm) Tested by: andreast (powerpc)	2013-03-18 00:25:02 +00:00
Attilio Rao	b38d37f7b5	Merge from vmc-playground branch: Rename the pv_entry_t iterator from pv_list to pv_next. Besides being more correct technically (as the name seems to suggest this is a list while it is an iterator), it will also be needed by vm_radix work to avoid a nameclash on macro expansions. Sponsored by: EMC / Isilon storage division Reviewed by: alc, jeff Tested by: flo, pho, jhb, davide	2013-03-02 14:19:08 +00:00
Konstantin Belousov	31a53cd036	Convert machine/elf.h, machine/frame.h, machine/sigframe.h, machine/signal.h and machine/ucontext.h into common x86 includes, copying from amd64 and merging with i386. Kernel-only compat definitions are kept in the i386/include/sigframe.h and i386/include/signal.h, to reduce amd64 kernel namespace pollution. The amd64 compat uses its own definitions so far. The _MACHINE_ELF_WANT_32BIT definition is to allow the sys/boot/userboot/userboot/elf32_freebsd.c to use i386 ELF definitions on the amd64 compile host. The same hack could be usefully abused by other code too.	2013-02-20 17:39:52 +00:00
Neel Natu	485b3300cc	Implement guest vcpu pinning using 'pthread_setaffinity_np(3)'. Prior to this change pinning was implemented via an ioctl (VM_SET_PINNING) that called 'sched_bind()' on behalf of the user thread. The ULE implementation of 'sched_bind()' bumps up 'td_pinned' which in turn runs afoul of the assertion '(td_pinned == 0)' in userret(). Using the cpuset affinity to implement pinning of the vcpu threads works with both 4BSD and ULE schedulers and has the happy side-effect of getting rid of a bunch of code in vmm.ko. Discussed with: grehan	2013-02-11 20:36:07 +00:00
Neel Natu	6d62a48f47	Compute the number of initial kernel page table pages (NKPT) dynamically. This eliminates the need to recompile the kernel when the default value of NKPT is not big enough - for e.g. when loading large kernel modules or memory disk images from the loader. If NKPT is defined in the kernel configuration file then it overrides the dynamic calculation. Reviewed by: alc, kib	2013-02-06 04:53:00 +00:00
Andriy Gapon	548b201607	x86 suspend/resume: suspend pics and pseudo-pics in reverse order - change 'pics' from STAILQ to TAILQ - ensure that Local APIC is always first in 'pics' Reviewed by: jhb Tested by: Sergey V. Dyatko <sergey.dyatko@gmail.com>, KAHO Toshikazu <kaho@elam.kais.kyoto-u.ac.jp> MFC after: 12 days	2013-02-02 12:02:42 +00:00
Neel Natu	8faceb3292	Add emulation support for instruction "88/r: mov r/m8, r8". This instruction moves a byte from a register to a memory location. Tested by: tycho nightingale at pluribusnetworks com	2013-01-30 04:09:09 +00:00
Neel Natu	912a3e678a	Add svn properties to the recently merged bhyve source files. The pre-commit hook will not allow any commits without the svn:keywords property in head.	2013-01-20 03:42:49 +00:00
Neel Natu	03429e45a7	Revert changes for x2apic support from projects/bhyve. During the early days of bhyve it did not support instruction emulation which necessitated the use of x2apic to access the local apic. This is no longer the case and the dependency on x2apic has gone away. The x2apic patches can be considered independently of bhyve and will be merged into head via projects/x2apic. Discussed with: grehan	2013-01-06 05:37:26 +00:00
Neel Natu	014a52f3a6	There is no need for 'start_emulating()' and 'stop_emulating()' to be defined in <machine/cpufunc.h> so remove them from there. Obtained from: NetApp	2013-01-04 02:49:12 +00:00
Neel Natu	682b847ede	Prefer x2apic mode when running inside a virtual machine. Provide a tunable 'machdep.x2apic_desired' to let the administrator override the default behavior. Provide a read-only sysctl 'machdep.x2apic' to let the administrator know whether the kernel is using x2apic or legacy mmio to access local apic. Tested with Parallels Desktop 8 and bhyve hypervisors. Also tested running on bare metal Intel Xeon E5-2658. Obtained from: NetApp Discussed with: jhb, attilio, avg, grehan	2012-12-16 00:57:14 +00:00
Neel Natu	48a29f4e07	Cleanup the user-space paging exit handler now that the unified instruction emulation is in place. Obtained from: NetApp	2012-11-28 13:34:44 +00:00
Neel Natu	ba9b7bf73a	Revamp the x86 instruction emulation in bhyve. On a nested page table fault the hypervisor will: - fetch the instruction using the guest %rip and %cr3 - decode the instruction in 'struct vie' - emulate the instruction in host kernel context for local apic accesses - any other type of mmio access is punted up to user-space (e.g. ioapic) The decoded instruction is passed as collateral to the user-space process that is handling the PAGING exit. The emulation code is fleshed out to include more addressing modes (e.g. SIB) and more types of operands (e.g. imm8). The source code is unified into a single file (vmm_instruction_emul.c) that is compiled into vmm.ko as well as /usr/sbin/bhyve. Reviewed by: grehan Obtained from: NetApp	2012-11-28 00:02:17 +00:00
Neel Natu	a10c6f5544	IFC @ r242684	2012-11-11 03:26:14 +00:00
Konstantin Belousov	2773649d2f	Provide the reading and display of the Standard Extended Features, introduced with the IvyBridge CPUs. Provide the definitions for new bits in CR3 and CR4 registers. Tested by: avg, Michael Moll <kvedulv@kvedulv.de> MFC after: 2 weeks	2012-11-01 15:14:37 +00:00
Neel Natu	f352ff0ca8	Maintain state regarding NMI delivery to guest vcpu in VT-x independent manner. Also add a stats counter to count the number of NMIs delivered per vcpu. Obtained from: NetApp	2012-10-24 02:54:21 +00:00
Peter Grehan	13ec93719a	Add the guest physical address and r/w/x bits to the paging exit in preparation for a rework of bhyve MMIO handling. Reviewed by: neel Obtained from: NetApp	2012-10-12 23:12:19 +00:00
Neel Natu	75dd336603	Provide per-vcpu locks instead of relying on a single big lock. This also gets rid of all the witness.watch warnings related to calling malloc(M_WAITOK) while holding a mutex. Reviewed by: grehan	2012-10-12 18:32:44 +00:00
Neel Natu	cdc5b9e7b1	Fix warnings generated by 'debug.witness.watch' during VM creation and destruction for calling malloc() with M_WAITOK while holding a mutex. Do not allow vmm.ko to be unloaded until all virtual machines are destroyed.	2012-10-11 19:39:54 +00:00
Attilio Rao	3a4730256a	Add an unified macro to deny ability from the compiler to reorder instruction loads/stores at its will. The macro __compiler_membar() is currently supported for both gcc and clang, but kernel compilation will fail otherwise. Reviewed by: bde, kib Discussed with: dim, theraven MFC after: 2 weeks	2012-10-09 14:32:30 +00:00
Attilio Rao	af2bdacafb	Reverts r234074,234105,234564,234723,234989,235231-235232 and part of r234247. Use, instead, the static intializer introduced in r239923 for x86 and sparc64 intr_cpus, unwinding the code to the initial version. Reviewed by: marius	2012-10-09 12:22:43 +00:00
Neel Natu	f7d51510f1	Change vm_malloc() to map pages in the guest physical address space in 4KB chunks. This breaks the assumption that the entire memory segment is contiguously allocated in the host physical address space. This also paves the way to satisfy the 4KB page allocations by requesting free pages from the VM subsystem as opposed to hard-partitioning host memory at boot time.	2012-10-04 02:27:14 +00:00
Neel Natu	bda273f21e	Get rid of assumptions in the hypervisor that the host physical memory associated with guest physical memory is contiguous. Rewrite vm_gpa2hpa() to get the GPA to HPA mapping by querying the nested page tables.	2012-10-03 00:46:30 +00:00
Neel Natu	341f19c949	Get rid of assumptions in the hypervisor that the host physical memory associated with guest physical memory is contiguous. In this case vm_malloc() was using vm_gpa2hpa() to indirectly infer whether or not the address range had already been allocated. Replace this instead with an explicit API 'vm_gpa_available()' that returns TRUE if a page is available for allocation in guest physical address space.	2012-09-29 01:15:45 +00:00
John Baldwin	960b5a7080	- Re-shuffle the <machine/pc/bios.h> headers to move all kernel-specific bits under #ifdef _KERNEL but leave definitions for various structures defined by standards ($PIR table, SMAP entries, etc.) available to userland. - Consolidate duplicate SMBIOS table structure definitions in ipmi(4) and smbios(4) in <machine/pc/bios.h> and make them available to userland. MFC after: 2 weeks	2012-09-28 11:59:32 +00:00
Neel Natu	e90273829b	Add ioctls to control the X2APIC capability exposed by the virtual machine to the guest. At the moment this simply sets the state in the 'vcpu' instance but there is no code that acts upon these settings.	2012-09-25 19:08:51 +00:00
Neel Natu	edf89256dd	Add an explicit exit code 'SPINUP_AP' to tell the controlling process that an AP needs to be activated by spinning up an execution context for it. The local apic emulation is now completely done in the hypervisor and it will detect writes to the ICR_LO register that try to bring up the AP. In response to such writes it will return to userspace with an exit code of SPINUP_AP. Reviewed by: grehan	2012-09-25 02:33:25 +00:00
Neel Natu	98ed632c63	Stash the 'vm_exit' information in each 'struct vcpu'. There is no functional change at this time but this paves the way for vm exit handler functions to easily modify the exit reason going forward.	2012-09-24 19:32:24 +00:00
Dag-Erling Smørgrav	e2082935f0	As discussed on -current, remove the hardcoded default maxswzone. MFC after: 3 weeks	2012-08-14 17:01:21 +00:00
Konstantin Belousov	0220d04fe3	Add lfence(). MFC after: 1 week	2012-08-01 17:24:53 +00:00
Konstantin Belousov	59c1a8315c	Forcibly shut up clang warning about NULL pointer dereference. MFC after: 3 weeks	2012-07-23 19:16:31 +00:00
Konstantin Belousov	83b22b05e6	Introduce curpcb magic variable, similar to curthread, which is MD amd64. It is implemented as __pure2 inline with non-volatile asm read from pcpu, which allows a compiler to cache its results. Convert most PCPU_GET(pcb) and curthread->td_pcb accesses into curpcb. Note that __curthread() uses magic value 0 as an offsetof(struct pcpu, pc_curthread). It seems to be done this way due to machine/pcpu.h needs to be processed before sys/pcpu.h, because machine/pcpu.h contributes machine-depended fields to the struct pcpu definition. As result, machine/pcpu.h cannot use struct pcpu yet. The __curpcb() also uses a magic constant instead of offsetof(struct pcpu, pc_curpcb) for the same reason. The constants are now defined as symbols and CTASSERTs are added to ensure that future KBI changes do not break the code. Requested and reviewed by: bde MFC after: 3 weeks	2012-07-19 19:09:12 +00:00
Konstantin Belousov	bc84db6267	On AMD64, provide siginfo.si_code for floating point errors when error occurs using the SSE math processor. Update comments describing the handling of the exception status bits in coprocessors control words. Remove GET_FPU_CW and GET_FPU_SW macros which were used only once. Prefer to use curpcb to access pcb_save over the longer path of referencing pcb through the thread structure. Based on the submission by: Ed Alley <wea llnl gov> PR: amd64/169927 Reviewed by: bde MFC after: 3 weeks	2012-07-18 15:43:47 +00:00
Konstantin Belousov	333d0c6060	Add support for the XSAVEOPT instruction use. Our XSAVE/XRSTOR usage mostly meets the guidelines set by the Intel SDM: 1. We use XRSTOR and XSAVE from the same CPL using the same linear address for the store area 2. Contrary to the recommendations, we cannot zero the FPU save area for a new thread, since fork semantic requires the copy of the previous state. This advice seemingly contradicts to the advice from the item 6. 3. We do use XSAVEOPT in the context switch code only, and the area for XSAVEOPT already always contains the data saved by XSAVE. 4. We do not modify the save area between XRSTOR, when the area is loaded into FPU context, and XSAVE. We always spit the fpu context into save area and start emulation when directly writing into FPU context. 5. We do not use segmented addressing to access save area, or rather, always address it using %ds basing. 6. XSAVEOPT can be only executed in the area which was previously loaded with XRSTOR, since context switch code checks for FPU use by outgoing thread before saving, and thread which stopped emulation forcibly get context loaded with XRSTOR. 7. The PCB cannot be paged out while FPU emulation is turned off, since stack of the executing thread is never swapped out. The context switch code is patched to issue XSAVEOPT instead of XSAVE if supported. This approach eliminates one conditional in the context switch code, which would be needed otherwise. For user-visible machine context to have proper data, fpugetregs() checks for unsaved extension blocks and manually copies pristine FPU state into them, according to the description provided by CPUID leaf 0xd. MFC after: 1 month	2012-07-14 15:48:30 +00:00
Peter Grehan	b652778e42	IFC @ r238370	2012-07-11 19:54:21 +00:00
John Baldwin	d706ec297a	Add a clts() wrapper around the 'clts' instruction to <machine/cpufunc.h> on x86 and use that to implement stop_emulating() in the fpu/npx code. Reimplement start_emulating() in the non-XEN case by using load_cr0() and rcr0() instead of the 'lmsw' and 'smsw' instructions. Intel explicitly discourages the use of 'lmsw' and 'smsw' on 80386 and later processors in the description of these instructions in Volume 2 of the ADM. Reviewed by: kib MFC after: 1 month	2012-07-09 20:55:39 +00:00
John Baldwin	7574a595f2	Now that our assembler supports the xsave family of instructions, use them natively rather than hand-assembled versions. For xgetbv/xsetbv, add a wrapper API to deal with xcr* registers: rxcr() and load_xcr(). Reviewed by: kib MFC after: 1 month	2012-07-05 18:19:35 +00:00
Alan Cox	2bde6e3518	Optimize reserve_pv_entries() using the popcnt instruction.	2012-06-30 20:25:12 +00:00
Konstantin Belousov	aea810386d	Implement mechanism to export some kernel timekeeping data to usermode, using shared page. The structures and functions have vdso prefix, to indicate the intended location of the code in some future. The versioned per-algorithm data is exported in the format of struct vdso_timehands, which mostly repeats the content of in-kernel struct timehands. Usermode reading of the structure can be lockless. Compatibility export for 32bit processes on 64bit host is also provided. Kernel also provides usermode with indication about currently used timecounter, so that libc can fall back to syscall if configured timecounter is unknown to usermode code. The shared data updates are initiated both from the tc_windup(), where a fast task is queued to do the update, and from sysctl handlers which change timecounter. A manual override switch kern.timecounter.fast_gettime allows to turn off the mechanism. Only x86 architectures export the real algorithm data, and there, only for tsc timecounter. HPET counters page could be exported as well, but I prefer to not further glue the kernel and libc ABI there until proper vdso-based solution is developed. Minimal stubs neccessary for non-x86 architectures to still compile are provided. Discussed with: bde Reviewed by: jhb Tested by: flo MFC after: 1 month	2012-06-22 07:06:40 +00:00
Konstantin Belousov	232aa31fb9	Reserve AT_TIMEKEEP auxv entry for providing usermode the pointer to timekeeping information. MFC after: 1 week	2012-06-22 06:38:31 +00:00
Alan Cox	6031c68de4	The page flag PGA_WRITEABLE is set and cleared exclusively by the pmap layer, but it is read directly by the MI VM layer. This change introduces pmap_page_is_write_mapped() in order to completely encapsulate all direct access to PGA_WRITEABLE in the pmap layer. Aesthetics aside, I am making this change because amd64 will likely begin using an alternative method to track write mappings, and having pmap_page_is_write_mapped() in place allows me to make such a change without further modification to the MI VM layer. As an added bonus, tidy up some nearby comments concerning page flags. Reviewed by: kib MFC after: 6 weeks	2012-06-16 18:56:19 +00:00
Jung-uk Kim	6ad799103d	- Remove unused code for CR3 and CR4. - Fix few style(9) nits while I am here.	2012-06-13 22:53:56 +00:00
Jung-uk Kim	acd7df97cc	- Fix resumectx() prototypes to reflect reality. - For i386, simply jump to resumectx() with PCB in %ecx. - Fix a style(9) nit while I am here.	2012-06-13 21:03:01 +00:00
Mitsuru IWASAKI	77c80e2e5b	Share IPI init and startup code of mp_machdep.c with acpi_wakeup.c as ipi_startup().	2012-06-12 00:14:54 +00:00
Mitsuru IWASAKI	fb864578af	Add x86/acpica/acpi_wakeup.c for amd64 and i386. Difference of suspend/resume procedures are minimized among them. common: - Add global cpuset suspended_cpus to indicate APs are suspended/resumed. - Remove acpi_waketag and acpi_wakemap from acpivar.h (no longer used). - Add some variables in acpi_wakecode.S in order to minimize the difference among amd64 and i386. - Disable load_cr3() because now CR3 is restored in resumectx(). amd64: - Add suspend/resume related members (such as MSR) in PCB. - Modify savectx() for above new PCB members. - Merge acpi_switch.S into cpu_switch.S as resumectx(). i386: - Merge(and remove) suspendctx() into savectx() in order to match with amd64 code. Reviewed by: attilio@, acpi@	2012-06-09 00:37:26 +00:00
Konstantin Belousov	fa9f322df9	Use plain store for atomic_store_rel on x86, instead of implicitly locked xchg instruction. IA32 memory model guarantees that store has release semantic, since stores cannot pass loads or stores. Reviewed by: bde, jhb Tested by: pho MFC after: 2 weeks	2012-06-02 18:10:16 +00:00
Bjoern A. Zeeb	920b965865	MFp4 bz_ipv6_fast: in_cksum.h required ip.h to be included for struct ip. To be able to use some general checksum functions like in_addword() in a non-IPv4 context, limit the (also exported to user space) IPv4 specific functions to the times, when the ip.h header is present and IPVERSION is defined (to 4). We should consider more general checksum (updating) functions to also allow easier incremental checksum updates in the L3/4 stack and firewalls, as well as ponder further requirements by certain NIC drivers needing slightly different pseudo values in offloading cases. Thinking in terms of a better "library". Sponsored by: The FreeBSD Foundation Sponsored by: iXsystems Reviewed by: gnn (as part of the whole) MFC After: 3 days	2012-05-24 22:00:48 +00:00
Alan Cox	246d751711	Rename pmap_collect() to pmap_pv_reclaim() and rewrite it such that it no longer uses the active and inactive paging queues. Instead, the pmap now maintains an LRU-ordered list of pv entry pages, and pmap_pv_reclaim() uses this list to select pv entries for reclamation. Note: The old pmap_collect() tried to avoid reclaiming mappings for pages that have either a hold_count or a busy field that is non-zero. However, this isn't necessary for correctness, and the locking in pmap_collect() was insufficient to guarantee that such mappings weren't reclaimed. The new pmap_pv_reclaim() doesn't even try. Reviewed by: kib MFC after: 6 weeks	2012-05-18 05:36:04 +00:00
Attilio Rao	b8be27bf29	Revert part of r234723 by re-enabling the SMP protection for intr_bind() on x86. This has been requested by jhb and I strongly disagree with this, but as long as he is the x86 and interrupt subsystem maintainer I will follow his directives. The disagreement cames from what we should really consider as a public KPI. IMHO, if we really need a selection between the kernel functions, we may need an explicit protection like _KERNEL_KPI, which defines which subset of the kernel function might really be considered as part of the KPI (for thirdy part modules) and which not. As long as we don't have this mechanism I just consider any possible function as usable by thirdy part code, thus intr_bind() included. MFC after: 1 week	2012-05-03 21:44:01 +00:00
Dimitry Andric	460378bf13	Add a convenience macro for the returns_twice attribute, and apply it to the prototypes of the appropriate functions (getcontext, savectx, setjmp, sigsetjmp and vfork). MFC after: 2 weeks	2012-04-29 11:04:31 +00:00
Peter Grehan	cd942e0f25	MSI-x interrupt support for PCI pass-thru devices. Includes instruction emulation for memory r/w access. This opens the door for io-apic, local apic, hpet timer, and legacy device emulation. Submitted by: ryan dot berryhill at sandvine dot com Reviewed by: grehan Obtained from: Sandvine	2012-04-28 16:28:00 +00:00
Robert Millan	74269bb439	Increase DFLDSIZ from 128 MiB to 32 GiB. On amd64 there's plenty of virtual memory available, so there is no need to be so conservative about it. Reviewed by: arch	2012-04-27 22:27:21 +00:00
Attilio Rao	70dbd1604c	Clean up the intr* MD KPI from the SMP dependency, removing a cause of discrepancy between modules and kernel, but deal with SMP differences within the functions themselves. As an added bonus this also helps in terms of code readability. Requested by: gibbs Reviewed by: jhb, marius MFC after: 1 week	2012-04-26 20:24:25 +00:00
Peter Grehan	38f1b189cd	IFC @ r234692 sys/amd64/include/cpufunc.h sys/amd64/include/fpu.h sys/amd64/amd64/fpu.c sys/amd64/vmm/vmm.c - Add API to allow vmm FPU state init/save/restore. FP stuff discussed with: kib	2012-04-26 07:52:28 +00:00
Andriy Gapon	f84633cdcc	bump INTRCNT_COUNT values to reflect actual numbers of IPI counters Maybe the numbers should be conditionalized on COUNT_IPIS Reviewed by: jhb MFC after: 1 week	2012-04-13 07:15:40 +00:00
John Baldwin	435803f3c7	Move the legacy(4) driver to x86.	2012-03-30 19:10:14 +00:00
John Baldwin	0d95597ca9	Use a more proper fix for enabling HT MSI mapping windows on Host-PCI bridges. Rather than blindly enabling the windows on all of them, only enable the window when an MSI interrupt is enabled for a device behind the bridge, similar to what already happens for HT PCI-PCI bridges. To implement this, each x86 Host-PCI bridge driver has to be able to locate it's actual backing device on bus 0. For ACPI, use the _ADR method to find the slot and function of the device. For the non-ACPI case, the legacy(4) driver already scans bus 0 looking for Host-PCI bridge devices. Now it saves the slot and function of each bridge that it finds as ivars that the Host-PCI bridge driver can then use in its pcib_map_msi() method. This fixes machines where non-MSI interrupts were broken by the previous round of HT MSI changes. Tested by: bapt MFC after: 1 week	2012-03-29 19:03:22 +00:00
John Baldwin	1f22be4547	- Rename VM_MEMATTR_UNCACHED to VM_MEMATTR_WEAK_UNCACHEABLE on x86 to be less ambiguous and more clearly identify what it means. This attribute is what Intel refers to as UC-, and it's only difference relative to normal UC memory is that a WC MTRR will override a UC- PAT entry causing the memory to be treated as WC, whereas a UC PAT entry will always override the MTRR. - Remove the VM_MEMATTR_UNCACHED alias from powerpc.	2012-03-29 16:51:22 +00:00
Fabien Thomas	f5f9340b98	Add software PMC support. New kernel events can be added at various location for sampling or counting. This will for example allow easy system profiling whatever the processor is with known tools like pmcstat(8). Simultaneous usage of software PMC and hardware PMC is possible, for example looking at the lock acquire failure, page fault while sampling on instructions. Sponsored by: NETASQ MFC after: 1 month	2012-03-28 20:58:30 +00:00
Alan Cox	5730afc9b6	Handle spurious page faults that may occur in no-fault sections of the kernel. When access restrictions are added to a page table entry, we flush the corresponding virtual address mapping from the TLB. In contrast, when access restrictions are removed from a page table entry, we do not flush the virtual address mapping from the TLB. This is exactly as recommended in AMD's documentation. In effect, when access restrictions are removed from a page table entry, AMD's MMUs will transparently refresh a stale TLB entry. In short, this saves us from having to perform potentially costly TLB flushes. In contrast, Intel's MMUs are allowed to generate a spurious page fault based upon the stale TLB entry. Usually, such spurious page faults are handled by vm_fault() without incident. However, when we are executing no-fault sections of the kernel, we are not allowed to execute vm_fault(). This change introduces special-case handling for spurious page faults that occur in no-fault sections of the kernel. In collaboration with: kib Tested by: gibbs (an earlier version) I would also like to acknowledge Hiroki Sato's assistance in diagnosing this problem. MFC after: 1 week	2012-03-22 04:52:51 +00:00
Tijl Coosemans	dfb1c11345	Copy amd64 sysarch.h to x86 and merge with i386 sysarch.h. Replace amd64/i386/pc98 sysarch.h with stubs.	2012-03-19 21:57:31 +00:00
Tijl Coosemans	2c7879ea84	Copy i386 specialreg.h to x86 and merge with amd64 specialreg.h. Replace amd64/i386/pc98 specialreg.h with stubs.	2012-03-19 21:34:11 +00:00
Tijl Coosemans	68156ad982	Copy i386 psl.h to x86 and replace amd64/i386/pc98 psl.h with stubs.	2012-03-19 21:29:57 +00:00
Tijl Coosemans	bcde3b9f67	Move userland bits (and some common kernel bits) from amd64 and i386 segments.h to a new x86 segments.h. Add __packed attribute to some structs (just to be sure). Also make it clear that i386 GDT and LDT entries are used in ia64 code.	2012-03-19 21:24:50 +00:00
Tijl Coosemans	01cd19680d	Copy i386 reg.h to x86 and merge with amd64 reg.h. Replace i386/amd64/pc98 reg.h with stubs. The tREGISTER macros are only made visible on i386. These macros are deprecated and should not be available on amd64. The i386 and amd64 versions of struct reg have been renamed to struct __reg32 and struct __reg64. During compilation either __reg32 or __reg64 is defined as reg depending on the machine architecture. On amd64 the i386 struct is also available as struct reg32 which is used in COMPAT_FREEBSD32 code. Most of compat/ia32/ia32_reg.h is now IA64 only. Reviewed by: kib (previous version)	2012-03-18 19:06:38 +00:00
Tijl Coosemans	23341c174c	Use exact width integer types in amd64/i386 reg.h to prepare for a merge. The only real change is replacing long with int on i386.	2012-03-18 18:44:42 +00:00
Tijl Coosemans	786645078b	Move userland bits of i386 npx.h and amd64 fpu.h to x86 fpu.h. Remove FPU types from compat/ia32/ia32_reg.h that are no longer needed. Create machine/npx.h on amd64 to allow compiling i386 code that uses this header. The original npx.h and fpu.h define struct envxmm differently. Both definitions have been included in the new x86 header as struct __envxmm32 and struct __envxmm64. During compilation either __envxmm32 or __envxmm64 is defined as envxmm depending on machine architecture. On amd64 the i386 struct is also available as struct envxmm32. Reviewed by: kib	2012-03-16 20:24:30 +00:00
Tijl Coosemans	d8a023328d	Copy amd64 ptrace.h to x86 and merge with i386 ptrace.h. Replace amd64/i386/pc98 ptrace.h with stubs. For amd64 PT_GETXSTATE and PT_SETXSTATE have been redefined to match the i386 values. The old values are still supported but should no longer be used. Reviewed by: kib	2012-03-04 20:24:28 +00:00
Tijl Coosemans	8b4a1ed0de	Copy amd64 trap.h to x86 and replace amd64/i386/pc98 trap.h with stubs.	2012-03-04 14:12:57 +00:00
Tijl Coosemans	ee0d5ab989	Copy amd64 float.h to x86 and merge with i386 float.h. Replace amd64/i386/pc98 float.h with stubs.	2012-03-04 14:00:32 +00:00
Tijl Coosemans	5b2a5decd1	Copy amd64 stdarg.h to x86 and replace amd64/i386/pc98 stdarg.h with stubs.	2012-02-28 22:30:58 +00:00
Tijl Coosemans	f85ac30a3d	Copy amd64 setjmp.h to x86 and replace amd64/i386/pc98 setjmp.h with stubs.	2012-02-28 22:17:52 +00:00
Tijl Coosemans	95b1d16df5	Copy amd64 endian.h to x86 and merge with i386 endian.h. Replace amd64/i386/pc98 endian.h with stubs. In __bswap64_const(x) the conflict between 0xffUL and 0xffULL has been resolved by reimplementing the macro in terms of __bswap32(x). As a side effect __bswap64_var(x) is now implemented using two bswap instructions on i386 and should be much faster. __bswap32_const(x) has been reimplemented in terms of __bswap16(x) for consistency.	2012-02-28 19:39:54 +00:00
Tijl Coosemans	8770e9db97	Copy amd64 _stdint.h to x86 and merge with i386 _stdint.h. Replace amd64/i386/pc98 _stdint.h with stubs.	2012-02-28 18:38:33 +00:00
Tijl Coosemans	8cfa93e4be	Copy amd64 _limits.h to x86 and merge with i386 _limits.h. Replace amd64/i386/pc98 _limits.h with stubs.	2012-02-28 18:24:28 +00:00
Tijl Coosemans	8f77be2b4c	Copy amd64 _types.h to x86 and merge with i386 _types.h. Replace existing amd64/i386/pc98 _types.h with stubs.	2012-02-28 18:15:28 +00:00
John Baldwin	c7e8722ca0	Resort the IDT_DTRACE_RET constant after it was changed to be less than IDT_SYSCALL.	2012-02-27 17:29:37 +00:00
John Baldwin	a61dd1bda9	Correct function prototype for read_rflags().	2012-02-27 17:28:47 +00:00
Konstantin Belousov	94818d19c3	Move xrstor/xsave/xsetbv into fpu.c and reorder them. Requested by: bde MFC after: 1 month	2012-01-30 07:53:33 +00:00
Konstantin Belousov	a045432a58	Synchronize the struct sigcontext definitions on x86 with mcontext_t. Pointed out by: bde MFC after: 1 month	2012-01-30 07:51:52 +00:00
Konstantin Belousov	5be9d54a2b	Order newly added functions alphabetically. Requested by: bde MFC after: 3 days	2012-01-25 12:43:27 +00:00
David Schultz	2ee7b1d4ae	Add C11 macros describing subnormal numbers to float.h. Reviewed by: bde	2012-01-23 06:36:41 +00:00
Konstantin Belousov	8c6f8f3d5b	Add support for the extended FPU states on amd64, both for native 64bit and 32bit ABIs. As a side-effect, it enables AVX on capable CPUs. In particular: - Query the CPU support for XSAVE, list of the supported extensions and the required size of FPU save area. The hw.use_xsave tunable is provided for disabling XSAVE, and hw.xsave_mask may be used to select the enabled extensions. - Remove the FPU save area from PCB and dynamically allocate the (run-time sized) user save area on the top of the kernel stack, right above the PCB. Reorganize the thread0 PCB initialization to postpone it after BSP is queried for save area size. - The dumppcb, stoppcbs and susppcbs now do not carry the FPU state as well. FPU state is only useful for suspend, where it is saved in dynamically allocated suspfpusave area. - Use XSAVE and XRSTOR to save/restore FPU state, if supported and enabled. - Define new mcontext_t flag _MC_HASFPXSTATE, indicating that mcontext_t has a valid pointer to out-of-struct extended FPU state. Signal handlers are supplied with stack-allocated fpu state. The sigreturn(2) and setcontext(2) syscall honour the flag, allowing the signal handlers to inspect and manipilate extended state in the interrupted context. - The getcontext(2) never returns extended state, since there is no place in the fixed-sized mcontext_t to place variable-sized save area. And, since mcontext_t is embedded into ucontext_t, makes it impossible to fix in a reasonable way. Instead of extending getcontext(2) syscall, provide a sysarch(2) facility to query extended FPU state. - Add ptrace(2) support for getting and setting extended state; while there, implement missed PT_I386_{GET,SET}XMMREGS for 32bit binaries. - Change fpu_kern KPI to not expose struct fpu_kern_ctx layout to consumers, making it opaque. Internally, struct fpu_kern_ctx now contains a space for the extended state. Convert in-kernel consumers of fpu_kern KPI both on i386 and amd64. First version of the support for AVX was submitted by Tim Bird <tim.bird am sony com> on behalf of Sony. This version was written from scratch. Tested by: pho (previous version), Yamagi Burmeister <lists yamagi org> MFC after: 1 month	2012-01-21 17:45:27 +00:00
Konstantin Belousov	6db9cf559f	Add definitions for the FPU extended state header, legacy extended state and AVX state. MFC after: 1 week	2012-01-17 17:07:13 +00:00
Konstantin Belousov	e568229f50	Modernize the fpusave structures definitions by using uint*_t types. MFC after: 1 week	2012-01-17 16:53:41 +00:00
Konstantin Belousov	dd4f5d2437	Implement xsetbv(), xsave() and xrstor() providing C access to the similarly named CPU instructions. Since our in-tree binutils gas is not aware of the instructions, and I have to use the byte-sequence to encode them, hardcode the r/m operand as (%rdi). This way, first argument of the pseudo-function is already placed into proper register. MFC after: 1 week	2012-01-17 07:30:36 +00:00
Konstantin Belousov	79937651ef	Add definitions related to XCR0. MFC after: 1 week	2012-01-17 07:23:43 +00:00
Konstantin Belousov	5ba2a4998c	Add macro IS_BSP() to check whether the current CPU is BSP. MFC after: 1 week	2012-01-17 07:21:23 +00:00
Sean Bruno	80dbff4e99	IFC to head to catch up the bhyve branch Approved by: grehan@	2012-01-04 02:01:27 +00:00
Ed Schouten	53627e400f	Replace __signed by signed. The signed keyword is an integral part of the C syntax. There's no need to use __signed.	2011-12-13 13:38:03 +00:00
Peter Grehan	3ee1a36e2e	IFC @ r227804 Pull in the virtio drivers from head.	2011-11-22 02:27:59 +00:00
David Chisnall	38d1ac34ff	Fix SIGATOMIC_M{IN,AX} on x86-64. These are meant to be the minimum values that are allowed in a sig_atomic_t, but it looks like they were just copied from the x86 versions, so these definitions violate the C and C++ specs. Mismatch was spotted by the libc++ test suite. Approved by: dim (mentor)	2011-11-12 20:16:06 +00:00
Konstantin Belousov	e9862e9b9e	Attempt to improve formatting and content of several comments for amd64 and i386 MD code. Based on suggestions by: bde MFC after: 1 week	2011-11-09 18:25:50 +00:00
Ryan Stone	166808c625	Fix the DTrace pid return trap interrupt vector. Previously we were using 31, but that vector is reserved. Without this fix, running dtrace -p <pid> would either cause the target process to crash or the kernel to page fault. Obtained from: rpaulo MFC after: 3days	2011-11-07 01:53:25 +00:00
Peter Grehan	70d8f36aa4	IFC @ r226824	2011-10-27 04:56:53 +00:00
David Schultz	a50079b7ff	People porting FreeBSD to new architectures ought not have to implement a deprecated FPU control interface in addition to the standard one. To make this clearer, further deprecate ieeefp.h by not declaring the function prototypes except on architectures that implement them already. Currently i386 and amd64 implement the ieeefp.h interface for compatibility, and for fp[gs]etprec(), which doesn't exist on most other hardware. Powerpc, sparc64, and ia64 partially implement it and probably shouldn't, and other architectures don't implement it at all.	2011-10-21 06:41:46 +00:00
Konstantin Belousov	6bfe4c78c8	Remove unused define. MFC after: 1 month	2011-10-07 16:09:44 +00:00
Peter Grehan	fab4c373af	IFC @ r225592 sys/dev/bvm/bvm_console.c - move up to the new alt-break order.	2011-09-15 22:14:35 +00:00
Konstantin Belousov	20aee906b4	Put amd64_syscall() prototype in md_var.h. Requested by: jhb Reviewed by: alc, jhb Approved by: re (bz) MFC after: 2 weeks	2011-09-15 09:54:07 +00:00
Attilio Rao	786ef92b7b	Bump MAXCPU for amd64, ia64 and XLP mips appropriately. From now on, default values for FreeBSD will be 64 maxiumum supported CPUs on amd64 and ia64 and 128 for XLP. All the other architectures seem already capped appropriately (with the exception of sparc64 which needs further support on jalapeno flavour). Bump __FreeBSD_version in order to reflect KBI/KPI brekage introduced during the infrastructure cleanup for supporting MAXCPU > 32. This covers cpumask_t retiral too. The switch is considered completed at the present time, so for whatever bug you may experience that is reconducible to that area, please report immediately. Requested by: marcel, jchandra Tested by: pluknet, sbruno Approved by: re (kib)	2011-07-19 13:00:30 +00:00
Attilio Rao	68b739cd6f	Add the possibility to specify from kernel configs MAXCPU value. This patch is going to help in cases like mips flavours where you want a more granular support on MAXCPU. No MFC is previewed for this patch. Tested by: pluknet Approved by: re (kib)	2011-07-19 00:37:24 +00:00
Peter Grehan	bd2228ab3e	IFC @ r224187	2011-07-18 22:00:21 +00:00
Jung-uk Kim	f0b28f005e	Correct cpu_monitor() and cpu_mwait() for amd64. These instructions take %rcx as "extensions" in long mode. If any unused bit is set in %rcx, these instructions cause general protection fault. Fix style nits and synchronize i386 with amd64.	2011-07-05 18:42:10 +00:00
Peter Grehan	23300944db	IFC @ r223696 to pick up dfr's userboot	2011-06-30 17:37:42 +00:00
Peter Grehan	a5615c9044	IFC @ r222830	2011-06-28 06:26:03 +00:00
John Baldwin	1368987ae4	Move {amd64,i386}/pci/pci_bus.c and {amd64,i386}/include/pci_cfgreg.h to the x86 tree. The $PIR code is still only enabled on i386 and not amd64. While here, make the qpi(4) driver on conditional on 'device pci'.	2011-06-22 21:04:13 +00:00
John Baldwin	e8f40e32eb	Oops, missed these in 223424. Reported by: jkim	2011-06-22 18:48:07 +00:00
Andriy Gapon	234dab4a82	remove code for dynamic offlining/onlining of CPUs on x86 The code has definitely been broken for SCHED_ULE, which is a default scheduler. It may have been broken for SCHED_4BSD in more subtle ways, e.g. with manually configured CPU affinities and for interrupt devilery purposes. We still provide a way to disable individual CPUs or all hyperthreading "twin" CPUs before SMP startup. See the UPDATING entry for details. Interaction between building CPU topology and disabling CPUs still remains fuzzy: topology is first built using all availble CPUs and then the disabled CPUs should be "subtracted" from it. That doesn't work well if the resulting topology becomes non-uniform. This work is done in cooperation with Attilio Rao who in addition to reviewing also provided parts of code. PR: kern/145385 Discussed with: gcooper, ambrisko, mdf, sbruno Reviewed by: attilio Tested by: pho, pluknet X-MFC after: never	2011-06-08 08:12:15 +00:00
Attilio Rao	74e4245e3f	Bring back the number of CPU to 32.	2011-06-07 08:05:23 +00:00
Peter Grehan	87c3644c64	IFC @ r222256	2011-05-24 15:39:34 +00:00
Attilio Rao	5f6b159db7	MFC	2011-05-18 16:01:29 +00:00
Jung-uk Kim	2b052e43be	Update CPUID bits to reflect AMD Bulldozer and Intel Sandy Bridge features. Note AMD dropped SSE5 extensions in order to avoid ISA overlap with Intel AVX instructions. The SSE5 bit was recycled as XOP extended instruction bit, CVT16 was deprecated in favor of F16C (half-precision float conversion instructions for AVX), and the remaining FMA4 (4-operand FMA instructions) gained a separate CPUID bit. Replace non-existent references with today's CPUID specifications.	2011-05-17 22:36:16 +00:00
John Baldwin	34a6b2d627	First cut at porting the kernel portions of 221828 and 221905 from the BHyVe reference branch to HEAD.	2011-05-14 20:35:01 +00:00
Peter Grehan	6c4c7d0f96	bhyve import part 2 of 2, guest kernel changes. This branch is now considered frozen: future bhyve development will take place in a branch off -CURRENT. sys/dev/bvm/bvm_console.c sys/dev/bvm/bvm_dbg.c - simple console driver/gdb debug port used for bringup. supported by user-space bhyve executable sys/conf/options.amd64 sys/amd64/amd64/minidump_machdep.c - allow NKPT to be set in the kernel config file sys/amd64/conf/GENERIC - mptable config options; bhyve user-space executable creates an mptable with number of CPUs, and optional vendor extension - add bvm console/debug - set NKPT to 512 to allow loading of large RAM disks from the loader - include kdb/gdb sys/amd64/amd64/local_apic.c sys/amd64/amd64/apic_vector.S sys/amd64/include/specialreg.h - if x2apic mode available, use MSRs to access the local APIC, otherwise fall back to 'classic' MMIO mode sys/amd64/amd64/mp_machdep.c - support AP spinup on CPU models that don't have real-mode support by overwriting the real-mode page with a message that supplies the bhyve user-space executable with enough information to start the AP directly in 64-bit mode. sys/amd64/amd64/vm_machdep.c - insert pause statements into cpu shutdown busy-wait loops sys/dev/blackhole/blackhole.c sys/modules/blackhole/Makefile - boot-time loadable module that claims all PCI bus/slot/funcs specified in an env var that are to be used for PCI passthrough sys/amd64/amd64/intr_machdep.c - allow round-robin assignment of device interrupts to CPUs to be disabled from the loader sys/amd64/include/bus.h - convert string ins/outs instructions to loops of individual in/out since bhyve doesn't support these yet sys/kern/subr_bus.c - if the device was no created with a fixed devclass, then remove it's association with the devclass it was associated with during probe. Otherwise, new drivers do not get a chance to probe/attach since the device will stay married to the first driver that it probed successfully but failed to attach. Sponsored by: NetApp, Inc.	2011-05-14 18:37:24 +00:00
Attilio Rao	b2aa562e7b	MFC	2011-05-13 20:58:48 +00:00
Matthew D Fleming	cfb00e5aa7	Move the ZERO_REGION_SIZE to a machine-dependent file, as on many architectures (i386, for example) the virtual memory space may be constrained enough that 2MB is a large chunk. Use 64K for arches other than amd64 and ia64, with special handling for sparc64 due to differing hardware. Also commit the comment changes to kmem_init_zero_region() that I missed due to not saving the file. (Darn the unfamiliar development environment). Arch maintainers, please feel free to adjust ZERO_REGION_SIZE as you see fit. Requested by: alc MFC after: 1 week MFC with: r221853	2011-05-13 19:35:01 +00:00
Peter Grehan	366f60834f	Import of bhyve hypervisor and utilities, part 1. vmm.ko - kernel module for VT-x, VT-d and hypervisor control bhyve - user-space sequencer and i/o emulation vmmctl - dump of hypervisor register state libvmm - front-end to vmm.ko chardev interface bhyve was designed and implemented by Neel Natu. Thanks to the following folk from NetApp who helped to make this available: Joe CaraDonna Peter Snyder Jeff Heller Sandeep Mann Steve Miller Brian Pawlowski	2011-05-13 04:54:01 +00:00
Attilio Rao	bd55ede060	MFC	2011-05-09 18:53:13 +00:00
Jung-uk Kim	65e7d70b09	Implement boot-time TSC synchronization test for SMP. This test is executed when the user has indicated that the system has synchronized TSCs or it has P-state invariant TSCs. For the former case, we may clear the tunable if it fails the test to prevent accidental foot-shooting. For the latter case, we may set it if it passes the test to notify the user that it may be usable.	2011-05-09 17:34:00 +00:00
Attilio Rao	aa8b9e0706	MFC	2011-05-06 22:45:33 +00:00
Andriy Gapon	fdf30d59a6	prepare code that does topology detection for amd cpus for bulldozer This also introduces a new detection path for family 10h and newer pre-bulldozer cpus, pre-10h hardware should not be affected. Tested by: Gary Jennejohn <gljennjohn@googlemail.com> (with pre-10h hardware) MFC after: 2 weeks	2011-05-06 13:51:54 +00:00
Attilio Rao	71a19bdc64	Commit the support for removing cpumask_t and replacing it directly with cpuset_t objects. That is going to offer the underlying support for a simple bump of MAXCPU and then support for number of cpus > 32 (as it is today). Right now, cpumask_t is an int, 32 bits on all our supported architecture. cpumask_t on the other side is implemented as an array of longs, and easilly extendible by definition. The architectures touched by this commit are the following: - amd64 - i386 - pc98 - arm - ia64 - XEN while the others are still missing. Userland is believed to be fully converted with the changes contained here. Some technical notes: - This commit may be considered an ABI nop for all the architectures different from amd64 and ia64 (and sparc64 in the future) - per-cpu members, which are now converted to cpuset_t, needs to be accessed avoiding migration, because the size of cpuset_t should be considered unknown - size of cpuset_t objects is different from kernel and userland (this is primirally done in order to leave some more space in userland to cope with KBI extensions). If you need to access kernel cpuset_t from the userland please refer to example in this patch on how to do that correctly (kgdb may be a good source, for example). - Support for other architectures is going to be added soon - Only MAXCPU for amd64 is bumped now The patch has been tested by sbruno and Nicholas Esborn on opteron 4 x 12 pack CPUs. More testing on big SMP is expected to came soon. pluknet tested the patch with his 8-ways on both amd64 and i386. Tested by: pluknet, sbruno, gianni, Nicholas Esborn Reviewed by: jeff, jhb, sbruno	2011-05-05 14:39:14 +00:00
Attilio Rao	8c0ef2464e	Revert md_assert_preempt() introduction. Discussed with: jeff, jhb	2011-05-04 20:29:40 +00:00
Attilio Rao	f1edea81ac	Add the function md_assert_nopreempt(), which is a very consistent function on the possibility of a thread to not preempt. As this function is very tied to x86 (interrupts disabled checkings) it is not intended to be used in MI code.	2011-04-30 23:12:37 +00:00
Jung-uk Kim	c34e9dbee1	Define "Hypervisor Present" bit. This bit is used by several hypervisors to identify CPUs running under emulation. Currently QEMU-KVM, Xen-HVM, VMware, and MS Hyper-V are known to set this bit. MFC after: 3 days	2011-04-28 22:23:39 +00:00
Konstantin Belousov	3136faa59d	Make pmap_invalidate_cache_range() available for consumption on amd64. Add pmap_invalidate_cache_pages() method on x86. It flushes the CPU cache for the set of pages, which are not neccessary mapped. Since its supposed use is to prepare the move of the pages ownership to a device that does not snoop all CPU accesses to the main memory (read GPU in GMCH), do not rely on CPU self-snoop feature. amd64 implementation takes advantage of the direct map. On i386, extract the helper pmap_flush_page() from pmap_page_set_memattr(), and use it to make a temporary mapping of the flushed page. Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 3 weeks	2011-04-18 21:24:42 +00:00
Jung-uk Kim	0e72764232	Add a function rdtsc32() to read lower 32 bits from TSC and discard upper 32 bits. Some times compiler inserts unnecessary instructions to preserve unused upper 32 bits even when it is casted to a 32-bit value. It reduces such compiler mistakes where every cycle counts.	2011-04-14 16:53:32 +00:00
Jung-uk Kim	4854ae249c	Consistently use __volatile as the rest of this file.	2011-04-14 16:19:41 +00:00
Jung-uk Kim	f5ac47f44c	Prefer C99 standard integers to reduce diff from i386 version.	2011-04-14 16:14:35 +00:00
Jung-uk Kim	dd3e254ebd	Add forgotten declarations for tsc_perf_stat from the previous commit.	2011-04-12 22:22:01 +00:00
Jung-uk Kim	3731174954	Add definitions for CPUID instruction 6, ECX information.	2011-04-12 22:12:23 +00:00
Alan Cox	63740078e0	Amd64 doesn't have a lazypmap ipi.	2011-03-27 16:18:51 +00:00
Alan Cox	1587dfd730	Move an external declaration to the appropriate header file.	2011-03-26 06:21:05 +00:00
Jeff Roberson	e4cd31dd3c	- Merge changes to the base system to support OFED. These include a wider arg2 for sysctl, updates to vlan code, IFT_INFINIBAND, and other miscellaneous small features.	2011-03-21 09:40:01 +00:00
Jung-uk Kim	38b8542ca9	Deprecate tsc_present as the last of its real consumers finally disappeared.	2011-03-15 17:19:52 +00:00
Jung-uk Kim	bc34c87e81	Deprecate rarely used tsc_is_broken. Instead, we zero out tsc_freq because it is almost always used with tsc_freq any way.	2011-03-10 20:02:58 +00:00
Alan Cox	e6ffa21488	Remove pmap fields that are either unused or not fully implemented. Discussed with: kib	2011-02-17 15:36:29 +00:00
Dmitry Chagin	dc4f0a9e11	To avoid excessive code duplication create wrapper for fill regs from stack frame. Change the trap() code to use newly created function instead of explicit regs assignment.	2011-02-16 17:50:21 +00:00
Jung-uk Kim	2fea643112	Add reader/writer lock around mem_range_attr_get() and mem_range_attr_set(). Compile sys/dev/mem/memutil.c for all supported platforms and remove now unnecessary dev_mem_md_init(). Consistently define mem_range_softc from mem.c for all platforms. Add missing #include guards for machine/memdev.h and sys/memrange.h. Clean up some nearby style(9) nits. MFC after: 1 month	2011-01-17 22:58:28 +00:00
Konstantin Belousov	50a57dfbec	Move repeated MAXSLP definition from machine/vmparam.h to sys/vmmeter.h. Update the outdated comments describing MAXSLP and the process selection algorithm for swap out. Comments wording and reviewed by: alc	2011-01-09 12:50:44 +00:00
Tijl Coosemans	d22e78d6b9	Copy powerpc/include/_inttypes.h to x86 and replace i386/amd64/pc98 headers with stubs. Approved by: kib (mentor)	2011-01-08 18:09:48 +00:00
Konstantin Belousov	6297a3d843	Create shared (readonly) page. Each ABI may specify the use of page by setting SV_SHP flag and providing pointer to the vm object and mapping address. Provide simple allocator to carve space in the page, tailored to put the code with alignment restrictions. Enable shared page use for amd64, both native and 32bit FreeBSD binaries. Page is private mapped at the top of the user address space, moving a start of the stack one page down. Move signal trampoline code from the top of the stack to the shared page. Reviewed by: alc	2011-01-08 16:13:44 +00:00
Tijl Coosemans	a56e818f29	On mixed 32/64 bit architectures (mips, powerpc) use __LP64__ rather than architecture macros (__mips_n64, __powerpc64__) when 64 bit types (and corresponding macros) are different from 32 bit. [1] Correct the type of INT64_MIN, INT64_MAX and UINT64_MAX. Define (U)INTMAX_C as an alias for (U)INT64_C matching the type definition for (u)intmax_t. Do this on all architectures for consistency. Suggested by: bde [1] Approved by: kib (mentor)	2011-01-08 12:43:05 +00:00
Tijl Coosemans	9858863cd4	Fix types of some values in machine/_limits.h. On some architectures UCHAR_MAX and USHRT_MAX had type unsigned int. However, lacking integer suffixes for types smaller than int, their type should correspond to that of an object of type unsigned char (or short) when used in an expression with objects of type int. In that case unsigned char (short) are promoted to int (i.e. signed) so the type of UCHAR_MAX and USHRT_MAX should also be int. Where MIN/MAX constants implicitly have the correct type the suffix has been removed. While here, correct some comments. Reviewed by: bde Approved by: kib (mentor)	2011-01-08 11:13:34 +00:00
Konstantin Belousov	39198f15ee	Add AT_STACKPROT elf aux vector. Will be used to inform rtld about the initial stack protection set by the kernel image activator.	2011-01-07 14:22:34 +00:00
Jung-uk Kim	50e3cec377	Increase size of pcb_flags to four bytes. Requested by: bde, jhb	2010-12-22 19:57:03 +00:00
Jung-uk Kim	e6c006d96a	Improve PCB flags handling and make it more robust. Add two new functions for manipulating pcb_flags. These inline functions are very similar to atomic_set_char(9) and atomic_clear_char(9) but without unnecessary LOCK prefix for SMP. Add comments about the rationale[1]. Use these functions wherever possible. Although there are some places where it is not strictly necessary (e.g., a PCB is copied to create a new PCB), it is done across the board for sake of consistency. Turn pcb_full_iret into a PCB flag as it is safe now. Move rarely used fields before pcb_flags and reduce size of pcb_flags to one byte. Fix some style(9) nits in pcb.h while I am in the neighborhood. Reviewed by: kib Submitted by: kib[1] MFC after: 2 months	2010-12-22 00:18:42 +00:00
Tijl Coosemans	81bd5041a2	Merge amd64 and i386 bus.h and move the resulting header to x86. Replace the original amd64 and i386 headers with stubs. Rename (AMD64\|I386)_BUS_SPACE_* to X86_BUS_SPACE_* everywhere. Reviewed by: imp (previous version), jhb Approved by: kib (mentor)	2010-12-20 16:39:43 +00:00
Konstantin Belousov	7222d2fbee	Inform a compiler which asm statements in the x86 implementation of atomics change eflags. Reviewed by: jhb MFC after: 2 weeks	2010-12-18 16:41:11 +00:00
Jung-uk Kim	dd7d207dcb	Merge sys/amd64/amd64/tsc.c and sys/i386/i386/tsc.c and move to sys/x86/x86. Discussed with: avg	2010-12-08 00:09:24 +00:00
Konstantin Belousov	0f0170e66a	Retire write-only PCB_FULLCTX pcb flag on amd64. Reminded by: Petr Salinger <Petr.Salinger seznam cz> Tested by: pho MFC after: 1 week	2010-12-07 12:17:43 +00:00
Rebecca Cran	c90f7d9b44	Revert r216134. This checkin broke platforms where bus_space are macros: they need to be a single statement, and do { } while (0) doesn't work in this situation so revert until a solution can be devised.	2010-12-03 07:09:23 +00:00
Rebecca Cran	15b4888a24	Disallow passing in a count of zero bytes to the bus_space(9) functions. Passing a count of zero on i386 and amd64 for [I386\|AMD64]_BUS_SPACE_MEM causes a crash/hang since the 'loop' instruction decrements the counter before checking if it's zero. PR: kern/80980 Discussed with: jhb	2010-12-02 22:19:30 +00:00
Alan Cox	686b00d691	Make the size of the direct map easily configurable. Changing NDMPML4E now suffices. Increase the size of the direct map to 1TB. An earler version of this patch was tested by sbruno@.	2010-11-26 19:36:26 +00:00
Konstantin Belousov	5c6eb03790	Remove npxgetregs(), npxsetregs(), fpugetregs() and fpusetregs() functions, they are unused. Remove 'user' from npxgetuserregs() etc. names. For {npx,fpu}{get,set}regs(), always use pcb->pcb_user_save for FPU context storage. This eliminates the need for ugly copying with overwrite of the newly added and reserved fields in ucontext on i386 to satisfy alignment requirements for fpusave() and fpurstor(). pc98 version was copied from i386. Suggested and reviewed by: bde Tested by: pho (i386 and amd64) MFC after: 1 week	2010-11-26 14:50:42 +00:00
Tijl Coosemans	ce4ec51dbe	Merge amd64/i386 _align.h by aligning on the size of register_t (copied from powerpc). Reviewed by: imp, jhb Approved by: kib (mentor)	2010-11-26 10:59:20 +00:00
Andriy Gapon	9b984feb3d	specialreg.h: add definitions for some useful bits found in CPUID.6 EAX and ECX CPUID.6 is defined as Thermal and Power Management Leaf by both Intel and AMD. Reviewed by: jhb MFC after: 7 days	2010-11-23 13:55:30 +00:00
Andriy Gapon	b43d292565	specialreg.h: add definitions for MPERF/APERF pair of MSRs These MSRs can be used to determine actual (average) performance as compared to a maximum defined performance. Availability of these MSRs is indicated by bit0 in CPUID.6.ECX on both Intel and AMD processors. MFC after: 5 days	2010-11-19 15:07:36 +00:00
Andriy Gapon	7af7c7624a	specialreg.h: add AMD-specific "Hardware Configuration Register" MSR It seems that this MSR has been available in a range of AMD processors families for quite a while now. Note1: not all AMD MSRs that are found in amd64 specialreg.h are also in the i386 version. Note2: perhaps some additional name component is needed to distinguish AMD-specific MSRs. MFC after: 5 days	2010-11-19 15:00:20 +00:00
Andriy Gapon	8fd6d51347	specialreg.h: add definition for AMD Core Performance Boost bit This bit indicates availability of the feature. MFC after: 4 days	2010-11-19 14:46:17 +00:00
Jung-uk Kim	19da400c64	Move identical copies of apm_bios.h to sys/x86/include, replace them with stubs, and adjust PC98 stub accordingly. Reviewed by: imp, nyan	2010-11-11 19:36:21 +00:00
Andriy Gapon	290e14f881	amd64: introduce minidump version 2 After KVA space was increased to 512GB on amd64 it became impractical to use PTEs as entries in the minidump map of dumped pages, because size of that map alone would already be 1GB. Instead, we now use PDEs as page map entries and employ two stage lookup in libkvm: virtual address -> PDE -> PTE -> physical address. PTEs are now dumped as regular pages. Fixed page map size now is 2MB. libkvm keeps support for accessing amd64 minidumps of version 1. Support for 1GB pages is added. Many thanks to Alan Cox for his guidance, numerous reviews, suggestions, enhancments and corrections. Reviewed by: alc [kernel part] MFC after: 15 days	2010-11-11 18:35:28 +00:00
John Baldwin	961135ead8	- Remove <machine/mutex.h>. Most of the headers were empty, and the contents of the ones that were not empty were stale and unused. - Now that <machine/mutex.h> no longer exists, there is no need to allow it to override various helper macros in <sys/mutex.h>. - Rename various helper macros for low-level operations on mutexes to live in the _mtx_* or __mtx_* namespaces. While here, change the names to more closely match the real API functions they are backing. - Drop support for including <sys/mutex.h> in assembly source files. Suggested by: bde (1, 2)	2010-11-09 20:46:41 +00:00
Attilio Rao	fcb250f392	Move the mptable.h under x86/include/. Sponsored by: Sandvine Incorporated MFC after: 14 days	2010-11-09 20:28:09 +00:00
John Baldwin	32c3d3b6e6	Move <machine/apicreg.h> to <x86/apicreg.h>.	2010-11-01 18:18:46 +00:00
John Baldwin	5ecdb3c46b	Move the <machine/mca.h> header to <x86/mca.h>.	2010-11-01 17:40:35 +00:00
Alan Cox	92ababa777	[1] According to the x86 architectural specifications, no virtual-to- physical page mapping should span two or more MTRRs of different types. Add a pmap function, pmap_demote_DMAP(), by which the MTRR module can ensure that the direct map region doesn't have such a mapping. [2] Fix a couple of nearby style errors in amd64_mrset(). [3] Re-enable the use of 1GB page mappings for implementing the direct map. (See also r197580 and r213897.) Tested by: kib@ on a Westmere-family processor [3] MFC after: 3 weeks	2010-10-27 16:46:37 +00:00
John Baldwin	c6390f7ac5	Use intr_disable() and intr_restore() instead of frobbing the flags register directly to disable interrupts. Reviewed by: bde (earlier version) MFC after: 2 weeks	2010-10-25 15:28:03 +00:00
Konstantin Belousov	3f506a78ce	Display PCID capability of CPU and add CPUID define for it. MFC after: 1 week	2010-10-05 15:31:56 +00:00
Andriy Gapon	0b750af1b1	amd64: reduce VM_KMEM_SIZE_SCALE to 1 allowing kernel to use more memory KVA space is abundant on amd64, so there is no reason to limit kernel map size to a fraction of available physical memory. In fact, it could be larger than physical memory. This should help with memory auto-tuning for ZFS and shouldn't affect other workloads. This should reduce number of circumstances for "kmem_map too small" panics, but probably won't eliminate them entirely due to potential kmem fragmentation. In fact, you might want/need to limit maximum ARC size after this commit if you need to resrve more memory for applications. This change was discussed on arch@ and nobody said "don't do it". MFC after: 6 weeks	2010-09-17 07:36:32 +00:00
Alexander Motin	a157e42516	Refactor timer management code with priority to one-shot operation mode. The main goal of this is to generate timer interrupts only when there is some work to do. When CPU is busy interrupts are generating at full rate of hz + stathz to fullfill scheduler and timekeeping requirements. But when CPU is idle, only minimum set of interrupts (down to 8 interrupts per second per CPU now), needed to handle scheduled callouts is executed. This allows significantly increase idle CPU sleep time, increasing effect of static power-saving technologies. Also it should reduce host CPU load on virtualized systems, when guest system is idle. There is set of tunables, also available as writable sysctls, allowing to control wanted event timer subsystem behavior: kern.eventtimer.timer - allows to choose event timer hardware to use. On x86 there is up to 4 different kinds of timers. Depending on whether chosen timer is per-CPU, behavior of other options slightly differs. kern.eventtimer.periodic - allows to choose periodic and one-shot operation mode. In periodic mode, current timer hardware taken as the only source of time for time events. This mode is quite alike to previous kernel behavior. One-shot mode instead uses currently selected time counter hardware to schedule all needed events one by one and program timer to generate interrupt exactly in specified time. Default value depends of chosen timer capabilities, but one-shot mode is preferred, until other is forced by user or hardware. kern.eventtimer.singlemul - in periodic mode specifies how much times higher timer frequency should be, to not strictly alias hardclock() and statclock() events. Default values are 2 and 4, but could be reduced to 1 if extra interrupts are unwanted. kern.eventtimer.idletick - makes each CPU to receive every timer interrupt independently of whether they busy or not. By default this options is disabled. If chosen timer is per-CPU and runs in periodic mode, this option has no effect - all interrupts are generating. As soon as this patch modifies cpu_idle() on some platforms, I have also refactored one on x86. Now it makes use of MONITOR/MWAIT instrunctions (if supported) under high sleep/wakeup rate, as fast alternative to other methods. It allows SMP scheduler to wake up sleeping CPUs much faster without using IPI, significantly increasing performance on some highly task-switching loads. Tested by: many (on i386, amd64, sparc64 and powerc) H/W donated by: Gheorghe Ardelean Sponsored by: iXsystems, Inc.	2010-09-13 07:25:35 +00:00
Roman Divacky	27d4fea6c5	Change the parameter passed to the inline assembly to u_short as we are dealing with 16bit segment registers. Change mov to movw. Approved by: rpaulo (mentor) Reviewed by: kib, rink	2010-09-03 14:25:17 +00:00
Rui Paulo	cba3269417	Register an interrupt vector for DTrace return probes. There is some code missing in lapic to make sure that we don't overwrite this entry, but this will be done on a sequent commit. Sponsored by: The FreeBSD Foundation	2010-08-28 08:03:29 +00:00
Rui Paulo	8a8d8fa3d1	Add two DTrace trap type values. Used by fasttrap. Sponsored by: The FreeBSD Foundation	2010-08-24 13:13:24 +00:00
Konstantin Belousov	ee235befcb	Supply some useful information to the started image using ELF aux vectors. In particular, provide pagesize and pagesizes array, the canary value for SSP use, number of host CPUs and osreldate. Tested by: marius (sparc64) MFC after: 1 month	2010-08-17 08:55:45 +00:00
John Baldwin	d9d8d1449d	Add a new ipi_cpu() function to the MI IPI API that can be used to send an IPI to a specific CPU by its cpuid. Replace calls to ipi_selected() that constructed a mask for a single CPU with calls to ipi_cpu() instead. This will matter more in the future when we transition from cpumask_t to cpuset_t for CPU masks in which case building a CPU mask is more expensive. Submitted by: peter, sbruno Reviewed by: rookie Obtained from: Yahoo! (x86) MFC after: 1 month	2010-08-06 15:36:59 +00:00
Jung-uk Kim	6305bb243c	Rearrange struct pcb. r177532 (CVS r1.64 of pcb.h) moved pcb_flags to make better use of cache lines by placing it before pcb_save (now pcb_user_save), which is moved to the end of pcb since r210777.	2010-08-02 18:12:30 +00:00
Jung-uk Kim	a2d2c83668	- Merge savectx2() with savectx() and struct xpcb with struct pcb. [1] savectx() is only used for panic dump (dumppcb) and kdb (stoppcbs). Thus, saving additional information does not hurt and it may be even beneficial. Unfortunately, struct pcb has grown larger to accommodate more data. Move 512-byte long pcb_user_save to the end of struct pcb while I am here. - savectx() now saves FPU state unconditionally and copy it to the PCB of FPU thread if necessary. This gives panic dump and kdb a chance to take a look at the current FPU state even if the FPU is "supposedly" not used. - Resuming CPU now unconditionally reinitializes FPU. If the saved FPU state was irrelevant, it could be in an unknown state. Suggested by: bde [1]	2010-08-02 17:35:00 +00:00
Xin LI	a3bc0a4e5c	Improve cputemp(4) driver wrt newer Intel processors, especially Xeon 5500/5600 series: - Utilize IA32_TEMPERATURE_TARGET, a.k.a. Tj(target) in place of Tj(max) when a sane value is available, as documented in Intel whitepaper "CPU Monitoring With DTS/PECI"; (By sane value we mean 70C - 100C for now); - Print the probe results when booting verbose; - Replace cpu_mask with cpu_stepping; - Use CPUID_* macros instead of rolling our own. Approved by: rpaulo MFC after: 1 month	2010-07-29 19:08:22 +00:00
John Baldwin	536af0d751	Mark the __curthread() functions as __pure2 and remove the volatile keyword from the inline assembly. This allows the compiler to cache invocations of curthread since it's value does not change within a thread context. Submitted by: zec (i386) MFC after: 1 week	2010-07-29 18:44:10 +00:00
John Baldwin	a955c461ad	The corrected error count field is dependent on CMCI, not TES. MFC after: 1 week	2010-07-28 21:52:09 +00:00
John Baldwin	a3870a1826	Very rough first cut at NUMA support for the physical page allocator. For now it uses a very dumb first-touch allocation policy. This will change in the future. - Each architecture indicates the maximum number of supported memory domains via a new VM_NDOMAIN parameter in <machine/vmparam.h>. - Each cpu now has a PCPU_GET(domain) member to indicate the memory domain a CPU belongs to. Domain values are dense and numbered from 0. - When a platform supports multiple domains, the default freelist (VM_FREELIST_DEFAULT) is split up into N freelists, one for each domain. The MD code is required to populate an array of mem_affinity structures. Each entry in the array defines a range of memory (start and end) and a domain for the range. Multiple entries may be present for a single domain. The list is terminated by an entry where all fields are zero. This array of structures is used to split up phys_avail[] regions that fall in VM_FREELIST_DEFAULT into per-domain freelists. - Each memory domain has a separate lookup-array of freelists that is used when fulfulling a physical memory allocation. Right now the per-domain freelists are listed in a round-robin order for each domain. In the future a table such as the ACPI SLIT table may be used to order the per-domain lookup lists based on the penalty for each memory domain relative to a specific domain. The lookup lists may be examined via a new vm.phys.lookup_lists sysctl. - The first-touch policy is implemented by using PCPU_GET(domain) to pick a lookup list when allocating memory. Reviewed by: alc	2010-07-27 20:33:50 +00:00
Konstantin Belousov	87d45a0392	When compat32 binary asks for the value of hw.machine_arch, report the name of 32bit sibling architecture instead of the host one. Do the same for hw.machine on amd64. Add a safety belt debug.adaptive_machine_arch sysctl, to turn the substitution off. Reviewed by: jhb, nwhitehorn MFC after: 2 weeks	2010-07-22 09:13:49 +00:00
Alexander Motin	fcc06be1b2	Move functions declaration to MI code, following implementation.	2010-07-15 17:49:35 +00:00
Warner Losh	1003cfe94d	Remove obsolete undef of COPY_SIGCODE. It appears to have not been used in FreeBSD in quite some time (maybe since before 4.4-lite :) Submitted by: bde	2010-07-13 15:06:13 +00:00
Konstantin Belousov	2680dac9e1	For both i386 and amd64 pmap, - change the type of pm_active to cpumask_t, which it is; - in pmap_remove_pages(), compare with PCPU(curpmap), instead of dereferencing the long chain of pointers [1]. For amd64 pmap, remove the unneeded checks for validity of curpmap in pmap_activate(), since curpmap should be always valid after r209789. Submitted by: alc [1] Reviewed by: alc MFC after: 3 weeks	2010-07-09 20:05:56 +00:00
Rui Paulo	80599de862	Fix style issues with the previous commit, namely use-tab-instead-of-space and don't use underscores in macro variables. Pointed out by: bde	2010-07-07 12:08:58 +00:00
Rui Paulo	8923c96ee1	Introduce USD_{SET,GET}{BASE,LIMIT}. These help setting up the user segment descriptor hi and lo values. Idea from Solaris. Reviewed by: kib	2010-07-06 16:56:27 +00:00
Konstantin Belousov	595473a587	Clear DF bit in eflags/rflags on the kernel entry. The i386 and amd64 ABI specifies the DF should be zero, and newer compilers do not clear DF before using DF-sensitive instructions. The DF clearing for signal handlers was done some time ago. MFC after: 1 week	2010-06-23 20:44:07 +00:00
Alexander Motin	875b8844be	Implement new event timers infrastructure. It provides unified APIs for writing event timer drivers, for choosing best possible drivers by machine independent code and for operating them to supply kernel with hardclock(), statclock() and profclock() events in unified fashion on various hardware. Infrastructure provides support for both per-CPU (independent for every CPU core) and global timers in periodic and one-shot modes. MI management code at this moment uses only periodic mode, but one-shot mode use planned for later, as part of tickless kernel project. For this moment infrastructure used on i386 and amd64 architectures. Other archs are welcome to follow, while their current operation should not be affected. This patch updates existing drivers (i8254, RTC and LAPIC) for the new order, and adds event timers support into the HPET driver. These drivers have different capabilities: LAPIC - per-CPU timer, supports periodic and one-shot operation, may freeze in C3 state, calibrated on first use, so may be not exactly precise. HPET - depending on hardware can work as per-CPU or global, supports periodic and one-shot operation, usually provides several event timers. i8254 - global, limited to periodic mode, because same hardware used also as time counter. RTC - global, supports only periodic mode, set of frequencies in Hz limited by powers of 2. Depending on hardware capabilities, drivers preferred in following orders, either LAPIC, HPETs, i8254, RTC or HPETs, LAPIC, i8254, RTC. User may explicitly specify wanted timers via loader tunables or sysctls: kern.eventtimer.timer1 and kern.eventtimer.timer2. If requested driver is unavailable or unoperational, system will try to replace it. If no more timers available or "NONE" specified for second, system will operate using only one timer, multiplying it's frequency by few times and uing respective dividers to honor hz, stathz and profhz values, set during initial setup.	2010-06-20 21:33:29 +00:00
Alexander Motin	d364638110	Merge COUNT_XINVLTLB_HITS and COUNT_IPIS kernel options from i386 to amd64. This information can be very valuable for CPU sleep-time (and respectively idle power consumption) optimization. Add counters for timer-related IPIs. Reviewed by: jhb@ (previous version)	2010-06-17 11:54:49 +00:00
John Baldwin	61d3f0bab2	Restore the machine check register banks on resume. For banks being monitored via CMCI, reset the interrupt threshold to 1 on resume. Reviewed by: jkim MFC after: 2 weeks	2010-06-15 18:51:41 +00:00
Kenneth D. Merry	7c049a853c	MFC 199549, 199997, 204158, 207673, and 208901. Bring in a number of netfront changes: r199549 \| jhb Remove commented out reference to if_watchdog and an assignment of zero to if_timer. Reviewed by: scottl r199997 \| gibbs Add media ioctl support and link notifications so that devd will attempt to run dhclient on a netfront (xn) device that is setup for DHCP in /etc/rc.conf. PR: kern/136251 (fixed differently than the submitted patch) r204158 \| kmacy - make printf conditional - fix witness warnings by making configuration lock a mutex r207673 \| joel Switch to our preferred 2-clause BSD license. Approved by: kmacy r208901 \| ken A number of netfront fixes and stability improvements: - Re-enable TSO. This was broken previously due to CSUM_TSO clearing the CSUM_TCP flag, so our checksum flags were incorrectly set going to the netback driver. That was fixed in r206844 in tcp_output.c, so we can turn TSO back on here. - Fix the way transmit slots are calculated, so that we can't overfill the ring. - Avoid sending packets with more fragments/segments than netback can handle. The Linux netback code can only handle packets of MAX_SKB_FRAGS, which turns out to be 18 on machines with 4K pages. We can easily generate packets with 32 or so fragments with TSO turned on. Right now the solution is just to drop the packets (since netback doesn't seem to handle it gracefully), but we should come up with a way to allow a driver to tell the TCP stack the maximum number of fragments it can handle in a single packet. - Fix the way the consumer is tracked in the receive path. It could get out of sync fairly easily. - Use standard Xen ring macros to make it clearer how netfront is using the rings. - Get rid of Linux-ish negative errno return values. - Added more documentation to the driver. - Refactored code to make it easier to read. - Some other minor fixes. Reviewed by: gibbs Sponsored by: Spectra Logic Approved by: re (bz)	2010-06-11 19:17:36 +00:00
Konstantin Belousov	6cf9a08d2c	Introduce the x86 kernel interfaces to allow kernel code to use FPU/SSE hardware. Caller should provide a save area that is chained into the stack of the areas; pcb save_area for usermode FPU state is on top. The pcb now contains a pointer to the current FPU saved area, used during FPUDNA handling and context switches. There is also a facility to allow the kernel thread to use pcb save_area. Change the dreaded warnings "npxdna in kernel mode!" into the panics when FPU usage is not registered. KPI discussed with: fabient Tested by: pho, fabient Hardware provided by: Sentex Communications MFC after: 1 month	2010-06-05 15:59:59 +00:00
Attilio Rao	875e0aa40d	MFC r207329, r208716: - Extract the IODEV_PIO interface from ia64 and make it MI. - On i386 and amd64 the old behaviour is kept but multithreaded processes must use the new interface in order to work well. - Support for the other architectures is greatly improved. Sponsored by: Sandvine Incorporated Approved by: re (kib, bz)	2010-06-01 21:19:58 +00:00
John Baldwin	58ccad7ddc	Add support for corrected machine check interrupts. CMCI is a new local APIC interrupt that fires when a threshold of corrected machine check events is reached. CMCI also includes a count of events when reporting corrected errors in the bank's status register. Note that individual banks may or may not support CMCI. If they do, each bank includes its own threshold register that determines when the interrupt fires. Currently the code uses a very simple strategy where it doubles the threshold on each interrupt until it succeeds in throttling the interrupt to occur only once a minute (this interval can be tuned via sysctl). The threshold is also adjusted on each hourly poll which will lower the threshold once events stop occurring. Tested by: Sailaja Bangaru sbappana at yahoo com MFC after: 1 month	2010-05-24 15:45:05 +00:00

... 3 4 5 6 7 ...

1904 Commits