freebsd-dev

Author	SHA1	Message	Date
Neel Natu	3565b59ec0	Create sysctl node 'hw.vmm.vmx' and populate it with oids that expose the VMX hardware capabilities. Obtained from: NetApp	2013-04-13 21:41:51 +00:00
Konstantin Belousov	fcb29b9210	Fix the name of the pcb member in the comments. Submitted by: Oliver Pinter <oliver.pntr@gmail.com> MFC after: 3 days	2013-04-13 15:20:33 +00:00
Neel Natu	26d66b9d58	Use the MAKEDEV_CHECKNAME flag to check for an invalid device name and return an error instead of panicking. Obtained from: NetApp	2013-04-13 05:11:21 +00:00
Edward Tomasz Napierala	8ed9860914	Remove ctl(4) from GENERIC. Also remove 'options CTL_DISABLE' and kern.cam.ctl.disable tunable; those were introduced as a workaround to make it possible to boot GENERIC on low memory machines. With ctl(4) being built as a module and automatically loaded by ctladm(8), this makes CTL work out of the box. Reviewed by: ken Sponsored by: FreeBSD Foundation	2013-04-12 16:25:03 +00:00
Neel Natu	d5408b1d26	If vmm.ko could not be initialized correctly then prevent the creation of virtual machines subsequently. Submitted by: Chris Torek	2013-04-12 01:16:52 +00:00
Neel Natu	150369ab7c	Make the code to check if VMX is enabled more readable by using macros instead of magic numbers. Discussed with: Chris Torek	2013-04-11 04:29:45 +00:00
Neel Natu	1472b87f2f	Unsynchronized TSCs on the host require special handling in bhyve: - use clock_gettime(2) as the time base for the emulated ACPI timer instead of directly using rdtsc(). - don't advertise the invariant TSC capability to the guest to discourage it from using the TSC as its time base. Discussed with: jhb@ (about making 'smp_tsc' a global) Reported by: Dan Mack on freebsd-virtualization@ Obtained from: NetApp	2013-04-10 05:59:07 +00:00
Gleb Smirnoff	4e76af6a41	Merge from projects/counters: counter(9). Introduce counter(9) API, that implements fast and raceless counters, provided (but not limited to) for gathering of statistical data. See http://lists.freebsd.org/pipermail/freebsd-arch/2013-April/014204.html for more details. In collaboration with: kib Reviewed by: luigi Tested by: ae, ray Sponsored by: Nginx, Inc.	2013-04-08 19:40:53 +00:00
Gleb Smirnoff	17dece86fe	Merge from projects/counters: Pad struct pcpu so that its size is denominator of PAGE_SIZE. This is done to reduce memory waste in UMA_PCPU_ZONE zones. Sponsored by: Nginx, Inc.	2013-04-08 19:19:10 +00:00
Peter Grehan	117e8f378e	Don't panic when a valid divisor of 1 has been requested. Obtained from: NetApp	2013-04-05 22:16:31 +00:00
Alexander Motin	45f6d66569	Remove all legacy ATA code parts, not used since options ATA_CAM enabled in most kernels before FreeBSD 9.0. Remove such modules and respective kernel options: atadisk, ataraid, atapicd, atapifd, atapist, atapicam. Remove the atacontrol utility and some man pages. Remove useless now options ATA_CAM. No objections: current@, stable@ MFC after: never	2013-04-04 07:12:24 +00:00
Neel Natu	77d8fd9bb3	Add counter to keep track of the number of timer interrupts generated by the local apic for each virtual cpu.	2013-03-31 03:56:48 +00:00
Neel Natu	b5aaf7b22b	Add some more stats to keep track of all the reasons that a vcpu is exiting.	2013-03-30 17:46:03 +00:00
Neel Natu	66f71b7d24	Allow caller to skip 'guest linear address' validation when doing instruction decode. This is to accomodate hardware assist implementations that do not provide the 'guest linear address' as part of nested page fault collateral. Submitted by: Anish Gupta (akgupt3 at gmail dot com)	2013-03-28 21:26:19 +00:00
Konstantin Belousov	ee75e7de7b	Implement the concept of the unmapped VMIO buffers, i.e. buffers which do not map the b_pages pages into buffer_map KVA. The use of the unmapped buffers eliminate the need to perform TLB shootdown for mapping on the buffer creation and reuse, greatly reducing the amount of IPIs for shootdown on big-SMP machines and eliminating up to 25-30% of the system time on i/o intensive workloads. The unmapped buffer should be explicitely requested by the GB_UNMAPPED flag by the consumer. For unmapped buffer, no KVA reservation is performed at all. The consumer might request unmapped buffer which does have a KVA reserve, to manually map it without recursing into buffer cache and blocking, with the GB_KVAALLOC flag. When the mapped buffer is requested and unmapped buffer already exists, the cache performs an upgrade, possibly reusing the KVA reservation. Unmapped buffer is translated into unmapped bio in g_vfs_strategy(). Unmapped bio carry a pointer to the vm_page_t array, offset and length instead of the data pointer. The provider which processes the bio should explicitely specify a readiness to accept unmapped bio, otherwise g_down geom thread performs the transient upgrade of the bio request by mapping the pages into the new bio_transient_map KVA submap. The bio_transient_map submap claims up to 10% of the buffer map, and the total buffer_map + bio_transient_map KVA usage stays the same. Still, it could be manually tuned by kern.bio_transient_maxcnt tunable, in the units of the transient mappings. Eventually, the bio_transient_map could be removed after all geom classes and drivers can accept unmapped i/o requests. Unmapped support can be turned off by the vfs.unmapped_buf_allowed tunable, disabling which makes the buffer (or cluster) creation requests to ignore GB_UNMAPPED and GB_KVAALLOC flags. Unmapped buffers are only enabled by default on the architectures where pmap_copy_page() was implemented and tested. In the rework, filesystem metadata is not the subject to maxbufspace limit anymore. Since the metadata buffers are always mapped, the buffers still have to fit into the buffer map, which provides a reasonable (but practically unreachable) upper bound on it. The non-metadata buffer allocations, both mapped and unmapped, is accounted against maxbufspace, as before. Effectively, this means that the maxbufspace is forced on mapped and unmapped buffers separately. The pre-patch bufspace limiting code did not worked, because buffer_map fragmentation does not allow the limit to be reached. By Jeff Roberson request, the getnewbuf() function was split into smaller single-purpose functions. Sponsored by: The FreeBSD Foundation Discussed with: jeff (previous version) Tested by: pho, scottl (previous version), jhb, bf MFC after: 2 weeks	2013-03-19 14:13:12 +00:00
Attilio Rao	774d251d99	Sync back vmcontention branch into HEAD: Replace the per-object resident and cached pages splay tree with a path-compressed multi-digit radix trie. Along with this, switch also the x86-specific handling of idle page tables to using the radix trie. This change is supposed to do the following: - Allowing the acquisition of read locking for lookup operations of the resident/cached pages collections as the per-vm_page_t splay iterators are now removed. - Increase the scalability of the operations on the page collections. The radix trie does rely on the consumers locking to ensure atomicity of its operations. In order to avoid deadlocks the bisection nodes are pre-allocated in the UMA zone. This can be done safely because the algorithm needs at maximum one new node per insert which means the maximum number of the desired nodes is the number of available physical frames themselves. However, not all the times a new bisection node is really needed. The radix trie implements path-compression because UFS indirect blocks can lead to several objects with a very sparse trie, increasing the number of levels to usually scan. It also helps in the nodes pre-fetching by introducing the single node per-insert property. This code is not generalized (yet) because of the possible loss of performance by having much of the sizes in play configurable. However, efforts to make this code more general and then reusable in further different consumers might be really done. The only KPI change is the removal of the function vm_page_splay() which is now reaped. The only KBI change, instead, is the removal of the left/right iterators from struct vm_page, which are now reaped. Further technical notes broken into mealpieces can be retrieved from the svn branch: http://svn.freebsd.org/base/user/attilio/vmcontention/ Sponsored by: EMC / Isilon storage division In collaboration with: alc, jeff Tested by: flo, pho, jhb, davide Tested by: ian (arm) Tested by: andreast (powerpc)	2013-03-18 00:25:02 +00:00
Neel Natu	3f23d3ca9f	Fix the '-Wtautological-compare' warning emitted by clang for comparing the unsigned enum type with a negative value. Obtained from: NetApp	2013-03-16 22:53:05 +00:00
Neel Natu	61592433eb	Allow vmm stats to be specific to the underlying hardware assist technology. This can be done by using the new macros VMM_STAT_INTEL() and VMM_STAT_AMD(). Statistic counters that are common across the two are defined using VMM_STAT(). Suggested by: Anish Gupta Discussed with: grehan Obtained from: NetApp	2013-03-16 22:40:20 +00:00
Konstantin Belousov	e8a4a618cf	Add pmap function pmap_copy_pages(), which copies the content of the pages around, taking array of vm_page_t both for source and destination. Starting offsets and total transfer size are specified. The function implements optimal algorithm for copying using the platform-specific optimizations. For instance, on the architectures were the direct map is available, no transient mappings are created, for i386 the per-cpu ephemeral page frame is used. The code was typically borrowed from the pmap_copy_page() for the same architecture. Only i386/amd64, powerpc aim and arm/arm-v6 implementations were tested at the time of commit. High-level code, not committed yet to the tree, ensures that the use of the function is only allowed after explicit enablement. For sparc64, the existing code has known issues and a stab is added instead, to allow the kernel linking. Sponsored by: The FreeBSD Foundation Tested by: pho (i386, amd64), scottl (amd64), ian (arm and arm-v6) MFC after: 2 weeks	2013-03-14 20:18:12 +00:00
Alan Cox	9f585991ba	The kernel pmap is statically allocated, so there is really no need to explicitly initialize its pm_root field to zero. Sponsored by: EMC / Isilon Storage Division	2013-03-10 21:07:44 +00:00
Attilio Rao	89f6b8632c	Switch the vm_object mutex to be a rwlock. This will enable in the future further optimizations where the vm_object lock will be held in read mode most of the time the page cache resident pool of pages are accessed for reading purposes. The change is mostly mechanical but few notes are reported: * The KPI changes as follow: - VM_OBJECT_LOCK() -> VM_OBJECT_WLOCK() - VM_OBJECT_TRYLOCK() -> VM_OBJECT_TRYWLOCK() - VM_OBJECT_UNLOCK() -> VM_OBJECT_WUNLOCK() - VM_OBJECT_LOCK_ASSERT(MA_OWNED) -> VM_OBJECT_ASSERT_WLOCKED() (in order to avoid visibility of implementation details) - The read-mode operations are added: VM_OBJECT_RLOCK(), VM_OBJECT_TRYRLOCK(), VM_OBJECT_RUNLOCK(), VM_OBJECT_ASSERT_RLOCKED(), VM_OBJECT_ASSERT_LOCKED() * The vm/vm_pager.h namespace pollution avoidance (forcing requiring sys/mutex.h in consumers directly to cater its inlining functions using VM_OBJECT_LOCK()) imposes that all the vm/vm_pager.h consumers now must include also sys/rwlock.h. * zfs requires a quite convoluted fix to include FreeBSD rwlocks into the compat layer because the name clash between FreeBSD and solaris versions must be avoided. At this purpose zfs redefines the vm_object locking functions directly, isolating the FreeBSD components in specific compat stubs. The KPI results heavilly broken by this commit. Thirdy part ports must be updated accordingly (I can think off-hand of VirtualBox, for example). Sponsored by: EMC / Isilon storage division Reviewed by: jeff Reviewed by: pjd (ZFS specific review) Discussed with: alc Tested by: pho	2013-03-09 02:32:23 +00:00
Bryan Venteicher	0cfbcf8c7b	Remove the virtio dependency entry for the VirtIO device drivers. This will prevent the kernel from linking if the device driver are included without the virtio module. Remove pci and scbus for the same reason. Also explain the relationship and necessity of the virtio and virtio_pci modules. Currently in FreeBSD, we only support VirtIO PCI, but it could be replaced with a different interface (like MMIO) and the device (network, block, etc) will still function. Requested by: luigi Approved by: grehan (mentor) MFC after: 3 days	2013-03-06 07:17:53 +00:00
Kenneth D. Merry	3a45b4781a	Re-enable CTL in GENERIC on i386 and amd64, but turn on the CTL disable tunable by default. This will allow GENERIC configurations to boot on small memory boxes, but not require end users who want to use CTL to recompile their kernel. They can simply set kern.cam.ctl.disable=0 in loader.conf. The eventual solution to the memory usage problem is to change the way CTL allocates memory to be more configurable, but this should fix things for small memory situations in the mean time. UPDATING: Explain the change in the CTL configuration, and how users can enable CTL if they would like to use it. sys/conf/options: Add a new option, CTL_DISABLE, that prevents CTL from initializing. ctl.c: If CTL_DISABLE is turned on, don't initialize. i386/conf/GENERIC, amd64/conf/GENERIC: Re-enable device ctl, and add the CTL_DISABLE option.	2013-03-04 21:18:45 +00:00
Attilio Rao	b38d37f7b5	Merge from vmc-playground branch: Rename the pv_entry_t iterator from pv_list to pv_next. Besides being more correct technically (as the name seems to suggest this is a list while it is an iterator), it will also be needed by vm_radix work to avoid a nameclash on macro expansions. Sponsored by: EMC / Isilon storage division Reviewed by: alc, jeff Tested by: flo, pho, jhb, davide	2013-03-02 14:19:08 +00:00
Adrian Chadd	fe138cc2af	Disable the ctl driver in GENERIC. It unfortunately steals a fair chunk of RAM at startup even if it's not actively used, which prevents FreeBSD VMs of 128MB from successfully booting and running.	2013-03-02 08:12:41 +00:00
Davide Italiano	acccf7d8b4	MFcalloutng: When CPU becomes idle, cpu_idleclock() calculates time to the next timer event in order to reprogram hw timer. Return that time in sbintime_t to the caller and pass it to acpi_cpu_idle(), where it can be used as one more factor (quite precise) to extimate furter sleep time and choose optimal sleep state. This is a preparatory change for further callout improvements will be committed in the next days. The commmit is not targeted for MFC.	2013-02-28 10:46:54 +00:00
Attilio Rao	dc1558d1cd	Merge from vmobj-rwlock: VM_OBJECT_LOCKED() macro is only used to implement a custom version of lock assertions right now (which likely spread out thanks to copy and paste). Remove it and implement actual assertions. Sponsored by: EMC / Isilon storage division Reviewed by: alc Tested by: pho	2013-02-27 18:12:13 +00:00
Konstantin Belousov	31a53cd036	Convert machine/elf.h, machine/frame.h, machine/sigframe.h, machine/signal.h and machine/ucontext.h into common x86 includes, copying from amd64 and merging with i386. Kernel-only compat definitions are kept in the i386/include/sigframe.h and i386/include/signal.h, to reduce amd64 kernel namespace pollution. The amd64 compat uses its own definitions so far. The _MACHINE_ELF_WANT_32BIT definition is to allow the sys/boot/userboot/userboot/elf32_freebsd.c to use i386 ELF definitions on the amd64 compile host. The same hack could be usefully abused by other code too.	2013-02-20 17:39:52 +00:00
Jung-uk Kim	00a54dfb1c	Consistently use round_page(x) rather than roundup(x, PAGE_SIZE). There is no functional change.	2013-02-15 22:43:08 +00:00
Konstantin Belousov	bf94adb3e1	Print slightly more useful information on the 'bad pte' panic. No objections from: alc MFC after: 1 week	2013-02-14 19:22:15 +00:00
Konstantin Belousov	252b1f6e22	Assert that user address is never qremoved. No objections from: alc MFC after: 1 week	2013-02-14 19:21:20 +00:00
Neel Natu	25448de222	Requests for invalid CPUID leaves should map to the highest known leaf instead. Reviewed by: grehan Obtained from: NetApp	2013-02-13 23:22:17 +00:00
Neel Natu	485b3300cc	Implement guest vcpu pinning using 'pthread_setaffinity_np(3)'. Prior to this change pinning was implemented via an ioctl (VM_SET_PINNING) that called 'sched_bind()' on behalf of the user thread. The ULE implementation of 'sched_bind()' bumps up 'td_pinned' which in turn runs afoul of the assertion '(td_pinned == 0)' in userret(). Using the cpuset affinity to implement pinning of the vcpu threads works with both 4BSD and ULE schedulers and has the happy side-effect of getting rid of a bunch of code in vmm.ko. Discussed with: grehan	2013-02-11 20:36:07 +00:00
Neel Natu	6d62a48f47	Compute the number of initial kernel page table pages (NKPT) dynamically. This eliminates the need to recompile the kernel when the default value of NKPT is not big enough - for e.g. when loading large kernel modules or memory disk images from the loader. If NKPT is defined in the kernel configuration file then it overrides the dynamic calculation. Reviewed by: alc, kib	2013-02-06 04:53:00 +00:00
Andriy Gapon	1a89ca4cf5	cpususpend_handler: mark AP as resumed only after fully setting up lapic Reviewed by: jhb Tested by: Sergey V. Dyatko <sergey.dyatko@gmail.com>, KAHO Toshikazu <kaho@elam.kais.kyoto-u.ac.jp> MFC after: 12 days	2013-02-02 12:04:32 +00:00
Andriy Gapon	548b201607	x86 suspend/resume: suspend pics and pseudo-pics in reverse order - change 'pics' from STAILQ to TAILQ - ensure that Local APIC is always first in 'pics' Reviewed by: jhb Tested by: Sergey V. Dyatko <sergey.dyatko@gmail.com>, KAHO Toshikazu <kaho@elam.kais.kyoto-u.ac.jp> MFC after: 12 days	2013-02-02 12:02:42 +00:00
Eitan Adler	4752ed3d7f	Remove support for plip from the GENERIC kernel as no systems in the last 10 years require this support. Discussed with: db Discussed with: kib Reviewed by: imp Reviewed by: jhb Reviewed by: -hackers Approved by: cperciva (mentor)	2013-02-01 20:17:11 +00:00
Neel Natu	2b89a04496	Fix a broken assumption in the passthru implementation that the MSI-X table can only be located at the beginning or the end of the BAR. If the MSI-table is located in the middle of a BAR then we will split the BAR into two and create two mappings - one before the table and one after the table - leaving a hole in place of the table so accesses to it can be trapped and emulated. Obtained from: NetApp	2013-02-01 03:49:09 +00:00
Neel Natu	07044a96d8	Increase the number of passthru devices supported by bhyve. The maximum length of an environment variable puts a limitation on the number of passthru devices that can be specified via a single variable. The workaround is to allow user to specify passthru devices via multiple environment variables instead of a single one. Obtained from: NetApp	2013-02-01 01:16:26 +00:00
Neel Natu	8faceb3292	Add emulation support for instruction "88/r: mov r/m8, r8". This instruction moves a byte from a register to a memory location. Tested by: tycho nightingale at pluribusnetworks com	2013-01-30 04:09:09 +00:00
John Baldwin	d825ce0a5d	Reduce duplication between i386/linux/linux.h and amd64/linux32/linux.h by moving bits that are MI out into headers in compat/linux. Reviewed by: Chagin Dmitry dmitry \| gmail MFC after: 2 weeks	2013-01-29 18:41:30 +00:00
Peter Grehan	1fb0ea3f1a	Always allow access to the sysenter cs/esp/eip MSRs since they are automatically saved and restored in the VMCS. Reviewed by: neel Obtained from: NetApp	2013-01-25 21:38:31 +00:00
John Baldwin	fb709557a3	Don't assume that all Linux TCP-level socket options are identical to FreeBSD TCP-level socket options (only the first two are). Instead, using a mapping function and fail unsupported options as we do for other socket option levels. MFC after: 2 weeks	2013-01-23 21:44:48 +00:00
Neel Natu	e3f0800bd1	Postpone vmm module initialization until after SMP is initialized - particularly that 'smp_started != 0'. This is required because the VT-x initialization calls smp_rendezvous() to set the CR4_VMXE bit on all the cpus. With this change we can preload vmm.ko from the loader. Reported by: alfred@, sbruno@ Obtained from: NetApp	2013-01-21 01:33:10 +00:00
Neel Natu	912a3e678a	Add svn properties to the recently merged bhyve source files. The pre-commit hook will not allow any commits without the svn:keywords property in head.	2013-01-20 03:42:49 +00:00
Neel Natu	c458fc1ed4	Merge projects/bhyve to head. 'bhyve' was developed by grehan@ and myself at NetApp (thanks!). Special thanks to Peter Snyder, Joe Caradonna and Michael Dexter for their support and encouragement. Obtained from: NetApp	2013-01-19 04:18:52 +00:00
John Baldwin	b5821c6f0e	Fix build with SMP disabled.` Reported by: bf	2013-01-19 01:18:22 +00:00
John Baldwin	f876ffeae3	Don't attempt to use clflush on the local APIC register window. Various CPUs exhibit bad behavior if this is done (Intel Errata AAJ3, hangs on Pentium-M, and trashing of the local APIC registers on a VIA C7). The local APIC is implicitly mapped UC already via MTRRs, so the clflush isn't necessary anyway. MFC after: 2 weeks	2013-01-17 21:32:25 +00:00
Neel Natu	c2217b9848	IFC @ r245509	2013-01-17 07:04:37 +00:00
Bryan Venteicher	ae366ffcbd	Add VirtIO to the i386 and amd64 GENERIC kernels This also removes the kludge from r239009 that covered only the network driver. Reviewed by: grehan Approved by: grehan (mentor) MFC after: 1 week	2013-01-13 07:14:16 +00:00
Neel Natu	8a60b77db8	IFC @ r245205	2013-01-09 03:32:23 +00:00
Neel Natu	1b54fbe69d	IFC @ r245178	2013-01-09 02:26:50 +00:00
Neel Natu	95102a8bcb	Add a "pause" to busy wait loops in the cpu reset path. This should not matter much when running on bare metal but it makes the guest more friendly when running inside a virtual machine. Discussed with: jhb Obtained from: NetApp	2013-01-09 02:11:16 +00:00
Neel Natu	03429e45a7	Revert changes for x2apic support from projects/bhyve. During the early days of bhyve it did not support instruction emulation which necessitated the use of x2apic to access the local apic. This is no longer the case and the dependency on x2apic has gone away. The x2apic patches can be considered independently of bhyve and will be merged into head via projects/x2apic. Discussed with: grehan	2013-01-06 05:37:26 +00:00
Neel Natu	2d28bff346	bhyve does not require a custom configuration file anymore so make the GENERIC identical to the one in HEAD. Obtained from: NetApp	2013-01-05 03:35:30 +00:00
Neel Natu	46b1c55d9e	IFC @ r244983.	2013-01-04 19:28:32 +00:00
Neel Natu	23ce7fedb4	There is no need for a special 'BHYVE' kernel configuration file anymore - 'GENERIC' works fine. Obtained from: NetApp	2013-01-04 03:02:43 +00:00
Neel Natu	014a52f3a6	There is no need for 'start_emulating()' and 'stop_emulating()' to be defined in <machine/cpufunc.h> so remove them from there. Obtained from: NetApp	2013-01-04 02:49:12 +00:00
Neel Natu	5f0677d392	The "unrestricted guest" capability is a feature of Intel VT-x that allows the guest to execute real or unpaged protected mode code - bhyve relies on this feature to execute the AP bootstrap code. Get rid of the hack that allowed bhyve to support SMP guests on processors that do not have the "unrestricted guest" capability. This hack was entirely FreeBSD-specific and would not work with any other guest OS. Instead, limit the number of vcpus to 1 when executing on processors without "unrestricted guest" capability. Suggested by: grehan Obtained from: NetApp	2013-01-04 02:04:41 +00:00
Konstantin Belousov	0dcbedfa61	Enable the UFS quotas for big-iron GENERIC kernels. Discussed with: mckusick MFC after: 2 weeks	2013-01-03 19:03:41 +00:00
Dag-Erling Smørgrav	36fca20f10	As discussed on -current last October, remove the firewire drivers from GENERIC.	2013-01-03 14:30:24 +00:00
Neel Natu	485f986ac9	Modify the default behavior of bhyve such that it no longer forces the use of x2apic mode on the guest. The guest can decide whether or not it wants to use legacy mmio or x2apic access to the APIC by writing to the MSR_APICBASE register. Obtained from: NetApp	2012-12-16 01:20:08 +00:00
Neel Natu	682b847ede	Prefer x2apic mode when running inside a virtual machine. Provide a tunable 'machdep.x2apic_desired' to let the administrator override the default behavior. Provide a read-only sysctl 'machdep.x2apic' to let the administrator know whether the kernel is using x2apic or legacy mmio to access local apic. Tested with Parallels Desktop 8 and bhyve hypervisors. Also tested running on bare metal Intel Xeon E5-2658. Obtained from: NetApp Discussed with: jhb, attilio, avg, grehan	2012-12-16 00:57:14 +00:00
Jim Harris	f2fcc434ee	Revert r243960 based on feedback regarding keeping x86 headers unified (mdf@, tijl@) and use of KASSERT/systm.h in bus.h (zeising@, bde@). Alternate implementation will be made in a separate commit.	2012-12-13 21:27:20 +00:00
Peter Grehan	2741efeca0	Implement an API to allow a hypervisor to save/restore guest floating point state without having to know the size of floating-point state. Unstaticize fpurestore to allow the hypervisor to save/restore guest state using fpusave/fpurestore on the allocated FPU state area. Reviewed by: kib Obtained from: NetApp/bhyve MFC after: 1 week	2012-12-12 08:35:32 +00:00
Konstantin Belousov	737d12b397	Add amd64-specific ddb command "show pte". The command displays the hierarchy of the page table entries which map the specified address. Reviewed by: alc (previous version) Sponsored by: The FreeBSD Foundation MFC after: 1 week	2012-12-10 05:14:34 +00:00
Jim Harris	71a30c4436	Add amd64 implementations for 8-byte bus_space routines. Submitted by: Carl Delsey <carl.r.delsey@intel.com> Discussed with: jhb, rwatson Reviewed by: jimharris MFC after: 1 week	2012-12-06 22:33:31 +00:00
Neel Natu	32531ccb84	IFC @r243836	2012-12-04 04:37:42 +00:00
Konstantin Belousov	349438a243	Print the frame addresses for the backtraces on i386 and amd64. It allows both to inspect the frame sizes and to manually peek into the frames from ddb, if needed. Reviewed by: dim MFC after: 2 weeks	2012-12-03 22:16:51 +00:00
Jung-uk Kim	7609e73ca0	Remove duplicate code. Reduce diff between amd64 and i386.	2012-12-01 00:56:19 +00:00
Jung-uk Kim	8c2b353ead	Use volatile keywords properly.	2012-11-30 20:15:01 +00:00
Peter Grehan	e6f1f347a1	Properly screen for the AND 0x81 instruction from the set of group1 0x81 instructions that use the reg bits as an extended opcode. Still todo: properly update rflags. Pointed out by: jilles@	2012-11-30 05:40:24 +00:00
Jung-uk Kim	231ac244f8	Tidy up inline assembly. No functional change.	2012-11-30 00:59:37 +00:00
Peter Grehan	b1f95796f0	Remove debug printf. Pointed out by: emaste	2012-11-29 15:08:13 +00:00
Peter Grehan	3b2b001107	Add support for the 0x81 AND instruction, now generated by clang in the local APIC code. 0x81 is a read-modify-write instruction - the EPT check that only allowed read or write and not both has been relaxed to allow read and write. Reviewed by: neel Obtained from: NetApp	2012-11-29 06:26:42 +00:00
Neel Natu	48a29f4e07	Cleanup the user-space paging exit handler now that the unified instruction emulation is in place. Obtained from: NetApp	2012-11-28 13:34:44 +00:00
Neel Natu	b42206f300	Change emulate_rdmsr() and emulate_wrmsr() to return 0 on sucess and errno on failure. The conversion from the return value to HANDLED or UNHANDLED can be done locally in vmx_exit_process(). Obtained from: NetApp	2012-11-28 13:10:18 +00:00
Neel Natu	ba9b7bf73a	Revamp the x86 instruction emulation in bhyve. On a nested page table fault the hypervisor will: - fetch the instruction using the guest %rip and %cr3 - decode the instruction in 'struct vie' - emulate the instruction in host kernel context for local apic accesses - any other type of mmio access is punted up to user-space (e.g. ioapic) The decoded instruction is passed as collateral to the user-space process that is handling the PAGING exit. The emulation code is fleshed out to include more addressing modes (e.g. SIB) and more types of operands (e.g. imm8). The source code is unified into a single file (vmm_instruction_emul.c) that is compiled into vmm.ko as well as /usr/sbin/bhyve. Reviewed by: grehan Obtained from: NetApp	2012-11-28 00:02:17 +00:00
Neel Natu	920bc34090	Fix a bug in the MSI-X resource allocation for PCI passthrough devices. In the case where the underlying host had disabled MSI-X via the "hw.pci.enable_msix" tunable, the ppt_setup_msix() function would fail and return an error without properly cleaning up. This in turn would cause a page fault on the next boot of the guest. Fix this by calling ppt_teardown_msix() in all the error return paths. Obtained from: NetApp	2012-11-22 04:07:18 +00:00
Neel Natu	288aeb8561	Get rid of redundant comparision which is guaranteed to be "true" for unsigned integers. Obtained from: NetApp	2012-11-22 00:08:20 +00:00
Peter Grehan	a0cad47092	Handle CPUID leaf 0x7 now that FreeBSD is using it. Return 0's for now. Reviewed by: neel Obtained from: NetApp	2012-11-20 06:01:03 +00:00
Neel Natu	3248464555	IFC @ r243164	2012-11-17 02:55:47 +00:00
Konstantin Belousov	43f48b65c0	Move the declaration of vm_phys_paddr_to_vm_page() from vm/vm_page.h to vm/vm_phys.h, where it belongs. Requested and reviewed by: alc MFC after: 2 weeks	2012-11-16 05:55:56 +00:00
Konstantin Belousov	b32ecf44bc	Flip the semantic of M_NOWAIT to only require the allocation to not sleep, and perform the page allocations with VM_ALLOC_SYSTEM class. Previously, the allocation was also allowed to completely drain the reserve of the free pages, being translated to VM_ALLOC_INTERRUPT request class for vm_page_alloc() and similar functions. Allow the caller of malloc* to request the 'deep drain' semantic by providing M_USE_RESERVE flag, now translated to VM_ALLOC_INTERRUPT class. Previously, it resulted in less aggressive VM_ALLOC_SYSTEM allocation class. Centralize the translation of the M_* malloc(9) flags in the single inline function malloc2vm_flags(). Discussion started by: "Sears, Steven" <Steven.Sears@netapp.com> Reviewed by: alc, mdf (previous version) Tested by: pho (previous version) MFC after: 2 weeks	2012-11-14 20:01:40 +00:00
Neel Natu	7d3d462b09	IFC @ r242940	2012-11-13 07:39:05 +00:00
Neel Natu	a10c6f5544	IFC @ r242684	2012-11-11 03:26:14 +00:00
Konstantin Belousov	5a17538e22	Do not try to enable new features in the %cr4 if running under hypervisor. Apparently, hypervisors failed to filter out 'Standard Extended Features' report from CPUID, but deliver #gp when corresponding bit in %cr4 is toggled. This shall be reconsidered later, after hypervisors correct the bug. Reported and tested by: joel Reviewed by: avg MFC after: 2 weeks	2012-11-09 16:00:30 +00:00
Peter Grehan	0a5e9bfb72	Fix issue found with clang build. Avoid code insertion by the compiler between inline asm statements that would in turn modify the flags value set by the first asm, and used by the second. Solve by making the common error block a string that can be pulled into the first inline asm, and using symbolic labels for asm variables. bhyve can now build/run fine when compiled with clang. Reviewed by: neel Obtained from: NetApp	2012-11-06 02:43:41 +00:00
Attilio Rao	cfedf924d3	Rework the known rwlock to benefit about staying on their own cache line in order to avoid manual frobbing but using struct rwlock_padalign. Reviewed by: alc, jimharris	2012-11-03 23:03:14 +00:00
Konstantin Belousov	cd9e9d1bc2	Enable the new instructions for reading and writing bases for %fs, %gs, when supported. Note that WRFSBASE and WRGSBASE are not very useful on FreeBSD right now, because a return from the kernel mode to userspace reloads the bases specified by the sysarch(2) syscall, most likely. Enable the Supervisor Mode Execution Prevention (SMEP) when supported. Since the loader(8) performs hand-off to the kernel with the page tables which contradict the SMEP, postpone enabling the SMEP on BSP until pmap switched for the proper kernel tables. Debugged with the help from: avg Tested by: avg, Michael Moll <kvedulv@kvedulv.de> MFC after: 1 month	2012-11-01 15:17:43 +00:00
Konstantin Belousov	2773649d2f	Provide the reading and display of the Standard Extended Features, introduced with the IvyBridge CPUs. Provide the definitions for new bits in CR3 and CR4 registers. Tested by: avg, Michael Moll <kvedulv@kvedulv.de> MFC after: 2 weeks	2012-11-01 15:14:37 +00:00
Neel Natu	514393f565	Convert VMCS_ENTRY_INTR_INFO field into a vmcs identifier before passing it to vmcs_getreg(). Without this conversion vmcs_getreg() will return EINVAL. In particular this prevented injection of the breakpoint exception into the guest via the "-B" option to /usr/sbin/bhyve which is hugely useful when debugging guest hangs. This was broken in r241921. Pointy hat: me Obtained from: NetApp	2012-10-29 23:58:15 +00:00
Neel Natu	b01c203325	Corral all the host state associated with the virtual machine into its own file. This state is independent of the type of hardware assist used so there is really no need for it to be in Intel-specific code. Obtained from: NetApp	2012-10-29 01:51:24 +00:00
Peter Grehan	cda5bd7f19	Set the valid field of the newly allocated field as all other vm page allocators do. This fixes a panic when a virtio block device is mounted as root, with the host system dying in vm_page_dirty with invalid bits. Reviewed by: neel Obtained from: NetApp	2012-10-26 22:32:26 +00:00
Neel Natu	bd8572e0be	Unconditionally enable fpu emulation by setting CR0.TS in the host after the guest does a vm exit. This allows us to trap any fpu access in the host context while the fpu still has "dirty" state belonging to the guest. Reported by: "s vas" on freebsd-virtualization@ Obtained from: NetApp	2012-10-26 03:12:40 +00:00
Neel Natu	f76fc5d414	If the guest vcpu wants to idle then use that opportunity to relinquish the host cpu to the scheduler until the guest is ready to run again. This implies that the host cpu utilization will now closely mirror the actual load imposed by the guest vcpu. Also, the vcpu mutex now needs to be of type MTX_SPIN since we need to acquire it inside a critical section. Obtained from: NetApp	2012-10-25 04:29:21 +00:00
Neel Natu	ff6ec151e0	Hide the monitor/mwait instruction capability from the guest until we know how to properly intercept it. Obtained from: NetApp	2012-10-25 04:08:26 +00:00
Neel Natu	f352ff0ca8	Maintain state regarding NMI delivery to guest vcpu in VT-x independent manner. Also add a stats counter to count the number of NMIs delivered per vcpu. Obtained from: NetApp	2012-10-24 02:54:21 +00:00
Neel Natu	eeefa4e4be	Test for AST pending with interrupts disabled right before entering the guest. If an IPI was delivered to this cpu before interrupts were disabled then return right away via vmx_setjmp() with a return value of VMX_RETURN_AST. Obtained from: NetApp	2012-10-23 02:20:42 +00:00
Eitan Adler	1611e2c0d1	The 'testing memory' patch gets printed too many times Approved by: cperciva (implicit)	2012-10-22 11:57:26 +00:00
Eitan Adler	267cc84937	Explain the upcoming delay by printing a message when the kernel is about to begin testing memory. Reviewed by: dteske, adri Approved by: cperciva MFC after: 1 week	2012-10-22 03:16:39 +00:00
Neel Natu	2e25737a49	Calculate the number of host ticks until the next guest timer interrupt. This information will be used in conjunction with guest "HLT exiting" to yield the thread hosting the virtual cpu. Obtained from: NetApp	2012-10-20 08:23:05 +00:00
Konstantin Belousov	802431fa2e	Print the %rip value for uprintf_signal. MFC after: 1 week	2012-10-14 17:08:46 +00:00
Andriy Gapon	851dbc07af	pciereg_cfg*: use assembly to access the mem-mapped cfg space AMD BKDG for CPU families 10h and later requires that the memory mapped config is always read into or written from al/ax/eax register. Discussed with: kib, alc Reviewed by: kib (earlier version) MFC after: 25 days	2012-10-14 10:13:50 +00:00
Peter Grehan	13ec93719a	Add the guest physical address and r/w/x bits to the paging exit in preparation for a rework of bhyve MMIO handling. Reviewed by: neel Obtained from: NetApp	2012-10-12 23:12:19 +00:00
Neel Natu	75dd336603	Provide per-vcpu locks instead of relying on a single big lock. This also gets rid of all the witness.watch warnings related to calling malloc(M_WAITOK) while holding a mutex. Reviewed by: grehan	2012-10-12 18:32:44 +00:00
Neel Natu	cdc5b9e7b1	Fix warnings generated by 'debug.witness.watch' during VM creation and destruction for calling malloc() with M_WAITOK while holding a mutex. Do not allow vmm.ko to be unloaded until all virtual machines are destroyed.	2012-10-11 19:39:54 +00:00
Neel Natu	f9d4f89e4d	Deliver the MSI to the correct guest virtual cpu. Prior to this change the MSI was being delivered unconditionally to vcpu 0 regardless of how the guest programmed the MSI delivery.	2012-10-11 19:28:07 +00:00
Kevin Lo	9823d52705	Revert previous commit... Pointyhat to: kevlo (myself)	2012-10-10 08:36:38 +00:00
Attilio Rao	3a4730256a	Add an unified macro to deny ability from the compiler to reorder instruction loads/stores at its will. The macro __compiler_membar() is currently supported for both gcc and clang, but kernel compilation will fail otherwise. Reviewed by: bde, kib Discussed with: dim, theraven MFC after: 2 weeks	2012-10-09 14:32:30 +00:00
Attilio Rao	af2bdacafb	Reverts r234074,234105,234564,234723,234989,235231-235232 and part of r234247. Use, instead, the static intializer introduced in r239923 for x86 and sparc64 intr_cpus, unwinding the code to the initial version. Reviewed by: marius	2012-10-09 12:22:43 +00:00
Kevin Lo	a10cee30c9	Prefer NULL over 0 for pointers	2012-10-09 08:27:40 +00:00
Neel Natu	7ce04d0ad9	Allocate memory pages for the guest from the host's free page queue. It is no longer necessary to hard-partition the memory between the host and guests at boot time.	2012-10-08 23:41:26 +00:00
Neel Natu	f7d51510f1	Change vm_malloc() to map pages in the guest physical address space in 4KB chunks. This breaks the assumption that the entire memory segment is contiguously allocated in the host physical address space. This also paves the way to satisfy the 4KB page allocations by requesting free pages from the VM subsystem as opposed to hard-partitioning host memory at boot time.	2012-10-04 02:27:14 +00:00
Neel Natu	4db4fb2c25	Get rid of assumptions in the hypervisor that the host physical memory associated with guest physical memory is contiguous. Add check to vm_gpa2hpa() that the range indicated by [gpa,gpa+len) is all contained within a single 4KB page.	2012-10-03 01:18:51 +00:00
Neel Natu	bda273f21e	Get rid of assumptions in the hypervisor that the host physical memory associated with guest physical memory is contiguous. Rewrite vm_gpa2hpa() to get the GPA to HPA mapping by querying the nested page tables.	2012-10-03 00:46:30 +00:00
Neel Natu	341f19c949	Get rid of assumptions in the hypervisor that the host physical memory associated with guest physical memory is contiguous. In this case vm_malloc() was using vm_gpa2hpa() to indirectly infer whether or not the address range had already been allocated. Replace this instead with an explicit API 'vm_gpa_available()' that returns TRUE if a page is available for allocation in guest physical address space.	2012-09-29 01:15:45 +00:00
John Baldwin	960b5a7080	- Re-shuffle the <machine/pc/bios.h> headers to move all kernel-specific bits under #ifdef _KERNEL but leave definitions for various structures defined by standards ($PIR table, SMAP entries, etc.) available to userland. - Consolidate duplicate SMBIOS table structure definitions in ipmi(4) and smbios(4) in <machine/pc/bios.h> and make them available to userland. MFC after: 2 weeks	2012-09-28 11:59:32 +00:00
Alan Cox	e4b8a2fc5a	Eliminate a stale comment. It describes another use case for the pmap in Mach that doesn't exist in FreeBSD.	2012-09-28 05:30:59 +00:00
Neel Natu	70593114cd	Intel VT-x provides the length of the instruction at the time of the nested page table fault. Use this when fetching the instruction bytes from the guest memory. Also modify the lapic_mmio() API so that a decoded instruction is fed into it instead of having it fetch the instruction bytes from the guest. This is useful for hardware assists like SVM that provide the faulting instruction as part of the vmexit.	2012-09-27 00:27:58 +00:00
Neel Natu	73820fb0a4	Add an option "-a" to present the local apic in the XAPIC mode instead of the default X2APIC mode to the guest.	2012-09-26 00:06:17 +00:00
Neel Natu	a2da7af6bc	Add support for trapping MMIO writes to local apic registers and emulating them. The default behavior is still to present the local apic to the guest in the x2apic mode.	2012-09-25 22:31:35 +00:00
Neel Natu	e90273829b	Add ioctls to control the X2APIC capability exposed by the virtual machine to the guest. At the moment this simply sets the state in the 'vcpu' instance but there is no code that acts upon these settings.	2012-09-25 19:08:51 +00:00
Neel Natu	edf89256dd	Add an explicit exit code 'SPINUP_AP' to tell the controlling process that an AP needs to be activated by spinning up an execution context for it. The local apic emulation is now completely done in the hypervisor and it will detect writes to the ICR_LO register that try to bring up the AP. In response to such writes it will return to userspace with an exit code of SPINUP_AP. Reviewed by: grehan	2012-09-25 02:33:25 +00:00
Neel Natu	98ed632c63	Stash the 'vm_exit' information in each 'struct vcpu'. There is no functional change at this time but this paves the way for vm exit handler functions to easily modify the exit reason going forward.	2012-09-24 19:32:24 +00:00
Dimitry Andric	f99157cced	After r205013, amd64 and i386 CPU family and model IDs were printed out in hexadecimal, but without any 0x prefix, which can be very misleading. MFC after: 3 days	2012-09-21 10:31:19 +00:00
Neel Natu	2d3a73ed6d	Restructure the x2apic access code in preparation for supporting memory mapped access to the local apic. The vlapic code is now aware of the mode that the guest is using to access the local apic. Reviewed by: grehan@	2012-09-21 03:09:23 +00:00
Jim Harris	eb85d44f06	Integrate nvme(4) and nvd(4) into the amd64 and i386 builds. Sponsored by: Intel	2012-09-17 19:26:33 +00:00
Konstantin Belousov	c5e3d0ab11	Rename the IVY_RNG option to RDRAND_RNG. Based on submission by: Arthur Mesh <arthurmesh@gmail.com> MFC after: 2 weeks	2012-09-13 10:12:16 +00:00
Alan Cox	7336315b0a	Simplify pmap_unmapdev(). Since kmem_free() eventually calls pmap_remove(), pmap_unmapdev()'s own direct efforts to destroy the page table entries are redundant, so eliminate them. Don't set PTE_W on the page table entry in pmap_kenter{,_attr}() on MIPS. Setting PTE_W on MIPS is inconsistent with the implementation of this function on other architectures. Moreover, PTE_W should not be set, unless the pmap's wired mapping count is incremented, which pmap_kenter{,_attr}() doesn't do. MFC after: 10 days	2012-09-10 16:11:29 +00:00
Attilio Rao	324e57150d	userret() already checks for td_locks when INVARIANTS is enabled, so there is no need to check if Giant is acquired after it. Reviewed by: kib MFC after: 1 week	2012-09-08 18:27:11 +00:00
Konstantin Belousov	ef9461ba0e	Add support for new Intel on-CPU Bull Mountain random number generator, found on IvyBridge and supposedly later CPUs, accessible with RDRAND instruction. From the Intel whitepapers and articles about Bull Mountain, it seems that we do not need to perform post-processing of RDRAND results, like AES-encryption of the data with random IV and keys, which was done for Padlock. Intel claims that sanitization is performed in hardware. Make both Padlock and Bull Mountain random generators support code covered by kernel config options, for the benefit of people who prefer minimal kernels. Also add the tunables to disable hardware generator even if detected. Reviewed by: markm, secteam (simon) Tested by: bapt, Michael Moll <kvedulv@kvedulv.de> MFC after: 3 weeks	2012-09-05 13:18:51 +00:00
Alan Cox	d8f9ed32c5	Rename {_,}pmap_unwire_pte_hold() to {_,}pmap_unwire_ptp() and update the comment describing them. Both the function names and the comment had grown stale. Quite some time has passed since these pmap implementations last used the page's hold count to track the number of valid mapping within a page table page. Also, returning TRUE from pmap_unwire_ptp() rather than _pmap_unwire_ptp() eliminates a few instructions from callers like pmap_enter_quick_locked() where pmap_unwire_ptp()'s return value is used directly by a conditional statement.	2012-09-05 06:02:54 +00:00
Xin LI	0807ad7422	Add hpt27xx to GENERIC kernel for amd64 and i386 systems. MFC after: 2 weeks	2012-09-04 21:02:57 +00:00
John Baldwin	778eefa40d	Fix duplicate entries for mwl(4): - Move mwlfw from {amd64,i386}/conf/NOTES to sys/conf/NOTES (mwl(4) is already present in sys/conf/NOTES). - Remove duplicate mwl(4) entries from {amd64,i386}/conf/NOTES. - While here, add a description to the sfxge line in amd64/conf/NOTES.	2012-09-04 19:19:36 +00:00
John Baldwin	4c1044491b	Fix misspelled "Infiniband". Submitted by: gcooper MFC after: 3 days	2012-08-28 11:34:09 +00:00
Peter Grehan	177fd53318	Add sysctls to display the total and free amount of hard-wired mem for VMs # sysctl hw.vmm hw.vmm.mem_free: 2145386496 hw.vmm.mem_total: 2145386496 Submitted by: Takeshi HASEGAWA hasegaw at gmail com	2012-08-26 01:41:41 +00:00
Glen Barber	67944c4572	Grammar fix: s/NIC's/NICs/ MFC after: 3 days	2012-08-26 01:21:02 +00:00
Dag-Erling Smørgrav	e2082935f0	As discussed on -current, remove the hardcoded default maxswzone. MFC after: 3 weeks	2012-08-14 17:01:21 +00:00
Konstantin Belousov	b6d3609050	Add a hackish debugging facility to provide a bit of information about reason for generated trap. The dump of basic signal information and 8 bytes of the faulting instruction are printed on the controlling terminal of the process, if the machdep.uprintf_signal syscal is enabled. The print is the only practical way to debug traps from a.out processes I am aware of. Because I have to reimplement it each time I debug an issue with a.out support on amd64, commit the hack to main tree. MFC after: 1 week	2012-08-14 12:15:01 +00:00
Konstantin Belousov	95fd15898b	Real hardware, as opposed to QEMU, does not allow to have a call gate in long mode which transfers control to 32bit code segment. Unbreak the lcall $7,$0 implementation on amd64 by putting the 64bit user code segment' selector into call gate, and execute the 64bit trampoline which converts the return frame into 32bit format and switches back to 32bit mode for executing int $0x80 trampoline. Note that all jumps over the hoops are performed in the user mode. MFC after: 1 week	2012-08-14 12:13:27 +00:00
John Baldwin	6e4ac34b07	Remove the deassert INIT IPI from the IPI startup sequence for APs. It is not listed in the boot sequence in the MP specification (1.4), and it is explicitly ignored on modern CPUs. It was only ever required when bootstrapping systems with external APICs (that is, SMP machines with 486s), which FreeBSD has never supported (and never will). While here, tidy some comments and remove some banal ones.	2012-08-13 18:52:51 +00:00
John Baldwin	a4284ef768	Add a 10 millisecond delay after sending the initial INIT IPI. This matches the algorithm in the MP specification (1.4). Previously we were sending out the deassert INIT IPI immediately after the initial INIT IPI was sent.	2012-08-13 16:33:22 +00:00
Colin Percival	347c7fd7bf	Build modules along with the XENHVM kernels. No objections from: freebsd-xen mailing list MFC after: 1 week	2012-08-13 07:36:57 +00:00
Alan Cox	663f8700d4	The assertion that I added in r238889 could legitimately fail when a debugger creates a breakpoint. Replace that assertion with a narrower one that still achieves my objective. Reported and tested by: kib	2012-08-08 05:28:30 +00:00
Konstantin Belousov	65211d02c4	Do not apply errata 721 workaround when under hypervisor, since typical hypervisor does not implement access to the required MSR, causing #GP on boot. Reported and tested by: olgeni PR: amd64/170388 MFC after: 3 days	2012-08-07 08:36:10 +00:00
Sergey Kandaurov	16ec457aeb	Remove duplicate header inclusion of <sys/sysent.h> Discussed with: bz	2012-08-07 05:46:36 +00:00
Alan Cox	59fa03faa3	Shave off a few more cycles from the average execution time of pmap_enter() by simplifying the control flow and reducing the live range of "om".	2012-08-05 16:59:02 +00:00
Neel Natu	8124debe13	Include 'device uart' in the guest kernel.	2012-08-04 04:30:26 +00:00
Neel Natu	39c21c2db2	Force certain bits in %cr4 to be hard-wired to '1' or '0' from a guest's perspective. If we don't do this some guest OSes (e.g. Linux) will reset the CR4_VMXE bit in %cr4 with disastrous consequences. Reported by: grehan	2012-08-04 02:06:55 +00:00
Konstantin Belousov	0220d04fe3	Add lfence(). MFC after: 1 week	2012-08-01 17:24:53 +00:00
Alan Cox	879eedbc7b	Revise pmap_enter()'s handling of mapping updates that change the PTE's PG_M and PG_RW bits but not the physical page frame. First, only perform vm_page_dirty() on a managed vm_page when the PG_M bit is being cleared. If the updated PTE continues to have PG_M set, then there is no requirement to perform vm_page_dirty(). Second, flush the mapping from the TLB when PG_M alone is cleared, not just when PG_M and PG_RW are cleared. Otherwise, a stale TLB entry may stop PG_M from being set again on the next store to the virtual page. However, since the vm_page's dirty field already shows the physical page as being dirty, no actual harm comes from the PG_M bit not being set. Nonetheless, it is potentially confusing to someone expecting to see the PTE change after a store to the virtual page.	2012-08-01 16:04:13 +00:00
Konstantin Belousov	a42fa0af44	Change (unused) prototype for stmxcsr() to match reality. Noted by: jhb MFC after: 1 week	2012-07-30 19:26:02 +00:00
Alan Cox	bc27d6c608	Shave off a few more cycles from pmap_enter()'s critical section. In particular, do a little less work with the PV list lock held.	2012-07-29 18:20:49 +00:00
Neel Natu	4bff7fad95	Verify that VMX operation has been enabled by BIOS before executing the VMXON instruction. Reported by "s vas" on freebsd-virtualization@	2012-07-25 00:21:16 +00:00
Konstantin Belousov	59c1a8315c	Forcibly shut up clang warning about NULL pointer dereference. MFC after: 3 weeks	2012-07-23 19:16:31 +00:00
Konstantin Belousov	7c80fcfdba	Constently use 2-space sentence breaks. Submitted by: bde MFC after: 1 week	2012-07-21 13:53:00 +00:00
Konstantin Belousov	1965c139f1	Stop caching curpcb in the local variable. Requested by: bde MFC after: 1 week	2012-07-21 13:47:37 +00:00
Konstantin Belousov	700de5109a	The PT_I386_{GET,SET}XMMREGS and PT_{GET,SET}XSTATE operate on the stopped threads. Implementation assumes that the thread's FPU context is spilled into the PCB due to stop. This is mostly true, except when FPU state for the thread is not initialized. Then the requests operate on the garbage state which is currently left in the PCB, causing confusion. The situation is indeed observed after a signal delivery and before #NM fault on execution of any FPU instruction in the signal handler, since sendsig(9) drops FPU state for current thread, clearing PCB_FPUINITDONE. When inspecting context state for the signal handler, debugger sees the FPU state of the main program context instead of the clear state supposed to be provided to handler. Fix this by forcing clean FPU state in PCB user FPU save area by performing getfpuregs(9) before accessing user FPU save area in ptrace_machdep.c. Note: this change will be merged to i386 kernel as well, where it is much more important, since e.g. gdb on i386 uses PT_I386_GETXMMREGS to inspect FPU context on CPUs that support SSE. Amd64 version of gdb uses PT_GETFPREGS to inspect both 64 and 32 bit processes, which does not exhibit the bug. Reported by: bde MFC after: 1 week	2012-07-21 13:06:37 +00:00
Konstantin Belousov	dfa8a51288	Stop clearing x87 exceptions in the #MF handler on amd64. If user code understands FPU hardware enough to catch SIGFPE and unmask exceptions in control word, then it may as well properly handle return from SIGFPE without causing an infinite loop of #MF exceptions due to faulting instruction restart, when needed. Clearing exceptions causes information loss for handlers which do understand FPU hardware, and struct siginfo si_code member cannot be considered adequate replacement for en_sw content due to translation. Supposed reason for clearing the exceptions, which is IRQ13 handling oddities, were never applicable to amd64. Note: this change will be merged to i386 kernel as well, since we do not support IRQ13 delivery of #MF notifications for some time. Requested by: bde MFC after: 1 week	2012-07-21 13:05:34 +00:00
Konstantin Belousov	83b22b05e6	Introduce curpcb magic variable, similar to curthread, which is MD amd64. It is implemented as __pure2 inline with non-volatile asm read from pcpu, which allows a compiler to cache its results. Convert most PCPU_GET(pcb) and curthread->td_pcb accesses into curpcb. Note that __curthread() uses magic value 0 as an offsetof(struct pcpu, pc_curthread). It seems to be done this way due to machine/pcpu.h needs to be processed before sys/pcpu.h, because machine/pcpu.h contributes machine-depended fields to the struct pcpu definition. As result, machine/pcpu.h cannot use struct pcpu yet. The __curpcb() also uses a magic constant instead of offsetof(struct pcpu, pc_curpcb) for the same reason. The constants are now defined as symbols and CTASSERTs are added to ensure that future KBI changes do not break the code. Requested and reviewed by: bde MFC after: 3 weeks	2012-07-19 19:09:12 +00:00
Alan Cox	3088e08c4b	Don't unnecessarily set PGA_REFERENCED in pmap_enter().	2012-07-19 05:34:19 +00:00
Konstantin Belousov	bc84db6267	On AMD64, provide siginfo.si_code for floating point errors when error occurs using the SSE math processor. Update comments describing the handling of the exception status bits in coprocessors control words. Remove GET_FPU_CW and GET_FPU_SW macros which were used only once. Prefer to use curpcb to access pcb_save over the longer path of referencing pcb through the thread structure. Based on the submission by: Ed Alley <wea llnl gov> PR: amd64/169927 Reviewed by: bde MFC after: 3 weeks	2012-07-18 15:43:47 +00:00
Konstantin Belousov	a81f9fed5d	Add stmxcsr. Submitted by: Ed Alley <wea llnl gov> PR: amd64/169927 MFC after: 3 weeks	2012-07-18 15:36:03 +00:00
Konstantin Belousov	333d0c6060	Add support for the XSAVEOPT instruction use. Our XSAVE/XRSTOR usage mostly meets the guidelines set by the Intel SDM: 1. We use XRSTOR and XSAVE from the same CPL using the same linear address for the store area 2. Contrary to the recommendations, we cannot zero the FPU save area for a new thread, since fork semantic requires the copy of the previous state. This advice seemingly contradicts to the advice from the item 6. 3. We do use XSAVEOPT in the context switch code only, and the area for XSAVEOPT already always contains the data saved by XSAVE. 4. We do not modify the save area between XRSTOR, when the area is loaded into FPU context, and XSAVE. We always spit the fpu context into save area and start emulation when directly writing into FPU context. 5. We do not use segmented addressing to access save area, or rather, always address it using %ds basing. 6. XSAVEOPT can be only executed in the area which was previously loaded with XRSTOR, since context switch code checks for FPU use by outgoing thread before saving, and thread which stopped emulation forcibly get context loaded with XRSTOR. 7. The PCB cannot be paged out while FPU emulation is turned off, since stack of the executing thread is never swapped out. The context switch code is patched to issue XSAVEOPT instead of XSAVE if supported. This approach eliminates one conditional in the context switch code, which would be needed otherwise. For user-visible machine context to have proper data, fpugetregs() checks for unsaved extension blocks and manually copies pristine FPU state into them, according to the description provided by CPUID leaf 0xd. MFC after: 1 month	2012-07-14 15:48:30 +00:00
Alan Cox	b9592bdab3	Wring a few cycles out of pmap_enter(). In particular, on a user-space pmap, avoid walking the page table twice.	2012-07-13 04:10:41 +00:00
Peter Grehan	b652778e42	IFC @ r238370	2012-07-11 19:54:21 +00:00
John Baldwin	d706ec297a	Add a clts() wrapper around the 'clts' instruction to <machine/cpufunc.h> on x86 and use that to implement stop_emulating() in the fpu/npx code. Reimplement start_emulating() in the non-XEN case by using load_cr0() and rcr0() instead of the 'lmsw' and 'smsw' instructions. Intel explicitly discourages the use of 'lmsw' and 'smsw' on 80386 and later processors in the description of these instructions in Volume 2 of the ADM. Reviewed by: kib MFC after: 1 month	2012-07-09 20:55:39 +00:00
John Baldwin	5355f65974	Partially revert r217515 so that the mem_range_softc variable is always present on x86 kernels. This fixes the build of kernels that include 'device acpi' but do not include 'device mem'. MFC after: 1 month	2012-07-09 20:42:08 +00:00
Konstantin Belousov	f18d5bf44b	Use assembler mnemonic instead of manually assembling, contination for r238142. Reviewed by: jhb MFC after: 1 month	2012-07-06 20:11:58 +00:00
John Baldwin	6632f45773	Several fixes to the amd64 disassembler: - Add generic support for opcodes that are escape bytes used for multi-byte opcodes (such as the 0x0f prefix). Use this to replace the hard-coded 0x0f special case and add support for three-byte opcodes that use the 0x0f38 prefix. - Decode all Intel VMX instructions. invept and invvpid in particular are three-byte opcodes that use the 0x0f38 escape prefix. - Rework how the special 'SDEP' size flag works such that the default instruction name (i_name) is the instruction when the data size prefix (0x66) is not specified, and the alternate name in i_extra is used when the prefix is included. - Add a new 'ADEP' size flag similar to 'SDEP' except that it chooses between i_name and i_extra based on the address size prefix (0x67). Use this to fix the decoding for jrcxz vs jecxz which is determined by the address size prefix, not the operand size prefix. Also, jcxz is not possible in 64-bit mode, but jrcxz is the default instruction for that opcode. - Add support for handling instructions that have a mandatory 'rep' prefix (this means not outputting the 'repe ' prefix until determining if it is used as part of an opcode). Make 'pause' less of a special case this way. - Decode 'cmpxchg16b' and 'cdqe' which are variants of other instructions but with a REX.W prefix. MFC after: 1 month	2012-07-06 14:25:59 +00:00
Alan Cox	cc861283f4	Make pmap_enter()'s management of PV entries consistent with the other pmap functions that manage PV entries. Specifically, remove the PV entry from the containing PV list only after the corresponding PTE is destroyed. Update the pmap's wired mapping count in pmap_enter() before the PV list lock is acquired.	2012-07-06 06:42:25 +00:00
John Baldwin	7574a595f2	Now that our assembler supports the xsave family of instructions, use them natively rather than hand-assembled versions. For xgetbv/xsetbv, add a wrapper API to deal with xcr* registers: rxcr() and load_xcr(). Reviewed by: kib MFC after: 1 month	2012-07-05 18:19:35 +00:00
Alan Cox	8f2994ce67	Calculate the new PTE value in pmap_enter() before acquiring any locks. Move an assertion to the beginning of pmap_enter().	2012-07-05 07:20:16 +00:00
Alan Cox	1bc8531c1e	Correct an error in r237513. The call to reserve_pv_entries() must come before pmap_demote_pde() updates the PDE. Otherwise, pmap_pv_demote_pde() can crash. Crash reported by: kib Patch tested by: kib	2012-07-05 00:08:47 +00:00
John Baldwin	66f9aec075	Decode the 'xsave', 'xrstor', 'xsaveopt', 'xgetbv', 'xsetbv', and 'rdtscp' instructions. MFC after: 1 month	2012-07-04 16:47:39 +00:00
Xin LI	309dca0171	tws(4) is interfaced with CAM so move it to the same section. Reported by: joel MFC after: 3 days	2012-07-01 08:10:49 +00:00
Alan Cox	2bde6e3518	Optimize reserve_pv_entries() using the popcnt instruction.	2012-06-30 20:25:12 +00:00
Alan Cox	92e2574577	In r237592, I forgot that pmap_enter() might already hold a PV list lock at the point that it calls get_pv_entry(). Thus, pmap_enter()'s PV list lock pointer must be passed to get_pv_entry() for those rare occasions when get_pv_entry() calls reclaim_pv_chunk(). Update some related comments.	2012-06-29 18:15:56 +00:00
Alan Cox	6c67613030	Avoid some unnecessary PV list locking in pmap_enter().	2012-06-28 22:03:59 +00:00
Alan Cox	23e59dfa8d	Optimize pmap_pv_demote_pde().	2012-06-28 05:42:04 +00:00
Alan Cox	e30df26e7b	Add new pmap layer locks to the predefined lock order. Change the names of a few existing VM locks to follow a consistent naming scheme.	2012-06-27 03:45:25 +00:00
Alan Cox	5b5b0ef34d	Introduce RELEASE_PV_LIST_LOCK().	2012-06-26 16:45:18 +00:00
Alan Cox	0d646df757	Add PV list locking to pmap_enter(). Its execution is no longer serialized by the pvh global lock. Add a needed atomic operation to pmap_object_init_pt().	2012-06-26 06:02:43 +00:00
Alan Cox	aaf3bc56fd	Add PV chunk and list locking to pmap_change_wiring(), pmap_protect(), and pmap_remove(). The execution of these functions is no longer serialized by the pvh global lock. Make some stylistic changes to the affected code for the sake of consistency with related code elsewhere in the pmap.	2012-06-25 07:13:25 +00:00
Alan Cox	f745b16359	Introduce reserve_pv_entry() and use it in pmap_pv_demote_pde(). In order to add PV list locking to pmap_pv_demote_pde(), it is necessary to change the way that pmap_pv_demote_pde() allocates PV entries. Specifically, once pmap_pv_demote_pde() begins modifying the PV lists, it can't allocate any new PV chunks, because that could require the PV list lock to be dropped. So, all necessary PV chunks must be allocated in advance. To my surprise, this new approach is a few percent faster than the old one.	2012-06-23 22:54:25 +00:00
Konstantin Belousov	aea810386d	Implement mechanism to export some kernel timekeeping data to usermode, using shared page. The structures and functions have vdso prefix, to indicate the intended location of the code in some future. The versioned per-algorithm data is exported in the format of struct vdso_timehands, which mostly repeats the content of in-kernel struct timehands. Usermode reading of the structure can be lockless. Compatibility export for 32bit processes on 64bit host is also provided. Kernel also provides usermode with indication about currently used timecounter, so that libc can fall back to syscall if configured timecounter is unknown to usermode code. The shared data updates are initiated both from the tc_windup(), where a fast task is queued to do the update, and from sysctl handlers which change timecounter. A manual override switch kern.timecounter.fast_gettime allows to turn off the mechanism. Only x86 architectures export the real algorithm data, and there, only for tsc timecounter. HPET counters page could be exported as well, but I prefer to not further glue the kernel and libc ABI there until proper vdso-based solution is developed. Minimal stubs neccessary for non-x86 architectures to still compile are provided. Discussed with: bde Reviewed by: jhb Tested by: flo MFC after: 1 month	2012-06-22 07:06:40 +00:00
Konstantin Belousov	232aa31fb9	Reserve AT_TIMEKEEP auxv entry for providing usermode the pointer to timekeeping information. MFC after: 1 week	2012-06-22 06:38:31 +00:00
Alan Cox	240cc83f55	Introduce CHANGE_PV_LIST_LOCK_TO_{PHYS,VM_PAGE}() to avoid duplication of code.	2012-06-22 05:01:36 +00:00
Alan Cox	290d3e6395	Update the PV stats in free_pv_entry() using atomics. After which, it is no longer necessary for free_pv_entry() to be serialized by the pvh global lock. Retire pmap_insert_entry() and pmap_remove_entry(). Once upon a time, these functions were called from multiple places within the pmap. Now, each has only one caller.	2012-06-21 16:37:36 +00:00
Alan Cox	7ed5b3afa2	Add PV list locking to pmap_copy(), pmap_enter_object(), and pmap_enter_quick(). These functions are no longer serialized by the pvh global lock. There is no need to release the PV list lock before calling free_pv_chunk() in pmap_remove_pages().	2012-06-20 07:25:20 +00:00
Alan Cox	2f49b6b831	Condition the implementation of pv_entry_count on PV_STATS. On amd64, pv_entry_count is purely informational. It does not serve any functional purpose. Add PV chunk locking to get_pv_entry().	2012-06-19 08:12:44 +00:00
Navdeep Parhar	09fe63205c	- Updated TOE support in the kernel. - Stateful TCP offload drivers for Terminator 3 and 4 (T3 and T4) ASICs. These are available as t3_tom and t4_tom modules that augment cxgb(4) and cxgbe(4) respectively. The cxgb/cxgbe drivers continue to work as usual with or without these extra features. - iWARP driver for Terminator 3 ASIC (kernel verbs). T4 iWARP in the works and will follow soon. Build-tested with make universe. 30s overview ============ What interfaces support TCP offload? Look for TOE4 and/or TOE6 in the capabilities of an interface: # ifconfig -m \| grep TOE Enable/disable TCP offload on an interface (just like any other ifnet capability): # ifconfig cxgbe0 toe # ifconfig cxgbe0 -toe Which connections are offloaded? Look for toe4 and/or toe6 in the output of netstat and sockstat: # netstat -np tcp \| grep toe # sockstat -46c \| grep toe Reviewed by: bz, gnn Sponsored by: Chelsio communications. MFC after: ~3 months (after 9.1, and after ensuring MFC is feasible)	2012-06-19 07:34:13 +00:00
Konstantin Belousov	c59f3d4d22	Adjust the fix in r236953, by not generating the signal manually, but performing the return to usermode using full return path. This consolidates the handling of exceptional situations in less number of places, and is less code as well. Reviewed by: jhb MFC after: 1 week	2012-06-18 21:08:48 +00:00
Alan Cox	06de588446	Add PV chunk and list locking to pmap_page_exists_quick(), pmap_page_is_mapped(), and pmap_remove_pages(). These functions are no longer serialized by the pvh global lock.	2012-06-18 16:21:59 +00:00
Alan Cox	6031c68de4	The page flag PGA_WRITEABLE is set and cleared exclusively by the pmap layer, but it is read directly by the MI VM layer. This change introduces pmap_page_is_write_mapped() in order to completely encapsulate all direct access to PGA_WRITEABLE in the pmap layer. Aesthetics aside, I am making this change because amd64 will likely begin using an alternative method to track write mappings, and having pmap_page_is_write_mapped() in place allows me to make such a change without further modification to the MI VM layer. As an added bonus, tidy up some nearby comments concerning page flags. Reviewed by: kib MFC after: 6 weeks	2012-06-16 18:56:19 +00:00
Adrian Chadd	83567110bd	Oops - use the actual 11n enable option.	2012-06-15 15:32:16 +00:00
Adrian Chadd	3342d83059	Ok, ok. 802.11n can be on by default in GENERIC in -HEAD. God help me.	2012-06-15 02:16:29 +00:00
Alan Cox	90407113a7	Update a couple comments to reflect r235598. X-MFC after: r235598	2012-06-14 17:47:54 +00:00
Alan Cox	62657c50df	Correctly identify the function in a KASSERT(). MFC after: 3 days	2012-06-14 17:40:49 +00:00

... 2 3 4 5 6 ...

6518 Commits