Commit Graph

6541 Commits

Author SHA1 Message Date
Konstantin Belousov
e9249a80f6 The ia32_get_mcontext() does not need to set PCB_FULL_IRET. The
usermode context state is not changed by the get operation, and
get_mcontext() does not require full iret as well.

Tested by:	dim, pgj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2013-05-27 18:31:15 +00:00
Konstantin Belousov
80b5691a76 When reporting the fault details, also print %rsp.
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2013-05-27 18:29:20 +00:00
Konstantin Belousov
6806ce6ec8 When handling an exception from the attempt from loading the faulting
context on return from the trap handler, re-enable the interrupts on
i386 and amd64.  The trap return path have to disable interrupts since
the sequence of loading the machine state is not atomic.  The trap()
function which transfers the control to the special handler would
enable the interrupt, but an iret loads the previous eflags with PSL_I
clear.  Then, the special handler calls trap() on its own, which now
sees the original eflags with PSL_I set and does not enable
interrupts.

The end result is that signal delivery and process exiting code could
be executed with interrupts disabled, which is generally wrong and
triggers several assertions.

For amd64, the interrupts are enabled conditionally based on PSL_I in
the eflags of the outer frame, as it is already done for
doreti_iret_fault.  For i386, the interrupts are enabled
unconditionally, the ast loop could have opened a window with
interrupts enabled just before the iret anyway.

Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2013-05-27 18:26:08 +00:00
Achim Leubner
dce93cd06d Driver 'aacraid' added. Supports Adaptec by PMC RAID controller families Series 6, 7, 8 and upcoming products. Older Adaptec RAID controller families are supported by the 'aac' driver.
Approved by:	scottl (mentor)
2013-05-24 09:22:43 +00:00
Attilio Rao
9af6d512f5 o Relax locking assertions for vm_page_find_least()
o Relax locking assertions for pmap_enter_object() and add them also
  to architectures that currently don't have any
o Introduce VM_OBJECT_LOCK_DOWNGRADE() which is basically a downgrade
  operation on the per-object rwlock
o Use all the mechanisms above to make vm_map_pmap_enter() to work
  mostl of the times only with readlocks.

Sponsored by:	EMC / Isilon storage division
Reviewed by:	alc
2013-05-21 20:38:19 +00:00
Konstantin Belousov
4f493a25f4 Fix the hardware watchpoints on SMP amd64. Load the updated %dr
registers also on other CPUs, besides the CPU which happens to execute
the ddb.  The debugging registers are stored in the pcpu area,
together with the command which is executed by the IPI stop handler
upon resume.

Reviewed by:	jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2013-05-21 11:24:32 +00:00
Konstantin Belousov
ea38ca6ea5 Add amd64-specific ddb command 'show phys2dmap', which calculates the
address in the direct map corresponding to the given physical address.

Reviewed by:	jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2013-05-21 11:07:12 +00:00
Marcel Moolenaar
cb34ed4434 Add basic support for FDT to i386 & amd64. This change includes:
1.  Common headers for fdt.h and ofw_machdep.h under x86/include
    with indirections under i386/include and amd64/include.
2.  New modinfo for loader provided FDT blob.
3.  Common x86_init_fdt() called from hammer_time() on amd64 and
    init386() on i386.
4.  Split-off FDT specific low-level console functions from FDT
    bus methods for the uart(4) driver. The low-level console
    logic has been moved to uart_cpu_fdt.c and is used for arm,
    mips & powerpc only. The FDT bus methods are shared across
    all architectures.
5.  Add dev/fdt/fdt_x86.c to hold the fdt_fixup_table[] and the
    fdt_pic_table[] arrays. Both are empty right now.

FDT addresses are I/O ports on x86. Since the core FDT code does
not handle different address spaces, adding support for both I/O
ports and memory addresses requires some thought and discussion.
It may be better to use a compile-time option that controls this.

Obtained from:	Juniper Networks, Inc.
2013-05-21 03:05:49 +00:00
Ed Schouten
31ffd8039d Improve readability of static assertions for OFFSET_* macros.
Instead of doing all sorts of weird casting of constants to
pointer-pointers, simply use the standard C offsetof() macro to obtain
the offset of the respective fields in the structures.
2013-05-13 21:47:17 +00:00
Peter Wemm
dda759d344 Tidy up some CVS workarounds. 2013-05-12 01:53:47 +00:00
Rui Paulo
009b3a1c3f Fix several standard extended feature bits.
Submitted by:	Oliver Pinter <oliver.pntr at gmail.com>
2013-05-11 01:31:51 +00:00
Neel Natu
0acb0d84c5 Support array-type of stats in bhyve.
An array-type stat in vmm.ko is defined as follows:
VMM_STAT_ARRAY(IPIS_SENT, VM_MAXCPU, "ipis sent to vcpu");

It is incremented as follows:
vmm_stat_array_incr(vm, vcpuid, IPIS_SENT, array_index, 1);

And output of 'bhyvectl --get-stats' looks like:
ipis sent to vcpu[0]     3114
ipis sent to vcpu[1]     0

Reviewed by:	grehan
Obtained from:	NetApp
2013-05-10 02:59:49 +00:00
Dmitry Chagin
d127f15308 Retire write-only PCB_GS32BIT pcb flag on amd64. 2013-05-09 21:42:43 +00:00
Konstantin Belousov
241b67bb47 Correct the type for the literal used on the left side of the shift up
to 63 bit positions.

Do not fill the save area and do not set the saved bit in the xstate
bit vector for the state which is not marked as enabled in xsave_mask.

Reported and tested by:	Jim Ohlstein <jim@ohlste.in>
MFC after:	3 days
2013-05-09 17:25:29 +00:00
Attilio Rao
941646f5ec Rename VM_NDOMAIN into MAXMEMDOM and move it into machine/param.h in
order to match the MAXCPU concept.  The change should also be useful
for consolidation and consistency.

Sponsored by:	EMC / Isilon storage division
Obtained from:	jeff
Reviewed by:	alc
2013-05-07 22:46:24 +00:00
Ed Maste
f5efbffe52 Switch to standard copyright license text
The initial version of this came from Sandvine but had "PROVIDED BY NETAPP,
INC" in the copyright text, presuambly because the license block was copied
from another file.  Replace it with standard "AUTHOR AND CONTRIBUTORS" form.

Approvided by: grehan@
2013-05-02 12:35:15 +00:00
Konstantin Belousov
14f525595c Partially saved extended state must be handled always, i.e. for both
fpu-owned context, and for pcb-saved one.  More, the XSAVE could do
partial save, same as XSAVEOPT, so qualifier for the handler should be
use_xsave and not use_xsaveopt.

Since xsave_area_desc is now needed regardless of the XSAVEOPT use,
remove the write-only use_xsaveopt variable.

In collaboration with:	jhb
MFC after:	1 week
2013-05-01 20:08:33 +00:00
Konstantin Belousov
6f2d9906a9 The check to ensure that xstate_bv always has XFEATURE_ENABLED_X87 and
XFEATURE_ENABLED_SSE bits set is not needed.  CPU correctly handles
any bitmask which is subset of the enabled bits in %XCR0.

More, CPU instructions XSAVE and XSAVEOPT could write the mask without
e.g. XFEATURE_ENABLED_SSE, after the VZEROALL.  The check prevents the
restoration of the otherwise valid FPU save area.

In collaboration with:	jhb
MFC after:	1 week
2013-05-01 20:03:50 +00:00
Carl Delsey
e47937d1b7 Add a new driver to support the Intel Non-Transparent Bridge(NTB).
The NTB allows you to connect two systems with this device using a PCI-e
link. The driver is made of two modules:
 - ntb_hw which is a basic hardware abstraction layer for the device.
 - if_ntb which implements the ntb network device and the communication
   protocol.

The driver is limited at the moment to CPU memcpy instead of using DMA, and
only Back-to-Back mode is supported. Also the network device isn't full
featured yet. These changes will be coming soon. The DMA change will also
bring in the ioat driver from the project branch it is on now.

This is an initial port of the GPL/BSD Linux driver contributed by Jon Mason
from Intel. Any bugs are my contributions.

Sponsored by: Intel
Reviewed by: jimharris, joel (man page only)
Approved by: jimharris (mentor)
2013-04-29 22:48:53 +00:00
Peter Grehan
d3c11f40a5 Add RIP-relative addressing to the instruction decoder.
Rework the guest register fetch code to allow the RIP to
be extracted from the VMCS while the kernel decoder is
functioning.

Hit by the OpenBSD local-apic code.

Submitted by:	neel
Reviewed by:	grehan
Obtained from:	NetApp
2013-04-25 04:56:43 +00:00
Rui Paulo
068e8f74e4 Print RDSEED, ADX, and SMAP.
Pointed out by:	kib
2013-04-18 01:21:44 +00:00
Gabor Kovesdan
a8b5c2a0aa - Correct spelling in comments
Submitted by:	Christoph Mallon <christoph.mallon@gmx.de> (via private mail)
2013-04-17 11:56:11 +00:00
Rui Paulo
d1dcd93145 Print more bits from the standard extended features CPUID which will be
available in the Haswell architecture (c.f. Intel Document #319433-012A).
2013-04-17 06:51:17 +00:00
Neel Natu
3565b59ec0 Create sysctl node 'hw.vmm.vmx' and populate it with oids that expose the VMX
hardware capabilities.

Obtained from:	NetApp
2013-04-13 21:41:51 +00:00
Konstantin Belousov
fcb29b9210 Fix the name of the pcb member in the comments.
Submitted by:	Oliver Pinter <oliver.pntr@gmail.com>
MFC after:	3 days
2013-04-13 15:20:33 +00:00
Neel Natu
26d66b9d58 Use the MAKEDEV_CHECKNAME flag to check for an invalid device name and return
an error instead of panicking.

Obtained from:	NetApp
2013-04-13 05:11:21 +00:00
Edward Tomasz Napierala
8ed9860914 Remove ctl(4) from GENERIC. Also remove 'options CTL_DISABLE'
and kern.cam.ctl.disable tunable; those were introduced as a workaround
to make it possible to boot GENERIC on low memory machines.

With ctl(4) being built as a module and automatically loaded by ctladm(8),
this makes CTL work out of the box.

Reviewed by:	ken
Sponsored by:	FreeBSD Foundation
2013-04-12 16:25:03 +00:00
Neel Natu
d5408b1d26 If vmm.ko could not be initialized correctly then prevent the creation of
virtual machines subsequently.

Submitted by:	Chris Torek
2013-04-12 01:16:52 +00:00
Neel Natu
150369ab7c Make the code to check if VMX is enabled more readable by using macros
instead of magic numbers.

Discussed with:	Chris Torek
2013-04-11 04:29:45 +00:00
Neel Natu
1472b87f2f Unsynchronized TSCs on the host require special handling in bhyve:
- use clock_gettime(2) as the time base for the emulated ACPI timer instead
  of directly using rdtsc().

- don't advertise the invariant TSC capability to the guest to discourage it
  from using the TSC as its time base.

Discussed with:	jhb@ (about making 'smp_tsc' a global)
Reported by:	Dan Mack on freebsd-virtualization@
Obtained from:	NetApp
2013-04-10 05:59:07 +00:00
Gleb Smirnoff
4e76af6a41 Merge from projects/counters: counter(9).
Introduce counter(9) API, that implements fast and raceless counters,
provided (but not limited to) for gathering of statistical data.

See http://lists.freebsd.org/pipermail/freebsd-arch/2013-April/014204.html
for more details.

In collaboration with:	kib
Reviewed by:		luigi
Tested by:		ae, ray
Sponsored by:		Nginx, Inc.
2013-04-08 19:40:53 +00:00
Gleb Smirnoff
17dece86fe Merge from projects/counters:
Pad struct pcpu so that its size is denominator of PAGE_SIZE. This
is done to reduce memory waste in UMA_PCPU_ZONE zones.

Sponsored by:	Nginx, Inc.
2013-04-08 19:19:10 +00:00
Peter Grehan
117e8f378e Don't panic when a valid divisor of 1 has been requested.
Obtained from:	NetApp
2013-04-05 22:16:31 +00:00
Alexander Motin
45f6d66569 Remove all legacy ATA code parts, not used since options ATA_CAM enabled in
most kernels before FreeBSD 9.0.  Remove such modules and respective kernel
options: atadisk, ataraid, atapicd, atapifd, atapist, atapicam.  Remove the
atacontrol utility and some man pages.  Remove useless now options ATA_CAM.

No objections:	current@, stable@
MFC after:	never
2013-04-04 07:12:24 +00:00
Neel Natu
77d8fd9bb3 Add counter to keep track of the number of timer interrupts generated by
the local apic for each virtual cpu.
2013-03-31 03:56:48 +00:00
Neel Natu
b5aaf7b22b Add some more stats to keep track of all the reasons that a vcpu is exiting. 2013-03-30 17:46:03 +00:00
Neel Natu
66f71b7d24 Allow caller to skip 'guest linear address' validation when doing instruction
decode. This is to accomodate hardware assist implementations that do not
provide the 'guest linear address' as part of nested page fault collateral.

Submitted by:	Anish Gupta (akgupt3 at gmail dot com)
2013-03-28 21:26:19 +00:00
Konstantin Belousov
ee75e7de7b Implement the concept of the unmapped VMIO buffers, i.e. buffers which
do not map the b_pages pages into buffer_map KVA.  The use of the
unmapped buffers eliminate the need to perform TLB shootdown for
mapping on the buffer creation and reuse, greatly reducing the amount
of IPIs for shootdown on big-SMP machines and eliminating up to 25-30%
of the system time on i/o intensive workloads.

The unmapped buffer should be explicitely requested by the GB_UNMAPPED
flag by the consumer.  For unmapped buffer, no KVA reservation is
performed at all. The consumer might request unmapped buffer which
does have a KVA reserve, to manually map it without recursing into
buffer cache and blocking, with the GB_KVAALLOC flag.

When the mapped buffer is requested and unmapped buffer already
exists, the cache performs an upgrade, possibly reusing the KVA
reservation.

Unmapped buffer is translated into unmapped bio in g_vfs_strategy().
Unmapped bio carry a pointer to the vm_page_t array, offset and length
instead of the data pointer.  The provider which processes the bio
should explicitely specify a readiness to accept unmapped bio,
otherwise g_down geom thread performs the transient upgrade of the bio
request by mapping the pages into the new bio_transient_map KVA
submap.

The bio_transient_map submap claims up to 10% of the buffer map, and
the total buffer_map + bio_transient_map KVA usage stays the
same. Still, it could be manually tuned by kern.bio_transient_maxcnt
tunable, in the units of the transient mappings.  Eventually, the
bio_transient_map could be removed after all geom classes and drivers
can accept unmapped i/o requests.

Unmapped support can be turned off by the vfs.unmapped_buf_allowed
tunable, disabling which makes the buffer (or cluster) creation
requests to ignore GB_UNMAPPED and GB_KVAALLOC flags.  Unmapped
buffers are only enabled by default on the architectures where
pmap_copy_page() was implemented and tested.

In the rework, filesystem metadata is not the subject to maxbufspace
limit anymore. Since the metadata buffers are always mapped, the
buffers still have to fit into the buffer map, which provides a
reasonable (but practically unreachable) upper bound on it. The
non-metadata buffer allocations, both mapped and unmapped, is
accounted against maxbufspace, as before. Effectively, this means that
the maxbufspace is forced on mapped and unmapped buffers separately.
The pre-patch bufspace limiting code did not worked, because
buffer_map fragmentation does not allow the limit to be reached.

By Jeff Roberson request, the getnewbuf() function was split into
smaller single-purpose functions.

Sponsored by:	The FreeBSD Foundation
Discussed with:	jeff (previous version)
Tested by:	pho, scottl (previous version), jhb, bf
MFC after:	2 weeks
2013-03-19 14:13:12 +00:00
Attilio Rao
774d251d99 Sync back vmcontention branch into HEAD:
Replace the per-object resident and cached pages splay tree with a
path-compressed multi-digit radix trie.
Along with this, switch also the x86-specific handling of idle page
tables to using the radix trie.

This change is supposed to do the following:
- Allowing the acquisition of read locking for lookup operations of the
  resident/cached pages collections as the per-vm_page_t splay iterators
  are now removed.
- Increase the scalability of the operations on the page collections.

The radix trie does rely on the consumers locking to ensure atomicity of
its operations.  In order to avoid deadlocks the bisection nodes are
pre-allocated in the UMA zone.  This can be done safely because the
algorithm needs at maximum one new node per insert which means the
maximum number of the desired nodes is the number of available physical
frames themselves.  However, not all the times a new bisection node is
really needed.

The radix trie implements path-compression because UFS indirect blocks
can lead to several objects with a very sparse trie, increasing the number
of levels to usually scan.  It also helps in the nodes pre-fetching by
introducing the single node per-insert property.

This code is not generalized (yet) because of the possible loss of
performance by having much of the sizes in play configurable.
However, efforts to make this code more general and then reusable in
further different consumers might be really done.

The only KPI change is the removal of the function vm_page_splay() which
is now reaped.
The only KBI change, instead, is the removal of the left/right iterators
from struct vm_page, which are now reaped.

Further technical notes broken into mealpieces can be retrieved from the
svn branch:
http://svn.freebsd.org/base/user/attilio/vmcontention/

Sponsored by:	EMC / Isilon storage division
In collaboration with:	alc, jeff
Tested by:	flo, pho, jhb, davide
Tested by:	ian (arm)
Tested by:	andreast (powerpc)
2013-03-18 00:25:02 +00:00
Neel Natu
3f23d3ca9f Fix the '-Wtautological-compare' warning emitted by clang for comparing the
unsigned enum type with a negative value.

Obtained from:	NetApp
2013-03-16 22:53:05 +00:00
Neel Natu
61592433eb Allow vmm stats to be specific to the underlying hardware assist technology.
This can be done by using the new macros VMM_STAT_INTEL() and VMM_STAT_AMD().
Statistic counters that are common across the two are defined using VMM_STAT().

Suggested by:	Anish Gupta
Discussed with:	grehan
Obtained from:	NetApp
2013-03-16 22:40:20 +00:00
Konstantin Belousov
e8a4a618cf Add pmap function pmap_copy_pages(), which copies the content of the
pages around, taking array of vm_page_t both for source and
destination.  Starting offsets and total transfer size are specified.

The function implements optimal algorithm for copying using the
platform-specific optimizations.  For instance, on the architectures
were the direct map is available, no transient mappings are created,
for i386 the per-cpu ephemeral page frame is used.  The code was
typically borrowed from the pmap_copy_page() for the same
architecture.

Only i386/amd64, powerpc aim and arm/arm-v6 implementations were
tested at the time of commit. High-level code, not committed yet to
the tree, ensures that the use of the function is only allowed after
explicit enablement.

For sparc64, the existing code has known issues and a stab is added
instead, to allow the kernel linking.

Sponsored by:	The FreeBSD Foundation
Tested by:	pho (i386, amd64), scottl (amd64), ian (arm and arm-v6)
MFC after:	2 weeks
2013-03-14 20:18:12 +00:00
Alan Cox
9f585991ba The kernel pmap is statically allocated, so there is really no need to
explicitly initialize its pm_root field to zero.

Sponsored by:	EMC / Isilon Storage Division
2013-03-10 21:07:44 +00:00
Attilio Rao
89f6b8632c Switch the vm_object mutex to be a rwlock. This will enable in the
future further optimizations where the vm_object lock will be held
in read mode most of the time the page cache resident pool of pages
are accessed for reading purposes.

The change is mostly mechanical but few notes are reported:
* The KPI changes as follow:
  - VM_OBJECT_LOCK() -> VM_OBJECT_WLOCK()
  - VM_OBJECT_TRYLOCK() -> VM_OBJECT_TRYWLOCK()
  - VM_OBJECT_UNLOCK() -> VM_OBJECT_WUNLOCK()
  - VM_OBJECT_LOCK_ASSERT(MA_OWNED) -> VM_OBJECT_ASSERT_WLOCKED()
    (in order to avoid visibility of implementation details)
  - The read-mode operations are added:
    VM_OBJECT_RLOCK(), VM_OBJECT_TRYRLOCK(), VM_OBJECT_RUNLOCK(),
    VM_OBJECT_ASSERT_RLOCKED(), VM_OBJECT_ASSERT_LOCKED()
* The vm/vm_pager.h namespace pollution avoidance (forcing requiring
  sys/mutex.h in consumers directly to cater its inlining functions
  using VM_OBJECT_LOCK()) imposes that all the vm/vm_pager.h
  consumers now must include also sys/rwlock.h.
* zfs requires a quite convoluted fix to include FreeBSD rwlocks into
  the compat layer because the name clash between FreeBSD and solaris
  versions must be avoided.
  At this purpose zfs redefines the vm_object locking functions
  directly, isolating the FreeBSD components in specific compat stubs.

The KPI results heavilly broken by this commit.  Thirdy part ports must
be updated accordingly (I can think off-hand of VirtualBox, for example).

Sponsored by:	EMC / Isilon storage division
Reviewed by:	jeff
Reviewed by:	pjd (ZFS specific review)
Discussed with:	alc
Tested by:	pho
2013-03-09 02:32:23 +00:00
Bryan Venteicher
0cfbcf8c7b Remove the virtio dependency entry for the VirtIO device drivers. This
will prevent the kernel from linking if the device driver are included
without the virtio module. Remove pci and scbus for the same reason.

Also explain the relationship and necessity of the virtio and virtio_pci
modules. Currently in FreeBSD, we only support VirtIO PCI, but it could
be replaced with a different interface (like MMIO) and the device
(network, block, etc) will still function.

Requested by:	luigi
Approved by:	grehan (mentor)
MFC after:	3 days
2013-03-06 07:17:53 +00:00
Kenneth D. Merry
3a45b4781a Re-enable CTL in GENERIC on i386 and amd64, but turn on the CTL disable
tunable by default.

This will allow GENERIC configurations to boot on small memory boxes, but
not require end users who want to use CTL to recompile their kernel.  They
can simply set kern.cam.ctl.disable=0 in loader.conf.

The eventual solution to the memory usage problem is to change the way
CTL allocates memory to be more configurable, but this should fix things
for small memory situations in the mean time.

UPDATING:		Explain the change in the CTL configuration, and
			how users can enable CTL if they would like to use
			it.

sys/conf/options:	Add a new option, CTL_DISABLE, that prevents CTL
			from initializing.

ctl.c:			If CTL_DISABLE is turned on, don't initialize.

i386/conf/GENERIC,
amd64/conf/GENERIC:	Re-enable device ctl, and add the CTL_DISABLE
			option.
2013-03-04 21:18:45 +00:00
Attilio Rao
b38d37f7b5 Merge from vmc-playground branch:
Rename the pv_entry_t iterator from pv_list to pv_next.
Besides being more correct technically (as the name seems to suggest
this is a list while it is an iterator), it will also be needed by
vm_radix work to avoid a nameclash on macro expansions.

Sponsored by:	EMC / Isilon storage division
Reviewed by:	alc, jeff
Tested by:	flo, pho, jhb, davide
2013-03-02 14:19:08 +00:00
Adrian Chadd
fe138cc2af Disable the ctl driver in GENERIC.
It unfortunately steals a fair chunk of RAM at startup even if it's not
actively used, which prevents FreeBSD VMs of 128MB from successfully
booting and running.
2013-03-02 08:12:41 +00:00
Davide Italiano
acccf7d8b4 MFcalloutng:
When CPU becomes idle, cpu_idleclock() calculates time to the next timer
event in order to reprogram hw timer. Return that time in sbintime_t to
the caller and pass it to acpi_cpu_idle(), where it can be used as one
more factor (quite precise) to extimate furter sleep time and choose
optimal sleep state. This is a preparatory change for further callout
improvements will be committed in the next days.

The commmit is not targeted for MFC.
2013-02-28 10:46:54 +00:00
Attilio Rao
dc1558d1cd Merge from vmobj-rwlock:
VM_OBJECT_LOCKED() macro is only used to implement a custom version
of lock assertions right now (which likely spread out thanks to
copy and paste).
Remove it and implement actual assertions.

Sponsored by:	EMC / Isilon storage division
Reviewed by:	alc
Tested by:	pho
2013-02-27 18:12:13 +00:00
Konstantin Belousov
31a53cd036 Convert machine/elf.h, machine/frame.h, machine/sigframe.h,
machine/signal.h and machine/ucontext.h into common x86 includes,
copying from amd64 and merging with i386.

Kernel-only compat definitions are kept in the i386/include/sigframe.h
and i386/include/signal.h, to reduce amd64 kernel namespace pollution.
The amd64 compat uses its own definitions so far.

The _MACHINE_ELF_WANT_32BIT definition is to allow the
sys/boot/userboot/userboot/elf32_freebsd.c to use i386 ELF definitions
on the amd64 compile host.  The same hack could be usefully abused by
other code too.
2013-02-20 17:39:52 +00:00
Jung-uk Kim
00a54dfb1c Consistently use round_page(x) rather than roundup(x, PAGE_SIZE). There is
no functional change.
2013-02-15 22:43:08 +00:00
Konstantin Belousov
bf94adb3e1 Print slightly more useful information on the 'bad pte' panic.
No objections from:	alc
MFC after:	1 week
2013-02-14 19:22:15 +00:00
Konstantin Belousov
252b1f6e22 Assert that user address is never qremoved.
No objections from:	alc
MFC after:	1 week
2013-02-14 19:21:20 +00:00
Neel Natu
25448de222 Requests for invalid CPUID leaves should map to the highest known leaf instead.
Reviewed by:	grehan
Obtained from:	NetApp
2013-02-13 23:22:17 +00:00
Neel Natu
485b3300cc Implement guest vcpu pinning using 'pthread_setaffinity_np(3)'.
Prior to this change pinning was implemented via an ioctl (VM_SET_PINNING)
that called 'sched_bind()' on behalf of the user thread.

The ULE implementation of 'sched_bind()' bumps up 'td_pinned' which in turn
runs afoul of the assertion '(td_pinned == 0)' in userret().

Using the cpuset affinity to implement pinning of the vcpu threads works with
both 4BSD and ULE schedulers and has the happy side-effect of getting rid
of a bunch of code in vmm.ko.

Discussed with:	grehan
2013-02-11 20:36:07 +00:00
Neel Natu
6d62a48f47 Compute the number of initial kernel page table pages (NKPT) dynamically.
This eliminates the need to recompile the kernel when the default value
of NKPT is not big enough - for e.g. when loading large kernel modules
or memory disk images from the loader.

If NKPT is defined in the kernel configuration file then it overrides the
dynamic calculation.

Reviewed by:	alc, kib
2013-02-06 04:53:00 +00:00
Andriy Gapon
1a89ca4cf5 cpususpend_handler: mark AP as resumed only after fully setting up lapic
Reviewed by:	jhb
Tested by:	Sergey V. Dyatko <sergey.dyatko@gmail.com>,
		KAHO Toshikazu <kaho@elam.kais.kyoto-u.ac.jp>
MFC after:	12 days
2013-02-02 12:04:32 +00:00
Andriy Gapon
548b201607 x86 suspend/resume: suspend pics and pseudo-pics in reverse order
- change 'pics' from STAILQ to TAILQ
- ensure that Local APIC is always first in 'pics'

Reviewed by:	jhb
Tested by:	Sergey V. Dyatko <sergey.dyatko@gmail.com>,
		KAHO Toshikazu <kaho@elam.kais.kyoto-u.ac.jp>
MFC after:	12 days
2013-02-02 12:02:42 +00:00
Eitan Adler
4752ed3d7f Remove support for plip from the GENERIC kernel as no systems in the
last 10 years require this support.

Discussed with:	db
Discussed with:	kib
Reviewed by:	imp
Reviewed by:	jhb
Reviewed by:	-hackers
Approved by:	cperciva (mentor)
2013-02-01 20:17:11 +00:00
Neel Natu
2b89a04496 Fix a broken assumption in the passthru implementation that the MSI-X table
can only be located at the beginning or the end of the BAR.

If the MSI-table is located in the middle of a BAR then we will split the
BAR into two and create two mappings - one before the table and one after
the table - leaving a hole in place of the table so accesses to it can be
trapped and emulated.

Obtained from:	NetApp
2013-02-01 03:49:09 +00:00
Neel Natu
07044a96d8 Increase the number of passthru devices supported by bhyve.
The maximum length of an environment variable puts a limitation on the
number of passthru devices that can be specified via a single variable.
The workaround is to allow user to specify passthru devices via multiple
environment variables instead of a single one.

Obtained from:	NetApp
2013-02-01 01:16:26 +00:00
Neel Natu
8faceb3292 Add emulation support for instruction "88/r: mov r/m8, r8".
This instruction moves a byte from a register to a memory location.

Tested by: tycho nightingale at pluribusnetworks com
2013-01-30 04:09:09 +00:00
John Baldwin
d825ce0a5d Reduce duplication between i386/linux/linux.h and amd64/linux32/linux.h
by moving bits that are MI out into headers in compat/linux.

Reviewed by:	Chagin Dmitry  dmitry | gmail
MFC after:	2 weeks
2013-01-29 18:41:30 +00:00
Peter Grehan
1fb0ea3f1a Always allow access to the sysenter cs/esp/eip MSRs since they
are automatically saved and restored in the VMCS.

Reviewed by:	neel
Obtained from:	NetApp
2013-01-25 21:38:31 +00:00
John Baldwin
fb709557a3 Don't assume that all Linux TCP-level socket options are identical to
FreeBSD TCP-level socket options (only the first two are).  Instead,
using a mapping function and fail unsupported options as we do for other
socket option levels.

MFC after:	2 weeks
2013-01-23 21:44:48 +00:00
Neel Natu
e3f0800bd1 Postpone vmm module initialization until after SMP is initialized - particularly
that 'smp_started != 0'.

This is required because the VT-x initialization calls smp_rendezvous()
to set the CR4_VMXE bit on all the cpus.

With this change we can preload vmm.ko from the loader.

Reported by:	alfred@, sbruno@
Obtained from:	NetApp
2013-01-21 01:33:10 +00:00
Neel Natu
912a3e678a Add svn properties to the recently merged bhyve source files.
The pre-commit hook will not allow any commits without the svn:keywords
property in head.
2013-01-20 03:42:49 +00:00
Neel Natu
c458fc1ed4 Merge projects/bhyve to head.
'bhyve' was developed by grehan@ and myself at NetApp (thanks!).

Special thanks to Peter Snyder, Joe Caradonna and Michael Dexter for their
support and encouragement.

Obtained from:	NetApp
2013-01-19 04:18:52 +00:00
John Baldwin
b5821c6f0e Fix build with SMP disabled.`
Reported by:	bf
2013-01-19 01:18:22 +00:00
John Baldwin
f876ffeae3 Don't attempt to use clflush on the local APIC register window. Various
CPUs exhibit bad behavior if this is done (Intel Errata AAJ3, hangs on
Pentium-M, and trashing of the local APIC registers on a VIA C7).  The
local APIC is implicitly mapped UC already via MTRRs, so the clflush isn't
necessary anyway.

MFC after:	2 weeks
2013-01-17 21:32:25 +00:00
Neel Natu
c2217b9848 IFC @ r245509 2013-01-17 07:04:37 +00:00
Bryan Venteicher
ae366ffcbd Add VirtIO to the i386 and amd64 GENERIC kernels
This also removes the kludge from r239009 that covered only
the network driver.

Reviewed by:	grehan
Approved by:	grehan (mentor)
MFC after:	1 week
2013-01-13 07:14:16 +00:00
Neel Natu
8a60b77db8 IFC @ r245205 2013-01-09 03:32:23 +00:00
Neel Natu
1b54fbe69d IFC @ r245178 2013-01-09 02:26:50 +00:00
Neel Natu
95102a8bcb Add a "pause" to busy wait loops in the cpu reset path.
This should not matter much when running on bare metal but it makes the guest
more friendly when running inside a virtual machine.

Discussed with:	jhb
Obtained from:	NetApp
2013-01-09 02:11:16 +00:00
Neel Natu
03429e45a7 Revert changes for x2apic support from projects/bhyve.
During the early days of bhyve it did not support instruction emulation
which necessitated the use of x2apic to access the local apic. This is no
longer the case and the dependency on x2apic has gone away.

The x2apic patches can be considered independently of bhyve and will be
merged into head via projects/x2apic.

Discussed with:	grehan
2013-01-06 05:37:26 +00:00
Neel Natu
2d28bff346 bhyve does not require a custom configuration file anymore so make the GENERIC
identical to the one in HEAD.

Obtained from:	NetApp
2013-01-05 03:35:30 +00:00
Neel Natu
46b1c55d9e IFC @ r244983. 2013-01-04 19:28:32 +00:00
Neel Natu
23ce7fedb4 There is no need for a special 'BHYVE' kernel configuration file anymore -
'GENERIC' works fine.

Obtained from:	NetApp
2013-01-04 03:02:43 +00:00
Neel Natu
014a52f3a6 There is no need for 'start_emulating()' and 'stop_emulating()' to be defined
in <machine/cpufunc.h> so remove them from there.

Obtained from:	NetApp
2013-01-04 02:49:12 +00:00
Neel Natu
5f0677d392 The "unrestricted guest" capability is a feature of Intel VT-x that allows
the guest to execute real or unpaged protected mode code - bhyve relies on
this feature to execute the AP bootstrap code.

Get rid of the hack that allowed bhyve to support SMP guests on processors
that do not have the "unrestricted guest" capability. This hack was entirely
FreeBSD-specific and would not work with any other guest OS.

Instead, limit the number of vcpus to 1 when executing on processors without
"unrestricted guest" capability.

Suggested by:	grehan
Obtained from:	NetApp
2013-01-04 02:04:41 +00:00
Konstantin Belousov
0dcbedfa61 Enable the UFS quotas for big-iron GENERIC kernels.
Discussed with:	      mckusick
MFC after:	      2 weeks
2013-01-03 19:03:41 +00:00
Dag-Erling Smørgrav
36fca20f10 As discussed on -current last October, remove the firewire drivers from
GENERIC.
2013-01-03 14:30:24 +00:00
Neel Natu
485f986ac9 Modify the default behavior of bhyve such that it no longer forces the use of
x2apic mode on the guest.

The guest can decide whether or not it wants to use legacy mmio or x2apic
access to the APIC by writing to the MSR_APICBASE register.

Obtained from:	NetApp
2012-12-16 01:20:08 +00:00
Neel Natu
682b847ede Prefer x2apic mode when running inside a virtual machine.
Provide a tunable 'machdep.x2apic_desired' to let the administrator override
the default behavior.

Provide a read-only sysctl 'machdep.x2apic' to let the administrator know
whether the kernel is using x2apic or legacy mmio to access local apic.

Tested with Parallels Desktop 8 and bhyve hypervisors.
Also tested running on bare metal Intel Xeon E5-2658.

Obtained from:	NetApp
Discussed with:	jhb, attilio, avg, grehan
2012-12-16 00:57:14 +00:00
Jim Harris
f2fcc434ee Revert r243960 based on feedback regarding keeping x86 headers unified
(mdf@, tijl@) and use of KASSERT/systm.h in bus.h (zeising@, bde@).

Alternate implementation will be made in a separate commit.
2012-12-13 21:27:20 +00:00
Peter Grehan
2741efeca0 Implement an API to allow a hypervisor to save/restore
guest floating point state without having to know the
size of floating-point state.

Unstaticize fpurestore to allow the hypervisor to
save/restore guest state using fpusave/fpurestore
on the allocated FPU state area.

Reviewed by:	kib
Obtained from:	NetApp/bhyve
MFC after:	1 week
2012-12-12 08:35:32 +00:00
Konstantin Belousov
737d12b397 Add amd64-specific ddb command "show pte". The command displays the
hierarchy of the page table entries which map the specified address.

Reviewed by:	alc (previous version)
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2012-12-10 05:14:34 +00:00
Jim Harris
71a30c4436 Add amd64 implementations for 8-byte bus_space routines.
Submitted by:	Carl Delsey <carl.r.delsey@intel.com>
Discussed with:	jhb, rwatson
Reviewed by:	jimharris
MFC after:	1 week
2012-12-06 22:33:31 +00:00
Neel Natu
32531ccb84 IFC @r243836 2012-12-04 04:37:42 +00:00
Konstantin Belousov
349438a243 Print the frame addresses for the backtraces on i386 and amd64. It
allows both to inspect the frame sizes and to manually peek into the
frames from ddb, if needed.

Reviewed by:	dim
MFC after:	2 weeks
2012-12-03 22:16:51 +00:00
Jung-uk Kim
7609e73ca0 Remove duplicate code. Reduce diff between amd64 and i386. 2012-12-01 00:56:19 +00:00
Jung-uk Kim
8c2b353ead Use volatile keywords properly. 2012-11-30 20:15:01 +00:00
Peter Grehan
e6f1f347a1 Properly screen for the AND 0x81 instruction from the set
of group1 0x81 instructions that use the reg bits as an
extended opcode.

Still todo: properly update rflags.

Pointed out by:	jilles@
2012-11-30 05:40:24 +00:00
Jung-uk Kim
231ac244f8 Tidy up inline assembly. No functional change. 2012-11-30 00:59:37 +00:00
Peter Grehan
b1f95796f0 Remove debug printf.
Pointed out by:	emaste
2012-11-29 15:08:13 +00:00
Peter Grehan
3b2b001107 Add support for the 0x81 AND instruction, now generated
by clang in the local APIC code.

0x81 is a read-modify-write instruction - the EPT check
that only allowed read or write and not both has been
relaxed to allow read and write.

Reviewed by:	neel
Obtained from:	NetApp
2012-11-29 06:26:42 +00:00
Neel Natu
48a29f4e07 Cleanup the user-space paging exit handler now that the unified instruction
emulation is in place.

Obtained from:	NetApp
2012-11-28 13:34:44 +00:00
Neel Natu
b42206f300 Change emulate_rdmsr() and emulate_wrmsr() to return 0 on sucess and errno on
failure. The conversion from the return value to HANDLED or UNHANDLED can be
done locally in vmx_exit_process().

Obtained from: NetApp
2012-11-28 13:10:18 +00:00
Neel Natu
ba9b7bf73a Revamp the x86 instruction emulation in bhyve.
On a nested page table fault the hypervisor will:
- fetch the instruction using the guest %rip and %cr3
- decode the instruction in 'struct vie'
- emulate the instruction in host kernel context for local apic accesses
- any other type of mmio access is punted up to user-space (e.g. ioapic)

The decoded instruction is passed as collateral to the user-space process
that is handling the PAGING exit.

The emulation code is fleshed out to include more addressing modes (e.g. SIB)
and more types of operands (e.g. imm8). The source code is unified into a
single file (vmm_instruction_emul.c) that is compiled into vmm.ko as well
as /usr/sbin/bhyve.

Reviewed by:	grehan
Obtained from:	NetApp
2012-11-28 00:02:17 +00:00
Neel Natu
920bc34090 Fix a bug in the MSI-X resource allocation for PCI passthrough devices.
In the case where the underlying host had disabled MSI-X via the
"hw.pci.enable_msix" tunable, the ppt_setup_msix() function would fail
and return an error without properly cleaning up. This in turn would
cause a page fault on the next boot of the guest.

Fix this by calling ppt_teardown_msix() in all the error return paths.

Obtained from:	NetApp
2012-11-22 04:07:18 +00:00
Neel Natu
288aeb8561 Get rid of redundant comparision which is guaranteed to be "true" for unsigned
integers.

Obtained from:	NetApp
2012-11-22 00:08:20 +00:00
Peter Grehan
a0cad47092 Handle CPUID leaf 0x7 now that FreeBSD is using it.
Return 0's for now.

Reviewed by:	neel
Obtained from:	NetApp
2012-11-20 06:01:03 +00:00
Neel Natu
3248464555 IFC @ r243164 2012-11-17 02:55:47 +00:00
Konstantin Belousov
43f48b65c0 Move the declaration of vm_phys_paddr_to_vm_page() from vm/vm_page.h
to vm/vm_phys.h, where it belongs.

Requested and reviewed by:	alc
MFC after:	2 weeks
2012-11-16 05:55:56 +00:00
Konstantin Belousov
b32ecf44bc Flip the semantic of M_NOWAIT to only require the allocation to not
sleep, and perform the page allocations with VM_ALLOC_SYSTEM
class. Previously, the allocation was also allowed to completely drain
the reserve of the free pages, being translated to VM_ALLOC_INTERRUPT
request class for vm_page_alloc() and similar functions.

Allow the caller of malloc* to request the 'deep drain' semantic by
providing M_USE_RESERVE flag, now translated to VM_ALLOC_INTERRUPT
class. Previously, it resulted in less aggressive VM_ALLOC_SYSTEM
allocation class.

Centralize the translation of the M_* malloc(9) flags in the single
inline function malloc2vm_flags().

Discussion started by:	"Sears, Steven" <Steven.Sears@netapp.com>
Reviewed by:	alc, mdf (previous version)
Tested by:	pho (previous version)
MFC after:	2 weeks
2012-11-14 20:01:40 +00:00
Neel Natu
7d3d462b09 IFC @ r242940 2012-11-13 07:39:05 +00:00
Neel Natu
a10c6f5544 IFC @ r242684 2012-11-11 03:26:14 +00:00
Konstantin Belousov
5a17538e22 Do not try to enable new features in the %cr4 if running under
hypervisor.  Apparently, hypervisors failed to filter out 'Standard
Extended Features' report from CPUID, but deliver #gp when
corresponding bit in %cr4 is toggled.

This shall be reconsidered later, after hypervisors correct the bug.

Reported and tested by:	joel
Reviewed by:	avg
MFC after:	2 weeks
2012-11-09 16:00:30 +00:00
Peter Grehan
0a5e9bfb72 Fix issue found with clang build. Avoid code insertion by the compiler
between inline asm statements that would in turn modify the flags
value set by the first asm, and used by the second.

Solve by making the common error block a string that can be pulled
into the first inline asm, and using symbolic labels for asm variables.

bhyve can now build/run fine when compiled with clang.

Reviewed by:	neel
Obtained from:	NetApp
2012-11-06 02:43:41 +00:00
Attilio Rao
cfedf924d3 Rework the known rwlock to benefit about staying on their own
cache line in order to avoid manual frobbing but using
struct rwlock_padalign.

Reviewed by:	alc, jimharris
2012-11-03 23:03:14 +00:00
Konstantin Belousov
cd9e9d1bc2 Enable the new instructions for reading and writing bases for %fs,
%gs, when supported.  Note that WRFSBASE and WRGSBASE are not very
useful on FreeBSD right now, because a return from the kernel mode to
userspace reloads the bases specified by the sysarch(2) syscall, most
likely.

Enable the Supervisor Mode Execution Prevention (SMEP) when
supported. Since the loader(8) performs hand-off to the kernel with
the page tables which contradict the SMEP, postpone enabling the SMEP
on BSP until pmap switched for the proper kernel tables.

Debugged with the help from:	avg
Tested by:	avg, Michael Moll <kvedulv@kvedulv.de>
MFC after:	1 month
2012-11-01 15:17:43 +00:00
Konstantin Belousov
2773649d2f Provide the reading and display of the Standard Extended Features,
introduced with the IvyBridge CPUs.  Provide the definitions for new
bits in CR3 and CR4 registers.

Tested by:	avg, Michael Moll <kvedulv@kvedulv.de>
MFC after:	2 weeks
2012-11-01 15:14:37 +00:00
Neel Natu
514393f565 Convert VMCS_ENTRY_INTR_INFO field into a vmcs identifier before passing it
to vmcs_getreg(). Without this conversion vmcs_getreg() will return EINVAL.

In particular this prevented injection of the breakpoint exception into the
guest via the "-B" option to /usr/sbin/bhyve which is hugely useful when
debugging guest hangs.

This was broken in r241921.

Pointy hat: me
Obtained from:	NetApp
2012-10-29 23:58:15 +00:00
Neel Natu
b01c203325 Corral all the host state associated with the virtual machine into its own file.
This state is independent of the type of hardware assist used so there is
really no need for it to be in Intel-specific code.

Obtained from:	NetApp
2012-10-29 01:51:24 +00:00
Peter Grehan
cda5bd7f19 Set the valid field of the newly allocated field as all other
vm page allocators do. This fixes a panic when a virtio block
device is mounted as root, with the host system dying in
vm_page_dirty with invalid bits.

Reviewed by:	neel
Obtained from:	NetApp
2012-10-26 22:32:26 +00:00
Neel Natu
bd8572e0be Unconditionally enable fpu emulation by setting CR0.TS in the host after the
guest does a vm exit.

This allows us to trap any fpu access in the host context while the fpu still
has "dirty" state belonging to the guest.

Reported by: "s vas" on freebsd-virtualization@
Obtained from:	NetApp
2012-10-26 03:12:40 +00:00
Neel Natu
f76fc5d414 If the guest vcpu wants to idle then use that opportunity to relinquish the
host cpu to the scheduler until the guest is ready to run again.

This implies that the host cpu utilization will now closely mirror the actual
load imposed by the guest vcpu.

Also, the vcpu mutex now needs to be of type MTX_SPIN since we need to acquire
it inside a critical section.

Obtained from:	NetApp
2012-10-25 04:29:21 +00:00
Neel Natu
ff6ec151e0 Hide the monitor/mwait instruction capability from the guest until we know how
to properly intercept it.

Obtained from:	NetApp
2012-10-25 04:08:26 +00:00
Neel Natu
f352ff0ca8 Maintain state regarding NMI delivery to guest vcpu in VT-x independent manner.
Also add a stats counter to count the number of NMIs delivered per vcpu.

Obtained from:	NetApp
2012-10-24 02:54:21 +00:00
Neel Natu
eeefa4e4be Test for AST pending with interrupts disabled right before entering the guest.
If an IPI was delivered to this cpu before interrupts were disabled
then return right away via vmx_setjmp() with a return value of VMX_RETURN_AST.

Obtained from:	NetApp
2012-10-23 02:20:42 +00:00
Eitan Adler
1611e2c0d1 The 'testing memory' patch gets printed too many times
Approved by: cperciva (implicit)
2012-10-22 11:57:26 +00:00
Eitan Adler
267cc84937 Explain the upcoming delay by printing a message when the kernel
is about to begin testing memory.

Reviewed by:	dteske, adri
Approved by:	cperciva
MFC after:	1 week
2012-10-22 03:16:39 +00:00
Neel Natu
2e25737a49 Calculate the number of host ticks until the next guest timer interrupt.
This information will be used in conjunction with guest "HLT exiting" to
yield the thread hosting the virtual cpu.

Obtained from:	NetApp
2012-10-20 08:23:05 +00:00
Konstantin Belousov
802431fa2e Print the %rip value for uprintf_signal.
MFC after:	1 week
2012-10-14 17:08:46 +00:00
Andriy Gapon
851dbc07af pciereg_cfg*: use assembly to access the mem-mapped cfg space
AMD BKDG for CPU families 10h and later requires that the memory
mapped config is always read into or written from al/ax/eax register.

Discussed with:	kib, alc
Reviewed by:	kib (earlier version)
MFC after:	25 days
2012-10-14 10:13:50 +00:00
Peter Grehan
13ec93719a Add the guest physical address and r/w/x bits to
the paging exit in preparation for a rework of
bhyve MMIO handling.

Reviewed by:	neel
Obtained from:	NetApp
2012-10-12 23:12:19 +00:00
Neel Natu
75dd336603 Provide per-vcpu locks instead of relying on a single big lock.
This also gets rid of all the witness.watch warnings related to calling
malloc(M_WAITOK) while holding a mutex.

Reviewed by:	grehan
2012-10-12 18:32:44 +00:00
Neel Natu
cdc5b9e7b1 Fix warnings generated by 'debug.witness.watch' during VM creation and
destruction for calling malloc() with M_WAITOK while holding a mutex.

Do not allow vmm.ko to be unloaded until all virtual machines are destroyed.
2012-10-11 19:39:54 +00:00
Neel Natu
f9d4f89e4d Deliver the MSI to the correct guest virtual cpu.
Prior to this change the MSI was being delivered unconditionally to vcpu 0
regardless of how the guest programmed the MSI delivery.
2012-10-11 19:28:07 +00:00
Kevin Lo
9823d52705 Revert previous commit...
Pointyhat to:	kevlo (myself)
2012-10-10 08:36:38 +00:00
Attilio Rao
3a4730256a Add an unified macro to deny ability from the compiler to reorder
instruction loads/stores at its will.
The macro __compiler_membar() is currently supported for both gcc and
clang, but kernel compilation will fail otherwise.

Reviewed by:	bde, kib
Discussed with:	dim, theraven
MFC after:	2 weeks
2012-10-09 14:32:30 +00:00
Attilio Rao
af2bdacafb Reverts r234074,234105,234564,234723,234989,235231-235232 and part of
r234247.
Use, instead, the static intializer introduced in r239923 for x86 and
sparc64 intr_cpus, unwinding the code to the initial version.

Reviewed by:	marius
2012-10-09 12:22:43 +00:00
Kevin Lo
a10cee30c9 Prefer NULL over 0 for pointers 2012-10-09 08:27:40 +00:00
Neel Natu
7ce04d0ad9 Allocate memory pages for the guest from the host's free page queue.
It is no longer necessary to hard-partition the memory between the host
and guests at boot time.
2012-10-08 23:41:26 +00:00
Neel Natu
f7d51510f1 Change vm_malloc() to map pages in the guest physical address space in 4KB
chunks. This breaks the assumption that the entire memory segment is
contiguously allocated in the host physical address space.

This also paves the way to satisfy the 4KB page allocations by requesting
free pages from the VM subsystem as opposed to hard-partitioning host memory
at boot time.
2012-10-04 02:27:14 +00:00
Neel Natu
4db4fb2c25 Get rid of assumptions in the hypervisor that the host physical memory
associated with guest physical memory is contiguous.

Add check to vm_gpa2hpa() that the range indicated by [gpa,gpa+len) is all
contained within a single 4KB page.
2012-10-03 01:18:51 +00:00
Neel Natu
bda273f21e Get rid of assumptions in the hypervisor that the host physical memory
associated with guest physical memory is contiguous.

Rewrite vm_gpa2hpa() to get the GPA to HPA mapping by querying the nested
page tables.
2012-10-03 00:46:30 +00:00
Neel Natu
341f19c949 Get rid of assumptions in the hypervisor that the host physical memory
associated with guest physical memory is contiguous.

In this case vm_malloc() was using vm_gpa2hpa() to indirectly infer whether
or not the address range had already been allocated.

Replace this instead with an explicit API 'vm_gpa_available()' that returns
TRUE if a page is available for allocation in guest physical address space.
2012-09-29 01:15:45 +00:00
John Baldwin
960b5a7080 - Re-shuffle the <machine/pc/bios.h> headers to move all kernel-specific
bits under #ifdef _KERNEL but leave definitions for various structures
  defined by standards ($PIR table, SMAP entries, etc.) available to
  userland.
- Consolidate duplicate SMBIOS table structure definitions in ipmi(4)
  and smbios(4) in <machine/pc/bios.h> and make them available to
  userland.

MFC after:	2 weeks
2012-09-28 11:59:32 +00:00
Alan Cox
e4b8a2fc5a Eliminate a stale comment. It describes another use case for the pmap in
Mach that doesn't exist in FreeBSD.
2012-09-28 05:30:59 +00:00
Neel Natu
70593114cd Intel VT-x provides the length of the instruction at the time of the nested
page table fault. Use this when fetching the instruction bytes from the guest
memory.

Also modify the lapic_mmio() API so that a decoded instruction is fed into it
instead of having it fetch the instruction bytes from the guest. This is
useful for hardware assists like SVM that provide the faulting instruction
as part of the vmexit.
2012-09-27 00:27:58 +00:00
Neel Natu
73820fb0a4 Add an option "-a" to present the local apic in the XAPIC mode instead of the
default X2APIC mode to the guest.
2012-09-26 00:06:17 +00:00
Neel Natu
a2da7af6bc Add support for trapping MMIO writes to local apic registers and emulating them.
The default behavior is still to present the local apic to the guest in the
x2apic mode.
2012-09-25 22:31:35 +00:00
Neel Natu
e90273829b Add ioctls to control the X2APIC capability exposed by the virtual machine to
the guest.

At the moment this simply sets the state in the 'vcpu' instance but there is
no code that acts upon these settings.
2012-09-25 19:08:51 +00:00
Neel Natu
edf89256dd Add an explicit exit code 'SPINUP_AP' to tell the controlling process that an
AP needs to be activated by spinning up an execution context for it.

The local apic emulation is now completely done in the hypervisor and it will
detect writes to the ICR_LO register that try to bring up the AP. In response
to such writes it will return to userspace with an exit code of SPINUP_AP.

Reviewed by: grehan
2012-09-25 02:33:25 +00:00
Neel Natu
98ed632c63 Stash the 'vm_exit' information in each 'struct vcpu'.
There is no functional change at this time but this paves the way for vm exit
handler functions to easily modify the exit reason going forward.
2012-09-24 19:32:24 +00:00
Dimitry Andric
f99157cced After r205013, amd64 and i386 CPU family and model IDs were printed out
in hexadecimal, but without any 0x prefix, which can be very misleading.

MFC after:	3 days
2012-09-21 10:31:19 +00:00
Neel Natu
2d3a73ed6d Restructure the x2apic access code in preparation for supporting memory mapped
access to the local apic.

The vlapic code is now aware of the mode that the guest is using to access the
local apic.

Reviewed by: grehan@
2012-09-21 03:09:23 +00:00
Jim Harris
eb85d44f06 Integrate nvme(4) and nvd(4) into the amd64 and i386 builds.
Sponsored by:	Intel
2012-09-17 19:26:33 +00:00
Konstantin Belousov
c5e3d0ab11 Rename the IVY_RNG option to RDRAND_RNG.
Based on submission by:	Arthur Mesh <arthurmesh@gmail.com>
MFC after:	2 weeks
2012-09-13 10:12:16 +00:00
Alan Cox
7336315b0a Simplify pmap_unmapdev(). Since kmem_free() eventually calls pmap_remove(),
pmap_unmapdev()'s own direct efforts to destroy the page table entries are
redundant, so eliminate them.

Don't set PTE_W on the page table entry in pmap_kenter{,_attr}() on MIPS.
Setting PTE_W on MIPS is inconsistent with the implementation of this
function on other architectures.  Moreover, PTE_W should not be set, unless
the pmap's wired mapping count is incremented, which pmap_kenter{,_attr}()
doesn't do.

MFC after:	10 days
2012-09-10 16:11:29 +00:00
Attilio Rao
324e57150d userret() already checks for td_locks when INVARIANTS is enabled, so
there is no need to check if Giant is acquired after it.

Reviewed by:	kib
MFC after:	1 week
2012-09-08 18:27:11 +00:00
Konstantin Belousov
ef9461ba0e Add support for new Intel on-CPU Bull Mountain random number
generator, found on IvyBridge and supposedly later CPUs, accessible
with RDRAND instruction.

From the Intel whitepapers and articles about Bull Mountain, it seems
that we do not need to perform post-processing of RDRAND results, like
AES-encryption of the data with random IV and keys, which was done for
Padlock. Intel claims that sanitization is performed in hardware.

Make both Padlock and Bull Mountain random generators support code
covered by kernel config options, for the benefit of people who prefer
minimal kernels. Also add the tunables to disable hardware generator
even if detected.

Reviewed by:	markm, secteam (simon)
Tested by:	bapt, Michael Moll <kvedulv@kvedulv.de>
MFC after:	3 weeks
2012-09-05 13:18:51 +00:00
Alan Cox
d8f9ed32c5 Rename {_,}pmap_unwire_pte_hold() to {_,}pmap_unwire_ptp() and update the
comment describing them.  Both the function names and the comment had grown
stale.  Quite some time has passed since these pmap implementations last
used the page's hold count to track the number of valid mapping within a
page table page.  Also, returning TRUE from pmap_unwire_ptp() rather than
_pmap_unwire_ptp() eliminates a few instructions from callers like
pmap_enter_quick_locked() where pmap_unwire_ptp()'s return value is used
directly by a conditional statement.
2012-09-05 06:02:54 +00:00
Xin LI
0807ad7422 Add hpt27xx to GENERIC kernel for amd64 and i386 systems.
MFC after:	2 weeks
2012-09-04 21:02:57 +00:00
John Baldwin
778eefa40d Fix duplicate entries for mwl(4):
- Move mwlfw from {amd64,i386}/conf/NOTES to sys/conf/NOTES (mwl(4) is
  already present in sys/conf/NOTES).
- Remove duplicate mwl(4) entries from {amd64,i386}/conf/NOTES.
- While here, add a description to the sfxge line in amd64/conf/NOTES.
2012-09-04 19:19:36 +00:00
John Baldwin
4c1044491b Fix misspelled "Infiniband".
Submitted by:	gcooper
MFC after:	3 days
2012-08-28 11:34:09 +00:00
Peter Grehan
177fd53318 Add sysctls to display the total and free amount of hard-wired mem for VMs
# sysctl hw.vmm
   hw.vmm.mem_free: 2145386496
   hw.vmm.mem_total: 2145386496

Submitted by:	Takeshi HASEGAWA hasegaw at gmail com
2012-08-26 01:41:41 +00:00
Glen Barber
67944c4572 Grammar fix: s/NIC's/NICs/
MFC after:	3 days
2012-08-26 01:21:02 +00:00
Dag-Erling Smørgrav
e2082935f0 As discussed on -current, remove the hardcoded default maxswzone.
MFC after:	3 weeks
2012-08-14 17:01:21 +00:00
Konstantin Belousov
b6d3609050 Add a hackish debugging facility to provide a bit of information about
reason for generated trap. The dump of basic signal information and 8
bytes of the faulting instruction are printed on the controlling
terminal of the process, if the machdep.uprintf_signal syscal is
enabled.

The print is the only practical way to debug traps from a.out
processes I am aware of. Because I have to reimplement it each time I
debug an issue with a.out support on amd64, commit the hack to main
tree.

MFC after:	1 week
2012-08-14 12:15:01 +00:00
Konstantin Belousov
95fd15898b Real hardware, as opposed to QEMU, does not allow to have a call gate
in long mode which transfers control to 32bit code segment. Unbreak
the lcall $7,$0 implementation on amd64 by putting the 64bit user code
segment' selector into call gate, and execute the 64bit trampoline
which converts the return frame into 32bit format and switches back to
32bit mode for executing int $0x80 trampoline.

Note that all jumps over the hoops are performed in the user mode.

MFC after:	1 week
2012-08-14 12:13:27 +00:00
John Baldwin
6e4ac34b07 Remove the deassert INIT IPI from the IPI startup sequence for APs.
It is not listed in the boot sequence in the MP specification (1.4),
and it is explicitly ignored on modern CPUs.  It was only ever required
when bootstrapping systems with external APICs (that is, SMP machines
with 486s), which FreeBSD has never supported (and never will).

While here, tidy some comments and remove some banal ones.
2012-08-13 18:52:51 +00:00
John Baldwin
a4284ef768 Add a 10 millisecond delay after sending the initial INIT IPI. This
matches the algorithm in the MP specification (1.4).  Previously we
were sending out the deassert INIT IPI immediately after the initial
INIT IPI was sent.
2012-08-13 16:33:22 +00:00
Colin Percival
347c7fd7bf Build modules along with the XENHVM kernels.
No objections from:	freebsd-xen mailing list
MFC after:	1 week
2012-08-13 07:36:57 +00:00
Alan Cox
663f8700d4 The assertion that I added in r238889 could legitimately fail when a
debugger creates a breakpoint.  Replace that assertion with a narrower
one that still achieves my objective.

Reported and tested by:	kib
2012-08-08 05:28:30 +00:00
Konstantin Belousov
65211d02c4 Do not apply errata 721 workaround when under hypervisor, since
typical hypervisor does not implement access to the required MSR,
causing #GP on boot.

Reported and tested by:	olgeni
PR:	amd64/170388
MFC after:	3 days
2012-08-07 08:36:10 +00:00
Sergey Kandaurov
16ec457aeb Remove duplicate header inclusion of <sys/sysent.h>
Discussed with:	bz
2012-08-07 05:46:36 +00:00
Alan Cox
59fa03faa3 Shave off a few more cycles from the average execution time of pmap_enter()
by simplifying the control flow and reducing the live range of "om".
2012-08-05 16:59:02 +00:00
Neel Natu
8124debe13 Include 'device uart' in the guest kernel. 2012-08-04 04:30:26 +00:00
Neel Natu
39c21c2db2 Force certain bits in %cr4 to be hard-wired to '1' or '0' from a guest's
perspective. If we don't do this some guest OSes (e.g. Linux) will reset
the CR4_VMXE bit in %cr4 with disastrous consequences.

Reported by: grehan
2012-08-04 02:06:55 +00:00
Konstantin Belousov
0220d04fe3 Add lfence().
MFC after:	1 week
2012-08-01 17:24:53 +00:00
Alan Cox
879eedbc7b Revise pmap_enter()'s handling of mapping updates that change the
PTE's PG_M and PG_RW bits but not the physical page frame.  First,
only perform vm_page_dirty() on a managed vm_page when the PG_M bit is
being cleared.  If the updated PTE continues to have PG_M set, then
there is no requirement to perform vm_page_dirty().  Second, flush the
mapping from the TLB when PG_M alone is cleared, not just when PG_M
and PG_RW are cleared.  Otherwise, a stale TLB entry may stop PG_M
from being set again on the next store to the virtual page.  However,
since the vm_page's dirty field already shows the physical page as
being dirty, no actual harm comes from the PG_M bit not being set.
Nonetheless, it is potentially confusing to someone expecting to see
the PTE change after a store to the virtual page.
2012-08-01 16:04:13 +00:00
Konstantin Belousov
a42fa0af44 Change (unused) prototype for stmxcsr() to match reality.
Noted by:	jhb
MFC after:	1 week
2012-07-30 19:26:02 +00:00
Alan Cox
bc27d6c608 Shave off a few more cycles from pmap_enter()'s critical section. In
particular, do a little less work with the PV list lock held.
2012-07-29 18:20:49 +00:00
Neel Natu
4bff7fad95 Verify that VMX operation has been enabled by BIOS before executing the
VMXON instruction.

Reported by "s vas" on freebsd-virtualization@
2012-07-25 00:21:16 +00:00
Konstantin Belousov
59c1a8315c Forcibly shut up clang warning about NULL pointer dereference.
MFC after:	3 weeks
2012-07-23 19:16:31 +00:00
Konstantin Belousov
7c80fcfdba Constently use 2-space sentence breaks.
Submitted by:	 bde
MFC after:	 1 week
2012-07-21 13:53:00 +00:00
Konstantin Belousov
1965c139f1 Stop caching curpcb in the local variable.
Requested by:	    bde
MFC after:	    1 week
2012-07-21 13:47:37 +00:00
Konstantin Belousov
700de5109a The PT_I386_{GET,SET}XMMREGS and PT_{GET,SET}XSTATE operate on the
stopped threads. Implementation assumes that the thread's FPU context
is spilled into the PCB due to stop. This is mostly true, except when
FPU state for the thread is not initialized. Then the requests operate
on the garbage state which is currently left in the PCB, causing
confusion.

The situation is indeed observed after a signal delivery and before
#NM fault on execution of any FPU instruction in the signal handler,
since sendsig(9) drops FPU state for current thread, clearing
PCB_FPUINITDONE. When inspecting context state for the signal handler,
debugger sees the FPU state of the main program context instead of the
clear state supposed to be provided to handler.

Fix this by forcing clean FPU state in PCB user FPU save area by
performing getfpuregs(9) before accessing user FPU save area in
ptrace_machdep.c.

Note: this change will be merged to i386 kernel as well, where it is
much more important, since e.g. gdb on i386 uses PT_I386_GETXMMREGS to
inspect FPU context on CPUs that support SSE. Amd64 version of gdb
uses PT_GETFPREGS to inspect both 64 and 32 bit processes, which does
not exhibit the bug.

Reported by:	bde
MFC after:	1 week
2012-07-21 13:06:37 +00:00
Konstantin Belousov
dfa8a51288 Stop clearing x87 exceptions in the #MF handler on amd64. If user code
understands FPU hardware enough to catch SIGFPE and unmask exceptions
in control word, then it may as well properly handle return from
SIGFPE without causing an infinite loop of #MF exceptions due to
faulting instruction restart, when needed.

Clearing exceptions causes information loss for handlers which do
understand FPU hardware, and struct siginfo si_code member cannot be
considered adequate replacement for en_sw content due to translation.

Supposed reason for clearing the exceptions, which is IRQ13 handling
oddities, were never applicable to amd64.

Note: this change will be merged to i386 kernel as well, since we do
not support IRQ13 delivery of #MF notifications for some time.

Requested by:	bde
MFC after:	1 week
2012-07-21 13:05:34 +00:00
Konstantin Belousov
83b22b05e6 Introduce curpcb magic variable, similar to curthread, which is MD
amd64.  It is implemented as __pure2 inline with non-volatile asm read
from pcpu, which allows a compiler to cache its results.

Convert most PCPU_GET(pcb) and curthread->td_pcb accesses into curpcb.

Note that __curthread() uses magic value 0 as an offsetof(struct pcpu,
pc_curthread). It seems to be done this way due to machine/pcpu.h
needs to be processed before sys/pcpu.h, because machine/pcpu.h
contributes machine-depended fields to the struct pcpu definition. As
result, machine/pcpu.h cannot use struct pcpu yet.

The __curpcb() also uses a magic constant instead of offsetof(struct
pcpu, pc_curpcb) for the same reason. The constants are now defined as
symbols and CTASSERTs are added to ensure that future KBI changes do
not break the code.

Requested and reviewed by: bde
MFC after:    3 weeks
2012-07-19 19:09:12 +00:00
Alan Cox
3088e08c4b Don't unnecessarily set PGA_REFERENCED in pmap_enter(). 2012-07-19 05:34:19 +00:00
Konstantin Belousov
bc84db6267 On AMD64, provide siginfo.si_code for floating point errors when error
occurs using the SSE math processor.  Update comments describing the
handling of the exception status bits in coprocessors control words.

Remove GET_FPU_CW and GET_FPU_SW macros which were used only once.
Prefer to use curpcb to access pcb_save over the longer path of
referencing pcb through the thread structure.

Based on the submission by:	Ed Alley <wea llnl gov>
PR:	  amd64/169927
Reviewed by:	bde
MFC after:	3 weeks
2012-07-18 15:43:47 +00:00
Konstantin Belousov
a81f9fed5d Add stmxcsr.
Submitted by:	Ed Alley <wea llnl gov>
PR:	  amd64/169927
MFC after:	3 weeks
2012-07-18 15:36:03 +00:00
Konstantin Belousov
333d0c6060 Add support for the XSAVEOPT instruction use. Our XSAVE/XRSTOR usage
mostly meets the guidelines set by the Intel SDM:
1. We use XRSTOR and XSAVE from the same CPL using the same linear
   address for the store area
2. Contrary to the recommendations, we cannot zero the FPU save area
   for a new thread, since fork semantic requires the copy of the
   previous state. This advice seemingly contradicts to the advice
   from the item 6.
3. We do use XSAVEOPT in the context switch code only, and the area
   for XSAVEOPT already always contains the data saved by XSAVE.
4. We do not modify the save area between XRSTOR, when the area is
   loaded into FPU context, and XSAVE. We always spit the fpu context
   into save area and start emulation when directly writing into FPU
   context.
5. We do not use segmented addressing to access save area, or rather,
   always address it using %ds basing.
6. XSAVEOPT can be only executed in the area which was previously
   loaded with XRSTOR, since context switch code checks for FPU use by
   outgoing thread before saving, and thread which stopped emulation
   forcibly get context loaded with XRSTOR.
7. The PCB cannot be paged out while FPU emulation is turned off, since
   stack of the executing thread is never swapped out.

The context switch code is patched to issue XSAVEOPT instead of XSAVE
if supported. This approach eliminates one conditional in the context
switch code, which would be needed otherwise.

For user-visible machine context to have proper data, fpugetregs()
checks for unsaved extension blocks and manually copies pristine FPU
state into them, according to the description provided by CPUID leaf
0xd.

MFC after:  1 month
2012-07-14 15:48:30 +00:00
Alan Cox
b9592bdab3 Wring a few cycles out of pmap_enter(). In particular, on a user-space
pmap, avoid walking the page table twice.
2012-07-13 04:10:41 +00:00
Peter Grehan
b652778e42 IFC @ r238370 2012-07-11 19:54:21 +00:00
John Baldwin
d706ec297a Add a clts() wrapper around the 'clts' instruction to <machine/cpufunc.h>
on x86 and use that to implement stop_emulating() in the fpu/npx code.
Reimplement start_emulating() in the non-XEN case by using load_cr0() and
rcr0() instead of the 'lmsw' and 'smsw' instructions.  Intel explicitly
discourages the use of 'lmsw' and 'smsw' on 80386 and later processors in
the description of these instructions in Volume 2 of the ADM.

Reviewed by:	kib
MFC after:	1 month
2012-07-09 20:55:39 +00:00
John Baldwin
5355f65974 Partially revert r217515 so that the mem_range_softc variable is always
present on x86 kernels.  This fixes the build of kernels that include
'device acpi' but do not include 'device mem'.

MFC after:	1 month
2012-07-09 20:42:08 +00:00
Konstantin Belousov
f18d5bf44b Use assembler mnemonic instead of manually assembling, contination for r238142.
Reviewed by:	jhb
MFC after:	1 month
2012-07-06 20:11:58 +00:00
John Baldwin
6632f45773 Several fixes to the amd64 disassembler:
- Add generic support for opcodes that are escape bytes used for
  multi-byte opcodes (such as the 0x0f prefix).  Use this to replace
  the hard-coded 0x0f special case and add support for three-byte
  opcodes that use the 0x0f38 prefix.
- Decode all Intel VMX instructions.  invept and invvpid in particular are
  three-byte opcodes that use the 0x0f38 escape prefix.
- Rework how the special 'SDEP' size flag works such that the default
  instruction name (i_name) is the instruction when the data size
  prefix (0x66) is not specified, and the alternate name in i_extra is
  used when the prefix is included.
- Add a new 'ADEP' size flag similar to 'SDEP' except that it chooses
  between i_name and i_extra based on the address size prefix (0x67).
  Use this to fix the decoding for jrcxz vs jecxz which is determined
  by the address size prefix, not the operand size prefix.  Also, jcxz
  is not possible in 64-bit mode, but jrcxz is the default instruction
  for that opcode.
- Add support for handling instructions that have a mandatory 'rep'
  prefix (this means not outputting the 'repe ' prefix until determining
  if it is used as part of an opcode).  Make 'pause' less of a special
  case this way.
- Decode 'cmpxchg16b' and 'cdqe' which are variants of other instructions
  but with a REX.W prefix.

MFC after:	1 month
2012-07-06 14:25:59 +00:00
Alan Cox
cc861283f4 Make pmap_enter()'s management of PV entries consistent with the other pmap
functions that manage PV entries.  Specifically, remove the PV entry from
the containing PV list only after the corresponding PTE is destroyed.

Update the pmap's wired mapping count in pmap_enter() before the PV list
lock is acquired.
2012-07-06 06:42:25 +00:00
John Baldwin
7574a595f2 Now that our assembler supports the xsave family of instructions, use them
natively rather than hand-assembled versions.  For xgetbv/xsetbv, add a
wrapper API to deal with xcr* registers: rxcr() and load_xcr().

Reviewed by:	kib
MFC after:	1 month
2012-07-05 18:19:35 +00:00
Alan Cox
8f2994ce67 Calculate the new PTE value in pmap_enter() before acquiring any locks.
Move an assertion to the beginning of pmap_enter().
2012-07-05 07:20:16 +00:00
Alan Cox
1bc8531c1e Correct an error in r237513. The call to reserve_pv_entries() must come
before pmap_demote_pde() updates the PDE.  Otherwise, pmap_pv_demote_pde()
can crash.

Crash reported by:	kib
Patch tested by:	kib
2012-07-05 00:08:47 +00:00
John Baldwin
66f9aec075 Decode the 'xsave', 'xrstor', 'xsaveopt', 'xgetbv', 'xsetbv', and
'rdtscp' instructions.

MFC after:	1 month
2012-07-04 16:47:39 +00:00
Xin LI
309dca0171 tws(4) is interfaced with CAM so move it to the same section.
Reported by:	joel
MFC after:	3 days
2012-07-01 08:10:49 +00:00