Commit Graph

6865 Commits

Author SHA1 Message Date
Neel Natu
2c52dcd9a8 Modify handling of write to the vlapic SVR register.
The handler is now called after the register value is updated in the virtual
APIC page. This will make it easier to handle APIC-write VM-exits with APIC
register virtualization turned on.

Additionally, mask all the LVT entries when the vlapic is software-disabled.
2013-12-27 07:01:42 +00:00
Neel Natu
3f0ddc7c5c Modify handling of writes to the vlapic ID, LDR and DFR registers.
The handlers are now called after the register value is updated in the virtual
APIC page. This will make it easier to handle APIC-write VM-exits with APIC
register virtualization turned on.

Additionally, we need to ensure that the value of these registers is always
correctly reflected in the virtual APIC page, because there is no VM exit
when the guest reads these registers with APIC register virtualization.
2013-12-26 19:58:30 +00:00
Neel Natu
de5ea6b65e vlapic code restructuring to make it easy to support hardware-assist for APIC
emulation.

The vlapic initialization and cleanup is done via processor specific vmm_ops.
This will allow the VT-x/SVM modules to layer any hardware-assist for APIC
emulation or virtual interrupt delivery on top of the vlapic device model.

Add a parameter to 'vcpu_notify_event()' to distinguish between vlapic
interrupts versus other events (e.g. NMI). This provides an opportunity to
use hardware-assists like Posted Interrupts (VT-x) or doorbell MSR (SVM)
to deliver an interrupt to a guest without causing a VM-exit.

Get rid of lapic_pending_intr() and lapic_intr_accepted() and use the
vlapic_xxx() counterparts directly.

Associate an 'Apic Page' with each vcpu and reference it from the 'vlapic'.
The 'Apic Page' is intended to be referenced from the Intel VMCS as the
'virtual APIC page' or from the AMD VMCB as the 'vAPIC backing page'.
2013-12-25 06:46:31 +00:00
John Baldwin
63e62d390d Add a resume hook for bhyve that runs a function on all CPUs during
resume.  For Intel CPUs, invoke vmxon for CPUs that were in VMX mode
at the time of suspend.

Reviewed by:	neel
2013-12-23 19:48:22 +00:00
John Baldwin
330baf58c6 Extend the support for local interrupts on the local APIC:
- Add a generic routine to trigger an LVT interrupt that supports both
  fixed and NMI delivery modes.
- Add an ioctl and bhyvectl command to trigger local interrupts inside a
  guest.  In particular, a global NMI similar to that raised by SERR# or
  PERR# can be simulated by asserting LINT1 on all vCPUs.
- Extend the LVT table in the vCPU local APIC to support CMCI.
- Flesh out the local APIC error reporting a bit to cache errors and
  report them via ESR when ESR is written to.  Add support for asserting
  the error LVT when an error occurs.  Raise illegal vector errors when
  attempting to signal an invalid vector for an interrupt or when sending
  an IPI.
- Ignore writes to reserved bits in LVT entries.
- Export table entries the MADT and MP Table advertising the stock x86
  config of LINT0 set to ExtInt and LINT1 wired to NMI.

Reviewed by:	neel (earlier version)
2013-12-23 19:29:07 +00:00
Neel Natu
f80330a820 Add a parameter to 'vcpu_set_state()' to enforce that the vcpu is in the IDLE
state before the requested state transition. This guarantees that there is
exactly one ioctl() operating on a vcpu at any point in time and prevents
unintended state transitions.

More details available here:
http://lists.freebsd.org/pipermail/freebsd-virtualization/2013-December/001825.html

Reviewed by:	grehan
Reported by:	Markiyan Kushnir (markiyan.kushnir at gmail.com)
MFC after:	3 days
2013-12-22 20:29:59 +00:00
Neel Natu
a783578566 Consolidate the virtual apic initialization in a single function: vlapic_reset() 2013-12-22 00:08:00 +00:00
Neel Natu
5515bb73e6 Re-arrange bits in the amd64/pmap 'pm_flags' field.
The least significant 8 bits of 'pm_flags' are now used for the IPI vector
to use for nested page table TLB shootdown.

Previously we used IPI_AST to interrupt the host cpu which is functionally
correct but could lead to misleading interrupt counts for AST handler. The
AST handler was also doing a lot more than what is required for the nested
page table TLB shootdown (EOI and IRET).
2013-12-20 05:50:22 +00:00
Peter Grehan
a0b78f096a Enable memory overcommit for AMD processors.
- No emulation of A/D bits is required since AMD-V RVI
supports A/D bits.
 - Enable pmap PT_RVI support(w/o PAT) which is required for
memory over-commit support.
 - Other minor fixes:
 * Make use of VMCB EXITINTINFO field. If a #VMEXIT happens while
delivering an interrupt, EXITINTINFO has all the details that bhyve
needs to inject the same interrupt.
 * SVM h/w decode assist code was incomplete - removed for now.
 * Some minor code clean-up (more coming).

Submitted by:	Anish Gupta (akgupt3@gmail.com)
2013-12-18 23:39:42 +00:00
Peter Grehan
d8ced94511 MFC @ r256071
This is the change where the bhyve_npt_pmap branch was
merged in to head.

The SVM changes to work with this will be in a follow-on
submit.
2013-12-18 22:31:53 +00:00
Neel Natu
3de8386283 Use vmcs_read() and vmcs_write() in preference to vmread() and vmwrite()
respectively. The vmcs_xxx() functions provide inline error checking of
all accesses to the VMCS.
2013-12-18 06:24:21 +00:00
Neel Natu
4f8be175d5 Add an API to deliver message signalled interrupts to vcpus. This allows
callers treat the MSI 'addr' and 'data' fields as opaque and also lets
bhyve implement multiple destination modes: physical, flat and clustered.

Submitted by:	Tycho Nightingale (tycho.nightingale@pluribusnetworks.com)
Reviewed by:	grehan@
2013-12-16 19:59:31 +00:00
Neel Natu
a83011d2e7 Fix typo when initializing the vlapic version register ('<<' instead of '<'). 2013-12-11 06:28:44 +00:00
Neel Natu
becd984900 Fix x2apic support in bhyve.
When the guest is bringing up the APs in the x2APIC mode a write to the
ICR register will now trigger a return to userspace with an exitcode of
VM_EXITCODE_SPINUP_AP. This gets SMP guests working again with x2APIC.

Change the vlapic timer lock to be a spinlock because the vlapic can be
accessed from within a critical section (vm run loop) when guest is using
x2apic mode.

Reviewed by:	grehan@
2013-12-10 22:56:51 +00:00
John Baldwin
316032ad20 Move constants for indices in the local APIC's local vector table from
apicvar.h to apicreg.h.
2013-12-09 21:08:52 +00:00
Neel Natu
fb03ca4e42 Use callout(9) to drive the vlapic timer instead of clocking it on each VM exit.
This decouples the guest's 'hz' from the host's 'hz' setting. For e.g. it is
now possible to have a guest run at 'hz=1000' while the host is at 'hz=100'.

Discussed with:	grehan@
Tested by:	Tycho Nightingale (tycho.nightingale@pluribusnetworks.com)
2013-12-07 23:11:12 +00:00
Neel Natu
1c05219285 If a vcpu disables its local apic and then executes a 'HLT' then spin down the
vcpu and destroy its thread context. Also modify the 'HLT' processing to ignore
pending interrupts in the IRR if interrupts have been disabled by the guest.
The interrupt cannot be injected into the guest in any case so resuming it
is futile.

With this change "halt" from a Linux guest works correctly.

Reviewed by:	grehan@
Tested by:	Tycho Nightingale (tycho.nightingale@pluribusnetworks.com)
2013-12-07 22:18:36 +00:00
John Baldwin
5c79f1f9df Fix a typo. 2013-12-05 21:58:02 +00:00
Neel Natu
7a3c80aa55 The 'protection' field in the VM exit collateral for the PAGING exit is not
used - get rid of it.
2013-12-03 01:21:21 +00:00
Neel Natu
2282187475 Rename 'vm_interrupt_hostcpu()' to 'vcpu_notify_event()' because the function
has outgrown its original name. Originally this function simply sent an IPI
to the host cpu that a vcpu was executing on but now it does a lot more than
just that.

Reviewed by:	grehan@
2013-12-03 00:43:31 +00:00
Eitan Adler
7a22215c53 Fix undefined behavior: (1 << 31) is not defined as 1 is an int and this
shifts into the sign bit.  Instead use (1U << 31) which gets the
expected result.

This fix is not ideal as it assumes a 32 bit int, but does fix the issue
for most cases.

A similar change was made in OpenBSD.

Discussed with:	-arch, rdivacky
Reviewed by:	cperciva
2013-11-30 22:17:27 +00:00
Pawel Jakub Dawidek
f2b525e6b9 Make process descriptors standard part of the kernel. rwhod(8) already
requires process descriptors to work and having PROCDESC in GENERIC
seems not enough, especially that we hope to have more and more consumers
in the base.

MFC after:	3 days
2013-11-30 15:08:35 +00:00
Neel Natu
b5b28fc9dc Add support for level triggered interrupt pins on the vioapic. Prior to this
commit level triggered interrupts would work as long as the pin was not shared
among multiple interrupt sources.

The vlapic now keeps track of level triggered interrupts in the trigger mode
register and will forward the EOI for a level triggered interrupt to the
vioapic. The vioapic in turn uses the EOI to sample the level on the pin and
re-inject the vector if the pin is still asserted.

The vhpet is the first consumer of level triggered interrupts and advertises
that it can generate interrupts on pins 20 through 23 of the vioapic.

Discussed with:	grehan@
2013-11-27 22:18:08 +00:00
Konstantin Belousov
291bfc8d24 Hide struct pcb definition by #ifdef __amd64__ braces. If cc -m32
compilation results in inclusion of the header, a confict arises due
to savefpu being union for i386, but used as struct in the pcb
definition.  The 32bit code should not need amd64 variant of the
struct pcb anyway.

For struct region_descriptor, use __uint64_t instead of unsigned long,
as the base type for bit-fields.  Unsigned long cannot have width 64
for -m32.

The changes allowed to use sys/sysctl.h for cc -m32.

Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2013-11-26 19:38:42 +00:00
Neel Natu
08e3ff329a Add HPET device emulation to bhyve.
bhyve supports a single timer block with 8 timers. The timers are all 32-bit
and capable of being operated in periodic mode. All timers support interrupt
delivery using MSI. Timers 0 and 1 also support legacy interrupt routing.

At the moment the timers are not connected to any ioapic pins but that will
be addressed in a subsequent commit.

This change is based on a patch from Tycho Nightingale (tycho.nightingale@pluribusnetworks.com).
2013-11-25 19:04:51 +00:00
Attilio Rao
54366c0bd7 - For kernel compiled only with KDTRACE_HOOKS and not any lock debugging
option, unbreak the lock tracing release semantic by embedding
  calls to LOCKSTAT_PROFILE_RELEASE_LOCK() direclty in the inlined
  version of the releasing functions for mutex, rwlock and sxlock.
  Failing to do so skips the lockstat_probe_func invokation for
  unlocking.
- As part of the LOCKSTAT support is inlined in mutex operation, for
  kernel compiled without lock debugging options, potentially every
  consumer must be compiled including opt_kdtrace.h.
  Fix this by moving KDTRACE_HOOKS into opt_global.h and remove the
  dependency by opt_kdtrace.h for all files, as now only KDTRACE_FRAMES
  is linked there and it is only used as a compile-time stub [0].

[0] immediately shows some new bug as DTRACE-derived support for debug
in sfxge is broken and it was never really tested.  As it was not
including correctly opt_kdtrace.h before it was never enabled so it
was kept broken for a while.  Fix this by using a protection stub,
leaving sfxge driver authors the responsibility for fixing it
appropriately [1].

Sponsored by:	EMC / Isilon storage division
Discussed with:	rstone
[0] Reported by:	rstone
[1] Discussed with:	philip
2013-11-25 07:38:45 +00:00
Neel Natu
ac7304a758 Add an ioctl to assert and deassert an ioapic pin atomically. This will be used
to inject edge triggered legacy interrupts into the guest.

Start using the new API in device models that use edge triggered interrupts:
viz. the 8254 timer and the LPC/uart device emulation.

Submitted by:	Tycho Nightingale (tycho.nightingale@pluribusnetworks.com)
2013-11-23 03:56:03 +00:00
Neel Natu
af480303a9 Eliminate redundant information about the host cpu in bhyve's KTR trace points.
This is always tracked by ktr(4) and can be displayed using the "-c" option
of ktrdump(8).

Discussed with:	grehan
2013-11-22 18:57:22 +00:00
Ed Maste
7b7d8599fe Don't abort SMAP processing after an entry of length 0
Length 0 is not special and should just be skipped.  This is the same
behaviour as i386.

Discussed with:	jhb@
Sponsored by:	The FreeBSD Foundation
2013-11-22 14:56:10 +00:00
Andreas Tobler
d2ef321a59 Introduce a WEAK_REFERENCE() alias and use it. Get rid of the CNAME and the
CONCAT macros in SYS.h.

Reviewed by:	bde, kib
2013-11-21 21:25:58 +00:00
Ed Maste
ff89f4778a Refactor amd64 startup SMAP parsing
Extracted from the projects/uefi branch, this change is a reasonable
cleanup and will reduce the diffs to review when bringing in the
UEFI work.

Reviewed by:	kib@
Sponsored by:	The FreeBSD Foundation
2013-11-21 19:20:08 +00:00
Ed Maste
aff122d6aa Disable amd64 boot time memory test by default
The page presence memory test takes a long time on large memory systems
and has little value on contemporary amd64 hardware.

Sponsored by:	The FreeBSD Foundation
2013-11-21 18:37:11 +00:00
Justin T. Gibbs
4fd76feafd Fix accounting for hw.realmem on the i386 and amd64 platforms.
sys/i386/i386/machdep.c:
sys/amd64/amd64/machdep.c:
	The value reported by FreeBSD as "real memory" when booting
	doesn't match what is later reported by sysctl as hw.realmem.
	This is due to the fact that the value printed during the
	boot process is fetched from smbios data (when possible),
	and accounts for holes in physical memory. On the other
	hand, the value of hw.realmem is unconditionally set to be
	one larger than the highest page of the physical address
	space.

	Fix this by setting hw.realmem to the same value printed
	during boot, this makes hw.realmem honour it's name and
	account properly for physical memory present in the system.

Submitted by:	Roger Pau Monné
Reviewed by:	gibbs
2013-11-15 16:05:55 +00:00
Ed Maste
3d271aaab0 x86: Allow users to change PSL_RF via ptrace(PT_SETREGS...)
Debuggers may need to change PSL_RF.  Note that tf_eflags is already stored
in the signal context during signal handling and PSL_RF previously could be
modified via sigreturn, so this change should not provide any new ability
to userspace.

For background see the thread at:
http://lists.freebsd.org/pipermail/freebsd-i386/2007-September/005910.html

Reviewed by:	jhb, kib
Sponsored by:	DARPA, AFRL
2013-11-14 15:37:20 +00:00
Neel Natu
565bbb8698 Move the ioapic device model from userspace into vmm.ko. This is needed for
upcoming in-kernel device emulations like the HPET.

The ioctls VM_IOAPIC_ASSERT_IRQ and VM_IOAPIC_DEASSERT_IRQ are used to
manipulate the ioapic pin state.

Discussed with:	grehan@
Submitted by:	Tycho Nightingale (tycho.nightingale@pluribusnetworks.com)
2013-11-12 22:51:03 +00:00
Konstantin Belousov
6f8a44a5dd Add bits for the AMD features from CPUID function 0x80000001 ECX,
described in the rev. 3.0 of the Kabini BKDG, document 48751.pdf.

Partially based on the patch submitted by:	Dmitry Luhtionov <dmitryluhtionov@gmail.com>
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2013-11-08 16:32:30 +00:00
Alan Cox
c70af4875e As of r257209, all architectures have defined VM_KMEM_SIZE_SCALE. In other
words, every architecture is now auto-sizing the kmem arena.  This revision
changes kmeminit() so that the definition of VM_KMEM_SIZE_SCALE becomes
mandatory and the definition of VM_KMEM_SIZE becomes optional.

Replace or eliminate all existing definitions of VM_KMEM_SIZE.  With
auto-sizing enabled, VM_KMEM_SIZE effectively became an alternate spelling
for VM_KMEM_SIZE_MIN on most architectures.  Use VM_KMEM_SIZE_MIN for
clarity.

Change kmeminit() so that the effect of defining VM_KMEM_SIZE is similar to
that of setting the tunable vm.kmem_size.  Whereas the macros
VM_KMEM_SIZE_{MAX,MIN,SCALE} have had the same effect as the tunables
vm.kmem_size_{max,min,scale}, the effects of VM_KMEM_SIZE and vm.kmem_size
have been distinct.  In particular, whereas VM_KMEM_SIZE was overridden by
VM_KMEM_SIZE_{MAX,MIN,SCALE} and vm.kmem_size_{max,min,scale}, vm.kmem_size
was not.  Remedy this inconsistency.  Now, VM_KMEM_SIZE can be used to set
the size of the kmem arena at compile-time without that value being
overridden by auto-sizing.

Update the nearby comments to reflect the kmem submap being replaced by the
kmem arena.  Stop duplicating the auto-sizing formula in every machine-
dependent vmparam.h and place it in kmeminit() where auto-sizing takes
place.

Reviewed by:	kib (an earlier version)
Sponsored by:	EMC / Isilon Storage Division
2013-11-08 16:25:00 +00:00
Neel Natu
03cd05011f Remove the 'vdev' abstraction that was meant to sit on top of device models
in the kernel. This abstraction was redundant because the only device emulated
inside vmm.ko is the local apic and it is always at a fixed guest physical
address.

Discussed with:	grehan
2013-11-04 23:25:07 +00:00
Neel Natu
513c8d338d Rename the VMM_CTRx() family of macros to VCPU_CTRx() to highlight that these
tracepoints are vcpu-specific.

Add support for tracepoints that are global to the virtual machine - these
tracepoints are called VM_CTRx().
2013-10-31 05:20:11 +00:00
Mark Johnston
57170f49f2 Remove references to an unused fasttrap probe hook, and remove the
corresponding x86 trap type. Userland DTrace probes are currently handled
by the other fasttrap hooks (dtrace_pid_probe_ptr and
dtrace_return_probe_ptr).

Discussed with:	rpaulo
2013-10-31 02:35:00 +00:00
Peter Grehan
064bee341e MFC @ r256071
This is just prior to the bhyve_npt_pmap import so will allow
just the change to be merged for easier debug.
2013-10-30 00:05:02 +00:00
Neel Natu
e2f5d9a129 Remove unnecessary includes of <machine/pmap.h>
Requested by:	alc@
2013-10-29 02:25:18 +00:00
Gleb Smirnoff
69eb2b176c Include XEN and HyperV into amd64 LINT. 2013-10-28 21:11:28 +00:00
Konstantin Belousov
86be9f0dd5 Import the driver for VT-d DMAR hardware, as specified in the revision
1.3 of Intelб╝ Virtualization Technology for Directed I/O Architecture
Specification.  The Extended Context and PASIDs from the rev. 2.2 are
not supported, but I am not aware of any released hardware which
implements them.  Code does not use queued invalidation, see comments
for the reason, and does not provide interrupt remapping services.

Code implements the management of the guest address space per domain
and allows to establish and tear down arbitrary mappings, but not
partial unmapping.  The superpages are created as needed, but not
promoted.  Faults are recorded, fault records could be obtained
programmatically, and printed on the console.

Implement the busdma(9) using DMARs.  This busdma backend avoids
bouncing and provides security against misbehaving hardware and driver
bad programming, preventing leaks and corruption of the memory by wild
DMA accesses.

By default, the implementation is compiled into amd64 GENERIC kernel
but disabled; to enable, set hw.dmar.enable=1 loader tunable.  Code is
written to work on i386, but testing there was low priority, and
driver is not enabled in GENERIC.  Even with the DMAR turned on,
individual devices could be directed to use the bounce busdma with the
hw.busdma.pci<domain>:<bus>:<device>:<function>.bounce=1 tunable.  If
DMARs are capable of the pass-through translations, it is used,
otherwise, an identity-mapping page table is constructed.

The driver was tested on Xeon 5400/5500 chipset legacy machine,
Haswell desktop and E5 SandyBridge dual-socket boxes, with ahci(4),
ata(4), bce(4), ehci(4), mfi(4), uhci(4), xhci(4) devices.  It also
works with em(4) and igb(4), but there some fixes are needed for
drivers, which are not committed yet.  Intel GPUs do not work with
DMAR (yet).

Many thanks to John Baldwin, who explained me the newbus integration;
Peter Holm, who did all testing and helped me to discover and
understand several incredible bugs; and to Jim Harris for the access
to the EDS and BWG and for listening when I have to explain my
findings to somebody.

Sponsored by:	The FreeBSD Foundation
MFC after:	1 month
2013-10-28 13:33:29 +00:00
Konstantin Belousov
e20f049b87 Several small fixes for the amd64 minidump code.
In report_progress(), use nitems(progress_track) instead of manually
hard-coding array size.  Wrap long line.

In blk_write(), code verifies that ptr and pa cannot be non-zero
simultaneously.  The later check for the page-alignment of the ptr
argument never triggers due to pa != 0 always implying ptr == NULL.  I
believe that the intent was to ensure that physicall address passed is
page-aligned, since the address is (temporary) mapped for the duration
of the page write.

Clear the progress_track.visited fields when starting minidump.  If
minidump is restarted or taken second time during the system lifetime,
progress is not printed otherwise, making operator suspectible to the
dump status.

Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2013-10-27 16:31:12 +00:00
Gleb Smirnoff
eedc7fd9e8 Provide includes that are needed in these files, and before were read
in implicitly via if.h -> if_var.h pollution.

Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2013-10-26 18:18:50 +00:00
Neel Natu
ab76fd5833 The ASID allocation in SVM is incorrect because it allocates a single ASID for
all vcpus belonging to a guest. This means that when different vcpus belonging
to the same guest are executing on the same host cpu there may be "leakage"
in the mappings created by one vcpu to another.

The proper fix for this is being worked on and will be committed shortly.

In the meantime workaround this bug by flushing the guest TLB entries on every
VM entry.

Submitted by:	Anish Gupta (akgupt3@gmail.com)
2013-10-21 23:46:37 +00:00
Neel Natu
49cc03da31 Add a new capability, VM_CAP_ENABLE_INVPCID, that can be enabled to expose
'invpcid' instruction to the guest. Currently bhyve will try to enable this
capability unconditionally if it is available.

Consolidate code in bhyve to set the capabilities so it is no longer
duplicated in BSP and AP bringup.

Add a sysctl 'vm.pmap.invpcid_works' to display whether the 'invpcid'
instruction is available.

Reviewed by:	grehan
MFC after:	3 days
2013-10-16 18:20:27 +00:00
Peter Grehan
4599af439f Fix SVM handling of ASTPENDING, which manifested as a
hang on console output (due to a missing interrupt).

SVM does exit processing and then handles ASTPENDING which
overwrites the already handled SVM exit cause and corrupts
virtual machine state. For example, if the SVM exit was due to
an I/O port access but the main loop detected an ASTPENDING,
the exit would be processed as ASTPENDING and leave the
device (e.g. emulated UART) for that I/O port in bad state.

Submitted by:	Anish Gupta (akgupt3@gmail.com)
Reviewed by:	grehan
2013-10-16 05:43:03 +00:00
Neel Natu
d38cae4aad Fix the witness warning that warned against calling uiomove() while holding
the 'vmmdev_mtx' in vmmdev_rw().

Rely on the 'si_threadcount' accounting to ensure that we never destroy the
VM device node while it has operations in progress (e.g. ioctl, mmap etc).

Reported by:	grehan
Reviewed by:	grehan
2013-10-16 00:58:47 +00:00
Glen Barber
6b48eebec6 Document XENHVM and xenpci are mutually inclusive.
Submitted by:   gibbs
Approved by:    re (delphij)
Sponsored by:   The FreeBSD Foundation
2013-10-11 19:40:28 +00:00
Dimitry Andric
9cba9d0157 In sys/amd64/amd64/pmap.c, fix several gcc warnings about uninitialized
variables in reclaim_pv_chunk().

Approved by:	re (marius)
Reviewed by:	neel, kib
X-MFC-With:	r256072
2013-10-08 20:04:35 +00:00
Justin T. Gibbs
5fdd34ee20 Formalize the concept of virtual CPU ids by adding a per-cpu vcpu_id
field.  Perform vcpu enumeration for Xen PV and HVM environments
and convert all Xen drivers to use vcpu_id instead of a hard coded
assumption of the mapping algorithm (acpi or apic ID) in use.

Submitted by:	Roger Pau Monné
Sponsored by:	Citrix Systems R&D
Reviewed by:	gibbs
Approved by:	re (blanket Xen)

amd64/include/pcpu.h:
i386/include/pcpu.h:
	Add vcpu_id to the amd64 and i386 pcpu structures.

dev/xen/timer/timer.c
x86/xen/xen_intr.c
	Use new vcpu_id instead of assuming acpi_id == vcpu_id.

i386/xen/mp_machdep.c:
i386/xen/mptable.c
x86/xen/hvm.c:
	Perform Xen HVM and Xen full PV vcpu_id mapping.

x86/xen/hvm.c:
x86/acpica/madt.c
	Change SYSINIT ordering of acpi CPU enumeration so that it
	is guaranteed to be available at the time of Xen HVM vcpu
	id mapping.
2013-10-05 23:11:01 +00:00
Neel Natu
318224bbe6 Merge projects/bhyve_npt_pmap into head.
Make the amd64/pmap code aware of nested page table mappings used by bhyve
guests. This allows bhyve to associate each guest with its own vmspace and
deal with nested page faults in the context of that vmspace. This also
enables features like accessed/dirty bit tracking, swapping to disk and
transparent superpage promotions of guest memory.

Guest vmspace:
Each bhyve guest has a unique vmspace to represent the physical memory
allocated to the guest. Each memory segment allocated by the guest is
mapped into the guest's address space via the 'vmspace->vm_map' and is
backed by an object of type OBJT_DEFAULT.

pmap types:
The amd64/pmap now understands two types of pmaps: PT_X86 and PT_EPT.

The PT_X86 pmap type is used by the vmspace associated with the host kernel
as well as user processes executing on the host. The PT_EPT pmap is used by
the vmspace associated with a bhyve guest.

Page Table Entries:
The EPT page table entries as mostly similar in functionality to regular
page table entries although there are some differences in terms of what
bits are used to express that functionality. For e.g. the dirty bit is
represented by bit 9 in the nested PTE as opposed to bit 6 in the regular
x86 PTE. Therefore the bitmask representing the dirty bit is now computed
at runtime based on the type of the pmap. Thus PG_M that was previously a
macro now becomes a local variable that is initialized at runtime using
'pmap_modified_bit(pmap)'.

An additional wrinkle associated with EPT mappings is that older Intel
processors don't have hardware support for tracking accessed/dirty bits in
the PTE. This means that the amd64/pmap code needs to emulate these bits to
provide proper accounting to the VM subsystem. This is achieved by using
the following mapping for EPT entries that need emulation of A/D bits:
               Bit Position           Interpreted By
PG_V               52                 software (accessed bit emulation handler)
PG_RW              53                 software (dirty bit emulation handler)
PG_A               0                  hardware (aka EPT_PG_RD)
PG_M               1                  hardware (aka EPT_PG_WR)

The idea to use the mapping listed above for A/D bit emulation came from
Alan Cox (alc@).

The final difference with respect to x86 PTEs is that some EPT implementations
do not support superpage mappings. This is recorded in the 'pm_flags' field
of the pmap.

TLB invalidation:
The amd64/pmap code has a number of ways to do invalidation of mappings
that may be cached in the TLB: single page, multiple pages in a range or the
entire TLB. All of these funnel into a single EPT invalidation routine called
'pmap_invalidate_ept()'. This routine bumps up the EPT generation number and
sends an IPI to the host cpus that are executing the guest's vcpus. On a
subsequent entry into the guest it will detect that the EPT has changed and
invalidate the mappings from the TLB.

Guest memory access:
Since the guest memory is no longer wired we need to hold the host physical
page that backs the guest physical page before we can access it. The helper
functions 'vm_gpa_hold()/vm_gpa_release()' are available for this purpose.

PCI passthru:
Guest's with PCI passthru devices will wire the entire guest physical address
space. The MMIO BAR associated with the passthru device is backed by a
vm_object of type OBJT_SG. An IOMMU domain is created only for guest's that
have one or more PCI passthru devices attached to them.

Limitations:
There isn't a way to map a guest physical page without execute permissions.
This is because the amd64/pmap code interprets the guest physical mappings as
user mappings since they are numerically below VM_MAXUSER_ADDRESS. Since PG_U
shares the same bit position as EPT_PG_EXECUTE all guest mappings become
automatically executable.

Thanks to Alan Cox and Konstantin Belousov for their rigorous code reviews
as well as their support and encouragement.

Thanks for John Baldwin for reviewing the use of OBJT_SG as the backing
object for pci passthru mmio regions.

Special thanks to Peter Holm for testing the patch on short notice.

Approved by:	re
Discussed with:	grehan
Reviewed by:	alc, kib
Tested by:	pho
2013-10-05 21:22:35 +00:00
John-Mark Gurney
29904f46d6 add aesni module to i386 and amd64 NOTES...
Approved by:	re (gjb)
2013-10-04 17:21:01 +00:00
Peter Grehan
e58d944482 Return 0 for a rdmsr of MSR_IA32_PLATFORM_ID. This
is enough to get Ubuntu 12.0.4/13.0.4 to boot.

Approved by:	re@ (blanket)
2013-09-27 14:55:59 +00:00
Konstantin Belousov
4cb8b041d1 In pmap_clear_modify(), initialize pvh even for fictitious managed
page, otherwise the small mappings loop would use uninitialized value.
Note that currently pmap_clear_modify() is not called for fictitious
pages.

Sponsored by:	The FreeBSD Foundation
Approved by:	re (glebius)
2013-09-24 13:52:47 +00:00
Konstantin Belousov
fecfc089e4 Use the pv lists generation count to read-lock the pvh_global_lock in
pmap_clear_modify().

Noted and reviewed by:	alc
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
Approved by:	re (marius)
2013-09-24 12:26:43 +00:00
Konstantin Belousov
75f50c53f1 Ensure that the ERESTART return from the syscall reloads the
registers, to make the restarted syscall instruction pass the correct
arguments.

PR:	kern/182161
Reported by:	Russ Cox <rsc@swtch.com>
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
Approved by:	re (marius)
2013-09-24 12:24:48 +00:00
Konstantin Belousov
ad43b98491 Free both KVA and backing pages when freeing TSS memory.
Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
Approved by:	re (marius)
2013-09-23 20:14:15 +00:00
Glen Barber
91aff61084 Put 'device hyperv' back in amd64/GENERIC, incorrectly removed with
r255736.

Pointed out by:	gibbs
Approved by:	re (delphij)
Sponsored by:	The FreeBSD Foundation
2013-09-21 01:07:27 +00:00
Peter Grehan
36f23e3c20 Reorder/regroup the vmm ioctl api definitions to allow some
semblance of API stability and growth during the 10.* timeframe.

Userland/kernel bhyve will have to be recompiled after this.

Reviewed by:	neel
Approved by:	re@ (blanket)
2013-09-21 00:27:53 +00:00
Justin T. Gibbs
566a5f5020 Merge Xen PVHVM support into the GENERIC kernel config for both
amd64 and i386.

Submitted by:	Roger Pau Monné
Sponsored by:	Citrix Systems R&D
Reviewed by:	gibbs
Approved by:	re (blanket Xen)
MFC after:	2 weeks

sys/amd64/amd64/mp_machdep.c:
sys/amd64/include/cpu.h:
sys/i386/i386/mp_machdep.c:
sys/i386/include/cpu.h:
	- Introduce two new CPU hooks for initialization and resume
	  purposes. This allows us to get rid of the XENHVM ifdefs in
	  mp_machdep, and also sets some hooks into common code that can be
	  used by other hypervisor implementations.

sys/amd64/conf/XENHVM:
sys/i386/conf/XENHVM:
	- Remove these configs now that GENERIC has builtin support for Xen
	  HVM.

sys/kern/subr_smp.c:
	- Make sure there are no pending IPIs when suspending a system.

sys/x86/xen/hvm.c:
	- Add cpu init and resume vectors that are called from mp_machdep
	  using the new hooks.
	- Only clear the vcpu_info mapping data on resume.  It is already
	  clear for the BSP on a cold boot and is set correctly as APs
	  are started.
	- Gate xen_hvm_init_cpu only to systems running under Xen.

sys/x86/xen/xen_intr.c:
	 - Gate the setup of event channels only to systems running under Xen.
2013-09-20 22:59:22 +00:00
David Christensen
4e4007688c Substantial rewrite of bxe(4) to add support for the BCM57712 and
BCM578XX controllers.

Approved by:	re
MFC after:	4 weeks
2013-09-20 20:18:49 +00:00
Neel Natu
74d1d2b7cc Merge the following changes from projects/bhyve_npt_pmap:
- add fields to 'struct pmap' that are required to manage nested page tables.
- add a parameter to 'vmspace_alloc()' that can be used to override the
  default pmap initialization routine 'pmap_pinit()'.

These changes are pushed ahead of the remaining changes in 'bhyve_npt_pmap'
in anticipation of the upcoming KBI freeze for 10.0.

Reviewed by:	kib@, alc@
Approved by:	re (glebius)
2013-09-20 17:06:49 +00:00
Justin T. Gibbs
428b7ca290 Add support for suspend/resume/migration operations when running as a
Xen PVHVM guest.

Submitted by:	Roger Pau Monné
Sponsored by:	Citrix Systems R&D
Reviewed by:	gibbs
Approved by:	re (blanket Xen)
MFC after:	2 weeks

sys/amd64/amd64/mp_machdep.c:
sys/i386/i386/mp_machdep.c:
	- Make sure that are no MMU related IPIs pending on migration.
	- Reset pending IPI_BITMAP on resume.
	- Init vcpu_info on resume.

sys/amd64/include/intr_machdep.h:
sys/i386/include/intr_machdep.h:
sys/x86/acpica/acpi_wakeup.c:
sys/x86/x86/intr_machdep.c:
sys/x86/isa/atpic.c:
sys/x86/x86/io_apic.c:
sys/x86/x86/local_apic.c:
	- Add a "suspend_cancelled" parameter to pic_resume().  For the
	  Xen PIC, restoration of interrupt services differs between
	  the aborted suspend and normal resume cases, so we must provide
	  this information.

sys/dev/acpica/acpi_timer.c:
sys/dev/xen/timer/timer.c:
sys/timetc.h:
	- Don't swap out "suspend safe" timers across a suspend/resume
	  cycle.  This includes the Xen PV and ACPI timers.

sys/dev/xen/control/control.c:
	- Perform proper suspend/resume process for PVHVM:
		- Suspend all APs before going into suspension, this allows us
		  to reset the vcpu_info on resume for each AP.
		- Reset shared info page and callback on resume.

sys/dev/xen/timer/timer.c:
	- Implement suspend/resume support for the PV timer. Since FreeBSD
	  doesn't perform a per-cpu resume of the timer, we need to call
	  smp_rendezvous in order to correctly resume the timer on each CPU.

sys/dev/xen/xenpci/xenpci.c:
	- Don't reset the PCI interrupt on each suspend/resume.

sys/kern/subr_smp.c:
	- When suspending a PVHVM domain make sure there are no MMU IPIs
	  in-flight, or we will get a lockup on resume due to the fact that
	  pending event channels are not carried over on migration.
	- Implement a generic version of restart_cpus that can be used by
	  suspended and stopped cpus.

sys/x86/xen/hvm.c:
	- Implement resume support for the hypercall page and shared info.
	- Clear vcpu_info so it can be reset by APs when resuming from
	  suspension.

sys/dev/xen/xenpci/xenpci.c:
sys/x86/xen/hvm.c:
sys/x86/xen/xen_intr.c:
	- Support UP kernel configurations.

sys/x86/xen/xen_intr.c:
	- Properly rebind per-cpus VIRQs and IPIs on resume.
2013-09-20 05:06:03 +00:00
Alan Cox
deb179bb4c The pmap function pmap_clear_reference() is no longer used. Remove it.
pmap_clear_reference() has had exactly one caller in the kernel for
several years, more precisely, since FreeBSD 8.  Now, that call no
longer exists.

Approved by:	re (kib)
Sponsored by:	EMC / Isilon Storage Division
2013-09-20 04:30:18 +00:00
Peter Grehan
ef90af83a5 IFC @ r255692
Comment out IA32_MISC_ENABLE MSR access - this doesn't exist on AMD.
Need to sort out how arch-specific MSRs will be handled.
2013-09-20 00:46:29 +00:00
Peter Grehan
d83d73618f Reconnect the hyperv drivers back into GENERIC now that the
disengage driver issue has been resolved.

Approved by:	re@ (gjb)
2013-09-19 05:07:51 +00:00
Pawel Jakub Dawidek
3fded357af Fix panic in ktrcapfail() when no capability rights are passed.
While here, correct all consumers to pass NULL instead of 0 as we pass
capability rights as pointers now, not uint64_t.

Reported by:	Daniel Peyrolon
Tested by:	Daniel Peyrolon
Approved by:	re (marius)
2013-09-18 19:26:08 +00:00
Roman Divacky
69d912af45 Regen.
Approved by:	re (delphij)
2013-09-18 18:49:26 +00:00
Roman Divacky
b12698e1a1 Revert r255672, it has some serious flaws, leaking file references etc.
Approved by:	re (delphij)
2013-09-18 18:48:33 +00:00
Roman Divacky
70ccaaf58e Regen.
Approved by:    re (delphij)
2013-09-18 17:58:03 +00:00
Roman Divacky
253c75c0de Implement epoll support in Linuxulator. This is a tiny wrapper around kqueue
to implement epoll subset of functionality. The kqueue user data are 32bit
on i386 which is not enough for epoll user data so this patch overrides
kqueue fileops to maintain enough space in struct file.

Initial patch developed by me in 2007 and then extended and finished
by Yuri Victorovich.

Approved by:    re (delphij)
Sponsored by:   Google Summer of Code
Submitted by:   Yuri Victorovich <yuri at rawbw dot com>
Tested by:      Yuri Victorovich <yuri at rawbw dot com>
2013-09-18 17:56:04 +00:00
Peter Grehan
517e21d3e7 Hide TSC-deadline APIC timer support from guests. This mode
isn't yet implemented in bhyve's APIC emulation.

Reviewed by:	neel
Approved by:	re@ (blanket)
2013-09-17 17:56:53 +00:00
Neel Natu
0f9d5dc758 Fix a bug in decoding an instruction that has an SIB byte as well as an
immediate operand. The presence of an SIB byte in decoding the ModR/M field
would cause 'imm_bytes' to not be set to the correct value.

Fix this by initializing 'imm_bytes' independent of the ModR/M decoding.

Reported by: grehan@
Approved by: re@
2013-09-17 16:06:07 +00:00
Bryan Venteicher
03c6abfd1c Add vmx(4) to i386 and amd64 GENERIC
Approved by:	re (gjb)
2013-09-17 01:54:13 +00:00
Konstantin Belousov
70b9173019 In pmap_copy(), when the copied region is mapped with superpage but does
not cover entire superpage, avoid copying.  Doing partial copy would
require demotion, which is incompatible with the already held locks.

Reported by:    cperciva
Reviewed by:    alc
Sponsored by:	The FreeBSD Foundation
MFC after:      1 week
Approved by:	re (delphij)
2013-09-16 06:15:15 +00:00
Peter Grehan
b90fcf02f2 Pull the hyperv drivers from GENERIC until the fix to the disengage
driver to make it only probe when running on hyperv is reviewed and
tested.

Approved by:	re (rodrigc)
2013-09-14 20:38:22 +00:00
Peter Grehan
ab7fb3bca7 Import Hyper-V paravirtualized drivers from projects/hyperv
branch into head.

Approved by:	re@ (hrs)
Obtained from:	Microsoft, NetApp, and Citrix.
2013-09-13 18:47:58 +00:00
Neel Natu
0f1ef0ec80 Fix a limitation in bhyve that would limit the number of virtual machines to
the maximum number of VT-d domains (256 on a Sandybridge). We now allocate a
VT-d domain for a guest only if the administrator has explicitly configured
one or more PCI passthru device(s).

If there are no PCI passthru devices configured (the common case) then the
number of virtual machines is no longer limited by the maximum number of
VT-d domains.

Reviewed by: grehan@
Approved by: re@
2013-09-11 07:11:14 +00:00
Peter Grehan
47823319c3 IFC @ r255459 2013-09-11 00:19:16 +00:00
Peter Grehan
8d39ed16c2 Go way past 11 and bump bhyve's max vCPUs to 16.
This should be sufficient for 10.0 and will do
until forthcoming work to avoid limitations
in this area is complete.

Thanks to Bela Lubkin at tidalscale for the
headsup on the apic/cpu id/io apic ASL parameters
that are actually hex values and broke when
written as decimal when 11 vCPUs were configured.

Approved by:	re@
2013-09-10 03:48:18 +00:00
Alan Cox
70c4180f1c Prior to r254304, we only began scanning the active page queue when the
amount of free memory was close to the point at which we would begin
reclaiming pages.  Now, we continuously scan the active page queue,
regardless of the amount of free memory.  Consequently, we are continuously
calling pmap_ts_referenced() on active pages.

Prior to this change, pmap_ts_referenced() would always demote superpage
mappings in order to obtain finer-grained reference information.  This made
sense because we were coming under memory pressure and would soon have to
begin reclaiming pages.  Now, however, with continuous scanning of the
active page queue, these demotions are taking a toll on performance.  For
example, on one of my test machines, the running time for the HPCC Random
Access benchmark (also known as GUPS) has increased by 54%.  To address this
problem, I have replaced the demotion with a heuristic for periodically
clearing the reference flag on superpage mappings.

Reviewed by:	kib
Approved by:	re (glebius)
Sponsored by:	EMC / Isilon Storage Division
2013-09-08 21:30:53 +00:00
Neel Natu
45e51299b3 Allocate VPIDs by using the unit number allocator to keep do the bookkeeping.
Also deal with VPID exhaustion by allocating out of a reserved range as the
last resort.
2013-09-07 05:30:34 +00:00
Peter Grehan
8a02f69652 Mask off the vector from the MSI-x data word.
Some o/s's set the trigger-mode level bit which
results in an invalid vector and pass-thru interrupts
not being delivered.
2013-09-07 03:33:36 +00:00
Justin T. Gibbs
e44af46e4c Implement PV IPIs for PVHVM guests and further converge PV and HVM
IPI implmementations.

Submitted by: Roger Pau Monné
Sponsored by: Citrix Systems R&D
Submitted by: gibbs (misc cleanup, table driven config)
Reviewed by:  gibbs
MFC after: 2 weeks

sys/amd64/include/cpufunc.h:
sys/amd64/amd64/pmap.c:
	Move invltlb_globpcid() into cpufunc.h so that it can be
	used by the Xen HVM version of tlb shootdown IPI handlers.

sys/x86/xen/xen_intr.c:
sys/xen/xen_intr.h:
	Rename xen_intr_bind_ipi() to xen_intr_alloc_and_bind_ipi(),
	and remove the ipi vector parameter.  This api allocates
	an event channel port that can be used for ipi services,
	but knows nothing of the actual ipi for which that port
	will be used.  Removing the unused argument and cleaning
	up the comments surrounding its declaration helps clarify
	its actual role.

sys/amd64/amd64/mp_machdep.c:
sys/amd64/include/cpu.h:
sys/i386/i386/mp_machdep.c:
sys/i386/include/cpu.h:
	Implement a generic framework for amd64 and i386 that allows
	the implementation of certain CPU management functions to
	be selected at runtime.  Currently this is only used for
	the ipi send function, which we optimize for Xen when running
	on a Xen hypervisor, but can easily be expanded to support
	more operations.

sys/x86/xen/hvm.c:
	Implement Xen PV IPI handlers and operations, replacing native
	send IPI.

sys/amd64/include/pcpu.h:
sys/i386/include/pcpu.h:
sys/i386/include/smp.h:
	Remove NR_VIRQS and NR_IPIS from FreeBSD headers.  NR_VIRQS
	is defined already for us in the xen interface files.
	NR_IPIS is only needed in one file per Xen platform and is
	easily inferred by the IPI vector table that is defined in
	those files.

sys/i386/xen/mp_machdep.c:
	Restructure to more closely match the HVM implementation by
	performing table driven IPI setup.
2013-09-06 22:17:02 +00:00
Bryan Venteicher
ddb4ffd0c6 Add vmx device to the i386 and amd64 NOTES files 2013-09-06 20:24:21 +00:00
Konstantin Belousov
9430f833ca Only lock pvh_global_lock read-only for pmap_page_wired_mappings(),
pmap_is_modified() and pmap_is_referenced(), same as it was done for
pmap_ts_referenced().

Consolidate identical code for pmap_is_modified() and
pmap_is_referenced() into helper pmap_page_test_mappings().

Reviewed by:	alc
Tested by:	pho (previous version)
Sponsored by:	The FreeBSD Foundation
2013-09-06 16:53:48 +00:00
Konstantin Belousov
3e4f32be7d In pmap_ts_referenced(), when restarting the loop due to pv list
generation changed, do not drop and immediately relock the pv list.

Suggested and reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
2013-09-06 16:48:34 +00:00
Gleb Smirnoff
e16477e8d9 On those machines, where sf_bufs do not represent any real object, make
sf_buf_alloc()/sf_buf_free() inlines, to save two calls to an absolutely
empty functions.

Reviewed by:	alc, kib, scottl
Sponsored by:	Nginx, Inc.
Sponsored by:	Netflix
2013-09-06 05:37:49 +00:00
Peter Grehan
76c35ba80f Emulate reading of the IA32_MISC_ENABLE MSR, by returning
the host MSR and masking off features that aren't supported.
Linux reads this MSR to detect if NX has been disabled via
BIOS.
2013-09-06 05:20:11 +00:00
Peter Grehan
8b7e3e3022 Allow CPUID leaf 0xD to be read as zeroes.
Linux reads this even though extended features
aren't exposed.

Support for 0xD will be expanded once AVX[2]
is exposed to the guest in upcoming work.
2013-09-06 05:16:10 +00:00
Pawel Jakub Dawidek
7008be5bd7 Change the cap_rights_t type from uint64_t to a structure that we can extend
in the future in a backward compatible (API and ABI) way.

The cap_rights_t represents capability rights. We used to use one bit to
represent one right, but we are running out of spare bits. Currently the new
structure provides place for 114 rights (so 50 more than the previous
cap_rights_t), but it is possible to grow the structure to hold at least 285
rights, although we can make it even larger if 285 rights won't be enough.

The structure definition looks like this:

	struct cap_rights {
		uint64_t	cr_rights[CAP_RIGHTS_VERSION + 2];
	};

The initial CAP_RIGHTS_VERSION is 0.

The top two bits in the first element of the cr_rights[] array contain total
number of elements in the array - 2. This means if those two bits are equal to
0, we have 2 array elements.

The top two bits in all remaining array elements should be 0.
The next five bits in all array elements contain array index. Only one bit is
used and bit position in this five-bits range defines array index. This means
there can be at most five array elements in the future.

To define new right the CAPRIGHT() macro must be used. The macro takes two
arguments - an array index and a bit to set, eg.

	#define	CAP_PDKILL	CAPRIGHT(1, 0x0000000000000800ULL)

We still support aliases that combine few rights, but the rights have to belong
to the same array element, eg:

	#define	CAP_LOOKUP	CAPRIGHT(0, 0x0000000000000400ULL)
	#define	CAP_FCHMOD	CAPRIGHT(0, 0x0000000000002000ULL)

	#define	CAP_FCHMODAT	(CAP_FCHMOD | CAP_LOOKUP)

There is new API to manage the new cap_rights_t structure:

	cap_rights_t *cap_rights_init(cap_rights_t *rights, ...);
	void cap_rights_set(cap_rights_t *rights, ...);
	void cap_rights_clear(cap_rights_t *rights, ...);
	bool cap_rights_is_set(const cap_rights_t *rights, ...);

	bool cap_rights_is_valid(const cap_rights_t *rights);
	void cap_rights_merge(cap_rights_t *dst, const cap_rights_t *src);
	void cap_rights_remove(cap_rights_t *dst, const cap_rights_t *src);
	bool cap_rights_contains(const cap_rights_t *big, const cap_rights_t *little);

Capability rights to the cap_rights_init(), cap_rights_set(),
cap_rights_clear() and cap_rights_is_set() functions are provided by
separating them with commas, eg:

	cap_rights_t rights;

	cap_rights_init(&rights, CAP_READ, CAP_WRITE, CAP_FSTAT);

There is no need to terminate the list of rights, as those functions are
actually macros that take care of the termination, eg:

	#define	cap_rights_set(rights, ...)				\
		__cap_rights_set((rights), __VA_ARGS__, 0ULL)
	void __cap_rights_set(cap_rights_t *rights, ...);

Thanks to using one bit as an array index we can assert in those functions that
there are no two rights belonging to different array elements provided
together. For example this is illegal and will be detected, because CAP_LOOKUP
belongs to element 0 and CAP_PDKILL to element 1:

	cap_rights_init(&rights, CAP_LOOKUP | CAP_PDKILL);

Providing several rights that belongs to the same array's element this way is
correct, but is not advised. It should only be used for aliases definition.

This commit also breaks compatibility with some existing Capsicum system calls,
but I see no other way to do that. This should be fine as Capsicum is still
experimental and this change is not going to 9.x.

Sponsored by:	The FreeBSD Foundation
2013-09-05 00:09:56 +00:00
Konstantin Belousov
6aceaa3e17 Tidy up some loose ends in the PCID code:
- Restore the pre-PCID TLB shootdown handlers for whole address space
  and single page invalidation asm code, and assign the IPI handler to
  them when PCID is not supported or disabled.  Old handlers have
  linear control flow.  But, still use the common return sequence.

- Stop using pcpu for INVPCID descriptors in the invlrg handler.  It
  is enough to allocate descriptors on the stack.  As result, two
  SWAPGS instructions are shaved off from the code for Haswell+.

- Fix the reverted condition in invlrng for checking of the PCID
  support [1], also in invlrng check that pmap is kernel pmap before
  performing other tests.  For the kernel pmap, which provides global
  mappings, the INVLPG must be used for invalidation always.

- Save the pre-computed pmap' %CR3 register in the struct pmap.  This
  allows to remove several checks for pm_pcid validity when %CR3 is
  reloaded [2].

Noted by:   gibbs [1]
Discussed with:	alc [2]
Tested by:	pho, flo
Sponsored by:	The FreeBSD Foundation
2013-09-04 23:31:29 +00:00
Peter Grehan
46ed9e4908 IFC @ r255209 2013-09-04 20:55:56 +00:00
John Baldwin
dffe0dc4d2 Add support for the 'invpcid' instruction to binutils and DDB's
disassembler on amd64.

MFC after:	1 month
2013-09-03 21:21:47 +00:00
Konstantin Belousov
f27d53b8f2 Fix two build failures for non-tb configurations, UP [2] and when using gas [1].
Reported by:	andreast [1], bf [2]
Sponsored by:	The FreeBSD Foundation
2013-08-31 19:13:21 +00:00
Konstantin Belousov
1099068118 The pm_save should be cleared on the pmap initialization, and not on
the activation.

Noted by:	alc
2013-08-30 20:10:01 +00:00
Konstantin Belousov
37eed8419c Implement support for the process-context identifiers ('PCID') on
Intel CPUs.  The feature tags TLB entries with the Id of the address
space and allows to avoid TLB invalidation on the context switch, it
is available only in the long mode.  In the microbenchmarks, using the
PCID decreased latency of the context switches by ~30% on SandyBridge
class desktop CPUs, measured with the lat_ctx program from lmbench.

If available, use INVPCID instruction when a TLB entry in non-current
address space needs to be invalidated.  The instruction is typically
available on the Haswell.

If needed, the use of PCID can be turned off with the
vm.pmap.pcid_enabled loader tunable set to 0.  The state of the
feature is reported by the vm.pmap.pcid_enabled sysctl.  The sysctl
vm.pmap.pcid_save_cnt reports the number of context switches which
avoided invalidating the TLB; compare with the total number of context
switches, available as sysctl vm.stats.sys.v_swtch.

Sponsored by:	The FreeBSD Foundation
Reviewed by:	alc
Tested by:	pho, bf
2013-08-30 07:59:49 +00:00
Konstantin Belousov
5f5703ef52 Provide a wrapper for the INVPCID instruction, definition of the
descriptor and symbolic names for the operation types.

Sponsored by:	The FreeBSD Foundation
Reviewed by:	alc
Tested by:	pho, bf
2013-08-30 07:42:38 +00:00
Justin T. Gibbs
76acc41fb7 Implement vector callback for PVHVM and unify event channel implementations
Re-structure Xen HVM support so that:
	- Xen is detected and hypercalls can be performed very
	  early in system startup.
	- Xen interrupt services are implemented using FreeBSD's native
	  interrupt delivery infrastructure.
	- the Xen interrupt service implementation is shared between PV
	  and HVM guests.
	- Xen interrupt handlers can optionally use a filter handler
	  in order to avoid the overhead of dispatch to an interrupt
	  thread.
	- interrupt load can be distributed among all available CPUs.
	- the overhead of accessing the emulated local and I/O apics
	  on HVM is removed for event channel port events.
	- a similar optimization can eventually, and fairly easily,
	  be used to optimize MSI.

Early Xen detection, HVM refactoring, PVHVM interrupt infrastructure,
and misc Xen cleanups:

Sponsored by: Spectra Logic Corporation

Unification of PV & HVM interrupt infrastructure, bug fixes,
and misc Xen cleanups:

Submitted by: Roger Pau Monné
Sponsored by: Citrix Systems R&D

sys/x86/x86/local_apic.c:
sys/amd64/include/apicvar.h:
sys/i386/include/apicvar.h:
sys/amd64/amd64/apic_vector.S:
sys/i386/i386/apic_vector.s:
sys/amd64/amd64/machdep.c:
sys/i386/i386/machdep.c:
sys/i386/xen/exception.s:
sys/x86/include/segments.h:
	Reserve IDT vector 0x93 for the Xen event channel upcall
	interrupt handler.  On Hypervisors that support the direct
	vector callback feature, we can request that this vector be
	called directly by an injected HVM interrupt event, instead
	of a simulated PCI interrupt on the Xen platform PCI device.
	This avoids all of the overhead of dealing with the emulated
	I/O APIC and local APIC.  It also means that the Hypervisor
	can inject these events on any CPU, allowing upcalls for
	different ports to be handled in parallel.

sys/amd64/amd64/mp_machdep.c:
sys/i386/i386/mp_machdep.c:
	Map Xen per-vcpu area during AP startup.

sys/amd64/include/intr_machdep.h:
sys/i386/include/intr_machdep.h:
	Increase the FreeBSD IRQ vector table to include space
	for event channel interrupt sources.

sys/amd64/include/pcpu.h:
sys/i386/include/pcpu.h:
	Remove Xen HVM per-cpu variable data.  These fields are now
	allocated via the dynamic per-cpu scheme.  See xen_intr.c
	for details.

sys/amd64/include/xen/hypercall.h:
sys/dev/xen/blkback/blkback.c:
sys/i386/include/xen/xenvar.h:
sys/i386/xen/clock.c:
sys/i386/xen/xen_machdep.c:
sys/xen/gnttab.c:
	Prefer FreeBSD primatives to Linux ones in Xen support code.

sys/amd64/include/xen/xen-os.h:
sys/i386/include/xen/xen-os.h:
sys/xen/xen-os.h:
sys/dev/xen/balloon/balloon.c:
sys/dev/xen/blkback/blkback.c:
sys/dev/xen/blkfront/blkfront.c:
sys/dev/xen/console/xencons_ring.c:
sys/dev/xen/control/control.c:
sys/dev/xen/netback/netback.c:
sys/dev/xen/netfront/netfront.c:
sys/dev/xen/xenpci/xenpci.c:
sys/i386/i386/machdep.c:
sys/i386/include/pmap.h:
sys/i386/include/xen/xenfunc.h:
sys/i386/isa/npx.c:
sys/i386/xen/clock.c:
sys/i386/xen/mp_machdep.c:
sys/i386/xen/mptable.c:
sys/i386/xen/xen_clock_util.c:
sys/i386/xen/xen_machdep.c:
sys/i386/xen/xen_rtc.c:
sys/xen/evtchn/evtchn_dev.c:
sys/xen/features.c:
sys/xen/gnttab.c:
sys/xen/gnttab.h:
sys/xen/hvm.h:
sys/xen/xenbus/xenbus.c:
sys/xen/xenbus/xenbus_if.m:
sys/xen/xenbus/xenbusb_front.c:
sys/xen/xenbus/xenbusvar.h:
sys/xen/xenstore/xenstore.c:
sys/xen/xenstore/xenstore_dev.c:
sys/xen/xenstore/xenstorevar.h:
	Pull common Xen OS support functions/settings into xen/xen-os.h.

sys/amd64/include/xen/xen-os.h:
sys/i386/include/xen/xen-os.h:
sys/xen/xen-os.h:
	Remove constants, macros, and functions unused in FreeBSD's Xen
	support.

sys/xen/xen-os.h:
sys/i386/xen/xen_machdep.c:
sys/x86/xen/hvm.c:
	Introduce new functions xen_domain(), xen_pv_domain(), and
	xen_hvm_domain().  These are used in favor of #ifdefs so that
	FreeBSD can dynamically detect and adapt to the presence of
	a hypervisor.  The goal is to have an HVM optimized GENERIC,
	but more is necessary before this is possible.

sys/amd64/amd64/machdep.c:
sys/dev/xen/xenpci/xenpcivar.h:
sys/dev/xen/xenpci/xenpci.c:
sys/x86/xen/hvm.c:
sys/sys/kernel.h:
	Refactor magic ioport, Hypercall table and Hypervisor shared
	information page setup, and move it to a dedicated HVM support
	module.

	HVM mode initialization is now triggered during the
	SI_SUB_HYPERVISOR phase of system startup.  This currently
	occurs just after the kernel VM is fully setup which is
	just enough infrastructure to allow the hypercall table
	and shared info page to be properly mapped.

sys/xen/hvm.h:
sys/x86/xen/hvm.c:
	Add definitions and a method for configuring Hypervisor event
	delievery via a direct vector callback.

sys/amd64/include/xen/xen-os.h:
sys/x86/xen/hvm.c:

sys/conf/files:
sys/conf/files.amd64:
sys/conf/files.i386:
	Adjust kernel build to reflect the refactoring of early
	Xen startup code and Xen interrupt services.

sys/dev/xen/blkback/blkback.c:
sys/dev/xen/blkfront/blkfront.c:
sys/dev/xen/blkfront/block.h:
sys/dev/xen/control/control.c:
sys/dev/xen/evtchn/evtchn_dev.c:
sys/dev/xen/netback/netback.c:
sys/dev/xen/netfront/netfront.c:
sys/xen/xenstore/xenstore.c:
sys/xen/evtchn/evtchn_dev.c:
sys/dev/xen/console/console.c:
sys/dev/xen/console/xencons_ring.c
	Adjust drivers to use new xen_intr_*() API.

sys/dev/xen/blkback/blkback.c:
	Since blkback defers all event handling to a taskqueue,
	convert this task queue to a "fast" taskqueue, and schedule
	it via an interrupt filter.  This avoids an unnecessary
	ithread context switch.

sys/xen/xenstore/xenstore.c:
	The xenstore driver is MPSAFE.  Indicate as much when
	registering its interrupt handler.

sys/xen/xenbus/xenbus.c:
sys/xen/xenbus/xenbusvar.h:
	Remove unused event channel APIs.

sys/xen/evtchn.h:
	Remove all kernel Xen interrupt service API definitions
	from this file.  It is now only used for structure and
	ioctl definitions related to the event channel userland
	device driver.

	Update the definitions in this file to match those from
	NetBSD.  Implementing this interface will be necessary for
	Dom0 support.

sys/xen/evtchn/evtchnvar.h:
	Add a header file for implemenation internal APIs related
	to managing event channels event delivery.  This is used
	to allow, for example, the event channel userland device
	driver to access low-level routines that typical kernel
	consumers of event channel services should never access.

sys/xen/interface/event_channel.h:
sys/xen/xen_intr.h:
	Standardize on the evtchn_port_t type for referring to
	an event channel port id.  In order to prevent low-level
	event channel APIs from leaking to kernel consumers who
	should not have access to this data, the type is defined
	twice: Once in the Xen provided event_channel.h, and again
	in xen/xen_intr.h.  The double declaration is protected by
	__XEN_EVTCHN_PORT_DEFINED__ to ensure it is never declared
	twice within a given compilation unit.

sys/xen/xen_intr.h:
sys/xen/evtchn/evtchn.c:
sys/x86/xen/xen_intr.c:
sys/dev/xen/xenpci/evtchn.c:
sys/dev/xen/xenpci/xenpcivar.h:
	New implementation of Xen interrupt services.  This is
	similar in many respects to the i386 PV implementation with
	the exception that events for bound to event channel ports
	(i.e. not IPI, virtual IRQ, or physical IRQ) are further
	optimized to avoid mask/unmask operations that aren't
	necessary for these edge triggered events.

	Stubs exist for supporting physical IRQ binding, but will
	need additional work before this implementation can be
	fully shared between PV and HVM.

sys/amd64/amd64/mp_machdep.c:
sys/i386/i386/mp_machdep.c:
sys/i386/xen/mp_machdep.c
sys/x86/xen/hvm.c:
	Add support for placing vcpu_info into an arbritary memory
	page instead of using HYPERVISOR_shared_info->vcpu_info.
	This allows the creation of domains with more than 32 vcpus.

sys/i386/i386/machdep.c:
sys/i386/xen/clock.c:
sys/i386/xen/xen_machdep.c:
sys/i386/xen/exception.s:
	Add support for new event channle implementation.
2013-08-29 19:52:18 +00:00
Alan Cox
51321f7c31 Significantly reduce the cost, i.e., run time, of calls to madvise(...,
MADV_DONTNEED) and madvise(..., MADV_FREE).  Specifically, introduce a new
pmap function, pmap_advise(), that operates on a range of virtual addresses
within the specified pmap, allowing for a more efficient implementation of
MADV_DONTNEED and MADV_FREE.  Previously, the implementation of
MADV_DONTNEED and MADV_FREE relied on per-page pmap operations, such as
pmap_clear_reference().  Intuitively, the problem with this implementation
is that the pmap-level locks are acquired and released and the page table
traversed repeatedly, once for each resident page in the range
that was specified to madvise(2).  A more subtle flaw with the previous
implementation is that pmap_clear_reference() would clear the reference bit
on all mappings to the specified page, not just the mapping in the range
specified to madvise(2).

Since our malloc(3) makes heavy use of madvise(2), this change can have a
measureable impact.  For example, the system time for completing a parallel
"buildworld" on a 6-core amd64 machine was reduced by about 1.5% to 2.0%.

Note: This change only contains pmap_advise() implementations for a subset
of our supported architectures.  I will commit implementations for the
remaining architectures after further testing.  For now, a stub function is
sufficient because of the advisory nature of pmap_advise().

Discussed with: jeff, jhb, kib
Tested by:      pho (i386), marcel (ia64)
Sponsored by:   EMC / Isilon Storage Division
2013-08-29 15:49:05 +00:00
Neel Natu
6f6ebf3c3f Add support for emulating the byte move instruction "mov r/m8, r8".
This emulation is required when dumping MMIO space via the ddb "examine"
command.
2013-08-27 16:49:20 +00:00
Peter Grehan
df5e6de3e3 Add in last remaining files to get AMD-SVM operational.
Submitted by:	Anish Gupta (akgupt3@gmail.com)
2013-08-23 00:37:26 +00:00
Peter Grehan
0bddaa8d25 HLT_IGNORED stat is used by both vmx and svm - move to common stats.
Submitted by:	Anish Gupta (akgupt3@gmail.com)
2013-08-22 22:29:27 +00:00
Peter Grehan
8b62d4719a Handle VM_PROT_NONE in nested page table code.
Submitted by:	Anish Gupta (akgupt3@gmail.com)
2013-08-22 22:26:46 +00:00
Konstantin Belousov
e68c64f0ba Revert r254501. Instead, reuse the type stability of the struct pmap
which is the part of struct vmspace, allocated from UMA_ZONE_NOFREE
zone.  Initialize the pmap lock in the vmspace zone init function, and
remove pmap lock initialization and destruction from pmap_pinit() and
pmap_release().

Suggested and reviewed by:	alc (previous version)
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
2013-08-22 18:12:24 +00:00
Konstantin Belousov
b544368a22 Use the generation count of the pv list to work around LOR between
pmap lock and pv list lock, and use the shared locking on
pvh_global_lock in pmap_remove_write(), same as it was done for
pmap_ts_referenced().

Noted and reviewed by:	alc (previous version)
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
2013-08-22 18:05:31 +00:00
David E. O'Brien
46be218dce The PADLOCK_RNG and RDRAND_RNG kernel options are now devices.
Thus "device padlock_rng" and "device rdrand_rng" should be
used instead of "options PADLOCK_RNG" & "options RDRAND_RNG".

Requested by:	so@ (des)
Submitted by:	obrien, arthurmesh@gmail.com
Obtained from:	Juniper Networks
2013-08-21 22:43:29 +00:00
Jung-uk Kim
1533b9f714 Reimplement atomic operations on PDEs and PTEs in pmap.h. This change
significantly reduces duplicate code and make it easier to read.

Reviewed by:	alc, bde
2013-08-21 22:40:29 +00:00
Jung-uk Kim
d36eb3f1c4 Remove empty lines before return statements for style consistency. 2013-08-21 22:05:58 +00:00
Jung-uk Kim
8a1ee2d346 Implement atomic_swap() and atomic_testandset().
Reviewed by:	arch, bde, jilles, kib
2013-08-21 22:03:06 +00:00
Jung-uk Kim
da255e4c7f - Remove the "a" constraint from main output operand for atomic_cmpset().
- Use "+" modifier for the "expect" because it is also an output (unused).
2013-08-21 21:30:06 +00:00
Jung-uk Kim
fe94be3da7 Use '+' modifier for a memory operand that is both an input and an output.
It was actually done in r86301 but reverted in r150182 because GCC 3.x was
not able to handle it for a memory operand.  Apparently, this problem was
fixed in GCC 4.1+ and several contrib sources already rely on this feature.
2013-08-21 21:14:16 +00:00
Jung-uk Kim
c1c84ce1bf Remove bogus labels. No functional change. 2013-08-21 20:49:46 +00:00
Jung-uk Kim
ee93d1173a Use consistent style. No functional change. 2013-08-21 20:43:50 +00:00
Neel Natu
b98940e5eb Do not create superpage mappings in the iommu.
This is a workaround to hide the fact that we do not have any code to
demote a superpage mapping before we unmap a single page that is part
of the superpage.
2013-08-20 06:46:40 +00:00
Neel Natu
f77e982952 Extract the location of the remapping hardware units from the ACPI DMAR table.
Submitted by:	Gopakumar T (gopakumar_thekkedath@yahoo.co.in)
2013-08-20 06:20:05 +00:00
Neel Natu
15e683837c Fix breakage caused by r254466 in minidumpsys().
r254466 increased the KVA from 512GB to 2TB which requires 4 PDP pages as
opposed to a single one before the change. This broke minidumpsys() since
it assumed that the entire KVA could be addressed via a single PDP page.

Fix this by obtaining the address of the PDP page from the PML4 entry
associated with the KVA being dumped.

Reported by:	pho
Submitted by:	kib
Pointy hat to:	neel
2013-08-20 02:09:26 +00:00
Konstantin Belousov
d91f339823 When code from r254064 in pmap_ts_referenced() drops pv lock and
blocks on a pmap lock, pmap_release() might proceed in parallel and
destroy the pmap mutex, since unlocked pv lock allows to remove pv
entry owned by the pmap.

For now, gate the pmap_release() on write-locked pvh_global_lock.
Since pmap_ts_release() does not unlock the global lock,
pmap_release() would not destroy pmap mutex until the
pmap_ts_referenced() finished.  We cannot enter pmap_ts_referenced()
and encounter a pv entry for the destroyed pmap if pmap_release()
passed the global lock gate, since pmap_remove_pages() would finish
earlier.

Reported by:	jeff, pho
Reviewed by:	alc
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
2013-08-18 21:36:22 +00:00
Pawel Jakub Dawidek
417ffc66fa Add process descriptors support to the GENERIC kernel. It is already being
used by the tools in base systems and with sandboxing more and more tools
the usage should only increase.

Submitted by:	Mariusz Zaborski <oshogbo@FreeBSD.org>
Sponsored by:	Google Summer of Code 2013
MFC after:	1 month
2013-08-18 10:21:29 +00:00
Neel Natu
0ef2ab3ab8 Bump up the maximum addressable memory on amd64 systems from 1TB to 4TB.
Bump up the KVA size proportionally from 512GB to 2TB.

The number of page table pages used by the direct map is now calculated at
run time based on 'Maxmem'. This means the small memory systems will not
see any additional tax in terms of page table pages for the direct map.

However all amd64 systems, regardless of the memory size, will use 3 more
pages to accomodate the bump in the KVA size.

More details available here:
http://lists.freebsd.org/pipermail/freebsd-hackers/2013-June/043015.html
http://lists.freebsd.org/pipermail/freebsd-current/2013-July/043143.html

Tested with the following configurations:
- Sandybridge server with 64GB of memory.
- bhyve VM with 64MB of memory.
- bhyve VM with a 8GB of memory with the memory segment above 4GB cuddling
  right up against the 4TB maximum memory limit.

Discussed on:	hackers@, current@
Submitted by:	Chris Torek (torek@torek.net)
2013-08-17 19:49:08 +00:00
Jilles Tjoelker
0f3a4d8051 libc: Access _logname_valid more efficiently.
The variable _logname_valid is not exported via the version script;
therefore, change C and i386/amd64 assembler code to remove indirection
(which allowed interposition). This makes the code slightly smaller and
faster.

Also, remove #define PIC_GOT from i386/amd64 in !PIC mode. Without PIC,
there is no place containing the address of each variable, so there is no
possible definition for PIC_GOT.
2013-08-17 19:24:58 +00:00
Brooks Davis
cd234300d3 Use an ANSI C definition of initializecpucache() to match the declaration
and the rest of the file.
2013-08-15 17:44:44 +00:00
Jung-uk Kim
38da30b419 Merge acpica_machdep.h for amd64 and i386 and move to x86. In fact, these
two files were functionally identical.
2013-08-13 22:05:10 +00:00
Jung-uk Kim
3bd12ca8f1 Tidy up global locks for ACPICA. There is no functional change. 2013-08-13 21:34:03 +00:00
Konstantin Belousov
c325e866f4 Different consumers of the struct vm_page abuse pageq member to keep
additional information, when the page is guaranteed to not belong to a
paging queue.  Usually, this results in a lot of type casts which make
reasoning about the code correctness harder.

Sometimes m->object is used instead of pageq, which could cause real
and confusing bugs if non-NULL m->object is leaked.  See r141955 and
r253140 for examples.

Change the pageq member into a union containing explicitly-typed
members.  Use them instead of type-punning or abusing m->object in x86
pmaps, uma and vm_page_alloc_contig().

Requested and reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
2013-08-10 17:36:42 +00:00
Attilio Rao
e946b94934 On all the architectures, avoid to preallocate the physical memory
for nodes used in vm_radix.
On architectures supporting direct mapping, also avoid to pre-allocate
the KVA for such nodes.

In order to do so make the operations derived from vm_radix_insert()
to fail and handle all the deriving failure of those.

vm_radix-wise introduce a new function called vm_radix_replace(),
which can replace a leaf node, already present, with a new one,
and take into account the possibility, during vm_radix_insert()
allocation, that the operations on the radix trie can recurse.
This means that if operations in vm_radix_insert() recursed
vm_radix_insert() will start from scratch again.

Sponsored by:	EMC / Isilon storage division
Reviewed by:	alc (older version)
Reviewed by:	jeff
Tested by:	pho, scottl
2013-08-09 11:28:55 +00:00
Attilio Rao
c7aebda8a1 The soft and hard busy mechanism rely on the vm object lock to work.
Unify the 2 concept into a real, minimal, sxlock where the shared
acquisition represent the soft busy and the exclusive acquisition
represent the hard busy.
The old VPO_WANTED mechanism becames the hard-path for this new lock
and it becomes per-page rather than per-object.
The vm_object lock becames an interlock for this functionality:
it can be held in both read or write mode.
However, if the vm_object lock is held in read mode while acquiring
or releasing the busy state, the thread owner cannot make any
assumption on the busy state unless it is also busying it.

Also:
- Add a new flag to directly shared busy pages while vm_page_alloc
  and vm_page_grab are being executed.  This will be very helpful
  once these functions happen under a read object lock.
- Move the swapping sleep into its own per-object flag

The KPI is heavilly changed this is why the version is bumped.
It is very likely that some VM ports users will need to change
their own code.

Sponsored by:	EMC / Isilon storage division
Discussed with:	alc
Reviewed by:	jeff, kib
Tested by:	gavin, bapt (older version)
Tested by:	pho, scottl
2013-08-09 11:11:11 +00:00
Andriy Gapon
9ba0691bdd follow up to r254051
- update powerpc/GENERIC64 as well, suggested by mdf
- update comments so that they make sense after the change, suggested by
  jhb

X-MFC after:	never (change specific to head)
2013-08-09 08:11:09 +00:00
Neel Natu
f263e391a3 Use local variables with the appropriate types and eliminate a bunch of casts.
This is a cosmetic change but it does help with a proposed change to increase
the maximum size of physical memory supported on amd64 platforms.

Submitted by:	Chris Torek (torek@torek.net)
2013-08-08 03:17:39 +00:00
Konstantin Belousov
449c2e92c9 Split the pagequeues per NUMA domains, and split pageademon process
into threads each processing queue in a single domain.  The structure
of the pagedaemons and queues is kept intact, most of the changes come
from the need for code to find an owning page queue for given page,
calculated from the segment containing the page.

The tie between NUMA domain and pagedaemon thread/pagequeue split is
rather arbitrary, the multithreaded daemon could be allowed for the
single-domain machines, or one domain might be split into several page
domains, to further increase concurrency.

Right now, each pagedaemon thread tries to reach the global target,
precalculated at the start of the pass.  This is not optimal, since it
could cause excessive page deactivation and freeing.  The code should
be changed to re-check the global page deficit state in the loop after
some number of iterations.

The pagedaemons reach the quorum before starting the OOM, since one
thread inability to meet the target is normal for split queues.  Only
when all pagedaemons fail to produce enough reusable pages, OOM is
started by single selected thread.

Launder is modified to take into account the segments layout with
regard to the region for which cleaning is performed.

Based on the preliminary patch by jeff, sponsored by EMC / Isilon
Storage Division.

Reviewed by:	alc
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
2013-08-07 16:36:38 +00:00
Konstantin Belousov
872d995f76 Change the pmap_ts_referenced() method of amd64 pmap to use shared
pvh_global_lock.  This allows the method to be executed in parallel,
avoiding undue contention on the pvh_global_lock for the multithreaded
pagedaemon.

The pmap_ts_referenced() function has to inspect the page mappings for
several pmaps, which need to be locked while pv list lock is owned.
This contradicts to the lock order, where pmap lock is before pv list
lock.  Introduce the generation count for the pv list of the page or
superpage, which indicate any change in the pv list, and, as usual,
perform restart of the iteration if generation changed while pv lock
was dropped for blocking acquire of a pmap lock.

Reported and tested by:	pho
Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
2013-08-07 16:33:15 +00:00
Andriy Gapon
818d282e7b enable KDB_TRACE in GENERICs
KDB_TRACE is not an alternative to DDB/etc, they are complementary.
So I do not see any reason to not enable KDB_TRACE by default.

X-MFC after:	never (change specific to head)
2013-08-07 08:03:50 +00:00
Jeff Roberson
5df87b21d3 Replace kernel virtual address space allocation with vmem. This provides
transparent layering and better fragmentation.

 - Normalize functions that allocate memory to use kmem_*
 - Those that allocate address space are named kva_*
 - Those that operate on maps are named kmap_*
 - Implement recursive allocation handling for kmem_arena in vmem.

Reviewed by:	alc
Tested by:	pho
Sponsored by:	EMC / Isilon Storage Division
2013-08-07 06:21:20 +00:00
Peter Grehan
40f65a4df5 IFC @ r254014 2013-08-07 00:09:49 +00:00
Jeff Roberson
2c0b86b48f - Introduce a specific function, pmap_remove_kernel_pde, for removing
huge pages in the kernel's address space.  This works around several
   asserts from pmap_demote_pde_locked that did not apply and gave false
   warnings.

Discovered by:	pho
Reviewed by:	alc
Sponsored by:	EMC / Isilon Storage Division
2013-08-05 00:28:03 +00:00
Peter Grehan
80a902ef7d Follow-up commit to fix CR0 issues. Maintain
architectural state on CR vmexits by guaranteeing
that EFER, CR0 and the VMCS entry controls are
all in sync when transitioning to IA-32e mode.

Submitted by:	Tycho Nightingale (tycho.nightingale <at> plurisbusnetworks.com)
2013-08-03 03:16:42 +00:00
Peter Grehan
672ed870a7 IFC @ r253862
- change the SI_SUB_RUN_SCHEDULER sysinits in hv_utilc and
hv_netvsc_drv_freebsd.c to SI_SUB_KTHREAD_IDLE, since the
former is no longer in FreeBSD.
  The use of these SYSINITs can probably be removed.
2013-08-01 22:09:57 +00:00
Peter Grehan
81ef6611ed Moved clearing of vmm_initialized to avoid the case
of unloading the module while VMs existed. This would
result in EBUSY, but would prevent further operations
on VMs resulting in the module being impossible to
unload.

Submitted by:   Tycho Nightingale (tycho.nightingale <at> plurisbusnetworks.com)
Reviewed by:	grehan, neel
2013-08-01 05:59:28 +00:00
Peter Grehan
aaaa065629 Correctly maintain the CR0/CR4 shadow registers.
This was exposed with AP spinup of Linux, and
booting OpenBSD, where the CR0 register is unconditionally
written to prior to the longjump to enter protected
mode. The CR-vmexit handling was not updating CPU state which
resulted in a vmentry failure with invalid guest state.

A follow-on submit will fix the CPU state issue, but this
fix prevents the CR-vmexit prior to entering protected
mode by properly initializing and maintaining CR* state.

Reviewed by:	neel
Reported by:	Gopakumar.T @ netapp
2013-08-01 01:18:51 +00:00
David E. O'Brien
0e6a0799a9 Back out r253779 & r253786. 2013-07-31 17:21:18 +00:00
David E. O'Brien
99ff83da74 Decouple yarrow from random(4) device.
* Make Yarrow an optional kernel component -- enabled by "YARROW_RNG" option.
  The files sha2.c, hash.c, randomdev_soft.c and yarrow.c comprise yarrow.

* random(4) device doesn't really depend on rijndael-*.  Yarrow, however, does.

* Add random_adaptors.[ch] which is basically a store of random_adaptor's.
  random_adaptor is basically an adapter that plugs in to random(4).
  random_adaptor can only be plugged in to random(4) very early in bootup.
  Unplugging random_adaptor from random(4) is not supported, and is probably a
  bad idea anyway, due to potential loss of entropy pools.
  We currently have 3 random_adaptors:
  + yarrow
  + rdrand (ivy.c)
  + nehemeiah

* Remove platform dependent logic from probe.c, and move it into
  corresponding registration routines of each random_adaptor provider.
  probe.c doesn't do anything other than picking a specific random_adaptor
  from a list of registered ones.

* If the kernel doesn't have any random_adaptor adapters present then the
  creation of /dev/random is postponed until next random_adaptor is kldload'ed.

* Fix randomdev_soft.c to refer to its own random_adaptor, instead of a
  system wide one.

Submitted by: arthurmesh@gmail.com, obrien
Obtained from: Juniper Networks
Reviewed by: obrien
2013-07-29 20:26:27 +00:00
Andriy Gapon
a29cc9a34b Revert r253748,253749
This WIP should not have been committed yet.

Pointyhat to:	avg
2013-07-28 18:44:17 +00:00
Andriy Gapon
366d8bfb7b put contents of cpu.h under _KERNEL
no userland-serviceable parts inside

MFC after:	20 days
2013-07-28 18:32:27 +00:00
Andriy Gapon
a69e8d609e x86: detect mwait capabilities and extensions, when present
Reviewed by:	kib (earlier amd64-only version)
MFC after:	2 weeks
2013-07-28 17:54:42 +00:00
Jeff Roberson
2f84c08eee - Use kmem_malloc rather than kmem_alloc() for GDT/LDT/tss allocations etc.
This eliminates some unusual uses of that API in favor of more typical
   uses of kmem_malloc().

Discussed with:	kib/alc
Tested by:	pho
Sponsored by:	EMC / Isilon Storage Division
2013-07-26 19:06:14 +00:00
Neel Natu
84e169c6c3 Add support for emulation of the "or r/m, imm8" instruction.
Submitted by:	Zhixiang Yu (zxyu.core@gmail.com)
Obtained from:	GSoC 2013 (AHCI device emulation for bhyve)
2013-07-23 23:43:00 +00:00
Neel Natu
113326a772 Fix a bug introduced in r252646 that causes a page with the PG_PTE_PAT bit set
to be interpreted as a superpage. This is because PG_PTE_PAT is at the same
bit position in PTE as PG_PS is in a PDE.

This caused a number of regressions on amd64 systems: panic when starting
X applications, freeze during shutdown etc.

Pointy hat to:	me
Tested by: gperez@entel.upc.edu, joel, dumbbell
Reviewed by: kib
2013-07-23 22:17:00 +00:00
Peter Grehan
15b996d742 First cut at adding the hyperv drivers to GENERIC.
The files inventory should probably have the modules split
out into net/storage/common etc as the modules build is,
but this will do for now.
2013-07-19 05:32:58 +00:00
Konstantin Belousov
0f6bcda4cd MFi386: add ddb "show sysregs" command.
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2013-07-15 06:30:57 +00:00
Konstantin Belousov
0cdd261571 Clear m->object for the page taken from the delayed free list for
reuse as the pv chink page in reclaim_pv_chunk().  Having non-NULL
m->object is wrong for page not owned by an object and confuses both
vm_page_free_toq() and vm_page_remove() when the page is freed later.

Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2013-07-10 09:24:03 +00:00
Xin LI
1fdeb1651c Import HighPoint DC Series Data Center HBA (DC7280 and R750) driver.
This driver works for FreeBSD/i386 and FreeBSD/amd64 platforms.

Many thanks to HighPoint for providing this driver.

MFC after:	1 day
2013-07-06 07:49:41 +00:00
Neel Natu
be28275d00 If a superpage mapping is being removed then we need to ignore the PG_PDE_PAT
bit when looking up the vm_page associated with the superpage's physical
address.

If the caching attribute for the mapping is write combining or write protected
then the PG_PDE_PAT bit will be set and thus cause an 'off-by-one' error
when looking up the vm_page.

Fix this by using the PG_PS_FRAME mask to compute the physical address for
a superpage mapping instead of PG_FRAME.

This is a theoretical issue at this point since non-writeback attributes are
currently used only for fictitious mappings and fictitious mappings are not
subject to promotion.

Discussed with:	alc, kib
MFC after:	2 weeks
2013-07-03 23:21:25 +00:00
Neel Natu
de16308c48 Verify that all bytes in the instruction buffer are consumed during decoding.
Suggested by:	grehan
2013-07-03 23:05:17 +00:00
Peter Grehan
e60f5d779e Ignore guest PAT settings by default in EPT mappings.
From experimentation, other hypervisors also do this.

Diagnosed by:	tycho nightingale at pluribusnetworks com
Reviewed by:	neel
2013-07-01 20:05:43 +00:00
Konstantin Belousov
70a7dd5d5b Fix issues with zeroing and fetching the counters, on x86 and ppc64.
Issues were noted by Bruce Evans and are present on all architectures.

On i386, a counter fetch should use atomic read of 64bit value,
otherwise carry from the increment on other CPU could be lost for the
given fetch, making error of 2^32.  If 64bit read (cmpxchg8b) is not
available on the machine, it cannot be SMP and it is enough to disable
preemption around read to avoid the split read.

On x86 the counter increment is not atomic on purpose, which makes it
possible for the store of the incremented result to override just
zeroed per-cpu slot.  The effect would be a counter going off by
arbitrary value after zeroing.  Perform the counter zeroing on the
same processor which does the increments, making the operations
mutually exclusive.  On i386, same as for the fetching, if the
cmpxchg8b is not available, machine is not SMP and we disable
preemption for zeroing.

PowerPC64 is treated the same as amd64.

For other architectures, the changes made to allow the compilation to
succeed, without fixing the issues with zeroing or fetching.  It
should be possible to handle them by using the 64bit loads and stores
atomic WRT preemption (assuming the architectures also converted from
using critical sections to proper asm).  If architecture does not
provide the facility, using global (spin) mutex would be non-optimal
but working solution.

Noted by:  bde
Sponsored by:	The FreeBSD Foundation
2013-07-01 02:48:27 +00:00
Peter Grehan
560d5eda2c Make sure all CPUID values are handled, instead of exiting the
bhyve process when an unhandled one is encountered.

Hide some additional capabilities from the guest (e.g. debug store).

This fixes the issue with FreeBSD 9.1 MP guests exiting the VM on
AP spinup (where CPUID is used when sync'ing the TSCs) and the
issue with the Java build where CPUIDs are issued from a guest
userspace.

Submitted by:	tycho nightingale at pluribusnetworks com
Reviewed by:	neel
Reported by:	many
2013-06-28 06:05:33 +00:00
Jung-uk Kim
b1ddd13145 Move definitions required by userland applications out of acpica_machdep.h. 2013-06-27 00:22:40 +00:00
Konstantin Belousov
9dbb63fe03 Allow immediate operand.
Sponsored by:	The FreeBSD Foundation
2013-06-20 14:30:04 +00:00
Konstantin Belousov
c788f92509 Some clarifications and updates for the comments, mostly retrieved
from Bruce Evans.  Trim the trailing spaces.

MFC after:	1 week
2013-06-19 05:05:16 +00:00
Sergey Kandaurov
1e2751ddeb Fix a gcc warning uncovered after r251745.
Reported by:	Sergey V. Dyatko
Reviewed by:	neel
2013-06-18 23:31:09 +00:00
Justin T. Gibbs
a8f6ac0573 Upgrade Xen interface headers to Xen 4.2.1.
Move FreeBSD from interface version 0x00030204 to 0x00030208.
Updates are required to our grant table implementation before we
can bump this further.

sys/xen/hvm.h:
	Replace the implementation of hvm_get_parameter(), formerly located
	in sys/xen/interface/hvm/params.h.  Linux has a similar file which
	primarily stores this function.

sys/xen/xenstore/xenstore.c:
	Include new xen/hvm.h header file to get hvm_get_parameter().

sys/amd64/include/xen/xen-os.h:
sys/i386/include/xen/xen-os.h:
	Correctly protect function definition and variables from being
	included into assembly files in xen-os.h

	Xen memory barriers are now prefixed with "xen_" to avoid conflicts
	with OS native primatives.  Define Xen memory barriers in terms of
	the native FreeBSD primatives.

Sponsored by:	Spectra Logic Corporation
Reviewed by:	Roger Pau Monné
Tested by:	Roger Pau Monné
Obtained from:	Roger Pau Monné (bug fixes)
2013-06-14 23:43:44 +00:00
Sergey Kandaurov
82f2974a69 Replace cpusetffs_obj with CPU_FFS, missed in r251703.
Reported by:	bdrewery, O. Hartmann
2013-06-14 10:26:38 +00:00
Neel Natu
8f1664b724 Remove unused macros PTESHIFT, PDESHIFT, PDPESHIFT and PML4ESHIFT.
Reviewed by:	alc
2013-06-14 00:03:43 +00:00
Jeff Roberson
17a2737732 - Add a BIT_FFS() macro and use it to replace cpusetffs_obj()
Discussed with:	attilio
Sponsored by:	EMC / Isilon Storage Division
2013-06-13 20:46:03 +00:00
Konstantin Belousov
9138579845 Assert that interrupts are enabled in the trap handlers on x86 before
calling generic code to deliver signals.

Discussed with:	bde
Tested by:	pho
MFC after:	1 week
2013-06-03 17:40:05 +00:00
Konstantin Belousov
cb5bfd1240 Use slightly more idiomatic expression to get the address of array.
Tested by:	dim, pgj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2013-05-27 18:39:39 +00:00
Konstantin Belousov
87b94d9a92 The _MC_HASFPXSTATE and _MC_IA32_HASFPXSTATE flags have the same bit
value on purpose, but the ia32 context handling code is logically more
correct to use the _MC_IA32_HASFPXSTATE name for the flag.

Tested by:	dim, pgj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2013-05-27 18:36:46 +00:00
Konstantin Belousov
e9249a80f6 The ia32_get_mcontext() does not need to set PCB_FULL_IRET. The
usermode context state is not changed by the get operation, and
get_mcontext() does not require full iret as well.

Tested by:	dim, pgj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2013-05-27 18:31:15 +00:00
Konstantin Belousov
80b5691a76 When reporting the fault details, also print %rsp.
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2013-05-27 18:29:20 +00:00
Konstantin Belousov
6806ce6ec8 When handling an exception from the attempt from loading the faulting
context on return from the trap handler, re-enable the interrupts on
i386 and amd64.  The trap return path have to disable interrupts since
the sequence of loading the machine state is not atomic.  The trap()
function which transfers the control to the special handler would
enable the interrupt, but an iret loads the previous eflags with PSL_I
clear.  Then, the special handler calls trap() on its own, which now
sees the original eflags with PSL_I set and does not enable
interrupts.

The end result is that signal delivery and process exiting code could
be executed with interrupts disabled, which is generally wrong and
triggers several assertions.

For amd64, the interrupts are enabled conditionally based on PSL_I in
the eflags of the outer frame, as it is already done for
doreti_iret_fault.  For i386, the interrupts are enabled
unconditionally, the ast loop could have opened a window with
interrupts enabled just before the iret anyway.

Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2013-05-27 18:26:08 +00:00
Achim Leubner
dce93cd06d Driver 'aacraid' added. Supports Adaptec by PMC RAID controller families Series 6, 7, 8 and upcoming products. Older Adaptec RAID controller families are supported by the 'aac' driver.
Approved by:	scottl (mentor)
2013-05-24 09:22:43 +00:00
Attilio Rao
9af6d512f5 o Relax locking assertions for vm_page_find_least()
o Relax locking assertions for pmap_enter_object() and add them also
  to architectures that currently don't have any
o Introduce VM_OBJECT_LOCK_DOWNGRADE() which is basically a downgrade
  operation on the per-object rwlock
o Use all the mechanisms above to make vm_map_pmap_enter() to work
  mostl of the times only with readlocks.

Sponsored by:	EMC / Isilon storage division
Reviewed by:	alc
2013-05-21 20:38:19 +00:00
Konstantin Belousov
4f493a25f4 Fix the hardware watchpoints on SMP amd64. Load the updated %dr
registers also on other CPUs, besides the CPU which happens to execute
the ddb.  The debugging registers are stored in the pcpu area,
together with the command which is executed by the IPI stop handler
upon resume.

Reviewed by:	jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2013-05-21 11:24:32 +00:00
Konstantin Belousov
ea38ca6ea5 Add amd64-specific ddb command 'show phys2dmap', which calculates the
address in the direct map corresponding to the given physical address.

Reviewed by:	jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2013-05-21 11:07:12 +00:00
Marcel Moolenaar
cb34ed4434 Add basic support for FDT to i386 & amd64. This change includes:
1.  Common headers for fdt.h and ofw_machdep.h under x86/include
    with indirections under i386/include and amd64/include.
2.  New modinfo for loader provided FDT blob.
3.  Common x86_init_fdt() called from hammer_time() on amd64 and
    init386() on i386.
4.  Split-off FDT specific low-level console functions from FDT
    bus methods for the uart(4) driver. The low-level console
    logic has been moved to uart_cpu_fdt.c and is used for arm,
    mips & powerpc only. The FDT bus methods are shared across
    all architectures.
5.  Add dev/fdt/fdt_x86.c to hold the fdt_fixup_table[] and the
    fdt_pic_table[] arrays. Both are empty right now.

FDT addresses are I/O ports on x86. Since the core FDT code does
not handle different address spaces, adding support for both I/O
ports and memory addresses requires some thought and discussion.
It may be better to use a compile-time option that controls this.

Obtained from:	Juniper Networks, Inc.
2013-05-21 03:05:49 +00:00
Ed Schouten
31ffd8039d Improve readability of static assertions for OFFSET_* macros.
Instead of doing all sorts of weird casting of constants to
pointer-pointers, simply use the standard C offsetof() macro to obtain
the offset of the respective fields in the structures.
2013-05-13 21:47:17 +00:00
Peter Wemm
dda759d344 Tidy up some CVS workarounds. 2013-05-12 01:53:47 +00:00
Rui Paulo
009b3a1c3f Fix several standard extended feature bits.
Submitted by:	Oliver Pinter <oliver.pntr at gmail.com>
2013-05-11 01:31:51 +00:00
Neel Natu
0acb0d84c5 Support array-type of stats in bhyve.
An array-type stat in vmm.ko is defined as follows:
VMM_STAT_ARRAY(IPIS_SENT, VM_MAXCPU, "ipis sent to vcpu");

It is incremented as follows:
vmm_stat_array_incr(vm, vcpuid, IPIS_SENT, array_index, 1);

And output of 'bhyvectl --get-stats' looks like:
ipis sent to vcpu[0]     3114
ipis sent to vcpu[1]     0

Reviewed by:	grehan
Obtained from:	NetApp
2013-05-10 02:59:49 +00:00
Dmitry Chagin
d127f15308 Retire write-only PCB_GS32BIT pcb flag on amd64. 2013-05-09 21:42:43 +00:00
Konstantin Belousov
241b67bb47 Correct the type for the literal used on the left side of the shift up
to 63 bit positions.

Do not fill the save area and do not set the saved bit in the xstate
bit vector for the state which is not marked as enabled in xsave_mask.

Reported and tested by:	Jim Ohlstein <jim@ohlste.in>
MFC after:	3 days
2013-05-09 17:25:29 +00:00
Attilio Rao
941646f5ec Rename VM_NDOMAIN into MAXMEMDOM and move it into machine/param.h in
order to match the MAXCPU concept.  The change should also be useful
for consolidation and consistency.

Sponsored by:	EMC / Isilon storage division
Obtained from:	jeff
Reviewed by:	alc
2013-05-07 22:46:24 +00:00
Ed Maste
f5efbffe52 Switch to standard copyright license text
The initial version of this came from Sandvine but had "PROVIDED BY NETAPP,
INC" in the copyright text, presuambly because the license block was copied
from another file.  Replace it with standard "AUTHOR AND CONTRIBUTORS" form.

Approvided by: grehan@
2013-05-02 12:35:15 +00:00
Konstantin Belousov
14f525595c Partially saved extended state must be handled always, i.e. for both
fpu-owned context, and for pcb-saved one.  More, the XSAVE could do
partial save, same as XSAVEOPT, so qualifier for the handler should be
use_xsave and not use_xsaveopt.

Since xsave_area_desc is now needed regardless of the XSAVEOPT use,
remove the write-only use_xsaveopt variable.

In collaboration with:	jhb
MFC after:	1 week
2013-05-01 20:08:33 +00:00
Konstantin Belousov
6f2d9906a9 The check to ensure that xstate_bv always has XFEATURE_ENABLED_X87 and
XFEATURE_ENABLED_SSE bits set is not needed.  CPU correctly handles
any bitmask which is subset of the enabled bits in %XCR0.

More, CPU instructions XSAVE and XSAVEOPT could write the mask without
e.g. XFEATURE_ENABLED_SSE, after the VZEROALL.  The check prevents the
restoration of the otherwise valid FPU save area.

In collaboration with:	jhb
MFC after:	1 week
2013-05-01 20:03:50 +00:00
Carl Delsey
e47937d1b7 Add a new driver to support the Intel Non-Transparent Bridge(NTB).
The NTB allows you to connect two systems with this device using a PCI-e
link. The driver is made of two modules:
 - ntb_hw which is a basic hardware abstraction layer for the device.
 - if_ntb which implements the ntb network device and the communication
   protocol.

The driver is limited at the moment to CPU memcpy instead of using DMA, and
only Back-to-Back mode is supported. Also the network device isn't full
featured yet. These changes will be coming soon. The DMA change will also
bring in the ioat driver from the project branch it is on now.

This is an initial port of the GPL/BSD Linux driver contributed by Jon Mason
from Intel. Any bugs are my contributions.

Sponsored by: Intel
Reviewed by: jimharris, joel (man page only)
Approved by: jimharris (mentor)
2013-04-29 22:48:53 +00:00
Neel Natu
b18ac2d833 - SVM nested paging support
- Define data structures to contain the SVM vcpu context
- Define data structures to contain guest and host software context
- Change license in vmcb.h and vmcb.c to remove references to NetApp that
  inadvertently sneaked in when the license text was copied from amdv.c.

Submitted by:	Anish Gupta (akgupt3@gmail.com)
2013-04-27 04:49:51 +00:00
Peter Grehan
d3c11f40a5 Add RIP-relative addressing to the instruction decoder.
Rework the guest register fetch code to allow the RIP to
be extracted from the VMCS while the kernel decoder is
functioning.

Hit by the OpenBSD local-apic code.

Submitted by:	neel
Reviewed by:	grehan
Obtained from:	NetApp
2013-04-25 04:56:43 +00:00
Rui Paulo
068e8f74e4 Print RDSEED, ADX, and SMAP.
Pointed out by:	kib
2013-04-18 01:21:44 +00:00
Gabor Kovesdan
a8b5c2a0aa - Correct spelling in comments
Submitted by:	Christoph Mallon <christoph.mallon@gmx.de> (via private mail)
2013-04-17 11:56:11 +00:00
Rui Paulo
d1dcd93145 Print more bits from the standard extended features CPUID which will be
available in the Haswell architecture (c.f. Intel Document #319433-012A).
2013-04-17 06:51:17 +00:00
Neel Natu
92ec8c801d Add a Cliff Notes version of the purpose and contents of the VMCB.
Requested by:	julian
2013-04-15 04:16:12 +00:00
Neel Natu
3565b59ec0 Create sysctl node 'hw.vmm.vmx' and populate it with oids that expose the VMX
hardware capabilities.

Obtained from:	NetApp
2013-04-13 21:41:51 +00:00
Konstantin Belousov
fcb29b9210 Fix the name of the pcb member in the comments.
Submitted by:	Oliver Pinter <oliver.pntr@gmail.com>
MFC after:	3 days
2013-04-13 15:20:33 +00:00
Neel Natu
26d66b9d58 Use the MAKEDEV_CHECKNAME flag to check for an invalid device name and return
an error instead of panicking.

Obtained from:	NetApp
2013-04-13 05:11:21 +00:00
Edward Tomasz Napierala
8ed9860914 Remove ctl(4) from GENERIC. Also remove 'options CTL_DISABLE'
and kern.cam.ctl.disable tunable; those were introduced as a workaround
to make it possible to boot GENERIC on low memory machines.

With ctl(4) being built as a module and automatically loaded by ctladm(8),
this makes CTL work out of the box.

Reviewed by:	ken
Sponsored by:	FreeBSD Foundation
2013-04-12 16:25:03 +00:00
Neel Natu
d5408b1d26 If vmm.ko could not be initialized correctly then prevent the creation of
virtual machines subsequently.

Submitted by:	Chris Torek
2013-04-12 01:16:52 +00:00
Neel Natu
934d71fcf4 Provide functions to manipulate the guest state in the VMCB.
Submitted by:	Anish Gupta (akgupt3@gmail.com)
2013-04-11 06:52:19 +00:00
Neel Natu
150369ab7c Make the code to check if VMX is enabled more readable by using macros
instead of magic numbers.

Discussed with:	Chris Torek
2013-04-11 04:29:45 +00:00
Neel Natu
1472b87f2f Unsynchronized TSCs on the host require special handling in bhyve:
- use clock_gettime(2) as the time base for the emulated ACPI timer instead
  of directly using rdtsc().

- don't advertise the invariant TSC capability to the guest to discourage it
  from using the TSC as its time base.

Discussed with:	jhb@ (about making 'smp_tsc' a global)
Reported by:	Dan Mack on freebsd-virtualization@
Obtained from:	NetApp
2013-04-10 05:59:07 +00:00
Gleb Smirnoff
4e76af6a41 Merge from projects/counters: counter(9).
Introduce counter(9) API, that implements fast and raceless counters,
provided (but not limited to) for gathering of statistical data.

See http://lists.freebsd.org/pipermail/freebsd-arch/2013-April/014204.html
for more details.

In collaboration with:	kib
Reviewed by:		luigi
Tested by:		ae, ray
Sponsored by:		Nginx, Inc.
2013-04-08 19:40:53 +00:00
Gleb Smirnoff
17dece86fe Merge from projects/counters:
Pad struct pcpu so that its size is denominator of PAGE_SIZE. This
is done to reduce memory waste in UMA_PCPU_ZONE zones.

Sponsored by:	Nginx, Inc.
2013-04-08 19:19:10 +00:00
Peter Grehan
117e8f378e Don't panic when a valid divisor of 1 has been requested.
Obtained from:	NetApp
2013-04-05 22:16:31 +00:00
Neel Natu
a3d60ba178 Macros, bitmasks and structs that describe the SVM virtual machine control
block aka VMCB.

Submitted by:	Anish Gupta (akgupt3@gmail.com)
2013-04-05 06:55:19 +00:00
Alexander Motin
45f6d66569 Remove all legacy ATA code parts, not used since options ATA_CAM enabled in
most kernels before FreeBSD 9.0.  Remove such modules and respective kernel
options: atadisk, ataraid, atapicd, atapifd, atapist, atapicam.  Remove the
atacontrol utility and some man pages.  Remove useless now options ATA_CAM.

No objections:	current@, stable@
MFC after:	never
2013-04-04 07:12:24 +00:00
Neel Natu
77d8fd9bb3 Add counter to keep track of the number of timer interrupts generated by
the local apic for each virtual cpu.
2013-03-31 03:56:48 +00:00
Neel Natu
b5aaf7b22b Add some more stats to keep track of all the reasons that a vcpu is exiting. 2013-03-30 17:46:03 +00:00
Neel Natu
66f71b7d24 Allow caller to skip 'guest linear address' validation when doing instruction
decode. This is to accomodate hardware assist implementations that do not
provide the 'guest linear address' as part of nested page fault collateral.

Submitted by:	Anish Gupta (akgupt3 at gmail dot com)
2013-03-28 21:26:19 +00:00
Konstantin Belousov
ee75e7de7b Implement the concept of the unmapped VMIO buffers, i.e. buffers which
do not map the b_pages pages into buffer_map KVA.  The use of the
unmapped buffers eliminate the need to perform TLB shootdown for
mapping on the buffer creation and reuse, greatly reducing the amount
of IPIs for shootdown on big-SMP machines and eliminating up to 25-30%
of the system time on i/o intensive workloads.

The unmapped buffer should be explicitely requested by the GB_UNMAPPED
flag by the consumer.  For unmapped buffer, no KVA reservation is
performed at all. The consumer might request unmapped buffer which
does have a KVA reserve, to manually map it without recursing into
buffer cache and blocking, with the GB_KVAALLOC flag.

When the mapped buffer is requested and unmapped buffer already
exists, the cache performs an upgrade, possibly reusing the KVA
reservation.

Unmapped buffer is translated into unmapped bio in g_vfs_strategy().
Unmapped bio carry a pointer to the vm_page_t array, offset and length
instead of the data pointer.  The provider which processes the bio
should explicitely specify a readiness to accept unmapped bio,
otherwise g_down geom thread performs the transient upgrade of the bio
request by mapping the pages into the new bio_transient_map KVA
submap.

The bio_transient_map submap claims up to 10% of the buffer map, and
the total buffer_map + bio_transient_map KVA usage stays the
same. Still, it could be manually tuned by kern.bio_transient_maxcnt
tunable, in the units of the transient mappings.  Eventually, the
bio_transient_map could be removed after all geom classes and drivers
can accept unmapped i/o requests.

Unmapped support can be turned off by the vfs.unmapped_buf_allowed
tunable, disabling which makes the buffer (or cluster) creation
requests to ignore GB_UNMAPPED and GB_KVAALLOC flags.  Unmapped
buffers are only enabled by default on the architectures where
pmap_copy_page() was implemented and tested.

In the rework, filesystem metadata is not the subject to maxbufspace
limit anymore. Since the metadata buffers are always mapped, the
buffers still have to fit into the buffer map, which provides a
reasonable (but practically unreachable) upper bound on it. The
non-metadata buffer allocations, both mapped and unmapped, is
accounted against maxbufspace, as before. Effectively, this means that
the maxbufspace is forced on mapped and unmapped buffers separately.
The pre-patch bufspace limiting code did not worked, because
buffer_map fragmentation does not allow the limit to be reached.

By Jeff Roberson request, the getnewbuf() function was split into
smaller single-purpose functions.

Sponsored by:	The FreeBSD Foundation
Discussed with:	jeff (previous version)
Tested by:	pho, scottl (previous version), jhb, bf
MFC after:	2 weeks
2013-03-19 14:13:12 +00:00
Attilio Rao
774d251d99 Sync back vmcontention branch into HEAD:
Replace the per-object resident and cached pages splay tree with a
path-compressed multi-digit radix trie.
Along with this, switch also the x86-specific handling of idle page
tables to using the radix trie.

This change is supposed to do the following:
- Allowing the acquisition of read locking for lookup operations of the
  resident/cached pages collections as the per-vm_page_t splay iterators
  are now removed.
- Increase the scalability of the operations on the page collections.

The radix trie does rely on the consumers locking to ensure atomicity of
its operations.  In order to avoid deadlocks the bisection nodes are
pre-allocated in the UMA zone.  This can be done safely because the
algorithm needs at maximum one new node per insert which means the
maximum number of the desired nodes is the number of available physical
frames themselves.  However, not all the times a new bisection node is
really needed.

The radix trie implements path-compression because UFS indirect blocks
can lead to several objects with a very sparse trie, increasing the number
of levels to usually scan.  It also helps in the nodes pre-fetching by
introducing the single node per-insert property.

This code is not generalized (yet) because of the possible loss of
performance by having much of the sizes in play configurable.
However, efforts to make this code more general and then reusable in
further different consumers might be really done.

The only KPI change is the removal of the function vm_page_splay() which
is now reaped.
The only KBI change, instead, is the removal of the left/right iterators
from struct vm_page, which are now reaped.

Further technical notes broken into mealpieces can be retrieved from the
svn branch:
http://svn.freebsd.org/base/user/attilio/vmcontention/

Sponsored by:	EMC / Isilon storage division
In collaboration with:	alc, jeff
Tested by:	flo, pho, jhb, davide
Tested by:	ian (arm)
Tested by:	andreast (powerpc)
2013-03-18 00:25:02 +00:00
Neel Natu
3f23d3ca9f Fix the '-Wtautological-compare' warning emitted by clang for comparing the
unsigned enum type with a negative value.

Obtained from:	NetApp
2013-03-16 22:53:05 +00:00
Neel Natu
61592433eb Allow vmm stats to be specific to the underlying hardware assist technology.
This can be done by using the new macros VMM_STAT_INTEL() and VMM_STAT_AMD().
Statistic counters that are common across the two are defined using VMM_STAT().

Suggested by:	Anish Gupta
Discussed with:	grehan
Obtained from:	NetApp
2013-03-16 22:40:20 +00:00
Konstantin Belousov
e8a4a618cf Add pmap function pmap_copy_pages(), which copies the content of the
pages around, taking array of vm_page_t both for source and
destination.  Starting offsets and total transfer size are specified.

The function implements optimal algorithm for copying using the
platform-specific optimizations.  For instance, on the architectures
were the direct map is available, no transient mappings are created,
for i386 the per-cpu ephemeral page frame is used.  The code was
typically borrowed from the pmap_copy_page() for the same
architecture.

Only i386/amd64, powerpc aim and arm/arm-v6 implementations were
tested at the time of commit. High-level code, not committed yet to
the tree, ensures that the use of the function is only allowed after
explicit enablement.

For sparc64, the existing code has known issues and a stab is added
instead, to allow the kernel linking.

Sponsored by:	The FreeBSD Foundation
Tested by:	pho (i386, amd64), scottl (amd64), ian (arm and arm-v6)
MFC after:	2 weeks
2013-03-14 20:18:12 +00:00
Alan Cox
9f585991ba The kernel pmap is statically allocated, so there is really no need to
explicitly initialize its pm_root field to zero.

Sponsored by:	EMC / Isilon Storage Division
2013-03-10 21:07:44 +00:00
Attilio Rao
89f6b8632c Switch the vm_object mutex to be a rwlock. This will enable in the
future further optimizations where the vm_object lock will be held
in read mode most of the time the page cache resident pool of pages
are accessed for reading purposes.

The change is mostly mechanical but few notes are reported:
* The KPI changes as follow:
  - VM_OBJECT_LOCK() -> VM_OBJECT_WLOCK()
  - VM_OBJECT_TRYLOCK() -> VM_OBJECT_TRYWLOCK()
  - VM_OBJECT_UNLOCK() -> VM_OBJECT_WUNLOCK()
  - VM_OBJECT_LOCK_ASSERT(MA_OWNED) -> VM_OBJECT_ASSERT_WLOCKED()
    (in order to avoid visibility of implementation details)
  - The read-mode operations are added:
    VM_OBJECT_RLOCK(), VM_OBJECT_TRYRLOCK(), VM_OBJECT_RUNLOCK(),
    VM_OBJECT_ASSERT_RLOCKED(), VM_OBJECT_ASSERT_LOCKED()
* The vm/vm_pager.h namespace pollution avoidance (forcing requiring
  sys/mutex.h in consumers directly to cater its inlining functions
  using VM_OBJECT_LOCK()) imposes that all the vm/vm_pager.h
  consumers now must include also sys/rwlock.h.
* zfs requires a quite convoluted fix to include FreeBSD rwlocks into
  the compat layer because the name clash between FreeBSD and solaris
  versions must be avoided.
  At this purpose zfs redefines the vm_object locking functions
  directly, isolating the FreeBSD components in specific compat stubs.

The KPI results heavilly broken by this commit.  Thirdy part ports must
be updated accordingly (I can think off-hand of VirtualBox, for example).

Sponsored by:	EMC / Isilon storage division
Reviewed by:	jeff
Reviewed by:	pjd (ZFS specific review)
Discussed with:	alc
Tested by:	pho
2013-03-09 02:32:23 +00:00
Bryan Venteicher
0cfbcf8c7b Remove the virtio dependency entry for the VirtIO device drivers. This
will prevent the kernel from linking if the device driver are included
without the virtio module. Remove pci and scbus for the same reason.

Also explain the relationship and necessity of the virtio and virtio_pci
modules. Currently in FreeBSD, we only support VirtIO PCI, but it could
be replaced with a different interface (like MMIO) and the device
(network, block, etc) will still function.

Requested by:	luigi
Approved by:	grehan (mentor)
MFC after:	3 days
2013-03-06 07:17:53 +00:00
Kenneth D. Merry
3a45b4781a Re-enable CTL in GENERIC on i386 and amd64, but turn on the CTL disable
tunable by default.

This will allow GENERIC configurations to boot on small memory boxes, but
not require end users who want to use CTL to recompile their kernel.  They
can simply set kern.cam.ctl.disable=0 in loader.conf.

The eventual solution to the memory usage problem is to change the way
CTL allocates memory to be more configurable, but this should fix things
for small memory situations in the mean time.

UPDATING:		Explain the change in the CTL configuration, and
			how users can enable CTL if they would like to use
			it.

sys/conf/options:	Add a new option, CTL_DISABLE, that prevents CTL
			from initializing.

ctl.c:			If CTL_DISABLE is turned on, don't initialize.

i386/conf/GENERIC,
amd64/conf/GENERIC:	Re-enable device ctl, and add the CTL_DISABLE
			option.
2013-03-04 21:18:45 +00:00
Attilio Rao
b38d37f7b5 Merge from vmc-playground branch:
Rename the pv_entry_t iterator from pv_list to pv_next.
Besides being more correct technically (as the name seems to suggest
this is a list while it is an iterator), it will also be needed by
vm_radix work to avoid a nameclash on macro expansions.

Sponsored by:	EMC / Isilon storage division
Reviewed by:	alc, jeff
Tested by:	flo, pho, jhb, davide
2013-03-02 14:19:08 +00:00
Adrian Chadd
fe138cc2af Disable the ctl driver in GENERIC.
It unfortunately steals a fair chunk of RAM at startup even if it's not
actively used, which prevents FreeBSD VMs of 128MB from successfully
booting and running.
2013-03-02 08:12:41 +00:00
Davide Italiano
acccf7d8b4 MFcalloutng:
When CPU becomes idle, cpu_idleclock() calculates time to the next timer
event in order to reprogram hw timer. Return that time in sbintime_t to
the caller and pass it to acpi_cpu_idle(), where it can be used as one
more factor (quite precise) to extimate furter sleep time and choose
optimal sleep state. This is a preparatory change for further callout
improvements will be committed in the next days.

The commmit is not targeted for MFC.
2013-02-28 10:46:54 +00:00
Attilio Rao
dc1558d1cd Merge from vmobj-rwlock:
VM_OBJECT_LOCKED() macro is only used to implement a custom version
of lock assertions right now (which likely spread out thanks to
copy and paste).
Remove it and implement actual assertions.

Sponsored by:	EMC / Isilon storage division
Reviewed by:	alc
Tested by:	pho
2013-02-27 18:12:13 +00:00
Konstantin Belousov
31a53cd036 Convert machine/elf.h, machine/frame.h, machine/sigframe.h,
machine/signal.h and machine/ucontext.h into common x86 includes,
copying from amd64 and merging with i386.

Kernel-only compat definitions are kept in the i386/include/sigframe.h
and i386/include/signal.h, to reduce amd64 kernel namespace pollution.
The amd64 compat uses its own definitions so far.

The _MACHINE_ELF_WANT_32BIT definition is to allow the
sys/boot/userboot/userboot/elf32_freebsd.c to use i386 ELF definitions
on the amd64 compile host.  The same hack could be usefully abused by
other code too.
2013-02-20 17:39:52 +00:00
Jung-uk Kim
00a54dfb1c Consistently use round_page(x) rather than roundup(x, PAGE_SIZE). There is
no functional change.
2013-02-15 22:43:08 +00:00
Konstantin Belousov
bf94adb3e1 Print slightly more useful information on the 'bad pte' panic.
No objections from:	alc
MFC after:	1 week
2013-02-14 19:22:15 +00:00
Konstantin Belousov
252b1f6e22 Assert that user address is never qremoved.
No objections from:	alc
MFC after:	1 week
2013-02-14 19:21:20 +00:00
Neel Natu
25448de222 Requests for invalid CPUID leaves should map to the highest known leaf instead.
Reviewed by:	grehan
Obtained from:	NetApp
2013-02-13 23:22:17 +00:00
Neel Natu
485b3300cc Implement guest vcpu pinning using 'pthread_setaffinity_np(3)'.
Prior to this change pinning was implemented via an ioctl (VM_SET_PINNING)
that called 'sched_bind()' on behalf of the user thread.

The ULE implementation of 'sched_bind()' bumps up 'td_pinned' which in turn
runs afoul of the assertion '(td_pinned == 0)' in userret().

Using the cpuset affinity to implement pinning of the vcpu threads works with
both 4BSD and ULE schedulers and has the happy side-effect of getting rid
of a bunch of code in vmm.ko.

Discussed with:	grehan
2013-02-11 20:36:07 +00:00
Neel Natu
6d62a48f47 Compute the number of initial kernel page table pages (NKPT) dynamically.
This eliminates the need to recompile the kernel when the default value
of NKPT is not big enough - for e.g. when loading large kernel modules
or memory disk images from the loader.

If NKPT is defined in the kernel configuration file then it overrides the
dynamic calculation.

Reviewed by:	alc, kib
2013-02-06 04:53:00 +00:00
Andriy Gapon
1a89ca4cf5 cpususpend_handler: mark AP as resumed only after fully setting up lapic
Reviewed by:	jhb
Tested by:	Sergey V. Dyatko <sergey.dyatko@gmail.com>,
		KAHO Toshikazu <kaho@elam.kais.kyoto-u.ac.jp>
MFC after:	12 days
2013-02-02 12:04:32 +00:00
Andriy Gapon
548b201607 x86 suspend/resume: suspend pics and pseudo-pics in reverse order
- change 'pics' from STAILQ to TAILQ
- ensure that Local APIC is always first in 'pics'

Reviewed by:	jhb
Tested by:	Sergey V. Dyatko <sergey.dyatko@gmail.com>,
		KAHO Toshikazu <kaho@elam.kais.kyoto-u.ac.jp>
MFC after:	12 days
2013-02-02 12:02:42 +00:00
Eitan Adler
4752ed3d7f Remove support for plip from the GENERIC kernel as no systems in the
last 10 years require this support.

Discussed with:	db
Discussed with:	kib
Reviewed by:	imp
Reviewed by:	jhb
Reviewed by:	-hackers
Approved by:	cperciva (mentor)
2013-02-01 20:17:11 +00:00
Neel Natu
2b89a04496 Fix a broken assumption in the passthru implementation that the MSI-X table
can only be located at the beginning or the end of the BAR.

If the MSI-table is located in the middle of a BAR then we will split the
BAR into two and create two mappings - one before the table and one after
the table - leaving a hole in place of the table so accesses to it can be
trapped and emulated.

Obtained from:	NetApp
2013-02-01 03:49:09 +00:00
Neel Natu
07044a96d8 Increase the number of passthru devices supported by bhyve.
The maximum length of an environment variable puts a limitation on the
number of passthru devices that can be specified via a single variable.
The workaround is to allow user to specify passthru devices via multiple
environment variables instead of a single one.

Obtained from:	NetApp
2013-02-01 01:16:26 +00:00
Neel Natu
8faceb3292 Add emulation support for instruction "88/r: mov r/m8, r8".
This instruction moves a byte from a register to a memory location.

Tested by: tycho nightingale at pluribusnetworks com
2013-01-30 04:09:09 +00:00
John Baldwin
d825ce0a5d Reduce duplication between i386/linux/linux.h and amd64/linux32/linux.h
by moving bits that are MI out into headers in compat/linux.

Reviewed by:	Chagin Dmitry  dmitry | gmail
MFC after:	2 weeks
2013-01-29 18:41:30 +00:00
Peter Grehan
1fb0ea3f1a Always allow access to the sysenter cs/esp/eip MSRs since they
are automatically saved and restored in the VMCS.

Reviewed by:	neel
Obtained from:	NetApp
2013-01-25 21:38:31 +00:00
John Baldwin
fb709557a3 Don't assume that all Linux TCP-level socket options are identical to
FreeBSD TCP-level socket options (only the first two are).  Instead,
using a mapping function and fail unsupported options as we do for other
socket option levels.

MFC after:	2 weeks
2013-01-23 21:44:48 +00:00
Neel Natu
e3f0800bd1 Postpone vmm module initialization until after SMP is initialized - particularly
that 'smp_started != 0'.

This is required because the VT-x initialization calls smp_rendezvous()
to set the CR4_VMXE bit on all the cpus.

With this change we can preload vmm.ko from the loader.

Reported by:	alfred@, sbruno@
Obtained from:	NetApp
2013-01-21 01:33:10 +00:00
Neel Natu
912a3e678a Add svn properties to the recently merged bhyve source files.
The pre-commit hook will not allow any commits without the svn:keywords
property in head.
2013-01-20 03:42:49 +00:00
Neel Natu
c458fc1ed4 Merge projects/bhyve to head.
'bhyve' was developed by grehan@ and myself at NetApp (thanks!).

Special thanks to Peter Snyder, Joe Caradonna and Michael Dexter for their
support and encouragement.

Obtained from:	NetApp
2013-01-19 04:18:52 +00:00
John Baldwin
b5821c6f0e Fix build with SMP disabled.`
Reported by:	bf
2013-01-19 01:18:22 +00:00
John Baldwin
f876ffeae3 Don't attempt to use clflush on the local APIC register window. Various
CPUs exhibit bad behavior if this is done (Intel Errata AAJ3, hangs on
Pentium-M, and trashing of the local APIC registers on a VIA C7).  The
local APIC is implicitly mapped UC already via MTRRs, so the clflush isn't
necessary anyway.

MFC after:	2 weeks
2013-01-17 21:32:25 +00:00
Neel Natu
c2217b9848 IFC @ r245509 2013-01-17 07:04:37 +00:00
Bryan Venteicher
ae366ffcbd Add VirtIO to the i386 and amd64 GENERIC kernels
This also removes the kludge from r239009 that covered only
the network driver.

Reviewed by:	grehan
Approved by:	grehan (mentor)
MFC after:	1 week
2013-01-13 07:14:16 +00:00
Neel Natu
8a60b77db8 IFC @ r245205 2013-01-09 03:32:23 +00:00
Neel Natu
1b54fbe69d IFC @ r245178 2013-01-09 02:26:50 +00:00
Neel Natu
95102a8bcb Add a "pause" to busy wait loops in the cpu reset path.
This should not matter much when running on bare metal but it makes the guest
more friendly when running inside a virtual machine.

Discussed with:	jhb
Obtained from:	NetApp
2013-01-09 02:11:16 +00:00
Neel Natu
03429e45a7 Revert changes for x2apic support from projects/bhyve.
During the early days of bhyve it did not support instruction emulation
which necessitated the use of x2apic to access the local apic. This is no
longer the case and the dependency on x2apic has gone away.

The x2apic patches can be considered independently of bhyve and will be
merged into head via projects/x2apic.

Discussed with:	grehan
2013-01-06 05:37:26 +00:00
Neel Natu
2d28bff346 bhyve does not require a custom configuration file anymore so make the GENERIC
identical to the one in HEAD.

Obtained from:	NetApp
2013-01-05 03:35:30 +00:00
Neel Natu
46b1c55d9e IFC @ r244983. 2013-01-04 19:28:32 +00:00
Neel Natu
23ce7fedb4 There is no need for a special 'BHYVE' kernel configuration file anymore -
'GENERIC' works fine.

Obtained from:	NetApp
2013-01-04 03:02:43 +00:00
Neel Natu
014a52f3a6 There is no need for 'start_emulating()' and 'stop_emulating()' to be defined
in <machine/cpufunc.h> so remove them from there.

Obtained from:	NetApp
2013-01-04 02:49:12 +00:00
Neel Natu
5f0677d392 The "unrestricted guest" capability is a feature of Intel VT-x that allows
the guest to execute real or unpaged protected mode code - bhyve relies on
this feature to execute the AP bootstrap code.

Get rid of the hack that allowed bhyve to support SMP guests on processors
that do not have the "unrestricted guest" capability. This hack was entirely
FreeBSD-specific and would not work with any other guest OS.

Instead, limit the number of vcpus to 1 when executing on processors without
"unrestricted guest" capability.

Suggested by:	grehan
Obtained from:	NetApp
2013-01-04 02:04:41 +00:00
Konstantin Belousov
0dcbedfa61 Enable the UFS quotas for big-iron GENERIC kernels.
Discussed with:	      mckusick
MFC after:	      2 weeks
2013-01-03 19:03:41 +00:00
Dag-Erling Smørgrav
36fca20f10 As discussed on -current last October, remove the firewire drivers from
GENERIC.
2013-01-03 14:30:24 +00:00
Neel Natu
485f986ac9 Modify the default behavior of bhyve such that it no longer forces the use of
x2apic mode on the guest.

The guest can decide whether or not it wants to use legacy mmio or x2apic
access to the APIC by writing to the MSR_APICBASE register.

Obtained from:	NetApp
2012-12-16 01:20:08 +00:00
Neel Natu
682b847ede Prefer x2apic mode when running inside a virtual machine.
Provide a tunable 'machdep.x2apic_desired' to let the administrator override
the default behavior.

Provide a read-only sysctl 'machdep.x2apic' to let the administrator know
whether the kernel is using x2apic or legacy mmio to access local apic.

Tested with Parallels Desktop 8 and bhyve hypervisors.
Also tested running on bare metal Intel Xeon E5-2658.

Obtained from:	NetApp
Discussed with:	jhb, attilio, avg, grehan
2012-12-16 00:57:14 +00:00
Jim Harris
f2fcc434ee Revert r243960 based on feedback regarding keeping x86 headers unified
(mdf@, tijl@) and use of KASSERT/systm.h in bus.h (zeising@, bde@).

Alternate implementation will be made in a separate commit.
2012-12-13 21:27:20 +00:00
Peter Grehan
2741efeca0 Implement an API to allow a hypervisor to save/restore
guest floating point state without having to know the
size of floating-point state.

Unstaticize fpurestore to allow the hypervisor to
save/restore guest state using fpusave/fpurestore
on the allocated FPU state area.

Reviewed by:	kib
Obtained from:	NetApp/bhyve
MFC after:	1 week
2012-12-12 08:35:32 +00:00
Konstantin Belousov
737d12b397 Add amd64-specific ddb command "show pte". The command displays the
hierarchy of the page table entries which map the specified address.

Reviewed by:	alc (previous version)
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2012-12-10 05:14:34 +00:00
Jim Harris
71a30c4436 Add amd64 implementations for 8-byte bus_space routines.
Submitted by:	Carl Delsey <carl.r.delsey@intel.com>
Discussed with:	jhb, rwatson
Reviewed by:	jimharris
MFC after:	1 week
2012-12-06 22:33:31 +00:00
Neel Natu
32531ccb84 IFC @r243836 2012-12-04 04:37:42 +00:00
Konstantin Belousov
349438a243 Print the frame addresses for the backtraces on i386 and amd64. It
allows both to inspect the frame sizes and to manually peek into the
frames from ddb, if needed.

Reviewed by:	dim
MFC after:	2 weeks
2012-12-03 22:16:51 +00:00
Jung-uk Kim
7609e73ca0 Remove duplicate code. Reduce diff between amd64 and i386. 2012-12-01 00:56:19 +00:00
Jung-uk Kim
8c2b353ead Use volatile keywords properly. 2012-11-30 20:15:01 +00:00
Peter Grehan
e6f1f347a1 Properly screen for the AND 0x81 instruction from the set
of group1 0x81 instructions that use the reg bits as an
extended opcode.

Still todo: properly update rflags.

Pointed out by:	jilles@
2012-11-30 05:40:24 +00:00
Jung-uk Kim
231ac244f8 Tidy up inline assembly. No functional change. 2012-11-30 00:59:37 +00:00
Peter Grehan
b1f95796f0 Remove debug printf.
Pointed out by:	emaste
2012-11-29 15:08:13 +00:00
Peter Grehan
3b2b001107 Add support for the 0x81 AND instruction, now generated
by clang in the local APIC code.

0x81 is a read-modify-write instruction - the EPT check
that only allowed read or write and not both has been
relaxed to allow read and write.

Reviewed by:	neel
Obtained from:	NetApp
2012-11-29 06:26:42 +00:00
Neel Natu
48a29f4e07 Cleanup the user-space paging exit handler now that the unified instruction
emulation is in place.

Obtained from:	NetApp
2012-11-28 13:34:44 +00:00
Neel Natu
b42206f300 Change emulate_rdmsr() and emulate_wrmsr() to return 0 on sucess and errno on
failure. The conversion from the return value to HANDLED or UNHANDLED can be
done locally in vmx_exit_process().

Obtained from: NetApp
2012-11-28 13:10:18 +00:00
Neel Natu
ba9b7bf73a Revamp the x86 instruction emulation in bhyve.
On a nested page table fault the hypervisor will:
- fetch the instruction using the guest %rip and %cr3
- decode the instruction in 'struct vie'
- emulate the instruction in host kernel context for local apic accesses
- any other type of mmio access is punted up to user-space (e.g. ioapic)

The decoded instruction is passed as collateral to the user-space process
that is handling the PAGING exit.

The emulation code is fleshed out to include more addressing modes (e.g. SIB)
and more types of operands (e.g. imm8). The source code is unified into a
single file (vmm_instruction_emul.c) that is compiled into vmm.ko as well
as /usr/sbin/bhyve.

Reviewed by:	grehan
Obtained from:	NetApp
2012-11-28 00:02:17 +00:00
Neel Natu
920bc34090 Fix a bug in the MSI-X resource allocation for PCI passthrough devices.
In the case where the underlying host had disabled MSI-X via the
"hw.pci.enable_msix" tunable, the ppt_setup_msix() function would fail
and return an error without properly cleaning up. This in turn would
cause a page fault on the next boot of the guest.

Fix this by calling ppt_teardown_msix() in all the error return paths.

Obtained from:	NetApp
2012-11-22 04:07:18 +00:00
Neel Natu
288aeb8561 Get rid of redundant comparision which is guaranteed to be "true" for unsigned
integers.

Obtained from:	NetApp
2012-11-22 00:08:20 +00:00
Peter Grehan
a0cad47092 Handle CPUID leaf 0x7 now that FreeBSD is using it.
Return 0's for now.

Reviewed by:	neel
Obtained from:	NetApp
2012-11-20 06:01:03 +00:00
Neel Natu
3248464555 IFC @ r243164 2012-11-17 02:55:47 +00:00
Konstantin Belousov
43f48b65c0 Move the declaration of vm_phys_paddr_to_vm_page() from vm/vm_page.h
to vm/vm_phys.h, where it belongs.

Requested and reviewed by:	alc
MFC after:	2 weeks
2012-11-16 05:55:56 +00:00
Konstantin Belousov
b32ecf44bc Flip the semantic of M_NOWAIT to only require the allocation to not
sleep, and perform the page allocations with VM_ALLOC_SYSTEM
class. Previously, the allocation was also allowed to completely drain
the reserve of the free pages, being translated to VM_ALLOC_INTERRUPT
request class for vm_page_alloc() and similar functions.

Allow the caller of malloc* to request the 'deep drain' semantic by
providing M_USE_RESERVE flag, now translated to VM_ALLOC_INTERRUPT
class. Previously, it resulted in less aggressive VM_ALLOC_SYSTEM
allocation class.

Centralize the translation of the M_* malloc(9) flags in the single
inline function malloc2vm_flags().

Discussion started by:	"Sears, Steven" <Steven.Sears@netapp.com>
Reviewed by:	alc, mdf (previous version)
Tested by:	pho (previous version)
MFC after:	2 weeks
2012-11-14 20:01:40 +00:00
Neel Natu
7d3d462b09 IFC @ r242940 2012-11-13 07:39:05 +00:00
Neel Natu
a10c6f5544 IFC @ r242684 2012-11-11 03:26:14 +00:00
Konstantin Belousov
5a17538e22 Do not try to enable new features in the %cr4 if running under
hypervisor.  Apparently, hypervisors failed to filter out 'Standard
Extended Features' report from CPUID, but deliver #gp when
corresponding bit in %cr4 is toggled.

This shall be reconsidered later, after hypervisors correct the bug.

Reported and tested by:	joel
Reviewed by:	avg
MFC after:	2 weeks
2012-11-09 16:00:30 +00:00
Peter Grehan
0a5e9bfb72 Fix issue found with clang build. Avoid code insertion by the compiler
between inline asm statements that would in turn modify the flags
value set by the first asm, and used by the second.

Solve by making the common error block a string that can be pulled
into the first inline asm, and using symbolic labels for asm variables.

bhyve can now build/run fine when compiled with clang.

Reviewed by:	neel
Obtained from:	NetApp
2012-11-06 02:43:41 +00:00
Attilio Rao
cfedf924d3 Rework the known rwlock to benefit about staying on their own
cache line in order to avoid manual frobbing but using
struct rwlock_padalign.

Reviewed by:	alc, jimharris
2012-11-03 23:03:14 +00:00
Konstantin Belousov
cd9e9d1bc2 Enable the new instructions for reading and writing bases for %fs,
%gs, when supported.  Note that WRFSBASE and WRGSBASE are not very
useful on FreeBSD right now, because a return from the kernel mode to
userspace reloads the bases specified by the sysarch(2) syscall, most
likely.

Enable the Supervisor Mode Execution Prevention (SMEP) when
supported. Since the loader(8) performs hand-off to the kernel with
the page tables which contradict the SMEP, postpone enabling the SMEP
on BSP until pmap switched for the proper kernel tables.

Debugged with the help from:	avg
Tested by:	avg, Michael Moll <kvedulv@kvedulv.de>
MFC after:	1 month
2012-11-01 15:17:43 +00:00
Konstantin Belousov
2773649d2f Provide the reading and display of the Standard Extended Features,
introduced with the IvyBridge CPUs.  Provide the definitions for new
bits in CR3 and CR4 registers.

Tested by:	avg, Michael Moll <kvedulv@kvedulv.de>
MFC after:	2 weeks
2012-11-01 15:14:37 +00:00
Neel Natu
514393f565 Convert VMCS_ENTRY_INTR_INFO field into a vmcs identifier before passing it
to vmcs_getreg(). Without this conversion vmcs_getreg() will return EINVAL.

In particular this prevented injection of the breakpoint exception into the
guest via the "-B" option to /usr/sbin/bhyve which is hugely useful when
debugging guest hangs.

This was broken in r241921.

Pointy hat: me
Obtained from:	NetApp
2012-10-29 23:58:15 +00:00
Neel Natu
b01c203325 Corral all the host state associated with the virtual machine into its own file.
This state is independent of the type of hardware assist used so there is
really no need for it to be in Intel-specific code.

Obtained from:	NetApp
2012-10-29 01:51:24 +00:00
Peter Grehan
cda5bd7f19 Set the valid field of the newly allocated field as all other
vm page allocators do. This fixes a panic when a virtio block
device is mounted as root, with the host system dying in
vm_page_dirty with invalid bits.

Reviewed by:	neel
Obtained from:	NetApp
2012-10-26 22:32:26 +00:00
Neel Natu
bd8572e0be Unconditionally enable fpu emulation by setting CR0.TS in the host after the
guest does a vm exit.

This allows us to trap any fpu access in the host context while the fpu still
has "dirty" state belonging to the guest.

Reported by: "s vas" on freebsd-virtualization@
Obtained from:	NetApp
2012-10-26 03:12:40 +00:00
Neel Natu
f76fc5d414 If the guest vcpu wants to idle then use that opportunity to relinquish the
host cpu to the scheduler until the guest is ready to run again.

This implies that the host cpu utilization will now closely mirror the actual
load imposed by the guest vcpu.

Also, the vcpu mutex now needs to be of type MTX_SPIN since we need to acquire
it inside a critical section.

Obtained from:	NetApp
2012-10-25 04:29:21 +00:00
Neel Natu
ff6ec151e0 Hide the monitor/mwait instruction capability from the guest until we know how
to properly intercept it.

Obtained from:	NetApp
2012-10-25 04:08:26 +00:00
Neel Natu
f352ff0ca8 Maintain state regarding NMI delivery to guest vcpu in VT-x independent manner.
Also add a stats counter to count the number of NMIs delivered per vcpu.

Obtained from:	NetApp
2012-10-24 02:54:21 +00:00
Neel Natu
eeefa4e4be Test for AST pending with interrupts disabled right before entering the guest.
If an IPI was delivered to this cpu before interrupts were disabled
then return right away via vmx_setjmp() with a return value of VMX_RETURN_AST.

Obtained from:	NetApp
2012-10-23 02:20:42 +00:00
Eitan Adler
1611e2c0d1 The 'testing memory' patch gets printed too many times
Approved by: cperciva (implicit)
2012-10-22 11:57:26 +00:00
Eitan Adler
267cc84937 Explain the upcoming delay by printing a message when the kernel
is about to begin testing memory.

Reviewed by:	dteske, adri
Approved by:	cperciva
MFC after:	1 week
2012-10-22 03:16:39 +00:00
Neel Natu
2e25737a49 Calculate the number of host ticks until the next guest timer interrupt.
This information will be used in conjunction with guest "HLT exiting" to
yield the thread hosting the virtual cpu.

Obtained from:	NetApp
2012-10-20 08:23:05 +00:00
Konstantin Belousov
802431fa2e Print the %rip value for uprintf_signal.
MFC after:	1 week
2012-10-14 17:08:46 +00:00
Andriy Gapon
851dbc07af pciereg_cfg*: use assembly to access the mem-mapped cfg space
AMD BKDG for CPU families 10h and later requires that the memory
mapped config is always read into or written from al/ax/eax register.

Discussed with:	kib, alc
Reviewed by:	kib (earlier version)
MFC after:	25 days
2012-10-14 10:13:50 +00:00
Peter Grehan
13ec93719a Add the guest physical address and r/w/x bits to
the paging exit in preparation for a rework of
bhyve MMIO handling.

Reviewed by:	neel
Obtained from:	NetApp
2012-10-12 23:12:19 +00:00
Neel Natu
75dd336603 Provide per-vcpu locks instead of relying on a single big lock.
This also gets rid of all the witness.watch warnings related to calling
malloc(M_WAITOK) while holding a mutex.

Reviewed by:	grehan
2012-10-12 18:32:44 +00:00
Neel Natu
cdc5b9e7b1 Fix warnings generated by 'debug.witness.watch' during VM creation and
destruction for calling malloc() with M_WAITOK while holding a mutex.

Do not allow vmm.ko to be unloaded until all virtual machines are destroyed.
2012-10-11 19:39:54 +00:00
Neel Natu
f9d4f89e4d Deliver the MSI to the correct guest virtual cpu.
Prior to this change the MSI was being delivered unconditionally to vcpu 0
regardless of how the guest programmed the MSI delivery.
2012-10-11 19:28:07 +00:00
Kevin Lo
9823d52705 Revert previous commit...
Pointyhat to:	kevlo (myself)
2012-10-10 08:36:38 +00:00
Attilio Rao
3a4730256a Add an unified macro to deny ability from the compiler to reorder
instruction loads/stores at its will.
The macro __compiler_membar() is currently supported for both gcc and
clang, but kernel compilation will fail otherwise.

Reviewed by:	bde, kib
Discussed with:	dim, theraven
MFC after:	2 weeks
2012-10-09 14:32:30 +00:00
Attilio Rao
af2bdacafb Reverts r234074,234105,234564,234723,234989,235231-235232 and part of
r234247.
Use, instead, the static intializer introduced in r239923 for x86 and
sparc64 intr_cpus, unwinding the code to the initial version.

Reviewed by:	marius
2012-10-09 12:22:43 +00:00
Kevin Lo
a10cee30c9 Prefer NULL over 0 for pointers 2012-10-09 08:27:40 +00:00
Neel Natu
7ce04d0ad9 Allocate memory pages for the guest from the host's free page queue.
It is no longer necessary to hard-partition the memory between the host
and guests at boot time.
2012-10-08 23:41:26 +00:00
Neel Natu
f7d51510f1 Change vm_malloc() to map pages in the guest physical address space in 4KB
chunks. This breaks the assumption that the entire memory segment is
contiguously allocated in the host physical address space.

This also paves the way to satisfy the 4KB page allocations by requesting
free pages from the VM subsystem as opposed to hard-partitioning host memory
at boot time.
2012-10-04 02:27:14 +00:00
Neel Natu
4db4fb2c25 Get rid of assumptions in the hypervisor that the host physical memory
associated with guest physical memory is contiguous.

Add check to vm_gpa2hpa() that the range indicated by [gpa,gpa+len) is all
contained within a single 4KB page.
2012-10-03 01:18:51 +00:00
Neel Natu
bda273f21e Get rid of assumptions in the hypervisor that the host physical memory
associated with guest physical memory is contiguous.

Rewrite vm_gpa2hpa() to get the GPA to HPA mapping by querying the nested
page tables.
2012-10-03 00:46:30 +00:00
Neel Natu
341f19c949 Get rid of assumptions in the hypervisor that the host physical memory
associated with guest physical memory is contiguous.

In this case vm_malloc() was using vm_gpa2hpa() to indirectly infer whether
or not the address range had already been allocated.

Replace this instead with an explicit API 'vm_gpa_available()' that returns
TRUE if a page is available for allocation in guest physical address space.
2012-09-29 01:15:45 +00:00
John Baldwin
960b5a7080 - Re-shuffle the <machine/pc/bios.h> headers to move all kernel-specific
bits under #ifdef _KERNEL but leave definitions for various structures
  defined by standards ($PIR table, SMAP entries, etc.) available to
  userland.
- Consolidate duplicate SMBIOS table structure definitions in ipmi(4)
  and smbios(4) in <machine/pc/bios.h> and make them available to
  userland.

MFC after:	2 weeks
2012-09-28 11:59:32 +00:00
Alan Cox
e4b8a2fc5a Eliminate a stale comment. It describes another use case for the pmap in
Mach that doesn't exist in FreeBSD.
2012-09-28 05:30:59 +00:00
Neel Natu
70593114cd Intel VT-x provides the length of the instruction at the time of the nested
page table fault. Use this when fetching the instruction bytes from the guest
memory.

Also modify the lapic_mmio() API so that a decoded instruction is fed into it
instead of having it fetch the instruction bytes from the guest. This is
useful for hardware assists like SVM that provide the faulting instruction
as part of the vmexit.
2012-09-27 00:27:58 +00:00
Neel Natu
73820fb0a4 Add an option "-a" to present the local apic in the XAPIC mode instead of the
default X2APIC mode to the guest.
2012-09-26 00:06:17 +00:00
Neel Natu
a2da7af6bc Add support for trapping MMIO writes to local apic registers and emulating them.
The default behavior is still to present the local apic to the guest in the
x2apic mode.
2012-09-25 22:31:35 +00:00
Neel Natu
e90273829b Add ioctls to control the X2APIC capability exposed by the virtual machine to
the guest.

At the moment this simply sets the state in the 'vcpu' instance but there is
no code that acts upon these settings.
2012-09-25 19:08:51 +00:00
Neel Natu
edf89256dd Add an explicit exit code 'SPINUP_AP' to tell the controlling process that an
AP needs to be activated by spinning up an execution context for it.

The local apic emulation is now completely done in the hypervisor and it will
detect writes to the ICR_LO register that try to bring up the AP. In response
to such writes it will return to userspace with an exit code of SPINUP_AP.

Reviewed by: grehan
2012-09-25 02:33:25 +00:00
Neel Natu
98ed632c63 Stash the 'vm_exit' information in each 'struct vcpu'.
There is no functional change at this time but this paves the way for vm exit
handler functions to easily modify the exit reason going forward.
2012-09-24 19:32:24 +00:00
Dimitry Andric
f99157cced After r205013, amd64 and i386 CPU family and model IDs were printed out
in hexadecimal, but without any 0x prefix, which can be very misleading.

MFC after:	3 days
2012-09-21 10:31:19 +00:00
Neel Natu
2d3a73ed6d Restructure the x2apic access code in preparation for supporting memory mapped
access to the local apic.

The vlapic code is now aware of the mode that the guest is using to access the
local apic.

Reviewed by: grehan@
2012-09-21 03:09:23 +00:00
Jim Harris
eb85d44f06 Integrate nvme(4) and nvd(4) into the amd64 and i386 builds.
Sponsored by:	Intel
2012-09-17 19:26:33 +00:00
Konstantin Belousov
c5e3d0ab11 Rename the IVY_RNG option to RDRAND_RNG.
Based on submission by:	Arthur Mesh <arthurmesh@gmail.com>
MFC after:	2 weeks
2012-09-13 10:12:16 +00:00
Alan Cox
7336315b0a Simplify pmap_unmapdev(). Since kmem_free() eventually calls pmap_remove(),
pmap_unmapdev()'s own direct efforts to destroy the page table entries are
redundant, so eliminate them.

Don't set PTE_W on the page table entry in pmap_kenter{,_attr}() on MIPS.
Setting PTE_W on MIPS is inconsistent with the implementation of this
function on other architectures.  Moreover, PTE_W should not be set, unless
the pmap's wired mapping count is incremented, which pmap_kenter{,_attr}()
doesn't do.

MFC after:	10 days
2012-09-10 16:11:29 +00:00
Attilio Rao
324e57150d userret() already checks for td_locks when INVARIANTS is enabled, so
there is no need to check if Giant is acquired after it.

Reviewed by:	kib
MFC after:	1 week
2012-09-08 18:27:11 +00:00
Konstantin Belousov
ef9461ba0e Add support for new Intel on-CPU Bull Mountain random number
generator, found on IvyBridge and supposedly later CPUs, accessible
with RDRAND instruction.

From the Intel whitepapers and articles about Bull Mountain, it seems
that we do not need to perform post-processing of RDRAND results, like
AES-encryption of the data with random IV and keys, which was done for
Padlock. Intel claims that sanitization is performed in hardware.

Make both Padlock and Bull Mountain random generators support code
covered by kernel config options, for the benefit of people who prefer
minimal kernels. Also add the tunables to disable hardware generator
even if detected.

Reviewed by:	markm, secteam (simon)
Tested by:	bapt, Michael Moll <kvedulv@kvedulv.de>
MFC after:	3 weeks
2012-09-05 13:18:51 +00:00
Alan Cox
d8f9ed32c5 Rename {_,}pmap_unwire_pte_hold() to {_,}pmap_unwire_ptp() and update the
comment describing them.  Both the function names and the comment had grown
stale.  Quite some time has passed since these pmap implementations last
used the page's hold count to track the number of valid mapping within a
page table page.  Also, returning TRUE from pmap_unwire_ptp() rather than
_pmap_unwire_ptp() eliminates a few instructions from callers like
pmap_enter_quick_locked() where pmap_unwire_ptp()'s return value is used
directly by a conditional statement.
2012-09-05 06:02:54 +00:00
Xin LI
0807ad7422 Add hpt27xx to GENERIC kernel for amd64 and i386 systems.
MFC after:	2 weeks
2012-09-04 21:02:57 +00:00
John Baldwin
778eefa40d Fix duplicate entries for mwl(4):
- Move mwlfw from {amd64,i386}/conf/NOTES to sys/conf/NOTES (mwl(4) is
  already present in sys/conf/NOTES).
- Remove duplicate mwl(4) entries from {amd64,i386}/conf/NOTES.
- While here, add a description to the sfxge line in amd64/conf/NOTES.
2012-09-04 19:19:36 +00:00
John Baldwin
4c1044491b Fix misspelled "Infiniband".
Submitted by:	gcooper
MFC after:	3 days
2012-08-28 11:34:09 +00:00
Peter Grehan
177fd53318 Add sysctls to display the total and free amount of hard-wired mem for VMs
# sysctl hw.vmm
   hw.vmm.mem_free: 2145386496
   hw.vmm.mem_total: 2145386496

Submitted by:	Takeshi HASEGAWA hasegaw at gmail com
2012-08-26 01:41:41 +00:00
Glen Barber
67944c4572 Grammar fix: s/NIC's/NICs/
MFC after:	3 days
2012-08-26 01:21:02 +00:00
Dag-Erling Smørgrav
e2082935f0 As discussed on -current, remove the hardcoded default maxswzone.
MFC after:	3 weeks
2012-08-14 17:01:21 +00:00
Konstantin Belousov
b6d3609050 Add a hackish debugging facility to provide a bit of information about
reason for generated trap. The dump of basic signal information and 8
bytes of the faulting instruction are printed on the controlling
terminal of the process, if the machdep.uprintf_signal syscal is
enabled.

The print is the only practical way to debug traps from a.out
processes I am aware of. Because I have to reimplement it each time I
debug an issue with a.out support on amd64, commit the hack to main
tree.

MFC after:	1 week
2012-08-14 12:15:01 +00:00
Konstantin Belousov
95fd15898b Real hardware, as opposed to QEMU, does not allow to have a call gate
in long mode which transfers control to 32bit code segment. Unbreak
the lcall $7,$0 implementation on amd64 by putting the 64bit user code
segment' selector into call gate, and execute the 64bit trampoline
which converts the return frame into 32bit format and switches back to
32bit mode for executing int $0x80 trampoline.

Note that all jumps over the hoops are performed in the user mode.

MFC after:	1 week
2012-08-14 12:13:27 +00:00
John Baldwin
6e4ac34b07 Remove the deassert INIT IPI from the IPI startup sequence for APs.
It is not listed in the boot sequence in the MP specification (1.4),
and it is explicitly ignored on modern CPUs.  It was only ever required
when bootstrapping systems with external APICs (that is, SMP machines
with 486s), which FreeBSD has never supported (and never will).

While here, tidy some comments and remove some banal ones.
2012-08-13 18:52:51 +00:00
John Baldwin
a4284ef768 Add a 10 millisecond delay after sending the initial INIT IPI. This
matches the algorithm in the MP specification (1.4).  Previously we
were sending out the deassert INIT IPI immediately after the initial
INIT IPI was sent.
2012-08-13 16:33:22 +00:00
Colin Percival
347c7fd7bf Build modules along with the XENHVM kernels.
No objections from:	freebsd-xen mailing list
MFC after:	1 week
2012-08-13 07:36:57 +00:00
Alan Cox
663f8700d4 The assertion that I added in r238889 could legitimately fail when a
debugger creates a breakpoint.  Replace that assertion with a narrower
one that still achieves my objective.

Reported and tested by:	kib
2012-08-08 05:28:30 +00:00
Konstantin Belousov
65211d02c4 Do not apply errata 721 workaround when under hypervisor, since
typical hypervisor does not implement access to the required MSR,
causing #GP on boot.

Reported and tested by:	olgeni
PR:	amd64/170388
MFC after:	3 days
2012-08-07 08:36:10 +00:00
Sergey Kandaurov
16ec457aeb Remove duplicate header inclusion of <sys/sysent.h>
Discussed with:	bz
2012-08-07 05:46:36 +00:00
Alan Cox
59fa03faa3 Shave off a few more cycles from the average execution time of pmap_enter()
by simplifying the control flow and reducing the live range of "om".
2012-08-05 16:59:02 +00:00
Neel Natu
8124debe13 Include 'device uart' in the guest kernel. 2012-08-04 04:30:26 +00:00
Neel Natu
39c21c2db2 Force certain bits in %cr4 to be hard-wired to '1' or '0' from a guest's
perspective. If we don't do this some guest OSes (e.g. Linux) will reset
the CR4_VMXE bit in %cr4 with disastrous consequences.

Reported by: grehan
2012-08-04 02:06:55 +00:00
Konstantin Belousov
0220d04fe3 Add lfence().
MFC after:	1 week
2012-08-01 17:24:53 +00:00
Alan Cox
879eedbc7b Revise pmap_enter()'s handling of mapping updates that change the
PTE's PG_M and PG_RW bits but not the physical page frame.  First,
only perform vm_page_dirty() on a managed vm_page when the PG_M bit is
being cleared.  If the updated PTE continues to have PG_M set, then
there is no requirement to perform vm_page_dirty().  Second, flush the
mapping from the TLB when PG_M alone is cleared, not just when PG_M
and PG_RW are cleared.  Otherwise, a stale TLB entry may stop PG_M
from being set again on the next store to the virtual page.  However,
since the vm_page's dirty field already shows the physical page as
being dirty, no actual harm comes from the PG_M bit not being set.
Nonetheless, it is potentially confusing to someone expecting to see
the PTE change after a store to the virtual page.
2012-08-01 16:04:13 +00:00
Konstantin Belousov
a42fa0af44 Change (unused) prototype for stmxcsr() to match reality.
Noted by:	jhb
MFC after:	1 week
2012-07-30 19:26:02 +00:00