Commit Graph

7301 Commits

Author SHA1 Message Date
kib
6006bf3a7d If x86 CPU implementation of the MWAIT instruction reasonably
interacts with interrupts, query ACPI and use MWAIT for entrance into
Cx sleep states.  Support C1 "I/O then halt" mode.  See Intel'
document 302223-007 "Intelб╝ Processor Vendor-Specific ACPI Interface
Specification" for description.

Move the acpi_cpu_c1() function into x86/cpu_machdep.c and use
it instead of inlining "sti; hlt" sequence in several places.

In the acpi(4) man page, besides documenting the dev.cpu.N.cx_methods
sysctl, correct the names for dev.cpu.N.{cx_usage,cx_lowest,cx_supported}
sysctls.

Both jkim and avg have some other patches implementing the mwait
functionality; this work is unrelated.  Linux does not rely on the
ACPI to provide correct tables describing Cx modes.  Instead, the
driver has pre-defined knowledge of the CPU models, it was supplied by
Intel.

Tested by:    pho (previous versions)
Sponsored by:	The FreeBSD Foundation
2015-05-09 12:28:48 +00:00
neel
390ea95709 Check 'td_owepreempt' and yield the vcpu thread if it is set.
This is done explicitly because a vcpu thread can be in a critical section
for the entire time slice alloted to it. This in turn can delay the handling
of the 'td_owepreempt'.

Reviewed by:	jhb
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D2430
2015-05-06 23:40:24 +00:00
neel
7776059e98 Deprecate the 3-way return values from vm_gla2gpa() and vm_copy_setup().
Prior to this change both functions returned 0 for success, -1 for failure
and +1 to indicate that an exception was injected into the guest.

The numerical value of ERESTART also happens to be -1 so when these functions
returned -1 it had to be translated to a positive errno value to prevent the
VM_RUN ioctl from being inadvertently restarted. This made it easy to introduce
bugs when writing emulation code.

Fix this by adding an 'int *guest_fault' parameter and setting it to '1' if
an exception was delivered to the guest. The return value is 0 or EFAULT so
no additional translation is needed.

Reviewed by:	tychon
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D2428
2015-05-06 16:25:20 +00:00
neel
e5110c0076 Do a proper emulation of guest writes to MSR_EFER.
- Must-Be-Zero bits cannot be set.
- EFER_LME and EFER_LMA should respect the long mode consistency checks.
- EFER_NXE, EFER_FFXSR, EFER_TCE can be set if allowed by CPUID capabilities.
- Flag an error if guest tries to set EFER_LMSLE since bhyve doesn't enforce
  segment limits in 64-bit mode.

MFC after:	2 weeks
2015-05-06 05:40:20 +00:00
neel
46cb5cb412 Emulate the 'CMP r/m8, imm8' instruction encountered when booting a Windows
Vista guest.

Reported by:	Leon Dang (ldang@nahannisys.com)
MFC after:	1 week
2015-05-04 04:27:23 +00:00
neel
ed97fa3273 Don't advertise the Intel SMX capability to the guest.
Reported by:	Leon Dang (ldang@nahannisys.com)
MFC after:	1 week
2015-05-02 19:07:49 +00:00
neel
cd20ad9aa3 Emulate machine check related MSRs to allow guest OSes like Windows to boot.
Reported by:	Leon Dang (ldang@nahannisys.com)
MFC after:	2 weeks
2015-05-02 04:19:11 +00:00
neel
755b785420 r281630 relaxed the limits on the vectors that can be asserted in the IRRs.
Do the same when transitioning a vector from the IRR to the ISR and also
when extinguishing it from the ISR in response to an EOI.

Reported by:	Leon Dang (ldang@nahannisys.com)
MFC after:	2 weeks
2015-05-01 16:00:29 +00:00
neel
c782f2195b Emulate MSR_SYSCFG which is accessed by Linux on AMD cpus when MTRRs are
enabled.

MFC after:	2 weeks
2015-05-01 05:11:14 +00:00
neel
9cbb7c919e Don't require <sys/cpuset.h> to be always included before <machine/vmm.h>.
Only a subset of source files that include <machine/vmm.h> need to use the
APIs that require the inclusion of <sys/cpuset.h>.

MFC after:	1 week
2015-04-30 22:23:22 +00:00
neel
f57c0156d3 When an instruction cannot be decoded just return to userspace so bhyve(8)
can dump the instruction bytes.

Requested by:	grehan
MFC after:	1 week
2015-04-30 21:00:47 +00:00
neel
6641e0d4f2 Advertise the MTRR feature via CPUID and emulate the minimal set of MTRR MSRs.
This is required for booting Windows guests.

Reported by:	Leon Dang (ldang@nahannisys.com)
MFC after:	2 weeks
2015-04-30 19:23:50 +00:00
jhb
9c4c8b62fb Remove support for Xen PV domU kernels. Support for HVM domU kernels
remains.  Xen is planning to phase out support for PV upstream since it
is harder to maintain and has more overhead.  Modern x86 CPUs include
virtualization extensions that support HVM guests instead of PV guests.
In addition, the PV code was i386 only and not as well maintained recently
as the HVM code.
- Remove the i386-only NATIVE option that was used to disable certain
  components for PV kernels.  These components are now standard as they
  are on amd64.
- Remove !XENHVM bits from PV drivers.
- Remove various shims required for XEN (e.g. PT_UPDATES_FLUSH, LOAD_CR3,
  etc.)
- Remove duplicate copy of <xen/features.h>.
- Remove unused, i386-only xenstored.h.

Differential Revision:	https://reviews.freebsd.org/D2362
Reviewed by:	royger
Tested by:	royger (i386/amd64 HVM domU and amd64 PVH dom0)
Relnotes:	yes
2015-04-30 15:48:48 +00:00
neel
acc3b0bbd7 Re-implement RTC current time calculation to eliminate the possibility of
losing time.

The problem with the earlier implementation was that the uptime value
used by 'vrtc_curtime()' could be different than the uptime value when
'vrtc_time_update()' actually updated 'base_uptime'.

Fix this by calculating and updating the (rtctime, uptime) tuple together.

MFC after:	2 weeks
2015-04-29 23:44:28 +00:00
whu
0b81a49657 Microsoft vmbus, storage and other related driver enhancements for HyperV.
- Vmbus multi channel support.
    - Vector interrupt support.
    - Signal optimization.
    - Storvsc driver performance improvement.
    - Scatter and gather support for storvsc driver.
    - Minor bug fix for KVP driver.
Thanks royger, jhb and delphij from FreeBSD community for the reviews
and comments. Also thanks Hovy Xu from NetApp for the contributions to
the storvsc driver.

PR:     195238
Submitted by:   whu
Reviewed by:    royger, jhb, delphij
Approved by:    royger
MFC after:      2 weeks
Relnotes:       yes
Sponsored by:   Microsoft OSTC
2015-04-29 10:12:34 +00:00
neel
42e24b0f80 Emulate the 'bit test' instruction. Windows 7 uses 'bit test' to check the
'Delivery Status' bit in APIC ICR register.

Reported by:	Leon Dang (ldang@nahannisys.com)
MFC after:	2 weeks
2015-04-29 02:01:46 +00:00
neel
0151349b05 Implement the century byte in the RTC. Some guests require this field to be
properly set.

Reported by:	Leon Dang (ldang@nahannisys.com)
MFC after:	2 weeks
2015-04-28 23:44:47 +00:00
tychon
da94200043 STOS/STOSB/STOSW/STOSD/STOSQ instruction emulation.
Reviewed by:	neel
2015-04-25 19:02:06 +00:00
kib
cbdd03ad96 Move common code from sys/i386/i386/mp_machdep.c and
sys/amd64/amd64/mp_machdep.c, to the new common x86 source
sys/x86/x86/mp_x86.c.

Proposed and reviewed by:	jhb
Review:	https://reviews.freebsd.org/D2347
Sponsored by:	The FreeBSD Foundation
2015-04-24 16:20:56 +00:00
jhb
e4683250d1 Reassign copyright statements on several files from Advanced
Computing Technologies LLC to Hudson River Trading LLC.

Approved by:	Hudson River Trading LLC (who owns ACT LLC)
MFC after:	1 week
2015-04-23 14:22:20 +00:00
araujo
41717d52d1 Missing break in switch case.
Differential Revision:	D2342
Reviewed by:		neel
2015-04-23 02:50:06 +00:00
kib
9fb28191d6 Move some common code from sys/amd64/amd64/machdep.c and
sys/i386/i386/machdep.c to new file sys/x86/x86/cpu_machdep.c.  Most
of the code is related to the idle handling.

Discussed with:	pluknet
Sponsored by:	The FreeBSD Foundation
2015-04-22 12:32:14 +00:00
kib
2bf89b258b Remove duplicate definitions of MWAIT_CX hints. Identical defines in
specialreg.h are enough.

Discussed with:	mav
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-04-20 08:25:55 +00:00
kib
32cf8052a6 Remove lazy pmap switch code from i386. Naive benchmark with md(4)
shows no difference with the code removed.

On both amd64 and i386, assert that a released pmap is not active.

Proposed and reviewed by:	alc
Discussed with:	Svatopluk Kraus <onwahe@gmail.com>, peter
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2015-04-18 21:23:16 +00:00
neel
4fe75f1860 Relax the check on which vectors can be delivered through the APIC. According
to the Intel SDM vectors 16 through 255 are allowed to be delivered via the
local APIC.

Reported by:	Leon Dang (ldang@nahannisys.com)
MFC after:	2 weeks
2015-04-16 22:44:51 +00:00
neel
6a373c232e Prefer 'vcpu_should_yield()' over checking 'curthread->td_flags' directly.
MFC after:	1 week
2015-04-16 20:15:47 +00:00
emaste
f81c3bc31e Use explicitly sized types in EFI module metadata
This will allow the same metadata struct to be used on all platforms.

Differential Revision:	https://reviews.freebsd.org/D2275
Reviewed by:	jhb
2015-04-10 19:26:45 +00:00
tychon
9079cdda46 Enhance the support for Group 1 Extended opcodes:
* Implemement the 0x81 and 0x83 CMP instructions.
  * Implemement the 0x83 AND instruction.
  * Implemement the 0x81 OR instruction.

Reviewed by:	neel
2015-04-06 12:22:41 +00:00
eadler
cea14a6cb5 adrian asked me to revert and get more testing 2015-04-05 05:18:14 +00:00
eadler
41fc5228f6 head/sys/amd64/amd64/support.S: unroll loop
unroll the loop in ENTRY(pagezero)
	acc' to the submitter this results in a reproducible 1% perf
	improvement under buildworld like workload

	I validated correctness and run-testing, but not performance impact

Submitted by:	lidl@pix.net
Reviewed by:	adrian
PR:		199151
MFC After:	1 month
2015-04-05 05:07:24 +00:00
rstone
57feb6fb43 Fix integer truncation bug in malloc(9)
A couple of internal functions used by malloc(9) and uma truncated
a size_t down to an int.  This could cause any number of issues
(e.g. indefinite sleeps, memory corruption) if any kernel
subsystem tried to allocate 2GB or more through malloc.  zfs would
attempt such an allocation when run on a system with 2TB or more
of RAM.

Note to self: When this is MFCed, sparc64 needs the same fix.

Differential revision:	https://reviews.freebsd.org/D2106
Reviewed by:	kib
Reported by:	Michael Fuckner <michael@fuckner.net>
Tested by:	Michael Fuckner <michael@fuckner.net>
MFC after:	2 weeks
2015-04-01 12:42:26 +00:00
tychon
b3c521e852 Fix "MOVS" instruction memory to MMIO emulation. Currently updates to
%rdi, %rsi, etc are inadvertently bypassed along with the check to
see if the instruction needs to be repeated per the 'rep' prefix.

Add "MOVS" instruction support for the 'MMIO to MMIO' case.

Reviewed by:	neel
2015-04-01 00:15:31 +00:00
kib
82927bc930 Provide workaround for a performance issue with the popcnt instruction
on Intel processors.  Clear spurious dependency by explicitely xoring
the destination register of popcnt.

Use bitcount64() instead of re-implementing SWAR locally, for
processors without popcnt instruction.

Reviewed by:	jhb
Discussed with:	jilles (previous version)
Sponsored by:	The FreeBSD Foundation
2015-03-31 01:44:07 +00:00
jhb
4e5400d652 Wait 100 microseconds for a local APIC to dispatch each startup-related IPI
rather than 20.  The MP 1.4 specification states in Appendix B.2:

  "A period of 20 microseconds should be sufficient for IPI dispatch to
   complete under normal operating conditions".

(Note that this appears to be separate from the 10 millisecond (INIT) and
200 microsecond (STARTUP) waits after the IPIs are dispatched.)  The
Intel SDM is silent on this issue as far as I can tell.

At least some hardware requires 60 microseconds as noted in the PR, so
bump this to 100 to be on the safe side.

PR:		197756
Reported by:	zaphod@berentweb.com
MFC after:	1 week
2015-03-30 20:13:22 +00:00
kib
1a6b57192c Make it possible for the signal handler to act on #ss. Load the
canonical user data segment' selector into %ss when calling the
handler.

Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-03-28 09:03:54 +00:00
kib
dff23c6926 The #ss fault handler erronously does not check for the fault
originated from the return to usermode. #ss must be handled same as
#np.

Reported by:	Andrew Lutomirski through secteam
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2015-03-28 09:02:19 +00:00
neel
4eebcc1fbb Fix the RTC device model to operate correctly in 12-hour mode. The following
table documents the values in the RTC 'hour' field in the two modes:

Hour-of-the-day		12-hour mode	24-hour mode
12	AM		12		0
[1-11]	AM		[1-11]		[1-11]
12	PM		0x80 | 12	12
[1-11]	PM		0x80 | [1-11]	[13-23]

Reported by:	Julian Hsiao (madoka@nyanisore.net)
MFC after:	1 week
2015-03-28 02:55:16 +00:00
tychon
b925086de0 When fetching an instruction in non-64bit mode, consider the value of the
code segment base address.

Also if an instruction doesn't support a mod R/M (modRM) byte, don't
be concerned if the CPU is in real mode.

Reviewed by:	neel
2015-03-24 17:12:36 +00:00
kib
b178649de8 Use VT-d interrupt remapping block (IR) to perform FSB messages
translation.  In particular, despite IO-APICs only take 8bit apic id,
IR translation structures accept 32bit APIC Id, which allows x2APIC
mode to function properly.  Extend msi_cpu of struct msi_intrsrc and
io_cpu of ioapic_intsrc to full int from one byte.

KPI of IR is isolated into the x86/iommu/iommu_intrmap.h, to avoid
bringing all dmar headers into interrupt code. The non-PCI(e) devices
which generate message interrupts on FSB require special handling. The
HPET FSB interrupts are remapped, while DMAR interrupts are not.

For each msi and ioapic interrupt source, the iommu cookie is added,
which is in fact index of the IRE (interrupt remap entry) in the IR
table. Cookie is made at the source allocation time, and then used at
the map time to fill both IRE and device registers. The MSI
address/data registers and IO-APIC redirection registers are
programmed with the special values which are recognized by IR and used
to restore the IRE index, to find proper delivery mode and target.
Map all MSI interrupts in the block when msi_map() is called.

Since an interrupt source setup and dismantle code are done in the
non-sleepable context, flushing interrupt entries cache in the IR
hardware, which is done async and ideally waits for the interrupt,
requires busy-wait for queue to drain.  The dmar_qi_wait_for_seq() is
modified to take a boolean argument requesting busy-wait for the
written sequence number instead of waiting for interrupt.

Some interrupts are configured before IR is initialized, e.g. ACPI
SCI.  Add intr_reprogram() function to reprogram all already
configured interrupts, and call it immediately before an IR unit is
enabled.  There is still a small window after the IO-APIC redirection
entry is reprogrammed with cookie but before the unit is enabled, but
to fix this properly, IR must be started much earlier.

Add workarounds for 5500 and X58 northbridges, some revisions of which
have severe flaws in handling IR.  Use the same identification methods
as employed by Linux.

Review:	https://reviews.freebsd.org/D1892
Reviewed by:	neel
Discussed with:	jhb
Tested by:	glebius, pho (previous versions)
Sponsored by:	The FreeBSD Foundation
MFC after:	3 weeks
2015-03-19 13:57:47 +00:00
jfv
06710b0884 Update to the Intel ixgbe driver:
- Split the driver into independent pf and vf loadables. This is
	  in preparation for SRIOV support which will be following shortly.
	  This also allows us to keep a seperate revision control over the
	  two parts, making for easier sustaining.
	- Make the TX/RX code a shared/seperated file, in the old code base
	  the ixv code would miss fixes that went into ixgbe, this model
	  will eliminate that problem.
	- The driver loadables will now match the device names, something that
	  has been requested for some time.
	- Rather than a modules/ixgbe there is now modules/ix and modules/ixv
	- It will also be possible to make your static kernel with only one
	  or the other for streamlined installs, or both.

Enjoy!

Submitted by: jfv and erj
2015-03-17 18:32:28 +00:00
mav
af4c17529a Report ARAT (APIC-Timer-always-running) feature for virtual CPU.
This makes FreeBSD guest to not avoid using LAPIC timer, preferring HPET
due to worries about non-existing for virtual CPUs deep sleep states.

Benchmarks of usleep(1) on guest and host show such extra latencies:
 - 51us for virtual HPET,
 - 22us for virtual LAPIC timer,
 - 22us for host HPET and
 - 3us for host LAPIC timer.

MFC after:	2 weeks
2015-03-16 11:57:03 +00:00
neel
e16d18eeb8 Use lapic_ipi_alloc() to dynamically allocate IPI slots needed by bhyve when
vmm.ko is loaded.

Also relocate the 'justreturn' IPI handler to be alongside all other handlers.

Requested by:	kib
2015-03-14 02:32:08 +00:00
jhb
c1c1fd8e25 Only schedule interrupts on a single hyperthread of a modern Intel CPU core
by default.  Previously we used a single hyperthread on Pentium4-era
cores but used both hyperthreads on more recent CPUs.

MFC after:	2 weeks
2015-03-06 20:34:28 +00:00
tychon
eedd9cfb16 When ICW1 is issued the edge sense circuit is reset which means that
following an initialization a low-to-high transistion is necesary to
generate an interrupt.

Reviewed by:	neel
2015-03-06 02:05:45 +00:00
neel
e549774cf3 Fix warnings/errors when building vmm.ko with gcc:
- fix warning about comparison of 'uint8_t v_tpr >= 0' always being true.

- fix error triggered by an empty clobber list in the inline assembly for
  "clgi" and "stgi"

- fix error when compiling "vmload %rax", "vmrun %rax" and "vmsave %rax". The
  gcc assembler does not like the explicit operand "%rax" while the clang
  assembler requires specifying the operand "%rax". Fix this by encoding the
  instructions using the ".byte" directive.

Reported by:	julian
MFC after:	1 week
2015-03-02 20:13:49 +00:00
rstone
e40d09375f Implement interface to create SR-IOV Virtual Functions
Implement the interace to create SR-IOV Virtual Functions (VFs).
When a driver registers that they support SR-IOV by calling
pci_setup_iov(), the SR-IOV code creates a new node in /dev/iov
for that device.  An ioctl can be invoked on that device to
create VFs and have the driver initialize them.

At this point, allocating memory I/O windows (BARs) is not
supported.

Differential Revision:	https://reviews.freebsd.org/D76
Reviewed by:		jhb
MFC after: 		1 month
Sponsored by:		Sandvine Inc.
2015-03-01 00:40:09 +00:00
rstone
29fedc4c2c Allow passthrough devices to be hinted.
Allow the ppt driver to attach to devices that were hinted to be
passthrough devices by the PCI code creating them with a driver
name of "ppt".

Add a tunable that allows the IOMMU to be forced to be used.  With
SR-IOV passthrough devices the VFs may be created after vmm.ko is
loaded.  The current code will not initialize the IOMMU in that
case, meaning that the passthrough devices can't actually be used.

Differential Revision:	https://reviews.freebsd.org/D73
Reviewed by:		neel
MFC after: 		1 month
Sponsored by:		Sandvine Inc.
2015-03-01 00:39:48 +00:00
kib
59184505da Supposed fix for some SandyBridge mobile CPUs hang on AP startup when
x2APIC mode is detected and enabled.  Current theory is that switching
the APIC mode while an IPI is in flight might be the issue.

Postpone switching to x2APIC mode until we are guaranteed that all
starting IPIs are already send and aknowledged.  Use aps_ready signal
as an indication that the BSP is done with us.

Tested by:	adrian
Sponsored by:	The FreeBSD Foundation
MFC after:	2 months
2015-02-28 20:37:38 +00:00
neel
f458d8769b Always emulate MSR_PAT on Intel processors and don't rely on PAT save/restore
capability of VT-x. This lets bhyve run nested in older VMware versions that
don't support the PAT save/restore capability.

Note that the actual value programmed by the guest in MSR_PAT is irrelevant
because bhyve sets the 'Ignore PAT' bit in the nested PTE.

Reported by:	marcel
Tested by:	Leon Dang (ldang@nahannisys.com)
Sponsored by:	Nahanni Systems
MFC after:	2 weeks
2015-02-24 05:35:15 +00:00
jhb
4145efe339 Ensure that the supplied data length is large enough to hold the base
FPU state to avoid passing a negative length to fpusetregs() / npxsetregs().

Differential Revision:	https://reviews.freebsd.org/D1861
Reviewed by:	kib, emaste
2015-02-18 23:34:03 +00:00
kib
7a89c3df60 Initialize x2APIC mode on the resume path before accessing LAPIC.
Remove unneeded disable of LAPIC in the native_lapic_xapic_mode().  We
attempt to send wakeup IPI on the resume path right after BSP wakeup,
so disabling is wrong.

Reported and tested by:	glebius, "Ranjan1018 ." <214748mv@gmail.com>
Sponsored by:	The FreeBSD Foundation
MFC after:	2 months
2015-02-16 21:56:19 +00:00
markj
fc1485dfdc Add support for decoding multibyte NOPs.
Differential Revision:	https://reviews.freebsd.org/D1830
Reviewed by:	jhb, kib
MFC after:	2 weeks
Sponsored by:	EMC / Isilon Storage Divison
2015-02-13 01:35:53 +00:00
kib
0754f0eac9 Add x2APIC support. Enable it by default if CPU is capable. The
hw.x2apic_enable tunable allows disabling it from the loader prompt.

To closely repeat effects of the uncached memory ops when accessing
registers in the xAPIC mode, the x2APIC writes to MSRs are preceeded
by mfence, except for the EOI notifications.  This is probably too
strict, only ICR writes to send IPI require serialization to ensure
that other CPUs see the previous actions when IPI is delivered.  This
may be changed later.

In vmm justreturn IPI handler, call doreti_iret instead of doing iretd
inline, to handle corner conditions.

Note that the patch only switches LAPICs into x2APIC mode. It does not
enables FreeBSD to support > 255 CPUs, which requires parsing x2APIC
MADT entries and doing interrupts remapping, but is the required step
on the way.

Reviewed by:	neel
Tested by:	pho (real hardware), neel (on bhyve)
Discussed with:	jhb, grehan
Sponsored by:	The FreeBSD Foundation
MFC after:	2 months
2015-02-09 21:00:56 +00:00
jhb
9dfc36d985 Revert the IPI startup sequence to match what is described in the
Intel Multiprocessor Specification v1.4.  The Intel SDM claims that
the INIT IPIs here are invalid, but other systems follow the MP
spec instead.

While here, fix the IPI wait routine to accept a timeout in microseconds
instead of a raw spin count, and don't spin forever during AP startup.
Instead, panic if a STARTUP IPI is not delivered after 20 us.

PR:		196542
Differential Revision:	https://reviews.freebsd.org/D1719
MFC after:	2 weeks
2015-02-06 18:19:59 +00:00
bryanv
25ce9181cf Generalized parts of the XEN timer code into a generic pvclock
KVM clock shares the same data structures between the guest and the host
as Xen so it makes sense to just have a single copy of this code.

Differential Revision: https://reviews.freebsd.org/D1429
Reviewed by:	royger (eariler version)
MFC after:	1 month
2015-02-04 08:26:43 +00:00
kib
3bbc91d138 Do not qualify the mcontext_t *mcp argument for set_mcontext(9) as
const.  On x86, even after the machine context is supposedly read into
the struct ucontext, lazy FPU state save code might only mark the FPU
data as hardware-owned.  Later, set_fpcontext() needs to fetch the
state from hardware, modifying the *mcp.

The set_mcontext(9) is called from sigreturn(2) and setcontext(2)
implementations and old create_thread(2) interface, which throw the
*mcp out after the set_mcontext() call.

Reported by:	dim
Discussed with:	jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-01-31 21:43:46 +00:00
royger
217732b47d amd64: allow base memory segment to start at address different than 0
Current code requires that the first physical memory segment starts at 0,
but this is not really needed. We only need to make sure the bootstrap code
and page tables for APs are allocated below 4GB.

This patch removes this requirement and allows booting a Dell R710 from
UEFI, where the first physical memory segment starts at 0x10000.

Sponsored by: Citrix Systems R&D
Reviewed by: jhb
Differential Revision: https://reviews.freebsd.org/D1417
2015-01-26 08:42:47 +00:00
jhb
03984c1e9c If the boot-time memory test is enabled, output a dot ('.') for
each GB of RAM tested so people watching the console can see that
the machine is making progress and not hung.

PR:		196650
Submitted by:	Ravi Pokala <rpokala@panasas.com>
Suggestions from:	Eric van Gyzen <eric@vangyzen.net>
MFC after:	2 weeks
2015-01-25 20:16:45 +00:00
des
ffa2c78ca2 Remove ISA NICs. Anyone still using these on amd64 can build their
own kernel.
2015-01-25 12:02:38 +00:00
neel
5c373339e4 Add macro to identify AVIC capability (advanced virtual interrupt controller)
in AMD processors.

Submitted by:	Dmitry Luhtionov (dmitryluhtionov@gmail.com)
2015-01-24 00:35:49 +00:00
neel
e901ab485a MOVS instruction emulation.
These instructions are emitted by 'bus_space_read_region()' when accessing
MMIO regions.

Since MOVS can be used with a repeat prefix start decoding the REPZ and
REPNZ prefixes. Also start decoding the segment override prefix since MOVS
allows overriding the source operand segment register.

Tested by:	tychon
MFC after:	1 week
2015-01-19 06:53:31 +00:00
neel
d9f07f9841 Simplify instruction restart logic in bhyve.
Keep track of the next instruction to be executed by the vcpu as 'nextrip'.
As a result the VM_RUN ioctl no longer takes the %rip where a vcpu should
start execution.

Also, instruction restart happens implicitly via 'vm_inject_exception()' or
explicitly via 'vm_restart_instruction()'. The APIs behave identically in
both kernel and userspace contexts. The main beneficiary is the instruction
emulation code that executes in both contexts.

bhyve(8) VM exit handlers now treat 'vmexit->rip' and 'vmexit->inst_length'
as readonly:
- Restarting an instruction is now done by calling 'vm_restart_instruction()'
  as opposed to setting 'vmexit->inst_length' to 0 (e.g. emulate_inout())
- Resuming vcpu at an arbitrary %rip is now done by setting VM_REG_GUEST_RIP
  as opposed to changing 'vmexit->rip' (e.g. vmexit_task_switch())

Differential Revision:	https://reviews.freebsd.org/D1526
Reviewed by:		grehan
MFC after:		2 weeks
2015-01-18 03:08:30 +00:00
np
337910a604 Plug cxgbe(4) back into !powerpc && !arm builds, instead of building it
on amd64 only.
2015-01-16 01:39:24 +00:00
royger
0c5b62d3d2 loader: implement multiboot support for Xen Dom0
Implement a subset of the multiboot specification in order to boot Xen
and a FreeBSD Dom0 from the FreeBSD bootloader. This multiboot
implementation is tailored to boot Xen and FreeBSD Dom0, and it will
most surely fail to boot any other multiboot compilant kernel.

In order to detect and boot the Xen microkernel, two new file formats
are added to the bootloader, multiboot and multiboot_obj. Multiboot
support must be tested before regular ELF support, since Xen is a
multiboot kernel that also uses ELF. After a multiboot kernel is
detected, all the other loaded kernels/modules are parsed by the
multiboot_obj format.

The layout of the loaded objects in memory is the following; first the
Xen kernel is loaded as a 32bit ELF into memory (Xen will switch to
long mode by itself), after that the FreeBSD kernel is loaded as a RAW
file (Xen will parse and load it using it's internal ELF loader), and
finally the metadata and the modules are loaded using the native
FreeBSD way. After everything is loaded we jump into Xen's entry point
using a small trampoline. The order of the multiboot modules passed to
Xen is the following, the first module is the RAW FreeBSD kernel, and
the second module is the metadata and the FreeBSD modules.

Since Xen will relocate the memory position of the second
multiboot module (the one that contains the metadata and native
FreeBSD modules), we need to stash the original modulep address inside
of the metadata itself in order to recalculate its position once
booted. This also means the metadata must come before the loaded
modules, so after loading the FreeBSD kernel a portion of memory is
reserved in order to place the metadata before booting.

In order to tell the loader to boot Xen and then the FreeBSD kernel the
following has to be added to the /boot/loader.conf file:

xen_cmdline="dom0_mem=1024M dom0_max_vcpus=2 dom0pvh=1 console=com1,vga"
xen_kernel="/boot/xen"

The first argument contains the command line that will be passed to the Xen
kernel, while the second argument is the path to the Xen kernel itself. This
can also be done manually from the loader command line, by for example
typing the following set of commands:

OK unload
OK load /boot/xen dom0_mem=1024M dom0_max_vcpus=2 dom0pvh=1 console=com1,vga
OK load kernel
OK load zfs
OK load if_tap
OK load ...
OK boot

Sponsored by: Citrix Systems R&D
Reviewed by: jhb
Differential Revision: https://reviews.freebsd.org/D517

For the Forth bits:
Submitted by: Julien Grall <julien.grall AT citrix.com>
2015-01-15 16:27:20 +00:00
imp
66acb8032e New MINIMAL kernel config. The goal with this configuration is to
only compile in those options in GENERIC that cannot be loaded as
modules. ufs is still included because many of its options aren't
present in the kernel module. There's some other exceptions documented
in the file. This is part of some work to get more things
automatically loading in the hopes of obsoleting GENERIC one day.
2015-01-15 00:42:06 +00:00
neel
4091be74c6 Fix typo (missing comma).
MFC after:	3 days
2015-01-14 07:18:51 +00:00
neel
5c965bc583 'struct vm_exception' was intended to be used only as the collateral for the
VM_INJECT_EXCEPTION ioctl. However it morphed into other uses like keeping
track pending exceptions for a vcpu. This in turn causes confusion because
some fields in 'struct vm_exception' like 'vcpuid' make sense only in the
ioctl context. It also makes it harder to add or remove structure fields.

Fix this by using 'struct vm_exception' only to communicate information
from userspace to vmm.ko when injecting an exception.

Also, add a field 'restart_instruction' to 'struct vm_exception'. This
field is set to '1' for exceptions where the faulting instruction is
restarted after the exception is handled.

MFC after:      1 week
2015-01-13 22:00:47 +00:00
kib
79db3369f9 Revert r263475: TDP_DEVMEMIO no longer needed, since amd64 /dev/kmem
does not access kernel mappings directly.

Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-01-12 08:58:07 +00:00
kib
5e39e2c47c Revert r276600: PHYS_TO_DMAP_RAW() and DMAP_TO_PHYS_RAW() are no
longer used, remove them.

Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-01-12 07:50:55 +00:00
kib
6db369589b Fix several issues with /dev/mem and /dev/kmem devices on amd64.
For /dev/mem, when requested physical address is not accessible by the
direct map, do temporal remaping with the caching attribute
'uncached'.  Limit the accessible addresses by MAXPHYADDR, since the
architecture disallowes writing non-zero into reserved bits of ptes
(or setting garbage into NX).

For /dev/kmem, only access existing kernel mappings for direct map
region.  For all other addresses, obtain a physical address of the
mapping and fall back to the /dev/mem mechanism.  This ensures that
/dev/kmem i/o does not fault even if the accessed region is changed in
parallel, by using either direct map or temporal mapping.

For both devices, operate on one page by iteration.  Do not return
error if any bytes were moved around, return the (partial) bytes count
to userspace.

Reviewed by:	alc
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-01-12 07:48:22 +00:00
kib
11969484c8 For x86, read MAXPHYADDR, defined in SDM vol 3 4.1.4 Enumeration of Paging
Features by CPUID as CPUID.80000008H:EAX[7:0], into variable cpu_maxphyaddr.

Reviewed by:	alc
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-01-12 07:36:25 +00:00
markj
7e7e145818 Factor out duplicated code from dumpsys() on each architecture into generic
code in sys/kern/kern_dump.c. Most dumpsys() implementations are nearly
identical and simply redefine a number of constants and helper subroutines;
a generic implementation will make it easier to implement features around
kernel core dumps. This change does not alter any minidump code and should
have no functional impact.

PR:		193873
Differential Revision:	https://reviews.freebsd.org/D904
Submitted by:	Conrad Meyer <conrad.meyer@isilon.com>
Reviewed by:	jhibbits (earlier version)
Sponsored by:	EMC / Isilon Storage Division
2015-01-07 01:01:39 +00:00
neel
e72e75f02d Clear blocking due to STI or MOV SS in the hypervisor when an instruction is
emulated or when the vcpu incurs an exception. This matches the CPU behavior.

Remove special case code in HLT processing that was clearing the interrupt
shadow. This is now redundant because the interrupt shadow is always cleared
when the vcpu is resumed after an instruction is emulated.

Reported by:	David Reed (david.reed@tidalscale.com)
MFC after:	2 weeks
2015-01-06 19:04:02 +00:00
jhb
c3d1954342 Remove "New" label from NFSCL/NFSD now that they are the only NFS
client/server.  While here, remove duplicate NFSCL from sys/conf/NOTES.

Approved by:	rmacklem
2015-01-06 16:15:57 +00:00
jhb
55d0376a65 On some Intel CPUs with a P-state but not C-state invariant TSC the TSC
may also halt in C2 and not just C3 (it seems that in some cases the BIOS
advertises its C3 state as a C2 state in _CST).  Just play it safe and
disable both C2 and C3 states if a user forces the use of the TSC as the
timecounter on such CPUs.

PR:		192316
Differential Revision:	https://reviews.freebsd.org/D1441
No objection from:	jkim
MFC after:	1 week
2015-01-05 20:44:44 +00:00
kib
771035aa6c For /dev/mem and /dev/kmem accesses, avoid asserting that addresses
are within direct map.  We want to return error instead of panicing.

PR:	194995
Sponsored by:	The FreeBSD Foundation
2015-01-03 01:28:58 +00:00
scottl
9f03d3d21b Fix a missed comment from r276526. 2015-01-02 15:46:54 +00:00
kib
35b8c82991 Callers of pmap_kextract() cannot distinguish between failure and
physical address zero.  Assume that the lowest page is always mapped
by direct map.

This restores access to the page at zero through /dev/mem after
r263475.

Reported and tested by:	neel
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-01-02 01:05:08 +00:00
kib
a9168eed4e Actually remove GIANT_REQUIRED, declared but not done in r263475.
Style.

Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-01-02 01:00:38 +00:00
dchagin
4026f410fd Regen after r276508, r276509. 2015-01-01 18:43:31 +00:00
dchagin
4f605a7bfc Correct an argument status of wait4 syscall for Linuxulator.
MFC after:	1 week
2015-01-01 18:37:03 +00:00
np
6c2366689f Temporarily unplug cxgbe(4) from !amd64 builds. 2014-12-31 20:34:12 +00:00
alc
369e66acd7 The physical memory allocator supports the use of distinct free lists for
managing pages from different address ranges.  Generally speaking, this
feature is used to increase the likelihood that physical pages are
available that can meet special DMA requirements or can be accessed through
a limited-coverage direct mapping (e.g., MIPS).  However, prior to this
change, the configuration of the free lists was static, i.e., it was
determined at compile time.  Consequentally, free lists could be created
for address ranges that held no actual pages, for example, on 32-bit MIPS-
based systems with 512 MB or less of physical memory.  This change makes
the creation of the free lists dynamic, i.e., it is based on the available
physical memory at boot time.

On 64-bit x86-based systems with 64 GB or more of physical memory, create
free lists for managing pages with physical addresses below 4 GB.  This
change is to address reported problems with initializing devices that
require the allocation of physical pages below 4 GB on some systems with
128 GB or more of physical memory.

PR:		185727
Differential Revision:	https://reviews.freebsd.org/D1274
Reviewed by:	jhb, kib
MFC after:	3 weeks
Sponsored by:	EMC / Isilon Storage Division
2014-12-31 00:54:38 +00:00
neel
b9de0c114c Initialize all fields of 'struct vm_exception exception' before passing it to
vm_inject_exception(). This fixes the issue that 'exception.cpuid' is
uninitialized when calling 'vm_inject_exception()'.

However, in practice this change is a no-op because vm_inject_exception()
does not use 'exception.cpuid' for anything.

Reported by:    Coverity Scan
CID:            1261297
MFC after:      3 days
2014-12-30 23:38:31 +00:00
neel
7aa6460c48 Replace bhyve's minimal RTC emulation with a fully featured one in vmm.ko.
The new RTC emulation supports all interrupt modes: periodic, update ended
and alarm. It is also capable of maintaining the date/time and NVRAM contents
across virtual machine reset. Also, the date/time fields can now be modified
by the guest.

Since bhyve now emulates both the PIT and the RTC there is no need for
"Legacy Replacement Routing" in the HPET so get rid of it.

The RTC device state can be inspected via bhyvectl as follows:
bhyvectl --vm=vm --get-rtc-time
bhyvectl --vm=vm --set-rtc-time=<unix_time_secs>
bhyvectl --vm=vm --rtc-nvram-offset=<offset> --get-rtc-nvram
bhyvectl --vm=vm --rtc-nvram-offset=<offset> --set-rtc-nvram=<value>

Reviewed by:	tychon
Discussed with:	grehan
Differential Revision:	https://reviews.freebsd.org/D1385
MFC after:	2 weeks
2014-12-30 22:19:34 +00:00
neel
2908c65b8a Inject #UD into the guest when it executes either 'MONITOR' or 'MWAIT' on
an AMD/SVM host.

MFC after:	1 week
2014-12-30 02:44:33 +00:00
neel
baa3b938e5 Implement "special mask mode" in vatpic.
OpenBSD guests always enable "special mask mode" during boot. As a result of
r275952 this is flagged as an error and the guest cannot boot.

Reviewed by:	grehan
Differential Revision:	https://reviews.freebsd.org/D1384
MFC after:	1 week
2014-12-28 00:53:52 +00:00
kib
d0d6e7c817 Change the way the lcall $7,$0 is reflected to usermode. Instead of
setting call gate, which must be 64 bit, put a code segment descriptor
into ldt slot 0.

This way, syscall shim does not switch temporary to 64bit trampoline,
and does not create a window where signal delivery interrupts 64 bit
mode (signal handler cannot return).  The cost is shim running with
non-zero based segment in %cs, which requires vfork() handling make
more assumptions.

Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-12-27 23:19:08 +00:00
phk
17ba4d442c Use compiled in default keymaps which are available both in syscons and vt. 2014-12-25 17:50:04 +00:00
markj
7ea63e4fb4 Restore the trap type argument to the DTrace trap hook, removed in r268600.
It's redundant at the moment since it can be obtained from the trapframe
on the architectures where DTrace is supported, but this won't be the case
with ARM.
2014-12-23 15:38:19 +00:00
neel
4d52b44708 Allow ktr(4) tracing of all guest exceptions via the tunable
"hw.vmm.trace_guest_exceptions".  To enable this feature set the tunable
to "1" before loading vmm.ko.

Tracing the guest exceptions can be useful when debugging guest triple faults.

Note that there is a performance impact when exception tracing is enabled
since every exception will now trigger a VM-exit.

Also, handle machine check exceptions that happen during guest execution
by vectoring to the host's machine check handler via "int $18".

Discussed with:	grehan
MFC after:	2 weeks
2014-12-23 02:14:49 +00:00
neel
9abef2383d Emulate writes to the IA32_MISC_ENABLE MSR.
PR:		196093
Reported by:	db
Tested by:	db
Discussed with:	grehan
MFC after:	1 week
2014-12-20 19:47:51 +00:00
neel
e4f07b01b4 Various 8259 device model improvements:
- implement 8259 "polled" mode.
- set 'atpic->sfn' if bit 4 in ICW4 is set during master initialization.
- report error if guest tries to enable the "special mask" mode.

Differential Revision:	https://reviews.freebsd.org/D1328
Reviewed by:		tychon
Reported by:		grehan
Tested by:		grehan
MFC after:		1 week
2014-12-20 04:57:45 +00:00
neel
36aadd747e Fix 8259 IRQ priority resolver.
Initialize the 8259 such that IRQ7 is the lowest priority.

Reviewed by:		tychon
Differential Revision:	https://reviews.freebsd.org/D1322
MFC after:		1 week
2014-12-17 03:04:43 +00:00
kib
42d5fa98d3 The iret instruction may generate #np and #ss fault, besides #gp.
When returning to usermode, the handler for that exceptions is also
executed with wrong gs base.  Handle all three possible faults in the
same way, checking for iret fault, and performing full iret.

Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2014-12-16 18:28:33 +00:00
neel
bdae65463e For level triggered interrupts clear the PIC IRR bit when the interrupt pin
is deasserted. Prior to this change each assertion on a level triggered irq
pin resulted in two interrupts being delivered to the CPU.

Differential Revision:	https://reviews.freebsd.org/D1310
Reviewed by:	tychon
MFC after:	1 week
2014-12-16 06:33:57 +00:00
gnn
d5b2a401e7 This configuration file removes several debugging options, including
WITNESS and INVARIANTS checking, which are known to have significant
performance impact on running systems.  When benchmarking new features
this kernel should be used instead of the standard GENERIC.
This kernel configuration should never appear outside of the HEAD
of the FreeBSD tree.
2014-12-02 19:55:43 +00:00
emaste
fda27c9937 Revert r274772: it is not valid on MIPS
Reported by:	sbruno
2014-11-25 03:50:31 +00:00
grehan
205a8351f4 Change the lower bound for guest vmspace allocation to 0 instead of
using the VM_MIN_ADDRESS constant.

HardenedBSD redefines VM_MIN_ADDRESS to be 64K, which results in
bhyve VM startup failing. Guest memory is always assumed to start
at 0 so use the absolute value instead.

Reported by:	Shawn Webb, lattera at gmail com
Reviewed by:	neel, grehan
Obtained from:	Oliver Pinter via HardenedBSD
23bd719ce1
MFC after:	1 week
2014-11-23 23:07:21 +00:00
jhb
1671ac9155 Improve support for XSAVE with debuggers.
- Dump an NT_X86_XSTATE note if XSAVE is in use. This note is designed
  to match what Linux does in that 1) it dumps the entire XSAVE area
  including the fxsave state, and 2) it stashes a copy of the current
  xsave mask in the unused padding between the fxsave state and the
  xstate header at the same location used by Linux.
- Teach readelf() to recognize NT_X86_XSTATE notes.
- Change PT_GET/SETXSTATE to take the entire XSAVE state instead of
  only the extra portion. This avoids having to always make two
  ptrace() calls to get or set the full XSAVE state.
- Add a PT_GET_XSTATE_INFO which returns the length of the current
  XSTATE save area (so the size of the buffer needed for PT_GETXSTATE)
  and the current XSAVE mask (%xcr0).

Differential Revision:	https://reviews.freebsd.org/D1193
Reviewed by:	kib
MFC after:	2 weeks
2014-11-21 20:53:17 +00:00