Commit Graph

6211 Commits

Author SHA1 Message Date
Neel Natu
a2da7af6bc Add support for trapping MMIO writes to local apic registers and emulating them.
The default behavior is still to present the local apic to the guest in the
x2apic mode.
2012-09-25 22:31:35 +00:00
Neel Natu
e90273829b Add ioctls to control the X2APIC capability exposed by the virtual machine to
the guest.

At the moment this simply sets the state in the 'vcpu' instance but there is
no code that acts upon these settings.
2012-09-25 19:08:51 +00:00
Neel Natu
edf89256dd Add an explicit exit code 'SPINUP_AP' to tell the controlling process that an
AP needs to be activated by spinning up an execution context for it.

The local apic emulation is now completely done in the hypervisor and it will
detect writes to the ICR_LO register that try to bring up the AP. In response
to such writes it will return to userspace with an exit code of SPINUP_AP.

Reviewed by: grehan
2012-09-25 02:33:25 +00:00
Neel Natu
98ed632c63 Stash the 'vm_exit' information in each 'struct vcpu'.
There is no functional change at this time but this paves the way for vm exit
handler functions to easily modify the exit reason going forward.
2012-09-24 19:32:24 +00:00
Neel Natu
2d3a73ed6d Restructure the x2apic access code in preparation for supporting memory mapped
access to the local apic.

The vlapic code is now aware of the mode that the guest is using to access the
local apic.

Reviewed by: grehan@
2012-09-21 03:09:23 +00:00
Peter Grehan
177fd53318 Add sysctls to display the total and free amount of hard-wired mem for VMs
# sysctl hw.vmm
   hw.vmm.mem_free: 2145386496
   hw.vmm.mem_total: 2145386496

Submitted by:	Takeshi HASEGAWA hasegaw at gmail com
2012-08-26 01:41:41 +00:00
Neel Natu
8124debe13 Include 'device uart' in the guest kernel. 2012-08-04 04:30:26 +00:00
Neel Natu
39c21c2db2 Force certain bits in %cr4 to be hard-wired to '1' or '0' from a guest's
perspective. If we don't do this some guest OSes (e.g. Linux) will reset
the CR4_VMXE bit in %cr4 with disastrous consequences.

Reported by: grehan
2012-08-04 02:06:55 +00:00
Neel Natu
4bff7fad95 Verify that VMX operation has been enabled by BIOS before executing the
VMXON instruction.

Reported by "s vas" on freebsd-virtualization@
2012-07-25 00:21:16 +00:00
Peter Grehan
b652778e42 IFC @ r238370 2012-07-11 19:54:21 +00:00
John Baldwin
d706ec297a Add a clts() wrapper around the 'clts' instruction to <machine/cpufunc.h>
on x86 and use that to implement stop_emulating() in the fpu/npx code.
Reimplement start_emulating() in the non-XEN case by using load_cr0() and
rcr0() instead of the 'lmsw' and 'smsw' instructions.  Intel explicitly
discourages the use of 'lmsw' and 'smsw' on 80386 and later processors in
the description of these instructions in Volume 2 of the ADM.

Reviewed by:	kib
MFC after:	1 month
2012-07-09 20:55:39 +00:00
John Baldwin
5355f65974 Partially revert r217515 so that the mem_range_softc variable is always
present on x86 kernels.  This fixes the build of kernels that include
'device acpi' but do not include 'device mem'.

MFC after:	1 month
2012-07-09 20:42:08 +00:00
Konstantin Belousov
f18d5bf44b Use assembler mnemonic instead of manually assembling, contination for r238142.
Reviewed by:	jhb
MFC after:	1 month
2012-07-06 20:11:58 +00:00
John Baldwin
6632f45773 Several fixes to the amd64 disassembler:
- Add generic support for opcodes that are escape bytes used for
  multi-byte opcodes (such as the 0x0f prefix).  Use this to replace
  the hard-coded 0x0f special case and add support for three-byte
  opcodes that use the 0x0f38 prefix.
- Decode all Intel VMX instructions.  invept and invvpid in particular are
  three-byte opcodes that use the 0x0f38 escape prefix.
- Rework how the special 'SDEP' size flag works such that the default
  instruction name (i_name) is the instruction when the data size
  prefix (0x66) is not specified, and the alternate name in i_extra is
  used when the prefix is included.
- Add a new 'ADEP' size flag similar to 'SDEP' except that it chooses
  between i_name and i_extra based on the address size prefix (0x67).
  Use this to fix the decoding for jrcxz vs jecxz which is determined
  by the address size prefix, not the operand size prefix.  Also, jcxz
  is not possible in 64-bit mode, but jrcxz is the default instruction
  for that opcode.
- Add support for handling instructions that have a mandatory 'rep'
  prefix (this means not outputting the 'repe ' prefix until determining
  if it is used as part of an opcode).  Make 'pause' less of a special
  case this way.
- Decode 'cmpxchg16b' and 'cdqe' which are variants of other instructions
  but with a REX.W prefix.

MFC after:	1 month
2012-07-06 14:25:59 +00:00
Alan Cox
cc861283f4 Make pmap_enter()'s management of PV entries consistent with the other pmap
functions that manage PV entries.  Specifically, remove the PV entry from
the containing PV list only after the corresponding PTE is destroyed.

Update the pmap's wired mapping count in pmap_enter() before the PV list
lock is acquired.
2012-07-06 06:42:25 +00:00
John Baldwin
7574a595f2 Now that our assembler supports the xsave family of instructions, use them
natively rather than hand-assembled versions.  For xgetbv/xsetbv, add a
wrapper API to deal with xcr* registers: rxcr() and load_xcr().

Reviewed by:	kib
MFC after:	1 month
2012-07-05 18:19:35 +00:00
Alan Cox
8f2994ce67 Calculate the new PTE value in pmap_enter() before acquiring any locks.
Move an assertion to the beginning of pmap_enter().
2012-07-05 07:20:16 +00:00
Alan Cox
1bc8531c1e Correct an error in r237513. The call to reserve_pv_entries() must come
before pmap_demote_pde() updates the PDE.  Otherwise, pmap_pv_demote_pde()
can crash.

Crash reported by:	kib
Patch tested by:	kib
2012-07-05 00:08:47 +00:00
John Baldwin
66f9aec075 Decode the 'xsave', 'xrstor', 'xsaveopt', 'xgetbv', 'xsetbv', and
'rdtscp' instructions.

MFC after:	1 month
2012-07-04 16:47:39 +00:00
Xin LI
309dca0171 tws(4) is interfaced with CAM so move it to the same section.
Reported by:	joel
MFC after:	3 days
2012-07-01 08:10:49 +00:00
Alan Cox
2bde6e3518 Optimize reserve_pv_entries() using the popcnt instruction. 2012-06-30 20:25:12 +00:00
Alan Cox
92e2574577 In r237592, I forgot that pmap_enter() might already hold a PV list lock
at the point that it calls get_pv_entry().  Thus, pmap_enter()'s PV list
lock pointer must be passed to get_pv_entry() for those rare occasions
when get_pv_entry() calls reclaim_pv_chunk().

Update some related comments.
2012-06-29 18:15:56 +00:00
Alan Cox
6c67613030 Avoid some unnecessary PV list locking in pmap_enter(). 2012-06-28 22:03:59 +00:00
Alan Cox
23e59dfa8d Optimize pmap_pv_demote_pde(). 2012-06-28 05:42:04 +00:00
Alan Cox
e30df26e7b Add new pmap layer locks to the predefined lock order. Change the names
of a few existing VM locks to follow a consistent naming scheme.
2012-06-27 03:45:25 +00:00
Alan Cox
5b5b0ef34d Introduce RELEASE_PV_LIST_LOCK(). 2012-06-26 16:45:18 +00:00
Alan Cox
0d646df757 Add PV list locking to pmap_enter(). Its execution is no longer serialized
by the pvh global lock.

Add a needed atomic operation to pmap_object_init_pt().
2012-06-26 06:02:43 +00:00
Alan Cox
aaf3bc56fd Add PV chunk and list locking to pmap_change_wiring(), pmap_protect(), and
pmap_remove().  The execution of these functions is no longer serialized
by the pvh global lock.

Make some stylistic changes to the affected code for the sake of
consistency with related code elsewhere in the pmap.
2012-06-25 07:13:25 +00:00
Alan Cox
f745b16359 Introduce reserve_pv_entry() and use it in pmap_pv_demote_pde(). In order
to add PV list locking to pmap_pv_demote_pde(), it is necessary to change
the way that pmap_pv_demote_pde() allocates PV entries.  Specifically,
once pmap_pv_demote_pde() begins modifying the PV lists, it can't allocate
any new PV chunks, because that could require the PV list lock to be
dropped.  So, all necessary PV chunks must be allocated in advance.  To my
surprise, this new approach is a few percent faster than the old one.
2012-06-23 22:54:25 +00:00
Konstantin Belousov
aea810386d Implement mechanism to export some kernel timekeeping data to
usermode, using shared page.  The structures and functions have vdso
prefix, to indicate the intended location of the code in some future.

The versioned per-algorithm data is exported in the format of struct
vdso_timehands, which mostly repeats the content of in-kernel struct
timehands. Usermode reading of the structure can be lockless.
Compatibility export for 32bit processes on 64bit host is also
provided. Kernel also provides usermode with indication about
currently used timecounter, so that libc can fall back to syscall if
configured timecounter is unknown to usermode code.

The shared data updates are initiated both from the tc_windup(), where
a fast task is queued to do the update, and from sysctl handlers which
change timecounter. A manual override switch
kern.timecounter.fast_gettime allows to turn off the mechanism.

Only x86 architectures export the real algorithm data, and there, only
for tsc timecounter. HPET counters page could be exported as well, but
I prefer to not further glue the kernel and libc ABI there until
proper vdso-based solution is developed.

Minimal stubs neccessary for non-x86 architectures to still compile
are provided.

Discussed with:	bde
Reviewed by:	jhb
Tested by:	flo
MFC after:	1 month
2012-06-22 07:06:40 +00:00
Konstantin Belousov
232aa31fb9 Reserve AT_TIMEKEEP auxv entry for providing usermode the pointer to
timekeeping information.

MFC after:  1 week
2012-06-22 06:38:31 +00:00
Alan Cox
240cc83f55 Introduce CHANGE_PV_LIST_LOCK_TO_{PHYS,VM_PAGE}() to avoid duplication of
code.
2012-06-22 05:01:36 +00:00
Alan Cox
290d3e6395 Update the PV stats in free_pv_entry() using atomics. After which, it is
no longer necessary for free_pv_entry() to be serialized by the pvh global
lock.

Retire pmap_insert_entry() and pmap_remove_entry().  Once upon a time,
these functions were called from multiple places within the pmap.  Now,
each has only one caller.
2012-06-21 16:37:36 +00:00
Alan Cox
7ed5b3afa2 Add PV list locking to pmap_copy(), pmap_enter_object(), and
pmap_enter_quick().  These functions are no longer serialized by the pvh
global lock.

There is no need to release the PV list lock before calling free_pv_chunk()
in pmap_remove_pages().
2012-06-20 07:25:20 +00:00
Alan Cox
2f49b6b831 Condition the implementation of pv_entry_count on PV_STATS. On amd64,
pv_entry_count is purely informational.  It does not serve any functional
purpose.

Add PV chunk locking to get_pv_entry().
2012-06-19 08:12:44 +00:00
Navdeep Parhar
09fe63205c - Updated TOE support in the kernel.
- Stateful TCP offload drivers for Terminator 3 and 4 (T3 and T4) ASICs.
  These are available as t3_tom and t4_tom modules that augment cxgb(4)
  and cxgbe(4) respectively.  The cxgb/cxgbe drivers continue to work as
  usual with or without these extra features.

- iWARP driver for Terminator 3 ASIC (kernel verbs).  T4 iWARP in the
  works and will follow soon.

Build-tested with make universe.

30s overview
============
What interfaces support TCP offload?  Look for TOE4 and/or TOE6 in the
capabilities of an interface:
# ifconfig -m | grep TOE

Enable/disable TCP offload on an interface (just like any other ifnet
capability):
# ifconfig cxgbe0 toe
# ifconfig cxgbe0 -toe

Which connections are offloaded?  Look for toe4 and/or toe6 in the
output of netstat and sockstat:
# netstat -np tcp | grep toe
# sockstat -46c | grep toe

Reviewed by:	bz, gnn
Sponsored by:	Chelsio communications.
MFC after:	~3 months (after 9.1, and after ensuring MFC is feasible)
2012-06-19 07:34:13 +00:00
Konstantin Belousov
c59f3d4d22 Adjust the fix in r236953, by not generating the signal manually, but
performing the return to usermode using full return path.  This
consolidates the handling of exceptional situations in less number of
places, and is less code as well.

Reviewed by:   jhb
MFC after:     1 week
2012-06-18 21:08:48 +00:00
Alan Cox
06de588446 Add PV chunk and list locking to pmap_page_exists_quick(),
pmap_page_is_mapped(), and pmap_remove_pages().  These functions
are no longer serialized by the pvh global lock.
2012-06-18 16:21:59 +00:00
Alan Cox
6031c68de4 The page flag PGA_WRITEABLE is set and cleared exclusively by the pmap
layer, but it is read directly by the MI VM layer.  This change introduces
pmap_page_is_write_mapped() in order to completely encapsulate all direct
access to PGA_WRITEABLE in the pmap layer.

Aesthetics aside, I am making this change because amd64 will likely begin
using an alternative method to track write mappings, and having
pmap_page_is_write_mapped() in place allows me to make such a change
without further modification to the MI VM layer.

As an added bonus, tidy up some nearby comments concerning page flags.

Reviewed by:	kib
MFC after:	6 weeks
2012-06-16 18:56:19 +00:00
Adrian Chadd
83567110bd Oops - use the actual 11n enable option. 2012-06-15 15:32:16 +00:00
Adrian Chadd
3342d83059 Ok, ok. 802.11n can be on by default in GENERIC in -HEAD.
God help me.
2012-06-15 02:16:29 +00:00
Alan Cox
90407113a7 Update a couple comments to reflect r235598.
X-MFC after:	r235598
2012-06-14 17:47:54 +00:00
Alan Cox
62657c50df Correctly identify the function in a KASSERT().
MFC after:	3 days
2012-06-14 17:40:49 +00:00
Jung-uk Kim
6ad799103d - Remove unused code for CR3 and CR4.
- Fix few style(9) nits while I am here.
2012-06-13 22:53:56 +00:00
Jung-uk Kim
acd7df97cc - Fix resumectx() prototypes to reflect reality.
- For i386, simply jump to resumectx() with PCB in %ecx.
- Fix a style(9) nit while I am here.
2012-06-13 21:03:01 +00:00
Bjoern A. Zeeb
b0a576ce8e Fix a problem where zero-length RDATA fields can cause named(8) to crash.
[12:03]

Correct a privilege escalation when returning from kernel if
running FreeBSD/amd64 on non-AMD processors. [12:04]

Fix reference count errors in IPv6 code. [EN-12:02]

Security:	CVE-2012-1667
Security:	FreeBSD-SA-12:03.bind
Security:	CVE-2012-0217
Security:	FreeBSD-SA-12:04.sysret
Security:	FreeBSD-EN-12:02.ipv6refcount
Approved by:	so (simon, bz)
2012-06-12 12:10:10 +00:00
Mitsuru IWASAKI
77c80e2e5b Share IPI init and startup code of mp_machdep.c with acpi_wakeup.c
as ipi_startup().
2012-06-12 00:14:54 +00:00
Alan Cox
efab609272 Avoid unnecessary atomic operations for clearing PGA_WRITEABLE in
pmap_remove_pages().  This reduces pmap_remove_pages()'s running time by
4 to 11% in my tests.

MFC after:	1 week
2012-06-11 21:41:16 +00:00
Mitsuru IWASAKI
8a6c6fadc7 Some fixes for r236772.
- Remove cpuset stopped_cpus which is no longer used.
- Add a short comment for cpuset suspended_cpus clearing.
- Fix the un-ordered x86/acpica/acpi_wakeup.c in conf/files.amd64 and i386.

Pointed-out by:	attilio@
2012-06-10 02:38:51 +00:00
Mitsuru IWASAKI
fb864578af Add x86/acpica/acpi_wakeup.c for amd64 and i386. Difference of
suspend/resume procedures are minimized among them.

common:
- Add global cpuset suspended_cpus to indicate APs are suspended/resumed.
- Remove acpi_waketag and acpi_wakemap from acpivar.h (no longer used).
- Add some variables in acpi_wakecode.S in order to minimize the difference
  among amd64 and i386.
- Disable load_cr3() because now CR3 is restored in resumectx().

amd64:
- Add suspend/resume related members (such as MSR) in PCB.
- Modify savectx() for above new PCB members.
- Merge acpi_switch.S into cpu_switch.S as resumectx().

i386:
- Merge(and remove) suspendctx() into savectx() in order to match with
  amd64 code.

Reviewed by:	attilio@, acpi@
2012-06-09 00:37:26 +00:00