Commit Graph

6429 Commits

Author SHA1 Message Date
Attilio Rao
c7aebda8a1 The soft and hard busy mechanism rely on the vm object lock to work.
Unify the 2 concept into a real, minimal, sxlock where the shared
acquisition represent the soft busy and the exclusive acquisition
represent the hard busy.
The old VPO_WANTED mechanism becames the hard-path for this new lock
and it becomes per-page rather than per-object.
The vm_object lock becames an interlock for this functionality:
it can be held in both read or write mode.
However, if the vm_object lock is held in read mode while acquiring
or releasing the busy state, the thread owner cannot make any
assumption on the busy state unless it is also busying it.

Also:
- Add a new flag to directly shared busy pages while vm_page_alloc
  and vm_page_grab are being executed.  This will be very helpful
  once these functions happen under a read object lock.
- Move the swapping sleep into its own per-object flag

The KPI is heavilly changed this is why the version is bumped.
It is very likely that some VM ports users will need to change
their own code.

Sponsored by:	EMC / Isilon storage division
Discussed with:	alc
Reviewed by:	jeff, kib
Tested by:	gavin, bapt (older version)
Tested by:	pho, scottl
2013-08-09 11:11:11 +00:00
Andriy Gapon
9ba0691bdd follow up to r254051
- update powerpc/GENERIC64 as well, suggested by mdf
- update comments so that they make sense after the change, suggested by
  jhb

X-MFC after:	never (change specific to head)
2013-08-09 08:11:09 +00:00
Neel Natu
f263e391a3 Use local variables with the appropriate types and eliminate a bunch of casts.
This is a cosmetic change but it does help with a proposed change to increase
the maximum size of physical memory supported on amd64 platforms.

Submitted by:	Chris Torek (torek@torek.net)
2013-08-08 03:17:39 +00:00
Konstantin Belousov
449c2e92c9 Split the pagequeues per NUMA domains, and split pageademon process
into threads each processing queue in a single domain.  The structure
of the pagedaemons and queues is kept intact, most of the changes come
from the need for code to find an owning page queue for given page,
calculated from the segment containing the page.

The tie between NUMA domain and pagedaemon thread/pagequeue split is
rather arbitrary, the multithreaded daemon could be allowed for the
single-domain machines, or one domain might be split into several page
domains, to further increase concurrency.

Right now, each pagedaemon thread tries to reach the global target,
precalculated at the start of the pass.  This is not optimal, since it
could cause excessive page deactivation and freeing.  The code should
be changed to re-check the global page deficit state in the loop after
some number of iterations.

The pagedaemons reach the quorum before starting the OOM, since one
thread inability to meet the target is normal for split queues.  Only
when all pagedaemons fail to produce enough reusable pages, OOM is
started by single selected thread.

Launder is modified to take into account the segments layout with
regard to the region for which cleaning is performed.

Based on the preliminary patch by jeff, sponsored by EMC / Isilon
Storage Division.

Reviewed by:	alc
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
2013-08-07 16:36:38 +00:00
Konstantin Belousov
872d995f76 Change the pmap_ts_referenced() method of amd64 pmap to use shared
pvh_global_lock.  This allows the method to be executed in parallel,
avoiding undue contention on the pvh_global_lock for the multithreaded
pagedaemon.

The pmap_ts_referenced() function has to inspect the page mappings for
several pmaps, which need to be locked while pv list lock is owned.
This contradicts to the lock order, where pmap lock is before pv list
lock.  Introduce the generation count for the pv list of the page or
superpage, which indicate any change in the pv list, and, as usual,
perform restart of the iteration if generation changed while pv lock
was dropped for blocking acquire of a pmap lock.

Reported and tested by:	pho
Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
2013-08-07 16:33:15 +00:00
Andriy Gapon
818d282e7b enable KDB_TRACE in GENERICs
KDB_TRACE is not an alternative to DDB/etc, they are complementary.
So I do not see any reason to not enable KDB_TRACE by default.

X-MFC after:	never (change specific to head)
2013-08-07 08:03:50 +00:00
Jeff Roberson
5df87b21d3 Replace kernel virtual address space allocation with vmem. This provides
transparent layering and better fragmentation.

 - Normalize functions that allocate memory to use kmem_*
 - Those that allocate address space are named kva_*
 - Those that operate on maps are named kmap_*
 - Implement recursive allocation handling for kmem_arena in vmem.

Reviewed by:	alc
Tested by:	pho
Sponsored by:	EMC / Isilon Storage Division
2013-08-07 06:21:20 +00:00
Jeff Roberson
2c0b86b48f - Introduce a specific function, pmap_remove_kernel_pde, for removing
huge pages in the kernel's address space.  This works around several
   asserts from pmap_demote_pde_locked that did not apply and gave false
   warnings.

Discovered by:	pho
Reviewed by:	alc
Sponsored by:	EMC / Isilon Storage Division
2013-08-05 00:28:03 +00:00
Peter Grehan
80a902ef7d Follow-up commit to fix CR0 issues. Maintain
architectural state on CR vmexits by guaranteeing
that EFER, CR0 and the VMCS entry controls are
all in sync when transitioning to IA-32e mode.

Submitted by:	Tycho Nightingale (tycho.nightingale <at> plurisbusnetworks.com)
2013-08-03 03:16:42 +00:00
Peter Grehan
81ef6611ed Moved clearing of vmm_initialized to avoid the case
of unloading the module while VMs existed. This would
result in EBUSY, but would prevent further operations
on VMs resulting in the module being impossible to
unload.

Submitted by:   Tycho Nightingale (tycho.nightingale <at> plurisbusnetworks.com)
Reviewed by:	grehan, neel
2013-08-01 05:59:28 +00:00
Peter Grehan
aaaa065629 Correctly maintain the CR0/CR4 shadow registers.
This was exposed with AP spinup of Linux, and
booting OpenBSD, where the CR0 register is unconditionally
written to prior to the longjump to enter protected
mode. The CR-vmexit handling was not updating CPU state which
resulted in a vmentry failure with invalid guest state.

A follow-on submit will fix the CPU state issue, but this
fix prevents the CR-vmexit prior to entering protected
mode by properly initializing and maintaining CR* state.

Reviewed by:	neel
Reported by:	Gopakumar.T @ netapp
2013-08-01 01:18:51 +00:00
David E. O'Brien
0e6a0799a9 Back out r253779 & r253786. 2013-07-31 17:21:18 +00:00
David E. O'Brien
99ff83da74 Decouple yarrow from random(4) device.
* Make Yarrow an optional kernel component -- enabled by "YARROW_RNG" option.
  The files sha2.c, hash.c, randomdev_soft.c and yarrow.c comprise yarrow.

* random(4) device doesn't really depend on rijndael-*.  Yarrow, however, does.

* Add random_adaptors.[ch] which is basically a store of random_adaptor's.
  random_adaptor is basically an adapter that plugs in to random(4).
  random_adaptor can only be plugged in to random(4) very early in bootup.
  Unplugging random_adaptor from random(4) is not supported, and is probably a
  bad idea anyway, due to potential loss of entropy pools.
  We currently have 3 random_adaptors:
  + yarrow
  + rdrand (ivy.c)
  + nehemeiah

* Remove platform dependent logic from probe.c, and move it into
  corresponding registration routines of each random_adaptor provider.
  probe.c doesn't do anything other than picking a specific random_adaptor
  from a list of registered ones.

* If the kernel doesn't have any random_adaptor adapters present then the
  creation of /dev/random is postponed until next random_adaptor is kldload'ed.

* Fix randomdev_soft.c to refer to its own random_adaptor, instead of a
  system wide one.

Submitted by: arthurmesh@gmail.com, obrien
Obtained from: Juniper Networks
Reviewed by: obrien
2013-07-29 20:26:27 +00:00
Andriy Gapon
a29cc9a34b Revert r253748,253749
This WIP should not have been committed yet.

Pointyhat to:	avg
2013-07-28 18:44:17 +00:00
Andriy Gapon
366d8bfb7b put contents of cpu.h under _KERNEL
no userland-serviceable parts inside

MFC after:	20 days
2013-07-28 18:32:27 +00:00
Andriy Gapon
a69e8d609e x86: detect mwait capabilities and extensions, when present
Reviewed by:	kib (earlier amd64-only version)
MFC after:	2 weeks
2013-07-28 17:54:42 +00:00
Jeff Roberson
2f84c08eee - Use kmem_malloc rather than kmem_alloc() for GDT/LDT/tss allocations etc.
This eliminates some unusual uses of that API in favor of more typical
   uses of kmem_malloc().

Discussed with:	kib/alc
Tested by:	pho
Sponsored by:	EMC / Isilon Storage Division
2013-07-26 19:06:14 +00:00
Neel Natu
84e169c6c3 Add support for emulation of the "or r/m, imm8" instruction.
Submitted by:	Zhixiang Yu (zxyu.core@gmail.com)
Obtained from:	GSoC 2013 (AHCI device emulation for bhyve)
2013-07-23 23:43:00 +00:00
Neel Natu
113326a772 Fix a bug introduced in r252646 that causes a page with the PG_PTE_PAT bit set
to be interpreted as a superpage. This is because PG_PTE_PAT is at the same
bit position in PTE as PG_PS is in a PDE.

This caused a number of regressions on amd64 systems: panic when starting
X applications, freeze during shutdown etc.

Pointy hat to:	me
Tested by: gperez@entel.upc.edu, joel, dumbbell
Reviewed by: kib
2013-07-23 22:17:00 +00:00
Konstantin Belousov
0f6bcda4cd MFi386: add ddb "show sysregs" command.
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2013-07-15 06:30:57 +00:00
Konstantin Belousov
0cdd261571 Clear m->object for the page taken from the delayed free list for
reuse as the pv chink page in reclaim_pv_chunk().  Having non-NULL
m->object is wrong for page not owned by an object and confuses both
vm_page_free_toq() and vm_page_remove() when the page is freed later.

Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2013-07-10 09:24:03 +00:00
Xin LI
1fdeb1651c Import HighPoint DC Series Data Center HBA (DC7280 and R750) driver.
This driver works for FreeBSD/i386 and FreeBSD/amd64 platforms.

Many thanks to HighPoint for providing this driver.

MFC after:	1 day
2013-07-06 07:49:41 +00:00
Neel Natu
be28275d00 If a superpage mapping is being removed then we need to ignore the PG_PDE_PAT
bit when looking up the vm_page associated with the superpage's physical
address.

If the caching attribute for the mapping is write combining or write protected
then the PG_PDE_PAT bit will be set and thus cause an 'off-by-one' error
when looking up the vm_page.

Fix this by using the PG_PS_FRAME mask to compute the physical address for
a superpage mapping instead of PG_FRAME.

This is a theoretical issue at this point since non-writeback attributes are
currently used only for fictitious mappings and fictitious mappings are not
subject to promotion.

Discussed with:	alc, kib
MFC after:	2 weeks
2013-07-03 23:21:25 +00:00
Neel Natu
de16308c48 Verify that all bytes in the instruction buffer are consumed during decoding.
Suggested by:	grehan
2013-07-03 23:05:17 +00:00
Peter Grehan
e60f5d779e Ignore guest PAT settings by default in EPT mappings.
From experimentation, other hypervisors also do this.

Diagnosed by:	tycho nightingale at pluribusnetworks com
Reviewed by:	neel
2013-07-01 20:05:43 +00:00
Konstantin Belousov
70a7dd5d5b Fix issues with zeroing and fetching the counters, on x86 and ppc64.
Issues were noted by Bruce Evans and are present on all architectures.

On i386, a counter fetch should use atomic read of 64bit value,
otherwise carry from the increment on other CPU could be lost for the
given fetch, making error of 2^32.  If 64bit read (cmpxchg8b) is not
available on the machine, it cannot be SMP and it is enough to disable
preemption around read to avoid the split read.

On x86 the counter increment is not atomic on purpose, which makes it
possible for the store of the incremented result to override just
zeroed per-cpu slot.  The effect would be a counter going off by
arbitrary value after zeroing.  Perform the counter zeroing on the
same processor which does the increments, making the operations
mutually exclusive.  On i386, same as for the fetching, if the
cmpxchg8b is not available, machine is not SMP and we disable
preemption for zeroing.

PowerPC64 is treated the same as amd64.

For other architectures, the changes made to allow the compilation to
succeed, without fixing the issues with zeroing or fetching.  It
should be possible to handle them by using the 64bit loads and stores
atomic WRT preemption (assuming the architectures also converted from
using critical sections to proper asm).  If architecture does not
provide the facility, using global (spin) mutex would be non-optimal
but working solution.

Noted by:  bde
Sponsored by:	The FreeBSD Foundation
2013-07-01 02:48:27 +00:00
Peter Grehan
560d5eda2c Make sure all CPUID values are handled, instead of exiting the
bhyve process when an unhandled one is encountered.

Hide some additional capabilities from the guest (e.g. debug store).

This fixes the issue with FreeBSD 9.1 MP guests exiting the VM on
AP spinup (where CPUID is used when sync'ing the TSCs) and the
issue with the Java build where CPUIDs are issued from a guest
userspace.

Submitted by:	tycho nightingale at pluribusnetworks com
Reviewed by:	neel
Reported by:	many
2013-06-28 06:05:33 +00:00
Jung-uk Kim
b1ddd13145 Move definitions required by userland applications out of acpica_machdep.h. 2013-06-27 00:22:40 +00:00
Konstantin Belousov
9dbb63fe03 Allow immediate operand.
Sponsored by:	The FreeBSD Foundation
2013-06-20 14:30:04 +00:00
Konstantin Belousov
c788f92509 Some clarifications and updates for the comments, mostly retrieved
from Bruce Evans.  Trim the trailing spaces.

MFC after:	1 week
2013-06-19 05:05:16 +00:00
Sergey Kandaurov
1e2751ddeb Fix a gcc warning uncovered after r251745.
Reported by:	Sergey V. Dyatko
Reviewed by:	neel
2013-06-18 23:31:09 +00:00
Justin T. Gibbs
a8f6ac0573 Upgrade Xen interface headers to Xen 4.2.1.
Move FreeBSD from interface version 0x00030204 to 0x00030208.
Updates are required to our grant table implementation before we
can bump this further.

sys/xen/hvm.h:
	Replace the implementation of hvm_get_parameter(), formerly located
	in sys/xen/interface/hvm/params.h.  Linux has a similar file which
	primarily stores this function.

sys/xen/xenstore/xenstore.c:
	Include new xen/hvm.h header file to get hvm_get_parameter().

sys/amd64/include/xen/xen-os.h:
sys/i386/include/xen/xen-os.h:
	Correctly protect function definition and variables from being
	included into assembly files in xen-os.h

	Xen memory barriers are now prefixed with "xen_" to avoid conflicts
	with OS native primatives.  Define Xen memory barriers in terms of
	the native FreeBSD primatives.

Sponsored by:	Spectra Logic Corporation
Reviewed by:	Roger Pau Monné
Tested by:	Roger Pau Monné
Obtained from:	Roger Pau Monné (bug fixes)
2013-06-14 23:43:44 +00:00
Sergey Kandaurov
82f2974a69 Replace cpusetffs_obj with CPU_FFS, missed in r251703.
Reported by:	bdrewery, O. Hartmann
2013-06-14 10:26:38 +00:00
Neel Natu
8f1664b724 Remove unused macros PTESHIFT, PDESHIFT, PDPESHIFT and PML4ESHIFT.
Reviewed by:	alc
2013-06-14 00:03:43 +00:00
Jeff Roberson
17a2737732 - Add a BIT_FFS() macro and use it to replace cpusetffs_obj()
Discussed with:	attilio
Sponsored by:	EMC / Isilon Storage Division
2013-06-13 20:46:03 +00:00
Konstantin Belousov
9138579845 Assert that interrupts are enabled in the trap handlers on x86 before
calling generic code to deliver signals.

Discussed with:	bde
Tested by:	pho
MFC after:	1 week
2013-06-03 17:40:05 +00:00
Konstantin Belousov
cb5bfd1240 Use slightly more idiomatic expression to get the address of array.
Tested by:	dim, pgj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2013-05-27 18:39:39 +00:00
Konstantin Belousov
87b94d9a92 The _MC_HASFPXSTATE and _MC_IA32_HASFPXSTATE flags have the same bit
value on purpose, but the ia32 context handling code is logically more
correct to use the _MC_IA32_HASFPXSTATE name for the flag.

Tested by:	dim, pgj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2013-05-27 18:36:46 +00:00
Konstantin Belousov
e9249a80f6 The ia32_get_mcontext() does not need to set PCB_FULL_IRET. The
usermode context state is not changed by the get operation, and
get_mcontext() does not require full iret as well.

Tested by:	dim, pgj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2013-05-27 18:31:15 +00:00
Konstantin Belousov
80b5691a76 When reporting the fault details, also print %rsp.
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2013-05-27 18:29:20 +00:00
Konstantin Belousov
6806ce6ec8 When handling an exception from the attempt from loading the faulting
context on return from the trap handler, re-enable the interrupts on
i386 and amd64.  The trap return path have to disable interrupts since
the sequence of loading the machine state is not atomic.  The trap()
function which transfers the control to the special handler would
enable the interrupt, but an iret loads the previous eflags with PSL_I
clear.  Then, the special handler calls trap() on its own, which now
sees the original eflags with PSL_I set and does not enable
interrupts.

The end result is that signal delivery and process exiting code could
be executed with interrupts disabled, which is generally wrong and
triggers several assertions.

For amd64, the interrupts are enabled conditionally based on PSL_I in
the eflags of the outer frame, as it is already done for
doreti_iret_fault.  For i386, the interrupts are enabled
unconditionally, the ast loop could have opened a window with
interrupts enabled just before the iret anyway.

Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2013-05-27 18:26:08 +00:00
Achim Leubner
dce93cd06d Driver 'aacraid' added. Supports Adaptec by PMC RAID controller families Series 6, 7, 8 and upcoming products. Older Adaptec RAID controller families are supported by the 'aac' driver.
Approved by:	scottl (mentor)
2013-05-24 09:22:43 +00:00
Attilio Rao
9af6d512f5 o Relax locking assertions for vm_page_find_least()
o Relax locking assertions for pmap_enter_object() and add them also
  to architectures that currently don't have any
o Introduce VM_OBJECT_LOCK_DOWNGRADE() which is basically a downgrade
  operation on the per-object rwlock
o Use all the mechanisms above to make vm_map_pmap_enter() to work
  mostl of the times only with readlocks.

Sponsored by:	EMC / Isilon storage division
Reviewed by:	alc
2013-05-21 20:38:19 +00:00
Konstantin Belousov
4f493a25f4 Fix the hardware watchpoints on SMP amd64. Load the updated %dr
registers also on other CPUs, besides the CPU which happens to execute
the ddb.  The debugging registers are stored in the pcpu area,
together with the command which is executed by the IPI stop handler
upon resume.

Reviewed by:	jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2013-05-21 11:24:32 +00:00
Konstantin Belousov
ea38ca6ea5 Add amd64-specific ddb command 'show phys2dmap', which calculates the
address in the direct map corresponding to the given physical address.

Reviewed by:	jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2013-05-21 11:07:12 +00:00
Marcel Moolenaar
cb34ed4434 Add basic support for FDT to i386 & amd64. This change includes:
1.  Common headers for fdt.h and ofw_machdep.h under x86/include
    with indirections under i386/include and amd64/include.
2.  New modinfo for loader provided FDT blob.
3.  Common x86_init_fdt() called from hammer_time() on amd64 and
    init386() on i386.
4.  Split-off FDT specific low-level console functions from FDT
    bus methods for the uart(4) driver. The low-level console
    logic has been moved to uart_cpu_fdt.c and is used for arm,
    mips & powerpc only. The FDT bus methods are shared across
    all architectures.
5.  Add dev/fdt/fdt_x86.c to hold the fdt_fixup_table[] and the
    fdt_pic_table[] arrays. Both are empty right now.

FDT addresses are I/O ports on x86. Since the core FDT code does
not handle different address spaces, adding support for both I/O
ports and memory addresses requires some thought and discussion.
It may be better to use a compile-time option that controls this.

Obtained from:	Juniper Networks, Inc.
2013-05-21 03:05:49 +00:00
Ed Schouten
31ffd8039d Improve readability of static assertions for OFFSET_* macros.
Instead of doing all sorts of weird casting of constants to
pointer-pointers, simply use the standard C offsetof() macro to obtain
the offset of the respective fields in the structures.
2013-05-13 21:47:17 +00:00
Peter Wemm
dda759d344 Tidy up some CVS workarounds. 2013-05-12 01:53:47 +00:00
Rui Paulo
009b3a1c3f Fix several standard extended feature bits.
Submitted by:	Oliver Pinter <oliver.pntr at gmail.com>
2013-05-11 01:31:51 +00:00
Neel Natu
0acb0d84c5 Support array-type of stats in bhyve.
An array-type stat in vmm.ko is defined as follows:
VMM_STAT_ARRAY(IPIS_SENT, VM_MAXCPU, "ipis sent to vcpu");

It is incremented as follows:
vmm_stat_array_incr(vm, vcpuid, IPIS_SENT, array_index, 1);

And output of 'bhyvectl --get-stats' looks like:
ipis sent to vcpu[0]     3114
ipis sent to vcpu[1]     0

Reviewed by:	grehan
Obtained from:	NetApp
2013-05-10 02:59:49 +00:00