This is a workaround to hide the fact that we do not have any code to
demote a superpage mapping before we unmap a single page that is part
of the superpage.
r254466 increased the KVA from 512GB to 2TB which requires 4 PDP pages as
opposed to a single one before the change. This broke minidumpsys() since
it assumed that the entire KVA could be addressed via a single PDP page.
Fix this by obtaining the address of the PDP page from the PML4 entry
associated with the KVA being dumped.
Reported by: pho
Submitted by: kib
Pointy hat to: neel
blocks on a pmap lock, pmap_release() might proceed in parallel and
destroy the pmap mutex, since unlocked pv lock allows to remove pv
entry owned by the pmap.
For now, gate the pmap_release() on write-locked pvh_global_lock.
Since pmap_ts_release() does not unlock the global lock,
pmap_release() would not destroy pmap mutex until the
pmap_ts_referenced() finished. We cannot enter pmap_ts_referenced()
and encounter a pv entry for the destroyed pmap if pmap_release()
passed the global lock gate, since pmap_remove_pages() would finish
earlier.
Reported by: jeff, pho
Reviewed by: alc
Tested by: pho
Sponsored by: The FreeBSD Foundation
used by the tools in base systems and with sandboxing more and more tools
the usage should only increase.
Submitted by: Mariusz Zaborski <oshogbo@FreeBSD.org>
Sponsored by: Google Summer of Code 2013
MFC after: 1 month
Bump up the KVA size proportionally from 512GB to 2TB.
The number of page table pages used by the direct map is now calculated at
run time based on 'Maxmem'. This means the small memory systems will not
see any additional tax in terms of page table pages for the direct map.
However all amd64 systems, regardless of the memory size, will use 3 more
pages to accomodate the bump in the KVA size.
More details available here:
http://lists.freebsd.org/pipermail/freebsd-hackers/2013-June/043015.htmlhttp://lists.freebsd.org/pipermail/freebsd-current/2013-July/043143.html
Tested with the following configurations:
- Sandybridge server with 64GB of memory.
- bhyve VM with 64MB of memory.
- bhyve VM with a 8GB of memory with the memory segment above 4GB cuddling
right up against the 4TB maximum memory limit.
Discussed on: hackers@, current@
Submitted by: Chris Torek (torek@torek.net)
The variable _logname_valid is not exported via the version script;
therefore, change C and i386/amd64 assembler code to remove indirection
(which allowed interposition). This makes the code slightly smaller and
faster.
Also, remove #define PIC_GOT from i386/amd64 in !PIC mode. Without PIC,
there is no place containing the address of each variable, so there is no
possible definition for PIC_GOT.
additional information, when the page is guaranteed to not belong to a
paging queue. Usually, this results in a lot of type casts which make
reasoning about the code correctness harder.
Sometimes m->object is used instead of pageq, which could cause real
and confusing bugs if non-NULL m->object is leaked. See r141955 and
r253140 for examples.
Change the pageq member into a union containing explicitly-typed
members. Use them instead of type-punning or abusing m->object in x86
pmaps, uma and vm_page_alloc_contig().
Requested and reviewed by: alc
Sponsored by: The FreeBSD Foundation
for nodes used in vm_radix.
On architectures supporting direct mapping, also avoid to pre-allocate
the KVA for such nodes.
In order to do so make the operations derived from vm_radix_insert()
to fail and handle all the deriving failure of those.
vm_radix-wise introduce a new function called vm_radix_replace(),
which can replace a leaf node, already present, with a new one,
and take into account the possibility, during vm_radix_insert()
allocation, that the operations on the radix trie can recurse.
This means that if operations in vm_radix_insert() recursed
vm_radix_insert() will start from scratch again.
Sponsored by: EMC / Isilon storage division
Reviewed by: alc (older version)
Reviewed by: jeff
Tested by: pho, scottl
Unify the 2 concept into a real, minimal, sxlock where the shared
acquisition represent the soft busy and the exclusive acquisition
represent the hard busy.
The old VPO_WANTED mechanism becames the hard-path for this new lock
and it becomes per-page rather than per-object.
The vm_object lock becames an interlock for this functionality:
it can be held in both read or write mode.
However, if the vm_object lock is held in read mode while acquiring
or releasing the busy state, the thread owner cannot make any
assumption on the busy state unless it is also busying it.
Also:
- Add a new flag to directly shared busy pages while vm_page_alloc
and vm_page_grab are being executed. This will be very helpful
once these functions happen under a read object lock.
- Move the swapping sleep into its own per-object flag
The KPI is heavilly changed this is why the version is bumped.
It is very likely that some VM ports users will need to change
their own code.
Sponsored by: EMC / Isilon storage division
Discussed with: alc
Reviewed by: jeff, kib
Tested by: gavin, bapt (older version)
Tested by: pho, scottl
- update powerpc/GENERIC64 as well, suggested by mdf
- update comments so that they make sense after the change, suggested by
jhb
X-MFC after: never (change specific to head)
This is a cosmetic change but it does help with a proposed change to increase
the maximum size of physical memory supported on amd64 platforms.
Submitted by: Chris Torek (torek@torek.net)
into threads each processing queue in a single domain. The structure
of the pagedaemons and queues is kept intact, most of the changes come
from the need for code to find an owning page queue for given page,
calculated from the segment containing the page.
The tie between NUMA domain and pagedaemon thread/pagequeue split is
rather arbitrary, the multithreaded daemon could be allowed for the
single-domain machines, or one domain might be split into several page
domains, to further increase concurrency.
Right now, each pagedaemon thread tries to reach the global target,
precalculated at the start of the pass. This is not optimal, since it
could cause excessive page deactivation and freeing. The code should
be changed to re-check the global page deficit state in the loop after
some number of iterations.
The pagedaemons reach the quorum before starting the OOM, since one
thread inability to meet the target is normal for split queues. Only
when all pagedaemons fail to produce enough reusable pages, OOM is
started by single selected thread.
Launder is modified to take into account the segments layout with
regard to the region for which cleaning is performed.
Based on the preliminary patch by jeff, sponsored by EMC / Isilon
Storage Division.
Reviewed by: alc
Tested by: pho
Sponsored by: The FreeBSD Foundation
pvh_global_lock. This allows the method to be executed in parallel,
avoiding undue contention on the pvh_global_lock for the multithreaded
pagedaemon.
The pmap_ts_referenced() function has to inspect the page mappings for
several pmaps, which need to be locked while pv list lock is owned.
This contradicts to the lock order, where pmap lock is before pv list
lock. Introduce the generation count for the pv list of the page or
superpage, which indicate any change in the pv list, and, as usual,
perform restart of the iteration if generation changed while pv lock
was dropped for blocking acquire of a pmap lock.
Reported and tested by: pho
Reviewed by: alc
Sponsored by: The FreeBSD Foundation
KDB_TRACE is not an alternative to DDB/etc, they are complementary.
So I do not see any reason to not enable KDB_TRACE by default.
X-MFC after: never (change specific to head)
transparent layering and better fragmentation.
- Normalize functions that allocate memory to use kmem_*
- Those that allocate address space are named kva_*
- Those that operate on maps are named kmap_*
- Implement recursive allocation handling for kmem_arena in vmem.
Reviewed by: alc
Tested by: pho
Sponsored by: EMC / Isilon Storage Division
huge pages in the kernel's address space. This works around several
asserts from pmap_demote_pde_locked that did not apply and gave false
warnings.
Discovered by: pho
Reviewed by: alc
Sponsored by: EMC / Isilon Storage Division
architectural state on CR vmexits by guaranteeing
that EFER, CR0 and the VMCS entry controls are
all in sync when transitioning to IA-32e mode.
Submitted by: Tycho Nightingale (tycho.nightingale <at> plurisbusnetworks.com)
of unloading the module while VMs existed. This would
result in EBUSY, but would prevent further operations
on VMs resulting in the module being impossible to
unload.
Submitted by: Tycho Nightingale (tycho.nightingale <at> plurisbusnetworks.com)
Reviewed by: grehan, neel
This was exposed with AP spinup of Linux, and
booting OpenBSD, where the CR0 register is unconditionally
written to prior to the longjump to enter protected
mode. The CR-vmexit handling was not updating CPU state which
resulted in a vmentry failure with invalid guest state.
A follow-on submit will fix the CPU state issue, but this
fix prevents the CR-vmexit prior to entering protected
mode by properly initializing and maintaining CR* state.
Reviewed by: neel
Reported by: Gopakumar.T @ netapp
* Make Yarrow an optional kernel component -- enabled by "YARROW_RNG" option.
The files sha2.c, hash.c, randomdev_soft.c and yarrow.c comprise yarrow.
* random(4) device doesn't really depend on rijndael-*. Yarrow, however, does.
* Add random_adaptors.[ch] which is basically a store of random_adaptor's.
random_adaptor is basically an adapter that plugs in to random(4).
random_adaptor can only be plugged in to random(4) very early in bootup.
Unplugging random_adaptor from random(4) is not supported, and is probably a
bad idea anyway, due to potential loss of entropy pools.
We currently have 3 random_adaptors:
+ yarrow
+ rdrand (ivy.c)
+ nehemeiah
* Remove platform dependent logic from probe.c, and move it into
corresponding registration routines of each random_adaptor provider.
probe.c doesn't do anything other than picking a specific random_adaptor
from a list of registered ones.
* If the kernel doesn't have any random_adaptor adapters present then the
creation of /dev/random is postponed until next random_adaptor is kldload'ed.
* Fix randomdev_soft.c to refer to its own random_adaptor, instead of a
system wide one.
Submitted by: arthurmesh@gmail.com, obrien
Obtained from: Juniper Networks
Reviewed by: obrien
This eliminates some unusual uses of that API in favor of more typical
uses of kmem_malloc().
Discussed with: kib/alc
Tested by: pho
Sponsored by: EMC / Isilon Storage Division
to be interpreted as a superpage. This is because PG_PTE_PAT is at the same
bit position in PTE as PG_PS is in a PDE.
This caused a number of regressions on amd64 systems: panic when starting
X applications, freeze during shutdown etc.
Pointy hat to: me
Tested by: gperez@entel.upc.edu, joel, dumbbell
Reviewed by: kib
reuse as the pv chink page in reclaim_pv_chunk(). Having non-NULL
m->object is wrong for page not owned by an object and confuses both
vm_page_free_toq() and vm_page_remove() when the page is freed later.
Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 3 days
bit when looking up the vm_page associated with the superpage's physical
address.
If the caching attribute for the mapping is write combining or write protected
then the PG_PDE_PAT bit will be set and thus cause an 'off-by-one' error
when looking up the vm_page.
Fix this by using the PG_PS_FRAME mask to compute the physical address for
a superpage mapping instead of PG_FRAME.
This is a theoretical issue at this point since non-writeback attributes are
currently used only for fictitious mappings and fictitious mappings are not
subject to promotion.
Discussed with: alc, kib
MFC after: 2 weeks
Issues were noted by Bruce Evans and are present on all architectures.
On i386, a counter fetch should use atomic read of 64bit value,
otherwise carry from the increment on other CPU could be lost for the
given fetch, making error of 2^32. If 64bit read (cmpxchg8b) is not
available on the machine, it cannot be SMP and it is enough to disable
preemption around read to avoid the split read.
On x86 the counter increment is not atomic on purpose, which makes it
possible for the store of the incremented result to override just
zeroed per-cpu slot. The effect would be a counter going off by
arbitrary value after zeroing. Perform the counter zeroing on the
same processor which does the increments, making the operations
mutually exclusive. On i386, same as for the fetching, if the
cmpxchg8b is not available, machine is not SMP and we disable
preemption for zeroing.
PowerPC64 is treated the same as amd64.
For other architectures, the changes made to allow the compilation to
succeed, without fixing the issues with zeroing or fetching. It
should be possible to handle them by using the 64bit loads and stores
atomic WRT preemption (assuming the architectures also converted from
using critical sections to proper asm). If architecture does not
provide the facility, using global (spin) mutex would be non-optimal
but working solution.
Noted by: bde
Sponsored by: The FreeBSD Foundation
bhyve process when an unhandled one is encountered.
Hide some additional capabilities from the guest (e.g. debug store).
This fixes the issue with FreeBSD 9.1 MP guests exiting the VM on
AP spinup (where CPUID is used when sync'ing the TSCs) and the
issue with the Java build where CPUIDs are issued from a guest
userspace.
Submitted by: tycho nightingale at pluribusnetworks com
Reviewed by: neel
Reported by: many
Move FreeBSD from interface version 0x00030204 to 0x00030208.
Updates are required to our grant table implementation before we
can bump this further.
sys/xen/hvm.h:
Replace the implementation of hvm_get_parameter(), formerly located
in sys/xen/interface/hvm/params.h. Linux has a similar file which
primarily stores this function.
sys/xen/xenstore/xenstore.c:
Include new xen/hvm.h header file to get hvm_get_parameter().
sys/amd64/include/xen/xen-os.h:
sys/i386/include/xen/xen-os.h:
Correctly protect function definition and variables from being
included into assembly files in xen-os.h
Xen memory barriers are now prefixed with "xen_" to avoid conflicts
with OS native primatives. Define Xen memory barriers in terms of
the native FreeBSD primatives.
Sponsored by: Spectra Logic Corporation
Reviewed by: Roger Pau Monné
Tested by: Roger Pau Monné
Obtained from: Roger Pau Monné (bug fixes)
value on purpose, but the ia32 context handling code is logically more
correct to use the _MC_IA32_HASFPXSTATE name for the flag.
Tested by: dim, pgj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week