Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
/*-
|
|
|
|
* Copyright (c) 2002-2006 Rice University
|
|
|
|
* Copyright (c) 2007 Alan L. Cox <alc@cs.rice.edu>
|
|
|
|
* All rights reserved.
|
|
|
|
*
|
|
|
|
* This software was developed for the FreeBSD Project by Alan L. Cox,
|
|
|
|
* Olivier Crameri, Peter Druschel, Sitaram Iyer, and Juan Navarro.
|
|
|
|
*
|
|
|
|
* Redistribution and use in source and binary forms, with or without
|
|
|
|
* modification, are permitted provided that the following conditions
|
|
|
|
* are met:
|
|
|
|
* 1. Redistributions of source code must retain the above copyright
|
|
|
|
* notice, this list of conditions and the following disclaimer.
|
|
|
|
* 2. Redistributions in binary form must reproduce the above copyright
|
|
|
|
* notice, this list of conditions and the following disclaimer in the
|
|
|
|
* documentation and/or other materials provided with the distribution.
|
|
|
|
*
|
|
|
|
* THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
|
|
|
|
* ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
|
|
|
|
* LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
|
|
|
|
* A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
|
|
|
|
* HOLDERS OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT,
|
|
|
|
* INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
|
|
|
|
* BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS
|
|
|
|
* OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED
|
|
|
|
* AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
|
|
|
|
* LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY
|
|
|
|
* WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
|
|
|
|
* POSSIBILITY OF SUCH DAMAGE.
|
|
|
|
*/
|
|
|
|
|
Refactor the code that performs physically contiguous memory allocation,
yielding a new public interface, vm_page_alloc_contig(). This new function
addresses some of the limitations of the current interfaces, contigmalloc()
and kmem_alloc_contig(). For example, the physically contiguous memory that
is allocated with those interfaces can only be allocated to the kernel vm
object and must be mapped into the kernel virtual address space. It also
provides functionality that vm_phys_alloc_contig() doesn't, such as wiring
the returned pages. Moreover, unlike that function, it respects the low
water marks on the paging queues and wakes up the page daemon when
necessary. That said, at present, this new function can't be applied to all
types of vm objects. However, that restriction will be eliminated in the
coming weeks.
From a design standpoint, this change also addresses an inconsistency
between vm_phys_alloc_contig() and the other vm_phys_alloc*() functions.
Specifically, vm_phys_alloc_contig() manipulated vm_page fields that other
functions in vm/vm_phys.c didn't. Moreover, vm_phys_alloc_contig() knew
about vnodes and reservations. Now, vm_page_alloc_contig() is responsible
for these things.
Reviewed by: kib
Discussed with: jhb
2011-11-16 16:46:09 +00:00
|
|
|
/*
|
|
|
|
* Physical memory system implementation
|
|
|
|
*
|
|
|
|
* Any external functions defined by this module are only to be used by the
|
|
|
|
* virtual memory system.
|
|
|
|
*/
|
|
|
|
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
#include <sys/cdefs.h>
|
|
|
|
__FBSDID("$FreeBSD$");
|
|
|
|
|
|
|
|
#include "opt_ddb.h"
|
2013-02-14 19:38:04 +00:00
|
|
|
#include "opt_vm.h"
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
|
|
|
|
#include <sys/param.h>
|
|
|
|
#include <sys/systm.h>
|
|
|
|
#include <sys/lock.h>
|
|
|
|
#include <sys/kernel.h>
|
|
|
|
#include <sys/malloc.h>
|
|
|
|
#include <sys/mutex.h>
|
2013-05-13 15:40:51 +00:00
|
|
|
#include <sys/proc.h>
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
#include <sys/queue.h>
|
2014-07-09 08:12:58 +00:00
|
|
|
#include <sys/rwlock.h>
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
#include <sys/sbuf.h>
|
|
|
|
#include <sys/sysctl.h>
|
2014-07-09 08:12:58 +00:00
|
|
|
#include <sys/tree.h>
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
#include <sys/vmmeter.h>
|
Add an initial NUMA affinity/policy configuration for threads and processes.
This is based on work done by jeff@ and jhb@, as well as the numa.diff
patch that has been circulating when someone asks for first-touch NUMA
on -10 or -11.
* Introduce a simple set of VM policy and iterator types.
* tie the policy types into the vm_phys path for now, mirroring how
the initial first-touch allocation work was enabled.
* add syscalls to control changing thread and process defaults.
* add a global NUMA VM domain policy.
* implement a simple cascade policy order - if a thread policy exists, use it;
if a process policy exists, use it; use the default policy.
* processes inherit policies from their parent processes, threads inherit
policies from their parent threads.
* add a simple tool (numactl) to query and modify default thread/process
policities.
* add documentation for the new syscalls, for numa and for numactl.
* re-enable first touch NUMA again by default, as now policies can be
set in a variety of methods.
This is only relevant for very specific workloads.
This doesn't pretend to be a final NUMA solution.
The previous defaults in -HEAD (with MAXMEMDOM set) can be achieved by
'sysctl vm.default_policy=rr'.
This is only relevant if MAXMEMDOM is set to something other than 1.
Ie, if you're using GENERIC or a modified kernel with non-NUMA, then
this is a glorified no-op for you.
Thank you to Norse Corp for giving me access to rather large
(for FreeBSD!) NUMA machines in order to develop and verify this.
Thank you to Dell for providing me with dual socket sandybridge
and westmere v3 hardware to do NUMA development with.
Thank you to Scott Long at Netflix for providing me with access
to the two-socket, four-domain haswell v3 hardware.
Thank you to Peter Holm for running the stress testing suite
against the NUMA branch during various stages of development!
Tested:
* MIPS (regression testing; non-NUMA)
* i386 (regression testing; non-NUMA GENERIC)
* amd64 (regression testing; non-NUMA GENERIC)
* westmere, 2 socket (thankyou norse!)
* sandy bridge, 2 socket (thankyou dell!)
* ivy bridge, 2 socket (thankyou norse!)
* westmere-EX, 4 socket / 1TB RAM (thankyou norse!)
* haswell, 2 socket (thankyou norse!)
* haswell v3, 2 socket (thankyou dell)
* haswell v3, 2x18 core (thankyou scott long / netflix!)
* Peter Holm ran a stress test suite on this work and found one
issue, but has not been able to verify it (it doesn't look NUMA
related, and he only saw it once over many testing runs.)
* I've tested bhyve instances running in fixed NUMA domains and cpusets;
all seems to work correctly.
Verified:
* intel-pcm - pcm-numa.x and pcm-memory.x, whilst selecting different
NUMA policies for processes under test.
Review:
This was reviewed through phabricator (https://reviews.freebsd.org/D2559)
as well as privately and via emails to freebsd-arch@. The git history
with specific attributes is available at https://github.com/erikarn/freebsd/
in the NUMA branch (https://github.com/erikarn/freebsd/compare/local/adrian_numa_policy).
This has been reviewed by a number of people (stas, rpaulo, kib, ngie,
wblock) but not achieved a clear consensus. My hope is that with further
exposure and testing more functionality can be implemented and evaluated.
Notes:
* The VM doesn't handle unbalanced domains very well, and if you have an overly
unbalanced memory setup whilst under high memory pressure, VM page allocation
may fail leading to a kernel panic. This was a problem in the past, but it's
much more easily triggered now with these tools.
* This work only controls the path through vm_phys; it doesn't yet strongly/predictably
affect contigmalloc, KVA placement, UMA, etc. So, driver placement of memory
isn't really guaranteed in any way. That's next on my plate.
Sponsored by: Norse Corp, Inc.; Dell
2015-07-11 15:21:37 +00:00
|
|
|
#include <sys/seq.h>
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
|
|
|
|
#include <ddb/ddb.h>
|
|
|
|
|
|
|
|
#include <vm/vm.h>
|
|
|
|
#include <vm/vm_param.h>
|
|
|
|
#include <vm/vm_kern.h>
|
|
|
|
#include <vm/vm_object.h>
|
|
|
|
#include <vm/vm_page.h>
|
|
|
|
#include <vm/vm_phys.h>
|
|
|
|
|
Add an initial NUMA affinity/policy configuration for threads and processes.
This is based on work done by jeff@ and jhb@, as well as the numa.diff
patch that has been circulating when someone asks for first-touch NUMA
on -10 or -11.
* Introduce a simple set of VM policy and iterator types.
* tie the policy types into the vm_phys path for now, mirroring how
the initial first-touch allocation work was enabled.
* add syscalls to control changing thread and process defaults.
* add a global NUMA VM domain policy.
* implement a simple cascade policy order - if a thread policy exists, use it;
if a process policy exists, use it; use the default policy.
* processes inherit policies from their parent processes, threads inherit
policies from their parent threads.
* add a simple tool (numactl) to query and modify default thread/process
policities.
* add documentation for the new syscalls, for numa and for numactl.
* re-enable first touch NUMA again by default, as now policies can be
set in a variety of methods.
This is only relevant for very specific workloads.
This doesn't pretend to be a final NUMA solution.
The previous defaults in -HEAD (with MAXMEMDOM set) can be achieved by
'sysctl vm.default_policy=rr'.
This is only relevant if MAXMEMDOM is set to something other than 1.
Ie, if you're using GENERIC or a modified kernel with non-NUMA, then
this is a glorified no-op for you.
Thank you to Norse Corp for giving me access to rather large
(for FreeBSD!) NUMA machines in order to develop and verify this.
Thank you to Dell for providing me with dual socket sandybridge
and westmere v3 hardware to do NUMA development with.
Thank you to Scott Long at Netflix for providing me with access
to the two-socket, four-domain haswell v3 hardware.
Thank you to Peter Holm for running the stress testing suite
against the NUMA branch during various stages of development!
Tested:
* MIPS (regression testing; non-NUMA)
* i386 (regression testing; non-NUMA GENERIC)
* amd64 (regression testing; non-NUMA GENERIC)
* westmere, 2 socket (thankyou norse!)
* sandy bridge, 2 socket (thankyou dell!)
* ivy bridge, 2 socket (thankyou norse!)
* westmere-EX, 4 socket / 1TB RAM (thankyou norse!)
* haswell, 2 socket (thankyou norse!)
* haswell v3, 2 socket (thankyou dell)
* haswell v3, 2x18 core (thankyou scott long / netflix!)
* Peter Holm ran a stress test suite on this work and found one
issue, but has not been able to verify it (it doesn't look NUMA
related, and he only saw it once over many testing runs.)
* I've tested bhyve instances running in fixed NUMA domains and cpusets;
all seems to work correctly.
Verified:
* intel-pcm - pcm-numa.x and pcm-memory.x, whilst selecting different
NUMA policies for processes under test.
Review:
This was reviewed through phabricator (https://reviews.freebsd.org/D2559)
as well as privately and via emails to freebsd-arch@. The git history
with specific attributes is available at https://github.com/erikarn/freebsd/
in the NUMA branch (https://github.com/erikarn/freebsd/compare/local/adrian_numa_policy).
This has been reviewed by a number of people (stas, rpaulo, kib, ngie,
wblock) but not achieved a clear consensus. My hope is that with further
exposure and testing more functionality can be implemented and evaluated.
Notes:
* The VM doesn't handle unbalanced domains very well, and if you have an overly
unbalanced memory setup whilst under high memory pressure, VM page allocation
may fail leading to a kernel panic. This was a problem in the past, but it's
much more easily triggered now with these tools.
* This work only controls the path through vm_phys; it doesn't yet strongly/predictably
affect contigmalloc, KVA placement, UMA, etc. So, driver placement of memory
isn't really guaranteed in any way. That's next on my plate.
Sponsored by: Norse Corp, Inc.; Dell
2015-07-11 15:21:37 +00:00
|
|
|
#include <vm/vm_domain.h>
|
|
|
|
|
Split the pagequeues per NUMA domains, and split pageademon process
into threads each processing queue in a single domain. The structure
of the pagedaemons and queues is kept intact, most of the changes come
from the need for code to find an owning page queue for given page,
calculated from the segment containing the page.
The tie between NUMA domain and pagedaemon thread/pagequeue split is
rather arbitrary, the multithreaded daemon could be allowed for the
single-domain machines, or one domain might be split into several page
domains, to further increase concurrency.
Right now, each pagedaemon thread tries to reach the global target,
precalculated at the start of the pass. This is not optimal, since it
could cause excessive page deactivation and freeing. The code should
be changed to re-check the global page deficit state in the loop after
some number of iterations.
The pagedaemons reach the quorum before starting the OOM, since one
thread inability to meet the target is normal for split queues. Only
when all pagedaemons fail to produce enough reusable pages, OOM is
started by single selected thread.
Launder is modified to take into account the segments layout with
regard to the region for which cleaning is performed.
Based on the preliminary patch by jeff, sponsored by EMC / Isilon
Storage Division.
Reviewed by: alc
Tested by: pho
Sponsored by: The FreeBSD Foundation
2013-08-07 16:36:38 +00:00
|
|
|
_Static_assert(sizeof(long) * NBBY >= VM_PHYSSEG_MAX,
|
|
|
|
"Too many physsegs.");
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
|
2016-04-09 13:58:04 +00:00
|
|
|
#ifdef VM_NUMA_ALLOC
|
2010-07-27 20:33:50 +00:00
|
|
|
struct mem_affinity *mem_affinity;
|
2015-05-08 00:56:56 +00:00
|
|
|
int *mem_locality;
|
2016-04-09 13:58:04 +00:00
|
|
|
#endif
|
2010-07-27 20:33:50 +00:00
|
|
|
|
2013-05-13 15:40:51 +00:00
|
|
|
int vm_ndomains = 1;
|
|
|
|
|
Split the pagequeues per NUMA domains, and split pageademon process
into threads each processing queue in a single domain. The structure
of the pagedaemons and queues is kept intact, most of the changes come
from the need for code to find an owning page queue for given page,
calculated from the segment containing the page.
The tie between NUMA domain and pagedaemon thread/pagequeue split is
rather arbitrary, the multithreaded daemon could be allowed for the
single-domain machines, or one domain might be split into several page
domains, to further increase concurrency.
Right now, each pagedaemon thread tries to reach the global target,
precalculated at the start of the pass. This is not optimal, since it
could cause excessive page deactivation and freeing. The code should
be changed to re-check the global page deficit state in the loop after
some number of iterations.
The pagedaemons reach the quorum before starting the OOM, since one
thread inability to meet the target is normal for split queues. Only
when all pagedaemons fail to produce enough reusable pages, OOM is
started by single selected thread.
Launder is modified to take into account the segments layout with
regard to the region for which cleaning is performed.
Based on the preliminary patch by jeff, sponsored by EMC / Isilon
Storage Division.
Reviewed by: alc
Tested by: pho
Sponsored by: The FreeBSD Foundation
2013-08-07 16:36:38 +00:00
|
|
|
struct vm_phys_seg vm_phys_segs[VM_PHYSSEG_MAX];
|
|
|
|
int vm_phys_nsegs;
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
|
2014-07-09 08:12:58 +00:00
|
|
|
struct vm_phys_fictitious_seg;
|
|
|
|
static int vm_phys_fictitious_cmp(struct vm_phys_fictitious_seg *,
|
|
|
|
struct vm_phys_fictitious_seg *);
|
|
|
|
|
|
|
|
RB_HEAD(fict_tree, vm_phys_fictitious_seg) vm_phys_fictitious_tree =
|
|
|
|
RB_INITIALIZER(_vm_phys_fictitious_tree);
|
|
|
|
|
|
|
|
struct vm_phys_fictitious_seg {
|
|
|
|
RB_ENTRY(vm_phys_fictitious_seg) node;
|
|
|
|
/* Memory region data */
|
2012-05-12 20:42:56 +00:00
|
|
|
vm_paddr_t start;
|
|
|
|
vm_paddr_t end;
|
|
|
|
vm_page_t first_page;
|
2014-07-09 08:12:58 +00:00
|
|
|
};
|
|
|
|
|
|
|
|
RB_GENERATE_STATIC(fict_tree, vm_phys_fictitious_seg, node,
|
|
|
|
vm_phys_fictitious_cmp);
|
|
|
|
|
|
|
|
static struct rwlock vm_phys_fictitious_reg_lock;
|
2013-08-07 00:20:30 +00:00
|
|
|
MALLOC_DEFINE(M_FICT_PAGES, "vm_fictitious", "Fictitious VM pages");
|
2012-05-12 20:42:56 +00:00
|
|
|
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
static struct vm_freelist
|
2013-05-13 15:40:51 +00:00
|
|
|
vm_phys_free_queues[MAXMEMDOM][VM_NFREELIST][VM_NFREEPOOL][VM_NFREEORDER];
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
|
The physical memory allocator supports the use of distinct free lists for
managing pages from different address ranges. Generally speaking, this
feature is used to increase the likelihood that physical pages are
available that can meet special DMA requirements or can be accessed through
a limited-coverage direct mapping (e.g., MIPS). However, prior to this
change, the configuration of the free lists was static, i.e., it was
determined at compile time. Consequentally, free lists could be created
for address ranges that held no actual pages, for example, on 32-bit MIPS-
based systems with 512 MB or less of physical memory. This change makes
the creation of the free lists dynamic, i.e., it is based on the available
physical memory at boot time.
On 64-bit x86-based systems with 64 GB or more of physical memory, create
free lists for managing pages with physical addresses below 4 GB. This
change is to address reported problems with initializing devices that
require the allocation of physical pages below 4 GB on some systems with
128 GB or more of physical memory.
PR: 185727
Differential Revision: https://reviews.freebsd.org/D1274
Reviewed by: jhb, kib
MFC after: 3 weeks
Sponsored by: EMC / Isilon Storage Division
2014-12-31 00:54:38 +00:00
|
|
|
static int vm_nfreelists;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Provides the mapping from VM_FREELIST_* to free list indices (flind).
|
|
|
|
*/
|
|
|
|
static int vm_freelist_to_flind[VM_NFREELIST];
|
|
|
|
|
|
|
|
CTASSERT(VM_FREELIST_DEFAULT == 0);
|
|
|
|
|
|
|
|
#ifdef VM_FREELIST_ISADMA
|
|
|
|
#define VM_ISADMA_BOUNDARY 16777216
|
|
|
|
#endif
|
|
|
|
#ifdef VM_FREELIST_DMA32
|
|
|
|
#define VM_DMA32_BOUNDARY ((vm_paddr_t)1 << 32)
|
|
|
|
#endif
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Enforce the assumptions made by vm_phys_add_seg() and vm_phys_init() about
|
|
|
|
* the ordering of the free list boundaries.
|
|
|
|
*/
|
|
|
|
#if defined(VM_ISADMA_BOUNDARY) && defined(VM_LOWMEM_BOUNDARY)
|
|
|
|
CTASSERT(VM_ISADMA_BOUNDARY < VM_LOWMEM_BOUNDARY);
|
|
|
|
#endif
|
|
|
|
#if defined(VM_LOWMEM_BOUNDARY) && defined(VM_DMA32_BOUNDARY)
|
|
|
|
CTASSERT(VM_LOWMEM_BOUNDARY < VM_DMA32_BOUNDARY);
|
|
|
|
#endif
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
|
|
|
|
static int cnt_prezero;
|
|
|
|
SYSCTL_INT(_vm_stats_misc, OID_AUTO, cnt_prezero, CTLFLAG_RD,
|
|
|
|
&cnt_prezero, 0, "The number of physical pages prezeroed at idle time");
|
|
|
|
|
|
|
|
static int sysctl_vm_phys_free(SYSCTL_HANDLER_ARGS);
|
|
|
|
SYSCTL_OID(_vm, OID_AUTO, phys_free, CTLTYPE_STRING | CTLFLAG_RD,
|
|
|
|
NULL, 0, sysctl_vm_phys_free, "A", "Phys Free Info");
|
|
|
|
|
|
|
|
static int sysctl_vm_phys_segs(SYSCTL_HANDLER_ARGS);
|
|
|
|
SYSCTL_OID(_vm, OID_AUTO, phys_segs, CTLTYPE_STRING | CTLFLAG_RD,
|
|
|
|
NULL, 0, sysctl_vm_phys_segs, "A", "Phys Seg Info");
|
|
|
|
|
2016-04-09 13:58:04 +00:00
|
|
|
#ifdef VM_NUMA_ALLOC
|
2015-05-08 00:56:56 +00:00
|
|
|
static int sysctl_vm_phys_locality(SYSCTL_HANDLER_ARGS);
|
|
|
|
SYSCTL_OID(_vm, OID_AUTO, phys_locality, CTLTYPE_STRING | CTLFLAG_RD,
|
|
|
|
NULL, 0, sysctl_vm_phys_locality, "A", "Phys Locality Info");
|
Add an initial NUMA affinity/policy configuration for threads and processes.
This is based on work done by jeff@ and jhb@, as well as the numa.diff
patch that has been circulating when someone asks for first-touch NUMA
on -10 or -11.
* Introduce a simple set of VM policy and iterator types.
* tie the policy types into the vm_phys path for now, mirroring how
the initial first-touch allocation work was enabled.
* add syscalls to control changing thread and process defaults.
* add a global NUMA VM domain policy.
* implement a simple cascade policy order - if a thread policy exists, use it;
if a process policy exists, use it; use the default policy.
* processes inherit policies from their parent processes, threads inherit
policies from their parent threads.
* add a simple tool (numactl) to query and modify default thread/process
policities.
* add documentation for the new syscalls, for numa and for numactl.
* re-enable first touch NUMA again by default, as now policies can be
set in a variety of methods.
This is only relevant for very specific workloads.
This doesn't pretend to be a final NUMA solution.
The previous defaults in -HEAD (with MAXMEMDOM set) can be achieved by
'sysctl vm.default_policy=rr'.
This is only relevant if MAXMEMDOM is set to something other than 1.
Ie, if you're using GENERIC or a modified kernel with non-NUMA, then
this is a glorified no-op for you.
Thank you to Norse Corp for giving me access to rather large
(for FreeBSD!) NUMA machines in order to develop and verify this.
Thank you to Dell for providing me with dual socket sandybridge
and westmere v3 hardware to do NUMA development with.
Thank you to Scott Long at Netflix for providing me with access
to the two-socket, four-domain haswell v3 hardware.
Thank you to Peter Holm for running the stress testing suite
against the NUMA branch during various stages of development!
Tested:
* MIPS (regression testing; non-NUMA)
* i386 (regression testing; non-NUMA GENERIC)
* amd64 (regression testing; non-NUMA GENERIC)
* westmere, 2 socket (thankyou norse!)
* sandy bridge, 2 socket (thankyou dell!)
* ivy bridge, 2 socket (thankyou norse!)
* westmere-EX, 4 socket / 1TB RAM (thankyou norse!)
* haswell, 2 socket (thankyou norse!)
* haswell v3, 2 socket (thankyou dell)
* haswell v3, 2x18 core (thankyou scott long / netflix!)
* Peter Holm ran a stress test suite on this work and found one
issue, but has not been able to verify it (it doesn't look NUMA
related, and he only saw it once over many testing runs.)
* I've tested bhyve instances running in fixed NUMA domains and cpusets;
all seems to work correctly.
Verified:
* intel-pcm - pcm-numa.x and pcm-memory.x, whilst selecting different
NUMA policies for processes under test.
Review:
This was reviewed through phabricator (https://reviews.freebsd.org/D2559)
as well as privately and via emails to freebsd-arch@. The git history
with specific attributes is available at https://github.com/erikarn/freebsd/
in the NUMA branch (https://github.com/erikarn/freebsd/compare/local/adrian_numa_policy).
This has been reviewed by a number of people (stas, rpaulo, kib, ngie,
wblock) but not achieved a clear consensus. My hope is that with further
exposure and testing more functionality can be implemented and evaluated.
Notes:
* The VM doesn't handle unbalanced domains very well, and if you have an overly
unbalanced memory setup whilst under high memory pressure, VM page allocation
may fail leading to a kernel panic. This was a problem in the past, but it's
much more easily triggered now with these tools.
* This work only controls the path through vm_phys; it doesn't yet strongly/predictably
affect contigmalloc, KVA placement, UMA, etc. So, driver placement of memory
isn't really guaranteed in any way. That's next on my plate.
Sponsored by: Norse Corp, Inc.; Dell
2015-07-11 15:21:37 +00:00
|
|
|
#endif
|
2015-05-08 00:56:56 +00:00
|
|
|
|
2013-05-13 15:40:51 +00:00
|
|
|
SYSCTL_INT(_vm, OID_AUTO, ndomains, CTLFLAG_RD,
|
|
|
|
&vm_ndomains, 0, "Number of physical memory domains available.");
|
2010-07-27 20:33:50 +00:00
|
|
|
|
Add an initial NUMA affinity/policy configuration for threads and processes.
This is based on work done by jeff@ and jhb@, as well as the numa.diff
patch that has been circulating when someone asks for first-touch NUMA
on -10 or -11.
* Introduce a simple set of VM policy and iterator types.
* tie the policy types into the vm_phys path for now, mirroring how
the initial first-touch allocation work was enabled.
* add syscalls to control changing thread and process defaults.
* add a global NUMA VM domain policy.
* implement a simple cascade policy order - if a thread policy exists, use it;
if a process policy exists, use it; use the default policy.
* processes inherit policies from their parent processes, threads inherit
policies from their parent threads.
* add a simple tool (numactl) to query and modify default thread/process
policities.
* add documentation for the new syscalls, for numa and for numactl.
* re-enable first touch NUMA again by default, as now policies can be
set in a variety of methods.
This is only relevant for very specific workloads.
This doesn't pretend to be a final NUMA solution.
The previous defaults in -HEAD (with MAXMEMDOM set) can be achieved by
'sysctl vm.default_policy=rr'.
This is only relevant if MAXMEMDOM is set to something other than 1.
Ie, if you're using GENERIC or a modified kernel with non-NUMA, then
this is a glorified no-op for you.
Thank you to Norse Corp for giving me access to rather large
(for FreeBSD!) NUMA machines in order to develop and verify this.
Thank you to Dell for providing me with dual socket sandybridge
and westmere v3 hardware to do NUMA development with.
Thank you to Scott Long at Netflix for providing me with access
to the two-socket, four-domain haswell v3 hardware.
Thank you to Peter Holm for running the stress testing suite
against the NUMA branch during various stages of development!
Tested:
* MIPS (regression testing; non-NUMA)
* i386 (regression testing; non-NUMA GENERIC)
* amd64 (regression testing; non-NUMA GENERIC)
* westmere, 2 socket (thankyou norse!)
* sandy bridge, 2 socket (thankyou dell!)
* ivy bridge, 2 socket (thankyou norse!)
* westmere-EX, 4 socket / 1TB RAM (thankyou norse!)
* haswell, 2 socket (thankyou norse!)
* haswell v3, 2 socket (thankyou dell)
* haswell v3, 2x18 core (thankyou scott long / netflix!)
* Peter Holm ran a stress test suite on this work and found one
issue, but has not been able to verify it (it doesn't look NUMA
related, and he only saw it once over many testing runs.)
* I've tested bhyve instances running in fixed NUMA domains and cpusets;
all seems to work correctly.
Verified:
* intel-pcm - pcm-numa.x and pcm-memory.x, whilst selecting different
NUMA policies for processes under test.
Review:
This was reviewed through phabricator (https://reviews.freebsd.org/D2559)
as well as privately and via emails to freebsd-arch@. The git history
with specific attributes is available at https://github.com/erikarn/freebsd/
in the NUMA branch (https://github.com/erikarn/freebsd/compare/local/adrian_numa_policy).
This has been reviewed by a number of people (stas, rpaulo, kib, ngie,
wblock) but not achieved a clear consensus. My hope is that with further
exposure and testing more functionality can be implemented and evaluated.
Notes:
* The VM doesn't handle unbalanced domains very well, and if you have an overly
unbalanced memory setup whilst under high memory pressure, VM page allocation
may fail leading to a kernel panic. This was a problem in the past, but it's
much more easily triggered now with these tools.
* This work only controls the path through vm_phys; it doesn't yet strongly/predictably
affect contigmalloc, KVA placement, UMA, etc. So, driver placement of memory
isn't really guaranteed in any way. That's next on my plate.
Sponsored by: Norse Corp, Inc.; Dell
2015-07-11 15:21:37 +00:00
|
|
|
/*
|
|
|
|
* Default to first-touch + round-robin.
|
|
|
|
*/
|
|
|
|
static struct mtx vm_default_policy_mtx;
|
|
|
|
MTX_SYSINIT(vm_default_policy, &vm_default_policy_mtx, "default policy mutex",
|
|
|
|
MTX_DEF);
|
2016-04-09 13:58:04 +00:00
|
|
|
#ifdef VM_NUMA_ALLOC
|
Add an initial NUMA affinity/policy configuration for threads and processes.
This is based on work done by jeff@ and jhb@, as well as the numa.diff
patch that has been circulating when someone asks for first-touch NUMA
on -10 or -11.
* Introduce a simple set of VM policy and iterator types.
* tie the policy types into the vm_phys path for now, mirroring how
the initial first-touch allocation work was enabled.
* add syscalls to control changing thread and process defaults.
* add a global NUMA VM domain policy.
* implement a simple cascade policy order - if a thread policy exists, use it;
if a process policy exists, use it; use the default policy.
* processes inherit policies from their parent processes, threads inherit
policies from their parent threads.
* add a simple tool (numactl) to query and modify default thread/process
policities.
* add documentation for the new syscalls, for numa and for numactl.
* re-enable first touch NUMA again by default, as now policies can be
set in a variety of methods.
This is only relevant for very specific workloads.
This doesn't pretend to be a final NUMA solution.
The previous defaults in -HEAD (with MAXMEMDOM set) can be achieved by
'sysctl vm.default_policy=rr'.
This is only relevant if MAXMEMDOM is set to something other than 1.
Ie, if you're using GENERIC or a modified kernel with non-NUMA, then
this is a glorified no-op for you.
Thank you to Norse Corp for giving me access to rather large
(for FreeBSD!) NUMA machines in order to develop and verify this.
Thank you to Dell for providing me with dual socket sandybridge
and westmere v3 hardware to do NUMA development with.
Thank you to Scott Long at Netflix for providing me with access
to the two-socket, four-domain haswell v3 hardware.
Thank you to Peter Holm for running the stress testing suite
against the NUMA branch during various stages of development!
Tested:
* MIPS (regression testing; non-NUMA)
* i386 (regression testing; non-NUMA GENERIC)
* amd64 (regression testing; non-NUMA GENERIC)
* westmere, 2 socket (thankyou norse!)
* sandy bridge, 2 socket (thankyou dell!)
* ivy bridge, 2 socket (thankyou norse!)
* westmere-EX, 4 socket / 1TB RAM (thankyou norse!)
* haswell, 2 socket (thankyou norse!)
* haswell v3, 2 socket (thankyou dell)
* haswell v3, 2x18 core (thankyou scott long / netflix!)
* Peter Holm ran a stress test suite on this work and found one
issue, but has not been able to verify it (it doesn't look NUMA
related, and he only saw it once over many testing runs.)
* I've tested bhyve instances running in fixed NUMA domains and cpusets;
all seems to work correctly.
Verified:
* intel-pcm - pcm-numa.x and pcm-memory.x, whilst selecting different
NUMA policies for processes under test.
Review:
This was reviewed through phabricator (https://reviews.freebsd.org/D2559)
as well as privately and via emails to freebsd-arch@. The git history
with specific attributes is available at https://github.com/erikarn/freebsd/
in the NUMA branch (https://github.com/erikarn/freebsd/compare/local/adrian_numa_policy).
This has been reviewed by a number of people (stas, rpaulo, kib, ngie,
wblock) but not achieved a clear consensus. My hope is that with further
exposure and testing more functionality can be implemented and evaluated.
Notes:
* The VM doesn't handle unbalanced domains very well, and if you have an overly
unbalanced memory setup whilst under high memory pressure, VM page allocation
may fail leading to a kernel panic. This was a problem in the past, but it's
much more easily triggered now with these tools.
* This work only controls the path through vm_phys; it doesn't yet strongly/predictably
affect contigmalloc, KVA placement, UMA, etc. So, driver placement of memory
isn't really guaranteed in any way. That's next on my plate.
Sponsored by: Norse Corp, Inc.; Dell
2015-07-11 15:21:37 +00:00
|
|
|
static struct vm_domain_policy vm_default_policy =
|
|
|
|
VM_DOMAIN_POLICY_STATIC_INITIALISER(VM_POLICY_FIRST_TOUCH_ROUND_ROBIN, 0);
|
|
|
|
#else
|
|
|
|
/* Use round-robin so the domain policy code will only try once per allocation */
|
|
|
|
static struct vm_domain_policy vm_default_policy =
|
|
|
|
VM_DOMAIN_POLICY_STATIC_INITIALISER(VM_POLICY_ROUND_ROBIN, 0);
|
|
|
|
#endif
|
|
|
|
|
2013-05-03 18:58:37 +00:00
|
|
|
static vm_page_t vm_phys_alloc_domain_pages(int domain, int flind, int pool,
|
|
|
|
int order);
|
2015-12-19 18:42:50 +00:00
|
|
|
static vm_page_t vm_phys_alloc_seg_contig(struct vm_phys_seg *seg,
|
|
|
|
u_long npages, vm_paddr_t low, vm_paddr_t high, u_long alignment,
|
|
|
|
vm_paddr_t boundary);
|
The physical memory allocator supports the use of distinct free lists for
managing pages from different address ranges. Generally speaking, this
feature is used to increase the likelihood that physical pages are
available that can meet special DMA requirements or can be accessed through
a limited-coverage direct mapping (e.g., MIPS). However, prior to this
change, the configuration of the free lists was static, i.e., it was
determined at compile time. Consequentally, free lists could be created
for address ranges that held no actual pages, for example, on 32-bit MIPS-
based systems with 512 MB or less of physical memory. This change makes
the creation of the free lists dynamic, i.e., it is based on the available
physical memory at boot time.
On 64-bit x86-based systems with 64 GB or more of physical memory, create
free lists for managing pages with physical addresses below 4 GB. This
change is to address reported problems with initializing devices that
require the allocation of physical pages below 4 GB on some systems with
128 GB or more of physical memory.
PR: 185727
Differential Revision: https://reviews.freebsd.org/D1274
Reviewed by: jhb, kib
MFC after: 3 weeks
Sponsored by: EMC / Isilon Storage Division
2014-12-31 00:54:38 +00:00
|
|
|
static void _vm_phys_create_seg(vm_paddr_t start, vm_paddr_t end, int domain);
|
|
|
|
static void vm_phys_create_seg(vm_paddr_t start, vm_paddr_t end);
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
static int vm_phys_paddr_to_segind(vm_paddr_t pa);
|
|
|
|
static void vm_phys_split_pages(vm_page_t m, int oind, struct vm_freelist *fl,
|
|
|
|
int order);
|
|
|
|
|
Add an initial NUMA affinity/policy configuration for threads and processes.
This is based on work done by jeff@ and jhb@, as well as the numa.diff
patch that has been circulating when someone asks for first-touch NUMA
on -10 or -11.
* Introduce a simple set of VM policy and iterator types.
* tie the policy types into the vm_phys path for now, mirroring how
the initial first-touch allocation work was enabled.
* add syscalls to control changing thread and process defaults.
* add a global NUMA VM domain policy.
* implement a simple cascade policy order - if a thread policy exists, use it;
if a process policy exists, use it; use the default policy.
* processes inherit policies from their parent processes, threads inherit
policies from their parent threads.
* add a simple tool (numactl) to query and modify default thread/process
policities.
* add documentation for the new syscalls, for numa and for numactl.
* re-enable first touch NUMA again by default, as now policies can be
set in a variety of methods.
This is only relevant for very specific workloads.
This doesn't pretend to be a final NUMA solution.
The previous defaults in -HEAD (with MAXMEMDOM set) can be achieved by
'sysctl vm.default_policy=rr'.
This is only relevant if MAXMEMDOM is set to something other than 1.
Ie, if you're using GENERIC or a modified kernel with non-NUMA, then
this is a glorified no-op for you.
Thank you to Norse Corp for giving me access to rather large
(for FreeBSD!) NUMA machines in order to develop and verify this.
Thank you to Dell for providing me with dual socket sandybridge
and westmere v3 hardware to do NUMA development with.
Thank you to Scott Long at Netflix for providing me with access
to the two-socket, four-domain haswell v3 hardware.
Thank you to Peter Holm for running the stress testing suite
against the NUMA branch during various stages of development!
Tested:
* MIPS (regression testing; non-NUMA)
* i386 (regression testing; non-NUMA GENERIC)
* amd64 (regression testing; non-NUMA GENERIC)
* westmere, 2 socket (thankyou norse!)
* sandy bridge, 2 socket (thankyou dell!)
* ivy bridge, 2 socket (thankyou norse!)
* westmere-EX, 4 socket / 1TB RAM (thankyou norse!)
* haswell, 2 socket (thankyou norse!)
* haswell v3, 2 socket (thankyou dell)
* haswell v3, 2x18 core (thankyou scott long / netflix!)
* Peter Holm ran a stress test suite on this work and found one
issue, but has not been able to verify it (it doesn't look NUMA
related, and he only saw it once over many testing runs.)
* I've tested bhyve instances running in fixed NUMA domains and cpusets;
all seems to work correctly.
Verified:
* intel-pcm - pcm-numa.x and pcm-memory.x, whilst selecting different
NUMA policies for processes under test.
Review:
This was reviewed through phabricator (https://reviews.freebsd.org/D2559)
as well as privately and via emails to freebsd-arch@. The git history
with specific attributes is available at https://github.com/erikarn/freebsd/
in the NUMA branch (https://github.com/erikarn/freebsd/compare/local/adrian_numa_policy).
This has been reviewed by a number of people (stas, rpaulo, kib, ngie,
wblock) but not achieved a clear consensus. My hope is that with further
exposure and testing more functionality can be implemented and evaluated.
Notes:
* The VM doesn't handle unbalanced domains very well, and if you have an overly
unbalanced memory setup whilst under high memory pressure, VM page allocation
may fail leading to a kernel panic. This was a problem in the past, but it's
much more easily triggered now with these tools.
* This work only controls the path through vm_phys; it doesn't yet strongly/predictably
affect contigmalloc, KVA placement, UMA, etc. So, driver placement of memory
isn't really guaranteed in any way. That's next on my plate.
Sponsored by: Norse Corp, Inc.; Dell
2015-07-11 15:21:37 +00:00
|
|
|
static int
|
|
|
|
sysctl_vm_default_policy(SYSCTL_HANDLER_ARGS)
|
|
|
|
{
|
|
|
|
char policy_name[32];
|
|
|
|
int error;
|
|
|
|
|
|
|
|
mtx_lock(&vm_default_policy_mtx);
|
|
|
|
|
|
|
|
/* Map policy to output string */
|
|
|
|
switch (vm_default_policy.p.policy) {
|
|
|
|
case VM_POLICY_FIRST_TOUCH:
|
|
|
|
strcpy(policy_name, "first-touch");
|
|
|
|
break;
|
|
|
|
case VM_POLICY_FIRST_TOUCH_ROUND_ROBIN:
|
|
|
|
strcpy(policy_name, "first-touch-rr");
|
|
|
|
break;
|
|
|
|
case VM_POLICY_ROUND_ROBIN:
|
|
|
|
default:
|
|
|
|
strcpy(policy_name, "rr");
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
mtx_unlock(&vm_default_policy_mtx);
|
|
|
|
|
|
|
|
error = sysctl_handle_string(oidp, &policy_name[0],
|
|
|
|
sizeof(policy_name), req);
|
|
|
|
if (error != 0 || req->newptr == NULL)
|
|
|
|
return (error);
|
|
|
|
|
|
|
|
mtx_lock(&vm_default_policy_mtx);
|
|
|
|
/* Set: match on the subset of policies that make sense as a default */
|
|
|
|
if (strcmp("first-touch-rr", policy_name) == 0) {
|
|
|
|
vm_domain_policy_set(&vm_default_policy,
|
|
|
|
VM_POLICY_FIRST_TOUCH_ROUND_ROBIN, 0);
|
|
|
|
} else if (strcmp("first-touch", policy_name) == 0) {
|
|
|
|
vm_domain_policy_set(&vm_default_policy,
|
|
|
|
VM_POLICY_FIRST_TOUCH, 0);
|
|
|
|
} else if (strcmp("rr", policy_name) == 0) {
|
|
|
|
vm_domain_policy_set(&vm_default_policy,
|
|
|
|
VM_POLICY_ROUND_ROBIN, 0);
|
|
|
|
} else {
|
|
|
|
error = EINVAL;
|
|
|
|
goto finish;
|
|
|
|
}
|
|
|
|
|
|
|
|
error = 0;
|
|
|
|
finish:
|
|
|
|
mtx_unlock(&vm_default_policy_mtx);
|
|
|
|
return (error);
|
|
|
|
}
|
|
|
|
|
|
|
|
SYSCTL_PROC(_vm, OID_AUTO, default_policy, CTLTYPE_STRING | CTLFLAG_RW,
|
|
|
|
0, 0, sysctl_vm_default_policy, "A",
|
|
|
|
"Default policy (rr, first-touch, first-touch-rr");
|
|
|
|
|
2014-07-09 08:12:58 +00:00
|
|
|
/*
|
|
|
|
* Red-black tree helpers for vm fictitious range management.
|
|
|
|
*/
|
|
|
|
static inline int
|
|
|
|
vm_phys_fictitious_in_range(struct vm_phys_fictitious_seg *p,
|
|
|
|
struct vm_phys_fictitious_seg *range)
|
|
|
|
{
|
|
|
|
|
|
|
|
KASSERT(range->start != 0 && range->end != 0,
|
|
|
|
("Invalid range passed on search for vm_fictitious page"));
|
|
|
|
if (p->start >= range->end)
|
|
|
|
return (1);
|
|
|
|
if (p->start < range->start)
|
|
|
|
return (-1);
|
|
|
|
|
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
|
|
|
static int
|
|
|
|
vm_phys_fictitious_cmp(struct vm_phys_fictitious_seg *p1,
|
|
|
|
struct vm_phys_fictitious_seg *p2)
|
|
|
|
{
|
|
|
|
|
|
|
|
/* Check if this is a search for a page */
|
|
|
|
if (p1->end == 0)
|
|
|
|
return (vm_phys_fictitious_in_range(p1, p2));
|
|
|
|
|
|
|
|
KASSERT(p2->end != 0,
|
|
|
|
("Invalid range passed as second parameter to vm fictitious comparison"));
|
|
|
|
|
|
|
|
/* Searching to add a new range */
|
|
|
|
if (p1->end <= p2->start)
|
|
|
|
return (-1);
|
|
|
|
if (p1->start >= p2->end)
|
|
|
|
return (1);
|
|
|
|
|
|
|
|
panic("Trying to add overlapping vm fictitious ranges:\n"
|
|
|
|
"[%#jx:%#jx] and [%#jx:%#jx]", (uintmax_t)p1->start,
|
|
|
|
(uintmax_t)p1->end, (uintmax_t)p2->start, (uintmax_t)p2->end);
|
|
|
|
}
|
|
|
|
|
2013-05-13 15:40:51 +00:00
|
|
|
static __inline int
|
|
|
|
vm_rr_selectdomain(void)
|
|
|
|
{
|
2016-04-09 13:58:04 +00:00
|
|
|
#ifdef VM_NUMA_ALLOC
|
2013-05-13 15:40:51 +00:00
|
|
|
struct thread *td;
|
|
|
|
|
|
|
|
td = curthread;
|
|
|
|
|
|
|
|
td->td_dom_rr_idx++;
|
|
|
|
td->td_dom_rr_idx %= vm_ndomains;
|
|
|
|
return (td->td_dom_rr_idx);
|
|
|
|
#else
|
|
|
|
return (0);
|
|
|
|
#endif
|
|
|
|
}
|
|
|
|
|
Add an initial NUMA affinity/policy configuration for threads and processes.
This is based on work done by jeff@ and jhb@, as well as the numa.diff
patch that has been circulating when someone asks for first-touch NUMA
on -10 or -11.
* Introduce a simple set of VM policy and iterator types.
* tie the policy types into the vm_phys path for now, mirroring how
the initial first-touch allocation work was enabled.
* add syscalls to control changing thread and process defaults.
* add a global NUMA VM domain policy.
* implement a simple cascade policy order - if a thread policy exists, use it;
if a process policy exists, use it; use the default policy.
* processes inherit policies from their parent processes, threads inherit
policies from their parent threads.
* add a simple tool (numactl) to query and modify default thread/process
policities.
* add documentation for the new syscalls, for numa and for numactl.
* re-enable first touch NUMA again by default, as now policies can be
set in a variety of methods.
This is only relevant for very specific workloads.
This doesn't pretend to be a final NUMA solution.
The previous defaults in -HEAD (with MAXMEMDOM set) can be achieved by
'sysctl vm.default_policy=rr'.
This is only relevant if MAXMEMDOM is set to something other than 1.
Ie, if you're using GENERIC or a modified kernel with non-NUMA, then
this is a glorified no-op for you.
Thank you to Norse Corp for giving me access to rather large
(for FreeBSD!) NUMA machines in order to develop and verify this.
Thank you to Dell for providing me with dual socket sandybridge
and westmere v3 hardware to do NUMA development with.
Thank you to Scott Long at Netflix for providing me with access
to the two-socket, four-domain haswell v3 hardware.
Thank you to Peter Holm for running the stress testing suite
against the NUMA branch during various stages of development!
Tested:
* MIPS (regression testing; non-NUMA)
* i386 (regression testing; non-NUMA GENERIC)
* amd64 (regression testing; non-NUMA GENERIC)
* westmere, 2 socket (thankyou norse!)
* sandy bridge, 2 socket (thankyou dell!)
* ivy bridge, 2 socket (thankyou norse!)
* westmere-EX, 4 socket / 1TB RAM (thankyou norse!)
* haswell, 2 socket (thankyou norse!)
* haswell v3, 2 socket (thankyou dell)
* haswell v3, 2x18 core (thankyou scott long / netflix!)
* Peter Holm ran a stress test suite on this work and found one
issue, but has not been able to verify it (it doesn't look NUMA
related, and he only saw it once over many testing runs.)
* I've tested bhyve instances running in fixed NUMA domains and cpusets;
all seems to work correctly.
Verified:
* intel-pcm - pcm-numa.x and pcm-memory.x, whilst selecting different
NUMA policies for processes under test.
Review:
This was reviewed through phabricator (https://reviews.freebsd.org/D2559)
as well as privately and via emails to freebsd-arch@. The git history
with specific attributes is available at https://github.com/erikarn/freebsd/
in the NUMA branch (https://github.com/erikarn/freebsd/compare/local/adrian_numa_policy).
This has been reviewed by a number of people (stas, rpaulo, kib, ngie,
wblock) but not achieved a clear consensus. My hope is that with further
exposure and testing more functionality can be implemented and evaluated.
Notes:
* The VM doesn't handle unbalanced domains very well, and if you have an overly
unbalanced memory setup whilst under high memory pressure, VM page allocation
may fail leading to a kernel panic. This was a problem in the past, but it's
much more easily triggered now with these tools.
* This work only controls the path through vm_phys; it doesn't yet strongly/predictably
affect contigmalloc, KVA placement, UMA, etc. So, driver placement of memory
isn't really guaranteed in any way. That's next on my plate.
Sponsored by: Norse Corp, Inc.; Dell
2015-07-11 15:21:37 +00:00
|
|
|
/*
|
|
|
|
* Initialise a VM domain iterator.
|
|
|
|
*
|
|
|
|
* Check the thread policy, then the proc policy,
|
|
|
|
* then default to the system policy.
|
|
|
|
*
|
|
|
|
* Later on the various layers will have this logic
|
|
|
|
* plumbed into them and the phys code will be explicitly
|
|
|
|
* handed a VM domain policy to use.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
vm_policy_iterator_init(struct vm_domain_iterator *vi)
|
|
|
|
{
|
2016-04-09 13:58:04 +00:00
|
|
|
#ifdef VM_NUMA_ALLOC
|
Add an initial NUMA affinity/policy configuration for threads and processes.
This is based on work done by jeff@ and jhb@, as well as the numa.diff
patch that has been circulating when someone asks for first-touch NUMA
on -10 or -11.
* Introduce a simple set of VM policy and iterator types.
* tie the policy types into the vm_phys path for now, mirroring how
the initial first-touch allocation work was enabled.
* add syscalls to control changing thread and process defaults.
* add a global NUMA VM domain policy.
* implement a simple cascade policy order - if a thread policy exists, use it;
if a process policy exists, use it; use the default policy.
* processes inherit policies from their parent processes, threads inherit
policies from their parent threads.
* add a simple tool (numactl) to query and modify default thread/process
policities.
* add documentation for the new syscalls, for numa and for numactl.
* re-enable first touch NUMA again by default, as now policies can be
set in a variety of methods.
This is only relevant for very specific workloads.
This doesn't pretend to be a final NUMA solution.
The previous defaults in -HEAD (with MAXMEMDOM set) can be achieved by
'sysctl vm.default_policy=rr'.
This is only relevant if MAXMEMDOM is set to something other than 1.
Ie, if you're using GENERIC or a modified kernel with non-NUMA, then
this is a glorified no-op for you.
Thank you to Norse Corp for giving me access to rather large
(for FreeBSD!) NUMA machines in order to develop and verify this.
Thank you to Dell for providing me with dual socket sandybridge
and westmere v3 hardware to do NUMA development with.
Thank you to Scott Long at Netflix for providing me with access
to the two-socket, four-domain haswell v3 hardware.
Thank you to Peter Holm for running the stress testing suite
against the NUMA branch during various stages of development!
Tested:
* MIPS (regression testing; non-NUMA)
* i386 (regression testing; non-NUMA GENERIC)
* amd64 (regression testing; non-NUMA GENERIC)
* westmere, 2 socket (thankyou norse!)
* sandy bridge, 2 socket (thankyou dell!)
* ivy bridge, 2 socket (thankyou norse!)
* westmere-EX, 4 socket / 1TB RAM (thankyou norse!)
* haswell, 2 socket (thankyou norse!)
* haswell v3, 2 socket (thankyou dell)
* haswell v3, 2x18 core (thankyou scott long / netflix!)
* Peter Holm ran a stress test suite on this work and found one
issue, but has not been able to verify it (it doesn't look NUMA
related, and he only saw it once over many testing runs.)
* I've tested bhyve instances running in fixed NUMA domains and cpusets;
all seems to work correctly.
Verified:
* intel-pcm - pcm-numa.x and pcm-memory.x, whilst selecting different
NUMA policies for processes under test.
Review:
This was reviewed through phabricator (https://reviews.freebsd.org/D2559)
as well as privately and via emails to freebsd-arch@. The git history
with specific attributes is available at https://github.com/erikarn/freebsd/
in the NUMA branch (https://github.com/erikarn/freebsd/compare/local/adrian_numa_policy).
This has been reviewed by a number of people (stas, rpaulo, kib, ngie,
wblock) but not achieved a clear consensus. My hope is that with further
exposure and testing more functionality can be implemented and evaluated.
Notes:
* The VM doesn't handle unbalanced domains very well, and if you have an overly
unbalanced memory setup whilst under high memory pressure, VM page allocation
may fail leading to a kernel panic. This was a problem in the past, but it's
much more easily triggered now with these tools.
* This work only controls the path through vm_phys; it doesn't yet strongly/predictably
affect contigmalloc, KVA placement, UMA, etc. So, driver placement of memory
isn't really guaranteed in any way. That's next on my plate.
Sponsored by: Norse Corp, Inc.; Dell
2015-07-11 15:21:37 +00:00
|
|
|
struct vm_domain_policy lcl;
|
|
|
|
#endif
|
|
|
|
|
|
|
|
vm_domain_iterator_init(vi);
|
|
|
|
|
2016-04-09 13:58:04 +00:00
|
|
|
#ifdef VM_NUMA_ALLOC
|
Add an initial NUMA affinity/policy configuration for threads and processes.
This is based on work done by jeff@ and jhb@, as well as the numa.diff
patch that has been circulating when someone asks for first-touch NUMA
on -10 or -11.
* Introduce a simple set of VM policy and iterator types.
* tie the policy types into the vm_phys path for now, mirroring how
the initial first-touch allocation work was enabled.
* add syscalls to control changing thread and process defaults.
* add a global NUMA VM domain policy.
* implement a simple cascade policy order - if a thread policy exists, use it;
if a process policy exists, use it; use the default policy.
* processes inherit policies from their parent processes, threads inherit
policies from their parent threads.
* add a simple tool (numactl) to query and modify default thread/process
policities.
* add documentation for the new syscalls, for numa and for numactl.
* re-enable first touch NUMA again by default, as now policies can be
set in a variety of methods.
This is only relevant for very specific workloads.
This doesn't pretend to be a final NUMA solution.
The previous defaults in -HEAD (with MAXMEMDOM set) can be achieved by
'sysctl vm.default_policy=rr'.
This is only relevant if MAXMEMDOM is set to something other than 1.
Ie, if you're using GENERIC or a modified kernel with non-NUMA, then
this is a glorified no-op for you.
Thank you to Norse Corp for giving me access to rather large
(for FreeBSD!) NUMA machines in order to develop and verify this.
Thank you to Dell for providing me with dual socket sandybridge
and westmere v3 hardware to do NUMA development with.
Thank you to Scott Long at Netflix for providing me with access
to the two-socket, four-domain haswell v3 hardware.
Thank you to Peter Holm for running the stress testing suite
against the NUMA branch during various stages of development!
Tested:
* MIPS (regression testing; non-NUMA)
* i386 (regression testing; non-NUMA GENERIC)
* amd64 (regression testing; non-NUMA GENERIC)
* westmere, 2 socket (thankyou norse!)
* sandy bridge, 2 socket (thankyou dell!)
* ivy bridge, 2 socket (thankyou norse!)
* westmere-EX, 4 socket / 1TB RAM (thankyou norse!)
* haswell, 2 socket (thankyou norse!)
* haswell v3, 2 socket (thankyou dell)
* haswell v3, 2x18 core (thankyou scott long / netflix!)
* Peter Holm ran a stress test suite on this work and found one
issue, but has not been able to verify it (it doesn't look NUMA
related, and he only saw it once over many testing runs.)
* I've tested bhyve instances running in fixed NUMA domains and cpusets;
all seems to work correctly.
Verified:
* intel-pcm - pcm-numa.x and pcm-memory.x, whilst selecting different
NUMA policies for processes under test.
Review:
This was reviewed through phabricator (https://reviews.freebsd.org/D2559)
as well as privately and via emails to freebsd-arch@. The git history
with specific attributes is available at https://github.com/erikarn/freebsd/
in the NUMA branch (https://github.com/erikarn/freebsd/compare/local/adrian_numa_policy).
This has been reviewed by a number of people (stas, rpaulo, kib, ngie,
wblock) but not achieved a clear consensus. My hope is that with further
exposure and testing more functionality can be implemented and evaluated.
Notes:
* The VM doesn't handle unbalanced domains very well, and if you have an overly
unbalanced memory setup whilst under high memory pressure, VM page allocation
may fail leading to a kernel panic. This was a problem in the past, but it's
much more easily triggered now with these tools.
* This work only controls the path through vm_phys; it doesn't yet strongly/predictably
affect contigmalloc, KVA placement, UMA, etc. So, driver placement of memory
isn't really guaranteed in any way. That's next on my plate.
Sponsored by: Norse Corp, Inc.; Dell
2015-07-11 15:21:37 +00:00
|
|
|
/* Copy out the thread policy */
|
|
|
|
vm_domain_policy_localcopy(&lcl, &curthread->td_vm_dom_policy);
|
|
|
|
if (lcl.p.policy != VM_POLICY_NONE) {
|
|
|
|
/* Thread policy is present; use it */
|
|
|
|
vm_domain_iterator_set_policy(vi, &lcl);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
vm_domain_policy_localcopy(&lcl,
|
|
|
|
&curthread->td_proc->p_vm_dom_policy);
|
|
|
|
if (lcl.p.policy != VM_POLICY_NONE) {
|
|
|
|
/* Process policy is present; use it */
|
|
|
|
vm_domain_iterator_set_policy(vi, &lcl);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
/* Use system default policy */
|
|
|
|
vm_domain_iterator_set_policy(vi, &vm_default_policy);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
vm_policy_iterator_finish(struct vm_domain_iterator *vi)
|
|
|
|
{
|
|
|
|
|
|
|
|
vm_domain_iterator_cleanup(vi);
|
|
|
|
}
|
|
|
|
|
Split the pagequeues per NUMA domains, and split pageademon process
into threads each processing queue in a single domain. The structure
of the pagedaemons and queues is kept intact, most of the changes come
from the need for code to find an owning page queue for given page,
calculated from the segment containing the page.
The tie between NUMA domain and pagedaemon thread/pagequeue split is
rather arbitrary, the multithreaded daemon could be allowed for the
single-domain machines, or one domain might be split into several page
domains, to further increase concurrency.
Right now, each pagedaemon thread tries to reach the global target,
precalculated at the start of the pass. This is not optimal, since it
could cause excessive page deactivation and freeing. The code should
be changed to re-check the global page deficit state in the loop after
some number of iterations.
The pagedaemons reach the quorum before starting the OOM, since one
thread inability to meet the target is normal for split queues. Only
when all pagedaemons fail to produce enough reusable pages, OOM is
started by single selected thread.
Launder is modified to take into account the segments layout with
regard to the region for which cleaning is performed.
Based on the preliminary patch by jeff, sponsored by EMC / Isilon
Storage Division.
Reviewed by: alc
Tested by: pho
Sponsored by: The FreeBSD Foundation
2013-08-07 16:36:38 +00:00
|
|
|
boolean_t
|
|
|
|
vm_phys_domain_intersects(long mask, vm_paddr_t low, vm_paddr_t high)
|
|
|
|
{
|
|
|
|
struct vm_phys_seg *s;
|
|
|
|
int idx;
|
|
|
|
|
|
|
|
while ((idx = ffsl(mask)) != 0) {
|
|
|
|
idx--; /* ffsl counts from 1 */
|
|
|
|
mask &= ~(1UL << idx);
|
|
|
|
s = &vm_phys_segs[idx];
|
|
|
|
if (low < s->end && high > s->start)
|
|
|
|
return (TRUE);
|
|
|
|
}
|
|
|
|
return (FALSE);
|
|
|
|
}
|
|
|
|
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
/*
|
|
|
|
* Outputs the state of the physical memory allocator, specifically,
|
|
|
|
* the amount of physical memory in each free list.
|
|
|
|
*/
|
|
|
|
static int
|
|
|
|
sysctl_vm_phys_free(SYSCTL_HANDLER_ARGS)
|
|
|
|
{
|
|
|
|
struct sbuf sbuf;
|
|
|
|
struct vm_freelist *fl;
|
2013-05-13 15:40:51 +00:00
|
|
|
int dom, error, flind, oind, pind;
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
|
2011-01-27 00:34:12 +00:00
|
|
|
error = sysctl_wire_old_buffer(req, 0);
|
|
|
|
if (error != 0)
|
|
|
|
return (error);
|
2013-05-13 15:40:51 +00:00
|
|
|
sbuf_new_for_sysctl(&sbuf, NULL, 128 * vm_ndomains, req);
|
|
|
|
for (dom = 0; dom < vm_ndomains; dom++) {
|
2013-10-10 16:11:45 +00:00
|
|
|
sbuf_printf(&sbuf,"\nDOMAIN %d:\n", dom);
|
2013-05-13 15:40:51 +00:00
|
|
|
for (flind = 0; flind < vm_nfreelists; flind++) {
|
2013-10-10 16:11:45 +00:00
|
|
|
sbuf_printf(&sbuf, "\nFREE LIST %d:\n"
|
2013-05-13 15:40:51 +00:00
|
|
|
"\n ORDER (SIZE) | NUMBER"
|
|
|
|
"\n ", flind);
|
|
|
|
for (pind = 0; pind < VM_NFREEPOOL; pind++)
|
|
|
|
sbuf_printf(&sbuf, " | POOL %d", pind);
|
|
|
|
sbuf_printf(&sbuf, "\n-- ");
|
|
|
|
for (pind = 0; pind < VM_NFREEPOOL; pind++)
|
|
|
|
sbuf_printf(&sbuf, "-- -- ");
|
|
|
|
sbuf_printf(&sbuf, "--\n");
|
|
|
|
for (oind = VM_NFREEORDER - 1; oind >= 0; oind--) {
|
|
|
|
sbuf_printf(&sbuf, " %2d (%6dK)", oind,
|
|
|
|
1 << (PAGE_SHIFT - 10 + oind));
|
|
|
|
for (pind = 0; pind < VM_NFREEPOOL; pind++) {
|
|
|
|
fl = vm_phys_free_queues[dom][flind][pind];
|
2013-10-10 16:11:45 +00:00
|
|
|
sbuf_printf(&sbuf, " | %6d",
|
2013-05-13 15:40:51 +00:00
|
|
|
fl[oind].lcnt);
|
|
|
|
}
|
|
|
|
sbuf_printf(&sbuf, "\n");
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
2010-09-16 16:13:12 +00:00
|
|
|
error = sbuf_finish(&sbuf);
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
sbuf_delete(&sbuf);
|
|
|
|
return (error);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Outputs the set of physical memory segments.
|
|
|
|
*/
|
|
|
|
static int
|
|
|
|
sysctl_vm_phys_segs(SYSCTL_HANDLER_ARGS)
|
|
|
|
{
|
|
|
|
struct sbuf sbuf;
|
|
|
|
struct vm_phys_seg *seg;
|
|
|
|
int error, segind;
|
|
|
|
|
2011-01-27 00:34:12 +00:00
|
|
|
error = sysctl_wire_old_buffer(req, 0);
|
|
|
|
if (error != 0)
|
|
|
|
return (error);
|
2010-09-16 16:13:12 +00:00
|
|
|
sbuf_new_for_sysctl(&sbuf, NULL, 128, req);
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
for (segind = 0; segind < vm_phys_nsegs; segind++) {
|
|
|
|
sbuf_printf(&sbuf, "\nSEGMENT %d:\n\n", segind);
|
|
|
|
seg = &vm_phys_segs[segind];
|
|
|
|
sbuf_printf(&sbuf, "start: %#jx\n",
|
|
|
|
(uintmax_t)seg->start);
|
|
|
|
sbuf_printf(&sbuf, "end: %#jx\n",
|
|
|
|
(uintmax_t)seg->end);
|
2010-07-27 20:33:50 +00:00
|
|
|
sbuf_printf(&sbuf, "domain: %d\n", seg->domain);
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
sbuf_printf(&sbuf, "free list: %p\n", seg->free_queues);
|
|
|
|
}
|
2010-09-16 16:13:12 +00:00
|
|
|
error = sbuf_finish(&sbuf);
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
sbuf_delete(&sbuf);
|
|
|
|
return (error);
|
|
|
|
}
|
|
|
|
|
2015-05-08 00:56:56 +00:00
|
|
|
/*
|
|
|
|
* Return affinity, or -1 if there's no affinity information.
|
|
|
|
*/
|
Add an initial NUMA affinity/policy configuration for threads and processes.
This is based on work done by jeff@ and jhb@, as well as the numa.diff
patch that has been circulating when someone asks for first-touch NUMA
on -10 or -11.
* Introduce a simple set of VM policy and iterator types.
* tie the policy types into the vm_phys path for now, mirroring how
the initial first-touch allocation work was enabled.
* add syscalls to control changing thread and process defaults.
* add a global NUMA VM domain policy.
* implement a simple cascade policy order - if a thread policy exists, use it;
if a process policy exists, use it; use the default policy.
* processes inherit policies from their parent processes, threads inherit
policies from their parent threads.
* add a simple tool (numactl) to query and modify default thread/process
policities.
* add documentation for the new syscalls, for numa and for numactl.
* re-enable first touch NUMA again by default, as now policies can be
set in a variety of methods.
This is only relevant for very specific workloads.
This doesn't pretend to be a final NUMA solution.
The previous defaults in -HEAD (with MAXMEMDOM set) can be achieved by
'sysctl vm.default_policy=rr'.
This is only relevant if MAXMEMDOM is set to something other than 1.
Ie, if you're using GENERIC or a modified kernel with non-NUMA, then
this is a glorified no-op for you.
Thank you to Norse Corp for giving me access to rather large
(for FreeBSD!) NUMA machines in order to develop and verify this.
Thank you to Dell for providing me with dual socket sandybridge
and westmere v3 hardware to do NUMA development with.
Thank you to Scott Long at Netflix for providing me with access
to the two-socket, four-domain haswell v3 hardware.
Thank you to Peter Holm for running the stress testing suite
against the NUMA branch during various stages of development!
Tested:
* MIPS (regression testing; non-NUMA)
* i386 (regression testing; non-NUMA GENERIC)
* amd64 (regression testing; non-NUMA GENERIC)
* westmere, 2 socket (thankyou norse!)
* sandy bridge, 2 socket (thankyou dell!)
* ivy bridge, 2 socket (thankyou norse!)
* westmere-EX, 4 socket / 1TB RAM (thankyou norse!)
* haswell, 2 socket (thankyou norse!)
* haswell v3, 2 socket (thankyou dell)
* haswell v3, 2x18 core (thankyou scott long / netflix!)
* Peter Holm ran a stress test suite on this work and found one
issue, but has not been able to verify it (it doesn't look NUMA
related, and he only saw it once over many testing runs.)
* I've tested bhyve instances running in fixed NUMA domains and cpusets;
all seems to work correctly.
Verified:
* intel-pcm - pcm-numa.x and pcm-memory.x, whilst selecting different
NUMA policies for processes under test.
Review:
This was reviewed through phabricator (https://reviews.freebsd.org/D2559)
as well as privately and via emails to freebsd-arch@. The git history
with specific attributes is available at https://github.com/erikarn/freebsd/
in the NUMA branch (https://github.com/erikarn/freebsd/compare/local/adrian_numa_policy).
This has been reviewed by a number of people (stas, rpaulo, kib, ngie,
wblock) but not achieved a clear consensus. My hope is that with further
exposure and testing more functionality can be implemented and evaluated.
Notes:
* The VM doesn't handle unbalanced domains very well, and if you have an overly
unbalanced memory setup whilst under high memory pressure, VM page allocation
may fail leading to a kernel panic. This was a problem in the past, but it's
much more easily triggered now with these tools.
* This work only controls the path through vm_phys; it doesn't yet strongly/predictably
affect contigmalloc, KVA placement, UMA, etc. So, driver placement of memory
isn't really guaranteed in any way. That's next on my plate.
Sponsored by: Norse Corp, Inc.; Dell
2015-07-11 15:21:37 +00:00
|
|
|
int
|
2015-05-08 00:56:56 +00:00
|
|
|
vm_phys_mem_affinity(int f, int t)
|
|
|
|
{
|
|
|
|
|
2016-04-09 13:58:04 +00:00
|
|
|
#ifdef VM_NUMA_ALLOC
|
2015-05-08 00:56:56 +00:00
|
|
|
if (mem_locality == NULL)
|
|
|
|
return (-1);
|
|
|
|
if (f >= vm_ndomains || t >= vm_ndomains)
|
|
|
|
return (-1);
|
|
|
|
return (mem_locality[f * vm_ndomains + t]);
|
Add an initial NUMA affinity/policy configuration for threads and processes.
This is based on work done by jeff@ and jhb@, as well as the numa.diff
patch that has been circulating when someone asks for first-touch NUMA
on -10 or -11.
* Introduce a simple set of VM policy and iterator types.
* tie the policy types into the vm_phys path for now, mirroring how
the initial first-touch allocation work was enabled.
* add syscalls to control changing thread and process defaults.
* add a global NUMA VM domain policy.
* implement a simple cascade policy order - if a thread policy exists, use it;
if a process policy exists, use it; use the default policy.
* processes inherit policies from their parent processes, threads inherit
policies from their parent threads.
* add a simple tool (numactl) to query and modify default thread/process
policities.
* add documentation for the new syscalls, for numa and for numactl.
* re-enable first touch NUMA again by default, as now policies can be
set in a variety of methods.
This is only relevant for very specific workloads.
This doesn't pretend to be a final NUMA solution.
The previous defaults in -HEAD (with MAXMEMDOM set) can be achieved by
'sysctl vm.default_policy=rr'.
This is only relevant if MAXMEMDOM is set to something other than 1.
Ie, if you're using GENERIC or a modified kernel with non-NUMA, then
this is a glorified no-op for you.
Thank you to Norse Corp for giving me access to rather large
(for FreeBSD!) NUMA machines in order to develop and verify this.
Thank you to Dell for providing me with dual socket sandybridge
and westmere v3 hardware to do NUMA development with.
Thank you to Scott Long at Netflix for providing me with access
to the two-socket, four-domain haswell v3 hardware.
Thank you to Peter Holm for running the stress testing suite
against the NUMA branch during various stages of development!
Tested:
* MIPS (regression testing; non-NUMA)
* i386 (regression testing; non-NUMA GENERIC)
* amd64 (regression testing; non-NUMA GENERIC)
* westmere, 2 socket (thankyou norse!)
* sandy bridge, 2 socket (thankyou dell!)
* ivy bridge, 2 socket (thankyou norse!)
* westmere-EX, 4 socket / 1TB RAM (thankyou norse!)
* haswell, 2 socket (thankyou norse!)
* haswell v3, 2 socket (thankyou dell)
* haswell v3, 2x18 core (thankyou scott long / netflix!)
* Peter Holm ran a stress test suite on this work and found one
issue, but has not been able to verify it (it doesn't look NUMA
related, and he only saw it once over many testing runs.)
* I've tested bhyve instances running in fixed NUMA domains and cpusets;
all seems to work correctly.
Verified:
* intel-pcm - pcm-numa.x and pcm-memory.x, whilst selecting different
NUMA policies for processes under test.
Review:
This was reviewed through phabricator (https://reviews.freebsd.org/D2559)
as well as privately and via emails to freebsd-arch@. The git history
with specific attributes is available at https://github.com/erikarn/freebsd/
in the NUMA branch (https://github.com/erikarn/freebsd/compare/local/adrian_numa_policy).
This has been reviewed by a number of people (stas, rpaulo, kib, ngie,
wblock) but not achieved a clear consensus. My hope is that with further
exposure and testing more functionality can be implemented and evaluated.
Notes:
* The VM doesn't handle unbalanced domains very well, and if you have an overly
unbalanced memory setup whilst under high memory pressure, VM page allocation
may fail leading to a kernel panic. This was a problem in the past, but it's
much more easily triggered now with these tools.
* This work only controls the path through vm_phys; it doesn't yet strongly/predictably
affect contigmalloc, KVA placement, UMA, etc. So, driver placement of memory
isn't really guaranteed in any way. That's next on my plate.
Sponsored by: Norse Corp, Inc.; Dell
2015-07-11 15:21:37 +00:00
|
|
|
#else
|
|
|
|
return (-1);
|
|
|
|
#endif
|
2015-05-08 00:56:56 +00:00
|
|
|
}
|
|
|
|
|
2016-04-09 13:58:04 +00:00
|
|
|
#ifdef VM_NUMA_ALLOC
|
2015-05-08 00:56:56 +00:00
|
|
|
/*
|
|
|
|
* Outputs the VM locality table.
|
|
|
|
*/
|
|
|
|
static int
|
|
|
|
sysctl_vm_phys_locality(SYSCTL_HANDLER_ARGS)
|
|
|
|
{
|
|
|
|
struct sbuf sbuf;
|
|
|
|
int error, i, j;
|
|
|
|
|
|
|
|
error = sysctl_wire_old_buffer(req, 0);
|
|
|
|
if (error != 0)
|
|
|
|
return (error);
|
|
|
|
sbuf_new_for_sysctl(&sbuf, NULL, 128, req);
|
|
|
|
|
|
|
|
sbuf_printf(&sbuf, "\n");
|
|
|
|
|
|
|
|
for (i = 0; i < vm_ndomains; i++) {
|
|
|
|
sbuf_printf(&sbuf, "%d: ", i);
|
|
|
|
for (j = 0; j < vm_ndomains; j++) {
|
|
|
|
sbuf_printf(&sbuf, "%d ", vm_phys_mem_affinity(i, j));
|
|
|
|
}
|
|
|
|
sbuf_printf(&sbuf, "\n");
|
|
|
|
}
|
|
|
|
error = sbuf_finish(&sbuf);
|
|
|
|
sbuf_delete(&sbuf);
|
|
|
|
return (error);
|
|
|
|
}
|
Add an initial NUMA affinity/policy configuration for threads and processes.
This is based on work done by jeff@ and jhb@, as well as the numa.diff
patch that has been circulating when someone asks for first-touch NUMA
on -10 or -11.
* Introduce a simple set of VM policy and iterator types.
* tie the policy types into the vm_phys path for now, mirroring how
the initial first-touch allocation work was enabled.
* add syscalls to control changing thread and process defaults.
* add a global NUMA VM domain policy.
* implement a simple cascade policy order - if a thread policy exists, use it;
if a process policy exists, use it; use the default policy.
* processes inherit policies from their parent processes, threads inherit
policies from their parent threads.
* add a simple tool (numactl) to query and modify default thread/process
policities.
* add documentation for the new syscalls, for numa and for numactl.
* re-enable first touch NUMA again by default, as now policies can be
set in a variety of methods.
This is only relevant for very specific workloads.
This doesn't pretend to be a final NUMA solution.
The previous defaults in -HEAD (with MAXMEMDOM set) can be achieved by
'sysctl vm.default_policy=rr'.
This is only relevant if MAXMEMDOM is set to something other than 1.
Ie, if you're using GENERIC or a modified kernel with non-NUMA, then
this is a glorified no-op for you.
Thank you to Norse Corp for giving me access to rather large
(for FreeBSD!) NUMA machines in order to develop and verify this.
Thank you to Dell for providing me with dual socket sandybridge
and westmere v3 hardware to do NUMA development with.
Thank you to Scott Long at Netflix for providing me with access
to the two-socket, four-domain haswell v3 hardware.
Thank you to Peter Holm for running the stress testing suite
against the NUMA branch during various stages of development!
Tested:
* MIPS (regression testing; non-NUMA)
* i386 (regression testing; non-NUMA GENERIC)
* amd64 (regression testing; non-NUMA GENERIC)
* westmere, 2 socket (thankyou norse!)
* sandy bridge, 2 socket (thankyou dell!)
* ivy bridge, 2 socket (thankyou norse!)
* westmere-EX, 4 socket / 1TB RAM (thankyou norse!)
* haswell, 2 socket (thankyou norse!)
* haswell v3, 2 socket (thankyou dell)
* haswell v3, 2x18 core (thankyou scott long / netflix!)
* Peter Holm ran a stress test suite on this work and found one
issue, but has not been able to verify it (it doesn't look NUMA
related, and he only saw it once over many testing runs.)
* I've tested bhyve instances running in fixed NUMA domains and cpusets;
all seems to work correctly.
Verified:
* intel-pcm - pcm-numa.x and pcm-memory.x, whilst selecting different
NUMA policies for processes under test.
Review:
This was reviewed through phabricator (https://reviews.freebsd.org/D2559)
as well as privately and via emails to freebsd-arch@. The git history
with specific attributes is available at https://github.com/erikarn/freebsd/
in the NUMA branch (https://github.com/erikarn/freebsd/compare/local/adrian_numa_policy).
This has been reviewed by a number of people (stas, rpaulo, kib, ngie,
wblock) but not achieved a clear consensus. My hope is that with further
exposure and testing more functionality can be implemented and evaluated.
Notes:
* The VM doesn't handle unbalanced domains very well, and if you have an overly
unbalanced memory setup whilst under high memory pressure, VM page allocation
may fail leading to a kernel panic. This was a problem in the past, but it's
much more easily triggered now with these tools.
* This work only controls the path through vm_phys; it doesn't yet strongly/predictably
affect contigmalloc, KVA placement, UMA, etc. So, driver placement of memory
isn't really guaranteed in any way. That's next on my plate.
Sponsored by: Norse Corp, Inc.; Dell
2015-07-11 15:21:37 +00:00
|
|
|
#endif
|
2015-05-08 00:56:56 +00:00
|
|
|
|
2013-05-13 15:40:51 +00:00
|
|
|
static void
|
|
|
|
vm_freelist_add(struct vm_freelist *fl, vm_page_t m, int order, int tail)
|
2010-07-27 20:33:50 +00:00
|
|
|
{
|
|
|
|
|
2013-05-13 15:40:51 +00:00
|
|
|
m->order = order;
|
|
|
|
if (tail)
|
2013-08-10 17:36:42 +00:00
|
|
|
TAILQ_INSERT_TAIL(&fl[order].pl, m, plinks.q);
|
2013-05-13 15:40:51 +00:00
|
|
|
else
|
2013-08-10 17:36:42 +00:00
|
|
|
TAILQ_INSERT_HEAD(&fl[order].pl, m, plinks.q);
|
2013-05-13 15:40:51 +00:00
|
|
|
fl[order].lcnt++;
|
2010-07-27 20:33:50 +00:00
|
|
|
}
|
2013-05-13 15:40:51 +00:00
|
|
|
|
|
|
|
static void
|
|
|
|
vm_freelist_rem(struct vm_freelist *fl, vm_page_t m, int order)
|
|
|
|
{
|
|
|
|
|
2013-08-10 17:36:42 +00:00
|
|
|
TAILQ_REMOVE(&fl[order].pl, m, plinks.q);
|
2013-05-13 15:40:51 +00:00
|
|
|
fl[order].lcnt--;
|
|
|
|
m->order = VM_NFREEORDER;
|
|
|
|
}
|
|
|
|
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
/*
|
|
|
|
* Create a physical memory segment.
|
|
|
|
*/
|
|
|
|
static void
|
The physical memory allocator supports the use of distinct free lists for
managing pages from different address ranges. Generally speaking, this
feature is used to increase the likelihood that physical pages are
available that can meet special DMA requirements or can be accessed through
a limited-coverage direct mapping (e.g., MIPS). However, prior to this
change, the configuration of the free lists was static, i.e., it was
determined at compile time. Consequentally, free lists could be created
for address ranges that held no actual pages, for example, on 32-bit MIPS-
based systems with 512 MB or less of physical memory. This change makes
the creation of the free lists dynamic, i.e., it is based on the available
physical memory at boot time.
On 64-bit x86-based systems with 64 GB or more of physical memory, create
free lists for managing pages with physical addresses below 4 GB. This
change is to address reported problems with initializing devices that
require the allocation of physical pages below 4 GB on some systems with
128 GB or more of physical memory.
PR: 185727
Differential Revision: https://reviews.freebsd.org/D1274
Reviewed by: jhb, kib
MFC after: 3 weeks
Sponsored by: EMC / Isilon Storage Division
2014-12-31 00:54:38 +00:00
|
|
|
_vm_phys_create_seg(vm_paddr_t start, vm_paddr_t end, int domain)
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
{
|
|
|
|
struct vm_phys_seg *seg;
|
|
|
|
|
|
|
|
KASSERT(vm_phys_nsegs < VM_PHYSSEG_MAX,
|
|
|
|
("vm_phys_create_seg: increase VM_PHYSSEG_MAX"));
|
2013-05-13 15:40:51 +00:00
|
|
|
KASSERT(domain < vm_ndomains,
|
|
|
|
("vm_phys_create_seg: invalid domain provided"));
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
seg = &vm_phys_segs[vm_phys_nsegs++];
|
2014-11-15 23:40:44 +00:00
|
|
|
while (seg > vm_phys_segs && (seg - 1)->start >= end) {
|
|
|
|
*seg = *(seg - 1);
|
|
|
|
seg--;
|
|
|
|
}
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
seg->start = start;
|
|
|
|
seg->end = end;
|
2010-07-27 20:33:50 +00:00
|
|
|
seg->domain = domain;
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
}
|
|
|
|
|
2010-07-27 20:33:50 +00:00
|
|
|
static void
|
The physical memory allocator supports the use of distinct free lists for
managing pages from different address ranges. Generally speaking, this
feature is used to increase the likelihood that physical pages are
available that can meet special DMA requirements or can be accessed through
a limited-coverage direct mapping (e.g., MIPS). However, prior to this
change, the configuration of the free lists was static, i.e., it was
determined at compile time. Consequentally, free lists could be created
for address ranges that held no actual pages, for example, on 32-bit MIPS-
based systems with 512 MB or less of physical memory. This change makes
the creation of the free lists dynamic, i.e., it is based on the available
physical memory at boot time.
On 64-bit x86-based systems with 64 GB or more of physical memory, create
free lists for managing pages with physical addresses below 4 GB. This
change is to address reported problems with initializing devices that
require the allocation of physical pages below 4 GB on some systems with
128 GB or more of physical memory.
PR: 185727
Differential Revision: https://reviews.freebsd.org/D1274
Reviewed by: jhb, kib
MFC after: 3 weeks
Sponsored by: EMC / Isilon Storage Division
2014-12-31 00:54:38 +00:00
|
|
|
vm_phys_create_seg(vm_paddr_t start, vm_paddr_t end)
|
2010-07-27 20:33:50 +00:00
|
|
|
{
|
2016-04-09 13:58:04 +00:00
|
|
|
#ifdef VM_NUMA_ALLOC
|
2010-07-27 20:33:50 +00:00
|
|
|
int i;
|
|
|
|
|
|
|
|
if (mem_affinity == NULL) {
|
The physical memory allocator supports the use of distinct free lists for
managing pages from different address ranges. Generally speaking, this
feature is used to increase the likelihood that physical pages are
available that can meet special DMA requirements or can be accessed through
a limited-coverage direct mapping (e.g., MIPS). However, prior to this
change, the configuration of the free lists was static, i.e., it was
determined at compile time. Consequentally, free lists could be created
for address ranges that held no actual pages, for example, on 32-bit MIPS-
based systems with 512 MB or less of physical memory. This change makes
the creation of the free lists dynamic, i.e., it is based on the available
physical memory at boot time.
On 64-bit x86-based systems with 64 GB or more of physical memory, create
free lists for managing pages with physical addresses below 4 GB. This
change is to address reported problems with initializing devices that
require the allocation of physical pages below 4 GB on some systems with
128 GB or more of physical memory.
PR: 185727
Differential Revision: https://reviews.freebsd.org/D1274
Reviewed by: jhb, kib
MFC after: 3 weeks
Sponsored by: EMC / Isilon Storage Division
2014-12-31 00:54:38 +00:00
|
|
|
_vm_phys_create_seg(start, end, 0);
|
2010-07-27 20:33:50 +00:00
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
for (i = 0;; i++) {
|
|
|
|
if (mem_affinity[i].end == 0)
|
|
|
|
panic("Reached end of affinity info");
|
|
|
|
if (mem_affinity[i].end <= start)
|
|
|
|
continue;
|
|
|
|
if (mem_affinity[i].start > start)
|
|
|
|
panic("No affinity info for start %jx",
|
|
|
|
(uintmax_t)start);
|
|
|
|
if (mem_affinity[i].end >= end) {
|
The physical memory allocator supports the use of distinct free lists for
managing pages from different address ranges. Generally speaking, this
feature is used to increase the likelihood that physical pages are
available that can meet special DMA requirements or can be accessed through
a limited-coverage direct mapping (e.g., MIPS). However, prior to this
change, the configuration of the free lists was static, i.e., it was
determined at compile time. Consequentally, free lists could be created
for address ranges that held no actual pages, for example, on 32-bit MIPS-
based systems with 512 MB or less of physical memory. This change makes
the creation of the free lists dynamic, i.e., it is based on the available
physical memory at boot time.
On 64-bit x86-based systems with 64 GB or more of physical memory, create
free lists for managing pages with physical addresses below 4 GB. This
change is to address reported problems with initializing devices that
require the allocation of physical pages below 4 GB on some systems with
128 GB or more of physical memory.
PR: 185727
Differential Revision: https://reviews.freebsd.org/D1274
Reviewed by: jhb, kib
MFC after: 3 weeks
Sponsored by: EMC / Isilon Storage Division
2014-12-31 00:54:38 +00:00
|
|
|
_vm_phys_create_seg(start, end,
|
2010-07-27 20:33:50 +00:00
|
|
|
mem_affinity[i].domain);
|
|
|
|
break;
|
|
|
|
}
|
The physical memory allocator supports the use of distinct free lists for
managing pages from different address ranges. Generally speaking, this
feature is used to increase the likelihood that physical pages are
available that can meet special DMA requirements or can be accessed through
a limited-coverage direct mapping (e.g., MIPS). However, prior to this
change, the configuration of the free lists was static, i.e., it was
determined at compile time. Consequentally, free lists could be created
for address ranges that held no actual pages, for example, on 32-bit MIPS-
based systems with 512 MB or less of physical memory. This change makes
the creation of the free lists dynamic, i.e., it is based on the available
physical memory at boot time.
On 64-bit x86-based systems with 64 GB or more of physical memory, create
free lists for managing pages with physical addresses below 4 GB. This
change is to address reported problems with initializing devices that
require the allocation of physical pages below 4 GB on some systems with
128 GB or more of physical memory.
PR: 185727
Differential Revision: https://reviews.freebsd.org/D1274
Reviewed by: jhb, kib
MFC after: 3 weeks
Sponsored by: EMC / Isilon Storage Division
2014-12-31 00:54:38 +00:00
|
|
|
_vm_phys_create_seg(start, mem_affinity[i].end,
|
2010-07-27 20:33:50 +00:00
|
|
|
mem_affinity[i].domain);
|
|
|
|
start = mem_affinity[i].end;
|
|
|
|
}
|
2016-04-09 13:58:04 +00:00
|
|
|
#else
|
|
|
|
_vm_phys_create_seg(start, end, 0);
|
|
|
|
#endif
|
2010-07-27 20:33:50 +00:00
|
|
|
}
|
|
|
|
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
/*
|
2014-11-15 23:40:44 +00:00
|
|
|
* Add a physical memory segment.
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
*/
|
|
|
|
void
|
2014-11-15 23:40:44 +00:00
|
|
|
vm_phys_add_seg(vm_paddr_t start, vm_paddr_t end)
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
{
|
The physical memory allocator supports the use of distinct free lists for
managing pages from different address ranges. Generally speaking, this
feature is used to increase the likelihood that physical pages are
available that can meet special DMA requirements or can be accessed through
a limited-coverage direct mapping (e.g., MIPS). However, prior to this
change, the configuration of the free lists was static, i.e., it was
determined at compile time. Consequentally, free lists could be created
for address ranges that held no actual pages, for example, on 32-bit MIPS-
based systems with 512 MB or less of physical memory. This change makes
the creation of the free lists dynamic, i.e., it is based on the available
physical memory at boot time.
On 64-bit x86-based systems with 64 GB or more of physical memory, create
free lists for managing pages with physical addresses below 4 GB. This
change is to address reported problems with initializing devices that
require the allocation of physical pages below 4 GB on some systems with
128 GB or more of physical memory.
PR: 185727
Differential Revision: https://reviews.freebsd.org/D1274
Reviewed by: jhb, kib
MFC after: 3 weeks
Sponsored by: EMC / Isilon Storage Division
2014-12-31 00:54:38 +00:00
|
|
|
vm_paddr_t paddr;
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
|
2014-11-15 23:40:44 +00:00
|
|
|
KASSERT((start & PAGE_MASK) == 0,
|
|
|
|
("vm_phys_define_seg: start is not page aligned"));
|
|
|
|
KASSERT((end & PAGE_MASK) == 0,
|
|
|
|
("vm_phys_define_seg: end is not page aligned"));
|
The physical memory allocator supports the use of distinct free lists for
managing pages from different address ranges. Generally speaking, this
feature is used to increase the likelihood that physical pages are
available that can meet special DMA requirements or can be accessed through
a limited-coverage direct mapping (e.g., MIPS). However, prior to this
change, the configuration of the free lists was static, i.e., it was
determined at compile time. Consequentally, free lists could be created
for address ranges that held no actual pages, for example, on 32-bit MIPS-
based systems with 512 MB or less of physical memory. This change makes
the creation of the free lists dynamic, i.e., it is based on the available
physical memory at boot time.
On 64-bit x86-based systems with 64 GB or more of physical memory, create
free lists for managing pages with physical addresses below 4 GB. This
change is to address reported problems with initializing devices that
require the allocation of physical pages below 4 GB on some systems with
128 GB or more of physical memory.
PR: 185727
Differential Revision: https://reviews.freebsd.org/D1274
Reviewed by: jhb, kib
MFC after: 3 weeks
Sponsored by: EMC / Isilon Storage Division
2014-12-31 00:54:38 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Split the physical memory segment if it spans two or more free
|
|
|
|
* list boundaries.
|
|
|
|
*/
|
|
|
|
paddr = start;
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
#ifdef VM_FREELIST_ISADMA
|
The physical memory allocator supports the use of distinct free lists for
managing pages from different address ranges. Generally speaking, this
feature is used to increase the likelihood that physical pages are
available that can meet special DMA requirements or can be accessed through
a limited-coverage direct mapping (e.g., MIPS). However, prior to this
change, the configuration of the free lists was static, i.e., it was
determined at compile time. Consequentally, free lists could be created
for address ranges that held no actual pages, for example, on 32-bit MIPS-
based systems with 512 MB or less of physical memory. This change makes
the creation of the free lists dynamic, i.e., it is based on the available
physical memory at boot time.
On 64-bit x86-based systems with 64 GB or more of physical memory, create
free lists for managing pages with physical addresses below 4 GB. This
change is to address reported problems with initializing devices that
require the allocation of physical pages below 4 GB on some systems with
128 GB or more of physical memory.
PR: 185727
Differential Revision: https://reviews.freebsd.org/D1274
Reviewed by: jhb, kib
MFC after: 3 weeks
Sponsored by: EMC / Isilon Storage Division
2014-12-31 00:54:38 +00:00
|
|
|
if (paddr < VM_ISADMA_BOUNDARY && end > VM_ISADMA_BOUNDARY) {
|
|
|
|
vm_phys_create_seg(paddr, VM_ISADMA_BOUNDARY);
|
|
|
|
paddr = VM_ISADMA_BOUNDARY;
|
|
|
|
}
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
#endif
|
The physical memory allocator supports the use of distinct free lists for
managing pages from different address ranges. Generally speaking, this
feature is used to increase the likelihood that physical pages are
available that can meet special DMA requirements or can be accessed through
a limited-coverage direct mapping (e.g., MIPS). However, prior to this
change, the configuration of the free lists was static, i.e., it was
determined at compile time. Consequentally, free lists could be created
for address ranges that held no actual pages, for example, on 32-bit MIPS-
based systems with 512 MB or less of physical memory. This change makes
the creation of the free lists dynamic, i.e., it is based on the available
physical memory at boot time.
On 64-bit x86-based systems with 64 GB or more of physical memory, create
free lists for managing pages with physical addresses below 4 GB. This
change is to address reported problems with initializing devices that
require the allocation of physical pages below 4 GB on some systems with
128 GB or more of physical memory.
PR: 185727
Differential Revision: https://reviews.freebsd.org/D1274
Reviewed by: jhb, kib
MFC after: 3 weeks
Sponsored by: EMC / Isilon Storage Division
2014-12-31 00:54:38 +00:00
|
|
|
#ifdef VM_FREELIST_LOWMEM
|
|
|
|
if (paddr < VM_LOWMEM_BOUNDARY && end > VM_LOWMEM_BOUNDARY) {
|
|
|
|
vm_phys_create_seg(paddr, VM_LOWMEM_BOUNDARY);
|
|
|
|
paddr = VM_LOWMEM_BOUNDARY;
|
|
|
|
}
|
2014-11-15 23:40:44 +00:00
|
|
|
#endif
|
The physical memory allocator supports the use of distinct free lists for
managing pages from different address ranges. Generally speaking, this
feature is used to increase the likelihood that physical pages are
available that can meet special DMA requirements or can be accessed through
a limited-coverage direct mapping (e.g., MIPS). However, prior to this
change, the configuration of the free lists was static, i.e., it was
determined at compile time. Consequentally, free lists could be created
for address ranges that held no actual pages, for example, on 32-bit MIPS-
based systems with 512 MB or less of physical memory. This change makes
the creation of the free lists dynamic, i.e., it is based on the available
physical memory at boot time.
On 64-bit x86-based systems with 64 GB or more of physical memory, create
free lists for managing pages with physical addresses below 4 GB. This
change is to address reported problems with initializing devices that
require the allocation of physical pages below 4 GB on some systems with
128 GB or more of physical memory.
PR: 185727
Differential Revision: https://reviews.freebsd.org/D1274
Reviewed by: jhb, kib
MFC after: 3 weeks
Sponsored by: EMC / Isilon Storage Division
2014-12-31 00:54:38 +00:00
|
|
|
#ifdef VM_FREELIST_DMA32
|
|
|
|
if (paddr < VM_DMA32_BOUNDARY && end > VM_DMA32_BOUNDARY) {
|
|
|
|
vm_phys_create_seg(paddr, VM_DMA32_BOUNDARY);
|
|
|
|
paddr = VM_DMA32_BOUNDARY;
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
vm_phys_create_seg(paddr, end);
|
2014-11-15 23:40:44 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Initialize the physical memory allocator.
|
The physical memory allocator supports the use of distinct free lists for
managing pages from different address ranges. Generally speaking, this
feature is used to increase the likelihood that physical pages are
available that can meet special DMA requirements or can be accessed through
a limited-coverage direct mapping (e.g., MIPS). However, prior to this
change, the configuration of the free lists was static, i.e., it was
determined at compile time. Consequentally, free lists could be created
for address ranges that held no actual pages, for example, on 32-bit MIPS-
based systems with 512 MB or less of physical memory. This change makes
the creation of the free lists dynamic, i.e., it is based on the available
physical memory at boot time.
On 64-bit x86-based systems with 64 GB or more of physical memory, create
free lists for managing pages with physical addresses below 4 GB. This
change is to address reported problems with initializing devices that
require the allocation of physical pages below 4 GB on some systems with
128 GB or more of physical memory.
PR: 185727
Differential Revision: https://reviews.freebsd.org/D1274
Reviewed by: jhb, kib
MFC after: 3 weeks
Sponsored by: EMC / Isilon Storage Division
2014-12-31 00:54:38 +00:00
|
|
|
*
|
|
|
|
* Requires that vm_page_array is initialized!
|
2014-11-15 23:40:44 +00:00
|
|
|
*/
|
|
|
|
void
|
|
|
|
vm_phys_init(void)
|
|
|
|
{
|
|
|
|
struct vm_freelist *fl;
|
|
|
|
struct vm_phys_seg *seg;
|
The physical memory allocator supports the use of distinct free lists for
managing pages from different address ranges. Generally speaking, this
feature is used to increase the likelihood that physical pages are
available that can meet special DMA requirements or can be accessed through
a limited-coverage direct mapping (e.g., MIPS). However, prior to this
change, the configuration of the free lists was static, i.e., it was
determined at compile time. Consequentally, free lists could be created
for address ranges that held no actual pages, for example, on 32-bit MIPS-
based systems with 512 MB or less of physical memory. This change makes
the creation of the free lists dynamic, i.e., it is based on the available
physical memory at boot time.
On 64-bit x86-based systems with 64 GB or more of physical memory, create
free lists for managing pages with physical addresses below 4 GB. This
change is to address reported problems with initializing devices that
require the allocation of physical pages below 4 GB on some systems with
128 GB or more of physical memory.
PR: 185727
Differential Revision: https://reviews.freebsd.org/D1274
Reviewed by: jhb, kib
MFC after: 3 weeks
Sponsored by: EMC / Isilon Storage Division
2014-12-31 00:54:38 +00:00
|
|
|
u_long npages;
|
|
|
|
int dom, flind, freelist, oind, pind, segind;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Compute the number of free lists, and generate the mapping from the
|
|
|
|
* manifest constants VM_FREELIST_* to the free list indices.
|
|
|
|
*
|
|
|
|
* Initially, the entries of vm_freelist_to_flind[] are set to either
|
|
|
|
* 0 or 1 to indicate which free lists should be created.
|
|
|
|
*/
|
|
|
|
npages = 0;
|
|
|
|
for (segind = vm_phys_nsegs - 1; segind >= 0; segind--) {
|
|
|
|
seg = &vm_phys_segs[segind];
|
|
|
|
#ifdef VM_FREELIST_ISADMA
|
|
|
|
if (seg->end <= VM_ISADMA_BOUNDARY)
|
|
|
|
vm_freelist_to_flind[VM_FREELIST_ISADMA] = 1;
|
|
|
|
else
|
|
|
|
#endif
|
|
|
|
#ifdef VM_FREELIST_LOWMEM
|
|
|
|
if (seg->end <= VM_LOWMEM_BOUNDARY)
|
|
|
|
vm_freelist_to_flind[VM_FREELIST_LOWMEM] = 1;
|
|
|
|
else
|
2014-11-15 23:40:44 +00:00
|
|
|
#endif
|
The physical memory allocator supports the use of distinct free lists for
managing pages from different address ranges. Generally speaking, this
feature is used to increase the likelihood that physical pages are
available that can meet special DMA requirements or can be accessed through
a limited-coverage direct mapping (e.g., MIPS). However, prior to this
change, the configuration of the free lists was static, i.e., it was
determined at compile time. Consequentally, free lists could be created
for address ranges that held no actual pages, for example, on 32-bit MIPS-
based systems with 512 MB or less of physical memory. This change makes
the creation of the free lists dynamic, i.e., it is based on the available
physical memory at boot time.
On 64-bit x86-based systems with 64 GB or more of physical memory, create
free lists for managing pages with physical addresses below 4 GB. This
change is to address reported problems with initializing devices that
require the allocation of physical pages below 4 GB on some systems with
128 GB or more of physical memory.
PR: 185727
Differential Revision: https://reviews.freebsd.org/D1274
Reviewed by: jhb, kib
MFC after: 3 weeks
Sponsored by: EMC / Isilon Storage Division
2014-12-31 00:54:38 +00:00
|
|
|
#ifdef VM_FREELIST_DMA32
|
|
|
|
if (
|
|
|
|
#ifdef VM_DMA32_NPAGES_THRESHOLD
|
|
|
|
/*
|
|
|
|
* Create the DMA32 free list only if the amount of
|
|
|
|
* physical memory above physical address 4G exceeds the
|
|
|
|
* given threshold.
|
|
|
|
*/
|
|
|
|
npages > VM_DMA32_NPAGES_THRESHOLD &&
|
|
|
|
#endif
|
|
|
|
seg->end <= VM_DMA32_BOUNDARY)
|
|
|
|
vm_freelist_to_flind[VM_FREELIST_DMA32] = 1;
|
|
|
|
else
|
|
|
|
#endif
|
|
|
|
{
|
|
|
|
npages += atop(seg->end - seg->start);
|
|
|
|
vm_freelist_to_flind[VM_FREELIST_DEFAULT] = 1;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
/* Change each entry into a running total of the free lists. */
|
|
|
|
for (freelist = 1; freelist < VM_NFREELIST; freelist++) {
|
|
|
|
vm_freelist_to_flind[freelist] +=
|
|
|
|
vm_freelist_to_flind[freelist - 1];
|
|
|
|
}
|
|
|
|
vm_nfreelists = vm_freelist_to_flind[VM_NFREELIST - 1];
|
|
|
|
KASSERT(vm_nfreelists > 0, ("vm_phys_init: no free lists"));
|
|
|
|
/* Change each entry into a free list index. */
|
|
|
|
for (freelist = 0; freelist < VM_NFREELIST; freelist++)
|
|
|
|
vm_freelist_to_flind[freelist]--;
|
2014-11-15 23:40:44 +00:00
|
|
|
|
The physical memory allocator supports the use of distinct free lists for
managing pages from different address ranges. Generally speaking, this
feature is used to increase the likelihood that physical pages are
available that can meet special DMA requirements or can be accessed through
a limited-coverage direct mapping (e.g., MIPS). However, prior to this
change, the configuration of the free lists was static, i.e., it was
determined at compile time. Consequentally, free lists could be created
for address ranges that held no actual pages, for example, on 32-bit MIPS-
based systems with 512 MB or less of physical memory. This change makes
the creation of the free lists dynamic, i.e., it is based on the available
physical memory at boot time.
On 64-bit x86-based systems with 64 GB or more of physical memory, create
free lists for managing pages with physical addresses below 4 GB. This
change is to address reported problems with initializing devices that
require the allocation of physical pages below 4 GB on some systems with
128 GB or more of physical memory.
PR: 185727
Differential Revision: https://reviews.freebsd.org/D1274
Reviewed by: jhb, kib
MFC after: 3 weeks
Sponsored by: EMC / Isilon Storage Division
2014-12-31 00:54:38 +00:00
|
|
|
/*
|
|
|
|
* Initialize the first_page and free_queues fields of each physical
|
|
|
|
* memory segment.
|
|
|
|
*/
|
2014-11-15 23:40:44 +00:00
|
|
|
#ifdef VM_PHYSSEG_SPARSE
|
The physical memory allocator supports the use of distinct free lists for
managing pages from different address ranges. Generally speaking, this
feature is used to increase the likelihood that physical pages are
available that can meet special DMA requirements or can be accessed through
a limited-coverage direct mapping (e.g., MIPS). However, prior to this
change, the configuration of the free lists was static, i.e., it was
determined at compile time. Consequentally, free lists could be created
for address ranges that held no actual pages, for example, on 32-bit MIPS-
based systems with 512 MB or less of physical memory. This change makes
the creation of the free lists dynamic, i.e., it is based on the available
physical memory at boot time.
On 64-bit x86-based systems with 64 GB or more of physical memory, create
free lists for managing pages with physical addresses below 4 GB. This
change is to address reported problems with initializing devices that
require the allocation of physical pages below 4 GB on some systems with
128 GB or more of physical memory.
PR: 185727
Differential Revision: https://reviews.freebsd.org/D1274
Reviewed by: jhb, kib
MFC after: 3 weeks
Sponsored by: EMC / Isilon Storage Division
2014-12-31 00:54:38 +00:00
|
|
|
npages = 0;
|
2014-11-15 23:40:44 +00:00
|
|
|
#endif
|
|
|
|
for (segind = 0; segind < vm_phys_nsegs; segind++) {
|
|
|
|
seg = &vm_phys_segs[segind];
|
|
|
|
#ifdef VM_PHYSSEG_SPARSE
|
The physical memory allocator supports the use of distinct free lists for
managing pages from different address ranges. Generally speaking, this
feature is used to increase the likelihood that physical pages are
available that can meet special DMA requirements or can be accessed through
a limited-coverage direct mapping (e.g., MIPS). However, prior to this
change, the configuration of the free lists was static, i.e., it was
determined at compile time. Consequentally, free lists could be created
for address ranges that held no actual pages, for example, on 32-bit MIPS-
based systems with 512 MB or less of physical memory. This change makes
the creation of the free lists dynamic, i.e., it is based on the available
physical memory at boot time.
On 64-bit x86-based systems with 64 GB or more of physical memory, create
free lists for managing pages with physical addresses below 4 GB. This
change is to address reported problems with initializing devices that
require the allocation of physical pages below 4 GB on some systems with
128 GB or more of physical memory.
PR: 185727
Differential Revision: https://reviews.freebsd.org/D1274
Reviewed by: jhb, kib
MFC after: 3 weeks
Sponsored by: EMC / Isilon Storage Division
2014-12-31 00:54:38 +00:00
|
|
|
seg->first_page = &vm_page_array[npages];
|
|
|
|
npages += atop(seg->end - seg->start);
|
2014-11-15 23:40:44 +00:00
|
|
|
#else
|
|
|
|
seg->first_page = PHYS_TO_VM_PAGE(seg->start);
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
#endif
|
The physical memory allocator supports the use of distinct free lists for
managing pages from different address ranges. Generally speaking, this
feature is used to increase the likelihood that physical pages are
available that can meet special DMA requirements or can be accessed through
a limited-coverage direct mapping (e.g., MIPS). However, prior to this
change, the configuration of the free lists was static, i.e., it was
determined at compile time. Consequentally, free lists could be created
for address ranges that held no actual pages, for example, on 32-bit MIPS-
based systems with 512 MB or less of physical memory. This change makes
the creation of the free lists dynamic, i.e., it is based on the available
physical memory at boot time.
On 64-bit x86-based systems with 64 GB or more of physical memory, create
free lists for managing pages with physical addresses below 4 GB. This
change is to address reported problems with initializing devices that
require the allocation of physical pages below 4 GB on some systems with
128 GB or more of physical memory.
PR: 185727
Differential Revision: https://reviews.freebsd.org/D1274
Reviewed by: jhb, kib
MFC after: 3 weeks
Sponsored by: EMC / Isilon Storage Division
2014-12-31 00:54:38 +00:00
|
|
|
#ifdef VM_FREELIST_ISADMA
|
|
|
|
if (seg->end <= VM_ISADMA_BOUNDARY) {
|
|
|
|
flind = vm_freelist_to_flind[VM_FREELIST_ISADMA];
|
|
|
|
KASSERT(flind >= 0,
|
|
|
|
("vm_phys_init: ISADMA flind < 0"));
|
|
|
|
} else
|
|
|
|
#endif
|
|
|
|
#ifdef VM_FREELIST_LOWMEM
|
|
|
|
if (seg->end <= VM_LOWMEM_BOUNDARY) {
|
|
|
|
flind = vm_freelist_to_flind[VM_FREELIST_LOWMEM];
|
|
|
|
KASSERT(flind >= 0,
|
|
|
|
("vm_phys_init: LOWMEM flind < 0"));
|
|
|
|
} else
|
|
|
|
#endif
|
|
|
|
#ifdef VM_FREELIST_DMA32
|
|
|
|
if (seg->end <= VM_DMA32_BOUNDARY) {
|
|
|
|
flind = vm_freelist_to_flind[VM_FREELIST_DMA32];
|
|
|
|
KASSERT(flind >= 0,
|
|
|
|
("vm_phys_init: DMA32 flind < 0"));
|
|
|
|
} else
|
|
|
|
#endif
|
|
|
|
{
|
|
|
|
flind = vm_freelist_to_flind[VM_FREELIST_DEFAULT];
|
|
|
|
KASSERT(flind >= 0,
|
|
|
|
("vm_phys_init: DEFAULT flind < 0"));
|
|
|
|
}
|
|
|
|
seg->free_queues = &vm_phys_free_queues[seg->domain][flind];
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
}
|
The physical memory allocator supports the use of distinct free lists for
managing pages from different address ranges. Generally speaking, this
feature is used to increase the likelihood that physical pages are
available that can meet special DMA requirements or can be accessed through
a limited-coverage direct mapping (e.g., MIPS). However, prior to this
change, the configuration of the free lists was static, i.e., it was
determined at compile time. Consequentally, free lists could be created
for address ranges that held no actual pages, for example, on 32-bit MIPS-
based systems with 512 MB or less of physical memory. This change makes
the creation of the free lists dynamic, i.e., it is based on the available
physical memory at boot time.
On 64-bit x86-based systems with 64 GB or more of physical memory, create
free lists for managing pages with physical addresses below 4 GB. This
change is to address reported problems with initializing devices that
require the allocation of physical pages below 4 GB on some systems with
128 GB or more of physical memory.
PR: 185727
Differential Revision: https://reviews.freebsd.org/D1274
Reviewed by: jhb, kib
MFC after: 3 weeks
Sponsored by: EMC / Isilon Storage Division
2014-12-31 00:54:38 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Initialize the free queues.
|
|
|
|
*/
|
2013-05-13 15:40:51 +00:00
|
|
|
for (dom = 0; dom < vm_ndomains; dom++) {
|
|
|
|
for (flind = 0; flind < vm_nfreelists; flind++) {
|
|
|
|
for (pind = 0; pind < VM_NFREEPOOL; pind++) {
|
|
|
|
fl = vm_phys_free_queues[dom][flind][pind];
|
|
|
|
for (oind = 0; oind < VM_NFREEORDER; oind++)
|
|
|
|
TAILQ_INIT(&fl[oind].pl);
|
|
|
|
}
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
}
|
|
|
|
}
|
The physical memory allocator supports the use of distinct free lists for
managing pages from different address ranges. Generally speaking, this
feature is used to increase the likelihood that physical pages are
available that can meet special DMA requirements or can be accessed through
a limited-coverage direct mapping (e.g., MIPS). However, prior to this
change, the configuration of the free lists was static, i.e., it was
determined at compile time. Consequentally, free lists could be created
for address ranges that held no actual pages, for example, on 32-bit MIPS-
based systems with 512 MB or less of physical memory. This change makes
the creation of the free lists dynamic, i.e., it is based on the available
physical memory at boot time.
On 64-bit x86-based systems with 64 GB or more of physical memory, create
free lists for managing pages with physical addresses below 4 GB. This
change is to address reported problems with initializing devices that
require the allocation of physical pages below 4 GB on some systems with
128 GB or more of physical memory.
PR: 185727
Differential Revision: https://reviews.freebsd.org/D1274
Reviewed by: jhb, kib
MFC after: 3 weeks
Sponsored by: EMC / Isilon Storage Division
2014-12-31 00:54:38 +00:00
|
|
|
|
2014-07-09 08:12:58 +00:00
|
|
|
rw_init(&vm_phys_fictitious_reg_lock, "vmfctr");
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Split a contiguous, power of two-sized set of physical pages.
|
|
|
|
*/
|
|
|
|
static __inline void
|
|
|
|
vm_phys_split_pages(vm_page_t m, int oind, struct vm_freelist *fl, int order)
|
|
|
|
{
|
|
|
|
vm_page_t m_buddy;
|
|
|
|
|
|
|
|
while (oind > order) {
|
|
|
|
oind--;
|
|
|
|
m_buddy = &m[1 << oind];
|
|
|
|
KASSERT(m_buddy->order == VM_NFREEORDER,
|
|
|
|
("vm_phys_split_pages: page %p has unexpected order %d",
|
|
|
|
m_buddy, m_buddy->order));
|
2013-05-13 15:40:51 +00:00
|
|
|
vm_freelist_add(fl, m_buddy, oind, 0);
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Initialize a physical page and add it to the free lists.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
vm_phys_add_page(vm_paddr_t pa)
|
|
|
|
{
|
|
|
|
vm_page_t m;
|
Split the pagequeues per NUMA domains, and split pageademon process
into threads each processing queue in a single domain. The structure
of the pagedaemons and queues is kept intact, most of the changes come
from the need for code to find an owning page queue for given page,
calculated from the segment containing the page.
The tie between NUMA domain and pagedaemon thread/pagequeue split is
rather arbitrary, the multithreaded daemon could be allowed for the
single-domain machines, or one domain might be split into several page
domains, to further increase concurrency.
Right now, each pagedaemon thread tries to reach the global target,
precalculated at the start of the pass. This is not optimal, since it
could cause excessive page deactivation and freeing. The code should
be changed to re-check the global page deficit state in the loop after
some number of iterations.
The pagedaemons reach the quorum before starting the OOM, since one
thread inability to meet the target is normal for split queues. Only
when all pagedaemons fail to produce enough reusable pages, OOM is
started by single selected thread.
Launder is modified to take into account the segments layout with
regard to the region for which cleaning is performed.
Based on the preliminary patch by jeff, sponsored by EMC / Isilon
Storage Division.
Reviewed by: alc
Tested by: pho
Sponsored by: The FreeBSD Foundation
2013-08-07 16:36:38 +00:00
|
|
|
struct vm_domain *vmd;
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
|
2014-03-22 10:26:09 +00:00
|
|
|
vm_cnt.v_page_count++;
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
m = vm_phys_paddr_to_vm_page(pa);
|
|
|
|
m->phys_addr = pa;
|
2011-01-17 19:17:26 +00:00
|
|
|
m->queue = PQ_NONE;
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
m->segind = vm_phys_paddr_to_segind(pa);
|
Split the pagequeues per NUMA domains, and split pageademon process
into threads each processing queue in a single domain. The structure
of the pagedaemons and queues is kept intact, most of the changes come
from the need for code to find an owning page queue for given page,
calculated from the segment containing the page.
The tie between NUMA domain and pagedaemon thread/pagequeue split is
rather arbitrary, the multithreaded daemon could be allowed for the
single-domain machines, or one domain might be split into several page
domains, to further increase concurrency.
Right now, each pagedaemon thread tries to reach the global target,
precalculated at the start of the pass. This is not optimal, since it
could cause excessive page deactivation and freeing. The code should
be changed to re-check the global page deficit state in the loop after
some number of iterations.
The pagedaemons reach the quorum before starting the OOM, since one
thread inability to meet the target is normal for split queues. Only
when all pagedaemons fail to produce enough reusable pages, OOM is
started by single selected thread.
Launder is modified to take into account the segments layout with
regard to the region for which cleaning is performed.
Based on the preliminary patch by jeff, sponsored by EMC / Isilon
Storage Division.
Reviewed by: alc
Tested by: pho
Sponsored by: The FreeBSD Foundation
2013-08-07 16:36:38 +00:00
|
|
|
vmd = vm_phys_domain(m);
|
|
|
|
vmd->vmd_page_count++;
|
|
|
|
vmd->vmd_segs |= 1UL << m->segind;
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
KASSERT(m->order == VM_NFREEORDER,
|
|
|
|
("vm_phys_add_page: page %p has unexpected order %d",
|
|
|
|
m, m->order));
|
|
|
|
m->pool = VM_FREEPOOL_DEFAULT;
|
|
|
|
pmap_page_init(m);
|
|
|
|
mtx_lock(&vm_page_queue_free_mtx);
|
Split the pagequeues per NUMA domains, and split pageademon process
into threads each processing queue in a single domain. The structure
of the pagedaemons and queues is kept intact, most of the changes come
from the need for code to find an owning page queue for given page,
calculated from the segment containing the page.
The tie between NUMA domain and pagedaemon thread/pagequeue split is
rather arbitrary, the multithreaded daemon could be allowed for the
single-domain machines, or one domain might be split into several page
domains, to further increase concurrency.
Right now, each pagedaemon thread tries to reach the global target,
precalculated at the start of the pass. This is not optimal, since it
could cause excessive page deactivation and freeing. The code should
be changed to re-check the global page deficit state in the loop after
some number of iterations.
The pagedaemons reach the quorum before starting the OOM, since one
thread inability to meet the target is normal for split queues. Only
when all pagedaemons fail to produce enough reusable pages, OOM is
started by single selected thread.
Launder is modified to take into account the segments layout with
regard to the region for which cleaning is performed.
Based on the preliminary patch by jeff, sponsored by EMC / Isilon
Storage Division.
Reviewed by: alc
Tested by: pho
Sponsored by: The FreeBSD Foundation
2013-08-07 16:36:38 +00:00
|
|
|
vm_phys_freecnt_adj(m, 1);
|
2007-07-14 21:21:17 +00:00
|
|
|
vm_phys_free_pages(m, 0);
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
mtx_unlock(&vm_page_queue_free_mtx);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Allocate a contiguous, power of two-sized set of physical pages
|
|
|
|
* from the free lists.
|
2007-07-14 21:21:17 +00:00
|
|
|
*
|
|
|
|
* The free page queues must be locked.
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
*/
|
|
|
|
vm_page_t
|
2007-07-14 21:21:17 +00:00
|
|
|
vm_phys_alloc_pages(int pool, int order)
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
{
|
Redo the page table page allocation on MIPS, as suggested by
alc@.
The UMA zone based allocation is replaced by a scheme that creates
a new free page list for the KSEG0 region, and a new function
in sys/vm that allocates pages from a specific free page list.
This also fixes a race condition introduced by the UMA based page table
page allocation code. Dropping the page queue and pmap locks before
the call to uma_zfree, and re-acquiring them afterwards will introduce
a race condtion(noted by alc@).
The changes are :
- Revert the earlier changes in MIPS pmap.c that added UMA zone for
page table pages.
- Add a new freelist VM_FREELIST_HIGHMEM to MIPS vmparam.h for memory that
is not directly mapped (in 32bit kernel). Normal page allocations will first
try the HIGHMEM freelist and then the default(direct mapped) freelist.
- Add a new function 'vm_page_t vm_page_alloc_freelist(int flind, int
order, int req)' to vm/vm_page.c to allocate a page from a specified
freelist. The MIPS page table pages will be allocated using this function
from the freelist containing direct mapped pages.
- Move the page initialization code from vm_phys_alloc_contig() to a
new function vm_page_alloc_init(), and use this function to initialize
pages in vm_page_alloc_freelist() too.
- Split the function vm_phys_alloc_pages(int pool, int order) to create
vm_phys_alloc_freelist_pages(int flind, int pool, int order), and use
this function from both vm_page_alloc_freelist() and vm_phys_alloc_pages().
Reviewed by: alc
2010-07-21 09:27:00 +00:00
|
|
|
vm_page_t m;
|
Add an initial NUMA affinity/policy configuration for threads and processes.
This is based on work done by jeff@ and jhb@, as well as the numa.diff
patch that has been circulating when someone asks for first-touch NUMA
on -10 or -11.
* Introduce a simple set of VM policy and iterator types.
* tie the policy types into the vm_phys path for now, mirroring how
the initial first-touch allocation work was enabled.
* add syscalls to control changing thread and process defaults.
* add a global NUMA VM domain policy.
* implement a simple cascade policy order - if a thread policy exists, use it;
if a process policy exists, use it; use the default policy.
* processes inherit policies from their parent processes, threads inherit
policies from their parent threads.
* add a simple tool (numactl) to query and modify default thread/process
policities.
* add documentation for the new syscalls, for numa and for numactl.
* re-enable first touch NUMA again by default, as now policies can be
set in a variety of methods.
This is only relevant for very specific workloads.
This doesn't pretend to be a final NUMA solution.
The previous defaults in -HEAD (with MAXMEMDOM set) can be achieved by
'sysctl vm.default_policy=rr'.
This is only relevant if MAXMEMDOM is set to something other than 1.
Ie, if you're using GENERIC or a modified kernel with non-NUMA, then
this is a glorified no-op for you.
Thank you to Norse Corp for giving me access to rather large
(for FreeBSD!) NUMA machines in order to develop and verify this.
Thank you to Dell for providing me with dual socket sandybridge
and westmere v3 hardware to do NUMA development with.
Thank you to Scott Long at Netflix for providing me with access
to the two-socket, four-domain haswell v3 hardware.
Thank you to Peter Holm for running the stress testing suite
against the NUMA branch during various stages of development!
Tested:
* MIPS (regression testing; non-NUMA)
* i386 (regression testing; non-NUMA GENERIC)
* amd64 (regression testing; non-NUMA GENERIC)
* westmere, 2 socket (thankyou norse!)
* sandy bridge, 2 socket (thankyou dell!)
* ivy bridge, 2 socket (thankyou norse!)
* westmere-EX, 4 socket / 1TB RAM (thankyou norse!)
* haswell, 2 socket (thankyou norse!)
* haswell v3, 2 socket (thankyou dell)
* haswell v3, 2x18 core (thankyou scott long / netflix!)
* Peter Holm ran a stress test suite on this work and found one
issue, but has not been able to verify it (it doesn't look NUMA
related, and he only saw it once over many testing runs.)
* I've tested bhyve instances running in fixed NUMA domains and cpusets;
all seems to work correctly.
Verified:
* intel-pcm - pcm-numa.x and pcm-memory.x, whilst selecting different
NUMA policies for processes under test.
Review:
This was reviewed through phabricator (https://reviews.freebsd.org/D2559)
as well as privately and via emails to freebsd-arch@. The git history
with specific attributes is available at https://github.com/erikarn/freebsd/
in the NUMA branch (https://github.com/erikarn/freebsd/compare/local/adrian_numa_policy).
This has been reviewed by a number of people (stas, rpaulo, kib, ngie,
wblock) but not achieved a clear consensus. My hope is that with further
exposure and testing more functionality can be implemented and evaluated.
Notes:
* The VM doesn't handle unbalanced domains very well, and if you have an overly
unbalanced memory setup whilst under high memory pressure, VM page allocation
may fail leading to a kernel panic. This was a problem in the past, but it's
much more easily triggered now with these tools.
* This work only controls the path through vm_phys; it doesn't yet strongly/predictably
affect contigmalloc, KVA placement, UMA, etc. So, driver placement of memory
isn't really guaranteed in any way. That's next on my plate.
Sponsored by: Norse Corp, Inc.; Dell
2015-07-11 15:21:37 +00:00
|
|
|
int domain, flind;
|
|
|
|
struct vm_domain_iterator vi;
|
2013-05-03 18:58:37 +00:00
|
|
|
|
|
|
|
KASSERT(pool < VM_NFREEPOOL,
|
|
|
|
("vm_phys_alloc_pages: pool %d is out of range", pool));
|
|
|
|
KASSERT(order < VM_NFREEORDER,
|
|
|
|
("vm_phys_alloc_pages: order %d is out of range", order));
|
Redo the page table page allocation on MIPS, as suggested by
alc@.
The UMA zone based allocation is replaced by a scheme that creates
a new free page list for the KSEG0 region, and a new function
in sys/vm that allocates pages from a specific free page list.
This also fixes a race condition introduced by the UMA based page table
page allocation code. Dropping the page queue and pmap locks before
the call to uma_zfree, and re-acquiring them afterwards will introduce
a race condtion(noted by alc@).
The changes are :
- Revert the earlier changes in MIPS pmap.c that added UMA zone for
page table pages.
- Add a new freelist VM_FREELIST_HIGHMEM to MIPS vmparam.h for memory that
is not directly mapped (in 32bit kernel). Normal page allocations will first
try the HIGHMEM freelist and then the default(direct mapped) freelist.
- Add a new function 'vm_page_t vm_page_alloc_freelist(int flind, int
order, int req)' to vm/vm_page.c to allocate a page from a specified
freelist. The MIPS page table pages will be allocated using this function
from the freelist containing direct mapped pages.
- Move the page initialization code from vm_phys_alloc_contig() to a
new function vm_page_alloc_init(), and use this function to initialize
pages in vm_page_alloc_freelist() too.
- Split the function vm_phys_alloc_pages(int pool, int order) to create
vm_phys_alloc_freelist_pages(int flind, int pool, int order), and use
this function from both vm_page_alloc_freelist() and vm_phys_alloc_pages().
Reviewed by: alc
2010-07-21 09:27:00 +00:00
|
|
|
|
Add an initial NUMA affinity/policy configuration for threads and processes.
This is based on work done by jeff@ and jhb@, as well as the numa.diff
patch that has been circulating when someone asks for first-touch NUMA
on -10 or -11.
* Introduce a simple set of VM policy and iterator types.
* tie the policy types into the vm_phys path for now, mirroring how
the initial first-touch allocation work was enabled.
* add syscalls to control changing thread and process defaults.
* add a global NUMA VM domain policy.
* implement a simple cascade policy order - if a thread policy exists, use it;
if a process policy exists, use it; use the default policy.
* processes inherit policies from their parent processes, threads inherit
policies from their parent threads.
* add a simple tool (numactl) to query and modify default thread/process
policities.
* add documentation for the new syscalls, for numa and for numactl.
* re-enable first touch NUMA again by default, as now policies can be
set in a variety of methods.
This is only relevant for very specific workloads.
This doesn't pretend to be a final NUMA solution.
The previous defaults in -HEAD (with MAXMEMDOM set) can be achieved by
'sysctl vm.default_policy=rr'.
This is only relevant if MAXMEMDOM is set to something other than 1.
Ie, if you're using GENERIC or a modified kernel with non-NUMA, then
this is a glorified no-op for you.
Thank you to Norse Corp for giving me access to rather large
(for FreeBSD!) NUMA machines in order to develop and verify this.
Thank you to Dell for providing me with dual socket sandybridge
and westmere v3 hardware to do NUMA development with.
Thank you to Scott Long at Netflix for providing me with access
to the two-socket, four-domain haswell v3 hardware.
Thank you to Peter Holm for running the stress testing suite
against the NUMA branch during various stages of development!
Tested:
* MIPS (regression testing; non-NUMA)
* i386 (regression testing; non-NUMA GENERIC)
* amd64 (regression testing; non-NUMA GENERIC)
* westmere, 2 socket (thankyou norse!)
* sandy bridge, 2 socket (thankyou dell!)
* ivy bridge, 2 socket (thankyou norse!)
* westmere-EX, 4 socket / 1TB RAM (thankyou norse!)
* haswell, 2 socket (thankyou norse!)
* haswell v3, 2 socket (thankyou dell)
* haswell v3, 2x18 core (thankyou scott long / netflix!)
* Peter Holm ran a stress test suite on this work and found one
issue, but has not been able to verify it (it doesn't look NUMA
related, and he only saw it once over many testing runs.)
* I've tested bhyve instances running in fixed NUMA domains and cpusets;
all seems to work correctly.
Verified:
* intel-pcm - pcm-numa.x and pcm-memory.x, whilst selecting different
NUMA policies for processes under test.
Review:
This was reviewed through phabricator (https://reviews.freebsd.org/D2559)
as well as privately and via emails to freebsd-arch@. The git history
with specific attributes is available at https://github.com/erikarn/freebsd/
in the NUMA branch (https://github.com/erikarn/freebsd/compare/local/adrian_numa_policy).
This has been reviewed by a number of people (stas, rpaulo, kib, ngie,
wblock) but not achieved a clear consensus. My hope is that with further
exposure and testing more functionality can be implemented and evaluated.
Notes:
* The VM doesn't handle unbalanced domains very well, and if you have an overly
unbalanced memory setup whilst under high memory pressure, VM page allocation
may fail leading to a kernel panic. This was a problem in the past, but it's
much more easily triggered now with these tools.
* This work only controls the path through vm_phys; it doesn't yet strongly/predictably
affect contigmalloc, KVA placement, UMA, etc. So, driver placement of memory
isn't really guaranteed in any way. That's next on my plate.
Sponsored by: Norse Corp, Inc.; Dell
2015-07-11 15:21:37 +00:00
|
|
|
vm_policy_iterator_init(&vi);
|
|
|
|
|
|
|
|
while ((vm_domain_iterator_run(&vi, &domain)) == 0) {
|
2013-05-13 15:40:51 +00:00
|
|
|
for (flind = 0; flind < vm_nfreelists; flind++) {
|
|
|
|
m = vm_phys_alloc_domain_pages(domain, flind, pool,
|
|
|
|
order);
|
|
|
|
if (m != NULL)
|
|
|
|
return (m);
|
|
|
|
}
|
Redo the page table page allocation on MIPS, as suggested by
alc@.
The UMA zone based allocation is replaced by a scheme that creates
a new free page list for the KSEG0 region, and a new function
in sys/vm that allocates pages from a specific free page list.
This also fixes a race condition introduced by the UMA based page table
page allocation code. Dropping the page queue and pmap locks before
the call to uma_zfree, and re-acquiring them afterwards will introduce
a race condtion(noted by alc@).
The changes are :
- Revert the earlier changes in MIPS pmap.c that added UMA zone for
page table pages.
- Add a new freelist VM_FREELIST_HIGHMEM to MIPS vmparam.h for memory that
is not directly mapped (in 32bit kernel). Normal page allocations will first
try the HIGHMEM freelist and then the default(direct mapped) freelist.
- Add a new function 'vm_page_t vm_page_alloc_freelist(int flind, int
order, int req)' to vm/vm_page.c to allocate a page from a specified
freelist. The MIPS page table pages will be allocated using this function
from the freelist containing direct mapped pages.
- Move the page initialization code from vm_phys_alloc_contig() to a
new function vm_page_alloc_init(), and use this function to initialize
pages in vm_page_alloc_freelist() too.
- Split the function vm_phys_alloc_pages(int pool, int order) to create
vm_phys_alloc_freelist_pages(int flind, int pool, int order), and use
this function from both vm_page_alloc_freelist() and vm_phys_alloc_pages().
Reviewed by: alc
2010-07-21 09:27:00 +00:00
|
|
|
}
|
Add an initial NUMA affinity/policy configuration for threads and processes.
This is based on work done by jeff@ and jhb@, as well as the numa.diff
patch that has been circulating when someone asks for first-touch NUMA
on -10 or -11.
* Introduce a simple set of VM policy and iterator types.
* tie the policy types into the vm_phys path for now, mirroring how
the initial first-touch allocation work was enabled.
* add syscalls to control changing thread and process defaults.
* add a global NUMA VM domain policy.
* implement a simple cascade policy order - if a thread policy exists, use it;
if a process policy exists, use it; use the default policy.
* processes inherit policies from their parent processes, threads inherit
policies from their parent threads.
* add a simple tool (numactl) to query and modify default thread/process
policities.
* add documentation for the new syscalls, for numa and for numactl.
* re-enable first touch NUMA again by default, as now policies can be
set in a variety of methods.
This is only relevant for very specific workloads.
This doesn't pretend to be a final NUMA solution.
The previous defaults in -HEAD (with MAXMEMDOM set) can be achieved by
'sysctl vm.default_policy=rr'.
This is only relevant if MAXMEMDOM is set to something other than 1.
Ie, if you're using GENERIC or a modified kernel with non-NUMA, then
this is a glorified no-op for you.
Thank you to Norse Corp for giving me access to rather large
(for FreeBSD!) NUMA machines in order to develop and verify this.
Thank you to Dell for providing me with dual socket sandybridge
and westmere v3 hardware to do NUMA development with.
Thank you to Scott Long at Netflix for providing me with access
to the two-socket, four-domain haswell v3 hardware.
Thank you to Peter Holm for running the stress testing suite
against the NUMA branch during various stages of development!
Tested:
* MIPS (regression testing; non-NUMA)
* i386 (regression testing; non-NUMA GENERIC)
* amd64 (regression testing; non-NUMA GENERIC)
* westmere, 2 socket (thankyou norse!)
* sandy bridge, 2 socket (thankyou dell!)
* ivy bridge, 2 socket (thankyou norse!)
* westmere-EX, 4 socket / 1TB RAM (thankyou norse!)
* haswell, 2 socket (thankyou norse!)
* haswell v3, 2 socket (thankyou dell)
* haswell v3, 2x18 core (thankyou scott long / netflix!)
* Peter Holm ran a stress test suite on this work and found one
issue, but has not been able to verify it (it doesn't look NUMA
related, and he only saw it once over many testing runs.)
* I've tested bhyve instances running in fixed NUMA domains and cpusets;
all seems to work correctly.
Verified:
* intel-pcm - pcm-numa.x and pcm-memory.x, whilst selecting different
NUMA policies for processes under test.
Review:
This was reviewed through phabricator (https://reviews.freebsd.org/D2559)
as well as privately and via emails to freebsd-arch@. The git history
with specific attributes is available at https://github.com/erikarn/freebsd/
in the NUMA branch (https://github.com/erikarn/freebsd/compare/local/adrian_numa_policy).
This has been reviewed by a number of people (stas, rpaulo, kib, ngie,
wblock) but not achieved a clear consensus. My hope is that with further
exposure and testing more functionality can be implemented and evaluated.
Notes:
* The VM doesn't handle unbalanced domains very well, and if you have an overly
unbalanced memory setup whilst under high memory pressure, VM page allocation
may fail leading to a kernel panic. This was a problem in the past, but it's
much more easily triggered now with these tools.
* This work only controls the path through vm_phys; it doesn't yet strongly/predictably
affect contigmalloc, KVA placement, UMA, etc. So, driver placement of memory
isn't really guaranteed in any way. That's next on my plate.
Sponsored by: Norse Corp, Inc.; Dell
2015-07-11 15:21:37 +00:00
|
|
|
|
|
|
|
vm_policy_iterator_finish(&vi);
|
Redo the page table page allocation on MIPS, as suggested by
alc@.
The UMA zone based allocation is replaced by a scheme that creates
a new free page list for the KSEG0 region, and a new function
in sys/vm that allocates pages from a specific free page list.
This also fixes a race condition introduced by the UMA based page table
page allocation code. Dropping the page queue and pmap locks before
the call to uma_zfree, and re-acquiring them afterwards will introduce
a race condtion(noted by alc@).
The changes are :
- Revert the earlier changes in MIPS pmap.c that added UMA zone for
page table pages.
- Add a new freelist VM_FREELIST_HIGHMEM to MIPS vmparam.h for memory that
is not directly mapped (in 32bit kernel). Normal page allocations will first
try the HIGHMEM freelist and then the default(direct mapped) freelist.
- Add a new function 'vm_page_t vm_page_alloc_freelist(int flind, int
order, int req)' to vm/vm_page.c to allocate a page from a specified
freelist. The MIPS page table pages will be allocated using this function
from the freelist containing direct mapped pages.
- Move the page initialization code from vm_phys_alloc_contig() to a
new function vm_page_alloc_init(), and use this function to initialize
pages in vm_page_alloc_freelist() too.
- Split the function vm_phys_alloc_pages(int pool, int order) to create
vm_phys_alloc_freelist_pages(int flind, int pool, int order), and use
this function from both vm_page_alloc_freelist() and vm_phys_alloc_pages().
Reviewed by: alc
2010-07-21 09:27:00 +00:00
|
|
|
return (NULL);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
The physical memory allocator supports the use of distinct free lists for
managing pages from different address ranges. Generally speaking, this
feature is used to increase the likelihood that physical pages are
available that can meet special DMA requirements or can be accessed through
a limited-coverage direct mapping (e.g., MIPS). However, prior to this
change, the configuration of the free lists was static, i.e., it was
determined at compile time. Consequentally, free lists could be created
for address ranges that held no actual pages, for example, on 32-bit MIPS-
based systems with 512 MB or less of physical memory. This change makes
the creation of the free lists dynamic, i.e., it is based on the available
physical memory at boot time.
On 64-bit x86-based systems with 64 GB or more of physical memory, create
free lists for managing pages with physical addresses below 4 GB. This
change is to address reported problems with initializing devices that
require the allocation of physical pages below 4 GB on some systems with
128 GB or more of physical memory.
PR: 185727
Differential Revision: https://reviews.freebsd.org/D1274
Reviewed by: jhb, kib
MFC after: 3 weeks
Sponsored by: EMC / Isilon Storage Division
2014-12-31 00:54:38 +00:00
|
|
|
* Allocate a contiguous, power of two-sized set of physical pages from the
|
|
|
|
* specified free list. The free list must be specified using one of the
|
|
|
|
* manifest constants VM_FREELIST_*.
|
|
|
|
*
|
|
|
|
* The free page queues must be locked.
|
Redo the page table page allocation on MIPS, as suggested by
alc@.
The UMA zone based allocation is replaced by a scheme that creates
a new free page list for the KSEG0 region, and a new function
in sys/vm that allocates pages from a specific free page list.
This also fixes a race condition introduced by the UMA based page table
page allocation code. Dropping the page queue and pmap locks before
the call to uma_zfree, and re-acquiring them afterwards will introduce
a race condtion(noted by alc@).
The changes are :
- Revert the earlier changes in MIPS pmap.c that added UMA zone for
page table pages.
- Add a new freelist VM_FREELIST_HIGHMEM to MIPS vmparam.h for memory that
is not directly mapped (in 32bit kernel). Normal page allocations will first
try the HIGHMEM freelist and then the default(direct mapped) freelist.
- Add a new function 'vm_page_t vm_page_alloc_freelist(int flind, int
order, int req)' to vm/vm_page.c to allocate a page from a specified
freelist. The MIPS page table pages will be allocated using this function
from the freelist containing direct mapped pages.
- Move the page initialization code from vm_phys_alloc_contig() to a
new function vm_page_alloc_init(), and use this function to initialize
pages in vm_page_alloc_freelist() too.
- Split the function vm_phys_alloc_pages(int pool, int order) to create
vm_phys_alloc_freelist_pages(int flind, int pool, int order), and use
this function from both vm_page_alloc_freelist() and vm_phys_alloc_pages().
Reviewed by: alc
2010-07-21 09:27:00 +00:00
|
|
|
*/
|
|
|
|
vm_page_t
|
The physical memory allocator supports the use of distinct free lists for
managing pages from different address ranges. Generally speaking, this
feature is used to increase the likelihood that physical pages are
available that can meet special DMA requirements or can be accessed through
a limited-coverage direct mapping (e.g., MIPS). However, prior to this
change, the configuration of the free lists was static, i.e., it was
determined at compile time. Consequentally, free lists could be created
for address ranges that held no actual pages, for example, on 32-bit MIPS-
based systems with 512 MB or less of physical memory. This change makes
the creation of the free lists dynamic, i.e., it is based on the available
physical memory at boot time.
On 64-bit x86-based systems with 64 GB or more of physical memory, create
free lists for managing pages with physical addresses below 4 GB. This
change is to address reported problems with initializing devices that
require the allocation of physical pages below 4 GB on some systems with
128 GB or more of physical memory.
PR: 185727
Differential Revision: https://reviews.freebsd.org/D1274
Reviewed by: jhb, kib
MFC after: 3 weeks
Sponsored by: EMC / Isilon Storage Division
2014-12-31 00:54:38 +00:00
|
|
|
vm_phys_alloc_freelist_pages(int freelist, int pool, int order)
|
2013-05-03 18:58:37 +00:00
|
|
|
{
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
vm_page_t m;
|
Add an initial NUMA affinity/policy configuration for threads and processes.
This is based on work done by jeff@ and jhb@, as well as the numa.diff
patch that has been circulating when someone asks for first-touch NUMA
on -10 or -11.
* Introduce a simple set of VM policy and iterator types.
* tie the policy types into the vm_phys path for now, mirroring how
the initial first-touch allocation work was enabled.
* add syscalls to control changing thread and process defaults.
* add a global NUMA VM domain policy.
* implement a simple cascade policy order - if a thread policy exists, use it;
if a process policy exists, use it; use the default policy.
* processes inherit policies from their parent processes, threads inherit
policies from their parent threads.
* add a simple tool (numactl) to query and modify default thread/process
policities.
* add documentation for the new syscalls, for numa and for numactl.
* re-enable first touch NUMA again by default, as now policies can be
set in a variety of methods.
This is only relevant for very specific workloads.
This doesn't pretend to be a final NUMA solution.
The previous defaults in -HEAD (with MAXMEMDOM set) can be achieved by
'sysctl vm.default_policy=rr'.
This is only relevant if MAXMEMDOM is set to something other than 1.
Ie, if you're using GENERIC or a modified kernel with non-NUMA, then
this is a glorified no-op for you.
Thank you to Norse Corp for giving me access to rather large
(for FreeBSD!) NUMA machines in order to develop and verify this.
Thank you to Dell for providing me with dual socket sandybridge
and westmere v3 hardware to do NUMA development with.
Thank you to Scott Long at Netflix for providing me with access
to the two-socket, four-domain haswell v3 hardware.
Thank you to Peter Holm for running the stress testing suite
against the NUMA branch during various stages of development!
Tested:
* MIPS (regression testing; non-NUMA)
* i386 (regression testing; non-NUMA GENERIC)
* amd64 (regression testing; non-NUMA GENERIC)
* westmere, 2 socket (thankyou norse!)
* sandy bridge, 2 socket (thankyou dell!)
* ivy bridge, 2 socket (thankyou norse!)
* westmere-EX, 4 socket / 1TB RAM (thankyou norse!)
* haswell, 2 socket (thankyou norse!)
* haswell v3, 2 socket (thankyou dell)
* haswell v3, 2x18 core (thankyou scott long / netflix!)
* Peter Holm ran a stress test suite on this work and found one
issue, but has not been able to verify it (it doesn't look NUMA
related, and he only saw it once over many testing runs.)
* I've tested bhyve instances running in fixed NUMA domains and cpusets;
all seems to work correctly.
Verified:
* intel-pcm - pcm-numa.x and pcm-memory.x, whilst selecting different
NUMA policies for processes under test.
Review:
This was reviewed through phabricator (https://reviews.freebsd.org/D2559)
as well as privately and via emails to freebsd-arch@. The git history
with specific attributes is available at https://github.com/erikarn/freebsd/
in the NUMA branch (https://github.com/erikarn/freebsd/compare/local/adrian_numa_policy).
This has been reviewed by a number of people (stas, rpaulo, kib, ngie,
wblock) but not achieved a clear consensus. My hope is that with further
exposure and testing more functionality can be implemented and evaluated.
Notes:
* The VM doesn't handle unbalanced domains very well, and if you have an overly
unbalanced memory setup whilst under high memory pressure, VM page allocation
may fail leading to a kernel panic. This was a problem in the past, but it's
much more easily triggered now with these tools.
* This work only controls the path through vm_phys; it doesn't yet strongly/predictably
affect contigmalloc, KVA placement, UMA, etc. So, driver placement of memory
isn't really guaranteed in any way. That's next on my plate.
Sponsored by: Norse Corp, Inc.; Dell
2015-07-11 15:21:37 +00:00
|
|
|
struct vm_domain_iterator vi;
|
|
|
|
int domain;
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
|
The physical memory allocator supports the use of distinct free lists for
managing pages from different address ranges. Generally speaking, this
feature is used to increase the likelihood that physical pages are
available that can meet special DMA requirements or can be accessed through
a limited-coverage direct mapping (e.g., MIPS). However, prior to this
change, the configuration of the free lists was static, i.e., it was
determined at compile time. Consequentally, free lists could be created
for address ranges that held no actual pages, for example, on 32-bit MIPS-
based systems with 512 MB or less of physical memory. This change makes
the creation of the free lists dynamic, i.e., it is based on the available
physical memory at boot time.
On 64-bit x86-based systems with 64 GB or more of physical memory, create
free lists for managing pages with physical addresses below 4 GB. This
change is to address reported problems with initializing devices that
require the allocation of physical pages below 4 GB on some systems with
128 GB or more of physical memory.
PR: 185727
Differential Revision: https://reviews.freebsd.org/D1274
Reviewed by: jhb, kib
MFC after: 3 weeks
Sponsored by: EMC / Isilon Storage Division
2014-12-31 00:54:38 +00:00
|
|
|
KASSERT(freelist < VM_NFREELIST,
|
|
|
|
("vm_phys_alloc_freelist_pages: freelist %d is out of range",
|
|
|
|
freelist));
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
KASSERT(pool < VM_NFREEPOOL,
|
Redo the page table page allocation on MIPS, as suggested by
alc@.
The UMA zone based allocation is replaced by a scheme that creates
a new free page list for the KSEG0 region, and a new function
in sys/vm that allocates pages from a specific free page list.
This also fixes a race condition introduced by the UMA based page table
page allocation code. Dropping the page queue and pmap locks before
the call to uma_zfree, and re-acquiring them afterwards will introduce
a race condtion(noted by alc@).
The changes are :
- Revert the earlier changes in MIPS pmap.c that added UMA zone for
page table pages.
- Add a new freelist VM_FREELIST_HIGHMEM to MIPS vmparam.h for memory that
is not directly mapped (in 32bit kernel). Normal page allocations will first
try the HIGHMEM freelist and then the default(direct mapped) freelist.
- Add a new function 'vm_page_t vm_page_alloc_freelist(int flind, int
order, int req)' to vm/vm_page.c to allocate a page from a specified
freelist. The MIPS page table pages will be allocated using this function
from the freelist containing direct mapped pages.
- Move the page initialization code from vm_phys_alloc_contig() to a
new function vm_page_alloc_init(), and use this function to initialize
pages in vm_page_alloc_freelist() too.
- Split the function vm_phys_alloc_pages(int pool, int order) to create
vm_phys_alloc_freelist_pages(int flind, int pool, int order), and use
this function from both vm_page_alloc_freelist() and vm_phys_alloc_pages().
Reviewed by: alc
2010-07-21 09:27:00 +00:00
|
|
|
("vm_phys_alloc_freelist_pages: pool %d is out of range", pool));
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
KASSERT(order < VM_NFREEORDER,
|
Redo the page table page allocation on MIPS, as suggested by
alc@.
The UMA zone based allocation is replaced by a scheme that creates
a new free page list for the KSEG0 region, and a new function
in sys/vm that allocates pages from a specific free page list.
This also fixes a race condition introduced by the UMA based page table
page allocation code. Dropping the page queue and pmap locks before
the call to uma_zfree, and re-acquiring them afterwards will introduce
a race condtion(noted by alc@).
The changes are :
- Revert the earlier changes in MIPS pmap.c that added UMA zone for
page table pages.
- Add a new freelist VM_FREELIST_HIGHMEM to MIPS vmparam.h for memory that
is not directly mapped (in 32bit kernel). Normal page allocations will first
try the HIGHMEM freelist and then the default(direct mapped) freelist.
- Add a new function 'vm_page_t vm_page_alloc_freelist(int flind, int
order, int req)' to vm/vm_page.c to allocate a page from a specified
freelist. The MIPS page table pages will be allocated using this function
from the freelist containing direct mapped pages.
- Move the page initialization code from vm_phys_alloc_contig() to a
new function vm_page_alloc_init(), and use this function to initialize
pages in vm_page_alloc_freelist() too.
- Split the function vm_phys_alloc_pages(int pool, int order) to create
vm_phys_alloc_freelist_pages(int flind, int pool, int order), and use
this function from both vm_page_alloc_freelist() and vm_phys_alloc_pages().
Reviewed by: alc
2010-07-21 09:27:00 +00:00
|
|
|
("vm_phys_alloc_freelist_pages: order %d is out of range", order));
|
Add an initial NUMA affinity/policy configuration for threads and processes.
This is based on work done by jeff@ and jhb@, as well as the numa.diff
patch that has been circulating when someone asks for first-touch NUMA
on -10 or -11.
* Introduce a simple set of VM policy and iterator types.
* tie the policy types into the vm_phys path for now, mirroring how
the initial first-touch allocation work was enabled.
* add syscalls to control changing thread and process defaults.
* add a global NUMA VM domain policy.
* implement a simple cascade policy order - if a thread policy exists, use it;
if a process policy exists, use it; use the default policy.
* processes inherit policies from their parent processes, threads inherit
policies from their parent threads.
* add a simple tool (numactl) to query and modify default thread/process
policities.
* add documentation for the new syscalls, for numa and for numactl.
* re-enable first touch NUMA again by default, as now policies can be
set in a variety of methods.
This is only relevant for very specific workloads.
This doesn't pretend to be a final NUMA solution.
The previous defaults in -HEAD (with MAXMEMDOM set) can be achieved by
'sysctl vm.default_policy=rr'.
This is only relevant if MAXMEMDOM is set to something other than 1.
Ie, if you're using GENERIC or a modified kernel with non-NUMA, then
this is a glorified no-op for you.
Thank you to Norse Corp for giving me access to rather large
(for FreeBSD!) NUMA machines in order to develop and verify this.
Thank you to Dell for providing me with dual socket sandybridge
and westmere v3 hardware to do NUMA development with.
Thank you to Scott Long at Netflix for providing me with access
to the two-socket, four-domain haswell v3 hardware.
Thank you to Peter Holm for running the stress testing suite
against the NUMA branch during various stages of development!
Tested:
* MIPS (regression testing; non-NUMA)
* i386 (regression testing; non-NUMA GENERIC)
* amd64 (regression testing; non-NUMA GENERIC)
* westmere, 2 socket (thankyou norse!)
* sandy bridge, 2 socket (thankyou dell!)
* ivy bridge, 2 socket (thankyou norse!)
* westmere-EX, 4 socket / 1TB RAM (thankyou norse!)
* haswell, 2 socket (thankyou norse!)
* haswell v3, 2 socket (thankyou dell)
* haswell v3, 2x18 core (thankyou scott long / netflix!)
* Peter Holm ran a stress test suite on this work and found one
issue, but has not been able to verify it (it doesn't look NUMA
related, and he only saw it once over many testing runs.)
* I've tested bhyve instances running in fixed NUMA domains and cpusets;
all seems to work correctly.
Verified:
* intel-pcm - pcm-numa.x and pcm-memory.x, whilst selecting different
NUMA policies for processes under test.
Review:
This was reviewed through phabricator (https://reviews.freebsd.org/D2559)
as well as privately and via emails to freebsd-arch@. The git history
with specific attributes is available at https://github.com/erikarn/freebsd/
in the NUMA branch (https://github.com/erikarn/freebsd/compare/local/adrian_numa_policy).
This has been reviewed by a number of people (stas, rpaulo, kib, ngie,
wblock) but not achieved a clear consensus. My hope is that with further
exposure and testing more functionality can be implemented and evaluated.
Notes:
* The VM doesn't handle unbalanced domains very well, and if you have an overly
unbalanced memory setup whilst under high memory pressure, VM page allocation
may fail leading to a kernel panic. This was a problem in the past, but it's
much more easily triggered now with these tools.
* This work only controls the path through vm_phys; it doesn't yet strongly/predictably
affect contigmalloc, KVA placement, UMA, etc. So, driver placement of memory
isn't really guaranteed in any way. That's next on my plate.
Sponsored by: Norse Corp, Inc.; Dell
2015-07-11 15:21:37 +00:00
|
|
|
|
|
|
|
vm_policy_iterator_init(&vi);
|
|
|
|
|
|
|
|
while ((vm_domain_iterator_run(&vi, &domain)) == 0) {
|
The physical memory allocator supports the use of distinct free lists for
managing pages from different address ranges. Generally speaking, this
feature is used to increase the likelihood that physical pages are
available that can meet special DMA requirements or can be accessed through
a limited-coverage direct mapping (e.g., MIPS). However, prior to this
change, the configuration of the free lists was static, i.e., it was
determined at compile time. Consequentally, free lists could be created
for address ranges that held no actual pages, for example, on 32-bit MIPS-
based systems with 512 MB or less of physical memory. This change makes
the creation of the free lists dynamic, i.e., it is based on the available
physical memory at boot time.
On 64-bit x86-based systems with 64 GB or more of physical memory, create
free lists for managing pages with physical addresses below 4 GB. This
change is to address reported problems with initializing devices that
require the allocation of physical pages below 4 GB on some systems with
128 GB or more of physical memory.
PR: 185727
Differential Revision: https://reviews.freebsd.org/D1274
Reviewed by: jhb, kib
MFC after: 3 weeks
Sponsored by: EMC / Isilon Storage Division
2014-12-31 00:54:38 +00:00
|
|
|
m = vm_phys_alloc_domain_pages(domain,
|
|
|
|
vm_freelist_to_flind[freelist], pool, order);
|
2013-05-13 15:40:51 +00:00
|
|
|
if (m != NULL)
|
|
|
|
return (m);
|
|
|
|
}
|
Add an initial NUMA affinity/policy configuration for threads and processes.
This is based on work done by jeff@ and jhb@, as well as the numa.diff
patch that has been circulating when someone asks for first-touch NUMA
on -10 or -11.
* Introduce a simple set of VM policy and iterator types.
* tie the policy types into the vm_phys path for now, mirroring how
the initial first-touch allocation work was enabled.
* add syscalls to control changing thread and process defaults.
* add a global NUMA VM domain policy.
* implement a simple cascade policy order - if a thread policy exists, use it;
if a process policy exists, use it; use the default policy.
* processes inherit policies from their parent processes, threads inherit
policies from their parent threads.
* add a simple tool (numactl) to query and modify default thread/process
policities.
* add documentation for the new syscalls, for numa and for numactl.
* re-enable first touch NUMA again by default, as now policies can be
set in a variety of methods.
This is only relevant for very specific workloads.
This doesn't pretend to be a final NUMA solution.
The previous defaults in -HEAD (with MAXMEMDOM set) can be achieved by
'sysctl vm.default_policy=rr'.
This is only relevant if MAXMEMDOM is set to something other than 1.
Ie, if you're using GENERIC or a modified kernel with non-NUMA, then
this is a glorified no-op for you.
Thank you to Norse Corp for giving me access to rather large
(for FreeBSD!) NUMA machines in order to develop and verify this.
Thank you to Dell for providing me with dual socket sandybridge
and westmere v3 hardware to do NUMA development with.
Thank you to Scott Long at Netflix for providing me with access
to the two-socket, four-domain haswell v3 hardware.
Thank you to Peter Holm for running the stress testing suite
against the NUMA branch during various stages of development!
Tested:
* MIPS (regression testing; non-NUMA)
* i386 (regression testing; non-NUMA GENERIC)
* amd64 (regression testing; non-NUMA GENERIC)
* westmere, 2 socket (thankyou norse!)
* sandy bridge, 2 socket (thankyou dell!)
* ivy bridge, 2 socket (thankyou norse!)
* westmere-EX, 4 socket / 1TB RAM (thankyou norse!)
* haswell, 2 socket (thankyou norse!)
* haswell v3, 2 socket (thankyou dell)
* haswell v3, 2x18 core (thankyou scott long / netflix!)
* Peter Holm ran a stress test suite on this work and found one
issue, but has not been able to verify it (it doesn't look NUMA
related, and he only saw it once over many testing runs.)
* I've tested bhyve instances running in fixed NUMA domains and cpusets;
all seems to work correctly.
Verified:
* intel-pcm - pcm-numa.x and pcm-memory.x, whilst selecting different
NUMA policies for processes under test.
Review:
This was reviewed through phabricator (https://reviews.freebsd.org/D2559)
as well as privately and via emails to freebsd-arch@. The git history
with specific attributes is available at https://github.com/erikarn/freebsd/
in the NUMA branch (https://github.com/erikarn/freebsd/compare/local/adrian_numa_policy).
This has been reviewed by a number of people (stas, rpaulo, kib, ngie,
wblock) but not achieved a clear consensus. My hope is that with further
exposure and testing more functionality can be implemented and evaluated.
Notes:
* The VM doesn't handle unbalanced domains very well, and if you have an overly
unbalanced memory setup whilst under high memory pressure, VM page allocation
may fail leading to a kernel panic. This was a problem in the past, but it's
much more easily triggered now with these tools.
* This work only controls the path through vm_phys; it doesn't yet strongly/predictably
affect contigmalloc, KVA placement, UMA, etc. So, driver placement of memory
isn't really guaranteed in any way. That's next on my plate.
Sponsored by: Norse Corp, Inc.; Dell
2015-07-11 15:21:37 +00:00
|
|
|
|
|
|
|
vm_policy_iterator_finish(&vi);
|
2013-05-13 15:40:51 +00:00
|
|
|
return (NULL);
|
2013-05-03 18:58:37 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
static vm_page_t
|
|
|
|
vm_phys_alloc_domain_pages(int domain, int flind, int pool, int order)
|
|
|
|
{
|
|
|
|
struct vm_freelist *fl;
|
|
|
|
struct vm_freelist *alt;
|
|
|
|
int oind, pind;
|
|
|
|
vm_page_t m;
|
|
|
|
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
mtx_assert(&vm_page_queue_free_mtx, MA_OWNED);
|
2013-05-13 15:40:51 +00:00
|
|
|
fl = &vm_phys_free_queues[domain][flind][pool][0];
|
Redo the page table page allocation on MIPS, as suggested by
alc@.
The UMA zone based allocation is replaced by a scheme that creates
a new free page list for the KSEG0 region, and a new function
in sys/vm that allocates pages from a specific free page list.
This also fixes a race condition introduced by the UMA based page table
page allocation code. Dropping the page queue and pmap locks before
the call to uma_zfree, and re-acquiring them afterwards will introduce
a race condtion(noted by alc@).
The changes are :
- Revert the earlier changes in MIPS pmap.c that added UMA zone for
page table pages.
- Add a new freelist VM_FREELIST_HIGHMEM to MIPS vmparam.h for memory that
is not directly mapped (in 32bit kernel). Normal page allocations will first
try the HIGHMEM freelist and then the default(direct mapped) freelist.
- Add a new function 'vm_page_t vm_page_alloc_freelist(int flind, int
order, int req)' to vm/vm_page.c to allocate a page from a specified
freelist. The MIPS page table pages will be allocated using this function
from the freelist containing direct mapped pages.
- Move the page initialization code from vm_phys_alloc_contig() to a
new function vm_page_alloc_init(), and use this function to initialize
pages in vm_page_alloc_freelist() too.
- Split the function vm_phys_alloc_pages(int pool, int order) to create
vm_phys_alloc_freelist_pages(int flind, int pool, int order), and use
this function from both vm_page_alloc_freelist() and vm_phys_alloc_pages().
Reviewed by: alc
2010-07-21 09:27:00 +00:00
|
|
|
for (oind = order; oind < VM_NFREEORDER; oind++) {
|
|
|
|
m = TAILQ_FIRST(&fl[oind].pl);
|
|
|
|
if (m != NULL) {
|
2013-05-13 15:40:51 +00:00
|
|
|
vm_freelist_rem(fl, m, oind);
|
Redo the page table page allocation on MIPS, as suggested by
alc@.
The UMA zone based allocation is replaced by a scheme that creates
a new free page list for the KSEG0 region, and a new function
in sys/vm that allocates pages from a specific free page list.
This also fixes a race condition introduced by the UMA based page table
page allocation code. Dropping the page queue and pmap locks before
the call to uma_zfree, and re-acquiring them afterwards will introduce
a race condtion(noted by alc@).
The changes are :
- Revert the earlier changes in MIPS pmap.c that added UMA zone for
page table pages.
- Add a new freelist VM_FREELIST_HIGHMEM to MIPS vmparam.h for memory that
is not directly mapped (in 32bit kernel). Normal page allocations will first
try the HIGHMEM freelist and then the default(direct mapped) freelist.
- Add a new function 'vm_page_t vm_page_alloc_freelist(int flind, int
order, int req)' to vm/vm_page.c to allocate a page from a specified
freelist. The MIPS page table pages will be allocated using this function
from the freelist containing direct mapped pages.
- Move the page initialization code from vm_phys_alloc_contig() to a
new function vm_page_alloc_init(), and use this function to initialize
pages in vm_page_alloc_freelist() too.
- Split the function vm_phys_alloc_pages(int pool, int order) to create
vm_phys_alloc_freelist_pages(int flind, int pool, int order), and use
this function from both vm_page_alloc_freelist() and vm_phys_alloc_pages().
Reviewed by: alc
2010-07-21 09:27:00 +00:00
|
|
|
vm_phys_split_pages(m, oind, fl, order);
|
|
|
|
return (m);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* The given pool was empty. Find the largest
|
|
|
|
* contiguous, power-of-two-sized set of pages in any
|
|
|
|
* pool. Transfer these pages to the given pool, and
|
|
|
|
* use them to satisfy the allocation.
|
|
|
|
*/
|
|
|
|
for (oind = VM_NFREEORDER - 1; oind >= order; oind--) {
|
|
|
|
for (pind = 0; pind < VM_NFREEPOOL; pind++) {
|
2013-05-13 15:40:51 +00:00
|
|
|
alt = &vm_phys_free_queues[domain][flind][pind][0];
|
Redo the page table page allocation on MIPS, as suggested by
alc@.
The UMA zone based allocation is replaced by a scheme that creates
a new free page list for the KSEG0 region, and a new function
in sys/vm that allocates pages from a specific free page list.
This also fixes a race condition introduced by the UMA based page table
page allocation code. Dropping the page queue and pmap locks before
the call to uma_zfree, and re-acquiring them afterwards will introduce
a race condtion(noted by alc@).
The changes are :
- Revert the earlier changes in MIPS pmap.c that added UMA zone for
page table pages.
- Add a new freelist VM_FREELIST_HIGHMEM to MIPS vmparam.h for memory that
is not directly mapped (in 32bit kernel). Normal page allocations will first
try the HIGHMEM freelist and then the default(direct mapped) freelist.
- Add a new function 'vm_page_t vm_page_alloc_freelist(int flind, int
order, int req)' to vm/vm_page.c to allocate a page from a specified
freelist. The MIPS page table pages will be allocated using this function
from the freelist containing direct mapped pages.
- Move the page initialization code from vm_phys_alloc_contig() to a
new function vm_page_alloc_init(), and use this function to initialize
pages in vm_page_alloc_freelist() too.
- Split the function vm_phys_alloc_pages(int pool, int order) to create
vm_phys_alloc_freelist_pages(int flind, int pool, int order), and use
this function from both vm_page_alloc_freelist() and vm_phys_alloc_pages().
Reviewed by: alc
2010-07-21 09:27:00 +00:00
|
|
|
m = TAILQ_FIRST(&alt[oind].pl);
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
if (m != NULL) {
|
2013-05-13 15:40:51 +00:00
|
|
|
vm_freelist_rem(alt, m, oind);
|
Redo the page table page allocation on MIPS, as suggested by
alc@.
The UMA zone based allocation is replaced by a scheme that creates
a new free page list for the KSEG0 region, and a new function
in sys/vm that allocates pages from a specific free page list.
This also fixes a race condition introduced by the UMA based page table
page allocation code. Dropping the page queue and pmap locks before
the call to uma_zfree, and re-acquiring them afterwards will introduce
a race condtion(noted by alc@).
The changes are :
- Revert the earlier changes in MIPS pmap.c that added UMA zone for
page table pages.
- Add a new freelist VM_FREELIST_HIGHMEM to MIPS vmparam.h for memory that
is not directly mapped (in 32bit kernel). Normal page allocations will first
try the HIGHMEM freelist and then the default(direct mapped) freelist.
- Add a new function 'vm_page_t vm_page_alloc_freelist(int flind, int
order, int req)' to vm/vm_page.c to allocate a page from a specified
freelist. The MIPS page table pages will be allocated using this function
from the freelist containing direct mapped pages.
- Move the page initialization code from vm_phys_alloc_contig() to a
new function vm_page_alloc_init(), and use this function to initialize
pages in vm_page_alloc_freelist() too.
- Split the function vm_phys_alloc_pages(int pool, int order) to create
vm_phys_alloc_freelist_pages(int flind, int pool, int order), and use
this function from both vm_page_alloc_freelist() and vm_phys_alloc_pages().
Reviewed by: alc
2010-07-21 09:27:00 +00:00
|
|
|
vm_phys_set_pool(pool, m, oind);
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
vm_phys_split_pages(m, oind, fl, order);
|
|
|
|
return (m);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
return (NULL);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Find the vm_page corresponding to the given physical address.
|
|
|
|
*/
|
|
|
|
vm_page_t
|
|
|
|
vm_phys_paddr_to_vm_page(vm_paddr_t pa)
|
|
|
|
{
|
|
|
|
struct vm_phys_seg *seg;
|
|
|
|
int segind;
|
|
|
|
|
|
|
|
for (segind = 0; segind < vm_phys_nsegs; segind++) {
|
|
|
|
seg = &vm_phys_segs[segind];
|
|
|
|
if (pa >= seg->start && pa < seg->end)
|
|
|
|
return (&seg->first_page[atop(pa - seg->start)]);
|
|
|
|
}
|
2009-06-18 20:42:37 +00:00
|
|
|
return (NULL);
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
}
|
|
|
|
|
2012-05-12 20:42:56 +00:00
|
|
|
vm_page_t
|
|
|
|
vm_phys_fictitious_to_vm_page(vm_paddr_t pa)
|
|
|
|
{
|
2014-07-09 08:12:58 +00:00
|
|
|
struct vm_phys_fictitious_seg tmp, *seg;
|
2012-05-12 20:42:56 +00:00
|
|
|
vm_page_t m;
|
|
|
|
|
|
|
|
m = NULL;
|
2014-07-09 08:12:58 +00:00
|
|
|
tmp.start = pa;
|
|
|
|
tmp.end = 0;
|
|
|
|
|
|
|
|
rw_rlock(&vm_phys_fictitious_reg_lock);
|
|
|
|
seg = RB_FIND(fict_tree, &vm_phys_fictitious_tree, &tmp);
|
|
|
|
rw_runlock(&vm_phys_fictitious_reg_lock);
|
|
|
|
if (seg == NULL)
|
|
|
|
return (NULL);
|
|
|
|
|
|
|
|
m = &seg->first_page[atop(pa - seg->start)];
|
|
|
|
KASSERT((m->flags & PG_FICTITIOUS) != 0, ("%p not fictitious", m));
|
|
|
|
|
2012-05-12 20:42:56 +00:00
|
|
|
return (m);
|
|
|
|
}
|
|
|
|
|
2014-08-05 10:29:01 +00:00
|
|
|
static inline void
|
|
|
|
vm_phys_fictitious_init_range(vm_page_t range, vm_paddr_t start,
|
|
|
|
long page_count, vm_memattr_t memattr)
|
|
|
|
{
|
|
|
|
long i;
|
|
|
|
|
|
|
|
for (i = 0; i < page_count; i++) {
|
|
|
|
vm_page_initfake(&range[i], start + PAGE_SIZE * i, memattr);
|
|
|
|
range[i].oflags &= ~VPO_UNMANAGED;
|
|
|
|
range[i].busy_lock = VPB_UNBUSIED;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2012-05-12 20:42:56 +00:00
|
|
|
int
|
|
|
|
vm_phys_fictitious_reg_range(vm_paddr_t start, vm_paddr_t end,
|
|
|
|
vm_memattr_t memattr)
|
|
|
|
{
|
|
|
|
struct vm_phys_fictitious_seg *seg;
|
|
|
|
vm_page_t fp;
|
2014-08-05 10:29:01 +00:00
|
|
|
long page_count;
|
2012-05-12 20:42:56 +00:00
|
|
|
#ifdef VM_PHYSSEG_DENSE
|
2014-08-05 10:29:01 +00:00
|
|
|
long pi, pe;
|
|
|
|
long dpage_count;
|
2012-05-12 20:42:56 +00:00
|
|
|
#endif
|
|
|
|
|
2014-08-05 10:29:01 +00:00
|
|
|
KASSERT(start < end,
|
|
|
|
("Start of segment isn't less than end (start: %jx end: %jx)",
|
|
|
|
(uintmax_t)start, (uintmax_t)end));
|
|
|
|
|
2012-05-12 20:42:56 +00:00
|
|
|
page_count = (end - start) / PAGE_SIZE;
|
|
|
|
|
|
|
|
#ifdef VM_PHYSSEG_DENSE
|
|
|
|
pi = atop(start);
|
2014-08-05 10:29:01 +00:00
|
|
|
pe = atop(end);
|
|
|
|
if (pi >= first_page && (pi - first_page) < vm_page_array_size) {
|
2012-05-12 20:42:56 +00:00
|
|
|
fp = &vm_page_array[pi - first_page];
|
2014-08-05 10:29:01 +00:00
|
|
|
if ((pe - first_page) > vm_page_array_size) {
|
|
|
|
/*
|
|
|
|
* We have a segment that starts inside
|
|
|
|
* of vm_page_array, but ends outside of it.
|
|
|
|
*
|
|
|
|
* Use vm_page_array pages for those that are
|
|
|
|
* inside of the vm_page_array range, and
|
|
|
|
* allocate the remaining ones.
|
|
|
|
*/
|
|
|
|
dpage_count = vm_page_array_size - (pi - first_page);
|
|
|
|
vm_phys_fictitious_init_range(fp, start, dpage_count,
|
|
|
|
memattr);
|
|
|
|
page_count -= dpage_count;
|
|
|
|
start += ptoa(dpage_count);
|
|
|
|
goto alloc;
|
|
|
|
}
|
|
|
|
/*
|
|
|
|
* We can allocate the full range from vm_page_array,
|
|
|
|
* so there's no need to register the range in the tree.
|
|
|
|
*/
|
|
|
|
vm_phys_fictitious_init_range(fp, start, page_count, memattr);
|
|
|
|
return (0);
|
|
|
|
} else if (pe > first_page && (pe - first_page) < vm_page_array_size) {
|
|
|
|
/*
|
|
|
|
* We have a segment that ends inside of vm_page_array,
|
|
|
|
* but starts outside of it.
|
|
|
|
*/
|
|
|
|
fp = &vm_page_array[0];
|
|
|
|
dpage_count = pe - first_page;
|
|
|
|
vm_phys_fictitious_init_range(fp, ptoa(first_page), dpage_count,
|
|
|
|
memattr);
|
|
|
|
end -= ptoa(dpage_count);
|
|
|
|
page_count -= dpage_count;
|
|
|
|
goto alloc;
|
|
|
|
} else if (pi < first_page && pe > (first_page + vm_page_array_size)) {
|
|
|
|
/*
|
|
|
|
* Trying to register a fictitious range that expands before
|
|
|
|
* and after vm_page_array.
|
|
|
|
*/
|
|
|
|
return (EINVAL);
|
|
|
|
} else {
|
|
|
|
alloc:
|
2012-05-12 20:42:56 +00:00
|
|
|
#endif
|
|
|
|
fp = malloc(page_count * sizeof(struct vm_page), M_FICT_PAGES,
|
|
|
|
M_WAITOK | M_ZERO);
|
2014-08-05 10:29:01 +00:00
|
|
|
#ifdef VM_PHYSSEG_DENSE
|
2012-05-12 20:42:56 +00:00
|
|
|
}
|
2014-08-05 10:29:01 +00:00
|
|
|
#endif
|
|
|
|
vm_phys_fictitious_init_range(fp, start, page_count, memattr);
|
2014-07-09 08:12:58 +00:00
|
|
|
|
|
|
|
seg = malloc(sizeof(*seg), M_FICT_PAGES, M_WAITOK | M_ZERO);
|
|
|
|
seg->start = start;
|
|
|
|
seg->end = end;
|
|
|
|
seg->first_page = fp;
|
|
|
|
|
|
|
|
rw_wlock(&vm_phys_fictitious_reg_lock);
|
|
|
|
RB_INSERT(fict_tree, &vm_phys_fictitious_tree, seg);
|
|
|
|
rw_wunlock(&vm_phys_fictitious_reg_lock);
|
|
|
|
|
|
|
|
return (0);
|
2012-05-12 20:42:56 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
void
|
|
|
|
vm_phys_fictitious_unreg_range(vm_paddr_t start, vm_paddr_t end)
|
|
|
|
{
|
2014-07-09 08:12:58 +00:00
|
|
|
struct vm_phys_fictitious_seg *seg, tmp;
|
2012-05-12 20:42:56 +00:00
|
|
|
#ifdef VM_PHYSSEG_DENSE
|
2014-08-05 10:29:01 +00:00
|
|
|
long pi, pe;
|
2012-05-12 20:42:56 +00:00
|
|
|
#endif
|
|
|
|
|
2014-08-05 10:29:01 +00:00
|
|
|
KASSERT(start < end,
|
|
|
|
("Start of segment isn't less than end (start: %jx end: %jx)",
|
|
|
|
(uintmax_t)start, (uintmax_t)end));
|
|
|
|
|
2012-05-12 20:42:56 +00:00
|
|
|
#ifdef VM_PHYSSEG_DENSE
|
|
|
|
pi = atop(start);
|
2014-08-05 10:29:01 +00:00
|
|
|
pe = atop(end);
|
|
|
|
if (pi >= first_page && (pi - first_page) < vm_page_array_size) {
|
|
|
|
if ((pe - first_page) <= vm_page_array_size) {
|
|
|
|
/*
|
|
|
|
* This segment was allocated using vm_page_array
|
|
|
|
* only, there's nothing to do since those pages
|
|
|
|
* were never added to the tree.
|
|
|
|
*/
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
/*
|
|
|
|
* We have a segment that starts inside
|
|
|
|
* of vm_page_array, but ends outside of it.
|
|
|
|
*
|
|
|
|
* Calculate how many pages were added to the
|
|
|
|
* tree and free them.
|
|
|
|
*/
|
|
|
|
start = ptoa(first_page + vm_page_array_size);
|
|
|
|
} else if (pe > first_page && (pe - first_page) < vm_page_array_size) {
|
|
|
|
/*
|
|
|
|
* We have a segment that ends inside of vm_page_array,
|
|
|
|
* but starts outside of it.
|
|
|
|
*/
|
|
|
|
end = ptoa(first_page);
|
|
|
|
} else if (pi < first_page && pe > (first_page + vm_page_array_size)) {
|
|
|
|
/* Since it's not possible to register such a range, panic. */
|
|
|
|
panic(
|
|
|
|
"Unregistering not registered fictitious range [%#jx:%#jx]",
|
|
|
|
(uintmax_t)start, (uintmax_t)end);
|
|
|
|
}
|
2012-05-12 20:42:56 +00:00
|
|
|
#endif
|
2014-07-09 08:12:58 +00:00
|
|
|
tmp.start = start;
|
|
|
|
tmp.end = 0;
|
|
|
|
|
|
|
|
rw_wlock(&vm_phys_fictitious_reg_lock);
|
|
|
|
seg = RB_FIND(fict_tree, &vm_phys_fictitious_tree, &tmp);
|
|
|
|
if (seg->start != start || seg->end != end) {
|
|
|
|
rw_wunlock(&vm_phys_fictitious_reg_lock);
|
|
|
|
panic(
|
|
|
|
"Unregistering not registered fictitious range [%#jx:%#jx]",
|
|
|
|
(uintmax_t)start, (uintmax_t)end);
|
|
|
|
}
|
|
|
|
RB_REMOVE(fict_tree, &vm_phys_fictitious_tree, seg);
|
|
|
|
rw_wunlock(&vm_phys_fictitious_reg_lock);
|
2014-08-05 10:29:01 +00:00
|
|
|
free(seg->first_page, M_FICT_PAGES);
|
2014-07-09 08:12:58 +00:00
|
|
|
free(seg, M_FICT_PAGES);
|
2012-05-12 20:42:56 +00:00
|
|
|
}
|
|
|
|
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
/*
|
|
|
|
* Find the segment containing the given physical address.
|
|
|
|
*/
|
|
|
|
static int
|
|
|
|
vm_phys_paddr_to_segind(vm_paddr_t pa)
|
|
|
|
{
|
|
|
|
struct vm_phys_seg *seg;
|
|
|
|
int segind;
|
|
|
|
|
|
|
|
for (segind = 0; segind < vm_phys_nsegs; segind++) {
|
|
|
|
seg = &vm_phys_segs[segind];
|
|
|
|
if (pa >= seg->start && pa < seg->end)
|
|
|
|
return (segind);
|
|
|
|
}
|
|
|
|
panic("vm_phys_paddr_to_segind: paddr %#jx is not in any segment" ,
|
|
|
|
(uintmax_t)pa);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Free a contiguous, power of two-sized set of physical pages.
|
2007-07-14 21:21:17 +00:00
|
|
|
*
|
|
|
|
* The free page queues must be locked.
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
*/
|
|
|
|
void
|
|
|
|
vm_phys_free_pages(vm_page_t m, int order)
|
|
|
|
{
|
|
|
|
struct vm_freelist *fl;
|
|
|
|
struct vm_phys_seg *seg;
|
2011-10-30 05:06:14 +00:00
|
|
|
vm_paddr_t pa;
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
vm_page_t m_buddy;
|
|
|
|
|
|
|
|
KASSERT(m->order == VM_NFREEORDER,
|
2007-07-14 21:21:17 +00:00
|
|
|
("vm_phys_free_pages: page %p has unexpected order %d",
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
m, m->order));
|
|
|
|
KASSERT(m->pool < VM_NFREEPOOL,
|
2007-07-14 21:21:17 +00:00
|
|
|
("vm_phys_free_pages: page %p has unexpected pool %d",
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
m, m->pool));
|
|
|
|
KASSERT(order < VM_NFREEORDER,
|
2007-07-14 21:21:17 +00:00
|
|
|
("vm_phys_free_pages: order %d is out of range", order));
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
mtx_assert(&vm_page_queue_free_mtx, MA_OWNED);
|
|
|
|
seg = &vm_phys_segs[m->segind];
|
2011-10-30 05:06:14 +00:00
|
|
|
if (order < VM_NFREEORDER - 1) {
|
|
|
|
pa = VM_PAGE_TO_PHYS(m);
|
|
|
|
do {
|
|
|
|
pa ^= ((vm_paddr_t)1 << (PAGE_SHIFT + order));
|
|
|
|
if (pa < seg->start || pa >= seg->end)
|
|
|
|
break;
|
|
|
|
m_buddy = &seg->first_page[atop(pa - seg->start)];
|
|
|
|
if (m_buddy->order != order)
|
|
|
|
break;
|
|
|
|
fl = (*seg->free_queues)[m_buddy->pool];
|
2013-05-13 15:40:51 +00:00
|
|
|
vm_freelist_rem(fl, m_buddy, order);
|
2011-10-30 05:06:14 +00:00
|
|
|
if (m_buddy->pool != m->pool)
|
|
|
|
vm_phys_set_pool(m->pool, m_buddy, order);
|
|
|
|
order++;
|
|
|
|
pa &= ~(((vm_paddr_t)1 << (PAGE_SHIFT + order)) - 1);
|
|
|
|
m = &seg->first_page[atop(pa - seg->start)];
|
|
|
|
} while (order < VM_NFREEORDER - 1);
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
}
|
|
|
|
fl = (*seg->free_queues)[m->pool];
|
2013-05-13 15:40:51 +00:00
|
|
|
vm_freelist_add(fl, m, order, 1);
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
}
|
|
|
|
|
2011-10-30 05:06:14 +00:00
|
|
|
/*
|
|
|
|
* Free a contiguous, arbitrarily sized set of physical pages.
|
|
|
|
*
|
|
|
|
* The free page queues must be locked.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
vm_phys_free_contig(vm_page_t m, u_long npages)
|
|
|
|
{
|
|
|
|
u_int n;
|
|
|
|
int order;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Avoid unnecessary coalescing by freeing the pages in the largest
|
|
|
|
* possible power-of-two-sized subsets.
|
|
|
|
*/
|
|
|
|
mtx_assert(&vm_page_queue_free_mtx, MA_OWNED);
|
|
|
|
for (;; npages -= n) {
|
|
|
|
/*
|
|
|
|
* Unsigned "min" is used here so that "order" is assigned
|
|
|
|
* "VM_NFREEORDER - 1" when "m"'s physical address is zero
|
|
|
|
* or the low-order bits of its physical address are zero
|
|
|
|
* because the size of a physical address exceeds the size of
|
|
|
|
* a long.
|
|
|
|
*/
|
|
|
|
order = min(ffsl(VM_PAGE_TO_PHYS(m) >> PAGE_SHIFT) - 1,
|
|
|
|
VM_NFREEORDER - 1);
|
|
|
|
n = 1 << order;
|
|
|
|
if (npages < n)
|
|
|
|
break;
|
|
|
|
vm_phys_free_pages(m, order);
|
|
|
|
m += n;
|
|
|
|
}
|
|
|
|
/* The residual "npages" is less than "1 << (VM_NFREEORDER - 1)". */
|
|
|
|
for (; npages > 0; npages -= n) {
|
|
|
|
order = flsl(npages) - 1;
|
|
|
|
n = 1 << order;
|
|
|
|
vm_phys_free_pages(m, order);
|
|
|
|
m += n;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2015-12-19 18:42:50 +00:00
|
|
|
/*
|
|
|
|
* Scan physical memory between the specified addresses "low" and "high" for a
|
|
|
|
* run of contiguous physical pages that satisfy the specified conditions, and
|
|
|
|
* return the lowest page in the run. The specified "alignment" determines
|
|
|
|
* the alignment of the lowest physical page in the run. If the specified
|
|
|
|
* "boundary" is non-zero, then the run of physical pages cannot span a
|
|
|
|
* physical address that is a multiple of "boundary".
|
|
|
|
*
|
|
|
|
* "npages" must be greater than zero. Both "alignment" and "boundary" must
|
|
|
|
* be a power of two.
|
|
|
|
*/
|
|
|
|
vm_page_t
|
|
|
|
vm_phys_scan_contig(u_long npages, vm_paddr_t low, vm_paddr_t high,
|
|
|
|
u_long alignment, vm_paddr_t boundary, int options)
|
|
|
|
{
|
|
|
|
vm_paddr_t pa_end;
|
|
|
|
vm_page_t m_end, m_run, m_start;
|
|
|
|
struct vm_phys_seg *seg;
|
|
|
|
int segind;
|
|
|
|
|
|
|
|
KASSERT(npages > 0, ("npages is 0"));
|
|
|
|
KASSERT(powerof2(alignment), ("alignment is not a power of 2"));
|
|
|
|
KASSERT(powerof2(boundary), ("boundary is not a power of 2"));
|
|
|
|
if (low >= high)
|
|
|
|
return (NULL);
|
|
|
|
for (segind = 0; segind < vm_phys_nsegs; segind++) {
|
|
|
|
seg = &vm_phys_segs[segind];
|
|
|
|
if (seg->start >= high)
|
|
|
|
break;
|
|
|
|
if (low >= seg->end)
|
|
|
|
continue;
|
|
|
|
if (low <= seg->start)
|
|
|
|
m_start = seg->first_page;
|
|
|
|
else
|
|
|
|
m_start = &seg->first_page[atop(low - seg->start)];
|
|
|
|
if (high < seg->end)
|
|
|
|
pa_end = high;
|
|
|
|
else
|
|
|
|
pa_end = seg->end;
|
|
|
|
if (pa_end - VM_PAGE_TO_PHYS(m_start) < ptoa(npages))
|
|
|
|
continue;
|
|
|
|
m_end = &seg->first_page[atop(pa_end - seg->start)];
|
|
|
|
m_run = vm_page_scan_contig(npages, m_start, m_end,
|
|
|
|
alignment, boundary, options);
|
|
|
|
if (m_run != NULL)
|
|
|
|
return (m_run);
|
|
|
|
}
|
|
|
|
return (NULL);
|
|
|
|
}
|
|
|
|
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
/*
|
|
|
|
* Set the pool for a contiguous, power of two-sized set of physical pages.
|
|
|
|
*/
|
Change the management of cached pages (PQ_CACHE) in two fundamental
ways:
(1) Cached pages are no longer kept in the object's resident page
splay tree and memq. Instead, they are kept in a separate per-object
splay tree of cached pages. However, access to this new per-object
splay tree is synchronized by the _free_ page queues lock, not to be
confused with the heavily contended page queues lock. Consequently, a
cached page can be reclaimed by vm_page_alloc(9) without acquiring the
object's lock or the page queues lock.
This solves a problem independently reported by tegge@ and Isilon.
Specifically, they observed the page daemon consuming a great deal of
CPU time because of pages bouncing back and forth between the cache
queue (PQ_CACHE) and the inactive queue (PQ_INACTIVE). The source of
this problem turned out to be a deadlock avoidance strategy employed
when selecting a cached page to reclaim in vm_page_select_cache().
However, the root cause was really that reclaiming a cached page
required the acquisition of an object lock while the page queues lock
was already held. Thus, this change addresses the problem at its
root, by eliminating the need to acquire the object's lock.
Moreover, keeping cached pages in the object's primary splay tree and
memq was, in effect, optimizing for the uncommon case. Cached pages
are reclaimed far, far more often than they are reactivated. Instead,
this change makes reclamation cheaper, especially in terms of
synchronization overhead, and reactivation more expensive, because
reactivated pages will have to be reentered into the object's primary
splay tree and memq.
(2) Cached pages are now stored alongside free pages in the physical
memory allocator's buddy queues, increasing the likelihood that large
allocations of contiguous physical memory (i.e., superpages) will
succeed.
Finally, as a result of this change long-standing restrictions on when
and where a cached page can be reclaimed and returned by
vm_page_alloc(9) are eliminated. Specifically, calls to
vm_page_alloc(9) specifying VM_ALLOC_INTERRUPT can now reclaim and
return a formerly cached page. Consequently, a call to malloc(9)
specifying M_NOWAIT is less likely to fail.
Discussed with: many over the course of the summer, including jeff@,
Justin Husted @ Isilon, peter@, tegge@
Tested by: an earlier version by kris@
Approved by: re (kensmith)
2007-09-25 06:25:06 +00:00
|
|
|
void
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
vm_phys_set_pool(int pool, vm_page_t m, int order)
|
|
|
|
{
|
|
|
|
vm_page_t m_tmp;
|
|
|
|
|
|
|
|
for (m_tmp = m; m_tmp < &m[1 << order]; m_tmp++)
|
|
|
|
m_tmp->pool = pool;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2007-12-21 02:44:31 +00:00
|
|
|
* Search for the given physical page "m" in the free lists. If the search
|
|
|
|
* succeeds, remove "m" from the free lists and return TRUE. Otherwise, return
|
|
|
|
* FALSE, indicating that "m" is not in the free lists.
|
Change the management of cached pages (PQ_CACHE) in two fundamental
ways:
(1) Cached pages are no longer kept in the object's resident page
splay tree and memq. Instead, they are kept in a separate per-object
splay tree of cached pages. However, access to this new per-object
splay tree is synchronized by the _free_ page queues lock, not to be
confused with the heavily contended page queues lock. Consequently, a
cached page can be reclaimed by vm_page_alloc(9) without acquiring the
object's lock or the page queues lock.
This solves a problem independently reported by tegge@ and Isilon.
Specifically, they observed the page daemon consuming a great deal of
CPU time because of pages bouncing back and forth between the cache
queue (PQ_CACHE) and the inactive queue (PQ_INACTIVE). The source of
this problem turned out to be a deadlock avoidance strategy employed
when selecting a cached page to reclaim in vm_page_select_cache().
However, the root cause was really that reclaiming a cached page
required the acquisition of an object lock while the page queues lock
was already held. Thus, this change addresses the problem at its
root, by eliminating the need to acquire the object's lock.
Moreover, keeping cached pages in the object's primary splay tree and
memq was, in effect, optimizing for the uncommon case. Cached pages
are reclaimed far, far more often than they are reactivated. Instead,
this change makes reclamation cheaper, especially in terms of
synchronization overhead, and reactivation more expensive, because
reactivated pages will have to be reentered into the object's primary
splay tree and memq.
(2) Cached pages are now stored alongside free pages in the physical
memory allocator's buddy queues, increasing the likelihood that large
allocations of contiguous physical memory (i.e., superpages) will
succeed.
Finally, as a result of this change long-standing restrictions on when
and where a cached page can be reclaimed and returned by
vm_page_alloc(9) are eliminated. Specifically, calls to
vm_page_alloc(9) specifying VM_ALLOC_INTERRUPT can now reclaim and
return a formerly cached page. Consequently, a call to malloc(9)
specifying M_NOWAIT is less likely to fail.
Discussed with: many over the course of the summer, including jeff@,
Justin Husted @ Isilon, peter@, tegge@
Tested by: an earlier version by kris@
Approved by: re (kensmith)
2007-09-25 06:25:06 +00:00
|
|
|
*
|
|
|
|
* The free page queues must be locked.
|
|
|
|
*/
|
2007-12-20 22:45:54 +00:00
|
|
|
boolean_t
|
Change the management of cached pages (PQ_CACHE) in two fundamental
ways:
(1) Cached pages are no longer kept in the object's resident page
splay tree and memq. Instead, they are kept in a separate per-object
splay tree of cached pages. However, access to this new per-object
splay tree is synchronized by the _free_ page queues lock, not to be
confused with the heavily contended page queues lock. Consequently, a
cached page can be reclaimed by vm_page_alloc(9) without acquiring the
object's lock or the page queues lock.
This solves a problem independently reported by tegge@ and Isilon.
Specifically, they observed the page daemon consuming a great deal of
CPU time because of pages bouncing back and forth between the cache
queue (PQ_CACHE) and the inactive queue (PQ_INACTIVE). The source of
this problem turned out to be a deadlock avoidance strategy employed
when selecting a cached page to reclaim in vm_page_select_cache().
However, the root cause was really that reclaiming a cached page
required the acquisition of an object lock while the page queues lock
was already held. Thus, this change addresses the problem at its
root, by eliminating the need to acquire the object's lock.
Moreover, keeping cached pages in the object's primary splay tree and
memq was, in effect, optimizing for the uncommon case. Cached pages
are reclaimed far, far more often than they are reactivated. Instead,
this change makes reclamation cheaper, especially in terms of
synchronization overhead, and reactivation more expensive, because
reactivated pages will have to be reentered into the object's primary
splay tree and memq.
(2) Cached pages are now stored alongside free pages in the physical
memory allocator's buddy queues, increasing the likelihood that large
allocations of contiguous physical memory (i.e., superpages) will
succeed.
Finally, as a result of this change long-standing restrictions on when
and where a cached page can be reclaimed and returned by
vm_page_alloc(9) are eliminated. Specifically, calls to
vm_page_alloc(9) specifying VM_ALLOC_INTERRUPT can now reclaim and
return a formerly cached page. Consequently, a call to malloc(9)
specifying M_NOWAIT is less likely to fail.
Discussed with: many over the course of the summer, including jeff@,
Justin Husted @ Isilon, peter@, tegge@
Tested by: an earlier version by kris@
Approved by: re (kensmith)
2007-09-25 06:25:06 +00:00
|
|
|
vm_phys_unfree_page(vm_page_t m)
|
|
|
|
{
|
|
|
|
struct vm_freelist *fl;
|
|
|
|
struct vm_phys_seg *seg;
|
|
|
|
vm_paddr_t pa, pa_half;
|
|
|
|
vm_page_t m_set, m_tmp;
|
|
|
|
int order;
|
|
|
|
|
|
|
|
mtx_assert(&vm_page_queue_free_mtx, MA_OWNED);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* First, find the contiguous, power of two-sized set of free
|
|
|
|
* physical pages containing the given physical page "m" and
|
|
|
|
* assign it to "m_set".
|
|
|
|
*/
|
|
|
|
seg = &vm_phys_segs[m->segind];
|
|
|
|
for (m_set = m, order = 0; m_set->order == VM_NFREEORDER &&
|
2007-12-19 23:09:45 +00:00
|
|
|
order < VM_NFREEORDER - 1; ) {
|
Change the management of cached pages (PQ_CACHE) in two fundamental
ways:
(1) Cached pages are no longer kept in the object's resident page
splay tree and memq. Instead, they are kept in a separate per-object
splay tree of cached pages. However, access to this new per-object
splay tree is synchronized by the _free_ page queues lock, not to be
confused with the heavily contended page queues lock. Consequently, a
cached page can be reclaimed by vm_page_alloc(9) without acquiring the
object's lock or the page queues lock.
This solves a problem independently reported by tegge@ and Isilon.
Specifically, they observed the page daemon consuming a great deal of
CPU time because of pages bouncing back and forth between the cache
queue (PQ_CACHE) and the inactive queue (PQ_INACTIVE). The source of
this problem turned out to be a deadlock avoidance strategy employed
when selecting a cached page to reclaim in vm_page_select_cache().
However, the root cause was really that reclaiming a cached page
required the acquisition of an object lock while the page queues lock
was already held. Thus, this change addresses the problem at its
root, by eliminating the need to acquire the object's lock.
Moreover, keeping cached pages in the object's primary splay tree and
memq was, in effect, optimizing for the uncommon case. Cached pages
are reclaimed far, far more often than they are reactivated. Instead,
this change makes reclamation cheaper, especially in terms of
synchronization overhead, and reactivation more expensive, because
reactivated pages will have to be reentered into the object's primary
splay tree and memq.
(2) Cached pages are now stored alongside free pages in the physical
memory allocator's buddy queues, increasing the likelihood that large
allocations of contiguous physical memory (i.e., superpages) will
succeed.
Finally, as a result of this change long-standing restrictions on when
and where a cached page can be reclaimed and returned by
vm_page_alloc(9) are eliminated. Specifically, calls to
vm_page_alloc(9) specifying VM_ALLOC_INTERRUPT can now reclaim and
return a formerly cached page. Consequently, a call to malloc(9)
specifying M_NOWAIT is less likely to fail.
Discussed with: many over the course of the summer, including jeff@,
Justin Husted @ Isilon, peter@, tegge@
Tested by: an earlier version by kris@
Approved by: re (kensmith)
2007-09-25 06:25:06 +00:00
|
|
|
order++;
|
|
|
|
pa = m->phys_addr & (~(vm_paddr_t)0 << (PAGE_SHIFT + order));
|
2008-04-05 05:02:53 +00:00
|
|
|
if (pa >= seg->start)
|
2007-12-20 22:45:54 +00:00
|
|
|
m_set = &seg->first_page[atop(pa - seg->start)];
|
|
|
|
else
|
|
|
|
return (FALSE);
|
Change the management of cached pages (PQ_CACHE) in two fundamental
ways:
(1) Cached pages are no longer kept in the object's resident page
splay tree and memq. Instead, they are kept in a separate per-object
splay tree of cached pages. However, access to this new per-object
splay tree is synchronized by the _free_ page queues lock, not to be
confused with the heavily contended page queues lock. Consequently, a
cached page can be reclaimed by vm_page_alloc(9) without acquiring the
object's lock or the page queues lock.
This solves a problem independently reported by tegge@ and Isilon.
Specifically, they observed the page daemon consuming a great deal of
CPU time because of pages bouncing back and forth between the cache
queue (PQ_CACHE) and the inactive queue (PQ_INACTIVE). The source of
this problem turned out to be a deadlock avoidance strategy employed
when selecting a cached page to reclaim in vm_page_select_cache().
However, the root cause was really that reclaiming a cached page
required the acquisition of an object lock while the page queues lock
was already held. Thus, this change addresses the problem at its
root, by eliminating the need to acquire the object's lock.
Moreover, keeping cached pages in the object's primary splay tree and
memq was, in effect, optimizing for the uncommon case. Cached pages
are reclaimed far, far more often than they are reactivated. Instead,
this change makes reclamation cheaper, especially in terms of
synchronization overhead, and reactivation more expensive, because
reactivated pages will have to be reentered into the object's primary
splay tree and memq.
(2) Cached pages are now stored alongside free pages in the physical
memory allocator's buddy queues, increasing the likelihood that large
allocations of contiguous physical memory (i.e., superpages) will
succeed.
Finally, as a result of this change long-standing restrictions on when
and where a cached page can be reclaimed and returned by
vm_page_alloc(9) are eliminated. Specifically, calls to
vm_page_alloc(9) specifying VM_ALLOC_INTERRUPT can now reclaim and
return a formerly cached page. Consequently, a call to malloc(9)
specifying M_NOWAIT is less likely to fail.
Discussed with: many over the course of the summer, including jeff@,
Justin Husted @ Isilon, peter@, tegge@
Tested by: an earlier version by kris@
Approved by: re (kensmith)
2007-09-25 06:25:06 +00:00
|
|
|
}
|
2007-12-20 22:45:54 +00:00
|
|
|
if (m_set->order < order)
|
|
|
|
return (FALSE);
|
|
|
|
if (m_set->order == VM_NFREEORDER)
|
|
|
|
return (FALSE);
|
Change the management of cached pages (PQ_CACHE) in two fundamental
ways:
(1) Cached pages are no longer kept in the object's resident page
splay tree and memq. Instead, they are kept in a separate per-object
splay tree of cached pages. However, access to this new per-object
splay tree is synchronized by the _free_ page queues lock, not to be
confused with the heavily contended page queues lock. Consequently, a
cached page can be reclaimed by vm_page_alloc(9) without acquiring the
object's lock or the page queues lock.
This solves a problem independently reported by tegge@ and Isilon.
Specifically, they observed the page daemon consuming a great deal of
CPU time because of pages bouncing back and forth between the cache
queue (PQ_CACHE) and the inactive queue (PQ_INACTIVE). The source of
this problem turned out to be a deadlock avoidance strategy employed
when selecting a cached page to reclaim in vm_page_select_cache().
However, the root cause was really that reclaiming a cached page
required the acquisition of an object lock while the page queues lock
was already held. Thus, this change addresses the problem at its
root, by eliminating the need to acquire the object's lock.
Moreover, keeping cached pages in the object's primary splay tree and
memq was, in effect, optimizing for the uncommon case. Cached pages
are reclaimed far, far more often than they are reactivated. Instead,
this change makes reclamation cheaper, especially in terms of
synchronization overhead, and reactivation more expensive, because
reactivated pages will have to be reentered into the object's primary
splay tree and memq.
(2) Cached pages are now stored alongside free pages in the physical
memory allocator's buddy queues, increasing the likelihood that large
allocations of contiguous physical memory (i.e., superpages) will
succeed.
Finally, as a result of this change long-standing restrictions on when
and where a cached page can be reclaimed and returned by
vm_page_alloc(9) are eliminated. Specifically, calls to
vm_page_alloc(9) specifying VM_ALLOC_INTERRUPT can now reclaim and
return a formerly cached page. Consequently, a call to malloc(9)
specifying M_NOWAIT is less likely to fail.
Discussed with: many over the course of the summer, including jeff@,
Justin Husted @ Isilon, peter@, tegge@
Tested by: an earlier version by kris@
Approved by: re (kensmith)
2007-09-25 06:25:06 +00:00
|
|
|
KASSERT(m_set->order < VM_NFREEORDER,
|
|
|
|
("vm_phys_unfree_page: page %p has unexpected order %d",
|
|
|
|
m_set, m_set->order));
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Next, remove "m_set" from the free lists. Finally, extract
|
|
|
|
* "m" from "m_set" using an iterative algorithm: While "m_set"
|
|
|
|
* is larger than a page, shrink "m_set" by returning the half
|
|
|
|
* of "m_set" that does not contain "m" to the free lists.
|
|
|
|
*/
|
|
|
|
fl = (*seg->free_queues)[m_set->pool];
|
|
|
|
order = m_set->order;
|
2013-05-13 15:40:51 +00:00
|
|
|
vm_freelist_rem(fl, m_set, order);
|
Change the management of cached pages (PQ_CACHE) in two fundamental
ways:
(1) Cached pages are no longer kept in the object's resident page
splay tree and memq. Instead, they are kept in a separate per-object
splay tree of cached pages. However, access to this new per-object
splay tree is synchronized by the _free_ page queues lock, not to be
confused with the heavily contended page queues lock. Consequently, a
cached page can be reclaimed by vm_page_alloc(9) without acquiring the
object's lock or the page queues lock.
This solves a problem independently reported by tegge@ and Isilon.
Specifically, they observed the page daemon consuming a great deal of
CPU time because of pages bouncing back and forth between the cache
queue (PQ_CACHE) and the inactive queue (PQ_INACTIVE). The source of
this problem turned out to be a deadlock avoidance strategy employed
when selecting a cached page to reclaim in vm_page_select_cache().
However, the root cause was really that reclaiming a cached page
required the acquisition of an object lock while the page queues lock
was already held. Thus, this change addresses the problem at its
root, by eliminating the need to acquire the object's lock.
Moreover, keeping cached pages in the object's primary splay tree and
memq was, in effect, optimizing for the uncommon case. Cached pages
are reclaimed far, far more often than they are reactivated. Instead,
this change makes reclamation cheaper, especially in terms of
synchronization overhead, and reactivation more expensive, because
reactivated pages will have to be reentered into the object's primary
splay tree and memq.
(2) Cached pages are now stored alongside free pages in the physical
memory allocator's buddy queues, increasing the likelihood that large
allocations of contiguous physical memory (i.e., superpages) will
succeed.
Finally, as a result of this change long-standing restrictions on when
and where a cached page can be reclaimed and returned by
vm_page_alloc(9) are eliminated. Specifically, calls to
vm_page_alloc(9) specifying VM_ALLOC_INTERRUPT can now reclaim and
return a formerly cached page. Consequently, a call to malloc(9)
specifying M_NOWAIT is less likely to fail.
Discussed with: many over the course of the summer, including jeff@,
Justin Husted @ Isilon, peter@, tegge@
Tested by: an earlier version by kris@
Approved by: re (kensmith)
2007-09-25 06:25:06 +00:00
|
|
|
while (order > 0) {
|
|
|
|
order--;
|
|
|
|
pa_half = m_set->phys_addr ^ (1 << (PAGE_SHIFT + order));
|
|
|
|
if (m->phys_addr < pa_half)
|
|
|
|
m_tmp = &seg->first_page[atop(pa_half - seg->start)];
|
|
|
|
else {
|
|
|
|
m_tmp = m_set;
|
|
|
|
m_set = &seg->first_page[atop(pa_half - seg->start)];
|
|
|
|
}
|
2013-05-13 15:40:51 +00:00
|
|
|
vm_freelist_add(fl, m_tmp, order, 0);
|
Change the management of cached pages (PQ_CACHE) in two fundamental
ways:
(1) Cached pages are no longer kept in the object's resident page
splay tree and memq. Instead, they are kept in a separate per-object
splay tree of cached pages. However, access to this new per-object
splay tree is synchronized by the _free_ page queues lock, not to be
confused with the heavily contended page queues lock. Consequently, a
cached page can be reclaimed by vm_page_alloc(9) without acquiring the
object's lock or the page queues lock.
This solves a problem independently reported by tegge@ and Isilon.
Specifically, they observed the page daemon consuming a great deal of
CPU time because of pages bouncing back and forth between the cache
queue (PQ_CACHE) and the inactive queue (PQ_INACTIVE). The source of
this problem turned out to be a deadlock avoidance strategy employed
when selecting a cached page to reclaim in vm_page_select_cache().
However, the root cause was really that reclaiming a cached page
required the acquisition of an object lock while the page queues lock
was already held. Thus, this change addresses the problem at its
root, by eliminating the need to acquire the object's lock.
Moreover, keeping cached pages in the object's primary splay tree and
memq was, in effect, optimizing for the uncommon case. Cached pages
are reclaimed far, far more often than they are reactivated. Instead,
this change makes reclamation cheaper, especially in terms of
synchronization overhead, and reactivation more expensive, because
reactivated pages will have to be reentered into the object's primary
splay tree and memq.
(2) Cached pages are now stored alongside free pages in the physical
memory allocator's buddy queues, increasing the likelihood that large
allocations of contiguous physical memory (i.e., superpages) will
succeed.
Finally, as a result of this change long-standing restrictions on when
and where a cached page can be reclaimed and returned by
vm_page_alloc(9) are eliminated. Specifically, calls to
vm_page_alloc(9) specifying VM_ALLOC_INTERRUPT can now reclaim and
return a formerly cached page. Consequently, a call to malloc(9)
specifying M_NOWAIT is less likely to fail.
Discussed with: many over the course of the summer, including jeff@,
Justin Husted @ Isilon, peter@, tegge@
Tested by: an earlier version by kris@
Approved by: re (kensmith)
2007-09-25 06:25:06 +00:00
|
|
|
}
|
|
|
|
KASSERT(m_set == m, ("vm_phys_unfree_page: fatal inconsistency"));
|
2007-12-20 22:45:54 +00:00
|
|
|
return (TRUE);
|
Change the management of cached pages (PQ_CACHE) in two fundamental
ways:
(1) Cached pages are no longer kept in the object's resident page
splay tree and memq. Instead, they are kept in a separate per-object
splay tree of cached pages. However, access to this new per-object
splay tree is synchronized by the _free_ page queues lock, not to be
confused with the heavily contended page queues lock. Consequently, a
cached page can be reclaimed by vm_page_alloc(9) without acquiring the
object's lock or the page queues lock.
This solves a problem independently reported by tegge@ and Isilon.
Specifically, they observed the page daemon consuming a great deal of
CPU time because of pages bouncing back and forth between the cache
queue (PQ_CACHE) and the inactive queue (PQ_INACTIVE). The source of
this problem turned out to be a deadlock avoidance strategy employed
when selecting a cached page to reclaim in vm_page_select_cache().
However, the root cause was really that reclaiming a cached page
required the acquisition of an object lock while the page queues lock
was already held. Thus, this change addresses the problem at its
root, by eliminating the need to acquire the object's lock.
Moreover, keeping cached pages in the object's primary splay tree and
memq was, in effect, optimizing for the uncommon case. Cached pages
are reclaimed far, far more often than they are reactivated. Instead,
this change makes reclamation cheaper, especially in terms of
synchronization overhead, and reactivation more expensive, because
reactivated pages will have to be reentered into the object's primary
splay tree and memq.
(2) Cached pages are now stored alongside free pages in the physical
memory allocator's buddy queues, increasing the likelihood that large
allocations of contiguous physical memory (i.e., superpages) will
succeed.
Finally, as a result of this change long-standing restrictions on when
and where a cached page can be reclaimed and returned by
vm_page_alloc(9) are eliminated. Specifically, calls to
vm_page_alloc(9) specifying VM_ALLOC_INTERRUPT can now reclaim and
return a formerly cached page. Consequently, a call to malloc(9)
specifying M_NOWAIT is less likely to fail.
Discussed with: many over the course of the summer, including jeff@,
Justin Husted @ Isilon, peter@, tegge@
Tested by: an earlier version by kris@
Approved by: re (kensmith)
2007-09-25 06:25:06 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Try to zero one physical page. Used by an idle priority thread.
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
*/
|
|
|
|
boolean_t
|
|
|
|
vm_phys_zero_pages_idle(void)
|
|
|
|
{
|
2013-05-13 15:40:51 +00:00
|
|
|
static struct vm_freelist *fl;
|
Change the management of cached pages (PQ_CACHE) in two fundamental
ways:
(1) Cached pages are no longer kept in the object's resident page
splay tree and memq. Instead, they are kept in a separate per-object
splay tree of cached pages. However, access to this new per-object
splay tree is synchronized by the _free_ page queues lock, not to be
confused with the heavily contended page queues lock. Consequently, a
cached page can be reclaimed by vm_page_alloc(9) without acquiring the
object's lock or the page queues lock.
This solves a problem independently reported by tegge@ and Isilon.
Specifically, they observed the page daemon consuming a great deal of
CPU time because of pages bouncing back and forth between the cache
queue (PQ_CACHE) and the inactive queue (PQ_INACTIVE). The source of
this problem turned out to be a deadlock avoidance strategy employed
when selecting a cached page to reclaim in vm_page_select_cache().
However, the root cause was really that reclaiming a cached page
required the acquisition of an object lock while the page queues lock
was already held. Thus, this change addresses the problem at its
root, by eliminating the need to acquire the object's lock.
Moreover, keeping cached pages in the object's primary splay tree and
memq was, in effect, optimizing for the uncommon case. Cached pages
are reclaimed far, far more often than they are reactivated. Instead,
this change makes reclamation cheaper, especially in terms of
synchronization overhead, and reactivation more expensive, because
reactivated pages will have to be reentered into the object's primary
splay tree and memq.
(2) Cached pages are now stored alongside free pages in the physical
memory allocator's buddy queues, increasing the likelihood that large
allocations of contiguous physical memory (i.e., superpages) will
succeed.
Finally, as a result of this change long-standing restrictions on when
and where a cached page can be reclaimed and returned by
vm_page_alloc(9) are eliminated. Specifically, calls to
vm_page_alloc(9) specifying VM_ALLOC_INTERRUPT can now reclaim and
return a formerly cached page. Consequently, a call to malloc(9)
specifying M_NOWAIT is less likely to fail.
Discussed with: many over the course of the summer, including jeff@,
Justin Husted @ Isilon, peter@, tegge@
Tested by: an earlier version by kris@
Approved by: re (kensmith)
2007-09-25 06:25:06 +00:00
|
|
|
static int flind, oind, pind;
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
vm_page_t m, m_tmp;
|
2013-05-13 15:40:51 +00:00
|
|
|
int domain;
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
|
2013-05-13 15:40:51 +00:00
|
|
|
domain = vm_rr_selectdomain();
|
|
|
|
fl = vm_phys_free_queues[domain][0][0];
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
mtx_assert(&vm_page_queue_free_mtx, MA_OWNED);
|
Change the management of cached pages (PQ_CACHE) in two fundamental
ways:
(1) Cached pages are no longer kept in the object's resident page
splay tree and memq. Instead, they are kept in a separate per-object
splay tree of cached pages. However, access to this new per-object
splay tree is synchronized by the _free_ page queues lock, not to be
confused with the heavily contended page queues lock. Consequently, a
cached page can be reclaimed by vm_page_alloc(9) without acquiring the
object's lock or the page queues lock.
This solves a problem independently reported by tegge@ and Isilon.
Specifically, they observed the page daemon consuming a great deal of
CPU time because of pages bouncing back and forth between the cache
queue (PQ_CACHE) and the inactive queue (PQ_INACTIVE). The source of
this problem turned out to be a deadlock avoidance strategy employed
when selecting a cached page to reclaim in vm_page_select_cache().
However, the root cause was really that reclaiming a cached page
required the acquisition of an object lock while the page queues lock
was already held. Thus, this change addresses the problem at its
root, by eliminating the need to acquire the object's lock.
Moreover, keeping cached pages in the object's primary splay tree and
memq was, in effect, optimizing for the uncommon case. Cached pages
are reclaimed far, far more often than they are reactivated. Instead,
this change makes reclamation cheaper, especially in terms of
synchronization overhead, and reactivation more expensive, because
reactivated pages will have to be reentered into the object's primary
splay tree and memq.
(2) Cached pages are now stored alongside free pages in the physical
memory allocator's buddy queues, increasing the likelihood that large
allocations of contiguous physical memory (i.e., superpages) will
succeed.
Finally, as a result of this change long-standing restrictions on when
and where a cached page can be reclaimed and returned by
vm_page_alloc(9) are eliminated. Specifically, calls to
vm_page_alloc(9) specifying VM_ALLOC_INTERRUPT can now reclaim and
return a formerly cached page. Consequently, a call to malloc(9)
specifying M_NOWAIT is less likely to fail.
Discussed with: many over the course of the summer, including jeff@,
Justin Husted @ Isilon, peter@, tegge@
Tested by: an earlier version by kris@
Approved by: re (kensmith)
2007-09-25 06:25:06 +00:00
|
|
|
for (;;) {
|
2013-08-10 17:36:42 +00:00
|
|
|
TAILQ_FOREACH_REVERSE(m, &fl[oind].pl, pglist, plinks.q) {
|
Change the management of cached pages (PQ_CACHE) in two fundamental
ways:
(1) Cached pages are no longer kept in the object's resident page
splay tree and memq. Instead, they are kept in a separate per-object
splay tree of cached pages. However, access to this new per-object
splay tree is synchronized by the _free_ page queues lock, not to be
confused with the heavily contended page queues lock. Consequently, a
cached page can be reclaimed by vm_page_alloc(9) without acquiring the
object's lock or the page queues lock.
This solves a problem independently reported by tegge@ and Isilon.
Specifically, they observed the page daemon consuming a great deal of
CPU time because of pages bouncing back and forth between the cache
queue (PQ_CACHE) and the inactive queue (PQ_INACTIVE). The source of
this problem turned out to be a deadlock avoidance strategy employed
when selecting a cached page to reclaim in vm_page_select_cache().
However, the root cause was really that reclaiming a cached page
required the acquisition of an object lock while the page queues lock
was already held. Thus, this change addresses the problem at its
root, by eliminating the need to acquire the object's lock.
Moreover, keeping cached pages in the object's primary splay tree and
memq was, in effect, optimizing for the uncommon case. Cached pages
are reclaimed far, far more often than they are reactivated. Instead,
this change makes reclamation cheaper, especially in terms of
synchronization overhead, and reactivation more expensive, because
reactivated pages will have to be reentered into the object's primary
splay tree and memq.
(2) Cached pages are now stored alongside free pages in the physical
memory allocator's buddy queues, increasing the likelihood that large
allocations of contiguous physical memory (i.e., superpages) will
succeed.
Finally, as a result of this change long-standing restrictions on when
and where a cached page can be reclaimed and returned by
vm_page_alloc(9) are eliminated. Specifically, calls to
vm_page_alloc(9) specifying VM_ALLOC_INTERRUPT can now reclaim and
return a formerly cached page. Consequently, a call to malloc(9)
specifying M_NOWAIT is less likely to fail.
Discussed with: many over the course of the summer, including jeff@,
Justin Husted @ Isilon, peter@, tegge@
Tested by: an earlier version by kris@
Approved by: re (kensmith)
2007-09-25 06:25:06 +00:00
|
|
|
for (m_tmp = m; m_tmp < &m[1 << oind]; m_tmp++) {
|
|
|
|
if ((m_tmp->flags & (PG_CACHED | PG_ZERO)) == 0) {
|
|
|
|
vm_phys_unfree_page(m_tmp);
|
Split the pagequeues per NUMA domains, and split pageademon process
into threads each processing queue in a single domain. The structure
of the pagedaemons and queues is kept intact, most of the changes come
from the need for code to find an owning page queue for given page,
calculated from the segment containing the page.
The tie between NUMA domain and pagedaemon thread/pagequeue split is
rather arbitrary, the multithreaded daemon could be allowed for the
single-domain machines, or one domain might be split into several page
domains, to further increase concurrency.
Right now, each pagedaemon thread tries to reach the global target,
precalculated at the start of the pass. This is not optimal, since it
could cause excessive page deactivation and freeing. The code should
be changed to re-check the global page deficit state in the loop after
some number of iterations.
The pagedaemons reach the quorum before starting the OOM, since one
thread inability to meet the target is normal for split queues. Only
when all pagedaemons fail to produce enough reusable pages, OOM is
started by single selected thread.
Launder is modified to take into account the segments layout with
regard to the region for which cleaning is performed.
Based on the preliminary patch by jeff, sponsored by EMC / Isilon
Storage Division.
Reviewed by: alc
Tested by: pho
Sponsored by: The FreeBSD Foundation
2013-08-07 16:36:38 +00:00
|
|
|
vm_phys_freecnt_adj(m, -1);
|
Change the management of cached pages (PQ_CACHE) in two fundamental
ways:
(1) Cached pages are no longer kept in the object's resident page
splay tree and memq. Instead, they are kept in a separate per-object
splay tree of cached pages. However, access to this new per-object
splay tree is synchronized by the _free_ page queues lock, not to be
confused with the heavily contended page queues lock. Consequently, a
cached page can be reclaimed by vm_page_alloc(9) without acquiring the
object's lock or the page queues lock.
This solves a problem independently reported by tegge@ and Isilon.
Specifically, they observed the page daemon consuming a great deal of
CPU time because of pages bouncing back and forth between the cache
queue (PQ_CACHE) and the inactive queue (PQ_INACTIVE). The source of
this problem turned out to be a deadlock avoidance strategy employed
when selecting a cached page to reclaim in vm_page_select_cache().
However, the root cause was really that reclaiming a cached page
required the acquisition of an object lock while the page queues lock
was already held. Thus, this change addresses the problem at its
root, by eliminating the need to acquire the object's lock.
Moreover, keeping cached pages in the object's primary splay tree and
memq was, in effect, optimizing for the uncommon case. Cached pages
are reclaimed far, far more often than they are reactivated. Instead,
this change makes reclamation cheaper, especially in terms of
synchronization overhead, and reactivation more expensive, because
reactivated pages will have to be reentered into the object's primary
splay tree and memq.
(2) Cached pages are now stored alongside free pages in the physical
memory allocator's buddy queues, increasing the likelihood that large
allocations of contiguous physical memory (i.e., superpages) will
succeed.
Finally, as a result of this change long-standing restrictions on when
and where a cached page can be reclaimed and returned by
vm_page_alloc(9) are eliminated. Specifically, calls to
vm_page_alloc(9) specifying VM_ALLOC_INTERRUPT can now reclaim and
return a formerly cached page. Consequently, a call to malloc(9)
specifying M_NOWAIT is less likely to fail.
Discussed with: many over the course of the summer, including jeff@,
Justin Husted @ Isilon, peter@, tegge@
Tested by: an earlier version by kris@
Approved by: re (kensmith)
2007-09-25 06:25:06 +00:00
|
|
|
mtx_unlock(&vm_page_queue_free_mtx);
|
|
|
|
pmap_zero_page_idle(m_tmp);
|
|
|
|
m_tmp->flags |= PG_ZERO;
|
|
|
|
mtx_lock(&vm_page_queue_free_mtx);
|
Split the pagequeues per NUMA domains, and split pageademon process
into threads each processing queue in a single domain. The structure
of the pagedaemons and queues is kept intact, most of the changes come
from the need for code to find an owning page queue for given page,
calculated from the segment containing the page.
The tie between NUMA domain and pagedaemon thread/pagequeue split is
rather arbitrary, the multithreaded daemon could be allowed for the
single-domain machines, or one domain might be split into several page
domains, to further increase concurrency.
Right now, each pagedaemon thread tries to reach the global target,
precalculated at the start of the pass. This is not optimal, since it
could cause excessive page deactivation and freeing. The code should
be changed to re-check the global page deficit state in the loop after
some number of iterations.
The pagedaemons reach the quorum before starting the OOM, since one
thread inability to meet the target is normal for split queues. Only
when all pagedaemons fail to produce enough reusable pages, OOM is
started by single selected thread.
Launder is modified to take into account the segments layout with
regard to the region for which cleaning is performed.
Based on the preliminary patch by jeff, sponsored by EMC / Isilon
Storage Division.
Reviewed by: alc
Tested by: pho
Sponsored by: The FreeBSD Foundation
2013-08-07 16:36:38 +00:00
|
|
|
vm_phys_freecnt_adj(m, 1);
|
Change the management of cached pages (PQ_CACHE) in two fundamental
ways:
(1) Cached pages are no longer kept in the object's resident page
splay tree and memq. Instead, they are kept in a separate per-object
splay tree of cached pages. However, access to this new per-object
splay tree is synchronized by the _free_ page queues lock, not to be
confused with the heavily contended page queues lock. Consequently, a
cached page can be reclaimed by vm_page_alloc(9) without acquiring the
object's lock or the page queues lock.
This solves a problem independently reported by tegge@ and Isilon.
Specifically, they observed the page daemon consuming a great deal of
CPU time because of pages bouncing back and forth between the cache
queue (PQ_CACHE) and the inactive queue (PQ_INACTIVE). The source of
this problem turned out to be a deadlock avoidance strategy employed
when selecting a cached page to reclaim in vm_page_select_cache().
However, the root cause was really that reclaiming a cached page
required the acquisition of an object lock while the page queues lock
was already held. Thus, this change addresses the problem at its
root, by eliminating the need to acquire the object's lock.
Moreover, keeping cached pages in the object's primary splay tree and
memq was, in effect, optimizing for the uncommon case. Cached pages
are reclaimed far, far more often than they are reactivated. Instead,
this change makes reclamation cheaper, especially in terms of
synchronization overhead, and reactivation more expensive, because
reactivated pages will have to be reentered into the object's primary
splay tree and memq.
(2) Cached pages are now stored alongside free pages in the physical
memory allocator's buddy queues, increasing the likelihood that large
allocations of contiguous physical memory (i.e., superpages) will
succeed.
Finally, as a result of this change long-standing restrictions on when
and where a cached page can be reclaimed and returned by
vm_page_alloc(9) are eliminated. Specifically, calls to
vm_page_alloc(9) specifying VM_ALLOC_INTERRUPT can now reclaim and
return a formerly cached page. Consequently, a call to malloc(9)
specifying M_NOWAIT is less likely to fail.
Discussed with: many over the course of the summer, including jeff@,
Justin Husted @ Isilon, peter@, tegge@
Tested by: an earlier version by kris@
Approved by: re (kensmith)
2007-09-25 06:25:06 +00:00
|
|
|
vm_phys_free_pages(m_tmp, 0);
|
|
|
|
vm_page_zero_count++;
|
|
|
|
cnt_prezero++;
|
|
|
|
return (TRUE);
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
Change the management of cached pages (PQ_CACHE) in two fundamental
ways:
(1) Cached pages are no longer kept in the object's resident page
splay tree and memq. Instead, they are kept in a separate per-object
splay tree of cached pages. However, access to this new per-object
splay tree is synchronized by the _free_ page queues lock, not to be
confused with the heavily contended page queues lock. Consequently, a
cached page can be reclaimed by vm_page_alloc(9) without acquiring the
object's lock or the page queues lock.
This solves a problem independently reported by tegge@ and Isilon.
Specifically, they observed the page daemon consuming a great deal of
CPU time because of pages bouncing back and forth between the cache
queue (PQ_CACHE) and the inactive queue (PQ_INACTIVE). The source of
this problem turned out to be a deadlock avoidance strategy employed
when selecting a cached page to reclaim in vm_page_select_cache().
However, the root cause was really that reclaiming a cached page
required the acquisition of an object lock while the page queues lock
was already held. Thus, this change addresses the problem at its
root, by eliminating the need to acquire the object's lock.
Moreover, keeping cached pages in the object's primary splay tree and
memq was, in effect, optimizing for the uncommon case. Cached pages
are reclaimed far, far more often than they are reactivated. Instead,
this change makes reclamation cheaper, especially in terms of
synchronization overhead, and reactivation more expensive, because
reactivated pages will have to be reentered into the object's primary
splay tree and memq.
(2) Cached pages are now stored alongside free pages in the physical
memory allocator's buddy queues, increasing the likelihood that large
allocations of contiguous physical memory (i.e., superpages) will
succeed.
Finally, as a result of this change long-standing restrictions on when
and where a cached page can be reclaimed and returned by
vm_page_alloc(9) are eliminated. Specifically, calls to
vm_page_alloc(9) specifying VM_ALLOC_INTERRUPT can now reclaim and
return a formerly cached page. Consequently, a call to malloc(9)
specifying M_NOWAIT is less likely to fail.
Discussed with: many over the course of the summer, including jeff@,
Justin Husted @ Isilon, peter@, tegge@
Tested by: an earlier version by kris@
Approved by: re (kensmith)
2007-09-25 06:25:06 +00:00
|
|
|
oind++;
|
|
|
|
if (oind == VM_NFREEORDER) {
|
|
|
|
oind = 0;
|
|
|
|
pind++;
|
|
|
|
if (pind == VM_NFREEPOOL) {
|
|
|
|
pind = 0;
|
|
|
|
flind++;
|
|
|
|
if (flind == vm_nfreelists)
|
|
|
|
flind = 0;
|
|
|
|
}
|
2013-05-13 15:40:51 +00:00
|
|
|
fl = vm_phys_free_queues[domain][flind][pind];
|
Change the management of cached pages (PQ_CACHE) in two fundamental
ways:
(1) Cached pages are no longer kept in the object's resident page
splay tree and memq. Instead, they are kept in a separate per-object
splay tree of cached pages. However, access to this new per-object
splay tree is synchronized by the _free_ page queues lock, not to be
confused with the heavily contended page queues lock. Consequently, a
cached page can be reclaimed by vm_page_alloc(9) without acquiring the
object's lock or the page queues lock.
This solves a problem independently reported by tegge@ and Isilon.
Specifically, they observed the page daemon consuming a great deal of
CPU time because of pages bouncing back and forth between the cache
queue (PQ_CACHE) and the inactive queue (PQ_INACTIVE). The source of
this problem turned out to be a deadlock avoidance strategy employed
when selecting a cached page to reclaim in vm_page_select_cache().
However, the root cause was really that reclaiming a cached page
required the acquisition of an object lock while the page queues lock
was already held. Thus, this change addresses the problem at its
root, by eliminating the need to acquire the object's lock.
Moreover, keeping cached pages in the object's primary splay tree and
memq was, in effect, optimizing for the uncommon case. Cached pages
are reclaimed far, far more often than they are reactivated. Instead,
this change makes reclamation cheaper, especially in terms of
synchronization overhead, and reactivation more expensive, because
reactivated pages will have to be reentered into the object's primary
splay tree and memq.
(2) Cached pages are now stored alongside free pages in the physical
memory allocator's buddy queues, increasing the likelihood that large
allocations of contiguous physical memory (i.e., superpages) will
succeed.
Finally, as a result of this change long-standing restrictions on when
and where a cached page can be reclaimed and returned by
vm_page_alloc(9) are eliminated. Specifically, calls to
vm_page_alloc(9) specifying VM_ALLOC_INTERRUPT can now reclaim and
return a formerly cached page. Consequently, a call to malloc(9)
specifying M_NOWAIT is less likely to fail.
Discussed with: many over the course of the summer, including jeff@,
Justin Husted @ Isilon, peter@, tegge@
Tested by: an earlier version by kris@
Approved by: re (kensmith)
2007-09-25 06:25:06 +00:00
|
|
|
}
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2007-06-16 05:25:53 +00:00
|
|
|
* Allocate a contiguous set of physical pages of the given size
|
|
|
|
* "npages" from the free lists. All of the physical pages must be at
|
|
|
|
* or above the given physical address "low" and below the given
|
|
|
|
* physical address "high". The given value "alignment" determines the
|
|
|
|
* alignment of the first physical page in the set. If the given value
|
|
|
|
* "boundary" is non-zero, then the set of physical pages cannot cross
|
|
|
|
* any physical address boundary that is a multiple of that value. Both
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
* "alignment" and "boundary" must be a power of two.
|
|
|
|
*/
|
|
|
|
vm_page_t
|
2011-10-30 05:06:14 +00:00
|
|
|
vm_phys_alloc_contig(u_long npages, vm_paddr_t low, vm_paddr_t high,
|
|
|
|
u_long alignment, vm_paddr_t boundary)
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
{
|
2015-12-19 18:42:50 +00:00
|
|
|
vm_paddr_t pa_end, pa_start;
|
|
|
|
vm_page_t m_run;
|
Add an initial NUMA affinity/policy configuration for threads and processes.
This is based on work done by jeff@ and jhb@, as well as the numa.diff
patch that has been circulating when someone asks for first-touch NUMA
on -10 or -11.
* Introduce a simple set of VM policy and iterator types.
* tie the policy types into the vm_phys path for now, mirroring how
the initial first-touch allocation work was enabled.
* add syscalls to control changing thread and process defaults.
* add a global NUMA VM domain policy.
* implement a simple cascade policy order - if a thread policy exists, use it;
if a process policy exists, use it; use the default policy.
* processes inherit policies from their parent processes, threads inherit
policies from their parent threads.
* add a simple tool (numactl) to query and modify default thread/process
policities.
* add documentation for the new syscalls, for numa and for numactl.
* re-enable first touch NUMA again by default, as now policies can be
set in a variety of methods.
This is only relevant for very specific workloads.
This doesn't pretend to be a final NUMA solution.
The previous defaults in -HEAD (with MAXMEMDOM set) can be achieved by
'sysctl vm.default_policy=rr'.
This is only relevant if MAXMEMDOM is set to something other than 1.
Ie, if you're using GENERIC or a modified kernel with non-NUMA, then
this is a glorified no-op for you.
Thank you to Norse Corp for giving me access to rather large
(for FreeBSD!) NUMA machines in order to develop and verify this.
Thank you to Dell for providing me with dual socket sandybridge
and westmere v3 hardware to do NUMA development with.
Thank you to Scott Long at Netflix for providing me with access
to the two-socket, four-domain haswell v3 hardware.
Thank you to Peter Holm for running the stress testing suite
against the NUMA branch during various stages of development!
Tested:
* MIPS (regression testing; non-NUMA)
* i386 (regression testing; non-NUMA GENERIC)
* amd64 (regression testing; non-NUMA GENERIC)
* westmere, 2 socket (thankyou norse!)
* sandy bridge, 2 socket (thankyou dell!)
* ivy bridge, 2 socket (thankyou norse!)
* westmere-EX, 4 socket / 1TB RAM (thankyou norse!)
* haswell, 2 socket (thankyou norse!)
* haswell v3, 2 socket (thankyou dell)
* haswell v3, 2x18 core (thankyou scott long / netflix!)
* Peter Holm ran a stress test suite on this work and found one
issue, but has not been able to verify it (it doesn't look NUMA
related, and he only saw it once over many testing runs.)
* I've tested bhyve instances running in fixed NUMA domains and cpusets;
all seems to work correctly.
Verified:
* intel-pcm - pcm-numa.x and pcm-memory.x, whilst selecting different
NUMA policies for processes under test.
Review:
This was reviewed through phabricator (https://reviews.freebsd.org/D2559)
as well as privately and via emails to freebsd-arch@. The git history
with specific attributes is available at https://github.com/erikarn/freebsd/
in the NUMA branch (https://github.com/erikarn/freebsd/compare/local/adrian_numa_policy).
This has been reviewed by a number of people (stas, rpaulo, kib, ngie,
wblock) but not achieved a clear consensus. My hope is that with further
exposure and testing more functionality can be implemented and evaluated.
Notes:
* The VM doesn't handle unbalanced domains very well, and if you have an overly
unbalanced memory setup whilst under high memory pressure, VM page allocation
may fail leading to a kernel panic. This was a problem in the past, but it's
much more easily triggered now with these tools.
* This work only controls the path through vm_phys; it doesn't yet strongly/predictably
affect contigmalloc, KVA placement, UMA, etc. So, driver placement of memory
isn't really guaranteed in any way. That's next on my plate.
Sponsored by: Norse Corp, Inc.; Dell
2015-07-11 15:21:37 +00:00
|
|
|
struct vm_domain_iterator vi;
|
2015-12-19 18:42:50 +00:00
|
|
|
struct vm_phys_seg *seg;
|
|
|
|
int domain, segind;
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
|
2015-12-19 18:42:50 +00:00
|
|
|
KASSERT(npages > 0, ("npages is 0"));
|
|
|
|
KASSERT(powerof2(alignment), ("alignment is not a power of 2"));
|
|
|
|
KASSERT(powerof2(boundary), ("boundary is not a power of 2"));
|
Refactor the code that performs physically contiguous memory allocation,
yielding a new public interface, vm_page_alloc_contig(). This new function
addresses some of the limitations of the current interfaces, contigmalloc()
and kmem_alloc_contig(). For example, the physically contiguous memory that
is allocated with those interfaces can only be allocated to the kernel vm
object and must be mapped into the kernel virtual address space. It also
provides functionality that vm_phys_alloc_contig() doesn't, such as wiring
the returned pages. Moreover, unlike that function, it respects the low
water marks on the paging queues and wakes up the page daemon when
necessary. That said, at present, this new function can't be applied to all
types of vm objects. However, that restriction will be eliminated in the
coming weeks.
From a design standpoint, this change also addresses an inconsistency
between vm_phys_alloc_contig() and the other vm_phys_alloc*() functions.
Specifically, vm_phys_alloc_contig() manipulated vm_page fields that other
functions in vm/vm_phys.c didn't. Moreover, vm_phys_alloc_contig() knew
about vnodes and reservations. Now, vm_page_alloc_contig() is responsible
for these things.
Reviewed by: kib
Discussed with: jhb
2011-11-16 16:46:09 +00:00
|
|
|
mtx_assert(&vm_page_queue_free_mtx, MA_OWNED);
|
2015-12-19 18:42:50 +00:00
|
|
|
if (low >= high)
|
|
|
|
return (NULL);
|
Add an initial NUMA affinity/policy configuration for threads and processes.
This is based on work done by jeff@ and jhb@, as well as the numa.diff
patch that has been circulating when someone asks for first-touch NUMA
on -10 or -11.
* Introduce a simple set of VM policy and iterator types.
* tie the policy types into the vm_phys path for now, mirroring how
the initial first-touch allocation work was enabled.
* add syscalls to control changing thread and process defaults.
* add a global NUMA VM domain policy.
* implement a simple cascade policy order - if a thread policy exists, use it;
if a process policy exists, use it; use the default policy.
* processes inherit policies from their parent processes, threads inherit
policies from their parent threads.
* add a simple tool (numactl) to query and modify default thread/process
policities.
* add documentation for the new syscalls, for numa and for numactl.
* re-enable first touch NUMA again by default, as now policies can be
set in a variety of methods.
This is only relevant for very specific workloads.
This doesn't pretend to be a final NUMA solution.
The previous defaults in -HEAD (with MAXMEMDOM set) can be achieved by
'sysctl vm.default_policy=rr'.
This is only relevant if MAXMEMDOM is set to something other than 1.
Ie, if you're using GENERIC or a modified kernel with non-NUMA, then
this is a glorified no-op for you.
Thank you to Norse Corp for giving me access to rather large
(for FreeBSD!) NUMA machines in order to develop and verify this.
Thank you to Dell for providing me with dual socket sandybridge
and westmere v3 hardware to do NUMA development with.
Thank you to Scott Long at Netflix for providing me with access
to the two-socket, four-domain haswell v3 hardware.
Thank you to Peter Holm for running the stress testing suite
against the NUMA branch during various stages of development!
Tested:
* MIPS (regression testing; non-NUMA)
* i386 (regression testing; non-NUMA GENERIC)
* amd64 (regression testing; non-NUMA GENERIC)
* westmere, 2 socket (thankyou norse!)
* sandy bridge, 2 socket (thankyou dell!)
* ivy bridge, 2 socket (thankyou norse!)
* westmere-EX, 4 socket / 1TB RAM (thankyou norse!)
* haswell, 2 socket (thankyou norse!)
* haswell v3, 2 socket (thankyou dell)
* haswell v3, 2x18 core (thankyou scott long / netflix!)
* Peter Holm ran a stress test suite on this work and found one
issue, but has not been able to verify it (it doesn't look NUMA
related, and he only saw it once over many testing runs.)
* I've tested bhyve instances running in fixed NUMA domains and cpusets;
all seems to work correctly.
Verified:
* intel-pcm - pcm-numa.x and pcm-memory.x, whilst selecting different
NUMA policies for processes under test.
Review:
This was reviewed through phabricator (https://reviews.freebsd.org/D2559)
as well as privately and via emails to freebsd-arch@. The git history
with specific attributes is available at https://github.com/erikarn/freebsd/
in the NUMA branch (https://github.com/erikarn/freebsd/compare/local/adrian_numa_policy).
This has been reviewed by a number of people (stas, rpaulo, kib, ngie,
wblock) but not achieved a clear consensus. My hope is that with further
exposure and testing more functionality can be implemented and evaluated.
Notes:
* The VM doesn't handle unbalanced domains very well, and if you have an overly
unbalanced memory setup whilst under high memory pressure, VM page allocation
may fail leading to a kernel panic. This was a problem in the past, but it's
much more easily triggered now with these tools.
* This work only controls the path through vm_phys; it doesn't yet strongly/predictably
affect contigmalloc, KVA placement, UMA, etc. So, driver placement of memory
isn't really guaranteed in any way. That's next on my plate.
Sponsored by: Norse Corp, Inc.; Dell
2015-07-11 15:21:37 +00:00
|
|
|
vm_policy_iterator_init(&vi);
|
2013-05-13 15:40:51 +00:00
|
|
|
restartdom:
|
Add an initial NUMA affinity/policy configuration for threads and processes.
This is based on work done by jeff@ and jhb@, as well as the numa.diff
patch that has been circulating when someone asks for first-touch NUMA
on -10 or -11.
* Introduce a simple set of VM policy and iterator types.
* tie the policy types into the vm_phys path for now, mirroring how
the initial first-touch allocation work was enabled.
* add syscalls to control changing thread and process defaults.
* add a global NUMA VM domain policy.
* implement a simple cascade policy order - if a thread policy exists, use it;
if a process policy exists, use it; use the default policy.
* processes inherit policies from their parent processes, threads inherit
policies from their parent threads.
* add a simple tool (numactl) to query and modify default thread/process
policities.
* add documentation for the new syscalls, for numa and for numactl.
* re-enable first touch NUMA again by default, as now policies can be
set in a variety of methods.
This is only relevant for very specific workloads.
This doesn't pretend to be a final NUMA solution.
The previous defaults in -HEAD (with MAXMEMDOM set) can be achieved by
'sysctl vm.default_policy=rr'.
This is only relevant if MAXMEMDOM is set to something other than 1.
Ie, if you're using GENERIC or a modified kernel with non-NUMA, then
this is a glorified no-op for you.
Thank you to Norse Corp for giving me access to rather large
(for FreeBSD!) NUMA machines in order to develop and verify this.
Thank you to Dell for providing me with dual socket sandybridge
and westmere v3 hardware to do NUMA development with.
Thank you to Scott Long at Netflix for providing me with access
to the two-socket, four-domain haswell v3 hardware.
Thank you to Peter Holm for running the stress testing suite
against the NUMA branch during various stages of development!
Tested:
* MIPS (regression testing; non-NUMA)
* i386 (regression testing; non-NUMA GENERIC)
* amd64 (regression testing; non-NUMA GENERIC)
* westmere, 2 socket (thankyou norse!)
* sandy bridge, 2 socket (thankyou dell!)
* ivy bridge, 2 socket (thankyou norse!)
* westmere-EX, 4 socket / 1TB RAM (thankyou norse!)
* haswell, 2 socket (thankyou norse!)
* haswell v3, 2 socket (thankyou dell)
* haswell v3, 2x18 core (thankyou scott long / netflix!)
* Peter Holm ran a stress test suite on this work and found one
issue, but has not been able to verify it (it doesn't look NUMA
related, and he only saw it once over many testing runs.)
* I've tested bhyve instances running in fixed NUMA domains and cpusets;
all seems to work correctly.
Verified:
* intel-pcm - pcm-numa.x and pcm-memory.x, whilst selecting different
NUMA policies for processes under test.
Review:
This was reviewed through phabricator (https://reviews.freebsd.org/D2559)
as well as privately and via emails to freebsd-arch@. The git history
with specific attributes is available at https://github.com/erikarn/freebsd/
in the NUMA branch (https://github.com/erikarn/freebsd/compare/local/adrian_numa_policy).
This has been reviewed by a number of people (stas, rpaulo, kib, ngie,
wblock) but not achieved a clear consensus. My hope is that with further
exposure and testing more functionality can be implemented and evaluated.
Notes:
* The VM doesn't handle unbalanced domains very well, and if you have an overly
unbalanced memory setup whilst under high memory pressure, VM page allocation
may fail leading to a kernel panic. This was a problem in the past, but it's
much more easily triggered now with these tools.
* This work only controls the path through vm_phys; it doesn't yet strongly/predictably
affect contigmalloc, KVA placement, UMA, etc. So, driver placement of memory
isn't really guaranteed in any way. That's next on my plate.
Sponsored by: Norse Corp, Inc.; Dell
2015-07-11 15:21:37 +00:00
|
|
|
if (vm_domain_iterator_run(&vi, &domain) != 0) {
|
|
|
|
vm_policy_iterator_finish(&vi);
|
|
|
|
return (NULL);
|
|
|
|
}
|
2015-12-19 18:42:50 +00:00
|
|
|
m_run = NULL;
|
2016-01-16 04:41:40 +00:00
|
|
|
for (segind = vm_phys_nsegs - 1; segind >= 0; segind--) {
|
2015-12-19 18:42:50 +00:00
|
|
|
seg = &vm_phys_segs[segind];
|
2016-01-16 04:41:40 +00:00
|
|
|
if (seg->start >= high || seg->domain != domain)
|
2015-12-19 18:42:50 +00:00
|
|
|
continue;
|
2016-01-16 04:41:40 +00:00
|
|
|
if (low >= seg->end)
|
|
|
|
break;
|
2015-12-19 18:42:50 +00:00
|
|
|
if (low <= seg->start)
|
|
|
|
pa_start = seg->start;
|
|
|
|
else
|
|
|
|
pa_start = low;
|
|
|
|
if (high < seg->end)
|
|
|
|
pa_end = high;
|
|
|
|
else
|
|
|
|
pa_end = seg->end;
|
|
|
|
if (pa_end - pa_start < ptoa(npages))
|
|
|
|
continue;
|
|
|
|
m_run = vm_phys_alloc_seg_contig(seg, npages, low, high,
|
|
|
|
alignment, boundary);
|
|
|
|
if (m_run != NULL)
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
if (m_run == NULL && !vm_domain_iterator_isdone(&vi))
|
|
|
|
goto restartdom;
|
|
|
|
vm_policy_iterator_finish(&vi);
|
|
|
|
return (m_run);
|
|
|
|
}
|
Add an initial NUMA affinity/policy configuration for threads and processes.
This is based on work done by jeff@ and jhb@, as well as the numa.diff
patch that has been circulating when someone asks for first-touch NUMA
on -10 or -11.
* Introduce a simple set of VM policy and iterator types.
* tie the policy types into the vm_phys path for now, mirroring how
the initial first-touch allocation work was enabled.
* add syscalls to control changing thread and process defaults.
* add a global NUMA VM domain policy.
* implement a simple cascade policy order - if a thread policy exists, use it;
if a process policy exists, use it; use the default policy.
* processes inherit policies from their parent processes, threads inherit
policies from their parent threads.
* add a simple tool (numactl) to query and modify default thread/process
policities.
* add documentation for the new syscalls, for numa and for numactl.
* re-enable first touch NUMA again by default, as now policies can be
set in a variety of methods.
This is only relevant for very specific workloads.
This doesn't pretend to be a final NUMA solution.
The previous defaults in -HEAD (with MAXMEMDOM set) can be achieved by
'sysctl vm.default_policy=rr'.
This is only relevant if MAXMEMDOM is set to something other than 1.
Ie, if you're using GENERIC or a modified kernel with non-NUMA, then
this is a glorified no-op for you.
Thank you to Norse Corp for giving me access to rather large
(for FreeBSD!) NUMA machines in order to develop and verify this.
Thank you to Dell for providing me with dual socket sandybridge
and westmere v3 hardware to do NUMA development with.
Thank you to Scott Long at Netflix for providing me with access
to the two-socket, four-domain haswell v3 hardware.
Thank you to Peter Holm for running the stress testing suite
against the NUMA branch during various stages of development!
Tested:
* MIPS (regression testing; non-NUMA)
* i386 (regression testing; non-NUMA GENERIC)
* amd64 (regression testing; non-NUMA GENERIC)
* westmere, 2 socket (thankyou norse!)
* sandy bridge, 2 socket (thankyou dell!)
* ivy bridge, 2 socket (thankyou norse!)
* westmere-EX, 4 socket / 1TB RAM (thankyou norse!)
* haswell, 2 socket (thankyou norse!)
* haswell v3, 2 socket (thankyou dell)
* haswell v3, 2x18 core (thankyou scott long / netflix!)
* Peter Holm ran a stress test suite on this work and found one
issue, but has not been able to verify it (it doesn't look NUMA
related, and he only saw it once over many testing runs.)
* I've tested bhyve instances running in fixed NUMA domains and cpusets;
all seems to work correctly.
Verified:
* intel-pcm - pcm-numa.x and pcm-memory.x, whilst selecting different
NUMA policies for processes under test.
Review:
This was reviewed through phabricator (https://reviews.freebsd.org/D2559)
as well as privately and via emails to freebsd-arch@. The git history
with specific attributes is available at https://github.com/erikarn/freebsd/
in the NUMA branch (https://github.com/erikarn/freebsd/compare/local/adrian_numa_policy).
This has been reviewed by a number of people (stas, rpaulo, kib, ngie,
wblock) but not achieved a clear consensus. My hope is that with further
exposure and testing more functionality can be implemented and evaluated.
Notes:
* The VM doesn't handle unbalanced domains very well, and if you have an overly
unbalanced memory setup whilst under high memory pressure, VM page allocation
may fail leading to a kernel panic. This was a problem in the past, but it's
much more easily triggered now with these tools.
* This work only controls the path through vm_phys; it doesn't yet strongly/predictably
affect contigmalloc, KVA placement, UMA, etc. So, driver placement of memory
isn't really guaranteed in any way. That's next on my plate.
Sponsored by: Norse Corp, Inc.; Dell
2015-07-11 15:21:37 +00:00
|
|
|
|
2015-12-19 18:42:50 +00:00
|
|
|
/*
|
|
|
|
* Allocate a run of contiguous physical pages from the free list for the
|
|
|
|
* specified segment.
|
|
|
|
*/
|
|
|
|
static vm_page_t
|
|
|
|
vm_phys_alloc_seg_contig(struct vm_phys_seg *seg, u_long npages,
|
|
|
|
vm_paddr_t low, vm_paddr_t high, u_long alignment, vm_paddr_t boundary)
|
|
|
|
{
|
|
|
|
struct vm_freelist *fl;
|
|
|
|
vm_paddr_t pa, pa_end, size;
|
|
|
|
vm_page_t m, m_ret;
|
|
|
|
u_long npages_end;
|
|
|
|
int oind, order, pind;
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
|
2015-12-19 18:42:50 +00:00
|
|
|
KASSERT(npages > 0, ("npages is 0"));
|
|
|
|
KASSERT(powerof2(alignment), ("alignment is not a power of 2"));
|
|
|
|
KASSERT(powerof2(boundary), ("boundary is not a power of 2"));
|
|
|
|
mtx_assert(&vm_page_queue_free_mtx, MA_OWNED);
|
|
|
|
/* Compute the queue that is the best fit for npages. */
|
|
|
|
for (order = 0; (1 << order) < npages; order++);
|
|
|
|
/* Search for a run satisfying the specified conditions. */
|
|
|
|
size = npages << PAGE_SHIFT;
|
|
|
|
for (oind = min(order, VM_NFREEORDER - 1); oind < VM_NFREEORDER;
|
|
|
|
oind++) {
|
|
|
|
for (pind = 0; pind < VM_NFREEPOOL; pind++) {
|
|
|
|
fl = (*seg->free_queues)[pind];
|
|
|
|
TAILQ_FOREACH(m_ret, &fl[oind].pl, plinks.q) {
|
|
|
|
/*
|
|
|
|
* Is the size of this allocation request
|
|
|
|
* larger than the largest block size?
|
|
|
|
*/
|
|
|
|
if (order >= VM_NFREEORDER) {
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
/*
|
2015-12-19 18:42:50 +00:00
|
|
|
* Determine if a sufficient number of
|
|
|
|
* subsequent blocks to satisfy the
|
|
|
|
* allocation request are free.
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
*/
|
|
|
|
pa = VM_PAGE_TO_PHYS(m_ret);
|
2015-12-19 18:42:50 +00:00
|
|
|
pa_end = pa + size;
|
|
|
|
for (;;) {
|
|
|
|
pa += 1 << (PAGE_SHIFT +
|
|
|
|
VM_NFREEORDER - 1);
|
|
|
|
if (pa >= pa_end ||
|
|
|
|
pa < seg->start ||
|
|
|
|
pa >= seg->end)
|
|
|
|
break;
|
|
|
|
m = &seg->first_page[atop(pa -
|
|
|
|
seg->start)];
|
|
|
|
if (m->order != VM_NFREEORDER -
|
|
|
|
1)
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
/* If not, go to the next block. */
|
|
|
|
if (pa < pa_end)
|
|
|
|
continue;
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
}
|
2015-12-19 18:42:50 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Determine if the blocks are within the
|
|
|
|
* given range, satisfy the given alignment,
|
|
|
|
* and do not cross the given boundary.
|
|
|
|
*/
|
|
|
|
pa = VM_PAGE_TO_PHYS(m_ret);
|
|
|
|
pa_end = pa + size;
|
2016-04-21 19:57:40 +00:00
|
|
|
if (pa >= low && pa_end <= high &&
|
|
|
|
(pa & (alignment - 1)) == 0 &&
|
|
|
|
rounddown2(pa ^ (pa_end - 1), boundary) == 0)
|
2015-12-19 18:42:50 +00:00
|
|
|
goto done;
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
return (NULL);
|
|
|
|
done:
|
|
|
|
for (m = m_ret; m < &m_ret[npages]; m = &m[1 << oind]) {
|
|
|
|
fl = (*seg->free_queues)[m->pool];
|
2013-05-13 15:40:51 +00:00
|
|
|
vm_freelist_rem(fl, m, m->order);
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
}
|
|
|
|
if (m_ret->pool != VM_FREEPOOL_DEFAULT)
|
|
|
|
vm_phys_set_pool(VM_FREEPOOL_DEFAULT, m_ret, oind);
|
|
|
|
fl = (*seg->free_queues)[m_ret->pool];
|
|
|
|
vm_phys_split_pages(m_ret, oind, fl, order);
|
2011-10-30 05:06:14 +00:00
|
|
|
/* Return excess pages to the free lists. */
|
|
|
|
npages_end = roundup2(npages, 1 << imin(oind, order));
|
|
|
|
if (npages < npages_end)
|
|
|
|
vm_phys_free_contig(&m_ret[npages], npages_end - npages);
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
return (m_ret);
|
|
|
|
}
|
|
|
|
|
|
|
|
#ifdef DDB
|
|
|
|
/*
|
|
|
|
* Show the number of physical pages in each of the free lists.
|
|
|
|
*/
|
|
|
|
DB_SHOW_COMMAND(freepages, db_show_freepages)
|
|
|
|
{
|
|
|
|
struct vm_freelist *fl;
|
2013-05-13 15:40:51 +00:00
|
|
|
int flind, oind, pind, dom;
|
|
|
|
|
|
|
|
for (dom = 0; dom < vm_ndomains; dom++) {
|
|
|
|
db_printf("DOMAIN: %d\n", dom);
|
|
|
|
for (flind = 0; flind < vm_nfreelists; flind++) {
|
|
|
|
db_printf("FREE LIST %d:\n"
|
|
|
|
"\n ORDER (SIZE) | NUMBER"
|
|
|
|
"\n ", flind);
|
|
|
|
for (pind = 0; pind < VM_NFREEPOOL; pind++)
|
|
|
|
db_printf(" | POOL %d", pind);
|
|
|
|
db_printf("\n-- ");
|
|
|
|
for (pind = 0; pind < VM_NFREEPOOL; pind++)
|
|
|
|
db_printf("-- -- ");
|
|
|
|
db_printf("--\n");
|
|
|
|
for (oind = VM_NFREEORDER - 1; oind >= 0; oind--) {
|
|
|
|
db_printf(" %2.2d (%6.6dK)", oind,
|
|
|
|
1 << (PAGE_SHIFT - 10 + oind));
|
|
|
|
for (pind = 0; pind < VM_NFREEPOOL; pind++) {
|
|
|
|
fl = vm_phys_free_queues[dom][flind][pind];
|
|
|
|
db_printf(" | %6.6d", fl[oind].lcnt);
|
|
|
|
}
|
|
|
|
db_printf("\n");
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
}
|
|
|
|
db_printf("\n");
|
|
|
|
}
|
|
|
|
db_printf("\n");
|
|
|
|
}
|
|
|
|
}
|
|
|
|
#endif
|