Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
/*-
|
2017-11-27 15:23:17 +00:00
|
|
|
* SPDX-License-Identifier: BSD-2-Clause-FreeBSD
|
|
|
|
*
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
* Copyright (c) 2002-2006 Rice University
|
|
|
|
* Copyright (c) 2007 Alan L. Cox <alc@cs.rice.edu>
|
|
|
|
* All rights reserved.
|
|
|
|
*
|
|
|
|
* This software was developed for the FreeBSD Project by Alan L. Cox,
|
|
|
|
* Olivier Crameri, Peter Druschel, Sitaram Iyer, and Juan Navarro.
|
|
|
|
*
|
|
|
|
* Redistribution and use in source and binary forms, with or without
|
|
|
|
* modification, are permitted provided that the following conditions
|
|
|
|
* are met:
|
|
|
|
* 1. Redistributions of source code must retain the above copyright
|
|
|
|
* notice, this list of conditions and the following disclaimer.
|
|
|
|
* 2. Redistributions in binary form must reproduce the above copyright
|
|
|
|
* notice, this list of conditions and the following disclaimer in the
|
|
|
|
* documentation and/or other materials provided with the distribution.
|
|
|
|
*
|
|
|
|
* THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
|
|
|
|
* ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
|
|
|
|
* LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
|
|
|
|
* A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
|
|
|
|
* HOLDERS OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT,
|
|
|
|
* INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
|
|
|
|
* BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS
|
|
|
|
* OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED
|
|
|
|
* AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
|
|
|
|
* LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY
|
|
|
|
* WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
|
|
|
|
* POSSIBILITY OF SUCH DAMAGE.
|
|
|
|
*/
|
|
|
|
|
Refactor the code that performs physically contiguous memory allocation,
yielding a new public interface, vm_page_alloc_contig(). This new function
addresses some of the limitations of the current interfaces, contigmalloc()
and kmem_alloc_contig(). For example, the physically contiguous memory that
is allocated with those interfaces can only be allocated to the kernel vm
object and must be mapped into the kernel virtual address space. It also
provides functionality that vm_phys_alloc_contig() doesn't, such as wiring
the returned pages. Moreover, unlike that function, it respects the low
water marks on the paging queues and wakes up the page daemon when
necessary. That said, at present, this new function can't be applied to all
types of vm objects. However, that restriction will be eliminated in the
coming weeks.
From a design standpoint, this change also addresses an inconsistency
between vm_phys_alloc_contig() and the other vm_phys_alloc*() functions.
Specifically, vm_phys_alloc_contig() manipulated vm_page fields that other
functions in vm/vm_phys.c didn't. Moreover, vm_phys_alloc_contig() knew
about vnodes and reservations. Now, vm_page_alloc_contig() is responsible
for these things.
Reviewed by: kib
Discussed with: jhb
2011-11-16 16:46:09 +00:00
|
|
|
/*
|
|
|
|
* Physical memory system implementation
|
|
|
|
*
|
|
|
|
* Any external functions defined by this module are only to be used by the
|
|
|
|
* virtual memory system.
|
|
|
|
*/
|
|
|
|
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
#include <sys/cdefs.h>
|
|
|
|
__FBSDID("$FreeBSD$");
|
|
|
|
|
|
|
|
#include "opt_ddb.h"
|
2013-02-14 19:38:04 +00:00
|
|
|
#include "opt_vm.h"
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
|
|
|
|
#include <sys/param.h>
|
|
|
|
#include <sys/systm.h>
|
2018-10-20 17:36:00 +00:00
|
|
|
#include <sys/domainset.h>
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
#include <sys/lock.h>
|
|
|
|
#include <sys/kernel.h>
|
|
|
|
#include <sys/malloc.h>
|
|
|
|
#include <sys/mutex.h>
|
2013-05-13 15:40:51 +00:00
|
|
|
#include <sys/proc.h>
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
#include <sys/queue.h>
|
2014-07-09 08:12:58 +00:00
|
|
|
#include <sys/rwlock.h>
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
#include <sys/sbuf.h>
|
|
|
|
#include <sys/sysctl.h>
|
2014-07-09 08:12:58 +00:00
|
|
|
#include <sys/tree.h>
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
#include <sys/vmmeter.h>
|
|
|
|
|
|
|
|
#include <ddb/ddb.h>
|
|
|
|
|
|
|
|
#include <vm/vm.h>
|
|
|
|
#include <vm/vm_param.h>
|
|
|
|
#include <vm/vm_kern.h>
|
|
|
|
#include <vm/vm_object.h>
|
|
|
|
#include <vm/vm_page.h>
|
|
|
|
#include <vm/vm_phys.h>
|
2018-02-06 22:10:07 +00:00
|
|
|
#include <vm/vm_pagequeue.h>
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
|
Split the pagequeues per NUMA domains, and split pageademon process
into threads each processing queue in a single domain. The structure
of the pagedaemons and queues is kept intact, most of the changes come
from the need for code to find an owning page queue for given page,
calculated from the segment containing the page.
The tie between NUMA domain and pagedaemon thread/pagequeue split is
rather arbitrary, the multithreaded daemon could be allowed for the
single-domain machines, or one domain might be split into several page
domains, to further increase concurrency.
Right now, each pagedaemon thread tries to reach the global target,
precalculated at the start of the pass. This is not optimal, since it
could cause excessive page deactivation and freeing. The code should
be changed to re-check the global page deficit state in the loop after
some number of iterations.
The pagedaemons reach the quorum before starting the OOM, since one
thread inability to meet the target is normal for split queues. Only
when all pagedaemons fail to produce enough reusable pages, OOM is
started by single selected thread.
Launder is modified to take into account the segments layout with
regard to the region for which cleaning is performed.
Based on the preliminary patch by jeff, sponsored by EMC / Isilon
Storage Division.
Reviewed by: alc
Tested by: pho
Sponsored by: The FreeBSD Foundation
2013-08-07 16:36:38 +00:00
|
|
|
_Static_assert(sizeof(long) * NBBY >= VM_PHYSSEG_MAX,
|
|
|
|
"Too many physsegs.");
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
|
2018-01-14 03:36:03 +00:00
|
|
|
#ifdef NUMA
|
2018-03-22 19:06:50 +00:00
|
|
|
struct mem_affinity __read_mostly *mem_affinity;
|
|
|
|
int __read_mostly *mem_locality;
|
2016-04-09 13:58:04 +00:00
|
|
|
#endif
|
2010-07-27 20:33:50 +00:00
|
|
|
|
2018-03-22 19:06:50 +00:00
|
|
|
int __read_mostly vm_ndomains = 1;
|
2018-09-24 19:24:17 +00:00
|
|
|
domainset_t __read_mostly all_domains = DOMAINSET_T_INITIALIZER(0x1);
|
2013-05-13 15:40:51 +00:00
|
|
|
|
2018-03-22 19:06:50 +00:00
|
|
|
struct vm_phys_seg __read_mostly vm_phys_segs[VM_PHYSSEG_MAX];
|
|
|
|
int __read_mostly vm_phys_nsegs;
|
Fix boot on systems where NUMA domain 0 is unpopulated.
- Add vm_phys_early_add_seg(), complementing vm_phys_early_alloc(), to
ensure that segments registered during hammer_time() are placed in the
right domain. Otherwise, since the SRAT is not parsed at that point,
we just add them to domain 0, which may be incorrect and results in a
domain with only several MB worth of memory.
- Fix uma_startup1() to try allocating memory for zones from any domain.
If domain 0 is unpopulated, the allocation will simply fail, resulting
in a page fault slightly later during boot.
- Change _vm_phys_domain() to return -1 for addresses not covered by the
affinity table, and change vm_phys_early_alloc() to handle wildcard
domains. This is necessary on amd64, where the page array is dense
and pmap_page_array_startup() may allocate page table pages for
non-existent page frames.
Reported and tested by: Rafael Kitover <rkitover@gmail.com>
Reviewed by: cem (earlier version), kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D25001
2020-05-28 19:41:00 +00:00
|
|
|
static struct vm_phys_seg vm_phys_early_segs[8];
|
|
|
|
static int vm_phys_early_nsegs;
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
|
2014-07-09 08:12:58 +00:00
|
|
|
struct vm_phys_fictitious_seg;
|
|
|
|
static int vm_phys_fictitious_cmp(struct vm_phys_fictitious_seg *,
|
|
|
|
struct vm_phys_fictitious_seg *);
|
|
|
|
|
|
|
|
RB_HEAD(fict_tree, vm_phys_fictitious_seg) vm_phys_fictitious_tree =
|
2019-12-22 21:53:05 +00:00
|
|
|
RB_INITIALIZER(&vm_phys_fictitious_tree);
|
2014-07-09 08:12:58 +00:00
|
|
|
|
|
|
|
struct vm_phys_fictitious_seg {
|
|
|
|
RB_ENTRY(vm_phys_fictitious_seg) node;
|
|
|
|
/* Memory region data */
|
2012-05-12 20:42:56 +00:00
|
|
|
vm_paddr_t start;
|
|
|
|
vm_paddr_t end;
|
|
|
|
vm_page_t first_page;
|
2014-07-09 08:12:58 +00:00
|
|
|
};
|
|
|
|
|
|
|
|
RB_GENERATE_STATIC(fict_tree, vm_phys_fictitious_seg, node,
|
|
|
|
vm_phys_fictitious_cmp);
|
|
|
|
|
2018-03-22 19:06:50 +00:00
|
|
|
static struct rwlock_padalign vm_phys_fictitious_reg_lock;
|
2013-08-07 00:20:30 +00:00
|
|
|
MALLOC_DEFINE(M_FICT_PAGES, "vm_fictitious", "Fictitious VM pages");
|
2012-05-12 20:42:56 +00:00
|
|
|
|
2018-03-22 19:06:50 +00:00
|
|
|
static struct vm_freelist __aligned(CACHE_LINE_SIZE)
|
2019-01-18 13:35:06 +00:00
|
|
|
vm_phys_free_queues[MAXMEMDOM][VM_NFREELIST][VM_NFREEPOOL]
|
|
|
|
[VM_NFREEORDER_MAX];
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
|
2018-03-22 19:06:50 +00:00
|
|
|
static int __read_mostly vm_nfreelists;
|
The physical memory allocator supports the use of distinct free lists for
managing pages from different address ranges. Generally speaking, this
feature is used to increase the likelihood that physical pages are
available that can meet special DMA requirements or can be accessed through
a limited-coverage direct mapping (e.g., MIPS). However, prior to this
change, the configuration of the free lists was static, i.e., it was
determined at compile time. Consequentally, free lists could be created
for address ranges that held no actual pages, for example, on 32-bit MIPS-
based systems with 512 MB or less of physical memory. This change makes
the creation of the free lists dynamic, i.e., it is based on the available
physical memory at boot time.
On 64-bit x86-based systems with 64 GB or more of physical memory, create
free lists for managing pages with physical addresses below 4 GB. This
change is to address reported problems with initializing devices that
require the allocation of physical pages below 4 GB on some systems with
128 GB or more of physical memory.
PR: 185727
Differential Revision: https://reviews.freebsd.org/D1274
Reviewed by: jhb, kib
MFC after: 3 weeks
Sponsored by: EMC / Isilon Storage Division
2014-12-31 00:54:38 +00:00
|
|
|
|
2019-08-16 00:45:14 +00:00
|
|
|
/*
|
|
|
|
* These "avail lists" are globals used to communicate boot-time physical
|
|
|
|
* memory layout to other parts of the kernel. Each physically contiguous
|
|
|
|
* region of memory is defined by a start address at an even index and an
|
|
|
|
* end address at the following odd index. Each list is terminated by a
|
|
|
|
* pair of zero entries.
|
|
|
|
*
|
|
|
|
* dump_avail tells the dump code what regions to include in a crash dump, and
|
|
|
|
* phys_avail is all of the remaining physical memory that is available for
|
|
|
|
* the vm system.
|
|
|
|
*
|
|
|
|
* Initially dump_avail and phys_avail are identical. Boot time memory
|
|
|
|
* allocations remove extents from phys_avail that may still be included
|
|
|
|
* in dumps.
|
|
|
|
*/
|
|
|
|
vm_paddr_t phys_avail[PHYS_AVAIL_COUNT];
|
|
|
|
vm_paddr_t dump_avail[PHYS_AVAIL_COUNT];
|
|
|
|
|
The physical memory allocator supports the use of distinct free lists for
managing pages from different address ranges. Generally speaking, this
feature is used to increase the likelihood that physical pages are
available that can meet special DMA requirements or can be accessed through
a limited-coverage direct mapping (e.g., MIPS). However, prior to this
change, the configuration of the free lists was static, i.e., it was
determined at compile time. Consequentally, free lists could be created
for address ranges that held no actual pages, for example, on 32-bit MIPS-
based systems with 512 MB or less of physical memory. This change makes
the creation of the free lists dynamic, i.e., it is based on the available
physical memory at boot time.
On 64-bit x86-based systems with 64 GB or more of physical memory, create
free lists for managing pages with physical addresses below 4 GB. This
change is to address reported problems with initializing devices that
require the allocation of physical pages below 4 GB on some systems with
128 GB or more of physical memory.
PR: 185727
Differential Revision: https://reviews.freebsd.org/D1274
Reviewed by: jhb, kib
MFC after: 3 weeks
Sponsored by: EMC / Isilon Storage Division
2014-12-31 00:54:38 +00:00
|
|
|
/*
|
|
|
|
* Provides the mapping from VM_FREELIST_* to free list indices (flind).
|
|
|
|
*/
|
2018-03-22 19:06:50 +00:00
|
|
|
static int __read_mostly vm_freelist_to_flind[VM_NFREELIST];
|
The physical memory allocator supports the use of distinct free lists for
managing pages from different address ranges. Generally speaking, this
feature is used to increase the likelihood that physical pages are
available that can meet special DMA requirements or can be accessed through
a limited-coverage direct mapping (e.g., MIPS). However, prior to this
change, the configuration of the free lists was static, i.e., it was
determined at compile time. Consequentally, free lists could be created
for address ranges that held no actual pages, for example, on 32-bit MIPS-
based systems with 512 MB or less of physical memory. This change makes
the creation of the free lists dynamic, i.e., it is based on the available
physical memory at boot time.
On 64-bit x86-based systems with 64 GB or more of physical memory, create
free lists for managing pages with physical addresses below 4 GB. This
change is to address reported problems with initializing devices that
require the allocation of physical pages below 4 GB on some systems with
128 GB or more of physical memory.
PR: 185727
Differential Revision: https://reviews.freebsd.org/D1274
Reviewed by: jhb, kib
MFC after: 3 weeks
Sponsored by: EMC / Isilon Storage Division
2014-12-31 00:54:38 +00:00
|
|
|
|
|
|
|
CTASSERT(VM_FREELIST_DEFAULT == 0);
|
|
|
|
|
|
|
|
#ifdef VM_FREELIST_DMA32
|
|
|
|
#define VM_DMA32_BOUNDARY ((vm_paddr_t)1 << 32)
|
|
|
|
#endif
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Enforce the assumptions made by vm_phys_add_seg() and vm_phys_init() about
|
|
|
|
* the ordering of the free list boundaries.
|
|
|
|
*/
|
|
|
|
#if defined(VM_LOWMEM_BOUNDARY) && defined(VM_DMA32_BOUNDARY)
|
|
|
|
CTASSERT(VM_LOWMEM_BOUNDARY < VM_DMA32_BOUNDARY);
|
|
|
|
#endif
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
|
|
|
|
static int sysctl_vm_phys_free(SYSCTL_HANDLER_ARGS);
|
2020-02-26 14:26:36 +00:00
|
|
|
SYSCTL_OID(_vm, OID_AUTO, phys_free,
|
2020-09-23 19:36:07 +00:00
|
|
|
CTLTYPE_STRING | CTLFLAG_RD | CTLFLAG_MPSAFE, NULL, 0,
|
2020-02-26 14:26:36 +00:00
|
|
|
sysctl_vm_phys_free, "A",
|
|
|
|
"Phys Free Info");
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
|
|
|
|
static int sysctl_vm_phys_segs(SYSCTL_HANDLER_ARGS);
|
2020-02-26 14:26:36 +00:00
|
|
|
SYSCTL_OID(_vm, OID_AUTO, phys_segs,
|
2020-09-23 19:36:07 +00:00
|
|
|
CTLTYPE_STRING | CTLFLAG_RD | CTLFLAG_MPSAFE, NULL, 0,
|
2020-02-26 14:26:36 +00:00
|
|
|
sysctl_vm_phys_segs, "A",
|
|
|
|
"Phys Seg Info");
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
|
2018-01-14 03:36:03 +00:00
|
|
|
#ifdef NUMA
|
2015-05-08 00:56:56 +00:00
|
|
|
static int sysctl_vm_phys_locality(SYSCTL_HANDLER_ARGS);
|
2020-02-26 14:26:36 +00:00
|
|
|
SYSCTL_OID(_vm, OID_AUTO, phys_locality,
|
2020-09-23 19:36:07 +00:00
|
|
|
CTLTYPE_STRING | CTLFLAG_RD | CTLFLAG_MPSAFE, NULL, 0,
|
2020-02-26 14:26:36 +00:00
|
|
|
sysctl_vm_phys_locality, "A",
|
|
|
|
"Phys Locality Info");
|
Add an initial NUMA affinity/policy configuration for threads and processes.
This is based on work done by jeff@ and jhb@, as well as the numa.diff
patch that has been circulating when someone asks for first-touch NUMA
on -10 or -11.
* Introduce a simple set of VM policy and iterator types.
* tie the policy types into the vm_phys path for now, mirroring how
the initial first-touch allocation work was enabled.
* add syscalls to control changing thread and process defaults.
* add a global NUMA VM domain policy.
* implement a simple cascade policy order - if a thread policy exists, use it;
if a process policy exists, use it; use the default policy.
* processes inherit policies from their parent processes, threads inherit
policies from their parent threads.
* add a simple tool (numactl) to query and modify default thread/process
policities.
* add documentation for the new syscalls, for numa and for numactl.
* re-enable first touch NUMA again by default, as now policies can be
set in a variety of methods.
This is only relevant for very specific workloads.
This doesn't pretend to be a final NUMA solution.
The previous defaults in -HEAD (with MAXMEMDOM set) can be achieved by
'sysctl vm.default_policy=rr'.
This is only relevant if MAXMEMDOM is set to something other than 1.
Ie, if you're using GENERIC or a modified kernel with non-NUMA, then
this is a glorified no-op for you.
Thank you to Norse Corp for giving me access to rather large
(for FreeBSD!) NUMA machines in order to develop and verify this.
Thank you to Dell for providing me with dual socket sandybridge
and westmere v3 hardware to do NUMA development with.
Thank you to Scott Long at Netflix for providing me with access
to the two-socket, four-domain haswell v3 hardware.
Thank you to Peter Holm for running the stress testing suite
against the NUMA branch during various stages of development!
Tested:
* MIPS (regression testing; non-NUMA)
* i386 (regression testing; non-NUMA GENERIC)
* amd64 (regression testing; non-NUMA GENERIC)
* westmere, 2 socket (thankyou norse!)
* sandy bridge, 2 socket (thankyou dell!)
* ivy bridge, 2 socket (thankyou norse!)
* westmere-EX, 4 socket / 1TB RAM (thankyou norse!)
* haswell, 2 socket (thankyou norse!)
* haswell v3, 2 socket (thankyou dell)
* haswell v3, 2x18 core (thankyou scott long / netflix!)
* Peter Holm ran a stress test suite on this work and found one
issue, but has not been able to verify it (it doesn't look NUMA
related, and he only saw it once over many testing runs.)
* I've tested bhyve instances running in fixed NUMA domains and cpusets;
all seems to work correctly.
Verified:
* intel-pcm - pcm-numa.x and pcm-memory.x, whilst selecting different
NUMA policies for processes under test.
Review:
This was reviewed through phabricator (https://reviews.freebsd.org/D2559)
as well as privately and via emails to freebsd-arch@. The git history
with specific attributes is available at https://github.com/erikarn/freebsd/
in the NUMA branch (https://github.com/erikarn/freebsd/compare/local/adrian_numa_policy).
This has been reviewed by a number of people (stas, rpaulo, kib, ngie,
wblock) but not achieved a clear consensus. My hope is that with further
exposure and testing more functionality can be implemented and evaluated.
Notes:
* The VM doesn't handle unbalanced domains very well, and if you have an overly
unbalanced memory setup whilst under high memory pressure, VM page allocation
may fail leading to a kernel panic. This was a problem in the past, but it's
much more easily triggered now with these tools.
* This work only controls the path through vm_phys; it doesn't yet strongly/predictably
affect contigmalloc, KVA placement, UMA, etc. So, driver placement of memory
isn't really guaranteed in any way. That's next on my plate.
Sponsored by: Norse Corp, Inc.; Dell
2015-07-11 15:21:37 +00:00
|
|
|
#endif
|
2015-05-08 00:56:56 +00:00
|
|
|
|
2013-05-13 15:40:51 +00:00
|
|
|
SYSCTL_INT(_vm, OID_AUTO, ndomains, CTLFLAG_RD,
|
|
|
|
&vm_ndomains, 0, "Number of physical memory domains available.");
|
2010-07-27 20:33:50 +00:00
|
|
|
|
2015-12-19 18:42:50 +00:00
|
|
|
static vm_page_t vm_phys_alloc_seg_contig(struct vm_phys_seg *seg,
|
|
|
|
u_long npages, vm_paddr_t low, vm_paddr_t high, u_long alignment,
|
|
|
|
vm_paddr_t boundary);
|
The physical memory allocator supports the use of distinct free lists for
managing pages from different address ranges. Generally speaking, this
feature is used to increase the likelihood that physical pages are
available that can meet special DMA requirements or can be accessed through
a limited-coverage direct mapping (e.g., MIPS). However, prior to this
change, the configuration of the free lists was static, i.e., it was
determined at compile time. Consequentally, free lists could be created
for address ranges that held no actual pages, for example, on 32-bit MIPS-
based systems with 512 MB or less of physical memory. This change makes
the creation of the free lists dynamic, i.e., it is based on the available
physical memory at boot time.
On 64-bit x86-based systems with 64 GB or more of physical memory, create
free lists for managing pages with physical addresses below 4 GB. This
change is to address reported problems with initializing devices that
require the allocation of physical pages below 4 GB on some systems with
128 GB or more of physical memory.
PR: 185727
Differential Revision: https://reviews.freebsd.org/D1274
Reviewed by: jhb, kib
MFC after: 3 weeks
Sponsored by: EMC / Isilon Storage Division
2014-12-31 00:54:38 +00:00
|
|
|
static void _vm_phys_create_seg(vm_paddr_t start, vm_paddr_t end, int domain);
|
|
|
|
static void vm_phys_create_seg(vm_paddr_t start, vm_paddr_t end);
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
static void vm_phys_split_pages(vm_page_t m, int oind, struct vm_freelist *fl,
|
2018-07-05 02:08:57 +00:00
|
|
|
int order, int tail);
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
|
2014-07-09 08:12:58 +00:00
|
|
|
/*
|
|
|
|
* Red-black tree helpers for vm fictitious range management.
|
|
|
|
*/
|
|
|
|
static inline int
|
|
|
|
vm_phys_fictitious_in_range(struct vm_phys_fictitious_seg *p,
|
|
|
|
struct vm_phys_fictitious_seg *range)
|
|
|
|
{
|
|
|
|
|
|
|
|
KASSERT(range->start != 0 && range->end != 0,
|
|
|
|
("Invalid range passed on search for vm_fictitious page"));
|
|
|
|
if (p->start >= range->end)
|
|
|
|
return (1);
|
|
|
|
if (p->start < range->start)
|
|
|
|
return (-1);
|
|
|
|
|
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
|
|
|
static int
|
|
|
|
vm_phys_fictitious_cmp(struct vm_phys_fictitious_seg *p1,
|
|
|
|
struct vm_phys_fictitious_seg *p2)
|
|
|
|
{
|
|
|
|
|
|
|
|
/* Check if this is a search for a page */
|
|
|
|
if (p1->end == 0)
|
|
|
|
return (vm_phys_fictitious_in_range(p1, p2));
|
|
|
|
|
|
|
|
KASSERT(p2->end != 0,
|
|
|
|
("Invalid range passed as second parameter to vm fictitious comparison"));
|
|
|
|
|
|
|
|
/* Searching to add a new range */
|
|
|
|
if (p1->end <= p2->start)
|
|
|
|
return (-1);
|
|
|
|
if (p1->start >= p2->end)
|
|
|
|
return (1);
|
|
|
|
|
|
|
|
panic("Trying to add overlapping vm fictitious ranges:\n"
|
|
|
|
"[%#jx:%#jx] and [%#jx:%#jx]", (uintmax_t)p1->start,
|
|
|
|
(uintmax_t)p1->end, (uintmax_t)p2->start, (uintmax_t)p2->end);
|
|
|
|
}
|
|
|
|
|
2018-01-12 23:34:16 +00:00
|
|
|
int
|
|
|
|
vm_phys_domain_match(int prefer, vm_paddr_t low, vm_paddr_t high)
|
Split the pagequeues per NUMA domains, and split pageademon process
into threads each processing queue in a single domain. The structure
of the pagedaemons and queues is kept intact, most of the changes come
from the need for code to find an owning page queue for given page,
calculated from the segment containing the page.
The tie between NUMA domain and pagedaemon thread/pagequeue split is
rather arbitrary, the multithreaded daemon could be allowed for the
single-domain machines, or one domain might be split into several page
domains, to further increase concurrency.
Right now, each pagedaemon thread tries to reach the global target,
precalculated at the start of the pass. This is not optimal, since it
could cause excessive page deactivation and freeing. The code should
be changed to re-check the global page deficit state in the loop after
some number of iterations.
The pagedaemons reach the quorum before starting the OOM, since one
thread inability to meet the target is normal for split queues. Only
when all pagedaemons fail to produce enough reusable pages, OOM is
started by single selected thread.
Launder is modified to take into account the segments layout with
regard to the region for which cleaning is performed.
Based on the preliminary patch by jeff, sponsored by EMC / Isilon
Storage Division.
Reviewed by: alc
Tested by: pho
Sponsored by: The FreeBSD Foundation
2013-08-07 16:36:38 +00:00
|
|
|
{
|
2018-01-14 03:36:03 +00:00
|
|
|
#ifdef NUMA
|
2018-01-12 23:34:16 +00:00
|
|
|
domainset_t mask;
|
|
|
|
int i;
|
|
|
|
|
|
|
|
if (vm_ndomains == 1 || mem_affinity == NULL)
|
|
|
|
return (0);
|
|
|
|
|
|
|
|
DOMAINSET_ZERO(&mask);
|
|
|
|
/*
|
|
|
|
* Check for any memory that overlaps low, high.
|
|
|
|
*/
|
|
|
|
for (i = 0; mem_affinity[i].end != 0; i++)
|
|
|
|
if (mem_affinity[i].start <= high &&
|
|
|
|
mem_affinity[i].end >= low)
|
|
|
|
DOMAINSET_SET(mem_affinity[i].domain, &mask);
|
|
|
|
if (prefer != -1 && DOMAINSET_ISSET(prefer, &mask))
|
|
|
|
return (prefer);
|
|
|
|
if (DOMAINSET_EMPTY(&mask))
|
|
|
|
panic("vm_phys_domain_match: Impossible constraint");
|
|
|
|
return (DOMAINSET_FFS(&mask) - 1);
|
|
|
|
#else
|
|
|
|
return (0);
|
|
|
|
#endif
|
Split the pagequeues per NUMA domains, and split pageademon process
into threads each processing queue in a single domain. The structure
of the pagedaemons and queues is kept intact, most of the changes come
from the need for code to find an owning page queue for given page,
calculated from the segment containing the page.
The tie between NUMA domain and pagedaemon thread/pagequeue split is
rather arbitrary, the multithreaded daemon could be allowed for the
single-domain machines, or one domain might be split into several page
domains, to further increase concurrency.
Right now, each pagedaemon thread tries to reach the global target,
precalculated at the start of the pass. This is not optimal, since it
could cause excessive page deactivation and freeing. The code should
be changed to re-check the global page deficit state in the loop after
some number of iterations.
The pagedaemons reach the quorum before starting the OOM, since one
thread inability to meet the target is normal for split queues. Only
when all pagedaemons fail to produce enough reusable pages, OOM is
started by single selected thread.
Launder is modified to take into account the segments layout with
regard to the region for which cleaning is performed.
Based on the preliminary patch by jeff, sponsored by EMC / Isilon
Storage Division.
Reviewed by: alc
Tested by: pho
Sponsored by: The FreeBSD Foundation
2013-08-07 16:36:38 +00:00
|
|
|
}
|
|
|
|
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
/*
|
|
|
|
* Outputs the state of the physical memory allocator, specifically,
|
|
|
|
* the amount of physical memory in each free list.
|
|
|
|
*/
|
|
|
|
static int
|
|
|
|
sysctl_vm_phys_free(SYSCTL_HANDLER_ARGS)
|
|
|
|
{
|
|
|
|
struct sbuf sbuf;
|
|
|
|
struct vm_freelist *fl;
|
2013-05-13 15:40:51 +00:00
|
|
|
int dom, error, flind, oind, pind;
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
|
2011-01-27 00:34:12 +00:00
|
|
|
error = sysctl_wire_old_buffer(req, 0);
|
|
|
|
if (error != 0)
|
|
|
|
return (error);
|
2013-05-13 15:40:51 +00:00
|
|
|
sbuf_new_for_sysctl(&sbuf, NULL, 128 * vm_ndomains, req);
|
|
|
|
for (dom = 0; dom < vm_ndomains; dom++) {
|
2013-10-10 16:11:45 +00:00
|
|
|
sbuf_printf(&sbuf,"\nDOMAIN %d:\n", dom);
|
2013-05-13 15:40:51 +00:00
|
|
|
for (flind = 0; flind < vm_nfreelists; flind++) {
|
2013-10-10 16:11:45 +00:00
|
|
|
sbuf_printf(&sbuf, "\nFREE LIST %d:\n"
|
2013-05-13 15:40:51 +00:00
|
|
|
"\n ORDER (SIZE) | NUMBER"
|
|
|
|
"\n ", flind);
|
|
|
|
for (pind = 0; pind < VM_NFREEPOOL; pind++)
|
|
|
|
sbuf_printf(&sbuf, " | POOL %d", pind);
|
|
|
|
sbuf_printf(&sbuf, "\n-- ");
|
|
|
|
for (pind = 0; pind < VM_NFREEPOOL; pind++)
|
|
|
|
sbuf_printf(&sbuf, "-- -- ");
|
|
|
|
sbuf_printf(&sbuf, "--\n");
|
|
|
|
for (oind = VM_NFREEORDER - 1; oind >= 0; oind--) {
|
|
|
|
sbuf_printf(&sbuf, " %2d (%6dK)", oind,
|
|
|
|
1 << (PAGE_SHIFT - 10 + oind));
|
|
|
|
for (pind = 0; pind < VM_NFREEPOOL; pind++) {
|
|
|
|
fl = vm_phys_free_queues[dom][flind][pind];
|
2013-10-10 16:11:45 +00:00
|
|
|
sbuf_printf(&sbuf, " | %6d",
|
2013-05-13 15:40:51 +00:00
|
|
|
fl[oind].lcnt);
|
|
|
|
}
|
|
|
|
sbuf_printf(&sbuf, "\n");
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
2010-09-16 16:13:12 +00:00
|
|
|
error = sbuf_finish(&sbuf);
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
sbuf_delete(&sbuf);
|
|
|
|
return (error);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Outputs the set of physical memory segments.
|
|
|
|
*/
|
|
|
|
static int
|
|
|
|
sysctl_vm_phys_segs(SYSCTL_HANDLER_ARGS)
|
|
|
|
{
|
|
|
|
struct sbuf sbuf;
|
|
|
|
struct vm_phys_seg *seg;
|
|
|
|
int error, segind;
|
|
|
|
|
2011-01-27 00:34:12 +00:00
|
|
|
error = sysctl_wire_old_buffer(req, 0);
|
|
|
|
if (error != 0)
|
|
|
|
return (error);
|
2010-09-16 16:13:12 +00:00
|
|
|
sbuf_new_for_sysctl(&sbuf, NULL, 128, req);
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
for (segind = 0; segind < vm_phys_nsegs; segind++) {
|
|
|
|
sbuf_printf(&sbuf, "\nSEGMENT %d:\n\n", segind);
|
|
|
|
seg = &vm_phys_segs[segind];
|
|
|
|
sbuf_printf(&sbuf, "start: %#jx\n",
|
|
|
|
(uintmax_t)seg->start);
|
|
|
|
sbuf_printf(&sbuf, "end: %#jx\n",
|
|
|
|
(uintmax_t)seg->end);
|
2010-07-27 20:33:50 +00:00
|
|
|
sbuf_printf(&sbuf, "domain: %d\n", seg->domain);
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
sbuf_printf(&sbuf, "free list: %p\n", seg->free_queues);
|
|
|
|
}
|
2010-09-16 16:13:12 +00:00
|
|
|
error = sbuf_finish(&sbuf);
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
sbuf_delete(&sbuf);
|
|
|
|
return (error);
|
|
|
|
}
|
|
|
|
|
2015-05-08 00:56:56 +00:00
|
|
|
/*
|
|
|
|
* Return affinity, or -1 if there's no affinity information.
|
|
|
|
*/
|
Add an initial NUMA affinity/policy configuration for threads and processes.
This is based on work done by jeff@ and jhb@, as well as the numa.diff
patch that has been circulating when someone asks for first-touch NUMA
on -10 or -11.
* Introduce a simple set of VM policy and iterator types.
* tie the policy types into the vm_phys path for now, mirroring how
the initial first-touch allocation work was enabled.
* add syscalls to control changing thread and process defaults.
* add a global NUMA VM domain policy.
* implement a simple cascade policy order - if a thread policy exists, use it;
if a process policy exists, use it; use the default policy.
* processes inherit policies from their parent processes, threads inherit
policies from their parent threads.
* add a simple tool (numactl) to query and modify default thread/process
policities.
* add documentation for the new syscalls, for numa and for numactl.
* re-enable first touch NUMA again by default, as now policies can be
set in a variety of methods.
This is only relevant for very specific workloads.
This doesn't pretend to be a final NUMA solution.
The previous defaults in -HEAD (with MAXMEMDOM set) can be achieved by
'sysctl vm.default_policy=rr'.
This is only relevant if MAXMEMDOM is set to something other than 1.
Ie, if you're using GENERIC or a modified kernel with non-NUMA, then
this is a glorified no-op for you.
Thank you to Norse Corp for giving me access to rather large
(for FreeBSD!) NUMA machines in order to develop and verify this.
Thank you to Dell for providing me with dual socket sandybridge
and westmere v3 hardware to do NUMA development with.
Thank you to Scott Long at Netflix for providing me with access
to the two-socket, four-domain haswell v3 hardware.
Thank you to Peter Holm for running the stress testing suite
against the NUMA branch during various stages of development!
Tested:
* MIPS (regression testing; non-NUMA)
* i386 (regression testing; non-NUMA GENERIC)
* amd64 (regression testing; non-NUMA GENERIC)
* westmere, 2 socket (thankyou norse!)
* sandy bridge, 2 socket (thankyou dell!)
* ivy bridge, 2 socket (thankyou norse!)
* westmere-EX, 4 socket / 1TB RAM (thankyou norse!)
* haswell, 2 socket (thankyou norse!)
* haswell v3, 2 socket (thankyou dell)
* haswell v3, 2x18 core (thankyou scott long / netflix!)
* Peter Holm ran a stress test suite on this work and found one
issue, but has not been able to verify it (it doesn't look NUMA
related, and he only saw it once over many testing runs.)
* I've tested bhyve instances running in fixed NUMA domains and cpusets;
all seems to work correctly.
Verified:
* intel-pcm - pcm-numa.x and pcm-memory.x, whilst selecting different
NUMA policies for processes under test.
Review:
This was reviewed through phabricator (https://reviews.freebsd.org/D2559)
as well as privately and via emails to freebsd-arch@. The git history
with specific attributes is available at https://github.com/erikarn/freebsd/
in the NUMA branch (https://github.com/erikarn/freebsd/compare/local/adrian_numa_policy).
This has been reviewed by a number of people (stas, rpaulo, kib, ngie,
wblock) but not achieved a clear consensus. My hope is that with further
exposure and testing more functionality can be implemented and evaluated.
Notes:
* The VM doesn't handle unbalanced domains very well, and if you have an overly
unbalanced memory setup whilst under high memory pressure, VM page allocation
may fail leading to a kernel panic. This was a problem in the past, but it's
much more easily triggered now with these tools.
* This work only controls the path through vm_phys; it doesn't yet strongly/predictably
affect contigmalloc, KVA placement, UMA, etc. So, driver placement of memory
isn't really guaranteed in any way. That's next on my plate.
Sponsored by: Norse Corp, Inc.; Dell
2015-07-11 15:21:37 +00:00
|
|
|
int
|
2015-05-08 00:56:56 +00:00
|
|
|
vm_phys_mem_affinity(int f, int t)
|
|
|
|
{
|
|
|
|
|
2018-01-14 03:36:03 +00:00
|
|
|
#ifdef NUMA
|
2015-05-08 00:56:56 +00:00
|
|
|
if (mem_locality == NULL)
|
|
|
|
return (-1);
|
|
|
|
if (f >= vm_ndomains || t >= vm_ndomains)
|
|
|
|
return (-1);
|
|
|
|
return (mem_locality[f * vm_ndomains + t]);
|
Add an initial NUMA affinity/policy configuration for threads and processes.
This is based on work done by jeff@ and jhb@, as well as the numa.diff
patch that has been circulating when someone asks for first-touch NUMA
on -10 or -11.
* Introduce a simple set of VM policy and iterator types.
* tie the policy types into the vm_phys path for now, mirroring how
the initial first-touch allocation work was enabled.
* add syscalls to control changing thread and process defaults.
* add a global NUMA VM domain policy.
* implement a simple cascade policy order - if a thread policy exists, use it;
if a process policy exists, use it; use the default policy.
* processes inherit policies from their parent processes, threads inherit
policies from their parent threads.
* add a simple tool (numactl) to query and modify default thread/process
policities.
* add documentation for the new syscalls, for numa and for numactl.
* re-enable first touch NUMA again by default, as now policies can be
set in a variety of methods.
This is only relevant for very specific workloads.
This doesn't pretend to be a final NUMA solution.
The previous defaults in -HEAD (with MAXMEMDOM set) can be achieved by
'sysctl vm.default_policy=rr'.
This is only relevant if MAXMEMDOM is set to something other than 1.
Ie, if you're using GENERIC or a modified kernel with non-NUMA, then
this is a glorified no-op for you.
Thank you to Norse Corp for giving me access to rather large
(for FreeBSD!) NUMA machines in order to develop and verify this.
Thank you to Dell for providing me with dual socket sandybridge
and westmere v3 hardware to do NUMA development with.
Thank you to Scott Long at Netflix for providing me with access
to the two-socket, four-domain haswell v3 hardware.
Thank you to Peter Holm for running the stress testing suite
against the NUMA branch during various stages of development!
Tested:
* MIPS (regression testing; non-NUMA)
* i386 (regression testing; non-NUMA GENERIC)
* amd64 (regression testing; non-NUMA GENERIC)
* westmere, 2 socket (thankyou norse!)
* sandy bridge, 2 socket (thankyou dell!)
* ivy bridge, 2 socket (thankyou norse!)
* westmere-EX, 4 socket / 1TB RAM (thankyou norse!)
* haswell, 2 socket (thankyou norse!)
* haswell v3, 2 socket (thankyou dell)
* haswell v3, 2x18 core (thankyou scott long / netflix!)
* Peter Holm ran a stress test suite on this work and found one
issue, but has not been able to verify it (it doesn't look NUMA
related, and he only saw it once over many testing runs.)
* I've tested bhyve instances running in fixed NUMA domains and cpusets;
all seems to work correctly.
Verified:
* intel-pcm - pcm-numa.x and pcm-memory.x, whilst selecting different
NUMA policies for processes under test.
Review:
This was reviewed through phabricator (https://reviews.freebsd.org/D2559)
as well as privately and via emails to freebsd-arch@. The git history
with specific attributes is available at https://github.com/erikarn/freebsd/
in the NUMA branch (https://github.com/erikarn/freebsd/compare/local/adrian_numa_policy).
This has been reviewed by a number of people (stas, rpaulo, kib, ngie,
wblock) but not achieved a clear consensus. My hope is that with further
exposure and testing more functionality can be implemented and evaluated.
Notes:
* The VM doesn't handle unbalanced domains very well, and if you have an overly
unbalanced memory setup whilst under high memory pressure, VM page allocation
may fail leading to a kernel panic. This was a problem in the past, but it's
much more easily triggered now with these tools.
* This work only controls the path through vm_phys; it doesn't yet strongly/predictably
affect contigmalloc, KVA placement, UMA, etc. So, driver placement of memory
isn't really guaranteed in any way. That's next on my plate.
Sponsored by: Norse Corp, Inc.; Dell
2015-07-11 15:21:37 +00:00
|
|
|
#else
|
|
|
|
return (-1);
|
|
|
|
#endif
|
2015-05-08 00:56:56 +00:00
|
|
|
}
|
|
|
|
|
2018-01-14 03:36:03 +00:00
|
|
|
#ifdef NUMA
|
2015-05-08 00:56:56 +00:00
|
|
|
/*
|
|
|
|
* Outputs the VM locality table.
|
|
|
|
*/
|
|
|
|
static int
|
|
|
|
sysctl_vm_phys_locality(SYSCTL_HANDLER_ARGS)
|
|
|
|
{
|
|
|
|
struct sbuf sbuf;
|
|
|
|
int error, i, j;
|
|
|
|
|
|
|
|
error = sysctl_wire_old_buffer(req, 0);
|
|
|
|
if (error != 0)
|
|
|
|
return (error);
|
|
|
|
sbuf_new_for_sysctl(&sbuf, NULL, 128, req);
|
|
|
|
|
|
|
|
sbuf_printf(&sbuf, "\n");
|
|
|
|
|
|
|
|
for (i = 0; i < vm_ndomains; i++) {
|
|
|
|
sbuf_printf(&sbuf, "%d: ", i);
|
|
|
|
for (j = 0; j < vm_ndomains; j++) {
|
|
|
|
sbuf_printf(&sbuf, "%d ", vm_phys_mem_affinity(i, j));
|
|
|
|
}
|
|
|
|
sbuf_printf(&sbuf, "\n");
|
|
|
|
}
|
|
|
|
error = sbuf_finish(&sbuf);
|
|
|
|
sbuf_delete(&sbuf);
|
|
|
|
return (error);
|
|
|
|
}
|
Add an initial NUMA affinity/policy configuration for threads and processes.
This is based on work done by jeff@ and jhb@, as well as the numa.diff
patch that has been circulating when someone asks for first-touch NUMA
on -10 or -11.
* Introduce a simple set of VM policy and iterator types.
* tie the policy types into the vm_phys path for now, mirroring how
the initial first-touch allocation work was enabled.
* add syscalls to control changing thread and process defaults.
* add a global NUMA VM domain policy.
* implement a simple cascade policy order - if a thread policy exists, use it;
if a process policy exists, use it; use the default policy.
* processes inherit policies from their parent processes, threads inherit
policies from their parent threads.
* add a simple tool (numactl) to query and modify default thread/process
policities.
* add documentation for the new syscalls, for numa and for numactl.
* re-enable first touch NUMA again by default, as now policies can be
set in a variety of methods.
This is only relevant for very specific workloads.
This doesn't pretend to be a final NUMA solution.
The previous defaults in -HEAD (with MAXMEMDOM set) can be achieved by
'sysctl vm.default_policy=rr'.
This is only relevant if MAXMEMDOM is set to something other than 1.
Ie, if you're using GENERIC or a modified kernel with non-NUMA, then
this is a glorified no-op for you.
Thank you to Norse Corp for giving me access to rather large
(for FreeBSD!) NUMA machines in order to develop and verify this.
Thank you to Dell for providing me with dual socket sandybridge
and westmere v3 hardware to do NUMA development with.
Thank you to Scott Long at Netflix for providing me with access
to the two-socket, four-domain haswell v3 hardware.
Thank you to Peter Holm for running the stress testing suite
against the NUMA branch during various stages of development!
Tested:
* MIPS (regression testing; non-NUMA)
* i386 (regression testing; non-NUMA GENERIC)
* amd64 (regression testing; non-NUMA GENERIC)
* westmere, 2 socket (thankyou norse!)
* sandy bridge, 2 socket (thankyou dell!)
* ivy bridge, 2 socket (thankyou norse!)
* westmere-EX, 4 socket / 1TB RAM (thankyou norse!)
* haswell, 2 socket (thankyou norse!)
* haswell v3, 2 socket (thankyou dell)
* haswell v3, 2x18 core (thankyou scott long / netflix!)
* Peter Holm ran a stress test suite on this work and found one
issue, but has not been able to verify it (it doesn't look NUMA
related, and he only saw it once over many testing runs.)
* I've tested bhyve instances running in fixed NUMA domains and cpusets;
all seems to work correctly.
Verified:
* intel-pcm - pcm-numa.x and pcm-memory.x, whilst selecting different
NUMA policies for processes under test.
Review:
This was reviewed through phabricator (https://reviews.freebsd.org/D2559)
as well as privately and via emails to freebsd-arch@. The git history
with specific attributes is available at https://github.com/erikarn/freebsd/
in the NUMA branch (https://github.com/erikarn/freebsd/compare/local/adrian_numa_policy).
This has been reviewed by a number of people (stas, rpaulo, kib, ngie,
wblock) but not achieved a clear consensus. My hope is that with further
exposure and testing more functionality can be implemented and evaluated.
Notes:
* The VM doesn't handle unbalanced domains very well, and if you have an overly
unbalanced memory setup whilst under high memory pressure, VM page allocation
may fail leading to a kernel panic. This was a problem in the past, but it's
much more easily triggered now with these tools.
* This work only controls the path through vm_phys; it doesn't yet strongly/predictably
affect contigmalloc, KVA placement, UMA, etc. So, driver placement of memory
isn't really guaranteed in any way. That's next on my plate.
Sponsored by: Norse Corp, Inc.; Dell
2015-07-11 15:21:37 +00:00
|
|
|
#endif
|
2015-05-08 00:56:56 +00:00
|
|
|
|
2013-05-13 15:40:51 +00:00
|
|
|
static void
|
|
|
|
vm_freelist_add(struct vm_freelist *fl, vm_page_t m, int order, int tail)
|
2010-07-27 20:33:50 +00:00
|
|
|
{
|
|
|
|
|
2013-05-13 15:40:51 +00:00
|
|
|
m->order = order;
|
|
|
|
if (tail)
|
2018-04-24 21:15:54 +00:00
|
|
|
TAILQ_INSERT_TAIL(&fl[order].pl, m, listq);
|
2013-05-13 15:40:51 +00:00
|
|
|
else
|
2018-04-24 21:15:54 +00:00
|
|
|
TAILQ_INSERT_HEAD(&fl[order].pl, m, listq);
|
2013-05-13 15:40:51 +00:00
|
|
|
fl[order].lcnt++;
|
2010-07-27 20:33:50 +00:00
|
|
|
}
|
2013-05-13 15:40:51 +00:00
|
|
|
|
|
|
|
static void
|
|
|
|
vm_freelist_rem(struct vm_freelist *fl, vm_page_t m, int order)
|
|
|
|
{
|
|
|
|
|
2018-04-24 21:15:54 +00:00
|
|
|
TAILQ_REMOVE(&fl[order].pl, m, listq);
|
2013-05-13 15:40:51 +00:00
|
|
|
fl[order].lcnt--;
|
|
|
|
m->order = VM_NFREEORDER;
|
|
|
|
}
|
|
|
|
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
/*
|
|
|
|
* Create a physical memory segment.
|
|
|
|
*/
|
|
|
|
static void
|
The physical memory allocator supports the use of distinct free lists for
managing pages from different address ranges. Generally speaking, this
feature is used to increase the likelihood that physical pages are
available that can meet special DMA requirements or can be accessed through
a limited-coverage direct mapping (e.g., MIPS). However, prior to this
change, the configuration of the free lists was static, i.e., it was
determined at compile time. Consequentally, free lists could be created
for address ranges that held no actual pages, for example, on 32-bit MIPS-
based systems with 512 MB or less of physical memory. This change makes
the creation of the free lists dynamic, i.e., it is based on the available
physical memory at boot time.
On 64-bit x86-based systems with 64 GB or more of physical memory, create
free lists for managing pages with physical addresses below 4 GB. This
change is to address reported problems with initializing devices that
require the allocation of physical pages below 4 GB on some systems with
128 GB or more of physical memory.
PR: 185727
Differential Revision: https://reviews.freebsd.org/D1274
Reviewed by: jhb, kib
MFC after: 3 weeks
Sponsored by: EMC / Isilon Storage Division
2014-12-31 00:54:38 +00:00
|
|
|
_vm_phys_create_seg(vm_paddr_t start, vm_paddr_t end, int domain)
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
{
|
|
|
|
struct vm_phys_seg *seg;
|
|
|
|
|
|
|
|
KASSERT(vm_phys_nsegs < VM_PHYSSEG_MAX,
|
|
|
|
("vm_phys_create_seg: increase VM_PHYSSEG_MAX"));
|
2017-11-28 23:18:35 +00:00
|
|
|
KASSERT(domain >= 0 && domain < vm_ndomains,
|
2013-05-13 15:40:51 +00:00
|
|
|
("vm_phys_create_seg: invalid domain provided"));
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
seg = &vm_phys_segs[vm_phys_nsegs++];
|
2014-11-15 23:40:44 +00:00
|
|
|
while (seg > vm_phys_segs && (seg - 1)->start >= end) {
|
|
|
|
*seg = *(seg - 1);
|
|
|
|
seg--;
|
|
|
|
}
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
seg->start = start;
|
|
|
|
seg->end = end;
|
2010-07-27 20:33:50 +00:00
|
|
|
seg->domain = domain;
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
}
|
|
|
|
|
2010-07-27 20:33:50 +00:00
|
|
|
static void
|
The physical memory allocator supports the use of distinct free lists for
managing pages from different address ranges. Generally speaking, this
feature is used to increase the likelihood that physical pages are
available that can meet special DMA requirements or can be accessed through
a limited-coverage direct mapping (e.g., MIPS). However, prior to this
change, the configuration of the free lists was static, i.e., it was
determined at compile time. Consequentally, free lists could be created
for address ranges that held no actual pages, for example, on 32-bit MIPS-
based systems with 512 MB or less of physical memory. This change makes
the creation of the free lists dynamic, i.e., it is based on the available
physical memory at boot time.
On 64-bit x86-based systems with 64 GB or more of physical memory, create
free lists for managing pages with physical addresses below 4 GB. This
change is to address reported problems with initializing devices that
require the allocation of physical pages below 4 GB on some systems with
128 GB or more of physical memory.
PR: 185727
Differential Revision: https://reviews.freebsd.org/D1274
Reviewed by: jhb, kib
MFC after: 3 weeks
Sponsored by: EMC / Isilon Storage Division
2014-12-31 00:54:38 +00:00
|
|
|
vm_phys_create_seg(vm_paddr_t start, vm_paddr_t end)
|
2010-07-27 20:33:50 +00:00
|
|
|
{
|
2018-01-14 03:36:03 +00:00
|
|
|
#ifdef NUMA
|
2010-07-27 20:33:50 +00:00
|
|
|
int i;
|
|
|
|
|
|
|
|
if (mem_affinity == NULL) {
|
The physical memory allocator supports the use of distinct free lists for
managing pages from different address ranges. Generally speaking, this
feature is used to increase the likelihood that physical pages are
available that can meet special DMA requirements or can be accessed through
a limited-coverage direct mapping (e.g., MIPS). However, prior to this
change, the configuration of the free lists was static, i.e., it was
determined at compile time. Consequentally, free lists could be created
for address ranges that held no actual pages, for example, on 32-bit MIPS-
based systems with 512 MB or less of physical memory. This change makes
the creation of the free lists dynamic, i.e., it is based on the available
physical memory at boot time.
On 64-bit x86-based systems with 64 GB or more of physical memory, create
free lists for managing pages with physical addresses below 4 GB. This
change is to address reported problems with initializing devices that
require the allocation of physical pages below 4 GB on some systems with
128 GB or more of physical memory.
PR: 185727
Differential Revision: https://reviews.freebsd.org/D1274
Reviewed by: jhb, kib
MFC after: 3 weeks
Sponsored by: EMC / Isilon Storage Division
2014-12-31 00:54:38 +00:00
|
|
|
_vm_phys_create_seg(start, end, 0);
|
2010-07-27 20:33:50 +00:00
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
for (i = 0;; i++) {
|
|
|
|
if (mem_affinity[i].end == 0)
|
|
|
|
panic("Reached end of affinity info");
|
|
|
|
if (mem_affinity[i].end <= start)
|
|
|
|
continue;
|
|
|
|
if (mem_affinity[i].start > start)
|
|
|
|
panic("No affinity info for start %jx",
|
|
|
|
(uintmax_t)start);
|
|
|
|
if (mem_affinity[i].end >= end) {
|
The physical memory allocator supports the use of distinct free lists for
managing pages from different address ranges. Generally speaking, this
feature is used to increase the likelihood that physical pages are
available that can meet special DMA requirements or can be accessed through
a limited-coverage direct mapping (e.g., MIPS). However, prior to this
change, the configuration of the free lists was static, i.e., it was
determined at compile time. Consequentally, free lists could be created
for address ranges that held no actual pages, for example, on 32-bit MIPS-
based systems with 512 MB or less of physical memory. This change makes
the creation of the free lists dynamic, i.e., it is based on the available
physical memory at boot time.
On 64-bit x86-based systems with 64 GB or more of physical memory, create
free lists for managing pages with physical addresses below 4 GB. This
change is to address reported problems with initializing devices that
require the allocation of physical pages below 4 GB on some systems with
128 GB or more of physical memory.
PR: 185727
Differential Revision: https://reviews.freebsd.org/D1274
Reviewed by: jhb, kib
MFC after: 3 weeks
Sponsored by: EMC / Isilon Storage Division
2014-12-31 00:54:38 +00:00
|
|
|
_vm_phys_create_seg(start, end,
|
2010-07-27 20:33:50 +00:00
|
|
|
mem_affinity[i].domain);
|
|
|
|
break;
|
|
|
|
}
|
The physical memory allocator supports the use of distinct free lists for
managing pages from different address ranges. Generally speaking, this
feature is used to increase the likelihood that physical pages are
available that can meet special DMA requirements or can be accessed through
a limited-coverage direct mapping (e.g., MIPS). However, prior to this
change, the configuration of the free lists was static, i.e., it was
determined at compile time. Consequentally, free lists could be created
for address ranges that held no actual pages, for example, on 32-bit MIPS-
based systems with 512 MB or less of physical memory. This change makes
the creation of the free lists dynamic, i.e., it is based on the available
physical memory at boot time.
On 64-bit x86-based systems with 64 GB or more of physical memory, create
free lists for managing pages with physical addresses below 4 GB. This
change is to address reported problems with initializing devices that
require the allocation of physical pages below 4 GB on some systems with
128 GB or more of physical memory.
PR: 185727
Differential Revision: https://reviews.freebsd.org/D1274
Reviewed by: jhb, kib
MFC after: 3 weeks
Sponsored by: EMC / Isilon Storage Division
2014-12-31 00:54:38 +00:00
|
|
|
_vm_phys_create_seg(start, mem_affinity[i].end,
|
2010-07-27 20:33:50 +00:00
|
|
|
mem_affinity[i].domain);
|
|
|
|
start = mem_affinity[i].end;
|
|
|
|
}
|
2016-04-09 13:58:04 +00:00
|
|
|
#else
|
|
|
|
_vm_phys_create_seg(start, end, 0);
|
|
|
|
#endif
|
2010-07-27 20:33:50 +00:00
|
|
|
}
|
|
|
|
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
/*
|
2014-11-15 23:40:44 +00:00
|
|
|
* Add a physical memory segment.
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
*/
|
|
|
|
void
|
2014-11-15 23:40:44 +00:00
|
|
|
vm_phys_add_seg(vm_paddr_t start, vm_paddr_t end)
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
{
|
The physical memory allocator supports the use of distinct free lists for
managing pages from different address ranges. Generally speaking, this
feature is used to increase the likelihood that physical pages are
available that can meet special DMA requirements or can be accessed through
a limited-coverage direct mapping (e.g., MIPS). However, prior to this
change, the configuration of the free lists was static, i.e., it was
determined at compile time. Consequentally, free lists could be created
for address ranges that held no actual pages, for example, on 32-bit MIPS-
based systems with 512 MB or less of physical memory. This change makes
the creation of the free lists dynamic, i.e., it is based on the available
physical memory at boot time.
On 64-bit x86-based systems with 64 GB or more of physical memory, create
free lists for managing pages with physical addresses below 4 GB. This
change is to address reported problems with initializing devices that
require the allocation of physical pages below 4 GB on some systems with
128 GB or more of physical memory.
PR: 185727
Differential Revision: https://reviews.freebsd.org/D1274
Reviewed by: jhb, kib
MFC after: 3 weeks
Sponsored by: EMC / Isilon Storage Division
2014-12-31 00:54:38 +00:00
|
|
|
vm_paddr_t paddr;
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
|
2014-11-15 23:40:44 +00:00
|
|
|
KASSERT((start & PAGE_MASK) == 0,
|
|
|
|
("vm_phys_define_seg: start is not page aligned"));
|
|
|
|
KASSERT((end & PAGE_MASK) == 0,
|
|
|
|
("vm_phys_define_seg: end is not page aligned"));
|
The physical memory allocator supports the use of distinct free lists for
managing pages from different address ranges. Generally speaking, this
feature is used to increase the likelihood that physical pages are
available that can meet special DMA requirements or can be accessed through
a limited-coverage direct mapping (e.g., MIPS). However, prior to this
change, the configuration of the free lists was static, i.e., it was
determined at compile time. Consequentally, free lists could be created
for address ranges that held no actual pages, for example, on 32-bit MIPS-
based systems with 512 MB or less of physical memory. This change makes
the creation of the free lists dynamic, i.e., it is based on the available
physical memory at boot time.
On 64-bit x86-based systems with 64 GB or more of physical memory, create
free lists for managing pages with physical addresses below 4 GB. This
change is to address reported problems with initializing devices that
require the allocation of physical pages below 4 GB on some systems with
128 GB or more of physical memory.
PR: 185727
Differential Revision: https://reviews.freebsd.org/D1274
Reviewed by: jhb, kib
MFC after: 3 weeks
Sponsored by: EMC / Isilon Storage Division
2014-12-31 00:54:38 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Split the physical memory segment if it spans two or more free
|
|
|
|
* list boundaries.
|
|
|
|
*/
|
|
|
|
paddr = start;
|
|
|
|
#ifdef VM_FREELIST_LOWMEM
|
|
|
|
if (paddr < VM_LOWMEM_BOUNDARY && end > VM_LOWMEM_BOUNDARY) {
|
|
|
|
vm_phys_create_seg(paddr, VM_LOWMEM_BOUNDARY);
|
|
|
|
paddr = VM_LOWMEM_BOUNDARY;
|
|
|
|
}
|
2014-11-15 23:40:44 +00:00
|
|
|
#endif
|
The physical memory allocator supports the use of distinct free lists for
managing pages from different address ranges. Generally speaking, this
feature is used to increase the likelihood that physical pages are
available that can meet special DMA requirements or can be accessed through
a limited-coverage direct mapping (e.g., MIPS). However, prior to this
change, the configuration of the free lists was static, i.e., it was
determined at compile time. Consequentally, free lists could be created
for address ranges that held no actual pages, for example, on 32-bit MIPS-
based systems with 512 MB or less of physical memory. This change makes
the creation of the free lists dynamic, i.e., it is based on the available
physical memory at boot time.
On 64-bit x86-based systems with 64 GB or more of physical memory, create
free lists for managing pages with physical addresses below 4 GB. This
change is to address reported problems with initializing devices that
require the allocation of physical pages below 4 GB on some systems with
128 GB or more of physical memory.
PR: 185727
Differential Revision: https://reviews.freebsd.org/D1274
Reviewed by: jhb, kib
MFC after: 3 weeks
Sponsored by: EMC / Isilon Storage Division
2014-12-31 00:54:38 +00:00
|
|
|
#ifdef VM_FREELIST_DMA32
|
|
|
|
if (paddr < VM_DMA32_BOUNDARY && end > VM_DMA32_BOUNDARY) {
|
|
|
|
vm_phys_create_seg(paddr, VM_DMA32_BOUNDARY);
|
|
|
|
paddr = VM_DMA32_BOUNDARY;
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
vm_phys_create_seg(paddr, end);
|
2014-11-15 23:40:44 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Initialize the physical memory allocator.
|
The physical memory allocator supports the use of distinct free lists for
managing pages from different address ranges. Generally speaking, this
feature is used to increase the likelihood that physical pages are
available that can meet special DMA requirements or can be accessed through
a limited-coverage direct mapping (e.g., MIPS). However, prior to this
change, the configuration of the free lists was static, i.e., it was
determined at compile time. Consequentally, free lists could be created
for address ranges that held no actual pages, for example, on 32-bit MIPS-
based systems with 512 MB or less of physical memory. This change makes
the creation of the free lists dynamic, i.e., it is based on the available
physical memory at boot time.
On 64-bit x86-based systems with 64 GB or more of physical memory, create
free lists for managing pages with physical addresses below 4 GB. This
change is to address reported problems with initializing devices that
require the allocation of physical pages below 4 GB on some systems with
128 GB or more of physical memory.
PR: 185727
Differential Revision: https://reviews.freebsd.org/D1274
Reviewed by: jhb, kib
MFC after: 3 weeks
Sponsored by: EMC / Isilon Storage Division
2014-12-31 00:54:38 +00:00
|
|
|
*
|
|
|
|
* Requires that vm_page_array is initialized!
|
2014-11-15 23:40:44 +00:00
|
|
|
*/
|
|
|
|
void
|
|
|
|
vm_phys_init(void)
|
|
|
|
{
|
|
|
|
struct vm_freelist *fl;
|
2018-09-02 18:29:38 +00:00
|
|
|
struct vm_phys_seg *end_seg, *prev_seg, *seg, *tmp_seg;
|
The physical memory allocator supports the use of distinct free lists for
managing pages from different address ranges. Generally speaking, this
feature is used to increase the likelihood that physical pages are
available that can meet special DMA requirements or can be accessed through
a limited-coverage direct mapping (e.g., MIPS). However, prior to this
change, the configuration of the free lists was static, i.e., it was
determined at compile time. Consequentally, free lists could be created
for address ranges that held no actual pages, for example, on 32-bit MIPS-
based systems with 512 MB or less of physical memory. This change makes
the creation of the free lists dynamic, i.e., it is based on the available
physical memory at boot time.
On 64-bit x86-based systems with 64 GB or more of physical memory, create
free lists for managing pages with physical addresses below 4 GB. This
change is to address reported problems with initializing devices that
require the allocation of physical pages below 4 GB on some systems with
128 GB or more of physical memory.
PR: 185727
Differential Revision: https://reviews.freebsd.org/D1274
Reviewed by: jhb, kib
MFC after: 3 weeks
Sponsored by: EMC / Isilon Storage Division
2014-12-31 00:54:38 +00:00
|
|
|
u_long npages;
|
|
|
|
int dom, flind, freelist, oind, pind, segind;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Compute the number of free lists, and generate the mapping from the
|
|
|
|
* manifest constants VM_FREELIST_* to the free list indices.
|
|
|
|
*
|
|
|
|
* Initially, the entries of vm_freelist_to_flind[] are set to either
|
|
|
|
* 0 or 1 to indicate which free lists should be created.
|
|
|
|
*/
|
|
|
|
npages = 0;
|
|
|
|
for (segind = vm_phys_nsegs - 1; segind >= 0; segind--) {
|
|
|
|
seg = &vm_phys_segs[segind];
|
|
|
|
#ifdef VM_FREELIST_LOWMEM
|
|
|
|
if (seg->end <= VM_LOWMEM_BOUNDARY)
|
|
|
|
vm_freelist_to_flind[VM_FREELIST_LOWMEM] = 1;
|
|
|
|
else
|
2014-11-15 23:40:44 +00:00
|
|
|
#endif
|
The physical memory allocator supports the use of distinct free lists for
managing pages from different address ranges. Generally speaking, this
feature is used to increase the likelihood that physical pages are
available that can meet special DMA requirements or can be accessed through
a limited-coverage direct mapping (e.g., MIPS). However, prior to this
change, the configuration of the free lists was static, i.e., it was
determined at compile time. Consequentally, free lists could be created
for address ranges that held no actual pages, for example, on 32-bit MIPS-
based systems with 512 MB or less of physical memory. This change makes
the creation of the free lists dynamic, i.e., it is based on the available
physical memory at boot time.
On 64-bit x86-based systems with 64 GB or more of physical memory, create
free lists for managing pages with physical addresses below 4 GB. This
change is to address reported problems with initializing devices that
require the allocation of physical pages below 4 GB on some systems with
128 GB or more of physical memory.
PR: 185727
Differential Revision: https://reviews.freebsd.org/D1274
Reviewed by: jhb, kib
MFC after: 3 weeks
Sponsored by: EMC / Isilon Storage Division
2014-12-31 00:54:38 +00:00
|
|
|
#ifdef VM_FREELIST_DMA32
|
|
|
|
if (
|
|
|
|
#ifdef VM_DMA32_NPAGES_THRESHOLD
|
|
|
|
/*
|
|
|
|
* Create the DMA32 free list only if the amount of
|
|
|
|
* physical memory above physical address 4G exceeds the
|
|
|
|
* given threshold.
|
|
|
|
*/
|
|
|
|
npages > VM_DMA32_NPAGES_THRESHOLD &&
|
|
|
|
#endif
|
|
|
|
seg->end <= VM_DMA32_BOUNDARY)
|
|
|
|
vm_freelist_to_flind[VM_FREELIST_DMA32] = 1;
|
|
|
|
else
|
|
|
|
#endif
|
|
|
|
{
|
|
|
|
npages += atop(seg->end - seg->start);
|
|
|
|
vm_freelist_to_flind[VM_FREELIST_DEFAULT] = 1;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
/* Change each entry into a running total of the free lists. */
|
|
|
|
for (freelist = 1; freelist < VM_NFREELIST; freelist++) {
|
|
|
|
vm_freelist_to_flind[freelist] +=
|
|
|
|
vm_freelist_to_flind[freelist - 1];
|
|
|
|
}
|
|
|
|
vm_nfreelists = vm_freelist_to_flind[VM_NFREELIST - 1];
|
|
|
|
KASSERT(vm_nfreelists > 0, ("vm_phys_init: no free lists"));
|
|
|
|
/* Change each entry into a free list index. */
|
|
|
|
for (freelist = 0; freelist < VM_NFREELIST; freelist++)
|
|
|
|
vm_freelist_to_flind[freelist]--;
|
2014-11-15 23:40:44 +00:00
|
|
|
|
The physical memory allocator supports the use of distinct free lists for
managing pages from different address ranges. Generally speaking, this
feature is used to increase the likelihood that physical pages are
available that can meet special DMA requirements or can be accessed through
a limited-coverage direct mapping (e.g., MIPS). However, prior to this
change, the configuration of the free lists was static, i.e., it was
determined at compile time. Consequentally, free lists could be created
for address ranges that held no actual pages, for example, on 32-bit MIPS-
based systems with 512 MB or less of physical memory. This change makes
the creation of the free lists dynamic, i.e., it is based on the available
physical memory at boot time.
On 64-bit x86-based systems with 64 GB or more of physical memory, create
free lists for managing pages with physical addresses below 4 GB. This
change is to address reported problems with initializing devices that
require the allocation of physical pages below 4 GB on some systems with
128 GB or more of physical memory.
PR: 185727
Differential Revision: https://reviews.freebsd.org/D1274
Reviewed by: jhb, kib
MFC after: 3 weeks
Sponsored by: EMC / Isilon Storage Division
2014-12-31 00:54:38 +00:00
|
|
|
/*
|
|
|
|
* Initialize the first_page and free_queues fields of each physical
|
|
|
|
* memory segment.
|
|
|
|
*/
|
2014-11-15 23:40:44 +00:00
|
|
|
#ifdef VM_PHYSSEG_SPARSE
|
The physical memory allocator supports the use of distinct free lists for
managing pages from different address ranges. Generally speaking, this
feature is used to increase the likelihood that physical pages are
available that can meet special DMA requirements or can be accessed through
a limited-coverage direct mapping (e.g., MIPS). However, prior to this
change, the configuration of the free lists was static, i.e., it was
determined at compile time. Consequentally, free lists could be created
for address ranges that held no actual pages, for example, on 32-bit MIPS-
based systems with 512 MB or less of physical memory. This change makes
the creation of the free lists dynamic, i.e., it is based on the available
physical memory at boot time.
On 64-bit x86-based systems with 64 GB or more of physical memory, create
free lists for managing pages with physical addresses below 4 GB. This
change is to address reported problems with initializing devices that
require the allocation of physical pages below 4 GB on some systems with
128 GB or more of physical memory.
PR: 185727
Differential Revision: https://reviews.freebsd.org/D1274
Reviewed by: jhb, kib
MFC after: 3 weeks
Sponsored by: EMC / Isilon Storage Division
2014-12-31 00:54:38 +00:00
|
|
|
npages = 0;
|
2014-11-15 23:40:44 +00:00
|
|
|
#endif
|
|
|
|
for (segind = 0; segind < vm_phys_nsegs; segind++) {
|
|
|
|
seg = &vm_phys_segs[segind];
|
|
|
|
#ifdef VM_PHYSSEG_SPARSE
|
The physical memory allocator supports the use of distinct free lists for
managing pages from different address ranges. Generally speaking, this
feature is used to increase the likelihood that physical pages are
available that can meet special DMA requirements or can be accessed through
a limited-coverage direct mapping (e.g., MIPS). However, prior to this
change, the configuration of the free lists was static, i.e., it was
determined at compile time. Consequentally, free lists could be created
for address ranges that held no actual pages, for example, on 32-bit MIPS-
based systems with 512 MB or less of physical memory. This change makes
the creation of the free lists dynamic, i.e., it is based on the available
physical memory at boot time.
On 64-bit x86-based systems with 64 GB or more of physical memory, create
free lists for managing pages with physical addresses below 4 GB. This
change is to address reported problems with initializing devices that
require the allocation of physical pages below 4 GB on some systems with
128 GB or more of physical memory.
PR: 185727
Differential Revision: https://reviews.freebsd.org/D1274
Reviewed by: jhb, kib
MFC after: 3 weeks
Sponsored by: EMC / Isilon Storage Division
2014-12-31 00:54:38 +00:00
|
|
|
seg->first_page = &vm_page_array[npages];
|
|
|
|
npages += atop(seg->end - seg->start);
|
2014-11-15 23:40:44 +00:00
|
|
|
#else
|
|
|
|
seg->first_page = PHYS_TO_VM_PAGE(seg->start);
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
#endif
|
The physical memory allocator supports the use of distinct free lists for
managing pages from different address ranges. Generally speaking, this
feature is used to increase the likelihood that physical pages are
available that can meet special DMA requirements or can be accessed through
a limited-coverage direct mapping (e.g., MIPS). However, prior to this
change, the configuration of the free lists was static, i.e., it was
determined at compile time. Consequentally, free lists could be created
for address ranges that held no actual pages, for example, on 32-bit MIPS-
based systems with 512 MB or less of physical memory. This change makes
the creation of the free lists dynamic, i.e., it is based on the available
physical memory at boot time.
On 64-bit x86-based systems with 64 GB or more of physical memory, create
free lists for managing pages with physical addresses below 4 GB. This
change is to address reported problems with initializing devices that
require the allocation of physical pages below 4 GB on some systems with
128 GB or more of physical memory.
PR: 185727
Differential Revision: https://reviews.freebsd.org/D1274
Reviewed by: jhb, kib
MFC after: 3 weeks
Sponsored by: EMC / Isilon Storage Division
2014-12-31 00:54:38 +00:00
|
|
|
#ifdef VM_FREELIST_LOWMEM
|
|
|
|
if (seg->end <= VM_LOWMEM_BOUNDARY) {
|
|
|
|
flind = vm_freelist_to_flind[VM_FREELIST_LOWMEM];
|
|
|
|
KASSERT(flind >= 0,
|
|
|
|
("vm_phys_init: LOWMEM flind < 0"));
|
|
|
|
} else
|
|
|
|
#endif
|
|
|
|
#ifdef VM_FREELIST_DMA32
|
|
|
|
if (seg->end <= VM_DMA32_BOUNDARY) {
|
|
|
|
flind = vm_freelist_to_flind[VM_FREELIST_DMA32];
|
|
|
|
KASSERT(flind >= 0,
|
|
|
|
("vm_phys_init: DMA32 flind < 0"));
|
|
|
|
} else
|
|
|
|
#endif
|
|
|
|
{
|
|
|
|
flind = vm_freelist_to_flind[VM_FREELIST_DEFAULT];
|
|
|
|
KASSERT(flind >= 0,
|
|
|
|
("vm_phys_init: DEFAULT flind < 0"));
|
|
|
|
}
|
|
|
|
seg->free_queues = &vm_phys_free_queues[seg->domain][flind];
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
}
|
The physical memory allocator supports the use of distinct free lists for
managing pages from different address ranges. Generally speaking, this
feature is used to increase the likelihood that physical pages are
available that can meet special DMA requirements or can be accessed through
a limited-coverage direct mapping (e.g., MIPS). However, prior to this
change, the configuration of the free lists was static, i.e., it was
determined at compile time. Consequentally, free lists could be created
for address ranges that held no actual pages, for example, on 32-bit MIPS-
based systems with 512 MB or less of physical memory. This change makes
the creation of the free lists dynamic, i.e., it is based on the available
physical memory at boot time.
On 64-bit x86-based systems with 64 GB or more of physical memory, create
free lists for managing pages with physical addresses below 4 GB. This
change is to address reported problems with initializing devices that
require the allocation of physical pages below 4 GB on some systems with
128 GB or more of physical memory.
PR: 185727
Differential Revision: https://reviews.freebsd.org/D1274
Reviewed by: jhb, kib
MFC after: 3 weeks
Sponsored by: EMC / Isilon Storage Division
2014-12-31 00:54:38 +00:00
|
|
|
|
2018-09-02 18:29:38 +00:00
|
|
|
/*
|
|
|
|
* Coalesce physical memory segments that are contiguous and share the
|
|
|
|
* same per-domain free queues.
|
|
|
|
*/
|
|
|
|
prev_seg = vm_phys_segs;
|
|
|
|
seg = &vm_phys_segs[1];
|
|
|
|
end_seg = &vm_phys_segs[vm_phys_nsegs];
|
|
|
|
while (seg < end_seg) {
|
|
|
|
if (prev_seg->end == seg->start &&
|
|
|
|
prev_seg->free_queues == seg->free_queues) {
|
|
|
|
prev_seg->end = seg->end;
|
|
|
|
KASSERT(prev_seg->domain == seg->domain,
|
|
|
|
("vm_phys_init: free queues cannot span domains"));
|
|
|
|
vm_phys_nsegs--;
|
|
|
|
end_seg--;
|
|
|
|
for (tmp_seg = seg; tmp_seg < end_seg; tmp_seg++)
|
|
|
|
*tmp_seg = *(tmp_seg + 1);
|
|
|
|
} else {
|
|
|
|
prev_seg = seg;
|
|
|
|
seg++;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
The physical memory allocator supports the use of distinct free lists for
managing pages from different address ranges. Generally speaking, this
feature is used to increase the likelihood that physical pages are
available that can meet special DMA requirements or can be accessed through
a limited-coverage direct mapping (e.g., MIPS). However, prior to this
change, the configuration of the free lists was static, i.e., it was
determined at compile time. Consequentally, free lists could be created
for address ranges that held no actual pages, for example, on 32-bit MIPS-
based systems with 512 MB or less of physical memory. This change makes
the creation of the free lists dynamic, i.e., it is based on the available
physical memory at boot time.
On 64-bit x86-based systems with 64 GB or more of physical memory, create
free lists for managing pages with physical addresses below 4 GB. This
change is to address reported problems with initializing devices that
require the allocation of physical pages below 4 GB on some systems with
128 GB or more of physical memory.
PR: 185727
Differential Revision: https://reviews.freebsd.org/D1274
Reviewed by: jhb, kib
MFC after: 3 weeks
Sponsored by: EMC / Isilon Storage Division
2014-12-31 00:54:38 +00:00
|
|
|
/*
|
|
|
|
* Initialize the free queues.
|
|
|
|
*/
|
2013-05-13 15:40:51 +00:00
|
|
|
for (dom = 0; dom < vm_ndomains; dom++) {
|
|
|
|
for (flind = 0; flind < vm_nfreelists; flind++) {
|
|
|
|
for (pind = 0; pind < VM_NFREEPOOL; pind++) {
|
|
|
|
fl = vm_phys_free_queues[dom][flind][pind];
|
|
|
|
for (oind = 0; oind < VM_NFREEORDER; oind++)
|
|
|
|
TAILQ_INIT(&fl[oind].pl);
|
|
|
|
}
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
}
|
|
|
|
}
|
The physical memory allocator supports the use of distinct free lists for
managing pages from different address ranges. Generally speaking, this
feature is used to increase the likelihood that physical pages are
available that can meet special DMA requirements or can be accessed through
a limited-coverage direct mapping (e.g., MIPS). However, prior to this
change, the configuration of the free lists was static, i.e., it was
determined at compile time. Consequentally, free lists could be created
for address ranges that held no actual pages, for example, on 32-bit MIPS-
based systems with 512 MB or less of physical memory. This change makes
the creation of the free lists dynamic, i.e., it is based on the available
physical memory at boot time.
On 64-bit x86-based systems with 64 GB or more of physical memory, create
free lists for managing pages with physical addresses below 4 GB. This
change is to address reported problems with initializing devices that
require the allocation of physical pages below 4 GB on some systems with
128 GB or more of physical memory.
PR: 185727
Differential Revision: https://reviews.freebsd.org/D1274
Reviewed by: jhb, kib
MFC after: 3 weeks
Sponsored by: EMC / Isilon Storage Division
2014-12-31 00:54:38 +00:00
|
|
|
|
2014-07-09 08:12:58 +00:00
|
|
|
rw_init(&vm_phys_fictitious_reg_lock, "vmfctr");
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
}
|
|
|
|
|
2018-10-20 17:36:00 +00:00
|
|
|
/*
|
|
|
|
* Register info about the NUMA topology of the system.
|
|
|
|
*
|
|
|
|
* Invoked by platform-dependent code prior to vm_phys_init().
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
vm_phys_register_domains(int ndomains, struct mem_affinity *affinity,
|
|
|
|
int *locality)
|
|
|
|
{
|
|
|
|
#ifdef NUMA
|
2018-10-22 20:13:51 +00:00
|
|
|
int d, i;
|
2018-10-20 17:36:00 +00:00
|
|
|
|
2018-10-22 20:13:51 +00:00
|
|
|
/*
|
|
|
|
* For now the only override value that we support is 1, which
|
|
|
|
* effectively disables NUMA-awareness in the allocators.
|
|
|
|
*/
|
|
|
|
d = 0;
|
|
|
|
TUNABLE_INT_FETCH("vm.numa.disabled", &d);
|
|
|
|
if (d)
|
|
|
|
ndomains = 1;
|
|
|
|
|
|
|
|
if (ndomains > 1) {
|
|
|
|
vm_ndomains = ndomains;
|
|
|
|
mem_affinity = affinity;
|
|
|
|
mem_locality = locality;
|
|
|
|
}
|
2018-10-20 17:36:00 +00:00
|
|
|
|
|
|
|
for (i = 0; i < vm_ndomains; i++)
|
|
|
|
DOMAINSET_SET(i, &all_domains);
|
|
|
|
#else
|
|
|
|
(void)ndomains;
|
|
|
|
(void)affinity;
|
|
|
|
(void)locality;
|
|
|
|
#endif
|
|
|
|
}
|
|
|
|
|
2019-08-06 21:50:34 +00:00
|
|
|
int
|
|
|
|
_vm_phys_domain(vm_paddr_t pa)
|
|
|
|
{
|
|
|
|
#ifdef NUMA
|
|
|
|
int i;
|
|
|
|
|
Fix boot on systems where NUMA domain 0 is unpopulated.
- Add vm_phys_early_add_seg(), complementing vm_phys_early_alloc(), to
ensure that segments registered during hammer_time() are placed in the
right domain. Otherwise, since the SRAT is not parsed at that point,
we just add them to domain 0, which may be incorrect and results in a
domain with only several MB worth of memory.
- Fix uma_startup1() to try allocating memory for zones from any domain.
If domain 0 is unpopulated, the allocation will simply fail, resulting
in a page fault slightly later during boot.
- Change _vm_phys_domain() to return -1 for addresses not covered by the
affinity table, and change vm_phys_early_alloc() to handle wildcard
domains. This is necessary on amd64, where the page array is dense
and pmap_page_array_startup() may allocate page table pages for
non-existent page frames.
Reported and tested by: Rafael Kitover <rkitover@gmail.com>
Reviewed by: cem (earlier version), kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D25001
2020-05-28 19:41:00 +00:00
|
|
|
if (vm_ndomains == 1)
|
2019-08-06 21:50:34 +00:00
|
|
|
return (0);
|
|
|
|
for (i = 0; mem_affinity[i].end != 0; i++)
|
|
|
|
if (mem_affinity[i].start <= pa &&
|
|
|
|
mem_affinity[i].end >= pa)
|
|
|
|
return (mem_affinity[i].domain);
|
Fix boot on systems where NUMA domain 0 is unpopulated.
- Add vm_phys_early_add_seg(), complementing vm_phys_early_alloc(), to
ensure that segments registered during hammer_time() are placed in the
right domain. Otherwise, since the SRAT is not parsed at that point,
we just add them to domain 0, which may be incorrect and results in a
domain with only several MB worth of memory.
- Fix uma_startup1() to try allocating memory for zones from any domain.
If domain 0 is unpopulated, the allocation will simply fail, resulting
in a page fault slightly later during boot.
- Change _vm_phys_domain() to return -1 for addresses not covered by the
affinity table, and change vm_phys_early_alloc() to handle wildcard
domains. This is necessary on amd64, where the page array is dense
and pmap_page_array_startup() may allocate page table pages for
non-existent page frames.
Reported and tested by: Rafael Kitover <rkitover@gmail.com>
Reviewed by: cem (earlier version), kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D25001
2020-05-28 19:41:00 +00:00
|
|
|
return (-1);
|
|
|
|
#else
|
2019-08-06 21:50:34 +00:00
|
|
|
return (0);
|
Fix boot on systems where NUMA domain 0 is unpopulated.
- Add vm_phys_early_add_seg(), complementing vm_phys_early_alloc(), to
ensure that segments registered during hammer_time() are placed in the
right domain. Otherwise, since the SRAT is not parsed at that point,
we just add them to domain 0, which may be incorrect and results in a
domain with only several MB worth of memory.
- Fix uma_startup1() to try allocating memory for zones from any domain.
If domain 0 is unpopulated, the allocation will simply fail, resulting
in a page fault slightly later during boot.
- Change _vm_phys_domain() to return -1 for addresses not covered by the
affinity table, and change vm_phys_early_alloc() to handle wildcard
domains. This is necessary on amd64, where the page array is dense
and pmap_page_array_startup() may allocate page table pages for
non-existent page frames.
Reported and tested by: Rafael Kitover <rkitover@gmail.com>
Reviewed by: cem (earlier version), kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D25001
2020-05-28 19:41:00 +00:00
|
|
|
#endif
|
2019-08-06 21:50:34 +00:00
|
|
|
}
|
|
|
|
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
/*
|
|
|
|
* Split a contiguous, power of two-sized set of physical pages.
|
2018-07-05 02:08:57 +00:00
|
|
|
*
|
|
|
|
* When this function is called by a page allocation function, the caller
|
|
|
|
* should request insertion at the head unless the order [order, oind) queues
|
|
|
|
* are known to be empty. The objective being to reduce the likelihood of
|
|
|
|
* long-term fragmentation by promoting contemporaneous allocation and
|
|
|
|
* (hopefully) deallocation.
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
*/
|
|
|
|
static __inline void
|
2018-07-05 02:08:57 +00:00
|
|
|
vm_phys_split_pages(vm_page_t m, int oind, struct vm_freelist *fl, int order,
|
|
|
|
int tail)
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
{
|
|
|
|
vm_page_t m_buddy;
|
|
|
|
|
|
|
|
while (oind > order) {
|
|
|
|
oind--;
|
|
|
|
m_buddy = &m[1 << oind];
|
|
|
|
KASSERT(m_buddy->order == VM_NFREEORDER,
|
|
|
|
("vm_phys_split_pages: page %p has unexpected order %d",
|
|
|
|
m_buddy, m_buddy->order));
|
2018-07-05 02:08:57 +00:00
|
|
|
vm_freelist_add(fl, m_buddy, oind, tail);
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2018-07-02 17:18:46 +00:00
|
|
|
/*
|
|
|
|
* Add the physical pages [m, m + npages) at the end of a power-of-two aligned
|
|
|
|
* and sized set to the specified free list.
|
|
|
|
*
|
|
|
|
* When this function is called by a page allocation function, the caller
|
|
|
|
* should request insertion at the head unless the lower-order queues are
|
|
|
|
* known to be empty. The objective being to reduce the likelihood of long-
|
|
|
|
* term fragmentation by promoting contemporaneous allocation and (hopefully)
|
|
|
|
* deallocation.
|
|
|
|
*
|
|
|
|
* The physical page m's buddy must not be free.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
vm_phys_enq_range(vm_page_t m, u_int npages, struct vm_freelist *fl, int tail)
|
|
|
|
{
|
|
|
|
u_int n;
|
|
|
|
int order;
|
|
|
|
|
|
|
|
KASSERT(npages > 0, ("vm_phys_enq_range: npages is 0"));
|
|
|
|
KASSERT(((VM_PAGE_TO_PHYS(m) + npages * PAGE_SIZE) &
|
|
|
|
((PAGE_SIZE << (fls(npages) - 1)) - 1)) == 0,
|
|
|
|
("vm_phys_enq_range: page %p and npages %u are misaligned",
|
|
|
|
m, npages));
|
|
|
|
do {
|
|
|
|
KASSERT(m->order == VM_NFREEORDER,
|
|
|
|
("vm_phys_enq_range: page %p has unexpected order %d",
|
|
|
|
m, m->order));
|
|
|
|
order = ffs(npages) - 1;
|
|
|
|
KASSERT(order < VM_NFREEORDER,
|
|
|
|
("vm_phys_enq_range: order %d is out of range", order));
|
|
|
|
vm_freelist_add(fl, m, order, tail);
|
|
|
|
n = 1 << order;
|
|
|
|
m += n;
|
|
|
|
npages -= n;
|
|
|
|
} while (npages > 0);
|
|
|
|
}
|
|
|
|
|
2018-06-26 18:29:56 +00:00
|
|
|
/*
|
|
|
|
* Tries to allocate the specified number of pages from the specified pool
|
|
|
|
* within the specified domain. Returns the actual number of allocated pages
|
|
|
|
* and a pointer to each page through the array ma[].
|
|
|
|
*
|
2018-06-28 17:52:06 +00:00
|
|
|
* The returned pages may not be physically contiguous. However, in contrast
|
|
|
|
* to performing multiple, back-to-back calls to vm_phys_alloc_pages(..., 0),
|
|
|
|
* calling this function once to allocate the desired number of pages will
|
|
|
|
* avoid wasted time in vm_phys_split_pages().
|
2018-06-26 18:29:56 +00:00
|
|
|
*
|
|
|
|
* The free page queues for the specified domain must be locked.
|
|
|
|
*/
|
|
|
|
int
|
|
|
|
vm_phys_alloc_npages(int domain, int pool, int npages, vm_page_t ma[])
|
|
|
|
{
|
|
|
|
struct vm_freelist *alt, *fl;
|
|
|
|
vm_page_t m;
|
|
|
|
int avail, end, flind, freelist, i, need, oind, pind;
|
|
|
|
|
|
|
|
KASSERT(domain >= 0 && domain < vm_ndomains,
|
|
|
|
("vm_phys_alloc_npages: domain %d is out of range", domain));
|
|
|
|
KASSERT(pool < VM_NFREEPOOL,
|
|
|
|
("vm_phys_alloc_npages: pool %d is out of range", pool));
|
|
|
|
KASSERT(npages <= 1 << (VM_NFREEORDER - 1),
|
|
|
|
("vm_phys_alloc_npages: npages %d is out of range", npages));
|
|
|
|
vm_domain_free_assert_locked(VM_DOMAIN(domain));
|
|
|
|
i = 0;
|
|
|
|
for (freelist = 0; freelist < VM_NFREELIST; freelist++) {
|
|
|
|
flind = vm_freelist_to_flind[freelist];
|
|
|
|
if (flind < 0)
|
|
|
|
continue;
|
|
|
|
fl = vm_phys_free_queues[domain][flind][pool];
|
|
|
|
for (oind = 0; oind < VM_NFREEORDER; oind++) {
|
|
|
|
while ((m = TAILQ_FIRST(&fl[oind].pl)) != NULL) {
|
|
|
|
vm_freelist_rem(fl, m, oind);
|
|
|
|
avail = 1 << oind;
|
|
|
|
need = imin(npages - i, avail);
|
|
|
|
for (end = i + need; i < end;)
|
|
|
|
ma[i++] = m++;
|
|
|
|
if (need < avail) {
|
2018-07-02 17:18:46 +00:00
|
|
|
/*
|
|
|
|
* Return excess pages to fl. Its
|
|
|
|
* order [0, oind) queues are empty.
|
|
|
|
*/
|
|
|
|
vm_phys_enq_range(m, avail - need, fl,
|
|
|
|
1);
|
2018-06-26 18:29:56 +00:00
|
|
|
return (npages);
|
|
|
|
} else if (i == npages)
|
|
|
|
return (npages);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
for (oind = VM_NFREEORDER - 1; oind >= 0; oind--) {
|
|
|
|
for (pind = 0; pind < VM_NFREEPOOL; pind++) {
|
|
|
|
alt = vm_phys_free_queues[domain][flind][pind];
|
|
|
|
while ((m = TAILQ_FIRST(&alt[oind].pl)) !=
|
|
|
|
NULL) {
|
|
|
|
vm_freelist_rem(alt, m, oind);
|
|
|
|
vm_phys_set_pool(pool, m, oind);
|
|
|
|
avail = 1 << oind;
|
|
|
|
need = imin(npages - i, avail);
|
|
|
|
for (end = i + need; i < end;)
|
|
|
|
ma[i++] = m++;
|
|
|
|
if (need < avail) {
|
2018-07-02 17:18:46 +00:00
|
|
|
/*
|
|
|
|
* Return excess pages to fl.
|
|
|
|
* Its order [0, oind) queues
|
|
|
|
* are empty.
|
|
|
|
*/
|
|
|
|
vm_phys_enq_range(m, avail -
|
|
|
|
need, fl, 1);
|
2018-06-26 18:29:56 +00:00
|
|
|
return (npages);
|
|
|
|
} else if (i == npages)
|
|
|
|
return (npages);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
return (i);
|
|
|
|
}
|
|
|
|
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
/*
|
|
|
|
* Allocate a contiguous, power of two-sized set of physical pages
|
|
|
|
* from the free lists.
|
2007-07-14 21:21:17 +00:00
|
|
|
*
|
|
|
|
* The free page queues must be locked.
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
*/
|
|
|
|
vm_page_t
|
2017-11-28 23:18:35 +00:00
|
|
|
vm_phys_alloc_pages(int domain, int pool, int order)
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
{
|
Redo the page table page allocation on MIPS, as suggested by
alc@.
The UMA zone based allocation is replaced by a scheme that creates
a new free page list for the KSEG0 region, and a new function
in sys/vm that allocates pages from a specific free page list.
This also fixes a race condition introduced by the UMA based page table
page allocation code. Dropping the page queue and pmap locks before
the call to uma_zfree, and re-acquiring them afterwards will introduce
a race condtion(noted by alc@).
The changes are :
- Revert the earlier changes in MIPS pmap.c that added UMA zone for
page table pages.
- Add a new freelist VM_FREELIST_HIGHMEM to MIPS vmparam.h for memory that
is not directly mapped (in 32bit kernel). Normal page allocations will first
try the HIGHMEM freelist and then the default(direct mapped) freelist.
- Add a new function 'vm_page_t vm_page_alloc_freelist(int flind, int
order, int req)' to vm/vm_page.c to allocate a page from a specified
freelist. The MIPS page table pages will be allocated using this function
from the freelist containing direct mapped pages.
- Move the page initialization code from vm_phys_alloc_contig() to a
new function vm_page_alloc_init(), and use this function to initialize
pages in vm_page_alloc_freelist() too.
- Split the function vm_phys_alloc_pages(int pool, int order) to create
vm_phys_alloc_freelist_pages(int flind, int pool, int order), and use
this function from both vm_page_alloc_freelist() and vm_phys_alloc_pages().
Reviewed by: alc
2010-07-21 09:27:00 +00:00
|
|
|
vm_page_t m;
|
2017-12-04 08:08:55 +00:00
|
|
|
int freelist;
|
Add an initial NUMA affinity/policy configuration for threads and processes.
This is based on work done by jeff@ and jhb@, as well as the numa.diff
patch that has been circulating when someone asks for first-touch NUMA
on -10 or -11.
* Introduce a simple set of VM policy and iterator types.
* tie the policy types into the vm_phys path for now, mirroring how
the initial first-touch allocation work was enabled.
* add syscalls to control changing thread and process defaults.
* add a global NUMA VM domain policy.
* implement a simple cascade policy order - if a thread policy exists, use it;
if a process policy exists, use it; use the default policy.
* processes inherit policies from their parent processes, threads inherit
policies from their parent threads.
* add a simple tool (numactl) to query and modify default thread/process
policities.
* add documentation for the new syscalls, for numa and for numactl.
* re-enable first touch NUMA again by default, as now policies can be
set in a variety of methods.
This is only relevant for very specific workloads.
This doesn't pretend to be a final NUMA solution.
The previous defaults in -HEAD (with MAXMEMDOM set) can be achieved by
'sysctl vm.default_policy=rr'.
This is only relevant if MAXMEMDOM is set to something other than 1.
Ie, if you're using GENERIC or a modified kernel with non-NUMA, then
this is a glorified no-op for you.
Thank you to Norse Corp for giving me access to rather large
(for FreeBSD!) NUMA machines in order to develop and verify this.
Thank you to Dell for providing me with dual socket sandybridge
and westmere v3 hardware to do NUMA development with.
Thank you to Scott Long at Netflix for providing me with access
to the two-socket, four-domain haswell v3 hardware.
Thank you to Peter Holm for running the stress testing suite
against the NUMA branch during various stages of development!
Tested:
* MIPS (regression testing; non-NUMA)
* i386 (regression testing; non-NUMA GENERIC)
* amd64 (regression testing; non-NUMA GENERIC)
* westmere, 2 socket (thankyou norse!)
* sandy bridge, 2 socket (thankyou dell!)
* ivy bridge, 2 socket (thankyou norse!)
* westmere-EX, 4 socket / 1TB RAM (thankyou norse!)
* haswell, 2 socket (thankyou norse!)
* haswell v3, 2 socket (thankyou dell)
* haswell v3, 2x18 core (thankyou scott long / netflix!)
* Peter Holm ran a stress test suite on this work and found one
issue, but has not been able to verify it (it doesn't look NUMA
related, and he only saw it once over many testing runs.)
* I've tested bhyve instances running in fixed NUMA domains and cpusets;
all seems to work correctly.
Verified:
* intel-pcm - pcm-numa.x and pcm-memory.x, whilst selecting different
NUMA policies for processes under test.
Review:
This was reviewed through phabricator (https://reviews.freebsd.org/D2559)
as well as privately and via emails to freebsd-arch@. The git history
with specific attributes is available at https://github.com/erikarn/freebsd/
in the NUMA branch (https://github.com/erikarn/freebsd/compare/local/adrian_numa_policy).
This has been reviewed by a number of people (stas, rpaulo, kib, ngie,
wblock) but not achieved a clear consensus. My hope is that with further
exposure and testing more functionality can be implemented and evaluated.
Notes:
* The VM doesn't handle unbalanced domains very well, and if you have an overly
unbalanced memory setup whilst under high memory pressure, VM page allocation
may fail leading to a kernel panic. This was a problem in the past, but it's
much more easily triggered now with these tools.
* This work only controls the path through vm_phys; it doesn't yet strongly/predictably
affect contigmalloc, KVA placement, UMA, etc. So, driver placement of memory
isn't really guaranteed in any way. That's next on my plate.
Sponsored by: Norse Corp, Inc.; Dell
2015-07-11 15:21:37 +00:00
|
|
|
|
2017-12-04 08:08:55 +00:00
|
|
|
for (freelist = 0; freelist < VM_NFREELIST; freelist++) {
|
|
|
|
m = vm_phys_alloc_freelist_pages(domain, freelist, pool, order);
|
2017-11-28 23:18:35 +00:00
|
|
|
if (m != NULL)
|
|
|
|
return (m);
|
Redo the page table page allocation on MIPS, as suggested by
alc@.
The UMA zone based allocation is replaced by a scheme that creates
a new free page list for the KSEG0 region, and a new function
in sys/vm that allocates pages from a specific free page list.
This also fixes a race condition introduced by the UMA based page table
page allocation code. Dropping the page queue and pmap locks before
the call to uma_zfree, and re-acquiring them afterwards will introduce
a race condtion(noted by alc@).
The changes are :
- Revert the earlier changes in MIPS pmap.c that added UMA zone for
page table pages.
- Add a new freelist VM_FREELIST_HIGHMEM to MIPS vmparam.h for memory that
is not directly mapped (in 32bit kernel). Normal page allocations will first
try the HIGHMEM freelist and then the default(direct mapped) freelist.
- Add a new function 'vm_page_t vm_page_alloc_freelist(int flind, int
order, int req)' to vm/vm_page.c to allocate a page from a specified
freelist. The MIPS page table pages will be allocated using this function
from the freelist containing direct mapped pages.
- Move the page initialization code from vm_phys_alloc_contig() to a
new function vm_page_alloc_init(), and use this function to initialize
pages in vm_page_alloc_freelist() too.
- Split the function vm_phys_alloc_pages(int pool, int order) to create
vm_phys_alloc_freelist_pages(int flind, int pool, int order), and use
this function from both vm_page_alloc_freelist() and vm_phys_alloc_pages().
Reviewed by: alc
2010-07-21 09:27:00 +00:00
|
|
|
}
|
|
|
|
return (NULL);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
The physical memory allocator supports the use of distinct free lists for
managing pages from different address ranges. Generally speaking, this
feature is used to increase the likelihood that physical pages are
available that can meet special DMA requirements or can be accessed through
a limited-coverage direct mapping (e.g., MIPS). However, prior to this
change, the configuration of the free lists was static, i.e., it was
determined at compile time. Consequentally, free lists could be created
for address ranges that held no actual pages, for example, on 32-bit MIPS-
based systems with 512 MB or less of physical memory. This change makes
the creation of the free lists dynamic, i.e., it is based on the available
physical memory at boot time.
On 64-bit x86-based systems with 64 GB or more of physical memory, create
free lists for managing pages with physical addresses below 4 GB. This
change is to address reported problems with initializing devices that
require the allocation of physical pages below 4 GB on some systems with
128 GB or more of physical memory.
PR: 185727
Differential Revision: https://reviews.freebsd.org/D1274
Reviewed by: jhb, kib
MFC after: 3 weeks
Sponsored by: EMC / Isilon Storage Division
2014-12-31 00:54:38 +00:00
|
|
|
* Allocate a contiguous, power of two-sized set of physical pages from the
|
|
|
|
* specified free list. The free list must be specified using one of the
|
|
|
|
* manifest constants VM_FREELIST_*.
|
|
|
|
*
|
|
|
|
* The free page queues must be locked.
|
Redo the page table page allocation on MIPS, as suggested by
alc@.
The UMA zone based allocation is replaced by a scheme that creates
a new free page list for the KSEG0 region, and a new function
in sys/vm that allocates pages from a specific free page list.
This also fixes a race condition introduced by the UMA based page table
page allocation code. Dropping the page queue and pmap locks before
the call to uma_zfree, and re-acquiring them afterwards will introduce
a race condtion(noted by alc@).
The changes are :
- Revert the earlier changes in MIPS pmap.c that added UMA zone for
page table pages.
- Add a new freelist VM_FREELIST_HIGHMEM to MIPS vmparam.h for memory that
is not directly mapped (in 32bit kernel). Normal page allocations will first
try the HIGHMEM freelist and then the default(direct mapped) freelist.
- Add a new function 'vm_page_t vm_page_alloc_freelist(int flind, int
order, int req)' to vm/vm_page.c to allocate a page from a specified
freelist. The MIPS page table pages will be allocated using this function
from the freelist containing direct mapped pages.
- Move the page initialization code from vm_phys_alloc_contig() to a
new function vm_page_alloc_init(), and use this function to initialize
pages in vm_page_alloc_freelist() too.
- Split the function vm_phys_alloc_pages(int pool, int order) to create
vm_phys_alloc_freelist_pages(int flind, int pool, int order), and use
this function from both vm_page_alloc_freelist() and vm_phys_alloc_pages().
Reviewed by: alc
2010-07-21 09:27:00 +00:00
|
|
|
*/
|
|
|
|
vm_page_t
|
2017-12-04 08:08:55 +00:00
|
|
|
vm_phys_alloc_freelist_pages(int domain, int freelist, int pool, int order)
|
2013-05-03 18:58:37 +00:00
|
|
|
{
|
2017-11-28 23:18:35 +00:00
|
|
|
struct vm_freelist *alt, *fl;
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
vm_page_t m;
|
2017-12-04 08:08:55 +00:00
|
|
|
int oind, pind, flind;
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
|
2017-11-28 23:18:35 +00:00
|
|
|
KASSERT(domain >= 0 && domain < vm_ndomains,
|
|
|
|
("vm_phys_alloc_freelist_pages: domain %d is out of range",
|
|
|
|
domain));
|
2017-12-04 08:08:55 +00:00
|
|
|
KASSERT(freelist < VM_NFREELIST,
|
The physical memory allocator supports the use of distinct free lists for
managing pages from different address ranges. Generally speaking, this
feature is used to increase the likelihood that physical pages are
available that can meet special DMA requirements or can be accessed through
a limited-coverage direct mapping (e.g., MIPS). However, prior to this
change, the configuration of the free lists was static, i.e., it was
determined at compile time. Consequentally, free lists could be created
for address ranges that held no actual pages, for example, on 32-bit MIPS-
based systems with 512 MB or less of physical memory. This change makes
the creation of the free lists dynamic, i.e., it is based on the available
physical memory at boot time.
On 64-bit x86-based systems with 64 GB or more of physical memory, create
free lists for managing pages with physical addresses below 4 GB. This
change is to address reported problems with initializing devices that
require the allocation of physical pages below 4 GB on some systems with
128 GB or more of physical memory.
PR: 185727
Differential Revision: https://reviews.freebsd.org/D1274
Reviewed by: jhb, kib
MFC after: 3 weeks
Sponsored by: EMC / Isilon Storage Division
2014-12-31 00:54:38 +00:00
|
|
|
("vm_phys_alloc_freelist_pages: freelist %d is out of range",
|
2017-12-04 11:16:51 +00:00
|
|
|
freelist));
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
KASSERT(pool < VM_NFREEPOOL,
|
Redo the page table page allocation on MIPS, as suggested by
alc@.
The UMA zone based allocation is replaced by a scheme that creates
a new free page list for the KSEG0 region, and a new function
in sys/vm that allocates pages from a specific free page list.
This also fixes a race condition introduced by the UMA based page table
page allocation code. Dropping the page queue and pmap locks before
the call to uma_zfree, and re-acquiring them afterwards will introduce
a race condtion(noted by alc@).
The changes are :
- Revert the earlier changes in MIPS pmap.c that added UMA zone for
page table pages.
- Add a new freelist VM_FREELIST_HIGHMEM to MIPS vmparam.h for memory that
is not directly mapped (in 32bit kernel). Normal page allocations will first
try the HIGHMEM freelist and then the default(direct mapped) freelist.
- Add a new function 'vm_page_t vm_page_alloc_freelist(int flind, int
order, int req)' to vm/vm_page.c to allocate a page from a specified
freelist. The MIPS page table pages will be allocated using this function
from the freelist containing direct mapped pages.
- Move the page initialization code from vm_phys_alloc_contig() to a
new function vm_page_alloc_init(), and use this function to initialize
pages in vm_page_alloc_freelist() too.
- Split the function vm_phys_alloc_pages(int pool, int order) to create
vm_phys_alloc_freelist_pages(int flind, int pool, int order), and use
this function from both vm_page_alloc_freelist() and vm_phys_alloc_pages().
Reviewed by: alc
2010-07-21 09:27:00 +00:00
|
|
|
("vm_phys_alloc_freelist_pages: pool %d is out of range", pool));
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
KASSERT(order < VM_NFREEORDER,
|
Redo the page table page allocation on MIPS, as suggested by
alc@.
The UMA zone based allocation is replaced by a scheme that creates
a new free page list for the KSEG0 region, and a new function
in sys/vm that allocates pages from a specific free page list.
This also fixes a race condition introduced by the UMA based page table
page allocation code. Dropping the page queue and pmap locks before
the call to uma_zfree, and re-acquiring them afterwards will introduce
a race condtion(noted by alc@).
The changes are :
- Revert the earlier changes in MIPS pmap.c that added UMA zone for
page table pages.
- Add a new freelist VM_FREELIST_HIGHMEM to MIPS vmparam.h for memory that
is not directly mapped (in 32bit kernel). Normal page allocations will first
try the HIGHMEM freelist and then the default(direct mapped) freelist.
- Add a new function 'vm_page_t vm_page_alloc_freelist(int flind, int
order, int req)' to vm/vm_page.c to allocate a page from a specified
freelist. The MIPS page table pages will be allocated using this function
from the freelist containing direct mapped pages.
- Move the page initialization code from vm_phys_alloc_contig() to a
new function vm_page_alloc_init(), and use this function to initialize
pages in vm_page_alloc_freelist() too.
- Split the function vm_phys_alloc_pages(int pool, int order) to create
vm_phys_alloc_freelist_pages(int flind, int pool, int order), and use
this function from both vm_page_alloc_freelist() and vm_phys_alloc_pages().
Reviewed by: alc
2010-07-21 09:27:00 +00:00
|
|
|
("vm_phys_alloc_freelist_pages: order %d is out of range", order));
|
Add an initial NUMA affinity/policy configuration for threads and processes.
This is based on work done by jeff@ and jhb@, as well as the numa.diff
patch that has been circulating when someone asks for first-touch NUMA
on -10 or -11.
* Introduce a simple set of VM policy and iterator types.
* tie the policy types into the vm_phys path for now, mirroring how
the initial first-touch allocation work was enabled.
* add syscalls to control changing thread and process defaults.
* add a global NUMA VM domain policy.
* implement a simple cascade policy order - if a thread policy exists, use it;
if a process policy exists, use it; use the default policy.
* processes inherit policies from their parent processes, threads inherit
policies from their parent threads.
* add a simple tool (numactl) to query and modify default thread/process
policities.
* add documentation for the new syscalls, for numa and for numactl.
* re-enable first touch NUMA again by default, as now policies can be
set in a variety of methods.
This is only relevant for very specific workloads.
This doesn't pretend to be a final NUMA solution.
The previous defaults in -HEAD (with MAXMEMDOM set) can be achieved by
'sysctl vm.default_policy=rr'.
This is only relevant if MAXMEMDOM is set to something other than 1.
Ie, if you're using GENERIC or a modified kernel with non-NUMA, then
this is a glorified no-op for you.
Thank you to Norse Corp for giving me access to rather large
(for FreeBSD!) NUMA machines in order to develop and verify this.
Thank you to Dell for providing me with dual socket sandybridge
and westmere v3 hardware to do NUMA development with.
Thank you to Scott Long at Netflix for providing me with access
to the two-socket, four-domain haswell v3 hardware.
Thank you to Peter Holm for running the stress testing suite
against the NUMA branch during various stages of development!
Tested:
* MIPS (regression testing; non-NUMA)
* i386 (regression testing; non-NUMA GENERIC)
* amd64 (regression testing; non-NUMA GENERIC)
* westmere, 2 socket (thankyou norse!)
* sandy bridge, 2 socket (thankyou dell!)
* ivy bridge, 2 socket (thankyou norse!)
* westmere-EX, 4 socket / 1TB RAM (thankyou norse!)
* haswell, 2 socket (thankyou norse!)
* haswell v3, 2 socket (thankyou dell)
* haswell v3, 2x18 core (thankyou scott long / netflix!)
* Peter Holm ran a stress test suite on this work and found one
issue, but has not been able to verify it (it doesn't look NUMA
related, and he only saw it once over many testing runs.)
* I've tested bhyve instances running in fixed NUMA domains and cpusets;
all seems to work correctly.
Verified:
* intel-pcm - pcm-numa.x and pcm-memory.x, whilst selecting different
NUMA policies for processes under test.
Review:
This was reviewed through phabricator (https://reviews.freebsd.org/D2559)
as well as privately and via emails to freebsd-arch@. The git history
with specific attributes is available at https://github.com/erikarn/freebsd/
in the NUMA branch (https://github.com/erikarn/freebsd/compare/local/adrian_numa_policy).
This has been reviewed by a number of people (stas, rpaulo, kib, ngie,
wblock) but not achieved a clear consensus. My hope is that with further
exposure and testing more functionality can be implemented and evaluated.
Notes:
* The VM doesn't handle unbalanced domains very well, and if you have an overly
unbalanced memory setup whilst under high memory pressure, VM page allocation
may fail leading to a kernel panic. This was a problem in the past, but it's
much more easily triggered now with these tools.
* This work only controls the path through vm_phys; it doesn't yet strongly/predictably
affect contigmalloc, KVA placement, UMA, etc. So, driver placement of memory
isn't really guaranteed in any way. That's next on my plate.
Sponsored by: Norse Corp, Inc.; Dell
2015-07-11 15:21:37 +00:00
|
|
|
|
2017-12-04 08:08:55 +00:00
|
|
|
flind = vm_freelist_to_flind[freelist];
|
|
|
|
/* Check if freelist is present */
|
|
|
|
if (flind < 0)
|
|
|
|
return (NULL);
|
|
|
|
|
2018-02-06 22:10:07 +00:00
|
|
|
vm_domain_free_assert_locked(VM_DOMAIN(domain));
|
2013-05-13 15:40:51 +00:00
|
|
|
fl = &vm_phys_free_queues[domain][flind][pool][0];
|
Redo the page table page allocation on MIPS, as suggested by
alc@.
The UMA zone based allocation is replaced by a scheme that creates
a new free page list for the KSEG0 region, and a new function
in sys/vm that allocates pages from a specific free page list.
This also fixes a race condition introduced by the UMA based page table
page allocation code. Dropping the page queue and pmap locks before
the call to uma_zfree, and re-acquiring them afterwards will introduce
a race condtion(noted by alc@).
The changes are :
- Revert the earlier changes in MIPS pmap.c that added UMA zone for
page table pages.
- Add a new freelist VM_FREELIST_HIGHMEM to MIPS vmparam.h for memory that
is not directly mapped (in 32bit kernel). Normal page allocations will first
try the HIGHMEM freelist and then the default(direct mapped) freelist.
- Add a new function 'vm_page_t vm_page_alloc_freelist(int flind, int
order, int req)' to vm/vm_page.c to allocate a page from a specified
freelist. The MIPS page table pages will be allocated using this function
from the freelist containing direct mapped pages.
- Move the page initialization code from vm_phys_alloc_contig() to a
new function vm_page_alloc_init(), and use this function to initialize
pages in vm_page_alloc_freelist() too.
- Split the function vm_phys_alloc_pages(int pool, int order) to create
vm_phys_alloc_freelist_pages(int flind, int pool, int order), and use
this function from both vm_page_alloc_freelist() and vm_phys_alloc_pages().
Reviewed by: alc
2010-07-21 09:27:00 +00:00
|
|
|
for (oind = order; oind < VM_NFREEORDER; oind++) {
|
|
|
|
m = TAILQ_FIRST(&fl[oind].pl);
|
|
|
|
if (m != NULL) {
|
2013-05-13 15:40:51 +00:00
|
|
|
vm_freelist_rem(fl, m, oind);
|
2018-07-05 02:08:57 +00:00
|
|
|
/* The order [order, oind) queues are empty. */
|
|
|
|
vm_phys_split_pages(m, oind, fl, order, 1);
|
Redo the page table page allocation on MIPS, as suggested by
alc@.
The UMA zone based allocation is replaced by a scheme that creates
a new free page list for the KSEG0 region, and a new function
in sys/vm that allocates pages from a specific free page list.
This also fixes a race condition introduced by the UMA based page table
page allocation code. Dropping the page queue and pmap locks before
the call to uma_zfree, and re-acquiring them afterwards will introduce
a race condtion(noted by alc@).
The changes are :
- Revert the earlier changes in MIPS pmap.c that added UMA zone for
page table pages.
- Add a new freelist VM_FREELIST_HIGHMEM to MIPS vmparam.h for memory that
is not directly mapped (in 32bit kernel). Normal page allocations will first
try the HIGHMEM freelist and then the default(direct mapped) freelist.
- Add a new function 'vm_page_t vm_page_alloc_freelist(int flind, int
order, int req)' to vm/vm_page.c to allocate a page from a specified
freelist. The MIPS page table pages will be allocated using this function
from the freelist containing direct mapped pages.
- Move the page initialization code from vm_phys_alloc_contig() to a
new function vm_page_alloc_init(), and use this function to initialize
pages in vm_page_alloc_freelist() too.
- Split the function vm_phys_alloc_pages(int pool, int order) to create
vm_phys_alloc_freelist_pages(int flind, int pool, int order), and use
this function from both vm_page_alloc_freelist() and vm_phys_alloc_pages().
Reviewed by: alc
2010-07-21 09:27:00 +00:00
|
|
|
return (m);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* The given pool was empty. Find the largest
|
|
|
|
* contiguous, power-of-two-sized set of pages in any
|
|
|
|
* pool. Transfer these pages to the given pool, and
|
|
|
|
* use them to satisfy the allocation.
|
|
|
|
*/
|
|
|
|
for (oind = VM_NFREEORDER - 1; oind >= order; oind--) {
|
|
|
|
for (pind = 0; pind < VM_NFREEPOOL; pind++) {
|
2013-05-13 15:40:51 +00:00
|
|
|
alt = &vm_phys_free_queues[domain][flind][pind][0];
|
Redo the page table page allocation on MIPS, as suggested by
alc@.
The UMA zone based allocation is replaced by a scheme that creates
a new free page list for the KSEG0 region, and a new function
in sys/vm that allocates pages from a specific free page list.
This also fixes a race condition introduced by the UMA based page table
page allocation code. Dropping the page queue and pmap locks before
the call to uma_zfree, and re-acquiring them afterwards will introduce
a race condtion(noted by alc@).
The changes are :
- Revert the earlier changes in MIPS pmap.c that added UMA zone for
page table pages.
- Add a new freelist VM_FREELIST_HIGHMEM to MIPS vmparam.h for memory that
is not directly mapped (in 32bit kernel). Normal page allocations will first
try the HIGHMEM freelist and then the default(direct mapped) freelist.
- Add a new function 'vm_page_t vm_page_alloc_freelist(int flind, int
order, int req)' to vm/vm_page.c to allocate a page from a specified
freelist. The MIPS page table pages will be allocated using this function
from the freelist containing direct mapped pages.
- Move the page initialization code from vm_phys_alloc_contig() to a
new function vm_page_alloc_init(), and use this function to initialize
pages in vm_page_alloc_freelist() too.
- Split the function vm_phys_alloc_pages(int pool, int order) to create
vm_phys_alloc_freelist_pages(int flind, int pool, int order), and use
this function from both vm_page_alloc_freelist() and vm_phys_alloc_pages().
Reviewed by: alc
2010-07-21 09:27:00 +00:00
|
|
|
m = TAILQ_FIRST(&alt[oind].pl);
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
if (m != NULL) {
|
2013-05-13 15:40:51 +00:00
|
|
|
vm_freelist_rem(alt, m, oind);
|
Redo the page table page allocation on MIPS, as suggested by
alc@.
The UMA zone based allocation is replaced by a scheme that creates
a new free page list for the KSEG0 region, and a new function
in sys/vm that allocates pages from a specific free page list.
This also fixes a race condition introduced by the UMA based page table
page allocation code. Dropping the page queue and pmap locks before
the call to uma_zfree, and re-acquiring them afterwards will introduce
a race condtion(noted by alc@).
The changes are :
- Revert the earlier changes in MIPS pmap.c that added UMA zone for
page table pages.
- Add a new freelist VM_FREELIST_HIGHMEM to MIPS vmparam.h for memory that
is not directly mapped (in 32bit kernel). Normal page allocations will first
try the HIGHMEM freelist and then the default(direct mapped) freelist.
- Add a new function 'vm_page_t vm_page_alloc_freelist(int flind, int
order, int req)' to vm/vm_page.c to allocate a page from a specified
freelist. The MIPS page table pages will be allocated using this function
from the freelist containing direct mapped pages.
- Move the page initialization code from vm_phys_alloc_contig() to a
new function vm_page_alloc_init(), and use this function to initialize
pages in vm_page_alloc_freelist() too.
- Split the function vm_phys_alloc_pages(int pool, int order) to create
vm_phys_alloc_freelist_pages(int flind, int pool, int order), and use
this function from both vm_page_alloc_freelist() and vm_phys_alloc_pages().
Reviewed by: alc
2010-07-21 09:27:00 +00:00
|
|
|
vm_phys_set_pool(pool, m, oind);
|
2018-07-05 02:08:57 +00:00
|
|
|
/* The order [order, oind) queues are empty. */
|
|
|
|
vm_phys_split_pages(m, oind, fl, order, 1);
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
return (m);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
return (NULL);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Find the vm_page corresponding to the given physical address.
|
|
|
|
*/
|
|
|
|
vm_page_t
|
|
|
|
vm_phys_paddr_to_vm_page(vm_paddr_t pa)
|
|
|
|
{
|
|
|
|
struct vm_phys_seg *seg;
|
|
|
|
int segind;
|
|
|
|
|
|
|
|
for (segind = 0; segind < vm_phys_nsegs; segind++) {
|
|
|
|
seg = &vm_phys_segs[segind];
|
|
|
|
if (pa >= seg->start && pa < seg->end)
|
|
|
|
return (&seg->first_page[atop(pa - seg->start)]);
|
|
|
|
}
|
2009-06-18 20:42:37 +00:00
|
|
|
return (NULL);
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
}
|
|
|
|
|
2012-05-12 20:42:56 +00:00
|
|
|
vm_page_t
|
|
|
|
vm_phys_fictitious_to_vm_page(vm_paddr_t pa)
|
|
|
|
{
|
2014-07-09 08:12:58 +00:00
|
|
|
struct vm_phys_fictitious_seg tmp, *seg;
|
2012-05-12 20:42:56 +00:00
|
|
|
vm_page_t m;
|
|
|
|
|
|
|
|
m = NULL;
|
2014-07-09 08:12:58 +00:00
|
|
|
tmp.start = pa;
|
|
|
|
tmp.end = 0;
|
|
|
|
|
|
|
|
rw_rlock(&vm_phys_fictitious_reg_lock);
|
|
|
|
seg = RB_FIND(fict_tree, &vm_phys_fictitious_tree, &tmp);
|
|
|
|
rw_runlock(&vm_phys_fictitious_reg_lock);
|
|
|
|
if (seg == NULL)
|
|
|
|
return (NULL);
|
|
|
|
|
|
|
|
m = &seg->first_page[atop(pa - seg->start)];
|
|
|
|
KASSERT((m->flags & PG_FICTITIOUS) != 0, ("%p not fictitious", m));
|
|
|
|
|
2012-05-12 20:42:56 +00:00
|
|
|
return (m);
|
|
|
|
}
|
|
|
|
|
2014-08-05 10:29:01 +00:00
|
|
|
static inline void
|
|
|
|
vm_phys_fictitious_init_range(vm_page_t range, vm_paddr_t start,
|
|
|
|
long page_count, vm_memattr_t memattr)
|
|
|
|
{
|
|
|
|
long i;
|
|
|
|
|
2017-09-07 21:43:39 +00:00
|
|
|
bzero(range, page_count * sizeof(*range));
|
2014-08-05 10:29:01 +00:00
|
|
|
for (i = 0; i < page_count; i++) {
|
|
|
|
vm_page_initfake(&range[i], start + PAGE_SIZE * i, memattr);
|
|
|
|
range[i].oflags &= ~VPO_UNMANAGED;
|
|
|
|
range[i].busy_lock = VPB_UNBUSIED;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2012-05-12 20:42:56 +00:00
|
|
|
int
|
|
|
|
vm_phys_fictitious_reg_range(vm_paddr_t start, vm_paddr_t end,
|
|
|
|
vm_memattr_t memattr)
|
|
|
|
{
|
|
|
|
struct vm_phys_fictitious_seg *seg;
|
|
|
|
vm_page_t fp;
|
2014-08-05 10:29:01 +00:00
|
|
|
long page_count;
|
2012-05-12 20:42:56 +00:00
|
|
|
#ifdef VM_PHYSSEG_DENSE
|
2014-08-05 10:29:01 +00:00
|
|
|
long pi, pe;
|
|
|
|
long dpage_count;
|
2012-05-12 20:42:56 +00:00
|
|
|
#endif
|
|
|
|
|
2014-08-05 10:29:01 +00:00
|
|
|
KASSERT(start < end,
|
|
|
|
("Start of segment isn't less than end (start: %jx end: %jx)",
|
|
|
|
(uintmax_t)start, (uintmax_t)end));
|
|
|
|
|
2012-05-12 20:42:56 +00:00
|
|
|
page_count = (end - start) / PAGE_SIZE;
|
|
|
|
|
|
|
|
#ifdef VM_PHYSSEG_DENSE
|
|
|
|
pi = atop(start);
|
2014-08-05 10:29:01 +00:00
|
|
|
pe = atop(end);
|
|
|
|
if (pi >= first_page && (pi - first_page) < vm_page_array_size) {
|
2012-05-12 20:42:56 +00:00
|
|
|
fp = &vm_page_array[pi - first_page];
|
2014-08-05 10:29:01 +00:00
|
|
|
if ((pe - first_page) > vm_page_array_size) {
|
|
|
|
/*
|
|
|
|
* We have a segment that starts inside
|
|
|
|
* of vm_page_array, but ends outside of it.
|
|
|
|
*
|
|
|
|
* Use vm_page_array pages for those that are
|
|
|
|
* inside of the vm_page_array range, and
|
|
|
|
* allocate the remaining ones.
|
|
|
|
*/
|
|
|
|
dpage_count = vm_page_array_size - (pi - first_page);
|
|
|
|
vm_phys_fictitious_init_range(fp, start, dpage_count,
|
|
|
|
memattr);
|
|
|
|
page_count -= dpage_count;
|
|
|
|
start += ptoa(dpage_count);
|
|
|
|
goto alloc;
|
|
|
|
}
|
|
|
|
/*
|
|
|
|
* We can allocate the full range from vm_page_array,
|
|
|
|
* so there's no need to register the range in the tree.
|
|
|
|
*/
|
|
|
|
vm_phys_fictitious_init_range(fp, start, page_count, memattr);
|
|
|
|
return (0);
|
|
|
|
} else if (pe > first_page && (pe - first_page) < vm_page_array_size) {
|
|
|
|
/*
|
|
|
|
* We have a segment that ends inside of vm_page_array,
|
|
|
|
* but starts outside of it.
|
|
|
|
*/
|
|
|
|
fp = &vm_page_array[0];
|
|
|
|
dpage_count = pe - first_page;
|
|
|
|
vm_phys_fictitious_init_range(fp, ptoa(first_page), dpage_count,
|
|
|
|
memattr);
|
|
|
|
end -= ptoa(dpage_count);
|
|
|
|
page_count -= dpage_count;
|
|
|
|
goto alloc;
|
|
|
|
} else if (pi < first_page && pe > (first_page + vm_page_array_size)) {
|
|
|
|
/*
|
|
|
|
* Trying to register a fictitious range that expands before
|
|
|
|
* and after vm_page_array.
|
|
|
|
*/
|
|
|
|
return (EINVAL);
|
|
|
|
} else {
|
|
|
|
alloc:
|
2012-05-12 20:42:56 +00:00
|
|
|
#endif
|
|
|
|
fp = malloc(page_count * sizeof(struct vm_page), M_FICT_PAGES,
|
2017-09-07 21:43:39 +00:00
|
|
|
M_WAITOK);
|
2014-08-05 10:29:01 +00:00
|
|
|
#ifdef VM_PHYSSEG_DENSE
|
2012-05-12 20:42:56 +00:00
|
|
|
}
|
2014-08-05 10:29:01 +00:00
|
|
|
#endif
|
|
|
|
vm_phys_fictitious_init_range(fp, start, page_count, memattr);
|
2014-07-09 08:12:58 +00:00
|
|
|
|
|
|
|
seg = malloc(sizeof(*seg), M_FICT_PAGES, M_WAITOK | M_ZERO);
|
|
|
|
seg->start = start;
|
|
|
|
seg->end = end;
|
|
|
|
seg->first_page = fp;
|
|
|
|
|
|
|
|
rw_wlock(&vm_phys_fictitious_reg_lock);
|
|
|
|
RB_INSERT(fict_tree, &vm_phys_fictitious_tree, seg);
|
|
|
|
rw_wunlock(&vm_phys_fictitious_reg_lock);
|
|
|
|
|
|
|
|
return (0);
|
2012-05-12 20:42:56 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
void
|
|
|
|
vm_phys_fictitious_unreg_range(vm_paddr_t start, vm_paddr_t end)
|
|
|
|
{
|
2014-07-09 08:12:58 +00:00
|
|
|
struct vm_phys_fictitious_seg *seg, tmp;
|
2012-05-12 20:42:56 +00:00
|
|
|
#ifdef VM_PHYSSEG_DENSE
|
2014-08-05 10:29:01 +00:00
|
|
|
long pi, pe;
|
2012-05-12 20:42:56 +00:00
|
|
|
#endif
|
|
|
|
|
2014-08-05 10:29:01 +00:00
|
|
|
KASSERT(start < end,
|
|
|
|
("Start of segment isn't less than end (start: %jx end: %jx)",
|
|
|
|
(uintmax_t)start, (uintmax_t)end));
|
|
|
|
|
2012-05-12 20:42:56 +00:00
|
|
|
#ifdef VM_PHYSSEG_DENSE
|
|
|
|
pi = atop(start);
|
2014-08-05 10:29:01 +00:00
|
|
|
pe = atop(end);
|
|
|
|
if (pi >= first_page && (pi - first_page) < vm_page_array_size) {
|
|
|
|
if ((pe - first_page) <= vm_page_array_size) {
|
|
|
|
/*
|
|
|
|
* This segment was allocated using vm_page_array
|
|
|
|
* only, there's nothing to do since those pages
|
|
|
|
* were never added to the tree.
|
|
|
|
*/
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
/*
|
|
|
|
* We have a segment that starts inside
|
|
|
|
* of vm_page_array, but ends outside of it.
|
|
|
|
*
|
|
|
|
* Calculate how many pages were added to the
|
|
|
|
* tree and free them.
|
|
|
|
*/
|
|
|
|
start = ptoa(first_page + vm_page_array_size);
|
|
|
|
} else if (pe > first_page && (pe - first_page) < vm_page_array_size) {
|
|
|
|
/*
|
|
|
|
* We have a segment that ends inside of vm_page_array,
|
|
|
|
* but starts outside of it.
|
|
|
|
*/
|
|
|
|
end = ptoa(first_page);
|
|
|
|
} else if (pi < first_page && pe > (first_page + vm_page_array_size)) {
|
|
|
|
/* Since it's not possible to register such a range, panic. */
|
|
|
|
panic(
|
|
|
|
"Unregistering not registered fictitious range [%#jx:%#jx]",
|
|
|
|
(uintmax_t)start, (uintmax_t)end);
|
|
|
|
}
|
2012-05-12 20:42:56 +00:00
|
|
|
#endif
|
2014-07-09 08:12:58 +00:00
|
|
|
tmp.start = start;
|
|
|
|
tmp.end = 0;
|
|
|
|
|
|
|
|
rw_wlock(&vm_phys_fictitious_reg_lock);
|
|
|
|
seg = RB_FIND(fict_tree, &vm_phys_fictitious_tree, &tmp);
|
|
|
|
if (seg->start != start || seg->end != end) {
|
|
|
|
rw_wunlock(&vm_phys_fictitious_reg_lock);
|
|
|
|
panic(
|
|
|
|
"Unregistering not registered fictitious range [%#jx:%#jx]",
|
|
|
|
(uintmax_t)start, (uintmax_t)end);
|
|
|
|
}
|
|
|
|
RB_REMOVE(fict_tree, &vm_phys_fictitious_tree, seg);
|
|
|
|
rw_wunlock(&vm_phys_fictitious_reg_lock);
|
2014-08-05 10:29:01 +00:00
|
|
|
free(seg->first_page, M_FICT_PAGES);
|
2014-07-09 08:12:58 +00:00
|
|
|
free(seg, M_FICT_PAGES);
|
2012-05-12 20:42:56 +00:00
|
|
|
}
|
|
|
|
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
/*
|
|
|
|
* Free a contiguous, power of two-sized set of physical pages.
|
2007-07-14 21:21:17 +00:00
|
|
|
*
|
|
|
|
* The free page queues must be locked.
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
*/
|
|
|
|
void
|
|
|
|
vm_phys_free_pages(vm_page_t m, int order)
|
|
|
|
{
|
|
|
|
struct vm_freelist *fl;
|
|
|
|
struct vm_phys_seg *seg;
|
2011-10-30 05:06:14 +00:00
|
|
|
vm_paddr_t pa;
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
vm_page_t m_buddy;
|
|
|
|
|
|
|
|
KASSERT(m->order == VM_NFREEORDER,
|
2019-08-18 08:07:31 +00:00
|
|
|
("vm_phys_free_pages: page %p has unexpected order %d",
|
|
|
|
m, m->order));
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
KASSERT(m->pool < VM_NFREEPOOL,
|
2007-07-14 21:21:17 +00:00
|
|
|
("vm_phys_free_pages: page %p has unexpected pool %d",
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
m, m->pool));
|
|
|
|
KASSERT(order < VM_NFREEORDER,
|
2007-07-14 21:21:17 +00:00
|
|
|
("vm_phys_free_pages: order %d is out of range", order));
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
seg = &vm_phys_segs[m->segind];
|
2018-02-06 22:10:07 +00:00
|
|
|
vm_domain_free_assert_locked(VM_DOMAIN(seg->domain));
|
2011-10-30 05:06:14 +00:00
|
|
|
if (order < VM_NFREEORDER - 1) {
|
|
|
|
pa = VM_PAGE_TO_PHYS(m);
|
|
|
|
do {
|
|
|
|
pa ^= ((vm_paddr_t)1 << (PAGE_SHIFT + order));
|
|
|
|
if (pa < seg->start || pa >= seg->end)
|
|
|
|
break;
|
|
|
|
m_buddy = &seg->first_page[atop(pa - seg->start)];
|
|
|
|
if (m_buddy->order != order)
|
|
|
|
break;
|
|
|
|
fl = (*seg->free_queues)[m_buddy->pool];
|
2013-05-13 15:40:51 +00:00
|
|
|
vm_freelist_rem(fl, m_buddy, order);
|
2011-10-30 05:06:14 +00:00
|
|
|
if (m_buddy->pool != m->pool)
|
|
|
|
vm_phys_set_pool(m->pool, m_buddy, order);
|
|
|
|
order++;
|
|
|
|
pa &= ~(((vm_paddr_t)1 << (PAGE_SHIFT + order)) - 1);
|
|
|
|
m = &seg->first_page[atop(pa - seg->start)];
|
|
|
|
} while (order < VM_NFREEORDER - 1);
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
}
|
|
|
|
fl = (*seg->free_queues)[m->pool];
|
2013-05-13 15:40:51 +00:00
|
|
|
vm_freelist_add(fl, m, order, 1);
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
}
|
|
|
|
|
2011-10-30 05:06:14 +00:00
|
|
|
/*
|
2019-05-31 21:02:42 +00:00
|
|
|
* Return the largest possible order of a set of pages starting at m.
|
|
|
|
*/
|
|
|
|
static int
|
|
|
|
max_order(vm_page_t m)
|
|
|
|
{
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Unsigned "min" is used here so that "order" is assigned
|
|
|
|
* "VM_NFREEORDER - 1" when "m"'s physical address is zero
|
|
|
|
* or the low-order bits of its physical address are zero
|
|
|
|
* because the size of a physical address exceeds the size of
|
|
|
|
* a long.
|
|
|
|
*/
|
|
|
|
return (min(ffsl(VM_PAGE_TO_PHYS(m) >> PAGE_SHIFT) - 1,
|
|
|
|
VM_NFREEORDER - 1));
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Free a contiguous, arbitrarily sized set of physical pages, without
|
|
|
|
* merging across set boundaries.
|
2011-10-30 05:06:14 +00:00
|
|
|
*
|
|
|
|
* The free page queues must be locked.
|
|
|
|
*/
|
|
|
|
void
|
2019-05-31 21:02:42 +00:00
|
|
|
vm_phys_enqueue_contig(vm_page_t m, u_long npages)
|
2011-10-30 05:06:14 +00:00
|
|
|
{
|
2019-05-31 21:02:42 +00:00
|
|
|
struct vm_freelist *fl;
|
|
|
|
struct vm_phys_seg *seg;
|
|
|
|
vm_page_t m_end;
|
2011-10-30 05:06:14 +00:00
|
|
|
int order;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Avoid unnecessary coalescing by freeing the pages in the largest
|
|
|
|
* possible power-of-two-sized subsets.
|
|
|
|
*/
|
2018-02-06 22:10:07 +00:00
|
|
|
vm_domain_free_assert_locked(vm_pagequeue_domain(m));
|
2019-05-31 21:02:42 +00:00
|
|
|
seg = &vm_phys_segs[m->segind];
|
|
|
|
fl = (*seg->free_queues)[m->pool];
|
|
|
|
m_end = m + npages;
|
|
|
|
/* Free blocks of increasing size. */
|
|
|
|
while ((order = max_order(m)) < VM_NFREEORDER - 1 &&
|
|
|
|
m + (1 << order) <= m_end) {
|
|
|
|
KASSERT(seg == &vm_phys_segs[m->segind],
|
|
|
|
("%s: page range [%p,%p) spans multiple segments",
|
|
|
|
__func__, m_end - npages, m));
|
|
|
|
vm_freelist_add(fl, m, order, 1);
|
|
|
|
m += 1 << order;
|
2011-10-30 05:06:14 +00:00
|
|
|
}
|
2019-05-31 21:02:42 +00:00
|
|
|
/* Free blocks of maximum size. */
|
|
|
|
while (m + (1 << order) <= m_end) {
|
|
|
|
KASSERT(seg == &vm_phys_segs[m->segind],
|
|
|
|
("%s: page range [%p,%p) spans multiple segments",
|
|
|
|
__func__, m_end - npages, m));
|
|
|
|
vm_freelist_add(fl, m, order, 1);
|
|
|
|
m += 1 << order;
|
2011-10-30 05:06:14 +00:00
|
|
|
}
|
2019-05-31 21:02:42 +00:00
|
|
|
/* Free blocks of diminishing size. */
|
|
|
|
while (m < m_end) {
|
|
|
|
KASSERT(seg == &vm_phys_segs[m->segind],
|
|
|
|
("%s: page range [%p,%p) spans multiple segments",
|
|
|
|
__func__, m_end - npages, m));
|
|
|
|
order = flsl(m_end - m) - 1;
|
|
|
|
vm_freelist_add(fl, m, order, 1);
|
|
|
|
m += 1 << order;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Free a contiguous, arbitrarily sized set of physical pages.
|
|
|
|
*
|
|
|
|
* The free page queues must be locked.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
vm_phys_free_contig(vm_page_t m, u_long npages)
|
|
|
|
{
|
|
|
|
int order_start, order_end;
|
|
|
|
vm_page_t m_start, m_end;
|
|
|
|
|
|
|
|
vm_domain_free_assert_locked(vm_pagequeue_domain(m));
|
|
|
|
|
|
|
|
m_start = m;
|
|
|
|
order_start = max_order(m_start);
|
|
|
|
if (order_start < VM_NFREEORDER - 1)
|
|
|
|
m_start += 1 << order_start;
|
|
|
|
m_end = m + npages;
|
|
|
|
order_end = max_order(m_end);
|
|
|
|
if (order_end < VM_NFREEORDER - 1)
|
|
|
|
m_end -= 1 << order_end;
|
|
|
|
/*
|
|
|
|
* Avoid unnecessary coalescing by freeing the pages at the start and
|
|
|
|
* end of the range last.
|
|
|
|
*/
|
|
|
|
if (m_start < m_end)
|
|
|
|
vm_phys_enqueue_contig(m_start, m_end - m_start);
|
|
|
|
if (order_start < VM_NFREEORDER - 1)
|
|
|
|
vm_phys_free_pages(m, order_start);
|
|
|
|
if (order_end < VM_NFREEORDER - 1)
|
|
|
|
vm_phys_free_pages(m_end, order_end);
|
2011-10-30 05:06:14 +00:00
|
|
|
}
|
|
|
|
|
2015-12-19 18:42:50 +00:00
|
|
|
/*
|
|
|
|
* Scan physical memory between the specified addresses "low" and "high" for a
|
|
|
|
* run of contiguous physical pages that satisfy the specified conditions, and
|
|
|
|
* return the lowest page in the run. The specified "alignment" determines
|
|
|
|
* the alignment of the lowest physical page in the run. If the specified
|
|
|
|
* "boundary" is non-zero, then the run of physical pages cannot span a
|
|
|
|
* physical address that is a multiple of "boundary".
|
|
|
|
*
|
|
|
|
* "npages" must be greater than zero. Both "alignment" and "boundary" must
|
|
|
|
* be a power of two.
|
|
|
|
*/
|
|
|
|
vm_page_t
|
2018-01-12 22:48:23 +00:00
|
|
|
vm_phys_scan_contig(int domain, u_long npages, vm_paddr_t low, vm_paddr_t high,
|
2015-12-19 18:42:50 +00:00
|
|
|
u_long alignment, vm_paddr_t boundary, int options)
|
|
|
|
{
|
|
|
|
vm_paddr_t pa_end;
|
|
|
|
vm_page_t m_end, m_run, m_start;
|
|
|
|
struct vm_phys_seg *seg;
|
|
|
|
int segind;
|
|
|
|
|
|
|
|
KASSERT(npages > 0, ("npages is 0"));
|
|
|
|
KASSERT(powerof2(alignment), ("alignment is not a power of 2"));
|
|
|
|
KASSERT(powerof2(boundary), ("boundary is not a power of 2"));
|
|
|
|
if (low >= high)
|
|
|
|
return (NULL);
|
|
|
|
for (segind = 0; segind < vm_phys_nsegs; segind++) {
|
|
|
|
seg = &vm_phys_segs[segind];
|
2018-01-12 22:48:23 +00:00
|
|
|
if (seg->domain != domain)
|
|
|
|
continue;
|
2015-12-19 18:42:50 +00:00
|
|
|
if (seg->start >= high)
|
|
|
|
break;
|
|
|
|
if (low >= seg->end)
|
|
|
|
continue;
|
|
|
|
if (low <= seg->start)
|
|
|
|
m_start = seg->first_page;
|
|
|
|
else
|
|
|
|
m_start = &seg->first_page[atop(low - seg->start)];
|
|
|
|
if (high < seg->end)
|
|
|
|
pa_end = high;
|
|
|
|
else
|
|
|
|
pa_end = seg->end;
|
|
|
|
if (pa_end - VM_PAGE_TO_PHYS(m_start) < ptoa(npages))
|
|
|
|
continue;
|
|
|
|
m_end = &seg->first_page[atop(pa_end - seg->start)];
|
|
|
|
m_run = vm_page_scan_contig(npages, m_start, m_end,
|
|
|
|
alignment, boundary, options);
|
|
|
|
if (m_run != NULL)
|
|
|
|
return (m_run);
|
|
|
|
}
|
|
|
|
return (NULL);
|
|
|
|
}
|
|
|
|
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
/*
|
|
|
|
* Set the pool for a contiguous, power of two-sized set of physical pages.
|
|
|
|
*/
|
Change the management of cached pages (PQ_CACHE) in two fundamental
ways:
(1) Cached pages are no longer kept in the object's resident page
splay tree and memq. Instead, they are kept in a separate per-object
splay tree of cached pages. However, access to this new per-object
splay tree is synchronized by the _free_ page queues lock, not to be
confused with the heavily contended page queues lock. Consequently, a
cached page can be reclaimed by vm_page_alloc(9) without acquiring the
object's lock or the page queues lock.
This solves a problem independently reported by tegge@ and Isilon.
Specifically, they observed the page daemon consuming a great deal of
CPU time because of pages bouncing back and forth between the cache
queue (PQ_CACHE) and the inactive queue (PQ_INACTIVE). The source of
this problem turned out to be a deadlock avoidance strategy employed
when selecting a cached page to reclaim in vm_page_select_cache().
However, the root cause was really that reclaiming a cached page
required the acquisition of an object lock while the page queues lock
was already held. Thus, this change addresses the problem at its
root, by eliminating the need to acquire the object's lock.
Moreover, keeping cached pages in the object's primary splay tree and
memq was, in effect, optimizing for the uncommon case. Cached pages
are reclaimed far, far more often than they are reactivated. Instead,
this change makes reclamation cheaper, especially in terms of
synchronization overhead, and reactivation more expensive, because
reactivated pages will have to be reentered into the object's primary
splay tree and memq.
(2) Cached pages are now stored alongside free pages in the physical
memory allocator's buddy queues, increasing the likelihood that large
allocations of contiguous physical memory (i.e., superpages) will
succeed.
Finally, as a result of this change long-standing restrictions on when
and where a cached page can be reclaimed and returned by
vm_page_alloc(9) are eliminated. Specifically, calls to
vm_page_alloc(9) specifying VM_ALLOC_INTERRUPT can now reclaim and
return a formerly cached page. Consequently, a call to malloc(9)
specifying M_NOWAIT is less likely to fail.
Discussed with: many over the course of the summer, including jeff@,
Justin Husted @ Isilon, peter@, tegge@
Tested by: an earlier version by kris@
Approved by: re (kensmith)
2007-09-25 06:25:06 +00:00
|
|
|
void
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
vm_phys_set_pool(int pool, vm_page_t m, int order)
|
|
|
|
{
|
|
|
|
vm_page_t m_tmp;
|
|
|
|
|
|
|
|
for (m_tmp = m; m_tmp < &m[1 << order]; m_tmp++)
|
|
|
|
m_tmp->pool = pool;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2007-12-21 02:44:31 +00:00
|
|
|
* Search for the given physical page "m" in the free lists. If the search
|
|
|
|
* succeeds, remove "m" from the free lists and return TRUE. Otherwise, return
|
|
|
|
* FALSE, indicating that "m" is not in the free lists.
|
Change the management of cached pages (PQ_CACHE) in two fundamental
ways:
(1) Cached pages are no longer kept in the object's resident page
splay tree and memq. Instead, they are kept in a separate per-object
splay tree of cached pages. However, access to this new per-object
splay tree is synchronized by the _free_ page queues lock, not to be
confused with the heavily contended page queues lock. Consequently, a
cached page can be reclaimed by vm_page_alloc(9) without acquiring the
object's lock or the page queues lock.
This solves a problem independently reported by tegge@ and Isilon.
Specifically, they observed the page daemon consuming a great deal of
CPU time because of pages bouncing back and forth between the cache
queue (PQ_CACHE) and the inactive queue (PQ_INACTIVE). The source of
this problem turned out to be a deadlock avoidance strategy employed
when selecting a cached page to reclaim in vm_page_select_cache().
However, the root cause was really that reclaiming a cached page
required the acquisition of an object lock while the page queues lock
was already held. Thus, this change addresses the problem at its
root, by eliminating the need to acquire the object's lock.
Moreover, keeping cached pages in the object's primary splay tree and
memq was, in effect, optimizing for the uncommon case. Cached pages
are reclaimed far, far more often than they are reactivated. Instead,
this change makes reclamation cheaper, especially in terms of
synchronization overhead, and reactivation more expensive, because
reactivated pages will have to be reentered into the object's primary
splay tree and memq.
(2) Cached pages are now stored alongside free pages in the physical
memory allocator's buddy queues, increasing the likelihood that large
allocations of contiguous physical memory (i.e., superpages) will
succeed.
Finally, as a result of this change long-standing restrictions on when
and where a cached page can be reclaimed and returned by
vm_page_alloc(9) are eliminated. Specifically, calls to
vm_page_alloc(9) specifying VM_ALLOC_INTERRUPT can now reclaim and
return a formerly cached page. Consequently, a call to malloc(9)
specifying M_NOWAIT is less likely to fail.
Discussed with: many over the course of the summer, including jeff@,
Justin Husted @ Isilon, peter@, tegge@
Tested by: an earlier version by kris@
Approved by: re (kensmith)
2007-09-25 06:25:06 +00:00
|
|
|
*
|
|
|
|
* The free page queues must be locked.
|
|
|
|
*/
|
2007-12-20 22:45:54 +00:00
|
|
|
boolean_t
|
Change the management of cached pages (PQ_CACHE) in two fundamental
ways:
(1) Cached pages are no longer kept in the object's resident page
splay tree and memq. Instead, they are kept in a separate per-object
splay tree of cached pages. However, access to this new per-object
splay tree is synchronized by the _free_ page queues lock, not to be
confused with the heavily contended page queues lock. Consequently, a
cached page can be reclaimed by vm_page_alloc(9) without acquiring the
object's lock or the page queues lock.
This solves a problem independently reported by tegge@ and Isilon.
Specifically, they observed the page daemon consuming a great deal of
CPU time because of pages bouncing back and forth between the cache
queue (PQ_CACHE) and the inactive queue (PQ_INACTIVE). The source of
this problem turned out to be a deadlock avoidance strategy employed
when selecting a cached page to reclaim in vm_page_select_cache().
However, the root cause was really that reclaiming a cached page
required the acquisition of an object lock while the page queues lock
was already held. Thus, this change addresses the problem at its
root, by eliminating the need to acquire the object's lock.
Moreover, keeping cached pages in the object's primary splay tree and
memq was, in effect, optimizing for the uncommon case. Cached pages
are reclaimed far, far more often than they are reactivated. Instead,
this change makes reclamation cheaper, especially in terms of
synchronization overhead, and reactivation more expensive, because
reactivated pages will have to be reentered into the object's primary
splay tree and memq.
(2) Cached pages are now stored alongside free pages in the physical
memory allocator's buddy queues, increasing the likelihood that large
allocations of contiguous physical memory (i.e., superpages) will
succeed.
Finally, as a result of this change long-standing restrictions on when
and where a cached page can be reclaimed and returned by
vm_page_alloc(9) are eliminated. Specifically, calls to
vm_page_alloc(9) specifying VM_ALLOC_INTERRUPT can now reclaim and
return a formerly cached page. Consequently, a call to malloc(9)
specifying M_NOWAIT is less likely to fail.
Discussed with: many over the course of the summer, including jeff@,
Justin Husted @ Isilon, peter@, tegge@
Tested by: an earlier version by kris@
Approved by: re (kensmith)
2007-09-25 06:25:06 +00:00
|
|
|
vm_phys_unfree_page(vm_page_t m)
|
|
|
|
{
|
|
|
|
struct vm_freelist *fl;
|
|
|
|
struct vm_phys_seg *seg;
|
|
|
|
vm_paddr_t pa, pa_half;
|
|
|
|
vm_page_t m_set, m_tmp;
|
|
|
|
int order;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* First, find the contiguous, power of two-sized set of free
|
|
|
|
* physical pages containing the given physical page "m" and
|
|
|
|
* assign it to "m_set".
|
|
|
|
*/
|
|
|
|
seg = &vm_phys_segs[m->segind];
|
2018-02-06 22:10:07 +00:00
|
|
|
vm_domain_free_assert_locked(VM_DOMAIN(seg->domain));
|
Change the management of cached pages (PQ_CACHE) in two fundamental
ways:
(1) Cached pages are no longer kept in the object's resident page
splay tree and memq. Instead, they are kept in a separate per-object
splay tree of cached pages. However, access to this new per-object
splay tree is synchronized by the _free_ page queues lock, not to be
confused with the heavily contended page queues lock. Consequently, a
cached page can be reclaimed by vm_page_alloc(9) without acquiring the
object's lock or the page queues lock.
This solves a problem independently reported by tegge@ and Isilon.
Specifically, they observed the page daemon consuming a great deal of
CPU time because of pages bouncing back and forth between the cache
queue (PQ_CACHE) and the inactive queue (PQ_INACTIVE). The source of
this problem turned out to be a deadlock avoidance strategy employed
when selecting a cached page to reclaim in vm_page_select_cache().
However, the root cause was really that reclaiming a cached page
required the acquisition of an object lock while the page queues lock
was already held. Thus, this change addresses the problem at its
root, by eliminating the need to acquire the object's lock.
Moreover, keeping cached pages in the object's primary splay tree and
memq was, in effect, optimizing for the uncommon case. Cached pages
are reclaimed far, far more often than they are reactivated. Instead,
this change makes reclamation cheaper, especially in terms of
synchronization overhead, and reactivation more expensive, because
reactivated pages will have to be reentered into the object's primary
splay tree and memq.
(2) Cached pages are now stored alongside free pages in the physical
memory allocator's buddy queues, increasing the likelihood that large
allocations of contiguous physical memory (i.e., superpages) will
succeed.
Finally, as a result of this change long-standing restrictions on when
and where a cached page can be reclaimed and returned by
vm_page_alloc(9) are eliminated. Specifically, calls to
vm_page_alloc(9) specifying VM_ALLOC_INTERRUPT can now reclaim and
return a formerly cached page. Consequently, a call to malloc(9)
specifying M_NOWAIT is less likely to fail.
Discussed with: many over the course of the summer, including jeff@,
Justin Husted @ Isilon, peter@, tegge@
Tested by: an earlier version by kris@
Approved by: re (kensmith)
2007-09-25 06:25:06 +00:00
|
|
|
for (m_set = m, order = 0; m_set->order == VM_NFREEORDER &&
|
2007-12-19 23:09:45 +00:00
|
|
|
order < VM_NFREEORDER - 1; ) {
|
Change the management of cached pages (PQ_CACHE) in two fundamental
ways:
(1) Cached pages are no longer kept in the object's resident page
splay tree and memq. Instead, they are kept in a separate per-object
splay tree of cached pages. However, access to this new per-object
splay tree is synchronized by the _free_ page queues lock, not to be
confused with the heavily contended page queues lock. Consequently, a
cached page can be reclaimed by vm_page_alloc(9) without acquiring the
object's lock or the page queues lock.
This solves a problem independently reported by tegge@ and Isilon.
Specifically, they observed the page daemon consuming a great deal of
CPU time because of pages bouncing back and forth between the cache
queue (PQ_CACHE) and the inactive queue (PQ_INACTIVE). The source of
this problem turned out to be a deadlock avoidance strategy employed
when selecting a cached page to reclaim in vm_page_select_cache().
However, the root cause was really that reclaiming a cached page
required the acquisition of an object lock while the page queues lock
was already held. Thus, this change addresses the problem at its
root, by eliminating the need to acquire the object's lock.
Moreover, keeping cached pages in the object's primary splay tree and
memq was, in effect, optimizing for the uncommon case. Cached pages
are reclaimed far, far more often than they are reactivated. Instead,
this change makes reclamation cheaper, especially in terms of
synchronization overhead, and reactivation more expensive, because
reactivated pages will have to be reentered into the object's primary
splay tree and memq.
(2) Cached pages are now stored alongside free pages in the physical
memory allocator's buddy queues, increasing the likelihood that large
allocations of contiguous physical memory (i.e., superpages) will
succeed.
Finally, as a result of this change long-standing restrictions on when
and where a cached page can be reclaimed and returned by
vm_page_alloc(9) are eliminated. Specifically, calls to
vm_page_alloc(9) specifying VM_ALLOC_INTERRUPT can now reclaim and
return a formerly cached page. Consequently, a call to malloc(9)
specifying M_NOWAIT is less likely to fail.
Discussed with: many over the course of the summer, including jeff@,
Justin Husted @ Isilon, peter@, tegge@
Tested by: an earlier version by kris@
Approved by: re (kensmith)
2007-09-25 06:25:06 +00:00
|
|
|
order++;
|
|
|
|
pa = m->phys_addr & (~(vm_paddr_t)0 << (PAGE_SHIFT + order));
|
2008-04-05 05:02:53 +00:00
|
|
|
if (pa >= seg->start)
|
2007-12-20 22:45:54 +00:00
|
|
|
m_set = &seg->first_page[atop(pa - seg->start)];
|
|
|
|
else
|
|
|
|
return (FALSE);
|
Change the management of cached pages (PQ_CACHE) in two fundamental
ways:
(1) Cached pages are no longer kept in the object's resident page
splay tree and memq. Instead, they are kept in a separate per-object
splay tree of cached pages. However, access to this new per-object
splay tree is synchronized by the _free_ page queues lock, not to be
confused with the heavily contended page queues lock. Consequently, a
cached page can be reclaimed by vm_page_alloc(9) without acquiring the
object's lock or the page queues lock.
This solves a problem independently reported by tegge@ and Isilon.
Specifically, they observed the page daemon consuming a great deal of
CPU time because of pages bouncing back and forth between the cache
queue (PQ_CACHE) and the inactive queue (PQ_INACTIVE). The source of
this problem turned out to be a deadlock avoidance strategy employed
when selecting a cached page to reclaim in vm_page_select_cache().
However, the root cause was really that reclaiming a cached page
required the acquisition of an object lock while the page queues lock
was already held. Thus, this change addresses the problem at its
root, by eliminating the need to acquire the object's lock.
Moreover, keeping cached pages in the object's primary splay tree and
memq was, in effect, optimizing for the uncommon case. Cached pages
are reclaimed far, far more often than they are reactivated. Instead,
this change makes reclamation cheaper, especially in terms of
synchronization overhead, and reactivation more expensive, because
reactivated pages will have to be reentered into the object's primary
splay tree and memq.
(2) Cached pages are now stored alongside free pages in the physical
memory allocator's buddy queues, increasing the likelihood that large
allocations of contiguous physical memory (i.e., superpages) will
succeed.
Finally, as a result of this change long-standing restrictions on when
and where a cached page can be reclaimed and returned by
vm_page_alloc(9) are eliminated. Specifically, calls to
vm_page_alloc(9) specifying VM_ALLOC_INTERRUPT can now reclaim and
return a formerly cached page. Consequently, a call to malloc(9)
specifying M_NOWAIT is less likely to fail.
Discussed with: many over the course of the summer, including jeff@,
Justin Husted @ Isilon, peter@, tegge@
Tested by: an earlier version by kris@
Approved by: re (kensmith)
2007-09-25 06:25:06 +00:00
|
|
|
}
|
2007-12-20 22:45:54 +00:00
|
|
|
if (m_set->order < order)
|
|
|
|
return (FALSE);
|
|
|
|
if (m_set->order == VM_NFREEORDER)
|
|
|
|
return (FALSE);
|
Change the management of cached pages (PQ_CACHE) in two fundamental
ways:
(1) Cached pages are no longer kept in the object's resident page
splay tree and memq. Instead, they are kept in a separate per-object
splay tree of cached pages. However, access to this new per-object
splay tree is synchronized by the _free_ page queues lock, not to be
confused with the heavily contended page queues lock. Consequently, a
cached page can be reclaimed by vm_page_alloc(9) without acquiring the
object's lock or the page queues lock.
This solves a problem independently reported by tegge@ and Isilon.
Specifically, they observed the page daemon consuming a great deal of
CPU time because of pages bouncing back and forth between the cache
queue (PQ_CACHE) and the inactive queue (PQ_INACTIVE). The source of
this problem turned out to be a deadlock avoidance strategy employed
when selecting a cached page to reclaim in vm_page_select_cache().
However, the root cause was really that reclaiming a cached page
required the acquisition of an object lock while the page queues lock
was already held. Thus, this change addresses the problem at its
root, by eliminating the need to acquire the object's lock.
Moreover, keeping cached pages in the object's primary splay tree and
memq was, in effect, optimizing for the uncommon case. Cached pages
are reclaimed far, far more often than they are reactivated. Instead,
this change makes reclamation cheaper, especially in terms of
synchronization overhead, and reactivation more expensive, because
reactivated pages will have to be reentered into the object's primary
splay tree and memq.
(2) Cached pages are now stored alongside free pages in the physical
memory allocator's buddy queues, increasing the likelihood that large
allocations of contiguous physical memory (i.e., superpages) will
succeed.
Finally, as a result of this change long-standing restrictions on when
and where a cached page can be reclaimed and returned by
vm_page_alloc(9) are eliminated. Specifically, calls to
vm_page_alloc(9) specifying VM_ALLOC_INTERRUPT can now reclaim and
return a formerly cached page. Consequently, a call to malloc(9)
specifying M_NOWAIT is less likely to fail.
Discussed with: many over the course of the summer, including jeff@,
Justin Husted @ Isilon, peter@, tegge@
Tested by: an earlier version by kris@
Approved by: re (kensmith)
2007-09-25 06:25:06 +00:00
|
|
|
KASSERT(m_set->order < VM_NFREEORDER,
|
|
|
|
("vm_phys_unfree_page: page %p has unexpected order %d",
|
|
|
|
m_set, m_set->order));
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Next, remove "m_set" from the free lists. Finally, extract
|
|
|
|
* "m" from "m_set" using an iterative algorithm: While "m_set"
|
|
|
|
* is larger than a page, shrink "m_set" by returning the half
|
|
|
|
* of "m_set" that does not contain "m" to the free lists.
|
|
|
|
*/
|
|
|
|
fl = (*seg->free_queues)[m_set->pool];
|
|
|
|
order = m_set->order;
|
2013-05-13 15:40:51 +00:00
|
|
|
vm_freelist_rem(fl, m_set, order);
|
Change the management of cached pages (PQ_CACHE) in two fundamental
ways:
(1) Cached pages are no longer kept in the object's resident page
splay tree and memq. Instead, they are kept in a separate per-object
splay tree of cached pages. However, access to this new per-object
splay tree is synchronized by the _free_ page queues lock, not to be
confused with the heavily contended page queues lock. Consequently, a
cached page can be reclaimed by vm_page_alloc(9) without acquiring the
object's lock or the page queues lock.
This solves a problem independently reported by tegge@ and Isilon.
Specifically, they observed the page daemon consuming a great deal of
CPU time because of pages bouncing back and forth between the cache
queue (PQ_CACHE) and the inactive queue (PQ_INACTIVE). The source of
this problem turned out to be a deadlock avoidance strategy employed
when selecting a cached page to reclaim in vm_page_select_cache().
However, the root cause was really that reclaiming a cached page
required the acquisition of an object lock while the page queues lock
was already held. Thus, this change addresses the problem at its
root, by eliminating the need to acquire the object's lock.
Moreover, keeping cached pages in the object's primary splay tree and
memq was, in effect, optimizing for the uncommon case. Cached pages
are reclaimed far, far more often than they are reactivated. Instead,
this change makes reclamation cheaper, especially in terms of
synchronization overhead, and reactivation more expensive, because
reactivated pages will have to be reentered into the object's primary
splay tree and memq.
(2) Cached pages are now stored alongside free pages in the physical
memory allocator's buddy queues, increasing the likelihood that large
allocations of contiguous physical memory (i.e., superpages) will
succeed.
Finally, as a result of this change long-standing restrictions on when
and where a cached page can be reclaimed and returned by
vm_page_alloc(9) are eliminated. Specifically, calls to
vm_page_alloc(9) specifying VM_ALLOC_INTERRUPT can now reclaim and
return a formerly cached page. Consequently, a call to malloc(9)
specifying M_NOWAIT is less likely to fail.
Discussed with: many over the course of the summer, including jeff@,
Justin Husted @ Isilon, peter@, tegge@
Tested by: an earlier version by kris@
Approved by: re (kensmith)
2007-09-25 06:25:06 +00:00
|
|
|
while (order > 0) {
|
|
|
|
order--;
|
|
|
|
pa_half = m_set->phys_addr ^ (1 << (PAGE_SHIFT + order));
|
|
|
|
if (m->phys_addr < pa_half)
|
|
|
|
m_tmp = &seg->first_page[atop(pa_half - seg->start)];
|
|
|
|
else {
|
|
|
|
m_tmp = m_set;
|
|
|
|
m_set = &seg->first_page[atop(pa_half - seg->start)];
|
|
|
|
}
|
2013-05-13 15:40:51 +00:00
|
|
|
vm_freelist_add(fl, m_tmp, order, 0);
|
Change the management of cached pages (PQ_CACHE) in two fundamental
ways:
(1) Cached pages are no longer kept in the object's resident page
splay tree and memq. Instead, they are kept in a separate per-object
splay tree of cached pages. However, access to this new per-object
splay tree is synchronized by the _free_ page queues lock, not to be
confused with the heavily contended page queues lock. Consequently, a
cached page can be reclaimed by vm_page_alloc(9) without acquiring the
object's lock or the page queues lock.
This solves a problem independently reported by tegge@ and Isilon.
Specifically, they observed the page daemon consuming a great deal of
CPU time because of pages bouncing back and forth between the cache
queue (PQ_CACHE) and the inactive queue (PQ_INACTIVE). The source of
this problem turned out to be a deadlock avoidance strategy employed
when selecting a cached page to reclaim in vm_page_select_cache().
However, the root cause was really that reclaiming a cached page
required the acquisition of an object lock while the page queues lock
was already held. Thus, this change addresses the problem at its
root, by eliminating the need to acquire the object's lock.
Moreover, keeping cached pages in the object's primary splay tree and
memq was, in effect, optimizing for the uncommon case. Cached pages
are reclaimed far, far more often than they are reactivated. Instead,
this change makes reclamation cheaper, especially in terms of
synchronization overhead, and reactivation more expensive, because
reactivated pages will have to be reentered into the object's primary
splay tree and memq.
(2) Cached pages are now stored alongside free pages in the physical
memory allocator's buddy queues, increasing the likelihood that large
allocations of contiguous physical memory (i.e., superpages) will
succeed.
Finally, as a result of this change long-standing restrictions on when
and where a cached page can be reclaimed and returned by
vm_page_alloc(9) are eliminated. Specifically, calls to
vm_page_alloc(9) specifying VM_ALLOC_INTERRUPT can now reclaim and
return a formerly cached page. Consequently, a call to malloc(9)
specifying M_NOWAIT is less likely to fail.
Discussed with: many over the course of the summer, including jeff@,
Justin Husted @ Isilon, peter@, tegge@
Tested by: an earlier version by kris@
Approved by: re (kensmith)
2007-09-25 06:25:06 +00:00
|
|
|
}
|
|
|
|
KASSERT(m_set == m, ("vm_phys_unfree_page: fatal inconsistency"));
|
2007-12-20 22:45:54 +00:00
|
|
|
return (TRUE);
|
Change the management of cached pages (PQ_CACHE) in two fundamental
ways:
(1) Cached pages are no longer kept in the object's resident page
splay tree and memq. Instead, they are kept in a separate per-object
splay tree of cached pages. However, access to this new per-object
splay tree is synchronized by the _free_ page queues lock, not to be
confused with the heavily contended page queues lock. Consequently, a
cached page can be reclaimed by vm_page_alloc(9) without acquiring the
object's lock or the page queues lock.
This solves a problem independently reported by tegge@ and Isilon.
Specifically, they observed the page daemon consuming a great deal of
CPU time because of pages bouncing back and forth between the cache
queue (PQ_CACHE) and the inactive queue (PQ_INACTIVE). The source of
this problem turned out to be a deadlock avoidance strategy employed
when selecting a cached page to reclaim in vm_page_select_cache().
However, the root cause was really that reclaiming a cached page
required the acquisition of an object lock while the page queues lock
was already held. Thus, this change addresses the problem at its
root, by eliminating the need to acquire the object's lock.
Moreover, keeping cached pages in the object's primary splay tree and
memq was, in effect, optimizing for the uncommon case. Cached pages
are reclaimed far, far more often than they are reactivated. Instead,
this change makes reclamation cheaper, especially in terms of
synchronization overhead, and reactivation more expensive, because
reactivated pages will have to be reentered into the object's primary
splay tree and memq.
(2) Cached pages are now stored alongside free pages in the physical
memory allocator's buddy queues, increasing the likelihood that large
allocations of contiguous physical memory (i.e., superpages) will
succeed.
Finally, as a result of this change long-standing restrictions on when
and where a cached page can be reclaimed and returned by
vm_page_alloc(9) are eliminated. Specifically, calls to
vm_page_alloc(9) specifying VM_ALLOC_INTERRUPT can now reclaim and
return a formerly cached page. Consequently, a call to malloc(9)
specifying M_NOWAIT is less likely to fail.
Discussed with: many over the course of the summer, including jeff@,
Justin Husted @ Isilon, peter@, tegge@
Tested by: an earlier version by kris@
Approved by: re (kensmith)
2007-09-25 06:25:06 +00:00
|
|
|
}
|
|
|
|
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
/*
|
2007-06-16 05:25:53 +00:00
|
|
|
* Allocate a contiguous set of physical pages of the given size
|
|
|
|
* "npages" from the free lists. All of the physical pages must be at
|
|
|
|
* or above the given physical address "low" and below the given
|
|
|
|
* physical address "high". The given value "alignment" determines the
|
|
|
|
* alignment of the first physical page in the set. If the given value
|
|
|
|
* "boundary" is non-zero, then the set of physical pages cannot cross
|
|
|
|
* any physical address boundary that is a multiple of that value. Both
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
* "alignment" and "boundary" must be a power of two.
|
|
|
|
*/
|
|
|
|
vm_page_t
|
2017-11-28 23:18:35 +00:00
|
|
|
vm_phys_alloc_contig(int domain, u_long npages, vm_paddr_t low, vm_paddr_t high,
|
2011-10-30 05:06:14 +00:00
|
|
|
u_long alignment, vm_paddr_t boundary)
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
{
|
2015-12-19 18:42:50 +00:00
|
|
|
vm_paddr_t pa_end, pa_start;
|
|
|
|
vm_page_t m_run;
|
|
|
|
struct vm_phys_seg *seg;
|
2017-11-28 23:18:35 +00:00
|
|
|
int segind;
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
|
2015-12-19 18:42:50 +00:00
|
|
|
KASSERT(npages > 0, ("npages is 0"));
|
|
|
|
KASSERT(powerof2(alignment), ("alignment is not a power of 2"));
|
|
|
|
KASSERT(powerof2(boundary), ("boundary is not a power of 2"));
|
2018-02-06 22:10:07 +00:00
|
|
|
vm_domain_free_assert_locked(VM_DOMAIN(domain));
|
2015-12-19 18:42:50 +00:00
|
|
|
if (low >= high)
|
|
|
|
return (NULL);
|
|
|
|
m_run = NULL;
|
2016-01-16 04:41:40 +00:00
|
|
|
for (segind = vm_phys_nsegs - 1; segind >= 0; segind--) {
|
2015-12-19 18:42:50 +00:00
|
|
|
seg = &vm_phys_segs[segind];
|
2016-01-16 04:41:40 +00:00
|
|
|
if (seg->start >= high || seg->domain != domain)
|
2015-12-19 18:42:50 +00:00
|
|
|
continue;
|
2016-01-16 04:41:40 +00:00
|
|
|
if (low >= seg->end)
|
|
|
|
break;
|
2015-12-19 18:42:50 +00:00
|
|
|
if (low <= seg->start)
|
|
|
|
pa_start = seg->start;
|
|
|
|
else
|
|
|
|
pa_start = low;
|
|
|
|
if (high < seg->end)
|
|
|
|
pa_end = high;
|
|
|
|
else
|
|
|
|
pa_end = seg->end;
|
|
|
|
if (pa_end - pa_start < ptoa(npages))
|
|
|
|
continue;
|
|
|
|
m_run = vm_phys_alloc_seg_contig(seg, npages, low, high,
|
|
|
|
alignment, boundary);
|
|
|
|
if (m_run != NULL)
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
return (m_run);
|
|
|
|
}
|
Add an initial NUMA affinity/policy configuration for threads and processes.
This is based on work done by jeff@ and jhb@, as well as the numa.diff
patch that has been circulating when someone asks for first-touch NUMA
on -10 or -11.
* Introduce a simple set of VM policy and iterator types.
* tie the policy types into the vm_phys path for now, mirroring how
the initial first-touch allocation work was enabled.
* add syscalls to control changing thread and process defaults.
* add a global NUMA VM domain policy.
* implement a simple cascade policy order - if a thread policy exists, use it;
if a process policy exists, use it; use the default policy.
* processes inherit policies from their parent processes, threads inherit
policies from their parent threads.
* add a simple tool (numactl) to query and modify default thread/process
policities.
* add documentation for the new syscalls, for numa and for numactl.
* re-enable first touch NUMA again by default, as now policies can be
set in a variety of methods.
This is only relevant for very specific workloads.
This doesn't pretend to be a final NUMA solution.
The previous defaults in -HEAD (with MAXMEMDOM set) can be achieved by
'sysctl vm.default_policy=rr'.
This is only relevant if MAXMEMDOM is set to something other than 1.
Ie, if you're using GENERIC or a modified kernel with non-NUMA, then
this is a glorified no-op for you.
Thank you to Norse Corp for giving me access to rather large
(for FreeBSD!) NUMA machines in order to develop and verify this.
Thank you to Dell for providing me with dual socket sandybridge
and westmere v3 hardware to do NUMA development with.
Thank you to Scott Long at Netflix for providing me with access
to the two-socket, four-domain haswell v3 hardware.
Thank you to Peter Holm for running the stress testing suite
against the NUMA branch during various stages of development!
Tested:
* MIPS (regression testing; non-NUMA)
* i386 (regression testing; non-NUMA GENERIC)
* amd64 (regression testing; non-NUMA GENERIC)
* westmere, 2 socket (thankyou norse!)
* sandy bridge, 2 socket (thankyou dell!)
* ivy bridge, 2 socket (thankyou norse!)
* westmere-EX, 4 socket / 1TB RAM (thankyou norse!)
* haswell, 2 socket (thankyou norse!)
* haswell v3, 2 socket (thankyou dell)
* haswell v3, 2x18 core (thankyou scott long / netflix!)
* Peter Holm ran a stress test suite on this work and found one
issue, but has not been able to verify it (it doesn't look NUMA
related, and he only saw it once over many testing runs.)
* I've tested bhyve instances running in fixed NUMA domains and cpusets;
all seems to work correctly.
Verified:
* intel-pcm - pcm-numa.x and pcm-memory.x, whilst selecting different
NUMA policies for processes under test.
Review:
This was reviewed through phabricator (https://reviews.freebsd.org/D2559)
as well as privately and via emails to freebsd-arch@. The git history
with specific attributes is available at https://github.com/erikarn/freebsd/
in the NUMA branch (https://github.com/erikarn/freebsd/compare/local/adrian_numa_policy).
This has been reviewed by a number of people (stas, rpaulo, kib, ngie,
wblock) but not achieved a clear consensus. My hope is that with further
exposure and testing more functionality can be implemented and evaluated.
Notes:
* The VM doesn't handle unbalanced domains very well, and if you have an overly
unbalanced memory setup whilst under high memory pressure, VM page allocation
may fail leading to a kernel panic. This was a problem in the past, but it's
much more easily triggered now with these tools.
* This work only controls the path through vm_phys; it doesn't yet strongly/predictably
affect contigmalloc, KVA placement, UMA, etc. So, driver placement of memory
isn't really guaranteed in any way. That's next on my plate.
Sponsored by: Norse Corp, Inc.; Dell
2015-07-11 15:21:37 +00:00
|
|
|
|
2015-12-19 18:42:50 +00:00
|
|
|
/*
|
|
|
|
* Allocate a run of contiguous physical pages from the free list for the
|
|
|
|
* specified segment.
|
|
|
|
*/
|
|
|
|
static vm_page_t
|
|
|
|
vm_phys_alloc_seg_contig(struct vm_phys_seg *seg, u_long npages,
|
|
|
|
vm_paddr_t low, vm_paddr_t high, u_long alignment, vm_paddr_t boundary)
|
|
|
|
{
|
|
|
|
struct vm_freelist *fl;
|
|
|
|
vm_paddr_t pa, pa_end, size;
|
|
|
|
vm_page_t m, m_ret;
|
|
|
|
u_long npages_end;
|
|
|
|
int oind, order, pind;
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
|
2015-12-19 18:42:50 +00:00
|
|
|
KASSERT(npages > 0, ("npages is 0"));
|
|
|
|
KASSERT(powerof2(alignment), ("alignment is not a power of 2"));
|
|
|
|
KASSERT(powerof2(boundary), ("boundary is not a power of 2"));
|
2018-02-06 22:10:07 +00:00
|
|
|
vm_domain_free_assert_locked(VM_DOMAIN(seg->domain));
|
2015-12-19 18:42:50 +00:00
|
|
|
/* Compute the queue that is the best fit for npages. */
|
2018-06-29 04:08:14 +00:00
|
|
|
order = flsl(npages - 1);
|
2015-12-19 18:42:50 +00:00
|
|
|
/* Search for a run satisfying the specified conditions. */
|
|
|
|
size = npages << PAGE_SHIFT;
|
|
|
|
for (oind = min(order, VM_NFREEORDER - 1); oind < VM_NFREEORDER;
|
|
|
|
oind++) {
|
|
|
|
for (pind = 0; pind < VM_NFREEPOOL; pind++) {
|
|
|
|
fl = (*seg->free_queues)[pind];
|
2018-04-24 21:15:54 +00:00
|
|
|
TAILQ_FOREACH(m_ret, &fl[oind].pl, listq) {
|
2015-12-19 18:42:50 +00:00
|
|
|
/*
|
|
|
|
* Is the size of this allocation request
|
|
|
|
* larger than the largest block size?
|
|
|
|
*/
|
|
|
|
if (order >= VM_NFREEORDER) {
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
/*
|
2015-12-19 18:42:50 +00:00
|
|
|
* Determine if a sufficient number of
|
|
|
|
* subsequent blocks to satisfy the
|
|
|
|
* allocation request are free.
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
*/
|
|
|
|
pa = VM_PAGE_TO_PHYS(m_ret);
|
2015-12-19 18:42:50 +00:00
|
|
|
pa_end = pa + size;
|
2018-03-20 16:17:55 +00:00
|
|
|
if (pa_end < pa)
|
|
|
|
continue;
|
2015-12-19 18:42:50 +00:00
|
|
|
for (;;) {
|
|
|
|
pa += 1 << (PAGE_SHIFT +
|
|
|
|
VM_NFREEORDER - 1);
|
|
|
|
if (pa >= pa_end ||
|
|
|
|
pa < seg->start ||
|
|
|
|
pa >= seg->end)
|
|
|
|
break;
|
|
|
|
m = &seg->first_page[atop(pa -
|
|
|
|
seg->start)];
|
|
|
|
if (m->order != VM_NFREEORDER -
|
|
|
|
1)
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
/* If not, go to the next block. */
|
|
|
|
if (pa < pa_end)
|
|
|
|
continue;
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
}
|
2015-12-19 18:42:50 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Determine if the blocks are within the
|
|
|
|
* given range, satisfy the given alignment,
|
|
|
|
* and do not cross the given boundary.
|
|
|
|
*/
|
|
|
|
pa = VM_PAGE_TO_PHYS(m_ret);
|
|
|
|
pa_end = pa + size;
|
2016-04-21 19:57:40 +00:00
|
|
|
if (pa >= low && pa_end <= high &&
|
|
|
|
(pa & (alignment - 1)) == 0 &&
|
|
|
|
rounddown2(pa ^ (pa_end - 1), boundary) == 0)
|
2015-12-19 18:42:50 +00:00
|
|
|
goto done;
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
return (NULL);
|
|
|
|
done:
|
|
|
|
for (m = m_ret; m < &m_ret[npages]; m = &m[1 << oind]) {
|
|
|
|
fl = (*seg->free_queues)[m->pool];
|
2018-06-29 04:08:14 +00:00
|
|
|
vm_freelist_rem(fl, m, oind);
|
|
|
|
if (m->pool != VM_FREEPOOL_DEFAULT)
|
|
|
|
vm_phys_set_pool(VM_FREEPOOL_DEFAULT, m, oind);
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
}
|
2011-10-30 05:06:14 +00:00
|
|
|
/* Return excess pages to the free lists. */
|
2018-06-29 04:08:14 +00:00
|
|
|
npages_end = roundup2(npages, 1 << oind);
|
2018-07-02 17:18:46 +00:00
|
|
|
if (npages < npages_end) {
|
|
|
|
fl = (*seg->free_queues)[VM_FREEPOOL_DEFAULT];
|
|
|
|
vm_phys_enq_range(&m_ret[npages], npages_end - npages, fl, 0);
|
|
|
|
}
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
return (m_ret);
|
|
|
|
}
|
|
|
|
|
2019-08-18 07:06:31 +00:00
|
|
|
/*
|
|
|
|
* Return the index of the first unused slot which may be the terminating
|
|
|
|
* entry.
|
|
|
|
*/
|
|
|
|
static int
|
|
|
|
vm_phys_avail_count(void)
|
|
|
|
{
|
|
|
|
int i;
|
|
|
|
|
|
|
|
for (i = 0; phys_avail[i + 1]; i += 2)
|
|
|
|
continue;
|
|
|
|
if (i > PHYS_AVAIL_ENTRIES)
|
|
|
|
panic("Improperly terminated phys_avail %d entries", i);
|
|
|
|
|
|
|
|
return (i);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Assert that a phys_avail entry is valid.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
vm_phys_avail_check(int i)
|
|
|
|
{
|
|
|
|
if (phys_avail[i] & PAGE_MASK)
|
|
|
|
panic("Unaligned phys_avail[%d]: %#jx", i,
|
|
|
|
(intmax_t)phys_avail[i]);
|
|
|
|
if (phys_avail[i+1] & PAGE_MASK)
|
|
|
|
panic("Unaligned phys_avail[%d + 1]: %#jx", i,
|
|
|
|
(intmax_t)phys_avail[i]);
|
|
|
|
if (phys_avail[i + 1] < phys_avail[i])
|
|
|
|
panic("phys_avail[%d] start %#jx < end %#jx", i,
|
|
|
|
(intmax_t)phys_avail[i], (intmax_t)phys_avail[i+1]);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Return the index of an overlapping phys_avail entry or -1.
|
|
|
|
*/
|
2019-08-18 07:43:15 +00:00
|
|
|
#ifdef NUMA
|
2019-08-18 07:06:31 +00:00
|
|
|
static int
|
|
|
|
vm_phys_avail_find(vm_paddr_t pa)
|
|
|
|
{
|
|
|
|
int i;
|
|
|
|
|
|
|
|
for (i = 0; phys_avail[i + 1]; i += 2)
|
|
|
|
if (phys_avail[i] <= pa && phys_avail[i + 1] > pa)
|
|
|
|
return (i);
|
|
|
|
return (-1);
|
|
|
|
}
|
2019-08-18 07:43:15 +00:00
|
|
|
#endif
|
2019-08-18 07:06:31 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Return the index of the largest entry.
|
|
|
|
*/
|
|
|
|
int
|
|
|
|
vm_phys_avail_largest(void)
|
|
|
|
{
|
|
|
|
vm_paddr_t sz, largesz;
|
|
|
|
int largest;
|
|
|
|
int i;
|
|
|
|
|
|
|
|
largest = 0;
|
|
|
|
largesz = 0;
|
|
|
|
for (i = 0; phys_avail[i + 1]; i += 2) {
|
|
|
|
sz = vm_phys_avail_size(i);
|
|
|
|
if (sz > largesz) {
|
|
|
|
largesz = sz;
|
|
|
|
largest = i;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
return (largest);
|
|
|
|
}
|
|
|
|
|
|
|
|
vm_paddr_t
|
|
|
|
vm_phys_avail_size(int i)
|
|
|
|
{
|
|
|
|
|
|
|
|
return (phys_avail[i + 1] - phys_avail[i]);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Split an entry at the address 'pa'. Return zero on success or errno.
|
|
|
|
*/
|
|
|
|
static int
|
|
|
|
vm_phys_avail_split(vm_paddr_t pa, int i)
|
|
|
|
{
|
|
|
|
int cnt;
|
|
|
|
|
|
|
|
vm_phys_avail_check(i);
|
|
|
|
if (pa <= phys_avail[i] || pa >= phys_avail[i + 1])
|
|
|
|
panic("vm_phys_avail_split: invalid address");
|
|
|
|
cnt = vm_phys_avail_count();
|
|
|
|
if (cnt >= PHYS_AVAIL_ENTRIES)
|
|
|
|
return (ENOSPC);
|
|
|
|
memmove(&phys_avail[i + 2], &phys_avail[i],
|
|
|
|
(cnt - i) * sizeof(phys_avail[0]));
|
|
|
|
phys_avail[i + 1] = pa;
|
|
|
|
phys_avail[i + 2] = pa;
|
|
|
|
vm_phys_avail_check(i);
|
|
|
|
vm_phys_avail_check(i+2);
|
|
|
|
|
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
Fix boot on systems where NUMA domain 0 is unpopulated.
- Add vm_phys_early_add_seg(), complementing vm_phys_early_alloc(), to
ensure that segments registered during hammer_time() are placed in the
right domain. Otherwise, since the SRAT is not parsed at that point,
we just add them to domain 0, which may be incorrect and results in a
domain with only several MB worth of memory.
- Fix uma_startup1() to try allocating memory for zones from any domain.
If domain 0 is unpopulated, the allocation will simply fail, resulting
in a page fault slightly later during boot.
- Change _vm_phys_domain() to return -1 for addresses not covered by the
affinity table, and change vm_phys_early_alloc() to handle wildcard
domains. This is necessary on amd64, where the page array is dense
and pmap_page_array_startup() may allocate page table pages for
non-existent page frames.
Reported and tested by: Rafael Kitover <rkitover@gmail.com>
Reviewed by: cem (earlier version), kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D25001
2020-05-28 19:41:00 +00:00
|
|
|
void
|
|
|
|
vm_phys_early_add_seg(vm_paddr_t start, vm_paddr_t end)
|
|
|
|
{
|
|
|
|
struct vm_phys_seg *seg;
|
|
|
|
|
|
|
|
if (vm_phys_early_nsegs == -1)
|
|
|
|
panic("%s: called after initialization", __func__);
|
|
|
|
if (vm_phys_early_nsegs == nitems(vm_phys_early_segs))
|
|
|
|
panic("%s: ran out of early segments", __func__);
|
|
|
|
|
|
|
|
seg = &vm_phys_early_segs[vm_phys_early_nsegs++];
|
|
|
|
seg->start = start;
|
|
|
|
seg->end = end;
|
|
|
|
}
|
|
|
|
|
2019-08-18 07:06:31 +00:00
|
|
|
/*
|
|
|
|
* This routine allocates NUMA node specific memory before the page
|
|
|
|
* allocator is bootstrapped.
|
|
|
|
*/
|
|
|
|
vm_paddr_t
|
|
|
|
vm_phys_early_alloc(int domain, size_t alloc_size)
|
|
|
|
{
|
|
|
|
int i, mem_index, biggestone;
|
|
|
|
vm_paddr_t pa, mem_start, mem_end, size, biggestsize, align;
|
|
|
|
|
Fix boot on systems where NUMA domain 0 is unpopulated.
- Add vm_phys_early_add_seg(), complementing vm_phys_early_alloc(), to
ensure that segments registered during hammer_time() are placed in the
right domain. Otherwise, since the SRAT is not parsed at that point,
we just add them to domain 0, which may be incorrect and results in a
domain with only several MB worth of memory.
- Fix uma_startup1() to try allocating memory for zones from any domain.
If domain 0 is unpopulated, the allocation will simply fail, resulting
in a page fault slightly later during boot.
- Change _vm_phys_domain() to return -1 for addresses not covered by the
affinity table, and change vm_phys_early_alloc() to handle wildcard
domains. This is necessary on amd64, where the page array is dense
and pmap_page_array_startup() may allocate page table pages for
non-existent page frames.
Reported and tested by: Rafael Kitover <rkitover@gmail.com>
Reviewed by: cem (earlier version), kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D25001
2020-05-28 19:41:00 +00:00
|
|
|
KASSERT(domain == -1 || (domain >= 0 && domain < vm_ndomains),
|
|
|
|
("%s: invalid domain index %d", __func__, domain));
|
2019-08-18 07:06:31 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Search the mem_affinity array for the biggest address
|
|
|
|
* range in the desired domain. This is used to constrain
|
|
|
|
* the phys_avail selection below.
|
|
|
|
*/
|
|
|
|
biggestsize = 0;
|
|
|
|
mem_index = 0;
|
|
|
|
mem_start = 0;
|
|
|
|
mem_end = -1;
|
|
|
|
#ifdef NUMA
|
|
|
|
if (mem_affinity != NULL) {
|
Fix boot on systems where NUMA domain 0 is unpopulated.
- Add vm_phys_early_add_seg(), complementing vm_phys_early_alloc(), to
ensure that segments registered during hammer_time() are placed in the
right domain. Otherwise, since the SRAT is not parsed at that point,
we just add them to domain 0, which may be incorrect and results in a
domain with only several MB worth of memory.
- Fix uma_startup1() to try allocating memory for zones from any domain.
If domain 0 is unpopulated, the allocation will simply fail, resulting
in a page fault slightly later during boot.
- Change _vm_phys_domain() to return -1 for addresses not covered by the
affinity table, and change vm_phys_early_alloc() to handle wildcard
domains. This is necessary on amd64, where the page array is dense
and pmap_page_array_startup() may allocate page table pages for
non-existent page frames.
Reported and tested by: Rafael Kitover <rkitover@gmail.com>
Reviewed by: cem (earlier version), kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D25001
2020-05-28 19:41:00 +00:00
|
|
|
for (i = 0;; i++) {
|
2019-08-18 07:06:31 +00:00
|
|
|
size = mem_affinity[i].end - mem_affinity[i].start;
|
|
|
|
if (size == 0)
|
|
|
|
break;
|
Fix boot on systems where NUMA domain 0 is unpopulated.
- Add vm_phys_early_add_seg(), complementing vm_phys_early_alloc(), to
ensure that segments registered during hammer_time() are placed in the
right domain. Otherwise, since the SRAT is not parsed at that point,
we just add them to domain 0, which may be incorrect and results in a
domain with only several MB worth of memory.
- Fix uma_startup1() to try allocating memory for zones from any domain.
If domain 0 is unpopulated, the allocation will simply fail, resulting
in a page fault slightly later during boot.
- Change _vm_phys_domain() to return -1 for addresses not covered by the
affinity table, and change vm_phys_early_alloc() to handle wildcard
domains. This is necessary on amd64, where the page array is dense
and pmap_page_array_startup() may allocate page table pages for
non-existent page frames.
Reported and tested by: Rafael Kitover <rkitover@gmail.com>
Reviewed by: cem (earlier version), kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D25001
2020-05-28 19:41:00 +00:00
|
|
|
if (domain != -1 && mem_affinity[i].domain != domain)
|
2019-08-18 07:06:31 +00:00
|
|
|
continue;
|
|
|
|
if (size > biggestsize) {
|
|
|
|
mem_index = i;
|
|
|
|
biggestsize = size;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
mem_start = mem_affinity[mem_index].start;
|
|
|
|
mem_end = mem_affinity[mem_index].end;
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Now find biggest physical segment in within the desired
|
|
|
|
* numa domain.
|
|
|
|
*/
|
|
|
|
biggestsize = 0;
|
|
|
|
biggestone = 0;
|
|
|
|
for (i = 0; phys_avail[i + 1] != 0; i += 2) {
|
|
|
|
/* skip regions that are out of range */
|
|
|
|
if (phys_avail[i+1] - alloc_size < mem_start ||
|
|
|
|
phys_avail[i+1] > mem_end)
|
|
|
|
continue;
|
|
|
|
size = vm_phys_avail_size(i);
|
|
|
|
if (size > biggestsize) {
|
|
|
|
biggestone = i;
|
|
|
|
biggestsize = size;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
alloc_size = round_page(alloc_size);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Grab single pages from the front to reduce fragmentation.
|
|
|
|
*/
|
|
|
|
if (alloc_size == PAGE_SIZE) {
|
|
|
|
pa = phys_avail[biggestone];
|
|
|
|
phys_avail[biggestone] += PAGE_SIZE;
|
|
|
|
vm_phys_avail_check(biggestone);
|
|
|
|
return (pa);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Naturally align large allocations.
|
|
|
|
*/
|
|
|
|
align = phys_avail[biggestone + 1] & (alloc_size - 1);
|
|
|
|
if (alloc_size + align > biggestsize)
|
|
|
|
panic("cannot find a large enough size\n");
|
|
|
|
if (align != 0 &&
|
|
|
|
vm_phys_avail_split(phys_avail[biggestone + 1] - align,
|
|
|
|
biggestone) != 0)
|
|
|
|
/* Wasting memory. */
|
|
|
|
phys_avail[biggestone + 1] -= align;
|
|
|
|
|
|
|
|
phys_avail[biggestone + 1] -= alloc_size;
|
|
|
|
vm_phys_avail_check(biggestone);
|
|
|
|
pa = phys_avail[biggestone + 1];
|
|
|
|
return (pa);
|
|
|
|
}
|
|
|
|
|
|
|
|
void
|
|
|
|
vm_phys_early_startup(void)
|
|
|
|
{
|
Fix boot on systems where NUMA domain 0 is unpopulated.
- Add vm_phys_early_add_seg(), complementing vm_phys_early_alloc(), to
ensure that segments registered during hammer_time() are placed in the
right domain. Otherwise, since the SRAT is not parsed at that point,
we just add them to domain 0, which may be incorrect and results in a
domain with only several MB worth of memory.
- Fix uma_startup1() to try allocating memory for zones from any domain.
If domain 0 is unpopulated, the allocation will simply fail, resulting
in a page fault slightly later during boot.
- Change _vm_phys_domain() to return -1 for addresses not covered by the
affinity table, and change vm_phys_early_alloc() to handle wildcard
domains. This is necessary on amd64, where the page array is dense
and pmap_page_array_startup() may allocate page table pages for
non-existent page frames.
Reported and tested by: Rafael Kitover <rkitover@gmail.com>
Reviewed by: cem (earlier version), kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D25001
2020-05-28 19:41:00 +00:00
|
|
|
struct vm_phys_seg *seg;
|
2019-08-18 07:06:31 +00:00
|
|
|
int i;
|
|
|
|
|
|
|
|
for (i = 0; phys_avail[i + 1] != 0; i += 2) {
|
|
|
|
phys_avail[i] = round_page(phys_avail[i]);
|
|
|
|
phys_avail[i + 1] = trunc_page(phys_avail[i + 1]);
|
|
|
|
}
|
|
|
|
|
Fix boot on systems where NUMA domain 0 is unpopulated.
- Add vm_phys_early_add_seg(), complementing vm_phys_early_alloc(), to
ensure that segments registered during hammer_time() are placed in the
right domain. Otherwise, since the SRAT is not parsed at that point,
we just add them to domain 0, which may be incorrect and results in a
domain with only several MB worth of memory.
- Fix uma_startup1() to try allocating memory for zones from any domain.
If domain 0 is unpopulated, the allocation will simply fail, resulting
in a page fault slightly later during boot.
- Change _vm_phys_domain() to return -1 for addresses not covered by the
affinity table, and change vm_phys_early_alloc() to handle wildcard
domains. This is necessary on amd64, where the page array is dense
and pmap_page_array_startup() may allocate page table pages for
non-existent page frames.
Reported and tested by: Rafael Kitover <rkitover@gmail.com>
Reviewed by: cem (earlier version), kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D25001
2020-05-28 19:41:00 +00:00
|
|
|
for (i = 0; i < vm_phys_early_nsegs; i++) {
|
|
|
|
seg = &vm_phys_early_segs[i];
|
|
|
|
vm_phys_add_seg(seg->start, seg->end);
|
|
|
|
}
|
|
|
|
vm_phys_early_nsegs = -1;
|
|
|
|
|
2019-08-18 07:06:31 +00:00
|
|
|
#ifdef NUMA
|
|
|
|
/* Force phys_avail to be split by domain. */
|
|
|
|
if (mem_affinity != NULL) {
|
|
|
|
int idx;
|
|
|
|
|
|
|
|
for (i = 0; mem_affinity[i].end != 0; i++) {
|
|
|
|
idx = vm_phys_avail_find(mem_affinity[i].start);
|
|
|
|
if (idx != -1 &&
|
|
|
|
phys_avail[idx] != mem_affinity[i].start)
|
|
|
|
vm_phys_avail_split(mem_affinity[i].start, idx);
|
|
|
|
idx = vm_phys_avail_find(mem_affinity[i].end);
|
|
|
|
if (idx != -1 &&
|
|
|
|
phys_avail[idx] != mem_affinity[i].end)
|
|
|
|
vm_phys_avail_split(mem_affinity[i].end, idx);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
}
|
|
|
|
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
#ifdef DDB
|
|
|
|
/*
|
|
|
|
* Show the number of physical pages in each of the free lists.
|
|
|
|
*/
|
|
|
|
DB_SHOW_COMMAND(freepages, db_show_freepages)
|
|
|
|
{
|
|
|
|
struct vm_freelist *fl;
|
2013-05-13 15:40:51 +00:00
|
|
|
int flind, oind, pind, dom;
|
|
|
|
|
|
|
|
for (dom = 0; dom < vm_ndomains; dom++) {
|
|
|
|
db_printf("DOMAIN: %d\n", dom);
|
|
|
|
for (flind = 0; flind < vm_nfreelists; flind++) {
|
|
|
|
db_printf("FREE LIST %d:\n"
|
|
|
|
"\n ORDER (SIZE) | NUMBER"
|
|
|
|
"\n ", flind);
|
|
|
|
for (pind = 0; pind < VM_NFREEPOOL; pind++)
|
|
|
|
db_printf(" | POOL %d", pind);
|
|
|
|
db_printf("\n-- ");
|
|
|
|
for (pind = 0; pind < VM_NFREEPOOL; pind++)
|
|
|
|
db_printf("-- -- ");
|
|
|
|
db_printf("--\n");
|
|
|
|
for (oind = VM_NFREEORDER - 1; oind >= 0; oind--) {
|
|
|
|
db_printf(" %2.2d (%6.6dK)", oind,
|
|
|
|
1 << (PAGE_SHIFT - 10 + oind));
|
|
|
|
for (pind = 0; pind < VM_NFREEPOOL; pind++) {
|
|
|
|
fl = vm_phys_free_queues[dom][flind][pind];
|
|
|
|
db_printf(" | %6.6d", fl[oind].lcnt);
|
|
|
|
}
|
|
|
|
db_printf("\n");
|
Add a new physical memory allocator. However, do not yet connect it
to the build.
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
2007-06-10 00:49:16 +00:00
|
|
|
}
|
|
|
|
db_printf("\n");
|
|
|
|
}
|
|
|
|
db_printf("\n");
|
|
|
|
}
|
|
|
|
}
|
|
|
|
#endif
|