This function is responsible for setting pc_domain in each pcpu
structure. Call it from the main function that starts APs, rather than
a separate SYSINIT. This makes it easier to close the window where
UMA's per-CPU slab allocator may be called while pc_domain is
uninitialized. In particular, the allocator uses pc_domain to allocate
domain-local pages, so allocations before this point end up using domain
0 for everything.
Reviewed by: kib
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D24757
This moves the architecture independent parts of sys/x86/acpica/srat.c
to sys/dev/acpica/acpi_pxm.c, to be used later on arm64. The function
declarations are moved to sys/dev/acpica/acpivar.h
We also need to update sys/conf/files.{i386,amd64} to use the new file.
No functional changes.
Reviewed by: markj, imp
Differential Revision: https://reviews.freebsd.org/D17941
The SLIT and SRAT ACPI tables needs to be parsed on arm64 as well, on
systems that use UEFI/ACPI firmware and support NUMA. To do this, we
need to move most of the logic of x86/acpica/srat.c to dev/acpica and
provide an API that architectures can use to parse and configure ACPI
NUMA information.
This commit adds the API in srat.c as a first step, without making any
functional changes. We will move the common code to sys/dev/acpica
as the next step.
The functions added are:
* int acpi_pxm_init(int ncpus, vm_paddr_t maxphys) - to allocate and
initialize data structures used
* void acpi_pxm_parse_tables(void) - parse SRAT/SLIT, save the cpu and
memory proximity information
* void acpi_pxm_set_mem_locality(void) - use the saved data to set
memory locality
* void acpi_pxm_set_cpu_locality(void) - use the saved data to set cpu
locality
* void acpi_pxm_free(void) - free data structures allocated by init
On arm64, we do not have an cpu APIC id that can be used as index to
store CPU data, we need to use the Processor Uid. To help with this,
define internal functions cpu_add, cpu_find, cpu_get_info to store
and get CPU proximity information.
Reviewed by: markj, jhb (previous version)
Differential Revision: https://reviews.freebsd.org/D17940
This provides a chicken switch for anyone negatively impacted by
enabling NUMA in the amd64 GENERIC kernel configuration. With
NUMA disabled at boot-time, information about the NUMA topology
is not exposed to the rest of the kernel, and all of physical
memory is viewed as coming from a single domain.
This method still has some performance overhead relative to disabling
NUMA support at compile time.
PR: 231460
Reviewed by: alc, gallatin, kib
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D17439
Pre-defined policies are useful when integrating the domainset(9)
policy machinery into various kernel memory allocators.
The refactoring will make it easier to add NUMA support for other
architectures.
No functional change intended.
Reviewed by: alc, gallatin, jeff, kib
Tested by: pho (part of a larger patch)
MFC after: 3 days
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D17416
The AMD Threadripper 2990WX is basically a slightly crippled Epyc.
Rather than having 4 memory controllers, one per NUMA domain, it has
only 2 memory controllers enabled. This means that only 2 of the
4 NUMA domains can be populated with physical memory, and the
others are empty.
Add support to FreeBSD for empty NUMA domains by:
- creating empty memory domains when parsing the SRAT table,
rather than failing to parse the table
- not running the pageout deamon threads in empty domains
- adding defensive code to UMA to avoid allocating from empty domains
- adding defensive code to cpuset to avoid binding to an empty domain
Thanks to Jeff for suggesting this strategy.
Reviewed by: alc, markj
Approved by: re (gjb@)
Differential Revision: https://reviews.freebsd.org/D1683
Use these predicates instead of inline references to vm_min_domains.
Also add a global all_domains set, akin to all_cpus.
Reviewed by: alc, jeff, kib
Approved by: re (gjb)
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D17278
- Change pcpu zone consumers to use a stride size of PAGE_SIZE.
(defined as UMA_PCPU_ALLOC_SIZE to make future identification easier)
- Allocate page from the correct domain for a given cpu.
- Don't initialize pc_domain to non-zero value if NUMA is not defined
There are some misconceptions surrounding this field. It is the
_VM_ NUMA domain and should only ever correspond to valid domain
values as understood by the VM.
The former slab size of sizeof(struct pcpu) was somewhat arbitrary.
The new value is PAGE_SIZE because that's the smallest granularity
which the VM can allocate a slab for a given domain. If you have
fewer than PAGE_SIZE/8 counters on your system there will be some
memory wasted, but this is obviously something where you want the
cache line to be coming from the correct domain.
Reviewed by: jeff
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D15933
userspace to control NUMA policy administratively and programmatically.
Implement domainset based iterators in the page layer.
Remove the now legacy numa_* syscalls.
Cleanup some header polution created by having seq.h in proc.h.
Reviewed by: markj, kib
Discussed with: alc
Tested by: pho
Sponsored by: Netflix, Dell/EMC Isilon
Differential Revision: https://reviews.freebsd.org/D13403
Mainly focus on files that use BSD 2-Clause license, however the tool I
was using misidentified many licenses so this was mostly a manual - error
prone - task.
The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.
Fix from fallout introduced in r322348 that moved the cpus array to a
dynamic allocation without zeroing the area.
Reported by: mjg
MFC with: r322348
Reviewed by: mjg
Differential revision: https://reviews.freebsd.org/D12220
So that MAX_APIC_ID can be bumped without wasting memory.
Note that the usage of MAX_APIC_ID in the SRAT parsing forces the
parser to allocate memory directly from the phys_avail physical memory
array, which is not the best approach probably, but I haven't found
any other way to allocate memory so early in boot. This memory is not
returned to the system afterwards, but at least it's sized according
to the maximum APIC ID found in the MADT table.
Sponsored by: Citrix Systems R&D
MFC after: 1 month
Reviewed by: kib
Differential revision: https://reviews.freebsd.org/D11912
Populate the lapics arrays and call cpu_add/lapic_create in the setup
phase instead. Also store the max APIC ID found in the newly
introduced max_apic_id global variable.
This is a requirement in order to make the static arrays currently
using MAX_LAPIC_ID dynamic.
Sponsored by: Citrix Systems R&D
MFC after: 1 month
Reviewed by: kib
Differential revision: https://reviews.freebsd.org/D11911
in place. To do per-cpu stats, convert all fields that previously were
maintained in the vmmeters that sit in pcpus to counter(9).
- Since some vmmeter stats may be touched at very early stages of boot,
before we have set up UMA and we can do counter_u64_alloc(), provide an
early counter mechanism:
o Leave one spare uint64_t in struct pcpu, named pc_early_dummy_counter.
o Point counter(9) fields of vmmeter to pcpu[0].pc_early_dummy_counter,
so that at early stages of boot, before counters are allocated we already
point to a counter that can be safely written to.
o For sparc64 that required a whole dummy pcpu[MAXCPU] array.
Further related changes:
- Don't include vmmeter.h into pcpu.h.
- vm.stats.vm.v_swappgsout and vm.stats.vm.v_swappgsin changed to 64-bit,
to match kernel representation.
- struct vmmeter hidden under _KERNEL, and only vmstat(1) is an exclusion.
This is based on benno@'s 4-year old patch:
https://lists.freebsd.org/pipermail/freebsd-arch/2013-July/014471.html
Reviewed by: kib, gallatin, marius, lidl
Differential Revision: https://reviews.freebsd.org/D10156
Ignore them like it's done in the MADT parser. This allows booting on a box
with SRAT and APIC IDs > 255.
Reported by: Wei Liu <wei.liu2@citrix.com>
Tested by: Wei Liu <wei.liu2@citrix.com>
Reviewed by: kib
MFC after: 2 weeks
Sponsored by: Citrix Systems R&D
Instead of panicking when parsing an invalid ACPI SRAT table,
just ignore it, effectively disabling NUMA.
https://lists.freebsd.org/pipermail/freebsd-current/2016-May/060984.html
Reported and tested by: Bill O'Hanlon (bill.ohanlon at gmail.com)
Reviewed by: jhb
MFC after: 1 week
Relnotes: If dmesg shows "SRAT: Duplicate local APIC ID",
try updating your BIOS to fix NUMA support.
Sponsored by: Dell Inc.
If we reached MAXMEMDOM, we would previously try to insert an additional
element and only detect overflow after causing (probably trivial) memory
overflow. Instead, detect the ndomain > MAXMEMDOM case before we write past
the end.
Reported by: Coverity
CID: 1354783
Sponsored by: EMC / Isilon Storage Division
VM_NUMA_ALLOC is used to enable use of domain-aware memory allocation in
the virtual memory system. DEVICE_NUMA is used to enable affinity
reporting for devices such as bus_get_domain().
MAXMEMDOM must still be set to a value greater than for any NUMA support
to be effective. Note that 'cpuset -gd' always works if MAXMEMDOM is
enabled and the system supports NUMA.
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D5782
a basic ACPI SLIT table parser.
For now this just exports the map via sysctl; it'll eventually be useful
to userland when there's more useful NUMA support in -HEAD.
* Add an optional mem_locality map;
* add a mapping function taking from/to domain and returning the
relative cost, or -1 if it's not available;
* Add a very basic SLIT parser to x86 ACPI.
Differential Revision: https://reviews.freebsd.org/D2460
Reviewed by: rpaulo, stas, jhb
Sponsored by: Norse Corp, Inc (hardware, coding); Dell (hardware)
kernel via the global cpuset_domain[] array. To export these to userland,
add a CPU_WHICH_DOMAIN level that can be used to fetch the mask for a
specific domain. Add a -d flag to cpuset(1) that can be used to fetch
the mask for a given domain.
Differential Revision: https://reviews.freebsd.org/D1232
Submitted by: jeff (kernel bits)
Reviewed by: adrian, jeff
into threads each processing queue in a single domain. The structure
of the pagedaemons and queues is kept intact, most of the changes come
from the need for code to find an owning page queue for given page,
calculated from the segment containing the page.
The tie between NUMA domain and pagedaemon thread/pagequeue split is
rather arbitrary, the multithreaded daemon could be allowed for the
single-domain machines, or one domain might be split into several page
domains, to further increase concurrency.
Right now, each pagedaemon thread tries to reach the global target,
precalculated at the start of the pass. This is not optimal, since it
could cause excessive page deactivation and freeing. The code should
be changed to re-check the global page deficit state in the loop after
some number of iterations.
The pagedaemons reach the quorum before starting the OOM, since one
thread inability to meet the target is normal for split queues. Only
when all pagedaemons fail to produce enough reusable pages, OOM is
started by single selected thread.
Launder is modified to take into account the segments layout with
regard to the region for which cleaning is performed.
Based on the preliminary patch by jeff, sponsored by EMC / Isilon
Storage Division.
Reviewed by: alc
Tested by: pho
Sponsored by: The FreeBSD Foundation
freelist.
o Split the pool of free pages queues really by domain and not rely on
definition of VM_RAW_NFREELIST.
o For MAXMEMDOM > 1, wrap the RR allocation logic into a specific
function that is called when calculating the allocation domain.
The RR counter is kept, currently, per-thread.
In the future it is expected that such function evolves in a real
policy decision referee, based on specific informations retrieved by
per-thread and per-vm_object attributes.
o Add the concept of "probed domains" under the form of vm_ndomains.
It is responsibility for every architecture willing to support multiple
memory domains to correctly probe vm_ndomains along with mem_affinity
segments attributes. Those two values are supposed to remain always
consistent.
Please also note that vm_ndomains and td_dom_rr_idx are both int
because segments already store domains as int. Ideally u_int would
have much more sense. Probabilly this should be cleaned up in the
future.
o Apply RR domain selection also to vm_phys_zero_pages_idle().
Sponsored by: EMC / Isilon storage division
Partly obtained from: jeff
Reviewed by: alc
Tested by: jeff
and printing at boot.
For reference on table informations and purposes please review ACPI specs.
Sponsored by: EMC / Isilon storage division
Obtained from: jeff
Reviewed by: jhb (earlier version)
order to match the MAXCPU concept. The change should also be useful
for consolidation and consistency.
Sponsored by: EMC / Isilon storage division
Obtained from: jeff
Reviewed by: alc
- Actually increment ndomain when building our list of known domains
so that we can properly renumber them to be 0-based and dense.
- If the number of domains exceeds the configured maximum (VM_NDOMAIN),
bail out of processing the SRAT and disable NUMA rather than hitting an
obscure panic later.
- Don't bother parsing the SRAT at all if VM_NDOMAIN is set to 1 to
disable NUMA (the default).
Reported by: phk (2)
MFC after: 1 week
existing phys_avail[] table. If a hw.physmem setting causes a memory
domain to not be present in phys_avail[], the SRAT table will now be
ignored rather than triggering a panic when a CPU in the missing domain
tries to allocate a page.
MFC after: 1 week