Commit Graph

290 Commits

Author SHA1 Message Date
Mark Johnston
013072f04c Fix pre-SI_SUB_CPU initialization of per-CPU counters.
r336020 introduced pcpu_page_alloc(), replacing page_alloc() as the
backend allocator for PCPU UMA zones.  Unlike page_alloc(), it does
not honour malloc(9) flags such as M_ZERO or M_NODUMP, so fix that.

r336020 also changed counter(9) to initialize each counter using a
CPU_FOREACH() loop instead of an SMP rendezvous.  Before SI_SUB_CPU,
smp_rendezvous() will only execute the callback on the current CPU
(i.e., CPU 0), so only one counter gets zeroed.  The rest are zeroed
by virtue of the fact that UMA gratuitously zeroes slabs when importing
them into a zone.

Prior to SI_SUB_CPU, all_cpus is clear, so with r336020 we weren't
zeroing vm_cnt counters during boot: the CPU_FOREACH() loop had no
effect, and pcpu_page_alloc() didn't honour M_ZERO.  Fix this by
iterating over the full range of CPU IDs when zeroing counters,
ignoring whether the corresponding bits in all_cpus are set.

Reported and tested by:	pho (previous version)
Reviewed by:		kib (previous version)
Differential Revision:	https://reviews.freebsd.org/D16190
2018-07-10 00:18:12 +00:00
Sean Bruno
a03af34228 Wrap the declaration and assignment of "stripe" with #ifdef NUMA declarations
as not all targets are NUMA aware.

Found with gcc.

Sponsored by:	Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D16113
2018-07-07 13:37:44 +00:00
Matt Macy
ab3059a8e7 Back pcpu zone with domain correct pages
- Change pcpu zone consumers to use a stride size of PAGE_SIZE.
  (defined as UMA_PCPU_ALLOC_SIZE to make future identification easier)

- Allocate page from the correct domain for a given cpu.

- Don't initialize pc_domain to non-zero value if NUMA is not defined
  There are some misconceptions surrounding this field. It is the
  _VM_ NUMA domain and should only ever correspond to valid domain
  values as understood by the VM.

The former slab size of sizeof(struct pcpu) was somewhat arbitrary.
The new value is PAGE_SIZE because that's the smallest granularity
which the VM can allocate a slab for a given domain. If you have
fewer than PAGE_SIZE/8 counters on your system there will be some
memory wasted, but this is obviously something where you want the
cache line to be coming from the correct domain.

Reviewed by: jeff
Sponsored by: Limelight Networks
Differential Revision:  https://reviews.freebsd.org/D15933
2018-07-06 02:06:03 +00:00
Ian Lepore
c5b7751fa2 Eliminate a spurious panic on non-SMP systems (occurred on shutdown/reboot). 2018-06-22 20:22:26 +00:00
Ruslan Bukin
b47999470d Fix uma_zalloc_pcpu_arg() operation in case of !SMP build.
Reviewed by:	mjg
Sponsored by:	DARPA, AFRL
2018-06-21 11:43:54 +00:00
Jonathan T. Looney
0766f278d8 Make UMA and malloc(9) return non-executable memory in most cases.
Most kernel memory that is allocated after boot does not need to be
executable.  There are a few exceptions.  For example, kernel modules
do need executable memory, but they don't use UMA or malloc(9).  The
BPF JIT compiler also needs executable memory and did use malloc(9)
until r317072.

(Note that a side effect of r316767 was that the "small allocation"
path in UMA on amd64 already returned non-executable memory.  This
meant that some calls to malloc(9) or the UMA zone(9) allocator could
return executable memory, while others could return non-executable
memory.  This change makes the behavior consistent.)

This change makes malloc(9) return non-executable memory unless the new
M_EXEC flag is specified.  After this change, the UMA zone(9) allocator
will always return non-executable memory, and a KASSERT will catch
attempts to use the M_EXEC flag to allocate executable memory using
uma_zalloc() or its variants.

Allocations that do need executable memory have various choices.  They
may use the M_EXEC flag to malloc(9), or they may use a different VM
interfact to obtain executable pages.

Now that malloc(9) again allows executable allocations, this change also
reverts most of r317072.

PR:		228927
Reviewed by:	alc, kib, markj, jhb (previous version)
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D15691
2018-06-13 17:04:41 +00:00
Mateusz Guzik
4e180881ae uma: implement provisional api for per-cpu zones
Per-cpu zone allocations are very rarely done compared to regular zones.
The intent is to avoid pessimizing the latter case with per-cpu specific
code.

In particular contrary to the claim in r334824, M_ZERO is sometimes being
used for such zones. But the zeroing method is completely different and
braching on it in the fast path for regular zones is a waste of time.
2018-06-08 21:40:03 +00:00
Mateusz Guzik
b8af2820f6 uma: fix up r334824
Turns out there is code which ends up passing M_ZERO to counters.
Since counters zero unconditionally on their own, just ignore drop the
flag in that place.
2018-06-08 05:40:36 +00:00
Mateusz Guzik
ea99223ec9 uma: remove M_ZERO support for pcpu zones
Nothing in the tree uses it and pcpu zones have a fundamentally different use
case than the regular zones - they are not supposed to be allocated and freed
all the time.

This reduces pollution in the allocation fast path.
2018-06-08 03:16:16 +00:00
Gleb Smirnoff
c5deaf0452 UMA memory debugging enabled with INVARIANTS consists of two things:
trashing freed memory and checking that allocated memory is properly
trashed, and also of keeping a bitset of freed items. Trashing/checking
creates a lot of CPU cache poisoning, while keeping debugging bitsets
consistent creates a lot of contention on UMA zone lock(s). The performance
difference between INVARIANTS kernel and normal one is mostly attributed
to UMA debugging, rather than to all KASSERT checks in the kernel.

Add loader tunable vm.debug.divisor that allows either to turn off UMA
debugging completely, or turn it on only for a fraction of allocations,
while still running all KASSERTs in kernel. That allows to run INVARIANTS
kernels in production environments without reducing load by orders of
magnitude, but still doing useful extra checks.

Default value is 1, meaning debug every allocation. Value of 0 would
disable UMA debugging completely. Values above 1 enable debugging only
for every N-th item. It isn't possible to strictly follow the number,
but still amount of debugging is reduced roughly by (N-1)/N percent.

Sponsored by:		Netflix
Differential Revision:	https://reviews.freebsd.org/D15199
2018-06-08 00:15:08 +00:00
Mateusz Guzik
e825ab8d89 uma: whack main zone counter update in the slow path
Cached counters are typically zero at this point so it performs
avoidable atomics. Everything reading them also reads the cached
ones, thus there is really no point.

Reviewed by:		jeff
2018-04-27 05:37:35 +00:00
Mark Johnston
7e28037a09 Add a UMA zone flag to disable the use of buckets.
This allows the creation of zones which don't do any caching in front of
the keg. If the zone is a cache zone, this means that UMA will not
attempt any memory allocations when allocating an item from the backend.
This is intended for use after a panic by netdump, but likely has other
applications.

Reviewed by:	kib
MFC after:	2 weeks
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D15184
2018-04-24 20:05:45 +00:00
Gleb Smirnoff
b92b26ad08 Use UMA_SLAB_SPACE macro. No functional change here. 2018-04-02 05:15:25 +00:00
Gleb Smirnoff
96a10340ce In uma_startup_count() handle special case when zone will fit into
single slab, but with alignment adjustment it won't. Again, when
there is only one item in a slab alignment can be ignored. See
previous revision of this file for more info.

PR:		227116
2018-04-02 05:14:31 +00:00
Gleb Smirnoff
1ca6ed4589 Handle a special case when a slab can fit only one allocation,
and zone has a large alignment. With alignment taken into
account uk_rsize will be greater than space in a slab. However,
since we have only one item per slab, it is always naturally
aligned.

Code that will panic before this change with 4k page:

	z = uma_zcreate("test", 3984, NULL, NULL, NULL, NULL, 31, 0);
	uma_zalloc(z, M_WAITOK);

A practical scenario to hit the panic is a machine with 56 CPUs
and 2 NUMA domains, which yields in zone size of 3984.

PR:		227116
MFC after:	2 weeks
2018-04-02 05:11:59 +00:00
Jeff Roberson
e8bb2dc7c9 Add the flag ZONE_NOBUCKETCACHE. This flag instructions UMA not to keep
a cache of fully populated buckets.  This will be used in a follow-on
commit.

The flag idea was originally from markj.

Reviewed by:	markj, kib
Tested by:	pho
Sponsored by:	Netflix, Dell/EMC Isilon
2018-04-01 04:47:05 +00:00
Konstantin Belousov
63b5d112b6 For vm_zone_stats() sysctl handler, do not drain sbuf calling
copyout(9) while owning zone lock.

Despite old value sysctl buffer is wired, spurious faults might still
occur.

Note that we still own the uma_rwlock there, but this lock does not
participate in sensitive lock orders.

Reported and tested by:	pho (as part of the larger patch)
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2018-03-24 13:48:53 +00:00
Gleb Smirnoff
f7d3578564 Fix boot_pages exhaustion on machines with many domains and cores, where
size of UMA zone allocation is greater than page size. In this case zone
of zones can not use UMA_MD_SMALL_ALLOC, and we  need to postpone switch
off of this zone from startup_alloc() until full launch of VM.

o Always supply number of VM zones to uma_startup_count(). On machines
  with UMA_MD_SMALL_ALLOC ignore it completely, unless zsize goes over
  a page. In the latter case account VM zones for number of allocations
  from the zone of zones.
o Rewrite startup_alloc() so that it will immediately switch off from
  itself any zone that is already capable of running real alloc.
  In worst case scenario we may leak a single page here. See comment
  in uma_startup_count().
o Hardcode call to uma_startup2() into vm_mem_init(). Otherwise some
  extra SYSINITs, e.g. vm_page_init() may sneak in before.
o While here, remove uma_boot_pages_mtx. With recent changes to boot
  pages calculation, we are guaranteed to use all of the boot_pages
  in the early single threaded stage.

Reported & tested by:	mav
2018-02-09 04:45:39 +00:00
Gleb Smirnoff
5073a08328 Fix three miscalculations in amount of boot pages:
o Most of startup zones have struct uma_slab embedded into the slab,
  so provide macro UMA_SLAB_SPACE and use it instead of UMA_SLAB_SIZE,
  when calculating how many pages would certain kind of allocations
  require. Some zones are offpage, so we might have a positive inaccuracy.
o The keg for the zone of zones is allocated "dynamically", so we
  need +1 when calculating amount of pages for kegs. [1]
o The zones of zones and zones of kegs have arbitrary alignment of 32,
  and this also needs to be accounted for. [2]

While here, spread more comments and improve diagnostic messages.

Reported by:	pho [1], jtl [2]
2018-02-07 18:32:51 +00:00
Gleb Smirnoff
d2be4a1e4f Use correct arithmetic to calculate how many pages we need for kegs
and hashes.  There is no functional change with current sizes.
2018-02-06 22:13:40 +00:00
Jeff Roberson
e2068d0bcd Use per-domain locks for vm page queue free. Move paging control from
global to per-domain state.  Protect reservations with the free lock
from the domain that they belong to.  Refactor to make vm domains more
of a first class object.

Reviewed by:    markj, kib, gallatin
Tested by:      pho
Sponsored by:   Netflix, Dell/EMC Isilon
Differential Revision:  https://reviews.freebsd.org/D14000
2018-02-06 22:10:07 +00:00
Gleb Smirnoff
1616767dfc Improve DIAGNOSTIC printf. Report using a boot page every time regardless
of booted status.
2018-02-06 22:08:43 +00:00
Gleb Smirnoff
f4bef67c9c Followup on r302393 by cperciva, improving calculation of boot pages required
for UMA startup.

o Introduce another stage of UMA startup, which is entered after
  vm_page_startup() finishes. After this stage we don't yet enable buckets,
  but we can ask VM for pages. Rename stages to meaningful names while here.
  New list of stages: BOOT_COLD, BOOT_STRAPPED, BOOT_PAGEALLOC, BOOT_BUCKETS,
  BOOT_RUNNING.
  Enabling page alloc earlier allows us to dramatically reduce number of
  boot pages required. What is more important number of zones becomes
  consistent across different machines, as no MD allocations are done before
  the BOOT_PAGEALLOC stage. Now only UMA internal zones actually need to use
  startup_alloc(), however that may change, so vm_page_startup() provides
  its need for early zones as argument.
o Introduce uma_startup_count() function, to avoid code duplication. The
  functions calculates sizes of zones zone and kegs zone, and calculates how
  many pages UMA will need to bootstrap.
  It counts not only of zone structures, but also of kegs, slabs and hashes.
o Hide uma_startup_foo() declarations from public file.
o Provide several DIAGNOSTIC printfs on boot_pages usage.
o Bugfix: when calculating zone of zones size use (mp_maxid + 1) instead of
  mp_ncpus. Use resulting number not only in the size argument to zone_ctor()
  but also as args.size.

Reviewed by:		imp, gallatin (earlier version)
Differential Revision:	https://reviews.freebsd.org/D14054
2018-02-06 04:16:00 +00:00
Jeff Roberson
b6715dab8f Move VM_NUMA_ALLOC and DEVICE_NUMA under the single global config option NUMA.
Sponsored by:	Netflix, Dell/EMC Isilon
Discussed with:	jhb
2018-01-14 03:36:03 +00:00
Jeff Roberson
ab3185d15e Implement NUMA support in uma(9) and malloc(9). Allocations from specific
domains can be done by the _domain() API variants.  UMA also supports a
first-touch policy via the NUMA zone flag.

The slab layer is now segregated by VM domains and is precise.  It handles
iteration for round-robin directly.  The per-cpu cache layer remains
a mix of domains according to where memory is allocated and freed.  Well
behaved clients can achieve perfect locality with no performance penalty.

The direct domain allocation functions have to visit the slab layer and
so require per-zone locks which come at some expense.

Reviewed by:	Attilio (a slightly older version)
Tested by:	pho
Sponsored by:	Netflix, Dell/EMC Isilon
2018-01-12 23:25:05 +00:00
Jeff Roberson
ad5b0f5b51 Fix arc after r326347 broke various memory limit queries. Use UMA features
rather than kmem arena size to determine available memory.

Initialize the UMA limit to LONG_MAX to avoid spurious wakeups on boot before
the real limit is set.

PR:		224330 (partial), 224080
Reviewed by:	markj, avg
Sponsored by:	Netflix / Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D13494
2018-01-02 04:35:56 +00:00
Konstantin Belousov
200f8117ba Perform all accesses to uma_reclaim_needed using atomic(9) KPI.
Reviewed by:	alc, jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D13534
2017-12-19 10:06:55 +00:00
Mark Johnston
952a29c04b Fix the UMA reclaim worker after r326347.
atomic_set_*() sets a bit in the target memory location, so
atomic_set_int(&uma_reclaim_needed, 0) does not do what it looks like
it does.

PR:		224080
Reviewed by:	jeff, kib
Differential Revision:	https://reviews.freebsd.org/D13412
2017-12-07 19:38:09 +00:00
Jeff Roberson
2e47807c21 Eliminate kmem_arena and kmem_object in preparation for further NUMA commits.
The arena argument to kmem_*() is now only used in an assert.  A follow-up
commit will remove the argument altogether before we freeze the API for the
next release.

This replaces the hard limit on kmem size with a soft limit imposed by UMA.  When
the soft limit is exceeded we periodically wakeup the UMA reclaim thread to
attempt to shrink KVA.  On 32bit architectures this should behave much more
gracefully as we exhaust KVA.  On 64bit the limits are likely never hit.

Reviewed by:	markj, kib (some objections)
Discussed with:	alc
Tested by:	pho
Sponsored by:	Netflix / Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D13187
2017-11-28 23:40:54 +00:00
Pedro F. Giffuni
fe267a5590 sys: general adoption of SPDX licensing ID tags.
Mainly focus on files that use BSD 2-Clause license, however the tool I
was using misidentified many licenses so this was mostly a manual - error
prone - task.

The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.

No functional change intended.
2017-11-27 15:23:17 +00:00
Konstantin Belousov
772c8b6749 Fix operator priority.
Sponsored by:	The FreeBSD Foundation
2017-11-08 23:25:05 +00:00
Jeff Roberson
8d6fbbb867 Replace manyinstances of VM_WAIT with blocking page allocation flags
similar to the kernel memory allocator.

This simplifies NUMA allocation because the domain will be known at wait
time and races between failure and sleeping are eliminated.  This also
reduces boilerplate code and simplifies callers.

A wait primitive is supplied for uma zones for similar reasons.  This
eliminates some non-specific VM_WAIT calls in favor of more explicit
sleeps that may be satisfied without new pages.

Reviewed by:	alc, kib, markj
Tested by:	pho
Sponsored by:	Netflix, Dell/EMC Isilon
2017-11-08 02:39:37 +00:00
Mark Johnston
2934eb8a22 Fix a logic error in the item size calculation for internal UMA zones.
Kegs for internal zones always keep the slab header in the slab itself.
Therefore, when determining the allocation size, we need to take the
slab header size into account.

Reported and tested by:	ae, rakuco
Reviewed by:	avg
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D12342
2017-09-13 15:44:54 +00:00
Mateusz Guzik
fe933c1d88 Start annotating global _padalign locks with __exclusive_cache_line
While these locks are guarnteed to not share their respective cache lines,
their current placement leaves unnecessary holes in lines which preceeded them.

For instance the annotation of vm_page_queue_free_mtx allows 2 neighbour
cachelines (previously separate by the lock) to be collapsed into 1.

The annotation is only effective on architectures which have it implemented in
their linker script (currently only amd64). Thus locks are not converted to
their not-padaligned variants as to not affect the rest.

MFC after:	1 week
2017-09-06 20:28:18 +00:00
Gleb Smirnoff
77e1943785 When we are in UMA_STARTUP use startup_alloc() for any zone, not for
internal zones only.  This allows to create new zones at early stages
of boot, without need to mark them as internal to UMA, which isn't
always true.

Reviewed by:	alc
2017-06-08 21:33:19 +00:00
Gleb Smirnoff
1431a74845 As old prophecy says, some day UMA_DEBUG printfs shall be made CTRs. 2017-06-01 18:36:52 +00:00
Gleb Smirnoff
ac0a6fd015 Simplify boot pages management in UMA.
It is simply a contigous virtual memory pointer and number of pages.
There is no need to build a linked list here.  Just increment pointer
and decrement counter.  The only functional difference to old allocator
is that before we gave pages from topmost and down to lowest, and now
we give them in normal ascending order.

While here remove padalign from a mutex that is unused at runtime.

Reviewed by:	alc
2017-06-01 18:26:57 +00:00
John Baldwin
a5a355788e Assert that the align parameter to uma_zcreate() is valid.
Reviewed by:	kib
MFC after:	1 week
Sponsored by:	DARPA / AFRL
Differential Revision:	https://reviews.freebsd.org/D10100
2017-04-04 16:26:46 +00:00
Andriy Gapon
57223e9994 uma: fix pages <-> items conversions at several places
Those places were not taking into account uk_ppera.
At present one allocation is always used by one slab, so uk_ppera must
be used to convert between pages and slabs.
uk_ipers is used to convert between slabs and items.

MFC after:	1 month (if ever)
2017-03-11 16:43:38 +00:00
Andriy Gapon
a55ebb7cd5 uma: eliminate uk_slabsize field
The field was not used beyond the initial keg setup stage anyway.

MFC after:	1 month (if ever)
2017-03-11 16:35:36 +00:00
Andriy Gapon
9b43bc27c4 call vm_lowmem hook in uma_reclaim_worker
A comment near kmem_reclaim() implies that we already did that.
Calling the hook is useful, because some handlers, e.g. ARC,
might be able to release significant amounts of KVA.

Now that we have more than one place where vm_lowmem hook is called,
use this change as an opportunity to introduce flags that describe
a reason for calling the hook.  No handler makes use of the flags yet.

Reviewed by:	markj, kib
MFC after:	1 week
Sponsored by:	Panzura
Differential Revision: https://reviews.freebsd.org/D9764
2017-02-25 16:39:21 +00:00
Justin Hibbits
b5345ef10e Print flags in hex instead of decimal.
Hex is easier to grok for flags, and consistent with other prints.
2017-01-02 16:50:52 +00:00
Mark Johnston
829be5168d Simplify keg_drain() a bit by using LIST_FOREACH_SAFE.
MFC after:	1 week
2016-10-20 23:10:27 +00:00
Mark Johnston
afa5d70339 Release the second critical section in uma_zfree_arg() slightly earlier.
It is only needed when removing a full bucket from the per-CPU cache. The
bucket cache (uz_buckets) is protected by the zone mutex and thus the
critical section can be released before inserting into that list.

MFC after:	1 week
2016-07-20 01:01:50 +00:00
Nathan Whitehorn
96c85efb4b Replace a number of conflations of mp_ncpus and mp_maxid with either
mp_maxid or CPU_FOREACH() as appropriate. This fixes a number of places in
the kernel that assumed CPU IDs are dense in [0, mp_ncpus) and would try,
for example, to run tasks on CPUs that did not exist or to allocate too
few buffers on systems with sparse CPU IDs in which there are holes in the
range and mp_maxid > mp_ncpus. Such circumstances generally occur on
systems with SMT, but on which SMT is disabled. This patch restores system
operation at least on POWER8 systems configured in this way.

There are a number of other places in the kernel with potential problems
in these situations, but where sparse CPU IDs are not currently known
to occur, mostly in the ARM machine-dependent code. These will be fixed
in a follow-up commit after the stable/11 branch.

PR:		kern/210106
Reviewed by:	jhb
Approved by:	re (glebius)
2016-07-06 14:09:49 +00:00
Mark Johnston
bc9d08e1cf Fix memguard(9) in kernels with INVARIANTS enabled.
With r284861, UMA zones use the trash ctor and dtor by default. This is
incompatible with memguard, which frees the backing page when the item
is freed. Modify the UMA debug functions to be no-ops if the item was
allocated from memguard. This also fixes constructors such as
mb_ctor_pack(), which invokes the trash ctor in addition to performing
some initialization.

Reviewed by:	glebius
MFC after:	3 weeks
Differential Revision:	https://reviews.freebsd.org/D6562
2016-06-01 22:31:35 +00:00
Pedro F. Giffuni
763df3ec55 sys/vm: minor spelling fixes in comments.
No functional change.
2016-05-02 20:16:29 +00:00
Gleb Smirnoff
cfcae3f86f Remove UMA_ZONE_REFCNT feature, now unused.
Blessed by:	jeff
2016-03-01 00:33:32 +00:00
Gleb Smirnoff
e60b2fcbeb Redo r292484. Embed task(9) into zone, so that uz_maxaction is called
in a context that can sleep, allowing consumers of the KPI to run their
drain routines without any extra measures.

Discussed with:	jtl
2016-02-03 23:30:17 +00:00
Gleb Smirnoff
9542ea7b80 Move uma_dbg_alloc() and uma_dbg_free() into uma_core.c, which allows
to make uma_dbg.h not depend on uma_int.h, which allows to uninclude
uma_int.h from the mbuf(9) allocator.
2016-02-03 22:02:36 +00:00