Coverity flagged the scaling by sizeof(uzd). That is the type
of the pointer, so the scaling was already done by pointer arithmetic.
However, this was also passing a stack frame pointer to kvm_read,
so it was doubly wrong.
Move ZDOM_GET into the !_KERNEL section and use it in libmemstat.
Reported by: Coverity
Reviewed by: markj
MFC after: 2 weeks
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D26213
This gives much better concurrency when there are a large number of
cores per-domain and multiple domains. Avoid taking the lock entirely
if it will not be productive. ROUNDROBIN domains will have mixed
memory in each domain and will load balance to all domains.
While here refactor the zone/domain separation and bucket limits to
simplify callers.
Reviewed by: markj
Differential Revision: https://reviews.freebsd.org/D23673
accounting for each NUMA domain. Independent keg domain locks are important
with cross-domain frees. Hashed zones are non-numa and use a single keg
lock to protect the hash table.
Reviewed by: markj, rlibby
Differential Revision: https://reviews.freebsd.org/D22829
cache area. This allows us to check on bucket space for all per-cpu
buckets with a single cacheline access and fewer branches.
Reviewed by: markj, rlibby
Differential Revision: https://reviews.freebsd.org/D22825
r355706 added an instance of offsetof() to the UMA private kernel header
file uma_int.h. Userspace memstat_uma.c includes that header, and
chokes on offsetof() because apparently the definition in sys/types.h is
ifdef _KERNEL. Now, include sys/stddef.h which has an identical
definition.
Pointyhat to: rlibby
Sponsored by: Dell EMC Isilon
The page daemon periodically invokes uma_reclaim() to reclaim cached
items from each zone when the system is under memory pressure. This
is important since the size of these caches is unbounded by default.
However it also results in bursts of high latency when allocating from
heavily used zones as threads miss in the per-CPU caches and must
access the keg in order to allocate new items.
With r340405 we maintain an estimate of each zone's usage of its
(per-NUMA domain) cache of full buckets. Start making use of this
estimate to avoid reclaiming the entire cache when under memory
pressure. In particular, introduce TRIM, DRAIN and DRAIN_CPU
verbs for uma_reclaim() and uma_zone_reclaim(). When trimming, only
items in excess of the estimate are reclaimed. Draining a zone
reclaims all of the cached full buckets (the previous behaviour of
uma_reclaim()), and may further drain the per-CPU caches in extreme
cases.
Now, when under memory pressure, the page daemon will trim zones
rather than draining them. As a result, heavily used zones do not incur
bursts of bucket cache misses following reclamation, but large, unused
caches will be reclaimed as before.
Reviewed by: jeff
Tested by: pho (an earlier version)
MFC after: 2 months
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D16667
- UMA_XDOMAIN enables an additional per-cpu bucket for freed memory that
was freed on a different domain from where it was allocated. This is
only used for UMA_ZONE_NUMA (first-touch) zones.
- UMA_FIRSTTOUCH sets the default UMA policy to be first-touch for all
zones. This tries to maintain locality for kernel memory.
Reviewed by: gallatin, alc, kib
Tested by: pho, gallatin
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20929
reports snap counts of how much a zone alloced and how much it freed. It
may happen that snap values doesn't match, e.g alloced - freed < 0.
Workaround that in memstat library.
Reported by: pho
two zones sharing a keg may have different limits. Now this is going
to work:
zone = uma_zcreate();
uma_zone_set_max(zone, limit);
zone2 = uma_zsecond_create(zone);
uma_zone_set_max(zone2, limit2);
Kegs no longer have uk_maxpages field, but zones have uz_items. When
set, it may be rounded up to minimum possible CPU bucket cache size.
For small limits bucket cache can also be reconfigured to be smaller.
Counter uz_items is updated whenever items transition from keg to a
bucket cache or directly to a consumer. If zone has uz_maxitems set and
it is reached, then we are going to sleep.
o Since new limits don't play well with multi-keg zones, remove them. The
idea of multi-keg zones was introduced exactly 10 years ago, and never
have had a practical usage. In discussion with Jeff we came to a wild
agreement that if we ever want to reintroduce the idea of a smart allocator
that would be able to choose between two (or more) totally different
backing stores, that choice should be made one level higher than UMA,
e.g. in malloc(9) or in mget(), or whatever and choice should be controlled
by the caller.
o Sleeping code is improved to account number of sleepers and wake them one
by one, to avoid thundering herd problem.
o Flag UMA_ZONE_NOBUCKETCACHE removed, instead uma_zone_set_maxcache()
KPI added. Having no bucket cache basically means setting maxcache to 0.
o Now with many fields added and many removed (no multi-keg zones!) make
sure that struct uma_zone is perfectly aligned.
Reviewed by: markj, jeff
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D17773
domains can be done by the _domain() API variants. UMA also supports a
first-touch policy via the NUMA zone flag.
The slab layer is now segregated by VM domains and is precise. It handles
iteration for round-robin directly. The per-cpu cache layer remains
a mix of domains according to where memory is allocated and freed. Well
behaved clients can achieve perfect locality with no performance penalty.
The direct domain allocation functions have to visit the slab layer and
so require per-zone locks which come at some expense.
Reviewed by: Attilio (a slightly older version)
Tested by: pho
Sponsored by: Netflix, Dell/EMC Isilon
Mainly focus on files that use BSD 2-Clause license, however the tool I
was using mis-identified many licenses so this was mostly a manual - error
prone - task.
The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.
vmpage requires struct pmap to exist and contain a pm_stats field. As of
r308817, either AIM or BOOKE is required to be set in order to get their
respective pmap structs. Rather than expose them both, or try to unify them
unnecessarily, add a third option which contains only a pm_stats field, and
change the two existing pmap structures to place the common fields at the
beginning of the struct. This actually fixes the stats collection by libkvm on
AIM hardware, because before it was accessing a possibly different offset, which
would cause it to read garbage.
Bump __FreeBSD_version to denote this ABI change, so that ports which depend on
libkvm can be rebuilt.
performance.
- Always free to the alloc bucket if there is space. This gives LIFO
allocation order to improve hot-cache performance. This also allows
for zones with a single bucket per-cpu rather than a pair if the entire
working set fits in one bucket.
- Enable per-cpu caches of buckets. To prevent recursive bucket
allocation one bucket zone still has per-cpu caches disabled.
- Pick the initial bucket size based on a table driven maximum size
per-bucket rather than the number of items per-page. This gives
more sane initial sizes.
- Only grow the bucket size when we face contention on the zone lock, this
causes bucket sizes to grow more slowly.
- Adjust the number of items per-bucket to account for the header space.
This packs the buckets more efficiently per-page while making them
not quite powers of two.
- Eliminate the per-zone free bucket list. Always return buckets back
to the bucket zone. This ensures that as zones grow into larger
bucket sizes they eventually discard the smaller sizes. It persists
fewer buckets in the system. The locking is slightly trickier.
- Only switch buckets in zalloc, not zfree, this eliminates pathological
cases where we ping-pong between two buckets.
- Ensure that the thread that fills a new bucket gets to allocate from
it to give a better upper bound on allocation time.
Sponsored by: EMC / Isilon Storage Division
dynamic memory allocation to hold per-CPU memory types data (sized to
mp_maxid for UMA, and to mp_maxcpus for malloc to match the kernel).
That fixes libmemstat with arbitrary large MAXCPU values and therefore
eliminates MEMSTAT_ERROR_TOOMANYCPUS error type.
Reviewed by: jhb
Approved by: re (kib)
cpuset_t objects.
That is going to offer the underlying support for a simple bump of
MAXCPU and then support for number of cpus > 32 (as it is today).
Right now, cpumask_t is an int, 32 bits on all our supported architecture.
cpumask_t on the other side is implemented as an array of longs, and
easilly extendible by definition.
The architectures touched by this commit are the following:
- amd64
- i386
- pc98
- arm
- ia64
- XEN
while the others are still missing.
Userland is believed to be fully converted with the changes contained
here.
Some technical notes:
- This commit may be considered an ABI nop for all the architectures
different from amd64 and ia64 (and sparc64 in the future)
- per-cpu members, which are now converted to cpuset_t, needs to be
accessed avoiding migration, because the size of cpuset_t should be
considered unknown
- size of cpuset_t objects is different from kernel and userland (this is
primirally done in order to leave some more space in userland to cope
with KBI extensions). If you need to access kernel cpuset_t from the
userland please refer to example in this patch on how to do that
correctly (kgdb may be a good source, for example).
- Support for other architectures is going to be added soon
- Only MAXCPU for amd64 is bumped now
The patch has been tested by sbruno and Nicholas Esborn on opteron
4 x 12 pack CPUs. More testing on big SMP is expected to came soon.
pluknet tested the patch with his 8-ways on both amd64 and i386.
Tested by: pluknet, sbruno, gianni, Nicholas Esborn
Reviewed by: jeff, jhb, sbruno
of times the system was forced to sleep when requesting a new allocation.
Expand the debugger hook, db_show_uma, to display these results as well.
This has proven to be very useful in out of memory situations when
it is not known why systems have become sluggish or fail in odd ways.
Reviewed by: rwatson alc
Approved by: scottl (mentor) peter
Obtained from: Yahoo Inc.
but then sizes the containing data structure at run-time to make room
for per-cpu cache data. Modify libmemstat to separately allocate a
buffer to hold per-cpu cache data, sized based on the run-time mp_maxid
variable when using libkvm to access UMA data. This avoids reading
invalid cache data from beyond the end of the uma_zone data structure
on the stack, which can result in invalid statistics and/or reads from
invalid kernel addresses.
Foot target practice by: ps
MFC after: 3 days
return a KVM error rather than an out of memory error, so that the caller
reports the KVM error state. This replaces a misleading error message
with a more accurate although equally confusing one.
MFC after: 3 days
cpu mask before looking at the cache entries for the CPU. For systems
with sparse CPU id arrays, this skips otherwise uninitialized cache
structures.
MFC after: 3 days
list, which could cause problems for multi-threaded applications
using libmemstat to monitor UMA in more than one thread
simultaneously.
MFC after: 3 days
opt_vmpage.h.
Remove definition of _KERNEL, it is no longer required in order to
include uma_int.h, as the sensitive parts of uma_int.h (a number of
inlines depending on kernel-only constants) are now protected by
_KERNEL.
that knows how to extract UMA(9) allocator statistics from a core dump or
live memory image using kvm(3). The caller is expected to provide the
necessary kvm_t handle, which is then used by libmemstat(3).
With these changes, it is trivially straight forward to re-introduce
vmstat -z support on core dumps, which was lost when UMA was introduced.
In the short term, this requires including vm/ include files that are not
intended for extra-kernel use, requiring in turn some ugliness.
- Move memory_type_list flushing logic from memstat_mtl_free() to
_memstat_mtl_empty(), a libmemstat-internal function that can
be called from other parts of the library. Invoke
_memstat_mtl_empty() from memstat_mtl_free(), which also frees
the containing list structure.
Invoke _memstat_mtl_empty() instead of memstat_mtl_free() in
various error cases in memstat_malloc.c and memstat_uma.c, which
previously resulted in the list being freed prematurely.
- Reverse the order of updating the mt_kegfree and mt_free fields
of the memory_type in memstat_uma.c, otherwise keg free items
won't be counted properly for non-secondary zones.
MFC after: 3 days
- Define a set of libmemstat(3) error constants, which are used by all
libmemstat(3) methods except for memstat_mtl_alloc(), which allocates
a memory type list and may return ENOMEM via errno.
- Define a per-memory_type_list current error value, which is set when a
call associated with a memory list fails. This requires wrapping a
structure around the queue(9) list head data structure, but this change
is not visible to libmemstat(3) consumers due to using access methods.
- Add a new accessor method, memstat_mtl_geterror() to retrieve the error
number.
- Consistently set the error number in a number of failure modes where
previously some combination of setting errno and printf'ing error
descriptions was used. libmemstat(3) will now no longer print to stdio
under any circumstances. Returns of NULL/-1 for errors remain the
same.
This avoids use of stdio, misuse of error numbers, and should make it
easier to program a libmemstat(3) consumer able to print useful error
messages. Currently, no error-to-string function is provided, as I'm
unsure how to address internationalization concerns.
MFC after: 1 day
try and discourage use outside the library.
Remove duplicate declaration of memstat_mtl_free() from memstat_internal.h,
as it's not internal, and the memstat.h definition suffices.
on top of a primary zone, sharing the same allocation "keg". When
reporting statistics for zones, do not report the free items in the
keg as part of the free items in the zone, or those free items will
be reported more than once: for the primary zone, and then any
secondary zones off the primary zone. Separately record and maintain
a kegfree statistic, and export via memstat_get_kegfree(), which is
available for use if needed. Since items free'd back to the keg are
not fully initialized, and hence may not actually be available (since
secondary zone ctor-time initialization can fail), this makes some
amount of sense.
This change corrects a bug made visible in the libmemstat(3)
modifications to netstat: mbufs freed back to the keg from the
packet zone would be counted twice, resulting in negative values
being printed in the mbuf free count.
Some further refinement of reporting relating to secondary zones may
still be required.
Reported by: ssouhlal
MFC after: 3 days
applications in tracking kernel memory statistics. It provides an
abstracted interface to uma(9) and malloc(9) statistics, wrapped
around the recently added binary stream sysctls for the allocators.
Using this interface, it is easy to build monitoring tools, query
specific memory types for usage information, etc. Facilities are
provided for binding caller-provided data to memory types,
incremental updates of memory types, and queries that span multiple
allocators.
Support for additional allocators is (relatively) easy to add.
The API for libmemstat(3) will probably change some over time as
consumers are written, and requirements evolve. It is written to
avoid encoding ABIs for data structure layout into consuming
applications for this reason.
MFC after: 1 week