Commit Graph

4187 Commits

Author SHA1 Message Date
markj
bbf6b587b5 Correct uma_prealloc()'s use of domainset iterators after r339925.
The iterator should be reinitialized after every successful slab
allocation.  A request to advance the iterator is interpreted as
an allocation failure, so a sufficiently large preallocation would
cause the iterator to believe that all domains were exhausted,
resulting in a sleep with the keg lock held. [1]

Also, keg_alloc_slab() should pass the unmodified wait flag to the
item initialization routine, which may use it to perform allocations
from other zones.

Reported and tested by:	slavah
Diagnosed by:	kib [1]
Reviewed by:	kib
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
2019-01-23 18:58:15 +00:00
kib
db5030058f MI VM: Make it possible to set size of superpage at boot instead of compile time.
In order to allow single kernel to use PAE pagetables on i386 if
hardware supports it, and fall back to classic two-level paging
structures if not, superpage code should be able to adopt to either 2M
or 4M superpages size.  There I make MI VM structures large enough to
track the biggest possible superpage, by allowing architecture to
define VM_NFREEORDER_MAX and VM_LEVEL_0_ORDER_MAX constants.
Corresponding VM_NFREEORDER and VM_LEVEL_0_ORDER symbols can be
defined as runtime values and must be less than the _MAX constants.
If architecture does not define _MAXs, it is assumed that _MAX ==
normal constant.

Reviewed by:	markj
Tested by:	pho (as part of the larger patch)
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D18853
2019-01-18 13:35:06 +00:00
glebius
39991209c1 Do not reserve KVA for paging bufs in vm_ksubmap_init(), since now
they allocate it in pbuf_init(). This should have been done together
with r343030.
2019-01-16 20:14:16 +00:00
kib
f34dbcf7d0 Implement shmat(2) flag SHM_REMAP.
Based on the description in Linux man page.

Reviewed by:	markj, ngie (previous version)
Sponsored by:	Mellanox Technologies
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D18837
2019-01-16 05:15:57 +00:00
glebius
7de9bd7cd4 Whitespace. 2019-01-16 04:02:08 +00:00
glebius
37fd65a0e1 Fix compilation failures on different arches that have vm_machdep.c not
aware of counter_u64_t by including counter.h into uma_int.h. I'm not
happy about this inclusion, but it fixes compilation ASAP.
2019-01-15 19:33:47 +00:00
glebius
25c8562625 style(9): break long line. 2019-01-15 18:50:11 +00:00
glebius
7546fbe224 Remove harmless leftover from code that cycles over zone's kegs. Just use +
instead of +=. There is no functional change.
2019-01-15 18:49:31 +00:00
glebius
8c4ff6ac75 Only do uz_items accounting for zones that have a limit set in uz_max_items.
This reduces amount of locking required for these zones.

Also, for cache only zones (UMA_ZFLAG_CACHE) accounting uz_items wasn't
correct at all, since they may allocate items directly from their backing
store and then free them via UMA underflowing uz_items.

Tested by:	pho
2019-01-15 18:32:26 +00:00
glebius
60d5d98bc3 Make uz_allocs, uz_frees and uz_fails counter(9). This removes some
atomic updates and reduces amount of data protected by zone lock.

During startup point these fields to EARLY_COUNTER. After startup
allocate them for all early zones.

Tested by:	pho
2019-01-15 18:24:34 +00:00
glebius
dd582940fe Fix compilation on 32-bit. 2019-01-15 03:43:46 +00:00
glebius
7ee1aa34d4 Allocate pager bufs from UMA instead of 80-ish mutex protected linked list.
o In vm_pager_bufferinit() create pbuf_zone and start accounting on how many
  pbufs are we going to have set.
  In various subsystems that are going to utilize pbufs create private zones
  via call to pbuf_zsecond_create(). The latter calls uma_zsecond_create(),
  and sets a limit on created zone. After startup preallocate pbufs according
  to requirements of all pbuf zones.

  Subsystems that used to have a private limit with old allocator now have
  private pbuf zones: md(4), fusefs, NFS client, smbfs, VFS cluster, FFS,
  swap, vnode pager.

  The following subsystems use shared pbuf zone: cam(4), nvme(4), physio(9),
  aio(4). They should have their private limits, but changing that is out of
  scope of this commit.

o Fetch tunable value of kern.nswbuf from init_param2() and while here move
  NSWBUF_MIN to opt_param.h and eliminate opt_swap.h, that was holding only
  this option.
  Default values aren't touched by this commit, but they probably should be
  reviewed wrt to modern hardware.

This change removes a tight bottleneck from sendfile(2) operation, that
uses pbufs in vnode pager. Other pagers also would benefit from faster
allocation.

Together with:	gallatin
Tested by:	pho
2019-01-15 01:02:16 +00:00
glebius
f1a8621cf2 o Move zone limit from keg level up to zone level. This means that now
two zones sharing a keg may have different limits. Now this is going
  to work:

  zone = uma_zcreate();
  uma_zone_set_max(zone, limit);
  zone2 = uma_zsecond_create(zone);
  uma_zone_set_max(zone2, limit2);

  Kegs no longer have uk_maxpages field, but zones have uz_items. When
  set, it may be rounded up to minimum possible CPU bucket cache size.
  For small limits bucket cache can also be reconfigured to be smaller.
  Counter uz_items is updated whenever items transition from keg to a
  bucket cache or directly to a consumer. If zone has uz_maxitems set and
  it is reached, then we are going to sleep.

o Since new limits don't play well with multi-keg zones, remove them. The
  idea of multi-keg zones was introduced exactly 10 years ago, and never
  have had a practical usage. In discussion with Jeff we came to a wild
  agreement that if we ever want to reintroduce the idea of a smart allocator
  that would be able to choose between two (or more) totally different
  backing stores, that choice should be made one level higher than UMA,
  e.g. in malloc(9) or in mget(), or whatever and choice should be controlled
  by the caller.

o Sleeping code is improved to account number of sleepers and wake them one
  by one, to avoid thundering herd problem.

o Flag UMA_ZONE_NOBUCKETCACHE removed, instead uma_zone_set_maxcache()
  KPI added. Having no bucket cache basically means setting maxcache to 0.

o Now with many fields added and many removed (no multi-keg zones!) make
  sure that struct uma_zone is perfectly aligned.

Reviewed by:	markj, jeff
Tested by:	pho
Differential Revision:	https://reviews.freebsd.org/D17773
2019-01-15 00:02:06 +00:00
glebius
74d4127eda Fix regression in r331368, that broke dumping of UMA startup pages
when WITNESS is present.

Discussed with:	markj
2019-01-07 23:17:09 +00:00
kib
54e59fff7b Add a tunable which changes mincore(2) algorithm to only report data
from the local mapping.

Enable the setting by default.
The article behind the change: https://arxiv.org/abs/1901.01161

Reviewed by:	markj
Discussed with:	emaste
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D18764
2019-01-07 22:10:48 +00:00
kib
d57c248b7d Add 'v' modifier to the ddb 'show pginfo' command to display vm_page
backing the provided kernel virtual address.

Tested by:	pho
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
2018-12-30 15:58:18 +00:00
mjg
7e31d1de7e Remove unused argument to priv_check_cred.
Patch mostly generated with cocinnelle:

@@
expression E1,E2;
@@

- priv_check_cred(E1,E2,0)
+ priv_check_cred(E1,E2)

Sponsored by:	The FreeBSD Foundation
2018-12-11 19:32:16 +00:00
mjg
5d5736ca94 vm: use fcmpset for vmspace reference counting
Sponsored by:	The FreeBSD Foundation
2018-12-07 16:22:54 +00:00
brooks
3456e102d5 Normalize COMPAT_43 syscall declarations.
Have ogetkerninfo, ogetpagesize, ogethostname, osethostname, and oaccept
declare o<foo>_args structs rather than non-compat ones. Due to a
failure to use NOARGS in most cases this adds only one new declaration.

No changes required in freebsd32 as only ogetpagesize() is implemented
and it has a 32-bit specific implementation.

Reviewed by:	kib
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D15816
2018-12-04 16:48:47 +00:00
kib
61e5f27315 Change the vm_ooffset_t type to unsigned.
The type represents byte offset in the vm_object_t data space, which
does not span negative offsets in FreeBSD VM.  The change matches byte
offset signess with the unsignedness of the vm_pindex_t which
represents the type of the page indexes in the objects.

This allows to remove the UOFF_TO_IDX() macro which was used when we
have to forcibly interpret the type as unsigned anyway.  Also it fixes
a lot of implicit bugs in the device drivers d_mmap methods.

Reviewed by:	alc, markj (previous version)
Tested by:	pho
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
2018-12-02 13:16:46 +00:00
kib
58c3cabfa8 Allow to create swap zone larger than v_page_count / 2.
If user configured the maxswapzone tunable, just take the literal
value for the initial zone sizing attempt.  Before, it was only
possible to reduce the zone by the tunable.

While there, correct the message which was not correct when zone
creation rounded the size up.

Reported by:	jmg
Reviewed by:	markj
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D18381
2018-12-01 16:50:12 +00:00
vangyzen
dc0101512a Include path for tmpfs objects in vm.objects sysctl
This applies the fix in r283924 to the vm.objects sysctl
added by r283624 so the output will include the vnode
information (i.e. path) for tmpfs objects.

Reviewed by:	kib, dab
MFC after:	2 weeks
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D2724
2018-11-30 04:59:43 +00:00
vangyzen
70ed60ac3d Add assertions and comment to vm_object_vnode()
Reviewed by:	kib
MFC after:	2 weeks
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D2724
2018-11-30 04:18:31 +00:00
markj
84154174b3 Update the free page count when blacklisting pages.
Otherwise the free page count will not accurately reflect the physical
page allocator's state.  On 11 this can trigger panics in
vm_page_alloc() since the allocator state and free page count are
updated atomically and we expect them to stay in sync.  On 12 the
bug would manifest as threads looping in vm_page_alloc().

PR:		231296
Reported by:	mav, wollman, Rainer Duffner, Josh Gitlin
Reviewed by:	alc, kib, mav
MFC after:	3 days
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D18374
2018-11-29 16:31:01 +00:00
glebius
7b43f3229b Fix yet another edge case in uma_startup_count(). If zone size fits into
several pages, but leaves no space for struct uma_slab at the end we
miscalculate number of pages by one. Totally mimic keg_large_init() math
here to cover that problem.

Reported by:	gallatin
2018-11-28 19:54:02 +00:00
glebius
82a2de386c For not offpage zones the slab is placed at the end of page. Keg's uk_pgoff
is calculated to guarantee that struct uma_slab is placed at pointer size
alignment. Calculation of real struct uma_slab size is done in keg_ctor()
and yet again in keg_large_init(), to check if we need an extra page. This
calculation can actually be performed at compile time.

- Add SIZEOF_UMA_SLAB macro to calculate size of struct uma_slab placed at
  an end of a page with alignment requirement.
- Use SIZEOF_UMA_SLAB in keg_ctor() and in keg_large_init(). This is a not
  a functional change.
- Use SIZEOF_UMA_SLAB in UMA_SLAB_SPACE definition and in keg_small_init().
  This is a potential bugfix, but in reality I don't think there are any
  systems affected, since compiler aligns struct uma_slab anyway.
2018-11-28 19:17:27 +00:00
kib
65de11809e Avoid unneeded check in vmspace_alloc().
All vmspace_alloc() callers know which kind of pmap they allocate.

Reviewed by:	alc, markj (previous version)
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D18329
2018-11-25 17:56:49 +00:00
bwidawsk
11416ef4a8 linuxkpi: Use pageproc instead of vmproc
According to markj@:
pageproc contains the page daemon and laundry threads, which are
responsible for managing the LRU page queues and writing back dirty
pages.  vmproc's main task is to swap out kernel stacks when the system
is under memory pressure, and swap them back in when necessary.  It's a
somewhat legacy component of the system and isn't required.  You can
build a kernel without it by specifying "options NO_SWAPPING" (which is
a somewhat misleading name), in which vm_swapout_dummy.c is compiled
instead of vm_swapout.c.

Based on this, we want pageproc to emulate kswapd, not vmproc.

Reviewed by:	markj
Differential Revision:	https://reviews.freebsd.org/D18061
2018-11-21 04:34:18 +00:00
bwidawsk
82426acbdd linuxkpi: Add some basic swap functions
These are used by kms-drm to determine various heuristics relate
memory conditions.

The number of free swap pages is just a variable, and it can be
much cheaper by either adding a new getter, or simply extern'ing
swap_total. However, this patch opts to use the more expensive,
existing interface - since this isn't an operation in a high per
path.

This allows us to remove some more gpl linuxkpi and do the follo
kms-drm:
git rm linuxkpi/gplv2/include/linux/swap.h

Reviewed by:    mmacy, Johannes Lundberg <johalun0@gmail.com>
Approved by:    emaste (mentor)
Differential Revision:  https://reviews.freebsd.org/D18052
2018-11-20 22:49:19 +00:00
alc
0f0e29d6b8 Use swp_pager_isondev() throughout. Submitted by: ota@j.email.ne.jp
Change swp_pager_isondev()'s return type to bool.

Reviewed by:	kib
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D16712
2018-11-19 17:17:23 +00:00
alc
4edb7810ba Tidy up vm_map_simplify_entry() and its recently introduced helper
functions.  Notably, reflow the text of some comments so that they
occupy fewer lines, and introduce an assertion in one of the new
helper functions so that it is not misused by a future caller.

In collaboration with:	Doug Moore <dougm@rice.edu>
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D17635
2018-11-18 01:27:17 +00:00
markj
e8333c0398 Add accounting to per-domain UMA full bucket caches.
In particular, track the current size of the cache and maintain an
estimate of its working set size.  This will be used to decide how
much to shrink various caches when the kernel attempts to reclaim
pages.  As a secondary effect, it makes statistics aggregation (done
by, e.g., vmstat -z) cheaper since sysctl_vm_zone_stats() no longer
needs to iterate over lists of cached buckets.

Discussed with:	alc, glebius, jeff
Tested by:	pho (previous version)
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D16666
2018-11-13 19:44:40 +00:00
markj
110bc77162 Re-apply r336984, reverting r339934.
r336984 exposed the bug fixed in r340241, leading to the initial revert
while the bug was being hunted down.  Now that the bug is fixed, we
can revert the revert.

Discussed with:	alc
MFC after:	3 days
2018-11-10 20:33:08 +00:00
markj
4bb5dea678 Fix a use-after-free in swp_pager_meta_free().
This was introduced in r326329 and explains the crashes mentioned in
the commit log message for r339934.  In particular, on INVARIANTS
kernels, UMA trashing causes the loop to exit early, leaving swap
blocks behind when they should have been freed.  After r336984 this
became more problematic since new anonymous mappings were more
likely to reuse swapped-out subranges of existing VM objects, so faults
would trigger pageins of freed memory rather than returning zeroed
pages.

Reviewed by:	kib
MFC after:	3 days
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D17897
2018-11-07 23:28:11 +00:00
markj
c259f44e9a Avoid specifying VM_PROT_EXECUTE in mappings from pipe_map and exec_map.
These submaps are used for mapping pipe buffers and execv() argument
strings respectively, so there's no need for such mappings to have
execute permissions.

Reported by:	jhb
Reviewed by:	alc, jhb, kib
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D17827
2018-11-06 21:57:03 +00:00
markj
e3f6c2eda5 Initialize last_target in the laundry thread control loop.
In practice it is always initialized because nfreed must be positive
in order to trigger background laundering, but this isn't obvious.

CID:		1387997
MFC after:	1 week
2018-11-06 02:52:54 +00:00
markj
92bd4105f9 Initialize the eflags field of vm_map headers.
Initializing the eflags field of the map->header entry to a value with a
unique new bit set makes a few comparisons to &map->header unnecessary.

Submitted by:	Doug Moore <dougm@rice.edu>
Reviewed by:	alc, kib
Tested by:	pho
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D14005
2018-11-02 16:26:44 +00:00
markj
8bf5ba5706 Revert r336984.
It appears to be responsible for random segfaults observed when lots
of paging activity is taking place, but the root cause is not yet
understood.

Requested by:	alc
MFC after:	now
2018-10-30 22:40:40 +00:00
markj
f931b753dd Add malloc_domainset(9) and _domainset variants to other allocator KPIs.
Remove malloc_domain(9) and most other _domain KPIs added in r327900.
The new functions allow the caller to specify a general NUMA domain
selection policy, rather than specifically requesting an allocation from
a specific domain.  The latter policy tends to interact poorly with
M_WAITOK, resulting in situations where a caller is blocked indefinitely
because the specified domain is depleted.  Most existing consumers of
the _domain KPIs are converted to instead use a DOMAINSET_PREF() policy,
in which we fall back to other domains to satisfy the allocation
request.

This change also defines a set of DOMAINSET_FIXED() policies, which
only permit allocations from the specified domain.

Discussed with:	gallatin, jeff
Reported and tested by:	pho (previous version)
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D17418
2018-10-30 18:26:34 +00:00
markj
4f4e1af172 Fix some problems that manifest when NUMA domain 0 is empty.
- In uma_prealloc(), we need to check for an empty domain before the
  first allocation attempt, not after.  Fix this by switching
  uma_prealloc() to use a vm_domainset iterator, which addresses the
  secondary issue of using a signed domain identifier in round-robin
  iteration.
- Don't automatically create a page daemon for domain 0.
- In domainset_empty_vm(), recompute ds_cnt and ds_order after
  excluding empty domains; otherwise we may frequently specify an empty
  domain when calling in to the page allocator, wasting CPU time.
  Convert DOMAINSET_PREF() policies for empty domains to round-robin.
- When freeing bootstrap pages, don't count them towards the per-domain
  total page counts for now: some vm_phys segments are created before
  the SRAT is parsed and are thus always identified as being in domain 0
  even when they are not.  Then, when bootstrap pages are freed, they
  are added to a domain that we had previously thought was empty.  Until
  this is corrected, we simply exclude them from the per-domain page
  count.

Reported and tested by:	Rajesh Kumar <rajfbsd@gmail.com>
Reviewed by:	gallatin
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D17704
2018-10-30 17:57:40 +00:00
alc
cfcce55c38 Eliminate typically pointless calls to vm_fault_prefault() on soft, copy-
on-write faults.  On a page fault, when we call vm_fault_prefault(), it
probes the pmap and the shadow chain of vm objects to see if there are
opportunities to create read and/or execute-only mappings to neighoring
pages.  For example, in the case of hard faults, such effort typically pays
off, that is, mappings are created that eliminate future soft page faults.
However, in the the case of soft, copy-on-write faults, the effort very
rarely pays off.  (See the review for some specific data.)

Reviewed by:	kib, markj
MFC after:	3 weeks
Differential Revision:	https://reviews.freebsd.org/D17367
2018-10-27 17:49:46 +00:00
markj
f2fc591269 Add a !NUMA definition for vm_domainset_iter_policy_ref_init().
Pointy hat:	markj
X-MFC with:	r339661
Sponsored by:	The FreeBSD Foundation
2018-10-24 17:09:20 +00:00
markj
239b1074f3 Add an #include required after r339686.
X-MFC with:	r339686
Sponsored by:	The FreeBSD Foundation
2018-10-24 16:49:16 +00:00
markj
ac2a239a60 Use a vm_domainset iterator in keg_fetch_slab().
Previously, it used a hand-rolled round-robin iterator.  This meant that
the minskip logic in r338507 didn't apply to UMA allocations, and also
meant that we would call vm_wait() for individual domains rather than
permitting an allocation from any domain with sufficient free pages.

Discussed with:	jeff
Tested by:	pho
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D17420
2018-10-24 16:41:47 +00:00
markj
875935f8c5 Initialize static domainsets regardless of whether an SRAT is present.
Reported by:	yuripv
X-MFC with:	r339452
Sponsored by:	The FreeBSD Foundation
2018-10-23 18:07:16 +00:00
markj
6c262608dd Refactor domainset iterators for use by malloc(9) and UMA.
Before this change we had two flavours of vm_domainset iterators: "page"
and "malloc".  The latter was only used for kmem_*() and hard-coded its
behaviour based on kernel_object's policy.  Moreover, its use contained
a race similar to that fixed by r338755 since the kernel_object's
iterator was being run without the object lock.

In some cases it is useful to be able to explicitly specify a policy
(domainset) or policy+iterator (domainset_ref) when performing memory
allocations.  To that end, refactor the vm_dominset_* KPI to permit
this, and get rid of the "malloc" domainset_iter KPI in the process.

Reviewed by:	jeff (previous version)
Tested by:	pho (part of a larger patch)
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D17417
2018-10-23 16:35:58 +00:00
markj
0dd92926d8 Make it possible to disable NUMA support with a tunable.
This provides a chicken switch for anyone negatively impacted by
enabling NUMA in the amd64 GENERIC kernel configuration.  With
NUMA disabled at boot-time, information about the NUMA topology
is not exposed to the rest of the kernel, and all of physical
memory is viewed as coming from a single domain.

This method still has some performance overhead relative to disabling
NUMA support at compile time.

PR:		231460
Reviewed by:	alc, gallatin, kib
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D17439
2018-10-22 20:13:51 +00:00
markj
816ca801ac Fix the build after r339601.
I committed some patches out of order and didn't build-test one of them.

Reported by:	Jenkins, O. Hartmann <ohartmann@walstatt.org>
X-MFC with:	r339601
2018-10-22 17:19:48 +00:00
markj
87de4fcb0a Avoid a redundancy in a comment updated by r339601.
Reported by:	alc
X-MFC with:	r339601
2018-10-22 17:17:30 +00:00
markj
9ce499cca5 Swap in processes unless there's a global memory shortage.
On NUMA systems, we would not swap in processes unless all domains
had some free pages.  This is too conservative in general.  Instead,
permit swapins so long as at least one domain has free pages, and add
a kernel stack NUMA policy which ensures that we will try to allocate
kernel stack pages from any domain.

Reported and tested by:	pho, Jan Bramkamp <crest@bultmann.eu>
Reviewed by:	alc, kib
Discussed with:	jeff
MFC after:	3 days
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D17304
2018-10-22 17:04:04 +00:00