Commit Graph

4324 Commits

Author SHA1 Message Date
Mark Johnston
38547d5980 Assert that the refcount value is not VPRC_BLOCKED in vm_page_drop().
VPRC_BLOCKED is a refcount flag used to indicate that a thread is
tearing down mappings of a page.  When set, it causes attempts to wire a
page via a pmap lookup to fail.  It should never represent the last
reference to a page, so assert this.

Suggested by:	kib
Reviewed by:	alc, kib
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D21639
2019-09-16 15:16:48 +00:00
Mark Johnston
923da43e7c Fix a race in vm_page_dequeue_deferred_free() after r352110.
This function loaded the page's queue index before setting PGA_DEQUEUE.
In this window the page daemon may have deactivated the page, updating
its queue index.  Make the operation atomic using vm_page_pqstate_cmpset();
the page daemon will not modify the page once it observes that PGA_DEQUEUE
is set.

Reported and tested by:	pho
Reviewed by:	alc, kib
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D21639
2019-09-16 15:12:49 +00:00
Mark Johnston
47aef898ea Fix a page leak in vm_page_reclaim_run().
After r352110 the attempt to remove mappings of the page being replaced
may fail if the page is wired.  In this case we must free the replacement
page.

Reviewed by:	alc, kib
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D21639
2019-09-16 15:09:31 +00:00
Mark Johnston
e8bcf6966b Revert r352406, which contained changes I didn't intend to commit. 2019-09-16 15:04:45 +00:00
Mark Johnston
41fd4b9422 Fix a couple of nits in r352110.
- Remove a dead variable from the amd64 pmap_extract_and_hold().
- Fix grammar in the vm_page_wire man page.

Reported by:	alc
Reviewed by:	alc, kib
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D21639
2019-09-16 15:03:12 +00:00
Hans Petter Selasky
11b57401e6 Use REFCOUNT_COUNT() to obtain refcount where appropriate.
Refcount waiting will set some flag bits in the refcount value.
Make sure these bits get cleared by using the REFCOUNT_COUNT()
macro to obtain the actual refcount.

Differential Revision:	https://reviews.freebsd.org/D21620
Reviewed by:	kib@, markj@
MFC after:	1 week
Sponsored by:	Mellanox Technologies
2019-09-12 16:26:59 +00:00
Jeff Roberson
c75757481f Replace redundant code with a few new vm_page_grab facilities:
- VM_ALLOC_NOCREAT will grab without creating a page.
 - vm_page_grab_valid() will grab and page in if necessary.
 - vm_page_busy_acquire() automates some busy acquire loops.

Discussed with:	alc, kib, markj
Tested by:	pho (part of larger branch)
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D21546
2019-09-10 19:08:01 +00:00
Jeff Roberson
4cdea4a853 Use the sleepq lock rather than the page lock to protect against wakeup
races with page busy state.  The object lock is still used as an interlock
to ensure that the identity stays valid.  Most callers should use
vm_page_sleep_if_busy() to handle the locking particulars.

Reviewed by:	alc, kib, markj
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D21255
2019-09-10 18:27:45 +00:00
Mark Johnston
fee2a2fa39 Change synchonization rules for vm_page reference counting.
There are several mechanisms by which a vm_page reference is held,
preventing the page from being freed back to the page allocator.  In
particular, holding the page's object lock is sufficient to prevent the
page from being freed; holding the busy lock or a wiring is sufficent as
well.  These references are protected by the page lock, which must
therefore be acquired for many per-page operations.  This results in
false sharing since the page locks are external to the vm_page
structures themselves and each lock protects multiple structures.

Transition to using an atomically updated per-page reference counter.
The object's reference is counted using a flag bit in the counter.  A
second flag bit is used to atomically block new references via
pmap_extract_and_hold() while removing managed mappings of a page.
Thus, the reference count of a page is guaranteed not to increase if the
page is unbusied, unmapped, and the object's write lock is held.  As
a consequence of this, the page lock no longer protects a page's
identity; operations which move pages between objects are now
synchronized solely by the objects' locks.

The vm_page_wire() and vm_page_unwire() KPIs are changed.  The former
requires that either the object lock or the busy lock is held.  The
latter no longer has a return value and may free the page if it releases
the last reference to that page.  vm_page_unwire_noq() behaves the same
as before; the caller is responsible for checking its return value and
freeing or enqueuing the page as appropriate.  vm_page_wire_mapped() is
introduced for use in pmap_extract_and_hold().  It fails if the page is
concurrently being unmapped, typically triggering a fallback to the
fault handler.  vm_page_wire() no longer requires the page lock and
vm_page_unwire() now internally acquires the page lock when releasing
the last wiring of a page (since the page lock still protects a page's
queue state).  In particular, synchronization details are no longer
leaked into the caller.

The change excises the page lock from several frequently executed code
paths.  In particular, vm_object_terminate() no longer bounces between
page locks as it releases an object's pages, and direct I/O and
sendfile(SF_NOCACHE) completions no longer require the page lock.  In
these latter cases we now get linear scalability in the common scenario
where different threads are operating on different files.

__FreeBSD_version is bumped.  The DRM ports have been updated to
accomodate the KPI changes.

Reviewed by:	jeff (earlier version)
Tested by:	gallatin (earlier version), pho
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D20486
2019-09-09 21:32:42 +00:00
Konstantin Belousov
0e79619e1e vm_object_deallocate(): Remove no longer needed code.
We track text mappings explicitly, there is no removal of the text
refs on the object deallocate any more, so tmpfs objects should not be
treated specially. Doing so causes excess deref.

Reported and tested by:	gallatin
Reviewed by:	markj
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D21560
2019-09-07 16:01:45 +00:00
Konstantin Belousov
4f49b435c8 vm_object_coalesce(): avoid extending any nosplit objects, not only
that which back tmpfs nodes.

Reviewed by:	markj
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D21560
2019-09-07 15:58:48 +00:00
Konstantin Belousov
bf5661f4a1 madvise(MADV_FREE): Quick fix to time rewind.
Don't free pages in a shadowing object.  While this degrades MADV_FREE
to a no-op (and we could, instead, choose to fall back to
MADV_DONTNEED, at the cost of changing pmap_madvise), this is
presently considered a temporary fix. We may prefer to risk a little
fragmentation of the map by creating a zero/OBJT_DEFAULT entry over
top of the existing object and, simultaneously, revert to the existing
marking any pages in the former shadowing object in the advised region
as reclaimable.  At least one consumer of MADV_FREE (snmalloc) may use
mmap() to construct zeroed pages "eventually" here anyway, so the
fragmentation may be coming anyway.

Submitted by:	Nathaniel Filardo <nwf20@cl.cam.ac.uk>
PR:	240061
Reviewed by:	markj
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D21517
2019-09-04 20:28:16 +00:00
Kyle Evans
fe7bcbaf50 vm pager: writemapping accounting for OBJT_SWAP
Currently writemapping accounting is only done for vnode_pager which does
some accounting on the underlying vnode.

Extend this to allow accounting to be possible for any of the pager types.
New pageops are added to update/release writecount that need to be
implemented for any pager wishing to do said accounting, and we implement
these methods now for both vnode_pager (unchanged) and swap_pager.

The primary motivation for this is to allow other systems with OBJT_SWAP
objects to check if their objects have any write mappings and reject
operations with EBUSY if so. posixshm will be the first to do so in order to
reject adding write seals to the shmfd if any writable mappings exist.

Reviewed by:	kib, markj
Differential Revision:	https://reviews.freebsd.org/D21456
2019-09-03 20:31:48 +00:00
Konstantin Belousov
fe69291ff4 Add procctl(PROC_STACKGAP_CTL)
It allows a process to request that stack gap was not applied to its
stacks, retroactively.  Also it is possible to control the gaps in the
process after exec.

PR:	239894
Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D21352
2019-09-03 18:56:25 +00:00
Mark Johnston
7cdeaf3309 Add preliminary support for atomic updates of per-page queue state.
Queue operations on a page use the page lock when updating the page to
reflect the desired queue state, and the page queue lock when physically
enqueuing or dequeuing a page.  Multiple pages share a given page lock,
but queue state is per-page; this false sharing results in heavy lock
contention.

Take a small step towards the use of atomic_cmpset to synchronize
updates to per-page queue state by introducing vm_page_pqstate_cmpset()
and using it in the page daemon.  In the longer term the plan is to stop
using the page lock to protect page identity and rely only on the object
and page busy locks.  However, since the page daemon avoids acquiring
the object lock except when necessary, some synchronization with a
concurrent free of the page is required.  vm_page_pqstate_cmpset() can
be used to ensure that queue state updates are successful only if the
page is not scheduled for a dequeue, which is sufficient for the page
daemon.

Add vm_page_swapqueue(), which moves a page from one queue to another
using vm_page_pqstate_cmpset().  Use it in the active queue scan, which
does not use the object lock.  Modify vm_page_dequeue_deferred() to
use vm_page_pqstate_cmpset() as well.

Reviewed by:	kib
Discussed with:	jeff
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D21257
2019-09-03 14:29:58 +00:00
Mark Johnston
9d75f0dc75 Map the vm_page array into KVA on amd64.
r351198 allows the kernel to use domain-local memory to back the vm_page
array (up to 2MB boundaries) and reserves a separate PML4 entry for that
purpose.  One consequence of that change is that the vm_page array is no
longer present in minidumps, which only adds pages mapped above
VM_MIN_KERNEL_ADDRESS.

To avoid the friction caused by having kernel data structures mapped
below VM_MIN_KERNEL_ADDRESS, map the vm_page array starting at
VM_MIN_KERNEL_ADDRESS instead of using a dedicated PML4 entry.

Reviewed by:	kib
Discussed with:	jeff
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D21491
2019-09-03 13:18:51 +00:00
Mark Johnston
08cfa56ea3 Extend uma_reclaim() to permit different reclamation targets.
The page daemon periodically invokes uma_reclaim() to reclaim cached
items from each zone when the system is under memory pressure.  This
is important since the size of these caches is unbounded by default.
However it also results in bursts of high latency when allocating from
heavily used zones as threads miss in the per-CPU caches and must
access the keg in order to allocate new items.

With r340405 we maintain an estimate of each zone's usage of its
(per-NUMA domain) cache of full buckets.  Start making use of this
estimate to avoid reclaiming the entire cache when under memory
pressure.  In particular, introduce TRIM, DRAIN and DRAIN_CPU
verbs for uma_reclaim() and uma_zone_reclaim().  When trimming, only
items in excess of the estimate are reclaimed.  Draining a zone
reclaims all of the cached full buckets (the previous behaviour of
uma_reclaim()), and may further drain the per-CPU caches in extreme
cases.

Now, when under memory pressure, the page daemon will trim zones
rather than draining them.  As a result, heavily used zones do not incur
bursts of bucket cache misses following reclamation, but large, unused
caches will be reclaimed as before.

Reviewed by:	jeff
Tested by:	pho (an earlier version)
MFC after:	2 months
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D16667
2019-09-01 22:22:43 +00:00
Konstantin Belousov
6470c8d3db Rework v_object lifecycle for vnodes.
Current implementation of vnode_create_vobject() and
vnode_destroy_vobject() is written so that it prepared to handle the
vm object destruction for live vnode.  Practically, no filesystems use
this, except for some remnants that were present in UFS till today.
One of the consequences of that model is that each filesystem must
call vnode_destroy_vobject() in VOP_RECLAIM() or earlier, as result
all of them get rid of the v_object in reclaim.

Move the call to vnode_destroy_vobject() to vgonel() before
VOP_RECLAIM().  This makes v_object stable: either the object is NULL,
or it is valid vm object till the vnode reclamation.  Remove code from
vnode_create_vobject() to handle races with the parallel destruction.

Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D21412
2019-08-29 07:50:25 +00:00
Mateusz Guzik
2614bd9699 vm: only lock tmpfs vnode shared in vm_object_deallocate
Reviewed by:	kib
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D21455
2019-08-28 19:28:27 +00:00
Mark Johnston
b5d239cb97 Wire pages in vm_page_grab() when appropriate.
uiomove_object_page() and exec_map_first_page() would previously wire a
page after having grabbed it.  Ask vm_page_grab() to perform the wiring
instead: this removes some redundant code, and is cheaper in the case
where the requested page is not resident since the page allocator can be
asked to initialize the page as wired, whereas a separate vm_page_wire()
call requires the page lock.

In vm_imgact_hold_page(), use vm_page_unwire_noq() instead of
vm_page_unwire(PQ_NONE).  The latter ensures that the page is dequeued
before returning, but this is unnecessary since vm_page_free() will
trigger a batched dequeue of the page.

Reviewed by:	alc, kib
Tested by:	pho (part of a larger patch)
MFC after:	1 week
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D21440
2019-08-28 16:08:06 +00:00
Mark Johnston
a70e17eeca Fix a few nits in vm_pqbatch_process_page().
- Don't bother masking off non-queue state flags when loading the
  page's atomic state, since it is only required for one of the
  function's assertions.  Update the assertion instead.
- Remove an incorrect comment regarding synchronization with the
  page daemon.  The page daemon only ever checks for PGA_ENQUEUED
  with the page queue lock held.
- When clearing requeue flags, only clear the flags that have been
  acted upon.

Reviewed by:	kib (previous version)
Discussed with:	alc
Tested by:	pho (part of a larger patch)
MFC after:	1 week
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D21368
2019-08-26 20:20:10 +00:00
Mark Johnston
b48d4efe75 Handle UMA_ANYDOMAIN in kstack_import().
The kernel thread stack zone performs first-touch allocations by
default, and must handle the case where the local memory domain
is empty.  For most UMA zones this is handled in the keg layer,
but cache zones currently must implement a policy for this case.
Simply use a round-robin policy if UMA_ANYDOMAIN is passed.

Reported and tested by:	bcran
Reviewed by:	kib
Sponsored by:	The FreeBSD Foundation
2019-08-25 21:14:46 +00:00
Konstantin Belousov
783a68aa33 Move OBJT_VNODE specific code from vm_object_terminate() to
vnode_destroy_vobject().

Reviewed by:	alc, jeff (previous version), markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D21357
2019-08-25 13:26:06 +00:00
Doug Moore
83ea714f4f vm_map_simplify_entry considers merging an entry with its two
neighbors, and is used in a way so that if entries a and b cannot be
merged, we consider them twice, first not-merging a with its successor
b, and then not-merging b with its predecessor a. This change replaces
vm_map_simplify_entry with vm_map_try_merge_entries, which compares
two adjacent entries only, and uses it to avoid duplicated
merge-checks.

Tested by: pho
Reviewed by: alc
Approved by: markj (implicit)
Differential Revision: https://reviews.freebsd.org/D20814
2019-08-25 07:06:51 +00:00
Konstantin Belousov
a7751d328a Make stack grow use the same gap as stack create.
Store stack_guard_page * PAGE_SIZE into the gap->next_read field at
the time of the stack creation.  This makes the used guard size
consistent between stack creation and stack grow time.

Suggested by:	alc
Reviewed by:	alc, markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D21384
2019-08-24 14:29:13 +00:00
Mateusz Guzik
5b596b9fa5 Remove the obsolete pcpu_zone_ptr zone.
It was only used by flowtable (removed in r321618).

Sponsored by:	The FreeBSD Foundation
2019-08-24 00:01:19 +00:00
Mark Johnston
f93670b7b9 Stop clearing page flags in vm_page_pqbatch_submit().
All existing callers guarantee that the page does not have a
pre-existing dequeue pending.  Thus, if the page is dequeued before
pqbatch_submit() acquires the page queue lock, we do not need to do
anything since vm_page_dequeue_complete() takes care of clearing all
page queue state flags for us.

With this change, vm_page_pqbatch_submit() has the nice property that it
does not directly modify any fields in the page structure.

Reviewed by:	alc, kib
Tested by:	pho (part of a larger change)
MFC after:	1 week
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D21372
2019-08-23 19:53:11 +00:00
Mark Johnston
386eba08bd Make vm_pqbatch_submit_page() externally visible.
It will become useful for the page daemon to be able to directly create
a batch queue entry for a page, and without modifying the page
structure.  Rename vm_pqbatch_submit_page() to vm_page_pqbatch_submit()
to keep the namespace consistent.  No functional change intended.

Reviewed by:	alc, kib
MFC after:	1 week
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D21369
2019-08-23 19:49:29 +00:00
Mark Johnston
8b90607f20 Simplify vm_page_dequeue() and fix an assertion.
- Add a vm_pagequeue_remove() function to physically remove a page
  from its queue and update the queue length.
- Remove vm_page_pagequeue_lockptr() and let vm_page_pagequeue()
  return NULL for dequeued pages.
- Avoid unnecessarily reloading the queue index if vm_page_dequeue()
  loses a race with a concurrent queue operation.
- Correct an always-true assertion: vm_page_dequeue() may be called
  from the page allocator with the page unlocked.  The assertion
  m->order == VM_NFREEORDER simply tests whether the page has been
  removed from the vm_phys free lists; instead, check whether the
  page belongs to an object.

Reviewed by:	kib
MFC after:	1 week
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D21341
2019-08-21 16:11:12 +00:00
Mark Johnston
acad79e66f Unconditionally enable debug.vm_lowmem.
It is useful for testing purposes to be able to drain UMA caches, so
do not limit the sysctl to DIAGNOSTIC kernels.

MFC after:	1 week
Sponsored by:	Netflix
2019-08-21 16:01:17 +00:00
Mark Johnston
930b195263 Don't requeue active pages in vm_swapout_object_deactivate_pages().
As of r332974 the page daemon does not requeue pages during a scan
of the active queue, so there is not much value in doing so here
either.

Reviewed by:	alc, dougm, kib
MFC after:	1 week
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D21343
2019-08-21 15:52:10 +00:00
Jeff Roberson
cf27e0d125 Use an atomic reference count for paging in progress so that callers do not
require the object lock.

Reviewed by:	markj
Tested by:	pho (as part of a larger branch)
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D21311
2019-08-19 23:09:38 +00:00
Jeff Roberson
4153054a7c Permit vm_pager_has_page() to run with a shared lock. Introduce
VM_OBJECT_DROP/VM_OBJECT_PICKUP to handle functions that are called with
uncertain lock state.

Reviewed by:	kib, markj
Tested by:	pho
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D21310
2019-08-19 22:25:28 +00:00
Jeff Roberson
3e5e1b5135 Allocate amd64's page array using pages and page directory pages from the
NUMA domain that the pages describe.  Patch original from gallatin.

Reviewed by:	kib
Tested by:	pho
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D21252
2019-08-18 23:07:56 +00:00
Konstantin Belousov
bb9e2184f0 Change locking requirements for VOP_UNSET_TEXT().
Require the vnode to be locked for the VOP_UNSET_TEXT() call.  This
will be used by the following bug fix for a tmpfs issue.

Tested by:	sbruno, pho (previous version)
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2019-08-18 20:24:52 +00:00
Jeff Roberson
3921068f1e Remove unnecessary debugging from r351181 that caused powerpc build to fail.
Tested by:	make universe TARGETS=powerpc
2019-08-18 08:07:31 +00:00
Jeff Roberson
be3f5f298b vm_phys_avail_find is only used on NUMA kernels. Fix a build error. 2019-08-18 07:43:15 +00:00
Jeff Roberson
b7565d44df Encapsulate phys_avail manipulation in a set of simple routines. Add a
NUMA aware boot time memory allocator that will be used to allocate early
domain correct structures.  Code partially submitted by gallatin.

Reviewed by:	gallatin, kib
Tested by:	pho
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D21251
2019-08-18 07:06:31 +00:00
Aleksandr Rybalko
6b821a7455 Check paddr for overflow.
Fix panic on initialize of "vm reserv" per-superpage lock in case when RAM ends at upper boundary of address space.
Observed on ARM32 board BPI-R2 (2GB RAM 0x80000000-0xffffffff).

PR:		235362
Reviewed by:	kib, markj, alc
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D21272
2019-08-16 19:27:05 +00:00
Konstantin Belousov
245139c69d Fix OOM handling of some corner cases.
In addition to pagedaemon initiating OOM, also do it from the
vm_fault() internals.  Namely, if the thread waits for a free page to
satisfy page fault some preconfigured amount of time, trigger OOM.
These triggers are rate-limited, due to a usual case of several
threads of the same multi-threaded process to enter fault handler
simultaneously.  The faults from pagedaemon threads participate in the
calculation of OOM rate, but are not under the limit.

Reviewed by:	markj (previous version)
Tested by:	pho
Discussed with:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D13671
2019-08-16 09:43:49 +00:00
Jeff Roberson
2194393787 Move phys_avail definition into MI code. It is consumed in the MI layer and
doing so adds more flexibility with less redundant code.

Reviewed by:	jhb, markj, kib
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D21250
2019-08-16 00:45:14 +00:00
Doug Moore
504f5e294e swap_pager.c reserves 2 blocks for a bsd label. Change that 2 to the
expression howmany(BBSIZE, PAGE_SIZE), where BBSIZE is the size of the
boot block area.  That can be less than 2 if PAGE_SIZE is big.

swapon(8) has an option to trim (delete) all the blocks of a device at
startup.  However, if the first of those blocks is a bsd label, then
trimming those blocks is destructive.  Change swapon to leave the
first BBSIZE bytes untrimmed.

Update manual pages to reflect changes in how swapon and how it may be
used, espeically in association with savecore.

Reviewed by: alc
Approved by: markj (mentor)
MFC after: 3 days
Differential Revision: https://reviews.freebsd.org/D21191
2019-08-15 02:30:44 +00:00
Konstantin Belousov
10ae16c7fe Fix stack grow for init.
During early stages of kern_exec(), including strings copyout,
p_textvp for init is NULL.  This prevented stack grow from working for
init execution.

Without stack gap enabled, initial stack segment size is enough for
strings passed by kernel to init.  With the gap enabled, the used
address might fall out of the initial segment, which kills init.

Exclude initproc from the check for contexts which should not cause
stack grow in the target map.

Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2019-08-08 16:48:19 +00:00
Jeff Roberson
0b26119b21 Cache kernel stacks in UMA. This gives us NUMA support, better concurrency,
and more statistics.

Reviewed by:	kib, markj
Tested by:	pho
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D20931
2019-08-06 23:15:34 +00:00
Jeff Roberson
eda1b01647 Implement a MINBUCKET zone flag so we can use minimal caching on zones that
may be expensive to cache.

Reviewed by:	markj, kib
Sponsored by:		Netflix
Differential Revision:	https://reviews.freebsd.org/D20930
2019-08-06 23:04:59 +00:00
Jeff Roberson
c168508655 Add two new kernel options to control memory locality on NUMA hardware.
- UMA_XDOMAIN enables an additional per-cpu bucket for freed memory that
   was freed on a different domain from where it was allocated.  This is
   only used for UMA_ZONE_NUMA (first-touch) zones.
 - UMA_FIRSTTOUCH sets the default UMA policy to be first-touch for all
   zones.  This tries to maintain locality for kernel memory.

Reviewed by:	gallatin, alc, kib
Tested by:	pho, gallatin
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D20929
2019-08-06 21:50:34 +00:00
Mark Johnston
98549e2dc6 Centralize the logic in vfs_vmio_unwire() and sendfile_free_page().
Both of these functions atomically unwire a page, optionally attempt
to free the page, and enqueue or requeue the page.  Add functions
vm_page_release() and vm_page_release_locked() to perform the same task.
The latter must be called with the page's object lock held.

As a side effect of this refactoring, the buffer cache will no longer
attempt to free mapped pages when completing direct I/O.  This is
consistent with the handling of pages by sendfile(SF_NOCACHE).

Reviewed by:	alc, kib
MFC after:	2 weeks
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D20986
2019-07-29 22:01:28 +00:00
Doug Moore
23612f0df3 In swap_pager_putpages, move the initialization of a free-blocks
counter, and the final freeing of freed swap blocks, outside the
region where an object lock is held.  Correct some style(9) and
spelling errors.  Change a panic() to a KASSERT().  Change a boolean_t
to a bool.

Suggested by: alc
Reviewed by: alc
Approved by: kib, markj (mentors)
Differential Revision: https://reviews.freebsd.org/D21093
2019-07-28 19:32:23 +00:00
Mark Johnston
b16e57a6c9 Rename vm_page_{import,release}() to vm_page_zone_{import,release}().
I would like to use the name vm_page_release() for a different purpose,
and vm_page_{import,release}() are local to vm_page.c.

Reviewed by:	kib
MFC after:	1 week
2019-07-20 18:25:41 +00:00
Doug Moore
312df2c1dd Define vm_map_entry_in_transition to handle an in-transition map
entry, combining code currently in vm_map_unwire and
vm_map_wire_locked into a single function, called by each of them for
entries in transition.

Discussed with: kib, markj
Reviewed by: alc
Approved by: kib, markj (mentors, implicit)
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D20833
2019-07-19 20:47:35 +00:00
Mark Johnston
eeacb3b02f Merge the vm_page hold and wire mechanisms.
The hold_count and wire_count fields of struct vm_page are separate
reference counters with similar semantics.  The remaining essential
differences are that holds are not counted as a reference with respect
to LRU, and holds have an implicit free-on-last unhold semantic whereas
vm_page_unwire() callers must explicitly determine whether to free the
page once the last reference to the page is released.

This change removes the KPIs which directly manipulate hold_count.
Functions such as vm_fault_quick_hold_pages() now return wired pages
instead.  Since r328977 the overhead of maintaining LRU for wired pages
is lower, and in many cases vm_fault_quick_hold_pages() callers would
swap holds for wirings on the returned pages anyway, so with this change
we remove a number of page lock acquisitions.

No functional change is intended.  __FreeBSD_version is bumped.

Reviewed by:	alc, kib
Discussed with:	jeff
Discussed with:	jhb, np (cxgbe)
Tested by:	pho (previous version)
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D19247
2019-07-08 19:46:20 +00:00
Mark Johnston
46736e306c Elide the vm_reserv_free_page() call when PG_PCPU_CACHE is set.
Pages with PG_PCPU_CACHE set cannot have been allocated from a
reservation, so as an optimization, skip the call to
vm_reserv_free_page() in this case.  Otherwise, the access of
the corresponding reservation structure often results in a cache
miss.

Reviewed by:	alc, kib
Discussed with:	jeff
MFC after:	2 weeks
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D20859
2019-07-08 19:02:40 +00:00
Mark Johnston
d9a73522e3 Add a per-CPU page cache per VM free pool.
Some workloads benefit from having a per-CPU cache for
VM_FREEPOOL_DIRECT pages.

Reviewed by:	dougm, kib
Discussed with:	alc, jeff
MFC after:	2 weeks
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D20858
2019-07-08 18:56:30 +00:00
Doug Moore
7b9bcad939 A style-related change, r349791, made unclear the meaning of a
comment. Rewrite that comment to improve its clarity.

Reported by: cem
Reviewed by: alc, cem
Approved by: kib, markj (mentors, implicit)
Differential Revision: https://reviews.freebsd.org/D20871
2019-07-07 06:57:04 +00:00
Doug Moore
0cab71bcee Fix style(9) violations involving division by PAGE_SIZE.
Reviewed by: alc
Approved by: markj (mentor)
Differential Revision: https://reviews.freebsd.org/D20847
2019-07-06 15:55:16 +00:00
Doug Moore
31c82722c1 Change blist_next_leaf_alloc so that it can examine more than one leaf
after the one where the possible block allocation begins, and allocate
a larger number of blocks than the current limit. This does not affect
the limit on minimum allocation size, which still cannot exceed
BLIST_MAX_ALLOC.

Use this change to modify swp_pager_getswapspace and its callers, so
that they can allocate more than BLIST_MAX_ALLOC blocks if they are
available.

Tested by: pho
Approved by: markj (mentor)
Differential Revision: https://reviews.freebsd.org/D20579
2019-07-06 06:15:03 +00:00
Doug Moore
56948d177e Based on work posted at https://reviews.freebsd.org/D13484, change
swap_pager_swapoff_object and swp_pager_force_pagein so that they can
page in multiple pages at a time to a swap device, rather than doing
one I/O operation for each page.

Tested by: pho
Submitted by: ota_j.email.ne.jp (Yoshihiro Ota)
Reviewed by: alc, markj, kib
Approved by: kib, markj (mentors)
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D20635
2019-07-05 16:49:34 +00:00
Doug Moore
d2860f22a4 Move an assignment, drop a label, and change gotos to break statements
in vm_map_unwire. The code generated on amd86 is unchanged.

Approved by: markj (mentor)
Differential Revision: https://reviews.freebsd.org/D20850
2019-07-04 19:25:30 +00:00
Doug Moore
b71f9b0de6 Replace a 'goto' with an 'else' in vm_map_wire_locked.
Reviewed by: alc
Approved by: markj (mentor)
Differential Revision:	https://reviews.freebsd.org/D20855
2019-07-04 19:17:55 +00:00
Doug Moore
9a0cdf9440 Change boolean_t variables in vm_map_unwire and vm_map_wire_locked to
bool. Drop result variable. Add holes_ok bool to replace repeated
masking of flags parameter.

Approved by: markj (mentor)
Differential Revision: https://reviews.freebsd.org/D20846
2019-07-04 19:12:13 +00:00
Doug Moore
723413be0c Drop a temp variable from vm_map_insert, with no effect on the
resulting amd64 machine code.

Reviewed by: alc
Approved by: kib, markj (mentors, implicit)
Differential Revision: https://reviews.freebsd.org/D20849
2019-07-04 18:28:49 +00:00
Doug Moore
38e220e8df Eliminate a goto and a label in vm_map_wire_locked by inserting an 'else'.
Reviewed by: alc
Approved by: kib, markj (mentors, implicit)
Differential Revision: https://reviews.freebsd.org/D20845
2019-07-03 22:41:54 +00:00
Ed Maste
b93a053ca2 correct pmap_ts_referenced return type
pmap_ts_referenced returns a count, not a boolean, and is supposed to
have int as the return type not boolean_t.

This worked previously because boolean_t is an int typedef.

Discussed with:	kib
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
2019-07-03 19:59:56 +00:00
Mark Johnston
d70f0ab38d Cache the next queue element when traversing a page queue.
When QUEUE_MACRO_DEBUG_TRASH is configured, removing a queue element
invalidates its queue linkage pointers.  vm_pageout_collect_batch()
was relying on these pointers remaining valid after a removal, so
modify it to fetch the next queued page before dequeuing the current
page.

Submitted by:	Don Morris <dgmorris@earthlink.net>
Reviewed by:	cem, vangyzen
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D20842
2019-07-03 18:46:39 +00:00
Mark Johnston
9f74cdbf78 Mark pages allocated from the per-CPU cache.
Only free pages to the cache when they were allocated from that cache.
This mitigates rapid fragmentation of physical memory seen during
poudriere's dependency calculation phase.  In particular, pages
belonging to broken reservations are no longer freed to the per-CPU
cache, so they get a chance to coalesce with freed pages during the
break.  Otherwise, the optimized CoW handler may create object
chains in which multiple objects contain pages from the same
reservation, and the order in which we do object termination means
that the reservation is broken before all of those pages are freed,
so some of them end up in the per-CPU cache and thus permanently
fragment physical memory.

The flag may also be useful for eliding calls to vm_reserv_free_page(),
thus avoiding memory accesses for data that is likely not present
in the CPU caches.

Reviewed by:	alc
Discussed with:	jeff
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D20763
2019-07-02 19:51:40 +00:00
Konstantin Belousov
5dc7e31a09 Control implicit PROT_MAX() using procctl(2) and the FreeBSD note
feature bit.

In particular, allocate the bit to opt-out the image from implicit
PROTMAX enablement.  Provide procctl(2) verbs to set and query
implicit PROTMAX handling.  The knobs mimic the same per-image flag
and per-process controls for ASLR.

Reviewed by:	emaste, markj (previous version)
Discussed with:	brooks
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D20795
2019-07-02 19:07:17 +00:00
Konstantin Belousov
3730695151 Use traditional 'p' local to designate td->td_proc in kern_mmap.
Reviewed by:	emaste, markj
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
Differential revision:	https://reviews.freebsd.org/D20795
2019-07-02 19:01:14 +00:00
Doug Moore
5201cbabf5 Remove a call to vm_map_simplify_entry from _vm_map_clip_start.
Recent changes to vm_map_protect have made it unnecessary.

Reviewed by: alc
Approved by: kib (mentor)
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D20633
2019-06-30 02:08:13 +00:00
Doug Moore
a72dce340d If vm_map_protect fails with KERN_RESOURCE_SHORTAGE, be sure to
simplify modified entries before returning.

Reviewed by: alc, markj (earlier version), kib (earlier version)
Approved by: kib, markj (mentors, implicit)
Differential Revision: https://reviews.freebsd.org/D20753
2019-06-28 02:14:54 +00:00
Mark Johnston
0fd977b3fa Add a return value to vm_page_remove().
Use it to indicate whether the page may be safely freed following
its removal from the object.  Also change vm_page_remove() to assume
that the page's object pointer is non-NULL, and have callers perform
this check instead.

This is a step towards an implementation of an atomic reference counter
for each physical page structure.

Reviewed by:	alc, dougm, kib
MFC after:	1 week
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D20758
2019-06-26 17:37:51 +00:00
Doug Moore
d1d3f7e1d1 Revert r349393, which leads to an assertion failure on bootup, in vm_map_stack_locked.
Reported by: ler@lerctr.org
Approved by: kib, markj (mentors, implicit)
2019-06-26 03:12:57 +00:00
Doug Moore
52499d1739 Eliminate some uses of the prev and next fields of vm_map_entry_t.
Since the only caller to vm_map_splay is vm_map_lookup_entry, move the
implementation of vm_map_splay into vm_map_lookup_helper, called by
vm_map_lookup_entry.

vm_map_lookup_entry returns the greatest entry less than or equal to a
given address, but in many cases the caller wants the least entry
greater than or equal to the address and uses the next pointer to get
to it. Provide an alternative interface to lookup,
vm_map_lookup_entry_ge, to provide the latter behavior, and let
callers use one or the other rather than having them use the next
pointer after a lookup miss to get what they really want.

In vm_map_growstack, the caller wants an entry that includes a given
address, and either the preceding or next entry depending on the value
of eflags in the first entry. Incorporate that behavior into
vm_map_lookup_helper, the function that implements all of these
lookups.

Eliminate some temporary variables used with vm_map_lookup_entry, but
inessential.

Reviewed by: markj (earlier version)
Approved by: kib (mentor)
Differential Revision: https://reviews.freebsd.org/D20664
2019-06-25 20:25:16 +00:00
Doug Moore
18cd8bb800 vm_map_protect may return an INVALID_ARGUMENT or PROTECTION_FAILURE
error response after clipping the first map entry in the region to be
reserved. This creates a pair of matching entries that should have
been "simplified" back into one, or never created. This change defers
the clipping of that entry until those two vm_map_protect failure
cases have been ruled out.

Reviewed by: alc
Approved by: markj (mentor)
Differential Revision: https://reviews.freebsd.org/D20711
2019-06-25 07:44:37 +00:00
Brooks Davis
74a1b66cf4 Extend mmap/mprotect API to specify the max page protections.
A new macro PROT_MAX() alters a protection value so it can be OR'd with
a regular protection value to specify the maximum permissions.  If
present, these flags specify the maximum permissions.

While these flags are non-portable, they can be used in portable code
with simple ifdefs to expand PROT_MAX() to 0.

This change allows (e.g.) a region that must be writable during run-time
linking or JIT code generation to be made permanently read+execute after
writes are complete.  This complements W^X protections allowing more
precise control by the programmer.

This change alters mprotect argument checking and returns an error when
unhandled protection flags are set.  This differs from POSIX (in that
POSIX only specifies an error), but is the documented behavior on Linux
and more closely matches historical mmap behavior.

In addition to explicit setting of the maximum permissions, an
experimental sysctl vm.imply_prot_max causes mmap to assume that the
initial permissions requested should be the maximum when the sysctl is
set to 1.  PROT_NONE mappings are excluded from this for compatibility
with rtld and other consumers that use such mappings to reserve
address space before mapping contents into part of the reservation.  A
final version this is expected to provide per-binary and per-process
opt-in/out options and this sysctl will go away in its current form.
As such it is undocumented.

Reviewed by:	emaste, kib (prior version), markj
Additional suggestions from:	alc
Obtained from:	CheriBSD
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D18880
2019-06-20 18:24:16 +00:00
Mark Johnston
ee1f168540 Group vm_page_activate()'s definition with other related functions.
No functional change intended.

MFC after:	3 days
2019-06-19 21:36:00 +00:00
Doug Moore
4766eba1df Critical comments were lost in r349203. This patch seeks to restore
the lost information in new comments.

Reported by: alc
Reviewed by: alc
Approved by: kib (mentor)
Differential Revision: https://reviews.freebsd.org/D20632
2019-06-15 04:30:13 +00:00
Doug Moore
771315283b Avoid using the prev field of vm_map_entry_t in two functions that
iterate over consecutive vm_map entries, and that can easily just
'remember' the prev value instead of looking it up.

Approved by: kib (mentor)
Differential Revision: https://reviews.freebsd.org/D20628
2019-06-14 03:15:54 +00:00
Doug Moore
af1d6d6a11 Create a function for creating objects to back map entries, and one
for giving cred to a map entry backed by an object, and use them
instead of the code duplicated inline now.

Approved by: kib (mentor)
Differential Revision: https://reviews.freebsd.org/D20370
2019-06-13 20:09:07 +00:00
Doug Moore
e65d58a0fe To test to see if a free space is big enough compare the required
length to the difference of the two offsets that define the gap, to
avoid overflow, rather that adding the length to an offset and
comparing that to another offset.

This addresses an overflow issue reported by Peter Holm on i386.

Reported by: pho
Tested by: pho
Approved by: kib (mentor)
Differential Revision: https://reviews.freebsd.org/D20594
2019-06-11 22:41:39 +00:00
Doug Moore
f8c8b2e8a0 r348879 introduced a wrong-way comparison that broke mmap.
This change rights that comparison.

Reported by: pho
Approved by: markj (mentor)
MFC after: 3 days
Differential Revision: https://reviews.freebsd.org/D20595
2019-06-10 22:06:40 +00:00
Doug Moore
5a0879da80 The computations of vm_map_splay_split and vm_map_splay_merge touch both
children of every entry on the search path as part of updating values of
the max_free field. By comparing the max_free values of an entry and its
child on the search path, the code can avoid accessing the child off the
path in cases where the max_free value decreases along the path.

Specifically, this patch changes splay_split so that the max_free field
of every entry on the search path is replaced, temporarily, by the
max_free field from its child not on the search path or, if the child
in that direction is NULL, then a difference between start and end
values of two pointers already available in the split code, without
following any next or prev pointers. However, to find that max_free
value does not require looking toward that other child if either the
child on the search path has a lower max_free value, or the current max_free
value is zero, because in either case we know that the value of max_free for
the other child is the value we already have. So, the changes to
vm_entry_splay_split make sure that we know all the off-search-path entries
we will need to complete the splay, without looking at all of them. There is
an exception at the bottom of the search path where we cannot rely on the
max_free value in the direction of the NULL pointer that ends the search,
because of the behavior of entry-clipping code.

The corresponding change to vm_splay_entry_merge makes it simpler, since it's
just reversing pointers and updating running maxima.

In a test intended to exercise vigorously the vm_map implementation, the
effect of this change was to reduce the data cache miss rate by 10-14% and
the running time by 5-7%.

Tested by: pho
Reviewed by: alc
Approved by: kib (mentor)
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D19826
2019-06-10 21:34:07 +00:00
Doug Moore
77555b849d Change the check for 'size' wrapping around to zero in kern_mmap to account
for both the lower and upper bound modifications. Change the error returned
to ENOMEM. Rename the parameter size to len and make size a local variable
that stores the value of len after it has been modified.

This addresses concerns expressed by Bruce Evans after r348843.

Reported by: brde@optusnet.com.au
Reviewed by: kib, markj (mentors)
MFC after: 3 days
Relnotes: yes
Differential Revision: https://reviews.freebsd.org/D20592
2019-06-10 21:26:14 +00:00
John Baldwin
0b96ca3310 Remove an overly-aggressive assertion.
While it is true that the new vmspace passed to vmspace_switch_aio
will always have a valid reference due to the AIO job or the extra
reference on the original vmspace in the worker thread, it is not true
that the old vmspace being switched away from will have more than one
reference.

Specifically, when a process with queued AIO jobs exits, the exit hook
in aio_proc_rundown will only ensure that all of the AIO jobs have
completed or been cancelled.  However, the last AIO job might have
completed and woken up the exiting process before the worker thread
servicing that job has switched back to its original vmspace.  In that
case, the process might finish exiting dropping its reference to the
vmspace before the worker thread resulting in the worker thread
dropping the last reference.

Reported by:	np
Reviewed by:	alc, markj, np, imp
MFC after:	2 weeks
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D20542
2019-06-10 19:01:54 +00:00
Doug Moore
97220a279f There are times when a len==0 parameter to mmap is okay. But on a
32-bit machine, a len parameter just a few bytes short of 4G, rounded
up to a page boundary and hitting zero then, is not okay. Return
failure in that case.

Reported by: pho
Reviewed by: alc, kib (mentor)
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D20580
2019-06-10 03:07:10 +00:00
Konstantin Belousov
452a2db863 Style MAP_ENTRY_ and MAP_ definitions.
Spell all bits in the hex constants.
Since all lines are modified, consistently use <tab> after #define.

Reviewed by:	alc (previous version), dougm
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
Differential revision:	https://reviews.freebsd.org/D20560
2019-06-08 20:28:04 +00:00
Doug Moore
7c022327ab Simple code refactoring originally in D13484.
Extract swp_pager_force_dirty() and swp_pager_force_launder() out of
swp_pager_force_pagein().

Extract swap_pager_swapoff_object() out of swap_pager_swapoff().

Submitted by: ota_j.email.ne.jp
Reviewed by: alc, dougm
Approved by: kib (mentor)
Differential Revision: https://reviews.freebsd.org/D20545
2019-06-08 17:49:17 +00:00
Mark Johnston
88ea538a98 Replace uses of vm_page_unwire(m, PQ_NONE) with vm_page_unwire_noq(m).
These calls are not the same in general: the former will dequeue the
page if it is enqueued, while the latter will just leave it alone.  But,
all existing uses of the former apply to unmanaged pages, which are
never enqueued in the first place.  No functional change intended.

Reviewed by:	kib
MFC after:	1 week
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D20470
2019-06-07 18:23:29 +00:00
Alexander Motin
3b2f2cb8e9 Allow UMA hash tables to expand faster then 2x in 20 seconds.
ZFS ABD allocates tons of 4KB chunks via UMA, requiring huge hash tables.
With initial hash table size of only 32 elements it takes ~20 expansions
or ~400 seconds to adapt to handling 220GB ZFS ARC.  During that time not
only the hash table is highly inefficient, but also each of those expan-
sions takes significant time with the lock held, blocking operation.

On my test system with 256GB of RAM and ZFS pool of 28 HDDs this change
reduces time needed to first time read 240GB from ~300-400s, during which
system is quite busy and unresponsive, to only ~150s with light CPU load
and just 5 sub-second CPU spikes to expand the hash table.

MFC after:	2 weeks
Sponsored by:	iXsystems, Inc.
2019-06-06 23:57:28 +00:00
Doug Moore
f96e8a0bab The means of finding ranges of free pages was changed for
vm_reserv_break in r348484, and there was found to improve performance
minutely and reduce code size. This change applies a similar change to
vm_reserv_reclaim_config, expecting similar benefits. This change also
allows quick rejection of page ranges that are unsuitable on account
of alignment or boundary issues, where those issues are processed a
page at a time in the current implementation.  For contrived test
cases, this can make finding a reservation satisfying a major
alignment requirement around 30 times faster.

Tested by: pho
Approved by: markj (mentor)
Differential Revision: https://reviews.freebsd.org/D20274
2019-06-06 16:28:34 +00:00
Mark Johnston
fbd9585915 Add sysctls for uma_kmem_{limit,total}.
Reviewed by:	alc, dougm, kib
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D20514
2019-06-06 16:26:58 +00:00
Mark Johnston
058f0f7464 Remove the volatile qualifer from uma_kmem_total.
No functional change intended.

Reviewed by:	alc, dougm, kib
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D20514
2019-06-06 16:23:44 +00:00
Konstantin Belousov
32d2014dde In vm_map_entry_set_vnode_text(), tolerate tmpfs mappings for which
vnode is no longer resident.

Mapping of tmpfs file does not bump use count on the vnode, because
backing object has swap type.  As result, even during normal
operations, and of course on forced unmount, we might end up with text
mapping from tmpfs node which has no vnode in memory.  In this case,
there is no v_writecount to clear (this was done during reclaim), and
no reason to assert that the vnode is present.

Restructure the code to silently ignore OBJ_SWAP objects with
OBJ_TMPFS_NODE flag set, but OBJ_TMPFS flag clear.

Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2019-06-05 20:21:17 +00:00
Mark Johnston
2d2748710a Remove an outdated header comment for vm_page.c.
The listed rules were incomplete and outdated.  There is a much more
comprehensive comment in vm_page.h.

Reviewed by:	alc, kib
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D20503
2019-06-04 18:38:27 +00:00
Konstantin Belousov
21d7728498 Remove dead store.
sw_flags is set to the function argument several lines later.

Reported by:	danfe using PVS-studio
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2019-06-03 15:19:11 +00:00
Alan Cox
2d5039db18 Retire vm_reserv_extend_{contig,page}(). These functions were introduced
as part of a false start toward fine-grained reservation locking.  In the
end, they were not needed, so eliminate them.

Order the parameters to vm_reserv_alloc_{contig,page}() consistently with
the vm_page functions that call them.

Update the comments about the locking requirements for
vm_reserv_alloc_{contig,page}().  They no longer require a free page
queues lock.

Wrap several lines that became too long after the "req" and "domain"
parameters were added to vm_reserv_alloc_{contig,page}().

Reviewed by:	kib, markj
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D20492
2019-06-03 05:15:36 +00:00
Mark Johnston
d842aa5114 Add a vm_page_wired() predicate.
Use it instead of accessing the wire_count field directly.  No
functional change intended.

Reviewed by:	alc, kib
MFC after:	1 week
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D20485
2019-06-02 01:00:17 +00:00
Doug Moore
b8590dae50 The function vm_phys_free_contig invokes vm_phys_free_pages for every
power-of-two page block it frees, launching an unsuccessful search for
a buddy to pair up with each time.  The only possible buddy-up mergers
are across the boundaries of the freed region, so change
vm_phys_free_contig simply to enqueue the freed interior blocks, via a
new function vm_phys_enqueue_contig, and then call vm_phys_free_pages
on the bounding blocks to create as big a cross-boundary block as
possible after buddy-merging.

The only callers of vm_phys_free_contig at the moment call it in
situations where merging blocks across the boundary is clearly
impossible, so just call vm_phys_enqueue_contig in those places and
avoid trying to buddy-up at all.

One beneficiary of this change is in breaking reservations.  For the
case where memory is freed in breaking a reservation with only the
first and last pages allocated, the number of cycles consumed by the
operation drops about 11% with this change.

Suggested by: alc
Reviewed by: alc
Approved by: kib, markj (mentors)
Differential Revision: https://reviews.freebsd.org/D16901
2019-05-31 21:02:42 +00:00
Mark Johnston
42447bb506 Remove a redundant vm_page_remove() call.
vm_page_free_prep() removes the page from its object.  No functional
change intended.

Reviewed by:	alc, kib
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D20469
2019-05-31 14:59:40 +00:00
Gleb Smirnoff
4a9f6ba75b In r343857 the referred comment moved to uma_vm_zone_stats(). 2019-05-29 22:33:37 +00:00
Doug Moore
e67a5068ec Reduce the code size and number of ffsl calls in vm_reserv_break. Use
xor to find where free ranges begin and end.

Tested by: pho
Reviewed by:alc
Approved by:markj, kib (mentors)
Differential Revision:	https://reviews.freebsd.org/D20256
2019-05-28 00:51:23 +00:00
Doug Moore
73f1145140 Fix typo from r348128: _func__ -> __func__
Reported by: LINT
2019-05-23 02:10:41 +00:00
Doug Moore
fa581662af Cleanups made necessary by r348115, or reactions to it:
1. Change size_t to vm_size_t in some places.
2. Rename vm_map_entry_resize_free to drop the _free part.
3. Fix whitespace errors.
4. Fix screwups in patch-conflict-management that left out important
changes related to growing and shrinking objects.

Reviewed by: alc
Approved by: kib (mentor)
2019-05-22 23:11:16 +00:00
Doug Moore
1895f5202a Passing a parameter to vm_map_entry_resize_free that describes the
amount of resizing reduces the number of functions changing the vm_map
invariants regarding the max_free field of map entries.

Reviewed by: markj (mentor)
Approved by: kib (mentor)
Differential Revision: https://reviews.freebsd.org/D20356
2019-05-22 17:40:54 +00:00
Conrad Meyer
daec92844e Include ktr.h in more compilation units
Similar to r348026, exhaustive search for uses of CTRn() and cross reference
ktr.h includes.  Where it was obvious that an OS compat header of some kind
included ktr.h indirectly, .c files were left alone.  Some of these files
clearly got ktr.h via header pollution in some scenarios, or tinderbox would
not be passing prior to this revision, but go ahead and explicitly include it
in files using it anyway.

Like r348026, these CUs did not show up in tinderbox as missing the include.

Reported by:	peterj (arm64/mp_machdep.c)
X-MFC-With:	r347984
Sponsored by:	Dell EMC Isilon
2019-05-21 20:38:48 +00:00
Conrad Meyer
e2e050c8ef Extract eventfilter declarations to sys/_eventfilter.h
This allows replacing "sys/eventfilter.h" includes with "sys/_eventfilter.h"
in other header files (e.g., sys/{bus,conf,cpu}.h) and reduces header
pollution substantially.

EVENTHANDLER_DECLARE and EVENTHANDLER_LIST_DECLAREs were moved out of .c
files into appropriate headers (e.g., sys/proc.h, powernv/opal.h).

As a side effect of reduced header pollution, many .c files and headers no
longer contain needed definitions.  The remainder of the patch addresses
adding appropriate includes to fix those files.

LOCK_DEBUG and LOCK_FILE_LINE_ARG are moved to sys/_lock.h, as required by
sys/mutex.h since r326106 (but silently protected by header pollution prior
to this change).

No functional change (intended).  Of course, any out of tree modules that
relied on header pollution for sys/eventhandler.h, sys/lock.h, or
sys/mutex.h inclusion need to be fixed.  __FreeBSD_version has been bumped.
2019-05-20 00:38:23 +00:00
Mark Johnston
ccc5d6dd97 Use M_NEXTFIT in memguard(9).
memguard(9) wants to avoid reuse of freed addresses for as long as
possible.  Previously it maintained a racily updated cursor which was
passed to vmem_xalloc(9) as the minimum address.  However, vmem will
not in general return the lowest free address in the arena, so this
trick only really works until the cursor has wrapped around the first
time.

Reported by:	alc
Reviewed by:	alc
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D17227
2019-05-18 02:02:14 +00:00
Mark Johnston
8cd6a80d7d Restore the pre-r347532 behaviour of ignoring wiring failures in mmap().
The error handling added in r347532 is not right when mapping vnodes
and will be fixed separately.

Reported by:	syzbot+1d2cc393bd6c88a548be@syzkaller.appspotmail.com
MFC with:	r347532
2019-05-13 18:40:01 +00:00
Mark Johnston
54a3a11421 Provide separate accounting for user-wired pages.
Historically we have not distinguished between kernel wirings and user
wirings for accounting purposes.  User wirings (via mlock(2)) were
subject to a global limit on the number of wired pages, so if large
swaths of physical memory were wired by the kernel, as happens with
the ZFS ARC among other things, the limit could be exceeded, causing
user wirings to fail.

The change adds a new counter, v_user_wire_count, which counts the
number of virtual pages wired by user processes via mlock(2) and
mlockall(2).  Only user-wired pages are subject to the system-wide
limit which helps provide some safety against deadlocks.  In
particular, while sources of kernel wirings typically support some
backpressure mechanism, there is no way to reclaim user-wired pages
shorting of killing the wiring process.  The limit is exported as
vm.max_user_wired, renamed from vm.max_wired, and changed from u_int
to u_long.

The choice to count virtual user-wired pages rather than physical
pages was done for simplicity.  There are mechanisms that can cause
user-wired mappings to be destroyed while maintaining a wiring of
the backing physical page; these make it difficult to accurately
track user wirings at the physical page layer.

The change also closes some holes which allowed user wirings to succeed
even when they would cause the system limit to be exceeded.  For
instance, mmap() may now fail with ENOMEM in a process that has called
mlockall(MCL_FUTURE) if the new mapping would cause the user wiring
limit to be exceeded.

Note that bhyve -S is subject to the user wiring limit, which defaults
to 1/3 of physical RAM.  Users that wish to exceed the limit must tune
vm.max_user_wired.

Reviewed by:	kib, ngie (mlock() test changes)
Tested by:	pho (earlier version)
MFC after:	45 days
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D19908
2019-05-13 16:38:48 +00:00
Doug Moore
87ae0686a2 A new parameter to blist_alloc specifies an upper bound on the size of
the allocation request, so that the blocks allocated are from the next
set of free blocks big enough to satisfy the minimum requirements of
the request, and the number of blocks allocated are as many as
possible, up to the specified maximum. The implementation of
swp_pager_getswapspace uses this parameter to ask for a number of
blocks between the new halved request size and the previous failed
request size. Thus a request for 32 blocks may fail, but instead of
getting only 16 blocks instead, the caller asks for 16 to 31 next, and
might get 19 or 27, which is closer to what they originally wanted.

I expect this to lead to bigger block allocations and less block
fragmentation, at least in some cases.

Approved by: kib (mentor)
Differential Revision: https://reviews.freebsd.org/D20001
2019-05-11 16:15:13 +00:00
Doug Moore
48e98a2afc Callers of swp_pager_getswapspace get either as many blocks as they
requested, or none, and in the latter case it is up to them to pick a
smaller request to make - which they always do by halving the failed
request. This change to swp_pager_getswapspace leaves the task of
downsizing the request to the function and not its caller. It still
does so by halving the original request.

Approved by: kib (mentor)
Differential Revision: https://reviews.freebsd.org/D20228
2019-05-11 10:16:43 +00:00
Konstantin Belousov
12487941f4 Noted by: alc
Reviewed by:	alc, markj (previous version)
Sponsored by:	The FreeBSD Foundation
MFC after:	6 days
2019-05-06 08:46:11 +00:00
Konstantin Belousov
78022527bb Switch to use shared vnode locks for text files during image activation.
kern_execve() locks text vnode exclusive to be able to set and clear
VV_TEXT flag. VV_TEXT is mutually exclusive with the v_writecount > 0
condition.

The change removes VV_TEXT, replacing it with the condition
v_writecount <= -1, and puts v_writecount under the vnode interlock.
Each text reference decrements v_writecount.  To clear the text
reference when the segment is unmapped, it is recorded in the
vm_map_entry backed by the text file as MAP_ENTRY_VN_TEXT flag, and
v_writecount is incremented on the map entry removal

The operations like VOP_ADD_WRITECOUNT() and VOP_SET_TEXT() check that
v_writecount does not contradict the desired change.  vn_writecheck()
is now racy and its use was eliminated everywhere except access.
Atomic check for writeability and increment of v_writecount is
performed by the VOP.  vn_truncate() now increments v_writecount
around VOP_SETATTR() call, lack of which is arguably a bug on its own.

nullfs bypasses v_writecount to the lower vnode always, so nullfs
vnode has its own v_writecount correct, and lower vnode gets all
references, since object->handle is always lower vnode.

On the text vnode' vm object dealloc, the v_writecount value is reset
to zero, and deadfs vop_unset_text short-circuit the operation.
Reclamation of lowervp always reclaims all nullfs vnodes referencing
lowervp first, so no stray references are left.

Reviewed by:	markj, trasz
Tested by:	mjg, pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 month
Differential revision:	https://reviews.freebsd.org/D19923
2019-05-05 11:20:43 +00:00
Konstantin Belousov
7f1446052f Do not collapse objects with OBJ_NOSPLIT backing swap object.
NOSPLIT swap objects are not anonymous, they are used by tmpfs regular
files and POSIX shared memory.  For such objects, collapse is not
permitted.

Reported by:	mjg
Reviewed by:	markj, trasz
Tested by:	mjg, pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D19923
2019-05-05 11:06:19 +00:00
Doug Moore
64f8d2575a fls() should find the most significant bit of an int faster than a
linear search can, so use it to avoid a linear search in isqrt.

Approved by: kib (mentor), markj (mentor)
Differential Revision: https://reviews.freebsd.org/D20102
2019-05-03 02:55:54 +00:00
Konstantin Belousov
19f5d9f27f Fix another race between vm_map_protect() and vm_map_wire().
vm_map_wire() increments entry->wire_count, after that it drops the
map lock both for faulting in the entry' pages, and for marking next
entry in the requested region as IN_TRANSITION. Only after all entries
are faulted in, MAP_ENTRY_USER_WIRE flag is set.

This makes it possible for vm_map_protect() to run while other entry'
MAP_ENTRY_IN_TRANSITION flag is handled, and vm_map_busy() lock does
not prevent it. In particular, if the call to vm_map_protect() adds
VM_PROT_WRITE to CoW entry, it would fail to call
vm_fault_copy_entry(). There are at least two consequences of the
race: the top object in the shadow chain is not populated with
writeable pages, and second, the entry eventually get contradictory
flags MAP_ENTRY_NEEDS_COPY | MAP_ENTRY_USER_WIRED with VM_PROT_WRITE
set.

Handle it by waiting for all MAP_ENTRY_IN_TRANSITION flags to go away
in vm_map_protect(), which does not drop map lock afterwards. Note
that vm_map_busy_wait() is left as is.

Reported and tested by:	pho (previous version)
Reviewed by:	Doug Moore <dougm@rice.edu>, markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D20091
2019-05-01 13:15:06 +00:00
Mark Johnston
c4e5de7e75 Disable vm map consistency checking by default on INVARIANTS kernels.
The checks are too expensive for a general-purpose kernel.  Enable the
checks when DIAGNOSTIC is defined and provide a sysctl to enable the
checks in a non-DIAGNOSTIC INVARIANTS kernel.

Reviewed by:	kib
Discussed with:	Doug Moore <dougm@rice.edu>
MFC after:	3 days
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D19999
2019-04-22 11:23:35 +00:00
Tycho Nightingale
323ad38632 for a cache-only zone the destructor tries to destroy a non-existent keg
Reviewed by:	markj
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D19835
2019-04-12 12:46:25 +00:00
Konstantin Belousov
a5a02ef49f Fix mis-merge.
Amusingly, it is nop.

Noted by:	trasz
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
X-MFC-rev:	r345702
2019-04-05 16:12:35 +00:00
Konstantin Belousov
9f70117263 Eliminate adj_free field from vm_map_entry.
Drop the adj_free field from vm_map_entry_t. Refine the max_free field
so that p->max_free is the size of the largest gap with one endpoint
in the subtree rooted at p. Change vm_map_findspace so that, first,
the address-based splay is restricted to tree nodes with large-enough
max_free value, to avoid searching for the right starting point in a
subtree where all the gaps are too small. Second, when the address
search leads to a tree search for the first large-enough gap, that gap
is the subject of a splay-search that brings the gap to the top of the
tree, so that an immediate insertion will take constant time.

Break up the splay code into separate components, one for searching
and breaking up the tree and another for reassembling it. Use these
components, and not splay itself, for linking and unlinking. Drop the
after-where parameter to link, as it is computed as a side-effect of
the splay search.

Submitted by:	Doug Moore <dougm@rice.edu>
Reviewed by:	markj
Tested by:	pho
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D17794
2019-03-29 16:53:46 +00:00
Edward Tomasz Napierala
0b208315f4 Improve error reporting when the swap pager runs out of memory.
Reviewed by:	kib
MFC after:	2 weeks
Sponsored by:	Klara Inc.
Differential Revision:	https://reviews.freebsd.org/D19699
2019-03-26 19:11:15 +00:00
Konstantin Belousov
5019dac98a ASLR: check for max_addr after applying randomization, not before.
Otherwise resulting address from vm_map_find() migh not satisfy the
upper limit.  For instance, it could affect MAP_32BIT flag from 64bit
processes.

Found by:	Doug Moore <dougm@rice.edu>
Reviewed by:	alc, Doug Moore <dougm@rice.edu>
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D19688
2019-03-23 16:36:18 +00:00
Mark Johnston
64087fd7f3 Disallow preemptive creation of wired superpage mappings.
There are some unusual cases where a process may cause an mlock()ed
range of memory to be unmapped.  If the application subsequently
faults on that region, the handler may attempt to create a superpage
mapping backed by the resident, wired pages.  However, the pmap code
responsible for creating such a mapping (pmap_enter_pde() on i386
and amd64) does not ensure that a leaf page table page is available
if the superpage is later demoted; the demotion operation must therefore
perform a non-blocking page allocation and must unmap the entire
superpage if the allocation fails.  The pmap layer ensures that this
can never happen for wired mappings, and so the case described above
breaks that invariant.

For now, simply ensure that the MI fault handler never attempts to
create a wired superpage except via promotion.

Reviewed by:	kib
Reported by:	syzbot+292d3b0416c27c131505@syzkaller.appspotmail.com
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D19670
2019-03-21 19:52:50 +00:00
Konstantin Belousov
45d72c7d7f vm_fault_copy_entry: accept invalid source pages.
Either msync(MS_INVALIDATE) or the object unlock during vnode
truncation can expose invalid pages backing wired entries.  Accept
them, but do not install them into destrination pmap.  We must create
copied pages in the copy case, because e.g. vm_object_unwire() expects
that the entry is fully backed.

Reported by:	syzkaller, via emaste
Reported by:	syzbot+514d40ce757a3f8b15bc@syzkaller.appspotmail.com
Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D19615
2019-03-20 13:07:57 +00:00
Mark Johnston
3b5b20292b Implement minidump support for RISC-V.
Submitted by:	Mitchell Horne <mhorne063@gmail.com>
Differential Revision:	https://reviews.freebsd.org/D18320
2019-03-06 00:01:06 +00:00
Mateusz Guzik
75d6d57634 vm: remove seq.h inclusion made obsolete by NUMA rewrite
Sponsored by:	The FreeBSD Foundation
2019-02-27 22:42:29 +00:00
Jason A. Harmening
40a5168449 Fix incorrect assertion in vnode_pager_generic_getpages()
Reviewed by:	kib, glebius
MFC after:	1 week
2019-02-26 04:50:46 +00:00
Mark Johnston
2b6010705c Improve vmem tuning for platforms without a direct map.
On platforms without a direct map (i.e., platforms without
UMA_MD_SMALL_ALLOC defined), the boundary tag allocator reserves a
number of tags for use when allocating a new slab of boundary tags,
as such platforms require free boundary tags in order to allocate
boundary tags.  r327899 increased the number of boundary tags required
for a KVA allocation in the worst case, and the aforementioned
reservation was not updated accordingly.  In some cases, this could
lead to a system hang.  Fix the problem by increasing this reservation.

Also reduce KVA_QUANTUM on systems lacking superpage support.
The previous import quantum (4MB with a 4KB page size) was quite large
for systems with limited KVA, and fragmentation in kernel_arena could
cause kernel memory allocation failures even with a substantial amount
of free KVA.

Reported and tested by:	jhibbits
Reviewed by:	alc, kib
No objections:	jeff
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D19337
2019-02-25 19:22:13 +00:00
Mark Johnston
46e39081f4 Clear pointers to indicate that the respective locks are released.
This fixes a problem in r344231: vm_pageout_launder() may scan two
queues when swap is disabled.

Reported by:	pho
MFC with:	r344231
2019-02-21 15:44:32 +00:00
Konstantin Belousov
e7a9df16e6 Add kernel support for Intel userspace protection keys feature on
Skylake Xeons.

See SDM rev. 68 Vol 3 4.6.2 Protection Keys and the description of the
RDPKRU and WRPKRU instructions.

Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D18893
2019-02-20 09:51:13 +00:00
Mark Johnston
602566044a Remove a redundant flag variable.
Use the object pointer itself to determine whether the object is locked.
No functional change intended.

Reviewed by:	kib
MFC after:	1 week
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D19215
2019-02-17 16:35:19 +00:00
Gleb Smirnoff
66fb0b1ad7 For 32-bit machines rollback the default number of vnode pager pbufs
back to the lever before r343030.  For 64-bit machines reduce it slightly,
too.  Together with r343030 I bumped the limit up to the value we use at
Netflix to serve 100 Gbit/s of sendfile traffic, and it probably isn't a
good default.

Provide a loader tunable to change vnode pager pbufs count. Document it.
2019-02-15 23:36:22 +00:00
Konstantin Belousov
484e9d0322 Make anon clustering more compatible.
Make the clustering enabling knob more fine-grained by providing a
setting where the allocation with hint is not clustered. This is aimed
to be somewhat more compatible with e.g. go 1.4 which expects that
hinted mmap without MAP_FIXED does not change the allocation address.

Now the vm.cluster_anon can be set to 1 to only cluster when no hints,
and to 2 to always cluster.  Default value is 1.

Requested by: peter
Reviewed by:	emaste, markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 month
Differential revision:	https://reviews.freebsd.org/D19194
2019-02-14 15:45:53 +00:00
Mark Johnston
f6893f09d5 Implement transparent 2MB superpage promotion for RISC-V.
This includes support for pmap_enter(..., psind=1) as described in the
commit log message for r321378.

The changes are largely modelled after amd64.  arm64 has more stringent
requirements around superpage creation to avoid the possibility of TLB
conflict aborts, and these requirements do not apply to RISC-V, which
like amd64 permits simultaneous caching of 4KB and 2MB translations for
a given page.  RISC-V's PTE format includes only two software bits, and
as these are already consumed we do not have an analogue for amd64's
PG_PROMOTED.  Instead, pmap_remove_l2() always invalidates the entire
2MB address range.

pmap_ts_referenced() is modified to clear PTE_A, now that we support
both hardware- and software-managed reference and dirty bits.  Also
fix pmap_fault_fixup() so that it does not set PTE_A or PTE_D on kernel
mappings.

Reviewed by:	kib (earlier version)
Discussed with:	jhb
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D18863
Differential Revision:	https://reviews.freebsd.org/D18864
Differential Revision:	https://reviews.freebsd.org/D18865
Differential Revision:	https://reviews.freebsd.org/D18866
Differential Revision:	https://reviews.freebsd.org/D18867
Differential Revision:	https://reviews.freebsd.org/D18868
2019-02-13 17:19:37 +00:00
Pedro F. Giffuni
6929b7d1ab UMA: unsign some variables related to allocation in hash_alloc().
As a followup to r343673, unsign some variables related to allocation
since the hashsize cannot be negative. This gives a bit more space to
handle bigger allocations and avoid some implicit casting.

While here also unsign uh_hashmask, it makes little sense to keep that
signed.

MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D19148
2019-02-12 04:33:05 +00:00
Konstantin Belousov
f6d281e8aa struct xswdev on amd64 requires compat32 shims after ino64.
i386 is the only architecture where uint64_t does not specify 8-bytes
alignment, which makes struct xswdev layout not compatible between
64bit and i386.

Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2019-02-10 19:01:05 +00:00
Konstantin Belousov
fa50a3552d Implement Address Space Layout Randomization (ASLR)
With this change, randomization can be enabled for all non-fixed
mappings.  It means that the base address for the mapping is selected
with a guaranteed amount of entropy (bits). If the mapping was
requested to be superpage aligned, the randomization honours the
superpage attributes.

Although the value of ASLR is diminshing over time as exploit authors
work out simple ASLR bypass techniques, it elimintates the trivial
exploitation of certain vulnerabilities, at least in theory.  This
implementation is relatively small and happens at the correct
architectural level.  Also, it is not expected to introduce
regressions in existing cases when turned off (default for now), or
cause any significant maintaince burden.

The randomization is done on a best-effort basis - that is, the
allocator falls back to a first fit strategy if fragmentation prevents
entropy injection.  It is trivial to implement a strong mode where
failure to guarantee the requested amount of entropy results in
mapping request failure, but I do not consider that to be usable.

I have not fine-tuned the amount of entropy injected right now. It is
only a quantitive change that will not change the implementation.  The
current amount is controlled by aslr_pages_rnd.

To not spoil coalescing optimizations, to reduce the page table
fragmentation inherent to ASLR, and to keep the transient superpage
promotion for the malloced memory, locality clustering is implemented
for anonymous private mappings, which are automatically grouped until
fragmentation kicks in.  The initial location for the anon group range
is, of course, randomized.  This is controlled by vm.cluster_anon,
enabled by default.

The default mode keeps the sbrk area unpopulated by other mappings,
but this can be turned off, which gives much more breathing bits on
architectures with small address space, such as i386.  This is tied
with the question of following an application's hint about the mmap(2)
base address. Testing shows that ignoring the hint does not affect the
function of common applications, but I would expect more demanding
code could break. By default sbrk is preserved and mmap hints are
satisfied, which can be changed by using the
kern.elf{32,64}.aslr.honor_sbrk sysctl.

ASLR is enabled on per-ABI basis, and currently it is only allowed on
FreeBSD native i386 and amd64 (including compat 32bit) ABIs.  Support
for additional architectures will be added after further testing.

Both per-process and per-image controls are implemented:
- procctl(2) adds PROC_ASLR_CTL/PROC_ASLR_STATUS;
- NT_FREEBSD_FCTL_ASLR_DISABLE feature control note bit makes it possible
  to force ASLR off for the given binary.  (A tool to edit the feature
  control note is in development.)
Global controls are:
- kern.elf{32,64}.aslr.enable - for non-fixed mappings done by mmap(2);
- kern.elf{32,64}.aslr.pie_enable - for PIE image activation mappings;
- kern.elf{32,64}.aslr.honor_sbrk - allow to use sbrk area for mmap(2);
- vm.cluster_anon - enables anon mapping clustering.

PR:	208580 (exp runs)
Exp-runs done by:	antoine
Reviewed by:	markj (previous version)
Discussed with:	emaste
Tested by:	pho
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D5603
2019-02-10 17:19:45 +00:00
Konstantin Belousov
5dddee2d65 i386: honor kern.elf32.read_exec for ommap(2) and break(2), as already
done on amd64.

Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2019-02-09 03:56:48 +00:00
Konstantin Belousov
a7f67facdf Normalize the declaration of i386_read_exec variable.
It is currently re-declared in sys/sysent.h which is a wrong place for
MD variable.  Which causes redeclaration error with gcc when
sys/sysent.h and machine/md_var.h are included both.

Remove it from sys/sysent.h and instead include machine/md_var.h when
needed, under #ifdef for both i386 and amd64.

Reported and tested by:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2019-02-09 03:51:51 +00:00
Gleb Smirnoff
ad66f95865 Now that there is only one way to allocate a slab, remove uz_slab method.
Discussed with:	jeff
2019-02-07 03:55:05 +00:00
Gleb Smirnoff
b47acb0a4d Report cache zones in UMA stats sysctl, that 'vmstat -z' uses. This
should had been part of r251826.
2019-02-07 03:32:45 +00:00
Konstantin Belousov
d22ff6e6a2 contigmalloc: handle M_EXEC.
Reviewed by:	alc, markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D19092
2019-02-07 02:00:23 +00:00
Mark Johnston
1e2b3e6f92 Allow vm_page_free_prep() to dequeue pages without the page lock.
This is a step towards being able to free pages without the page
lock held.  The approach is simply to add an implementation of
vm_page_dequeue_deferred() which does not assert that the page
lock is held.  Formally, the page lock is required to set
PGA_DEQUEUE, but in the case of vm_page_free_prep() we get the
same mutual exclusion for free by virtue of the fact that no
other references to the page may exist.

No functional change intended.

Reviewed by:	kib (previous version)
MFC after:	2 weeks
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D19065
2019-02-03 18:43:20 +00:00
Mark Johnston
d0488e698f Fix a race in vm_page_dequeue_deferred().
To detect the case where the page is already marked for a deferred
dequeue, we must read the "queue" and "aflags" fields in a
precise order.  Otherwise, a race with a concurrent
vm_page_dequeue_complete() could leave the page with PGA_DEQUEUE
set despite it already having been dequeued.  Fix the problem by
using vm_page_queue() to check the queue state, which correctly
handles the race.

Reviewed by:	kib
Tested by:	pho
MFC after:	3 days
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D19039
2019-02-03 18:38:58 +00:00
Alexander Motin
59568a0e52 Fix integer math overflow in UMA hash_alloc().
512GB of ZFS ABD ARC means abd_chunk zone of 128M 4KB items.  To manage
them UMA tries to allocate 2GB hash table, which size does not fit into
the int variable, causing later allocation failure, which makes ARC shrink
back below the 512GB, not letting it to use more RAM.  With this change I
easily reached >700GB ARC size on 768GB RAM machine.

MFC after:	1 week
Sponsored by:	iXsystems, Inc.
2019-02-02 04:11:59 +00:00
Gleb Smirnoff
37125720b9 In zone_alloc_bucket() max argument was calculated based on uz_count.
Then bucket_alloc() also selects bucket size based on uz_count. However,
since zone lock is dropped, uz_count may reduce. In this case max may
be greater than ub_entries and that would yield into writing beyond end
of the allocation.

Reported by:	pho
2019-01-31 17:52:48 +00:00
Mark Johnston
862203935e Correct uma_prealloc()'s use of domainset iterators after r339925.
The iterator should be reinitialized after every successful slab
allocation.  A request to advance the iterator is interpreted as
an allocation failure, so a sufficiently large preallocation would
cause the iterator to believe that all domains were exhausted,
resulting in a sleep with the keg lock held. [1]

Also, keg_alloc_slab() should pass the unmodified wait flag to the
item initialization routine, which may use it to perform allocations
from other zones.

Reported and tested by:	slavah
Diagnosed by:	kib [1]
Reviewed by:	kib
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
2019-01-23 18:58:15 +00:00
Konstantin Belousov
f2a496d667 MI VM: Make it possible to set size of superpage at boot instead of compile time.
In order to allow single kernel to use PAE pagetables on i386 if
hardware supports it, and fall back to classic two-level paging
structures if not, superpage code should be able to adopt to either 2M
or 4M superpages size.  There I make MI VM structures large enough to
track the biggest possible superpage, by allowing architecture to
define VM_NFREEORDER_MAX and VM_LEVEL_0_ORDER_MAX constants.
Corresponding VM_NFREEORDER and VM_LEVEL_0_ORDER symbols can be
defined as runtime values and must be less than the _MAX constants.
If architecture does not define _MAXs, it is assumed that _MAX ==
normal constant.

Reviewed by:	markj
Tested by:	pho (as part of the larger patch)
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D18853
2019-01-18 13:35:06 +00:00
Gleb Smirnoff
46b0292a82 Do not reserve KVA for paging bufs in vm_ksubmap_init(), since now
they allocate it in pbuf_init(). This should have been done together
with r343030.
2019-01-16 20:14:16 +00:00
Konstantin Belousov
ea7e7006db Implement shmat(2) flag SHM_REMAP.
Based on the description in Linux man page.

Reviewed by:	markj, ngie (previous version)
Sponsored by:	Mellanox Technologies
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D18837
2019-01-16 05:15:57 +00:00
Gleb Smirnoff
b68d692a3d Whitespace. 2019-01-16 04:02:08 +00:00
Gleb Smirnoff
396694153f Fix compilation failures on different arches that have vm_machdep.c not
aware of counter_u64_t by including counter.h into uma_int.h. I'm not
happy about this inclusion, but it fixes compilation ASAP.
2019-01-15 19:33:47 +00:00
Gleb Smirnoff
e7e4bcd856 style(9): break long line. 2019-01-15 18:50:11 +00:00
Gleb Smirnoff
f8c86a5fde Remove harmless leftover from code that cycles over zone's kegs. Just use +
instead of +=. There is no functional change.
2019-01-15 18:49:31 +00:00
Gleb Smirnoff
bb45b411e2 Only do uz_items accounting for zones that have a limit set in uz_max_items.
This reduces amount of locking required for these zones.

Also, for cache only zones (UMA_ZFLAG_CACHE) accounting uz_items wasn't
correct at all, since they may allocate items directly from their backing
store and then free them via UMA underflowing uz_items.

Tested by:	pho
2019-01-15 18:32:26 +00:00
Gleb Smirnoff
2efcc8cbca Make uz_allocs, uz_frees and uz_fails counter(9). This removes some
atomic updates and reduces amount of data protected by zone lock.

During startup point these fields to EARLY_COUNTER. After startup
allocate them for all early zones.

Tested by:	pho
2019-01-15 18:24:34 +00:00
Gleb Smirnoff
5a8eee2bb4 Fix compilation on 32-bit. 2019-01-15 03:43:46 +00:00
Gleb Smirnoff
756a541279 Allocate pager bufs from UMA instead of 80-ish mutex protected linked list.
o In vm_pager_bufferinit() create pbuf_zone and start accounting on how many
  pbufs are we going to have set.
  In various subsystems that are going to utilize pbufs create private zones
  via call to pbuf_zsecond_create(). The latter calls uma_zsecond_create(),
  and sets a limit on created zone. After startup preallocate pbufs according
  to requirements of all pbuf zones.

  Subsystems that used to have a private limit with old allocator now have
  private pbuf zones: md(4), fusefs, NFS client, smbfs, VFS cluster, FFS,
  swap, vnode pager.

  The following subsystems use shared pbuf zone: cam(4), nvme(4), physio(9),
  aio(4). They should have their private limits, but changing that is out of
  scope of this commit.

o Fetch tunable value of kern.nswbuf from init_param2() and while here move
  NSWBUF_MIN to opt_param.h and eliminate opt_swap.h, that was holding only
  this option.
  Default values aren't touched by this commit, but they probably should be
  reviewed wrt to modern hardware.

This change removes a tight bottleneck from sendfile(2) operation, that
uses pbufs in vnode pager. Other pagers also would benefit from faster
allocation.

Together with:	gallatin
Tested by:	pho
2019-01-15 01:02:16 +00:00
Gleb Smirnoff
bb15d1c778 o Move zone limit from keg level up to zone level. This means that now
two zones sharing a keg may have different limits. Now this is going
  to work:

  zone = uma_zcreate();
  uma_zone_set_max(zone, limit);
  zone2 = uma_zsecond_create(zone);
  uma_zone_set_max(zone2, limit2);

  Kegs no longer have uk_maxpages field, but zones have uz_items. When
  set, it may be rounded up to minimum possible CPU bucket cache size.
  For small limits bucket cache can also be reconfigured to be smaller.
  Counter uz_items is updated whenever items transition from keg to a
  bucket cache or directly to a consumer. If zone has uz_maxitems set and
  it is reached, then we are going to sleep.

o Since new limits don't play well with multi-keg zones, remove them. The
  idea of multi-keg zones was introduced exactly 10 years ago, and never
  have had a practical usage. In discussion with Jeff we came to a wild
  agreement that if we ever want to reintroduce the idea of a smart allocator
  that would be able to choose between two (or more) totally different
  backing stores, that choice should be made one level higher than UMA,
  e.g. in malloc(9) or in mget(), or whatever and choice should be controlled
  by the caller.

o Sleeping code is improved to account number of sleepers and wake them one
  by one, to avoid thundering herd problem.

o Flag UMA_ZONE_NOBUCKETCACHE removed, instead uma_zone_set_maxcache()
  KPI added. Having no bucket cache basically means setting maxcache to 0.

o Now with many fields added and many removed (no multi-keg zones!) make
  sure that struct uma_zone is perfectly aligned.

Reviewed by:	markj, jeff
Tested by:	pho
Differential Revision:	https://reviews.freebsd.org/D17773
2019-01-15 00:02:06 +00:00
Gleb Smirnoff
9cc36b3dab Fix regression in r331368, that broke dumping of UMA startup pages
when WITNESS is present.

Discussed with:	markj
2019-01-07 23:17:09 +00:00
Konstantin Belousov
3fbc2e00d1 Add a tunable which changes mincore(2) algorithm to only report data
from the local mapping.

Enable the setting by default.
The article behind the change: https://arxiv.org/abs/1901.01161

Reviewed by:	markj
Discussed with:	emaste
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D18764
2019-01-07 22:10:48 +00:00
Konstantin Belousov
7af4985245 Add 'v' modifier to the ddb 'show pginfo' command to display vm_page
backing the provided kernel virtual address.

Tested by:	pho
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
2018-12-30 15:58:18 +00:00
Mateusz Guzik
cc426dd319 Remove unused argument to priv_check_cred.
Patch mostly generated with cocinnelle:

@@
expression E1,E2;
@@

- priv_check_cred(E1,E2,0)
+ priv_check_cred(E1,E2)

Sponsored by:	The FreeBSD Foundation
2018-12-11 19:32:16 +00:00
Mateusz Guzik
83764b446a vm: use fcmpset for vmspace reference counting
Sponsored by:	The FreeBSD Foundation
2018-12-07 16:22:54 +00:00
Brooks Davis
d48719bd96 Normalize COMPAT_43 syscall declarations.
Have ogetkerninfo, ogetpagesize, ogethostname, osethostname, and oaccept
declare o<foo>_args structs rather than non-compat ones. Due to a
failure to use NOARGS in most cases this adds only one new declaration.

No changes required in freebsd32 as only ogetpagesize() is implemented
and it has a 32-bit specific implementation.

Reviewed by:	kib
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D15816
2018-12-04 16:48:47 +00:00
Konstantin Belousov
10d9120c44 Change the vm_ooffset_t type to unsigned.
The type represents byte offset in the vm_object_t data space, which
does not span negative offsets in FreeBSD VM.  The change matches byte
offset signess with the unsignedness of the vm_pindex_t which
represents the type of the page indexes in the objects.

This allows to remove the UOFF_TO_IDX() macro which was used when we
have to forcibly interpret the type as unsigned anyway.  Also it fixes
a lot of implicit bugs in the device drivers d_mmap methods.

Reviewed by:	alc, markj (previous version)
Tested by:	pho
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
2018-12-02 13:16:46 +00:00
Konstantin Belousov
a823302783 Allow to create swap zone larger than v_page_count / 2.
If user configured the maxswapzone tunable, just take the literal
value for the initial zone sizing attempt.  Before, it was only
possible to reduce the zone by the tunable.

While there, correct the message which was not correct when zone
creation rounded the size up.

Reported by:	jmg
Reviewed by:	markj
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D18381
2018-12-01 16:50:12 +00:00
Eric van Gyzen
5e38e3f5eb Include path for tmpfs objects in vm.objects sysctl
This applies the fix in r283924 to the vm.objects sysctl
added by r283624 so the output will include the vnode
information (i.e. path) for tmpfs objects.

Reviewed by:	kib, dab
MFC after:	2 weeks
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D2724
2018-11-30 04:59:43 +00:00
Eric van Gyzen
0951bd362c Add assertions and comment to vm_object_vnode()
Reviewed by:	kib
MFC after:	2 weeks
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D2724
2018-11-30 04:18:31 +00:00
Mark Johnston
e31fc3ab13 Update the free page count when blacklisting pages.
Otherwise the free page count will not accurately reflect the physical
page allocator's state.  On 11 this can trigger panics in
vm_page_alloc() since the allocator state and free page count are
updated atomically and we expect them to stay in sync.  On 12 the
bug would manifest as threads looping in vm_page_alloc().

PR:		231296
Reported by:	mav, wollman, Rainer Duffner, Josh Gitlin
Reviewed by:	alc, kib, mav
MFC after:	3 days
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D18374
2018-11-29 16:31:01 +00:00
Gleb Smirnoff
0b2e3aead3 Fix yet another edge case in uma_startup_count(). If zone size fits into
several pages, but leaves no space for struct uma_slab at the end we
miscalculate number of pages by one. Totally mimic keg_large_init() math
here to cover that problem.

Reported by:	gallatin
2018-11-28 19:54:02 +00:00
Gleb Smirnoff
3d5e3df73f For not offpage zones the slab is placed at the end of page. Keg's uk_pgoff
is calculated to guarantee that struct uma_slab is placed at pointer size
alignment. Calculation of real struct uma_slab size is done in keg_ctor()
and yet again in keg_large_init(), to check if we need an extra page. This
calculation can actually be performed at compile time.

- Add SIZEOF_UMA_SLAB macro to calculate size of struct uma_slab placed at
  an end of a page with alignment requirement.
- Use SIZEOF_UMA_SLAB in keg_ctor() and in keg_large_init(). This is a not
  a functional change.
- Use SIZEOF_UMA_SLAB in UMA_SLAB_SPACE definition and in keg_small_init().
  This is a potential bugfix, but in reality I don't think there are any
  systems affected, since compiler aligns struct uma_slab anyway.
2018-11-28 19:17:27 +00:00
Konstantin Belousov
6e00f3a311 Avoid unneeded check in vmspace_alloc().
All vmspace_alloc() callers know which kind of pmap they allocate.

Reviewed by:	alc, markj (previous version)
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D18329
2018-11-25 17:56:49 +00:00
Ben Widawsky
f82dd310bb linuxkpi: Use pageproc instead of vmproc
According to markj@:
pageproc contains the page daemon and laundry threads, which are
responsible for managing the LRU page queues and writing back dirty
pages.  vmproc's main task is to swap out kernel stacks when the system
is under memory pressure, and swap them back in when necessary.  It's a
somewhat legacy component of the system and isn't required.  You can
build a kernel without it by specifying "options NO_SWAPPING" (which is
a somewhat misleading name), in which vm_swapout_dummy.c is compiled
instead of vm_swapout.c.

Based on this, we want pageproc to emulate kswapd, not vmproc.

Reviewed by:	markj
Differential Revision:	https://reviews.freebsd.org/D18061
2018-11-21 04:34:18 +00:00
Ben Widawsky
c3f4f28c63 linuxkpi: Add some basic swap functions
These are used by kms-drm to determine various heuristics relate
memory conditions.

The number of free swap pages is just a variable, and it can be
much cheaper by either adding a new getter, or simply extern'ing
swap_total. However, this patch opts to use the more expensive,
existing interface - since this isn't an operation in a high per
path.

This allows us to remove some more gpl linuxkpi and do the follo
kms-drm:
git rm linuxkpi/gplv2/include/linux/swap.h

Reviewed by:    mmacy, Johannes Lundberg <johalun0@gmail.com>
Approved by:    emaste (mentor)
Differential Revision:  https://reviews.freebsd.org/D18052
2018-11-20 22:49:19 +00:00
Alan Cox
541a117532 Use swp_pager_isondev() throughout. Submitted by: ota@j.email.ne.jp
Change swp_pager_isondev()'s return type to bool.

Reviewed by:	kib
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D16712
2018-11-19 17:17:23 +00:00
Alan Cox
92e78c1012 Tidy up vm_map_simplify_entry() and its recently introduced helper
functions.  Notably, reflow the text of some comments so that they
occupy fewer lines, and introduce an assertion in one of the new
helper functions so that it is not misused by a future caller.

In collaboration with:	Doug Moore <dougm@rice.edu>
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D17635
2018-11-18 01:27:17 +00:00
Mark Johnston
0f9b7bf37a Add accounting to per-domain UMA full bucket caches.
In particular, track the current size of the cache and maintain an
estimate of its working set size.  This will be used to decide how
much to shrink various caches when the kernel attempts to reclaim
pages.  As a secondary effect, it makes statistics aggregation (done
by, e.g., vmstat -z) cheaper since sysctl_vm_zone_stats() no longer
needs to iterate over lists of cached buckets.

Discussed with:	alc, glebius, jeff
Tested by:	pho (previous version)
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D16666
2018-11-13 19:44:40 +00:00
Mark Johnston
0e48e06807 Re-apply r336984, reverting r339934.
r336984 exposed the bug fixed in r340241, leading to the initial revert
while the bug was being hunted down.  Now that the bug is fixed, we
can revert the revert.

Discussed with:	alc
MFC after:	3 days
2018-11-10 20:33:08 +00:00
Mark Johnston
150d384e5c Fix a use-after-free in swp_pager_meta_free().
This was introduced in r326329 and explains the crashes mentioned in
the commit log message for r339934.  In particular, on INVARIANTS
kernels, UMA trashing causes the loop to exit early, leaving swap
blocks behind when they should have been freed.  After r336984 this
became more problematic since new anonymous mappings were more
likely to reuse swapped-out subranges of existing VM objects, so faults
would trigger pageins of freed memory rather than returning zeroed
pages.

Reviewed by:	kib
MFC after:	3 days
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D17897
2018-11-07 23:28:11 +00:00
Mark Johnston
07702f72e5 Avoid specifying VM_PROT_EXECUTE in mappings from pipe_map and exec_map.
These submaps are used for mapping pipe buffers and execv() argument
strings respectively, so there's no need for such mappings to have
execute permissions.

Reported by:	jhb
Reviewed by:	alc, jhb, kib
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D17827
2018-11-06 21:57:03 +00:00
Mark Johnston
8002c3a495 Initialize last_target in the laundry thread control loop.
In practice it is always initialized because nfreed must be positive
in order to trigger background laundering, but this isn't obvious.

CID:		1387997
MFC after:	1 week
2018-11-06 02:52:54 +00:00
Mark Johnston
2203c46d87 Initialize the eflags field of vm_map headers.
Initializing the eflags field of the map->header entry to a value with a
unique new bit set makes a few comparisons to &map->header unnecessary.

Submitted by:	Doug Moore <dougm@rice.edu>
Reviewed by:	alc, kib
Tested by:	pho
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D14005
2018-11-02 16:26:44 +00:00
Mark Johnston
5d277a85ad Revert r336984.
It appears to be responsible for random segfaults observed when lots
of paging activity is taking place, but the root cause is not yet
understood.

Requested by:	alc
MFC after:	now
2018-10-30 22:40:40 +00:00
Mark Johnston
9978bd996b Add malloc_domainset(9) and _domainset variants to other allocator KPIs.
Remove malloc_domain(9) and most other _domain KPIs added in r327900.
The new functions allow the caller to specify a general NUMA domain
selection policy, rather than specifically requesting an allocation from
a specific domain.  The latter policy tends to interact poorly with
M_WAITOK, resulting in situations where a caller is blocked indefinitely
because the specified domain is depleted.  Most existing consumers of
the _domain KPIs are converted to instead use a DOMAINSET_PREF() policy,
in which we fall back to other domains to satisfy the allocation
request.

This change also defines a set of DOMAINSET_FIXED() policies, which
only permit allocations from the specified domain.

Discussed with:	gallatin, jeff
Reported and tested by:	pho (previous version)
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D17418
2018-10-30 18:26:34 +00:00
Mark Johnston
920239efde Fix some problems that manifest when NUMA domain 0 is empty.
- In uma_prealloc(), we need to check for an empty domain before the
  first allocation attempt, not after.  Fix this by switching
  uma_prealloc() to use a vm_domainset iterator, which addresses the
  secondary issue of using a signed domain identifier in round-robin
  iteration.
- Don't automatically create a page daemon for domain 0.
- In domainset_empty_vm(), recompute ds_cnt and ds_order after
  excluding empty domains; otherwise we may frequently specify an empty
  domain when calling in to the page allocator, wasting CPU time.
  Convert DOMAINSET_PREF() policies for empty domains to round-robin.
- When freeing bootstrap pages, don't count them towards the per-domain
  total page counts for now: some vm_phys segments are created before
  the SRAT is parsed and are thus always identified as being in domain 0
  even when they are not.  Then, when bootstrap pages are freed, they
  are added to a domain that we had previously thought was empty.  Until
  this is corrected, we simply exclude them from the per-domain page
  count.

Reported and tested by:	Rajesh Kumar <rajfbsd@gmail.com>
Reviewed by:	gallatin
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D17704
2018-10-30 17:57:40 +00:00
Alan Cox
9f1abe3df4 Eliminate typically pointless calls to vm_fault_prefault() on soft, copy-
on-write faults.  On a page fault, when we call vm_fault_prefault(), it
probes the pmap and the shadow chain of vm objects to see if there are
opportunities to create read and/or execute-only mappings to neighoring
pages.  For example, in the case of hard faults, such effort typically pays
off, that is, mappings are created that eliminate future soft page faults.
However, in the the case of soft, copy-on-write faults, the effort very
rarely pays off.  (See the review for some specific data.)

Reviewed by:	kib, markj
MFC after:	3 weeks
Differential Revision:	https://reviews.freebsd.org/D17367
2018-10-27 17:49:46 +00:00
Mark Johnston
17fbf3cf34 Add a !NUMA definition for vm_domainset_iter_policy_ref_init().
Pointy hat:	markj
X-MFC with:	r339661
Sponsored by:	The FreeBSD Foundation
2018-10-24 17:09:20 +00:00
Mark Johnston
7571e24901 Add an #include required after r339686.
X-MFC with:	r339686
Sponsored by:	The FreeBSD Foundation
2018-10-24 16:49:16 +00:00
Mark Johnston
194a979ee9 Use a vm_domainset iterator in keg_fetch_slab().
Previously, it used a hand-rolled round-robin iterator.  This meant that
the minskip logic in r338507 didn't apply to UMA allocations, and also
meant that we would call vm_wait() for individual domains rather than
permitting an allocation from any domain with sufficient free pages.

Discussed with:	jeff
Tested by:	pho
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D17420
2018-10-24 16:41:47 +00:00
Mark Johnston
87ab1a10b1 Initialize static domainsets regardless of whether an SRAT is present.
Reported by:	yuripv
X-MFC with:	r339452
Sponsored by:	The FreeBSD Foundation
2018-10-23 18:07:16 +00:00
Mark Johnston
4c29d2de67 Refactor domainset iterators for use by malloc(9) and UMA.
Before this change we had two flavours of vm_domainset iterators: "page"
and "malloc".  The latter was only used for kmem_*() and hard-coded its
behaviour based on kernel_object's policy.  Moreover, its use contained
a race similar to that fixed by r338755 since the kernel_object's
iterator was being run without the object lock.

In some cases it is useful to be able to explicitly specify a policy
(domainset) or policy+iterator (domainset_ref) when performing memory
allocations.  To that end, refactor the vm_dominset_* KPI to permit
this, and get rid of the "malloc" domainset_iter KPI in the process.

Reviewed by:	jeff (previous version)
Tested by:	pho (part of a larger patch)
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D17417
2018-10-23 16:35:58 +00:00
Mark Johnston
b61f314290 Make it possible to disable NUMA support with a tunable.
This provides a chicken switch for anyone negatively impacted by
enabling NUMA in the amd64 GENERIC kernel configuration.  With
NUMA disabled at boot-time, information about the NUMA topology
is not exposed to the rest of the kernel, and all of physical
memory is viewed as coming from a single domain.

This method still has some performance overhead relative to disabling
NUMA support at compile time.

PR:		231460
Reviewed by:	alc, gallatin, kib
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D17439
2018-10-22 20:13:51 +00:00
Mark Johnston
2801dd08d7 Fix the build after r339601.
I committed some patches out of order and didn't build-test one of them.

Reported by:	Jenkins, O. Hartmann <ohartmann@walstatt.org>
X-MFC with:	r339601
2018-10-22 17:19:48 +00:00
Mark Johnston
2a843ae7d9 Avoid a redundancy in a comment updated by r339601.
Reported by:	alc
X-MFC with:	r339601
2018-10-22 17:17:30 +00:00
Mark Johnston
b00581965d Swap in processes unless there's a global memory shortage.
On NUMA systems, we would not swap in processes unless all domains
had some free pages.  This is too conservative in general.  Instead,
permit swapins so long as at least one domain has free pages, and add
a kernel stack NUMA policy which ensures that we will try to allocate
kernel stack pages from any domain.

Reported and tested by:	pho, Jan Bramkamp <crest@bultmann.eu>
Reviewed by:	alc, kib
Discussed with:	jeff
MFC after:	3 days
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D17304
2018-10-22 17:04:04 +00:00
Gleb Smirnoff
81c0d72c60 If we lost race or were migrated during bucket allocation for the per-CPU
cache, then we put new bucket on generic bucket cache. However, code didn't
honor UMA_ZONE_NOBUCKETCACHE flag, so potentially we could start a cache
on a zone that clearly forbids that. Fix this.

Reviewed by:	markj
2018-10-22 15:48:07 +00:00
Konstantin Belousov
17afd2beec Unindent vm_map_simplify_entry() after r339506.
Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D17632
2018-10-21 00:11:56 +00:00
Konstantin Belousov
074244628b Reduce code duplication in merging vm_entry neighbors.
Submitted by:	Doug Moore <dougm@rice.edu>
Reviewed by:	markj
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D17610
2018-10-20 23:08:04 +00:00
Mark Johnston
662e7fa8d9 Create some global domainsets and refactor NUMA registration.
Pre-defined policies are useful when integrating the domainset(9)
policy machinery into various kernel memory allocators.

The refactoring will make it easier to add NUMA support for other
architectures.

No functional change intended.

Reviewed by:	alc, gallatin, jeff, kib
Tested by:	pho (part of a larger patch)
MFC after:	3 days
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D17416
2018-10-20 17:36:00 +00:00
Matt Macy
e8bb589d56 eliminate locking surrounding ui_vmsize and swap reserve by using atomics
Change swap_reserve and swap_total to be in units of pages so that
swap reservations can be done using only atomics instead of using a single
global mutex for swap_reserve and a single mutex for all processes running
under the same uid for uid accounting.

Results in mmap speed up and a 70% increase in brk calls / second.

Reviewed by:	alc@, markj@, kib@
Approved by:	re (delphij@)
Differential Revision:	https://reviews.freebsd.org/D16273
2018-10-05 05:50:56 +00:00
Mark Johnston
93db904d19 Use an unsigned iterator for domain sets.
Otherwise (iter % ds->ds_cnt) is not guaranteed to lie in the range
[0, MAXMEMDOM).

Reported by:	pho
Reviewed by:	kib
Approved by:	re (rgrimes)
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D17374
2018-10-01 18:51:39 +00:00
Andrew Gallatin
30c5525b3c Allow empty NUMA memory domains to support Threadripper2
The AMD Threadripper 2990WX is basically a slightly crippled Epyc.
Rather than having 4 memory controllers, one per NUMA domain, it has
only 2  memory controllers enabled. This means that only 2 of the
4 NUMA domains can be populated with physical memory, and the
others are empty.

Add support to FreeBSD for empty NUMA domains by:

- creating empty memory domains when parsing the SRAT table,
    rather than failing to parse the table
- not running the pageout deamon threads in empty domains
- adding defensive code to UMA to avoid allocating from empty domains
- adding defensive code to cpuset to avoid binding to an empty domain
    Thanks to Jeff for suggesting this strategy.

Reviewed by:	alc, markj
Approved by:	re (gjb@)
Differential Revision:	https://reviews.freebsd.org/D1683
2018-10-01 14:14:21 +00:00
Konstantin Belousov
c62637d679 Correct vm_fault_copy_entry() handling of backing file truncation
after the file mapping was wired.

if a wired map entry is backed by vnode and the file is truncated,
corresponding pages are invalidated.  vm_fault_copy_entry() should be
aware of it and allow for invalid pages past end of file. Also, such
pages should be not mapped into userspace.  If userspace accesses the
truncated part of the mapping later, it gets a signal, there is no way
kernel can prevent the page fault.

Reported by:	andrew using syzkaller
Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
Approved by:	re (gjb)
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D17323
2018-09-28 14:11:38 +00:00
Konstantin Belousov
9f25ab83f9 In vm_fault_copy_entry(), we should not assert that entry is charged
if the dst_object is not of swap type.

It can only happen when entry does not require copy, otherwise
vm_map_protect() already adds the charge. So the assert was right for
the case where swap object was allocated in the vm_fault_copy_entry(),
but not when it was just copied from src_entry and its type is not
swap.

Reported by:	andrew using syzkaller
Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
Approved by:	re (gjb)
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D17323
2018-09-28 14:11:01 +00:00
Konstantin Belousov
a60d3db15e In vm_fault_copy_entry(), collect the code to initialize a newly
allocated dst_object in a single place.

Suggested and reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
Approved by:	re (gjb)
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D17323
2018-09-28 14:10:12 +00:00
Mark Johnston
463406ac4a Add more NUMA-specific low memory predicates.
Use these predicates instead of inline references to vm_min_domains.
Also add a global all_domains set, akin to all_cpus.

Reviewed by:	alc, jeff, kib
Approved by:	re (gjb)
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D17278
2018-09-24 19:24:17 +00:00
Alan Cox
f5fbe90de4 Passing UMA_ZONE_NOFREE to uma_zcreate() for swpctrie_zone and swblk_zone is
redundant, because uma_zone_reserve_kva() is performed on both zones and it
sets this same flag on the zone.  (Moreover, the implementation of the swap
pager does not itself require these zones to be UMA_ZONE_NOFREE.)

Reviewed by:	kib, markj
Approved by:	re (gjb)
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D17296
2018-09-24 16:49:02 +00:00
Mark Johnston
3d14a7bb43 Ensure that "domain" is initialized when vm_ndomains == 1.
Reported by:	alc
Approved by:	re (gjb)
2018-09-24 15:32:46 +00:00
Mark Johnston
969e147aff Ensure that imports into per-domain kmem arenas are KVA_QUANTUM-aligned.
The old code appears to assume that vmem_alloc() would import
size-aligned KVA chunks from the parent kernel_arena, but vmem doesn't
provide this guarantee.

Also remove the unused global RWX arena and add comments explaining why
we have per-domain arenas.

Reported by:	alc
Reviewed by:	alc, kib (previous version)
Approved by:	re (gjb)
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D17249
2018-09-20 18:29:55 +00:00
Mark Johnston
25ed23cfbb Change the domain selection policy in kmem_back().
Ensure that pages backing the same virtual large page come from the
same physical domain, as kmem_malloc_domain() does.

PR:		231038
Reviewed by:	alc, kib
Approved by:	re (gjb)
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D17248
2018-09-20 15:45:12 +00:00
Mark Johnston
1aed6d48a8 Move kernel vmem arena initialization to vm_kern.c.
This keeps the initialization coupled together with the kmem_* KPI
implementation, which is the main user of these arenas.

No functional change intended.

Reviewed by:	alc
Approved by:	re (gjb)
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D17247
2018-09-19 19:13:43 +00:00
Mateusz Guzik
c035292545 vm: check for empty kstack cache before locking
The current cache logic checks the total number of stacks in the kernel,
which even on small boxes significantly exceeds the 128 limit (e.g. an
8-way box with zfs has almost 800 stacks allocated).

Stacks are cached earlier for each main thread.

As a result the code is rarely executed, but when it is then (on boxes like
the above) it always fails. Since there are no provisions made for NUMA and
release time is approaching, just do a quick check to avoid acquiring the
lock.

Approved by:	re (kib)
2018-09-19 16:02:33 +00:00
Mark Johnston
26fe2217bf Only update the domain cursor once in keg_fetch_slab().
We drop the keg lock when we go to actually allocate the slab, allowing
other threads to advance the cursor.  This can cause us to exit the
round-robin loop before having attempted allocations from all domains,
resulting in a hang during a subsequent blocking allocation attempt from
a depleted domain.

Reported and tested by:	Jan Bramkamp <crest@bultmann.eu>
Reviewed by:	alc, cem
Approved by:	re (gjb)
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D17209
2018-09-18 17:51:45 +00:00
Mateusz Guzik
2554f86a8d vm: stop taking proc lock in mmap to satisfy racct if it is disabled
Limits can be safely obtained with lim_cur from the thread. racct is compiled
in but disabled by default. Note that racct enablement is a boot-only tunable.

This eliminates second most common place of taking the lock while pkg building.

While here don't take the lock in mlockall either.

Reviewed by:	kib
Approved by:	re (gjb)
Differential Revision:	https://reviews.freebsd.org/D17210
2018-09-18 01:24:30 +00:00
Mark Johnston
7a364d458a Split some checks in vm_page_activate() to make it easier to read.
No functional change intended.

Reviewed by:	alc, kib
Approved by:	re (gjb)
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D17028
2018-09-10 18:59:23 +00:00
Mark Johnston
5a7f993702 Relax an assertion in vm_pqbatch_process_page().
While executing vm_pqbatch_process_page(m), m->queue may change to
PQ_NONE if the page daemon is concurrently freeing the page.  In this
case m's queue state flags must be clear, so vm_pqbatch_process_page()
will be a no-op, but the race could cause spurious assertion failures.
Correct the assertion which assumed that m->queue's value does not
change while the page queue lock is held.

Reviewed by:	alc, kib
Reported and tested by:	pho
Approved by:	re (gjb)
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D17027
2018-09-08 21:49:43 +00:00
Mark Johnston
c56c7299c2 Use the correct terminology.
Reported by:    kib
Approved by:	re (gjb)
Differential revision:  https://reviews.freebsd.org/D16191
2018-09-06 20:02:19 +00:00
Mark Johnston
23984ce5cd Avoid resource deadlocks when one domain has exhausted its memory. Attempt
other allowed domains if the requested domain is below the minimum paging
threshold.  Block in fork only if all domains available to the forking
thread are below the severe threshold rather than any.

Submitted by:	jeff
Reported by:	mjg
Reviewed by:	alc, kib, markj
Approved by:	re (rgrimes)
Differential Revision:	https://reviews.freebsd.org/D16191
2018-09-06 19:28:52 +00:00
Mark Johnston
21f01f4584 Remove vm_page_remque().
Testing m->queue != PQ_NONE is not sufficient; see the commit log
message for r338276.  As of r332974 vm_page_dequeue() handles
already-dequeued pages, so just replace vm_page_remque() calls with
vm_page_dequeue() calls.

Reviewed by:	kib
Tested by:	pho
Approved by:	re (marius)
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D17025
2018-09-06 16:17:45 +00:00
Alan Cox
72aebdd742 Recent changes have created, for the first time, physical memory segments
that can be coalesced.  To be clear, fragmentation of phys_avail[] is not
the cause.  This fragmentation of vm_phys_segs[] arises from the "special"
calls to vm_phys_add_seg(), in other words, not those that derive directly
from phys_avail[], but those that we create for the initial kernel page
table pages and now for the kernel and modules loaded at boot time.  Since
we sometimes iterate over the physical memory segments, coalescing these
segments at initialization time is a worthwhile change.

Reviewed by:	kib, markj
Approved by:	re (rgrimes)
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D16976
2018-09-02 18:29:38 +00:00
Konstantin Belousov
f0165b1ca6 Remove {max/min}_offset() macros, use vm_map_{max/min}() inlines.
Exposing max_offset and min_offset defines in public headers is
causing clashes with variable names, for example when building QEMU.

Based on the submission by:	royger
Reviewed by:	alc, markj (previous version)
Sponsored by:	The FreeBSD Foundation (kib)
MFC after:	1 week
Approved by:	re (marius)
Differential revision:	https://reviews.freebsd.org/D16881
2018-08-29 12:24:19 +00:00
Mark Murray
19fa89e938 Remove the Yarrow PRNG algorithm option in accordance with due notice
given in random(4).

This includes updating of the relevant man pages, and no-longer-used
harvesting parameters.

Ensure that the pseudo-unit-test still does something useful, now also
with the "other" algorithm instead of Yarrow.

PR:		230870
Reviewed by:	cem
Approved by:	so(delphij,gtetlow)
Approved by:	re(marius)
Differential Revision:	https://reviews.freebsd.org/D16898
2018-08-26 12:51:46 +00:00
Alan Cox
49bfa624ac Eliminate the arena parameter to kmem_free(). Implicitly this corrects an
error in the function hypercall_memfree(), where the wrong arena was being
passed to kmem_free().

Introduce a per-page flag, VPO_KMEM_EXEC, to mark physical pages that are
mapped in kmem with execute permissions.  Use this flag to determine which
arena the kmem virtual addresses are returned to.

Eliminate UMA_SLAB_KRWX.  The introduction of VPO_KMEM_EXEC makes it
redundant.

Update the nearby comment for UMA_SLAB_KERNEL.

Reviewed by:	kib, markj
Discussed with:	jeff
Approved by:	re (marius)
Differential Revision:	https://reviews.freebsd.org/D16845
2018-08-25 19:38:08 +00:00
Gleb Smirnoff
306abf0f35 Either "free" or "allocated" is misleading here, since an item
in a bucket is free from perspective of UMA consumer, and it is
allocated from perspective of keg.

Discussed with:	markj
Approved by:	re (kib)
2018-08-24 18:47:50 +00:00
Gleb Smirnoff
a307fb5b0c Fix comment. The actual meaning of ub_cnt is the opposite. 2018-08-23 23:24:28 +00:00
Mark Johnston
899fe184c7 Add a per-pagequeue pdpages counter.
Expose these counters under the vm.domain sysctl node.  The existing
vm.stats.vm.v_pdpages sysctl is preserved.

Reviewed by:	alc (previous version)
Differential Revision:	https://reviews.freebsd.org/D14666
2018-08-23 21:03:45 +00:00
Mark Johnston
99d92d732f Ensure that queue state is cleared when vm_page_dequeue() returns.
Per-page queue state is updated non-atomically, with either the page
lock or the page queue lock held.  When vm_page_dequeue() is called
without the page lock, in rare cases a different thread may be
concurrently dequeuing the page with the pagequeue lock held.  Because
of the non-atomic update, vm_page_dequeue() might return before queue
state is completely updated, which can lead to race conditions.

Restrict the vm_page_dequeue() interface so that it must be called
either with the page lock held or on a free page, and busy wait when
a different thread is concurrently updating queue state, which must
happen in a critical section.

While here, do some related cleanup: inline vm_page_dequeue_locked()
into its only caller and delete a prototype for the unimplemented
vm_page_requeue_locked().  Replace the volatile qualifier for "queue"
added in r333703 with explicit uses of atomic_load_8() where required.

Reported and tested by:	pho
Reviewed by:	alc
Differential Revision:	https://reviews.freebsd.org/D15980
2018-08-23 20:34:22 +00:00
Alan Cox
83a90bffd8 Eliminate kmem_malloc()'s unused arena parameter. (The arena parameter
became unused in FreeBSD 12.x as a side-effect of the NUMA-related
changes.)

Reviewed by:	kib, markj
Discussed with:	jeff, re@
Differential Revision:	https://reviews.freebsd.org/D16825
2018-08-21 16:43:46 +00:00
Alan Cox
44d0efb215 Eliminate kmem_alloc_contig()'s unused arena parameter.
Reviewed by:	hselasky, kib, markj
Discussed with:	jeff
Differential Revision:	https://reviews.freebsd.org/D16799
2018-08-20 15:57:27 +00:00
Alan Cox
db7c2a4822 Eliminate the unused arena parameter from kmem_alloc_attr().
Reviewed by:	kib, markj
Differential Revision:	https://reviews.freebsd.org/D16793
2018-08-18 22:07:48 +00:00
Alan Cox
067fd85894 Eliminate the arena parameter to kmem_malloc_domain(). It is redundant.
The domain and flags parameters suffice.  In fact, the related functions
kmem_alloc_{attr,contig}_domain() don't have an arena parameter.

Reviewed by:	kib, markj
Differential Revision:	https://reviews.freebsd.org/D16713
2018-08-18 18:33:50 +00:00
Konstantin Belousov
c1344d2bbe Prevent some parallel swap-ins, rate-limit swapper swap-ins.
If faultin() was called outside swapper (from PHOLD()), do not allow
swapper to initiate additional swap-ins.  Swapper' initiated swap-ins
are serialized because they are synchronous and executed in the
context of the thread0.  With the added limitation, we only allow
parallel swap-ins from PHOLD(), which is up to PHOLD() users to
manage, usually they do not need to.

Rate-limit swapper' swap-ins to one in the MAXSLP / 2 seconds
interval, counting faultin() swapins.

Suggested by:	alc
Reviewed by:	alc, markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D16610
2018-08-13 16:48:46 +00:00
Mark Johnston
b50a4ea646 Account for the lowmem handlers in the inactive queue scan target.
Before r329882 the target would be computed after lowmem handlers run
and free pages.  On some systems a significant amount of page
reclamation happens this way.  However, with r329882 the target is
computed first, which can lead to unnecessary reclamation from the
page cache, and this in turn may result in excessive swapping.

Instead, adjust the target after running lowmem handlers.  Don't
invoke the lowmem handlers before the PID controller, though, since
that would hide the true rate of page allocation.

Reviewed by:	alc, kib (previous version)
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D16606
2018-08-09 18:25:49 +00:00
Alan Cox
2bf8cb3804 Add support for pmap_enter(..., psind=1) to the armv6 pmap. In other words,
add support for explicitly requesting that pmap_enter() create a 1 MB page
mapping.  (Essentially, this feature allows the machine-independent layer
to create superpage mappings preemptively, and not wait for automatic
promotion to occur.)

Export pmap_ps_enabled() to the machine-independent layer.

Add a flag to pmap_pv_insert_pte1() that specifies whether it should fail
or reclaim a PV entry when one is not available.

Refactor pmap_enter_pte1() into two functions, one by the same name, that
is a general-purpose function for creating pte1 mappings, and another,
pmap_enter_1mpage(), that is used to prefault 1 MB read- and/or execute-
only mappings for execve(2), mmap(2), and shmat(2).

In addition, as an optimization to pmap_enter(..., psind=0), eliminate the
use of pte2_is_managed() from pmap_enter().  Unlike the x86 pmap
implementations, armv6 does not have a managed bit defined within the PTE.
So, pte2_is_managed() is actually a call to PHYS_TO_VM_PAGE(), which is O(n)
in the number of vm_phys_segs[].  All but one call to PHYS_TO_VM_PAGE() in
pmap_enter() can be avoided.

Reviewed by:	kib, markj, mmel
Tested by:	mmel
MFC after:	6 weeks
Differential Revision:	https://reviews.freebsd.org/D16555
2018-08-08 16:55:01 +00:00
Alan Cox
78f1deeffe Defer and aggregate swap_pager_meta_build frees.
Before swp_pager_meta_build replaces an old swapblk with an new one,
it frees the old one.  To allow such freeing of blocks to be
aggregated, have swp_pager_meta_build return the old swap block, and
make the caller responsible for freeing it.

Define a pair of short static functions, swp_pager_init_freerange and
swp_pager_update_freerange, to do the initialization and updating of
blk addresses and counters used in aggregating blocks to be freed.

Submitted by:	Doug Moore <dougm@rice.edu>
Reviewed by:	kib, markj (an earlier version)
Tested by:	pho
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D13707
2018-08-08 02:30:34 +00:00
Konstantin Belousov
a70e9a1388 Swap in WKILLED processes.
Swapped-out process that is WKILLED must be swapped in as soon as
possible.  The reason is that such process can be killed by OOM and
its pages can be only freed if the process exits.  To exit, the kernel
stack of the process must be mapped.

When allocating pages for the stack of the WKILLED process on swap in,
use VM_ALLOC_SYSTEM requests to increase the chance of the allocation
to succeed.

Add counter of the swapped out processes to avoid unneeded iteration
over the allprocs list when there is no work to do, reducing the
allproc_lock ownership.

Reviewed by:	alc, markj (previous version)
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D16489
2018-08-04 20:45:43 +00:00
Mark Johnston
c16bd872dc Add the required page accounting to kmem_bootstrap_free().
Reviewed by:	alc, kib
MFC after:	3 days
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D16581
2018-08-03 16:35:37 +00:00
Konstantin Belousov
e45b89d23d Add pmap_is_valid_memattr(9).
Discussed with:	alc
Sponsored by:	The FreeBSD Foundation, Mellanox Technologies
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D15583
2018-08-01 18:45:51 +00:00
Konstantin Belousov
6e1d2cf679 For compat32, emulate the same wraparound check as occurs on the real
ILP32 system.

Reported by and discussed with:	asomers
PR:	230162
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D16525
2018-07-31 18:00:47 +00:00
Alan Cox
005783a0a6 Allow vm object coalescing to occur in the midst of a vm object when the
OBJ_ONEMAPPING flag is set.  In other words, allow recycling of existing
but unused subranges of a vm object when the OBJ_ONEMAPPING flag is set.

Such situations are increasingly common with jemalloc >= 5.0.  This
change has the expected effect of reducing the number of vm map entry and
object allocations and increasing the number of superpage promotions.

Reviewed by:	kib, markj
Tested by:	pho
MFC after:	6 weeks
Differential Revision:	https://reviews.freebsd.org/D16501
2018-07-31 17:41:48 +00:00
Alan Cox
737e25f7eb To date, mlockall(MCL_FUTURE) has had the unfortunate side effect of
blocking vm map entry and object coalescing for the calling process.
However, there is no reason that mlockall(MCL_FUTURE) should block
such coalescing.  This change enables it.

Reviewed by:	kib, markj
Tested by:	pho
MFC after:	6 weeks
Differential Revision:	https://reviews.freebsd.org/D16413
2018-07-28 04:06:33 +00:00
Warner Losh
67d33338c0 Rename VM_FREELIST_ISADMA to VM_FREELIST_LOWMEM.
There's no differene between VM_FREELIST_ISADMA and VM_FREELIST_LOWMEM
except for the default boundary (16MB on x86 and 256MB on MIPS, but
they are otherwise the same). We don't need both for any system we
support (there were some really old ARC systems that did have ISA/EISA
bus, but we never ran on them and they are too old to ever grow
support for).

Differential Review: https://reviews.freebsd.org/D16290
2018-07-27 18:34:20 +00:00
Mark Johnston
6c85795a25 Fix handling of KVA in kmem_bootstrap_free().
Do not use vm_map_remove() to release KVA back to the system.  Because
kernel map entries do not have an associated VM object, with r336030
the vm_map_remove() call will not update the kernel page tables.  Avoid
relying on the vm_map layer and instead update the pmap and release KVA
to the kernel arena directly in kmem_bootstrap_free().

Because the pmap updates will generally result in superpage demotions,
modify pmap_init() to insert PTPs shadowed by superpage mappings into
the kernel pmap's radix tree.

While here, port r329171 to i386.

Reported by:	alc
Reviewed by:	alc, kib
X-MFC with:	r336505
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D16426
2018-07-27 15:46:34 +00:00
Li-Wen Hsu
03154ade2a Use __riscv to determine building for RISC-V
Reviewed by:	br
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D16398
2018-07-23 19:49:54 +00:00
Mark Johnston
398a929f42 Add support for pmap_enter(psind = 1) to the arm64 pmap.
See the commit log messages for r321378 and r336288 for descriptions of
this functionality.

Reviewed by:	alc
Differential Revision:	https://reviews.freebsd.org/D16303
2018-07-20 16:37:04 +00:00
Mark Johnston
483f692ea6 Have preload_delete_name() free pages backing preloaded data.
On i386 and amd64, add a vm_phys segment for physical memory used to
store the kernel binary and other preloaded data.  This makes it
possible to free such memory back to the system once it is no longer
needed, e.g., when a preloaded kernel module is unloaded.  Previously,
it would have remained unused.

Reviewed by:	kib, royger
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D16330
2018-07-19 20:00:28 +00:00
Alan Cox
103cc0f6ea Revert r329254. The underlying cause for the copy-on-write problem in
multithreaded programs that was addressed by r329254 was in the
implementation of pmap_enter() on some architectures, notably, amd64.
kib@, markj@ and I have audited all of the pmap_enter() implementations,
and fixed the broken ones, specifically, amd64 (r335784, r335971), i386
(r336092), mips (r336248), and riscv (r336294).

To be clear, the reason to address the problem within pmap_enter() and
revert r329254 is not just a matter of principle.  An effect of r329254
was that a copy-on-write fault actually entailed two page faults, not
one, even for single-threaded programs.  Now, in the expected case for
either single- or multithreaded programs, we are back to a single page
fault to complete a copy-on-write operation.  (In extremely rare
circumstances, a multithreaded program could suffer two page faults.)

Reviewed by:	kib, markj
Tested by:	truckman
MFC after:	3 weeks
Differential Revision:	https://reviews.freebsd.org/D16301
2018-07-19 17:01:10 +00:00
Alan Cox
d7aeb429a0 Test PGA_REFERENCED after calling pmap_ts_referenced(), rather than before,
so that a reference from a concurrently destroyed mapping is observed
during the current scan.

Reviewed by:	kib, markj
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D16277
2018-07-15 19:25:15 +00:00
Alan Cox
8c0873714c Add support for pmap_enter(..., psind=1) to the i386 pmap. In other words,
add support for explicitly requesting that pmap_enter() create a 2 or 4 MB
page mapping.  (Essentially, this feature allows the machine-independent
layer to create superpage mappings preemptively, and not wait for automatic
promotion to occur.)

Export pmap_ps_enabled() to the machine-independent layer.

Add a flag to pmap_pv_insert_pde() that specifies whether it should fail or
reclaim a PV entry when one is not available.

Refactor pmap_enter_pde() into two functions, one by the same name, that is
a general-purpose function for creating PDE PG_PS mappings, and another,
pmap_enter_4mpage(), that is used to prefault 2 or 4 MB read- and/or
execute-only mappings for execve(2), mmap(2), and shmat(2).

Reviewed by:	kib
Tested by:	pho
Differential Revision:	https://reviews.freebsd.org/D16246
2018-07-14 17:20:27 +00:00
Mateusz Guzik
efb6d4a479 uma: whack main zone counter update in the slow path, freeing side
See r333052.
2018-07-12 22:35:52 +00:00