Commit Graph

559 Commits

Author SHA1 Message Date
Konstantin Belousov
19ea042eb8 Make vm_map_max/min/pmap KBI stable.
There are out of tree consumers of vm_map_min() and vm_map_max(), and
I believe there are consumers of vm_map_pmap(), although the later is
arguably less in the need of KBI-stable interface. For the consumers
benefit, make modules using this KPI not depended on the struct vm_map
layout.

Reviewed by:	alc, markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D14902
2018-03-30 10:55:31 +00:00
Jeff Roberson
e2068d0bcd Use per-domain locks for vm page queue free. Move paging control from
global to per-domain state.  Protect reservations with the free lock
from the domain that they belong to.  Refactor to make vm domains more
of a first class object.

Reviewed by:    markj, kib, gallatin
Tested by:      pho
Sponsored by:   Netflix, Dell/EMC Isilon
Differential Revision:  https://reviews.freebsd.org/D14000
2018-02-06 22:10:07 +00:00
Konstantin Belousov
1c5196c3ed Assign map->header values to avoid boundary checks.
In several places, entry start and end field are checked, after
excluding the possibility that the entry is map->header.  By assigning
max and min values to the start and end fields of map->header in
vm_map_init, the explicit map->header checks become unnecessary.

Submitted by:	Doug Moore <dougm@rice.edu>
Reviewed by:	alc, kib, markj (previous version)
Tested by:	pho (previous version)
MFC after:	1 week
Differential Revision:  https://reviews.freebsd.org/D13735
2018-01-20 12:19:02 +00:00
Alan Cox
fec296887f Refactor vm_map_find(), creating a separate function, vm_map_alignspace(),
for finding aligned free space in the given map.  With this change, we
always return KERN_NO_SPACE when we fail to find free space.  Whereas,
previously, we might return KERN_INVALID_ADDRESS.  Also, with this change,
we explicitly check for address wrap, rather than relying upon the map's
min and max addresses to establish sentinel-like regions.

This refactoring was inspired by the problem that we addressed in r326098.

Reviewed by:	kib
Tested by:	pho
Discussed with:	markj
MFC after:	3 weeks
Differential Revision:	https://reviews.freebsd.org/D13346
2017-12-26 17:59:37 +00:00
Konstantin Belousov
e8502826ce Add comment for vm_map_find_min().
Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
X-Differential revision:	https://reviews.freebsd.org/D13155
2017-12-01 10:53:08 +00:00
Pedro F. Giffuni
796df753f4 SPDX: Consider code from Carnegie-Mellon University.
Interesting cases, most likely from CMU Mach sources.
2017-11-30 15:48:35 +00:00
Jeff Roberson
2e47807c21 Eliminate kmem_arena and kmem_object in preparation for further NUMA commits.
The arena argument to kmem_*() is now only used in an assert.  A follow-up
commit will remove the argument altogether before we freeze the API for the
next release.

This replaces the hard limit on kmem size with a soft limit imposed by UMA.  When
the soft limit is exceeded we periodically wakeup the UMA reclaim thread to
attempt to shrink KVA.  On 32bit architectures this should behave much more
gracefully as we exhaust KVA.  On 64bit the limits are likely never hit.

Reviewed by:	markj, kib (some objections)
Discussed with:	alc
Tested by:	pho
Sponsored by:	Netflix / Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D13187
2017-11-28 23:40:54 +00:00
Konstantin Belousov
9410cd7d9e Return different error code for the guard page layout violation.
On KERN_NO_SPACE error, as it is returned now, vm_map_find() continues
the loop searching for the suitable range for the requested mapping
with specific alignment.  Since the vm_map_findspace() succesfully
finds the same place, the loop never ends.

The errors returned from vm_map_stack() completely repeat the behavior
of vm_map_insert() now, as suggested by Alan.

Reported by:	Arto Pekkanen <aksyom@gmail.com>
PR:	223732
Reviewed by:	alc, markj
Discussed with:	jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
Differential revision:	https://reviews.freebsd.org/D13186
2017-11-22 16:45:27 +00:00
Alan Cox
4d572bb3ed When vm_map_find(find_space = VMFS_OPTIMAL_SPACE) fails to find space, a
second scan of the address space with find_space = VMFS_ANY_SPACE is
performed.  Previously, vm_map_find() released and reacquired the map lock
between the first and second scans.  However, there is no compelling
reason to do so.  This revision modifies vm_map_find() to retain the map
lock.

Reviewed by:	jhb, kib, markj
MFC after:	1 week
X-Differential Revision:	https://reviews.freebsd.org/D13155
2017-11-22 16:39:24 +00:00
Pedro F. Giffuni
51369649b0 sys: further adoption of SPDX licensing ID tags.
Mainly focus on files that use BSD 3-Clause license.

The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.

Special thanks to Wind River for providing access to "The Duke of
Highlander" tool: an older (2014) run over FreeBSD tree was useful as a
starting point.
2017-11-20 19:43:44 +00:00
Konstantin Belousov
eb5ea8788f Disable stack growth when accessed by AIO daemons.
Commit message for r321173 incorrectly stated that the change disables
automatic stack growth from the AIO daemons contexts, with explanation
that this is currently prevents applying wrong resource limits.  Fix
this by actually disabling the growth.

Noted by:	alc
Reviewed by:	alc, jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-07-19 19:00:32 +00:00
Konstantin Belousov
f758aadd07 Convert assertion that only vmspace owner grows the stack, into a
check blocking grow from other processes accesses.

Debugger may access stack grow area with ptrace(2).  In this case,
real state of the process is to not have the stack grown, which
provides more accurate inspection.  Technical reason to avoid the grow
is to avoid applying wrong process (debugger) stack limit.

This change also has a consequence of making aio workers accesses past
the bottom of stacks into EFAULT, arguably the situation is a
programmers mistake.

Reported by:	jhb
Discussed with:	alc, jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2017-07-18 20:26:41 +00:00
Alan Cox
8830260128 Generalize vm_page_ps_is_valid() to support testing other predicates on
the (super)page, renaming the function to vm_page_ps_test().

Reviewed by:	kib, markj
MFC after:	1 week
2017-07-14 02:15:48 +00:00
Konstantin Belousov
7683ad70d3 Fix loop termination in vm_map_find_min().
Reported by:	antoine
Tested by:	Stefan Ehmann <shoesoft@gmx.net>,
       Jan Kokemueller <jan.kokemueller@gmail.com>
PR:	220493
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2017-07-09 15:41:49 +00:00
Alan Cox
201f03b8e7 Modify vm_map_growstack() to protect itself from the possibility of the
gap entry in the vm map being smaller than the sysctl-derived stack guard
size.  Otherwise, the value of max_grow can suffer from overflow, and the
roundup(grow_amount, sgrowsiz) will not be properly capped, resulting in
an assertion failure.

In collaboration with:	kib
MFC after:	3 days
2017-07-01 23:39:49 +00:00
Alan Cox
8056df6e25 Clear the MAP_WIREFUTURE flag on the vm map in exec_new_vmspace() when it
recycles the current vm space.  Otherwise, an mlockall(MCL_FUTURE) could
still be in effect on the process after an execve(2), which violates the
specification for mlockall(2).

It's pointless for vm_map_stack() to check the MEMLOCK limit.  It will
never be asked to wire the stack.  Moreover, it doesn't even implement
wiring of the stack.

Reviewed by:	kib, markj
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D11421
2017-06-30 15:49:36 +00:00
Konstantin Belousov
6a97a3f756 Treat the addr argument for mmap(2) request without MAP_FIXED flag as
a hint.

Right now, for non-fixed mmap(2) calls, addr is de-facto interpreted
as the absolute minimal address of the range where the mapping is
created.  The VA allocator only allocates in the range [addr,
VM_MAXUSER_ADDRESS].  This is too restrictive, the mmap(2) call might
unduly fail if there is no free addresses above addr but a lot of
usable space below it.

Lift this implementation limitation by allocating VA in two passes.
First, try to allocate above addr, as before.  If that fails, do the
second pass with less restrictive constraints for the start of
allocation by specifying minimal allocation address at the max bss
end, if this limit is less than addr.

One important case where this change makes a difference is the
allocation of the stacks for new threads in libthr.  Under some
configuration conditions, libthr tries to hint kernel to reuse the
main thread stack grow area for the new stacks.  This cannot work by
design now after grow area is converted to stack, and there is no
unallocated VA above the main stack.  Interpreting requested stack
base address as the hint provides compatibility with old libthr and
with (mis-)configured current libthr.

Reviewed by:	alc
Tested by:	dim (previous version)
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-06-28 04:02:36 +00:00
Konstantin Belousov
8a89ca9425 For now, allow mprotect(2) over the guards to succeed regardless of
the requested protection.

The syscall returns success without changing the protection of the
guard.  This is consistent with the current mprotect(2) behaviour on
the unmapped ranges.  More important, the calls performed by libc and
libthr to allow execution of stacks, if requested by the loaded ELF
objects, do the expected change instead of failing on the grow space
guard.

Reviewed by:	alc, markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-06-25 23:16:37 +00:00
Konstantin Belousov
19f49ad30f Correctly handle small MAP_STACK requests.
If mmap(2) is called with the MAP_STACK flag and the size which is
less or equal to the initial stack mapping size plus guard,
calculation of the mapping layout created zero-sized guard.  Attempt
to create such entry failed in vm_map_insert(), causing the whole
mmap(2) call to fail.

Fix it by adjusting the initial mapping size to have space for
non-empty guard.  Reject MAP_STACK requests which are shorter or equal
to the configured guard pages size.

Reported and tested by:	Manfred Antar <null@pozo.com>
Reviewed by:	alc, markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-06-25 20:06:05 +00:00
Konstantin Belousov
ae5bb0cac8 Remove stale part of the comment.
Reviewed by:	alc, markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-06-25 19:59:39 +00:00
Konstantin Belousov
f141ed73a2 Style.
Reviewed by:	alc, markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-06-25 18:40:59 +00:00
Konstantin Belousov
19bd0d9c85 Implement address space guards.
Guard, requested by the MAP_GUARD mmap(2) flag, prevents the reuse of
the allocated address space, but does not allow instantiation of the
pages in the range.  It is useful for more explicit support for usual
two-stage reserve then commit allocators, since it prevents accidental
instantiation of the mapping, e.g. by mprotect(2).

Use guards to reimplement stack grow code.  Explicitely track stack
grow area with the guard, including the stack guard page.  On stack
grow, trivial shift of the guard map entry and stack map entry limits
makes the stack expansion.  Move the code to detect stack grow and
call vm_map_growstack(), from vm_fault() into vm_map_lookup().

As result, it is impossible to get random mapping to occur in the
stack grow area, or to overlap the stack guard page.

Enable stack guard page by default.

Reviewed by:	alc, markj
Man page update reviewed by:	alc, bjk, emaste, markj, pho
Tested by:	pho, Qualys
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D11306 (man pages)
2017-06-24 17:01:11 +00:00
Konstantin Belousov
546bb2d7f0 Do not try to unmark MAP_ENTRY_IN_TRANSITION marked by other thread.
The issue is catched by "vm_map_wire: alien wire" KASSERT at the end
of the vm_map_wire().  We currently check for MAP_ENTRY_WIRE_SKIPPED
flag before ensuring that the wiring_thread is curthread. For HOLESOK
wiring, this means that we might see WIRE_SKIPPED entry from different
wiring.

The fix it by only checking WIRE_SKIPPED if the entry is put
IN_TRANSITION by us.  Also fixed a typo in the comment explaining the
situation.

Reported and tested by:	pho
Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-06-24 16:47:41 +00:00
Konstantin Belousov
0ec97ffc10 Call pmap_copy() only for map entries which have the backing object
instantiated.

Calling pmap_copy() on non-faulted anonymous memory entries is useless.

Noted and reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-06-21 18:54:28 +00:00
Konstantin Belousov
00de677313 Assert that the protection of a new map entry is a subset of the max
protection.

Noted and reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-06-21 18:51:30 +00:00
Konstantin Belousov
212e02c836 Ignore the P_SYSTEM process flag, and do not request
VM_MAP_WIRE_SYSTEM mode when wiring the newly grown stack.

System maps do not create auto-grown stack.  Any stack we handled,
even for P_SYSTEM, must be for user address space.  P_SYSTEM processes
with mapped user space is either init(8) or an aio worker attached to
other user process with aio buffer pointing into stack area.  In either
case, VM_MAP_WIRE_USER mode should be used.

Noted and reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-06-19 20:40:59 +00:00
Mark Johnston
e1cb9d3747 Busy the map in vm_map_protect().
We are otherwise susceptible to a race with a concurrent vm_map_wire(),
which may drop the map lock to fault pages into the object chain. In
particular, vm_map_protect() will only copy newly writable wired pages
into the top-level object when MAP_ENTRY_USER_WIRED is set, but
vm_map_wire() only sets this flag after its fault loop. We may thus end
up with a writable wired entry whose top-level object does not contain the
entire range of pages.

Reported and tested by:	pho
Reviewed by:	kib
MFC after:	1 week
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D10349
2017-04-10 21:01:42 +00:00
Mark Johnston
1c2d20a1d0 Consistently use for-loops in vm_map_protect().
No functional change.

Reviewed by:	kib
MFC after:	1 week
Sponsored by:	Dell EMC Isilon
X-Differential Revision:	https://reviews.freebsd.org/D10349
2017-04-10 20:57:16 +00:00
Mark Johnston
ed11e4d701 Add some bounds assertions to the vm_map_entry clip functions.
Reviewed by:	kib
MFC after:	1 week
Sponsored by:	Dell EMC Isilon
X-Differential Revision: https://reviews.freebsd.org/D10349
2017-04-10 20:55:42 +00:00
Xin LI
83d37aaf01 The adj_free and max_free values of new_entry will be calculated and
assigned by subsequent vm_map_entry_link(), therefore, remove the
pointless copying.

Submitted by:	alc
MFC after:	3 days
2017-03-16 05:44:16 +00:00
Konstantin Belousov
d1780e8dac Use atop() instead of OFF_TO_IDX() for convertion of addresses or
addresses offsets, as intended.

Suggested and reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2017-03-14 19:39:17 +00:00
Xin LI
78d7964b46 Implement INHERIT_ZERO for minherit(2).
INHERIT_ZERO is an OpenBSD feature.

When a page is marked as such, it would be zeroed
upon fork().

This would be used in new arc4random(3) functions.

PR:	182610
Reviewed by:	kib (earlier version)
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D427
2017-03-14 17:10:42 +00:00
Konstantin Belousov
2b6d1a639b Follow-up to r313690.
Fix two missed places where vm_object offset to index calculation
should use unsigned shift, to allow handling of full range of unsigned
offsets used to create device mappings.

Reported and tested by:	royger (previous version)
Reviewed by:	alc (previous version)
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-03-12 13:53:13 +00:00
Warner Losh
fbbd9655e5 Renumber copyright clause 4
Renumber cluase 4 to 3, per what everybody else did when BSD granted
them permission to remove clause 3. My insistance on keeping the same
numbering for legal reasons is too pedantic, so give up on that point.

Submitted by:	Jan Schaumann <jschauma@stevens.edu>
Pull Request:	https://github.com/freebsd/freebsd/pull/96
2017-02-28 23:42:47 +00:00
Konstantin Belousov
1569205f0a Style fixes for vm_map_insert().
Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-01-01 18:49:46 +00:00
Konstantin Belousov
9a4ee196dd Style.
Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2016-12-30 13:04:43 +00:00
John Baldwin
a9546a6b17 Use db_lookup_proc() in the DDB 'show procvm' command.
This allows processes to be identified by PID as well as a pointer address.

MFC after:	2 weeks
Sponsored by:	DARPA / AFRL
2016-12-13 19:22:43 +00:00
Alan Cox
3453bca864 Eliminate every mention of PG_CACHED pages from the comments in the machine-
independent layer of the virtual memory system.  Update some of the nearby
comments to eliminate redundancy and improve clarity.

In vm/vm_reserv.c, do not use hyphens after adverbs ending in -ly per
The Chicago Manual of Style.

Update the comment in vm/vm_page.h defining the four types of page queues to
reflect the elimination of PG_CACHED pages and the introduction of the
laundry queue.

Reviewed by:	kib, markj
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D8752
2016-12-12 17:47:09 +00:00
Alan Cox
381b724280 Change the type of the map entry's next_read field from a vm_pindex_t to a
vm_offset_t.  (This field is used to detect sequential access to the virtual
address range represented by the map entry.)  There are three reasons to
make this change.  First, a vm_offset_t is smaller on 32-bit architectures.
Consequently, a struct vm_map_entry is now smaller on 32-bit architectures.
Second, a vm_offset_t can be written atomically, whereas it may not be
possible to write a vm_pindex_t atomically on a 32-bit architecture.  Third,
using a vm_pindex_t makes the next_read field dependent on which object in
the shadow chain is being read from.

Replace an "XXX" comment.

Reviewed by:	kib
Approved by:	re (gjb)
Sponsored by:	EMC / Isilon Storage Division
2016-07-07 20:58:16 +00:00
Pedro F. Giffuni
763df3ec55 sys/vm: minor spelling fixes in comments.
No functional change.
2016-05-02 20:16:29 +00:00
John Baldwin
164a37a55a Trim redundant message.
WITNESS_WARN() appends "with non-sleepable lock" to the caller's message.

Sponsored by:	Chelsio Communications
2016-04-27 21:51:24 +00:00
Konstantin Belousov
ba7c64d17b Typo in comment. 2016-01-24 13:38:41 +00:00
John Baldwin
8a4dc40ff4 Various cleanups to the main function for AIO kernel processes:
- Pull the vmspace logic out into helper functions and reduce duplication.
  Operations on the vmspace are all isolated to vm_map.c, but it now exports
  a new 'vmspace_switch_aio' for use by AIO kernel processes.
- When an AIO kernel process wants to exit, break out of the main loop and
  perform cleanup after the loop end.  This reduces a lot of indentation and
  allows cleanup to more closely mirror setup actions before the loop starts.
- Convert a DIAGNOSTIC to KASSERT().
- Replace mycp with more typical 'p'.

Reviewed by:	kib
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D4990
2016-01-19 21:37:51 +00:00
Konstantin Belousov
b8db977617 Remove a check which caused spurious SIGSEGV on usermode access to the
mapped address without valid pte installed, when parallel wiring of
the entry happen.  The entry must be copy on write.  If entry is COW
but was already copied, and parallel wiring set
MAP_ENTRY_IN_TRANSITION, vm_fault() would sleep waiting for the
MAP_ENTRY_IN_TRANSITION flag to clear.  After that, the fault handler
is restarted and vm_map_lookup() or vm_map_lookup_locked() trip over
the check.  Note that this is race, if the address is accessed after
the wiring is done, the entry does not fault at all.

There is no reason in the current kernel to disallow write access to
the COW wired entry if the entry permissions allow it.  Initially this
was done in r24666, since that kernel did not supported proper
copy-on-write for wired text, which was fixed in r199869.  The r251901
revision re-introduced the r24666 fix for the current VM.

Note that write access must clear MAP_ENTRY_NEEDS_COPY entry flag by
performing COW.  In reverse, when MAP_ENTRY_NEEDS_COPY is set in
vmspace_fork(), the MAP_ENTRY_USER_WIRED flag is cleared.  Put the
assert stating the invariant, instead of returning the error.

Reported and debugging help by:	peter
Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-09-09 06:19:33 +00:00
Konstantin Belousov
6a875bf929 Do not pretend that vm_fault(9) supports unwiring the address. Rename
the VM_FAULT_CHANGE_WIRING flag to VM_FAULT_WIRE.  Assert that the
flag is only passed when faulting on the wired map entry.  Remove the
vm_page_unwire() call, which should be never reachable.

Since VM_FAULT_WIRE flag implies wired map entry, the TRYPAGER() macro
is reduced to the testing of the fs.object having a default pager.
Inline the check.

Suggested and reviewed by:	alc
Tested by:	pho (previous version)
MFC after:	1 week
2015-07-30 18:28:34 +00:00
Konstantin Belousov
be930a2021 Account for the main process stack being one page below the highest
user address when ABI uses shared page.

Note that the change is no-op for correctness, since shared page does
not fault.  The mapping for the shared page is installed at the
address space creation, the page is unmanaged and its pte/pv entry
cannot be reclaimed.

Submitted by:	Oliver Pinter
Review:	https://reviews.freebsd.org/D2954
MFC after:	1 week
2015-07-02 15:22:13 +00:00
Mateusz Guzik
f6f6d24062 Implement lockless resource limits.
Use the same scheme implemented to manage credentials.

Code needing to look at process's credentials (as opposed to thred's) is
provided with *_proc variants of relevant functions.

Places which possibly had to take the proc lock anyway still use the proc
pointer to access limits.
2015-06-10 10:48:12 +00:00
Edward Tomasz Napierala
4b5c9cf62f Add kern.racct.enable tunable and RACCT_DISABLED config option.
The point of this is to be able to add RACCT (with RACCT_DISABLED)
to GENERIC, to avoid having to rebuild the kernel to use rctl(8).

Differential Revision:	https://reviews.freebsd.org/D2369
Reviewed by:	kib@
MFC after:	1 month
Relnotes:	yes
Sponsored by:	The FreeBSD Foundation
2015-04-29 10:23:02 +00:00
Ryan Stone
423521aa33 vmspace_release() may sleep if the last reference is being released,
so add a WITNESS_WARN() to catch cases where it is called with a
non-sleepable lock held.

MFC after:	1 month
Sponsored by:	Sandvine Inc.
2015-01-24 16:59:38 +00:00
Konstantin Belousov
54432196db vm_map_pmap_enter() and pmap_enter_object() are currently not aware of
the wired attribute of the mapping.  As result, some pmap
implementations clear the wired state of the page table entries, which
breaks invariants and allows the entries to be lost.  Avoid calling
vm_map_pmap_enter() for the MADV_WILLNEED on the wired entry, the
pages must be already mapped.

Noted and reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2014-09-23 18:54:23 +00:00