550 Commits

Author SHA1 Message Date
kib
34e529e98b Disable stack growth when accessed by AIO daemons.
Commit message for r321173 incorrectly stated that the change disables
automatic stack growth from the AIO daemons contexts, with explanation
that this is currently prevents applying wrong resource limits.  Fix
this by actually disabling the growth.

Noted by:	alc
Reviewed by:	alc, jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-07-19 19:00:32 +00:00
kib
65f611547e Convert assertion that only vmspace owner grows the stack, into a
check blocking grow from other processes accesses.

Debugger may access stack grow area with ptrace(2).  In this case,
real state of the process is to not have the stack grown, which
provides more accurate inspection.  Technical reason to avoid the grow
is to avoid applying wrong process (debugger) stack limit.

This change also has a consequence of making aio workers accesses past
the bottom of stacks into EFAULT, arguably the situation is a
programmers mistake.

Reported by:	jhb
Discussed with:	alc, jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2017-07-18 20:26:41 +00:00
alc
80e6bc7371 Generalize vm_page_ps_is_valid() to support testing other predicates on
the (super)page, renaming the function to vm_page_ps_test().

Reviewed by:	kib, markj
MFC after:	1 week
2017-07-14 02:15:48 +00:00
kib
d2c5c68142 Fix loop termination in vm_map_find_min().
Reported by:	antoine
Tested by:	Stefan Ehmann <shoesoft@gmx.net>,
       Jan Kokemueller <jan.kokemueller@gmail.com>
PR:	220493
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2017-07-09 15:41:49 +00:00
alc
7c0d38a1e3 Modify vm_map_growstack() to protect itself from the possibility of the
gap entry in the vm map being smaller than the sysctl-derived stack guard
size.  Otherwise, the value of max_grow can suffer from overflow, and the
roundup(grow_amount, sgrowsiz) will not be properly capped, resulting in
an assertion failure.

In collaboration with:	kib
MFC after:	3 days
2017-07-01 23:39:49 +00:00
alc
0b28a56ef7 Clear the MAP_WIREFUTURE flag on the vm map in exec_new_vmspace() when it
recycles the current vm space.  Otherwise, an mlockall(MCL_FUTURE) could
still be in effect on the process after an execve(2), which violates the
specification for mlockall(2).

It's pointless for vm_map_stack() to check the MEMLOCK limit.  It will
never be asked to wire the stack.  Moreover, it doesn't even implement
wiring of the stack.

Reviewed by:	kib, markj
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D11421
2017-06-30 15:49:36 +00:00
kib
91d820a4f9 Treat the addr argument for mmap(2) request without MAP_FIXED flag as
a hint.

Right now, for non-fixed mmap(2) calls, addr is de-facto interpreted
as the absolute minimal address of the range where the mapping is
created.  The VA allocator only allocates in the range [addr,
VM_MAXUSER_ADDRESS].  This is too restrictive, the mmap(2) call might
unduly fail if there is no free addresses above addr but a lot of
usable space below it.

Lift this implementation limitation by allocating VA in two passes.
First, try to allocate above addr, as before.  If that fails, do the
second pass with less restrictive constraints for the start of
allocation by specifying minimal allocation address at the max bss
end, if this limit is less than addr.

One important case where this change makes a difference is the
allocation of the stacks for new threads in libthr.  Under some
configuration conditions, libthr tries to hint kernel to reuse the
main thread stack grow area for the new stacks.  This cannot work by
design now after grow area is converted to stack, and there is no
unallocated VA above the main stack.  Interpreting requested stack
base address as the hint provides compatibility with old libthr and
with (mis-)configured current libthr.

Reviewed by:	alc
Tested by:	dim (previous version)
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-06-28 04:02:36 +00:00
kib
aa808958eb For now, allow mprotect(2) over the guards to succeed regardless of
the requested protection.

The syscall returns success without changing the protection of the
guard.  This is consistent with the current mprotect(2) behaviour on
the unmapped ranges.  More important, the calls performed by libc and
libthr to allow execution of stacks, if requested by the loaded ELF
objects, do the expected change instead of failing on the grow space
guard.

Reviewed by:	alc, markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-06-25 23:16:37 +00:00
kib
a34566f9cc Correctly handle small MAP_STACK requests.
If mmap(2) is called with the MAP_STACK flag and the size which is
less or equal to the initial stack mapping size plus guard,
calculation of the mapping layout created zero-sized guard.  Attempt
to create such entry failed in vm_map_insert(), causing the whole
mmap(2) call to fail.

Fix it by adjusting the initial mapping size to have space for
non-empty guard.  Reject MAP_STACK requests which are shorter or equal
to the configured guard pages size.

Reported and tested by:	Manfred Antar <null@pozo.com>
Reviewed by:	alc, markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-06-25 20:06:05 +00:00
kib
fd9763bf72 Remove stale part of the comment.
Reviewed by:	alc, markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-06-25 19:59:39 +00:00
kib
20f226eed3 Style.
Reviewed by:	alc, markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-06-25 18:40:59 +00:00
kib
34de63e92d Implement address space guards.
Guard, requested by the MAP_GUARD mmap(2) flag, prevents the reuse of
the allocated address space, but does not allow instantiation of the
pages in the range.  It is useful for more explicit support for usual
two-stage reserve then commit allocators, since it prevents accidental
instantiation of the mapping, e.g. by mprotect(2).

Use guards to reimplement stack grow code.  Explicitely track stack
grow area with the guard, including the stack guard page.  On stack
grow, trivial shift of the guard map entry and stack map entry limits
makes the stack expansion.  Move the code to detect stack grow and
call vm_map_growstack(), from vm_fault() into vm_map_lookup().

As result, it is impossible to get random mapping to occur in the
stack grow area, or to overlap the stack guard page.

Enable stack guard page by default.

Reviewed by:	alc, markj
Man page update reviewed by:	alc, bjk, emaste, markj, pho
Tested by:	pho, Qualys
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D11306 (man pages)
2017-06-24 17:01:11 +00:00
kib
eb2cba616a Do not try to unmark MAP_ENTRY_IN_TRANSITION marked by other thread.
The issue is catched by "vm_map_wire: alien wire" KASSERT at the end
of the vm_map_wire().  We currently check for MAP_ENTRY_WIRE_SKIPPED
flag before ensuring that the wiring_thread is curthread. For HOLESOK
wiring, this means that we might see WIRE_SKIPPED entry from different
wiring.

The fix it by only checking WIRE_SKIPPED if the entry is put
IN_TRANSITION by us.  Also fixed a typo in the comment explaining the
situation.

Reported and tested by:	pho
Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-06-24 16:47:41 +00:00
kib
3677635c7b Call pmap_copy() only for map entries which have the backing object
instantiated.

Calling pmap_copy() on non-faulted anonymous memory entries is useless.

Noted and reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-06-21 18:54:28 +00:00
kib
1892b89561 Assert that the protection of a new map entry is a subset of the max
protection.

Noted and reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-06-21 18:51:30 +00:00
kib
bdbf5f6e59 Ignore the P_SYSTEM process flag, and do not request
VM_MAP_WIRE_SYSTEM mode when wiring the newly grown stack.

System maps do not create auto-grown stack.  Any stack we handled,
even for P_SYSTEM, must be for user address space.  P_SYSTEM processes
with mapped user space is either init(8) or an aio worker attached to
other user process with aio buffer pointing into stack area.  In either
case, VM_MAP_WIRE_USER mode should be used.

Noted and reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-06-19 20:40:59 +00:00
markj
c4c6e9ae09 Busy the map in vm_map_protect().
We are otherwise susceptible to a race with a concurrent vm_map_wire(),
which may drop the map lock to fault pages into the object chain. In
particular, vm_map_protect() will only copy newly writable wired pages
into the top-level object when MAP_ENTRY_USER_WIRED is set, but
vm_map_wire() only sets this flag after its fault loop. We may thus end
up with a writable wired entry whose top-level object does not contain the
entire range of pages.

Reported and tested by:	pho
Reviewed by:	kib
MFC after:	1 week
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D10349
2017-04-10 21:01:42 +00:00
markj
6976fe28f2 Consistently use for-loops in vm_map_protect().
No functional change.

Reviewed by:	kib
MFC after:	1 week
Sponsored by:	Dell EMC Isilon
X-Differential Revision:	https://reviews.freebsd.org/D10349
2017-04-10 20:57:16 +00:00
markj
278be1664c Add some bounds assertions to the vm_map_entry clip functions.
Reviewed by:	kib
MFC after:	1 week
Sponsored by:	Dell EMC Isilon
X-Differential Revision: https://reviews.freebsd.org/D10349
2017-04-10 20:55:42 +00:00
delphij
d308d4fa6c The adj_free and max_free values of new_entry will be calculated and
assigned by subsequent vm_map_entry_link(), therefore, remove the
pointless copying.

Submitted by:	alc
MFC after:	3 days
2017-03-16 05:44:16 +00:00
kib
8cf2af841c Use atop() instead of OFF_TO_IDX() for convertion of addresses or
addresses offsets, as intended.

Suggested and reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2017-03-14 19:39:17 +00:00
delphij
6c1ff20b8d Implement INHERIT_ZERO for minherit(2).
INHERIT_ZERO is an OpenBSD feature.

When a page is marked as such, it would be zeroed
upon fork().

This would be used in new arc4random(3) functions.

PR:	182610
Reviewed by:	kib (earlier version)
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D427
2017-03-14 17:10:42 +00:00
kib
dada1184e8 Follow-up to r313690.
Fix two missed places where vm_object offset to index calculation
should use unsigned shift, to allow handling of full range of unsigned
offsets used to create device mappings.

Reported and tested by:	royger (previous version)
Reviewed by:	alc (previous version)
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-03-12 13:53:13 +00:00
imp
7e6cabd06e Renumber copyright clause 4
Renumber cluase 4 to 3, per what everybody else did when BSD granted
them permission to remove clause 3. My insistance on keeping the same
numbering for legal reasons is too pedantic, so give up on that point.

Submitted by:	Jan Schaumann <jschauma@stevens.edu>
Pull Request:	https://github.com/freebsd/freebsd/pull/96
2017-02-28 23:42:47 +00:00
kib
e345e3955b Style fixes for vm_map_insert().
Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-01-01 18:49:46 +00:00
kib
9763564a8d Style.
Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2016-12-30 13:04:43 +00:00
jhb
6b0130d5b3 Use db_lookup_proc() in the DDB 'show procvm' command.
This allows processes to be identified by PID as well as a pointer address.

MFC after:	2 weeks
Sponsored by:	DARPA / AFRL
2016-12-13 19:22:43 +00:00
alc
023ed8da8d Eliminate every mention of PG_CACHED pages from the comments in the machine-
independent layer of the virtual memory system.  Update some of the nearby
comments to eliminate redundancy and improve clarity.

In vm/vm_reserv.c, do not use hyphens after adverbs ending in -ly per
The Chicago Manual of Style.

Update the comment in vm/vm_page.h defining the four types of page queues to
reflect the elimination of PG_CACHED pages and the introduction of the
laundry queue.

Reviewed by:	kib, markj
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D8752
2016-12-12 17:47:09 +00:00
alc
694dd903a3 Change the type of the map entry's next_read field from a vm_pindex_t to a
vm_offset_t.  (This field is used to detect sequential access to the virtual
address range represented by the map entry.)  There are three reasons to
make this change.  First, a vm_offset_t is smaller on 32-bit architectures.
Consequently, a struct vm_map_entry is now smaller on 32-bit architectures.
Second, a vm_offset_t can be written atomically, whereas it may not be
possible to write a vm_pindex_t atomically on a 32-bit architecture.  Third,
using a vm_pindex_t makes the next_read field dependent on which object in
the shadow chain is being read from.

Replace an "XXX" comment.

Reviewed by:	kib
Approved by:	re (gjb)
Sponsored by:	EMC / Isilon Storage Division
2016-07-07 20:58:16 +00:00
pfg
8327dbfe4d sys/vm: minor spelling fixes in comments.
No functional change.
2016-05-02 20:16:29 +00:00
jhb
d6559c9525 Trim redundant message.
WITNESS_WARN() appends "with non-sleepable lock" to the caller's message.

Sponsored by:	Chelsio Communications
2016-04-27 21:51:24 +00:00
kib
e840a5e833 Typo in comment. 2016-01-24 13:38:41 +00:00
jhb
897a9bc5d3 Various cleanups to the main function for AIO kernel processes:
- Pull the vmspace logic out into helper functions and reduce duplication.
  Operations on the vmspace are all isolated to vm_map.c, but it now exports
  a new 'vmspace_switch_aio' for use by AIO kernel processes.
- When an AIO kernel process wants to exit, break out of the main loop and
  perform cleanup after the loop end.  This reduces a lot of indentation and
  allows cleanup to more closely mirror setup actions before the loop starts.
- Convert a DIAGNOSTIC to KASSERT().
- Replace mycp with more typical 'p'.

Reviewed by:	kib
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D4990
2016-01-19 21:37:51 +00:00
kib
b4d1918765 Remove a check which caused spurious SIGSEGV on usermode access to the
mapped address without valid pte installed, when parallel wiring of
the entry happen.  The entry must be copy on write.  If entry is COW
but was already copied, and parallel wiring set
MAP_ENTRY_IN_TRANSITION, vm_fault() would sleep waiting for the
MAP_ENTRY_IN_TRANSITION flag to clear.  After that, the fault handler
is restarted and vm_map_lookup() or vm_map_lookup_locked() trip over
the check.  Note that this is race, if the address is accessed after
the wiring is done, the entry does not fault at all.

There is no reason in the current kernel to disallow write access to
the COW wired entry if the entry permissions allow it.  Initially this
was done in r24666, since that kernel did not supported proper
copy-on-write for wired text, which was fixed in r199869.  The r251901
revision re-introduced the r24666 fix for the current VM.

Note that write access must clear MAP_ENTRY_NEEDS_COPY entry flag by
performing COW.  In reverse, when MAP_ENTRY_NEEDS_COPY is set in
vmspace_fork(), the MAP_ENTRY_USER_WIRED flag is cleared.  Put the
assert stating the invariant, instead of returning the error.

Reported and debugging help by:	peter
Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-09-09 06:19:33 +00:00
kib
34996cf7cd Do not pretend that vm_fault(9) supports unwiring the address. Rename
the VM_FAULT_CHANGE_WIRING flag to VM_FAULT_WIRE.  Assert that the
flag is only passed when faulting on the wired map entry.  Remove the
vm_page_unwire() call, which should be never reachable.

Since VM_FAULT_WIRE flag implies wired map entry, the TRYPAGER() macro
is reduced to the testing of the fs.object having a default pager.
Inline the check.

Suggested and reviewed by:	alc
Tested by:	pho (previous version)
MFC after:	1 week
2015-07-30 18:28:34 +00:00
kib
eee3eeb0fa Account for the main process stack being one page below the highest
user address when ABI uses shared page.

Note that the change is no-op for correctness, since shared page does
not fault.  The mapping for the shared page is installed at the
address space creation, the page is unmanaged and its pte/pv entry
cannot be reclaimed.

Submitted by:	Oliver Pinter
Review:	https://reviews.freebsd.org/D2954
MFC after:	1 week
2015-07-02 15:22:13 +00:00
mjg
d7bc9285a6 Implement lockless resource limits.
Use the same scheme implemented to manage credentials.

Code needing to look at process's credentials (as opposed to thred's) is
provided with *_proc variants of relevant functions.

Places which possibly had to take the proc lock anyway still use the proc
pointer to access limits.
2015-06-10 10:48:12 +00:00
trasz
802017a04b Add kern.racct.enable tunable and RACCT_DISABLED config option.
The point of this is to be able to add RACCT (with RACCT_DISABLED)
to GENERIC, to avoid having to rebuild the kernel to use rctl(8).

Differential Revision:	https://reviews.freebsd.org/D2369
Reviewed by:	kib@
MFC after:	1 month
Relnotes:	yes
Sponsored by:	The FreeBSD Foundation
2015-04-29 10:23:02 +00:00
rstone
520ad84555 vmspace_release() may sleep if the last reference is being released,
so add a WITNESS_WARN() to catch cases where it is called with a
non-sleepable lock held.

MFC after:	1 month
Sponsored by:	Sandvine Inc.
2015-01-24 16:59:38 +00:00
kib
74ff69d0ef vm_map_pmap_enter() and pmap_enter_object() are currently not aware of
the wired attribute of the mapping.  As result, some pmap
implementations clear the wired state of the page table entries, which
breaks invariants and allows the entries to be lost.  Avoid calling
vm_map_pmap_enter() for the MADV_WILLNEED on the wired entry, the
pages must be already mapped.

Noted and reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2014-09-23 18:54:23 +00:00
alc
1771ced43b Oops. vm_map_simplify_entry() is used by mac_proc_vm_revoke_recurse(), so
it can't be static.
2014-09-08 02:25:01 +00:00
alc
6a1691365d Make two functions static and eliminate an unused #define. 2014-09-08 00:19:03 +00:00
alc
eacf4d0259 Rewrite a loop in vm_map_wire() so that gcc doesn't think that the variable
"rv" is uninitialized.

Reported by:	bz
2014-08-02 17:58:20 +00:00
alc
f873c17deb Handle wiring failures in vm_map_wire() with the new functions
pmap_unwire() and vm_object_unwire().

Retire vm_fault_{un,}wire(), since they are no longer used.

(See r268327 and r269134 for the motivation behind this change.)

Reviewed by:	kib
Sponsored by:	EMC / Isilon Storage Division
2014-08-02 16:10:24 +00:00
alc
2b42c1ddc8 When unwiring a region of an address space, do not assume that the
underlying physical pages are mapped by the pmap.  If, for example, the
application has performed an mprotect(..., PROT_NONE) on any part of the
wired region, then those pages will no longer be mapped by the pmap.
So, using the pmap to lookup the wired pages in order to unwire them
doesn't always work, and when it doesn't work wired pages are leaked.

To avoid the leak, introduce and use a new function vm_object_unwire()
that locates the wired pages by traversing the object and its backing
objects.

At the same time, switch from using pmap_change_wiring() to the recently
introduced function pmap_unwire() for unwiring the region's mappings.
pmap_unwire() is faster, because it operates a range of virtual addresses
rather than a single virtual page at a time.  Moreover, by operating on
a range, it is superpage friendly.  It doesn't waste time performing
unnecessary demotions.

Reported by:	markj
Reviewed by:	kib
Tested by:	pho, jmg (arm)
Sponsored by:	EMC / Isilon Storage Division
2014-07-26 18:10:18 +00:00
hselasky
35b126e324 Pull in r267961 and r267973 again. Fix for issues reported will follow. 2014-06-28 03:56:17 +00:00
gjb
fc21f40567 Revert r267961, r267973:
These changes prevent sysctl(8) from returning proper output,
such as:

 1) no output from sysctl(8)
 2) erroneously returning ENOMEM with tools like truss(1)
    or uname(1)
 truss: can not get etype: Cannot allocate memory
2014-06-27 22:05:21 +00:00
hselasky
bd1ed65f0f Extend the meaning of the CTLFLAG_TUN flag to automatically check if
there is an environment variable which shall initialize the SYSCTL
during early boot. This works for all SYSCTL types both statically and
dynamically created ones, except for the SYSCTL NODE type and SYSCTLs
which belong to VNETs. A new flag, CTLFLAG_NOFETCH, has been added to
be used in the case a tunable sysctl has a custom initialisation
function allowing the sysctl to still be marked as a tunable. The
kernel SYSCTL API is mostly the same, with a few exceptions for some
special operations like iterating childrens of a static/extern SYSCTL
node. This operation should probably be made into a factored out
common macro, hence some device drivers use this. The reason for
changing the SYSCTL API was the need for a SYSCTL parent OID pointer
and not only the SYSCTL parent OID list pointer in order to quickly
generate the sysctl path. The motivation behind this patch is to avoid
parameter loading cludges inside the OFED driver subsystem. Instead of
adding special code to the OFED driver subsystem to post-load tunables
into dynamically created sysctls, we generalize this in the kernel.

Other changes:
- Corrected a possibly incorrect sysctl name from "hw.cbb.intr_mask"
to "hw.pcic.intr_mask".
- Removed redundant TUNABLE statements throughout the kernel.
- Some minor code rewrites in connection to removing not needed
TUNABLE statements.
- Added a missing SYSCTL_DECL().
- Wrapped two very long lines.
- Avoid malloc()/free() inside sysctl string handling, in case it is
called to initialize a sysctl from a tunable, hence malloc()/free() is
not ready when sysctls from the sysctl dataset are registered.
- Bumped FreeBSD version to indicate SYSCTL API change.

MFC after:	2 weeks
Sponsored by:	Mellanox Technologies
2014-06-27 16:33:43 +00:00
alc
50c074c333 Delay the call to crhold() in vm_map_insert() until we know that we won't
have to undo it by calling crfree().  This reduces the total number of calls
by vm_map_insert() to crhold() and crfree() by 45% in my tests.

Eliminate an unnecessary variable from vm_map_insert().

Reviewed by:	kib
Tested by:	pho
2014-06-26 16:04:03 +00:00
alc
aeda9d4c41 Now that vm_map_insert() sets MAP_ENTRY_GROWS_{DOWN,UP} on the stack entries
that it creates (r267645), we can place the check that blocks map entry
coalescing on stack entries in vm_map_simplify_entry() where it properly
belongs.

Reviewed by:	kib
2014-06-25 03:30:03 +00:00