Commit Graph

3849 Commits

Author SHA1 Message Date
alc
63fd4c37fb Increase the pageout cluster size to 32 pages.
Decouple the pageout cluster size from the size of the hash table entry
used by the swap pager for mapping (object, pindex) to a block on the
swap device(s), and keep the size of a hash table entry at its current
size.

Eliminate a pointless macro.

Reviewed by:	kib, markj (an earlier version)
MFC after:	4 weeks
Differential Revision:	https://reviews.freebsd.org/D11305
2017-06-24 17:10:33 +00:00
kib
34de63e92d Implement address space guards.
Guard, requested by the MAP_GUARD mmap(2) flag, prevents the reuse of
the allocated address space, but does not allow instantiation of the
pages in the range.  It is useful for more explicit support for usual
two-stage reserve then commit allocators, since it prevents accidental
instantiation of the mapping, e.g. by mprotect(2).

Use guards to reimplement stack grow code.  Explicitely track stack
grow area with the guard, including the stack guard page.  On stack
grow, trivial shift of the guard map entry and stack map entry limits
makes the stack expansion.  Move the code to detect stack grow and
call vm_map_growstack(), from vm_fault() into vm_map_lookup().

As result, it is impossible to get random mapping to occur in the
stack grow area, or to overlap the stack guard page.

Enable stack guard page by default.

Reviewed by:	alc, markj
Man page update reviewed by:	alc, bjk, emaste, markj, pho
Tested by:	pho, Qualys
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D11306 (man pages)
2017-06-24 17:01:11 +00:00
kib
eb2cba616a Do not try to unmark MAP_ENTRY_IN_TRANSITION marked by other thread.
The issue is catched by "vm_map_wire: alien wire" KASSERT at the end
of the vm_map_wire().  We currently check for MAP_ENTRY_WIRE_SKIPPED
flag before ensuring that the wiring_thread is curthread. For HOLESOK
wiring, this means that we might see WIRE_SKIPPED entry from different
wiring.

The fix it by only checking WIRE_SKIPPED if the entry is put
IN_TRANSITION by us.  Also fixed a typo in the comment explaining the
situation.

Reported and tested by:	pho
Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-06-24 16:47:41 +00:00
kib
3677635c7b Call pmap_copy() only for map entries which have the backing object
instantiated.

Calling pmap_copy() on non-faulted anonymous memory entries is useless.

Noted and reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-06-21 18:54:28 +00:00
kib
1892b89561 Assert that the protection of a new map entry is a subset of the max
protection.

Noted and reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-06-21 18:51:30 +00:00
alc
742947fba6 Eliminate an unused macro.
MFC after:	3 days
2017-06-21 03:55:45 +00:00
kib
bdbf5f6e59 Ignore the P_SYSTEM process flag, and do not request
VM_MAP_WIRE_SYSTEM mode when wiring the newly grown stack.

System maps do not create auto-grown stack.  Any stack we handled,
even for P_SYSTEM, must be for user address space.  P_SYSTEM processes
with mapped user space is either init(8) or an aio worker attached to
other user process with aio buffer pointing into stack area.  In either
case, VM_MAP_WIRE_USER mode should be used.

Noted and reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-06-19 20:40:59 +00:00
alc
3f2176677b Pages that are passed to swap_pager_putpages() should already be fully
dirty.  Assert that they are fully dirty rather than redundantly calling
vm_page_dirty() on them.

Reviewed by:	kib, markj
MFC after:	1 week
X-MFC after:	r319932
2017-06-17 03:05:25 +00:00
kib
7dae9de7e8 Some minor improvements to vnode_pager_generic_putpages().
- Add asserts that the pages to write are dirty.  The last page, if
  partially written, is only required to be dirty, while completely
  written pages should have all dirty bit set.
- Use uintmax_t to print vm_page pindexes.
- Use NULL instead of casted zero.
- Remove if () test which duplicated the loop ending condition.
- Miscellaneous style fixes.

Reviewed by:	alc, markj (previous version)
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-06-15 14:34:33 +00:00
glebius
8e755a8e23 When we are in UMA_STARTUP use startup_alloc() for any zone, not for
internal zones only.  This allows to create new zones at early stages
of boot, without need to mark them as internal to UMA, which isn't
always true.

Reviewed by:	alc
2017-06-08 21:33:19 +00:00
jhb
2a188f8a23 Fix an off-by-one error in the VM page array on some systems.
r31386 changed how the size of the VM page array was calculated to be
less wasteful.  For most systems, the amount of memory is divided by
the overhead required by each page (a page of data plus a struct vm_page)
to determine the maximum number of available pages.  However, if the
remainder for the first non-available page was at least a page of data
(so that the only memory missing was a struct vm_page), this last page
was left in phys_avail[] but was not allocated an entry in the VM page
array.  Handle this case by explicitly excluding the page from
phys_avail[].

Reviewed by:	alc
Sponsored by:	DARPA / AFRL
Differential Revision:	https://reviews.freebsd.org/D11000
2017-06-08 16:18:41 +00:00
alc
662afbe1bf Starting in r118390, swaponsomething() began to reserve the blocks at the
beginning of a swap area for a disk label.  However, neither r118390 nor
r118544, which increased the reservation from one to two blocks, correctly
accounted for these blocks when updating the variable "swap_pager_avail".
This change corrects that error.

Reviewed by:	kib
MFC after:	5 days
2017-06-06 16:52:07 +00:00
alc
186046483d When the function blist_fill() was added to the kernel in r107913, the swap
pager used a different scheme for striping the allocation of swap space
across multiple devices.  And, although blist_fill() was intended to support
fill operations with large counts, the old striping scheme never performed a
fill larger than the stripe size.  Consequently, the misplacement of a
sanity check in blst_meta_fill() went undetected.  Now, moving forward in
time to r118390, a new scheme for striping was introduced that maintained a
blist allocator per device, but as noted in r318995, swapoff_one() was not
fully and correctly converted to the new scheme.  This change completes what
was started in r318995 by fixing the underlying bug in blst_meta_fill() that
stops swapoff_one() from simply performing a single blist_fill() operation.

Reviewed by:	kib
MFC after:	5 days
Differential Revision:	https://reviews.freebsd.org/D11043
2017-06-06 03:32:17 +00:00
alc
e5e6fc9910 The variable "breakout" is used like a Boolean, so actually define it as
one.

Reviewed by:	kib
MFC after:	5 days
2017-06-05 18:07:56 +00:00
alc
ce8181e2f8 Halve the memory being internally allocated by the blist allocator. In
short, half of the memory that is allocated to implement the radix tree is
wasted because we did not change "u_daddr_t" to be a 64-bit unsigned int
when we changed "daddr_t" to be a 64-bit (signed) int.  (See r96849 and
r96851.)

Reviewed by:	kib, markj
Tested by:	pho
MFC after:	5 days
Differential Revision:	https://reviews.freebsd.org/D11028
2017-06-05 17:14:16 +00:00
glebius
18c82b0c32 As old prophecy says, some day UMA_DEBUG printfs shall be made CTRs. 2017-06-01 18:36:52 +00:00
glebius
d20e709b83 Simplify boot pages management in UMA.
It is simply a contigous virtual memory pointer and number of pages.
There is no need to build a linked list here.  Just increment pointer
and decrement counter.  The only functional difference to old allocator
is that before we gave pages from topmost and down to lowest, and now
we give them in normal ascending order.

While here remove padalign from a mutex that is unused at runtime.

Reviewed by:	alc
2017-06-01 18:26:57 +00:00
alc
a1b59a2a3e After r118390, the variable "dmmax" was neither the correct strip size
nor the correct maximum block size.  Moreover, after r318995, it serves
no purpose except to provide information to user space through a read-
sysctl.

This change eliminates the variable "dmmax" but retains the sysctl.  It
also corrects the value returned by the sysctl.

Reviewed by:	kib, markj
MFC after:	3 days
2017-05-27 21:46:00 +00:00
alc
33fb89b4ed In r118390, the swap pager's approach to striping swap allocation over
multiple devices was changed.  However, swapoff_one() was not fully and
correctly converted.  In particular, with r118390's introduction of a per-
device blist, the maximum swap block size, "dmmax", became irrelevant to
swapoff_one()'s operation.  Moreover, swapoff_one() was performing out-of-
range operations on the per-device blist that were silently ignored by
blist_fill().

This change corrects both of these problems with swapoff_one(), which will
allow us to potentially increase MAX_PAGEOUT_CLUSTER.  Previously,
swapoff_one() would panic inside of blist_fill() if you increased
MAX_PAGEOUT_CLUSTER.

Reviewed by:	kib, markj
MFC after:	3 days
2017-05-27 16:40:00 +00:00
kib
e75ba1d5c4 Commit the 64-bit inode project.
Extend the ino_t, dev_t, nlink_t types to 64-bit ints.  Modify
struct dirent layout to add d_off, increase the size of d_fileno
to 64-bits, increase the size of d_namlen to 16-bits, and change
the required alignment.  Increase struct statfs f_mntfromname[] and
f_mntonname[] array length MNAMELEN to 1024.

ABI breakage is mitigated by providing compatibility using versioned
symbols, ingenious use of the existing padding in structures, and
by employing other tricks.  Unfortunately, not everything can be
fixed, especially outside the base system.  For instance, third-party
APIs which pass struct stat around are broken in backward and
forward incompatible ways.

Kinfo sysctl MIBs ABI is changed in backward-compatible way, but
there is no general mechanism to handle other sysctl MIBS which
return structures where the layout has changed. It was considered
that the breakage is either in the management interfaces, where we
usually allow ABI slip, or is not important.

Struct xvnode changed layout, no compat shims are provided.

For struct xtty, dev_t tty device member was reduced to uint32_t.
It was decided that keeping ABI compat in this case is more useful
than reporting 64-bit dev_t, for the sake of pstat.

Update note: strictly follow the instructions in UPDATING.  Build
and install the new kernel with COMPAT_FREEBSD11 option enabled,
then reboot, and only then install new world.

Credits: The 64-bit inode project, also known as ino64, started life
many years ago as a project by Gleb Kurtsou (gleb).  Kirk McKusick
(mckusick) then picked up and updated the patch, and acted as a
flag-waver.  Feedback, suggestions, and discussions were carried
by Ed Maste (emaste), John Baldwin (jhb), Jilles Tjoelker (jilles),
and Rick Macklem (rmacklem).  Kris Moore (kris) performed an initial
ports investigation followed by an exp-run by Antoine Brodin (antoine).
Essential and all-embracing testing was done by Peter Holm (pho).
The heavy lifting of coordinating all these efforts and bringing the
project to completion were done by Konstantin Belousov (kib).

Sponsored by:	The FreeBSD Foundation (emaste, kib)
Differential revision:	https://reviews.freebsd.org/D10439
2017-05-23 09:29:05 +00:00
kib
6b474b6405 Emulate pre-r317061 ABI.
This restores 32bit-sized accesses to vmcnt sysctls, making old
binaries like top(1), systat(8) and reboot(8) mostly functional on
newer kernel.

Reviewed by:	bde
Sponsored by:	The FreeBSD Foundation
2017-05-02 18:40:41 +00:00
glebius
21ead51d79 - Remove 'struct vmmeter' from 'struct pcpu', leaving only global vmmeter
in place.  To do per-cpu stats, convert all fields that previously were
  maintained in the vmmeters that sit in pcpus to counter(9).
- Since some vmmeter stats may be touched at very early stages of boot,
  before we have set up UMA and we can do counter_u64_alloc(), provide an
  early counter mechanism:
  o Leave one spare uint64_t in struct pcpu, named pc_early_dummy_counter.
  o Point counter(9) fields of vmmeter to pcpu[0].pc_early_dummy_counter,
    so that at early stages of boot, before counters are allocated we already
    point to a counter that can be safely written to.
  o For sparc64 that required a whole dummy pcpu[MAXCPU] array.

Further related changes:
- Don't include vmmeter.h into pcpu.h.
- vm.stats.vm.v_swappgsout and vm.stats.vm.v_swappgsin changed to 64-bit,
  to match kernel representation.
- struct vmmeter hidden under _KERNEL, and only vmstat(1) is an exclusion.

This is based on benno@'s 4-year old patch:
https://lists.freebsd.org/pipermail/freebsd-arch/2013-July/014471.html

Reviewed by:	kib, gallatin, marius, lidl
Differential Revision:	https://reviews.freebsd.org/D10156
2017-04-17 17:34:47 +00:00
glebius
5763443023 All these files need sys/vmmeter.h, but now they got it implicitly
included via sys/pcpu.h.
2017-04-17 17:07:00 +00:00
markj
c4c6e9ae09 Busy the map in vm_map_protect().
We are otherwise susceptible to a race with a concurrent vm_map_wire(),
which may drop the map lock to fault pages into the object chain. In
particular, vm_map_protect() will only copy newly writable wired pages
into the top-level object when MAP_ENTRY_USER_WIRED is set, but
vm_map_wire() only sets this flag after its fault loop. We may thus end
up with a writable wired entry whose top-level object does not contain the
entire range of pages.

Reported and tested by:	pho
Reviewed by:	kib
MFC after:	1 week
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D10349
2017-04-10 21:01:42 +00:00
markj
6976fe28f2 Consistently use for-loops in vm_map_protect().
No functional change.

Reviewed by:	kib
MFC after:	1 week
Sponsored by:	Dell EMC Isilon
X-Differential Revision:	https://reviews.freebsd.org/D10349
2017-04-10 20:57:16 +00:00
markj
278be1664c Add some bounds assertions to the vm_map_entry clip functions.
Reviewed by:	kib
MFC after:	1 week
Sponsored by:	Dell EMC Isilon
X-Differential Revision: https://reviews.freebsd.org/D10349
2017-04-10 20:55:42 +00:00
kib
2a5a630e5b Extract calculation of ioflags from the vm_pager_putpages flags into a
helper.

Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
X-Differential revision:	https://reviews.freebsd.org/D10241
2017-04-05 16:56:04 +00:00
kib
dfb5980629 Some style fixes for vnode_pager_generic_putpages(), in the local
declaration block.

Reviewed by:	markj (as part of the larger patch)
Tested by:	pho (as part of the larger patch)
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
X-Differential revision:	https://reviews.freebsd.org/D10241
2017-04-05 16:45:00 +00:00
kib
6e5de5b2c0 Use int instead of boolean_t for flags argument type in
vnode_pager_generic_putpages() prototype; change the argument name to
reflect that it is flags.

Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
X-Differential revision:	https://reviews.freebsd.org/D10241
2017-04-05 16:30:41 +00:00
jhb
bbbd5839ff Assert that the align parameter to uma_zcreate() is valid.
Reviewed by:	kib
MFC after:	1 week
Sponsored by:	DARPA / AFRL
Differential Revision:	https://reviews.freebsd.org/D10100
2017-04-04 16:26:46 +00:00
dchagin
57bf038283 Add kern_mincore() helper for micore() syscall.
Suggested by:	kib@
Reviewed by:	kib@
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D10143
2017-03-30 19:42:49 +00:00
alc
8bad3c5e87 Two changes to vm_fault_populate():
Simplify the logic for clipping the range returned by the pager to fit
within the map entry.

Use atop() rather than OFF_TO_IDX() on addresses.

Reviewed by:	kib
MFC after:	1 week
2017-03-19 19:52:47 +00:00
kib
340b707be8 Fix off-by-one in the vm_fault_populate() code.
When re-calculating the last inclusive page index after the pager
call, -1 was erronously ommitted.  If the pager extended the run
(unlikely), the result would be insertion of the valid page mapping
outside the current map entry range.

Found by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-03-19 14:42:16 +00:00
delphij
d308d4fa6c The adj_free and max_free values of new_entry will be calculated and
assigned by subsequent vm_map_entry_link(), therefore, remove the
pointless copying.

Submitted by:	alc
MFC after:	3 days
2017-03-16 05:44:16 +00:00
alc
89e63829ad Relax the locking requirements for vm_object_page_noreuse(). While
reviewing all uses of OFF_TO_IDX(), I observed that
vm_object_page_noreuse() is requiring an exclusive lock on the object
when, in fact, a shared lock suffices.

Reviewed by:	kib, markj
MFC after:	3 days
Differential Revision:	https://reviews.freebsd.org/D10011
2017-03-15 17:43:45 +00:00
kib
8cf2af841c Use atop() instead of OFF_TO_IDX() for convertion of addresses or
addresses offsets, as intended.

Suggested and reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2017-03-14 19:39:17 +00:00
delphij
6c1ff20b8d Implement INHERIT_ZERO for minherit(2).
INHERIT_ZERO is an OpenBSD feature.

When a page is marked as such, it would be zeroed
upon fork().

This would be used in new arc4random(3) functions.

PR:	182610
Reviewed by:	kib (earlier version)
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D427
2017-03-14 17:10:42 +00:00
markj
b6e7d5b313 Update a comment to reflect reality.
MFC after:	1 week
2017-03-13 18:45:25 +00:00
kib
dada1184e8 Follow-up to r313690.
Fix two missed places where vm_object offset to index calculation
should use unsigned shift, to allow handling of full range of unsigned
offsets used to create device mappings.

Reported and tested by:	royger (previous version)
Reviewed by:	alc (previous version)
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-03-12 13:53:13 +00:00
avg
b0aaa6df92 uma: fix pages <-> items conversions at several places
Those places were not taking into account uk_ppera.
At present one allocation is always used by one slab, so uk_ppera must
be used to convert between pages and slabs.
uk_ipers is used to convert between slabs and items.

MFC after:	1 month (if ever)
2017-03-11 16:43:38 +00:00
avg
de71736cdc uma: eliminate uk_slabsize field
The field was not used beyond the initial keg setup stage anyway.

MFC after:	1 month (if ever)
2017-03-11 16:35:36 +00:00
imp
7e6cabd06e Renumber copyright clause 4
Renumber cluase 4 to 3, per what everybody else did when BSD granted
them permission to remove clause 3. My insistance on keeping the same
numbering for legal reasons is too pedantic, so give up on that point.

Submitted by:	Jan Schaumann <jschauma@stevens.edu>
Pull Request:	https://github.com/freebsd/freebsd/pull/96
2017-02-28 23:42:47 +00:00
avg
e4abb6a4a9 call vm_lowmem hook in uma_reclaim_worker
A comment near kmem_reclaim() implies that we already did that.
Calling the hook is useful, because some handlers, e.g. ARC,
might be able to release significant amounts of KVA.

Now that we have more than one place where vm_lowmem hook is called,
use this change as an opportunity to introduce flags that describe
a reason for calling the hook.  No handler makes use of the flags yet.

Reviewed by:	markj, kib
MFC after:	1 week
Sponsored by:	Panzura
Differential Revision: https://reviews.freebsd.org/D9764
2017-02-25 16:39:21 +00:00
kib
568d99bbad Properly handle possible underflow in vm_fault_prefault().
In vm_fault_prefault(), if backward count causes underflow in
calculation of
	starta = addra - backward * PAGE_SIZE;
then starta must be clipped to entry->start, instead of zero.
Clipping to zero allowed mapping outside of the map entries address
ranges, in particular, map at zero.

Submitted by:	Yanko Yankulov <yanko.yankulov@gmail.com>
Reviewed by:	alc
MFC after:	1 week
2017-02-24 08:09:16 +00:00
avg
25c75ccb77 try to fix RACCT_RSS accounting
There could be a race between the vm daemon setting RACCT_RSS based on
the vm space and vmspace_exit (called from exit1) resetting RACCT_RSS to
zero.  In that case we can get a zombie process with non-zero RACCT_RSS.
If the process is jailed, that may break accounting for the jail.
There could be other consequences.

Fix this race in the vm daemon by updating RACCT_RSS only when a process
is in the normal state.  Also, make accounting a little bit more
accurate by refreshing the page resident count after calling
vm_pageout_map_deactivate_pages().
Finally, add an assert that the RSS is zero when a process is reaped.

PR:		210315
Reviewed by:	trasz
Differential Revision: https://reviews.freebsd.org/D9464
2017-02-14 13:54:05 +00:00
bz
21d1178863 Use %s __func__ to print the actual function name (been looking at
the wrong one for too often lately at first), and also use %#lx to
get the 0x prefix for the address.

MFC after:	1 week
2017-02-14 01:20:03 +00:00
kib
0dac2c5955 Rework r313352.
Rename kern_vm_* functions to kern_*.  Move the prototypes to
syscallsubr.h.  Also change Mach VM types to uintptr_t/size_t as
needed, to avoid headers pollution.

Requested by:	alc, jhb
Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D9535
2017-02-13 09:04:38 +00:00
kib
0bc2784093 Remove MPSAFE and ARGUSED annotations, ANSI-fy syscall handlers.
Discussed with:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-02-13 00:40:55 +00:00
kib
9bfd07fd2e Consistently handle negative or wrapping offsets in the mmap(2) syscalls.
For regular files and posix shared memory, POSIX requires that
[offset, offset + size) range is legitimate.  At the maping time,
check that offset is not negative.  Allowing negative offsets might
expose the data that filesystem put into vm_object for internal use,
esp. due to OFF_TO_IDX() signess treatment.  Fault handler verifies
that the mapped range is valid, assuming that mmap(2) checked that
arithmetic gives no undefined results.

For device mappings, leave the semantic of negative offsets to the
driver.  Correct object page index calculation to not erronously
propagate sign.

In either case, disallow overflow of offset + size.

Update mmap(2) man page to explain the requirement of the range
validity, and behaviour when the range becomes invalid after mapping.

Reported and tested by:	royger (previous version)
Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2017-02-12 21:05:44 +00:00
kib
95e32f2845 Change type of the prot parameter for kern_vm_mmap() from vm_prot_t to int.
This makes the code to pass whole word of the mmap(2) syscall argument
prot to the syscall helper kern_vm_mmap(), which can validate all
bits.  The change provides temporal fix for sys/vm/mmap_test
mmap__bad_arguments, which was broken after r313352.

PR:	216976
Reported and tested by:	ngie
Sponsored by:	The FreeBSD Foundation
2017-02-11 20:27:39 +00:00