Commit Graph

3638 Commits

Author SHA1 Message Date
Konstantin Belousov
0ec97ffc10 Call pmap_copy() only for map entries which have the backing object
instantiated.

Calling pmap_copy() on non-faulted anonymous memory entries is useless.

Noted and reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-06-21 18:54:28 +00:00
Konstantin Belousov
00de677313 Assert that the protection of a new map entry is a subset of the max
protection.

Noted and reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-06-21 18:51:30 +00:00
Alan Cox
3a5d839ebc Eliminate an unused macro.
MFC after:	3 days
2017-06-21 03:55:45 +00:00
Konstantin Belousov
212e02c836 Ignore the P_SYSTEM process flag, and do not request
VM_MAP_WIRE_SYSTEM mode when wiring the newly grown stack.

System maps do not create auto-grown stack.  Any stack we handled,
even for P_SYSTEM, must be for user address space.  P_SYSTEM processes
with mapped user space is either init(8) or an aio worker attached to
other user process with aio buffer pointing into stack area.  In either
case, VM_MAP_WIRE_USER mode should be used.

Noted and reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-06-19 20:40:59 +00:00
Alan Cox
87b0ab69a9 Pages that are passed to swap_pager_putpages() should already be fully
dirty.  Assert that they are fully dirty rather than redundantly calling
vm_page_dirty() on them.

Reviewed by:	kib, markj
MFC after:	1 week
X-MFC after:	r319932
2017-06-17 03:05:25 +00:00
Konstantin Belousov
e6c44f65d4 Some minor improvements to vnode_pager_generic_putpages().
- Add asserts that the pages to write are dirty.  The last page, if
  partially written, is only required to be dirty, while completely
  written pages should have all dirty bit set.
- Use uintmax_t to print vm_page pindexes.
- Use NULL instead of casted zero.
- Remove if () test which duplicated the loop ending condition.
- Miscellaneous style fixes.

Reviewed by:	alc, markj (previous version)
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-06-15 14:34:33 +00:00
Gleb Smirnoff
77e1943785 When we are in UMA_STARTUP use startup_alloc() for any zone, not for
internal zones only.  This allows to create new zones at early stages
of boot, without need to mark them as internal to UMA, which isn't
always true.

Reviewed by:	alc
2017-06-08 21:33:19 +00:00
John Baldwin
4bd7e351f1 Fix an off-by-one error in the VM page array on some systems.
r31386 changed how the size of the VM page array was calculated to be
less wasteful.  For most systems, the amount of memory is divided by
the overhead required by each page (a page of data plus a struct vm_page)
to determine the maximum number of available pages.  However, if the
remainder for the first non-available page was at least a page of data
(so that the only memory missing was a struct vm_page), this last page
was left in phys_avail[] but was not allocated an entry in the VM page
array.  Handle this case by explicitly excluding the page from
phys_avail[].

Reviewed by:	alc
Sponsored by:	DARPA / AFRL
Differential Revision:	https://reviews.freebsd.org/D11000
2017-06-08 16:18:41 +00:00
Alan Cox
761097c85e Starting in r118390, swaponsomething() began to reserve the blocks at the
beginning of a swap area for a disk label.  However, neither r118390 nor
r118544, which increased the reservation from one to two blocks, correctly
accounted for these blocks when updating the variable "swap_pager_avail".
This change corrects that error.

Reviewed by:	kib
MFC after:	5 days
2017-06-06 16:52:07 +00:00
Alan Cox
03bdd65f18 When the function blist_fill() was added to the kernel in r107913, the swap
pager used a different scheme for striping the allocation of swap space
across multiple devices.  And, although blist_fill() was intended to support
fill operations with large counts, the old striping scheme never performed a
fill larger than the stripe size.  Consequently, the misplacement of a
sanity check in blst_meta_fill() went undetected.  Now, moving forward in
time to r118390, a new scheme for striping was introduced that maintained a
blist allocator per device, but as noted in r318995, swapoff_one() was not
fully and correctly converted to the new scheme.  This change completes what
was started in r318995 by fixing the underlying bug in blst_meta_fill() that
stops swapoff_one() from simply performing a single blist_fill() operation.

Reviewed by:	kib
MFC after:	5 days
Differential Revision:	https://reviews.freebsd.org/D11043
2017-06-06 03:32:17 +00:00
Alan Cox
3e78e98337 The variable "breakout" is used like a Boolean, so actually define it as
one.

Reviewed by:	kib
MFC after:	5 days
2017-06-05 18:07:56 +00:00
Alan Cox
064650c180 Halve the memory being internally allocated by the blist allocator. In
short, half of the memory that is allocated to implement the radix tree is
wasted because we did not change "u_daddr_t" to be a 64-bit unsigned int
when we changed "daddr_t" to be a 64-bit (signed) int.  (See r96849 and
r96851.)

Reviewed by:	kib, markj
Tested by:	pho
MFC after:	5 days
Differential Revision:	https://reviews.freebsd.org/D11028
2017-06-05 17:14:16 +00:00
Gleb Smirnoff
1431a74845 As old prophecy says, some day UMA_DEBUG printfs shall be made CTRs. 2017-06-01 18:36:52 +00:00
Gleb Smirnoff
ac0a6fd015 Simplify boot pages management in UMA.
It is simply a contigous virtual memory pointer and number of pages.
There is no need to build a linked list here.  Just increment pointer
and decrement counter.  The only functional difference to old allocator
is that before we gave pages from topmost and down to lowest, and now
we give them in normal ascending order.

While here remove padalign from a mutex that is unused at runtime.

Reviewed by:	alc
2017-06-01 18:26:57 +00:00
Alan Cox
07c348ea7b After r118390, the variable "dmmax" was neither the correct strip size
nor the correct maximum block size.  Moreover, after r318995, it serves
no purpose except to provide information to user space through a read-
sysctl.

This change eliminates the variable "dmmax" but retains the sysctl.  It
also corrects the value returned by the sysctl.

Reviewed by:	kib, markj
MFC after:	3 days
2017-05-27 21:46:00 +00:00
Alan Cox
fe71561af2 In r118390, the swap pager's approach to striping swap allocation over
multiple devices was changed.  However, swapoff_one() was not fully and
correctly converted.  In particular, with r118390's introduction of a per-
device blist, the maximum swap block size, "dmmax", became irrelevant to
swapoff_one()'s operation.  Moreover, swapoff_one() was performing out-of-
range operations on the per-device blist that were silently ignored by
blist_fill().

This change corrects both of these problems with swapoff_one(), which will
allow us to potentially increase MAX_PAGEOUT_CLUSTER.  Previously,
swapoff_one() would panic inside of blist_fill() if you increased
MAX_PAGEOUT_CLUSTER.

Reviewed by:	kib, markj
MFC after:	3 days
2017-05-27 16:40:00 +00:00
Konstantin Belousov
6992112349 Commit the 64-bit inode project.
Extend the ino_t, dev_t, nlink_t types to 64-bit ints.  Modify
struct dirent layout to add d_off, increase the size of d_fileno
to 64-bits, increase the size of d_namlen to 16-bits, and change
the required alignment.  Increase struct statfs f_mntfromname[] and
f_mntonname[] array length MNAMELEN to 1024.

ABI breakage is mitigated by providing compatibility using versioned
symbols, ingenious use of the existing padding in structures, and
by employing other tricks.  Unfortunately, not everything can be
fixed, especially outside the base system.  For instance, third-party
APIs which pass struct stat around are broken in backward and
forward incompatible ways.

Kinfo sysctl MIBs ABI is changed in backward-compatible way, but
there is no general mechanism to handle other sysctl MIBS which
return structures where the layout has changed. It was considered
that the breakage is either in the management interfaces, where we
usually allow ABI slip, or is not important.

Struct xvnode changed layout, no compat shims are provided.

For struct xtty, dev_t tty device member was reduced to uint32_t.
It was decided that keeping ABI compat in this case is more useful
than reporting 64-bit dev_t, for the sake of pstat.

Update note: strictly follow the instructions in UPDATING.  Build
and install the new kernel with COMPAT_FREEBSD11 option enabled,
then reboot, and only then install new world.

Credits: The 64-bit inode project, also known as ino64, started life
many years ago as a project by Gleb Kurtsou (gleb).  Kirk McKusick
(mckusick) then picked up and updated the patch, and acted as a
flag-waver.  Feedback, suggestions, and discussions were carried
by Ed Maste (emaste), John Baldwin (jhb), Jilles Tjoelker (jilles),
and Rick Macklem (rmacklem).  Kris Moore (kris) performed an initial
ports investigation followed by an exp-run by Antoine Brodin (antoine).
Essential and all-embracing testing was done by Peter Holm (pho).
The heavy lifting of coordinating all these efforts and bringing the
project to completion were done by Konstantin Belousov (kib).

Sponsored by:	The FreeBSD Foundation (emaste, kib)
Differential revision:	https://reviews.freebsd.org/D10439
2017-05-23 09:29:05 +00:00
Konstantin Belousov
be10b9d5d7 Emulate pre-r317061 ABI.
This restores 32bit-sized accesses to vmcnt sysctls, making old
binaries like top(1), systat(8) and reboot(8) mostly functional on
newer kernel.

Reviewed by:	bde
Sponsored by:	The FreeBSD Foundation
2017-05-02 18:40:41 +00:00
Gleb Smirnoff
83c9dea1ba - Remove 'struct vmmeter' from 'struct pcpu', leaving only global vmmeter
in place.  To do per-cpu stats, convert all fields that previously were
  maintained in the vmmeters that sit in pcpus to counter(9).
- Since some vmmeter stats may be touched at very early stages of boot,
  before we have set up UMA and we can do counter_u64_alloc(), provide an
  early counter mechanism:
  o Leave one spare uint64_t in struct pcpu, named pc_early_dummy_counter.
  o Point counter(9) fields of vmmeter to pcpu[0].pc_early_dummy_counter,
    so that at early stages of boot, before counters are allocated we already
    point to a counter that can be safely written to.
  o For sparc64 that required a whole dummy pcpu[MAXCPU] array.

Further related changes:
- Don't include vmmeter.h into pcpu.h.
- vm.stats.vm.v_swappgsout and vm.stats.vm.v_swappgsin changed to 64-bit,
  to match kernel representation.
- struct vmmeter hidden under _KERNEL, and only vmstat(1) is an exclusion.

This is based on benno@'s 4-year old patch:
https://lists.freebsd.org/pipermail/freebsd-arch/2013-July/014471.html

Reviewed by:	kib, gallatin, marius, lidl
Differential Revision:	https://reviews.freebsd.org/D10156
2017-04-17 17:34:47 +00:00
Gleb Smirnoff
9ed01c32e0 All these files need sys/vmmeter.h, but now they got it implicitly
included via sys/pcpu.h.
2017-04-17 17:07:00 +00:00
Mark Johnston
e1cb9d3747 Busy the map in vm_map_protect().
We are otherwise susceptible to a race with a concurrent vm_map_wire(),
which may drop the map lock to fault pages into the object chain. In
particular, vm_map_protect() will only copy newly writable wired pages
into the top-level object when MAP_ENTRY_USER_WIRED is set, but
vm_map_wire() only sets this flag after its fault loop. We may thus end
up with a writable wired entry whose top-level object does not contain the
entire range of pages.

Reported and tested by:	pho
Reviewed by:	kib
MFC after:	1 week
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D10349
2017-04-10 21:01:42 +00:00
Mark Johnston
1c2d20a1d0 Consistently use for-loops in vm_map_protect().
No functional change.

Reviewed by:	kib
MFC after:	1 week
Sponsored by:	Dell EMC Isilon
X-Differential Revision:	https://reviews.freebsd.org/D10349
2017-04-10 20:57:16 +00:00
Mark Johnston
ed11e4d701 Add some bounds assertions to the vm_map_entry clip functions.
Reviewed by:	kib
MFC after:	1 week
Sponsored by:	Dell EMC Isilon
X-Differential Revision: https://reviews.freebsd.org/D10349
2017-04-10 20:55:42 +00:00
Konstantin Belousov
65b9599a76 Extract calculation of ioflags from the vm_pager_putpages flags into a
helper.

Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
X-Differential revision:	https://reviews.freebsd.org/D10241
2017-04-05 16:56:04 +00:00
Konstantin Belousov
3dbb0ca646 Some style fixes for vnode_pager_generic_putpages(), in the local
declaration block.

Reviewed by:	markj (as part of the larger patch)
Tested by:	pho (as part of the larger patch)
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
X-Differential revision:	https://reviews.freebsd.org/D10241
2017-04-05 16:45:00 +00:00
Konstantin Belousov
53b6404819 Use int instead of boolean_t for flags argument type in
vnode_pager_generic_putpages() prototype; change the argument name to
reflect that it is flags.

Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
X-Differential revision:	https://reviews.freebsd.org/D10241
2017-04-05 16:30:41 +00:00
John Baldwin
a5a355788e Assert that the align parameter to uma_zcreate() is valid.
Reviewed by:	kib
MFC after:	1 week
Sponsored by:	DARPA / AFRL
Differential Revision:	https://reviews.freebsd.org/D10100
2017-04-04 16:26:46 +00:00
Dmitry Chagin
46dc8e9d6a Add kern_mincore() helper for micore() syscall.
Suggested by:	kib@
Reviewed by:	kib@
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D10143
2017-03-30 19:42:49 +00:00
Alan Cox
8956418832 Two changes to vm_fault_populate():
Simplify the logic for clipping the range returned by the pager to fit
within the map entry.

Use atop() rather than OFF_TO_IDX() on addresses.

Reviewed by:	kib
MFC after:	1 week
2017-03-19 19:52:47 +00:00
Konstantin Belousov
bc27810671 Fix off-by-one in the vm_fault_populate() code.
When re-calculating the last inclusive page index after the pager
call, -1 was erronously ommitted.  If the pager extended the run
(unlikely), the result would be insertion of the valid page mapping
outside the current map entry range.

Found by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-03-19 14:42:16 +00:00
Xin LI
83d37aaf01 The adj_free and max_free values of new_entry will be calculated and
assigned by subsequent vm_map_entry_link(), therefore, remove the
pointless copying.

Submitted by:	alc
MFC after:	3 days
2017-03-16 05:44:16 +00:00
Alan Cox
52d1addaa1 Relax the locking requirements for vm_object_page_noreuse(). While
reviewing all uses of OFF_TO_IDX(), I observed that
vm_object_page_noreuse() is requiring an exclusive lock on the object
when, in fact, a shared lock suffices.

Reviewed by:	kib, markj
MFC after:	3 days
Differential Revision:	https://reviews.freebsd.org/D10011
2017-03-15 17:43:45 +00:00
Konstantin Belousov
d1780e8dac Use atop() instead of OFF_TO_IDX() for convertion of addresses or
addresses offsets, as intended.

Suggested and reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2017-03-14 19:39:17 +00:00
Xin LI
78d7964b46 Implement INHERIT_ZERO for minherit(2).
INHERIT_ZERO is an OpenBSD feature.

When a page is marked as such, it would be zeroed
upon fork().

This would be used in new arc4random(3) functions.

PR:	182610
Reviewed by:	kib (earlier version)
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D427
2017-03-14 17:10:42 +00:00
Mark Johnston
e20ff1a4d4 Update a comment to reflect reality.
MFC after:	1 week
2017-03-13 18:45:25 +00:00
Konstantin Belousov
2b6d1a639b Follow-up to r313690.
Fix two missed places where vm_object offset to index calculation
should use unsigned shift, to allow handling of full range of unsigned
offsets used to create device mappings.

Reported and tested by:	royger (previous version)
Reviewed by:	alc (previous version)
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-03-12 13:53:13 +00:00
Andriy Gapon
57223e9994 uma: fix pages <-> items conversions at several places
Those places were not taking into account uk_ppera.
At present one allocation is always used by one slab, so uk_ppera must
be used to convert between pages and slabs.
uk_ipers is used to convert between slabs and items.

MFC after:	1 month (if ever)
2017-03-11 16:43:38 +00:00
Andriy Gapon
a55ebb7cd5 uma: eliminate uk_slabsize field
The field was not used beyond the initial keg setup stage anyway.

MFC after:	1 month (if ever)
2017-03-11 16:35:36 +00:00
Warner Losh
fbbd9655e5 Renumber copyright clause 4
Renumber cluase 4 to 3, per what everybody else did when BSD granted
them permission to remove clause 3. My insistance on keeping the same
numbering for legal reasons is too pedantic, so give up on that point.

Submitted by:	Jan Schaumann <jschauma@stevens.edu>
Pull Request:	https://github.com/freebsd/freebsd/pull/96
2017-02-28 23:42:47 +00:00
Andriy Gapon
9b43bc27c4 call vm_lowmem hook in uma_reclaim_worker
A comment near kmem_reclaim() implies that we already did that.
Calling the hook is useful, because some handlers, e.g. ARC,
might be able to release significant amounts of KVA.

Now that we have more than one place where vm_lowmem hook is called,
use this change as an opportunity to introduce flags that describe
a reason for calling the hook.  No handler makes use of the flags yet.

Reviewed by:	markj, kib
MFC after:	1 week
Sponsored by:	Panzura
Differential Revision: https://reviews.freebsd.org/D9764
2017-02-25 16:39:21 +00:00
Konstantin Belousov
63cdcaaead Properly handle possible underflow in vm_fault_prefault().
In vm_fault_prefault(), if backward count causes underflow in
calculation of
	starta = addra - backward * PAGE_SIZE;
then starta must be clipped to entry->start, instead of zero.
Clipping to zero allowed mapping outside of the map entries address
ranges, in particular, map at zero.

Submitted by:	Yanko Yankulov <yanko.yankulov@gmail.com>
Reviewed by:	alc
MFC after:	1 week
2017-02-24 08:09:16 +00:00
Andriy Gapon
937c1b0757 try to fix RACCT_RSS accounting
There could be a race between the vm daemon setting RACCT_RSS based on
the vm space and vmspace_exit (called from exit1) resetting RACCT_RSS to
zero.  In that case we can get a zombie process with non-zero RACCT_RSS.
If the process is jailed, that may break accounting for the jail.
There could be other consequences.

Fix this race in the vm daemon by updating RACCT_RSS only when a process
is in the normal state.  Also, make accounting a little bit more
accurate by refreshing the page resident count after calling
vm_pageout_map_deactivate_pages().
Finally, add an assert that the RSS is zero when a process is reaped.

PR:		210315
Reviewed by:	trasz
Differential Revision: https://reviews.freebsd.org/D9464
2017-02-14 13:54:05 +00:00
Bjoern A. Zeeb
05d58177e8 Use %s __func__ to print the actual function name (been looking at
the wrong one for too often lately at first), and also use %#lx to
get the 0x prefix for the address.

MFC after:	1 week
2017-02-14 01:20:03 +00:00
Konstantin Belousov
496ab0532d Rework r313352.
Rename kern_vm_* functions to kern_*.  Move the prototypes to
syscallsubr.h.  Also change Mach VM types to uintptr_t/size_t as
needed, to avoid headers pollution.

Requested by:	alc, jhb
Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D9535
2017-02-13 09:04:38 +00:00
Konstantin Belousov
04e89ffba8 Remove MPSAFE and ARGUSED annotations, ANSI-fy syscall handlers.
Discussed with:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-02-13 00:40:55 +00:00
Konstantin Belousov
987ff18184 Consistently handle negative or wrapping offsets in the mmap(2) syscalls.
For regular files and posix shared memory, POSIX requires that
[offset, offset + size) range is legitimate.  At the maping time,
check that offset is not negative.  Allowing negative offsets might
expose the data that filesystem put into vm_object for internal use,
esp. due to OFF_TO_IDX() signess treatment.  Fault handler verifies
that the mapped range is valid, assuming that mmap(2) checked that
arithmetic gives no undefined results.

For device mappings, leave the semantic of negative offsets to the
driver.  Correct object page index calculation to not erronously
propagate sign.

In either case, disallow overflow of offset + size.

Update mmap(2) man page to explain the requirement of the range
validity, and behaviour when the range becomes invalid after mapping.

Reported and tested by:	royger (previous version)
Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2017-02-12 21:05:44 +00:00
Konstantin Belousov
1c2ad3e962 Change type of the prot parameter for kern_vm_mmap() from vm_prot_t to int.
This makes the code to pass whole word of the mmap(2) syscall argument
prot to the syscall helper kern_vm_mmap(), which can validate all
bits.  The change provides temporal fix for sys/vm/mmap_test
mmap__bad_arguments, which was broken after r313352.

PR:	216976
Reported and tested by:	ngie
Sponsored by:	The FreeBSD Foundation
2017-02-11 20:27:39 +00:00
Edward Tomasz Napierala
69cdfcef2e Add kern_vm_mmap2(), kern_vm_mprotect(), kern_vm_msync(), kern_vm_munlock(),
kern_vm_munmap(), and kern_vm_madvise(), and use them in various compats
instead of their sys_*() counterparts.

Reviewed by:	ed, dchagin, kib
MFC after:	2 weeks
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D9378
2017-02-06 20:57:12 +00:00
Konstantin Belousov
5fca242374 Style, use tab after #define.
Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2017-02-04 19:16:19 +00:00
Alan Cox
8a99f1cc59 Over the years, the code and comments in vm_page_startup() have diverged in
one respect.  When determining how many page structures to allocate,
contrary to what the comments say, the code does not account for the
overhead of a page structure per page of physical memory.  This revision
changes the code to match the comments.

Reviewed by:	kib, markj
MFC after:	6 weeks
Differential Revision:	https://reviews.freebsd.org/D9081
2017-02-04 05:23:10 +00:00