3388 Commits

Author SHA1 Message Date
alc
ab793415d0 MFC r265418
Prior to r254304, a separate function, vm_pageout_page_stats(), was used
  to periodically update the reference status of the active pages.  This
  function was called, instead of vm_pageout_scan(), when memory was not
  scarce.  The objective was to provide up to date reference status for
  active pages in case memory did become scarce and active pages needed to
  be deactivated.

  The active page queue scan performed by vm_pageout_page_stats() was
  virtually identical to that performed by vm_pageout_scan(), and so r254304
  eliminated vm_pageout_page_stats().  Instead, vm_pageout_scan() is
  called with the parameter "pass" set to zero.  The intention was that when
  pass is zero, vm_pageout_scan() would only scan the active queue.
  However, the variable page_shortage can still be greater than zero when
  memory is not scarce and vm_pageout_scan() is called with pass equal to
  zero.  Consequently, the inactive queue may be scanned and dirty pages
  laundered even though that was not intended by r254304.  This revision
  fixes that.
2014-05-13 05:26:43 +00:00
alc
498371bb43 MFC r260567
Correctly update the count of stuck pages, "addl_page_shortage", in
  vm_pageout_scan().  There were missing increments in two less common
  cases.

  Don't conflate the count of stuck pages and the pageout deficit provided
  by vm_page_alloc{,_contig}().

  Handle held pages consistently in the inactive queue scan.  In the more
  common case, we did not move the page to the tail of the queue.  Whereas,
  in the less common case, we did.  There's no particular reason to move
  the page in the less common case, so remove it.

  Perform the calculation of the page shortage for the active queue scan a
  little earlier, before the active queue lock is acquired.  The correctness
  of this calculation doesn't depend on the active queue lock being held.

  Eliminate a redundant variable, "pcount".  Use the more descriptive
  variable, "maxscan", in its place.

  Apply a few nearby style fixes, e.g., eliminate stray whitespace and
  excess parentheses.
2014-05-13 05:21:54 +00:00
des
7885a006b9 MFH (r264966): add sysctl OIDs for actual swap zone size and capacity 2014-05-12 20:48:04 +00:00
kib
68fab25080 MFC r265100:
Fix the comparision for the end of range in vm_phys_fictitious_reg_range().
2014-05-06 12:20:07 +00:00
kib
839393d336 MFC r265002:
Fix vm_fault_copy_entry() operation on upgrade; allow it to find the
pages in the shadowed objects.
2014-05-04 07:19:37 +00:00
kib
ef58943ab3 MFC r263475:
Fix two issues with /dev/mem access on amd64, both causing kernel page
faults.

First, for accesses to direct map region should check for the limit by
which direct map is instantiated.

Second, for accesses to the kernel map, use a new thread private flag
TDP_DEVMEMIO, which instructs vm_fault() to return error when fault
happens on the MAP_ENTRY_NOFAULT entry, instead of panicing.

MFC r263498:
Add change forgotten in r263475.  Make dmaplimit accessible outside
amd64/pmap.c.
2014-03-28 15:38:38 +00:00
kib
b020ab10d3 MFC r263471:
Initialize vm_map_entry member wiring_thread on the map entry creation.
2014-03-24 12:40:53 +00:00
kib
2a9993c246 MFC r263095:
Initialize paddr to handle the case of zero size.
2014-03-19 13:09:17 +00:00
kib
741b07ba7d MFC r263092:
Do not vdrop() the tmpfs vnode until it is unlocked.  The hold
reference might be the last, and then vdrop() would free the vnode.
2014-03-19 13:04:16 +00:00
glebius
8a9528c4d0 Merge r261722, r261723, r261724, r261725 from head:
several minor improvements for UMA_ZPCPU_ZONE zones.
2014-03-04 14:46:30 +00:00
glebius
322a3c94d3 Merge 261593 from head:
Provide macros that allow easily export uma(9) zone limits and
  current usage via sysctl(9).
2014-03-04 14:21:07 +00:00
attilio
cf0fa484f9 MFC r261867:
Use the right index to free swapspace after vm_page_rename().
2014-02-21 09:43:34 +00:00
dim
28bc8939f8 MFC r261896:
After r251709, avoid a clang 3.4 warning about an unused static const
variable (uma_max_ipers), when asserts are disabled.

Reviewed by:	glebius
2014-02-17 20:25:17 +00:00
marcel
60764eb6dd MFC r259908:
For ia64, use pmap_remove_pages() and not pmap_remove().
2014-02-16 20:54:26 +00:00
mav
b8172b7691 MFC r258716:
- Add bucket size column to `show uma` DDB command.
 - Add `show umacache` command to show alike stats for cache-only UMA zones.
2014-01-04 23:43:18 +00:00
mav
00fb1dac34 MFC r258693:
Make UMA to not blindly force offpage slab header allocation for large
(> PAGE_SIZE) zones.  If zone is not multiple to PAGE_SIZE, there may
be enough space for the header at the last page, so we may avoid extra
header memory allocation and hash table update/lookup.

ZFS creates bunch of odd-sized UMA zones (5120, 6144, 7168, 10240, 14336).
This change gives good use to at least some of otherwise lost memory there.
2014-01-04 23:42:24 +00:00
mav
91cfd3a7cc MFC r258691:
Don't count bucket allocation failures for UMA zones as their own failures.
There are good reasons for this to happen, such as recursion prevention, etc.
and they are not fatal since buckets are just an optimization mechanism.
Real bucket allocation failures are any way counted by the bucket zones
themselves, and we don't need double accounting there.
2014-01-04 23:40:47 +00:00
mav
3ff6064c46 MFC r258340, r258497:
Implement mechanism to safely but slowly purge UMA per-CPU caches.

This is a last resort for very low memory condition in case other measures
to free memory were ineffective.  Sequentially cycle through all CPUs and
extract per-CPU cache buckets into zone cache from where they can be freed.
2014-01-04 23:39:39 +00:00
mav
4f2347ab59 MFC r258338:
Grow UMA zone bucket size also on lock congestion during item free.

Lock congestion is the same, whether it happens on alloc or free, so
handle it equally.  Now that we have back pressure, there is no problem
to grow buckets a bit faster.  Any way growth is much slower then in 9.x.
2014-01-04 23:38:06 +00:00
mav
363e273d8f MFC r258337:
Add two new UMA bucket zones to store 3 and 9 items per bucket.

These new buckets make bucket size self-tuning more soft and precise.
Without them there are buckets for 1, 5, 13, 29, ... items.  While at
bigger sizes difference about 2x is fine, at smallest ones it is 5x and
2.6x respectively.  New buckets make that line look like 1, 3, 5, 9, 13,
29, reducing jumps between steps, making algorithm work softer, allocating
and freeing memory in better fitting chunks.  Otherwise there is quite a
big gap between allocating 128K and 5x128K of RAM at once.
2014-01-04 23:37:01 +00:00
mav
a5fd15da70 MFC r258336:
Implement soft pressure on UMA cache bucket sizes.

Every time system detects low memory condition decrease bucket sizes for
each zone by one item.  As result, higher memory pressure will push to
smaller bucket sizes and so smaller per-CPU caches and so more efficient
memory use.

Before this change there was no force to oppose buckets growth as result
of practically inevitable zone lock conflicts, and after some run time
per-CPU caches could consume enough RAM to kill the system.
2014-01-04 23:35:34 +00:00
glebius
a19f1f1902 Merge r258690 by mav from head:
Fix bug introduced at r252226, when udata argument passed to bucket_alloc()
  was used without making sure first that it was really passed for us.

  On some of my systems this bug made user argument passed by ZFS code to
  uma_zalloc_arg() unexpectedly block UMA per-CPU caches for those zones.
2014-01-04 19:51:57 +00:00
kib
cc03f8b1e5 MFC r259951:
Do not coalesce stack entry. Pass MAP_STACK_GROWS_DOWN and
MAP_STACK_GROWS_UP flags to vm_map_insert() from vm_map_stack()
2013-12-30 08:57:54 +00:00
dim
ecbf461bca MFC r259893:
In sys/vm/vm_pageout.c, since vm_pageout_worker() takes a void * as
argument, cast the incoming 0 argument to void *, to silence a warning
from clang 3.4 ("expression which evaluates to zero treated as a null
pointer constant of type 'void *' [-Wnon-literal-null-conversion]").
2013-12-28 02:07:29 +00:00
kib
758e3a9934 MFC r258039:
Avoid overflow for the page counts.

MFC r258365:
Revert back to use int for the page counts.
Rearrange the checks to correctly handle overflowing address arithmetic.
2013-12-17 09:21:56 +00:00
kib
37e02e5c7a MFC r258367:
Verify for zero-length requests and act as if it is always successfull
without performing any action on the address space.
2013-12-13 06:28:18 +00:00
kib
acce26c3d9 MFC r258366:
Add assertions to cover all places in the wiring and unwiring code
where MAP_ENTRY_IN_TRANSITION is set or cleared.
2013-12-13 06:25:08 +00:00
kib
f79abc87df MFC r257899:
If filesystem declares that it supports shared locking for writes, use
shared vnode lock for VOP_PUTPAGES() as well.
2013-12-13 06:12:21 +00:00
rodrigc
84898ef06b MFC r258737
In keg_dtor(), print out the keg name in the "Freed UMA keg was not empty"
message printed to the console.  This makes it easier to track down
the source of certain memory leaks.

Suggested by: adrian
Approved by: re (gjb)
2013-12-04 07:46:53 +00:00
kib
faa83dc918 MFC r257680:
Do not coalesce if the swap object belongs to tmpfs vnode.

Approved by:	re (glebius)
2013-11-12 08:01:58 +00:00
alc
83e71fe4f7 Tidy up the output of "sysctl vm.phys_free".
Approved by:	re (glebius)
Sponsored by:	EMC / Isilon Storage Division
2013-10-10 16:11:45 +00:00
alc
6e2676ddc1 Both the vm_map and vmspace zones are defined as "no free". So, there is no
point in defining a fini function for these zones.

Reviewed by:	kib
Approved by:	re (glebius)
Sponsored by:	EMC / Isilon Storage Division
2013-09-22 17:48:10 +00:00
neel
44c4dbefdb Merge the following changes from projects/bhyve_npt_pmap:
- add fields to 'struct pmap' that are required to manage nested page tables.
- add a parameter to 'vmspace_alloc()' that can be used to override the
  default pmap initialization routine 'pmap_pinit()'.

These changes are pushed ahead of the remaining changes in 'bhyve_npt_pmap'
in anticipation of the upcoming KBI freeze for 10.0.

Reviewed by:	kib@, alc@
Approved by:	re (glebius)
2013-09-20 17:06:49 +00:00
alc
88a4d0f31a The pmap function pmap_clear_reference() is no longer used. Remove it.
pmap_clear_reference() has had exactly one caller in the kernel for
several years, more precisely, since FreeBSD 8.  Now, that call no
longer exists.

Approved by:	re (kib)
Sponsored by:	EMC / Isilon Storage Division
2013-09-20 04:30:18 +00:00
jhb
d3ef75b6c7 Extend the support for exempting processes from being killed when swap is
exhausted.
- Add a new protect(1) command that can be used to set or revoke protection
  from arbitrary processes.  Similar to ktrace it can apply a change to all
  existing descendants of a process as well as future descendants.
- Add a new procctl(2) system call that provides a generic interface for
  control operations on processes (as opposed to the debugger-specific
  operations provided by ptrace(2)).  procctl(2) uses a combination of
  idtype_t and an id to identify the set of processes on which to operate
  similar to wait6().
- Add a PROC_SPROTECT control operation to manage the protection status
  of a set of processes.  MADV_PROTECT still works for backwards
  compatability.
- Add a p_flag2 to struct proc (and a corresponding ki_flag2 to kinfo_proc)
  the first bit of which is used to track if P_PROTECT should be inherited
  by new child processes.

Reviewed by:	kib, jilles (earlier version)
Approved by:	re (delphij)
MFC after:	1 month
2013-09-19 18:53:42 +00:00
kib
8ca067efb2 PG_SLAB no longer serves a useful purpose, since m->object is no
longer abused to store pointer to slab. Remove it.

Reviewed by:    alc
Sponsored by:   The FreeBSD Foundation
Approved by:	re (hrs)
2013-09-17 07:35:26 +00:00
kib
6796656333 Remove zero-copy sockets code. It only worked for anonymous memory,
and the equivalent functionality is now provided by sendfile(2) over
posix shared memory filedescriptor.

Remove the cow member of struct vm_page, and rearrange the remaining
members.  While there, make hold_count unsigned.

Requested and reviewed by:	alc
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
Approved by:	re (delphij)
2013-09-16 06:25:54 +00:00
kib
889b9d0e0b If the last page of the file is partially full and whole valid
portion is invalidated, invalidate the whole page.  Otherwise,
partially valid page appears on a page queue, which is wrong.  This
could only happen for the last page, because only then buffer which
triggered invalidation could not cover the whole page.

Reported and tested by:	pho (previous version)
Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
Approved by:	re (delphij)
MFC after:	2 weeks
2013-09-14 10:11:38 +00:00
jhb
3c31e1fb75 Fix an off-by-one error when populating mincore(2) entries for
skipped entries.  lastvecindex references the last valid byte,
so the new bytes should come after it.

Approved by:	re (kib)
MFC after:	1 week
2013-09-12 20:46:32 +00:00
jhb
04bb6e10cd Add a mmap flag (MAP_32BIT) on 64-bit platforms to request that a mapping use
an address in the first 2GB of the process's address space.  This flag should
have the same semantics as the same flag on Linux.

To facilitate this, add a new parameter to vm_map_find() that specifies an
optional maximum virtual address.  While here, fix several callers of
vm_map_find() to use a VMFS_* constant for the findspace argument instead of
TRUE and FALSE.

Reviewed by:	alc
Approved by:	re (kib)
2013-09-09 18:11:59 +00:00
kib
56cc686058 Drain for the xbusy state for two places which potentially do
pmap_remove_all(). Not doing the drain allows the pmap_enter() to
proceed in parallel, making the pmap_remove_all() effects void.

The race results in an invalidated page mapped wired by usermode.

Reported and tested by:	pho
Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
Approved by:	re (glebius)
2013-09-08 17:51:22 +00:00
kib
7ab18d4990 The vm_page_trysbusy() should not fail when shared busy counter or
VPB_BIT_WAITERS flag were changed between reading of busy_lock and the
cas.  The vm_page_sbusy(), which is the only user of
vm_page_trysbusy() in the tree, panics on the failure, which in these
cases is transient and do not mean that the current page state
prevents sbusying.

Retry the operation inside vm_page_trysbusy() if cas failed, only
return a failure when VPB_BIT_SHARED is cleared.

Reported and tested by:	pho
Reviewed by:	attilio
Sponsored by:	The FreeBSD Foundation
2013-09-05 12:54:40 +00:00
pjd
029a6f5d92 Change the cap_rights_t type from uint64_t to a structure that we can extend
in the future in a backward compatible (API and ABI) way.

The cap_rights_t represents capability rights. We used to use one bit to
represent one right, but we are running out of spare bits. Currently the new
structure provides place for 114 rights (so 50 more than the previous
cap_rights_t), but it is possible to grow the structure to hold at least 285
rights, although we can make it even larger if 285 rights won't be enough.

The structure definition looks like this:

	struct cap_rights {
		uint64_t	cr_rights[CAP_RIGHTS_VERSION + 2];
	};

The initial CAP_RIGHTS_VERSION is 0.

The top two bits in the first element of the cr_rights[] array contain total
number of elements in the array - 2. This means if those two bits are equal to
0, we have 2 array elements.

The top two bits in all remaining array elements should be 0.
The next five bits in all array elements contain array index. Only one bit is
used and bit position in this five-bits range defines array index. This means
there can be at most five array elements in the future.

To define new right the CAPRIGHT() macro must be used. The macro takes two
arguments - an array index and a bit to set, eg.

	#define	CAP_PDKILL	CAPRIGHT(1, 0x0000000000000800ULL)

We still support aliases that combine few rights, but the rights have to belong
to the same array element, eg:

	#define	CAP_LOOKUP	CAPRIGHT(0, 0x0000000000000400ULL)
	#define	CAP_FCHMOD	CAPRIGHT(0, 0x0000000000002000ULL)

	#define	CAP_FCHMODAT	(CAP_FCHMOD | CAP_LOOKUP)

There is new API to manage the new cap_rights_t structure:

	cap_rights_t *cap_rights_init(cap_rights_t *rights, ...);
	void cap_rights_set(cap_rights_t *rights, ...);
	void cap_rights_clear(cap_rights_t *rights, ...);
	bool cap_rights_is_set(const cap_rights_t *rights, ...);

	bool cap_rights_is_valid(const cap_rights_t *rights);
	void cap_rights_merge(cap_rights_t *dst, const cap_rights_t *src);
	void cap_rights_remove(cap_rights_t *dst, const cap_rights_t *src);
	bool cap_rights_contains(const cap_rights_t *big, const cap_rights_t *little);

Capability rights to the cap_rights_init(), cap_rights_set(),
cap_rights_clear() and cap_rights_is_set() functions are provided by
separating them with commas, eg:

	cap_rights_t rights;

	cap_rights_init(&rights, CAP_READ, CAP_WRITE, CAP_FSTAT);

There is no need to terminate the list of rights, as those functions are
actually macros that take care of the termination, eg:

	#define	cap_rights_set(rights, ...)				\
		__cap_rights_set((rights), __VA_ARGS__, 0ULL)
	void __cap_rights_set(cap_rights_t *rights, ...);

Thanks to using one bit as an array index we can assert in those functions that
there are no two rights belonging to different array elements provided
together. For example this is illegal and will be detected, because CAP_LOOKUP
belongs to element 0 and CAP_PDKILL to element 1:

	cap_rights_init(&rights, CAP_LOOKUP | CAP_PDKILL);

Providing several rights that belongs to the same array's element this way is
correct, but is not advised. It should only be used for aliases definition.

This commit also breaks compatibility with some existing Capsicum system calls,
but I see no other way to do that. This should be fine as Capsicum is still
experimental and this change is not going to 9.x.

Sponsored by:	The FreeBSD Foundation
2013-09-05 00:09:56 +00:00
mckusick
57ee6d3c5d Fix bug introduced in rewrite of keg_free_slab in -r251894.
The consequence of the bug is that fini calls are not done
when a slab is freed by a call-back from the page daemon.
It went unnoticed for two months because fini is little used.

I spotted the bug while reading the code to learn how it works
so I could write it up for the next edition of the Design and
Implementation of FreeBSD book.

No MFC needed as this code exists only in HEAD.

Reviewed by: kib, jeff
Tested by:   pho
2013-08-31 15:40:15 +00:00
alc
aa9a7bb9e6 Significantly reduce the cost, i.e., run time, of calls to madvise(...,
MADV_DONTNEED) and madvise(..., MADV_FREE).  Specifically, introduce a new
pmap function, pmap_advise(), that operates on a range of virtual addresses
within the specified pmap, allowing for a more efficient implementation of
MADV_DONTNEED and MADV_FREE.  Previously, the implementation of
MADV_DONTNEED and MADV_FREE relied on per-page pmap operations, such as
pmap_clear_reference().  Intuitively, the problem with this implementation
is that the pmap-level locks are acquired and released and the page table
traversed repeatedly, once for each resident page in the range
that was specified to madvise(2).  A more subtle flaw with the previous
implementation is that pmap_clear_reference() would clear the reference bit
on all mappings to the specified page, not just the mapping in the range
specified to madvise(2).

Since our malloc(3) makes heavy use of madvise(2), this change can have a
measureable impact.  For example, the system time for completing a parallel
"buildworld" on a 6-core amd64 machine was reduced by about 1.5% to 2.0%.

Note: This change only contains pmap_advise() implementations for a subset
of our supported architectures.  I will commit implementations for the
remaining architectures after further testing.  For now, a stub function is
sufficient because of the advisory nature of pmap_advise().

Discussed with: jeff, jhb, kib
Tested by:      pho (i386), marcel (ia64)
Sponsored by:   EMC / Isilon Storage Division
2013-08-29 15:49:05 +00:00
glebius
088bcbe3ed Remove comment that is no longer relevant since r254182. 2013-08-26 14:14:25 +00:00
alc
1a535523cd Addendum to r254141: The call to vm_radix_insert() in vm_page_cache() can
reclaim the last preexisting cached page in the object, resulting in a call
to vdrop().  Detect this scenario so that the vnode's hold count is
correctly maintained.  Otherwise, we panic.

Reported by:	scottl
Tested by:	pho
Discussed with:	attilio, jeff, kib
2013-08-23 17:27:12 +00:00
kib
05a9dff802 Revert r254501. Instead, reuse the type stability of the struct pmap
which is the part of struct vmspace, allocated from UMA_ZONE_NOFREE
zone.  Initialize the pmap lock in the vmspace zone init function, and
remove pmap lock initialization and destruction from pmap_pinit() and
pmap_release().

Suggested and reviewed by:	alc (previous version)
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
2013-08-22 18:12:24 +00:00
kib
ba12eedccd Remove the deprecated VM_ALLOC_RETRY flag for the vm_page_grab(9).
The flag was mandatory since r209792, where vm_page_grab(9) was
changed to only support the alloc retry semantic.

Suggested and reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
2013-08-22 07:39:53 +00:00
jeff
bef38f5afd - Eliminate the vm object lock from the active queue scan. It is not
necessary since we do not free or cache the page from active anymore.
   Document the one possible race that is harmless.

Sponsored by:	EMC / Isilon Storage Division
Discussed with:	alc
2013-08-21 22:39:19 +00:00