Allocating new bucket for bucket zone, never take it from the zone itself,
since it will almost certanly fail. Take next bigger zone instead.
This situation should not happen with original bucket zones configuration:
"32 Bucket" zone uses "64 Bucket" and vice versa. But if "64 Bucket" zone
lock is congested, zone may grow its bucket size and start biting itself.
With the new-and-improved vm_fault_copy_entry() (r265843), we can always
avoid soft page faults when adding write access to user wired entries in
vm_map_protect(). Previously, we only avoided the soft page fault when
the underlying pages were copy-on-write. In other words, we avoided the
pages faults that might sleep on page allocation, but not the trivial
page faults to update the physical map.
On a fork allow read-only wired pages to be copy-on-write shared between
the parent and child processes. Previously, we copied these pages even
though they are read only. However, the reason for copying them is
historical and no longer exists. In recent times, vm_map_protect() has
developed the ability to copy pages when write access is added to wired
copy-on-write pages. So, in this case, copy-on-write sharing of wired
pages is not to be feared. It is not going to lead to copy-on-write
faults on wired memory.
msync(2) must return ENOMEM and not EINVAL when the address is outside the
allowed range or when one or more pages are not mapped. This according to
The Open Group Base Specifications Issue 7.
Sponsored by: EMC / Isilon storage division
For the upgrade case in vm_fault_copy_entry(), when the entry does not
need COW and is writeable, do not create a new backing object for the entry.
MFC r265887:
Fix locking.
Prior to r254304, a separate function, vm_pageout_page_stats(), was used
to periodically update the reference status of the active pages. This
function was called, instead of vm_pageout_scan(), when memory was not
scarce. The objective was to provide up to date reference status for
active pages in case memory did become scarce and active pages needed to
be deactivated.
The active page queue scan performed by vm_pageout_page_stats() was
virtually identical to that performed by vm_pageout_scan(), and so r254304
eliminated vm_pageout_page_stats(). Instead, vm_pageout_scan() is
called with the parameter "pass" set to zero. The intention was that when
pass is zero, vm_pageout_scan() would only scan the active queue.
However, the variable page_shortage can still be greater than zero when
memory is not scarce and vm_pageout_scan() is called with pass equal to
zero. Consequently, the inactive queue may be scanned and dirty pages
laundered even though that was not intended by r254304. This revision
fixes that.
Correctly update the count of stuck pages, "addl_page_shortage", in
vm_pageout_scan(). There were missing increments in two less common
cases.
Don't conflate the count of stuck pages and the pageout deficit provided
by vm_page_alloc{,_contig}().
Handle held pages consistently in the inactive queue scan. In the more
common case, we did not move the page to the tail of the queue. Whereas,
in the less common case, we did. There's no particular reason to move
the page in the less common case, so remove it.
Perform the calculation of the page shortage for the active queue scan a
little earlier, before the active queue lock is acquired. The correctness
of this calculation doesn't depend on the active queue lock being held.
Eliminate a redundant variable, "pcount". Use the more descriptive
variable, "maxscan", in its place.
Apply a few nearby style fixes, e.g., eliminate stray whitespace and
excess parentheses.
Fix two issues with /dev/mem access on amd64, both causing kernel page
faults.
First, for accesses to direct map region should check for the limit by
which direct map is instantiated.
Second, for accesses to the kernel map, use a new thread private flag
TDP_DEVMEMIO, which instructs vm_fault() to return error when fault
happens on the MAP_ENTRY_NOFAULT entry, instead of panicing.
MFC r263498:
Add change forgotten in r263475. Make dmaplimit accessible outside
amd64/pmap.c.
Make UMA to not blindly force offpage slab header allocation for large
(> PAGE_SIZE) zones. If zone is not multiple to PAGE_SIZE, there may
be enough space for the header at the last page, so we may avoid extra
header memory allocation and hash table update/lookup.
ZFS creates bunch of odd-sized UMA zones (5120, 6144, 7168, 10240, 14336).
This change gives good use to at least some of otherwise lost memory there.
Don't count bucket allocation failures for UMA zones as their own failures.
There are good reasons for this to happen, such as recursion prevention, etc.
and they are not fatal since buckets are just an optimization mechanism.
Real bucket allocation failures are any way counted by the bucket zones
themselves, and we don't need double accounting there.
Implement mechanism to safely but slowly purge UMA per-CPU caches.
This is a last resort for very low memory condition in case other measures
to free memory were ineffective. Sequentially cycle through all CPUs and
extract per-CPU cache buckets into zone cache from where they can be freed.
Grow UMA zone bucket size also on lock congestion during item free.
Lock congestion is the same, whether it happens on alloc or free, so
handle it equally. Now that we have back pressure, there is no problem
to grow buckets a bit faster. Any way growth is much slower then in 9.x.
Add two new UMA bucket zones to store 3 and 9 items per bucket.
These new buckets make bucket size self-tuning more soft and precise.
Without them there are buckets for 1, 5, 13, 29, ... items. While at
bigger sizes difference about 2x is fine, at smallest ones it is 5x and
2.6x respectively. New buckets make that line look like 1, 3, 5, 9, 13,
29, reducing jumps between steps, making algorithm work softer, allocating
and freeing memory in better fitting chunks. Otherwise there is quite a
big gap between allocating 128K and 5x128K of RAM at once.
Implement soft pressure on UMA cache bucket sizes.
Every time system detects low memory condition decrease bucket sizes for
each zone by one item. As result, higher memory pressure will push to
smaller bucket sizes and so smaller per-CPU caches and so more efficient
memory use.
Before this change there was no force to oppose buckets growth as result
of practically inevitable zone lock conflicts, and after some run time
per-CPU caches could consume enough RAM to kill the system.
Fix bug introduced at r252226, when udata argument passed to bucket_alloc()
was used without making sure first that it was really passed for us.
On some of my systems this bug made user argument passed by ZFS code to
uma_zalloc_arg() unexpectedly block UMA per-CPU caches for those zones.
In sys/vm/vm_pageout.c, since vm_pageout_worker() takes a void * as
argument, cast the incoming 0 argument to void *, to silence a warning
from clang 3.4 ("expression which evaluates to zero treated as a null
pointer constant of type 'void *' [-Wnon-literal-null-conversion]").
Avoid overflow for the page counts.
MFC r258365:
Revert back to use int for the page counts.
Rearrange the checks to correctly handle overflowing address arithmetic.
In keg_dtor(), print out the keg name in the "Freed UMA keg was not empty"
message printed to the console. This makes it easier to track down
the source of certain memory leaks.
Suggested by: adrian
Approved by: re (gjb)
- add fields to 'struct pmap' that are required to manage nested page tables.
- add a parameter to 'vmspace_alloc()' that can be used to override the
default pmap initialization routine 'pmap_pinit()'.
These changes are pushed ahead of the remaining changes in 'bhyve_npt_pmap'
in anticipation of the upcoming KBI freeze for 10.0.
Reviewed by: kib@, alc@
Approved by: re (glebius)
pmap_clear_reference() has had exactly one caller in the kernel for
several years, more precisely, since FreeBSD 8. Now, that call no
longer exists.
Approved by: re (kib)
Sponsored by: EMC / Isilon Storage Division
exhausted.
- Add a new protect(1) command that can be used to set or revoke protection
from arbitrary processes. Similar to ktrace it can apply a change to all
existing descendants of a process as well as future descendants.
- Add a new procctl(2) system call that provides a generic interface for
control operations on processes (as opposed to the debugger-specific
operations provided by ptrace(2)). procctl(2) uses a combination of
idtype_t and an id to identify the set of processes on which to operate
similar to wait6().
- Add a PROC_SPROTECT control operation to manage the protection status
of a set of processes. MADV_PROTECT still works for backwards
compatability.
- Add a p_flag2 to struct proc (and a corresponding ki_flag2 to kinfo_proc)
the first bit of which is used to track if P_PROTECT should be inherited
by new child processes.
Reviewed by: kib, jilles (earlier version)
Approved by: re (delphij)
MFC after: 1 month
and the equivalent functionality is now provided by sendfile(2) over
posix shared memory filedescriptor.
Remove the cow member of struct vm_page, and rearrange the remaining
members. While there, make hold_count unsigned.
Requested and reviewed by: alc
Tested by: pho
Sponsored by: The FreeBSD Foundation
Approved by: re (delphij)