Commit Graph

3278 Commits

Author SHA1 Message Date
Glen Barber
37a107a407 Revert r267961, r267973:
These changes prevent sysctl(8) from returning proper output,
such as:

 1) no output from sysctl(8)
 2) erroneously returning ENOMEM with tools like truss(1)
    or uname(1)
 truss: can not get etype: Cannot allocate memory
2014-06-27 22:05:21 +00:00
Hans Petter Selasky
3da1cf1e88 Extend the meaning of the CTLFLAG_TUN flag to automatically check if
there is an environment variable which shall initialize the SYSCTL
during early boot. This works for all SYSCTL types both statically and
dynamically created ones, except for the SYSCTL NODE type and SYSCTLs
which belong to VNETs. A new flag, CTLFLAG_NOFETCH, has been added to
be used in the case a tunable sysctl has a custom initialisation
function allowing the sysctl to still be marked as a tunable. The
kernel SYSCTL API is mostly the same, with a few exceptions for some
special operations like iterating childrens of a static/extern SYSCTL
node. This operation should probably be made into a factored out
common macro, hence some device drivers use this. The reason for
changing the SYSCTL API was the need for a SYSCTL parent OID pointer
and not only the SYSCTL parent OID list pointer in order to quickly
generate the sysctl path. The motivation behind this patch is to avoid
parameter loading cludges inside the OFED driver subsystem. Instead of
adding special code to the OFED driver subsystem to post-load tunables
into dynamically created sysctls, we generalize this in the kernel.

Other changes:
- Corrected a possibly incorrect sysctl name from "hw.cbb.intr_mask"
to "hw.pcic.intr_mask".
- Removed redundant TUNABLE statements throughout the kernel.
- Some minor code rewrites in connection to removing not needed
TUNABLE statements.
- Added a missing SYSCTL_DECL().
- Wrapped two very long lines.
- Avoid malloc()/free() inside sysctl string handling, in case it is
called to initialize a sysctl from a tunable, hence malloc()/free() is
not ready when sysctls from the sysctl dataset are registered.
- Bumped FreeBSD version to indicate SYSCTL API change.

MFC after:	2 weeks
Sponsored by:	Mellanox Technologies
2014-06-27 16:33:43 +00:00
Alan Cox
60169c88d9 Delay the call to crhold() in vm_map_insert() until we know that we won't
have to undo it by calling crfree().  This reduces the total number of calls
by vm_map_insert() to crhold() and crfree() by 45% in my tests.

Eliminate an unnecessary variable from vm_map_insert().

Reviewed by:	kib
Tested by:	pho
2014-06-26 16:04:03 +00:00
Alan Cox
eaaf9f7fce Now that vm_map_insert() sets MAP_ENTRY_GROWS_{DOWN,UP} on the stack entries
that it creates (r267645), we can place the check that blocks map entry
coalescing on stack entries in vm_map_simplify_entry() where it properly
belongs.

Reviewed by:	kib
2014-06-25 03:30:03 +00:00
Konstantin Belousov
b5f8c226ab Use correct names for the flags. MAP_ENTRY_GROWS_* have the same
numerical values as MAP_STACK_GROWS_*, but the former is for entries'
eflags, while the later for the cow argument of vm_map_insert().

Submitted by:	alc
2014-06-23 07:03:47 +00:00
Konstantin Belousov
5831f5fc52 Assert that the new entry is inserted into the right location in the
map entries list, and that it does not overlap with the previous and
next entries.

Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-06-20 07:01:53 +00:00
Alan Cox
39c18ce157 Eliminate a pointless call to vm_map_clip_start() from vm_map_growstack().
For this call to do anything at all we would have to have two overlapping
map entries.

Submitted by:	kib
2014-06-19 21:05:07 +00:00
Alan Cox
712efe66e2 When MAP_STACK_GROWS_{DOWN,UP} are passed to vm_map_insert() set the
corresponding flag(s) in the new map entry.  Previously, the caller was
responsible for setting them after vm_map_insert() returned.

Pass MAP_STACK_GROWS_DOWN to vm_map_insert() from vm_map_growstack() when
extending the stack in the downward direction.

Together these changes slightly simplify the caller's task when creating a
downward growing stack.  In particular, the caller no longer needs to clip
the previous entry, because the new stack entry can't possibly coalesce
with the previous entry.

Reviewed by:	kib
Sponsored by:	EMC / Isilon Storage Division
2014-06-19 16:26:16 +00:00
Konstantin Belousov
11c42bcc54 Add MAP_EXCL flag for mmap(2). It should be combined with MAP_FIXED,
and prevents the request from deleting existing mappings in the
region, failing instead.

Reviewed by:	alc
Discussed with:	jhb
Tested by:	markj, pho (previous version, as part of the bigger patch)
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-06-19 05:00:39 +00:00
Attilio Rao
3ae10f7477 - Modify vm_page_unwire() and vm_page_enqueue() to directly accept
the queue where to enqueue pages that are going to be unwired.
- Add stronger checks to the enqueue/dequeue for the pagequeues when
  adding and removing pages to them.

Of course, for unmanaged pages the queue parameter of vm_page_unwire() will
be ignored, just as the active parameter today.
This makes adding new pagequeues quicker.

This change effectively modifies the KPI.  __FreeBSD_version will be,
however, bumped just when the full cache of free pages will be
evicted.

Sponsored by:	EMC / Isilon storage division
Reviewed by:	alc
Tested by:	pho
2014-06-16 18:15:27 +00:00
Alan Cox
33314db034 Tidy up the early parts of vm_map_insert(), in particular, simplify one
of the assertions and eliminate a comment that has grown stale.

Reviewed by:	kib
MFC after:	1 week
2014-06-16 16:37:41 +00:00
Alan Cox
e1f92ccc73 One of the intentions behind r267254 was that the global variable "sgrowsiz"
would be read once and cached in a local variable so that the resource limit
check and map entry insertion would be guaranteed to use the same value.
However, the value being passed to vm_map_insert() is still from "sgrowsiz"
and not the local variable.  Correct this oversight.

Reviewed by:	kib
2014-06-15 07:52:59 +00:00
Alexander Motin
1aa6c75827 Introduce new "256 Bucket" zone to split requests and reduce congestion
on "128 Bucket" zone lock.

MFC after:	2 weeks
Sponsored by:	iXsystems, Inc.
2014-06-12 11:57:07 +00:00
Alexander Motin
20d3ab87cd Allocating new bucket for bucket zone, never take it from the zone itself,
since it will almost certanly fail.  Take next bigger zone instead.

This situation should not happen with original bucket zones configuration:
"32 Bucket" zone uses "64 Bucket" and vice versa.  But if "64 Bucket" zone
lock is congested, zone may grow its bucket size and start biting itself.

MFC after:	2 weeks
Sponsored by:	iXsystems, Inc.
2014-06-12 11:36:22 +00:00
Alan Cox
3180f7573a Correct a bug in the management of the population map on big-endian
machines.  Specifically, there was a mismatch between how the routine
allocation and deallocation operations accessed the population map
and how the aggressively optimized reservation-breaking operation
accessed it.  So, problems only occurred when reservations were broken.
This change makes the routine operations access the population map in
the same way as the reservation breaking operation.

This bug was introduced in r259999.

PR:		187080
Tested by:	jmg (on an "armeb" machine)
Sponsored by:	EMC / Isilon Storage Division
2014-06-11 16:11:12 +00:00
Konstantin Belousov
4648ba0a0f Make mmap(MAP_STACK) search for the available address space, similar
to !MAP_STACK mapping requests.  For MAP_STACK | MAP_FIXED, clear any
mappings which could previously exist in the used range.

For this, teach vm_map_find() and vm_map_fixed() to handle
MAP_STACK_GROWS_DOWN or _UP cow flags, by calling a new
vm_map_stack_locked() helper, which is factored out from
vm_map_stack().

The side effect of the change is that MAP_STACK started obeying
MAP_ALIGNMENT and MAP_32BIT flags.

Reported by:	rwatson
Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2014-06-09 03:37:41 +00:00
Alan Cox
dd05fa1945 Add a page size field to struct vm_page. Increase the page size field when
a partially populated reservation becomes fully populated, and decrease this
field when a fully populated reservation becomes partially populated.

Use this field to simplify the implementation of pmap_enter_object() on
amd64, arm, and i386.

On all architectures where we support superpages, the cost of creating a
superpage mapping is roughly the same as creating a base page mapping.  For
example, both kinds of mappings entail the creation of a single PTE and PV
entry.  With this in mind, use the page size field to make the
implementation of vm_map_pmap_enter(..., MAP_PREFAULT_PARTIAL) a little
smarter.  Previously, if MAP_PREFAULT_PARTIAL was specified to
vm_map_pmap_enter(), that function would only map base pages.  Now, it will
create up to 96 base page or superpage mappings.

Reviewed by:	kib
Sponsored by:	EMC / Isilon Storage Division
2014-06-07 17:12:26 +00:00
Konstantin Belousov
5930251a9d Remove the assert which can be triggered by the userspace. The
situation checked by assert is verified to not take place in
vm_map_wire(), and protection permissions on the wired entry can be
revoked afterward.

Reported by:	markj
Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-05-28 00:45:35 +00:00
Alan Cox
fa2f411c4e There is no reason to perform the pmap_remove() on the kernel pmap while
the kmem object lock is held.  Do the pmap_remove() before acquiring the
kmem object lock.

MFC after:	1 week
Sponsored by:	EMC / Isilon Storage Division
2014-05-23 16:22:36 +00:00
Konstantin Belousov
2602a2ea88 Remove redundand loop. The inner goto restarts the whole page
handling in the situation identical to the loop condition.

Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2014-05-21 08:19:04 +00:00
Konstantin Belousov
7032434e98 When exec_new_vmspace() decides that current vmspace cannot be reused
on execve(2), it calls vmspace_exec(), which frees the current
vmspace.  The thread executing an exec syscall gets new vmspace
assigned, and old vmspace is freed if only referenced by the current
process.  The free operation includes pmap_release(), which
de-constructs the paging structures used by hardware.

If the calling process is multithreaded, other threads are suspended
in the thread_suspend_check(), and need to be unsuspended and run to
be able to exit on successfull exec.  Now, since the old vmspace is
destroyed, paging structures are invalid, threads are resumed on the
non-existent pmaps (page tables), which leads to triple fault on x86.

To fix, postpone the free of old vmspace until the threads are resumed
and exited.  To avoid modifications to all image activators all of
which use exec_new_vmspace(), memoize the current (old) vmspace in
kern_execve(), and notify it about the need to call vmspace_free()
with a thread-private flag TDP_EXECVMSPC.

http://bugs.debian.org/743141

Reported by:	Ivo De Decker <ivo.dedecker@ugent.be> through secteam
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2014-05-20 09:19:35 +00:00
Alan Cox
afaa41f6b8 On a fork allow read-only wired pages to be copy-on-write shared between the
parent and child processes.  Previously, we copied these pages even though
they are read only.  However, the reason for copying them is historical and
no longer exists.  In recent times, vm_map_protect() has developed the
ability to copy pages when write access is added to wired copy-on-write
pages.  So, in this case, copy-on-write sharing of wired pages is not to be
feared.  It is not going to lead to copy-on-write faults on wired memory.

Reviewed by:	kib
MFC after:	1 week
Sponsored by:	EMC / Isilon Storage Division
2014-05-13 13:20:23 +00:00
Konstantin Belousov
c8f780e3d6 Fix locking. The dst_object must remain locked on the retry of the
loop iteration.

Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	6 days
2014-05-11 18:07:07 +00:00
Alan Cox
dd006a1b14 With the new-and-improved vm_fault_copy_entry() (r265843), we can always
avoid soft page faults when adding write access to user wired entries in
vm_map_protect().  Previously, we only avoided the soft page fault when
the underlying pages were copy-on-write.  In other words, we avoided the
pages faults that might sleep on page allocation, but not the trivial
page faults to update the physical map.

Reviewed by:	kib
MFC after:	1 week
Sponsored by:	EMC / Isilon Storage Division
2014-05-11 17:41:29 +00:00
Alan Cox
d9a9209abe About 9% of the pmap_protect() calls being performed by vm_map_copy_entry()
are unnecessary.  Eliminate the unnecessary calls.

Reviewed by:	kib
MFC after:	1 week
Sponsored by:	EMC / Isilon Storage Division
2014-05-10 19:47:00 +00:00
Konstantin Belousov
0973283d6e For the upgrade case in vm_fault_copy_entry(), when the entry does not
need COW and is writeable (i.e. becoming writeable due to the
mprotect(2) operation), do not create a new backing object for the
entry.  The caller of the function is vm_map_protect(), the call is
made to ensure that wired entry has all pages resident and wired in
the top level object and to enable the write.  We might need to copy
read-only page from some backing objects into the top object or remap
the page with the write allowed.

This fixes the issue with mishandling of the swap accounting when
read-only wired mapping is upgraded to write-enabled after fork.  The
previous code path did not accounted the new object, but it creation
is redundand anyway and the change provides an optimization for the
non-common situation.

Reported by:	markj
Suggested and reviewed by:	alc (previous version)
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-05-10 17:03:33 +00:00
Konstantin Belousov
44bbc3b77d When printing the map with the ddb 'show procvm' command, do not dump
page queues for the backing objects.  The queues are huge and clutter
the display, when mostly the map entries and its backing storage is
interesting.

The page queues can be seen with ddb 'show object' command.

Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-05-10 16:36:13 +00:00
Konstantin Belousov
3d95614f9d Print the entry address in addition to the object. The variable is
typically optimized out and debuggers cannot find its value.

Sponsored by:	    The FreeBSD Foundation
MFC after:	1 week
2014-05-10 16:30:48 +00:00
Peter Holm
e103f5b1c0 msync(2) must return ENOMEM and not EINVAL when the address is outside the
allowed range or when one or more pages are not mapped. This according to
The Open Group Base Specifications Issue 7.

Discussed with:	 attilio, Bruce Evans
Reviewed by:	 alc, Garrett Cooper
Reported by:	 ATF
MFC after:	 2 weeks
Sponsored by:	EMC / Isilon storage division
2014-05-07 08:38:02 +00:00
Alan Cox
60196cda04 Prior to r254304, a separate function, vm_pageout_page_stats(), was used to
periodically update the reference status of the active pages.  This function
was called, instead of vm_pageout_scan(), when memory was not scarce.  The
objective was to provide up to date reference status for active pages in
case memory did become scarce and active pages needed to be deactivated.

The active page queue scan performed by vm_pageout_page_stats() was
virtually identical to that performed by vm_pageout_scan(), and so r254304
eliminated vm_pageout_page_stats().  Instead, vm_pageout_scan() is
called with the parameter "pass" set to zero.  The intention was that when
pass is zero, vm_pageout_scan() would only scan the active queue.  However,
the variable page_shortage can still be greater than zero when memory is not
scarce and vm_pageout_scan() is called with pass equal to zero.
Consequently, the inactive queue may be scanned and dirty pages laundered
even though that was not intended by r254304.  This revision fixes that.

Reported by:	avg
MFC after:	1 week
Sponsored by:	EMC / Isilon Storage Division
2014-05-06 03:42:04 +00:00
Konstantin Belousov
a17937bdd0 For the VM_PHYSSEG_DENSE case, checking the requested range to fall
into the area backed by vm_page_array wrongly compared end with
vm_page_array_size.  It should be adjusted by first_page index to be
correct.

Also, the corner and incorrect case of the requested range extending
after the end of the vm_page_array was incorrectly handled by
allocating the segment.

Fix the comparision for the end of range and return EINVAL if the end
extends beyond vm_page_array.

Discussed with:	royger
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-04-29 18:42:37 +00:00
Konstantin Belousov
4c74acf76a When vm_fault_copy_entry() is called from vm_map_protect() for a wired
entry and performs the upgrade of the entry permissions from read-only
to read-write, we must allow to search for the source pages in the
backing object, like we do in the case of forking the read-only wired
entry. For the fork case, the behaviour is allowed by src_readonly
boolean, which in fact is only used to assert that read-write case
provides all source pages in the top-level object.

Eliminate the src_readonly variable.  Allow for the copy loop to look
into the backing objects, add explicit asserts to ensure that only
read-only and upgrade case actually does.

Expand comments. Change the panic call into assert.

Reported by:	markj
Tested by:	markj, pho (previous version)
Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-04-27 05:19:01 +00:00
Dag-Erling Smørgrav
612032773a Add sysctl OIDs showing the actual size and capacity of the swap zone.
MFC after:	1 week
2014-04-26 12:18:17 +00:00
Bryan Drewery
44f1c91610 Rename global cnt to vm_cnt to avoid shadowing.
To reduce the diff struct pcu.cnt field was not renamed, so
PCPU_OP(cnt.field) is still used. pc_cnt and pcpu are also used in
kvm(3) and vmstat(8). The goal was to not affect externally used KPI.

Bump __FreeBSD_version_ in case some out-of-tree module/code relies on the
the global cnt variable.

Exp-run revealed no ports using it directly.

No objection from:	arch@
Sponsored by:	EMC / Isilon Storage Division
2014-03-22 10:26:09 +00:00
Konstantin Belousov
52f3c44efe Fix two issues with /dev/mem access on amd64, both causing kernel page
faults.

First, for accesses to direct map region should check for the limit by
which direct map is instantiated.

Second, for accesses to the kernel map, success returned from the
kernacc(9) does not guarantee that consequent attempt to read or write
to the checked address succeed, since other thread might invalidate
the address meantime.  Add a new thread private flag TDP_DEVMEMIO,
which instructs vm_fault() to return error when fault happens on the
MAP_ENTRY_NOFAULT entry, instead of panicing.  The trap handler would
then see a page fault from access, and recover in normal way, making
/dev/mem access safer.

Remove GIANT_REQUIRED from the amd64 memrw(), since it is not needed
and having Giant locked does not solve issues for amd64.

Note that at least the second issue exists on other architectures, and
requires similar patching for md code.

Reported and tested by:	clusteradm (gjb, sbruno)
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-03-21 14:25:09 +00:00
Konstantin Belousov
997ac6905f Initialize vm_map_entry member wiring_thread on the map entry creation.
This was missed in r253190.

Reported by:	hps, peter
Tested by:	hps
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2014-03-21 13:55:57 +00:00
Attilio Rao
0d8243cc34 vm_page_grab() and vm_pager_get_pages() can drop the vm_object lock,
then threads can sleep on the pip condition.
Avoid to deadlock such threads by correctly awakening the sleeping ones
after the pip is finished.
swapoff side of the bug can likely result in shutdown deadlocks.

Sponsored by:	EMC / Isilon Storage Division
Reported by:	pho, pluknet
Tested by:	pho
2014-03-19 01:13:42 +00:00
Robert Watson
4a14441044 Update kernel inclusions of capability.h to use capsicum.h instead; some
further refinement is required as some device drivers intended to be
portable over FreeBSD versions rely on __FreeBSD_version to decide whether
to include capability.h.

MFC after:	3 weeks
2014-03-16 10:55:57 +00:00
Konstantin Belousov
7253a5ec63 Initialize paddr to handle the case of zero size.
Reported and reviewed by:	Conrad Meyer <cemeyer@uw.edu>
MFC after:	1 week
2014-03-12 16:38:55 +00:00
Konstantin Belousov
2309fa9b92 Do not vdrop() the tmpfs vnode until it is unlocked. The hold
reference might be the last, and then vdrop() would free the vnode.

Reported and tested by:	bdrewery
MFC after:	1 week
2014-03-12 15:13:57 +00:00
Dimitry Andric
2367b4ddc4 After r251709, avoid a clang 3.4 warning about an unused static const
variable (uma_max_ipers), when asserts are disabled.

Reviewed by:	glebius
MFC after:	3 days
2014-02-14 17:47:18 +00:00
Attilio Rao
14a5dc1780 Fix-up r254141: in the process of making a failing vm_page_rename()
a call of pager_swap_freespace() was moved around, now leading to freeing
the incorrect page because of the pindex changes after vm_page_rename().

Get back to use the correct pindex when destroying the swap space.

Sponsored by:	EMC / Isilon storage division
Reported by:	avg
Tested by:	pho
MFC after:	7 days
2014-02-14 03:34:12 +00:00
Gleb Smirnoff
5f3563b0a5 Fix function name in KASSERT().
Submitted by:	hiren
2014-02-12 20:11:20 +00:00
John Baldwin
8add0ced70 Correct assertion to assert that the existing device VM object uses the
same type rather than asserting in the case where we just created a new
VM object.

Reviewed by:	kib
2014-02-11 22:05:21 +00:00
Gleb Smirnoff
49fef6a202 Create two public UMA_ZONE_PCPU zones: 64 bit sized and pointer sized.
Sponsored by:	Nginx, Inc.
2014-02-10 19:59:46 +00:00
Gleb Smirnoff
f947570e35 Style. 2014-02-10 19:51:15 +00:00
Gleb Smirnoff
48343a2f34 Make M_ZERO flag work correctly on UMA_ZONE_PCPU zones.
Sponsored by:	Nginx, Inc.
2014-02-10 19:48:26 +00:00
Alan Cox
7b9b301c6b Don't call vm_fault_prefault() on zero-fill faults. It's a waste of time.
Successful prefaults after a zero-fill fault are extremely rare.
2014-02-09 01:59:52 +00:00
Gleb Smirnoff
0a5a3ccb81 Provide macros that allow easily export uma(9) zone limits and
current usage via sysctl(9):

  SYSCTL_UMA_MAX()
  SYSCTL_ADD_UMA_MAX()
  SYSCTL_UMA_CUR()
  SYSCTL_ADD_UMA_CUR()

Sponsored by:	Nginx, Inc.
2014-02-07 14:29:03 +00:00
Alan Cox
63281952f0 Make prefaulting more aggressive on hard faults. Previously, we would only
map a fraction of the pages that were fetched by vm_pager_get_pages() from
secondary storage.  Now, we map them all in order to avoid future soft
faults.  This effect is most evident when a memory-mapped file is accessed
sequentially.  Previously, there were 6 soft faults for every hard fault.
Now, these soft faults are eliminated.

Sponsored by:	EMC / Isilon Storage Division
2014-02-02 20:21:53 +00:00
Alan Cox
793d14076a In an effort to diagnose possible corruption of struct vm_page on some
sparc64 machines make the page queue assert in vm_page_dequeue() more
precise.  While I'm here switch the page lock assert to the newer style.
2014-01-24 19:08:42 +00:00
John Baldwin
ab46f63e8f Fix a couple of typos. 2014-01-21 03:27:47 +00:00
Gleb Smirnoff
7ebba1f8ff ANSIfy declarations.
Ok'ed by:	alc
2014-01-20 18:47:56 +00:00
Alan Cox
86fa24710e Style changes in vm_pageout_scan():
1. Be consistent in the style of "act_delta" manipulations between the
   inactive and active queue scans.

2. Explicitly compare to zero.

3. The deactivation of a page is based is based on its recent history
   and not just the current call to vm_pageout_scan().  The variable
   "act_delta" represents the current state of the page, and not its
   history.  Avoid possible confusion by not (ab)using "act_delta" for
   the making the deactivation decision.

Submitted by:	kib [1]
Reviewed by:	kib [2,3]
2014-01-18 20:02:59 +00:00
Alan Cox
9099545af1 Correctly update the count of stuck pages, "addl_page_shortage", in
vm_pageout_scan().  There were missing increments in two less common cases.

Don't conflate the count of stuck pages and the pageout deficit provided by
vm_page_alloc{,_contig}().  (A proposed fix to the OOM code depends on this.)

Handle held pages consistently in the inactive queue scan.  In the more
common case, we did not move the page to the tail of the queue.  Whereas, in
the less common case, we did.  There's no particular reason to move the page
in the less common case, so remove it.

Perform the calculation of the page shortage for the active queue scan a
little earlier, before the active queue lock is acquired.  The correctness
of this calculation doesn't depend on the active queue lock being held.

Eliminate a redundant variable, "pcount".  Use the more descriptive
variable, "maxscan", in its place.

Apply a few nearby style fixes, e.g., eliminate stray whitespace and excess
parentheses.

Reviewed by:	kib
Sponsored by:	EMC / Isilon Storage Division
2014-01-12 19:04:20 +00:00
Alan Cox
000fb817d8 Since the introduction of the popmap to reservations in r259999, there is
no longer any need for the page's PG_CACHED and PG_FREE flags to be set and
cleared while the free page queues lock is held.  Thus, vm_page_alloc(),
vm_page_alloc_contig(), and vm_page_alloc_freelist() can wait until after
the free page queues lock is released to clear the page's flags.  Moreover,
the PG_FREE flag can be retired.  Now that the reservation system no longer
uses it, its only uses are in a few assertions.  Eliminating these
assertions is no real loss.  Other assertions catch the same types of
misbehavior, like doubly freeing a page (see r260032) or dirtying a free
page (free pages are invalid and only valid pages can be dirtied).

Eliminate an unneeded variable from vm_page_alloc_contig().

Sponsored by:	EMC / Isilon Storage Division
2013-12-31 18:25:15 +00:00
Alan Cox
a08c151546 Add "popmap" assertions: The page being freed isn't already free, and the
page being allocated isn't already allocated.

Sponsored by:	EMC / Isilon Storage Division
2013-12-29 04:54:52 +00:00
Alan Cox
ec17932242 MFp4 alc_popmap
Change the way that reservations keep track of which pages are in use.
  Instead of using the page's PG_CACHED and PG_FREE flags, maintain a bit
  vector within the reservation.  This approach has a couple benefits.
  First, it makes breaking reservations much cheaper because there are
  fewer cache misses to identify the unused pages.  Second, it is a pre-
  requisite for supporting two or more reservation sizes.
2013-12-28 04:28:35 +00:00
Konstantin Belousov
b61a53d43d Do not coalesce stack entry, vm_map_stack() asserts that the requested
region is claimed by a new entry.

Pass MAP_STACK_GROWS_DOWN and MAP_STACK_GROWS_UP flags to
vm_map_insert() from vm_map_stack(), to really turn off coalescing
code and call to vm_map_simplify_entry() [1].

Reported by:	avg, peter, many
Tested by:	avg, peter
Noted by:	avg [1]
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2013-12-27 16:59:47 +00:00
Marcel Moolenaar
938b0f5b75 For ia64, use pmap_remove_pages() and not pmap_remove(). The problem is
that we don't have a good way (yet) to iterate over the mapped pages by
virtual address and simply try each page within the range. Given that we
call pmap_remove() over the entire 2^63 bytes of address space, it takes
a while for pmap_remove to have tried all 2^50 pages.
By using pmap_remove_pages() we use the PV list to find all mappings.

Change derived from a patch by: alc
2013-12-26 05:46:10 +00:00
Dimitry Andric
d395270d06 In sys/vm/vm_pageout.c, since vm_pageout_worker() takes a void * as
argument, cast the incoming 0 argument to void *, to silence a warning
from clang 3.4 ("expression which evaluates to zero treated as a null
pointer constant of type 'void *' [-Wnon-literal-null-conversion]").

MFC after:	3 days
2013-12-25 22:32:34 +00:00
Alan Cox
703b304f33 Eliminate a redundant parameter to vm_radix_replace().
Improve the wording of the comment describing vm_radix_replace().

Reviewed by:	attilio
MFC after:	6 weeks
Sponsored by:	EMC / Isilon Storage Division
2013-12-08 20:07:02 +00:00
Craig Rodrigues
a3845534a0 In keg_dtor(), print out the keg name in the "Freed UMA keg was not empty"
message printed to the console.  This makes it easier to track down
the source of certain memory leaks.

Suggested by: adrian
2013-11-29 08:04:45 +00:00
Alexander Motin
03175483c2 - Add bucket size column to show uma DDB command.
- Add `show umacache` command to show alike stats for cache-only UMA zones.
2013-11-28 19:20:49 +00:00
Alexander Motin
cec48e002a Make UMA to not blindly force offpage slab header allocation for large
(> PAGE_SIZE) zones.  If zone is not multiple to PAGE_SIZE, there may
be enough space for the header at the last page, so we may avoid extra
header memory allocation and hash table update/lookup.

ZFS creates bunch of odd-sized UMA zones (5120, 6144, 7168, 10240, 14336).
This change gives good use to at least some of otherwise lost memory there.

Reviewed by:	avg
2013-11-27 20:56:10 +00:00
Alexander Motin
f7104ccd94 Don't count bucket allocation failures for UMA zones as their own failures.
There are good reasons for this to happen, such as recursion prevention, etc.
and they are not fatal since buckets are just an optimization mechanism.
Real bucket allocation failures are any way counted by the bucket zones
themselves, and we don't need double accounting there.
2013-11-27 20:16:18 +00:00
Alexander Motin
e8a720fe60 Fix bug introduced at r252226, when udata argument passed to bucket_alloc()
was used without making sure first that it was really passed for us.

On some of my systems this bug made user argument passed by ZFS code to
uma_zalloc_arg() unexpectedly block UMA per-CPU caches for those zones.
2013-11-27 19:55:42 +00:00
Alexander Motin
8a8d9d1475 When purging per-CPU UMA caches do not return empty buckets into the global
full bucket cache to not trigger assertion if allocation happen before that
global cache get purged.
2013-11-23 13:42:56 +00:00
Konstantin Belousov
79e9451f07 Vm map code performs clipping when map entry covers region which is
larger than the operational region.  If the op region size is zero,
clipping would create a zero-sized map entry.  The result is that vm
map splay starts behaving inconsistently, sometimes returning
zero-sized entry, sometimes the next (or previous) entry.

One step further, it could result in e.g. vm_map_wire() setting
MAP_ENTRY_IN_TRANSITION on the zero-sized entry, but failing to clear
it in the done part.  The vm_map_delete() than hangs forever waiting
for the flag removal.

Verify for zero-length requests and act as if it is always successfull
without performing any action on the address space.

Diagnosed by:	pho
Tested by:	pho (previous version)
Reviewed by:	alc (previous version)
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2013-11-20 09:03:48 +00:00
Konstantin Belousov
ff3ae454c0 Add assertions to cover all places in the wiring and unwiring code
where MAP_ENTRY_IN_TRANSITION is set or cleared.

Tested by:	pho
Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2013-11-20 08:47:54 +00:00
Konstantin Belousov
7e14088d93 Revert back to use int for the page counts. In vn_io_fault(), the i/o
is chunked to pieces limited by integer io_hold_cnt tunable, while
vm_fault_quick_hold_pages() takes integer max_count as the upper bound.

Rearrange the checks to correctly handle overflowing address arithmetic.

Submitted by:	bde
Tested by:	pho
Discussed with:	alc
MFC after:	1 week
2013-11-20 08:45:26 +00:00
Alexander Motin
a2de44abf5 Implement mechanism to safely but slowly purge UMA per-CPU caches.
This is a last resort for very low memory condition in case other measures
to free memory were ineffective.  Sequentially cycle through all CPUs and
extract per-CPU cache buckets into zone cache from where they can be freed.
2013-11-19 10:51:46 +00:00
Alexander Motin
4d104ba024 Grow UMA zone bucket size also on lock congestion during item free.
Lock congestion is the same, whether it happens on alloc or free, so
handle it equally.  Now that we have back pressure, there is no problem
to grow buckets a bit faster.  Any way growth is much slower then in 9.x.
2013-11-19 10:17:10 +00:00
Alexander Motin
f3932e9025 Add two new UMA bucket zones to store 3 and 9 items per bucket.
These new buckets make bucket size self-tuning more soft and precise.
Without them there are buckets for 1, 5, 13, 29, ... items.  While at
bigger sizes difference about 2x is fine, at smallest ones it is 5x and
2.6x respectively.  New buckets make that line look like 1, 3, 5, 9, 13,
29, reducing jumps between steps, making algorithm work softer, allocating
and freeing memory in better fitting chunks.  Otherwise there is quite a
big gap between allocating 128K and 5x128K of RAM at once.
2013-11-19 10:10:44 +00:00
Alexander Motin
ace66b568e Implement soft pressure on UMA cache bucket sizes.
Every time system detects low memory condition decrease bucket sizes for
each zone by one item.  As result, higher memory pressure will push to
smaller bucket sizes and so smaller per-CPU caches and so more efficient
memory use.

Before this change there was no force to oppose buckets growth as result
of practically inevitable zone lock conflicts, and after some run time
per-CPU caches could consume enough RAM to kill the system.
2013-11-19 10:05:53 +00:00
Konstantin Belousov
d005ed537c Avoid overflow for the page counts.
Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2013-11-12 08:47:58 +00:00
Konstantin Belousov
1bd7d0b7db If filesystem declares that it supports shared locking for writes, use
shared vnode lock for VOP_PUTPAGES() as well.  The only such
filesystem in the tree is ZFS, and it uses
vnode_pager_generic_putpages(), which performs the pageout with
VOP_WRITE().

Reviewed by:	alc
Discussed with:	avg
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2013-11-09 20:36:29 +00:00
Konstantin Belousov
9ded9474d3 Do not coalesce if the swap object belongs to tmpfs vnode. The
coalesce would extend the object to keep pages for the anonymous
mapping created by the process.  The pages has no relations to the
tmpfs file content which could be written into the corresponding
range, causing anonymous mapping and file content aliasing and
subsequent corruption.

Another lesser problem created by coalescing is over-accounting on the
tmpfs node destruction, since the object size is substracted from the
total count of the pages owned by the tmpfs mount.

Reported and tested by:	bdrewery
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2013-11-05 06:18:50 +00:00
Alan Cox
eb2f42fbb0 Tidy up the output of "sysctl vm.phys_free".
Approved by:	re (glebius)
Sponsored by:	EMC / Isilon Storage Division
2013-10-10 16:11:45 +00:00
Alan Cox
f872f6eaf5 Both the vm_map and vmspace zones are defined as "no free". So, there is no
point in defining a fini function for these zones.

Reviewed by:	kib
Approved by:	re (glebius)
Sponsored by:	EMC / Isilon Storage Division
2013-09-22 17:48:10 +00:00
Neel Natu
74d1d2b7cc Merge the following changes from projects/bhyve_npt_pmap:
- add fields to 'struct pmap' that are required to manage nested page tables.
- add a parameter to 'vmspace_alloc()' that can be used to override the
  default pmap initialization routine 'pmap_pinit()'.

These changes are pushed ahead of the remaining changes in 'bhyve_npt_pmap'
in anticipation of the upcoming KBI freeze for 10.0.

Reviewed by:	kib@, alc@
Approved by:	re (glebius)
2013-09-20 17:06:49 +00:00
Alan Cox
deb179bb4c The pmap function pmap_clear_reference() is no longer used. Remove it.
pmap_clear_reference() has had exactly one caller in the kernel for
several years, more precisely, since FreeBSD 8.  Now, that call no
longer exists.

Approved by:	re (kib)
Sponsored by:	EMC / Isilon Storage Division
2013-09-20 04:30:18 +00:00
John Baldwin
55648840de Extend the support for exempting processes from being killed when swap is
exhausted.
- Add a new protect(1) command that can be used to set or revoke protection
  from arbitrary processes.  Similar to ktrace it can apply a change to all
  existing descendants of a process as well as future descendants.
- Add a new procctl(2) system call that provides a generic interface for
  control operations on processes (as opposed to the debugger-specific
  operations provided by ptrace(2)).  procctl(2) uses a combination of
  idtype_t and an id to identify the set of processes on which to operate
  similar to wait6().
- Add a PROC_SPROTECT control operation to manage the protection status
  of a set of processes.  MADV_PROTECT still works for backwards
  compatability.
- Add a p_flag2 to struct proc (and a corresponding ki_flag2 to kinfo_proc)
  the first bit of which is used to track if P_PROTECT should be inherited
  by new child processes.

Reviewed by:	kib, jilles (earlier version)
Approved by:	re (delphij)
MFC after:	1 month
2013-09-19 18:53:42 +00:00
Konstantin Belousov
9eab548476 PG_SLAB no longer serves a useful purpose, since m->object is no
longer abused to store pointer to slab. Remove it.

Reviewed by:    alc
Sponsored by:   The FreeBSD Foundation
Approved by:	re (hrs)
2013-09-17 07:35:26 +00:00
Konstantin Belousov
3846a82284 Remove zero-copy sockets code. It only worked for anonymous memory,
and the equivalent functionality is now provided by sendfile(2) over
posix shared memory filedescriptor.

Remove the cow member of struct vm_page, and rearrange the remaining
members.  While there, make hold_count unsigned.

Requested and reviewed by:	alc
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
Approved by:	re (delphij)
2013-09-16 06:25:54 +00:00
Konstantin Belousov
196beb5359 If the last page of the file is partially full and whole valid
portion is invalidated, invalidate the whole page.  Otherwise,
partially valid page appears on a page queue, which is wrong.  This
could only happen for the last page, because only then buffer which
triggered invalidation could not cover the whole page.

Reported and tested by:	pho (previous version)
Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
Approved by:	re (delphij)
MFC after:	2 weeks
2013-09-14 10:11:38 +00:00
John Baldwin
6a87d217e2 Fix an off-by-one error when populating mincore(2) entries for
skipped entries.  lastvecindex references the last valid byte,
so the new bytes should come after it.

Approved by:	re (kib)
MFC after:	1 week
2013-09-12 20:46:32 +00:00
John Baldwin
edb572a38c Add a mmap flag (MAP_32BIT) on 64-bit platforms to request that a mapping use
an address in the first 2GB of the process's address space.  This flag should
have the same semantics as the same flag on Linux.

To facilitate this, add a new parameter to vm_map_find() that specifies an
optional maximum virtual address.  While here, fix several callers of
vm_map_find() to use a VMFS_* constant for the findspace argument instead of
TRUE and FALSE.

Reviewed by:	alc
Approved by:	re (kib)
2013-09-09 18:11:59 +00:00
Konstantin Belousov
3aaea6efd5 Drain for the xbusy state for two places which potentially do
pmap_remove_all(). Not doing the drain allows the pmap_enter() to
proceed in parallel, making the pmap_remove_all() effects void.

The race results in an invalidated page mapped wired by usermode.

Reported and tested by:	pho
Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
Approved by:	re (glebius)
2013-09-08 17:51:22 +00:00
Konstantin Belousov
7a4b2bc56c The vm_page_trysbusy() should not fail when shared busy counter or
VPB_BIT_WAITERS flag were changed between reading of busy_lock and the
cas.  The vm_page_sbusy(), which is the only user of
vm_page_trysbusy() in the tree, panics on the failure, which in these
cases is transient and do not mean that the current page state
prevents sbusying.

Retry the operation inside vm_page_trysbusy() if cas failed, only
return a failure when VPB_BIT_SHARED is cleared.

Reported and tested by:	pho
Reviewed by:	attilio
Sponsored by:	The FreeBSD Foundation
2013-09-05 12:54:40 +00:00
Pawel Jakub Dawidek
7008be5bd7 Change the cap_rights_t type from uint64_t to a structure that we can extend
in the future in a backward compatible (API and ABI) way.

The cap_rights_t represents capability rights. We used to use one bit to
represent one right, but we are running out of spare bits. Currently the new
structure provides place for 114 rights (so 50 more than the previous
cap_rights_t), but it is possible to grow the structure to hold at least 285
rights, although we can make it even larger if 285 rights won't be enough.

The structure definition looks like this:

	struct cap_rights {
		uint64_t	cr_rights[CAP_RIGHTS_VERSION + 2];
	};

The initial CAP_RIGHTS_VERSION is 0.

The top two bits in the first element of the cr_rights[] array contain total
number of elements in the array - 2. This means if those two bits are equal to
0, we have 2 array elements.

The top two bits in all remaining array elements should be 0.
The next five bits in all array elements contain array index. Only one bit is
used and bit position in this five-bits range defines array index. This means
there can be at most five array elements in the future.

To define new right the CAPRIGHT() macro must be used. The macro takes two
arguments - an array index and a bit to set, eg.

	#define	CAP_PDKILL	CAPRIGHT(1, 0x0000000000000800ULL)

We still support aliases that combine few rights, but the rights have to belong
to the same array element, eg:

	#define	CAP_LOOKUP	CAPRIGHT(0, 0x0000000000000400ULL)
	#define	CAP_FCHMOD	CAPRIGHT(0, 0x0000000000002000ULL)

	#define	CAP_FCHMODAT	(CAP_FCHMOD | CAP_LOOKUP)

There is new API to manage the new cap_rights_t structure:

	cap_rights_t *cap_rights_init(cap_rights_t *rights, ...);
	void cap_rights_set(cap_rights_t *rights, ...);
	void cap_rights_clear(cap_rights_t *rights, ...);
	bool cap_rights_is_set(const cap_rights_t *rights, ...);

	bool cap_rights_is_valid(const cap_rights_t *rights);
	void cap_rights_merge(cap_rights_t *dst, const cap_rights_t *src);
	void cap_rights_remove(cap_rights_t *dst, const cap_rights_t *src);
	bool cap_rights_contains(const cap_rights_t *big, const cap_rights_t *little);

Capability rights to the cap_rights_init(), cap_rights_set(),
cap_rights_clear() and cap_rights_is_set() functions are provided by
separating them with commas, eg:

	cap_rights_t rights;

	cap_rights_init(&rights, CAP_READ, CAP_WRITE, CAP_FSTAT);

There is no need to terminate the list of rights, as those functions are
actually macros that take care of the termination, eg:

	#define	cap_rights_set(rights, ...)				\
		__cap_rights_set((rights), __VA_ARGS__, 0ULL)
	void __cap_rights_set(cap_rights_t *rights, ...);

Thanks to using one bit as an array index we can assert in those functions that
there are no two rights belonging to different array elements provided
together. For example this is illegal and will be detected, because CAP_LOOKUP
belongs to element 0 and CAP_PDKILL to element 1:

	cap_rights_init(&rights, CAP_LOOKUP | CAP_PDKILL);

Providing several rights that belongs to the same array's element this way is
correct, but is not advised. It should only be used for aliases definition.

This commit also breaks compatibility with some existing Capsicum system calls,
but I see no other way to do that. This should be fine as Capsicum is still
experimental and this change is not going to 9.x.

Sponsored by:	The FreeBSD Foundation
2013-09-05 00:09:56 +00:00
Kirk McKusick
1645995b97 Fix bug introduced in rewrite of keg_free_slab in -r251894.
The consequence of the bug is that fini calls are not done
when a slab is freed by a call-back from the page daemon.
It went unnoticed for two months because fini is little used.

I spotted the bug while reading the code to learn how it works
so I could write it up for the next edition of the Design and
Implementation of FreeBSD book.

No MFC needed as this code exists only in HEAD.

Reviewed by: kib, jeff
Tested by:   pho
2013-08-31 15:40:15 +00:00
Alan Cox
51321f7c31 Significantly reduce the cost, i.e., run time, of calls to madvise(...,
MADV_DONTNEED) and madvise(..., MADV_FREE).  Specifically, introduce a new
pmap function, pmap_advise(), that operates on a range of virtual addresses
within the specified pmap, allowing for a more efficient implementation of
MADV_DONTNEED and MADV_FREE.  Previously, the implementation of
MADV_DONTNEED and MADV_FREE relied on per-page pmap operations, such as
pmap_clear_reference().  Intuitively, the problem with this implementation
is that the pmap-level locks are acquired and released and the page table
traversed repeatedly, once for each resident page in the range
that was specified to madvise(2).  A more subtle flaw with the previous
implementation is that pmap_clear_reference() would clear the reference bit
on all mappings to the specified page, not just the mapping in the range
specified to madvise(2).

Since our malloc(3) makes heavy use of madvise(2), this change can have a
measureable impact.  For example, the system time for completing a parallel
"buildworld" on a 6-core amd64 machine was reduced by about 1.5% to 2.0%.

Note: This change only contains pmap_advise() implementations for a subset
of our supported architectures.  I will commit implementations for the
remaining architectures after further testing.  For now, a stub function is
sufficient because of the advisory nature of pmap_advise().

Discussed with: jeff, jhb, kib
Tested by:      pho (i386), marcel (ia64)
Sponsored by:   EMC / Isilon Storage Division
2013-08-29 15:49:05 +00:00
Gleb Smirnoff
133dae887b Remove comment that is no longer relevant since r254182. 2013-08-26 14:14:25 +00:00
Alan Cox
776cad90ff Addendum to r254141: The call to vm_radix_insert() in vm_page_cache() can
reclaim the last preexisting cached page in the object, resulting in a call
to vdrop().  Detect this scenario so that the vnode's hold count is
correctly maintained.  Otherwise, we panic.

Reported by:	scottl
Tested by:	pho
Discussed with:	attilio, jeff, kib
2013-08-23 17:27:12 +00:00
Konstantin Belousov
e68c64f0ba Revert r254501. Instead, reuse the type stability of the struct pmap
which is the part of struct vmspace, allocated from UMA_ZONE_NOFREE
zone.  Initialize the pmap lock in the vmspace zone init function, and
remove pmap lock initialization and destruction from pmap_pinit() and
pmap_release().

Suggested and reviewed by:	alc (previous version)
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
2013-08-22 18:12:24 +00:00
Konstantin Belousov
5944de8ecd Remove the deprecated VM_ALLOC_RETRY flag for the vm_page_grab(9).
The flag was mandatory since r209792, where vm_page_grab(9) was
changed to only support the alloc retry semantic.

Suggested and reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
2013-08-22 07:39:53 +00:00
Jeff Roberson
274132ac23 - Eliminate the vm object lock from the active queue scan. It is not
necessary since we do not free or cache the page from active anymore.
   Document the one possible race that is harmless.

Sponsored by:	EMC / Isilon Storage Division
Discussed with:	alc
2013-08-21 22:39:19 +00:00
Alan Cox
28a288cbaa Addendum to r254141: Allow recursion on the free pages queues lock in
vm_page_alloc_freelist().

Reported and tested by:	sbruno
Sponsored by:	EMC / Isilon Storage Division
2013-08-21 15:31:43 +00:00
Jeff Roberson
c9612b2db8 - Increase the active lru refresh interval to 10 minutes. This has been
shown to negatively impact some workloads and the goal is only to
   eliminate worst case behaviors for very long periods of paging
   inactivity.  Eventually we should determine a more complex scaling
   factor for this feature.
 - Rate limit low memory callback handlers to limit thrashing.  Set the
   default to 10 seconds.

Sponsored by:	EMC / Isilon Storage Division
2013-08-19 23:54:24 +00:00