Commit Graph

3188 Commits

Author SHA1 Message Date
Pawel Jakub Dawidek
7008be5bd7 Change the cap_rights_t type from uint64_t to a structure that we can extend
in the future in a backward compatible (API and ABI) way.

The cap_rights_t represents capability rights. We used to use one bit to
represent one right, but we are running out of spare bits. Currently the new
structure provides place for 114 rights (so 50 more than the previous
cap_rights_t), but it is possible to grow the structure to hold at least 285
rights, although we can make it even larger if 285 rights won't be enough.

The structure definition looks like this:

	struct cap_rights {
		uint64_t	cr_rights[CAP_RIGHTS_VERSION + 2];
	};

The initial CAP_RIGHTS_VERSION is 0.

The top two bits in the first element of the cr_rights[] array contain total
number of elements in the array - 2. This means if those two bits are equal to
0, we have 2 array elements.

The top two bits in all remaining array elements should be 0.
The next five bits in all array elements contain array index. Only one bit is
used and bit position in this five-bits range defines array index. This means
there can be at most five array elements in the future.

To define new right the CAPRIGHT() macro must be used. The macro takes two
arguments - an array index and a bit to set, eg.

	#define	CAP_PDKILL	CAPRIGHT(1, 0x0000000000000800ULL)

We still support aliases that combine few rights, but the rights have to belong
to the same array element, eg:

	#define	CAP_LOOKUP	CAPRIGHT(0, 0x0000000000000400ULL)
	#define	CAP_FCHMOD	CAPRIGHT(0, 0x0000000000002000ULL)

	#define	CAP_FCHMODAT	(CAP_FCHMOD | CAP_LOOKUP)

There is new API to manage the new cap_rights_t structure:

	cap_rights_t *cap_rights_init(cap_rights_t *rights, ...);
	void cap_rights_set(cap_rights_t *rights, ...);
	void cap_rights_clear(cap_rights_t *rights, ...);
	bool cap_rights_is_set(const cap_rights_t *rights, ...);

	bool cap_rights_is_valid(const cap_rights_t *rights);
	void cap_rights_merge(cap_rights_t *dst, const cap_rights_t *src);
	void cap_rights_remove(cap_rights_t *dst, const cap_rights_t *src);
	bool cap_rights_contains(const cap_rights_t *big, const cap_rights_t *little);

Capability rights to the cap_rights_init(), cap_rights_set(),
cap_rights_clear() and cap_rights_is_set() functions are provided by
separating them with commas, eg:

	cap_rights_t rights;

	cap_rights_init(&rights, CAP_READ, CAP_WRITE, CAP_FSTAT);

There is no need to terminate the list of rights, as those functions are
actually macros that take care of the termination, eg:

	#define	cap_rights_set(rights, ...)				\
		__cap_rights_set((rights), __VA_ARGS__, 0ULL)
	void __cap_rights_set(cap_rights_t *rights, ...);

Thanks to using one bit as an array index we can assert in those functions that
there are no two rights belonging to different array elements provided
together. For example this is illegal and will be detected, because CAP_LOOKUP
belongs to element 0 and CAP_PDKILL to element 1:

	cap_rights_init(&rights, CAP_LOOKUP | CAP_PDKILL);

Providing several rights that belongs to the same array's element this way is
correct, but is not advised. It should only be used for aliases definition.

This commit also breaks compatibility with some existing Capsicum system calls,
but I see no other way to do that. This should be fine as Capsicum is still
experimental and this change is not going to 9.x.

Sponsored by:	The FreeBSD Foundation
2013-09-05 00:09:56 +00:00
Kirk McKusick
1645995b97 Fix bug introduced in rewrite of keg_free_slab in -r251894.
The consequence of the bug is that fini calls are not done
when a slab is freed by a call-back from the page daemon.
It went unnoticed for two months because fini is little used.

I spotted the bug while reading the code to learn how it works
so I could write it up for the next edition of the Design and
Implementation of FreeBSD book.

No MFC needed as this code exists only in HEAD.

Reviewed by: kib, jeff
Tested by:   pho
2013-08-31 15:40:15 +00:00
Alan Cox
51321f7c31 Significantly reduce the cost, i.e., run time, of calls to madvise(...,
MADV_DONTNEED) and madvise(..., MADV_FREE).  Specifically, introduce a new
pmap function, pmap_advise(), that operates on a range of virtual addresses
within the specified pmap, allowing for a more efficient implementation of
MADV_DONTNEED and MADV_FREE.  Previously, the implementation of
MADV_DONTNEED and MADV_FREE relied on per-page pmap operations, such as
pmap_clear_reference().  Intuitively, the problem with this implementation
is that the pmap-level locks are acquired and released and the page table
traversed repeatedly, once for each resident page in the range
that was specified to madvise(2).  A more subtle flaw with the previous
implementation is that pmap_clear_reference() would clear the reference bit
on all mappings to the specified page, not just the mapping in the range
specified to madvise(2).

Since our malloc(3) makes heavy use of madvise(2), this change can have a
measureable impact.  For example, the system time for completing a parallel
"buildworld" on a 6-core amd64 machine was reduced by about 1.5% to 2.0%.

Note: This change only contains pmap_advise() implementations for a subset
of our supported architectures.  I will commit implementations for the
remaining architectures after further testing.  For now, a stub function is
sufficient because of the advisory nature of pmap_advise().

Discussed with: jeff, jhb, kib
Tested by:      pho (i386), marcel (ia64)
Sponsored by:   EMC / Isilon Storage Division
2013-08-29 15:49:05 +00:00
Gleb Smirnoff
133dae887b Remove comment that is no longer relevant since r254182. 2013-08-26 14:14:25 +00:00
Alan Cox
776cad90ff Addendum to r254141: The call to vm_radix_insert() in vm_page_cache() can
reclaim the last preexisting cached page in the object, resulting in a call
to vdrop().  Detect this scenario so that the vnode's hold count is
correctly maintained.  Otherwise, we panic.

Reported by:	scottl
Tested by:	pho
Discussed with:	attilio, jeff, kib
2013-08-23 17:27:12 +00:00
Konstantin Belousov
e68c64f0ba Revert r254501. Instead, reuse the type stability of the struct pmap
which is the part of struct vmspace, allocated from UMA_ZONE_NOFREE
zone.  Initialize the pmap lock in the vmspace zone init function, and
remove pmap lock initialization and destruction from pmap_pinit() and
pmap_release().

Suggested and reviewed by:	alc (previous version)
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
2013-08-22 18:12:24 +00:00
Konstantin Belousov
5944de8ecd Remove the deprecated VM_ALLOC_RETRY flag for the vm_page_grab(9).
The flag was mandatory since r209792, where vm_page_grab(9) was
changed to only support the alloc retry semantic.

Suggested and reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
2013-08-22 07:39:53 +00:00
Jeff Roberson
274132ac23 - Eliminate the vm object lock from the active queue scan. It is not
necessary since we do not free or cache the page from active anymore.
   Document the one possible race that is harmless.

Sponsored by:	EMC / Isilon Storage Division
Discussed with:	alc
2013-08-21 22:39:19 +00:00
Alan Cox
28a288cbaa Addendum to r254141: Allow recursion on the free pages queues lock in
vm_page_alloc_freelist().

Reported and tested by:	sbruno
Sponsored by:	EMC / Isilon Storage Division
2013-08-21 15:31:43 +00:00
Jeff Roberson
c9612b2db8 - Increase the active lru refresh interval to 10 minutes. This has been
shown to negatively impact some workloads and the goal is only to
   eliminate worst case behaviors for very long periods of paging
   inactivity.  Eventually we should determine a more complex scaling
   factor for this feature.
 - Rate limit low memory callback handlers to limit thrashing.  Set the
   default to 10 seconds.

Sponsored by:	EMC / Isilon Storage Division
2013-08-19 23:54:24 +00:00
Jeff Roberson
d91722fb27 - Use an arbitrary but reasonably large import size for kva on architectures
that don't support superpages.  This keeps the number of spans and internal
   fragmentation lower.
 - When the user asks for alignment from vmem_xalloc adjust the imported size
   by 2*align to be certain we can satisfy the allocation.  This comes at
   the expense of potential failures when the backend can't supply enough
   memory but could supply the requested size and alignment.

Sponsored by:	EMC / Isilon Storage Division
2013-08-19 23:02:39 +00:00
Konstantin Belousov
949c918635 Remove the arbitrary binding of the pagedaemon threads to the domains,
update the comment accordingly and make it more precise.

Requested and reviewed by:	jeff (previous version)
2013-08-17 07:10:01 +00:00
John Baldwin
5aa60b6f21 Add new mmap(2) flags to permit applications to request specific virtual
address alignment of mappings.
- MAP_ALIGNED(n) requests a mapping aligned on a boundary of (1 << n).
  Requests for n >= number of bits in a pointer or less than the size of
  a page fail with EINVAL.  This matches the API provided by NetBSD.
- MAP_ALIGNED_SUPER is a special case of MAP_ALIGNED.  It can be used
  to optimize the chances of using large pages.  By default it will align
  the mapping on a large page boundary (the system is free to choose any
  large page size to align to that seems best for the mapping request).
  However, if the object being mapped is already using large pages, then
  it will align the virtual mapping to match the existing large pages in
  the object instead.
- Internally, VMFS_ALIGNED_SPACE is now renamed to VMFS_SUPER_SPACE, and
  VMFS_ALIGNED_SPACE(n) is repurposed for specifying a specific alignment.
  MAP_ALIGNED(n) maps to using VMFS_ALIGNED_SPACE(n), while
  MAP_ALIGNED_SUPER maps to VMFS_SUPER_SPACE.
- mmap() of a device object now uses VMFS_OPTIMAL_SPACE rather than
  explicitly using VMFS_SUPER_SPACE.  All device objects are forced to
  use a specific color on creation, so VMFS_OPTIMAL_SPACE is effectively
  equivalent.

Reviewed by:	alc
MFC after:	1 month
2013-08-16 21:13:55 +00:00
Jeff Roberson
114f62c6df - Fix bug in r254304. Use the ACTIVE pq count for the active list
processing, not inactive.  This was the result of a bad merge.

Reported by:	pho
Sponsored by:	EMC / Isilon Storage Division
2013-08-15 22:29:49 +00:00
Attilio Rao
a834cbaec8 On the recovery path for vm_page_alloc(), if a page had been requested
wired, unwind back the wiring bits otherwise we can end up freeing a
page that is considered wired.

Sponsored by:	EMC / Isilon storage division
Reported by:	alc
2013-08-15 11:01:25 +00:00
Jeff Roberson
8441d1e842 - Add a statically allocated memguard arena since it is needed very early
on.
 - Pass the appropriate flags to vmem_xalloc() when allocating space for
   the arena from kmem_arena.

Sponsored by:	EMC / Isilon Storage Division
2013-08-13 22:40:43 +00:00
Jeff Roberson
d9e232109f Improve pageout flow control to wakeup more frequently and do less work while
maintaining better LRU of active pages.

 - Change v_free_target to include the quantity previously represented by
   v_cache_min so we don't need to add them together everywhere we use them.
 - Add a pageout_wakeup_thresh that sets the free page count trigger for
   waking the page daemon.  Set this 10% above v_free_min so we wakeup before
   any phase transitions in vm users.
 - Adjust down v_free_target now that we're willing to accept more pagedaemon
   wakeups.  This means we process fewer pages in one iteration as well,
   leading to shorter lock hold times and less overall disruption.
 - Eliminate vm_pageout_page_stats().  This was a minor variation on the
   PQ_ACTIVE segment of the normal pageout daemon.  Instead we now process
   1 / vm_pageout_update_period pages every second.  This causes us to visit
   the whole active list every 60 seconds.  Previously we would only maintain
   the active LRU when we were short on pages which would mean it could be
   woefully out of date.

Reviewed by:	alc (slight variant of this)
Discussed with:	alc, kib, jhb
Sponsored by:	EMC / Isilon Storage Division
2013-08-13 21:56:16 +00:00
Attilio Rao
6006884122 Correct the recovery logic in vm_page_alloc_contig:
what is really needed on this code snipped is that all the pages that
are already fully inserted gets fully freed, while for the others the
object removal itself might be skipped, hence the object might be set to
NULL.

Sponsored by:	EMC / Isilon storage division
Reported by:	alc, kib
Reviewed by:	alc
2013-08-11 21:15:04 +00:00
Konstantin Belousov
c325e866f4 Different consumers of the struct vm_page abuse pageq member to keep
additional information, when the page is guaranteed to not belong to a
paging queue.  Usually, this results in a lot of type casts which make
reasoning about the code correctness harder.

Sometimes m->object is used instead of pageq, which could cause real
and confusing bugs if non-NULL m->object is leaked.  See r141955 and
r253140 for examples.

Change the pageq member into a union containing explicitly-typed
members.  Use them instead of type-punning or abusing m->object in x86
pmaps, uma and vm_page_alloc_contig().

Requested and reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
2013-08-10 17:36:42 +00:00
Andrey Zonov
767cfe52cc Remove unused definition for CTL_VM_NAMES.
Suggested by:	bde
2013-08-09 23:47:43 +00:00
John Baldwin
cdc00bf7d2 Revert the addition of VPO_BUSY and instead update vm_page_replace() to
properly unbusy the page.

Submitted by:	alc
2013-08-09 21:14:55 +00:00
David E. O'Brien
5d82a21469 Add missing 'VPO_BUSY' from r254141 to fix kernel build break. 2013-08-09 16:43:50 +00:00
Attilio Rao
e946b94934 On all the architectures, avoid to preallocate the physical memory
for nodes used in vm_radix.
On architectures supporting direct mapping, also avoid to pre-allocate
the KVA for such nodes.

In order to do so make the operations derived from vm_radix_insert()
to fail and handle all the deriving failure of those.

vm_radix-wise introduce a new function called vm_radix_replace(),
which can replace a leaf node, already present, with a new one,
and take into account the possibility, during vm_radix_insert()
allocation, that the operations on the radix trie can recurse.
This means that if operations in vm_radix_insert() recursed
vm_radix_insert() will start from scratch again.

Sponsored by:	EMC / Isilon storage division
Reviewed by:	alc (older version)
Reviewed by:	jeff
Tested by:	pho, scottl
2013-08-09 11:28:55 +00:00
Attilio Rao
c7aebda8a1 The soft and hard busy mechanism rely on the vm object lock to work.
Unify the 2 concept into a real, minimal, sxlock where the shared
acquisition represent the soft busy and the exclusive acquisition
represent the hard busy.
The old VPO_WANTED mechanism becames the hard-path for this new lock
and it becomes per-page rather than per-object.
The vm_object lock becames an interlock for this functionality:
it can be held in both read or write mode.
However, if the vm_object lock is held in read mode while acquiring
or releasing the busy state, the thread owner cannot make any
assumption on the busy state unless it is also busying it.

Also:
- Add a new flag to directly shared busy pages while vm_page_alloc
  and vm_page_grab are being executed.  This will be very helpful
  once these functions happen under a read object lock.
- Move the swapping sleep into its own per-object flag

The KPI is heavilly changed this is why the version is bumped.
It is very likely that some VM ports users will need to change
their own code.

Sponsored by:	EMC / Isilon storage division
Discussed with:	alc
Reviewed by:	jeff, kib
Tested by:	gavin, bapt (older version)
Tested by:	pho, scottl
2013-08-09 11:11:11 +00:00
Konstantin Belousov
449c2e92c9 Split the pagequeues per NUMA domains, and split pageademon process
into threads each processing queue in a single domain.  The structure
of the pagedaemons and queues is kept intact, most of the changes come
from the need for code to find an owning page queue for given page,
calculated from the segment containing the page.

The tie between NUMA domain and pagedaemon thread/pagequeue split is
rather arbitrary, the multithreaded daemon could be allowed for the
single-domain machines, or one domain might be split into several page
domains, to further increase concurrency.

Right now, each pagedaemon thread tries to reach the global target,
precalculated at the start of the pass.  This is not optimal, since it
could cause excessive page deactivation and freeing.  The code should
be changed to re-check the global page deficit state in the loop after
some number of iterations.

The pagedaemons reach the quorum before starting the OOM, since one
thread inability to meet the target is normal for split queues.  Only
when all pagedaemons fail to produce enough reusable pages, OOM is
started by single selected thread.

Launder is modified to take into account the segments layout with
regard to the region for which cleaning is performed.

Based on the preliminary patch by jeff, sponsored by EMC / Isilon
Storage Division.

Reviewed by:	alc
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
2013-08-07 16:36:38 +00:00
Jeff Roberson
5df87b21d3 Replace kernel virtual address space allocation with vmem. This provides
transparent layering and better fragmentation.

 - Normalize functions that allocate memory to use kmem_*
 - Those that allocate address space are named kva_*
 - Those that operate on maps are named kmap_*
 - Implement recursive allocation handling for kmem_arena in vmem.

Reviewed by:	alc
Tested by:	pho
Sponsored by:	EMC / Isilon Storage Division
2013-08-07 06:21:20 +00:00
Mark Johnston
c0432fc38b Fill in the description fields for M_FICT_PAGES.
Reviewed by:	kib
MFC after:	3 days
2013-08-07 00:20:30 +00:00
Attilio Rao
be99683637 Revert r253939:
We cannot busy a page before doing pagefaults.
Infact, it can deadlock against vnode lock, as it tries to vget().
Other functions, right now, have an opposite lock ordering, like
vm_object_sync(), which acquires the vnode lock first and then
sleeps on the busy mechanism.

Before this patch is reinserted we need to break this ordering.

Sponsored by:	EMC / Isilon storage division
Reported by:	kib
2013-08-05 08:55:35 +00:00
Attilio Rao
3b6714cacb The page hold mechanism is fast but it has couple of fallouts:
- It does not let pages respect the LRU policy
- It bloats the active/inactive queues of few pages

Try to avoid it as much as possible with the long-term target to
completely remove it.
Use the soft-busy mechanism to protect page content accesses during
short-term operations (like uiomove_fromphys()).

After this change only vm_fault_quick_hold_pages() is still using the
hold mechanism for page content access.
There is an additional complexity there as the quick path cannot
immediately access the page object to busy the page and the slow path
cannot however busy more than one page a time (to avoid deadlocks).

Fixing such primitive can bring to complete removal of the page hold
mechanism.

Sponsored by:	EMC / Isilon storage division
Discussed with:	alc
Reviewed by:	jeff
Tested by:	pho
2013-08-04 21:07:24 +00:00
Andrey Zonov
23b5c8fe3d Unbreak sysctl ABI changes introduced in r253662
Requested by:	bde
2013-07-29 18:48:51 +00:00
Jeff Roberson
bb7858ea20 Improve page LRU quality and simplify the logic.
- Don't short-circuit aging tests for unmapped objects.  This biases
   against unmapped file pages and transient mappings.
 - Always honor PGA_REFERENCED.  We can now use this after soft busying
   to lazily restart the LRU.
 - Don't transition directly from active to cached bypassing the inactive
   queue.  This frees recently used data much too early.
 - Rename actcount to act_delta to be more consistent with use and meaning.

Reviewed by:	kib, alc
Sponsored by:	EMC / Isilon Storage Division
2013-07-26 23:22:05 +00:00
Andrey Zonov
20dd2f38dc Remove define and documentation for vm_pageout_algorithm missed in r253587 2013-07-26 02:00:06 +00:00
Tim Kientzle
763d9566fe Clear entire map structure including locks so that the
locks don't accidentally appear to have been already
initialized.

In particular, this fixes a consistent kernel crash on
armv6 with:
  panic: lock "vm map (user)" 0xc09cc050 already initialized
that appeared with r251709.

PR: arm/180820
2013-07-25 03:48:37 +00:00
Andriy Gapon
785797c341 rename scheduler->swapper and SI_SUB_RUN_SCHEDULER->SI_SUB_LAST
Also directly call swapper() at the end of mi_startup instead of
relying on swapper being the last thing in sysinits order.

Rationale:

- "RUN_SCHEDULER" was misleading, scheduling already takes place at that stage
- "scheduler" was misleading, the function swaps in the swapped out processes
- another SYSINIT(SI_SUB_RUN_SCHEDULER, SI_ORDER_ANY) could never be
  invoked depending on its relative order with scheduler; this was not obvious
  and the bug actually used to exist

Reviewed by:	kib (ealier version)
MFC after:	14 days
2013-07-24 09:45:31 +00:00
Gleb Smirnoff
dab12c75bb Since r251709 a slab no longer use 8-bit indicies to manage items,
thus remove a stale comment.

Reviewed by:	jeff
2013-07-24 06:13:00 +00:00
Jeff Roberson
90776bd730 - Remove the long obsolete 'vm_pageout_algorithm' experiment.
Discussed with:	alc
Sponsored by:	EMC / Isilon Storage Division
2013-07-24 01:25:56 +00:00
Jeff Roberson
c93dcf2235 - Correct a stale comment. We don't have vclean() anymore. The work is
done by vgonel() and destroy_vobject() should only be called once from
   VOP_INACTIVE().

Sponsored by:	EMC / Isilon Storage Division
2013-07-23 22:52:38 +00:00
Gleb Smirnoff
e28a647db6 Revert r249590 and in case if mp_ncpus isn't initialized use MAXCPU. This
allows us to init counter zone at early stage of boot.

Reviewed by:	kib
Tested by:	Lytochkin Boris <lytboris gmail.com>
2013-07-23 11:16:40 +00:00
Jeremie Le Hen
fc2b167929 Fix previous commit when option RACCT is not used.
MFC after:	7 days
2013-07-22 22:16:47 +00:00
Jeremie Le Hen
c92b506977 Fix a panic in the racct code when munlock(2) is called with incorrect values.
The racct code in sys_munlock() assumed that the boundaries provided by the
userland were correct as long as vm_map_unwire() returned successfully.
However the latter contains its own logic and sometimes manages to do something
out of those boundaries, even if they are buggy.  This change makes the racct
code to use the accounting done by the vm layer, as it is done in other places
such as vm_mlock().

Despite fixing the panic, Alan Cox pointed that this code is still race-y
though: two simultaneous callers will produce incorrect values.

Reviewed by:	alc
MFC after:	7 days
2013-07-22 21:47:14 +00:00
John Baldwin
ff74a3fa6b Be more aggressive in using superpages in all mappings of objects:
- Add a new address space allocation method (VMFS_OPTIMAL_SPACE) for
  vm_map_find() that will try to alter the alignment of a mapping to match
  any existing superpage mappings of the object being mapped.  If no
  suitable address range is found with the necessary alignment,
  vm_map_find() will fall back to using the simple first-fit strategy
  (VMFS_ANY_SPACE).
- Change mmap() without MAP_FIXED, shmat(), and the GEM mapping ioctl to
  use VMFS_OPTIMAL_SPACE instead of VMFS_ANY_SPACE.

Reviewed by:	alc (earlier version)
MFC after:	2 weeks
2013-07-19 19:06:15 +00:00
Konstantin Belousov
5a3c920f45 When swap pager allocates metadata in the pagedaemon context, allow it
to drain the reserve.  This was broken in r243040, causing deadlock.
Note that VM_WAIT call in case of uma_zalloc() failure from pagedaemon
would only wait for the v_pageout_free_min anyway.

Reported and tested by:	pho
Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
2013-07-11 20:33:57 +00:00
Konstantin Belousov
4f9c9114a3 The vm_fault() should not be allowed to proceed on the map entry which
is being wired now.  The entry wired count is changed to non-zero in
advance, before the map lock is dropped.  This makes the vm_fault() to
perceive the entry as wired, and breaks the fragment which moves the
wire count from the shadowed page, to the upper page, making the code
unwiring non-wired page.

On the other hand, the vm_fault() calls from vm_fault_wire() should be
allowed to proceed, so only drain MAP_ENTRY_IN_TRANSITION from
vm_fault() when wiring_thread is not current.

Reported and tested by:	pho
Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2013-07-11 05:58:28 +00:00
Konstantin Belousov
0acea7dfde The mlockall() or VM_MAP_WIRE_HOLESOK does not interact properly with
parallel creation of the map entries, e.g. by mmap() or stack growing.
It also breaks when other entry is wired in parallel.

The vm_map_wire() iterates over the map entries in the region, and
assumes that map entries it finds are marked as in transition before,
also that any entry marked as in transition, are marked by the current
invocation of vm_map_wire().  This is not true for new entries in the
holes.

Add the thread owner of the MAP_ENTRY_IN_TRANSITION flag to struct
vm_map_entry.  In vm_map_wire() and vm_map_unwire(), only process the
entries which transition owner is the current thread.

Reported and tested by:	pho
Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2013-07-11 05:55:08 +00:00
Konstantin Belousov
ebf5d94e82 Never remove user-wired pages from an object when doing
msync(MS_INVALIDATE).  The vm_fault_copy_entry() requires that object
range which corresponds to the user-wired vm_map_entry, is always
fully populated.

Add OBJPR_NOTWIRED flag for vm_object_page_remove() to request the
preserving behaviour, use it when calling vm_object_page_remove() from
vm_object_sync().

Reported and tested by:	pho
Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2013-07-11 05:47:26 +00:00
Konstantin Belousov
3abeb8113d In the vm_page_set_invalid() function, do not assert that the page is
not busy, since its only caller brelse() can legitimately call it on
busy page.  This happens for VOP_PUTPAGES() on filesystems that use
buffers and which VOP_WRITE() method marked the buffer containing page
as non-cacheable.

Reported and tested by:	pho
Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2013-07-11 05:38:39 +00:00
Konstantin Belousov
56ce850bf8 Fix typo in comment.
MFC after:	3 days
2013-07-09 13:22:30 +00:00
Neel Natu
6b5fbc1225 vm_phys_fictitious_reg_range() was losing the 'memattr' because it would be
reset by pmap_page_init() right after being initialized in vm_page_initfake().

The statement above is with reference to the amd64 implementation of
pmap_page_init().

Fix this by calling 'pmap_page_init()' in 'vm_page_initfake()' before changing
the 'memattr'.

Reviewed by:	kib
MFC after:	2 weeks
2013-07-03 23:38:37 +00:00
Davide Italiano
a1dff92058 Remove a spurious keg lock acquisition. 2013-06-28 21:13:19 +00:00
Jeff Roberson
5f51836645 - Add a general purpose resource allocator, vmem, from NetBSD. It was
originally inspired by the Solaris vmem detailed in the proceedings
   of usenix 2001.  The NetBSD version was heavily refactored for bugs
   and simplicity.
 - Use this resource allocator to allocate the buffer and transient maps.
   Buffer cache defrags are reduced by 25% when used by filesystems with
   mixed block sizes.  Ultimately this may permit dynamic buffer cache
   sizing on low KVA machines.

Discussed with:	alc, kib, attilio
Tested by:	pho
Sponsored by:	EMC / Isilon Storage Division
2013-06-28 03:51:20 +00:00
Jeff Roberson
6fd34d6f67 - Resolve bucket recursion issues by passing a cookie with zone flags
through bucket_alloc() to uma_zalloc_arg() and uma_zfree_arg().
 - Make some smaller buckets for large zones to further reduce memory
   waste.
 - Implement uma_zone_reserve().  This holds aside a number of items only
   for callers who specify M_USE_RESERVE.  buckets will never be filled
   from reserve allocations.

Sponsored by:	EMC / Isilon Storage Division
2013-06-26 00:57:38 +00:00
Gleb Smirnoff
4aa4cd8e92 Typo in comment. 2013-06-24 13:36:16 +00:00
Jeff Roberson
af5263743c - Add a per-zone lock for zones without kegs.
- Be more explicit about zone vs keg locking.  This functionally changes
   almost nothing.
 - Add a size parameter to uma_zcache_create() so we can size the buckets.
 - Pass the zone to bucket_alloc() so it can modify allocation flags
   as appropriate.
 - Fix a bug in zone_alloc_bucket() where I missed an address of operator
   in a failure case.  (Found by pho)

Sponsored by:	EMC / Isilon Storage Division
2013-06-20 19:08:12 +00:00
Jeff Roberson
8aaf680e90 - Persist the caller's flags in the bucket allocation flags so we don't
lose a M_NOVM when we recurse into a bucket allocation.

Sponsored by:	EMC / Isilon Storage Division
2013-06-19 02:30:32 +00:00
Dag-Erling Smørgrav
5b3e02570a Fix a bug that allowed a tracing process (e.g. gdb) to write
to a memory-mapped file in the traced process's address space
even if neither the traced process nor the tracing process had
write access to that file.

Security:	CVE-2013-2171
Security:	FreeBSD-SA-13:06.mmap
Approved by:	so
2013-06-18 07:02:35 +00:00
Jeff Roberson
fc03d22b17 Refine UMA bucket allocation to reduce space consumption and improve
performance.

 - Always free to the alloc bucket if there is space.  This gives LIFO
   allocation order to improve hot-cache performance.  This also allows
   for zones with a single bucket per-cpu rather than a pair if the entire
   working set fits in one bucket.
 - Enable per-cpu caches of buckets.  To prevent recursive bucket
   allocation one bucket zone still has per-cpu caches disabled.
 - Pick the initial bucket size based on a table driven maximum size
   per-bucket rather than the number of items per-page.  This gives
   more sane initial sizes.
 - Only grow the bucket size when we face contention on the zone lock, this
   causes bucket sizes to grow more slowly.
 - Adjust the number of items per-bucket to account for the header space.
   This packs the buckets more efficiently per-page while making them
   not quite powers of two.
 - Eliminate the per-zone free bucket list.  Always return buckets back
   to the bucket zone.  This ensures that as zones grow into larger
   bucket sizes they eventually discard the smaller sizes.  It persists
   fewer buckets in the system.  The locking is slightly trickier.
 - Only switch buckets in zalloc, not zfree, this eliminates pathological
   cases where we ping-pong between two buckets.
 - Ensure that the thread that fills a new bucket gets to allocate from
   it to give a better upper bound on allocation time.

Sponsored by:	EMC / Isilon Storage Division
2013-06-18 04:50:20 +00:00
Jeff Roberson
0095a78419 - Add a new UMA API: uma_zcache_create(). This makes a zone without any
backing memory that is only a container for per-cpu caches of arbitrary
   pointer items.  These zones have no kegs.
 - Convert the regular keg based allocator to use the new import/release
   functions.
 - Move some stats to be atomics since they would require excessive zone
   locking/unlocking with the new import/release paradigm.  Make
   zone_free_item simpler now that callers can manage more stats.
 - Check for these cache-only zones in the public APIs and debugging
   code by checking zone_first_keg() against NULL.

Sponsored by:	EMC / Isilong Storage Division
2013-06-17 03:43:47 +00:00
Jeff Roberson
ef72505e6d - Convert the slab free item list from a linked array of indices to a
bitmap using sys/bitset.  This is much simpler, has lower space
   overhead and is cheaper in most cases.
 - Use a second bitmap for invariants asserts and improve the quality of
   the asserts as well as the number of erroneous conditions that we will
   catch.
 - Drastically simplify sizing code.  Special case refcnt zones since they
   will be going away.
 - Update stale comments.

Sponsored by:	EMC / Isilon Storage Division
2013-06-13 21:05:38 +00:00
Alan Cox
2051980f97 Revise the interface between vm_object_madvise() and vm_page_dontneed() so
that pointless calls to pmap_is_modified() can be easily avoided when
performing madvise(..., MADV_FREE).

Sponsored by:	EMC / Isilon Storage Division
2013-06-10 01:48:21 +00:00
Gleb Smirnoff
995d706909 Make sys_mlock() function just a wrapper around vm_mlock() function
that does all the job.

Reviewed by:	kib, jilles
Sponsored by:	Nginx, Inc.
2013-06-08 13:13:40 +00:00
Attilio Rao
002f377ab2 Complete r251452:
Avoid to busy/unbusy a page in cases where there is no need to drop the
vm_obj lock, more nominally when the page is full valid after
vm_page_grab().

Sponsored by:	EMC / Isilon storage division
Reviewed by:	alc
2013-06-06 18:19:26 +00:00
Attilio Rao
dfd55c0c7b In vm_object_split(), busy and consequently unbusy the pages only when
swap_pager_copy() is invoked, otherwise there is no reason to do so.
This will eliminate the necessity to busy pages most of the times.

Sponsored by:	EMC / Isilon storage division
Reviewed by:	alc
2013-06-04 22:47:01 +00:00
Alan Cox
da38420832 Update a comment. 2013-06-04 05:44:52 +00:00
Alan Cox
e23b0a193e Relax the object locking in vm_pageout_map_deactivate_pages() and
vm_pageout_object_deactivate_pages().  A read lock suffices.

Sponsored by:	EMC / Isilon Storage Division
2013-06-04 02:28:47 +00:00
Konstantin Belousov
be6ec55376 Remove irrelevant comments.
Discussed with:	alc
MFC after:	3 days
2013-06-03 17:30:40 +00:00
Alan Cox
b417181250 Require that the page lock is held, instead of the object lock, when
clearing the page's PGA_REFERENCED flag.  Since we are typically
manipulating the page's act_count field when we are clearing its
PGA_REFERENCED flag, the page lock is already held everywhere that we clear
the PGA_REFERENCED flag.  So, in fact, this revision only changes some
comments and an assertion.  Nonetheless, it will enable later changes to
object locking in the pageout code.

Introduce vm_page_assert_locked(), which completely hides the implementation
details of the page lock from the caller, and use it in
vm_page_aflag_clear().  (The existing vm_page_lock_assert() could not be
used in vm_page_aflag_clear().)  Over the coming weeks, I expect that we'll
either eliminate or replace the various uses of vm_page_lock_assert() with
vm_page_assert_locked().

Reviewed by:	attilio
Sponsored by:	EMC / Isilon Storage Division
2013-06-03 01:22:54 +00:00
Alan Cox
b4e498071d Now that access to the page's "act_count" field is synchronized by the page
lock instead of the object lock, there is no reason for vm_page_activate()
to assert that the object is locked for either read or write access.
(The "VPO_UNMANAGED" flag never changes after page allocation.)

Sponsored by:	EMC / Isilon Storage Division
2013-06-01 20:32:34 +00:00
Alan Cox
ef5ba5a31d Simplify the definition of vm_page_lock_assert(). There is no compelling
reason to inline the implementation of vm_page_lock_assert() in the
!KLD_MODULES case.  Use the same implementation for both KLD_MODULES and
!KLD_MODULES.

Reviewed by:	kib
2013-05-31 16:00:42 +00:00
Konstantin Belousov
7560005c41 After the object lock was dropped, the object' reference count could
change.  Retest the ref_count and return from the function to not
execute the further code which assumes that ref_count == 1 if it is
not.  Also, do not leak vnode lock if other thread cleared OBJ_TMPFS
flag meantime.

Reported by:	bdrewery
Tested by:	bdrewery, pho
Sponsored by:	The FreeBSD Foundation
2013-05-30 20:00:19 +00:00
Konstantin Belousov
782d4a636b Remove the capitalization in the assertion message. Print the address
of the object to get useful information from optimizated kernels dump.
2013-05-30 19:53:31 +00:00
Attilio Rao
c25673ffd6 o Change the locking scheme for swp_bcount.
It can now be accessed with a write lock on the object containing it OR
  with a read lock on the object containing it along with the swhash_mtx.
o Remove some duplicate assertions for swap_pager_freespace() and
  swap_pager_unswapped() but keep the object locking references for
  documentation.

Sponsored by:	EMC / Isilon storage division
Reviewed by:	alc
2013-05-28 22:07:23 +00:00
Attilio Rao
83b375ea16 Acquire read lock on the src object for vm_fault_copy_entry().
Sponsored by:	EMC / Isilon storage division
Reviewed by:	alc
2013-05-22 15:11:00 +00:00
Attilio Rao
9af6d512f5 o Relax locking assertions for vm_page_find_least()
o Relax locking assertions for pmap_enter_object() and add them also
  to architectures that currently don't have any
o Introduce VM_OBJECT_LOCK_DOWNGRADE() which is basically a downgrade
  operation on the per-object rwlock
o Use all the mechanisms above to make vm_map_pmap_enter() to work
  mostl of the times only with readlocks.

Sponsored by:	EMC / Isilon storage division
Reviewed by:	alc
2013-05-21 20:38:19 +00:00
Konstantin Belousov
4fab678be2 Add ddb command 'show pginfo' which provides useful information about
a vm page, denoted either by an address of the struct vm_page, or, if
the '/p' modifier is specified, by a physical address of the
corresponding frame.

Reviewed by:	jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2013-05-21 11:04:00 +00:00
Alan Cox
c141ae7f49 Relax the object locking in vm_fault_prefault(). A read lock suffices.
Reviewed by:	attilio
Sponsored by:	EMC / Isilon Storage Division
2013-05-17 19:02:36 +00:00
Alan Cox
767a6420bc Relax the object locking assertion in vm_page_lookup(). Now that a radix
tree is used to maintain the object's collection of resident pages,
vm_page_lookup() no longer needs an exclusive lock.

Reviewed by:    attilio
Sponsored by:   EMC / Isilon Storage Division
2013-05-17 18:49:43 +00:00
Attilio Rao
7e226537c7 o Add accessor functions to add and remove pages from a specific
freelist.
o Split the pool of free pages queues really by domain and not rely on
  definition of VM_RAW_NFREELIST.
o For MAXMEMDOM > 1, wrap the RR allocation logic into a specific
  function that is called when calculating the allocation domain.
  The RR counter is kept, currently, per-thread.
  In the future it is expected that such function evolves in a real
  policy decision referee, based on specific informations retrieved by
  per-thread and per-vm_object attributes.
o Add the concept of "probed domains" under the form of vm_ndomains.
  It is responsibility for every architecture willing to support multiple
  memory domains to correctly probe vm_ndomains along with mem_affinity
  segments attributes.  Those two values are supposed to remain always
  consistent.
  Please also note that vm_ndomains and td_dom_rr_idx are both int
  because segments already store domains as int.  Ideally u_int would
  have much more sense. Probabilly this should be cleaned up in the
  future.
o Apply RR domain selection also to vm_phys_zero_pages_idle().

Sponsored by:	EMC / Isilon storage division
Partly obtained from:	jeff
Reviewed by:	alc
Tested by:	jeff
2013-05-13 15:40:51 +00:00
Peter Wemm
df839389c5 Bandaid for compiling with gcc, which happens to be the default compiler
for a number of platforms still.
2013-05-13 07:09:31 +00:00
Alan Cox
404eb1b3fd Refactor vm_page_alloc()'s interactions with vm_reserv_alloc_page() and
vm_page_insert() so that (1) vm_radix_lookup_le() is never called while the
free page queues lock is held and (2) vm_radix_lookup_le() is called at most
once.  This change reduces the average time that the free page queues lock
is held by vm_page_alloc() as well as vm_page_alloc()'s average overall
running time.

Sponsored by:	EMC / Isilon Storage Division
2013-05-12 16:50:18 +00:00
Alan Cox
9f2e600890 To reduce the amount of arithmetic performed in the various radix tree
functions, reverse the numbering scheme for the levels.  The highest
numbered level in the tree now appears near the root instead of the leaves.

Sponsored by:	EMC / Isilon Storage Division
2013-05-11 18:01:41 +00:00
Attilio Rao
d0b5855eb2 Fix-up r250338 by completing the removal of VM_NDOMAIN in favor of
MAXMEMDOM.
This unbreak builds.

Sponsored by:	EMC / Isilon storage division
Reported by:	adrian, jeli
2013-05-08 10:55:39 +00:00
Attilio Rao
941646f5ec Rename VM_NDOMAIN into MAXMEMDOM and move it into machine/param.h in
order to match the MAXCPU concept.  The change should also be useful
for consolidation and consistency.

Sponsored by:	EMC / Isilon storage division
Obtained from:	jeff
Reviewed by:	alc
2013-05-07 22:46:24 +00:00
Alan Cox
bb0e1de4ab Remove a redundant call to panic() from vm_radix_keydiff(). The assertion
before the loop accomplishes the same thing.

Sponsored by:	EMC / Isilon Storage Division
2013-05-07 18:45:34 +00:00
Alan Cox
2d4b9a6438 Optimize vm_radix_lookup_ge() and vm_radix_lookup_le(). Specifically,
change the way that these functions ascend the tree when the search for a
matching leaf fails at an interior node.  Rather than returning to the root
of the tree and repeating the lookup with an updated key, maintain a stack
of interior nodes that were visited during the descent and use that stack
to resume the lookup at the closest ancestor that might have a matching
descendant.

Sponsored by:	EMC / Isilon Storage Division
Reviewed by:	attilio
Tested by:	pho
2013-05-04 22:50:15 +00:00
John Baldwin
f5c4b077be Fix two bugs in the current NUMA-aware allocation code:
- vm_phys_alloc_freelist_pages() can be called by vm_page_alloc_freelist()
  to allocate a page from a specific freelist.  In the NUMA case it did not
  properly map the public VM_FREELIST_* constants to the correct backing
  freelists, nor did it try all NUMA domains for allocations from
  VM_FREELIST_DEFAULT.
- vm_phys_alloc_pages() did not pin the thread and each call to
  vm_phys_alloc_freelist_pages() fetched the current domain to choose
  which freelist to use.  If a thread migrated domains during the loop
  in vm_phys_alloc_pages() it could skip one of the freelists.  If the
  other freelists were out of memory then it is possible that
  vm_phys_alloc_pages() would fail to allocate a page even though pages
  were available resulting in a panic in vm_page_alloc().

Reviewed by:	alc
MFC after:	1 week
2013-05-03 18:58:37 +00:00
Konstantin Belousov
53f5f8a0e1 Add a hint suggesting why tmpfs does not need a special case there. 2013-05-02 18:35:12 +00:00
Konstantin Belousov
6f2af3fcf3 Rework the handling of the tmpfs node backing swap object and tmpfs
vnode v_object to avoid double-buffering.  Use the same object both as
the backing store for tmpfs node and as the v_object.

Besides reducing memory use up to 2x times for situation of mapping
files from tmpfs, it also makes tmpfs read and write operations copy
twice bytes less.

VM subsystem was already slightly adapted to tolerate OBJT_SWAP object
as v_object. Now the vm_object_deallocate() is modified to not
reinstantiate OBJ_ONEMAPPING flag and help the VFS to correctly handle
VV_TEXT flag on the last dereference of the tmpfs backing object.

Reviewed by:	alc
Tested by:	pho, bf
MFC after:	1 month
2013-04-28 19:38:59 +00:00
Konstantin Belousov
e5f299ff76 Make vm_object_page_clean() and vm_mmap_vnode() tolerate the vnode'
v_object of non OBJT_VNODE type.

For vm_object_page_clean(), simply do not assert that object type must
be OBJT_VNODE, and add a comment explaining how the check for
OBJ_MIGHTBEDIRTY prevents the rest of function from operating on such
objects.

For vm_mmap_vnode(), if the object type is not OBJT_VNODE, require it
to be for swap pager (or default), handle the bypass filesystems, and
correctly acquire the object reference in this case.

Reviewed by:	alc
Tested by:	pho, bf
MFC after:	1 week
2013-04-28 19:25:09 +00:00
Konstantin Belousov
9b8851faae Assert that the object type for the vnode' non-NULL v_object, passed
to vnode_pager_setsize(), is either OBJT_VNODE, or, if vnode was
already reclaimed, OBJT_DEAD.  Note that the later is only possible
due to some filesystems, in particular, nfsiods from nfs clients, call
vnode_pager_setsize() with unlocked vnode.

More, if the object is terminated, do not perform the resizing
operation.

Reviewed by:	alc
Tested by:	pho, bf
MFC after:	1 week
2013-04-28 19:19:26 +00:00
Konstantin Belousov
6ded84276d Convert panic() into KASSERT().
Reviewed by:	alc
MFC after:	1 week
2013-04-28 18:40:55 +00:00
Alan Cox
82af926a57 Eliminate an unneeded call to vm_radix_trimkey() from vm_radix_lookup_le().
This call is clearing bits from the key that will be set again by the next
line.

Sponsored by:	EMC / Isilon Storage Division
2013-04-28 08:29:00 +00:00
Alan Cox
40076ebc5c Avoid some lookup restarts in vm_radix_lookup_{ge,le}().
Sponsored by:	EMC / Isilon Storage Division
2013-04-27 16:44:59 +00:00
Gleb Smirnoff
08a3102c0b Panic if UMA_ZONE_PCPU is created at early stages of boot, when mp_ncpus
isn't yet initialized. Otherwise we will panic at first allocation later.

Sponsored by:	Nginx, Inc.
2013-04-22 09:02:23 +00:00
Alan Cox
384875a3a6 Simplify vm_radix_{add,dec}lev().
Sponsored by:	EMC / Isilon Storage Division
2013-04-22 01:26:13 +00:00
Alan Cox
880659fe81 When calculating the number of reserved nodes, discount the pages that will
be used to store the nodes.

Sponsored by:	EMC / Isilon Storage Division
2013-04-18 05:34:33 +00:00
Alan Cox
a08f2cf69e Although we perform path compression to reduce the height of the trie and
the number of interior nodes, we have previously created a level zero
interior node at the root of every non-empty trie, even when that node is
not strictly necessary, i.e., it has only one child.  This change is the
second (and final) step in eliminating those unnecessary level zero interior
nodes.  Specifically, it updates the deletion and insertion functions so
that they do not require a level zero interior node at the root of the trie.
For a "buildworld" workload, this change results in a 16.8% reduction in the
number of interior nodes allocated and a similar reduction in the average
execution time for lookup functions.  For example, the average execution
time for a call to vm_radix_lookup_ge() is reduced by 22.9%.

Reviewed by:	attilio, jeff (an earlier version)
Sponsored by:	EMC / Isilon Storage Division
2013-04-15 06:12:00 +00:00
Alan Cox
6f9c0b15bb Although we perform path compression to reduce the height of the trie and
the number of interior nodes, we always create a level zero interior node at
the root of every non-empty trie, even when that node is not strictly
necessary, i.e., it has only one child.  This change is the first step in
eliminating those unnecessary level zero interior nodes.  Specifically, it
updates all of the lookup functions so that they do not require a level zero
interior node at the root.

Reviewed by:	attilio, jeff (an earlier version)
Sponsored by:	EMC / Isilon Storage Division
2013-04-12 20:21:28 +00:00
Gleb Smirnoff
85dcf349c1 Convert UMA code to C99 uintXX_t types. 2013-04-09 17:43:48 +00:00
Gleb Smirnoff
04fc5741e0 Swap us_freecount and us_flags, achieving same structure size
as before previous commit.

Submitted by:	alc
2013-04-09 17:25:15 +00:00
Gleb Smirnoff
8cf455b8d9 Since now we support 256 items per slab, we need more bits
for us_freecount.

This grows uma_slab_head on 32-bit arches, but growth isn't
significant. Taking kmem zones as example, only the 32 byte
zone is affected, ipers is reduced from 113 to 112.

In collaboration with:	kib
2013-04-09 15:15:52 +00:00