Commit Graph

2464 Commits

Author SHA1 Message Date
Alan Cox
8fece8c367 Two small changes to vm_map_pmap_enter():
1) Eliminate an unnecessary check for fictitious pages.  Specifically,
only device-backed objects contain fictitious pages and the object is
not device-backed.

2) Change the types of "psize" and "tmpidx" to vm_pindex_t in order to
prevent possible wrap around with extremely large maps and objects,
respectively.  Observed by: tegge (last summer)
2007-03-25 19:33:40 +00:00
Alan Cox
768131d293 vm_page_busy() no longer requires the page queues lock to be held. Reduce
the scope of the page queues lock in vm_fault() accordingly.
2007-03-23 06:11:25 +00:00
Alan Cox
c5474b8f18 Change the order of lock reacquisition in vm_object_split() in order to
simplify the code slightly.  Add a comment concerning lock ordering.
2007-03-22 07:02:43 +00:00
Alan Cox
d8810d894d Use PCPU_LAZY_INC() to update page fault statistics. 2007-03-05 18:55:14 +00:00
John Baldwin
8db5fc58ff Use pause() in vm_object_deallocate() to yield the CPU to the lock holder
rather than a tsleep() on &proc0.  The only wakeup on &proc0 is intended
to awaken the swapper, not random threads blocked in
vm_object_deallocate().
2007-02-27 19:40:26 +00:00
John Baldwin
4d70511ac3 Use pause() rather than tsleep() on stack variables and function pointers. 2007-02-27 17:23:29 +00:00
Alan Cox
9f5c801b94 Change the way that unmanaged pages are created. Specifically,
immediately flag any page that is allocated to a OBJT_PHYS object as
unmanaged in vm_page_alloc() rather than waiting for a later call to
vm_page_unmanage().  This allows for the elimination of some uses of
the page queues lock.

Change the type of the kernel and kmem objects from OBJT_DEFAULT to
OBJT_PHYS.  This allows us to take advantage of the above change to
simplify the allocation of unmanaged pages in kmem_alloc() and
kmem_malloc().

Remove vm_page_unmanage().  It is no longer used.
2007-02-25 06:14:58 +00:00
Alan Cox
0cd31a0d75 Change the page's CLEANCHK flag from being a page queue mutex synchronized
flag to a vm object mutex synchronized flag.
2007-02-22 06:15:52 +00:00
Alan Cox
711585d087 Enable vm_page_free() and vm_page_free_zero() to be called on some pages
without the page queues lock being held, specifically, pages that are not
contained in a vm object and not a member of a page queue.
2007-02-18 05:54:42 +00:00
Alan Cox
ba000fb2c1 Remove a stale comment. Add punctuation to a nearby comment. 2007-02-17 19:37:00 +00:00
Alan Cox
d3d029bd62 Relax the page queue lock assertions in vm_page_remove() and
vm_page_free_toq() to account for recent changes that allow
vm_page_free_toq() to be called on some pages without the page queues lock
being held, specifically, pages that are not contained in a vm object and
not a member of a page queue.  (Examples of such pages include page table
pages, pv entry pages, and uma small alloc pages.)
2007-02-15 05:43:38 +00:00
Alan Cox
7d60988bad Avoid the unnecessary acquisition of the free page queues lock when a page
is actually being added to the hold queue, not the free queue.  At the same
time, avoid unnecessary tests to wake up threads waiting for free memory
and the idle thread that zeroes free pages.  (These tests will be performed
later when the page finally moves from the hold queue to the free queue.)
2007-02-14 07:05:55 +00:00
Robert Watson
1e319f6db3 Add uma_set_align() interface, which will be called at most once during
boot by MD code to indicated detected alignment preference.  Rather than
cache alignment being encoded in UMA consumers by defining a global
alignment value of (16 - 1) in UMA_ALIGN_CACHE, UMA_ALIGN_CACHE is now
a special value (-1) that causes UMA to look at registered alignment.  If
no preferred alignment has been selected by MD code, a default alignment
of (16 - 1) will be used.

Currently, no hardware platforms specify alignment; architecture
maintainers will need to modify MD startup code to specify an alignment
if desired.  This must occur before initialization of UMA so that all UMA
zones pick up the requested alignment.

Reviewed by:	jeff, alc
Submitted by:	attilio
2007-02-11 20:13:52 +00:00
Alan Cox
5351a2488a Use the free page queue mutex instead of the page queue mutex to
synchronize sleeping and waking of the zero idle thread.
2007-02-11 05:18:40 +00:00
John Baldwin
e8865caffb - Move 'struct swdevt' back into swap_pager.h and expose it to userland.
- Restore support for fetching swap information from crash dumps via
  kvm_get_swapinfo(3) to fix pstat -T/-s on crash dumps.

Reviewed by:	arch@, phk
MFC after:	1 week
2007-02-07 17:43:11 +00:00
Alan Cox
e9f995d824 Change the pagedaemon, vm_wait(), and vm_waitpfault() to sleep on the
vm page queue free mutex instead of the vm page queue mutex.
2007-02-07 06:37:30 +00:00
Alan Cox
3ae3919d0b Change the free page queue lock from a spin mutex to a default (blocking)
mutex.  With the demise of Alpha support, there is no longer a reason for
it to be a spin mutex.
2007-02-05 06:02:55 +00:00
Mohan Srinivasan
6c125b8df6 Fix for problems that occur when all mbuf clusters migrate to the mbuf packet
zone. Cluster allocations fail when this happens. Also processes that may have
blocked on cluster allocations will never be woken up. Thanks to rwatson for
an overview of the issue and pointers to the mbuma paper and his tool to dump
out UMA zones.

Reviewed by: andre@
2007-01-25 01:05:23 +00:00
Mohan Srinivasan
7738029183 Fix for a bug where only one process (of multiple) blocked on
maxpages on a zone is woken up, with the rest never being woken up as
a result of the ZFLAG_FULL flag being cleared. Wakeup all such blocked
procsses instead. This change introduces a thundering herd, but since
this should be relatively infrequent, optimizing this (by introducing
a count of blocked processes, for example) may be premature.

Reviewd by: ups@
2007-01-24 22:49:11 +00:00
Jeff Roberson
f0393f063a - Remove setrunqueue and replace it with direct calls to sched_add().
setrunqueue() was mostly empty.  The few asserts and thread state
   setting were moved to the individual schedulers.  sched_add() was
   chosen to displace it for naming consistency reasons.
 - Remove adjustrunqueue, it was 4 lines of code that was ifdef'd to be
   different on all three schedulers where it was only called in one place
   each.
 - Remove the long ifdef'd out remrunqueue code.
 - Remove the now redundant ts_state.  Inspect the thread state directly.
 - Don't set TSF_* flags from kern_switch.c, we were only doing this to
   support a feature in one scheduler.
 - Change sched_choose() to return a thread rather than a td_sched.  Also,
   rely on the schedulers to return the idlethread.  This simplifies the
   logic in choosethread().  Aside from the run queue links kern_switch.c
   mostly does not care about the contents of td_sched.

Discussed with:	julian

 - Move the idle thread loop into the per scheduler area.  ULE wants to
   do something different from the other schedulers.

Suggested by:	jhb

Tested on:	x86/amd64 sched_{4BSD, ULE, CORE}.
2007-01-23 08:46:51 +00:00
Xin LI
f67af5c918 Use FOREACH_PROC_IN_SYSTEM instead of using its unrolled form. 2007-01-17 15:05:52 +00:00
Robert Watson
635fd50514 Remove uma_zalloc_arg() hack, which coerced M_WAITOK to M_NOWAIT when
allocations were made using improper flags in interrupt context.
Replace with a simple WITNESS warning call.  This restores the
invariant that M_WAITOK allocations will always succeed or die
horribly trying, which is relied on by many UMA consumers.

MFC after:	3 weeks
Discussed with:	jhb
2007-01-10 21:04:43 +00:00
Alan Cox
e6eaadba43 Declare the map entry created by kmem_init() for the range from
VM_MIN_KERNEL_ADDRESS to the end of the kernel's bootstrap data as
MAP_NOFAULT.
2007-01-07 07:32:04 +00:00
John Baldwin
663b416f16 - Add a new function uma_zone_exhausted() to see if a zone is full.
- Add a printf in swp_pager_meta_build() to warn if the swapzone becomes
  exhausted so that there's at least a warning before a box that runs out
  of swapzone space before running out of swap space deadlocks.

MFC after:	1 week
Reviwed by:	alc
2007-01-05 19:09:01 +00:00
Alan Cox
73000556e8 Optimize vm_object_split(). Specifically, make the number of iterations
equal to the number of physical pages that are renamed to the new object
rather than the new object's virtual size.
2006-12-17 20:14:43 +00:00
Alan Cox
95442adf05 Simplify the computation of the new object's size in vm_object_split(). 2006-12-16 08:17:07 +00:00
Kip Macy
35d10226b7 Remove the requirement that phys_avail be sorted in ascending order
by explicitly finding the lowest and highest addresses when calculating
the size of the vm_pages array

Reviewed by :alc
2006-12-08 08:44:47 +00:00
Julian Elischer
ad1e7d285a Threading cleanup.. part 2 of several.
Make part of John Birrell's KSE patch permanent..
Specifically, remove:
Any reference of the ksegrp structure. This feature was
never fully utilised and made things overly complicated.
All code in the scheduler that tried to make threaded programs
fair to unthreaded programs.  Libpthread processes will already
do this to some extent and libthr processes already disable it.

Also:
Since this makes such a big change to the scheduler(s), take the opportunity
to rename some structures and elements that had to be moved anyhow.
This makes the code a lot more readable.

The ULE scheduler compiles again but I have no idea if it works.

The 4bsd scheduler still reqires a little cleaning and some functions that now do
ALMOST nothing will go away, but I thought I'd do that as a separate commit.

Tested by David Xu, and Dan Eischen using libthr and libpthread.
2006-12-06 06:34:57 +00:00
Ruslan Ermilov
9bed18a493 The clean_map has been made local to vm_init.c long ago. 2006-11-20 16:23:34 +00:00
Ruslan Ermilov
ef1b7c4804 Remove a redundant pointer-type variable. 2006-11-20 08:33:55 +00:00
Ruslan Ermilov
276096bb3e When counting vm totals, skip unreferenced objects, including
vnodes representing mounted file systems.

Reviewed by:	alc
MFC after:	3 days
2006-11-20 00:16:00 +00:00
Alan Cox
0f3b612a06 There is no point in setting PG_REFERENCED on kmem_object pages because
they are "unmanaged", i.e., non-pageable, pages.

Remove a stale comment.
2006-11-13 00:27:02 +00:00
Alan Cox
44b8bd66f9 Make pmap_enter() responsible for setting PG_WRITEABLE instead
of its caller.  (As a beneficial side-effect, a high-contention
acquisition of the page queues lock in vm_fault() is eliminated.)
2006-11-12 21:48:34 +00:00
Alan Cox
49c3b92531 I misplaced the assertion that was added to vm_page_startup() in the
previous change.  Correct its placement.
2006-11-08 19:11:54 +00:00
Alan Cox
9ad3296a25 Simplify the construction of the free queues in vm_page_startup(). Add
an assertion to test a hypothesis concerning other redundant computation
in vm_page_startup().
2006-11-08 18:43:47 +00:00
Alan Cox
815bc69fb0 Ensure that the page's oflags field is initialized by contigmalloc(). 2006-11-08 06:23:29 +00:00
Robert Watson
acd3428b7d Sweep kernel replacing suser(9) calls with priv(9) calls, assigning
specific privilege names to a broad range of privileges.  These may
require some future tweaking.

Sponsored by:           nCircle Network Security, Inc.
Obtained from:          TrustedBSD Project
Discussed on:           arch@
Reviewed (at least in part) by: mlaier, jmg, pjd, bde, ceri,
                        Alex Lyashkov <umka at sevcity dot net>,
                        Skip Ford <skip dot ford at verizon dot net>,
                        Antoine Brodin <antoine dot brodin at laposte dot net>
2006-11-06 13:42:10 +00:00
John Birrell
8460a577a4 Make KSE a kernel option, turned on by default in all GENERIC
kernel configs except sun4v (which doesn't process signals properly
with KSE).

Reviewed by:	davidxu@
2006-10-26 21:42:22 +00:00
Robert Watson
ae4e9636ac Better align output of "show uma" by moving from displaying the basic
counters of allocs/frees/use for each zone to the same statistics
shown by userspace "vmstat -z".

MFC after:	3 days
2006-10-26 12:55:32 +00:00
Alan Cox
66bdd5d619 The page queues lock is no longer required by vm_page_wakeup(). 2006-10-23 05:27:31 +00:00
Alan Cox
2a53696fb8 The page queues lock is no longer required by vm_page_busy() or
vm_page_wakeup().  Reduce or eliminate its use accordingly.
2006-10-22 21:18:48 +00:00
Robert Watson
aed5570872 Complete break-out of sys/sys/mac.h into sys/security/mac/mac_framework.h
begun with a repo-copy of mac.h to mac_framework.h.  sys/mac.h now
contains the userspace and user<->kernel API and definitions, with all
in-kernel interfaces moved to mac_framework.h, which is now included
across most of the kernel instead.

This change is the first step in a larger cleanup and sweep of MAC
Framework interfaces in the kernel, and will not be MFC'd.

Obtained from:	TrustedBSD Project
Sponsored by:	SPARTA
2006-10-22 11:52:19 +00:00
Alan Cox
9af80719db Replace PG_BUSY with VPO_BUSY. In other words, changes to the page's
busy flag, i.e., VPO_BUSY, are now synchronized by the per-vm object
lock instead of the global page queues lock.
2006-10-22 04:28:14 +00:00
Alan Cox
9fea8cad08 Eliminate unnecessary PG_BUSY tests. They originally served a purpose
that is now handled by vm object locking.
2006-10-21 21:02:04 +00:00
Alan Cox
7e2393ff51 Long ago, revision 1.22 of vm/vm_pager.h introduced a bug. Specifically,
it introduced a check after the call to file system's get pages method
that assumes that the get pages method does not change the array of pages
that is passed to it.  In the case of vnode_pager_generic_getpages(),
this assumption has been incorrect.  The contents of the array of pages
may be shifted by vnode_pager_generic_getpages().  Likely, the problem
has been hidden by vnode_pager_haspage() limiting the set of pages that
are passed to vnode_pager_generic_getpages() such that a shift never
occurs.

The fix implemented herein is to adjust the pointer to the array of pages
rather than shifting the pages within the array.

MFC after: 3 weeks
Fix suggested by: tegge
2006-10-14 23:21:48 +00:00
Alan Cox
bff763439b Change vnode_pager_addr() such that on returning it distinguishes between
an error returned by VOP_BMAP() and a hole in the file.

Change the callers to vnode_pager_addr() such that they return
VM_PAGER_ERROR when VOP_BMAP fails instead of a zero-filled page.

Reviewed by: tegge
MFC after: 3 weeks
2006-10-14 22:09:03 +00:00
Kip Macy
600c53adf9 sun4v requires TSBs (translation storage buffers) to be contiguous and be
size aligned requiring heavy usage of vm_page_alloc_contig

This change makes vm_page_alloc_contig SMP safe

Approved by: scottl (acting as backup for mentor rwatson)
2006-10-12 04:41:39 +00:00
Alan Cox
1de11f1af3 Distinguish between two distinct kinds of errors from VOP_BMAP() in
vnode_pager_generic_getpages(): (1) that VOP_BMAP() is unsupported by the
underlying file system and (2) an error in performing the VOP_BMAP().
Previously, vnode_pager_generic_getpages() assumed that all errors were
of the first type.  If, in fact, the error was of the second type, the
likely outcome was for the process to become permanently blocked on a busy
page.

MFC after: 3 weeks
Reviewed by: tegge
2006-10-10 18:26:18 +00:00
Alan Cox
f4f83da02d Change vnode_pager_generic_getpages() so that it does not panic if the
given file is sparse.  Instead, it zeroes the requested page.

Reviewed by: tegge
PR: kern/98116
MFC after: 3 days
2006-10-08 20:26:16 +00:00
Ken Smith
a9a5d47c85 Fix two minor style(9) nits in v1.313 which were noticed during an
MFC review.  alc@ will be MFCing V1.313 plus style fix to RELENG_6.
2006-09-29 00:20:56 +00:00
Alan Cox
e1cb7bc081 Make vm_page_release_contig() static. 2006-09-03 22:24:08 +00:00
Alan Cox
eb4bbba83a Refactor vm_page_sleep_if_busy() so that the test for a busy page is
inlined and a procedure call is made in the rare case, i.e., when it is
necessary to sleep.  In this case, inlining the test actually makes the
kernel smaller.
2006-08-27 19:50:13 +00:00
Alan Cox
1f081553cc Prevent a call to contigmalloc() that asks for more physical memory than
the machine has from causing a panic.

Submitted by: Michael Plass
PR: 101668
MFC after: 3 days
2006-08-26 02:43:23 +00:00
Alan Cox
09ef0d6e0c The return value from vm_pageq_add_new_page() is not used. Eliminate it. 2006-08-25 04:36:19 +00:00
Alan Cox
b276ae6f6a Add _vm_stats and _vm_stats_misc to the sysctl declarations in sysctl.h and
eliminate their declarations from various source files.
2006-08-21 06:27:28 +00:00
Alan Cox
38498c2123 vm_page_zero_idle()'s return value serves no purpose. Eliminate it. 2006-08-21 00:55:05 +00:00
Alan Cox
4f9d17d8ab Page flags are reset on (re)allocation. There is no need to clear any
flags except for PG_ZERO in vm_page_free_toq().
2006-08-21 00:34:31 +00:00
Alan Cox
b146f9e5d2 Reimplement the page's NOSYNC flag as an object-synchronized instead of a
page queues-synchronized flag.  Reduce the scope of the page queues lock in
vm_fault() accordingly.

Move vm_fault()'s call to vm_object_set_writeable_dirty() outside of the
scope of the page queues lock.  Reviewed by: tegge
Additionally, eliminate an unnecessary dereference in computing the
argument that is passed to vm_object_set_writeable_dirty().
2006-08-13 00:11:09 +00:00
Alan Cox
25017df472 Ensure that the page's new field for object-synchronized flags is always
initialized to zero.

Call vm_page_sleep_if_busy() instead of duplicating its implementation in
vm_page_grab().
2006-08-11 17:18:58 +00:00
Alan Cox
75db2abb2e Change vm_page_cowfault() so that it doesn't allocate a pre-busied page. 2006-08-10 04:48:29 +00:00
Alan Cox
5786be7cc7 Introduce a field to struct vm_page for storing flags that are
synchronized by the lock on the object containing the page.

Transition PG_WANTED and PG_SWAPINPROG to use the new field,
eliminating the need for holding the page queues lock when setting
or clearing these flags.  Rename PG_WANTED and PG_SWAPINPROG to
VPO_WANTED and VPO_SWAPINPROG, respectively.

Eliminate the assertion that the page queues lock is held in
vm_page_io_finish().

Eliminate the acquisition and release of the page queues lock
around calls to vm_page_io_finish() in kern_sendfile() and
vfs_unbusy_pages().
2006-08-09 17:43:27 +00:00
Alan Cox
e7e56b2889 Eliminate the acquisition and release of the page queues lock around a call
to vm_page_sleep_if_busy().
2006-08-06 00:17:17 +00:00
Alan Cox
e74814b66a Change vm_page_sleep_if_busy() so that it no longer requires the caller to
hold the page queues lock.
2006-08-06 00:15:40 +00:00
Alan Cox
ab1661cb76 Remove a stale comment. 2006-08-05 19:07:07 +00:00
Alan Cox
91449ce98c When sleeping on a busy page, use the lock from the containing object
rather than the global page queues lock.
2006-08-03 23:56:11 +00:00
Alan Cox
78985e424a Complete the transition from pmap_page_protect() to pmap_remove_write().
Originally, I had adopted sparc64's name, pmap_clear_write(), for the
function that is now pmap_remove_write().  However, this function is more
like pmap_remove_all() than like pmap_clear_modify() or
pmap_clear_reference(), hence, the name change.

The higher-level rationale behind this change is described in
src/sys/amd64/amd64/pmap.c revision 1.567.  The short version is that I'm
trying to clean up and fix our support for execute access.

Reviewed by: marcel@ (ia64)
2006-08-01 19:06:06 +00:00
Alan Cox
604c2bbc34 Export the number of object bypasses and collapses through sysctl. 2006-07-22 22:31:57 +00:00
Alan Cox
2cf139527c Retire debug.mpsafevm. None of the architectures supported in CVS require
it any longer.
2006-07-21 23:22:49 +00:00
Alan Cox
af51d7bf57 Eliminate OBJ_WRITEABLE. It hasn't been used in a long time. 2006-07-21 06:40:29 +00:00
Alan Cox
3cad40e517 Add pmap_clear_write() to the interface between the virtual memory
system's machine-dependent and machine-independent layers.  Once
pmap_clear_write() is implemented on all of our supported
architectures, I intend to replace all calls to pmap_page_protect() by
calls to pmap_clear_write().  Why?  Both the use and implementation of
pmap_page_protect() in our virtual memory system has subtle errors,
specifically, the management of execute permission is broken on some
architectures.  The "prot" argument to pmap_page_protect() should
behave differently from the "prot" argument to other pmap functions.
Instead of meaning, "give the specified access rights to all of the
physical page's mappings," it means "don't take away the specified
access rights from all of the physical page's mappings, but do take
away the ones that aren't specified."  However, owing to our i386
legacy, i.e., no support for no-execute rights, all but one invocation
of pmap_page_protect() specifies VM_PROT_READ only, when the intent
is, in fact, to remove only write permission.  Consequently, a
faithful implementation of pmap_page_protect(), e.g., ia64, would
remove execute permission as well as write permission.  On the other
hand, some architectures that support execute permission have
basically ignored whether or not VM_PROT_EXECUTE is passed to
pmap_page_protect(), e.g., amd64 and sparc64.  This change represents
the first step in replacing pmap_page_protect() by the less subtle
pmap_clear_write() that is already implemented on amd64, i386, and
sparc64.

Discussed with: grehan@ and marcel@
2006-07-20 17:48:41 +00:00
Robert Watson
a0d4b0aeaa Fix build of uma_core.c when DDB is not compiled into the kernel by
making uma_zone_sumstat() ifdef DDB, as it's only used with DDB now.

Submitted by:	Wolfram Fenske <Wolfram.Fenske at Student.Uni-Magdeburg.DE>
2006-07-18 01:13:18 +00:00
Alan Cox
2e9f4a698d Ensure that vm_object_deallocate() doesn't dereference a stale object
pointer: When vm_object_deallocate() sleeps because of a non-zero
paging in progress count on either object or object's shadow,
vm_object_deallocate() must ensure that object is still the shadow's
backing object when it reawakens.  In fact, object may have been
deallocated while vm_object_deallocate() slept.  If so, reacquiring
the lock on object can lead to a deadlock.

Submitted by: ups@
MFC after: 3 weeks
2006-07-17 06:45:03 +00:00
Robert Watson
eabadd9e4b Remove sysctl_vm_zone() and vm.zone sysctl from 7.x. As of 6.x,
libmemstat(3) is used by vmstat (and friends) to produce more accurate
and more detailed statistics information in a machine-readable way,
and vmstat continues to provide the same text-based front-end.

This change should not be MFC'd.
2006-07-16 22:53:26 +00:00
Alan Cox
ff5ff76116 Set debug.mpsafevm to true on PowerPC. (Now, by default, all architectures
in CVS have debug.mpsafevm set to true.)

Tested by: grehan@
2006-07-10 07:08:05 +00:00
John Baldwin
9bdaa43379 Move the code to handle the vm.blacklist tunable up a layer into
vm_page_startup().  As a result, we now only lookup the tunable once
instead of looking it up once for every physical page of memory in the
system.  This cuts out about a 1 second or so delay in boot on x86
systems.  The delay is much larger and more noticable on sun4v apparently.

Reported by:	kmacy
MFC after:	1 week
2006-06-23 16:44:24 +00:00
Konstantin Belousov
455dd7d4c7 Make the mincore(2) return ENOMEM when requested range is not fully mapped.
Requested by:	Bruno Haible <bruno at clisp org>
Reviewed by:	alc
Approved by:	pjd (mentor)
MFC after:	1 month
2006-06-21 12:59:05 +00:00
Alan Cox
379fb6429d Use ptoa(psize) instead of size to compute the end of the mapping in
vm_map_pmap_enter().
2006-06-17 08:45:01 +00:00
Stephan Uphoff
2053c12705 Remove mpte optimization from pmap_enter_quick().
There is a race with the current locking scheme and removing
it should have no measurable performance impact.
This fixes page faults leading to panics in pmap_enter_quick_locked()
on amd64/i386.

Reviewed by: alc,jhb,peter,ps
2006-06-15 01:01:06 +00:00
Alan Cox
d2d9e24a89 Correct an error in the previous revision that could lead to a panic:
Found mapped cache page.  Specifically, if cnt.v_free_count dips below
cnt.v_free_reserved after p_start has been set to a non-NULL value,
then vm_map_pmap_enter() would break out of the loop and incorrectly
call pmap_enter_object() for the remaining address range.  To correct
this error, this revision truncates the address range so that
pmap_enter_object() will not map any cache pages.

In collaboration with: tegge@
Reported by: kris@
2006-06-14 17:48:45 +00:00
Alan Cox
e4c7a7b169 Enable debug.mpsafevm on arm by default.
Tested by: cognet@
2006-06-10 05:29:37 +00:00
Alan Cox
ce142d9ec0 Introduce the function pmap_enter_object(). It maps a sequence of resident
pages from the same object.  Use it in vm_map_pmap_enter() to reduce the
locking overhead of premapping objects.

Reviewed by: tegge@
2006-06-05 20:35:27 +00:00
Paul Saab
4cbb1c1aaa Fix minidumps to include pages allocated via pmap_map on amd64.
These pages are allocated from the direct map, and were not previous
tracked.  This included the vm_page_array and the early UMA bootstrap
pages.

Reviewed by:	peter
2006-05-31 22:55:23 +00:00
Tor Egge
57051fdc4b Close race between vmspace_exitfree() and exit1() and races between
vmspace_exitfree() and vmspace_free() which could result in the same
vmspace being freed twice.

Factor out part of exit1() into new function vmspace_exit().  Attach
to vmspace0 to allow old vmspace to be freed earlier.

Add new function, vmspace_acquire_ref(), for obtaining a vmspace
reference for a vmspace belonging to another process.  Avoid changing
vmspace refcount from 0 to 1 since that could also lead to the same
vmspace being freed twice.

Change vmtotal() and swapout_procs() to use vmspace_acquire_ref().

Reviewed by:	alc
2006-05-29 21:28:56 +00:00
Robert Watson
4f538c7480 When allocating a bucket to hold a free'd item in UMA fails, don't
report this as an allocation failure for the item type.  The failure
will be separately recorded with the bucket type.  This my eliminate
high mbuf allocation failure counts under some circumstances, which
can be alarming in appearance, but not actually a problem in
practice.

MFC after:	2 weeks
Reported by:	ps, Peter J. Blok <pblok at bsd4all dot org>,
		OxY <oxy at field dot hu>,
		Gabor MICSKO <gmicskoa at szintezis dot hu>
2006-05-21 23:25:32 +00:00
Alan Cox
8f8790a76d Simplify the implementation of vm_fault_additional_pages() based upon the
object's memq being ordered.  Specifically, replace repeated calls to
vm_page_lookup() by two simple constant-time operations.

Reviewed by: tegge
2006-05-13 20:05:44 +00:00
Pawel Jakub Dawidek
61f73c79da Use better order here. 2006-05-10 06:50:44 +00:00
Alan Cox
fda28c1440 Add synchronization to vm_pageq_add_new_page() so that it can be called
safely after kernel initialization.  Remove GIANT_REQUIRED.

MFC after: 6 weeks
2006-04-25 17:27:24 +00:00
Tom Rhodes
89eae00b84 It seems that POSIX would rather ENODEV returned in place of EINVAL when
trying to mmap() an fd that isn't a normal file.

Reference: http://www.opengroup.org/onlinepubs/009695399/functions/mmap.html
Submitted by:	fanf
2006-04-21 07:17:25 +00:00
Peter Wemm
c0345a84aa Introduce minidumps. Full physical memory crash dumps are still available
via the debug.minidump sysctl and tunable.

Traditional dumps store all physical memory.  This was once a good thing
when machines had a maximum of 64M of ram and 1GB of kvm.  These days,
machines often have many gigabytes of ram and a smaller amount of kvm.
libkvm+kgdb don't have a way to access physical ram that is not mapped
into kvm at the time of the crash dump, so the extra ram being dumped
is mostly wasted.

Minidumps invert the process.  Instead of dumping physical memory in
in order to guarantee that all of kvm's backing is dumped, minidumps
instead dump only memory that is actively mapped into kvm.

amd64 has a direct map region that things like UMA use.  Obviously we
cannot dump all of the direct map region because that is effectively
an old style all-physical-memory dump.  Instead, introduce a bitmap
and two helper routines (dump_add_page(pa) and dump_drop_page(pa)) that
allow certain critical direct map pages to be included in the dump.
uma_machdep.c's allocator is the intended consumer.

Dumps are a custom format.  At the very beginning of the file is a header,
then a copy of the message buffer, then the bitmap of pages present in
the dump, then the final level of the kvm page table trees (2MB mappings
are expanded into a 4K page mappings), then the sparse physical pages
according to the bitmap.  libkvm can now conveniently access the kvm
page table entries.

Booting my test 8GB machine, forcing it into ddb and forcing a dump
leads to a 48MB minidump.  While this is a best case, I expect minidumps
to be in the 100MB-500MB range.  Obviously, never larger than physical
memory of course.

minidumps are on by default.  It would want be necessary to turn them off
if it was necessary to debug corrupt kernel page table management as that
would mess up minidumps as well.

Both minidumps and regular dumps are supported on the same machine.
2006-04-21 04:24:50 +00:00
John Baldwin
0f180a7cce Change msleep() and tsleep() to not alter the calling thread's priority
if the specified priority is zero.  This avoids a race where the calling
thread could read a snapshot of it's current priority, then a different
thread could change the first thread's priority, then the original thread
would call sched_prio() inside msleep() undoing the change made by the
second thread.  I used a priority of zero as no thread that calls msleep()
or tsleep() should be specifying a priority of zero anyway.

The various places that passed 'curthread->td_priority' or some variant
as the priority now pass 0.
2006-04-17 18:20:38 +00:00
Pawel Jakub Dawidek
0909f38a3c On shutdown try to turn off all swap devices. This way GEOM providers are
properly closed on shutdown.

Requested by:	ru
Reviewed by:	alc
MFC after:	2 weeks
2006-04-10 10:03:41 +00:00
Peter Wemm
b9eee07e36 Remove the unused sva and eva arguments from pmap_remove_pages(). 2006-04-03 21:16:10 +00:00
Joseph Koshy
49874f6ea3 MFP4: Support for profiling dynamically loaded objects.
Kernel changes:

  Inform hwpmc of executable objects brought into the system by
  kldload() and mmap(), and of their removal by kldunload() and
  munmap().  A helper function linker_hwpmc_list_objects() has been
  added to "sys/kern/kern_linker.c" and is used by hwpmc to retrieve
  the list of currently loaded kernel modules.

  The unused `MAPPINGCHANGE' event has been deprecated in favour
  of separate `MAP_IN' and `MAP_OUT' events; this change reduces
  space wastage in the log.

  Bump the hwpmc's ABI version to "2.0.00".  Teach hwpmc(4) to
  handle the map change callbacks.

  Change the default per-cpu sample buffer size to hold
  32 samples (up from 16).

  Increment __FreeBSD_version.

libpmc(3) changes:

  Update libpmc(3) to deal with the new events in the log file; bring
  the pmclog(3) manual page in sync with the code.

pmcstat(8) changes:

  Introduce new options to pmcstat(8): "-r" (root fs path), "-M"
  (mapfile name), "-q"/"-v" (verbosity control).  Option "-k" now
  takes a kernel directory as its argument but will also work with
  the older invocation syntax.

  Rework string handling in pmcstat(8) to use an opaque type for
  interned strings.  Clean up ELF parsing code and add support for
  tracking dynamic object mappings reported by a v2.0.00 hwpmc(4).

  Report statistics at the end of a log conversion run depending
  on the requested verbosity level.

Reviewed by:	jhb, dds (kernel parts of an earlier patch)
Tested by:	gallatin (earlier patch)
2006-03-26 12:20:54 +00:00
Warner Losh
62a59e8f0d Remove leading __ from __(inline|const|signed|volatile). They are
obsolete.  This should reduce diffs to NetBSD as well.
2006-03-08 06:31:46 +00:00
Tor Egge
34ef4672d2 Ignore dirty pages owned by "dead" objects. 2006-03-08 00:51:00 +00:00
Tor Egge
3b582b4e72 Eliminate a deadlock when creating snapshots. Blocking vn_start_write() must
be called without any vnode locks held.  Remove calls to vn_start_write() and
vn_finished_write() in vnode_pager_putpages() and add these calls before the
vnode lock is obtained to most of the callers that don't already have them.
2006-03-02 22:13:28 +00:00
Tor Egge
6b085058e4 Hold extra reference to vm object while cleaning pages. 2006-03-02 21:38:38 +00:00
John Baldwin
ca95b5146a Lock the vm_object while checking its type to see if it is a vnode-backed
object that requires Giant in vm_object_deallocate().  This is somewhat
hairy in that if we can't obtain Giant directly, we have to drop the
object lock, then lock Giant, then relock the object lock and verify that
we still need Giant.  If we don't (because the object changed to OBJT_DEAD
for example), then we drop Giant before continuing.

Reviewed by:	alc
Tested by:	kris
2006-02-21 22:09:54 +00:00
Tor Egge
625e6c0af4 Expand scope of marker to reduce the number of page queue scan restarts. 2006-02-17 21:02:39 +00:00
Tor Egge
db27dcc0f0 Check return value from nonblocking call to vn_start_write(). 2006-02-17 18:22:19 +00:00
Stephan Uphoff
224409590d When the VM needs to allocated physical memory pages (for non interrupt use)
and it has not plenty of free pages it tries to free pages in the cache queue.
Unfortunately freeing a cached page requires the locking of the object that
owns the page. However in the context of allocating pages we may not be able
to lock the object and thus can only TRY to lock the object. If the locking try
fails the cache page can not be freed and is activated to move it out of the way
so that we may try to free other cache pages.

If all pages in the cache belong to objects that are currently locked the
cache queue can be emptied without freeing a single page. This scenario caused
two problems:

    1)  vm_page_alloc always failed allocation when it tried freeing pages from
        the cache queue and failed to do so. However if there are more than
        cnt.v_interrupt_free_min pages on the free list it should return pages
        when requested with priority VM_ALLOC_SYSTEM. Failure to do so can cause
        resource exhaustion deadlocks.

    2)  Threads than need to allocate pages spend a lot of time cleaning up the
        page queue without really getting anything done while the pagedaemon
         needs to work overtime to refill the cache.

This change fixes the first problem. (1)

Reviewed by:	tegge@
2006-02-15 22:29:53 +00:00
Robert Watson
082dc776db Skip per-cpu caches associated with absent CPUs when generating a
memory statistics record stream via sysctl.

MFC after:	3 days
2006-02-11 19:20:56 +00:00
Jeff Roberson
b73f64c484 - Fix silly VI locking that is used to check a single flag. The vnode
lock also protects this flag so it is not necessary.
 - Don't rely on v_mount to detect whether or not we've been recycled, use
   the more appropriate VI_DOOMED instead.

Sponsored by:	Isilon Systems, Inc.
MFC After:	1 week
2006-02-06 10:14:12 +00:00
Alan Cox
3b7db47d7e Remove an unnecessary call to pmap_remove_all(). The given page is not
mapped because its contents are invalid.
2006-02-04 22:37:10 +00:00
Tor Egge
44ed341759 Adjust old comment (present in rev 1.1) to match changes in rev 1.82.
PR:	kern/92509
Submitted by:   "Bryan Venteicher" <bryanv@daemoninthecloset.org>
2006-02-02 21:55:38 +00:00
Yaroslav Tykhiy
731959b118 Use off_t for file size passed to vnode_create_vobject().
The former type, size_t, was causing truncation to 32 bits on i386,
which immediately led to undersizing of VM objects backed by
files >4GB.  In particular, sendfile(2) was broken for such files.

PR:		kern/92243
MFC after:	5 days
2006-02-01 12:43:13 +00:00
Jeff Roberson
c05e22d44b - Install a temporary bandaid in vm_object_reference() that will stop
mtx_assert()s from triggering until I find a real long-term solution.
2006-02-01 09:47:02 +00:00
Alan Cox
6c237adcea Change #if defined(DIAGNOSTIC) to KASSERT. 2006-01-31 19:06:51 +00:00
Pawel Jakub Dawidek
847a2a1716 Add buffer corruption protection (RedZone) for kernel's malloc(9).
It detects both: buffer underflows and buffer overflows bugs at runtime
(on free(9) and realloc(9)) and prints backtraces from where memory was
allocated and from where it was freed.

Tested by:	kris
2006-01-31 11:09:21 +00:00
Scott Long
a5cbb43e43 The change a few years ago of having contigmalloc start its scan at the top
of physical RAM instead of the bottom was a sound idea, but the implementation
left a lot to be desired.  Scans would spend considerable time looking at
pages that are above of the address range given by the caller, and multiple
calls (like what happens in busdma) would spend more time on top of that
rescanning the same pages over and over.

Solve this, at least for now, with two simple optimizations.  The first is
to not bother scanning high ordered pages that are outside of the provided
address range.  Second is to cache the page index from the last successful
operation so that subsequent scans don't have to restart from the top.  This
is conditional on the numpages argument being the same or greater between
calls.

MFC After: 2 weeks
2006-01-29 08:24:54 +00:00
John Baldwin
ffaf2c55a8 Add a new macro wrapper WITNESS_CHECK() around the witness_warn() function.
The difference between WITNESS_CHECK() and WITNESS_WARN() is that
WITNESS_CHECK() should be used in the places that the return value of
witness_warn() is checked, whereas WITNESS_WARN() should be used in places
where the return value is ignored.  Specifically, in a kernel without
WITNESS enabled, WITNESS_WARN() evaluates to an empty string where as
WITNESS_CHECK evaluates to 0.  I also updated the one place that was
checking the return value of WITNESS_WARN() to use WITNESS_CHECK.
2006-01-27 22:20:15 +00:00
Olivier Houchard
100650dee1 Make sure b_vp and b_bufobj are NULL before calling relpbuf(), as it asserts
they are. They should be NULL at this point, except if we're coming from
swapdev_strategy().
It should only affect the case where we're swapping directly on a file over
NFS.
2006-01-27 21:11:50 +00:00
Alan Cox
0034fd6fff Style: Add blank line after local variable declarations. 2006-01-27 21:06:37 +00:00
Alan Cox
82eedee4a4 Use the new macros abstracting the page coloring/queues implementation.
(There are no functional changes.)
2006-01-27 08:35:32 +00:00
Alan Cox
997e1c252b Use the new macros abstracting the page coloring/queues implementation.
(There are no functional changes.)
2006-01-27 07:28:51 +00:00
Alan Cox
cfc26cd69c Plug a leak in the newer contigmalloc() implementation. Specifically, if
a multipage allocation was aborted midway, the pages that were already
allocated were not always returned to the free list.

Submitted by: tegge
2006-01-26 05:51:26 +00:00
Jeff Roberson
df59a0fee7 - Avoid calling vm_object_backing_scan() when collapsing an object when
the resident page count matches the object size.  We know it fully backs
   its parent in this case.

Reviewed by:	acl, tegge
Sponsored by:	Isilon Systems, Inc.
2006-01-25 08:42:58 +00:00
Alan Cox
0883c2d739 The previous revision incorrectly changed a switch statement into an if
statement.  Specifically, a break statement that previously broke out of
the enclosing switch was not changed.  Consequently, the enclosing loop
terminated prematurely.

This could result in "vm_page_insert: page already inserted" panics.

Submitted by: tegge
2006-01-25 06:45:57 +00:00
Alan Cox
39fd9b639f With the recent changes to the implementation of page coloring, the
the option PQ_NOOPT is used exclusively by vm_pageq.c.  Thus, the
include of opt_vmpage.h can be removed from vm_page.h.
2006-01-24 19:24:54 +00:00
Alan Cox
fc3c1bc471 In vm_page_set_invalid() invalidate all of the page's mappings as soon as
any part of the page's contents is invalidated.

Submitted by: tegge
2006-01-24 07:21:38 +00:00
Alan Cox
02dd83311a Make vm_object_vndeallocate() static. The external calls to it were
eliminated in ufs/ffs/ffs_vnops.c's revision 1.125.
2006-01-22 23:56:20 +00:00
John Baldwin
ca49f12fdb Reduce the scope of one #ifdef to avoid duplicating a SYSCTL_INT() macro
and trim another unneeded #ifdef (it was just around a macro that is
already conditionally defined).
2006-01-06 18:03:45 +00:00
Alexander Leidinger
924771865a Convert the PAGE_SIZE check into a CTASSERT.
Suggested by:	jhb
2006-01-04 19:19:42 +00:00
Alexander Leidinger
1442a4476a Prevent divide by zero, use default values in case one of the divisor's
is zero.

Tested by:	Randy Bush <randy@psg.com>
2006-01-04 18:26:54 +00:00
Alexander Leidinger
ef39c05baa MI changes:
- provide an interface (macros) to the page coloring part of the VM system,
   this allows to try different coloring algorithms without the need to
   touch every file [1]
 - make the page queue tuning values readable: sysctl vm.stats.pagequeue
 - autotuning of the page coloring values based upon the cache size instead
   of options in the kernel config (disabling of the page coloring as a
   kernel option is still possible)

MD changes:
 - detection of the cache size: only IA32 and AMD64 (untested) contains
   cache size detection code, every other arch just comes with a dummy
   function (this results in the use of default values like it was the
   case without the autotuning of the page coloring)
 - print some more info on Intel CPU's (like we do on AMD and Transmeta
   CPU's)

Note to AMD owners (IA32 and AMD64): please run "sysctl vm.stats.pagequeue"
and report if the cache* values are zero (= bug in the cache detection code)
or not.

Based upon work by:	Chad David <davidc@acns.ab.ca> [1]
Reviewed by:		alc, arch (in 2004)
Discussed with:		alc, Chad David, arch (in 2004)
2005-12-31 14:39:20 +00:00
Pawel Jakub Dawidek
d362c40d3a Improve memguard a bit:
- Provide tunable vm.memguard.desc, so one can specify memory type without
  changing the code and recompiling the kernel.
- Allow to use memguard for kernel modules by providing sysctl
  vm.memguard.desc, which can be changed to short description of memory
  type before module is loaded.
- Move as much memguard code as possible to memguard.c.
- Add sysctl node vm.memguard. and move memguard-specific sysctl there.
- Add malloc_desc2type() function for finding memory type based on its
  short description (ks_shortdesc field).
- Memory type can be changed (via vm.memguard.desc sysctl) only if it
  doesn't exist (will be loaded later) or when no memory is allocated yet.
  If there is allocated memory for the given memory type, return EBUSY.
- Implement two ways of memory types comparsion and make safer/slower the
  default.
2005-12-30 11:45:07 +00:00
Tor Egge
b898bb1be3 Don't access fs->first_object after dropping reference to it.
The result could be a missed or extra giant unlock.

Reviewed by:	alc
2005-12-20 12:27:59 +00:00
Alan Cox
da61b9a69e Use sf_buf_alloc() instead of vm_map_find() on exec_map to create the
ephemeral mappings that are used as the source for three copy
operations from kernel space to user space.  There are two reasons for
making this change: (1) Under heavy load exec_map can fill up causing
vm_map_find() to fail.  When it fails, the nascent process is aborted
(SIGABRT).  Whereas, this reimplementation using sf_buf_alloc()
sleeps.  (2) Although it is possible to sleep on vm_map_find()'s
failure until address space becomes available (see kmem_alloc_wait()),
using sf_buf_alloc() is faster.  Furthermore, the reimplementation
uses a CPU private mapping, avoiding a TLB shootdown on
multiprocessors.

Problem uncovered by: kris@
Reviewed by: tegge@
MFC after: 3 weeks
2005-12-16 18:34:14 +00:00
Alan Cox
984922d761 Assert that the page that is given to vm_page_free_toq() does not have any
managed mappings.
2005-12-13 19:59:09 +00:00
Alan Cox
05406e6f33 Remove unneeded calls to pmap_remove_all(). The given page is not mapped.
Reviewed by: tegge
2005-12-11 22:06:57 +00:00
Alan Cox
717f7d5962 Simplify vmspace_dofree(). 2005-12-04 22:55:41 +00:00
Alan Cox
51016cdfd7 Eliminate unneeded preallocation at initialization.
Reviewed by: tegge
2005-12-03 22:41:15 +00:00
Alan Cox
8215781ba2 Eliminate unneeded preallocation at initialization.
Reviewed by: tegge
2005-12-03 19:37:29 +00:00
Alan Cox
97a0c226d6 Eliminate pmap_init2(). It's no longer used. 2005-11-20 06:09:49 +00:00
Alan Cox
7a35a21e7b Reimplement the reclamation of PV entries. Specifically, perform
reclamation synchronously from get_pv_entry() instead of
asynchronously as part of the page daemon.  Additionally, limit the
reclamation to inactive pages unless allocation from the PV entry zone
or reclamation from the inactive queue fails.  Previously, reclamation
destroyed mappings to both inactive and active pages.  get_pv_entry()
still, however, wakes up the page daemon when reclamation occurs.  The
reason being that the page daemon may move some pages from the active
queue to the inactive queue, making some new pages available to future
reclamations.

Print the "reclaiming PV entries" message at most once per minute, but
don't stop printing it after the fifth time.  This way, we do not give
the impression that the problem has gone away.

Reviewed by: tegge
2005-11-09 08:19:21 +00:00
Alan Cox
7e9d944218 If a physical page is mapped by two or more virtual addresses, transmitted
by the zero-copy sockets method, and written to before the transmission
completes, we need to destroy all of the existing mappings to the page,
not just the one that we fault on.  Otherwise, the mappings will no longer
be to the same page and changes made through one of the mappings will not
be visible through the others.

Observed by: tegge
2005-11-08 06:33:21 +00:00
Paul Saab
dd498befc4 Rate limit vnode_pager_putpages printfs to once a second. 2005-11-01 23:00:24 +00:00
Alan Cox
674b706ea0 Consider the zero-copy transmission of a page that was wired by mlock(2).
If a copy-on-write fault occurs on the page, the new copy should inherit
a part of the original page's wire count.

Submitted by: tegge
MFC after: 1 week
2005-11-01 04:30:21 +00:00
Robert Watson
5bb84bc84b Normalize a significant number of kernel malloc type names:
- Prefer '_' to ' ', as it results in more easily parsed results in
  memory monitoring tools such as vmstat.

- Remove punctuation that is incompatible with using memory type names
  as file names, such as '/' characters.

- Disambiguate some collisions by adding subsystem prefixes to some
  memory types.

- Generally prefer lower case to upper case.

- If the same type is defined in multiple architecture directories,
  attempt to use the same name in additional cases.

Not all instances were caught in this change, so more work is required to
finish this conversion.  Similar changes are required for UMA zone names.
2005-10-31 15:41:29 +00:00
Alan Cox
f6d8983846 Use of the ZERO_COPY_SOCKETS options can result in an unusual state that
vm_object_backing_scan() was not written to handle.  Specifically, a wired
page within a backing object that is shadowed by a page within the shadow
object.  Handle this state by removing the wired page from the backing
object.  The wired page will be freed by socow_iodone().

Stop masking errors: If a page is being freed by vm_object_backing_scan(),
assert that it is no longer mapped rather than quietly destroying any
mappings.

Tested by: Harald Schmalzbauer
2005-10-22 18:46:38 +00:00
Robert Watson
64a266f9e8 Change format string for u_int64_t to %ju from %llu, in order to use the
correct format string on 64-bit systems.

Pointed out by:	pjd
2005-10-20 21:28:31 +00:00
Robert Watson
48c5777e3d Add a "show uma" command to DDB, which prints out the current stats for
available UMA zones.  Quite useful for post-mortem debugging of memory
leaks without a dump device configured on a panicked box.

MFC after:	2 weeks
2005-10-20 16:39:33 +00:00
Diomidis Spinellis
9f5c1d1955 Move execve's access time update functionality into a new
vfs_mark_atime() function, and use the new function for
performing efficient atime updates in mmap().

Reviewed by:	bde
MFC after:	2 weeks
2005-10-12 06:56:00 +00:00
Dag-Erling Smørgrav
3803b26bae As alc pointed out to me, vm_page.c 1.305 was incomplete: uma_startup()
still uses the constant UMA_BOOT_PAGES.  Change it to accept boot_pages
as an additional argument.

MFC after:	2 weeks
2005-10-08 21:03:54 +00:00
Diomidis Spinellis
1e3090039d Update the vnode's access time after an mmap operation on it.
Before this change a copy operation with cp(1) would not update the
file access times.

According to the POSIX mmap(2) documentation: the st_atime field
of the mapped file may be marked for update at any time between the
mmap() call and the corresponding munmap() call. The initial read
or write reference to a mapped region shall cause the file's st_atime
field to be marked for update if it has not already been marked for
update.
2005-10-04 14:58:58 +00:00
John Baldwin
b65089ccb5 Trim a couple of unneeded includes. 2005-09-29 19:13:52 +00:00
Olivier Houchard
3cfc7651b2 Make sure we have a bufobj before calling bstrategy().
I'm not sure this is the right thing to do, but at least I don't panic
anymore when swapping on a NFS file without using md(4).

X-MFC after:      proper review
2005-09-21 15:01:09 +00:00
Peter Wemm
749474f2f5 Remove unused (but initialized) variable 'objsize' from vm_mmap() 2005-09-20 22:08:27 +00:00
Alan Cox
f353d3388f Introduce a new lock for the purpose of synchronizing access to the
UMA boot pages.

Disable recursion on the general UMA lock now that startup_alloc() no
longer uses it.

Eliminate the variable uma_boot_free.  It serves no purpose.

Note: This change eliminates a lock-order reversal between a system
map mutex and the UMA lock.  See
http://sources.zabbadoz.net/freebsd/lor.html#109 for details.

MFC after: 3 days
2005-09-09 06:03:08 +00:00
Alan Cox
57b5187b16 Eliminate an incorrect cast. 2005-09-07 01:42:30 +00:00
Alan Cox
ba8bca610c Pass a value of type vm_prot_t to pmap_enter_quick() so that it determine
whether the mapping should permit execute access.
2005-09-03 18:20:20 +00:00
Alexander Kabaev
857b66d505 Do not use vm_pager_init() to initialize vnode_pbuf_freecnt variable.
vm_pager_init() is run before required nswbuf variable has been set
to correct value. This caused system to run with single pbuf available
for vnode_pager. Handle both cluster_pbuf_freecnt and vnode_pbuf_freecnt
variable in the same way.

Reported by:	ade
Obtained from:	alc
MFC after:	2 days
2005-08-13 20:21:33 +00:00
Tor Egge
1113a8b44a Check for marker pages when scanning active and inactive page queues.
Reviewed by:	alc
2005-08-12 18:17:40 +00:00
Dag-Erling Smørgrav
cfa22bcc4c Introduce the vm.boot_pages tunable and sysctl, which controls the number
of pages reserved to bootstrap the kernel memory allocator.

MFC after:	2 weeks
2005-08-12 12:24:19 +00:00
Tor Egge
8dbca793a9 Don't allow pagedaemon to skip pages while scanning PQ_ACTIVE or PQ_INACTIVE
due to the vm object being locked.

When a process writes large amounts of data to a file, the vm object associated
with that file can contain most of the physical pages on the machine.  If the
process is preempted while holding the lock on the vm object, pagedaemon would
be able to move very few pages from PQ_INACTIVE to PQ_CACHE or from PQ_ACTIVE
to PQ_INACTIVE, resulting in unlimited cleaning of dirty pages belonging to
other vm objects.

Temporarily unlock the page queues lock while locking vm objects to avoid lock
order violation.  Detect and handle relevant page queue changes.

This change depends on both the lock portion of struct vm_object and normal
struct vm_page being type stable.

Reviewed by:	alc
2005-08-10 00:17:36 +00:00
Suleiman Souhlal
4f12e0acb0 Use atomic operations on runningbufspace.
PR:		kern/84318
Submitted by:	ade
MFC after:	3 days
2005-08-08 22:44:10 +00:00
Robert Watson
8be9063ea1 Don't perform a nested include of opt_vmpage.h if LIBMEMSTAT is defined,
as opt_vmpage.h will not be available to user space library builds.  A
similar existing check is present for KLD_MODULE for similar reasons.

MFC after:	3 days
2005-08-04 10:05:11 +00:00
Robert Watson
af17e9a922 Wrap inlines in uma_int.h in #ifdef _KERNEL so that uma_int.h can be
used from memstat_uma.c for the purposes of kvm access without lots
of additional unsafe includes.

MFC after:	3 days
2005-08-04 10:03:53 +00:00
Robert Watson
cbbb4a0089 Rename UMA_MAX_NAME to UTH_MAX_NAME, since it's a maximum in the
monitoring API, which might or might not be the same as the internal
maximum (currently none).

Export flag information on UMA zones -- in particular, whether or
not this is a secondary zone, and so the keg free count should be
considered in that light.

MFC after:	1 day
2005-07-25 00:47:32 +00:00
Alan Cox
ec9c9e7363 Eliminate inconsistency in the setting of the B_DONE flag. Specifically,
make the b_iodone callback responsible for setting it if it is needed.
Previously, it was set unconditionally by bufdone() without holding
whichever lock is shared by the b_iodone callback and the corresponding
top-half function.  Consequently, in a race, the top-half function could
conclude that operation was done before the b_iodone callback finished.
See, for example, aio_physwakeup() and aio_fphysio().

Note: I don't believe that the other, more widely-used b_iodone callbacks
are affected.

Discussed with: jeff
Reviewed by: phk
MFC after: 2 weeks
2005-07-20 19:06:06 +00:00
Robert Watson
f4ff923bdd Further UMA statistics related changes:
- Add a new uma_zfree_internal() flag, ZFREE_STATFREE, which causes it to
  to update the zone's uz_frees statistic.  Previously, the statistic was
  updated unconditionally.

- Use the flag in situations where a "real" free occurs: i.e., one where
  the caller is freeing an allocated item, to be differentiated from
  situations where uma_zfree_internal() is used to tear down the item
  during slab teardown in order to invoke its fini() method.  Also use
  the flag when UMA is freeing its internal objects.

- When exchanging a bucket with the zone from the per-CPU cache when
  freeing an item, flush cache statistics back to the zone (since the
  zone lock and critical section are both held) to match the allocation
  case.

MFC after: 3 days
2005-07-20 18:47:42 +00:00
Alan Cox
15d2d31372 Eliminate an incorrect (and unnecessary) cast. 2005-07-20 18:41:08 +00:00
Robert Watson
ab3a57c04d Use mp_maxid in preference to MAXCPU when creating exports of UMA
per-CPU cache statistics.  UMA sizes the cache array based on the
number of CPUs at boot (mp_maxid + 1), and iterating based on MAXCPU
could read off the end of the array (into the next zone).

Reported by:	yongari
MFC after:	1 week
2005-07-16 11:03:06 +00:00
Robert Watson
08ecce74bc Improve canonicalization of copyrights. Order copyrights by order of
assertion (jeff, bmilekic, rwatson).

Suggested ages ago by:	bde
MFC after:		1 week
2005-07-16 09:51:52 +00:00
Robert Watson
2450bbb872 Move the unlocking of the zone mutex in sysctl_vm_zone_stats() so that
it covers the following of the uc_alloc/freebucket cache pointers.
Originally, I felt that the race wasn't helped by holding the mutex,
hence a comment in the code and not holding it across the cache access.
However, it does improve consistency, as while it doesn't prevent
bucket exchange, it does prevent bucket pointer invalidation.  So a
race in gathering cache free space statistics still can occur, but not
one that follows an invalid bucket pointer, if the mutex is held.

Submitted by:	yongari
MFC after:	1 week
2005-07-16 09:40:34 +00:00
Mike Silbersack
2018f30c01 Increase the flags field for kegs from a 16 to a 32 bit value;
we have exhausted all 16 flags.
2005-07-16 02:23:41 +00:00
Robert Watson
2019094a35 Track UMA(9) allocation failures by zone, and export via sysctl.
Requested by:	victor cruceru <victor dot cruceru at gmail dot com>
MFC after:	1 week
2005-07-15 23:34:39 +00:00
John Baldwin
b9d3b80521 Convert a remaining !fs.map->system_map to
fs.first_object->flags & OBJ_NEEDGIANT test that was missed in an earlier
revision.  This fixes mutex assertion failures in the debug.mpsafevm=0
case.

Reported by:	ps
MFC after:	3 days
2005-07-14 21:18:07 +00:00
Robert Watson
7a52a97eb3 Introduce a new sysctl, vm.zone_stats, which exports UMA(9) allocator
statistics via a binary structure stream:

- Add structure 'uma_stream_header', which defines a stream version,
  definition of MAXCPUs used in the stream, and the number of zone
  records in the stream.

- Add structure 'uma_type_header', which defines the name, alignment,
  size, resource allocation limits, current pages allocated, preferred
  bucket size, and central zone + keg statistics.

- Add structure 'uma_percpu_stat', which, for each per-CPU cache,
  includes the number of allocations and frees, as well as the number
  of free items in the cache.

- When the sysctl is queried, return a stream header, followed by a
  series of type descriptions, each consisting of a type header
  followed by a series of MAXCPUs uma_percpu_stat structures holding
  per-CPU allocation information.  Typical values of MAXCPU will be
  1 (UP compiled kernel) and 16 (SMP compiled kernel).

This query mechanism allows user space monitoring tools to extract
memory allocation statistics in a machine-readable form, and to do so
at a per-CPU granularity, allowing monitoring of allocation patterns
across CPUs in order to better understand the distribution of work and
memory flow over multiple CPUs.

While here, also export the number of UMA zones as a sysctl
vm.uma_count, in order to assist in sizing user swpace buffers to
receive the stream.

A follow-up commit of libmemstat(3), a library to monitor kernel memory
allocation, will occur in the next few days.  This change directly
supports converting netstat(1)'s "-mb" mode to using UMA-sourced stats
rather than separately maintained mbuf allocator statistics.

MFC after:	1 week
2005-07-14 16:35:13 +00:00
Robert Watson
773df9ab16 In addition to tracking allocs in the zone, also track frees. Add
a zone free counter, as well as a cache free counter.

MFC after:	1 week
2005-07-14 16:17:21 +00:00
Robert Watson
2c743d361a In an earlier world order, UMA would flush per-CPU statistics to the
zone whenever it was moving buckets between the zone and the cache,
or when coalescing statistics across the CPU.  Remove flushing of
statistics to the zone when coalescing statistics as part of sysctl,
as we won't be running on the right CPU to write to the cache
statistics.

Add a missed gathering of statistics: when uma_zalloc_internal()
does a special case allocation of a single item, make sure to update
the zone statistics to represent this.  Previously this case wasn't
accounted for in user-visible statistics.

MFC after:	1 week
2005-07-14 16:13:46 +00:00
Mike Silbersack
cb6e5c1aba Change the panic in trash_ctor into just a printf for now. Once the reports
of panics in trash_ctor relating to mbufs have been examined and a fix
found, this will be turned back into a panic.

Approved by: re (rwatson)
2005-06-26 23:44:07 +00:00
Alan Cox
eafc7b549a Increase UMA_BOOT_PAGES to prevent a crash during initialization. See
http://docs.FreeBSD.org/cgi/mid.cgi?42AD8270.8060906 for a detailed
description of the crash.

Reported by: Eric Anderson
Approved by: re (scottl)
MFC after: 3 days
2005-06-16 17:06:34 +00:00
Brian Feldman
a534973af4 The new contigmalloc(9) has a bad degenerate case where there were
many regions checked again and again despite knowing the pages
contained were not usable and only satisfied the alignment constraints
This case was compounded, especially for large allocations, by the
practice of looping from the top of memory so as to keep out of the
important low-memory regions.  While the old contigmalloc(9) has the
same problem, it is not as noticeable due to looping from the low
memory to high.

This degenerate case is fixed, as well as reversing the sense of the
rest of the loops within it, to provide a tremendous speed increase.
This makes the best case O(n * VM overhead) much more likely than the
worst case O(4 * VM overhead).  For comparison, the worst case for old
contigmalloc would be O(5 * VM overhead) in addition to its strategy
of turning used memory into free being highly pessimal.

Also, fix a bug that in practice most likely couldn't have been triggered,
int the new contigmalloc(9): it walked backwards from the end of memory
without accounting for how many pages it needed.  Potentially, nonexistant
pages could have been mapped.  This hasn't occurred because the kernel
generally requests as its first contigmalloc(9) a single page.

Reported by: Nicolas Dehaine <nicko@stbernard.com>, wes
MFC After: 1 month
More testing by: Nicolas Dehaine <nicko@stbernard.com>, wes
2005-06-11 00:05:16 +00:00
Alan Cox
25f2e1c8cc Add a comment to the effect that fictitious pages do not require the
initialization of their machine-dependent fields.
2005-06-10 17:27:54 +00:00
Alan Cox
1c245ae7d1 Introduce a procedure, pmap_page_init(), that initializes the
vm_page's machine-dependent fields.  Use this function in
vm_pageq_add_new_page() so that the vm_page's machine-dependent and
machine-independent fields are initialized at the same time.

Remove code from pmap_init() for initializing the vm_page's
machine-dependent fields.

Remove stale comments from pmap_init().

Eliminate the Boolean variable pmap_initialized from the alpha, amd64,
i386, and ia64 pmap implementations.  Its use is no longer required
because of the above changes and earlier changes that result in physical
memory that is being mapped at initialization time being mapped without
pv entries.

Tested by: cognet, kensmith, marcel
2005-06-10 03:33:36 +00:00
Alan Cox
5f7679afd0 Update some comments to reflect the change from spl-based to lock-based
synchronization.
2005-05-28 17:56:18 +00:00
Stephan Uphoff
d13ec71369 Use low level constructs borrowed from interrupt threads to wait for
work in proc0.
Remove the TDP_WAKEPROC0 workaround.
2005-05-23 23:01:53 +00:00
Alan Cox
10c447fac2 Swap in can occur safely without Giant. Release Giant on entry to
scheduler().
2005-05-22 21:06:07 +00:00
Alan Cox
35cf2323f8 Remove GIANT_REQUIRED from swapout_procs(). 2005-05-22 00:30:50 +00:00
Alan Cox
071a1710d1 Reduce the number of times that we acquire and release locks in
swap_pager_getpages().

MFC after: 1 week
2005-05-20 21:26:05 +00:00
Alan Cox
1aececdb5a Remove calls to spl*(). 2005-05-19 06:11:13 +00:00
Alan Cox
d2d9c9aca7 Remove a stale comment concerning spl* usage. 2005-05-19 03:53:07 +00:00
Alan Cox
cf51adc0a1 Update some comments to reflect the change from spl-based to lock-based
synchronization.
2005-05-18 22:08:52 +00:00
Alan Cox
ab973359d5 Remove calls to spl*(). 2005-05-18 20:45:33 +00:00
Alan Cox
2e2a6fa28a Revert revision 1.270: swp_pager_async_iodone() need not perform
VM_LOCK_GIANT().

Discussed with: jeff
2005-05-18 17:48:04 +00:00
Bjoern A. Zeeb
f3aad9a6bb Correct 32 vs 64 bit signedness issues.
Approved by:	pjd (mentor)
MFC after:	2 weeks
2005-05-18 08:57:31 +00:00
Peter Grehan
10b00dd4f3 The final test in unlock_and_deallocate() to determine if GIANT needs to be
unlocked wasn't updated to check for OBJ_NEEDGIANT. This caused a WITNESS
panic when debug_mpsafevm was set to 0.

Approved by:	jeffr
2005-05-12 04:09:41 +00:00
Marcel Moolenaar
23110bedcd Enable debug_mpsafevm on ia64 due to the severe functional regression
caused by recent locking changes when it's off. Revert the logic to
trim down the conditional.

Clued-in by: alc@
2005-05-08 23:56:16 +00:00
Jeff Roberson
b8a0b997fd - We need to inhert the OBJ_NEEDGIANT flag from the original object in
vm_object_split().

Spotted by:	alc
2005-05-04 20:54:16 +00:00
Jeff Roberson
ed4fe4f4f5 - Add a new object flag "OBJ_NEEDSGIANT". We set this flag if the
underlying vnode requires Giant.
 - In vm_fault only acquire Giant if the underlying object has NEEDSGIANT
   set.
 - In vm_object_shadow inherit the NEEDSGIANT flag from the backing object.
2005-05-03 11:11:26 +00:00
Alan Cox
b7903e65fb Remove GIANT_REQUIRED from vmspace_exec().
Prodded by: jeff
2005-05-02 07:05:20 +00:00
Jeff Roberson
382a601cd7 - VM_LOCK_GIANT in the swap pager's iodone routine as VFS will soon call it
without Giant.

Sponsored by:	Isilon Systems, Inc.
2005-04-30 11:25:49 +00:00
Robert Watson
5d1ae027f0 Modify UMA to use critical sections to protect per-CPU caches, rather than
mutexes, which offers lower overhead on both UP and SMP.  When allocating
from or freeing to the per-cpu cache, without INVARIANTS enabled, we now
no longer perform any mutex operations, which offers a 1%-3% performance
improvement in a variety of micro-benchmarks.  We rely on critical
sections to prevent (a) preemption resulting in reentrant access to UMA on
a single CPU, and (b) migration of the thread during access.  In the event
we need to go back to the zone for a new bucket, we release the critical
section to acquire the global zone mutex, and must re-acquire the critical
section and re-evaluate which cache we are accessing in case migration has
occured, or circumstances have changed in the current cache.

Per-CPU cache statistics are now gathered lock-free by the sysctl, which
can result in small races in statistics reporting for caches.

Reviewed by:	bmilekic, jeff (somewhat)
Tested by:	rwatson, kris, gnn, scottl, mike at sentex dot net, others
2005-04-29 18:56:36 +00:00
Jeff Roberson
7625cbf3cc - Pass the ISOPEN flag to namei so filesystems will know we're about to
open them or otherwise access the data.
2005-04-27 09:05:19 +00:00
Kris Kennaway
f5fca0d8be Add the vm.exec_map_entries tunable and read-only sysctl, which controls
the number of entries in exec_map (maximum number of simultaneous execs
that can be handled by the kernel).  The default value of 16 is
insufficient on heavily loaded machines (particularly SMP machines), and
if it is exceeded then executing further processes will generate a SIGABRT.

This is a workaround until a better solution can be implemented.

Reviewed by:	alc
MFC after:	3 days
2005-04-25 19:22:05 +00:00
Dag-Erling Smørgrav
02dcaf2fd1 Unbreak the build on 64-bit architectures. 2005-04-16 12:37:16 +00:00
John Baldwin
3c3edcb445 Add a vm.blacklist tunable which can hold a space or comma seperated list
of physical addresses.  The pages containing these physical addresses will
not be added to the free list and thus will effectively be ignored by the
VM system.  This is mostly useful for the case when one knows of specific
physical addresses that have bit errors (such as from a memtest run) so
that one can blacklist the bad pages while waiting for the new sticks of
RAM to arrive.  The physical addresses of any ignored pages are listed in
the message buffer as well.
2005-04-15 21:45:02 +00:00
Christian S.J. Peron
c92163dcad Move MAC check_vnode_mmap entry point out from being exclusive to
MAP_SHARED so that the entry point gets executed un-conditionally.
This may be useful for security policies which want to perform access
control checks around run-time linking.

-add the mmap(2) flags argument to the check_vnode_mmap entry point
 so that we can make access control decisions based on the type of
 mapped object.
-update any dependent API around this parameter addition such as
 function prototype modifications, entry point parameter additions
 and the inclusion of sys/mman.h header file.
-Change the MLS, BIBA and LOMAC security policies so that subject
 domination routines are not executed unless the type of mapping is
 shared. This is done to maintain compatibility between the old
 vm_mmap_vnode(9) and these policies.

Reviewed by:	rwatson
MFC after:	1 month
2005-04-14 16:03:30 +00:00
John Baldwin
9fd0669542 Tidy vcnt() by moving a duplicated line above #ifdef and removing a useless
variable.
2005-04-12 23:15:28 +00:00