Commit Graph

2080 Commits

Author SHA1 Message Date
Alan Cox
af117d1278 Eliminate an unused but initialized variable. 2004-10-30 20:11:23 +00:00
Alan Cox
4b8a5c4095 Add an assignment statement that I omitted from the previous revision. 2004-10-30 07:09:46 +00:00
Alan Cox
f4d49654ae Assert that the containing vm object is locked in vm_page_cache() and
vm_page_try_to_cache().
2004-10-28 05:26:21 +00:00
Bosko Milekic
a5a262c6db Fix a INVARIANTS-only bug introduced in Revision 1.104:
IF INVARIANTS is defined, and in the rare case that we have
allocated some objects from the slab and at least one initializer
on at least one of those objects failed, and we need to fail the
allocation and push the uninitialized items back into the slab
caches -- in that scenario, we would fail to [re]set the
bucket cache's ub_bucket item references to NULL, which would
eventually trigger a KASSERT.
2004-10-27 21:19:35 +00:00
Alan Cox
b08abf6cc0 During traversal of the active queue, try locking the page's containing
object before accessing the page's flags or the object's reference count.
If the trylock fails, handle the page as though it is busy.
2004-10-27 18:29:17 +00:00
Poul-Henning Kamp
6229cc508b Also check that the sectormask is bigger than zero.
Wrap this overly long KASSERT and remove newline.
2004-10-26 19:51:57 +00:00
Poul-Henning Kamp
5d9d81e7ea Put the I/O block size in bufobj->bo_bsize.
We keep si_bsize_phys around for now as that is the simplest way to pull
the number out of disk device drivers in devfs_open().  The correct solution
would be to do an ioctl(DIOCGSECTORSIZE), but the point is probably mooth
when filesystems sit on GEOM, so don't bother for now.
2004-10-26 07:39:12 +00:00
Poul-Henning Kamp
ff4782b57d Don't clear flags we just checked were not set. 2004-10-26 05:57:29 +00:00
Alan Cox
63bb7041cc Assert that the containing vm object is locked in vm_page_flash(). 2004-10-25 19:52:44 +00:00
Alan Cox
75d0533847 Assert that the containing vm object is locked in vm_page_busy() and
vm_page_wakeup().
2004-10-24 23:53:47 +00:00
Poul-Henning Kamp
b792bebeea Move the buffer method vector (buf->b_op) to the bufobj.
Extend it with a strategy method.

Add bufstrategy() which do the usual VOP_SPECSTRATEGY/VOP_STRATEGY
song and dance.

Rename ibwrite to bufwrite().

Move the two NFS buf_ops to more sensible places, add bufstrategy
to them.

Add inlines for bwrite() and bstrategy() which calls through
buf->b_bufobj->b_ops->b_{write,strategy}().

Replace almost all VOP_STRATEGY()/VOP_SPECSTRATEGY() calls with bstrategy().
2004-10-24 20:03:41 +00:00
Alan Cox
e5526e6aa5 Acquire the vm object lock before rather than after calling
vm_page_sleep_if_busy().  (The motivation being to transition
synchronization of the vm_page's PG_BUSY flag from the global page queues
lock to the per-object lock.)
2004-10-24 19:32:19 +00:00
Alan Cox
ddf4bb37c8 Use VM_ALLOC_NOBUSY instead of calling vm_page_wakeup(). 2004-10-24 18:46:32 +00:00
Alan Cox
0f9f9bcb53 Introduce VM_ALLOC_NOBUSY, an option to vm_page_alloc() and vm_page_grab()
that indicates that the caller does not want a page with its busy flag set.
In many places, the global page queues lock is acquired and released just
to clear the busy flag on a just allocated page.  Both the allocation of
the page and the clearing of the busy flag occur while the containing vm
object is locked.  So, the busy flag might as well never be set.
2004-10-24 06:15:36 +00:00
Poul-Henning Kamp
494eb176e7 Add b_bufobj to struct buf which eventually will eliminate the need for b_vp.
Initialize b_bufobj for all buffers.

Make incore() and gbincore() take a bufobj instead of a vnode.

Make inmem() local to vfs_bio.c

Change a lot of VI_[UN]LOCK(bp->b_vp) to BO_[UN]LOCK(bp->b_bufobj)
also VI_MTX() to BO_MTX(),

Make buf_vlist_add() take a bufobj instead of a vnode.

Eliminate other uses of bp->b_vp where bp->b_bufobj will do.

Various minor polishing: remove "register", turn panic into KASSERT,
use new function declarations, TAILQ_FOREACH_SAFE() etc.
2004-10-22 08:47:20 +00:00
Poul-Henning Kamp
a76d8f4ec9 Move the VI_BWAIT flag into no bo_flag element of bufobj and call it BO_WWAIT
Add bufobj_wref(), bufobj_wdrop() and bufobj_wwait() to handle the write
count on a bufobj.  Bufobj_wdrop() replaces vwakeup().

Use these functions all relevant places except in ffs_softdep.c where
the use if interlocked_sleep() makes this impossible.

Rename b_vnbufs to b_bobufs now that we touch all the relevant files anyway.
2004-10-21 15:53:54 +00:00
Alan Cox
1e96d2a217 Correct two errors in PG_BUSY management by vm_page_cowfault(). Both
errors are in rarely executed paths.
1. Each time the retry_alloc path is taken, the PG_BUSY must be set again.
   Otherwise vm_page_remove() panics.
2. There is no need to set PG_BUSY on the newly allocated page before
   freeing it.  The page already has PG_BUSY set by vm_page_alloc().
   Setting it again could cause an assertion failure.

MFC after: 2 weeks
2004-10-18 08:11:59 +00:00
Alan Cox
36aeb90e34 Assert that the containing object is locked in vm_page_io_start() and
vm_page_io_finish().  The motivation being to transition synchronization of
the vm_page's busy field from the global page queues lock to the per-object
lock.
2004-10-17 22:33:40 +00:00
Alan Cox
950d5f7a99 Remove unnecessary check for curthread == NULL. 2004-10-17 20:29:28 +00:00
Peter Wemm
a7bc3102c4 Put on my peril sensitive sunglasses and add a flags field to the internal
sysctl routines and state.  Add some code to use it for signalling the need
to downconvert a data structure to 32 bits on a 64 bit OS when requested by
a 32 bit app.

I tried to do this in a generic abi wrapper that intercepted the sysctl
oid's, or looked up the format string etc, but it was a real can of worms
that turned into a fragile mess before I even got it partially working.

With this, we can now run 'sysctl -a' on a 32 bit sysctl binary and have
it not abort.  Things like netstat, ps, etc have a long way to go.

This also fixes a bug in the kern.ps_strings and kern.usrstack hacks.
These do matter very much because they are used by libc_r and other things.
2004-10-11 22:04:16 +00:00
Brian Feldman
55fc8c1146 In the previous revision, I did not intend to change the default value
of "nosleepwithlocks."

Submitted by:	ru
2004-10-09 18:51:32 +00:00
Brian Feldman
ab14a3f7aa Fix critical stability problems that can cause UMA mbuf cluster
state management corruption, mbuf leaks, general mbuf corruption,
and at least on i386 a first level splash damage radius that
encompasses up to about half a megabyte of the memory after
an mbuf cluster's allocation slab.  In short, this has caused
instability nightmares anywhere the right kind of network traffic
is present.

When the polymorphic refcount slabs were added to UMA, the new types
were not used pervasively.  In particular, the slab management
structure was turned into one for refcounts, and one for non-refcounts
(supposed to be mostly like the old slab management structure),
but the latter was almost always used through out.  In general, every
access to zones with UMA_ZONE_REFCNT turned on corrupted the
"next free" slab offset offset and the refcount with each other and
with other allocations (on i386, 2 mbuf clusters per 4096 byte slab).

Fix things so that the right type is used to access refcounted zones
where it was not before.  There are additional errors in gross
overestimation of padding, it seems, that would cause a large kegs
(nee zones) to be allocated when small ones would do.  Unless I have
analyzed this incorrectly, it is not directly harmful.
2004-10-08 20:19:29 +00:00
David Schultz
f6bcadc4fc Don't look for swap blocks in objects that aren't swap-backed.
I expect that this will fix the following panic, reported by Jun:
	swap_pager_isswapped: failed to locate all swap meta blocks

MT5 candidate
2004-09-24 16:04:20 +00:00
Poul-Henning Kamp
891822a853 XXX mark two places where we do not hold a threadcount on the dev when
frobbing the cdevsw.

In both cases we examine only the cdevsw and it is a good question if we
weren't better off copying those properties into the cdev in the first
place.  This question will be revisited.
2004-09-24 08:32:36 +00:00
Poul-Henning Kamp
751fdd08fe Use dev_re[fl]thread() to maintain a ref on the device driver while
we call the ->d_mmap function.
2004-09-24 05:59:11 +00:00
David Schultz
8daa8c602a The zone from which proc structures are allocated is marked
UMA_ZONE_NOFREE to guarantee type stability, so proc_fini() should
never be called.  Move an assertion from proc_fini() to proc_dtor()
and garbage-collect the rest of the unreachable code.  I have retained
vm_proc_dispose(), since I consider its disuse a bug.
2004-09-19 18:34:17 +00:00
Poul-Henning Kamp
7ce1979be6 Add new a function isa_dma_init() which returns an errno when it fails
and which takes a M_WAITOK/M_NOWAIT flag argument.

Add compatibility isa_dmainit() macro which whines loudly if
isa_dma_init() fails.

Problem uncovered by:	tegge
2004-09-15 12:09:50 +00:00
Alan Cox
5e4bdb57cb System maps are prohibited from mapping vnode-backed objects. Take
advantage of this restriction to avoid acquiring and releasing Giant when
wiring pages within a system map.

In collaboration with: tegge@
2004-09-11 18:49:59 +00:00
Poul-Henning Kamp
1a31a6c3b2 add KASSERTS 2004-09-07 07:32:40 +00:00
Alan Cox
b102e653ad Enable debug.mpsafevm by default on amd64 and i386. This enables copy-on-
write and zero-fill faults to run without holding Giant.  It is still
possible to disable Giant-free operation by setting debug.mpsafevm to 0 in
loader.conf.
2004-09-04 05:51:54 +00:00
Alan Cox
94ddc7076d Push Giant deep into vm_forkproc(), acquiring it only if the process has
mapped System V shared memory segments (see shmfork_myhook()) or requires
the allocation of an ldt (see vm_fault_wire()).
2004-09-03 05:11:32 +00:00
Scott Long
9923b511ed Turn PREEMPTION into a kernel option. Make sure that it's defined if
FULL_PREEMPTION is defined.  Add a runtime warning to ULE if PREEMPTION is
enabled (code inspired by the PREEMPTION warning in kern_switch.c).  This
is a possible MT5 candidate.
2004-09-02 18:59:15 +00:00
Alan Cox
c2296e999c Remove dead code. 2004-09-01 19:58:37 +00:00
Alan Cox
1a95d74419 In vm_fault_unwire() eliminate the acquisition and release of Giant in the
case of non-kernel pmaps.
2004-09-01 19:18:59 +00:00
Julian Elischer
2630e4c90c Give setrunqueue() and sched_add() more of a clue as to
where they are coming from and what is expected from them.

MFC after:	2 days
2004-09-01 02:11:28 +00:00
Alan Cox
9b98b79683 Move the acquisition and release of the lock on the object at the head of
the shadow chain outside of the loop in vm_object_madvise(), reducing the
number of times that this lock is acquired and released.
2004-08-29 20:14:10 +00:00
Ian Dowse
632ca524c3 Prevent vm_page_zero_idle_wakeup() from attempting to wake up the
page zeroing thread before it has been created. It was possible for
calls to free() very early in the boot process to panic here because
the sleep queues were not yet initialised. Specifically, sysinit_add()
running at SI_SUB_KLD would trigger this if the array of pointers
became big enough to require uma_large_alloc() allocations.

Submitted by:	peter
2004-08-29 01:02:33 +00:00
Marcel Moolenaar
2bbb534b56 Move the cow field between wire_count and hold_count. This is the
position that is 64-bit aligned and makes sure that the valid and
dirty fields are also 64-bit aligned. This means that if PAGE_SIZE
is 32K, the size of the vm_page structure is only increased by 8
bytes instead of 16 bytes. More importantly, the vm_page structure
is either 120 or 128 bytes on ia64. These are "interesting" sizes.
2004-08-22 20:52:23 +00:00
Alan Cox
3268a1bf75 In the previous revision, I failed to condition an early release of Giant
in vm_fault() on debug_mpsafevm.  If debug_mpsafevm was not set, the result
was an assertion failure early in the boot process.

Reported by: green@
2004-08-22 00:08:43 +00:00
Alan Cox
b99e61353f Further reduce the use of Giant by vm_fault(): Giant is held only when
manipulating a vnode, e.g., calling vput().  This reduces contention for
Giant during many copy-on-write faults, resulting in some additional
speedup on SMPs.

Note: debug_mpsafevm must be enabled for this optimization to take effect.
2004-08-21 19:20:21 +00:00
Alan Cox
0cb507cb20 Acquire and release Giant around a call to VOP_BMAP(). (This is a
prerequisite to any further reduction in Giant's use by vm_fault().)
2004-08-19 02:37:12 +00:00
Alan Cox
c1fbc251cd - Introduce and use a new tunable "debug.mpsafevm". At present, setting
"debug.mpsafevm" results in (almost) Giant-free execution of zero-fill
   page faults.  (Giant is held only briefly, just long enough to determine
   if there is a vnode backing the faulting address.)

   Also, condition the acquisition and release of Giant around calls to
   pmap_remove() on "debug.mpsafevm".

   The effect on performance is significant.  On my dual Opteron, I see a
   3.6% reduction in "buildworld" time.

 - Use atomic operations to update several counters in vm_fault().
2004-08-16 06:16:12 +00:00
Brian Feldman
7c938963a4 Rather than bringing back all of the changes to make VM map deletion
wait for system wires to disappear, do so (much more trivially) by
instead only checking for system wires of user maps and not kernel maps.

Alternative by:	tor
Reviewed by:	alc
2004-08-16 03:11:09 +00:00
Alan Cox
6eaee3fee4 Remove spl calls. 2004-08-14 18:57:41 +00:00
Alan Cox
0164e05781 Replace the linear search in vm_map_findspace() with an O(log n)
algorithm built into the map entry splay tree.  This replaces the
first_free hint in struct vm_map with two fields in vm_map_entry:
adj_free, the amount of free space following a map entry, and
max_free, the maximum amount of free space in the entry's subtree.
These fields make it possible to find a first-fit free region of a
given size in one pass down the tree, so O(log n) amortized using
splay trees.

This significantly reduces the overhead in vm_map_findspace() for
applications that mmap() many hundreds or thousands of regions, and
has a negligible slowdown (0.1%) on buildworld.  See, for example, the
discussion of a micro-benchmark titled "Some mmap observations
compared to Linux 2.6/OpenBSD" on -hackers in late October 2003.

OpenBSD adopted this approach in March 2002, and NetBSD added it in
November 2003, both with Red-Black trees.

Submitted by: Mark W. Krentel
2004-08-13 08:06:34 +00:00
Tor Egge
19dc560756 The vm map lock is needed in vm_fault() after the page has been found,
to avoid later changes before pmap_enter() and vm_fault_prefault()
has completed.

Simplify deadlock avoidance by not blocking on vm map relookup.

In collaboration with: alc
2004-08-12 20:14:49 +00:00
Brian Feldman
c5f60ffccf Re-delete the comment from r1.352. 2004-08-12 17:22:28 +00:00
Brian Feldman
0ada205ee6 Back out all behavioral chnages. 2004-08-10 14:42:48 +00:00
Brian Feldman
9689d5e5ee Revamp VM map wiring.
* Allow no-fault wiring/unwiring to succeed for consistency;
  however, the wired count remains at zero, so it's a special case.

* Fix issues inside vm_map_wire() and vm_map_unwire() where the
  exact state of user wiring (one or zero) and system wiring
  (zero or more) could be confused; for example, system unwiring
  could succeed in removing a user wire, instead of being an
  error.

* Require all mappings to be unwired before they are deleted.
  When VM space is still wired upon deletion, it will be waited
  upon for the following unwire.  This makes vslock(9) work
  rather than allowing kernel-locked memory to be deleted
  out from underneath of its consumer as it would before.
2004-08-09 19:52:29 +00:00
Alan Cox
8673599662 Make two changes to vm_fault().
1. Move a comment to its proper place, updating it.  (Except for white-
   space, this comment had been unchanged since revision 1.1!)
2. Remove spl calls.
2004-08-09 18:46:39 +00:00
Alan Cox
91491c3566 Remove a stale comment from vm_map_lookup() that pertains to share maps.
(The last vestiges of the share map code were removed in revisions 1.153
and 1.159.)
2004-08-09 18:15:46 +00:00
Alan Cox
eebf3286a6 Make two changes to vm_fault().
1. Retain the map lock until after the calls to pmap_enter() and
   vm_fault_prefault().
2. Remove a stale comment.  Submitted by: tegge@
2004-08-09 06:01:46 +00:00
Poul-Henning Kamp
5721c9c76a Tag all geom classes in the tree with a version number. 2004-08-08 07:57:53 +00:00
Alan Cox
e0b47a134b Remove dead code. A vm_map's first_free is never NULL (even if the map is
full).

(This is preparation for an O(log n) implementation of vm_map_findspace().)

Submitted by:	Mark W. Krentel
2004-08-07 05:58:31 +00:00
Robert Watson
3659f747f1 Generate KTR trace records for uma_zalloc_arg() and uma_zfree_arg().
This doesn't trace every event of interest in UMA, but provides
enough basic information to explain lock traces and sleep patterns.
2004-08-06 21:52:38 +00:00
Brian Feldman
28775a6130 Turn on the new contigmalloc(9) by default. There should not actually
be a reason to use the old contigmalloc(9), but if desired, it the
vm.old_contigmalloc setting can be tuned/sysctld back to 0 for now.
2004-08-05 21:54:11 +00:00
Poul-Henning Kamp
23fc1a90ea Remove a product specific workaround for wrong modes when mmap(2)'ing
devices.  They have had plenty of time to adjust now.
2004-08-05 07:04:33 +00:00
Alan Cox
684a62b7bf - Push down the acquisition and release of Giant into pmap_enter_quick()
on those architectures without pmap locking.
 - Eliminate the acquisition and release of Giant in vm_map_pmap_enter().
2004-08-04 22:03:16 +00:00
Doug Rabson
c413d99c4e In dev_pager_updatefake, m->valid is typically 0 on entry. It
should be set to VM_PAGE_BITS_ALL before returning, to ensure that
neither vm_pager_get_pages nor vm_fault calls vm_page_zero_invalid
after dev_pager_getpages has returned.

Submitted by: tegge
2004-08-04 08:58:58 +00:00
Alan Cox
21c125453c Eliminate the acquisition and release of Giant around the call to
pmap_mincore() in mincore(2).  Either pmap locking exists (alpha, amd64,
i386, ia64) or pmap_mincore() is unimplemented (arm, powerpc, sparc64).
2004-08-02 03:31:05 +00:00
Brian Feldman
b23f72e98a * Add a "how" argument to uma_zone constructors and initialization functions
so that they know whether the allocation is supposed to be able to sleep
  or not.
* Allow uma_zone constructors and initialation functions to return either
  success or error.  Almost all of the ones in the tree currently return
  success unconditionally, but mbuf is a notable exception: the packet
  zone constructor wants to be able to fail if it cannot suballocate an
  mbuf cluster, and the mbuf allocators want to be able to fail in general
  in a MAC kernel if the MAC mbuf initializer fails.  This fixes the
  panics people are seeing when they run out of memory for mbuf clusters.
* Allow debug.nosleepwithlocks on WITNESS to be disabled, without changing
  the default.

Both bmilekic and jeff have reviewed the changes made to make failable
zone allocations work.
2004-08-02 00:18:36 +00:00
Alan Cox
9bb0e06861 - Push down the acquisition and release of Giant into pmap_protect() on
those architectures without pmap locking.
 - Eliminate the acquisition and release of Giant from vm_map_protect().

(Translation: mprotect(2) runs to completion without touching Giant on
alpha, amd64, i386 and ia64.)
2004-07-30 20:38:30 +00:00
Alan Cox
9be60284a6 Giant is no longer required by vm_waitproc() and vmspace_exitfree().
Eliminate it acquisition and release around vm_waitproc() in kern_wait().
2004-07-30 20:31:02 +00:00
Doug Rabson
92bab635d3 Fix a memory leak in the device pager which is exposed by the NVIDIA
OpenGL driver.

Submitted by: nvidia (possibly also tegge)
2004-07-30 11:09:18 +00:00
Doug Rabson
874f013517 Fix handling of msync(2) for character special files.
Submitted by: nvidia
2004-07-30 11:08:02 +00:00
Maxime Henrion
12c649749c Get rid of another lockmgr(9) consumer by using sx locks for the user
maps.  We always acquire the sx lock exclusively here, but we can't
use a mutex because we want to be able to sleep while holding the
lock.  This is completely equivalent to what we were doing with the
lockmgr(9) locks before.

Approved by:	alc
2004-07-30 09:10:28 +00:00
Alan Cox
a087914310 Advance the state of pmap locking on alpha, amd64, and i386.
- Enable recursion on the page queues lock.  This allows calls to
   vm_page_alloc(VM_ALLOC_NORMAL) and UMA's obj_alloc() with the page
   queues lock held.  Such calls are made to allocate page table pages
   and pv entries.
 - The previous change enables a partial reversion of vm/vm_page.c
   revision 1.216, i.e., the call to vm_page_alloc() by vm_page_cowfault()
   now specifies VM_ALLOC_NORMAL rather than VM_ALLOC_INTERRUPT.
 - Add partial locking to pmap_copy().  (As a side-effect, pmap_copy()
   should now be faster on i386 SMP because it no longer generates IPIs
   for TLB shootdown on the other processors.)
 - Complete the locking of pmap_enter() and pmap_enter_quick().  (As of now,
   all changes to a user-level pmap on alpha, amd64, and i386 are performed
   with appropriate locking.)
2004-07-29 18:56:31 +00:00
Bosko Milekic
244f45548a Rework the way slab header storage space is calculated in UMA.
- zone_large_init() stays pretty much the same.
- zone_small_init() will try to stash the slab header in the slab page
  being allocated if the amount of calculated wasted space is less
  than UMA_MAX_WASTE (for both the UMA_ZONE_REFCNT case and regular
  case).  If the amount of wasted space is >= UMA_MAX_WASTE, then
  UMA_ZONE_OFFPAGE will be set and the slab header will be allocated
  separately for better use of space.
- uma_startup() calculates the maximum ipers required in offpage slabs
  (so that the offpage slab header zone(s) can be sized accordingly).
  The algorithm used to calculate this replaces the old calculation
  (which only happened to work coincidentally).  We now iterate over
  possible object sizes, starting from the smallest one, until we
  determine that wastedspace calculated in zone_small_init() might
  end up being greater than UMA_MAX_WASTE, at which point we use the
  found object size to compute the maximum possible ipers.  The
  reason this works is because:
      - wastedspace versus objectsize is a see-saw function with
        local minima all equal to zero and local maxima growing
        directly proportioned to objectsize.  This implies that
        for objects up to or equal a certain objectsize, the see-saw
        remains entirely below UMA_MAX_WASTE, so for those objectsizes
        it is impossible to ever go OFFPAGE for slab headers.
      - ipers (items-per-slab) versus objectsize is an inversely
        proportional function which falls off very quickly (very large
        for small objectsizes).
      - To determine the maximum ipers we'll ever need from OFFPAGE
        slab headers we first find the largest objectsize for which
        we are guaranteed to not go offpage for and use it to compute
        ipers (as though we were offpage).  Since the only objectsizes
        allowed to go offpage are bigger than the found objectsize,
        and since ipers vs objectsize is inversely proportional (and
        monotonically decreasing), then we are guaranteed that the
        ipers computed is always >= what we will ever need in offpage
        slab headers.
- Define UMA_FRITM_SZ and UMA_FRITMREF_SZ to be the actual (possibly
  padded) size of each freelist index so that offset calculations are
  fixed.

This might fix weird data corruption problems and certainly allows
ARM to now boot to at least single-user (via simulator).

Tested on i386 UP by me.
Tested on sparc64 SMP by fenner.
Tested on ARM simulator to single-user by cognet.
2004-07-29 15:25:40 +00:00
Alan Cox
56e0670fdc Correct a very old error in both vm_object_madvise() (originating in
vm/vm_object.c revision 1.88) and vm_object_sync() (originating in
vm/vm_map.c revision 1.36): When descending a chain of backing objects,
both use the wrong object's backing offset.  Consequently, both may
operate on the wrong pages.

Quoting Matt, "This could be responsible for all of the sporatic madvise
oddness that has been reported over the years."

Reviewed by:	Matt Dillon
2004-07-28 18:23:08 +00:00
Alan Cox
1a276a3f91 - Use atomic ops for updating the vmspace's refcnt and exitingcnt.
- Push down Giant into shmexit().  (Giant is acquired only if the vmspace
   contains shm segments.)
 - Eliminate the acquisition of Giant from proc_rwmem().
 - Reduce the scope of Giant in exit1(), uncovering the destruction of the
   address space.
2004-07-27 03:53:41 +00:00
Alan Cox
5122b74809 For years, kmem_alloc_pageable() has been misused. Now that the last of
these misuses has been corrected, remove it before new ones appear, such as
arm/arm/pmap.c revision 1.8.
2004-07-25 20:08:59 +00:00
Alan Cox
9b45f81502 Remove spl calls. 2004-07-25 19:28:10 +00:00
Alan Cox
57a21aba93 Make the code and comments for vm_object_coalesce() consistent. 2004-07-25 07:48:47 +00:00
Alan Cox
51ab6c2890 Simplify vmspace initialization. The bcopy() of fields from the old
vmspace to the new vmspace in vmspace_exec() is mostly wasted effort.  With
one exception, vm_swrss, the copied fields are immediately overwritten.
Instead, initialize these fields to zero in vmspace_alloc(), eliminating a
bcopy() from vmspace_exec() and a bzero() from vmspace_fork().
2004-07-24 07:40:35 +00:00
Alan Cox
5285558ac2 - Change uma_zone_set_obj() to call kmem_alloc_nofault() instead of
kmem_alloc_pageable().  The difference between these is that an errant
   memory access to the zone will be detected sooner with
   kmem_alloc_nofault().

The following changes serve to eliminate the following lock-order
reversal reported by witness:

 1st 0xc1a3c084 vm object (vm object) @ vm/swap_pager.c:1311
 2nd 0xc07acb00 swap_pager swhash (swap_pager swhash) @ vm/swap_pager.c:1797
 3rd 0xc1804bdc vm object (vm object) @ vm/uma_core.c:931

There is no potential deadlock in this case.  However, witness is unable
to recognize this because vm objects used by UMA have the same type as
ordinary vm objects.  To remedy this, we make the following changes:

 - Add a mutex type argument to VM_OBJECT_LOCK_INIT().
 - Use the mutex type argument to assign distinct types to special
   vm objects such as the kernel object, kmem object, and UMA objects.
 - Define a static swap zone object for use by UMA.  (Only static
   objects are assigned a special mutex type.)
2004-07-22 19:44:49 +00:00
Brian Feldman
d951b75210 Fix a race in vm_page_sleep_if_busy(). Due to vm_object locking
being incomplete, it currently has to know how to drop and pick back
up the vm_object's mutex if it has to sleep and drop the page queue
mutex.  The problem with this is that if the page is busy, while we
are sleeping, the page can be freed and object disappear.  When trying
to lock m->object, we'd get a stale or NULL pointer and crash.

The object is now cached, but this makes the assumption that
the object is referenced in some manner and will not itself
disappear while it is unlocked.  Since this only happens if
the object is locked, I had to remove an assumption earlier in
contigmalloc() that reversed the order of locking the object and
doing vm_page_sleep_if_busy(), not the normal order.
2004-07-21 23:56:09 +00:00
Peter Wemm
5476633aed Semi-gratuitous change. Move two refcount operations to their own lines
rather than be buried inside an if (expression).  And now that the if
expression is the same in both exit paths, use the same ordering.
2004-07-21 05:08:10 +00:00
Peter Wemm
3f25cbddc2 Move the initialization and teardown of pmaps to the vmspace zone's
init and fini handlers.  Our vm system removes all userland mappings at
exit prior to calling pmap_release.  It just so happens that we might
as well reuse the pmap for the next process since the userland slate
has already been wiped clean.

However.  There is a functional benefit to this as well.  For platforms
that share userland and kernel context in the same pmap, it means that
the kernel portion of a pmap remains valid after the vmspace has been
freed (process exit) and while it is in uma's cache.  This is significant
for i386 SMP systems with kernel context borrowing because it avoids
a LOT of IPIs from the pmap_lazyfix() cleanup in the usual case.

Tested on:  amd64, i386, sparc64, alpha
Glanced at by:  alc
2004-07-21 00:29:21 +00:00
Brian Feldman
757cd67065 Remove extraneous locks on the VM free page queue mutex; it is not
meant to be recursed upon, and could cauuse a deadlock inside the
new contigmalloc (vm.old_contigmalloc=0) code.

Submitted by:	alc
2004-07-19 23:29:36 +00:00
Alan Cox
e832aafc51 - Eliminate the pte object from the pmap. Instead, page table pages are
allocated as "no object" pages.  Similar changes were made to the amd64
   and i386 pmap last year.  The primary reason being that maintaining
   a pte object leads to lock order violations.  A secondary reason being
   that the pte object is redundant, i.e., the page table itself can be
   used to lookup page table pages.  (Historical note: The pte object
   predates our ability to allocate "no object" pages.  Thus, the pte
   object was a necessary evil.)
 - Unconditionally check the vm object lock's status in vm_page_remove().
   Previously, this assertion could not be made on Alpha due to its use
   of a pte object.
2004-07-19 18:12:04 +00:00
Brian Feldman
0c3c862e21 Since breakage of malloc(9)/uma_zalloc(9) is totally non-optional in
GENERIC/for WITNESS users, make sure the sysctl to disable the behavior
is read-only and always enabled.
2004-07-19 15:05:24 +00:00
Brian Feldman
4362fada8f Reimplement contigmalloc(9) with an algorithm which stands a greatly-
improved chance of working despite pressure from running programs.
Instead of trying to throw a bunch of pages out to swap and hope for
the best, only a range that can potentially fulfill contigmalloc(9)'s
request will have its contents paged out (potentially, not forcibly)
at a time.

The new contigmalloc operation still operates in three passes, but it
could potentially be tuned to more or less.  The first pass only looks
at pages in the cache and free pages, so they would be thrown out
without having to block.  If this is not enough, the subsequent passes
page out any unwired memory.  To combat memory pressure refragmenting
the section of memory being laundered, each page is removed from the
systems' free memory queue once it has been freed so that blocking
later doesn't cause the memory laundered so far to get reallocated.

The page-out operations are now blocking, as it would make little sense
to try to push out a page, then get its status immediately afterward
to remove it from the available free pages queue, if it's unlikely to
have been freed.  Another change is that if KVA allocation fails, the
allocated memory segment will be freed and not leaked.

There is a sysctl/tunable, defaulting to on, which causes the old
contigmalloc() algorithm to be used.  Nonetheless, I have been using
vm.old_contigmalloc=0 for over a month.  It is safe to switch at
run-time to see the difference it makes.

A new interface has been used which does not require mapping the
allocated pages into KVA: vm_page.h functions vm_page_alloc_contig()
and vm_page_release_contig().  These are what vm.old_contigmalloc=0
uses internally, so the sysctl/tunable does not affect their operation.

When using the contigmalloc(9) and contigfree(9) interfaces, memory
is now tracked with malloc(9) stats.  Several functions have been
exported from kern_malloc.c to allow other subsystems to use these
statistics, as well.  This invalidates the BUGS section of the
contigmalloc(9) manpage.
2004-07-19 06:21:27 +00:00
Alan Cox
3e36afbe27 Remove the GIANT_REQUIRED preceding pmap_remove() in
vm_pageout_map_deactivate_pages().
2004-07-18 04:38:11 +00:00
Alan Cox
3d2e54c317 Push down the acquisition and release of the page queues lock into
pmap_protect() and pmap_remove().  In general, they require the lock in
order to modify a page's pv list or flags.  In some cases, however,
pmap_protect() can avoid acquiring the lock.
2004-07-15 18:00:43 +00:00
Alan Cox
26354d4c08 Remove an unused and unimplemented sysctl. (For the record, it was marked
as unimplemented in revision 1.129 nearly six years ago.)
2004-07-12 17:45:37 +00:00
Alan Cox
790bdd0f2e Increase the scope of the page queues lock in vm_page_alloc() to cover
a diagnostic check that accesses the cache queue count.
2004-07-10 22:12:49 +00:00
Alan Cox
fd2d354908 Micro-optimize vmspace for 64-bit architectures: Colocate vm_refcnt and
vm_exitingcnt so that alignment does not result in wasted space.
2004-07-06 17:35:10 +00:00
Bruce M Simpson
9bd86a9861 Properly brucify a string by outdenting it. 2004-07-06 02:27:30 +00:00
Bosko Milekic
0d0837ee6d Introduce debug.nosleepwithlocks sysctl, 0 by default. If set to 1
and WITNESS is not built, then force all M_WAITOK allocations to
M_NOWAIT behavior (transparently).  This is to be used temporarily
if wierd deadlocks are reported because we still have code paths
that perform M_WAITOK allocations with lock(s) held, which can
lead to deadlock.  If WITNESS is compiled, then the sysctl is ignored
and we ask witness to tell us wether we have locks held, converting
to M_NOWAIT behavior only if it tells us that we do.

Note this removes the previous mbuf.h inclusion as well (only needed
by last revision), and cleans up unneeded [artificial] comparisons
to just the mbuf zones.  The problem described above has nothing to
do with previous mbuf wait behavior; it is a general problem.
2004-07-04 16:07:44 +00:00
Brian Feldman
7a708c3626 Reextend the M_WAITOK-disabling-hack to all three of the mbuf-related
zones, and do it by direct comparison of uma_zone_t instead of strcmp.

The mbuf subsystem used to provide M_TRYWAIT/M_DONTWAIT semantics, but
this is mostly no longer the case.  M_WAITOK has taken over the spot
M_TRYWAIT used to have, and for mbuf things, still may return NULL if
the code path is incorrectly holding a mutex going into mbuf allocation
functions.

The M_WAITOK/M_NOWAIT semantics are absolute; though it may deadlock
the system to try to malloc or uma_zalloc something with a mutex held
and M_WAITOK specified, it is absolutely required to not return NULL
and will result in instability and/or security breaches otherwise.
There is still room to add the WITNESS_WARN() to all cases so that
we are notified of the possibility of deadlocks, but it cannot change
the value of the "badness" variable and allow allocation to actually
fail except for the specialized cases which used to be M_TRYWAIT.
2004-07-04 15:59:25 +00:00
Brian Feldman
cf107c1d1a Limit mbuma damage. Suddenly ALL allocations with M_WAITOK are subject
to failing -- that is, allocations via malloc(M_WAITOK) that are required
to never fail -- if WITNESS is not defined.  While everyone should be
running WITNESS, in any case, zone "Mbuf" allocations are really the only
ones that should be screwed with by this hack.

This hack is crashing people, and would continue to do so with or without
WITNESS.  Things shouldn't be allocating with M_WAITOK with locks held,
but it's not okay just to always remove M_WAITOK when !WITNESS.

Reported by:	Bernd Walter <ticso@cicely5.cicely.de>
2004-07-03 18:11:41 +00:00
John Baldwin
0c0b25ae91 Implement preemption of kernel threads natively in the scheduler rather
than as one-off hacks in various other parts of the kernel:
- Add a function maybe_preempt() that is called from sched_add() to
  determine if a thread about to be added to a run queue should be
  preempted to directly.  If it is not safe to preempt or if the new
  thread does not have a high enough priority, then the function returns
  false and sched_add() adds the thread to the run queue.  If the thread
  should be preempted to but the current thread is in a nested critical
  section, then the flag TDF_OWEPREEMPT is set and the thread is added
  to the run queue.  Otherwise, mi_switch() is called immediately and the
  thread is never added to the run queue since it is switch to directly.
  When exiting an outermost critical section, if TDF_OWEPREEMPT is set,
  then clear it and call mi_switch() to perform the deferred preemption.
- Remove explicit preemption from ithread_schedule() as calling
  setrunqueue() now does all the correct work.  This also removes the
  do_switch argument from ithread_schedule().
- Do not use the manual preemption code in mtx_unlock if the architecture
  supports native preemption.
- Don't call mi_switch() in a loop during shutdown to give ithreads a
  chance to run if the architecture supports native preemption since
  the ithreads will just preempt DELAY().
- Don't call mi_switch() from the page zeroing idle thread for
  architectures that support native preemption as it is unnecessary.
- Native preemption is enabled on the same archs that supported ithread
  preemption, namely alpha, i386, and amd64.

This change should largely be a NOP for the default case as committed
except that we will do fewer context switches in a few cases and will
avoid the run queues completely when preempting.

Approved by:	scottl (with his re@ hat)
2004-07-02 20:21:44 +00:00
John Baldwin
bf0acc273a - Change mi_switch() and sched_switch() to accept an optional thread to
switch to.  If a non-NULL thread pointer is passed in, then the CPU will
  switch to that thread directly rather than calling choosethread() to pick
  a thread to choose to.
- Make sched_switch() aware of idle threads and know to do
  TD_SET_CAN_RUN() instead of sticking them on the run queue rather than
  requiring all callers of mi_switch() to know to do this if they can be
  called from an idlethread.
- Move constants for arguments to mi_switch() and thread_single() out of
  the middle of the function prototypes and up above into their own
  section.
2004-07-02 19:09:50 +00:00
John Baldwin
d202e0cccc - Don't use a variable to point to the user area that we only use once.
Just use p2->p_uarea directly instead.
- Remove an old and mostly bogus assertion regarding p2->p_sigacts.
- Use RANGEOF macro ala fork1() to clean up bzero/bcopy of p_stats.
2004-07-02 03:45:07 +00:00
Tor Egge
9174ca7ba3 Initialize result->backing_object_offset before linking result onto the list of
vm objects shadowing source in vm_object_shadow().  This closes a race where
vm_object_collapse() could be called with a partially uninitialized object
argument causing symptoms that looked like hardware problems, e.g.  signal 6,
10, 11 or a /bin/sh busy-waiting for a nonexistant child process.
2004-06-28 20:26:35 +00:00
Andrew Gallatin
b351299ca3 Use MIN() macro rather than ulmin() inline, and fix stray tab
that snuck in with my last commit.

Submitted by: green
2004-06-28 19:58:39 +00:00
Andrew Gallatin
1dad8fe1ed Fix alpha - the use of min() on longs was loosing the high bits and
returning wrong answers, leading to strange values vm2->vm_{s,t,d}size.
2004-06-28 19:15:40 +00:00
David Schultz
17d9d0d049 Update a stale comment. The heuristic to swap processes out based on
the number of pages already paged out was broken in rev 1.10 and
removed in rev 1.11.
2004-06-27 01:58:12 +00:00
Alan Cox
d9dd6bfb56 Remove an unused field from the vmspace structure. 2004-06-26 19:16:35 +00:00
Brian Feldman
2a7be1b6d1 Correct the tracking of various bits of the process's vmspace and vm_map
when not propogated on fork (due to minherit(2)).  Consistency checks
otherwise fail when the vm_map is freed and it appears to have not been
emptied completely, causing an INVARIANTS panic in vm_map_zdtor().

PR:		kern/68017
Submitted by:	Mark W. Krentel <krentel@dreamscape.com>
Reviewed by:	alc
2004-06-24 22:43:46 +00:00