Commit Graph

1345 Commits

Author SHA1 Message Date
Andrew R. Reiter
d4d6aee5a0 - Fix a round down bogon in uma_zone_set_max().
Submitted by: jeff@
2002-04-25 06:24:40 +00:00
Alan Cox
a569838764 Reintroduce locking on accesses to vm_object_list. 2002-04-20 07:23:22 +00:00
Alan Cox
92de35b0ce o Move the acquisition of Giant from vm_fault() to the point
after initialization in vm_fault1().
 o Fix some style problems in vm_fault1().
2002-04-19 04:20:31 +00:00
Alan Cox
ff8f4ebe22 Add a comment documenting a race condition in vm_fault(): Specifically, a
modification is made to the vm_map while only a read lock is held.
2002-04-18 03:55:50 +00:00
Alan Cox
6139043b1f o Call vm_map_growstack() from vm_fault() if vm_map_lookup() has failed
due to conditions that suggest the possible need for stack growth.
   This has two beneficial effects: (1) we can
   now remove calls to vm_map_growstack() from the MD trap handlers and (2)
   simple page faults are faster because we no longer unnecessarily perform
   vm_map_growstack() on every page fault.
 o Remove vm_map_growstack() from the i386's trap_pfault().
 o Remove the acquisition and release of Giant from i386's trap_pfault().
   (vm_fault() still acquires it.)
2002-04-18 03:28:27 +00:00
Peter Wemm
334f706177 Do not free the vmspace until p->p_vmspace is set to null. Otherwise
statclock can access it in the tail end of statclock_process() at an
unfortunate time.  This bit me several times on an SMP alpha (UP2000)
and the problem went away with this change.  I'm not sure why it doesn't
break x86 as well.  Maybe it's because the clocks are much faster
on alpha (HZ=1024 by default).
2002-04-17 05:26:42 +00:00
Alan Cox
b208d0633f Remove an unused option, VM_FAULT_HOLD, to vm_fault(). 2002-04-17 02:23:57 +00:00
Peter Wemm
1a87a0da66 Pass vm_page_t instead of physical addresses to pmap_zero_page[_area]()
and pmap_copy_page().  This gets rid of a couple more physical addresses
in upper layers, with the eventual aim of supporting PAE and dealing with
the physical addressing mostly within pmap.  (We will need either 64 bit
physical addresses or page indexes, possibly both depending on the
circumstances.  Leaving this to pmap itself gives more flexibilitly.)

Reviewed by:	jake
Tested on:	i386, ia64 and (I believe) sparc64. (my alpha was hosed)
2002-04-15 16:00:03 +00:00
Jeff Roberson
5300d9dda2 Fix a witness warning when expanding a hash table. We were allocating the new
hash while holding the lock on a zone.  Fix this by doing the allocation
seperately from the actual hash expansion.

The lock is dropped before the allocation and reacquired before the expansion.
The expansion code checks to see if we lost the race and frees the new hash
if we do.  We really never will lose this race because the hash expansion is
single threaded via the timeout mechanism.
2002-04-14 13:47:10 +00:00
Jeff Roberson
0da47b2fc6 Protect the initial list traversal in sysctl_vm_zone() with the uma_mtx. 2002-04-14 12:39:38 +00:00
Jeff Roberson
af7f9b97b6 Fix the calculation that determines uz_maxpages. It was off for large zones.
Fortunately we have no large zones with maximums specified yet, so it wasn't
breaking anything.

Implement blocking when a zone exceeds the maximum and M_WAITOK is specified.
Previously this just failed like the old zone allocator did.  The old zone
allocator didn't support WAITOK/NOWAIT though so we should do what we
advertise.

While I was in there I cleaned up some more zalloc logic to further simplify
that code path and reduce redundant code.  This was needed to make the blocking
work properly anyway.
2002-04-14 01:56:25 +00:00
Jeff Roberson
bce9779110 Remember to unlock the zone if the fill count is too high.
Pointed out by:	pete, jake, jhb
2002-04-10 01:52:50 +00:00
Jeff Roberson
1d4cb54ba8 Quiet witness warnings about acquiring several zone locks. In the case that
this happens it is OK.
2002-04-08 21:08:17 +00:00
Jeff Roberson
86bbae32f4 Add a mechanism to disable buckets when the v_free_count drops below
v_free_min.  This should help performance in memory starved situations.
2002-04-08 06:20:34 +00:00
Jeff Roberson
605cbd6a08 Don't release the zone lock until after the dtor has been called. As far as I
can tell this could not have caused any problems yet because UMA is still
called with giant.

Pointy hat to:	jeff
Noticed by:	jake
2002-04-08 05:13:48 +00:00
Jeff Roberson
9c2cd7e5a9 Implement uma_zdestroy(). It's prototype changed slightly. I decided that I
didn't like the wait argument and that if you were removing a zone it had
better be empty.

Also, I broke out part of hash_expand and made a seperate hash_free() for use
in uma_zdestroy.
2002-04-08 04:48:58 +00:00
Jeff Roberson
a553d4b8eb Rework most of the bucket allocation and free code so that per cpu locks are
never held across blocking operations.  Also, fix two other lock order
reversals that were exposed by jhb's witness change.

The free path previously had a bug that would cause it to skip the free bucket
list in some cases and go straight to allocating a new bucket.  This has been
fixed as well.

These changes made the bucket handling code much cleaner and removed quite a
few lock operations.  This should be marginally faster now.

It is now possible to call malloc w/o Giant and avoid any witness warnings.
This still isn't entirely safe though because malloc_type statistics are not
protected by any lock.
2002-04-08 02:42:55 +00:00
Jeff Roberson
c235bfa551 Spelling correction; s/seperate/separate/g
Submitted by:	eric
2002-04-07 22:56:48 +00:00
Jeff Roberson
fedfeee018 There should be no remaining references to these two files in the tree. If
there are, it is an error.  vm_zone has been superseded by uma.
2002-04-07 22:51:18 +00:00
Jeff Roberson
d0b06acbe1 This fixes a bug where isitem never got set to 1 if a certain chain of events
relating to extreme low memory situations occured.  This was only ever seen on
the port build cluster, so many thanks to kris for helping me debug this.

Tested by:	kris
2002-04-07 22:47:36 +00:00
Alan Cox
aa4d062142 o Eliminate the use of grow_stack() and useracc() from sendsig(), osendsig(),
and osf1_sendsig().
 o Eliminate the prototype for the MD grow_stack() now that it has been removed
   from all platforms.
2002-04-05 00:52:15 +00:00
Matthew Dillon
80f5c8bf42 Embed a struct vmmeter in the per-cpu structure and add a macro,
PCPU_LAZY_INC() which increments elements in it for cases where we
can afford the occassional inaccuracy.  Use of per-cpu stats counters
avoids significant cache stalls in various critical paths that would
otherwise severely limit our cpu scaleability.

Adjust all sysctl's accessing cnt.* elements to now use a procedure
which aggregates the requested field for all cpus and for the global
vmmeter.

The global vmmeter is retained, since some stats counters, like v_free_min,
cannot be made per-cpu.  Also, this allows us to convert counters from
the global vmmeter to the per-cpu vmmeter in a piecemeal fashion, so
have at it!
2002-04-04 21:38:47 +00:00
John Baldwin
6008862bc2 Change callers of mtx_init() to pass in an appropriate lock type name. In
most cases NULL is passed, but in some cases such as network driver locks
(which use the MTX_NETWORK_LOCK macro) and UMA zone locks, a name is used.

Tested on:	i386, alpha, sparc64
2002-04-04 21:03:38 +00:00
Jake Burkholder
48f9a59443 Fix a long standing 32bit-ism. Don't assume that the size of a chunk of
memory in phys_avail will fit in 'int', use vm_size_t.  This fixes booting
on sparc64 machines with more than 2 gigs of ram.

Thanks to Jan Chrillesen for providing me with access to a 4 gig machine.
2002-04-03 06:57:52 +00:00
Alfred Perlstein
157d7b3538 fix comment typo, s/neccisary/necessary/g 2002-04-02 21:25:12 +00:00
John Baldwin
44731cab3b Change the suser() API to take advantage of td_ucred as well as do a
general cleanup of the API.  The entire API now consists of two functions
similar to the pre-KSE API.  The suser() function takes a thread pointer
as its only argument.  The td_ucred member of this thread must be valid
so the only valid thread pointers are curthread and a few kernel threads
such as thread0.  The suser_cred() function takes a pointer to a struct
ucred as its first argument and an integer flag as its second argument.
The flag is currently only used for the PRISON_ROOT flag.

Discussed on:	smp@
2002-04-01 21:31:13 +00:00
Jeff Roberson
f22a4b62f5 Add a new mtx_init option "MTX_DUPOK" which allows duplicate acquires of locks
with this flag.  Remove the dup_list and dup_ok code from subr_witness.  Now
we just check for the flag instead of doing string compares.

Also, switch the process lock, process group lock, and uma per cpu locks over
to this interface.  The original mechanism did not work well for uma because
per cpu lock names are unique to each zone.

Approved by:	jhb
2002-03-27 09:23:41 +00:00
Alan Cox
433b72aa12 Remove an unused prototype. 2002-03-26 05:30:59 +00:00
Jeff Roberson
f4af24d55d Reset the cachefree statistics after draining the cache. This fixes a bug
where a sysctl within 20 seconds of a cache_drain could yield negative "USED"
counts.

Also, grab the uma_mtx while in the sysctl handler.  This hadn't caused
problems yet because Giant is held all the time.

Reported by:	kkenn
2002-03-24 10:56:11 +00:00
Jeff Roberson
736ee5907f Add uma_zone_set_max() to add enforced limits to non vm obj backed zones. 2002-03-20 05:28:34 +00:00
Jeff Roberson
670d17b5c0 Remove references to vm_zone.h and switch over to the new uma API. 2002-03-20 04:02:59 +00:00
Alfred Perlstein
11caded34f Remove __P. 2002-03-19 22:20:14 +00:00
Jeff Roberson
9eb6e51923 Quit a warning introduced by UMA. This only occurs on machines where
vm_size_t != unsigned long.

Reviewed by:	phk
2002-03-19 11:49:10 +00:00
Peter Wemm
30171114b3 Fix a gcc-3.1+ warning.
warning: deprecated use of label at end of compound statement

ie: you cannot do this anymore:
switch(foo) {
....

default:
}
2002-03-19 11:02:06 +00:00
Jeff Roberson
8355f576a9 This is the first part of the new kernel memory allocator. This replaces
malloc(9) and vm_zone with a slab like allocator.

Reviewed by:	arch@
2002-03-19 09:11:49 +00:00
Brian Feldman
25adb370be Back out the modification of vm_map locks from lockmgr to sx locks. The
best path forward now is likely to change the lockmgr locks to simple
sleep mutexes, then see if any extra contention it generates is greater
than removed overhead of managing local locking state information,
cost of extra calls into lockmgr, etc.

Additionally, making the vm_map lock a mutex and respecting it properly
will put us much closer to not needing Giant magic in vm.
2002-03-18 15:08:09 +00:00
Alan Cox
9f0567f557 Remove vm_object_count: It's unused, incorrectly maintained and duplicates
information maintained by the zone allocator.
2002-03-17 18:37:37 +00:00
Alan Cox
5ee9fe6ba1 Undo part of revision 1.57: Now that (o)sendsig() doesn't call useracc(),
the motivation for saving and restoring the map->hint in useracc() is gone.
(The same tests that motivated this change in revision 1.57 now show that
there is no performance loss from removing it.)  This was really a hack and
some day we would have had to add new synchronization here on map->hint
to maintain it.
2002-03-17 07:01:42 +00:00
Alan Cox
2f6c16e1e8 Acquire a read lock on the map inside of vm_map_check_protection() rather
than expecting the caller to do so.  This (1) eliminates duplicated code in
kernacc() and useracc() and (2) fixes missing synchronization in munmap().
2002-03-17 03:19:31 +00:00
Jake Burkholder
ac59490b5e Convert all pmap_kenter/pmap_kremove pairs in MI code to use pmap_qenter/
pmap_qremove.  pmap_kenter is not safe to use in MI code because it is not
guaranteed to flush the mapping from the tlb on all cpus.  If the process
in question is preempted and migrates cpus between the call to pmap_kenter
and pmap_kremove, the original cpu will be left with stale mappings in its
tlb.  This is currently not a problem for i386 because we do not use PG_G on
SMP, and thus all mappings are flushed from the tlb on context switches, not
just user mappings.  This is not the case on all architectures, and if PG_G
is to be used with SMP on i386 it will be a problem.  This was committed by
peter earlier as part of his fine grained tlb shootdown work for i386, which
was backed out for other reasons.

Reviewed by:	peter
2002-03-17 00:56:41 +00:00
Kirk McKusick
0d2af52141 Introduce the new 64-bit size disk block, daddr64_t. Change
the bio and buffer structures to have daddr64_t bio_pblkno,
b_blkno, and b_lblkno fields which allows access to disks
larger than a Terabyte in size. This change also requires
that the VOP_BMAP vnode operation accept and return daddr64_t
blocks. This delta should not affect system operation in
any way. It merely sets up the necessary interfaces to allow
the development of disk drivers that work with these larger
disk block addresses. It also allows for the development of
UFS2 which will use 64-bit block addresses.
2002-03-15 18:49:47 +00:00
Brian Feldman
9cb574590e Document faultstate.lookup_still_valid more than none.
Requested by:	alfred
2002-03-14 02:10:14 +00:00
Brian Feldman
0e0af8ecda Rename SI_SUB_MUTEX to SI_SUB_MTX_POOL to make the name at all accurate.
While doing this, move it earlier in the sysinit boot process so that the
VM system can use it.

After that, the system is now able to use sx locks instead of lockmgr
locks in the VM system.  To accomplish this, some of the more
questionable uses of the locks (such as testing whether they are
owned or not, as well as allowing shared+exclusive recursion) are
removed, and simpler logic throughout is used so locks should also be
easier to understand.

This has been tested on my laptop for months, and has not shown any
problems on SMP systems, either, so appears quite safe.  One more
user of lockmgr down, many more to go :)
2002-03-13 23:48:08 +00:00
Eivind Eklund
a128794977 - Remove a number of extra newlines that do not belong here according to
style(9)
- Minor space adjustment in cases where we have "( ", " )", if(), return(),
  while(), for(), etc.
- Add /* SYMBOL */ after a few #endifs.

Reviewed by:	alc
2002-03-10 21:52:48 +00:00
Tor Egge
ff91d7800f Revert change in revision 1.53 and add a small comment to protect
the revived code.

vm pages newly allocated are marked busy (PG_BUSY), thus calling
vm_page_delete before the pages has been freed or unbusied will
cause a deadlock since vm_page_object_page_remove will wait for the
busy flag to be cleared.  This can be triggered by calling malloc
with size > PAGE_SIZE and the M_NOWAIT flag on systems low on
physical free memory.

A kernel module that reproduces the problem, written by Logan Gabriel
<logan@mail.2cactus.com>, can be found in the freebsd-hackers mail
archive (12 Apr 2001).  The problem was recently noticed again by
Archie Cobbs <archie@dellroad.org>.

Reviewed by:	dillon
2002-03-09 16:24:27 +00:00
Matthew Dillon
8c5dffe8ca Fix a bug in the vm_map_clean() procedure. msync()ing an area of memory
that has just been mapped MAP_ANON|MAP_NOSYNC and has not yet been accessed
will panic the machine.

MFC after:	1 day
2002-03-07 03:54:56 +00:00
Matthew Dillon
b9b7a4be90 Add a sequential iteration optimization to vm_object_page_clean(). This
moderately improves msync's and VM object flushing for objects containing
randomly dirtied pages (fsync(), msync(), filesystem update daemon),
and improves cpu use for small-ranged sequential msync()s in the face of
very large mmap()ings from O(N) to O(1) as might be performed by a database.

A sysctl, vm.msync_flush_flag, has been added and defaults to 3 (the two
committed optimizations are turned on by default).  0 will turn off both
optimizations.

This code has already been tested under stable and is one in a series of
memq / vp->v_dirtyblkhd / fsync optimizations to remove O(N^2) restart
conditions that will be coming down the pipe.

MFC after:	3 days
2002-03-06 02:42:56 +00:00
Eivind Eklund
f52bd684f3 * Move bswlist declaration and initialization from kern/vfs_bio.c to
vm/vm_pager.c, which is the only place it is used.
* Make the QUEUE_* definitions and bufqueues local to vfs_bio.c.
* constify buf_wmesg.
2002-03-05 18:20:58 +00:00
Alan Cox
2be21c5e68 o Create vm_pageq_enqueue() to encapsulate code that is duplicated time
and again in vm_page.c and vm_pageq.c.
 o Delete unusused prototypes.  (Mainly a result of the earlier renaming
   of various functions from vm_page_*() to vm_pageq_*().)
2002-03-04 18:55:26 +00:00
Alan Cox
64190c7a2f Call vm_pageq_remove_nowakeup() rather than duplicating it. 2002-03-03 22:36:14 +00:00
Alan Cox
5714577006 Remove some long dead code. 2002-03-02 22:21:42 +00:00
John Baldwin
fdcc1cc09f Use thread0.td_ucred instead of proc0.p_ucred. This change is cosmetic
and isn't strictly required.  However, it lowers the number of false
positives found when grep'ing the kernel sources for p_ucred to ensure
proper locking.
2002-02-27 19:18:10 +00:00
John Baldwin
a854ed9893 Simple p_ucred -> td_ucred changes to start using the per-thread ucred
reference.
2002-02-27 18:32:23 +00:00
Mike Silbersack
7f3a40933b Fix a horribly suboptimal algorithm in the vm_daemon.
In order to determine what to page out, the vm_daemon checks
reference bits on all pages belonging to all processes.  Unfortunately,
the algorithm used reacted badly with shared pages; each shared page
would be checked once per process sharing it; this caused an O(N^2)
growth of tlb invalidations.  The algorithm has been changed so that
each page will be checked only 16 times.

Prior to this change, a fork/sleepbomb of 1300 processes could cause
the vm_daemon to take over 60 seconds to complete, effectively
freezing the system for that time period.  With this change
in place, the vm_daemon completes in less than a second.  Any system
with hundreds of processes sharing pages should benefit from this change.

Note that the vm_daemon is only run when the system is under extreme
memory pressure.  It is likely that many people with loaded systems saw
no symptoms of this problem until they reached the point where swapping
began.

Special thanks go to dillon, peter, and Chuck Cranor, who helped me
get up to speed with vm internals.

PR:		33542, 20393
Reviewed by:	dillon
MFC after:	1 week
2002-02-27 18:03:02 +00:00
Peter Wemm
d1693e1701 Back out all the pmap related stuff I've touched over the last few days.
There is some unresolved badness that has been eluding me, particularly
affecting uniprocessor kernels.  Turning off PG_G helped (which is a bad
sign) but didn't solve it entirely.  Userland programs still crashed.
2002-02-27 09:51:33 +00:00
Peter Wemm
bd1e3a0f89 Jake further reduced IPI shootdowns on sparc64 in loops by using ranged
shootdowns in a couple of key places.  Do the same for i386.  This also
hides some physical addresses from higher levels and has it use the
generic vm_page_t's instead.  This will help for PAE down the road.

Obtained from:	jake (MI code, suggestions for MD part)
2002-02-27 02:14:58 +00:00
Peter Wemm
dd50331c0e Remove unused variable (td) 2002-02-26 01:01:37 +00:00
Poul-Henning Kamp
57c10583aa GC: BIO_ORDERED, various infrastructure dealing with BIO_ORDERED. 2002-02-22 09:26:35 +00:00
Tor Egge
d2760948fe Add a page queue, PQ_HOLD, that temporarily owns pages with nonzero hold
count that would otherwise be on one of the free queues.  This eliminates a
panic when broken programs unmap memory that still has pending IO from raw
devices.

Reviewed by:	dillon, alc
2002-02-19 23:19:30 +00:00
Mike Silbersack
0c9e47230a Add one more comment to the OOM changes so that future readers of
the code may better understand the code.

Suggested by:	dillon
MFC after:	1 week
2002-02-19 18:50:49 +00:00
Mike Silbersack
ef6020d187 Changes to make the OOM killer much more effective:
- Allow the OOM killer to target processes currently locked in
  memory.  These very often are the ones doing the memory hogging.
- Drop the wakeup priority of processes currently sleeping while
  waiting for their page fault to complete.  In order for the OOM
  killer to work well, the killed process and other system processes
  waiting on memory must be allowed to wakeup first.

Reviewed by:	dillon
MFC after:	1 week
2002-02-19 18:34:02 +00:00
Bruce Evans
1e92845e1b Garbage-collect options ACPI_NO_ENABLE_ON_BOOT, AML_DEBUG, BLEED,
DEVICE_SYSCTLS, KEY, LOUTB, NFS_MUIDHASHSIZ, NFS_UIDHASHSIZ, PCI_QUIET
and SIMPLELOCK_DEBUG.
2002-02-15 13:16:11 +00:00
Julian Elischer
2c1007663f In a threaded world, differnt priorirites become properties of
different entities.  Make it so.

Reviewed by:	jhb@freebsd.org (john baldwin)
2002-02-11 20:37:54 +00:00
Julian Elischer
079b7badea Pre-KSE/M3 commit.
this is a low-functionality change that changes the kernel to access the main
thread of a process via the linked list of threads rather than
assuming that it is embedded in the process. It IS still embeded there
but remove all teh code that assumes that in preparation for the next commit
which will actually move it out.

Reviewed by: peter@freebsd.org, gallatin@cs.duke.edu, benno rice,
2002-02-07 20:58:47 +00:00
Alfred Perlstein
582ec34cd8 Fix a race with free'ing vmspaces at process exit when vmspaces are
shared.

Also introduce vm_endcopy instead of using pointer tricks when
initializing new vmspaces.

The race occured because of how the reference was utilized:
  test vmspace reference,
  possibly block,
  decrement reference

When sharing a vmspace between multiple processes it was possible
for two processes exiting at the same time to test the reference
count, possibly block and neither one free because they wouldn't
see the other's update.

Submitted by: green
2002-02-05 21:23:05 +00:00
Matthew Dillon
027df6bdd7 GC P_BUFEXHAUST leftovers, we've had a new mechanism to avoid buffer
cache lockups for over a year now.

MFC after:		0 days
2002-01-31 18:39:44 +00:00
David Malone
d2979f90e7 Remove a parameter name from a prototype. 2002-01-25 21:33:10 +00:00
Bruce Evans
e50f5c2e8d Don't declare vm_swapout() in the NO_SWAPPING case when it is not defined.
Fixed some style bugs.
2002-01-17 16:46:26 +00:00
Alfred Perlstein
a4db49537b Replace ffind_* with fget calls.
Make fget MPsafe.

Make fgetvp and fgetsock use the fget subsystem to reduce code bloat.

Push giant down in fpathconf().
2002-01-14 00:13:45 +00:00
Alfred Perlstein
426da3bcfb SMP Lock struct file, filedesc and the global file list.
Seigo Tanimura (tanimura) posted the initial delta.

I've polished it quite a bit reducing the need for locking and
adapting it for KSE.

Locks:

1 mutex in each filedesc
   protects all the fields.
   protects "struct file" initialization, while a struct file
     is being changed from &badfileops -> &pipeops or something
     the filedesc should be locked.

1 mutex in each struct file
   protects the refcount fields.
   doesn't protect anything else.
   the flags used for garbage collection have been moved to
     f_gcflag which was the FILLER short, this doesn't need
     locking because the garbage collection is a single threaded
     container.
  could likely be made to use a pool mutex.

1 sx lock for the global filelist.

struct file *	fhold(struct file *fp);
        /* increments reference count on a file */

struct file *	fhold_locked(struct file *fp);
        /* like fhold but expects file to locked */

struct file *	ffind_hold(struct thread *, int fd);
        /* finds the struct file in thread, adds one reference and
                returns it unlocked */

struct file *	ffind_lock(struct thread *, int fd);
        /* ffind_hold, but returns file locked */

I still have to smp-safe the fget cruft, I'll get to that asap.
2002-01-13 11:58:06 +00:00
John Baldwin
c86b6ff551 Change the preemption code for software interrupt thread schedules and
mutex releases to not require flags for the cases when preemption is
not allowed:

The purpose of the MTX_NOSWITCH and SWI_NOSWITCH flags is to prevent
switching to a higher priority thread on mutex releease and swi schedule,
respectively when that switch is not safe.  Now that the critical section
API maintains a per-thread nesting count, the kernel can easily check
whether or not it should switch without relying on flags from the
programmer.  This fixes a few bugs in that all current callers of
swi_sched() used SWI_NOSWITCH, when in fact, only the ones called from
fast interrupt handlers and the swi_sched of softclock needed this flag.
Note that to ensure that swi_sched()'s in clock and fast interrupt
handlers do not switch, these handlers have to be explicitly wrapped
in critical_enter/exit pairs.  Presently, just wrapping the handlers is
sufficient, but in the future with the fully preemptive kernel, the
interrupt must be EOI'd before critical_exit() is called.  (critical_exit()
can switch due to a deferred preemption in a fully preemptive kernel.)

I've tested the changes to the interrupt code on i386 and alpha.  I have
not tested ia64, but the interrupt code is almost identical to the alpha
code, so I expect it will work fine.  PowerPC and ARM do not yet have
interrupt code in the tree so they shouldn't be broken.  Sparc64 is
broken, but that's been ok'd by jake and tmm who will be fixing the
interrupt code for sparc64 shortly.

Reviewed by:	peter
Tested on:	i386, alpha
2002-01-05 08:47:13 +00:00
Matthew Dillon
23b590188f Fix a BUF_TIMELOCK race against BUF_LOCK and fix a deadlock in vget()
against VM_WAIT in the pageout code.  Both fixes involve adjusting
the lockmgr's timeout capability so locks obtained with timeouts do not
interfere with locks obtained without a timeout.

Hopefully MFC: before the 4.5 release
2001-12-20 22:42:27 +00:00
Matthew Dillon
3ebeaf5984 This fixes a large number of bugs in our NFS client side code. A recent
commit by Kirk also fixed a softupdates bug that could easily be triggered
by server side NFS.

	* An edge case with shared R+W mmap()'s and truncate whereby
	  the system would inappropriately clear the dirty bits on
	  still-dirty data.  (applicable to all filesystems)

	  THIS FIX TEMPORARILY DISABLED PENDING FURTHER TESTING.
	  see vm/vm_page.c line 1641

	* The straddle case for VM pages and buffer cache buffers when
	  truncating.  (applicable to NFS client side)

	* Possible SMP database corruption due to vm_pager_unmap_page()
	  not clearing the TLB for the other cpu's.  (applicable to NFS
	  client side but could effect all filesystems).  Note: not
	  considered serious since the corruption occurs beyond the file
	  EOF.

	* When flusing a dirty buffer due to B_CACHE getting cleared,
	  we were accidently setting B_CACHE again (that is, bwrite() sets
	  B_CACHE), when we really want it to stay clear after the write
	  is complete.  This resulted in a corrupt buffer.  (applicable
	  to all filesystems but probably only triggered by NFS)

	* We have to call vtruncbuf() when ftruncate()ing to remove
	  any buffer cache buffers.  This is still tentitive, I may
	  be able to remove it due to the second bug fix.  (applicable
	  to NFS client side)

	* vnode_pager_setsize() race against nfs_vinvalbuf()... we have
	  to set n_size before calling nfs_vinvalbuf or the NFS code
	  may recursively vnode_pager_setsize() to the original value
	  before the truncate.  This is what was causing the user mmap
	  bus faults in the nfs tester program.  (applicable to NFS
	  client side)

	* Fix to softupdates (see ufs/ffs/ffs_inode.c 1.73, commit made
	  by Kirk).

Testing program written by: Avadis Tevanian, Jr.
Testing program supplied by: jkh / Apple (see Dec2001 posting to freebsd-hackers with Subject 'NFS: How to make FreeBS fall on its face in one easy step')
MFC after:	1 week
2001-12-14 01:16:57 +00:00
Luigi Rizzo
60363fb9f7 vm/vm_kern.c: rate limit (to once per second) diagnostic printf when
you run out of mbuf address space.

kern/subr_mbuf.c: print a warning message when mb_alloc fails, again
	rate-limited to at most once per second. This covers other
	cases of mbuf allocation failures. Probably it also overlaps the
	one handled in vm/vm_kern.c, so maybe the latter should go away.

This warning will let us gradually remove the printf that are scattered
across most network drivers to report mbuf allocation failures.
Those are potentially dangerous, in that they are not rate-limited and
can easily cause systems to panic.

Unless there is disagreement (which does not seem to be the case
judging from the discussion on -net so far), and because this is
sort of a safety bugfix, I plan to commit a similar change to STABLE
during the weekend (it affects kern/uipc_mbuf.c there).

Discussed-with: jlemon, silby and -net
2001-12-01 00:21:30 +00:00
Jonathan Lemon
4584bbf555 When laying out objects in a ZONE_INTERRUPT zone, allow them to cross
a page boundary, since we've already allocated all our contiguous kva
space up front.  This eliminates some memory wastage, and allows us to
actually reach the # of objects were specified in the zinit() call.

Reviewed by: peter, dillon
2001-11-17 00:40:48 +00:00
Matthew Dillon
fe8e0238cc Fix deadlock introduced in 1.73 (Jan 1998). The paging-in-progress count
on a vnode-backed object must be incremented *after* obtaining the vnode
lock.  If it is bumped before obtaining the vnode lock we can deadlock
against vtruncbuf().

Submitted by:	peter, ps
MFC after:	3 days
2001-11-09 21:34:45 +00:00
Matthew Dillon
33c6774151 Adjust vnode_pager_input_smlfs() to not attempt to BMAP blocks beyond the
file EOF.  This works around a bug in the ISOFS (CDRom) BMAP code which
returns bogus values for requests beyond the file EOF rather then returning
an error, resulting in either corrupt data being mmap()'d beyond the file EOF
or resulting in a seg-fault on the last page of a mmap()'d file (mmap()s of
CDRom files).

Reported by: peter / Yahoo
MFC after:	3 days
2001-11-05 18:58:47 +00:00
Matthew Dillon
e302698320 Don't let pmap_object_init_pt() exhaust all available free pages
(allocating pv entries w/ zalloci) when called in a loop due to
an madvise().  It is possible to completely exhaust the free page list and
cause a system panic when an expected allocation fails.
2001-10-31 03:06:33 +00:00
Matthew Dillon
7a5a635273 Move recently added procedure which was incorrectly placed within an
#ifdef DDB block.
2001-10-26 16:27:54 +00:00
Matthew Dillon
245df27cee Implement kern.maxvnodes. adjusting kern.maxvnodes now actually has a
real effect.

Optimize vfs_msync().  Avoid having to continually drop and re-obtain
mutexes when scanning the vnode list.  Improves looping case by 500%.

Optimize ffs_sync().  Avoid having to continually drop and re-obtain
mutexes when scanning the vnode list.  This makes a couple of assumptions,
which I believe are ok, in regards to vnode stability when the mount list
mutex is held.  Improves looping case by 500%.

(more optimization work is needed on top of these fixes)

MFC after:	1 week
2001-10-26 00:08:05 +00:00
Matthew Dillon
57601bcb5d Syntax cleanup and documentation, no operational changes.
MFC after:	1 day
2001-10-21 06:12:06 +00:00
Ian Dowse
0eb6ce3169 Move the code that computes the system load average from vm_meter.c
to kern_synch.c in preparation for adding some jitter to the
inter-sample time.

Note that the "vm.loadavg" sysctl still lives in vm_meter.c which
isn't the right place, but it is appropriate for the current (bad)
name of that sysctl.

Suggested by:	jhb (some time ago)
Reviewed by:	bde
2001-10-20 13:10:43 +00:00
Matthew Dillon
b386828956 contigmalloc1() could cause the vm_page_zero_count to become incorrect.
Properly track the count.

Submitted by:	mark tinguely <tinguely@web.cs.ndsu.nodak.edu>
2001-10-17 17:34:34 +00:00
Tor Egge
d6844b6bf6 Don't use an uninitialized field reserved for callers in the bio structure
passed to swap_pager_strategy().  Instead, use a field reserved for drivers
and initialize it before usage.

Reviewed by:	dillon
2001-10-15 23:02:54 +00:00
Tor Egge
30105b9ec4 Don't remove all mappings of a swapped out process if the vm map contained
wired entries.  vm_fault_unwire() depends on the mapping being intact.

Reviewed by:	dillon
2001-10-14 20:51:14 +00:00
Tor Egge
e7673b8424 Fix locking violations during page wiring:
- vm map entries are not valid after the map has been unlocked.

 - An exclusive lock on the map is needed before calling
   vm_map_simplify_entry().

Fix cleanup after page wiring failure to unwire all pages that had been
successfully wired before the failure was detected.

Reviewed by:	dillon
2001-10-14 20:47:08 +00:00
Matthew Dillon
33bd457d91 Makes contigalloc[1]() create the vm_map / underlying wired pages in the
kernel map and object in a manner that contigfree() is actually able to
free.  Previously contigfree() freed up the KVA space but could not
unwire & free the underlying VM pages due to mismatched pageability between
the map entry and the VM pages.

Submitted by:	Thomas Moestl <tmoestl@gmx.net>
Testing by: mark tinguely <tinguely@web.cs.ndsu.nodak.edu>
MFC after:	3 days
2001-10-13 04:23:37 +00:00
Matthew Dillon
00a6f47f13 Finally fix the VM bug where a file whos EOF occurs in the middle of a page
would sometimes prevent a dirty page from being cleaned, even when synced,
resulting in the dirty page being re-flushed to disk every 30-60 seconds or
so, forever.  The problem is that when the filesystem flushes a page to
its backing file it typically does not clear dirty bits representing areas
of the page that are beyond the file EOF.  If the file is also mmap()'d and
a fault is taken, vm_fault (properly, is required to) set the vm_page_t->dirty
bits to VM_PAGE_BITS_ALL.  This combination could leave us with an uncleanable,
unfreeable page.

The solution is to have the vnode_pager detect the edge case and manually
clear the dirty bits representing areas beyond the file EOF.  The filesystem
does the rest and the page comes up clean after the write completes.

MFC after:	3 days
2001-10-12 18:17:34 +00:00
John Baldwin
bd78cece5d Change the kernel's ucred API as follows:
- crhold() returns a reference to the ucred whose refcount it bumps.
- crcopy() now simply copies the credentials from one credential to
  another and has no return value.
- a new crshared() primitive is added which returns true if a ucred's
  refcount is > 1 and false (0) otherwise.
2001-10-11 23:38:17 +00:00
John Baldwin
61d80e90a9 Add missing includes of sys/ktr.h. 2001-10-11 17:53:43 +00:00
Paul Saab
cbc89bfbfe Make MAXTSIZ, DFLDSIZ, MAXDSIZ, DFLSSIZ, MAXSSIZ, SGROWSIZ loader
tunable.

Reviewed by:	peter
MFC after:	2 weeks
2001-10-10 23:06:54 +00:00
Ian Dowse
564bfabecb Remove the SSLEEP case from the load average computation. This has
been a no-op for as long as our CVS history goes back. Processes in
state SSLEEP could only be counted if p_slptime == 0, but immediately
before loadav() is called, schedcpu() has just incremented p_slptime
on all SSLEEP processes.
2001-10-04 22:33:31 +00:00
Robert Watson
8c5d4fe829 o Modify access control checks in mmap() to use securelevel_gt() instead
of direct variable access.

Obtained from:	TrustedBSD Project
2001-09-26 20:29:39 +00:00
Julian Elischer
b40ce4165d KSE Milestone 2
Note ALL MODULES MUST BE RECOMPILED
make the kernel aware that there are smaller units of scheduling than the
process. (but only allow one thread per process at this time).
This is functionally equivalent to teh previousl -current except
that there is a thread associated with each process.

Sorry john! (your next MFC will be a doosie!)

Reviewed by: peter@freebsd.org, dillon@freebsd.org

X-MFC after:    ha ha ha ha
2001-09-12 08:38:13 +00:00
Peter Wemm
eb30c1c0b9 Rip some well duplicated code out of cpu_wait() and cpu_exit() and move
it to the MI area.  KSE touched cpu_wait() which had the same change
replicated five ways for each platform.  Now it can just do it once.
The only MD parts seemed to be dealing with fpu state cleanup and things
like vm86 cleanup on x86.  The rest was identical.

XXX: ia64 and powerpc did not have cpu_throw(), so I've put a functional
stub in place.

Reviewed by:	jake, tmm, dillon
2001-09-10 04:28:58 +00:00
John Baldwin
29fdb744d1 Process priority is locked by the sched_lock, not the proc lock. 2001-09-01 20:16:30 +00:00
Matthew Dillon
7feaf028be make swapon() MPSAFE (will adjust syscalls.master later) 2001-08-31 22:15:37 +00:00
Matthew Dillon
6a33d53c48 mark obreak() and ovadvise() as being MPSAFE 2001-08-31 22:10:03 +00:00
Matthew Dillon
d2c60af81a Cleanup 2001-08-31 01:26:30 +00:00
Peter Wemm
3516c025ff Implement idle zeroing of pages. I've been tinkering with this
on and off since John Dyson left his work-in-progress.

It is off by default for now.  sysctl vm.zeroidle_enable=1 to turn it on.

There are some hacks here to deal with the present lack of preemption - we
yield after doing a small number of pages since we wont preempt otherwise.

This is basically Matt's algorithm [with hysteresis] with an idle process
to call it in a similar way it used to be called from the idle loop.

I cleaned up the includes a fair bit here too.
2001-08-25 05:00:44 +00:00
Matthew Dillon
676274db9b Remove support for the badly broken MAP_INHERIT (from -current only). 2001-08-24 19:29:56 +00:00
Matthew Dillon
219d632c15 Move most of the kernel submap initialization code, including the
timeout callwheel and buffer cache, out of the platform specific areas
and into the machine independant area.  i386 and alpha adjusted here.
Other cpus can be fixed piecemeal.

Reviewed by:    freebsd-smp, jake
2001-08-22 04:07:27 +00:00
Matthew Dillon
0b76df7146 KASSERT if vm_page_t->wire_count overflows. 2001-08-22 04:01:56 +00:00
Matthew Dillon
2f9e4e8025 Limit the amount of KVM reserved for the buffer cache and for swap-meta
information.  The default limits only effect machines with > 1GB of ram
and can be overriden with two new kernel conf variables VM_SWZONE_SIZE_MAX
and VM_BCACHE_SIZE_MAX, or with loader variables kern.maxswzone and
kern.maxbcache.  This has the effect of leaving more KVM available for
sizing NMBCLUSTERS and 'maxusers' and should avoid tripups where a sysad
adds memory to a machine and then sees the kernel panic on boot due to
running out of KVM.

Also change the default swap-meta auto-sizing calculation to allocate half
of what it was previously allocating.  The prior defaults were way too high.
Note that we cannot afford to run out of swap-meta structures so we still
stay somewhat conservative here.
2001-08-20 00:41:12 +00:00
John Baldwin
02cd7c3cf2 - Remove asleep(), await(), and M_ASLEEP.
- Callers of asleep() and await() have been converted to calling tsleep().
  The only caller outside of M_ASLEEP was the ata driver, which called both
  asleep() and await() with spl-raised, so there was no need for the
  asleep() and await() pair.  M_ASLEEP was unused.

Reviewed by:	jasone, peter
2001-08-10 06:56:12 +00:00
John Baldwin
8ec48c6dbf - Remove asleep(), await(), and M_ASLEEP.
- Callers of asleep() and await() have been converted to calling tsleep().
  The only caller outside of M_ASLEEP was the ata driver, which called both
  asleep() and await() with spl-raised, so there was no need for the
  asleep() and await() pair.  M_ASLEEP was unused.

Reviewed by:	jasone, peter
2001-08-10 06:37:05 +00:00
Thomas Moestl
59fa485c3e Add a missing semicolon to unbreak the kernel build with INVARIANTS
(which was unfortunately turned off in the confguration I used for the
last test build).

Spotted by:	jake
Pointy hat to:	tmm
2001-08-05 03:55:02 +00:00
John Baldwin
bd8e0d5871 Whitespace fixes. 2001-08-04 20:49:29 +00:00
Thomas Moestl
b4c53a8111 Add a zdestroy() function to the zone allocator. This is needed for the
unload case of modules that use their own zones.
It has been tested with the nfs module.
2001-08-04 20:17:05 +00:00
Alfred Perlstein
61ce6eeee3 Fixups for the initial allocation by dillon:
1) allocate fewer buckets
  2) when failing to allocate swap zone, keep reducing the zone by
     a third rather than a half in order to reduce the chance of
     allocating way too little.

I also moved around some code for readability.

Suggested by: dillon
Reviewed by: dillon
2001-08-02 07:54:58 +00:00
Jake Burkholder
3a9b5daf48 Oops. Last commit to vm_object.c should have got these files too.
Remove the use of atomic ops to manipulate vm_object and vm_page flags.
Giant is required here, so they are superfluous.

Discussed with:	dillon
2001-07-31 04:09:52 +00:00
Jake Burkholder
b06805ad34 Remove the use of atomic ops to manipulate vm_object and vm_page flags.
Giant is required here, so they are superfluous.

Discussed with:	dillon
2001-07-31 04:03:53 +00:00
Ian Dowse
a4821e444e Permit direct swapping to NFS regular files using swapon(2). We
already allow this for NFS swap configured via BOOTP, so it is
known to work fine.

For many diskless configurations is is more flexible to have the
client set up swapping itself; it can recreate a sparse swap file
to save on server space for example, and it works with a non-NFS
root filesystem such as an in-kernel filesystem image.
2001-07-28 20:18:38 +00:00
Assar Westerlund
d3e5863fa9 make vm_page_select_cache static
Requested by:	bde
2001-07-23 12:34:31 +00:00
Assar Westerlund
0379d76358 (vm_page_select_cache): add prototype 2001-07-21 17:08:15 +00:00
Benno Rice
1f246456a5 The i386-specific includes in this file were "fixed" by bracketing them with
#ifndef __alpha__.  Fix this for the rest of the world by turning it into
#ifdef __i386__.

Reviewed by:	obrien
2001-07-15 04:11:51 +00:00
Dag-Erling Smørgrav
bf3009895e Fix missing newline and terminator at the end of the vm.zone sysctl. 2001-07-09 03:37:33 +00:00
Matt Jacob
f343cf2135 Apply field bandages to the includes so compiles happen on alpha. 2001-07-05 06:13:44 +00:00
Matthew Dillon
7197571105 Move vm_page_zero_idle() from machine-dependant sections to a
machine-independant source file, vm/vm_zeroidle.c.  It was exactly the
same for all platforms and updating them all was getting annoying.
2001-07-05 01:32:42 +00:00
Matthew Dillon
6d03d577a5 Reorg vm_page.c into vm_page.c, vm_pageq.c, and vm_contig.c (for contigmalloc).
Also removed some spl's and added some VM mutexes, but they are not actually
used yet, so this commit does not really make any operational changes
to the system.

vm_page.c relates to vm_page_t manipulation, including high level deactivation,
activation, etc...  vm_pageq.c relates to finding free pages and aquiring
exclusive access to a page queue (exclusivity part not yet implemented).
And the world still builds... :-)
2001-07-04 23:27:09 +00:00
Matthew Dillon
1b40f8c036 Change inlines back into mainline code in preparation for mutexing. Also,
most of these inlines had been bloated in -current far beyond their
original intent.  Normalize prototypes and function declarations to be ANSI
only (half already were).  And do some general cleanup.

(kernel size also reduced by 50-100K, but that isn't the prime intent)
2001-07-04 20:15:18 +00:00
Matthew Dillon
54d9214595 whitespace / register cleanup 2001-07-04 19:00:13 +00:00
Matthew Dillon
0cddd8f023 With Alfred's permission, remove vm_mtx in favor of a fine-grained approach
(this commit is just the first stage).  Also add various GIANT_ macros to
formalize the removal of Giant, making it easy to test in a more piecemeal
fashion. These macros will allow us to test fine-grained locks to a degree
before removing Giant, and also after, and to remove Giant in a piecemeal
fashion via sysctl's on those subsystems which the authors believe can
operate without Giant.
2001-07-04 16:20:28 +00:00
John Baldwin
b62b9b648b Fix a XXX comment by moving the initialization of the number of pbuf's
for the vnode pager to a new vnode pager init method instead of making it
a hack in getpages().
2001-07-03 07:35:56 +00:00
John Baldwin
6d541bf1ae - Protect all accesses to nsw_[rw]count{,_{,a}sync} with the pbuf mutex.
- Don't drop the vm mutex while grabbing the pbuf mutex to manipulate
  said variables.
2001-06-22 21:12:19 +00:00
Bosko Milekic
08442f8a82 Introduce numerous SMP friendly changes to the mbuf allocator. Namely,
introduce a modified allocation mechanism for mbufs and mbuf clusters; one
which can scale under SMP and which offers the possibility of resource
reclamation to be implemented in the future. Notable advantages:

 o Reduce contention for SMP by offering per-CPU pools and locks.
 o Better use of data cache due to per-CPU pools.
 o Much less code cache pollution due to excessively large allocation macros.
 o Framework for `grouping' objects from same page together so as to be able
   to possibly free wired-down pages back to the system if they are no longer
   needed by the network stacks.

 Additional things changed with this addition:

  - Moved some mbuf specific declarations and initializations from
    sys/conf/param.c into mbuf-specific code where they belong.
  - m_getclr() has been renamed to m_get_clrd() because the old name is really
    confusing. m_getclr() HAS been preserved though and is defined to the new
    name. No tree sweep has been done "to change the interface," as the old
    name will continue to be supported and is not depracated. The change was
    merely done because m_getclr() sounds too much like "m_get a cluster."
  - TEMPORARILY disabled mbtypes statistics displaying in netstat(1) and
    systat(1) (see TODO below).
  - Fixed systat(1) to display number of "free mbufs" based on new per-CPU
    stat structures.
  - Fixed netstat(1) to display new per-CPU stats based on sysctl-exported
    per-CPU stat structures. All infos are fetched via sysctl.

 TODO (in order of priority):

  - Re-enable mbtypes statistics in both netstat(1) and systat(1) after
    introducing an SMP friendly way to collect the mbtypes stats under the
    already introduced per-CPU locks (i.e. hopefully don't use atomic() - it
    seems too costly for a mere stat update, especially when other locks are
    already present).
  - Optionally have systat(1) display not only "total free mbufs" but also
    "total free mbufs per CPU pool."
  - Fix minor length-fetching issues in netstat(1) related to recently
    re-enabled option to read mbuf stats from a core file.
  - Move reference counters at least for mbuf clusters into an unused portion
    of the cluster itself, to save space and need to allocate a counter.
  - Look into introducing resource freeing possibly from a kproc.

Reviewed by (in parts): jlemon, jake, silby, terry
Tested by: jlemon (Intel & Alpha), mjacob (Intel & Alpha)
Preliminary performance measurements: jlemon (and me, obviously)
URL: http://people.freebsd.org/~bmilekic/mb_alloc/
2001-06-22 06:35:32 +00:00
John Baldwin
ad6c5bbede Don't lock around swap_pager_swap_init() that is only called once during
the pagedaemon's startup code since it calls malloc which results in lock
order reversals.
2001-06-20 23:34:06 +00:00
John Baldwin
69a78d4666 Put the scheduler, vmdaemon, and pagedaemon kthreads back under Giant for
now.  The proc locking isn't actually safe yet and won't be until the proc
locking is finished.
2001-06-20 00:48:20 +00:00
Matthew Dillon
ef6a93ef81 Cleanup the tabbing 2001-06-11 19:17:05 +00:00
Matthew Dillon
ff2b5645b5 Two fixes to the out-of-swap process termination code. First, start killing
processes a little earlier to avoid a deadlock.  Second, when calculating
the 'largest process' do not just count RSS.  Instead count the RSS + SWAP
used by the process.  Without this the code tended to kill small
inconsequential processes like, oh, sshd, rather then one of the many
'eatmem 200MB' I run on a whim :-).  This fix has been extensively tested on
-stable and somewhat tested on -current and will be MFCd in a few days.

Shamed into fixing this by: ps
2001-06-09 18:06:58 +00:00
Thomas Moestl
5c5c8fa826 Change the way information about swap devices is exported to be more
canonical: define a versioned struct xswdev, and add a sysctl node
handler that allows the user to get this structure for a certain device
index by specifying this index as last element of the MIB.
This new node handler, vm.swap_info, replaces the old vm.nswapdev
and vm.swapdevX.* (where X was the index) sysctls.
2001-06-01 22:53:10 +00:00
Thomas Moestl
d279178df7 Clean up the code exporting interrupt statistics via sysctl a bit:
- move the sysctl code to kern_intr.c
- do not use INTRCNT_COUNT, but rather eintrcnt - intrcnt to determine
  the length of the intrcnt array
- move the declarations of intrnames, eintrnames, intrcnt and eintrcnt
  from machine-dependent include files to sys/interrupt.h
- remove the hw.nintr sysctl, it is not needed.
- fix various style bugs

Requested by:	bde
Reviewed by:	bde (some time ago)
2001-06-01 13:23:28 +00:00
John Baldwin
342a1480aa Don't hold the VM lock across VOP's and other things that can sleep. 2001-05-29 16:58:25 +00:00
John Baldwin
190609dd48 Stick VM syscalls back under Giant if the BLEED option is not defined. 2001-05-24 18:04:29 +00:00
Matthew Dillon
ac8f990bde This patch implements O_DIRECT about 80% of the way. It takes a patchset
Tor created a while ago, removes the raw I/O piece (that has cache coherency
problems), and adds a buffer cache / VM freeing piece.

Essentially this patch causes O_DIRECT I/O to not be left in the cache, but
does not prevent it from going through the cache, hence the 80%.  For
the last 20% we need a method by which the I/O can be issued directly to
buffer supplied by the user process and bypass the buffer cache entirely,
but still maintain cache coherency.

I also have the code working under -stable but the changes made to sys/file.h
may not be MFCable, so an MFC is not on the table yet.

Submitted by:	tegge, dillon
2001-05-24 07:22:27 +00:00
John Baldwin
e6b961ffbd - Assert Giant is held in the vnode pager methods.
- Lock the VM while walking down a vm_object's backing_object list in
  vnode_pager_lock().
2001-05-23 22:51:23 +00:00
John Baldwin
3614c6fcbb - Add in several asserts of vm_mtx.
- Assert Giant in vm_pageout_scan() for the vnode hacking that it does.
- Don't hold vm_mtx around vget() or vput().
- Lock Giant when calling vm_pageout_scan() from the pagedaemon.  Also,
  lock curproc while setting the P_BUFEXHAUST flag.
- For now we still hold Giant for all of the vm_daemon.  When process
  limits are locked we will be only need Giant for swapout_procs().
2001-05-23 22:48:28 +00:00
John Baldwin
60517fd1f7 - Assert that the vm lock is held for all of _vm_object_allocate().
- Restore the previous order of setting up a new vm_object.  The previous
  had a small bug where we zero'd out the flags after we set the
  OBJ_ONEMAPPING flag.
- Add several asserts of vm_mtx.
- Assert Giant is held rather than locking and unlocking it in a few
  places.
- Add in some #ifdef objlocks code to lock individual vm objects when
  vm objects each have their own lock someday.
- Don't bother acquiring the allproc lock for a ddb command.  If DDB
  blocked on the lock, that would be worse than having an inconsistent
  allproc list.
2001-05-23 22:42:10 +00:00
John Baldwin
21c641b2a9 - Add lots of vm_mtx assertions.
- Add a few KTR tracepoints to track the addition and removal of
  vm_map_entry's and the creation adn free'ing of vmspace's.
- Adjust a few portions of code so that we update the process' vmspace
  pointer to its new vmspace before freeing the old vmspace.
2001-05-23 22:38:00 +00:00
John Baldwin
3a2189d451 - Lock the VM around the pmap_swapin_proc() call in faultin().
- Don't lock Giant in the scheduler() function except for when calling
  faultin().
- In swapout_procs(), lock the VM before the proccess to avoid a lock order
  violation.
- In swapout_procs(), release the allproc lock before calling swapout().
  We restart the process scan after swapping out a process.
- In swapout_procs(), un #if 0 the code to bump the vmspace reference count
  and lock the process' vm structures.  This bug was introduced by me and
  could result in the vmspace being free'd out from under a running
  process.
- Fix an old bug where the vmspace reference count was not free'd if we
  failed the swap_idle_threshold2 test.
2001-05-23 22:35:45 +00:00
John Baldwin
b608320d4a - Fix the sw_alloc_interlock to actually lock itself when the lock is
acquired.
- Assert Giant is held in the strategy, getpages, and putpages methods and
  the getchainbuf, flushchainbuf, and waitchainbuf functions.
- Always call flushchainbuf() w/o the VM lock.
2001-05-23 22:31:15 +00:00
John Baldwin
6d556da5c2 Assert Giant is held for the device pager alloc and getpages methods since
we call the mmap method of the cdevsw of the device we are mmap'ing.
2001-05-23 22:27:52 +00:00
John Baldwin
e4ca250d4b - Obtain Giant in mmap() syscall while messing with file descriptors and
vnodes.
- Fix an old bug that would leak a reference to a fd if the vnode being
  mmap'd wasn't of type VREG or VCHR.
- Lock Giant in vm_mmap() around calls into the VM that can call into
  pager routines that need Giant or into other VM routines that need
  Giant.
- Replace code that used a goto to jump around the else branch of a test
  to use an else branch instead.
2001-05-23 22:17:43 +00:00
John Baldwin
bb10bb4978 Acquire Giant around vm_map_remove() inside of the obreak() syscall for
vm_object_terminate().
2001-05-23 22:13:10 +00:00
John Baldwin
576f0c5fa4 Take a more conservative approach and still lock Giant around VM faults
for now.
2001-05-23 22:09:18 +00:00
John Baldwin
c52f090cfb Set the phys_pager_alloc_lock to 1 when it is acquired so that it is
actually locked.
2001-05-23 19:52:23 +00:00
Alfred Perlstein
c5e62505ad aquire Giant when playing with the buffercache and doing IO.
use msleep against the vm mutex while waiting for a page IO to complete.
2001-05-23 10:28:11 +00:00
Alfred Perlstein
240e0fdd93 aquire vm mutex in swp_pager_async_iodone. Don't call swp_pager_async_iodone
with the mutex held.
2001-05-22 19:01:26 +00:00
John Baldwin
86e92ee7e1 Remove duplicate include and sort includes. 2001-05-22 07:21:46 +00:00
John Baldwin
7d4ad42de5 Sort includes. 2001-05-22 07:01:11 +00:00
John Baldwin
12635f9c89 Unlock the VM lock at the end of munlock() instead of locking it again. 2001-05-22 06:07:36 +00:00
John Baldwin
874468957d Sort includes from previous commit. 2001-05-22 05:35:45 +00:00
John Baldwin
4edf4a58e6 Sort includes. 2001-05-22 00:56:25 +00:00
Alfred Perlstein
2395531439 Introduce a global lock for the vm subsystem (vm_mtx).
vm_mtx does not recurse and is required for most low level
vm operations.

faults can not be taken without holding Giant.

Memory subsystems can now call the base page allocators safely.

Almost all atomic ops were removed as they are covered under the
vm mutex.

Alpha and ia64 now need to catch up to i386's trap handlers.

FFS and NFS have been tested, other filesystems will need minor
changes (grabbing the vm lock when twiddling page properties).

Reviewed (partially) by: jake, jhb
2001-05-19 01:28:09 +00:00
John Baldwin
ea7549540f - Use a timeout for the tsleep in scheduler() instead of having vmmeter()
wakeup proc0 by hand to enforce the timeout.
- When swapping out a process, keep the process locked via the proc lock
  from the first checks up until we clear PS_INMEM and set PS_SWAPPING in
  swapout().  The swapout() function now must be called with the proc lock
  held and releases it before returning.
- Comment out the code to attempt to lock a process' VM structures before
  swapping out.  It is broken in that it releases the lock after obtaining
  it.  If it does grab the lock, it needs to hand it off to swapout()
  instead of releasing it.  This can be revisisted when the VM is locked
  as this is a valid test to perform.  It also causes a lock order reversal
  for the time being, which is the immediate cause for temporarily
  disabling it.
2001-05-18 00:08:38 +00:00
John Baldwin
1c58e4e550 During the code to pick a process to kill when memory is exhausted, keep
the process in question locked as soon as we find it and determine it to
be eligible until we actually kill it.  To avoid deadlock, we don't block
on the process lock but skip any process that is already locked during our
search.
2001-05-17 22:49:03 +00:00
John Baldwin
c96d52a913 - Use PROC_LOCK_ASSERT instead of a direct mtx_assert.
- Don't hold Giant in the swapper daemon while we walk the list of
  processes looking for a process to swap back in.
- Don't bother grabbing the sched_lock while checking a process' sleep
  time in swapout_procs() to ensure that a process has been idle for at
  least swap_idle_threshold2 before swapping it out.  If we lose the race
  we just let a process stay in memory until the next call of
  swapout_procs().
- Remove some unneeded spl's, sched_lock does all the locking needed in
  this case.
2001-05-15 22:20:44 +00:00
Poul-Henning Kamp
a468031ce8 Actually biofinish(struct bio *, struct devstat *, int error) is more general
than the bioerror().

Most of this patch is generated by scripts.
2001-05-06 20:00:03 +00:00
Mark Murray
559034b748 Putting sys/lockmgr.h in here allows us to depollute userland includes
a bit.
OK'ed by:	bde
2001-05-03 11:33:51 +00:00
Mark Murray
fb919e4d5a Undo part of the tangle of having sys/lock.h and sys/mutex.h included in
other "system" header files.

Also help the deprecation of lockmgr.h by making it a sub-include of
sys/lock.h and removing sys/lockmgr.h form kernel .c files.

Sort sys/*.h includes where possible in affected files.

OK'ed by:	bde (with reservations)
2001-05-01 08:13:21 +00:00
Greg Lehey
60fb0ce365 Revert consequences of changes to mount.h, part 2.
Requested by:	bde
2001-04-29 02:45:39 +00:00
Alfred Perlstein
93c7ba9f09 Address a number of problems with sysctl_vm_zone().
The zone allocator's locks should be leaflocks, meaning that they
should never be held when entering into another subsystem, however
the sysctl grabs the zone global mutex and individual zone mutexes
while holding the lock it calls SYSCTL_OUT which recurses into the
VM subsystem in order to wire user memory to do a safe copy.  This
can block and cause lock order reversals.

To fix this:
  lock zone global.
  get a count of the number of zones.
  unlock global.
  allocate temporary storage.
  format and SYSCTL_OUT the banner.
  lock global.
  traverse list.
    make sure we haven't looped more than the initial count taken
      to avoid overflowing the allocated buffer.
    lock each nodes.
    read values and format into buffer.
    unlock individual node.
  unlock global.
  format and SYSCTL_OUT the rest of the data.
  free storage.
  return.

Other problems included not checking for errors when doing sysctl out
of the column header.  Fixed.

Inconsistant termination of the copied string. Fixed.

Objected to by: des (for not using sbuf)

Since the output is not variable length and I'm actually over
allocating signifigantly and I'd like to get this fixed now, I'll
work on the sbuf convertion at a later date.  I would not object
to someone else taking it upon themselves to convert it to sbuf.
I hold no MAINTIANER rights to this code (for now).
2001-04-27 22:24:45 +00:00
Greg Lehey
d98dc34f52 Correct #includes to work with fixed sys/mount.h. 2001-04-23 09:05:15 +00:00
Alfred Perlstein
d8d5fa8805 vnode_pager_freepage() is really vm_page_free() in disguise,
nuke vnode_pager_freepage() and replace all calls to it with vm_page_free()
2001-04-19 06:18:23 +00:00
Alfred Perlstein
a9fa2c05fc Protect pager object creation with sx locks.
Protect pager object list manipulation with a mutex.

It doesn't look possible to combine them under a single sx lock because
creation may block and we can't have the object list manipulation block
on anything other than a mutex because of interrupt requests.
2001-04-18 20:24:16 +00:00
Alfred Perlstein
305dd591ee Fix the botched rev 1.59 where I made it such that without INVARIANTS
the map is never locked.

Submitted by: tegge
2001-04-18 05:30:24 +00:00
Poul-Henning Kamp
f84e29a06c This patch removes the VOP_BWRITE() vector.
VOP_BWRITE() was a hack which made it possible for NFS client
side to use struct buf with non-bio backing.

This patch takes a more general approach and adds a bp->b_op
vector where more methods can be added.

The success of this patch depends on bp->b_op being initialized
all relevant places for some value of "relevant" which is not
easy to determine.  For now the buffers have grown a b_magic
element which will make such issues a tiny bit easier to debug.
2001-04-17 08:56:39 +00:00
Alfred Perlstein
cc64b484dd use TAILQ_FOREACH, fix a comment's location 2001-04-15 10:22:04 +00:00
Alfred Perlstein
971dd34298 if/panic -> KASSERT 2001-04-13 11:15:40 +00:00
Alfred Perlstein
2a758ebe58 protect pbufs and associated counts with a mutex 2001-04-13 10:23:32 +00:00
Alfred Perlstein
493607117e use %p for pointer printf, include sys/systm.h for printf proto 2001-04-13 10:22:14 +00:00
Alfred Perlstein
7d26b6a450 Use a macro wrapper over printf along with KASSERT to reduce the amount
of code here.
2001-04-13 08:07:37 +00:00
Alfred Perlstein
b28cb1ca07 remove truncated part from commment 2001-04-12 21:50:03 +00:00
John Baldwin
1005a129e5 Convert the allproc and proctree locks from lockmgr locks to sx locks. 2001-03-28 11:52:56 +00:00
John Baldwin
f34fa851e0 Catch up to header include changes:
- <sys/mutex.h> now requires <sys/systm.h>
- <sys/mutex.h> and <sys/sx.h> now require <sys/lock.h>
2001-03-28 09:17:56 +00:00
Thomas Moestl
368d2edce4 Export intrnames and intrcnt as sysctls (hw.nintr, hw.intrnames and
hw.intrcnt).

Approved by:	rwatson
2001-03-23 03:45:17 +00:00
Matthew Dillon
b823bbd6be Fix a lock reversal problem in the VM subsystem related to threaded
programs.   There is a case during a fork() which can cause a deadlock.

From Tor -
The workaround that consists of setting a flag in the vm map that
indicates that a fork is in progress and using that mark in the page
fault handling to force a revalidation failure.  That change will only
affect (pessimize) page fault handling during fork for threaded
(linuxthreads style) applications and applications using aio_*().

Submited by: tegge
2001-03-14 06:48:53 +00:00
Matthew Dillon
1a484d28dd Temporarily remove the vm_map_simplify() call from vm_map_insert(). The
call is correct, but it interferes with the massive hack called
vm_map_growstack().  The call will be returned after our stack handling
code is fixed.

Reported by: tegge
2001-03-14 06:09:42 +00:00
Ian Dowse
d30344bdfa When creating a shadow vm_object in vmspace_fork(), only one
reference count was transferred to the new object, but both the
new and the old map entries had pointers to the new object.
Correct this by transferring the second reference.

This fixes a panic that can occur when mmap(2) is used with the
MAP_INHERIT flag.

PR:		i386/25603
Reviewed by:	dillon, alc
2001-03-09 18:25:54 +00:00
John Baldwin
136d8f42b9 Unrevert the pmap_map() changes. They weren't broken on x86.
Sense beaten into me by:	peter
2001-03-07 05:29:21 +00:00
John Baldwin
4a01ebd482 Back out the pmap_map() change for now, it isn't completely stable on the
i386.
2001-03-07 01:04:17 +00:00
John Baldwin
968950e5d1 - Rework pmap_map() to take advantage of direct-mapped segments on
supported architectures such as the alpha.  This allows us to save
  on kernel virtual address space, TLB entries, and (on the ia64) VHPT
  entries.  pmap_map() now modifies the passed in virtual address on
  architectures that do not support direct-mapped segments to point to
  the next available virtual address.  It also returns the actual
  address that the request was mapped to.
- On the IA64 don't use a special zone of PV entries needed for early
  calls to pmap_kenter() during pmap_init().  This gets us in trouble
  because we end up trying to use the zone allocator before it is
  initialized.  Instead, with the pmap_map() change, the number of needed
  PV entries is small enough that we can get by with a static pool that is
  used until pmap_init() is complete.

Submitted by:		dfr
Debugging help:		peter
Tested by:		me
2001-03-06 06:06:42 +00:00
Alfred Perlstein
8125b1e66e Simplify vm_object_deallocate(), by decrementing the refcount first.
This allows some of the conditionals to be combined.
2001-03-04 20:25:23 +00:00
Andrew Gallatin
c909b97167 Allocate vm_page_array and vm_page_buckets from the end of the biggest chunk
of memory, rather than from the start.

This fixes problems allocating bouncebuffers on alphas where there is only
1 chunk of memory (unlike PCs where there is generally at least one small
chunk and a large chunk).  Having 1 chunk had been fatal, because these
structures take over 13MB on a machine with 1GB of ram. This doesn't leave
much room for other structures and bounce buffers if they're at the front.

Reviewed by: dfr, anderson@cs.duke.edu, silence on -arch
Tested by: Yoriaki FUJIMORI <fujimori@grafin.fujimori.cache.waseda.ac.jp>
2001-03-01 19:21:24 +00:00
Matthew Dillon
5bf53acb74 If we intend to make the page writable without requiring another fault,
make sure that PG_NOSYNC is properly set.  Previously we only set it
for a write-fault, but this can occur on a read-fault too.
(will be MFCd prior to 4.3 freeze)
2001-02-28 04:26:43 +00:00
Robert Watson
edfa785a8e Introduce per-swap area accounting in the VM system, and export
this information via the vm.nswapdev sysctl (number of swap areas)
and vm.swapdevX nodes (where X is the device), which contain the MIBs
dev, blocks, used, and flags.  These changes are required to allow
top and other userland swap-monitoring utilities to run without
setgid kmem.

Submitted by:	Thomas Moestl <tmoestl@gmx.net>
Reviewed by:	freebsd-audit
2001-02-23 18:46:21 +00:00
Dag-Erling Smørgrav
2f9564de0f Fix formatting bugs introduced in sysctl_vm_zone() by the previous commit.
Also, if SYSCTL_OUT() returns a non-zero value, stop at once.
2001-02-22 14:44:39 +00:00
Jake Burkholder
d5a08a6065 Implement a unified run queue and adjust priority levels accordingly.
- All processes go into the same array of queues, with different
  scheduling classes using different portions of the array.  This
  allows user processes to have their priorities propogated up into
  interrupt thread range if need be.
- I chose 64 run queues as an arbitrary number that is greater than
  32.  We used to have 4 separate arrays of 32 queues each, so this
  may not be optimal.  The new run queue code was written with this
  in mind; changing the number of run queues only requires changing
  constants in runq.h and adjusting the priority levels.
- The new run queue code takes the run queue as a parameter.  This
  is intended to be used to create per-cpu run queues.  Implement
  wrappers for compatibility with the old interface which pass in
  the global run queue structure.
- Group the priority level, user priority, native priority (before
  propogation) and the scheduling class into a struct priority.
- Change any hard coded priority levels that I found to use
  symbolic constants (TTIPRI and TTOPRI).
- Remove the curpriority global variable and use that of curproc.
  This was used to detect when a process' priority had lowered and
  it should yield.  We now effectively yield on every interrupt.
- Activate propogate_priority().  It should now have the desired
  effect without needing to also propogate the scheduling class.
- Temporarily comment out the call to vm_page_zero_idle() in the
  idle loop.  It interfered with propogate_priority() because
  the idle process needed to do a non-blocking acquire of Giant
  and then other processes would try to propogate their priority
  onto it.  The idle process should not do anything except idle.
  vm_page_zero_idle() will return in the form of an idle priority
  kernel thread which is woken up at apprioriate times by the vm
  system.
- Update struct kinfo_proc to the new priority interface.  Deliberately
  change its size by adjusting the spare fields.  It remained the same
  size, but the layout has changed, so userland processes that use it
  would parse the data incorrectly.  The size constraint should really
  be changed to an arbitrary version number.  Also add a debug.sizeof
  sysctl node for struct kinfo_proc.
2001-02-12 00:20:08 +00:00
Bosko Milekic
9ed346bab0 Change and clean the mutex lock interface.
mtx_enter(lock, type) becomes:

mtx_lock(lock) for sleep locks (MTX_DEF-initialized locks)
mtx_lock_spin(lock) for spin locks (MTX_SPIN-initialized)

similarily, for releasing a lock, we now have:

mtx_unlock(lock) for MTX_DEF and mtx_unlock_spin(lock) for MTX_SPIN.
We change the caller interface for the two different types of locks
because the semantics are entirely different for each case, and this
makes it explicitly clear and, at the same time, it rids us of the
extra `type' argument.

The enter->lock and exit->unlock change has been made with the idea
that we're "locking data" and not "entering locked code" in mind.

Further, remove all additional "flags" previously passed to the
lock acquire/release routines with the exception of two:

MTX_QUIET and MTX_NOSWITCH

The functionality of these flags is preserved and they can be passed
to the lock/unlock routines by calling the corresponding wrappers:

mtx_{lock, unlock}_flags(lock, flag(s)) and
mtx_{lock, unlock}_spin_flags(lock, flag(s)) for MTX_DEF and MTX_SPIN
locks, respectively.

Re-inline some lock acq/rel code; in the sleep lock case, we only
inline the _obtain_lock()s in order to ensure that the inlined code
fits into a cache line. In the spin lock case, we inline recursion and
actually only perform a function call if we need to spin. This change
has been made with the idea that we generally tend to avoid spin locks
and that also the spin locks that we do have and are heavily used
(i.e. sched_lock) do recurse, and therefore in an effort to reduce
function call overhead for some architectures (such as alpha), we
inline recursion for this case.

Create a new malloc type for the witness code and retire from using
the M_DEV type. The new type is called M_WITNESS and is only declared
if WITNESS is enabled.

Begin cleaning up some machdep/mutex.h code - specifically updated the
"optimized" inlined code in alpha/mutex.h and wrote MTX_LOCK_SPIN
and MTX_UNLOCK_SPIN asm macros for the i386/mutex.h as we presently
need those.

Finally, caught up to the interface changes in all sys code.

Contributors: jake, jhb, jasone (in no particular order)
2001-02-09 06:11:45 +00:00
Poul-Henning Kamp
fc2ffbe604 Mechanical change to use <sys/queue.h> macro API instead of
fondling implementation details.

Created with: sed(1)
Reviewed by: md5(1)
2001-02-04 13:13:25 +00:00
Matthew Dillon
4e71e795a1 This commit represents work mainly submitted by Tor and slightly modified
by myself.  It solves a serious vm_map corruption problem that can occur
with the buffer cache when block sizes > 64K are used.  This code has been
heavily tested in -stable but only tested somewhat on -current.  An MFC
will occur in a few days.  My additions include the vm_map_simplify_entry()
and minor buffer cache boundry case fix.

Make the buffer cache use a system map for buffer cache KVM rather then a
normal map.

Ensure that VM objects are not allocated for system maps.  There were cases
where a buffer map could wind up with a backing VM object -- normally
harmless, but this could also result in the buffer cache blocking in places
where it assumes no blocking will occur, possibly resulting in corrupted
maps.

Fix a minor boundry case in the buffer cache size limit is reached that
could result in non-optimal code.

Add vm_map_simplify_entry() calls to prevent 'creeping proliferation'
of vm_map_entry's in the buffer cache's vm_map.  Previously only a simple
linear optimization was made.  (The buffer vm_map typically has only a
handful of vm_map_entry's.  This stabilizes it at that level permanently).

PR: 20609
Submitted by: (Tor Egge) tegge
2001-02-04 06:19:28 +00:00
John Baldwin
45ece682fd - Doh, lock faultin() with proc lock in scheduler().
- Lock p_swtime with sched_lock in scheduler() as well.
2001-01-25 01:38:09 +00:00
Jason Evans
1b367556b5 Convert all simplelocks to mutexes and remove the simplelock implementations. 2001-01-24 12:35:55 +00:00
John Baldwin
69b4045657 Argh, I didn't get this test right when I converted it. Break this up
into two separate if's instead of nested if's.  Also, reorder things
slightly to avoid unnecessary mutex operations.
2001-01-24 12:23:17 +00:00
John Baldwin
8606d88043 - Catch up to proc flag changes.
- Minimal proc locking.
- Use queue macros.
2001-01-24 11:28:36 +00:00
John Baldwin
e2181d41d0 Add mtx_assert()'s to verify that kmem_alloc() and kmem_free() are called
with Giant held.
2001-01-24 11:27:29 +00:00
John Baldwin
5074aecd6c - Catch up to proc flag changes.
- Proc locking in a few places.
- faultin() now must be called with the proc lock held.
- Split up swappable() into a couple of tests so that it can be locke in
  swapout_procs().
- Use queue macros.
2001-01-24 11:25:56 +00:00
John Baldwin
b939335607 - Catch up to proc flag changes. 2001-01-24 11:20:05 +00:00
John Baldwin
0f68b6595a Add missing include. 2001-01-24 06:54:24 +00:00
Hajimu UMEMOTO
5d22597f3a Add mibs to hold the number of forks since boot. New mibs are:
vm.stats.vm.v_forks
	vm.stats.vm.v_vforks
	vm.stats.vm.v_rforks
	vm.stats.vm.v_kthreads
	vm.stats.vm.v_forkpages
	vm.stats.vm.v_vforkpages
	vm.stats.vm.v_rforkpages
	vm.stats.vm.v_kthreadpages

Submitted by:	Paul Herman <pherman@frenchfries.net>
Reviewed by:	alfred
2001-01-23 14:32:01 +00:00
Jake Burkholder
43be6e2fa2 Sigh. atomic_add_int takes a pointer, not an integer.
Pointy-hat-to:	des
2001-01-23 03:40:27 +00:00
Dag-Erling Smørgrav
ac2e223868 Use atomic operations to update the stat counters. 2001-01-23 01:11:11 +00:00
Dag-Erling Smørgrav
e78411656b Call vm_zone_init() at the appropriate time.
Reviewed by:	jasone, jhb
2001-01-22 07:02:42 +00:00
Dag-Erling Smørgrav
0b0dfb6b07 Give this code a major facelift:
- replace the simplelock in struct vm_zone with a mutex.

 - use a proper SLIST rather than a hand-rolled job for the zone list.

 - add a subsystem lock that protects the zone list and the statistics
   counters.

 - merge _zalloc() into zalloc() and _zfree() into zfree(), and
   move them below _zget() so there's no need for a prototype.

 - add two initialization functions: one which initializes the
   subsystem mutex and the zone list, and one that currently doesn't
   do anything.

 - zap zerror(); use KASSERTs instead.

 - dike out half of sysctl_vm_zone(), which was mostly trying to do
   manually what the snprintf() call could do better.

Reviewed by:	jhb, jasone
2001-01-22 07:01:50 +00:00
Dag-Erling Smørgrav
a3ea6d41b9 First step towards an MP-safe zone allocator:
- have zalloc() and zfree() always lock the vm_zone.
 - remove zalloci() and zfreei(), which are now redundant.

Reviewed by:	bmilekic, jasone
2001-01-21 22:23:11 +00:00
Alfred Perlstein
030f23696c fix comment which was outdated 3 years ago
remove useless assignment
purge entire file of 'register' keyword
2000-12-29 13:49:05 +00:00
Alfred Perlstein
6e4f51d1ac clean up kmem_suballoc():
remove useless assignment
remove 'register' variables
2000-12-29 13:05:22 +00:00
Assar Westerlund
68d74cd044 Make zalloc and zfree non-inline functions. This avoids having to
have the code calling these be compiled with the same setting for
INVARIANTS and SMP.

Reviewed by:	dillon
2000-12-27 02:54:37 +00:00
Matthew Dillon
2b6b0df712 This implements a better launder limiting solution. There was a solution
in 4.2-REL which I ripped out in -stable and -current when implementing the
low-memory handling solution.  However, maxlaunder turns out to be the saving
grace in certain very heavily loaded systems (e.g. newsreader box).  The new
algorithm limits the number of pages laundered in the first pageout daemon
pass.  If that is not sufficient then suceessive will be run without any
limit.

Write I/O is now pipelined using two sysctls, vfs.lorunningspace and
vfs.hirunningspace.  This prevents excessive buffered writes in the
disk queues which cause long (multi-second) delays for reads.  It leads
to more stable (less jerky) and generally faster I/O streaming to disk
by allowing required read ops (e.g. for indirect blocks and such) to occur
without interrupting the write stream, amoung other things.

NOTE: eventually, filesystem write I/O pipelining needs to be done on a
per-device basis.  At the moment it is globalized.
2000-12-26 19:41:38 +00:00
Poul-Henning Kamp
065b25803d Fix floppy drives on machines with lots of RAM.
The fix works by reverting the ordering of free memory so that the
chances of contig_malloc() succeeding increases.

PR:		23291
Submitted by:	Andrew Atrens <atrens@nortel.ca>
2000-12-18 20:12:13 +00:00
Seigo Tanimura
21cd6e6232 - If swap metadata does not fit into the KVM, reduce the number of
struct swblock entries by dividing the number of the entries by 2
until the swap metadata fits.

- Reject swapon(2) upon failure of swap_zone allocation.

This is just a temporary fix. Better solutions include:
(suggested by:	dillon)

o reserving swap in SWAP_META_PAGES chunks, and
o swapping the swblock structures themselves.

Reviewed by:	alfred, dillon
2000-12-13 10:01:00 +00:00
Jake Burkholder
c0c2557090 - Change the allproc_lock to use a macro, ALLPROC_LOCK(how), instead
of explicit calls to lockmgr.  Also provides macros for the flags
  pased to specify shared, exclusive or release which map to the
  lockmgr flags.  This is so that the use of lockmgr can be easily
  replaced with optimized reader-writer locks.
- Add some locking that I missed the first time.
2000-12-13 00:17:05 +00:00
Matthew Dillon
02fa91d35e Be less conservative with a recently added KASSERT. Certain edge
cases with file fragments and read-write mmap's can lead to a situation
    where a VM page has odd dirty bits, e.g. 0xFC - due to being dirtied by
    an mmap and only the fragment (representing a non-page-aligned end of
    file) synced via a filesystem buffer.  A correct solution that
    guarentees consistent m->dirty for the file EOF case is being
    worked on.  In the mean time we can't be so conservative in the
    KASSERT.
2000-12-11 07:52:47 +00:00
David Malone
7cc0979fd6 Convert more malloc+bzero to malloc+M_ZERO.
Submitted by:	josh@zipperup.org
Submitted by:	Robert Drehmel <robd@gmx.net>
2000-12-08 21:51:06 +00:00
Alfred Perlstein
b5861b3450 Really fix phys_pager:
Backout the previous delta (rev 1.4), it didn't make any difference.

If the requested handle is NULL then don't add it to the list of
objects, to be found by handle.

The problem is that when asking for a NULL handle you are implying
you want a new object.  Because objects with NULL handles were
being added to the list, any further requests for phys backed
objects with NULL handles would return a reference to the initial
NULL handle object after finding it on the list.

Basically one couldn't have more than one phys backed object without
a handle in the entire system without this fix.  If you did more
than one shared memory allocation using the phys pager it would
give you your initial allocation again.
2000-12-06 21:52:23 +00:00
Alfred Perlstein
54019b0afe need to adjust allocation size to properly deal with non PAGE_SIZE
allocations, specifically with allocations < PAGE_SIZE when the code
doesn't work properly
2000-12-05 22:22:24 +00:00
Bruce Evans
03b67a395f Backed out previous commit. Don't depend on namespace pollution in
<sys/buf.h>.
2000-12-02 12:03:58 +00:00
John Baldwin
c5a44a6af6 Protect p_stat with sched_lock. 2000-12-02 06:09:44 +00:00
John Baldwin
c8a6b0011c Protect p_stat with sched_lock. 2000-12-02 03:29:33 +00:00
Alfred Perlstein
82625cf321 remove unneded sys/ucred.h includes 2000-11-30 18:52:32 +00:00
Jake Burkholder
553629ebc9 Protect the following with a lockmgr lock:
allproc
	zombproc
	pidhashtbl
	proc.p_list
	proc.p_hash
	nextpid

Reviewed by:	jhb
Obtained from:	BSD/OS and netbsd
2000-11-22 07:42:04 +00:00
Robert Watson
cee313c431 o Export dmmax ("Maximum size of a swap block") using SYSCTL_INT.
This removes a reason that systat requires setgid kmem.  More to
  come.
2000-11-20 00:39:04 +00:00
Matthew Dillon
936524aa02 Implement a low-memory deadlock solution.
Removed most of the hacks that were trying to deal with low-memory
    situations prior to now.

    The new code is based on the concept that I/O must be able to function in
    a low memory situation.  All major modules related to I/O (except
    networking) have been adjusted to allow allocation out of the system
    reserve memory pool.  These modules now detect a low memory situation but
    rather then block they instead continue to operate, then return resources
    to the memory pool instead of cache them or leave them wired.

    Code has been added to stall in a low-memory situation prior to a vnode
    being locked.

    Thus situations where a process blocks in a low-memory condition while
    holding a locked vnode have been reduced to near nothing.  Not only will
    I/O continue to operate, but many prior deadlock conditions simply no
    longer exist.

Implement a number of VFS/BIO fixes

	(found by Ian): in biodone(), bogus-page replacement code, the loop
        was not properly incrementing loop variables prior to a continue
        statement.  We do not believe this code can be hit anyway but we
        aren't taking any chances.  We'll turn the whole section into a
        panic (as it already is in brelse()) after the release is rolled.

	In biodone(), the foff calculation was incorrectly
        clamped to the iosize, causing the wrong foff to be calculated
        for pages in the case of an I/O error or biodone() called without
        initiating I/O.  The problem always caused a panic before.  Now it
        doesn't.  The problem is mainly an issue with NFS.

	Fixed casts for ~PAGE_MASK.  This code worked properly before only
        because the calculations use signed arithmatic.  Better to properly
        extend PAGE_MASK first before inverting it for the 64 bit masking
        op.

	In brelse(), the bogus_page fixup code was improperly throwing
        away the original contents of 'm' when it did the j-loop to
        fix the bogus pages.  The result was that it would potentially
        invalidate parts of the *WRONG* page(!), leading to corruption.

	There may still be cases where a background bitmap write is
        being duplicated, causing potential corruption.  We have identified
        a potentially serious bug related to this but the fix is still TBD.
        So instead this patch contains a KASSERT to detect the problem
  	and panic the machine rather then continue to corrupt the filesystem.
	The problem does not occur very often..  it is very hard to
	reproduce, and it may or may not be the cause of the corruption
	people have reported.

Review by: (VFS/BIO: mckusick, Ian Dowse <iedowse@maths.tcd.ie>)
Testing by: (VM/Deadlock) Paul Saab <ps@yahoo-inc.com>
2000-11-18 23:06:26 +00:00
Matthew Dillon
ef0646f9d8 Add the splvm()'s suggested in PR 20609 to protect vm_pager_page_unswapped().
The remainder of the PR is still open.

PR: kern/20609 (partial fix)
2000-11-18 21:11:23 +00:00
Matthew Dillon
279d722604 This patchset fixes a large number of file descriptor race conditions.
Pre-rfork code assumed inherent locking of a process's file descriptor
    array.  However, with the advent of rfork() the file descriptor table
    could be shared between processes.  This patch closes over a dozen
    serious race conditions related to one thread manipulating the table
    (e.g. closing or dup()ing a descriptor) while another is blocked in
    an open(), close(), fcntl(), read(), write(), etc...

PR: kern/11629
Discussed with: Alexander Viro <viro@math.psu.edu>
2000-11-18 21:01:04 +00:00
Tor Egge
028fe6ec24 Clear the MAP_ENTRY_USER_WIRED flag from cloned vm_map entries.
PR:		2840
2000-11-02 21:38:18 +00:00
Poul-Henning Kamp
9f69a4578a Weaken a bogus dependency on <sys/proc.h> in <sys/buf.h> by #ifdef'ing
the offending inline function (BUF_KERNPROC) on it being #included
already.

I'm not sure BUF_KERNPROC() is even the right thing to do or in the
right place or implemented the right way (inline vs normal function).

Remove consequently unneeded #includes of <sys/proc.h>
2000-10-29 14:54:55 +00:00
John Baldwin
915cf38b11 - Catch a machine/mutex.h -> sys/mutex.h I somehow missed.
- Close a small race condition.  The sched_lock mutex protects
  p->p_stat as well as the run queues.  Another CPU could change p_stat
  of the process while we are waiting for the lock, and we would end up
  scheduling a process that isn't runnable.
2000-10-25 00:04:16 +00:00
Paul Saab
c794ceb56a Implement write combining for crashdumps. This is useful when
write caching is disabled on both SCSI and IDE disks where large
memory dumps could take up to an hour to complete.

Taking an i386 scsi based system with 512MB of ram and timing (in
seconds) how long it took to complete a dump, the following results
were obtained:

Before:				After:
	WCE           TIME		WCE           TIME
	------------------		------------------
	1	141.820972		1	 15.600111
	0	797.265072		0	 65.480465

Obtained from:	Yahoo!
Reviewed by:	peter
2000-10-17 10:05:49 +00:00
Matthew Dillon
64bcb9c815 The swap bitmap allocator was not calculating the bitmap size properly
in the face of non-stripe-aligned swap areas.  The bug could cause a
      panic during boot.

      Refuse to configure a swap area that is too large (67 GB or so)

      Properly document the power-of-2 requirement for SWB_NPAGES.

      The patch is slightly different then the one Tor enclosed in the P.R.,
      but accomplishes the same thing.

PR: kern/20273
Submitted by: Tor.Egge@fast.no
2000-10-13 16:44:34 +00:00
Jason Evans
9722d88fba For lockmgr mutex protection, use an array of mutexes that are allocated
and initialized during boot.  This avoids bloating sizeof(struct lock).
As a side effect, it is no longer necessary to enforce the assumtion that
lockinit()/lockdestroy() calls are paired, so the LK_VALID flag has been
removed.

Idea taken from:	BSD/OS.
2000-10-12 22:37:28 +00:00
David Malone
a77c9610f2 If a process is over its resource limit for datasize, still allow
it to lower its memory usage. This was mentioned on the mailing
lists ages ago, and I've lost the name of the person who brought
it up.

Reviewed by:	alc
2000-10-06 13:03:50 +00:00
Jason Evans
a18b1f1d4d Convert lockmgr locks from using simple locks to using mutexes.
Add lockdestroy() and appropriate invocations, which corresponds to
lockinit() and must be called to clean up after a lockmgr lock is no
longer needed.
2000-10-04 01:29:17 +00:00
John Baldwin
7ab37af1ed - Add a new process flag P_NOLOAD that marks a process that should be
ignored during load average calcuations.
- Set this flag for the idle processes and the softinterrupt process.
2000-09-15 22:00:23 +00:00
Boris Popov
9ff5ce6baf Add three new VOPs: VOP_CREATEVOBJECT, VOP_DESTROYVOBJECT and VOP_GETVOBJECT.
They will be used by nullfs and other stacked filesystems to support full
cache coherency.

Reviewed in general by:	mckusick, dillon
2000-09-12 09:49:08 +00:00
Jason Evans
0384fff8c5 Major update to the way synchronization is done in the kernel. Highlights
include:

* Mutual exclusion is used instead of spl*().  See mutex(9).  (Note: The
  alpha port is still in transition and currently uses both.)

* Per-CPU idle processes.

* Interrupts are run in their own separate kernel threads and can be
  preempted (i386 only).

Partially contributed by:	BSDi (BSD/OS)
Submissions by (at least):	cp, dfr, dillon, grog, jake, jhb, sheldonh
2000-09-07 01:33:02 +00:00
David E. O'Brien
ee2fd03885 Make the arguments match the functionality of the functions. 2000-08-26 04:51:39 +00:00
Peter Wemm
bb663856f8 Minor cleanups:
- remove unused variables (fix warnings)
 - use a more consistant ansi style rather than a mixture
 - remove dead #if 0 code and declarations
2000-07-28 22:03:08 +00:00
Kirk McKusick
3592b7155c Clean up the snapshot code so that it no longer depends on the use of
the SF_IMMUTABLE flag to prevent writing. Instead put in explicit
checking for the SF_SNAPSHOT flag in the appropriate places. With
this change, it is now possible to rename and link to snapshot files.
It is also possible to set or clear any of the owner, group, or
other read bits on the file, though none of the write or execute
bits can be set. There is also an explicit test to prevent the
setting or clearing of the SF_SNAPSHOT flag via chflags() or
fchflags(). Note also that the modify time cannot be changed as
it needs to accurately reflect the time that the snapshot was taken.

Submitted by:	Robert Watson <rwatson@FreeBSD.org>
2000-07-26 23:07:01 +00:00
Kirk McKusick
f2a2857bb3 Add snapshots to the fast filesystem. Most of the changes support
the gating of system calls that cause modifications to the underlying
filesystem. The gating can be enabled by any filesystem that needs
to consistently suspend operations by adding the vop_stdgetwritemount
to their set of vnops. Once gating is enabled, the function
vfs_write_suspend stops all new write operations to a filesystem,
allows any filesystem modifying system calls already in progress
to complete, then sync's the filesystem to disk and returns. The
function vfs_write_resume allows the suspended write operations to
begin again. Gating is not added by default for all filesystems as
for SMP systems it adds two extra locks to such critical kernel
paths as the write system call. Thus, gating should only be added
as needed.

Details on the use and current status of snapshots in FFS can be
found in /sys/ufs/ffs/README.snapshot so for brevity and timelyness
is not included here. Unless and until you create a snapshot file,
these changes should have no effect on your system (famous last words).
2000-07-11 22:07:57 +00:00
Alfred Perlstein
4c8545c13d #elsif -> #elif
Noticed by: green
2000-07-11 09:41:29 +00:00
John Baldwin
9701cd40b4 Support for unsigned integer and long sysctl variables. Update the
SYSCTL_LONG macro to be consistent with other integer sysctl variables
and require an initial value instead of assuming 0.  Update several
sysctl variables to use the unsigned types.

PR:		15251
Submitted by:	Kelly Yancey <kbyanc@posi.net>
2000-07-05 07:46:41 +00:00
Poul-Henning Kamp
77978ab8bc Previous commit changing SYSCTL_HANDLER_ARGS violated KNF.
Pointed out by:	bde
2000-07-04 11:25:35 +00:00
John Baldwin
9a20f99adf Replace the PQ_*CACHE options with a single PQ_CACHESIZE option that you
set equal to the number of kilobytes in your cache.  The old options are
still supported for backwards compatibility.

Submitted by:	Kelly Yancey <kbyanc@posi.net>
2000-07-04 08:55:18 +00:00
Kirk McKusick
c904bbbdd8 Simplify and rationalise the management of the vnode free list
(preparing the code to add snapshots).
2000-07-04 04:32:40 +00:00
Poul-Henning Kamp
82d9ae4e32 Style police catches up with rev 1.26 of src/sys/sys/sysctl.h:
Sanitize SYSCTL_HANDLER_ARGS so that simplistic tools can grog our
sources:

        -sysctl_vm_zone SYSCTL_HANDLER_ARGS
        +sysctl_vm_zone (SYSCTL_HANDLER_ARGS)
2000-07-03 09:35:31 +00:00
Mark Murray
2589f2499d Nifty idea from Jeroen van Gelderen; don't call a routine to check if
we are using the /dev/zero device, just check a flag (supplied by
/dev/zero).
Reviewed by:	dfr
2000-06-25 09:44:32 +00:00
Jeffrey Hsu
b10c4e187f Add missing increment of allocation counter. 2000-06-05 06:34:41 +00:00
Matthew Dillon
8b03c8ed5e This is a cleanup patch to Peter's new OBJT_PHYS VM object type
and sysv shared memory support for it.  It implements a new
    PG_UNMANAGED flag that has slightly different characteristics
    from PG_FICTICIOUS.

    A new sysctl, kern.ipc.shm_use_phys has been added to enable the
    use of physically-backed sysv shared memory rather then swap-backed.
    Physically backed shm segments are not tracked with PV entries,
    allowing programs which use a large shm segment as a rendezvous
    point to operate without eating an insane amount of KVM in the
    PV entry management.  Read: Oracle.

    Peter's OBJT_PHYS object will also allow us to eventually implement
    page-table sharing and/or 4MB physical page support for such segments.
    We're half way there.
2000-05-29 22:40:54 +00:00
Doug Rabson
1536418a84 Brucify the pmap_enter_temporary() changes. 2000-05-29 19:21:01 +00:00