constants VM_MIN_ADDRESS, VM_MAXUSER_ADDRESS, USRSTACK and PS_STRINGS.
This is mainly so that they can be variable even for the native abi, based
on different machine types. Get stack protections from the sysentvec too.
This makes it trivial to map the stack non-executable for certain abis, on
machines that support it.
The ability to schedule multiple threads per process
(one one cpu) by making ALL system calls optionally asynchronous.
to come: ia64 and power-pc patches, patches for gdb, test program (in tools)
Reviewed by: Almost everyone who counts
(at various times, peter, jhb, matt, alfred, mini, bernd,
and a cast of thousands)
NOTE: this is still Beta code, and contains lots of debugging stuff.
expect slight instability in signals..
best path forward now is likely to change the lockmgr locks to simple
sleep mutexes, then see if any extra contention it generates is greater
than removed overhead of managing local locking state information,
cost of extra calls into lockmgr, etc.
Additionally, making the vm_map lock a mutex and respecting it properly
will put us much closer to not needing Giant magic in vm.
While doing this, move it earlier in the sysinit boot process so that the
VM system can use it.
After that, the system is now able to use sx locks instead of lockmgr
locks in the VM system. To accomplish this, some of the more
questionable uses of the locks (such as testing whether they are
owned or not, as well as allowing shared+exclusive recursion) are
removed, and simpler logic throughout is used so locks should also be
easier to understand.
This has been tested on my laptop for months, and has not shown any
problems on SMP systems, either, so appears quite safe. One more
user of lockmgr down, many more to go :)
style(9)
- Minor space adjustment in cases where we have "( ", " )", if(), return(),
while(), for(), etc.
- Add /* SYMBOL */ after a few #endifs.
Reviewed by: alc
In order to determine what to page out, the vm_daemon checks
reference bits on all pages belonging to all processes. Unfortunately,
the algorithm used reacted badly with shared pages; each shared page
would be checked once per process sharing it; this caused an O(N^2)
growth of tlb invalidations. The algorithm has been changed so that
each page will be checked only 16 times.
Prior to this change, a fork/sleepbomb of 1300 processes could cause
the vm_daemon to take over 60 seconds to complete, effectively
freezing the system for that time period. With this change
in place, the vm_daemon completes in less than a second. Any system
with hundreds of processes sharing pages should benefit from this change.
Note that the vm_daemon is only run when the system is under extreme
memory pressure. It is likely that many people with loaded systems saw
no symptoms of this problem until they reached the point where swapping
began.
Special thanks go to dillon, peter, and Chuck Cranor, who helped me
get up to speed with vm internals.
PR: 33542, 20393
Reviewed by: dillon
MFC after: 1 week
- Allow the OOM killer to target processes currently locked in
memory. These very often are the ones doing the memory hogging.
- Drop the wakeup priority of processes currently sleeping while
waiting for their page fault to complete. In order for the OOM
killer to work well, the killed process and other system processes
waiting on memory must be allowed to wakeup first.
Reviewed by: dillon
MFC after: 1 week
against VM_WAIT in the pageout code. Both fixes involve adjusting
the lockmgr's timeout capability so locks obtained with timeouts do not
interfere with locks obtained without a timeout.
Hopefully MFC: before the 4.5 release
Note ALL MODULES MUST BE RECOMPILED
make the kernel aware that there are smaller units of scheduling than the
process. (but only allow one thread per process at this time).
This is functionally equivalent to teh previousl -current except
that there is a thread associated with each process.
Sorry john! (your next MFC will be a doosie!)
Reviewed by: peter@freebsd.org, dillon@freebsd.org
X-MFC after: ha ha ha ha
Also removed some spl's and added some VM mutexes, but they are not actually
used yet, so this commit does not really make any operational changes
to the system.
vm_page.c relates to vm_page_t manipulation, including high level deactivation,
activation, etc... vm_pageq.c relates to finding free pages and aquiring
exclusive access to a page queue (exclusivity part not yet implemented).
And the world still builds... :-)
(this commit is just the first stage). Also add various GIANT_ macros to
formalize the removal of Giant, making it easy to test in a more piecemeal
fashion. These macros will allow us to test fine-grained locks to a degree
before removing Giant, and also after, and to remove Giant in a piecemeal
fashion via sysctl's on those subsystems which the authors believe can
operate without Giant.
processes a little earlier to avoid a deadlock. Second, when calculating
the 'largest process' do not just count RSS. Instead count the RSS + SWAP
used by the process. Without this the code tended to kill small
inconsequential processes like, oh, sshd, rather then one of the many
'eatmem 200MB' I run on a whim :-). This fix has been extensively tested on
-stable and somewhat tested on -current and will be MFCd in a few days.
Shamed into fixing this by: ps
- Assert Giant in vm_pageout_scan() for the vnode hacking that it does.
- Don't hold vm_mtx around vget() or vput().
- Lock Giant when calling vm_pageout_scan() from the pagedaemon. Also,
lock curproc while setting the P_BUFEXHAUST flag.
- For now we still hold Giant for all of the vm_daemon. When process
limits are locked we will be only need Giant for swapout_procs().
vm_mtx does not recurse and is required for most low level
vm operations.
faults can not be taken without holding Giant.
Memory subsystems can now call the base page allocators safely.
Almost all atomic ops were removed as they are covered under the
vm mutex.
Alpha and ia64 now need to catch up to i386's trap handlers.
FFS and NFS have been tested, other filesystems will need minor
changes (grabbing the vm lock when twiddling page properties).
Reviewed (partially) by: jake, jhb
the process in question locked as soon as we find it and determine it to
be eligible until we actually kill it. To avoid deadlock, we don't block
on the process lock but skip any process that is already locked during our
search.
other "system" header files.
Also help the deprecation of lockmgr.h by making it a sub-include of
sys/lock.h and removing sys/lockmgr.h form kernel .c files.
Sort sys/*.h includes where possible in affected files.
OK'ed by: bde (with reservations)
mtx_enter(lock, type) becomes:
mtx_lock(lock) for sleep locks (MTX_DEF-initialized locks)
mtx_lock_spin(lock) for spin locks (MTX_SPIN-initialized)
similarily, for releasing a lock, we now have:
mtx_unlock(lock) for MTX_DEF and mtx_unlock_spin(lock) for MTX_SPIN.
We change the caller interface for the two different types of locks
because the semantics are entirely different for each case, and this
makes it explicitly clear and, at the same time, it rids us of the
extra `type' argument.
The enter->lock and exit->unlock change has been made with the idea
that we're "locking data" and not "entering locked code" in mind.
Further, remove all additional "flags" previously passed to the
lock acquire/release routines with the exception of two:
MTX_QUIET and MTX_NOSWITCH
The functionality of these flags is preserved and they can be passed
to the lock/unlock routines by calling the corresponding wrappers:
mtx_{lock, unlock}_flags(lock, flag(s)) and
mtx_{lock, unlock}_spin_flags(lock, flag(s)) for MTX_DEF and MTX_SPIN
locks, respectively.
Re-inline some lock acq/rel code; in the sleep lock case, we only
inline the _obtain_lock()s in order to ensure that the inlined code
fits into a cache line. In the spin lock case, we inline recursion and
actually only perform a function call if we need to spin. This change
has been made with the idea that we generally tend to avoid spin locks
and that also the spin locks that we do have and are heavily used
(i.e. sched_lock) do recurse, and therefore in an effort to reduce
function call overhead for some architectures (such as alpha), we
inline recursion for this case.
Create a new malloc type for the witness code and retire from using
the M_DEV type. The new type is called M_WITNESS and is only declared
if WITNESS is enabled.
Begin cleaning up some machdep/mutex.h code - specifically updated the
"optimized" inlined code in alpha/mutex.h and wrote MTX_LOCK_SPIN
and MTX_UNLOCK_SPIN asm macros for the i386/mutex.h as we presently
need those.
Finally, caught up to the interface changes in all sys code.
Contributors: jake, jhb, jasone (in no particular order)
in 4.2-REL which I ripped out in -stable and -current when implementing the
low-memory handling solution. However, maxlaunder turns out to be the saving
grace in certain very heavily loaded systems (e.g. newsreader box). The new
algorithm limits the number of pages laundered in the first pageout daemon
pass. If that is not sufficient then suceessive will be run without any
limit.
Write I/O is now pipelined using two sysctls, vfs.lorunningspace and
vfs.hirunningspace. This prevents excessive buffered writes in the
disk queues which cause long (multi-second) delays for reads. It leads
to more stable (less jerky) and generally faster I/O streaming to disk
by allowing required read ops (e.g. for indirect blocks and such) to occur
without interrupting the write stream, amoung other things.
NOTE: eventually, filesystem write I/O pipelining needs to be done on a
per-device basis. At the moment it is globalized.
struct swblock entries by dividing the number of the entries by 2
until the swap metadata fits.
- Reject swapon(2) upon failure of swap_zone allocation.
This is just a temporary fix. Better solutions include:
(suggested by: dillon)
o reserving swap in SWAP_META_PAGES chunks, and
o swapping the swblock structures themselves.
Reviewed by: alfred, dillon
of explicit calls to lockmgr. Also provides macros for the flags
pased to specify shared, exclusive or release which map to the
lockmgr flags. This is so that the use of lockmgr can be easily
replaced with optimized reader-writer locks.
- Add some locking that I missed the first time.
cases with file fragments and read-write mmap's can lead to a situation
where a VM page has odd dirty bits, e.g. 0xFC - due to being dirtied by
an mmap and only the fragment (representing a non-page-aligned end of
file) synced via a filesystem buffer. A correct solution that
guarentees consistent m->dirty for the file EOF case is being
worked on. In the mean time we can't be so conservative in the
KASSERT.
Removed most of the hacks that were trying to deal with low-memory
situations prior to now.
The new code is based on the concept that I/O must be able to function in
a low memory situation. All major modules related to I/O (except
networking) have been adjusted to allow allocation out of the system
reserve memory pool. These modules now detect a low memory situation but
rather then block they instead continue to operate, then return resources
to the memory pool instead of cache them or leave them wired.
Code has been added to stall in a low-memory situation prior to a vnode
being locked.
Thus situations where a process blocks in a low-memory condition while
holding a locked vnode have been reduced to near nothing. Not only will
I/O continue to operate, but many prior deadlock conditions simply no
longer exist.
Implement a number of VFS/BIO fixes
(found by Ian): in biodone(), bogus-page replacement code, the loop
was not properly incrementing loop variables prior to a continue
statement. We do not believe this code can be hit anyway but we
aren't taking any chances. We'll turn the whole section into a
panic (as it already is in brelse()) after the release is rolled.
In biodone(), the foff calculation was incorrectly
clamped to the iosize, causing the wrong foff to be calculated
for pages in the case of an I/O error or biodone() called without
initiating I/O. The problem always caused a panic before. Now it
doesn't. The problem is mainly an issue with NFS.
Fixed casts for ~PAGE_MASK. This code worked properly before only
because the calculations use signed arithmatic. Better to properly
extend PAGE_MASK first before inverting it for the 64 bit masking
op.
In brelse(), the bogus_page fixup code was improperly throwing
away the original contents of 'm' when it did the j-loop to
fix the bogus pages. The result was that it would potentially
invalidate parts of the *WRONG* page(!), leading to corruption.
There may still be cases where a background bitmap write is
being duplicated, causing potential corruption. We have identified
a potentially serious bug related to this but the fix is still TBD.
So instead this patch contains a KASSERT to detect the problem
and panic the machine rather then continue to corrupt the filesystem.
The problem does not occur very often.. it is very hard to
reproduce, and it may or may not be the cause of the corruption
people have reported.
Review by: (VFS/BIO: mckusick, Ian Dowse <iedowse@maths.tcd.ie>)
Testing by: (VM/Deadlock) Paul Saab <ps@yahoo-inc.com>
include:
* Mutual exclusion is used instead of spl*(). See mutex(9). (Note: The
alpha port is still in transition and currently uses both.)
* Per-CPU idle processes.
* Interrupts are run in their own separate kernel threads and can be
preempted (i386 only).
Partially contributed by: BSDi (BSD/OS)
Submissions by (at least): cp, dfr, dillon, grog, jake, jhb, sheldonh
the gating of system calls that cause modifications to the underlying
filesystem. The gating can be enabled by any filesystem that needs
to consistently suspend operations by adding the vop_stdgetwritemount
to their set of vnops. Once gating is enabled, the function
vfs_write_suspend stops all new write operations to a filesystem,
allows any filesystem modifying system calls already in progress
to complete, then sync's the filesystem to disk and returns. The
function vfs_write_resume allows the suspended write operations to
begin again. Gating is not added by default for all filesystems as
for SMP systems it adds two extra locks to such critical kernel
paths as the write system call. Thus, gating should only be added
as needed.
Details on the use and current status of snapshots in FFS can be
found in /sys/ufs/ffs/README.snapshot so for brevity and timelyness
is not included here. Unless and until you create a snapshot file,
these changes should have no effect on your system (famous last words).
and sysv shared memory support for it. It implements a new
PG_UNMANAGED flag that has slightly different characteristics
from PG_FICTICIOUS.
A new sysctl, kern.ipc.shm_use_phys has been added to enable the
use of physically-backed sysv shared memory rather then swap-backed.
Physically backed shm segments are not tracked with PV entries,
allowing programs which use a large shm segment as a rendezvous
point to operate without eating an insane amount of KVM in the
PV entry management. Read: Oracle.
Peter's OBJT_PHYS object will also allow us to eventually implement
page-table sharing and/or 4MB physical page support for such segments.
We're half way there.
have pv_entries. This is intended for very special circumstances,
eg: a certain database that has a 1GB shm segment mapped into 300
processes. That would consume 2GB of kvm just to hold the pv_entries
alone. This would not be used on systems unless the physical ram was
available, as it's not pageable.
This is a work-in-progress, but is a useful and functional checkpoint.
Matt has got some more fixes for it that will be committed soon.
Reviewed by: dillon
to various pmap_*() functions instead of looking up the physical address
and passing that. In many cases, the first thing the pmap code was doing
was going to a lot of trouble to get back the original vm_page_t, or
it's shadow pv_table entry.
Inspired by: John Dyson's 1998 patches.
Also:
Eliminate pv_table as a seperate thing and build it into a machine
dependent part of vm_page_t. This eliminates having a seperate set of
structions that shadow each other in a 1:1 fashion that we often went to
a lot of trouble to translate from one to the other. (see above)
This happens to save 4 bytes of physical memory for each page in the
system. (8 bytes on the Alpha).
Eliminate the use of the phys_avail[] array to determine if a page is
managed (ie: it has pv_entries etc). Store this information in a flag.
Things like device_pager set it because they create vm_page_t's on the
fly that do not have pv_entries. This makes it easier to "unmanage" a
page of physical memory (this will be taken advantage of in subsequent
commits).
Add a function to add a new page to the freelist. This could be used
for reclaiming the previously wasted pages left over from preloaded
loader(8) files.
Reviewed by: dillon
* lockstatus() and VOP_ISLOCKED() gets a new process argument and a new
return value: LK_EXCLOTHER, when the lock is held exclusively by another
process.
* The ASSERT_VOP_(UN)LOCKED family is extended to use what this gives them
* Extend the vnode_if.src format to allow more exact specification than
locked/unlocked.
This commit should not do any semantic changes unless you are using
DEBUG_VFS_LOCKS.
Discussed with: grog, mch, peter, phk
Reviewed by: peter
eliminate an extra (useless) level of indirection in half of the page
queue accesses and (2) to use a single name for each queue throughout,
instead of, e.g., "vm_page_queue_active" in some places and
"vm_page_queues[PQ_ACTIVE]" in others.
Reviewed by: dillon
Merge the contents (less some trivial bordering the silly comments)
of <vm/vm_prot.h> and <vm/vm_inherit.h> into <vm/vm.h>. This puts
the #defines for the vm_inherit_t and vm_prot_t types next to their
typedefs.
This paves the road for the commit to follow shortly: change
useracc() to use VM_PROT_{READ|WRITE} rather than B_{READ|WRITE}
as argument.
Replace various VM related page count calculations strewn over the
VM code with inlines to aid in readability and to reduce fragility
in the code where modules depend on the same test being performed
to properly sleep and wakeup.
Split out a portion of the page deactivation code into an inline
in vm_page.c to support vm_page_dontneed().
add vm_page_dontneed(), which handles the madvise MADV_DONTNEED
feature in a related commit coming up for vm_map.c/vm_object.c. This
code prevents degenerate cases where an essentially active page may
be rotated through a subset of the paging lists, resulting in premature
disposal.
QUEUE_AGE, QUEUE_LRU, and QUEUE_EMPTY we instead have QUEUE_CLEAN,
QUEUE_DIRTY, QUEUE_EMPTY, and QUEUE_EMPTYKVA. With this patch clean
and dirty buffers have been separated. Empty buffers with KVM
assignments have been separated from truely empty buffers. getnewbuf()
has been rewritten and now operates in a 100% optimal fashion. That is,
it is able to find precisely the right kind of buffer it needs to
allocate a new buffer, defragment KVM, or to free-up an existing buffer
when the buffer cache is full (which is a steady-state situation for
the buffer cache).
Buffer flushing has been reorganized. Previously buffers were flushed
in the context of whatever process hit the conditions forcing buffer
flushing to occur. This resulted in processes blocking on conditions
unrelated to what they were doing. This also resulted in inappropriate
VFS stacking chains due to multiple processes getting stuck trying to
flush dirty buffers or due to a single process getting into a situation
where it might attempt to flush buffers recursively - a situation that
was only partially fixed in prior commits. We have added a new daemon
called the buf_daemon which is responsible for flushing dirty buffers
when the number of dirty buffers exceeds the vfs.hidirtybuffers limit.
This daemon attempts to dynamically adjust the rate at which dirty buffers
are flushed such that getnewbuf() calls (almost) never block.
The number of nbufs and amount of buffer space is now scaled past the
8MB limit that was previously imposed for systems with over 64MB of
memory, and the vfs.{lo,hi}dirtybuffers limits have been relaxed
somewhat. The number of physical buffers has been increased with the
intention that we will manage physical I/O differently in the future.
reassignbuf previously attempted to keep the dirtyblkhd list sorted which
could result in non-deterministic operation under certain conditions,
such as when a large number of dirty buffers are being managed. This
algorithm has been changed. reassignbuf now keeps buffers locally sorted
if it can do so cheaply, and otherwise gives up and adds buffers to
the head of the dirtyblkhd list. The new algorithm is deterministic but
not perfect. The new algorithm greatly reduces problems that previously
occured when write_behind was turned off in the system.
The P_FLSINPROG proc->p_flag bit has been replaced by the more descriptive
P_BUFEXHAUST bit. This bit allows processes working with filesystem
buffers to use available emergency reserves. Normal processes do not set
this bit and are not allowed to dig into emergency reserves. The purpose
of this bit is to avoid low-memory deadlocks.
A small race condition was fixed in getpbuf() in vm/vm_pager.c.
Submitted by: Matthew Dillon <dillon@apollo.backplane.com>
Reviewed by: Kirk McKusick <mckusick@mckusick.com>
SYSINIT_KT() etc (which is a static, compile-time procedure), use a
NetBSD-style kthread_create() interface. kproc_start is still available
as a SYSINIT() hook. This allowed simplification of chunks of the
sysinit code in the process. This kthread_create() is our old kproc_start
internals, with the SYSINIT_KT fork hooks grafted in and tweaked to work
the same as the NetBSD one.
One thing I'd like to do shortly is get rid of nfsiod as a user initiated
process. It makes sense for the nfs client code to create them on the
fly as needed up to a user settable limit. This means that nfsiod
doesn't need to be in /sbin and is always "available". This is a fair bit
easier to do outside of the SYSINIT_KT() framework.
is the preparation step for moving pmap storage out of vmspace proper.
Reviewed by: Alan Cox <alc@cs.rice.edu>
Matthew Dillion <dillon@apollo.backplane.com>
PQ_FREE. There is little operational difference other then the kernel
being a few kilobytes smaller and the code being more readable.
* vm_page_select_free() has been *greatly* simplified.
* The PQ_ZERO page queue and supporting structures have been removed
* vm_page_zero_idle() revamped (see below)
PG_ZERO setting and clearing has been migrated from vm_page_alloc()
to vm_page_free[_zero]() and will eventually be guarenteed to remain
tracked throughout a page's life ( if it isn't already ).
When a page is freed, PG_ZERO pages are appended to the appropriate
tailq in the PQ_FREE queue while non-PG_ZERO pages are prepended.
When locating a new free page, PG_ZERO selection operates from within
vm_page_list_find() ( get page from end of queue instead of beginning
of queue ) and then only occurs in the nominal critical path case. If
the nominal case misses, both normal and zero-page allocation devolves
into the same _vm_page_list_find() select code without any specific
zero-page optimizations.
Additionally, vm_page_zero_idle() has been revamped. Hysteresis has been
added and zero-page tracking adjusted to conform with the other changes.
Currently hysteresis is set at 1/3 (lo) and 1/2 (hi) the number of free
pages. We may wish to increase both parameters as time permits. The
hysteresis is designed to avoid silly zeroing in borderline allocation/free
situations.
attempt to optimize forks but were essentially given-up on due to
problems and replaced with an explicit dup of the vm_map_entry structure.
Prior to the removal, they were entirely unused.
Removed call to vm_object_collapse(), which can block. This was being
called without the pageout code holding any sort of reference on the
vm_object or vm_page_t structures being manipulated. Since this code
can block, it was possible for other kernel code to shred the state
the pageout code was assuming remained intact.
Fixed potential blocking condition in vm_pageout_page_free() ( which
could cause a deadlock in a low-memory situation ).
Currently there is a hack in-place to deal with clean filesystem meta-data
polluting the inactive page queue. John doesn't like the hack, and neither
do I.
Revamped and commented a portion of the pageout loop.
Added protection against potential memory deadlocks with OBJT_VNODE
when using VOP_ISLOCKED(). The problem is that vp->v_data can be NULL
which causes VOP_ISLOCKED() to return a less informed answer.
remove vm_pager_sync() -- none of the pagers use it any more ( the old
swapper used to. The new one does not ).
changes to the VM system to support the new swapper, VM bug
fixes, several VM optimizations, and some additional revamping of the
VM code. The specific bug fixes will be documented with additional
forced commits. This commit is somewhat rough in regards to code
cleanup issues.
Reviewed by: "John S. Dyson" <root@dyson.iquest.net>, "David Greenman" <dg@root.com>
Add some overflow checks to read/write (from bde).
Change all modifications to vm_page::flags, vm_page::busy, vm_object::flags
and vm_object::paging_in_progress to use operations which are not
interruptable.
Reviewed by: Bruce Evans <bde@zeta.org.au>
managed to avoid corruption of this variable by luck (the compiler used a
memory read-modify-write instruction which wasn't interruptable) but other
architectures cannot.
With this change, I am now able to 'make buildworld' on the alpha (sfx: the
crowd goes wild...)
"time" wasn't a atomic variable, so splfoo() protection were needed
around any access to it, unless you just wanted the seconds part.
Most uses of time.tv_sec now uses the new variable time_second instead.
gettime() changed to getmicrotime(0.
Remove a couple of unneeded splfoo() protections, the new getmicrotime()
is atomic, (until Bruce sets a breakpoint in it).
A couple of places needed random data, so use read_random() instead
of mucking about with time which isn't random.
Add a new nfs_curusec() function.
Mark a couple of bogosities involving the now disappeard time variable.
Update ffs_update() to avoid the weird "== &time" checks, by fixing the
one remaining call that passwd &time as args.
Change profiling in ncr.c to use ticks instead of time. Resolution is
the same.
Add new function "tvtohz()" to avoid the bogus "splfoo(), add time, call
hzto() which subtracts time" sequences.
Reviewed by: bde
problems. Tor Egge and others have helped with various VM bugs
lately, but don't blame him -- blame me!!!
pmap.c:
1) Create an object for kernel page table allocations. This
fixes a bogus allocation method previously used for such, by
grabbing pages from the kernel object, using bogus pindexes.
(This was a code cleanup, and perhaps a minor system stability
issue.)
pmap.c:
2) Pre-set the modify and accessed bits when prudent. This will
decrease bus traffic under certain circumstances.
vfs_bio.c, vfs_cluster.c:
3) Rather than calculating the beginning virtual byte offset
multiple times, stick the offset into the buffer header, so
that the calculated offset can be reused. (Long long multiplies
are often expensive, and this is a probably unmeasurable performance
improvement, and code cleanup.)
vfs_bio.c:
4) Handle write recursion more intelligently (but not perfectly) so
that it is less likely to cause a system panic, and is also
much more robust.
vfs_bio.c:
5) getblk incorrectly wrote out blocks that are incorrectly sized.
The problem is fixed, and writes blocks out ONLY when B_DELWRI
is true.
vfs_bio.c:
6) Check that already constituted buffers have fully valid pages. If
not, then make sure that the B_CACHE bit is not set. (This was
a major source of Sig-11 type problems.)
vfs_bio.c:
7) Fix a potential system deadlock due to an incorrectly specified
sleep priority while waiting for a buffer write operation. The
change that I made opens the system up to serious problems, and
we need to examine the issue of process sleep priorities.
vfs_cluster.c, vfs_bio.c:
8) Make clustered reads work more correctly (and more completely)
when buffers are already constituted, but not fully valid.
(This was another system reliability issue.)
vfs_subr.c, ffs_inode.c:
9) Create a vtruncbuf function, which is used by filesystems that
can truncate files. The vinvalbuf forced a file sync type operation,
while vtruncbuf only invalidates the buffers past the new end of file,
and also invalidates the appropriate pages. (This was a system reliabiliy
and performance issue.)
10) Modify FFS to use vtruncbuf.
vm_object.c:
11) Make the object rundown mechanism for OBJT_VNODE type objects work
more correctly. Included in that fix, create pager entries for
the OBJT_DEAD pager type, so that paging requests that might slip
in during race conditions are properly handled. (This was a system
reliability issue.)
vm_page.c:
12) Make some of the page validation routines be a little less picky
about arguments passed to them. Also, support page invalidation
change the object generation count so that we handle generation
counts a little more robustly.
vm_pageout.c:
13) Further reduce pageout daemon activity when the system doesn't
need help from it. There should be no additional performance
decrease even when the pageout daemon is running. (This was
a significant performance issue.)
vnode_pager.c:
14) Teach the vnode pager to handle race conditions during vnode
deallocations.
has been some bitrot and incorrect assumptions in the vfs_bio code. These
problems have manifest themselves worse on NFS type filesystems, but can
still affect local filesystems under certain circumstances. Most of
the problems have involved mmap consistancy, and as a side-effect broke
the vfs.ioopt code. This code might have been committed seperately, but
almost everything is interrelated.
1) Allow (pmap_object_init_pt) prefaulting of buffer-busy pages that
are fully valid.
2) Rather than deactivating erroneously read initial (header) pages in
kern_exec, we now free them.
3) Fix the rundown of non-VMIO buffers that are in an inconsistent
(missing vp) state.
4) Fix the disassociation of pages from buffers in brelse. The previous
code had rotted and was faulty in a couple of important circumstances.
5) Remove a gratuitious buffer wakeup in vfs_vmio_release.
6) Remove a crufty and currently unused cluster mechanism for VBLK
files in vfs_bio_awrite. When the code is functional, I'll add back
a cleaner version.
7) The page busy count wakeups assocated with the buffer cache usage were
incorrectly cleaned up in a previous commit by me. Revert to the
original, correct version, but with a cleaner implementation.
8) The cluster read code now tries to keep data associated with buffers
more aggressively (without breaking the heuristics) when it is presumed
that the read data (buffers) will be soon needed.
9) Change to filesystem lockmgr locks so that they use LK_NOPAUSE. The
delay loop waiting is not useful for filesystem locks, due to the
length of the time intervals.
10) Correct and clean-up spec_getpages.
11) Implement a fully functional nfs_getpages, nfs_putpages.
12) Fix nfs_write so that modifications are coherent with the NFS data on
the server disk (at least as well as NFS seems to allow.)
13) Properly support MS_INVALIDATE on NFS.
14) Properly pass down MS_INVALIDATE to lower levels of the VM code from
vm_map_clean.
15) Better support the notion of pages being busy but valid, so that
fewer in-transit waits occur. (use p->busy more for pageouts instead
of PG_BUSY.) Since the page is fully valid, it is still usable for
reads.
16) It is possible (in error) for cached pages to be busy. Make the
page allocation code handle that case correctly. (It should probably
be a printf or panic, but I want the system to handle coding errors
robustly. I'll probably add a printf.)
17) Correct the design and usage of vm_page_sleep. It didn't handle
consistancy problems very well, so make the design a little less
lofty. After vm_page_sleep, if it ever blocked, it is still important
to relookup the page (if the object generation count changed), and
verify it's status (always.)
18) In vm_pageout.c, vm_pageout_clean had rotted, so clean that up.
19) Push the page busy for writes and VM_PROT_READ into vm_pageout_flush.
20) Fix vm_pager_put_pages and it's descendents to support an int flag
instead of a boolean, so that we can pass down the invalidate bit.
2) Do not unnecessarily force page blocking when paging
pages out.
3) Further improve swap pager performance and correctness,
including fixing the paging in progress deadlock (except
in severe I/O error conditions.)
4) Enable vfs_ioopt=1 as a default.
5) Fix and enable the page prezeroing in SMP mode.
All in all, SMP systems especially should show a significant
improvement in "snappyness."
improve tuning on larger systems. (A couple of the VM tuning params for
small systems were so badly chosen that the system could hang under load.)
The broken tuning was originaly my fault.
have declined due to code-rot over time. The swap pager rundown code
has been clean-up, and unneeded wakeups removed. Lots of splbio's
are changed to splvm's. Also, set the dynamic tunables for the
pageout daemon to be more sane for larger systems (thereby decreasing
the daemon overheadla.)
of the various ad-hoc schemes.
2) When bringing in UPAGES, the pmap code needs to do another vm_page_lookup.
3) When appropriate, set the PG_A or PG_M bits a-priori to both avoid some
processor errata, and to minimize redundant processor updating of page
tables.
4) Modify pmap_protect so that it can only remove permissions (as it
originally supported.) The additional capability is not needed.
5) Streamline read-only to read-write page mappings.
6) For pmap_copy_page, don't enable write mapping for source page.
7) Correct and clean-up pmap_incore.
8) Cluster initial kern_exec pagin.
9) Removal of some minor lint from kern_malloc.
10) Correct some ioopt code.
11) Remove some dead code from the MI swapout routine.
12) Correct vm_object_deallocate (to remove backing_object ref.)
13) Fix dead object handling, that had problems under heavy memory load.
14) Add minor vm_page_lookup improvements.
15) Some pages are not in objects, and make sure that the vm_page.c can
properly support such pages.
16) Add some more page deficit handling.
17) Some minor code readability improvements.
MUST be PG_BUSY. It is bogus to free a page that isn't busy,
because it is in a state of being "unavailable" when being
freed. The additional advantage is that the page_remove code
has a better cross-check that the page should be busy and
unavailable for other use. There were some minor problems
with the collapse code, and this plugs those subtile "holes."
Also, the vfs_bio code wasn't checking correctly for PG_BUSY
pages. I am going to develop a more consistant scheme for
grabbing pages, busy or otherwise. For now, we are stuck
with the current morass.
1) Start using TSM.
Struct procs continue to point to upages structure, after being freed.
Struct vmspace continues to point to pte object and kva space for kstack.
u_map is now superfluous.
2) vm_map's don't need to be reference counted. They always exist either
in the kernel or in a vmspace. The vmspaces are managed by reference
counts.
3) Remove the "wired" vm_map nonsense.
4) No need to keep a cache of kernel stack kva's.
5) Get rid of strange looking ++var, and change to var++.
6) Change more data structures to use our "zone" allocator. Added
struct proc, struct vmspace and struct vnode. This saves a significant
amount of kva space and physical memory. Additionally, this enables
TSM for the zone managed memory.
7) Keep ioopt disabled for now.
8) Remove the now bogus "single use" map concept.
9) Use generation counts or id's for data structures residing in TSM, where
it allows us to avoid unneeded restart overhead during traversals, where
blocking might occur.
10) Account better for memory deficits, so the pageout daemon will be able
to make enough memory available (experimental.)
11) Fix some vnode locking problems. (From Tor, I think.)
12) Add a check in ufs_lookup, to avoid lots of unneeded calls to bcmp.
(experimental.)
13) Significantly shrink, cleanup, and make slightly faster the vm_fault.c
code. Use generation counts, get rid of unneded collpase operations,
and clean up the cluster code.
14) Make vm_zone more suitable for TSM.
This commit is partially as a result of discussions and contributions from
other people, including DG, Tor Egge, PHK, and probably others that I
have forgotten to attribute (so let me know, if I forgot.)
This is not the infamous, final cleanup of the vnode stuff, but a necessary
step. Vnode mgmt should be correct, but things might still change, and
there is still some missing stuff (like ioopt, and physical backing of
non-merged cache files, debugging of layering concepts.)
config option in pmap. Fix a problem with faulting in pages. Clean-up
some loose ends in swap pager memory management.
The system should be much more stable, but all subtile bugs aren't fixed yet.
Fix the UIO optimization code.
Fix an assumption in vm_map_insert regarding allocation of swap pagers.
Fix an spl problem in the collapse handling in vm_object_deallocate.
When pages are freed from vnode objects, and the criteria for putting
the associated vnode onto the free list is reached, either put the
vnode onto the list, or put it onto an interrupt safe version of the
list, for further transfer onto the actual free list.
Some minor syntax changes changing pre-decs, pre-incs to post versions.
Remove a bogus timeout (that I added for debugging) from vn_lock.
PHK will likely still have problems with the vnode list management, and
so do I, but it is better than it was.
original BSD code. The association between the vnode and the vm_object
no longer includes reference counts. The major difference is that
vm_object's are no longer freed gratuitiously from the vnode, and so
once an object is created for the vnode, it will last as long as the
vnode does.
When a vnode object reference count is incremented, then the underlying
vnode reference count is incremented also. The two "objects" are now
more intimately related, and so the interactions are now much less
complex.
When vnodes are now normally placed onto the free queue with an object still
attached. The rundown of the object happens at vnode rundown time, and
happens with exactly the same filesystem semantics of the original VFS
code. There is absolutely no need for vnode_pager_uncache and other
travesties like that anymore.
A side-effect of these changes is that SMP locking should be much simpler,
the I/O copyin/copyout optimizations work, NFS should be more ponderable,
and further work on layered filesystems should be less frustrating, because
of the totally coherent management of the vnode objects and vnodes.
Please be careful with your system while running this code, but I would
greatly appreciate feedback as soon a reasonably possible.
of vnodes and objects. There are some metadata performance improvements
that come along with this. There are also a few prototypes added when
the need is noticed. Changes include:
1) Cleaning up vref, vget.
2) Removal of the object cache.
3) Nuke vnode_pager_uncache and friends, because they aren't needed anymore.
4) Correct some missing LK_RETRY's in vn_lock.
5) Correct the page range in the code for msync.
Be gentle, and please give me feedback asap.
is apparently useful for large shell systems, or systems with long running
idle processes. To enable the feature:
sysctl -w vm.swap_idle_enabled=1
Please note that some of the other vm sysctl variables have been renamed
to be more accurate.
Submitted by: Much of it from Matt Dillon <dillon@best.net>
the maximum pageout rate:
sysctl -w vm.vm_maxlaunder=n
1 < n < inf.
If paging heavily on large systems, it is likely that a performance
improvement can be achieved by increasing the parameter. On a large
system, the parm is 32, but numbers as large as 128 can make a big
difference. If paging is expensive, you might try decreasing the
number to 1-8.
sysctl -w vm.defer_swap_pageouts=1
Causes the system to resist the use of swap space. In low memory
conditions, performance will decrease.
sysctl -w vm.disable_swap_pageouts=1
Causes the system to mostly disable the use of swap space. In
low memory conditions, the system will likely start killing
processes.
the system is out of memory. The daemon does a minimal amount of work that
increases as the system becomes more likely to run out of memory and page in/out.
The default tuning is fairly low in background CPU usage, and sysctl variables
have been added to enable flexable operation. This is an experimental feature
that will likely be changed and improved over time.
changes, so don't expect to be able to run the kernel as-is (very well)
without the appropriate Lite/2 userland changes.
The system boots and can mount UFS filesystems.
Untested: ext2fs, msdosfs, NFS
Known problems: Incorrect Berkeley ID strings in some files.
Mount_std mounts will not work until the getfsent
library routine is changed.
Reviewed by: various people
Submitted by: Jeffery Hsu <hsu@freebsd.org>
This will make a number of things easier in the future, as well as (finally!)
avoiding the Id-smashing problem which has plagued developers for so long.
Boy, I'm glad we're not using sup anymore. This update would have been
insane otherwise.
anymore with the "full" collapse fix that we added about 1yr ago!!! The
code has been removed by optioning it out for now, so we can put it back
in ASAP if any problems are found.
The default level works with minimal overhead, but one can also enable
full, efficient use of a 512K cache. (Parameters can be generated
to support arbitrary cache sizes also.)
problem with the 'shell scripts' was found, but there was a 'strange'
problem found with a 486 laptop that we could not find. This commit
backs the code back to 25-jul, and will be re-entered after the snapshot
in smaller (more easily tested) chunks.
performance issues.
1) The pmap module has had too many inlines, and so the
object file is simply bigger than it needs to be.
Some common code is also merged into subroutines.
2) Removal of some *evil* PHYS_TO_VM_PAGE macro calls.
Unfortunately, a few have needed to be added also.
The removal caused the need for more vm_page_lookups.
I added lookup hints to minimize the need for the
page table lookup operations.
3) Removal of some bogus performance improvements, that
mostly made the code more complex (tracking individual
page table page updates unnecessarily). Those improvements
actually hurt 386 processors perf (not that people who
worry about perf use 386 processors anymore :-)).
4) Changed pv queue manipulations/structures to be TAILQ's.
5) The pv queue code has had some performance problems since
day one. Some significant scalability issues are resolved
by threading the pv entries from the pmap AND the physical
address instead of just the physical address. This makes
certain pmap operations run much faster. This does
not affect most micro-benchmarks, but should help loaded system
performance *significantly*. DG helped and came up with most
of the solution for this one.
6) Most if not all pmap bit operations follow the pattern:
pmap_test_bit();
pmap_clear_bit();
That made for twice the necessary pv list traversal. The
pmap interface now supports only pmap_tc_bit type operations:
pmap_[test/clear]_modified, pmap_[test/clear]_referenced.
Additionally, the modified routine now takes a vm_page_t arg
instead of a phys address. This eliminates a PHYS_TO_VM_PAGE
operation.
7) Several rewrites of routines that contain redundant code to
use common routines, so that there is a greater likelihood of
keeping the cache footprint smaller.
performance to LRU or worse when RSS limiting takes effect. Also,
make an end condition in the active queue scan more efficient in the
case where pages are removed from the active queue as a side effect
of a pmap operation.
Re-enables the RSS limiting, and the routine is now tail-recursive,
making it much more safe (eliminates the possiblity of kernel stack
overflow.) Also, the RSS limiting is a little more intelligent about
finding the likely objects that are pushing the process over the limit.
Added some sysctls that help with VM system tuning.
New sysctl features:
1) Enable/disable lru pageout algorithm.
vm.pageout_algorithm = 0, default algorithm that works
well, especially using X windows and heavy
memory loading. Can have adverse effects,
sometimes slowing down program loading.
vm.pageout_algorithm = 1, close to true LRU. Works much
better than clock, etc. Does not work as well as
the default algorithm in general. Certain memory
"malloc" type benchmarks work a little better with
this setting.
Please give me feedback on the performance results
associated with these.
2) Enable/disable swapping.
vm.swapping_enabled = 1, default.
vm.swapping_enabled = 0, useful for cases where swapping
degrades performance.
The config option "NO_SWAPPING" is still operative, and
takes precedence over the sysctl. If "NO_SWAPPING" is
specified, the sysctl still exists, but "vm.swapping_enabled"
is hard-wired to "0".
Each of these can be changed "on the fly."
1) Make it much less likely to miss a wakeup in vm_page_free_wakeup
2) Create a new entry point into pmap: pmap_ts_referenced, eliminates
the need to scan the pv lists twice in many cases. Perhaps there
is alot more to do here to work on minimizing pv list manipulation
3) Minor improvements to vm_pageout including the use of pmap_ts_ref.
4) Major changes and code improvement to pmap. This code has had
several serious bugs in page table page manipulation. In order
to simplify the problem, and hopefully solve it for once and all,
page table pages are no longer "managed" with the pv list stuff.
Page table pages are only (mapped and held/wired) or
(free and unused) now. Page table pages are never inactive,
active or cached. These changes have probably fixed the
hold count problems, but if they haven't, then the code is
simpler anyway for future bugfixing.
5) The pmap code has been sorely in need of re-organization, and I
have taken a first (of probably many) steps. Please tell me
if you have any ideas.
queue corruption problems, and to apply Gary Palmer's code cleanups.
David Greenman helped with these problems also. There is still
a hang problem using X in small memory machines.
case where blocking can occur, thereby giving other process's a chance
to modify the queue where a page resides. This could cause numerous
process and system failures.
a condition when blocking can occur, and the daemon did not check properly
for a page remaining on the expected queue. Additionally, the inactive
target was being set much too large for small memory machines. It is now
being calculated based upon the amount of user memory available on every
pageout daemon run. Another problem was that if memory was very low, the
pageout daemon could fail repeatedly to traverse the inactive queue.
contributions or ideas from Stephen McKay <syssgm@devetir.qld.gov.au>,
Alan Cox <alc@cs.rice.edu>, David Greenman <davidg@freebsd.org> and me:
More usage of the TAILQ macros. Additional minor fix to queue.h.
Performance enhancements to the pageout daemon.
Addition of a wait in the case that the pageout daemon
has to run immediately.
Slightly modify the pageout algorithm.
Significant revamp of the pmap/fork code:
1) PTE's and UPAGES's are NO LONGER in the process's map.
2) PTE's and UPAGES's reside in their own objects.
3) TOTAL elimination of recursive page table pagefaults.
4) The page directory now resides in the PTE object.
5) Implemented pmap_copy, thereby speeding up fork time.
6) Changed the pv entries so that the head is a pointer
and not an entire entry.
7) Significant cleanup of pmap_protect, and pmap_remove.
8) Removed significant amounts of machine dependent
fork code from vm_glue. Pushed much of that code into
the machine dependent pmap module.
9) Support more completely the reuse of already zeroed
pages (Page table pages and page directories) as being
already zeroed.
Performance and code cleanups in vm_map:
1) Improved and simplified allocation of map entries.
2) Improved vm_map_copy code.
3) Corrected some minor problems in the simplify code.
Implemented splvm (combo of splbio and splimp.) The VM code now
seldom uses splhigh.
Improved the speed of and simplified kmem_malloc.
Minor mod to vm_fault to avoid using pre-zeroed pages in the case
of objects with backing objects along with the already
existant condition of having a vnode. (If there is a backing
object, there will likely be a COW... With a COW, it isn't
necessary to start with a pre-zeroed page.)
Minor reorg of source to perhaps improve locality of ref.
to help diagnose a problem on wcarchive (where the kernel stack was
sometimes not present), but is useful in its own right since swapping
actually reduces performance on some systems (such as wcarchive).
Note: swapping in this context means making the U pages pageable and has
nothing to do with generic VM paging, which is unaffected by this option.
Reviewed by: <dyson>
Speed up for vfs_bio -- addition of a routine bqrelse to greatly diminish
overhead for merged cache.
Efficiency improvement for vfs_cluster. It used to do alot of redundant
calls to cluster_rbuild.
Correct the ordering for vrele of .text and release of credentials.
Use the selective tlb update for 486/586/P6.
Numerous fixes to the size of objects allocated for files. Additionally,
fixes in the various pagers.
Fixes for proper positioning of vnode_pager_setsize in msdosfs and ext2fs.
Fixes in the swap pager for exhausted resources. The pageout code
will not as readily thrash.
Change the page queue flags (PG_ACTIVE, PG_INACTIVE, PG_FREE, PG_CACHE) into
page queue indices (PQ_ACTIVE, PQ_INACTIVE, PQ_FREE, PQ_CACHE),
thereby improving efficiency of several routines.
Eliminate even more unnecessary vm_page_protect operations.
Significantly speed up process forks.
Make vm_object_page_clean more efficient, thereby eliminating the pause
that happens every 30seconds.
Make sequential clustered writes B_ASYNC instead of B_DELWRI even in the
case of filesystems mounted async.
Fix a panic with busy pages when write clustering is done for non-VMIO
buffers.