from vm_map_pageable(). At the point they called, vm_map_pageable()
holds a read (or shared) lock on the map. The purpose
of vm_map_{clear,set}_recursive() is to disable/enable repeated
write (or exclusive) lock requests by the same process.
vm_map always failed because vm_map_lookup() looked at
"vm_map_entry->wired_count" instead of "(vm_map_entry->eflags &
MAP_ENTRY_USER_WIRED)". The effect was that many page
wiring operations by sysctl were (silently) failing.
multiplexed underlying swap devices (/dev/drum). The only thing it did
was to allow root to open /dev/drum, but not do anything with it.
Various utilities used to grovel around in here, but Matt has written
a much nicer (and clean) front-end to this for libkvm, and nothing uses
the old system any more.
The VM system was calling VOP_STRATEGY() on the vp of the first underlying
swap device (not the /dev/drum one, the first real device), and using
the VOP system to indirectly (and only) call swstrategy() to choose
an underlying device and enqueue it on that device. I have changed it
to avoid diverting through the VOP system and to call the only possible
target directly, saving a little bit of time and some complexity.
In all, nothing much changes, except some scaffolding to support the
roundabout way of calling swstrategy() is gone.
Matt gave me the ok to do this some time ago, and I apologize for taking
so long to get around to it.
instead of duplicating the code. (2) If a wired page is passed
to vm_page_free_toq, panic instead of printing a friendly warning.
(If we don't panic here, we'll just panic later in vm_page_unwire
obscuring the problem.)
eliminate an extra (useless) level of indirection in half of the page
queue accesses and (2) to use a single name for each queue throughout,
instead of, e.g., "vm_page_queue_active" in some places and
"vm_page_queues[PQ_ACTIVE]" in others.
Reviewed by: dillon
"rw" argument, rather than hijacking B_{READ|WRITE}.
Fix two bugs (physio & cam) resulting by the confusion caused by this.
Submitted by: Tor.Egge@fast.no
Reviewed by: alc, ken (partly)
Merge the contents (less some trivial bordering the silly comments)
of <vm/vm_prot.h> and <vm/vm_inherit.h> into <vm/vm.h>. This puts
the #defines for the vm_inherit_t and vm_prot_t types next to their
typedefs.
This paves the road for the commit to follow shortly: change
useracc() to use VM_PROT_{READ|WRITE} rather than B_{READ|WRITE}
as argument.
hexdump -C < /dev/drum
by simply refusing to do I/O from userland.
a panic. I'm not sure we even need /dev/drum anymore, it seems
to have been broken for a long time thi
have been there in the first place. A GENERIC kernel shrinks almost 1k.
Add a slightly different safetybelt under nostop for tty drivers.
Add some missing FreeBSD tags
clustering issues (replacing code that used to be in
ufs/ufs/ufs_readwrite.c). vm_fault also now uses the new VM page counter
inlines.
This completes the changeover from vnode->v_lastr to vm_entry_t->v_lastr
for VM, and fp->f_nextread and fp->f_seqcount (which have been in the
tree for a while). Determination of the I/O strategy (sequential, random,
and so forth) is now handled on a descriptor-by-descriptor basis for
base I/O calls, and on a memory-region-by-memory-region and
process-by-process basis for VM faults.
Reviewed by: David Greenman <dg@root.com>, Alan Cox <alc@cs.rice.edu>
spaces which cross a segment boundry in the page table. pmap_kextract()
is not designed for access to the user space portion of the page
table and cannot handle the null-page-directory-entry case.
The fix is to have vm_fault_quick() return a success or failure which
is then used to avoid calling pmap_kextract().
syncs the entire underlying file rather then just the requested range,
resulting in huge inefficiencies when the VM system is articulated in
a certain way. The VOP_FSYNC was also found to massively reduce NFS
performance in certain cases.
Change MADV_DONTNEED and MADV_FREE to call vm_page_dontneed() instead
of vm_page_deactivate(). Using vm_page_deactivate() causes all
inactive and cache pages to be recycled before the dontneed/free page
is recycled, effectively flushing our entire VM inactive & cache
queues continuously even if only a few pages are being actively MADV
free'd and reused (such as occurs with a sequential scan of a
memory-mapped file).
Reviewed by: Alan Cox <alc@cs.rice.edu>, David Greenman <dg@root.com>
from the vnode. (The changeover is undergoing final testing and
will be committed soon).
Reviewed by: Alan Cox <alc@cs.rice.edu>, David Greenman <dg@root.com>
underlying physical sector size when aligning I/O transfer sizes.
It cannot assume 512 bytes.
We assume the underlying sector size is a power of 2. If it isn't,
mmap() will break badly anyway (in the same way mmap broke with NFS
when NFS tried to cache piecemeal write ranges in buffers, before
we enforced read-buffer-before-write-piecemeal for NFS).
Reviewed by: Alan Cox <alc@cs.rice.edu>, David Greenman <dg@root.com>
Swap space can be freed from an interrupt and so swap reservation and
freeing must occur at splvm.
Add swap_pager_reserve() code to support a new swap pre-reservation
capability for the VN device.
Generally cleanup the swap code by simplifying the swp_pager_meta_build()
static function and consolidating the SWAPBLK_NONE test from a bit test
to an absolute compare. The bit test was left over from a rejected
swap allocation scheme that was not ultimately committed. A few other
minor cleanups were also made.
Reorganize the swap strategy code, again for VN support, to not
reallocate swap when writing as this messes up pre-reservation and
can fragment I/O unnecessarily as VN-baesd disk is messed around with.
Reviewed by: Alan Cox <alc@cs.rice.edu>, David Greenman <dg@root.com>
current process from the exclusive lock prior to initiating I/O.
This fixes a panic related to swap-backed VN disks
Reviewed by: Alan Cox <alc@cs.rice.edu>, David Greenman <dg@root.com>
Replace various VM related page count calculations strewn over the
VM code with inlines to aid in readability and to reduce fragility
in the code where modules depend on the same test being performed
to properly sleep and wakeup.
Split out a portion of the page deactivation code into an inline
in vm_page.c to support vm_page_dontneed().
add vm_page_dontneed(), which handles the madvise MADV_DONTNEED
feature in a related commit coming up for vm_map.c/vm_object.c. This
code prevents degenerate cases where an essentially active page may
be rotated through a subset of the paging lists, resulting in premature
disposal.
Make the alias list a SLIST.
Drop the "fast recycling" optimization of vnodes (including
the returning of a prexisting but stale vnode from checkalias).
It doesn't buy us anything now that we don't hardlimit
vnodes anymore.
Rename checkalias2() and checkalias() to addalias() and
addaliasu() - which takes dev_t and udev_t arg respectively.
Make the revoke syscalls use vcount() instead of VALIASED.
Remove VALIASED flag, we don't need it now and it is faster
to traverse the much shorter lists than to maintain the
flag.
vfs_mountedon() can check the dev_t directly, all the vnodes
point to the same one.
Print the devicename in specfs/vprint().
Remove a couple of stale LFS vnode flags.
Remove unimplemented/unused LK_DRAINED;
creation of /dev/drum via calling swapon. However, the make_dev has a
bogus (insofar that it hasn't been added yet) cdevsw, so later we end
up crashing with a null pointer dereference on the swap vp's specinfo.
The specinfo points to a dev_t with a major of 254 (uninitialized), and
we get a crash on its d_strategy being called.
The simple solution to this is to call cdevsw_add before the make_dev
is ever used. This fixes the panic which occurred upon swapping.
Diskslice/label code not yet handled.
Vinum, i4b, alpha, pc98 not dealt with (left to respective Maintainers)
Add the correct hook for devfs to kern_conf.c
The net result of this excercise is that a lot less files depends on DEVFS,
and devtoname() gets more sensible output in many cases.
A few drivers had minor additional cleanups performed relating to cdevsw
registration.
A few drivers don't register a cdevsw{} anymore, but only use make_dev().
The lock structure cannot be the first element of the vm_map
because this can result in livelock between two or more system
processes trying to kmem_alloc_wait.
Remove semicolons or add "do { } while (0)" as necessary
to enable the use of these macros in arbitrary statements.
(There are no functional changes.)
Submitted by: dillon
A complete rewrite by dillon and myself to separate
the implementation of behaviors that effect the vm_map_entry
from those that effect the vm_object.
A result of this change is that madvise(..., MADV_FREE);
is much cheaper.
This setting is also acceptable for Celerons and Pentium Pros
with less than 1MB L2 caches.
Note: PQ_L2_SIZE is a misnomer. The correct number of colors is
a function of the cache's degree of associativity as well as its size.
Submitted by: bde and alc
Now that behaviors are stored in the vm_map_entry rather than
the vm_object, it's no longer necessary to instantiate a vm_object
just to hold the behavior.
Reviewed by: dillon
Remove the initialization of PQ_NONE's cnt and lcnt. They aren't
used.
vm_page_insert:
Remove an unnecessary dereference.
vm_page_wire:
Remove the one and only (and thus pointless) reference
to PQ_NONE's lcnt.
When creating new processes (or performing exec), the new page
directory is initialized too early. The kernel might grow before
p_vmspace is initialized for the new process. Since pmap_growkernel
doesn't yet know about the new page directory, it isn't updated, and
subsequent use causes a failure.
The fix is (1) to clear p_vmspace early, to stop pmap_growkernel
from stomping on memory, and (2) to defer part of the initialization
of new page directories until p_vmspace is initialized.
PR: kern/12378
Submitted by: tegge
Reviewed by: dfr
vm_map.c:
Don't set OBJ_ONEMAPPING on arbitrary vm objects. Only default
and swap type vm objects should have it set. vm_object_deallocate
already handles these cases.
vm_object.c:
If OBJ_ONEMAPPING isn't already clear in vm_object_shadow,
we are in trouble. Instead of clearing it, make it
an assertion that it is already clear.
QUEUE_AGE, QUEUE_LRU, and QUEUE_EMPTY we instead have QUEUE_CLEAN,
QUEUE_DIRTY, QUEUE_EMPTY, and QUEUE_EMPTYKVA. With this patch clean
and dirty buffers have been separated. Empty buffers with KVM
assignments have been separated from truely empty buffers. getnewbuf()
has been rewritten and now operates in a 100% optimal fashion. That is,
it is able to find precisely the right kind of buffer it needs to
allocate a new buffer, defragment KVM, or to free-up an existing buffer
when the buffer cache is full (which is a steady-state situation for
the buffer cache).
Buffer flushing has been reorganized. Previously buffers were flushed
in the context of whatever process hit the conditions forcing buffer
flushing to occur. This resulted in processes blocking on conditions
unrelated to what they were doing. This also resulted in inappropriate
VFS stacking chains due to multiple processes getting stuck trying to
flush dirty buffers or due to a single process getting into a situation
where it might attempt to flush buffers recursively - a situation that
was only partially fixed in prior commits. We have added a new daemon
called the buf_daemon which is responsible for flushing dirty buffers
when the number of dirty buffers exceeds the vfs.hidirtybuffers limit.
This daemon attempts to dynamically adjust the rate at which dirty buffers
are flushed such that getnewbuf() calls (almost) never block.
The number of nbufs and amount of buffer space is now scaled past the
8MB limit that was previously imposed for systems with over 64MB of
memory, and the vfs.{lo,hi}dirtybuffers limits have been relaxed
somewhat. The number of physical buffers has been increased with the
intention that we will manage physical I/O differently in the future.
reassignbuf previously attempted to keep the dirtyblkhd list sorted which
could result in non-deterministic operation under certain conditions,
such as when a large number of dirty buffers are being managed. This
algorithm has been changed. reassignbuf now keeps buffers locally sorted
if it can do so cheaply, and otherwise gives up and adds buffers to
the head of the dirtyblkhd list. The new algorithm is deterministic but
not perfect. The new algorithm greatly reduces problems that previously
occured when write_behind was turned off in the system.
The P_FLSINPROG proc->p_flag bit has been replaced by the more descriptive
P_BUFEXHAUST bit. This bit allows processes working with filesystem
buffers to use available emergency reserves. Normal processes do not set
this bit and are not allowed to dig into emergency reserves. The purpose
of this bit is to avoid low-memory deadlocks.
A small race condition was fixed in getpbuf() in vm/vm_pager.c.
Submitted by: Matthew Dillon <dillon@apollo.backplane.com>
Reviewed by: Kirk McKusick <mckusick@mckusick.com>
SYSINIT_KT() etc (which is a static, compile-time procedure), use a
NetBSD-style kthread_create() interface. kproc_start is still available
as a SYSINIT() hook. This allowed simplification of chunks of the
sysinit code in the process. This kthread_create() is our old kproc_start
internals, with the SYSINIT_KT fork hooks grafted in and tweaked to work
the same as the NetBSD one.
One thing I'd like to do shortly is get rid of nfsiod as a user initiated
process. It makes sense for the nfs client code to create them on the
fly as needed up to a user settable limit. This means that nfsiod
doesn't need to be in /sbin and is always "available". This is a fair bit
easier to do outside of the SYSINIT_KT() framework.
lockmgr locks. This commit should be functionally equivalent to the old
semantics. That is, all buffer locking is done with LK_EXCLUSIVE
requests. Changes to take advantage of LK_SHARED and LK_RECURSIVE will
be done in future commits.
creating a new entry. vm_map_stack and vm_map_growstack can panic when
a new entry isn't created. Fixed vm_map_stack and vm_map_growstack.
Also, when extending the stack, always set the protection to VM_PROT_ALL.
Insure that device mappings get MAP_PREFAULT(_PARTIAL) set,
so that 4M page mappings are used when possible.
Reviewed by: Luoqi Chen <luoqi@watermarkgroup.com>
The cdevsw_add() function now finds the major number(s) in the
struct cdevsw passed to it. cdevsw_add_generic() is no longer
needed, cdevsw_add() does the same thing.
cdevsw_add() will print an message if the d_maj field looks bogus.
Remove nblkdev and nchrdev variables. Most places they were used
bogusly. Instead check a dev_t for validity by seeing if devsw()
or bdevsw() returns NULL.
Move bdevsw() and devsw() functions to kern/kern_conf.c
Bump __FreeBSD_version to 400006
This commit removes:
72 bogus makedev() calls
26 bogus SYSINIT functions
if_xe.c bogusly accessed cdevsw[], author/maintainer please fix.
I4b and vinum not changed. Patches emailed to authors. LINT
probably broken until they catch up.
Reformat and initialize correctly all "struct cdevsw".
Initialize the d_maj and d_bmaj fields.
The d_reset field was not removed, although it is never used.
I used a program to do most of this, so all the files now use the
same consistent format. Please keep it that way.
Vinum and i4b not modified, patches emailed to respective authors.
Remove a useless argument from vm_map_madvise's interface (vm_map.c,
vm_map.h, and vm_mmap.c).
Remove a redundant test in vm_uiomove (vm_map.c).
Make two changes to vm_object_coalesce:
1. Determine whether the new range of pages actually overlaps
the existing object's range of pages before calling vm_object_page_remove.
(Prior to this change almost 90% of the calls to vm_object_page_remove
were to remove pages that were beyond the end of the object.)
2. Free any swap space allocated to removed pages.
It never makes sense to specify MAP_COPY_NEEDED without also specifying
MAP_COPY_ON_WRITE, and vice versa. Thus, MAP_COPY_ON_WRITE suffices.
Reviewed by: David Greenman <dg@root.com>
udev_t in the kernel but still called dev_t in userland.
Provide functions to manipulate both types:
major() umajor()
minor() uminor()
makedev() umakedev()
dev2udev() udev2dev()
For now they're functions, they will become in-line functions
after one of the next two steps in this process.
Return major/minor/makedev to macro-hood for userland.
Register a name in cdevsw[] for the "filedescriptor" driver.
In the kernel the udev_t appears in places where we have the
major/minor number combination, (ie: a potential device: we
may not have the driver nor the device), like in inodes, vattr,
cdevsw registration and so on, whereas the dev_t appears where
we carry around a reference to a actual device.
In the future the cdevsw and the aliased-from vnode will be hung
directly from the dev_t, along with up to two softc pointers for
the device driver and a few houskeeping bits. This will essentially
replace the current "alias" check code (same buck, bigger bang).
A little stunt has been provided to try to catch places where the
wrong type is being used (dev_t vs udev_t), if you see something
not working, #undef DEVT_FASCIST in kern/kern_conf.c and see if
it makes a difference. If it does, please try to track it down
(many hands make light work) or at least try to reproduce it
as simply as possible, and describe how to do that.
Without DEVT_FASCIST I belive this patch is a no-op.
Stylistic/posixoid comments about the userland view of the <sys/*.h>
files welcome now, from userland they now contain the end result.
Next planned step: make all dev_t's refer to the same devsw[] which
means convert BLK's to CHR's at the perimeter of the vnodes and
other places where they enter the game (bootdev, mknod, sysctl).
Made a new (inline) function devsw(dev_t dev) and substituted it.
Changed to the BDEV variant to this format as well: bdevsw(dev_t dev)
DEVFS will eventually benefit from this change too.
Virtualize bdevsw[] from cdevsw. bdevsw() is now an (inline)
function.
Join CDEV_MODULE and BDEV_MODULE to DEV_MODULE (please pay attention
to the order of the cmaj/bmaj arguments!)
Join CDEV_DRIVER_MODULE and BDEV_DRIVER_MODULE to DEV_DRIVER_MODULE
(ditto!)
(Next step will be to convert all bdev dev_t's to cdev dev_t's
before they get to do any damage^H^H^H^H^H^Hwork in the kernel.)
piecemeal, middle-of-file writes for NFS. These hacks have caused no
end of trouble, especially when combined with mmap(). I've removed
them. Instead, NFS will issue a read-before-write to fully
instantiate the struct buf containing the write. NFS does, however,
optimize piecemeal appends to files. For most common file operations,
you will not notice the difference. The sole remaining fragment in
the VFS/BIO system is b_dirtyoff/end, which NFS uses to avoid cache
coherency issues with read-merge-write style operations. NFS also
optimizes the write-covers-entire-buffer case by avoiding the
read-before-write. There is quite a bit of room for further
optimization in these areas.
The VM system marks pages fully-valid (AKA vm_page_t->valid =
VM_PAGE_BITS_ALL) in several places, most noteably in vm_fault. This
is not correct operation. The vm_pager_get_pages() code is now
responsible for marking VM pages all-valid. A number of VM helper
routines have been added to aid in zeroing-out the invalid portions of
a VM page prior to the page being marked all-valid. This operation is
necessary to properly support mmap(). The zeroing occurs most often
when dealing with file-EOF situations. Several bugs have been fixed
in the NFS subsystem, including bits handling file and directory EOF
situations and buf->b_flags consistancy issues relating to clearing
B_ERROR & B_INVAL, and handling B_DONE.
getblk() and allocbuf() have been rewritten. B_CACHE operation is now
formally defined in comments and more straightforward in
implementation. B_CACHE for VMIO buffers is based on the validity of
the backing store. B_CACHE for non-VMIO buffers is based simply on
whether the buffer is B_INVAL or not (B_CACHE set if B_INVAL clear,
and vise-versa). biodone() is now responsible for setting B_CACHE
when a successful read completes. B_CACHE is also set when a bdwrite()
is initiated and when a bwrite() is initiated. VFS VOP_BWRITE
routines (there are only two - nfs_bwrite() and bwrite()) are now
expected to set B_CACHE. This means that bowrite() and bawrite() also
set B_CACHE indirectly.
There are a number of places in the code which were previously using
buf->b_bufsize (which is DEV_BSIZE aligned) when they should have
been using buf->b_bcount. These have been fixed. getblk() now clears
B_DONE on return because the rest of the system is so bad about
dealing with B_DONE.
Major fixes to NFS/TCP have been made. A server-side bug could cause
requests to be lost by the server due to nfs_realign() overwriting
other rpc's in the same TCP mbuf chain. The server's kernel must be
recompiled to get the benefit of the fixes.
Submitted by: Matthew Dillon <dillon@apollo.backplane.com>
1:
s/suser/suser_xxx/
2:
Add new function: suser(struct proc *), prototyped in <sys/proc.h>.
3:
s/suser_xxx(\([a-zA-Z0-9_]*\)->p_ucred, \&\1->p_acflag)/suser(\1)/
The remaining suser_xxx() calls will be scrutinized and dealt with
later.
There may be some unneeded #include <sys/cred.h>, but they are left
as an exercise for Bruce.
More changes to the suser() API will come along with the "jail" code.
unallocated parts of the last page when the file ended on a frag
but not a page boundary.
Delimitted by tags PRE_MATT_MMAP_EOF and POST_MATT_MMAP_EOF,
in files alpha/alpha/pmap.c i386/i386/pmap.c nfs/nfs_bio.c vm/pmap.h
vm/vm_page.c vm/vm_page.h vm/vnode_pager.c miscfs/specfs/spec_vnops.c
ufs/ufs/ufs_readwrite.c kern/vfs_bio.c
Submitted by: Matt Dillon <dillon@freebsd.org>
Reviewed by: Alan Cox <alc@freebsd.org>
1. Don't bother checking object->ref_count == 1 in order to set
OBJ_ONEMAPPING. It's a waste of time. If object->ref_count == 1,
vm_map_entry_delete will "run-down" the object and its pages.
2. If object->ref_count == 1, ignore OBJ_ONEMAPPING. Wait for
vm_map_entry_delete to "run-down" the object and its pages.
Otherwise, we're calling two different procedures to delete
the object's pages.
Note: "vmstat -s" will once again show a non-zero value
for "pages freed by exiting processes".
Remove more (redundant) map timestamp increments from properly
synchronized routines. (Changed: vm_map_entry_link, vm_map_entry_unlink,
and vm_map_pageable.)
Micro-optimize vm_map_entry_link and vm_map_entry_unlink, eliminating
unnecessary dereferences. At the same time, converted them from macros
to inline functions.
address) so that the first 16MB of physical memory is allocated
last rather than first. On large-memory machines, this avoids
the exhaustion of low physical memory before isa_dmainit has run.
block (VM_WAIT) holding the map lock. This is bad. For example, a subsequent
kmem_malloc by an interrupt handler on the same map may find the lock held
and panic in the lockmgr.
In general, vm_map_simplify_entry should be performed INSIDE
the loop that traverses the map, not outside. (Changed:
vm_map_inherit, vm_map_pageable.)
vm_fault_unwire doesn't acquire the map lock (or block holding
it). Thus, vm_map_set/clear_recursive shouldn't be called.
(Changed: vm_map_user_pageable, vm_map_pageable.)
The old VN device broke in -4.x when the definition of B_PAGING
changed. This patch fixes this plus implements additional capabilities.
The new VN device can be backed by a file ( as per normal ), or it can
be directly backed by swap.
Due to dependencies in VM include files (on opt_xxx options) the new
vn device cannot be a module yet. This will be fixed in a later commit.
This commit delimitted by tags {PRE,POST}_MATT_VNDEV
1. The size of vm_object::memq is vm_object::resident_page_count,
not vm_object::size.
2. The "size > 4" test sometimes results in the traversal of a ~1000 page
memq in order to locate ~10 pages.
lock) until it actually needs to modify the vm_map.
Note: it is legal to modify vm_map::hint without holding a write lock.
Submitted by: "Richard Seaman, Jr." <dick@tar.com> with minor changes
by myself.
the read lock around the subyte operations in mincore. After the lock is
reacquired, use the map's timestamp to determine if we need to restart
the scan.
Submitted by: Matthew Dillon <dillon@apollo.backplane.com>
To prevent a deadlock, if we are extremely low on memory, force synchronous
operation by the VOP_PUTPAGES in vnode_pager_putpages.
Fix bug where an object's OBJ_WRITEABLE/OBJ_MIGHTBEDIRTY flags do
not get set under certain circumstances ( page rename case ).
Reviewed by: Alan Cox <alc@cs.rice.edu>, John Dyson
is the preparation step for moving pmap storage out of vmspace proper.
Reviewed by: Alan Cox <alc@cs.rice.edu>
Matthew Dillion <dillon@apollo.backplane.com>
be in progress at any given moment.
Add two swap tuneables to sysctl:
vm.swap_async_max: 4
vm.swap_cluster_max: 16
Recommended values are a cluster size of 8 or 16 pages. async_max is
about right for 1-4 swap devices. Reduce to 2 if swap is eating too much
bandwidth, or even 1 if swap is both eating too much bandwidth and sitting
on a slow network (10BaseT).
The defaults work well across a broad range of configurations and should
normally be left alone.
Unlock vnode before messing with map to avoid deadlock between map and
vnode ( e.g. with exec_map and underlying program binary vnode ). Solves
a deadlock that most often occurs during a large -j# buildworld reported
by three people.
been made but the code has been reorganized and documented to make
it more readable, reduce the size of the code, and optimize the branch
path caching capabilities that most modern processors have.
free swap space out from under a busy page. This is not legal because
the swap may be reallocated and I/O issued while I/O is still in
progress on the same swap page from the madvise()'d object. This bug
could only occur under extreme paging conditions but might not cause
an error until much later. As a side-benefit, madvise() is now even
smaller.
possible without actually unmapping it from the process.
As of now, I declare madvise() on OBJT_DEFAULT/OBJT_SWAP objects to be
'working and complete'.
OBJ_ONEMAPPING in the case where an object is extended by an
additional vm_map_entry must be allocated.
In vm_object_madvise(), remove calll to vm_page_cache() in MADV_FREE
case in order to avoid a page fault on page reuse. However, we still
mark the page as clean and destroy any swap backing store.
Submitted by: Alan Cox <alc@cs.rice.edu>
because there was a concensus on current in regards to leaving bss r+w+x
instead of r+w. This is in order to maintain reasonable compatibility
with existing JIT compilers (e.g. kaffe) and possibly other programs.
no major operational changes were made. The three core object->memq loops
were moved into a single inline procedure and various operational
characteristics of the collapse function were documented.
PQ_FREE. There is little operational difference other then the kernel
being a few kilobytes smaller and the code being more readable.
* vm_page_select_free() has been *greatly* simplified.
* The PQ_ZERO page queue and supporting structures have been removed
* vm_page_zero_idle() revamped (see below)
PG_ZERO setting and clearing has been migrated from vm_page_alloc()
to vm_page_free[_zero]() and will eventually be guarenteed to remain
tracked throughout a page's life ( if it isn't already ).
When a page is freed, PG_ZERO pages are appended to the appropriate
tailq in the PQ_FREE queue while non-PG_ZERO pages are prepended.
When locating a new free page, PG_ZERO selection operates from within
vm_page_list_find() ( get page from end of queue instead of beginning
of queue ) and then only occurs in the nominal critical path case. If
the nominal case misses, both normal and zero-page allocation devolves
into the same _vm_page_list_find() select code without any specific
zero-page optimizations.
Additionally, vm_page_zero_idle() has been revamped. Hysteresis has been
added and zero-page tracking adjusted to conform with the other changes.
Currently hysteresis is set at 1/3 (lo) and 1/2 (hi) the number of free
pages. We may wish to increase both parameters as time permits. The
hysteresis is designed to avoid silly zeroing in borderline allocation/free
situations.
attempt to optimize forks but were essentially given-up on due to
problems and replaced with an explicit dup of the vm_map_entry structure.
Prior to the removal, they were entirely unused.
rather then VM_PROT_ALL. obreak, on the otherhand, uses VM_PROT_ALL.
This prevents vm_map_insert() from being able to coalesce the heap and
creates an extra map entry. Since current architectures ignore
VM_PROT_EXECUTE anyway, and since not having VM_PROT_EXECUTE on data/bss
may provide protection in the future, obreak now uses read+write rather
then all (r+w+x).
This is an optimization, not a bug fix.
Submitted by: Alan Cox <alc@cs.rice.edu>
Since paging is in progress, page scan in vm_page_qcollapse() must be
protected at atleast splbio() to prevent pages from being ripped out from
under the scan.
The vm_map_insert()/vm_object_coalesce() optimization has been extended
to include OBJT_SWAP objects as well as OBJT_DEFAULT objects. This is
possible because it costs nothing to extend an OBJT_SWAP object with
the new swapper. We can't do this with the old swapper. The old swapper
used a linear array that would have had to have been reallocated, costing
time as well as a potential low-memory deadlock.
in vm_map_simplify_entry. Basically, once you've verified that
the objects in the adjacent vm_map_entry's are the same, either
NULL or the same vm_object, there's no point in checking that the
objects have the same behavior.
Obtained from: Alan Cox <alc@cs.rice.edu>
Checked by: "Richard Seaman, Jr." <dick@tar.com>
Fix the following problem:
As the code stands now, growing any stack, and not just the process's
main stack, modifies vm->vm_ssize. This is inconsistent with the code
earlier in the same procedure.
This changes the definitions of a few items so that structures are the
same whether or not the option itself is enabled. This allows
people to enable and disable the option without recompilng the world.
As the author says:
|I ran into a problem pulling out the VM_STACK option. I was aware of this
|when I first did the work, but then forgot about it. The VM_STACK stuff
|has some code changes in the i386 branch. There need to be corresponding
|changes in the alpha branch before it can come out completely.
what is done:
|
|1) Pull the VM_STACK option out of the header files it appears in. This
|really shouldn't affect anything that executes with or without the rest
|of the VM_STACK patches. The vm_map_entry will then always have one
|extra element (avail_ssize). It just won't be used if the VM_STACK
|option is not turned on.
|
|I've also pulled the option out of vm_map.c. This shouldn't harm anything,
|since the routines that are enabled as a result are not called unless
|the VM_STACK option is enabled elsewhere.
|
|2) Add what appears to be appropriate code the the alpha branch, still
|protected behind the VM_STACK switch. I don't have an alpha machine,
|so we would need to get some testers with alpha machines to try it out.
|
|Once there is some testing, we can consider making the change permanent
|for both i386 and alpha.
|
[..]
|
|Once the alpha code is adequately tested, we can pull VM_STACK out
|everywhere.
|
Submitted by: "Richard Seaman, Jr." <dick@tar.com>
This takes the conditionals out of the code that has been tested by
various people for a while.
ps and friends (libkvm) will need a recompile as some proc structure
changes are made.
Submitted by: "Richard Seaman, Jr." <dick@tar.com>
vm_page_rename(), but never pulled the page off PQ_CACHE if it was on
PQ_CACHE. Dirty pages in PQ_CACHE are not allowed and a KASSERT was
added in -4.x to test for this... and got hit.
In -4.x, vm_page_rename() automatically dirties the page. This commit
also has it deal with the PQ_CACHE case, deactivating the page in that
case.
values. The 'int' return value for the procedure was never used and
not well defined in any case when there are mixed errors on pages, so
it has been removed. vm_pager_put_pages() and associated vm_pager
functions now return void.
swap blocks are now in PAGE_SIZE'd increments instead of DEV_BSIZE'd
increments. We still convert to DEV_BSIZE'd increments for the
backing store I/O, but everything else is in PAGE_SIZE increments.
vm_pager.h
Added argument to getpbuf() and relpbuf() to allow each subsystem to
specify a different hard limit on the number of simultanious physical
bufferes that said subsystem may allocate. Without this feature, one
subsystem ( e.g. the vfs clustering code ) could hog *ALL* the pbufs,
causing a deadlock in the pager in a low memory situation.
Same for trypbuf().
Removed call to vm_object_collapse(), which can block. This was being
called without the pageout code holding any sort of reference on the
vm_object or vm_page_t structures being manipulated. Since this code
can block, it was possible for other kernel code to shred the state
the pageout code was assuming remained intact.
Fixed potential blocking condition in vm_pageout_page_free() ( which
could cause a deadlock in a low-memory situation ).
Currently there is a hack in-place to deal with clean filesystem meta-data
polluting the inactive page queue. John doesn't like the hack, and neither
do I.
Revamped and commented a portion of the pageout loop.
Added protection against potential memory deadlocks with OBJT_VNODE
when using VOP_ISLOCKED(). The problem is that vp->v_data can be NULL
which causes VOP_ISLOCKED() to return a less informed answer.
remove vm_pager_sync() -- none of the pagers use it any more ( the old
swapper used to. The new one does not ).
reducing the size of vm_page_t.
SWAPBLK_NONE and SWAPBLK_MASK are defined here. These actually are
more generalized then their names imply, but their placement is somewhat
of a legacy issue from a prior test version of this code that put
the swapblk in the vm_page_t structure. That test code was eventually
thrown away. The legacy remains.
Added vm_page_flash() inline. Similar to vm_page_wakeup() except that
it does not clear PG_BUSY ( one assumes that PG_BUSY is already clear ).
Used by a number of routines to wakeup waiters.
Collapsed some of the code in inline calls to make other inline calls.
GCC will optimize this well and it reduces duplication.
vm_page_free() and vm_page_free_zero() inlines added to convert to
the proper vm_page_free_toq() call.
vm_page_sleep_busy() inline added, replacing vm_page_sleep() ( which has
been removed ). This implements a much more optimizable page-waiting
function.
pointers per entry ). The table has been changed to a singly linked
list of vm_page_t pointers. The table has been doubled in size, but
the entries only take half the space so a net-zero change in memory use.
The hash function has been changed, hopefully for the better. The
combination of the larger hash table size of changed function should
keep the chain length down to a reasonable number (0-3, average 1).
vm_object->page_hint has been removed. This 'optimization' was not
only never needed, but costs as much as a hash chain link to implement.
While having page_hint in vm_object might result in better locality
of reference, the cost is not worth the space in vm_object or the
extra instructions in my view.
vm_page_alloc*() functions have been inlined and call a generalized
non-inlined vm_page_alloc_toq() which combines the standard alloc
and zero-page alloc functions together, reducing code size and the L1
cache footprint. Some reordering has been done... not much. The
delinking code should be faster ( because unlinking a doubly-linked list
requires four memory ops and unlinking a singly linked list only requires
two ), and we get a hash consistancy check for free.
vm_page_rename() now automatically sets the page's dirty bits.
vm_page_alloc() does not try to manually inline freeing a cache page.
Instead, it now properly calls vm_page_free(m) ... vm_page_free() is
really too complex to manually inline.
vm_await(), supporting asleep(), has been added.
of most of the swap-pager-specific fields, the removal of the id,
and the removal of paging_offset.
A new inline, vm_object_pip_wakeupn() has been added to subtract an
arbitrary number n from the paging_in_progress count and then wakeup
waiters as necessary. n may be 0, resulting in a 'flash'.
object->paging_offset has been removed - it was used to optimize a
single OBJT_SWAP collapse case yet introduced massive confusion throughout
vm_object.c. The optimization was inconsequential except for the
claim that it didn't have to allocate any memory. The optimization
has been removed.
madvise() has been fixed. The old madvise() could be made to operate
on shared objects which is a big no-no. The new one is much more careful
in what it modifies. MADV_FREE was totally broken and has now been fixed.
vm_page_rename() now automatically dirties a page, so explicit dirtying
of the page prior to calling vm_page_rename() has been removed.
about conversions of objects to OBJT_SWAP, it is done automatically
now.
Replaced manually inserted code with inline calls for busy waiting on
pages, which also incidently fixes a potential PG_BUSY race due to
the code not running at splvm().
vm_objects no longer have a paging_offset field ( see vm/vm_object.c )
instead to properly handle any waiters.
Added comments, added support for M_ASLEEP. Generally treat M_ flags
as flags instead of constants to compare against.
and the swap_pager has been completely replaced.
The new swap pager uses the new blist radix-tree based bitmap allocator
for low level swap allocation and deallocation. The new allocator
is effectively O(5) while the old one was O(N), and the new allocator
allocates all required memory at init time rather then at allocate
memory on the fly at run time.
Swap metadata is allocated in clusters and stored in a hash table,
eliminating linearly allocated structures.
Many, many features have been rewritten or added. Swap space is now
reallocated on the fly providing a poor-mans auto defragmentation of
swap space. Swap space that is no longer needed is freed on a timely
basis so no garbage collection is necessary.
Swap I/O is marked B_ASYNC and NFS has been fixed to do the right
thing with it, so NFS-based paging now has around 10x the performance
as it did before ( previously NFS enforced synchronous I/O for paging ).
changes to the VM system to support the new swapper, VM bug
fixes, several VM optimizations, and some additional revamping of the
VM code. The specific bug fixes will be documented with additional
forced commits. This commit is somewhat rough in regards to code
cleanup issues.
Reviewed by: "John S. Dyson" <root@dyson.iquest.net>, "David Greenman" <dg@root.com>
shared signal handling when there is shared signal handling being
used.
This removes the main objection to making the shared signal handling
a standard ability in rfork() and friends and 'unconditionalising'
this code. (i.e. the allocation of an extra 328 bytes per process).
Signal handling information remains in the U area until such a time as
it's reference count would be incremented to > 1. At that point a new
struct is malloc'd and maintained in KVM so that it can be shared between
the processes (threads) using it.
A function to check the reference count and move the struct back to the U
area when it drops back to 1 is also supplied. Signal information is
therefore now swapable for all processes that are not sharing that
information with other processes. THis should addres the concerns raised
by Garrett and others.
Submitted by: "Richard Seaman, Jr." <dick@tar.com>
downward growing stacks more general.
Add (but don't activate) code to use the new stack facility
when running threads, (specifically the linux threads support).
This allows people to use both linux compiled linuxthreads, and also the
native FreeBSD linux-threads port.
The code is conditional on VM_STACK. Not using this will
produce the old heavily tested system.
Submitted by: Richard Seaman <dick@tar.com>
"dying daemons" problem. (I thought this code was introduced in rev.1.80,
but it just relaxed the condition.)
Also, kill related "suggest more swap space" warning (also introduced in
1.80). It was confusing, to say the least...
Requested by: msmith
Not objected by: dg
Submitted by: "Richard Seaman, Jr." <lists@tar.com>
Obtained from: linux :-)
Code to allow Linux Threads to run under FreeBSD.
By default not enabled
This code is dependent on the conditional
COMPAT_LINUX_THREADS (suggested by Garret)
This is not yet a 'real' option but will be within some number of hours.
for possible buffer overflow problems. Replaced most sprintf()'s
with snprintf(); for others cases, added terminating NUL bytes where
appropriate, replaced constants like "16" with sizeof(), etc.
These changes include several bug fixes, but most changes are for
maintainability's sake. Any instance where it wasn't "immediately
obvious" that a buffer overflow could not occur was made safer.
Reviewed by: Bruce Evans <bde@zeta.org.au>
Reviewed by: Matthew Dillon <dillon@apollo.backplane.com>
Reviewed by: Mike Spengler <mks@networkcs.com>
almost always causes this panic for the curproc != pageproc case.
This case apparently doesn't happen in normal operation, but it
happens when vm_page_alloc_contig() is called when there is a memory
hogging application that hasn't already been paged out.
PR: 8632
Reviewed by: info@opensound.com (Dev Mazumdar), dg
Broken in: rev.1.89 (1998/02/23)
truncated to 32 bits.
* Change the calling convention of the device mmap entry point to
pass a vm_offset_t instead of an int for the offset allowing
devices with a larger memory map than (1<<32) to be supported
on the alpha (/dev/mem is one such).
These changes are required to allow the X server to mmap the various
I/O regions used for device port and memory access on the alpha.
file to a stream socket. sendfile(2) is similar to implementations in
HP-UX, Linux, and other systems, but the API is more extensive and
addresses many of the complaints that the Apache Group and others have
had with those other implementations. Thanks to Marc Slemko of the
Apache Group for helping me work out the best API for this.
Anyway, this has the "net" result of speeding up sends of files over
TCP/IP sockets by about 10X (that is to say, uses 1/10th of the CPU
cycles) when compared to a traditional read/write loop.
when bdevsw[] became sparse. We still depend on magic to avoid having to
check that (v_rdev) device numbers in vnodes are not NODEV.
Removed a redundant `major(dev) < nblkdev' test instead of updating it.
Don't follow a garbage bdevsw pointer for attempts to swap on empty
regular files. This case currently can't happen. Swapping on regular
files is ifdefed out in swapon() and isn't attempted for empty files
in nfs_mountroot().
needs to be called prior to freeing remaining pages in the object so that
the device pager has an opportunity to grab its "fake" pages. Also, in
the case of wired pages, the page must be made busy prior to calling
vm_page_remove. This is a difference from 2.2.x that I overlooked when
I brought these changes forward.
legitimately wired pages. Currently we print a diagnostic when this
happens, but this will be removed soon when it will be common for this
to occur with zero-copy TCP/IP buffers.
1) The vnode pager wasn't properly tracking the file size due to
"size" being page rounded in some cases and not in others.
This sometimes resulted in corrupted files. First noticed by
Terry Lambert.
Fixed by changing the "size" pager_alloc parameter to be a 64bit
byte value (as opposed to a 32bit page index) and changing the
pagers and their callers to deal with this properly.
2) Fixed a bogus type cast in round_page() and trunc_page() that
caused some 64bit offsets and sizes to be scrambled. Removing
the cast required adding casts at a few dozen callers.
There may be problems with other bogus casts in close-by
macros. A quick check seemed to indicate that those were okay,
however.
simple-lock.
The reviewer raises the following caveat: "I believe these changes
open a non-critical race condition when adding memory to the pool
for the zone. I think what will happen is that you could have two
threads that are simultaneously adding additional memory when the
pool runs out. This appears to not be a problem, however, since
the re-aquisition of the lock will protect the list pointers."
The submitter agrees that the race is non-critical, and points out
that it already existed for the non-SMP case. He suggests that
perhaps a sleep lock (using the lock manager) should be used to
close that race. This might be worth revisiting after 3.0 is
released.
Reviewed by: dg (David Greenman)
Submitted by: tegge (Tor Egge)
expected. This bug caused builds of Modula-3 to fail in mysterious
ways on SMP kernels. More precisely, such builds failed on systems
with kern.fast_vfork equal to 0, the default and only supported
value for SMP kernels.
PR: kern/7468
Submitted by: tegge (Tor Egge)
when nfs is an LKM. Declare it in a header file. Don't forget to use
it in non-Lite2 code. Initialize it to -1 instead of to 0, since 0
will soon be the mount type number for the first vfs loaded.
NetBSD uses strcmp() to avoid this ugly global.
Add some overflow checks to read/write (from bde).
Change all modifications to vm_page::flags, vm_page::busy, vm_object::flags
and vm_object::paging_in_progress to use operations which are not
interruptable.
Reviewed by: Bruce Evans <bde@zeta.org.au>
managed to avoid corruption of this variable by luck (the compiler used a
memory read-modify-write instruction which wasn't interruptable) but other
architectures cannot.
With this change, I am now able to 'make buildworld' on the alpha (sfx: the
crowd goes wild...)
code still left in there. The macros it describes disapeared some-
time since 4.4BSD lite.
PR: 7246
Reviewed by: phk
Submitted by: Stefan Eggers <seggers@semyam.dinoco.de>
respectively. Most of the longs should probably have been
u_longs, but this changes is just to prevent warnings about
casts between pointers and integers of different sizes, not
to fix poorly chosen types.
casting them to long, etc. Fixed some nearby printf bogons (sign
errors not warned about by gcc, and style bugs, but not truncation
of vm_ooffset_t's).
casting them to long, etc. Fixed some nearby printf bogons (sign
errors not warned about by gcc, and style bugs, but not truncation
of vm_ooffset_t's).
Use slightly less bogus casts for passing pointers to ddb command
functions.
There is only cdevsw (which should be renamed in a later edit to deventry
or something). cdevsw contains the union of what were in both bdevsw an
cdevsw entries. The bdevsw[] table stiff exists and is a second pointer
to the cdevsw entry of the device. it's major is in d_bmaj rather than
d_maj. some cleanup still to happen (e.g. dsopen now gets two pointers
to the same cdevsw struct instead of one to a bdevsw and one to a cdevsw).
rawread()/rawwrite() went away as part of this though it's not strictly
the same patch, just that it involves all the same lines in the drivers.
cdroms no longer have write() entries (they did have rawwrite (?)).
tapes no longer have support for bdev operations.
Reviewed by: Eivind Eklund and Mike Smith
Changes suggested by eivind.
as the value in b_vp is often not really what you want.
(and needs to be frobbed). more cleanups will follow this.
Reviewed by: Bruce Evans <bde@freebsd.org>
FreeBSD/alpha. The most significant item is to change the command
argument to ioctl functions from int to u_long. This change brings us
inline with various other BSD versions. Driver writers may like to
use (__FreeBSD_version == 300003) to detect this change.
The prototype FreeBSD/alpha machdep will follow in a couple of days
time.
printf() of "Out of mbuf clusters - adjust NMBCLUSTERS or increase
maxusers" so that the message is more informative and so that it will
appear in the kernel message buffer.
unexpectedly do not complete writes even with sync I/O requests.
This should help the behavior of mmaped files when using
softupdates (and perhaps in other circumstances also.)
the page offset. If a large file offset was passed in, a large negative
array index could be generated which could cause page faults etc at worst
and file corruption at the least. (Pages are allocated within file
space on page alignment boundaries, so a file offset being passed in here
is harmless to DTRT. The case where this was happening has already been
fixed though, this is in case it happens again).
Reviewed by: dyson
deallocation cycles. This should provide a measurable improvement
on swap and memory allocation on loaded systems. It is unlikely a
complete solution. Also, provide more map info with procfs.
Chuck Cranor spurred on this improvement.
"time" wasn't a atomic variable, so splfoo() protection were needed
around any access to it, unless you just wanted the seconds part.
Most uses of time.tv_sec now uses the new variable time_second instead.
gettime() changed to getmicrotime(0.
Remove a couple of unneeded splfoo() protections, the new getmicrotime()
is atomic, (until Bruce sets a breakpoint in it).
A couple of places needed random data, so use read_random() instead
of mucking about with time which isn't random.
Add a new nfs_curusec() function.
Mark a couple of bogosities involving the now disappeard time variable.
Update ffs_update() to avoid the weird "== &time" checks, by fixing the
one remaining call that passwd &time as args.
Change profiling in ncr.c to use ticks instead of time. Resolution is
the same.
Add new function "tvtohz()" to avoid the bogus "splfoo(), add time, call
hzto() which subtracts time" sequences.
Reviewed by: bde
problems. Tor Egge and others have helped with various VM bugs
lately, but don't blame him -- blame me!!!
pmap.c:
1) Create an object for kernel page table allocations. This
fixes a bogus allocation method previously used for such, by
grabbing pages from the kernel object, using bogus pindexes.
(This was a code cleanup, and perhaps a minor system stability
issue.)
pmap.c:
2) Pre-set the modify and accessed bits when prudent. This will
decrease bus traffic under certain circumstances.
vfs_bio.c, vfs_cluster.c:
3) Rather than calculating the beginning virtual byte offset
multiple times, stick the offset into the buffer header, so
that the calculated offset can be reused. (Long long multiplies
are often expensive, and this is a probably unmeasurable performance
improvement, and code cleanup.)
vfs_bio.c:
4) Handle write recursion more intelligently (but not perfectly) so
that it is less likely to cause a system panic, and is also
much more robust.
vfs_bio.c:
5) getblk incorrectly wrote out blocks that are incorrectly sized.
The problem is fixed, and writes blocks out ONLY when B_DELWRI
is true.
vfs_bio.c:
6) Check that already constituted buffers have fully valid pages. If
not, then make sure that the B_CACHE bit is not set. (This was
a major source of Sig-11 type problems.)
vfs_bio.c:
7) Fix a potential system deadlock due to an incorrectly specified
sleep priority while waiting for a buffer write operation. The
change that I made opens the system up to serious problems, and
we need to examine the issue of process sleep priorities.
vfs_cluster.c, vfs_bio.c:
8) Make clustered reads work more correctly (and more completely)
when buffers are already constituted, but not fully valid.
(This was another system reliability issue.)
vfs_subr.c, ffs_inode.c:
9) Create a vtruncbuf function, which is used by filesystems that
can truncate files. The vinvalbuf forced a file sync type operation,
while vtruncbuf only invalidates the buffers past the new end of file,
and also invalidates the appropriate pages. (This was a system reliabiliy
and performance issue.)
10) Modify FFS to use vtruncbuf.
vm_object.c:
11) Make the object rundown mechanism for OBJT_VNODE type objects work
more correctly. Included in that fix, create pager entries for
the OBJT_DEAD pager type, so that paging requests that might slip
in during race conditions are properly handled. (This was a system
reliability issue.)
vm_page.c:
12) Make some of the page validation routines be a little less picky
about arguments passed to them. Also, support page invalidation
change the object generation count so that we handle generation
counts a little more robustly.
vm_pageout.c:
13) Further reduce pageout daemon activity when the system doesn't
need help from it. There should be no additional performance
decrease even when the pageout daemon is running. (This was
a significant performance issue.)
vnode_pager.c:
14) Teach the vnode pager to handle race conditions during vnode
deallocations.
1) When freeing pages, it is a good idea to protect them off.
(This is probably gratuitious, but good form.)
2) Allow collapsing pages in the backing object that are
PQ_CACHE. This will improve memory utilization.
3) Correct the collapse code so that pages that were on the
cache queue are moved to the inactive queue. This is
done when pages are marked dirty (so that those pages
will be properly paged out instead of freed), so that
cached pages will not be paradoxically marked dirty.
has been some bitrot and incorrect assumptions in the vfs_bio code. These
problems have manifest themselves worse on NFS type filesystems, but can
still affect local filesystems under certain circumstances. Most of
the problems have involved mmap consistancy, and as a side-effect broke
the vfs.ioopt code. This code might have been committed seperately, but
almost everything is interrelated.
1) Allow (pmap_object_init_pt) prefaulting of buffer-busy pages that
are fully valid.
2) Rather than deactivating erroneously read initial (header) pages in
kern_exec, we now free them.
3) Fix the rundown of non-VMIO buffers that are in an inconsistent
(missing vp) state.
4) Fix the disassociation of pages from buffers in brelse. The previous
code had rotted and was faulty in a couple of important circumstances.
5) Remove a gratuitious buffer wakeup in vfs_vmio_release.
6) Remove a crufty and currently unused cluster mechanism for VBLK
files in vfs_bio_awrite. When the code is functional, I'll add back
a cleaner version.
7) The page busy count wakeups assocated with the buffer cache usage were
incorrectly cleaned up in a previous commit by me. Revert to the
original, correct version, but with a cleaner implementation.
8) The cluster read code now tries to keep data associated with buffers
more aggressively (without breaking the heuristics) when it is presumed
that the read data (buffers) will be soon needed.
9) Change to filesystem lockmgr locks so that they use LK_NOPAUSE. The
delay loop waiting is not useful for filesystem locks, due to the
length of the time intervals.
10) Correct and clean-up spec_getpages.
11) Implement a fully functional nfs_getpages, nfs_putpages.
12) Fix nfs_write so that modifications are coherent with the NFS data on
the server disk (at least as well as NFS seems to allow.)
13) Properly support MS_INVALIDATE on NFS.
14) Properly pass down MS_INVALIDATE to lower levels of the VM code from
vm_map_clean.
15) Better support the notion of pages being busy but valid, so that
fewer in-transit waits occur. (use p->busy more for pageouts instead
of PG_BUSY.) Since the page is fully valid, it is still usable for
reads.
16) It is possible (in error) for cached pages to be busy. Make the
page allocation code handle that case correctly. (It should probably
be a printf or panic, but I want the system to handle coding errors
robustly. I'll probably add a printf.)
17) Correct the design and usage of vm_page_sleep. It didn't handle
consistancy problems very well, so make the design a little less
lofty. After vm_page_sleep, if it ever blocked, it is still important
to relookup the page (if the object generation count changed), and
verify it's status (always.)
18) In vm_pageout.c, vm_pageout_clean had rotted, so clean that up.
19) Push the page busy for writes and VM_PROT_READ into vm_pageout_flush.
20) Fix vm_pager_put_pages and it's descendents to support an int flag
instead of a boolean, so that we can pass down the invalidate bit.
put alot of it's context into a data structure. This allows
significant shortening of its codepath, and will significantly
decrease it's cache footprint.
Also, add some stats to vmmeter. Note that you'll have to
rebuild/recompile vmstat, systat, etc... Otherwise, you'll
get "very interesting" paging stats.
2) Do not unnecessarily force page blocking when paging
pages out.
3) Further improve swap pager performance and correctness,
including fixing the paging in progress deadlock (except
in severe I/O error conditions.)
4) Enable vfs_ioopt=1 as a default.
5) Fix and enable the page prezeroing in SMP mode.
All in all, SMP systems especially should show a significant
improvement in "snappyness."
These diffs implement the first stage of a VOP_{GET|PUT}PAGES pushdown
for local media FS's.
See ffs_putpages in /sys/ufs/ufs/ufs_readwrite.c for implementation
details for generic *_{get|put}pages for local media FS's. Support
is trivial to add for any FS that formerly relied on the default
behaviour of the vnode_pager in in EOPNOTSUPP cases (just copy the
ffs_getpages() code for the FS in question's *_{get|put}pages).
Obviously, it would be better if each local media FS implemented a
more optimal method, instead of calling an exported interface from
the /sys/vm/vnode_pager.c, but this is a necessary first step in
getting the FS's to a point where they can be supplied with better
implementations on a case-by-case basis.
Obviously, the cd9660_putpages() can be rather trivial (since it
is a read-only FS type 8-)).
A slight (temporary) modification is made to print a diagnostic message
in the case where the underlying filesystem attempts to engage in the
previous behaviour. Failure is likely to be ungraceful.
Submitted by: terry@freebsd.org (Terry Lambert)
improve tuning on larger systems. (A couple of the VM tuning params for
small systems were so badly chosen that the system could hang under load.)
The broken tuning was originaly my fault.
have declined due to code-rot over time. The swap pager rundown code
has been clean-up, and unneeded wakeups removed. Lots of splbio's
are changed to splvm's. Also, set the dynamic tunables for the
pageout daemon to be more sane for larger systems (thereby decreasing
the daemon overheadla.)
in a way identically as before.) I had problems with the system properly
handling the number of vnodes when there is alot of system memory, and the
default VM_KMEM_SIZE. Two new options "VM_KMEM_SIZE_SCALE" and
"VM_KMEM_SIZE_MAX" have been added to support better auto-sizing for systems
with greater than 128MB.
Add some accouting for vm_zone memory allocations, and provide properly
for vm_zone allocations out of the kmem_map. Also move the vm_zone
allocation stats to the VM OID tree from the KERN OID tree.
of the various ad-hoc schemes.
2) When bringing in UPAGES, the pmap code needs to do another vm_page_lookup.
3) When appropriate, set the PG_A or PG_M bits a-priori to both avoid some
processor errata, and to minimize redundant processor updating of page
tables.
4) Modify pmap_protect so that it can only remove permissions (as it
originally supported.) The additional capability is not needed.
5) Streamline read-only to read-write page mappings.
6) For pmap_copy_page, don't enable write mapping for source page.
7) Correct and clean-up pmap_incore.
8) Cluster initial kern_exec pagin.
9) Removal of some minor lint from kern_malloc.
10) Correct some ioopt code.
11) Remove some dead code from the MI swapout routine.
12) Correct vm_object_deallocate (to remove backing_object ref.)
13) Fix dead object handling, that had problems under heavy memory load.
14) Add minor vm_page_lookup improvements.
15) Some pages are not in objects, and make sure that the vm_page.c can
properly support such pages.
16) Add some more page deficit handling.
17) Some minor code readability improvements.
MUST be PG_BUSY. It is bogus to free a page that isn't busy,
because it is in a state of being "unavailable" when being
freed. The additional advantage is that the page_remove code
has a better cross-check that the page should be busy and
unavailable for other use. There were some minor problems
with the collapse code, and this plugs those subtile "holes."
Also, the vfs_bio code wasn't checking correctly for PG_BUSY
pages. I am going to develop a more consistant scheme for
grabbing pages, busy or otherwise. For now, we are stuck
with the current morass.
This introduce an xxxFS_BOOT for each of the rootable filesystems.
(Presently not required, but encouraged to allow a smooth move of option *FS
to opt_dontuse.h later.)
LFS is temporarily disabled, and will be re-enabled tomorrow.
1) Start using TSM.
Struct procs continue to point to upages structure, after being freed.
Struct vmspace continues to point to pte object and kva space for kstack.
u_map is now superfluous.
2) vm_map's don't need to be reference counted. They always exist either
in the kernel or in a vmspace. The vmspaces are managed by reference
counts.
3) Remove the "wired" vm_map nonsense.
4) No need to keep a cache of kernel stack kva's.
5) Get rid of strange looking ++var, and change to var++.
6) Change more data structures to use our "zone" allocator. Added
struct proc, struct vmspace and struct vnode. This saves a significant
amount of kva space and physical memory. Additionally, this enables
TSM for the zone managed memory.
7) Keep ioopt disabled for now.
8) Remove the now bogus "single use" map concept.
9) Use generation counts or id's for data structures residing in TSM, where
it allows us to avoid unneeded restart overhead during traversals, where
blocking might occur.
10) Account better for memory deficits, so the pageout daemon will be able
to make enough memory available (experimental.)
11) Fix some vnode locking problems. (From Tor, I think.)
12) Add a check in ufs_lookup, to avoid lots of unneeded calls to bcmp.
(experimental.)
13) Significantly shrink, cleanup, and make slightly faster the vm_fault.c
code. Use generation counts, get rid of unneded collpase operations,
and clean up the cluster code.
14) Make vm_zone more suitable for TSM.
This commit is partially as a result of discussions and contributions from
other people, including DG, Tor Egge, PHK, and probably others that I
have forgotten to attribute (so let me know, if I forgot.)
This is not the infamous, final cleanup of the vnode stuff, but a necessary
step. Vnode mgmt should be correct, but things might still change, and
there is still some missing stuff (like ioopt, and physical backing of
non-merged cache files, debugging of layering concepts.)
config option in pmap. Fix a problem with faulting in pages. Clean-up
some loose ends in swap pager memory management.
The system should be much more stable, but all subtile bugs aren't fixed yet.
Fix the UIO optimization code.
Fix an assumption in vm_map_insert regarding allocation of swap pagers.
Fix an spl problem in the collapse handling in vm_object_deallocate.
When pages are freed from vnode objects, and the criteria for putting
the associated vnode onto the free list is reached, either put the
vnode onto the list, or put it onto an interrupt safe version of the
list, for further transfer onto the actual free list.
Some minor syntax changes changing pre-decs, pre-incs to post versions.
Remove a bogus timeout (that I added for debugging) from vn_lock.
PHK will likely still have problems with the vnode list management, and
so do I, but it is better than it was.
original BSD code. The association between the vnode and the vm_object
no longer includes reference counts. The major difference is that
vm_object's are no longer freed gratuitiously from the vnode, and so
once an object is created for the vnode, it will last as long as the
vnode does.
When a vnode object reference count is incremented, then the underlying
vnode reference count is incremented also. The two "objects" are now
more intimately related, and so the interactions are now much less
complex.
When vnodes are now normally placed onto the free queue with an object still
attached. The rundown of the object happens at vnode rundown time, and
happens with exactly the same filesystem semantics of the original VFS
code. There is absolutely no need for vnode_pager_uncache and other
travesties like that anymore.
A side-effect of these changes is that SMP locking should be much simpler,
the I/O copyin/copyout optimizations work, NFS should be more ponderable,
and further work on layered filesystems should be less frustrating, because
of the totally coherent management of the vnode objects and vnodes.
Please be careful with your system while running this code, but I would
greatly appreciate feedback as soon a reasonably possible.
of vnodes and objects. There are some metadata performance improvements
that come along with this. There are also a few prototypes added when
the need is noticed. Changes include:
1) Cleaning up vref, vget.
2) Removal of the object cache.
3) Nuke vnode_pager_uncache and friends, because they aren't needed anymore.
4) Correct some missing LK_RETRY's in vn_lock.
5) Correct the page range in the code for msync.
Be gentle, and please give me feedback asap.