swap blocks are now in PAGE_SIZE'd increments instead of DEV_BSIZE'd
increments. We still convert to DEV_BSIZE'd increments for the
backing store I/O, but everything else is in PAGE_SIZE increments.
vm_pager.h
Added argument to getpbuf() and relpbuf() to allow each subsystem to
specify a different hard limit on the number of simultanious physical
bufferes that said subsystem may allocate. Without this feature, one
subsystem ( e.g. the vfs clustering code ) could hog *ALL* the pbufs,
causing a deadlock in the pager in a low memory situation.
Same for trypbuf().
Removed call to vm_object_collapse(), which can block. This was being
called without the pageout code holding any sort of reference on the
vm_object or vm_page_t structures being manipulated. Since this code
can block, it was possible for other kernel code to shred the state
the pageout code was assuming remained intact.
Fixed potential blocking condition in vm_pageout_page_free() ( which
could cause a deadlock in a low-memory situation ).
Currently there is a hack in-place to deal with clean filesystem meta-data
polluting the inactive page queue. John doesn't like the hack, and neither
do I.
Revamped and commented a portion of the pageout loop.
Added protection against potential memory deadlocks with OBJT_VNODE
when using VOP_ISLOCKED(). The problem is that vp->v_data can be NULL
which causes VOP_ISLOCKED() to return a less informed answer.
remove vm_pager_sync() -- none of the pagers use it any more ( the old
swapper used to. The new one does not ).
reducing the size of vm_page_t.
SWAPBLK_NONE and SWAPBLK_MASK are defined here. These actually are
more generalized then their names imply, but their placement is somewhat
of a legacy issue from a prior test version of this code that put
the swapblk in the vm_page_t structure. That test code was eventually
thrown away. The legacy remains.
Added vm_page_flash() inline. Similar to vm_page_wakeup() except that
it does not clear PG_BUSY ( one assumes that PG_BUSY is already clear ).
Used by a number of routines to wakeup waiters.
Collapsed some of the code in inline calls to make other inline calls.
GCC will optimize this well and it reduces duplication.
vm_page_free() and vm_page_free_zero() inlines added to convert to
the proper vm_page_free_toq() call.
vm_page_sleep_busy() inline added, replacing vm_page_sleep() ( which has
been removed ). This implements a much more optimizable page-waiting
function.
pointers per entry ). The table has been changed to a singly linked
list of vm_page_t pointers. The table has been doubled in size, but
the entries only take half the space so a net-zero change in memory use.
The hash function has been changed, hopefully for the better. The
combination of the larger hash table size of changed function should
keep the chain length down to a reasonable number (0-3, average 1).
vm_object->page_hint has been removed. This 'optimization' was not
only never needed, but costs as much as a hash chain link to implement.
While having page_hint in vm_object might result in better locality
of reference, the cost is not worth the space in vm_object or the
extra instructions in my view.
vm_page_alloc*() functions have been inlined and call a generalized
non-inlined vm_page_alloc_toq() which combines the standard alloc
and zero-page alloc functions together, reducing code size and the L1
cache footprint. Some reordering has been done... not much. The
delinking code should be faster ( because unlinking a doubly-linked list
requires four memory ops and unlinking a singly linked list only requires
two ), and we get a hash consistancy check for free.
vm_page_rename() now automatically sets the page's dirty bits.
vm_page_alloc() does not try to manually inline freeing a cache page.
Instead, it now properly calls vm_page_free(m) ... vm_page_free() is
really too complex to manually inline.
vm_await(), supporting asleep(), has been added.
of most of the swap-pager-specific fields, the removal of the id,
and the removal of paging_offset.
A new inline, vm_object_pip_wakeupn() has been added to subtract an
arbitrary number n from the paging_in_progress count and then wakeup
waiters as necessary. n may be 0, resulting in a 'flash'.
object->paging_offset has been removed - it was used to optimize a
single OBJT_SWAP collapse case yet introduced massive confusion throughout
vm_object.c. The optimization was inconsequential except for the
claim that it didn't have to allocate any memory. The optimization
has been removed.
madvise() has been fixed. The old madvise() could be made to operate
on shared objects which is a big no-no. The new one is much more careful
in what it modifies. MADV_FREE was totally broken and has now been fixed.
vm_page_rename() now automatically dirties a page, so explicit dirtying
of the page prior to calling vm_page_rename() has been removed.
about conversions of objects to OBJT_SWAP, it is done automatically
now.
Replaced manually inserted code with inline calls for busy waiting on
pages, which also incidently fixes a potential PG_BUSY race due to
the code not running at splvm().
vm_objects no longer have a paging_offset field ( see vm/vm_object.c )
instead to properly handle any waiters.
Added comments, added support for M_ASLEEP. Generally treat M_ flags
as flags instead of constants to compare against.
and the swap_pager has been completely replaced.
The new swap pager uses the new blist radix-tree based bitmap allocator
for low level swap allocation and deallocation. The new allocator
is effectively O(5) while the old one was O(N), and the new allocator
allocates all required memory at init time rather then at allocate
memory on the fly at run time.
Swap metadata is allocated in clusters and stored in a hash table,
eliminating linearly allocated structures.
Many, many features have been rewritten or added. Swap space is now
reallocated on the fly providing a poor-mans auto defragmentation of
swap space. Swap space that is no longer needed is freed on a timely
basis so no garbage collection is necessary.
Swap I/O is marked B_ASYNC and NFS has been fixed to do the right
thing with it, so NFS-based paging now has around 10x the performance
as it did before ( previously NFS enforced synchronous I/O for paging ).
B_DELWRI and B_CACHE flags, fixing a bug that showed up with NFS.
Also, a number of cases where manually inserted code has been removed
and replaced with an inline function call giving us better functional
isolation in the source.
descriptor-passing messages was calling sorflush() without checking
to see if the descriptor was actually a socket. This can cause a
crash by exiting programs that use the mechanism under certain
circumstances.
changes to the VM system to support the new swapper, VM bug
fixes, several VM optimizations, and some additional revamping of the
VM code. The specific bug fixes will be documented with additional
forced commits. This commit is somewhat rough in regards to code
cleanup issues.
Reviewed by: "John S. Dyson" <root@dyson.iquest.net>, "David Greenman" <dg@root.com>
Change from lkm to kld
Add field plexsdno to sd struct
Add flag VF_NEWBORN to drive, sd, plex and volume structs, indicating
that the object has just been created.
Add object types for raw (unattached) plexes and subdisks
Remove definitions of VOLNO, PLEXNO and SDNO (now functions Volno,
Plexno and Sdno)
Move revive parameters from struct plex to struct sd.
struct plex:
maintain a count of the number of inaccessible subdisks.
remove defective and unmapped regions.
Debug flags: make an enum (previously #define)
Set default revive block size to 64kB (was 32 kB)
Previously, accidentally starting the wrong version could corrupt
the RAID5 configuration.
Add functions Volno, Plexno and Sdno to replace the old defines
VOLNO, PLEXNO and SDNO.
Change from lkm to kld
Serious rewrite. No longer call set_<foo>_state to set the state
based only on other objects; instead, add functions
update_<foo>_state, which determine what the state should be by
themselves. This allows the set_<foo>_state functions to shrink
enough to be almost intelligible.
Remove flags setstate_recurse and setstate_recursing.
Remove plex defective regions and unmapped regions, which were
maintained but not used.
Change code to allow daemon to perform operations formerly kludged
into an interrupt context. Remove the DIRTYCONFIG kludge.
Change from lkm to kld
Remove #ifdefs for FreeBSD 2.c
vinumstrategy:
Support anonymous (`raw') subdisks and plexes.
Change code to allow daemon to perform operations formerly kludged
into an interrupt context. Remove the DIRTYCONFIG kludge.
No longer set B_ORDERED for reviving subdisks. I suspect this
wouldn't work correctly, and it should be done in a different manner
in vinumrevive.c
sdio: set subdisk state correctly on error
start to remove code that doesn't make any sense any more.
Remove #ifdefs for FreeBSD 2.c
Change from lkm to kld
correct type of `flags' in calls to set_drive_state.
set_drive_parms: handle anonymous drives correctly (remove them)
drive VOP functions: use the PID of the original opener to fool the
lock manager.
open_drive: be quiet about failures (they're normal when scanning the
partitions).
close_drive: lock drive before closing.
remove_drive: lock drive before deallocating.
read_drive_label: set drive up when all is OK
check_drive:
Complete rewrite. Offload most of the code to the new
vinum_scandisk
format_config:
use snprintf and %qd options to make much less emetic.
Remove old supporting functions.
vinum_scandisk:
Moved here from vinum.c
Almost complete rewrite, incorporating much of what was check_drive.
We still don't have a general way to find the drives on a system, so
get the user to supply the names via the `read' command. For each
device, try each possible compatibility slice name (there's a danger
of finding both /dev/da1h and /dev/da0s1h otherwise). Sort the
partitions found in reverse order of last update time and read them
in, setting the `update' parameter to parse_config and descendents.
save_config: rename to daemon_save_config, since the function is now
called by the daemon. Create a new function save_config which queues
the request with the daemon.
daemon_save_config: some mods to allow for the unfamiliar
environment.
Change from lkm to kld
Remove BROKEN_GDB kludge (it's not needed with klds)
Add code for interfacing with daemon
Modify device minor number encoding, use selector functions which also
permit anonymous plexes and subdisks.
Remove code for 2.x support.
Change messages to omit obvious words like 'plex' and 'subdisk.
give_plex_to_volume: invalidate subdisks being given to a plex which
is part of a volume with other plexes.
give_sd_to_plex: keep track of plex size in all cases
lock drives before closing them, to keep the daemon from getting
confused.
config_drive: handle partition type errors more gracefully
config_subdisk: set subdisk state correctly
find_drive, find_drive_by_dev, find_subdisk, find_plex, find_volume:
set VF_NEWBORN flag when a new object is created
config_drive:
Handle partition_status returns more cleverly.
Replace the device name in some cases where it got overwritten.
config_subdisk:
add parameter `update'. If the object already exists, exit without
any changes.
Set state correctly.
config_plex, config_volume:
add parameter `update'. If the object already exists, exit without
any changes.
parse_config:
move read function to vinum_scandisk.
add parameter `update' to pass to config_<object>.
remove_<object>_entry:
print a message when the object is removed.
update_plex_config:
Start defusing this function, which will go away some time.
Remove calls to update_volume_config.
Make size 64 bits
Change from lkm to kld
Remove BROKEN_GDB kludge (it's not needed with klds)
Add code for interfacing with daemon
Modify manner of determining when module is idle
Modify device minor number encoding, use selector functions which also
permit anonymous plexes and subdisks.
Remove code for 2.x support.
Move vinum_scandisk to vinumio.c
Remove myproc kludge
Keep track of open volumes by flag, not by pid (the pids caused some
problems with the lock manager).
free_vinum:
Remove unmapped and defective regions from plexes.
Wait for daemon to stop before returning
vinumopen:
Don't refuse an open if the volume is already open.