Remove the vnode and dev_t fields and replace them with a void *.
Introduce separate strategy functions for devices and regular (NFS)
vnodes.
For devices we don't need the vnode v_numoutput stuff.
Add a generic swaponsomething() function to add a swapdevice and
split the remainder of swaponvp() into swaponvp() and swapondev()
which calls this backend.
reacquire the "first" object's lock while a backing object's lock is held.
Since this is a lock-order reversal, vm_fault() uses trylock to acquire
the first object's lock, skipping the sequential access optimization in
the unlikely event that the trylock fails.
in struct vm_page are defined as u_int for 16K pages and u_long
for 32K pages, with the implied assumption that long will at least
be 64 bits wide on platforms where we support 32K pages.
UMA_ZFLAG_INTERNAL zones at all. Apparently, Wilko's alpha
was crashing while entering multi-user because, I think, we
were calculating the garbage cachefree for pcpu caches that
essentially don't exist for at least the 'zones' zone and it so
happened that we were reading from an unmapped location.
Confirmed to fix crash: wilko
Helped debug: wilko, gallatin
compare the zone element size (+1 for the byte of linkage) against
UMA_SLAB_SIZE - sizeof(struct uma_slab), and not just UMA_SLAB_SIZE.
Add a KASSERT in zone_small_init to make sure that the computed
ipers (items per slab) for the zone is not zero, despite the addition
of the check, just to be sure (this part submitted by: silby)
- UMA_ZONE_VM used to imply BUCKETCACHE. Now it implies
CACHEONLY instead. CACHEONLY is like BUCKETCACHE in the
case of bucket allocations, but in addition to that also ensures that
we don't setup the zone with OFFPAGE slab headers allocated from the
slabzone. This means that we're not allowed to have a UMA_ZONE_VM
zone initialized for large items (zone_large_init) because it would
require the slab headers to be allocated from slabzone, and hence
kmem_map. Some of the zones init'd with UMA_ZONE_VM are so init'd
before kmem_map is suballoc'd from kernel_map, which is why this
change is necessary.
- All those diffs to syscalls.master for each architecture *are*
necessary. This needed clarification; the stub code generation for
mlockall() was disabled, which would prevent applications from
linking to this API (suggested by mux)
- Giant has been quoshed. It is no longer held by the code, as
the required locking has been pushed down within vm_map.c.
- Callers must specify VM_MAP_WIRE_HOLESOK or VM_MAP_WIRE_NOHOLES
to express their intention explicitly.
- Inspected at the vmstat, top and vm pager sysctl stats level.
Paging-in activity is occurring correctly, using a test harness.
- The RES size for a process may appear to be greater than its SIZE.
This is believed to be due to mappings of the same shared library
page being wired twice. Further exploration is needed.
- Believed to back out of allocations and locks correctly
(tested with WITNESS, MUTEX_PROFILING, INVARIANTS and DIAGNOSTIC).
PR: kern/43426, standards/54223
Reviewed by: jake, alc
Approved by: jake (mentor)
MFC after: 2 weeks
From alc:
Move pageable pipe memory to a seperate kernel submap to avoid awkward
vm map interlocking issues. (Bad explanation provided by me.)
From me:
Rework pipespace accounting code to handle this new layout, and adjust
our default values to account for the fact that we now have a solid
limit on allocations.
Also, remove the "maxpipes" limit, as it no longer has a purpose.
(The limit on kva usage solves the problem of having two many pipes.)
Eliminate a lot of checkes to make sure requests are not cross-device
which is unnecessary with the new layout. We know a sequential request
cannot possibly be cross-device because there is a reserved page between
the devices.
Remove a couple of comments which no longer are relevant.
to not get any cross-device I/O requests. (The unallocated first page
protecting BSD labels already gave us this, but that hack may go away
at some point in time).
Remove the check for cross-device I/O requests in swap_pager_strategy.
Move the repeated statistics updating into flushchainbuf().
swapbkva. Swapbkva mappings are explicitly managed using pmap_qenter(),
not on-demand by vm_fault(), making kmem_alloc_nofault() more appropriate.
Submitted by: tegge
Use ->bio_children to count child buffers, rather than abuse the
bio_caller1 pointer.
Expand the relevant bits of waitchainbuf() inline, this clarifies
the code a little bit.
striping to a per device round-robin algorithm.
Because of the policy of not attempting to retain previous swap
allocation on page-out, this means that a newly added swap device
almost instantly takes its 1/N share of the I/O load but it takes
somewhat longer for it to assume it's 1/N share of the pages if there
is plenty of space on the other devices.
Change the 8G total swapspace limitation to 8G per device instead
by using a per device blist rather than one global blist. This
reduces the memory footprint by 75% (typically a couple hundred
kilobytes) for the common case with one swapdevice but NSWAPDEV=4.
Remove the compile time constant limit of number of swap devices,
there is no limit now. Instead of a fixed size array, store the
per swapdev structure in a TAILQ.
Total swap space is still addressed by a 32 bit page number and
therefore the upper limit is now 2^42 bytes = 16TB (for i386).
We still do not allocate the first page of each device in order to
give some amount of protection to any bsdlabel at the start of the
device.
A new device is appended after the existing devices in the swap space,
no attempt is made to fill in holes left behind by swapoff (this can
trivially be changed should it ever become a problem).
The sysctl vm.nswapdev now reflects the number of currently configured
swap devices.
Rename vm_swap_size to swap_pager_avail for consistency with other
exported names.
Change argument type for vm_proc_swapin_all() and swap_pager_isswapped()
to be a struct swdevt pointer rather than an index.
Not changed: we are still using blists to manage the free space,
but since the swapspace is no longer fragmented by the striping
different resource managers might fare better.
concurrent invocations from acquiring the same address(es). Also, in case
of an incomplete allocation, free any allocated pages.
In collaboration with: tegge
sure that uma_dbg_free() is called if we're about to call
uma_zfree_internal() but we're asking it to skip the dtor and
uma_dbg_free() call itself. So, if we're about to call
uma_zfree_internal() from uma_zfree_arg() and skip == 1, call
uma_dbg_free() ourselves.
in sync with the backend machdep code. When cpu_thread_init() does not
have the same idea of KSTACK_PAGES as the thing that created the kstack,
all hell breaks loose.
Bad alc! no cookie! :-)