on and off since John Dyson left his work-in-progress.
It is off by default for now. sysctl vm.zeroidle_enable=1 to turn it on.
There are some hacks here to deal with the present lack of preemption - we
yield after doing a small number of pages since we wont preempt otherwise.
This is basically Matt's algorithm [with hysteresis] with an idle process
to call it in a similar way it used to be called from the idle loop.
I cleaned up the includes a fair bit here too.
timeout callwheel and buffer cache, out of the platform specific areas
and into the machine independant area. i386 and alpha adjusted here.
Other cpus can be fixed piecemeal.
Reviewed by: freebsd-smp, jake
information. The default limits only effect machines with > 1GB of ram
and can be overriden with two new kernel conf variables VM_SWZONE_SIZE_MAX
and VM_BCACHE_SIZE_MAX, or with loader variables kern.maxswzone and
kern.maxbcache. This has the effect of leaving more KVM available for
sizing NMBCLUSTERS and 'maxusers' and should avoid tripups where a sysad
adds memory to a machine and then sees the kernel panic on boot due to
running out of KVM.
Also change the default swap-meta auto-sizing calculation to allocate half
of what it was previously allocating. The prior defaults were way too high.
Note that we cannot afford to run out of swap-meta structures so we still
stay somewhat conservative here.
- Callers of asleep() and await() have been converted to calling tsleep().
The only caller outside of M_ASLEEP was the ata driver, which called both
asleep() and await() with spl-raised, so there was no need for the
asleep() and await() pair. M_ASLEEP was unused.
Reviewed by: jasone, peter
- Callers of asleep() and await() have been converted to calling tsleep().
The only caller outside of M_ASLEEP was the ata driver, which called both
asleep() and await() with spl-raised, so there was no need for the
asleep() and await() pair. M_ASLEEP was unused.
Reviewed by: jasone, peter
1) allocate fewer buckets
2) when failing to allocate swap zone, keep reducing the zone by
a third rather than a half in order to reduce the chance of
allocating way too little.
I also moved around some code for readability.
Suggested by: dillon
Reviewed by: dillon
already allow this for NFS swap configured via BOOTP, so it is
known to work fine.
For many diskless configurations is is more flexible to have the
client set up swapping itself; it can recreate a sparse swap file
to save on server space for example, and it works with a non-NFS
root filesystem such as an in-kernel filesystem image.
Also removed some spl's and added some VM mutexes, but they are not actually
used yet, so this commit does not really make any operational changes
to the system.
vm_page.c relates to vm_page_t manipulation, including high level deactivation,
activation, etc... vm_pageq.c relates to finding free pages and aquiring
exclusive access to a page queue (exclusivity part not yet implemented).
And the world still builds... :-)
most of these inlines had been bloated in -current far beyond their
original intent. Normalize prototypes and function declarations to be ANSI
only (half already were). And do some general cleanup.
(kernel size also reduced by 50-100K, but that isn't the prime intent)
(this commit is just the first stage). Also add various GIANT_ macros to
formalize the removal of Giant, making it easy to test in a more piecemeal
fashion. These macros will allow us to test fine-grained locks to a degree
before removing Giant, and also after, and to remove Giant in a piecemeal
fashion via sysctl's on those subsystems which the authors believe can
operate without Giant.
introduce a modified allocation mechanism for mbufs and mbuf clusters; one
which can scale under SMP and which offers the possibility of resource
reclamation to be implemented in the future. Notable advantages:
o Reduce contention for SMP by offering per-CPU pools and locks.
o Better use of data cache due to per-CPU pools.
o Much less code cache pollution due to excessively large allocation macros.
o Framework for `grouping' objects from same page together so as to be able
to possibly free wired-down pages back to the system if they are no longer
needed by the network stacks.
Additional things changed with this addition:
- Moved some mbuf specific declarations and initializations from
sys/conf/param.c into mbuf-specific code where they belong.
- m_getclr() has been renamed to m_get_clrd() because the old name is really
confusing. m_getclr() HAS been preserved though and is defined to the new
name. No tree sweep has been done "to change the interface," as the old
name will continue to be supported and is not depracated. The change was
merely done because m_getclr() sounds too much like "m_get a cluster."
- TEMPORARILY disabled mbtypes statistics displaying in netstat(1) and
systat(1) (see TODO below).
- Fixed systat(1) to display number of "free mbufs" based on new per-CPU
stat structures.
- Fixed netstat(1) to display new per-CPU stats based on sysctl-exported
per-CPU stat structures. All infos are fetched via sysctl.
TODO (in order of priority):
- Re-enable mbtypes statistics in both netstat(1) and systat(1) after
introducing an SMP friendly way to collect the mbtypes stats under the
already introduced per-CPU locks (i.e. hopefully don't use atomic() - it
seems too costly for a mere stat update, especially when other locks are
already present).
- Optionally have systat(1) display not only "total free mbufs" but also
"total free mbufs per CPU pool."
- Fix minor length-fetching issues in netstat(1) related to recently
re-enabled option to read mbuf stats from a core file.
- Move reference counters at least for mbuf clusters into an unused portion
of the cluster itself, to save space and need to allocate a counter.
- Look into introducing resource freeing possibly from a kproc.
Reviewed by (in parts): jlemon, jake, silby, terry
Tested by: jlemon (Intel & Alpha), mjacob (Intel & Alpha)
Preliminary performance measurements: jlemon (and me, obviously)
URL: http://people.freebsd.org/~bmilekic/mb_alloc/
processes a little earlier to avoid a deadlock. Second, when calculating
the 'largest process' do not just count RSS. Instead count the RSS + SWAP
used by the process. Without this the code tended to kill small
inconsequential processes like, oh, sshd, rather then one of the many
'eatmem 200MB' I run on a whim :-). This fix has been extensively tested on
-stable and somewhat tested on -current and will be MFCd in a few days.
Shamed into fixing this by: ps
canonical: define a versioned struct xswdev, and add a sysctl node
handler that allows the user to get this structure for a certain device
index by specifying this index as last element of the MIB.
This new node handler, vm.swap_info, replaces the old vm.nswapdev
and vm.swapdevX.* (where X was the index) sysctls.
- move the sysctl code to kern_intr.c
- do not use INTRCNT_COUNT, but rather eintrcnt - intrcnt to determine
the length of the intrcnt array
- move the declarations of intrnames, eintrnames, intrcnt and eintrcnt
from machine-dependent include files to sys/interrupt.h
- remove the hw.nintr sysctl, it is not needed.
- fix various style bugs
Requested by: bde
Reviewed by: bde (some time ago)
Tor created a while ago, removes the raw I/O piece (that has cache coherency
problems), and adds a buffer cache / VM freeing piece.
Essentially this patch causes O_DIRECT I/O to not be left in the cache, but
does not prevent it from going through the cache, hence the 80%. For
the last 20% we need a method by which the I/O can be issued directly to
buffer supplied by the user process and bypass the buffer cache entirely,
but still maintain cache coherency.
I also have the code working under -stable but the changes made to sys/file.h
may not be MFCable, so an MFC is not on the table yet.
Submitted by: tegge, dillon
- Assert Giant in vm_pageout_scan() for the vnode hacking that it does.
- Don't hold vm_mtx around vget() or vput().
- Lock Giant when calling vm_pageout_scan() from the pagedaemon. Also,
lock curproc while setting the P_BUFEXHAUST flag.
- For now we still hold Giant for all of the vm_daemon. When process
limits are locked we will be only need Giant for swapout_procs().
- Restore the previous order of setting up a new vm_object. The previous
had a small bug where we zero'd out the flags after we set the
OBJ_ONEMAPPING flag.
- Add several asserts of vm_mtx.
- Assert Giant is held rather than locking and unlocking it in a few
places.
- Add in some #ifdef objlocks code to lock individual vm objects when
vm objects each have their own lock someday.
- Don't bother acquiring the allproc lock for a ddb command. If DDB
blocked on the lock, that would be worse than having an inconsistent
allproc list.
- Add a few KTR tracepoints to track the addition and removal of
vm_map_entry's and the creation adn free'ing of vmspace's.
- Adjust a few portions of code so that we update the process' vmspace
pointer to its new vmspace before freeing the old vmspace.
- Don't lock Giant in the scheduler() function except for when calling
faultin().
- In swapout_procs(), lock the VM before the proccess to avoid a lock order
violation.
- In swapout_procs(), release the allproc lock before calling swapout().
We restart the process scan after swapping out a process.
- In swapout_procs(), un #if 0 the code to bump the vmspace reference count
and lock the process' vm structures. This bug was introduced by me and
could result in the vmspace being free'd out from under a running
process.
- Fix an old bug where the vmspace reference count was not free'd if we
failed the swap_idle_threshold2 test.
acquired.
- Assert Giant is held in the strategy, getpages, and putpages methods and
the getchainbuf, flushchainbuf, and waitchainbuf functions.
- Always call flushchainbuf() w/o the VM lock.
vnodes.
- Fix an old bug that would leak a reference to a fd if the vnode being
mmap'd wasn't of type VREG or VCHR.
- Lock Giant in vm_mmap() around calls into the VM that can call into
pager routines that need Giant or into other VM routines that need
Giant.
- Replace code that used a goto to jump around the else branch of a test
to use an else branch instead.