Tor created a while ago, removes the raw I/O piece (that has cache coherency
problems), and adds a buffer cache / VM freeing piece.
Essentially this patch causes O_DIRECT I/O to not be left in the cache, but
does not prevent it from going through the cache, hence the 80%. For
the last 20% we need a method by which the I/O can be issued directly to
buffer supplied by the user process and bypass the buffer cache entirely,
but still maintain cache coherency.
I also have the code working under -stable but the changes made to sys/file.h
may not be MFCable, so an MFC is not on the table yet.
Submitted by: tegge, dillon
- Assert Giant in vm_pageout_scan() for the vnode hacking that it does.
- Don't hold vm_mtx around vget() or vput().
- Lock Giant when calling vm_pageout_scan() from the pagedaemon. Also,
lock curproc while setting the P_BUFEXHAUST flag.
- For now we still hold Giant for all of the vm_daemon. When process
limits are locked we will be only need Giant for swapout_procs().
- Restore the previous order of setting up a new vm_object. The previous
had a small bug where we zero'd out the flags after we set the
OBJ_ONEMAPPING flag.
- Add several asserts of vm_mtx.
- Assert Giant is held rather than locking and unlocking it in a few
places.
- Add in some #ifdef objlocks code to lock individual vm objects when
vm objects each have their own lock someday.
- Don't bother acquiring the allproc lock for a ddb command. If DDB
blocked on the lock, that would be worse than having an inconsistent
allproc list.
- Add a few KTR tracepoints to track the addition and removal of
vm_map_entry's and the creation adn free'ing of vmspace's.
- Adjust a few portions of code so that we update the process' vmspace
pointer to its new vmspace before freeing the old vmspace.
- Don't lock Giant in the scheduler() function except for when calling
faultin().
- In swapout_procs(), lock the VM before the proccess to avoid a lock order
violation.
- In swapout_procs(), release the allproc lock before calling swapout().
We restart the process scan after swapping out a process.
- In swapout_procs(), un #if 0 the code to bump the vmspace reference count
and lock the process' vm structures. This bug was introduced by me and
could result in the vmspace being free'd out from under a running
process.
- Fix an old bug where the vmspace reference count was not free'd if we
failed the swap_idle_threshold2 test.
acquired.
- Assert Giant is held in the strategy, getpages, and putpages methods and
the getchainbuf, flushchainbuf, and waitchainbuf functions.
- Always call flushchainbuf() w/o the VM lock.
vnodes.
- Fix an old bug that would leak a reference to a fd if the vnode being
mmap'd wasn't of type VREG or VCHR.
- Lock Giant in vm_mmap() around calls into the VM that can call into
pager routines that need Giant or into other VM routines that need
Giant.
- Replace code that used a goto to jump around the else branch of a test
to use an else branch instead.
vm_mtx does not recurse and is required for most low level
vm operations.
faults can not be taken without holding Giant.
Memory subsystems can now call the base page allocators safely.
Almost all atomic ops were removed as they are covered under the
vm mutex.
Alpha and ia64 now need to catch up to i386's trap handlers.
FFS and NFS have been tested, other filesystems will need minor
changes (grabbing the vm lock when twiddling page properties).
Reviewed (partially) by: jake, jhb
wakeup proc0 by hand to enforce the timeout.
- When swapping out a process, keep the process locked via the proc lock
from the first checks up until we clear PS_INMEM and set PS_SWAPPING in
swapout(). The swapout() function now must be called with the proc lock
held and releases it before returning.
- Comment out the code to attempt to lock a process' VM structures before
swapping out. It is broken in that it releases the lock after obtaining
it. If it does grab the lock, it needs to hand it off to swapout()
instead of releasing it. This can be revisisted when the VM is locked
as this is a valid test to perform. It also causes a lock order reversal
for the time being, which is the immediate cause for temporarily
disabling it.
the process in question locked as soon as we find it and determine it to
be eligible until we actually kill it. To avoid deadlock, we don't block
on the process lock but skip any process that is already locked during our
search.
- Don't hold Giant in the swapper daemon while we walk the list of
processes looking for a process to swap back in.
- Don't bother grabbing the sched_lock while checking a process' sleep
time in swapout_procs() to ensure that a process has been idle for at
least swap_idle_threshold2 before swapping it out. If we lose the race
we just let a process stay in memory until the next call of
swapout_procs().
- Remove some unneeded spl's, sched_lock does all the locking needed in
this case.
other "system" header files.
Also help the deprecation of lockmgr.h by making it a sub-include of
sys/lock.h and removing sys/lockmgr.h form kernel .c files.
Sort sys/*.h includes where possible in affected files.
OK'ed by: bde (with reservations)
The zone allocator's locks should be leaflocks, meaning that they
should never be held when entering into another subsystem, however
the sysctl grabs the zone global mutex and individual zone mutexes
while holding the lock it calls SYSCTL_OUT which recurses into the
VM subsystem in order to wire user memory to do a safe copy. This
can block and cause lock order reversals.
To fix this:
lock zone global.
get a count of the number of zones.
unlock global.
allocate temporary storage.
format and SYSCTL_OUT the banner.
lock global.
traverse list.
make sure we haven't looped more than the initial count taken
to avoid overflowing the allocated buffer.
lock each nodes.
read values and format into buffer.
unlock individual node.
unlock global.
format and SYSCTL_OUT the rest of the data.
free storage.
return.
Other problems included not checking for errors when doing sysctl out
of the column header. Fixed.
Inconsistant termination of the copied string. Fixed.
Objected to by: des (for not using sbuf)
Since the output is not variable length and I'm actually over
allocating signifigantly and I'd like to get this fixed now, I'll
work on the sbuf convertion at a later date. I would not object
to someone else taking it upon themselves to convert it to sbuf.
I hold no MAINTIANER rights to this code (for now).
Protect pager object list manipulation with a mutex.
It doesn't look possible to combine them under a single sx lock because
creation may block and we can't have the object list manipulation block
on anything other than a mutex because of interrupt requests.
VOP_BWRITE() was a hack which made it possible for NFS client
side to use struct buf with non-bio backing.
This patch takes a more general approach and adds a bp->b_op
vector where more methods can be added.
The success of this patch depends on bp->b_op being initialized
all relevant places for some value of "relevant" which is not
easy to determine. For now the buffers have grown a b_magic
element which will make such issues a tiny bit easier to debug.
programs. There is a case during a fork() which can cause a deadlock.
From Tor -
The workaround that consists of setting a flag in the vm map that
indicates that a fork is in progress and using that mark in the page
fault handling to force a revalidation failure. That change will only
affect (pessimize) page fault handling during fork for threaded
(linuxthreads style) applications and applications using aio_*().
Submited by: tegge
call is correct, but it interferes with the massive hack called
vm_map_growstack(). The call will be returned after our stack handling
code is fixed.
Reported by: tegge
reference count was transferred to the new object, but both the
new and the old map entries had pointers to the new object.
Correct this by transferring the second reference.
This fixes a panic that can occur when mmap(2) is used with the
MAP_INHERIT flag.
PR: i386/25603
Reviewed by: dillon, alc
supported architectures such as the alpha. This allows us to save
on kernel virtual address space, TLB entries, and (on the ia64) VHPT
entries. pmap_map() now modifies the passed in virtual address on
architectures that do not support direct-mapped segments to point to
the next available virtual address. It also returns the actual
address that the request was mapped to.
- On the IA64 don't use a special zone of PV entries needed for early
calls to pmap_kenter() during pmap_init(). This gets us in trouble
because we end up trying to use the zone allocator before it is
initialized. Instead, with the pmap_map() change, the number of needed
PV entries is small enough that we can get by with a static pool that is
used until pmap_init() is complete.
Submitted by: dfr
Debugging help: peter
Tested by: me