Specifically, since the delete-behind heuristic is never applied to a
device-backed object, there is no point in checking whether each of the
object's pages is fictitious. (Only device-backed objects have
fictitious pages.)
machine-independent support for superpages. (The earlier part was
the rewrite of the physical memory allocator.) The remainder of the
code required for superpages support is machine-dependent and will
be added to the various pmap implementations at a later date.
Initially, I am only supporting one large page size per architecture.
Moreover, I am only enabling the reservation system on amd64. (In
an emergency, it can be disabled by setting VM_NRESERVLEVELS to 0
in amd64/include/vmparam.h or your kernel configuration file.)
ways:
(1) Cached pages are no longer kept in the object's resident page
splay tree and memq. Instead, they are kept in a separate per-object
splay tree of cached pages. However, access to this new per-object
splay tree is synchronized by the _free_ page queues lock, not to be
confused with the heavily contended page queues lock. Consequently, a
cached page can be reclaimed by vm_page_alloc(9) without acquiring the
object's lock or the page queues lock.
This solves a problem independently reported by tegge@ and Isilon.
Specifically, they observed the page daemon consuming a great deal of
CPU time because of pages bouncing back and forth between the cache
queue (PQ_CACHE) and the inactive queue (PQ_INACTIVE). The source of
this problem turned out to be a deadlock avoidance strategy employed
when selecting a cached page to reclaim in vm_page_select_cache().
However, the root cause was really that reclaiming a cached page
required the acquisition of an object lock while the page queues lock
was already held. Thus, this change addresses the problem at its
root, by eliminating the need to acquire the object's lock.
Moreover, keeping cached pages in the object's primary splay tree and
memq was, in effect, optimizing for the uncommon case. Cached pages
are reclaimed far, far more often than they are reactivated. Instead,
this change makes reclamation cheaper, especially in terms of
synchronization overhead, and reactivation more expensive, because
reactivated pages will have to be reentered into the object's primary
splay tree and memq.
(2) Cached pages are now stored alongside free pages in the physical
memory allocator's buddy queues, increasing the likelihood that large
allocations of contiguous physical memory (i.e., superpages) will
succeed.
Finally, as a result of this change long-standing restrictions on when
and where a cached page can be reclaimed and returned by
vm_page_alloc(9) are eliminated. Specifically, calls to
vm_page_alloc(9) specifying VM_ALLOC_INTERRUPT can now reclaim and
return a formerly cached page. Consequently, a call to malloc(9)
specifying M_NOWAIT is less likely to fail.
Discussed with: many over the course of the summer, including jeff@,
Justin Husted @ Isilon, peter@, tegge@
Tested by: an earlier version by kris@
Approved by: re (kensmith)
1. Rewrite the backward scan. Specifically, reverse the order in which
pages are allocated so that upon failure it is never necessary to
free pages that were just allocated. Moreover, any allocated pages
can be put to use. This makes the backward scan behave just like the
forward scan.
2. Eliminate an explicit, unsynchronized check for low memory before
calling vm_page_alloc(). It serves no useful purpose. It is, in
effect, optimizing the uncommon case at the expense of the common
case.
Approved by: re (hrs)
MFC after: 3 weeks
vm_fault_additional_pages() that was introduced in revision 1.47. Then
as now, it is unnecessary because dev_pager_haspage() returns zero for
both the number of pages to read ahead and read behind, producing the
same exact behavior by vm_fault_additional_pages() as the special case
handling.
Approved by: re (rwatson)
tracks the total number of reactivated pages. (We have not been
counting reactivations by vm_fault() since revision 1.46.)
Correct a comment in vm_fault_additional_pages().
Approved by: re (kensmith)
MFC after: 1 week
- Rename PCPU_LAZY_INC into PCPU_INC
- Add the PCPU_ADD interface which just does an add on the pcpu member
given a specific value.
Note that for most architectures PCPU_INC and PCPU_ADD are not safe.
This is a point that needs some discussions/work in the next days.
Reviewed by: alc, bde
Approved by: jeff (mentor)
td_ru. This removes the requirement for per-process synchronization in
statclock() and mi_switch(). This was previously supported by
sched_lock which is going away. All modifications to rusage are now
done in the context of the owning thread. reads proceed without locks.
- Aggregate exiting threads rusage in thread_exit() such that the exiting
thread's rusage is not lost.
- Provide a new routine, rufetch() to fetch an aggregate of all rusage
structures from all threads in a process. This routine must be used
in any place requiring a rusage from a process prior to it's exit. The
exited process's rusage is still available via p_ru.
- Aggregate tick statistics only on demand via rufetch() or when a thread
exits. Tick statistics are kept in the thread and protected by sched_lock
until it exits.
Initial patch by: attilio
Reviewed by: attilio, bde (some objections), arch (mostly silent)
Probabilly, a general approach is not the better solution here, so we should
solve the sched_lock protection problems separately.
Requested by: alc
Approved by: jeff (mentor)
vm_map_pmap_enter() unless the caller is madvise(MADV_WILLNEED). With
the exception of calls to vm_map_pmap_enter() from
madvise(MADV_WILLNEED), vm_fault_prefault() and vm_map_pmap_enter()
are both used to create speculative mappings. Thus, always
reactivating cached pages is a mistake. In principle, cached pages
should only be reactivated by an actual access. Otherwise, the
following misbehavior can occur. On a hard fault for a text page the
clustering algorithm fetches not only the required page but also
several of the adjacent pages. Now, suppose that one or more of the
adjacent pages are never accessed. Ultimately, these unused pages
become cached pages through the efforts of the page daemon. However,
the next activation of the executable reactivates and maps these
unused pages. Consequently, they are never replaced. In effect, they
become pinned in memory.
vmcnts. This can be used to abstract away pcpu details but also changes
to use atomics for all counters now. This means sched lock is no longer
responsible for protecting counts in the switch routines.
Contributed by: Attilio Rao <attilio@FreeBSD.org>
The problem is this: vm_fault_additional_pages() calls vm_pager_has_page(),
which calls vnode_pager_haspage(). Now when VOP_BMAP() returns an error (eg.
EOPNOTSUPP), vnode_pager_haspage() returns TRUE without initializing 'before'
and 'after' arguments, so we have some accidental values there. This bascially
was causing this condition to be meet:
if ((rahead + rbehind) >
((cnt.v_free_count + cnt.v_cache_count) - cnt.v_free_reserved)) {
pagedaemon_wakeup();
[...]
}
(we have some random values in rahead and rbehind variables)
I'm not entirely sure this is the right fix, maybe we should just return FALSE
in vnode_pager_haspage() when VOP_BMAP() fails?
alc@ knows about this problem, maybe he will be able to come up with a better
fix if this is not the right one.
page queues-synchronized flag. Reduce the scope of the page queues lock in
vm_fault() accordingly.
Move vm_fault()'s call to vm_object_set_writeable_dirty() outside of the
scope of the page queues lock. Reviewed by: tegge
Additionally, eliminate an unnecessary dereference in computing the
argument that is passed to vm_object_set_writeable_dirty().
There is a race with the current locking scheme and removing
it should have no measurable performance impact.
This fixes page faults leading to panics in pmap_enter_quick_locked()
on amd64/i386.
Reviewed by: alc,jhb,peter,ps
- provide an interface (macros) to the page coloring part of the VM system,
this allows to try different coloring algorithms without the need to
touch every file [1]
- make the page queue tuning values readable: sysctl vm.stats.pagequeue
- autotuning of the page coloring values based upon the cache size instead
of options in the kernel config (disabling of the page coloring as a
kernel option is still possible)
MD changes:
- detection of the cache size: only IA32 and AMD64 (untested) contains
cache size detection code, every other arch just comes with a dummy
function (this results in the use of default values like it was the
case without the autotuning of the page coloring)
- print some more info on Intel CPU's (like we do on AMD and Transmeta
CPU's)
Note to AMD owners (IA32 and AMD64): please run "sysctl vm.stats.pagequeue"
and report if the cache* values are zero (= bug in the cache detection code)
or not.
Based upon work by: Chad David <davidc@acns.ab.ca> [1]
Reviewed by: alc, arch (in 2004)
Discussed with: alc, Chad David, arch (in 2004)
fs.first_object->flags & OBJ_NEEDGIANT test that was missed in an earlier
revision. This fixes mutex assertion failures in the debug.mpsafevm=0
case.
Reported by: ps
MFC after: 3 days
underlying vnode requires Giant.
- In vm_fault only acquire Giant if the underlying object has NEEDSGIANT
set.
- In vm_object_shadow inherit the NEEDSGIANT flag from the backing object.
- Use VFS_LOCK_GIANT() rather than directly acquiring giant in places
where giant is only held because vfs requires it.
Sponsored By: Isilon Systems, Inc.
flag and busy field with the global page queues lock to synchronizing their
access with the containing object's lock. Specifically, acquire the
containing object's lock before reading the page's PG_BUSY flag and busy
field in vm_fault().
Reviewed by: tegge@
on entry and it assumes the responsibility for releasing the page queues
lock if it must sleep.
Remove a bogus comment from pmap_enter_quick().
Using the first change, modify vm_map_pmap_enter() so that the page queues
lock is acquired and released once, rather than each time that a page
is mapped.
In such cases, the busying of the page and the unlocking of the
containing object by vm_map_pmap_enter() and vm_fault_prefault() is
unnecessary overhead. To eliminate this overhead, this change
modifies pmap_enter_quick() so that it expects the object to be locked
on entry and it assumes the responsibility for busying the page and
unlocking the object if it must sleep. Note: alpha, amd64, i386 and
ia64 are the only implementations optimized by this change; arm,
powerpc, and sparc64 still conservatively busy the page and unlock the
object within every pmap_enter_quick() call.
Additionally, this change is the first case where we synchronize
access to the page's PG_BUSY flag and busy field using the containing
object's lock rather than the global page queues lock. (Modifications
to the page's PG_BUSY flag and busy field have asserted both locks for
several weeks, enabling an incremental transition.)
manipulating a vnode, e.g., calling vput(). This reduces contention for
Giant during many copy-on-write faults, resulting in some additional
speedup on SMPs.
Note: debug_mpsafevm must be enabled for this optimization to take effect.
"debug.mpsafevm" results in (almost) Giant-free execution of zero-fill
page faults. (Giant is held only briefly, just long enough to determine
if there is a vnode backing the faulting address.)
Also, condition the acquisition and release of Giant around calls to
pmap_remove() on "debug.mpsafevm".
The effect on performance is significant. On my dual Opteron, I see a
3.6% reduction in "buildworld" time.
- Use atomic operations to update several counters in vm_fault().
to avoid later changes before pmap_enter() and vm_fault_prefault()
has completed.
Simplify deadlock avoidance by not blocking on vm map relookup.
In collaboration with: alc
1. Move a comment to its proper place, updating it. (Except for white-
space, this comment had been unchanged since revision 1.1!)
2. Remove spl calls.
being that PHYS_TO_VM_PAGE() returns the wrong vm_page for fictitious
pages but unwiring uses PHYS_TO_VM_PAGE(). The resulting panic
reported an unexpected wired count. Rather than attempting to fix
PHYS_TO_VM_PAGE(), this fix takes advantage of the properties of
fictitious pages. Specifically, fictitious pages will never be
completely unwired. Therefore, we can keep a fictitious page's wired
count forever set to one and thereby avoid the use of
PHYS_TO_VM_PAGE() when we know that we're working with a fictitious
page, just not which one.
In collaboration with: green@, tegge@
PR: kern/29915
allocation and deallocation. This flag's principal use is shortly after
allocation. For such cases, clearing the flag is pointless. The only
unusual use of PG_ZERO is in vfs_bio_clrbuf(). However, allocbuf() never
requests a prezeroed page. So, vfs_bio_clrbuf() never sees a prezeroed
page.
Reviewed by: tegge@
pmap. For the kernel pmap, Giant is not required. In general, for
other pmaps, Giant is required by i386's pmap_pte() implementation.
Specifically, the use of PMAP2/PADDR2 is synchronized by Giant.
Note: In principle, updates to the kernel pmap's wired count could be
lost without Giant. However, in practice, we never use the kernel
pmap's wired count. This will be resolved when pmap locking appears.
- With the above change, cpu_thread_clean() and uma_large_free() need
not acquire Giant. (The first case is simply the revival of
i386/i386/vm_machdep.c's revision 1.226 by peter.)
panic "vm_page_cache: caching a dirty page, ...": Access to the page must
be restricted or removed before calling vm_page_cache(). This race
condition is identical in nature to that which was addressed by
vm_pageout.c's revision 1.251 and vm_page.c's revision 1.275.
Reviewed by: tegge
MFC after: 7 days
every page. If the source entry was read-only, one or more wired pages
could be in backing objects.
- vm_fault_copy_entry() should not set the PG_WRITEABLE flag on the page
unless the destination entry is, in fact, writeable.
pmap_copy_page() et al. to accept a vm_page_t rather than a physical
address. Also, this change will facilitate locking access to the vm page's
valid field.
A small helper function pmap_is_prefaultable() is added. This function
encapsulate the few lines of pmap_prefault() that actually vary from
machine to machine. Note: pmap_is_prefaultable() and pmap_mincore() have
much in common. Going forward, it's worth considering their merger.
reacquire the "first" object's lock while a backing object's lock is held.
Since this is a lock-order reversal, vm_fault() uses trylock to acquire
the first object's lock, skipping the sequential access optimization in
the unlikely event that the trylock fails.
releasing the lock only if we are about to sleep (e.g., vm_pager_get_pages()
or vm_pager_has_pages()). If we sleep, we have marked the vm object with
the paging-in-progress flag.
where physical addresses larger than virtual addresses, such as i386s
with PAE.
- Use this to represent physical addresses in the MI vm system and in the
i386 pmap code. This also changes the paddr parameter to d_mmap_t.
- Fix printf formats to handle physical addresses >4G in the i386 memory
detection code, and due to kvtop returning vm_paddr_t instead of u_long.
Note that this is a name change only; vm_paddr_t is still the same as
vm_offset_t on all currently supported platforms.
Sponsored by: DARPA, Network Associates Laboratories
Discussed with: re, phk (cdevsw change)
- On receive, vm_map_lookup() needs to trigger the creation of a shadow
object. To make that happen, call vm_map_lookup() with PROT_WRITE
instead of PROT_READ in vm_pgmoveco().
- On send, a shadow object will be created by the vm_map_lookup() in
vm_fault(), but vm_page_cowfault() will delete the original page from
the backing object rather than simply letting the legacy COW mechanism
take over. In other words, the new page should be added to the shadow
object rather than replacing the old page in the backing object. (i.e.
vm_page_cowfault() should not be called in this case.) We accomplish
this by making sure fs.object == fs.first_object before calling
vm_page_cowfault() in vm_fault().
Submitted by: gallatin, alc
Tested by: ken
indirectly through vm_page_protect(). The one remaining page flag that
is updated by vm_page_protect() is already being updated by our various
pmap implementations.
Note: A later commit will similarly change the VM_PROT_READ case and
eliminate vm_page_protect().
on-write (COW) mechanism. (This mechanism is used by the zero-copy
TCP/IP implementation.)
- Extend the scope of the page queues lock in vm_fault()
to cover vm_page_cowfault().
- Modify vm_page_cowfault() to release the page queues lock
if it sleeps.
pmap_zero_page() and pmap_zero_page_area() were modified to accept
a struct vm_page * instead of a physical address, vm_page_zero_fill()
and vm_page_zero_fill_area() have served no purpose.
MAKEDEV: Add MAKEDEV glue for the ti(4) device nodes.
ti.4: Update the ti(4) man page to include information on the
TI_JUMBO_HDRSPLIT and TI_PRIVATE_JUMBOS kernel options,
and also include information about the new character
device interface and the associated ioctls.
man9/Makefile: Add jumbo.9 and zero_copy.9 man pages and associated
links.
jumbo.9: New man page describing the jumbo buffer allocator
interface and operation.
zero_copy.9: New man page describing the general characteristics of
the zero copy send and receive code, and what an
application author should do to take advantage of the
zero copy functionality.
NOTES: Add entries for ZERO_COPY_SOCKETS, TI_PRIVATE_JUMBOS,
TI_JUMBO_HDRSPLIT, MSIZE, and MCLSHIFT.
conf/files: Add uipc_jumbo.c and uipc_cow.c.
conf/options: Add the 5 options mentioned above.
kern_subr.c: Receive side zero copy implementation. This takes
"disposable" pages attached to an mbuf, gives them to
a user process, and then recycles the user's page.
This is only active when ZERO_COPY_SOCKETS is turned on
and the kern.ipc.zero_copy.receive sysctl variable is
set to 1.
uipc_cow.c: Send side zero copy functions. Takes a page written
by the user and maps it copy on write and assigns it
kernel virtual address space. Removes copy on write
mapping once the buffer has been freed by the network
stack.
uipc_jumbo.c: Jumbo disposable page allocator code. This allocates
(optionally) disposable pages for network drivers that
want to give the user the option of doing zero copy
receive.
uipc_socket.c: Add kern.ipc.zero_copy.{send,receive} sysctls that are
enabled if ZERO_COPY_SOCKETS is turned on.
Add zero copy send support to sosend() -- pages get
mapped into the kernel instead of getting copied if
they meet size and alignment restrictions.
uipc_syscalls.c:Un-staticize some of the sf* functions so that they
can be used elsewhere. (uipc_cow.c)
if_media.c: In the SIOCGIFMEDIA ioctl in ifmedia_ioctl(), avoid
calling malloc() with M_WAITOK. Return an error if
the M_NOWAIT malloc fails.
The ti(4) driver and the wi(4) driver, at least, call
this with a mutex held. This causes witness warnings
for 'ifconfig -a' with a wi(4) or ti(4) board in the
system. (I've only verified for ti(4)).
ip_output.c: Fragment large datagrams so that each segment contains
a multiple of PAGE_SIZE amount of data plus headers.
This allows the receiver to potentially do page
flipping on receives.
if_ti.c: Add zero copy receive support to the ti(4) driver. If
TI_PRIVATE_JUMBOS is not defined, it now uses the
jumbo(9) buffer allocator for jumbo receive buffers.
Add a new character device interface for the ti(4)
driver for the new debugging interface. This allows
(a patched version of) gdb to talk to the Tigon board
and debug the firmware. There are also a few additional
debugging ioctls available through this interface.
Add header splitting support to the ti(4) driver.
Tweak some of the default interrupt coalescing
parameters to more useful defaults.
Add hooks for supporting transmit flow control, but
leave it turned off with a comment describing why it
is turned off.
if_tireg.h: Change the firmware rev to 12.4.11, since we're really
at 12.4.11 plus fixes from 12.4.13.
Add defines needed for debugging.
Remove the ti_stats structure, it is now defined in
sys/tiio.h.
ti_fw.h: 12.4.11 firmware.
ti_fw2.h: 12.4.11 firmware, plus selected fixes from 12.4.13,
and my header splitting patches. Revision 12.4.13
doesn't handle 10/100 negotiation properly. (This
firmware is the same as what was in the tree previously,
with the addition of header splitting support.)
sys/jumbo.h: Jumbo buffer allocator interface.
sys/mbuf.h: Add a new external mbuf type, EXT_DISPOSABLE, to
indicate that the payload buffer can be thrown away /
flipped to a userland process.
socketvar.h: Add prototype for socow_setup.
tiio.h: ioctl interface to the character portion of the ti(4)
driver, plus associated structure/type definitions.
uio.h: Change prototype for uiomoveco() so that we'll know
whether the source page is disposable.
ufs_readwrite.c:Update for new prototype of uiomoveco().
vm_fault.c: In vm_fault(), check to see whether we need to do a page
based copy on write fault.
vm_object.c: Add a new function, vm_object_allocate_wait(). This
does the same thing that vm_object allocate does, except
that it gives the caller the opportunity to specify whether
it should wait on the uma_zalloc() of the object structre.
This allows vm objects to be allocated while holding a
mutex. (Without generating WITNESS warnings.)
vm_object_allocate() is implemented as a call to
vm_object_allocate_wait() with the malloc flag set to
M_WAITOK.
vm_object.h: Add prototype for vm_object_allocate_wait().
vm_page.c: Add page-based copy on write setup, clear and fault
routines.
vm_page.h: Add page based COW function prototypes and variable in
the vm_page structure.
Many thanks to Drew Gallatin, who wrote the zero copy send and receive
code, and to all the other folks who have tested and reviewed this code
over the years.
o Move pmap_pageable() outside of Giant in vm_fault_unwire().
(pmap_pageable() is a no-op on all supported architectures.)
o Remove the acquisition and release of Giant from mlock().
and vm_map_delete(). Assert GIANT_REQUIRED in vm_map_delete()
only if operating on the kernel_object or the kmem_object.
o Remove GIANT_REQUIRED from vm_map_remove().
o Remove the acquisition and release of Giant from munmap().
due to conditions that suggest the possible need for stack growth.
This has two beneficial effects: (1) we can
now remove calls to vm_map_growstack() from the MD trap handlers and (2)
simple page faults are faster because we no longer unnecessarily perform
vm_map_growstack() on every page fault.
o Remove vm_map_growstack() from the i386's trap_pfault().
o Remove the acquisition and release of Giant from i386's trap_pfault().
(vm_fault() still acquires it.)
best path forward now is likely to change the lockmgr locks to simple
sleep mutexes, then see if any extra contention it generates is greater
than removed overhead of managing local locking state information,
cost of extra calls into lockmgr, etc.
Additionally, making the vm_map lock a mutex and respecting it properly
will put us much closer to not needing Giant magic in vm.
While doing this, move it earlier in the sysinit boot process so that the
VM system can use it.
After that, the system is now able to use sx locks instead of lockmgr
locks in the VM system. To accomplish this, some of the more
questionable uses of the locks (such as testing whether they are
owned or not, as well as allowing shared+exclusive recursion) are
removed, and simpler logic throughout is used so locks should also be
easier to understand.
This has been tested on my laptop for months, and has not shown any
problems on SMP systems, either, so appears quite safe. One more
user of lockmgr down, many more to go :)
style(9)
- Minor space adjustment in cases where we have "( ", " )", if(), return(),
while(), for(), etc.
- Add /* SYMBOL */ after a few #endifs.
Reviewed by: alc
- Allow the OOM killer to target processes currently locked in
memory. These very often are the ones doing the memory hogging.
- Drop the wakeup priority of processes currently sleeping while
waiting for their page fault to complete. In order for the OOM
killer to work well, the killed process and other system processes
waiting on memory must be allowed to wakeup first.
Reviewed by: dillon
MFC after: 1 week
on a vnode-backed object must be incremented *after* obtaining the vnode
lock. If it is bumped before obtaining the vnode lock we can deadlock
against vtruncbuf().
Submitted by: peter, ps
MFC after: 3 days
real effect.
Optimize vfs_msync(). Avoid having to continually drop and re-obtain
mutexes when scanning the vnode list. Improves looping case by 500%.
Optimize ffs_sync(). Avoid having to continually drop and re-obtain
mutexes when scanning the vnode list. This makes a couple of assumptions,
which I believe are ok, in regards to vnode stability when the mount list
mutex is held. Improves looping case by 500%.
(more optimization work is needed on top of these fixes)
MFC after: 1 week
Note ALL MODULES MUST BE RECOMPILED
make the kernel aware that there are smaller units of scheduling than the
process. (but only allow one thread per process at this time).
This is functionally equivalent to teh previousl -current except
that there is a thread associated with each process.
Sorry john! (your next MFC will be a doosie!)
Reviewed by: peter@freebsd.org, dillon@freebsd.org
X-MFC after: ha ha ha ha
Also removed some spl's and added some VM mutexes, but they are not actually
used yet, so this commit does not really make any operational changes
to the system.
vm_page.c relates to vm_page_t manipulation, including high level deactivation,
activation, etc... vm_pageq.c relates to finding free pages and aquiring
exclusive access to a page queue (exclusivity part not yet implemented).
And the world still builds... :-)
(this commit is just the first stage). Also add various GIANT_ macros to
formalize the removal of Giant, making it easy to test in a more piecemeal
fashion. These macros will allow us to test fine-grained locks to a degree
before removing Giant, and also after, and to remove Giant in a piecemeal
fashion via sysctl's on those subsystems which the authors believe can
operate without Giant.
vm_mtx does not recurse and is required for most low level
vm operations.
faults can not be taken without holding Giant.
Memory subsystems can now call the base page allocators safely.
Almost all atomic ops were removed as they are covered under the
vm mutex.
Alpha and ia64 now need to catch up to i386's trap handlers.
FFS and NFS have been tested, other filesystems will need minor
changes (grabbing the vm lock when twiddling page properties).
Reviewed (partially) by: jake, jhb
other "system" header files.
Also help the deprecation of lockmgr.h by making it a sub-include of
sys/lock.h and removing sys/lockmgr.h form kernel .c files.
Sort sys/*.h includes where possible in affected files.
OK'ed by: bde (with reservations)
programs. There is a case during a fork() which can cause a deadlock.
From Tor -
The workaround that consists of setting a flag in the vm map that
indicates that a fork is in progress and using that mark in the page
fault handling to force a revalidation failure. That change will only
affect (pessimize) page fault handling during fork for threaded
(linuxthreads style) applications and applications using aio_*().
Submited by: tegge
make sure that PG_NOSYNC is properly set. Previously we only set it
for a write-fault, but this can occur on a read-fault too.
(will be MFCd prior to 4.3 freeze)
mtx_enter(lock, type) becomes:
mtx_lock(lock) for sleep locks (MTX_DEF-initialized locks)
mtx_lock_spin(lock) for spin locks (MTX_SPIN-initialized)
similarily, for releasing a lock, we now have:
mtx_unlock(lock) for MTX_DEF and mtx_unlock_spin(lock) for MTX_SPIN.
We change the caller interface for the two different types of locks
because the semantics are entirely different for each case, and this
makes it explicitly clear and, at the same time, it rids us of the
extra `type' argument.
The enter->lock and exit->unlock change has been made with the idea
that we're "locking data" and not "entering locked code" in mind.
Further, remove all additional "flags" previously passed to the
lock acquire/release routines with the exception of two:
MTX_QUIET and MTX_NOSWITCH
The functionality of these flags is preserved and they can be passed
to the lock/unlock routines by calling the corresponding wrappers:
mtx_{lock, unlock}_flags(lock, flag(s)) and
mtx_{lock, unlock}_spin_flags(lock, flag(s)) for MTX_DEF and MTX_SPIN
locks, respectively.
Re-inline some lock acq/rel code; in the sleep lock case, we only
inline the _obtain_lock()s in order to ensure that the inlined code
fits into a cache line. In the spin lock case, we inline recursion and
actually only perform a function call if we need to spin. This change
has been made with the idea that we generally tend to avoid spin locks
and that also the spin locks that we do have and are heavily used
(i.e. sched_lock) do recurse, and therefore in an effort to reduce
function call overhead for some architectures (such as alpha), we
inline recursion for this case.
Create a new malloc type for the witness code and retire from using
the M_DEV type. The new type is called M_WITNESS and is only declared
if WITNESS is enabled.
Begin cleaning up some machdep/mutex.h code - specifically updated the
"optimized" inlined code in alpha/mutex.h and wrote MTX_LOCK_SPIN
and MTX_UNLOCK_SPIN asm macros for the i386/mutex.h as we presently
need those.
Finally, caught up to the interface changes in all sys code.
Contributors: jake, jhb, jasone (in no particular order)
and sysv shared memory support for it. It implements a new
PG_UNMANAGED flag that has slightly different characteristics
from PG_FICTICIOUS.
A new sysctl, kern.ipc.shm_use_phys has been added to enable the
use of physically-backed sysv shared memory rather then swap-backed.
Physically backed shm segments are not tracked with PV entries,
allowing programs which use a large shm segment as a rendezvous
point to operate without eating an insane amount of KVM in the
PV entry management. Read: Oracle.
Peter's OBJT_PHYS object will also allow us to eventually implement
page-table sharing and/or 4MB physical page support for such segments.
We're half way there.
to various pmap_*() functions instead of looking up the physical address
and passing that. In many cases, the first thing the pmap code was doing
was going to a lot of trouble to get back the original vm_page_t, or
it's shadow pv_table entry.
Inspired by: John Dyson's 1998 patches.
Also:
Eliminate pv_table as a seperate thing and build it into a machine
dependent part of vm_page_t. This eliminates having a seperate set of
structions that shadow each other in a 1:1 fashion that we often went to
a lot of trouble to translate from one to the other. (see above)
This happens to save 4 bytes of physical memory for each page in the
system. (8 bytes on the Alpha).
Eliminate the use of the phys_avail[] array to determine if a page is
managed (ie: it has pv_entries etc). Store this information in a flag.
Things like device_pager set it because they create vm_page_t's on the
fly that do not have pv_entries. This makes it easier to "unmanage" a
page of physical memory (this will be taken advantage of in subsequent
commits).
Add a function to add a new page to the freelist. This could be used
for reclaiming the previously wasted pages left over from preloaded
loader(8) files.
Reviewed by: dillon
madvise().
This feature prevents the update daemon from gratuitously flushing
dirty pages associated with a mapped file-backed region of memory. The
system pager will still page the memory as necessary and the VM system
will still be fully coherent with the filesystem. Modifications made
by other means to the same area of memory, for example by write(), are
unaffected. The feature works on a page-granularity basis.
MAP_NOSYNC allows one to use mmap() to share memory between processes
without incuring any significant filesystem overhead, putting it in
the same performance category as SysV Shared memory and anonymous memory.
Reviewed by: julian, alc, dg
Merge the contents (less some trivial bordering the silly comments)
of <vm/vm_prot.h> and <vm/vm_inherit.h> into <vm/vm.h>. This puts
the #defines for the vm_inherit_t and vm_prot_t types next to their
typedefs.
This paves the road for the commit to follow shortly: change
useracc() to use VM_PROT_{READ|WRITE} rather than B_{READ|WRITE}
as argument.
clustering issues (replacing code that used to be in
ufs/ufs/ufs_readwrite.c). vm_fault also now uses the new VM page counter
inlines.
This completes the changeover from vnode->v_lastr to vm_entry_t->v_lastr
for VM, and fp->f_nextread and fp->f_seqcount (which have been in the
tree for a while). Determination of the I/O strategy (sequential, random,
and so forth) is now handled on a descriptor-by-descriptor basis for
base I/O calls, and on a memory-region-by-memory-region and
process-by-process basis for VM faults.
Reviewed by: David Greenman <dg@root.com>, Alan Cox <alc@cs.rice.edu>
piecemeal, middle-of-file writes for NFS. These hacks have caused no
end of trouble, especially when combined with mmap(). I've removed
them. Instead, NFS will issue a read-before-write to fully
instantiate the struct buf containing the write. NFS does, however,
optimize piecemeal appends to files. For most common file operations,
you will not notice the difference. The sole remaining fragment in
the VFS/BIO system is b_dirtyoff/end, which NFS uses to avoid cache
coherency issues with read-merge-write style operations. NFS also
optimizes the write-covers-entire-buffer case by avoiding the
read-before-write. There is quite a bit of room for further
optimization in these areas.
The VM system marks pages fully-valid (AKA vm_page_t->valid =
VM_PAGE_BITS_ALL) in several places, most noteably in vm_fault. This
is not correct operation. The vm_pager_get_pages() code is now
responsible for marking VM pages all-valid. A number of VM helper
routines have been added to aid in zeroing-out the invalid portions of
a VM page prior to the page being marked all-valid. This operation is
necessary to properly support mmap(). The zeroing occurs most often
when dealing with file-EOF situations. Several bugs have been fixed
in the NFS subsystem, including bits handling file and directory EOF
situations and buf->b_flags consistancy issues relating to clearing
B_ERROR & B_INVAL, and handling B_DONE.
getblk() and allocbuf() have been rewritten. B_CACHE operation is now
formally defined in comments and more straightforward in
implementation. B_CACHE for VMIO buffers is based on the validity of
the backing store. B_CACHE for non-VMIO buffers is based simply on
whether the buffer is B_INVAL or not (B_CACHE set if B_INVAL clear,
and vise-versa). biodone() is now responsible for setting B_CACHE
when a successful read completes. B_CACHE is also set when a bdwrite()
is initiated and when a bwrite() is initiated. VFS VOP_BWRITE
routines (there are only two - nfs_bwrite() and bwrite()) are now
expected to set B_CACHE. This means that bowrite() and bawrite() also
set B_CACHE indirectly.
There are a number of places in the code which were previously using
buf->b_bufsize (which is DEV_BSIZE aligned) when they should have
been using buf->b_bcount. These have been fixed. getblk() now clears
B_DONE on return because the rest of the system is so bad about
dealing with B_DONE.
Major fixes to NFS/TCP have been made. A server-side bug could cause
requests to be lost by the server due to nfs_realign() overwriting
other rpc's in the same TCP mbuf chain. The server's kernel must be
recompiled to get the benefit of the fixes.
Submitted by: Matthew Dillon <dillon@apollo.backplane.com>
Unlock vnode before messing with map to avoid deadlock between map and
vnode ( e.g. with exec_map and underlying program binary vnode ). Solves
a deadlock that most often occurs during a large -j# buildworld reported
by three people.
attempt to optimize forks but were essentially given-up on due to
problems and replaced with an explicit dup of the vm_map_entry structure.
Prior to the removal, they were entirely unused.
changes to the VM system to support the new swapper, VM bug
fixes, several VM optimizations, and some additional revamping of the
VM code. The specific bug fixes will be documented with additional
forced commits. This commit is somewhat rough in regards to code
cleanup issues.
Reviewed by: "John S. Dyson" <root@dyson.iquest.net>, "David Greenman" <dg@root.com>
Add some overflow checks to read/write (from bde).
Change all modifications to vm_page::flags, vm_page::busy, vm_object::flags
and vm_object::paging_in_progress to use operations which are not
interruptable.
Reviewed by: Bruce Evans <bde@zeta.org.au>
managed to avoid corruption of this variable by luck (the compiler used a
memory read-modify-write instruction which wasn't interruptable) but other
architectures cannot.
With this change, I am now able to 'make buildworld' on the alpha (sfx: the
crowd goes wild...)
put alot of it's context into a data structure. This allows
significant shortening of its codepath, and will significantly
decrease it's cache footprint.
Also, add some stats to vmmeter. Note that you'll have to
rebuild/recompile vmstat, systat, etc... Otherwise, you'll
get "very interesting" paging stats.
2) Do not unnecessarily force page blocking when paging
pages out.
3) Further improve swap pager performance and correctness,
including fixing the paging in progress deadlock (except
in severe I/O error conditions.)
4) Enable vfs_ioopt=1 as a default.
5) Fix and enable the page prezeroing in SMP mode.
All in all, SMP systems especially should show a significant
improvement in "snappyness."
MUST be PG_BUSY. It is bogus to free a page that isn't busy,
because it is in a state of being "unavailable" when being
freed. The additional advantage is that the page_remove code
has a better cross-check that the page should be busy and
unavailable for other use. There were some minor problems
with the collapse code, and this plugs those subtile "holes."
Also, the vfs_bio code wasn't checking correctly for PG_BUSY
pages. I am going to develop a more consistant scheme for
grabbing pages, busy or otherwise. For now, we are stuck
with the current morass.
1) Start using TSM.
Struct procs continue to point to upages structure, after being freed.
Struct vmspace continues to point to pte object and kva space for kstack.
u_map is now superfluous.
2) vm_map's don't need to be reference counted. They always exist either
in the kernel or in a vmspace. The vmspaces are managed by reference
counts.
3) Remove the "wired" vm_map nonsense.
4) No need to keep a cache of kernel stack kva's.
5) Get rid of strange looking ++var, and change to var++.
6) Change more data structures to use our "zone" allocator. Added
struct proc, struct vmspace and struct vnode. This saves a significant
amount of kva space and physical memory. Additionally, this enables
TSM for the zone managed memory.
7) Keep ioopt disabled for now.
8) Remove the now bogus "single use" map concept.
9) Use generation counts or id's for data structures residing in TSM, where
it allows us to avoid unneeded restart overhead during traversals, where
blocking might occur.
10) Account better for memory deficits, so the pageout daemon will be able
to make enough memory available (experimental.)
11) Fix some vnode locking problems. (From Tor, I think.)
12) Add a check in ufs_lookup, to avoid lots of unneeded calls to bcmp.
(experimental.)
13) Significantly shrink, cleanup, and make slightly faster the vm_fault.c
code. Use generation counts, get rid of unneded collpase operations,
and clean up the cluster code.
14) Make vm_zone more suitable for TSM.
This commit is partially as a result of discussions and contributions from
other people, including DG, Tor Egge, PHK, and probably others that I
have forgotten to attribute (so let me know, if I forgot.)
This is not the infamous, final cleanup of the vnode stuff, but a necessary
step. Vnode mgmt should be correct, but things might still change, and
there is still some missing stuff (like ioopt, and physical backing of
non-merged cache files, debugging of layering concepts.)
config option in pmap. Fix a problem with faulting in pages. Clean-up
some loose ends in swap pager memory management.
The system should be much more stable, but all subtile bugs aren't fixed yet.
Fix the UIO optimization code.
Fix an assumption in vm_map_insert regarding allocation of swap pagers.
Fix an spl problem in the collapse handling in vm_object_deallocate.
When pages are freed from vnode objects, and the criteria for putting
the associated vnode onto the free list is reached, either put the
vnode onto the list, or put it onto an interrupt safe version of the
list, for further transfer onto the actual free list.
Some minor syntax changes changing pre-decs, pre-incs to post versions.
Remove a bogus timeout (that I added for debugging) from vn_lock.
PHK will likely still have problems with the vnode list management, and
so do I, but it is better than it was.
original BSD code. The association between the vnode and the vm_object
no longer includes reference counts. The major difference is that
vm_object's are no longer freed gratuitiously from the vnode, and so
once an object is created for the vnode, it will last as long as the
vnode does.
When a vnode object reference count is incremented, then the underlying
vnode reference count is incremented also. The two "objects" are now
more intimately related, and so the interactions are now much less
complex.
When vnodes are now normally placed onto the free queue with an object still
attached. The rundown of the object happens at vnode rundown time, and
happens with exactly the same filesystem semantics of the original VFS
code. There is absolutely no need for vnode_pager_uncache and other
travesties like that anymore.
A side-effect of these changes is that SMP locking should be much simpler,
the I/O copyin/copyout optimizations work, NFS should be more ponderable,
and further work on layered filesystems should be less frustrating, because
of the totally coherent management of the vnode objects and vnodes.
Please be careful with your system while running this code, but I would
greatly appreciate feedback as soon a reasonably possible.
vm_inherit_t. These types are smaller than ints, so the prototypes
should have used the promoted type (int) to match the old-style function
definitions. They use just vm_prot_t and/or vm_inherit_t. This depends
on gcc features to work. I fixed the definitions since this is easiest.
The correct fix may be to change the small types to u_int, to optimize
for time instead of space.
and b_validend. The changes to vfs_bio.c are a bit ugly but hopefully
can be tidied up later by a slight redesign.
PR: kern/2573, kern/2754, kern/3046 (possibly)
Reviewed by: dyson
The typo was detected once apon a time with the -Wunused compile option.
The result was that a block of code for implementing
madvise(.. MADV_SEQUENTIAL..) behavior was "dead" and unused, probably
negating the effect of activating the option.
Reviewed by: dyson
by Alan Cox <alc@cs.rice.edu>, and his description of the problem.
The bug was primarily in procfs_mem, but the mistake likely happened
due to the lack of vm system support for the operation. I added
better support for selective marking of page dirty flags so that
vm_map_pageable(wiring) will not cause this problem again.
The code in procfs_mem is now less bogus (but maybe still a little
so.)
changes, so don't expect to be able to run the kernel as-is (very well)
without the appropriate Lite/2 userland changes.
The system boots and can mount UFS filesystems.
Untested: ext2fs, msdosfs, NFS
Known problems: Incorrect Berkeley ID strings in some files.
Mount_std mounts will not work until the getfsent
library routine is changed.
Reviewed by: various people
Submitted by: Jeffery Hsu <hsu@freebsd.org>
This will make a number of things easier in the future, as well as (finally!)
avoiding the Id-smashing problem which has plagued developers for so long.
Boy, I'm glad we're not using sup anymore. This update would have been
insane otherwise.
anymore with the "full" collapse fix that we added about 1yr ago!!! The
code has been removed by optioning it out for now, so we can put it back
in ASAP if any problems are found.
that we do allow mlock to span unallocated regions (of course, not
mlocking them.) We also allow mlocking of RO regions (which the old
code couldn't.) The restriction there is that once a RO region is
wired (mlocked), it cannot be debugged (or EVER written to.)
Under normal usage, the new mlock code will be a significant improvement
over our old stuff.
scheme. Additionally, add the capability for checking for unexpected
kernel page faults. The maximum amount of kva space for buffers hasn't
been decreased from where it is, but it will now be possible to do so.
This scheme manages the kva space similar to the buffers themselves. If
there isn't enough kva space because of usage or fragementation, buffers
will be reclaimed until a buffer allocation is successful. This scheme
should be very resistant to fragmentation problems until/if the LFS code
is fixed and uses the bogus buffer locking scheme -- but a 'fixed' LFS
is not likely to use such a scheme.
Now there should be NO problem allocating buffers up to MAXPHYS.
The default level works with minimal overhead, but one can also enable
full, efficient use of a 512K cache. (Parameters can be generated
to support arbitrary cache sizes also.)
problem with the 'shell scripts' was found, but there was a 'strange'
problem found with a 486 laptop that we could not find. This commit
backs the code back to 25-jul, and will be re-entered after the snapshot
in smaller (more easily tested) chunks.
performance issues.
1) The pmap module has had too many inlines, and so the
object file is simply bigger than it needs to be.
Some common code is also merged into subroutines.
2) Removal of some *evil* PHYS_TO_VM_PAGE macro calls.
Unfortunately, a few have needed to be added also.
The removal caused the need for more vm_page_lookups.
I added lookup hints to minimize the need for the
page table lookup operations.
3) Removal of some bogus performance improvements, that
mostly made the code more complex (tracking individual
page table page updates unnecessarily). Those improvements
actually hurt 386 processors perf (not that people who
worry about perf use 386 processors anymore :-)).
4) Changed pv queue manipulations/structures to be TAILQ's.
5) The pv queue code has had some performance problems since
day one. Some significant scalability issues are resolved
by threading the pv entries from the pmap AND the physical
address instead of just the physical address. This makes
certain pmap operations run much faster. This does
not affect most micro-benchmarks, but should help loaded system
performance *significantly*. DG helped and came up with most
of the solution for this one.
6) Most if not all pmap bit operations follow the pattern:
pmap_test_bit();
pmap_clear_bit();
That made for twice the necessary pv list traversal. The
pmap interface now supports only pmap_tc_bit type operations:
pmap_[test/clear]_modified, pmap_[test/clear]_referenced.
Additionally, the modified routine now takes a vm_page_t arg
instead of a phys address. This eliminates a PHYS_TO_VM_PAGE
operation.
7) Several rewrites of routines that contain redundant code to
use common routines, so that there is a greater likelihood of
keeping the cache footprint smaller.
1) Remove potential race conditions on waking up in vm_page_free_wakeup
by making sure that it is at splvm().
2) Fix another bug in vm_map_simplify_entry.
3) Be more complete about converting from default to swap pager
when an object grows to be large enough that there can be
a problem with data structure allocation under low memory
conditions.
4) Make some madvise code more efficient.
5) Added some comments.
reserving "cached" pages before waking up the pageout daemon. This will reserve
the faulted page, and keep the system from thrashing itself to death given
this condition.
queue in vm_fault.
Move the PG_BUSY in vm_fault to the correct place.
Remove redundant/unnecessary code in pmap.c.
Properly block on rundown of page table pages, if they are busy.
I think that the VM system is in pretty good shape now, and the following
individuals (among others, in no particular order) have helped with this
recent bunch of bugs, thanks! If I left anyone out, I apologize!
Stephen McKay, Stephen Hocking, Eric J. Chet, Dan O'Brien, James Raynard,
Marc Fournier.
operations don't work with FICTITIOUS pages.) Also, close a window
between PG_MANAGED and pmap_enter that can mess up the accounting of
the managed flag. This problem could likely cause a hold_count error
for page table pages.
queue corruption problems, and to apply Gary Palmer's code cleanups.
David Greenman helped with these problems also. There is still
a hang problem using X in small memory machines.
problem. BY MISTAKE, the vm_page_unqueue (or equiv) was removed from the
vm_fault code. Really bad things appear to happen if a page is on a queue
while it is being faulted.
contributions or ideas from Stephen McKay <syssgm@devetir.qld.gov.au>,
Alan Cox <alc@cs.rice.edu>, David Greenman <davidg@freebsd.org> and me:
More usage of the TAILQ macros. Additional minor fix to queue.h.
Performance enhancements to the pageout daemon.
Addition of a wait in the case that the pageout daemon
has to run immediately.
Slightly modify the pageout algorithm.
Significant revamp of the pmap/fork code:
1) PTE's and UPAGES's are NO LONGER in the process's map.
2) PTE's and UPAGES's reside in their own objects.
3) TOTAL elimination of recursive page table pagefaults.
4) The page directory now resides in the PTE object.
5) Implemented pmap_copy, thereby speeding up fork time.
6) Changed the pv entries so that the head is a pointer
and not an entire entry.
7) Significant cleanup of pmap_protect, and pmap_remove.
8) Removed significant amounts of machine dependent
fork code from vm_glue. Pushed much of that code into
the machine dependent pmap module.
9) Support more completely the reuse of already zeroed
pages (Page table pages and page directories) as being
already zeroed.
Performance and code cleanups in vm_map:
1) Improved and simplified allocation of map entries.
2) Improved vm_map_copy code.
3) Corrected some minor problems in the simplify code.
Implemented splvm (combo of splbio and splimp.) The VM code now
seldom uses splhigh.
Improved the speed of and simplified kmem_malloc.
Minor mod to vm_fault to avoid using pre-zeroed pages in the case
of objects with backing objects along with the already
existant condition of having a vnode. (If there is a backing
object, there will likely be a COW... With a COW, it isn't
necessary to start with a pre-zeroed page.)
Minor reorg of source to perhaps improve locality of ref.