Instead of checking each page for PG_UNMANAGED, perform a one-time
check whether the object is OBJT_PHYS. (PG_UNMANAGED pages only
belong to OBJT_PHYS objects.)
Specifically, since the delete-behind heuristic is never applied to a
device-backed object, there is no point in checking whether each of the
object's pages is fictitious. (Only device-backed objects have
fictitious pages.)
machine-independent support for superpages. (The earlier part was
the rewrite of the physical memory allocator.) The remainder of the
code required for superpages support is machine-dependent and will
be added to the various pmap implementations at a later date.
Initially, I am only supporting one large page size per architecture.
Moreover, I am only enabling the reservation system on amd64. (In
an emergency, it can be disabled by setting VM_NRESERVLEVELS to 0
in amd64/include/vmparam.h or your kernel configuration file.)
ways:
(1) Cached pages are no longer kept in the object's resident page
splay tree and memq. Instead, they are kept in a separate per-object
splay tree of cached pages. However, access to this new per-object
splay tree is synchronized by the _free_ page queues lock, not to be
confused with the heavily contended page queues lock. Consequently, a
cached page can be reclaimed by vm_page_alloc(9) without acquiring the
object's lock or the page queues lock.
This solves a problem independently reported by tegge@ and Isilon.
Specifically, they observed the page daemon consuming a great deal of
CPU time because of pages bouncing back and forth between the cache
queue (PQ_CACHE) and the inactive queue (PQ_INACTIVE). The source of
this problem turned out to be a deadlock avoidance strategy employed
when selecting a cached page to reclaim in vm_page_select_cache().
However, the root cause was really that reclaiming a cached page
required the acquisition of an object lock while the page queues lock
was already held. Thus, this change addresses the problem at its
root, by eliminating the need to acquire the object's lock.
Moreover, keeping cached pages in the object's primary splay tree and
memq was, in effect, optimizing for the uncommon case. Cached pages
are reclaimed far, far more often than they are reactivated. Instead,
this change makes reclamation cheaper, especially in terms of
synchronization overhead, and reactivation more expensive, because
reactivated pages will have to be reentered into the object's primary
splay tree and memq.
(2) Cached pages are now stored alongside free pages in the physical
memory allocator's buddy queues, increasing the likelihood that large
allocations of contiguous physical memory (i.e., superpages) will
succeed.
Finally, as a result of this change long-standing restrictions on when
and where a cached page can be reclaimed and returned by
vm_page_alloc(9) are eliminated. Specifically, calls to
vm_page_alloc(9) specifying VM_ALLOC_INTERRUPT can now reclaim and
return a formerly cached page. Consequently, a call to malloc(9)
specifying M_NOWAIT is less likely to fail.
Discussed with: many over the course of the summer, including jeff@,
Justin Husted @ Isilon, peter@, tegge@
Tested by: an earlier version by kris@
Approved by: re (kensmith)
1. Rewrite the backward scan. Specifically, reverse the order in which
pages are allocated so that upon failure it is never necessary to
free pages that were just allocated. Moreover, any allocated pages
can be put to use. This makes the backward scan behave just like the
forward scan.
2. Eliminate an explicit, unsynchronized check for low memory before
calling vm_page_alloc(). It serves no useful purpose. It is, in
effect, optimizing the uncommon case at the expense of the common
case.
Approved by: re (hrs)
MFC after: 3 weeks
vm_fault_additional_pages() that was introduced in revision 1.47. Then
as now, it is unnecessary because dev_pager_haspage() returns zero for
both the number of pages to read ahead and read behind, producing the
same exact behavior by vm_fault_additional_pages() as the special case
handling.
Approved by: re (rwatson)
tracks the total number of reactivated pages. (We have not been
counting reactivations by vm_fault() since revision 1.46.)
Correct a comment in vm_fault_additional_pages().
Approved by: re (kensmith)
MFC after: 1 week
- Rename PCPU_LAZY_INC into PCPU_INC
- Add the PCPU_ADD interface which just does an add on the pcpu member
given a specific value.
Note that for most architectures PCPU_INC and PCPU_ADD are not safe.
This is a point that needs some discussions/work in the next days.
Reviewed by: alc, bde
Approved by: jeff (mentor)
td_ru. This removes the requirement for per-process synchronization in
statclock() and mi_switch(). This was previously supported by
sched_lock which is going away. All modifications to rusage are now
done in the context of the owning thread. reads proceed without locks.
- Aggregate exiting threads rusage in thread_exit() such that the exiting
thread's rusage is not lost.
- Provide a new routine, rufetch() to fetch an aggregate of all rusage
structures from all threads in a process. This routine must be used
in any place requiring a rusage from a process prior to it's exit. The
exited process's rusage is still available via p_ru.
- Aggregate tick statistics only on demand via rufetch() or when a thread
exits. Tick statistics are kept in the thread and protected by sched_lock
until it exits.
Initial patch by: attilio
Reviewed by: attilio, bde (some objections), arch (mostly silent)
Probabilly, a general approach is not the better solution here, so we should
solve the sched_lock protection problems separately.
Requested by: alc
Approved by: jeff (mentor)
vm_map_pmap_enter() unless the caller is madvise(MADV_WILLNEED). With
the exception of calls to vm_map_pmap_enter() from
madvise(MADV_WILLNEED), vm_fault_prefault() and vm_map_pmap_enter()
are both used to create speculative mappings. Thus, always
reactivating cached pages is a mistake. In principle, cached pages
should only be reactivated by an actual access. Otherwise, the
following misbehavior can occur. On a hard fault for a text page the
clustering algorithm fetches not only the required page but also
several of the adjacent pages. Now, suppose that one or more of the
adjacent pages are never accessed. Ultimately, these unused pages
become cached pages through the efforts of the page daemon. However,
the next activation of the executable reactivates and maps these
unused pages. Consequently, they are never replaced. In effect, they
become pinned in memory.
vmcnts. This can be used to abstract away pcpu details but also changes
to use atomics for all counters now. This means sched lock is no longer
responsible for protecting counts in the switch routines.
Contributed by: Attilio Rao <attilio@FreeBSD.org>
The problem is this: vm_fault_additional_pages() calls vm_pager_has_page(),
which calls vnode_pager_haspage(). Now when VOP_BMAP() returns an error (eg.
EOPNOTSUPP), vnode_pager_haspage() returns TRUE without initializing 'before'
and 'after' arguments, so we have some accidental values there. This bascially
was causing this condition to be meet:
if ((rahead + rbehind) >
((cnt.v_free_count + cnt.v_cache_count) - cnt.v_free_reserved)) {
pagedaemon_wakeup();
[...]
}
(we have some random values in rahead and rbehind variables)
I'm not entirely sure this is the right fix, maybe we should just return FALSE
in vnode_pager_haspage() when VOP_BMAP() fails?
alc@ knows about this problem, maybe he will be able to come up with a better
fix if this is not the right one.
page queues-synchronized flag. Reduce the scope of the page queues lock in
vm_fault() accordingly.
Move vm_fault()'s call to vm_object_set_writeable_dirty() outside of the
scope of the page queues lock. Reviewed by: tegge
Additionally, eliminate an unnecessary dereference in computing the
argument that is passed to vm_object_set_writeable_dirty().
There is a race with the current locking scheme and removing
it should have no measurable performance impact.
This fixes page faults leading to panics in pmap_enter_quick_locked()
on amd64/i386.
Reviewed by: alc,jhb,peter,ps
- provide an interface (macros) to the page coloring part of the VM system,
this allows to try different coloring algorithms without the need to
touch every file [1]
- make the page queue tuning values readable: sysctl vm.stats.pagequeue
- autotuning of the page coloring values based upon the cache size instead
of options in the kernel config (disabling of the page coloring as a
kernel option is still possible)
MD changes:
- detection of the cache size: only IA32 and AMD64 (untested) contains
cache size detection code, every other arch just comes with a dummy
function (this results in the use of default values like it was the
case without the autotuning of the page coloring)
- print some more info on Intel CPU's (like we do on AMD and Transmeta
CPU's)
Note to AMD owners (IA32 and AMD64): please run "sysctl vm.stats.pagequeue"
and report if the cache* values are zero (= bug in the cache detection code)
or not.
Based upon work by: Chad David <davidc@acns.ab.ca> [1]
Reviewed by: alc, arch (in 2004)
Discussed with: alc, Chad David, arch (in 2004)
fs.first_object->flags & OBJ_NEEDGIANT test that was missed in an earlier
revision. This fixes mutex assertion failures in the debug.mpsafevm=0
case.
Reported by: ps
MFC after: 3 days
underlying vnode requires Giant.
- In vm_fault only acquire Giant if the underlying object has NEEDSGIANT
set.
- In vm_object_shadow inherit the NEEDSGIANT flag from the backing object.
- Use VFS_LOCK_GIANT() rather than directly acquiring giant in places
where giant is only held because vfs requires it.
Sponsored By: Isilon Systems, Inc.
flag and busy field with the global page queues lock to synchronizing their
access with the containing object's lock. Specifically, acquire the
containing object's lock before reading the page's PG_BUSY flag and busy
field in vm_fault().
Reviewed by: tegge@
on entry and it assumes the responsibility for releasing the page queues
lock if it must sleep.
Remove a bogus comment from pmap_enter_quick().
Using the first change, modify vm_map_pmap_enter() so that the page queues
lock is acquired and released once, rather than each time that a page
is mapped.
In such cases, the busying of the page and the unlocking of the
containing object by vm_map_pmap_enter() and vm_fault_prefault() is
unnecessary overhead. To eliminate this overhead, this change
modifies pmap_enter_quick() so that it expects the object to be locked
on entry and it assumes the responsibility for busying the page and
unlocking the object if it must sleep. Note: alpha, amd64, i386 and
ia64 are the only implementations optimized by this change; arm,
powerpc, and sparc64 still conservatively busy the page and unlock the
object within every pmap_enter_quick() call.
Additionally, this change is the first case where we synchronize
access to the page's PG_BUSY flag and busy field using the containing
object's lock rather than the global page queues lock. (Modifications
to the page's PG_BUSY flag and busy field have asserted both locks for
several weeks, enabling an incremental transition.)
manipulating a vnode, e.g., calling vput(). This reduces contention for
Giant during many copy-on-write faults, resulting in some additional
speedup on SMPs.
Note: debug_mpsafevm must be enabled for this optimization to take effect.
"debug.mpsafevm" results in (almost) Giant-free execution of zero-fill
page faults. (Giant is held only briefly, just long enough to determine
if there is a vnode backing the faulting address.)
Also, condition the acquisition and release of Giant around calls to
pmap_remove() on "debug.mpsafevm".
The effect on performance is significant. On my dual Opteron, I see a
3.6% reduction in "buildworld" time.
- Use atomic operations to update several counters in vm_fault().
to avoid later changes before pmap_enter() and vm_fault_prefault()
has completed.
Simplify deadlock avoidance by not blocking on vm map relookup.
In collaboration with: alc
1. Move a comment to its proper place, updating it. (Except for white-
space, this comment had been unchanged since revision 1.1!)
2. Remove spl calls.
being that PHYS_TO_VM_PAGE() returns the wrong vm_page for fictitious
pages but unwiring uses PHYS_TO_VM_PAGE(). The resulting panic
reported an unexpected wired count. Rather than attempting to fix
PHYS_TO_VM_PAGE(), this fix takes advantage of the properties of
fictitious pages. Specifically, fictitious pages will never be
completely unwired. Therefore, we can keep a fictitious page's wired
count forever set to one and thereby avoid the use of
PHYS_TO_VM_PAGE() when we know that we're working with a fictitious
page, just not which one.
In collaboration with: green@, tegge@
PR: kern/29915
allocation and deallocation. This flag's principal use is shortly after
allocation. For such cases, clearing the flag is pointless. The only
unusual use of PG_ZERO is in vfs_bio_clrbuf(). However, allocbuf() never
requests a prezeroed page. So, vfs_bio_clrbuf() never sees a prezeroed
page.
Reviewed by: tegge@
pmap. For the kernel pmap, Giant is not required. In general, for
other pmaps, Giant is required by i386's pmap_pte() implementation.
Specifically, the use of PMAP2/PADDR2 is synchronized by Giant.
Note: In principle, updates to the kernel pmap's wired count could be
lost without Giant. However, in practice, we never use the kernel
pmap's wired count. This will be resolved when pmap locking appears.
- With the above change, cpu_thread_clean() and uma_large_free() need
not acquire Giant. (The first case is simply the revival of
i386/i386/vm_machdep.c's revision 1.226 by peter.)
panic "vm_page_cache: caching a dirty page, ...": Access to the page must
be restricted or removed before calling vm_page_cache(). This race
condition is identical in nature to that which was addressed by
vm_pageout.c's revision 1.251 and vm_page.c's revision 1.275.
Reviewed by: tegge
MFC after: 7 days
every page. If the source entry was read-only, one or more wired pages
could be in backing objects.
- vm_fault_copy_entry() should not set the PG_WRITEABLE flag on the page
unless the destination entry is, in fact, writeable.
pmap_copy_page() et al. to accept a vm_page_t rather than a physical
address. Also, this change will facilitate locking access to the vm page's
valid field.
A small helper function pmap_is_prefaultable() is added. This function
encapsulate the few lines of pmap_prefault() that actually vary from
machine to machine. Note: pmap_is_prefaultable() and pmap_mincore() have
much in common. Going forward, it's worth considering their merger.
reacquire the "first" object's lock while a backing object's lock is held.
Since this is a lock-order reversal, vm_fault() uses trylock to acquire
the first object's lock, skipping the sequential access optimization in
the unlikely event that the trylock fails.
releasing the lock only if we are about to sleep (e.g., vm_pager_get_pages()
or vm_pager_has_pages()). If we sleep, we have marked the vm object with
the paging-in-progress flag.
where physical addresses larger than virtual addresses, such as i386s
with PAE.
- Use this to represent physical addresses in the MI vm system and in the
i386 pmap code. This also changes the paddr parameter to d_mmap_t.
- Fix printf formats to handle physical addresses >4G in the i386 memory
detection code, and due to kvtop returning vm_paddr_t instead of u_long.
Note that this is a name change only; vm_paddr_t is still the same as
vm_offset_t on all currently supported platforms.
Sponsored by: DARPA, Network Associates Laboratories
Discussed with: re, phk (cdevsw change)
- On receive, vm_map_lookup() needs to trigger the creation of a shadow
object. To make that happen, call vm_map_lookup() with PROT_WRITE
instead of PROT_READ in vm_pgmoveco().
- On send, a shadow object will be created by the vm_map_lookup() in
vm_fault(), but vm_page_cowfault() will delete the original page from
the backing object rather than simply letting the legacy COW mechanism
take over. In other words, the new page should be added to the shadow
object rather than replacing the old page in the backing object. (i.e.
vm_page_cowfault() should not be called in this case.) We accomplish
this by making sure fs.object == fs.first_object before calling
vm_page_cowfault() in vm_fault().
Submitted by: gallatin, alc
Tested by: ken
indirectly through vm_page_protect(). The one remaining page flag that
is updated by vm_page_protect() is already being updated by our various
pmap implementations.
Note: A later commit will similarly change the VM_PROT_READ case and
eliminate vm_page_protect().
on-write (COW) mechanism. (This mechanism is used by the zero-copy
TCP/IP implementation.)
- Extend the scope of the page queues lock in vm_fault()
to cover vm_page_cowfault().
- Modify vm_page_cowfault() to release the page queues lock
if it sleeps.
pmap_zero_page() and pmap_zero_page_area() were modified to accept
a struct vm_page * instead of a physical address, vm_page_zero_fill()
and vm_page_zero_fill_area() have served no purpose.
MAKEDEV: Add MAKEDEV glue for the ti(4) device nodes.
ti.4: Update the ti(4) man page to include information on the
TI_JUMBO_HDRSPLIT and TI_PRIVATE_JUMBOS kernel options,
and also include information about the new character
device interface and the associated ioctls.
man9/Makefile: Add jumbo.9 and zero_copy.9 man pages and associated
links.
jumbo.9: New man page describing the jumbo buffer allocator
interface and operation.
zero_copy.9: New man page describing the general characteristics of
the zero copy send and receive code, and what an
application author should do to take advantage of the
zero copy functionality.
NOTES: Add entries for ZERO_COPY_SOCKETS, TI_PRIVATE_JUMBOS,
TI_JUMBO_HDRSPLIT, MSIZE, and MCLSHIFT.
conf/files: Add uipc_jumbo.c and uipc_cow.c.
conf/options: Add the 5 options mentioned above.
kern_subr.c: Receive side zero copy implementation. This takes
"disposable" pages attached to an mbuf, gives them to
a user process, and then recycles the user's page.
This is only active when ZERO_COPY_SOCKETS is turned on
and the kern.ipc.zero_copy.receive sysctl variable is
set to 1.
uipc_cow.c: Send side zero copy functions. Takes a page written
by the user and maps it copy on write and assigns it
kernel virtual address space. Removes copy on write
mapping once the buffer has been freed by the network
stack.
uipc_jumbo.c: Jumbo disposable page allocator code. This allocates
(optionally) disposable pages for network drivers that
want to give the user the option of doing zero copy
receive.
uipc_socket.c: Add kern.ipc.zero_copy.{send,receive} sysctls that are
enabled if ZERO_COPY_SOCKETS is turned on.
Add zero copy send support to sosend() -- pages get
mapped into the kernel instead of getting copied if
they meet size and alignment restrictions.
uipc_syscalls.c:Un-staticize some of the sf* functions so that they
can be used elsewhere. (uipc_cow.c)
if_media.c: In the SIOCGIFMEDIA ioctl in ifmedia_ioctl(), avoid
calling malloc() with M_WAITOK. Return an error if
the M_NOWAIT malloc fails.
The ti(4) driver and the wi(4) driver, at least, call
this with a mutex held. This causes witness warnings
for 'ifconfig -a' with a wi(4) or ti(4) board in the
system. (I've only verified for ti(4)).
ip_output.c: Fragment large datagrams so that each segment contains
a multiple of PAGE_SIZE amount of data plus headers.
This allows the receiver to potentially do page
flipping on receives.
if_ti.c: Add zero copy receive support to the ti(4) driver. If
TI_PRIVATE_JUMBOS is not defined, it now uses the
jumbo(9) buffer allocator for jumbo receive buffers.
Add a new character device interface for the ti(4)
driver for the new debugging interface. This allows
(a patched version of) gdb to talk to the Tigon board
and debug the firmware. There are also a few additional
debugging ioctls available through this interface.
Add header splitting support to the ti(4) driver.
Tweak some of the default interrupt coalescing
parameters to more useful defaults.
Add hooks for supporting transmit flow control, but
leave it turned off with a comment describing why it
is turned off.
if_tireg.h: Change the firmware rev to 12.4.11, since we're really
at 12.4.11 plus fixes from 12.4.13.
Add defines needed for debugging.
Remove the ti_stats structure, it is now defined in
sys/tiio.h.
ti_fw.h: 12.4.11 firmware.
ti_fw2.h: 12.4.11 firmware, plus selected fixes from 12.4.13,
and my header splitting patches. Revision 12.4.13
doesn't handle 10/100 negotiation properly. (This
firmware is the same as what was in the tree previously,
with the addition of header splitting support.)
sys/jumbo.h: Jumbo buffer allocator interface.
sys/mbuf.h: Add a new external mbuf type, EXT_DISPOSABLE, to
indicate that the payload buffer can be thrown away /
flipped to a userland process.
socketvar.h: Add prototype for socow_setup.
tiio.h: ioctl interface to the character portion of the ti(4)
driver, plus associated structure/type definitions.
uio.h: Change prototype for uiomoveco() so that we'll know
whether the source page is disposable.
ufs_readwrite.c:Update for new prototype of uiomoveco().
vm_fault.c: In vm_fault(), check to see whether we need to do a page
based copy on write fault.
vm_object.c: Add a new function, vm_object_allocate_wait(). This
does the same thing that vm_object allocate does, except
that it gives the caller the opportunity to specify whether
it should wait on the uma_zalloc() of the object structre.
This allows vm objects to be allocated while holding a
mutex. (Without generating WITNESS warnings.)
vm_object_allocate() is implemented as a call to
vm_object_allocate_wait() with the malloc flag set to
M_WAITOK.
vm_object.h: Add prototype for vm_object_allocate_wait().
vm_page.c: Add page-based copy on write setup, clear and fault
routines.
vm_page.h: Add page based COW function prototypes and variable in
the vm_page structure.
Many thanks to Drew Gallatin, who wrote the zero copy send and receive
code, and to all the other folks who have tested and reviewed this code
over the years.
o Move pmap_pageable() outside of Giant in vm_fault_unwire().
(pmap_pageable() is a no-op on all supported architectures.)
o Remove the acquisition and release of Giant from mlock().
and vm_map_delete(). Assert GIANT_REQUIRED in vm_map_delete()
only if operating on the kernel_object or the kmem_object.
o Remove GIANT_REQUIRED from vm_map_remove().
o Remove the acquisition and release of Giant from munmap().