cache: vm_object_page_remove() should convert any cached pages that
fall with the specified range to free pages. Otherwise, there could
be a problem if a file is first truncated and then regrown.
Specifically, some old data from prior to the truncation might reappear.
Generalize vm_page_cache_free() to support the conversion of either a
subset or the entirety of an object's cached pages.
Reported by: tegge
Reviewed by: tegge
Approved by: re (kensmith)
ways:
(1) Cached pages are no longer kept in the object's resident page
splay tree and memq. Instead, they are kept in a separate per-object
splay tree of cached pages. However, access to this new per-object
splay tree is synchronized by the _free_ page queues lock, not to be
confused with the heavily contended page queues lock. Consequently, a
cached page can be reclaimed by vm_page_alloc(9) without acquiring the
object's lock or the page queues lock.
This solves a problem independently reported by tegge@ and Isilon.
Specifically, they observed the page daemon consuming a great deal of
CPU time because of pages bouncing back and forth between the cache
queue (PQ_CACHE) and the inactive queue (PQ_INACTIVE). The source of
this problem turned out to be a deadlock avoidance strategy employed
when selecting a cached page to reclaim in vm_page_select_cache().
However, the root cause was really that reclaiming a cached page
required the acquisition of an object lock while the page queues lock
was already held. Thus, this change addresses the problem at its
root, by eliminating the need to acquire the object's lock.
Moreover, keeping cached pages in the object's primary splay tree and
memq was, in effect, optimizing for the uncommon case. Cached pages
are reclaimed far, far more often than they are reactivated. Instead,
this change makes reclamation cheaper, especially in terms of
synchronization overhead, and reactivation more expensive, because
reactivated pages will have to be reentered into the object's primary
splay tree and memq.
(2) Cached pages are now stored alongside free pages in the physical
memory allocator's buddy queues, increasing the likelihood that large
allocations of contiguous physical memory (i.e., superpages) will
succeed.
Finally, as a result of this change long-standing restrictions on when
and where a cached page can be reclaimed and returned by
vm_page_alloc(9) are eliminated. Specifically, calls to
vm_page_alloc(9) specifying VM_ALLOC_INTERRUPT can now reclaim and
return a formerly cached page. Consequently, a call to malloc(9)
specifying M_NOWAIT is less likely to fail.
Discussed with: many over the course of the summer, including jeff@,
Justin Husted @ Isilon, peter@, tegge@
Tested by: an earlier version by kris@
Approved by: re (kensmith)
vm_phys_free_pages(). Rename vm_phys_alloc_pages_locked() to
vm_phys_alloc_pages() and vm_phys_free_pages_locked() to
vm_phys_free_pages(). Add comments regarding the need for the free page
queues lock to be held by callers to these functions. No functional
changes.
Approved by: re (hrs)
vm_page_cowfault(). Initially, if vm_page_cowfault() sleeps, the given
page is wired, preventing it from being recycled. However, when
transmission of the page completes, the page is unwired and returned to
the page queues. At that point, the page is not in any special state
that prevents it from being recycled. Consequently, vm_page_cowfault()
should verify that the page is still held by the same vm object before
retrying the replacement of the page. Note: The containing object is,
however, safe from being recycled by virtue of having a non-zero
paging-in-progress count.
While I'm here, add some assertions and comments.
Approved by: re (rwatson)
MFC After: 3 weeks
This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).
The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.
This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.
Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.
Approved by: re
In particular:
- Add an explicative table for locking of struct vmmeter members
- Apply new rules for some of those members
- Remove some unuseful comments
Heavily reviewed by: alc, bde, jeff
Approved by: jeff (mentor)
Now, we assume no more sched_lock protection for some of them and use the
distribuited loads method for vmmeter (distribuited through CPUs).
Reviewed by: alc, bde
Approved by: jeff (mentor)
Probabilly, a general approach is not the better solution here, so we should
solve the sched_lock protection problems separately.
Requested by: alc
Approved by: jeff (mentor)
vmcnts. This can be used to abstract away pcpu details but also changes
to use atomics for all counters now. This means sched lock is no longer
responsible for protecting counts in the switch routines.
Contributed by: Attilio Rao <attilio@FreeBSD.org>
VM_PHYSSEG_SPARSE depending on whether the physical address space is
densely or sparsely populated with memory. The effect of this
definition is to determine which of two implementations of
vm_page_array and PHYS_TO_VM_PAGE() is used. The legacy
implementation is obtained by defining VM_PHYSSEG_DENSE, and a new
implementation that trades off time for space is obtained by defining
VM_PHYSSEG_SPARSE. For now, all architectures except for ia64 and
sparc64 define VM_PHYSSEG_DENSE. Defining VM_PHYSSEG_SPARSE on ia64
allows the entirety of my Itanium 2's memory to be used. Previously,
only the first 1 GB could be used. Defining VM_PHYSSEG_SPARSE on
sparc64 allows USIIIi-based systems to boot without crashing.
This change is a combination of Nathan Whitehorn's patch and my own
work in perforce.
Discussed with: kmacy, marius, Nathan Whitehorn
PR: 112194
immediately flag any page that is allocated to a OBJT_PHYS object as
unmanaged in vm_page_alloc() rather than waiting for a later call to
vm_page_unmanage(). This allows for the elimination of some uses of
the page queues lock.
Change the type of the kernel and kmem objects from OBJT_DEFAULT to
OBJT_PHYS. This allows us to take advantage of the above change to
simplify the allocation of unmanaged pages in kmem_alloc() and
kmem_malloc().
Remove vm_page_unmanage(). It is no longer used.
vm_page_free_toq() to account for recent changes that allow
vm_page_free_toq() to be called on some pages without the page queues lock
being held, specifically, pages that are not contained in a vm object and
not a member of a page queue. (Examples of such pages include page table
pages, pv entry pages, and uma small alloc pages.)
is actually being added to the hold queue, not the free queue. At the same
time, avoid unnecessary tests to wake up threads waiting for free memory
and the idle thread that zeroes free pages. (These tests will be performed
later when the page finally moves from the hold queue to the free queue.)
inlined and a procedure call is made in the rare case, i.e., when it is
necessary to sleep. In this case, inlining the test actually makes the
kernel smaller.
page queues-synchronized flag. Reduce the scope of the page queues lock in
vm_fault() accordingly.
Move vm_fault()'s call to vm_object_set_writeable_dirty() outside of the
scope of the page queues lock. Reviewed by: tegge
Additionally, eliminate an unnecessary dereference in computing the
argument that is passed to vm_object_set_writeable_dirty().
synchronized by the lock on the object containing the page.
Transition PG_WANTED and PG_SWAPINPROG to use the new field,
eliminating the need for holding the page queues lock when setting
or clearing these flags. Rename PG_WANTED and PG_SWAPINPROG to
VPO_WANTED and VPO_SWAPINPROG, respectively.
Eliminate the assertion that the page queues lock is held in
vm_page_io_finish().
Eliminate the acquisition and release of the page queues lock
around calls to vm_page_io_finish() in kern_sendfile() and
vfs_unbusy_pages().
Originally, I had adopted sparc64's name, pmap_clear_write(), for the
function that is now pmap_remove_write(). However, this function is more
like pmap_remove_all() than like pmap_clear_modify() or
pmap_clear_reference(), hence, the name change.
The higher-level rationale behind this change is described in
src/sys/amd64/amd64/pmap.c revision 1.567. The short version is that I'm
trying to clean up and fix our support for execute access.
Reviewed by: marcel@ (ia64)
vm_page_startup(). As a result, we now only lookup the tunable once
instead of looking it up once for every physical page of memory in the
system. This cuts out about a 1 second or so delay in boot on x86
systems. The delay is much larger and more noticable on sun4v apparently.
Reported by: kmacy
MFC after: 1 week
These pages are allocated from the direct map, and were not previous
tracked. This included the vm_page_array and the early UMA bootstrap
pages.
Reviewed by: peter
via the debug.minidump sysctl and tunable.
Traditional dumps store all physical memory. This was once a good thing
when machines had a maximum of 64M of ram and 1GB of kvm. These days,
machines often have many gigabytes of ram and a smaller amount of kvm.
libkvm+kgdb don't have a way to access physical ram that is not mapped
into kvm at the time of the crash dump, so the extra ram being dumped
is mostly wasted.
Minidumps invert the process. Instead of dumping physical memory in
in order to guarantee that all of kvm's backing is dumped, minidumps
instead dump only memory that is actively mapped into kvm.
amd64 has a direct map region that things like UMA use. Obviously we
cannot dump all of the direct map region because that is effectively
an old style all-physical-memory dump. Instead, introduce a bitmap
and two helper routines (dump_add_page(pa) and dump_drop_page(pa)) that
allow certain critical direct map pages to be included in the dump.
uma_machdep.c's allocator is the intended consumer.
Dumps are a custom format. At the very beginning of the file is a header,
then a copy of the message buffer, then the bitmap of pages present in
the dump, then the final level of the kvm page table trees (2MB mappings
are expanded into a 4K page mappings), then the sparse physical pages
according to the bitmap. libkvm can now conveniently access the kvm
page table entries.
Booting my test 8GB machine, forcing it into ddb and forcing a dump
leads to a 48MB minidump. While this is a best case, I expect minidumps
to be in the 100MB-500MB range. Obviously, never larger than physical
memory of course.
minidumps are on by default. It would want be necessary to turn them off
if it was necessary to debug corrupt kernel page table management as that
would mess up minidumps as well.
Both minidumps and regular dumps are supported on the same machine.
and it has not plenty of free pages it tries to free pages in the cache queue.
Unfortunately freeing a cached page requires the locking of the object that
owns the page. However in the context of allocating pages we may not be able
to lock the object and thus can only TRY to lock the object. If the locking try
fails the cache page can not be freed and is activated to move it out of the way
so that we may try to free other cache pages.
If all pages in the cache belong to objects that are currently locked the
cache queue can be emptied without freeing a single page. This scenario caused
two problems:
1) vm_page_alloc always failed allocation when it tried freeing pages from
the cache queue and failed to do so. However if there are more than
cnt.v_interrupt_free_min pages on the free list it should return pages
when requested with priority VM_ALLOC_SYSTEM. Failure to do so can cause
resource exhaustion deadlocks.
2) Threads than need to allocate pages spend a lot of time cleaning up the
page queue without really getting anything done while the pagedaemon
needs to work overtime to refill the cache.
This change fixes the first problem. (1)
Reviewed by: tegge@
- provide an interface (macros) to the page coloring part of the VM system,
this allows to try different coloring algorithms without the need to
touch every file [1]
- make the page queue tuning values readable: sysctl vm.stats.pagequeue
- autotuning of the page coloring values based upon the cache size instead
of options in the kernel config (disabling of the page coloring as a
kernel option is still possible)
MD changes:
- detection of the cache size: only IA32 and AMD64 (untested) contains
cache size detection code, every other arch just comes with a dummy
function (this results in the use of default values like it was the
case without the autotuning of the page coloring)
- print some more info on Intel CPU's (like we do on AMD and Transmeta
CPU's)
Note to AMD owners (IA32 and AMD64): please run "sysctl vm.stats.pagequeue"
and report if the cache* values are zero (= bug in the cache detection code)
or not.
Based upon work by: Chad David <davidc@acns.ab.ca> [1]
Reviewed by: alc, arch (in 2004)
Discussed with: alc, Chad David, arch (in 2004)
by the zero-copy sockets method, and written to before the transmission
completes, we need to destroy all of the existing mappings to the page,
not just the one that we fault on. Otherwise, the mappings will no longer
be to the same page and changes made through one of the mappings will not
be visible through the others.
Observed by: tegge
If a copy-on-write fault occurs on the page, the new copy should inherit
a part of the original page's wire count.
Submitted by: tegge
MFC after: 1 week
is inserted.
- In vm_page_remove() drop the backing vnode when the last page
is removed.
- Don't check the vnode to see if it must be reclaimed on every
call to vm_page_free_toq() as we only check it now when it is
actually required. This saves us two lock operations per call.
Sponsored by: Isilon Systems, Inc.
queue and (possibly) unlocking the containing object from
vm_page_alloc() to vm_page_select_cache(). Recent optimizations to
vm_map_pmap_enter() (see vm_map.c revisions 1.362 and 1.363) and
pmap_enter_quick() have resulted in panic()s because vm_page_alloc()
mistakenly unlocked objects that had not been locked by
vm_page_select_cache().
Reported by: Peter Holm and Kris Kennaway
need for most calls to vm_page_busy(). Specifically, most calls to
vm_page_busy() occur immediately prior to a call to vm_page_remove().
In such cases, the containing vm object is locked across both calls.
Consequently, the setting of the vm page's PG_BUSY flag is not even
visible to other threads that are following the synchronization
protocol.
This change (1) eliminates the calls to vm_page_busy() that
immediately precede a call to vm_page_remove() or functions, such as
vm_page_free() and vm_page_rename(), that call it and (2) relaxes the
requirement in vm_page_remove() that the vm page's PG_BUSY flag is
set. Now, the vm page's PG_BUSY flag is set only when the vm object
lock is released while the vm page is still in transition. Typically,
this is when it is undergoing I/O.
that indicates that the caller does not want a page with its busy flag set.
In many places, the global page queues lock is acquired and released just
to clear the busy flag on a just allocated page. Both the allocation of
the page and the clearing of the busy flag occur while the containing vm
object is locked. So, the busy flag might as well never be set.
errors are in rarely executed paths.
1. Each time the retry_alloc path is taken, the PG_BUSY must be set again.
Otherwise vm_page_remove() panics.
2. There is no need to set PG_BUSY on the newly allocated page before
freeing it. The page already has PG_BUSY set by vm_page_alloc().
Setting it again could cause an assertion failure.
MFC after: 2 weeks
vm_page_io_finish(). The motivation being to transition synchronization of
the vm_page's busy field from the global page queues lock to the per-object
lock.
and which takes a M_WAITOK/M_NOWAIT flag argument.
Add compatibility isa_dmainit() macro which whines loudly if
isa_dma_init() fails.
Problem uncovered by: tegge
- Enable recursion on the page queues lock. This allows calls to
vm_page_alloc(VM_ALLOC_NORMAL) and UMA's obj_alloc() with the page
queues lock held. Such calls are made to allocate page table pages
and pv entries.
- The previous change enables a partial reversion of vm/vm_page.c
revision 1.216, i.e., the call to vm_page_alloc() by vm_page_cowfault()
now specifies VM_ALLOC_NORMAL rather than VM_ALLOC_INTERRUPT.
- Add partial locking to pmap_copy(). (As a side-effect, pmap_copy()
should now be faster on i386 SMP because it no longer generates IPIs
for TLB shootdown on the other processors.)
- Complete the locking of pmap_enter() and pmap_enter_quick(). (As of now,
all changes to a user-level pmap on alpha, amd64, and i386 are performed
with appropriate locking.)
being incomplete, it currently has to know how to drop and pick back
up the vm_object's mutex if it has to sleep and drop the page queue
mutex. The problem with this is that if the page is busy, while we
are sleeping, the page can be freed and object disappear. When trying
to lock m->object, we'd get a stale or NULL pointer and crash.
The object is now cached, but this makes the assumption that
the object is referenced in some manner and will not itself
disappear while it is unlocked. Since this only happens if
the object is locked, I had to remove an assumption earlier in
contigmalloc() that reversed the order of locking the object and
doing vm_page_sleep_if_busy(), not the normal order.
allocated as "no object" pages. Similar changes were made to the amd64
and i386 pmap last year. The primary reason being that maintaining
a pte object leads to lock order violations. A secondary reason being
that the pte object is redundant, i.e., the page table itself can be
used to lookup page table pages. (Historical note: The pte object
predates our ability to allocate "no object" pages. Thus, the pte
object was a necessary evil.)
- Unconditionally check the vm object lock's status in vm_page_remove().
Previously, this assertion could not be made on Alpha due to its use
of a pte object.
being that PHYS_TO_VM_PAGE() returns the wrong vm_page for fictitious
pages but unwiring uses PHYS_TO_VM_PAGE(). The resulting panic
reported an unexpected wired count. Rather than attempting to fix
PHYS_TO_VM_PAGE(), this fix takes advantage of the properties of
fictitious pages. Specifically, fictitious pages will never be
completely unwired. Therefore, we can keep a fictitious page's wired
count forever set to one and thereby avoid the use of
PHYS_TO_VM_PAGE() when we know that we're working with a fictitious
page, just not which one.
In collaboration with: green@, tegge@
PR: kern/29915
Some of the conditions that caused vm_page_select_cache() to deactivate a
page were wrong. For example, deactivating an unmanaged or wired page is a
nop. Thus, if vm_page_select_cache() had ever encountered an unmanaged or
wired page, it would have looped forever. Now, we assert that the page is
neither unmanaged nor wired.
caller to vm_page_grab(). Although this gives VM_ALLOC_ZERO a
different meaning for vm_page_grab() than for vm_page_alloc(), I feel
such change is necessary to accomplish other goals. Specifically, I
want to make the PG_ZERO flag immutable between the time it is
allocated by vm_page_alloc() and freed by vm_page_free() or
vm_page_free_zero() to avoid locking overheads. Once we gave up on
the ability to automatically recognize a zeroed page upon entry to
vm_page_free(), the ability to mutate the PG_ZERO flag became useless.
Instead, I would like to say that "Once a page becomes valid, its
PG_ZERO flag must be ignored."
vm_page_free() is called. The problem with holding this lock is that it is
a spin lock and vm_page_free() may attempt the acquisition of a different
default-type lock.
could result in a panic "vm_page_cache: caching a dirty page, ...":
Access to the page must be restricted or removed before calling
vm_page_cache(). This race condition is identical in nature to that
which was addressed by vm_pageout.c's revision 1.251.
- Simplify the code surrounding the fix to this same race condition
in vm_pageout.c's revision 1.251. There should be no behavioral
change. Reviewed by: tegge
MFC after: 7 days
free pages queue. This is presently needed by contigmalloc1().
- Move a sanity check against attempted double allocation of two pages
to the same vm object offset from vm_page_alloc() to vm_page_insert().
This provides better protection because double allocation could occur
through a direct call to vm_page_insert(), such as that by
vm_page_rename().
- Modify contigmalloc1() to hold the mutex synchronizing access to the
free pages queue while it scans vm_page_array in search of free pages.
- Correct a potential leak of pages by contigmalloc1() that I introduced
in revision 1.20: We must convert all cache queue pages to free pages
before we begin removing free pages from the free queue. Otherwise,
if we have to restart the scan because we are unable to acquire the
vm object lock that is necessary to convert a cache queue page to a
free page, we leak those free pages already removed from the free queue.
vm object hasn't changed, the desired page will be at or near the root
of the vm object's splay tree, making vm_page_lookup() cheap. (The only
lock required for vm_page_lookup() is already held.) If, however, the
vm object has changed and retry was requested, eliminating the generation
check also eliminates a pointless acquisition and release of the page
queues lock.
This guard page would have trapped the problems with the MFC of the PAE
support to RELENG_4 at an earlier point in the sequence of events.
Submitted by: tegge
vm_pageout_scan(). Rationale: I don't like leaving a busy page in the
cache queue with neither the vm object nor the vm page queues lock held.
- Assert that the page is active in vm_pageout_page_stats().
pmap_copy_page() et al. to accept a vm_page_t rather than a physical
address. Also, this change will facilitate locking access to the vm page's
valid field.
color in vm_page_alloc(). (This also has small performance benefits.)
- Eliminate vm_page_select_free(); vm_page_alloc() might as well
call vm_pageq_find() directly.
releasing the lock only if we are about to sleep (e.g., vm_pager_get_pages()
or vm_pager_has_pages()). If we sleep, we have marked the vm object with
the paging-in-progress flag.
- Remove the Giant required from vm_page_free_toq(). (Any locking
errors will be caught by vm_page_remove().)
This remedies a panic that occurred when kmem_malloc(NOWAIT) performed
without Giant failed to allocate the necessary pages.
Reported by: phk
called without Giant; and obj_alloc() in turn calls vm_page_alloc()
without Giant. This causes an assertion failure in vm_page_alloc().
Fortunately, obj_alloc() is now MPSAFE. So, we need only clean up
some assertions.
- Weaken the assertion in vm_page_lookup() to require Giant only
if the vm_object isn't locked.
- Remove an assertion from vm_page_alloc() that duplicates a check
performed in vm_page_lookup().
In collaboration with: gallatin, jake, jeff
where physical addresses larger than virtual addresses, such as i386s
with PAE.
- Use this to represent physical addresses in the MI vm system and in the
i386 pmap code. This also changes the paddr parameter to d_mmap_t.
- Fix printf formats to handle physical addresses >4G in the i386 memory
detection code, and due to kvtop returning vm_paddr_t instead of u_long.
Note that this is a name change only; vm_paddr_t is still the same as
vm_offset_t on all currently supported platforms.
Sponsored by: DARPA, Network Associates Laboratories
Discussed with: re, phk (cdevsw change)
The objective being to eliminate some cases of page queues locking.
(See, for example, vm/vm_fault.c revision 1.160.)
Reviewed by: tegge
(Also, pointed out by tegge that I changed vm_fault.c before changing
vm_page.c. Oops.)
requests when the number of free pages is below the reserved threshold.
Previously, VM_ALLOC_ZERO was only honored when the number of free pages
was above the reserved threshold. Honoring it in all cases generally
makes sense, does no harm, and simplifies the code.
indirectly through vm_page_protect(). The one remaining page flag that
is updated by vm_page_protect() is already being updated by our various
pmap implementations.
Note: A later commit will similarly change the VM_PROT_READ case and
eliminate vm_page_protect().
because it's no longer used. (See revision 1.215.)
- Fix a harmless bug: the number of vm_page structures allocated wasn't
properly adjusted when uma_bootstrap() was introduced. Consequently,
we were allocating 30 unused vm_page structures.
- Wrap a long line.
vm_page_alloc not to insert this page into an object. The pindex is
still used for colorization.
- Rework vm_page_select_* to accept a color instead of an object and
pindex to work with VM_PAGE_NOOBJ.
- Document other VM_ALLOC_ flags.
Reviewed by: peter, jake
on-write (COW) mechanism. (This mechanism is used by the zero-copy
TCP/IP implementation.)
- Extend the scope of the page queues lock in vm_fault()
to cover vm_page_cowfault().
- Modify vm_page_cowfault() to release the page queues lock
if it sleeps.
be no major change in performance from this change at this time but this
will allow other work to progress: Giant lock removal around VM system
in favor of per-object mutexes, ranged fsyncs, more optimal COMMIT rpc's for
NFS, partial filesystem syncs by the syncer, more optimal object flushing,
etc. Note that the buffer cache is already using a similar splay tree
mechanism.
Note that a good chunk of the old hash table code is still in the tree.
Alan or I will remove it prior to the release if the new code does not
introduce unsolvable bugs, else we can revert more easily.
Submitted by: alc (this is Alan's code)
Approved by: re
pmap_zero_page() and pmap_zero_page_area() were modified to accept
a struct vm_page * instead of a physical address, vm_page_zero_fill()
and vm_page_zero_fill_area() have served no purpose.
- v_vflag is protected by the vnode lock and is used when synchronization
with VOP calls is needed.
- v_iflag is protected by interlock and is used for dealing with vnode
management issues. These flags include X/O LOCK, FREE, DOOMED, etc.
- All accesses to v_iflag and v_vflag have either been locked or marked with
mp_fixme's.
- Many ASSERT_VOP_LOCKED calls have been added where the locking was not
clear.
- Many functions in vfs_subr.c were restructured to provide for stronger
locking.
Idea stolen from: BSD/OS
when VM_ALLOC_WIRED is specified: set the PG_MAPPED bit in flags.
o In both vm_page_wire() and vm_page_allocate() add a comment saying
that setting PG_MAPPED does not belong there.
to return a wired page.
o Use VM_ALLOC_WIRED within Alpha's pmap_growkernel(). Also, because
Alpha's pmap_growkernel() calls vm_page_alloc() from within a critical
section, specify VM_ALLOC_INTERRUPT instead of VM_ALLOC_SYSTEM. (Only
VM_ALLOC_INTERRUPT is implemented entirely with a spin mutex.)
o Assert that the page queues mutex is held in vm_page_wire()
on Alpha, just like the other platforms.
o Assert that the page queues lock is held in vm_page_unwire().
o Make vm_page_lock_queues() and vm_page_unlock_queues() visible
to kernel loadable modules.
queue lock (revision 1.33 of vm/vm_page.c removed them).
o Make the free queue lock a spin lock because it's sometimes acquired
inside of a critical section.
MAKEDEV: Add MAKEDEV glue for the ti(4) device nodes.
ti.4: Update the ti(4) man page to include information on the
TI_JUMBO_HDRSPLIT and TI_PRIVATE_JUMBOS kernel options,
and also include information about the new character
device interface and the associated ioctls.
man9/Makefile: Add jumbo.9 and zero_copy.9 man pages and associated
links.
jumbo.9: New man page describing the jumbo buffer allocator
interface and operation.
zero_copy.9: New man page describing the general characteristics of
the zero copy send and receive code, and what an
application author should do to take advantage of the
zero copy functionality.
NOTES: Add entries for ZERO_COPY_SOCKETS, TI_PRIVATE_JUMBOS,
TI_JUMBO_HDRSPLIT, MSIZE, and MCLSHIFT.
conf/files: Add uipc_jumbo.c and uipc_cow.c.
conf/options: Add the 5 options mentioned above.
kern_subr.c: Receive side zero copy implementation. This takes
"disposable" pages attached to an mbuf, gives them to
a user process, and then recycles the user's page.
This is only active when ZERO_COPY_SOCKETS is turned on
and the kern.ipc.zero_copy.receive sysctl variable is
set to 1.
uipc_cow.c: Send side zero copy functions. Takes a page written
by the user and maps it copy on write and assigns it
kernel virtual address space. Removes copy on write
mapping once the buffer has been freed by the network
stack.
uipc_jumbo.c: Jumbo disposable page allocator code. This allocates
(optionally) disposable pages for network drivers that
want to give the user the option of doing zero copy
receive.
uipc_socket.c: Add kern.ipc.zero_copy.{send,receive} sysctls that are
enabled if ZERO_COPY_SOCKETS is turned on.
Add zero copy send support to sosend() -- pages get
mapped into the kernel instead of getting copied if
they meet size and alignment restrictions.
uipc_syscalls.c:Un-staticize some of the sf* functions so that they
can be used elsewhere. (uipc_cow.c)
if_media.c: In the SIOCGIFMEDIA ioctl in ifmedia_ioctl(), avoid
calling malloc() with M_WAITOK. Return an error if
the M_NOWAIT malloc fails.
The ti(4) driver and the wi(4) driver, at least, call
this with a mutex held. This causes witness warnings
for 'ifconfig -a' with a wi(4) or ti(4) board in the
system. (I've only verified for ti(4)).
ip_output.c: Fragment large datagrams so that each segment contains
a multiple of PAGE_SIZE amount of data plus headers.
This allows the receiver to potentially do page
flipping on receives.
if_ti.c: Add zero copy receive support to the ti(4) driver. If
TI_PRIVATE_JUMBOS is not defined, it now uses the
jumbo(9) buffer allocator for jumbo receive buffers.
Add a new character device interface for the ti(4)
driver for the new debugging interface. This allows
(a patched version of) gdb to talk to the Tigon board
and debug the firmware. There are also a few additional
debugging ioctls available through this interface.
Add header splitting support to the ti(4) driver.
Tweak some of the default interrupt coalescing
parameters to more useful defaults.
Add hooks for supporting transmit flow control, but
leave it turned off with a comment describing why it
is turned off.
if_tireg.h: Change the firmware rev to 12.4.11, since we're really
at 12.4.11 plus fixes from 12.4.13.
Add defines needed for debugging.
Remove the ti_stats structure, it is now defined in
sys/tiio.h.
ti_fw.h: 12.4.11 firmware.
ti_fw2.h: 12.4.11 firmware, plus selected fixes from 12.4.13,
and my header splitting patches. Revision 12.4.13
doesn't handle 10/100 negotiation properly. (This
firmware is the same as what was in the tree previously,
with the addition of header splitting support.)
sys/jumbo.h: Jumbo buffer allocator interface.
sys/mbuf.h: Add a new external mbuf type, EXT_DISPOSABLE, to
indicate that the payload buffer can be thrown away /
flipped to a userland process.
socketvar.h: Add prototype for socow_setup.
tiio.h: ioctl interface to the character portion of the ti(4)
driver, plus associated structure/type definitions.
uio.h: Change prototype for uiomoveco() so that we'll know
whether the source page is disposable.
ufs_readwrite.c:Update for new prototype of uiomoveco().
vm_fault.c: In vm_fault(), check to see whether we need to do a page
based copy on write fault.
vm_object.c: Add a new function, vm_object_allocate_wait(). This
does the same thing that vm_object allocate does, except
that it gives the caller the opportunity to specify whether
it should wait on the uma_zalloc() of the object structre.
This allows vm objects to be allocated while holding a
mutex. (Without generating WITNESS warnings.)
vm_object_allocate() is implemented as a call to
vm_object_allocate_wait() with the malloc flag set to
M_WAITOK.
vm_object.h: Add prototype for vm_object_allocate_wait().
vm_page.c: Add page-based copy on write setup, clear and fault
routines.
vm_page.h: Add page based COW function prototypes and variable in
the vm_page structure.
Many thanks to Drew Gallatin, who wrote the zero copy send and receive
code, and to all the other folks who have tested and reviewed this code
over the years.
and pmap_copy_page(). This gets rid of a couple more physical addresses
in upper layers, with the eventual aim of supporting PAE and dealing with
the physical addressing mostly within pmap. (We will need either 64 bit
physical addresses or page indexes, possibly both depending on the
circumstances. Leaving this to pmap itself gives more flexibilitly.)
Reviewed by: jake
Tested on: i386, ia64 and (I believe) sparc64. (my alpha was hosed)
most cases NULL is passed, but in some cases such as network driver locks
(which use the MTX_NETWORK_LOCK macro) and UMA zone locks, a name is used.
Tested on: i386, alpha, sparc64
memory in phys_avail will fit in 'int', use vm_size_t. This fixes booting
on sparc64 machines with more than 2 gigs of ram.
Thanks to Jan Chrillesen for providing me with access to a 4 gig machine.
style(9)
- Minor space adjustment in cases where we have "( ", " )", if(), return(),
while(), for(), etc.
- Add /* SYMBOL */ after a few #endifs.
Reviewed by: alc
and again in vm_page.c and vm_pageq.c.
o Delete unusused prototypes. (Mainly a result of the earlier renaming
of various functions from vm_page_*() to vm_pageq_*().)
count that would otherwise be on one of the free queues. This eliminates a
panic when broken programs unmap memory that still has pending IO from raw
devices.
Reviewed by: dillon, alc
- Allow the OOM killer to target processes currently locked in
memory. These very often are the ones doing the memory hogging.
- Drop the wakeup priority of processes currently sleeping while
waiting for their page fault to complete. In order for the OOM
killer to work well, the killed process and other system processes
waiting on memory must be allowed to wakeup first.
Reviewed by: dillon
MFC after: 1 week
commit by Kirk also fixed a softupdates bug that could easily be triggered
by server side NFS.
* An edge case with shared R+W mmap()'s and truncate whereby
the system would inappropriately clear the dirty bits on
still-dirty data. (applicable to all filesystems)
THIS FIX TEMPORARILY DISABLED PENDING FURTHER TESTING.
see vm/vm_page.c line 1641
* The straddle case for VM pages and buffer cache buffers when
truncating. (applicable to NFS client side)
* Possible SMP database corruption due to vm_pager_unmap_page()
not clearing the TLB for the other cpu's. (applicable to NFS
client side but could effect all filesystems). Note: not
considered serious since the corruption occurs beyond the file
EOF.
* When flusing a dirty buffer due to B_CACHE getting cleared,
we were accidently setting B_CACHE again (that is, bwrite() sets
B_CACHE), when we really want it to stay clear after the write
is complete. This resulted in a corrupt buffer. (applicable
to all filesystems but probably only triggered by NFS)
* We have to call vtruncbuf() when ftruncate()ing to remove
any buffer cache buffers. This is still tentitive, I may
be able to remove it due to the second bug fix. (applicable
to NFS client side)
* vnode_pager_setsize() race against nfs_vinvalbuf()... we have
to set n_size before calling nfs_vinvalbuf or the NFS code
may recursively vnode_pager_setsize() to the original value
before the truncate. This is what was causing the user mmap
bus faults in the nfs tester program. (applicable to NFS
client side)
* Fix to softupdates (see ufs/ffs/ffs_inode.c 1.73, commit made
by Kirk).
Testing program written by: Avadis Tevanian, Jr.
Testing program supplied by: jkh / Apple (see Dec2001 posting to freebsd-hackers with Subject 'NFS: How to make FreeBS fall on its face in one easy step')
MFC after: 1 week
real effect.
Optimize vfs_msync(). Avoid having to continually drop and re-obtain
mutexes when scanning the vnode list. Improves looping case by 500%.
Optimize ffs_sync(). Avoid having to continually drop and re-obtain
mutexes when scanning the vnode list. This makes a couple of assumptions,
which I believe are ok, in regards to vnode stability when the mount list
mutex is held. Improves looping case by 500%.
(more optimization work is needed on top of these fixes)
MFC after: 1 week
on and off since John Dyson left his work-in-progress.
It is off by default for now. sysctl vm.zeroidle_enable=1 to turn it on.
There are some hacks here to deal with the present lack of preemption - we
yield after doing a small number of pages since we wont preempt otherwise.
This is basically Matt's algorithm [with hysteresis] with an idle process
to call it in a similar way it used to be called from the idle loop.
I cleaned up the includes a fair bit here too.
- Callers of asleep() and await() have been converted to calling tsleep().
The only caller outside of M_ASLEEP was the ata driver, which called both
asleep() and await() with spl-raised, so there was no need for the
asleep() and await() pair. M_ASLEEP was unused.
Reviewed by: jasone, peter
Also removed some spl's and added some VM mutexes, but they are not actually
used yet, so this commit does not really make any operational changes
to the system.
vm_page.c relates to vm_page_t manipulation, including high level deactivation,
activation, etc... vm_pageq.c relates to finding free pages and aquiring
exclusive access to a page queue (exclusivity part not yet implemented).
And the world still builds... :-)
most of these inlines had been bloated in -current far beyond their
original intent. Normalize prototypes and function declarations to be ANSI
only (half already were). And do some general cleanup.
(kernel size also reduced by 50-100K, but that isn't the prime intent)
(this commit is just the first stage). Also add various GIANT_ macros to
formalize the removal of Giant, making it easy to test in a more piecemeal
fashion. These macros will allow us to test fine-grained locks to a degree
before removing Giant, and also after, and to remove Giant in a piecemeal
fashion via sysctl's on those subsystems which the authors believe can
operate without Giant.
Tor created a while ago, removes the raw I/O piece (that has cache coherency
problems), and adds a buffer cache / VM freeing piece.
Essentially this patch causes O_DIRECT I/O to not be left in the cache, but
does not prevent it from going through the cache, hence the 80%. For
the last 20% we need a method by which the I/O can be issued directly to
buffer supplied by the user process and bypass the buffer cache entirely,
but still maintain cache coherency.
I also have the code working under -stable but the changes made to sys/file.h
may not be MFCable, so an MFC is not on the table yet.
Submitted by: tegge, dillon
vm_mtx does not recurse and is required for most low level
vm operations.
faults can not be taken without holding Giant.
Memory subsystems can now call the base page allocators safely.
Almost all atomic ops were removed as they are covered under the
vm mutex.
Alpha and ia64 now need to catch up to i386's trap handlers.
FFS and NFS have been tested, other filesystems will need minor
changes (grabbing the vm lock when twiddling page properties).
Reviewed (partially) by: jake, jhb
other "system" header files.
Also help the deprecation of lockmgr.h by making it a sub-include of
sys/lock.h and removing sys/lockmgr.h form kernel .c files.
Sort sys/*.h includes where possible in affected files.
OK'ed by: bde (with reservations)
supported architectures such as the alpha. This allows us to save
on kernel virtual address space, TLB entries, and (on the ia64) VHPT
entries. pmap_map() now modifies the passed in virtual address on
architectures that do not support direct-mapped segments to point to
the next available virtual address. It also returns the actual
address that the request was mapped to.
- On the IA64 don't use a special zone of PV entries needed for early
calls to pmap_kenter() during pmap_init(). This gets us in trouble
because we end up trying to use the zone allocator before it is
initialized. Instead, with the pmap_map() change, the number of needed
PV entries is small enough that we can get by with a static pool that is
used until pmap_init() is complete.
Submitted by: dfr
Debugging help: peter
Tested by: me
of memory, rather than from the start.
This fixes problems allocating bouncebuffers on alphas where there is only
1 chunk of memory (unlike PCs where there is generally at least one small
chunk and a large chunk). Having 1 chunk had been fatal, because these
structures take over 13MB on a machine with 1GB of ram. This doesn't leave
much room for other structures and bounce buffers if they're at the front.
Reviewed by: dfr, anderson@cs.duke.edu, silence on -arch
Tested by: Yoriaki FUJIMORI <fujimori@grafin.fujimori.cache.waseda.ac.jp>
in 4.2-REL which I ripped out in -stable and -current when implementing the
low-memory handling solution. However, maxlaunder turns out to be the saving
grace in certain very heavily loaded systems (e.g. newsreader box). The new
algorithm limits the number of pages laundered in the first pageout daemon
pass. If that is not sufficient then suceessive will be run without any
limit.
Write I/O is now pipelined using two sysctls, vfs.lorunningspace and
vfs.hirunningspace. This prevents excessive buffered writes in the
disk queues which cause long (multi-second) delays for reads. It leads
to more stable (less jerky) and generally faster I/O streaming to disk
by allowing required read ops (e.g. for indirect blocks and such) to occur
without interrupting the write stream, amoung other things.
NOTE: eventually, filesystem write I/O pipelining needs to be done on a
per-device basis. At the moment it is globalized.
The fix works by reverting the ordering of free memory so that the
chances of contig_malloc() succeeding increases.
PR: 23291
Submitted by: Andrew Atrens <atrens@nortel.ca>
Removed most of the hacks that were trying to deal with low-memory
situations prior to now.
The new code is based on the concept that I/O must be able to function in
a low memory situation. All major modules related to I/O (except
networking) have been adjusted to allow allocation out of the system
reserve memory pool. These modules now detect a low memory situation but
rather then block they instead continue to operate, then return resources
to the memory pool instead of cache them or leave them wired.
Code has been added to stall in a low-memory situation prior to a vnode
being locked.
Thus situations where a process blocks in a low-memory condition while
holding a locked vnode have been reduced to near nothing. Not only will
I/O continue to operate, but many prior deadlock conditions simply no
longer exist.
Implement a number of VFS/BIO fixes
(found by Ian): in biodone(), bogus-page replacement code, the loop
was not properly incrementing loop variables prior to a continue
statement. We do not believe this code can be hit anyway but we
aren't taking any chances. We'll turn the whole section into a
panic (as it already is in brelse()) after the release is rolled.
In biodone(), the foff calculation was incorrectly
clamped to the iosize, causing the wrong foff to be calculated
for pages in the case of an I/O error or biodone() called without
initiating I/O. The problem always caused a panic before. Now it
doesn't. The problem is mainly an issue with NFS.
Fixed casts for ~PAGE_MASK. This code worked properly before only
because the calculations use signed arithmatic. Better to properly
extend PAGE_MASK first before inverting it for the 64 bit masking
op.
In brelse(), the bogus_page fixup code was improperly throwing
away the original contents of 'm' when it did the j-loop to
fix the bogus pages. The result was that it would potentially
invalidate parts of the *WRONG* page(!), leading to corruption.
There may still be cases where a background bitmap write is
being duplicated, causing potential corruption. We have identified
a potentially serious bug related to this but the fix is still TBD.
So instead this patch contains a KASSERT to detect the problem
and panic the machine rather then continue to corrupt the filesystem.
The problem does not occur very often.. it is very hard to
reproduce, and it may or may not be the cause of the corruption
people have reported.
Review by: (VFS/BIO: mckusick, Ian Dowse <iedowse@maths.tcd.ie>)
Testing by: (VM/Deadlock) Paul Saab <ps@yahoo-inc.com>
and sysv shared memory support for it. It implements a new
PG_UNMANAGED flag that has slightly different characteristics
from PG_FICTICIOUS.
A new sysctl, kern.ipc.shm_use_phys has been added to enable the
use of physically-backed sysv shared memory rather then swap-backed.
Physically backed shm segments are not tracked with PV entries,
allowing programs which use a large shm segment as a rendezvous
point to operate without eating an insane amount of KVM in the
PV entry management. Read: Oracle.
Peter's OBJT_PHYS object will also allow us to eventually implement
page-table sharing and/or 4MB physical page support for such segments.
We're half way there.
to various pmap_*() functions instead of looking up the physical address
and passing that. In many cases, the first thing the pmap code was doing
was going to a lot of trouble to get back the original vm_page_t, or
it's shadow pv_table entry.
Inspired by: John Dyson's 1998 patches.
Also:
Eliminate pv_table as a seperate thing and build it into a machine
dependent part of vm_page_t. This eliminates having a seperate set of
structions that shadow each other in a 1:1 fashion that we often went to
a lot of trouble to translate from one to the other. (see above)
This happens to save 4 bytes of physical memory for each page in the
system. (8 bytes on the Alpha).
Eliminate the use of the phys_avail[] array to determine if a page is
managed (ie: it has pv_entries etc). Store this information in a flag.
Things like device_pager set it because they create vm_page_t's on the
fly that do not have pv_entries. This makes it easier to "unmanage" a
page of physical memory (this will be taken advantage of in subsequent
commits).
Add a function to add a new page to the freelist. This could be used
for reclaiming the previously wasted pages left over from preloaded
loader(8) files.
Reviewed by: dillon
madvise().
This feature prevents the update daemon from gratuitously flushing
dirty pages associated with a mapped file-backed region of memory. The
system pager will still page the memory as necessary and the VM system
will still be fully coherent with the filesystem. Modifications made
by other means to the same area of memory, for example by write(), are
unaffected. The feature works on a page-granularity basis.
MAP_NOSYNC allows one to use mmap() to share memory between processes
without incuring any significant filesystem overhead, putting it in
the same performance category as SysV Shared memory and anonymous memory.
Reviewed by: julian, alc, dg
instead of duplicating the code. (2) If a wired page is passed
to vm_page_free_toq, panic instead of printing a friendly warning.
(If we don't panic here, we'll just panic later in vm_page_unwire
obscuring the problem.)
eliminate an extra (useless) level of indirection in half of the page
queue accesses and (2) to use a single name for each queue throughout,
instead of, e.g., "vm_page_queue_active" in some places and
"vm_page_queues[PQ_ACTIVE]" in others.
Reviewed by: dillon
Merge the contents (less some trivial bordering the silly comments)
of <vm/vm_prot.h> and <vm/vm_inherit.h> into <vm/vm.h>. This puts
the #defines for the vm_inherit_t and vm_prot_t types next to their
typedefs.
This paves the road for the commit to follow shortly: change
useracc() to use VM_PROT_{READ|WRITE} rather than B_{READ|WRITE}
as argument.
Replace various VM related page count calculations strewn over the
VM code with inlines to aid in readability and to reduce fragility
in the code where modules depend on the same test being performed
to properly sleep and wakeup.
Split out a portion of the page deactivation code into an inline
in vm_page.c to support vm_page_dontneed().
add vm_page_dontneed(), which handles the madvise MADV_DONTNEED
feature in a related commit coming up for vm_map.c/vm_object.c. This
code prevents degenerate cases where an essentially active page may
be rotated through a subset of the paging lists, resulting in premature
disposal.
Remove the initialization of PQ_NONE's cnt and lcnt. They aren't
used.
vm_page_insert:
Remove an unnecessary dereference.
vm_page_wire:
Remove the one and only (and thus pointless) reference
to PQ_NONE's lcnt.
piecemeal, middle-of-file writes for NFS. These hacks have caused no
end of trouble, especially when combined with mmap(). I've removed
them. Instead, NFS will issue a read-before-write to fully
instantiate the struct buf containing the write. NFS does, however,
optimize piecemeal appends to files. For most common file operations,
you will not notice the difference. The sole remaining fragment in
the VFS/BIO system is b_dirtyoff/end, which NFS uses to avoid cache
coherency issues with read-merge-write style operations. NFS also
optimizes the write-covers-entire-buffer case by avoiding the
read-before-write. There is quite a bit of room for further
optimization in these areas.
The VM system marks pages fully-valid (AKA vm_page_t->valid =
VM_PAGE_BITS_ALL) in several places, most noteably in vm_fault. This
is not correct operation. The vm_pager_get_pages() code is now
responsible for marking VM pages all-valid. A number of VM helper
routines have been added to aid in zeroing-out the invalid portions of
a VM page prior to the page being marked all-valid. This operation is
necessary to properly support mmap(). The zeroing occurs most often
when dealing with file-EOF situations. Several bugs have been fixed
in the NFS subsystem, including bits handling file and directory EOF
situations and buf->b_flags consistancy issues relating to clearing
B_ERROR & B_INVAL, and handling B_DONE.
getblk() and allocbuf() have been rewritten. B_CACHE operation is now
formally defined in comments and more straightforward in
implementation. B_CACHE for VMIO buffers is based on the validity of
the backing store. B_CACHE for non-VMIO buffers is based simply on
whether the buffer is B_INVAL or not (B_CACHE set if B_INVAL clear,
and vise-versa). biodone() is now responsible for setting B_CACHE
when a successful read completes. B_CACHE is also set when a bdwrite()
is initiated and when a bwrite() is initiated. VFS VOP_BWRITE
routines (there are only two - nfs_bwrite() and bwrite()) are now
expected to set B_CACHE. This means that bowrite() and bawrite() also
set B_CACHE indirectly.
There are a number of places in the code which were previously using
buf->b_bufsize (which is DEV_BSIZE aligned) when they should have
been using buf->b_bcount. These have been fixed. getblk() now clears
B_DONE on return because the rest of the system is so bad about
dealing with B_DONE.
Major fixes to NFS/TCP have been made. A server-side bug could cause
requests to be lost by the server due to nfs_realign() overwriting
other rpc's in the same TCP mbuf chain. The server's kernel must be
recompiled to get the benefit of the fixes.
Submitted by: Matthew Dillon <dillon@apollo.backplane.com>
unallocated parts of the last page when the file ended on a frag
but not a page boundary.
Delimitted by tags PRE_MATT_MMAP_EOF and POST_MATT_MMAP_EOF,
in files alpha/alpha/pmap.c i386/i386/pmap.c nfs/nfs_bio.c vm/pmap.h
vm/vm_page.c vm/vm_page.h vm/vnode_pager.c miscfs/specfs/spec_vnops.c
ufs/ufs/ufs_readwrite.c kern/vfs_bio.c
Submitted by: Matt Dillon <dillon@freebsd.org>
Reviewed by: Alan Cox <alc@freebsd.org>
address) so that the first 16MB of physical memory is allocated
last rather than first. On large-memory machines, this avoids
the exhaustion of low physical memory before isa_dmainit has run.
Fix bug where an object's OBJ_WRITEABLE/OBJ_MIGHTBEDIRTY flags do
not get set under certain circumstances ( page rename case ).
Reviewed by: Alan Cox <alc@cs.rice.edu>, John Dyson
been made but the code has been reorganized and documented to make
it more readable, reduce the size of the code, and optimize the branch
path caching capabilities that most modern processors have.
PQ_FREE. There is little operational difference other then the kernel
being a few kilobytes smaller and the code being more readable.
* vm_page_select_free() has been *greatly* simplified.
* The PQ_ZERO page queue and supporting structures have been removed
* vm_page_zero_idle() revamped (see below)
PG_ZERO setting and clearing has been migrated from vm_page_alloc()
to vm_page_free[_zero]() and will eventually be guarenteed to remain
tracked throughout a page's life ( if it isn't already ).
When a page is freed, PG_ZERO pages are appended to the appropriate
tailq in the PQ_FREE queue while non-PG_ZERO pages are prepended.
When locating a new free page, PG_ZERO selection operates from within
vm_page_list_find() ( get page from end of queue instead of beginning
of queue ) and then only occurs in the nominal critical path case. If
the nominal case misses, both normal and zero-page allocation devolves
into the same _vm_page_list_find() select code without any specific
zero-page optimizations.
Additionally, vm_page_zero_idle() has been revamped. Hysteresis has been
added and zero-page tracking adjusted to conform with the other changes.
Currently hysteresis is set at 1/3 (lo) and 1/2 (hi) the number of free
pages. We may wish to increase both parameters as time permits. The
hysteresis is designed to avoid silly zeroing in borderline allocation/free
situations.
vm_page_rename(), but never pulled the page off PQ_CACHE if it was on
PQ_CACHE. Dirty pages in PQ_CACHE are not allowed and a KASSERT was
added in -4.x to test for this... and got hit.
In -4.x, vm_page_rename() automatically dirties the page. This commit
also has it deal with the PQ_CACHE case, deactivating the page in that
case.
pointers per entry ). The table has been changed to a singly linked
list of vm_page_t pointers. The table has been doubled in size, but
the entries only take half the space so a net-zero change in memory use.
The hash function has been changed, hopefully for the better. The
combination of the larger hash table size of changed function should
keep the chain length down to a reasonable number (0-3, average 1).
vm_object->page_hint has been removed. This 'optimization' was not
only never needed, but costs as much as a hash chain link to implement.
While having page_hint in vm_object might result in better locality
of reference, the cost is not worth the space in vm_object or the
extra instructions in my view.
vm_page_alloc*() functions have been inlined and call a generalized
non-inlined vm_page_alloc_toq() which combines the standard alloc
and zero-page alloc functions together, reducing code size and the L1
cache footprint. Some reordering has been done... not much. The
delinking code should be faster ( because unlinking a doubly-linked list
requires four memory ops and unlinking a singly linked list only requires
two ), and we get a hash consistancy check for free.
vm_page_rename() now automatically sets the page's dirty bits.
vm_page_alloc() does not try to manually inline freeing a cache page.
Instead, it now properly calls vm_page_free(m) ... vm_page_free() is
really too complex to manually inline.
vm_await(), supporting asleep(), has been added.
changes to the VM system to support the new swapper, VM bug
fixes, several VM optimizations, and some additional revamping of the
VM code. The specific bug fixes will be documented with additional
forced commits. This commit is somewhat rough in regards to code
cleanup issues.
Reviewed by: "John S. Dyson" <root@dyson.iquest.net>, "David Greenman" <dg@root.com>
Add some overflow checks to read/write (from bde).
Change all modifications to vm_page::flags, vm_page::busy, vm_object::flags
and vm_object::paging_in_progress to use operations which are not
interruptable.
Reviewed by: Bruce Evans <bde@zeta.org.au>
FreeBSD/alpha. The most significant item is to change the command
argument to ioctl functions from int to u_long. This change brings us
inline with various other BSD versions. Driver writers may like to
use (__FreeBSD_version == 300003) to detect this change.
The prototype FreeBSD/alpha machdep will follow in a couple of days
time.
the page offset. If a large file offset was passed in, a large negative
array index could be generated which could cause page faults etc at worst
and file corruption at the least. (Pages are allocated within file
space on page alignment boundaries, so a file offset being passed in here
is harmless to DTRT. The case where this was happening has already been
fixed though, this is in case it happens again).
Reviewed by: dyson
problems. Tor Egge and others have helped with various VM bugs
lately, but don't blame him -- blame me!!!
pmap.c:
1) Create an object for kernel page table allocations. This
fixes a bogus allocation method previously used for such, by
grabbing pages from the kernel object, using bogus pindexes.
(This was a code cleanup, and perhaps a minor system stability
issue.)
pmap.c:
2) Pre-set the modify and accessed bits when prudent. This will
decrease bus traffic under certain circumstances.
vfs_bio.c, vfs_cluster.c:
3) Rather than calculating the beginning virtual byte offset
multiple times, stick the offset into the buffer header, so
that the calculated offset can be reused. (Long long multiplies
are often expensive, and this is a probably unmeasurable performance
improvement, and code cleanup.)
vfs_bio.c:
4) Handle write recursion more intelligently (but not perfectly) so
that it is less likely to cause a system panic, and is also
much more robust.
vfs_bio.c:
5) getblk incorrectly wrote out blocks that are incorrectly sized.
The problem is fixed, and writes blocks out ONLY when B_DELWRI
is true.
vfs_bio.c:
6) Check that already constituted buffers have fully valid pages. If
not, then make sure that the B_CACHE bit is not set. (This was
a major source of Sig-11 type problems.)
vfs_bio.c:
7) Fix a potential system deadlock due to an incorrectly specified
sleep priority while waiting for a buffer write operation. The
change that I made opens the system up to serious problems, and
we need to examine the issue of process sleep priorities.
vfs_cluster.c, vfs_bio.c:
8) Make clustered reads work more correctly (and more completely)
when buffers are already constituted, but not fully valid.
(This was another system reliability issue.)
vfs_subr.c, ffs_inode.c:
9) Create a vtruncbuf function, which is used by filesystems that
can truncate files. The vinvalbuf forced a file sync type operation,
while vtruncbuf only invalidates the buffers past the new end of file,
and also invalidates the appropriate pages. (This was a system reliabiliy
and performance issue.)
10) Modify FFS to use vtruncbuf.
vm_object.c:
11) Make the object rundown mechanism for OBJT_VNODE type objects work
more correctly. Included in that fix, create pager entries for
the OBJT_DEAD pager type, so that paging requests that might slip
in during race conditions are properly handled. (This was a system
reliability issue.)
vm_page.c:
12) Make some of the page validation routines be a little less picky
about arguments passed to them. Also, support page invalidation
change the object generation count so that we handle generation
counts a little more robustly.
vm_pageout.c:
13) Further reduce pageout daemon activity when the system doesn't
need help from it. There should be no additional performance
decrease even when the pageout daemon is running. (This was
a significant performance issue.)
vnode_pager.c:
14) Teach the vnode pager to handle race conditions during vnode
deallocations.
has been some bitrot and incorrect assumptions in the vfs_bio code. These
problems have manifest themselves worse on NFS type filesystems, but can
still affect local filesystems under certain circumstances. Most of
the problems have involved mmap consistancy, and as a side-effect broke
the vfs.ioopt code. This code might have been committed seperately, but
almost everything is interrelated.
1) Allow (pmap_object_init_pt) prefaulting of buffer-busy pages that
are fully valid.
2) Rather than deactivating erroneously read initial (header) pages in
kern_exec, we now free them.
3) Fix the rundown of non-VMIO buffers that are in an inconsistent
(missing vp) state.
4) Fix the disassociation of pages from buffers in brelse. The previous
code had rotted and was faulty in a couple of important circumstances.
5) Remove a gratuitious buffer wakeup in vfs_vmio_release.
6) Remove a crufty and currently unused cluster mechanism for VBLK
files in vfs_bio_awrite. When the code is functional, I'll add back
a cleaner version.
7) The page busy count wakeups assocated with the buffer cache usage were
incorrectly cleaned up in a previous commit by me. Revert to the
original, correct version, but with a cleaner implementation.
8) The cluster read code now tries to keep data associated with buffers
more aggressively (without breaking the heuristics) when it is presumed
that the read data (buffers) will be soon needed.
9) Change to filesystem lockmgr locks so that they use LK_NOPAUSE. The
delay loop waiting is not useful for filesystem locks, due to the
length of the time intervals.
10) Correct and clean-up spec_getpages.
11) Implement a fully functional nfs_getpages, nfs_putpages.
12) Fix nfs_write so that modifications are coherent with the NFS data on
the server disk (at least as well as NFS seems to allow.)
13) Properly support MS_INVALIDATE on NFS.
14) Properly pass down MS_INVALIDATE to lower levels of the VM code from
vm_map_clean.
15) Better support the notion of pages being busy but valid, so that
fewer in-transit waits occur. (use p->busy more for pageouts instead
of PG_BUSY.) Since the page is fully valid, it is still usable for
reads.
16) It is possible (in error) for cached pages to be busy. Make the
page allocation code handle that case correctly. (It should probably
be a printf or panic, but I want the system to handle coding errors
robustly. I'll probably add a printf.)
17) Correct the design and usage of vm_page_sleep. It didn't handle
consistancy problems very well, so make the design a little less
lofty. After vm_page_sleep, if it ever blocked, it is still important
to relookup the page (if the object generation count changed), and
verify it's status (always.)
18) In vm_pageout.c, vm_pageout_clean had rotted, so clean that up.
19) Push the page busy for writes and VM_PROT_READ into vm_pageout_flush.
20) Fix vm_pager_put_pages and it's descendents to support an int flag
instead of a boolean, so that we can pass down the invalidate bit.
2) Do not unnecessarily force page blocking when paging
pages out.
3) Further improve swap pager performance and correctness,
including fixing the paging in progress deadlock (except
in severe I/O error conditions.)
4) Enable vfs_ioopt=1 as a default.
5) Fix and enable the page prezeroing in SMP mode.
All in all, SMP systems especially should show a significant
improvement in "snappyness."
of the various ad-hoc schemes.
2) When bringing in UPAGES, the pmap code needs to do another vm_page_lookup.
3) When appropriate, set the PG_A or PG_M bits a-priori to both avoid some
processor errata, and to minimize redundant processor updating of page
tables.
4) Modify pmap_protect so that it can only remove permissions (as it
originally supported.) The additional capability is not needed.
5) Streamline read-only to read-write page mappings.
6) For pmap_copy_page, don't enable write mapping for source page.
7) Correct and clean-up pmap_incore.
8) Cluster initial kern_exec pagin.
9) Removal of some minor lint from kern_malloc.
10) Correct some ioopt code.
11) Remove some dead code from the MI swapout routine.
12) Correct vm_object_deallocate (to remove backing_object ref.)
13) Fix dead object handling, that had problems under heavy memory load.
14) Add minor vm_page_lookup improvements.
15) Some pages are not in objects, and make sure that the vm_page.c can
properly support such pages.
16) Add some more page deficit handling.
17) Some minor code readability improvements.
MUST be PG_BUSY. It is bogus to free a page that isn't busy,
because it is in a state of being "unavailable" when being
freed. The additional advantage is that the page_remove code
has a better cross-check that the page should be busy and
unavailable for other use. There were some minor problems
with the collapse code, and this plugs those subtile "holes."
Also, the vfs_bio code wasn't checking correctly for PG_BUSY
pages. I am going to develop a more consistant scheme for
grabbing pages, busy or otherwise. For now, we are stuck
with the current morass.
1) Start using TSM.
Struct procs continue to point to upages structure, after being freed.
Struct vmspace continues to point to pte object and kva space for kstack.
u_map is now superfluous.
2) vm_map's don't need to be reference counted. They always exist either
in the kernel or in a vmspace. The vmspaces are managed by reference
counts.
3) Remove the "wired" vm_map nonsense.
4) No need to keep a cache of kernel stack kva's.
5) Get rid of strange looking ++var, and change to var++.
6) Change more data structures to use our "zone" allocator. Added
struct proc, struct vmspace and struct vnode. This saves a significant
amount of kva space and physical memory. Additionally, this enables
TSM for the zone managed memory.
7) Keep ioopt disabled for now.
8) Remove the now bogus "single use" map concept.
9) Use generation counts or id's for data structures residing in TSM, where
it allows us to avoid unneeded restart overhead during traversals, where
blocking might occur.
10) Account better for memory deficits, so the pageout daemon will be able
to make enough memory available (experimental.)
11) Fix some vnode locking problems. (From Tor, I think.)
12) Add a check in ufs_lookup, to avoid lots of unneeded calls to bcmp.
(experimental.)
13) Significantly shrink, cleanup, and make slightly faster the vm_fault.c
code. Use generation counts, get rid of unneded collpase operations,
and clean up the cluster code.
14) Make vm_zone more suitable for TSM.
This commit is partially as a result of discussions and contributions from
other people, including DG, Tor Egge, PHK, and probably others that I
have forgotten to attribute (so let me know, if I forgot.)
This is not the infamous, final cleanup of the vnode stuff, but a necessary
step. Vnode mgmt should be correct, but things might still change, and
there is still some missing stuff (like ioopt, and physical backing of
non-merged cache files, debugging of layering concepts.)
config option in pmap. Fix a problem with faulting in pages. Clean-up
some loose ends in swap pager memory management.
The system should be much more stable, but all subtile bugs aren't fixed yet.
Fix the UIO optimization code.
Fix an assumption in vm_map_insert regarding allocation of swap pagers.
Fix an spl problem in the collapse handling in vm_object_deallocate.
When pages are freed from vnode objects, and the criteria for putting
the associated vnode onto the free list is reached, either put the
vnode onto the list, or put it onto an interrupt safe version of the
list, for further transfer onto the actual free list.
Some minor syntax changes changing pre-decs, pre-incs to post versions.
Remove a bogus timeout (that I added for debugging) from vn_lock.
PHK will likely still have problems with the vnode list management, and
so do I, but it is better than it was.
of vnodes and objects. There are some metadata performance improvements
that come along with this. There are also a few prototypes added when
the need is noticed. Changes include:
1) Cleaning up vref, vget.
2) Removal of the object cache.
3) Nuke vnode_pager_uncache and friends, because they aren't needed anymore.
4) Correct some missing LK_RETRY's in vn_lock.
5) Correct the page range in the code for msync.
Be gentle, and please give me feedback asap.
the pageout daemon wasn't always being waken up appropriately when the
(cache + free) queues were depleted.
Submitted by: David S. Miller <davem@jenolan.rutgers.edu>
changes, so don't expect to be able to run the kernel as-is (very well)
without the appropriate Lite/2 userland changes.
The system boots and can mount UFS filesystems.
Untested: ext2fs, msdosfs, NFS
Known problems: Incorrect Berkeley ID strings in some files.
Mount_std mounts will not work until the getfsent
library routine is changed.
Reviewed by: various people
Submitted by: Jeffery Hsu <hsu@freebsd.org>
This will make a number of things easier in the future, as well as (finally!)
avoiding the Id-smashing problem which has plagued developers for so long.
Boy, I'm glad we're not using sup anymore. This update would have been
insane otherwise.
problem of allocating contiguous buffer memory in general, but
make it much more likely to work at boot-up time. The best
chance for an LKM-type load of a sound driver is immediately
after the mount of the root filesystem.
This appears to work for a 64K allocation on an 8MB system.
`show vmopag', `show page' and `show pageq'. Moved all vm ddb stuff
to the ends of the vm source files.
Changed printf() to db_printf(), `indent' to db_indent, and iprintf()
to db_iprintf() in ddb commands. Moved db_indent and db_iprintf()
from vm to ddb.
vm_page.c:
Don't use __pure. Staticized.
db_output.c:
Reduced page width from 80 to 79 to inhibit double spacing for long
lines (there are still some problems if words are printed across
column 79).
The default level works with minimal overhead, but one can also enable
full, efficient use of a 512K cache. (Parameters can be generated
to support arbitrary cache sizes also.)
problem with the 'shell scripts' was found, but there was a 'strange'
problem found with a 486 laptop that we could not find. This commit
backs the code back to 25-jul, and will be re-entered after the snapshot
in smaller (more easily tested) chunks.
performance issues.
1) The pmap module has had too many inlines, and so the
object file is simply bigger than it needs to be.
Some common code is also merged into subroutines.
2) Removal of some *evil* PHYS_TO_VM_PAGE macro calls.
Unfortunately, a few have needed to be added also.
The removal caused the need for more vm_page_lookups.
I added lookup hints to minimize the need for the
page table lookup operations.
3) Removal of some bogus performance improvements, that
mostly made the code more complex (tracking individual
page table page updates unnecessarily). Those improvements
actually hurt 386 processors perf (not that people who
worry about perf use 386 processors anymore :-)).
4) Changed pv queue manipulations/structures to be TAILQ's.
5) The pv queue code has had some performance problems since
day one. Some significant scalability issues are resolved
by threading the pv entries from the pmap AND the physical
address instead of just the physical address. This makes
certain pmap operations run much faster. This does
not affect most micro-benchmarks, but should help loaded system
performance *significantly*. DG helped and came up with most
of the solution for this one.
6) Most if not all pmap bit operations follow the pattern:
pmap_test_bit();
pmap_clear_bit();
That made for twice the necessary pv list traversal. The
pmap interface now supports only pmap_tc_bit type operations:
pmap_[test/clear]_modified, pmap_[test/clear]_referenced.
Additionally, the modified routine now takes a vm_page_t arg
instead of a phys address. This eliminates a PHYS_TO_VM_PAGE
operation.
7) Several rewrites of routines that contain redundant code to
use common routines, so that there is a greater likelihood of
keeping the cache footprint smaller.
Re-enables the RSS limiting, and the routine is now tail-recursive,
making it much more safe (eliminates the possiblity of kernel stack
overflow.) Also, the RSS limiting is a little more intelligent about
finding the likely objects that are pushing the process over the limit.
Added some sysctls that help with VM system tuning.
New sysctl features:
1) Enable/disable lru pageout algorithm.
vm.pageout_algorithm = 0, default algorithm that works
well, especially using X windows and heavy
memory loading. Can have adverse effects,
sometimes slowing down program loading.
vm.pageout_algorithm = 1, close to true LRU. Works much
better than clock, etc. Does not work as well as
the default algorithm in general. Certain memory
"malloc" type benchmarks work a little better with
this setting.
Please give me feedback on the performance results
associated with these.
2) Enable/disable swapping.
vm.swapping_enabled = 1, default.
vm.swapping_enabled = 0, useful for cases where swapping
degrades performance.
The config option "NO_SWAPPING" is still operative, and
takes precedence over the sysctl. If "NO_SWAPPING" is
specified, the sysctl still exists, but "vm.swapping_enabled"
is hard-wired to "0".
Each of these can be changed "on the fly."
1) Make it much less likely to miss a wakeup in vm_page_free_wakeup
2) Create a new entry point into pmap: pmap_ts_referenced, eliminates
the need to scan the pv lists twice in many cases. Perhaps there
is alot more to do here to work on minimizing pv list manipulation
3) Minor improvements to vm_pageout including the use of pmap_ts_ref.
4) Major changes and code improvement to pmap. This code has had
several serious bugs in page table page manipulation. In order
to simplify the problem, and hopefully solve it for once and all,
page table pages are no longer "managed" with the pv list stuff.
Page table pages are only (mapped and held/wired) or
(free and unused) now. Page table pages are never inactive,
active or cached. These changes have probably fixed the
hold count problems, but if they haven't, then the code is
simpler anyway for future bugfixing.
5) The pmap code has been sorely in need of re-organization, and I
have taken a first (of probably many) steps. Please tell me
if you have any ideas.
1) Remove potential race conditions on waking up in vm_page_free_wakeup
by making sure that it is at splvm().
2) Fix another bug in vm_map_simplify_entry.
3) Be more complete about converting from default to swap pager
when an object grows to be large enough that there can be
a problem with data structure allocation under low memory
conditions.
4) Make some madvise code more efficient.
5) Added some comments.
queue in vm_fault.
Move the PG_BUSY in vm_fault to the correct place.
Remove redundant/unnecessary code in pmap.c.
Properly block on rundown of page table pages, if they are busy.
I think that the VM system is in pretty good shape now, and the following
individuals (among others, in no particular order) have helped with this
recent bunch of bugs, thanks! If I left anyone out, I apologize!
Stephen McKay, Stephen Hocking, Eric J. Chet, Dan O'Brien, James Raynard,
Marc Fournier.
some problems with the page-table page management code, since it can't
deal with the notion of page-table pages being paged out or in transit.
Also, clean up some stylistic issues per some suggestions from
Stephen McKay.
queue corruption problems, and to apply Gary Palmer's code cleanups.
David Greenman helped with these problems also. There is still
a hang problem using X in small memory machines.
contributions or ideas from Stephen McKay <syssgm@devetir.qld.gov.au>,
Alan Cox <alc@cs.rice.edu>, David Greenman <davidg@freebsd.org> and me:
More usage of the TAILQ macros. Additional minor fix to queue.h.
Performance enhancements to the pageout daemon.
Addition of a wait in the case that the pageout daemon
has to run immediately.
Slightly modify the pageout algorithm.
Significant revamp of the pmap/fork code:
1) PTE's and UPAGES's are NO LONGER in the process's map.
2) PTE's and UPAGES's reside in their own objects.
3) TOTAL elimination of recursive page table pagefaults.
4) The page directory now resides in the PTE object.
5) Implemented pmap_copy, thereby speeding up fork time.
6) Changed the pv entries so that the head is a pointer
and not an entire entry.
7) Significant cleanup of pmap_protect, and pmap_remove.
8) Removed significant amounts of machine dependent
fork code from vm_glue. Pushed much of that code into
the machine dependent pmap module.
9) Support more completely the reuse of already zeroed
pages (Page table pages and page directories) as being
already zeroed.
Performance and code cleanups in vm_map:
1) Improved and simplified allocation of map entries.
2) Improved vm_map_copy code.
3) Corrected some minor problems in the simplify code.
Implemented splvm (combo of splbio and splimp.) The VM code now
seldom uses splhigh.
Improved the speed of and simplified kmem_malloc.
Minor mod to vm_fault to avoid using pre-zeroed pages in the case
of objects with backing objects along with the already
existant condition of having a vnode. (If there is a backing
object, there will likely be a COW... With a COW, it isn't
necessary to start with a pre-zeroed page.)
Minor reorg of source to perhaps improve locality of ref.
way to avoid crossing a 64K DMA boundary was to specify an alignment
greater than the size even when the alignment didn't matter, and for
sizes larger than a page, this reduced the chance of finding enough
contiguous pages. E.g., allocations of 8K not crossing a 64K boundary
previously had to be allocated on 8K boundaries; now they can be
allocated on any 4K boundary except (64 * n + 60)K.
Fixed bugs in vm_alloc_page_contig():
- the last page wasn't allocated for sizes smaller than a page.
- failures of kmem_alloc_pageable() weren't handled.
Mutated vm_page_alloc_contig() to create a more convenient interface
named contigmalloc(). This is the same as the one in 1.1.5 except
it has `low' and `high' args, and the `alignment' and `boundary'
args are multipliers instead of masks.
Speed up for vfs_bio -- addition of a routine bqrelse to greatly diminish
overhead for merged cache.
Efficiency improvement for vfs_cluster. It used to do alot of redundant
calls to cluster_rbuild.
Correct the ordering for vrele of .text and release of credentials.
Use the selective tlb update for 486/586/P6.
Numerous fixes to the size of objects allocated for files. Additionally,
fixes in the various pagers.
Fixes for proper positioning of vnode_pager_setsize in msdosfs and ext2fs.
Fixes in the swap pager for exhausted resources. The pageout code
will not as readily thrash.
Change the page queue flags (PG_ACTIVE, PG_INACTIVE, PG_FREE, PG_CACHE) into
page queue indices (PQ_ACTIVE, PQ_INACTIVE, PQ_FREE, PQ_CACHE),
thereby improving efficiency of several routines.
Eliminate even more unnecessary vm_page_protect operations.
Significantly speed up process forks.
Make vm_object_page_clean more efficient, thereby eliminating the pause
that happens every 30seconds.
Make sequential clustered writes B_ASYNC instead of B_DELWRI even in the
case of filesystems mounted async.
Fix a panic with busy pages when write clustering is done for non-VMIO
buffers.
Staticized some functions.
__purified some functions. Some functions were bogusly declared as
returning `const'. This hasn't done anything since gcc-2.5. For
later versions of gcc, the equivalent is __attribute__((const)) at
the end of function declarations.
proc or any VM system structure will have to be rebuilt!!!
Much needed overhaul of the VM system. Included in this first round of
changes:
1) Improved pager interfaces: init, alloc, dealloc, getpages, putpages,
haspage, and sync operations are supported. The haspage interface now
provides information about clusterability. All pager routines now take
struct vm_object's instead of "pagers".
2) Improved data structures. In the previous paradigm, there is constant
confusion caused by pagers being both a data structure ("allocate a
pager") and a collection of routines. The idea of a pager structure has
escentially been eliminated. Objects now have types, and this type is
used to index the appropriate pager. In most cases, items in the pager
structure were duplicated in the object data structure and thus were
unnecessary. In the few cases that remained, a un_pager structure union
was created in the object to contain these items.
3) Because of the cleanup of #1 & #2, a lot of unnecessary layering can now
be removed. For instance, vm_object_enter(), vm_object_lookup(),
vm_object_remove(), and the associated object hash list were some of the
things that were removed.
4) simple_lock's removed. Discussion with several people reveals that the
SMP locking primitives used in the VM system aren't likely the mechanism
that we'll be adopting. Even if it were, the locking that was in the code
was very inadequate and would have to be mostly re-done anyway. The
locking in a uni-processor kernel was a no-op but went a long way toward
making the code difficult to read and debug.
5) Places that attempted to kludge-up the fact that we don't have kernel
thread support have been fixed to reflect the reality that we are really
dealing with processes, not threads. The VM system didn't have complete
thread support, so the comments and mis-named routines were just wrong.
We now use tsleep and wakeup directly in the lock routines, for instance.
6) Where appropriate, the pagers have been improved, especially in the
pager_alloc routines. Most of the pager_allocs have been rewritten and
are now faster and easier to maintain.
7) The pagedaemon pageout clustering algorithm has been rewritten and
now tries harder to output an even number of pages before and after
the requested page. This is sort of the reverse of the ideal pagein
algorithm and should provide better overall performance.
8) Unnecessary (incorrect) casts to caddr_t in calls to tsleep & wakeup
have been removed. Some other unnecessary casts have also been removed.
9) Some almost useless debugging code removed.
10) Terminology of shadow objects vs. backing objects straightened out.
The fact that the vm_object data structure escentially had this
backwards really confused things. The use of "shadow" and "backing
object" throughout the code is now internally consistent and correct
in the Mach terminology.
11) Several minor bug fixes, including one in the vm daemon that caused
0 RSS objects to not get purged as intended.
12) A "default pager" has now been created which cleans up the transition
of objects to the "swap" type. The previous checks throughout the code
for swp->pg_data != NULL were really ugly. This change also provides
the rudiments for future backing of "anonymous" memory by something
other than the swap pager (via the vnode pager, for example), and it
allows the decision about which of these pagers to use to be made
dynamically (although will need some additional decision code to do
this, of course).
13) (dyson) MAP_COPY has been deprecated and the corresponding "copy
object" code has been removed. MAP_COPY was undocumented and non-
standard. It was furthermore broken in several ways which caused its
behavior to degrade to MAP_PRIVATE. Binaries that use MAP_COPY will
continue to work correctly, but via the slightly different semantics
of MAP_PRIVATE.
14) (dyson) Sharing maps have been removed. It's marginal usefulness in a
threads design can be worked around in other ways. Both #12 and #13
were done to simplify the code and improve readability and maintain-
ability. (As were most all of these changes)
TODO:
1) Rewrite most of the vnode pager to use VOP_GETPAGES/PUTPAGES. Doing
this will reduce the vnode pager to a mere fraction of its current size.
2) Rewrite vm_fault and the swap/vnode pagers to use the clustering
information provided by the new haspage pager interface. This will
substantially reduce the overhead by eliminating a large number of
VOP_BMAP() calls. The VOP_BMAP() filesystem interface should be
improved to provide both a "behind" and "ahead" indication of
contiguousness.
3) Implement the extended features of pager_haspage in swap_pager_haspage().
It currently just says 0 pages ahead/behind.
4) Re-implement the swap device (swstrategy) in a more elegant way, perhaps
via a much more general mechanism that could also be used for disk
striping of regular filesystems.
5) Do something to improve the architecture of vm_object_collapse(). The
fact that it makes calls into the swap pager and knows too much about
how the swap pager operates really bothers me. It also doesn't allow
for collapsing of non-swap pager objects ("unnamed" objects backed by
other pagers).
Fixed remaining known bugs in the buffer IO and VM system.
vfs_bio.c:
Fixed some race conditions and locking bugs. Improved performance
by removing some (now) unnecessary code and fixing some broken
logic.
Fixed process accounting of # of FS outputs.
Properly handle NFS interrupts (B_EINTR).
(various)
Replaced calls to clrbuf() with calls to an optimized routine
called vfs_bio_clrbuf().
(various FS sync)
Sync out modified vnode_pager backed pages.
ffs_vnops.c:
Do two passes: Sync out file data first, then indirect blocks.
vm_fault.c:
Fixed deadly embrace caused by acquiring locks in the wrong order.
vnode_pager.c:
Changed to use buffer I/O system for writing out modified pages. This
should fix the problem with the modification date previous not getting
updated. Also dramatically simplifies the code. Note that this is
going to change in the future and be implemented via VOP_PUTPAGES().
vm_object.c:
Fixed a pile of bugs related to cleaning (vnode) objects. The performance
of vm_object_page_clean() is terrible when dealing with huge objects,
but this will change when we implement a binary tree to keep the object
pages sorted.
vm_pageout.c:
Fixed broken clustering of pageouts. Fixed race conditions and other
lockup style bugs in the scanning of pages. Improved performance.
New functions create - vm_object_pip_wakeup and pagedaemon_wakeup that
are used to reduce the actual number of wakeups.
New function vm_page_protect which is used in conjuction with some new
page flags to reduce the number of calls to pmap_page_protect.
Minor changes to reduce unnecessary spl nesting.
Rewrote vm_page_alloc() to improve readability.
Various other mostly cosmetic changes.
Use request==VM_ALLOC_NORMAL rather than object!=kmem_object in deciding
if the caller is "important" in vm_page_alloc(). Also established a new
low threshold for non-interrupt allocations via cnt.v_interrupt_free_min.
vm_pageout.c:
Various algorithmic cleanup. Some calculations simplified. Initialize
cnt.v_interrupt_free_min to 2 pages.
Submitted by: John Dyson
Fixed long standing bug in freeing swap space during object collapses.
Fixed 'out of space' messages from printing out too often.
Modified to use new kmem_malloc() calling convention.
Implemented an additional stat in the swap pager struct to count the
amount of space allocated to that pager. This may be removed at some
point in the future.
Minimized unnecessary wakeups.
vm_fault.c:
Don't try to collect fault stats on 'swapped' processes - there aren't
any upages to store the stats in.
Changed read-ahead policy (again!).
vm_glue.c:
Be sure to gain a reference to the process's map before swapping.
Be sure to lose it when done.
kern_malloc.c:
Added the ability to specify if allocations are at interrupt time or
are 'safe'; this affects what types of pages can be allocated.
vm_map.c:
Fixed a variety of map lock problems; there's still a lurking bug that
will eventually bite.
vm_object.c:
Explicitly initialize the object fields rather than bzeroing the struct.
Eliminated the 'rcollapse' code and folded it's functionality into the
"real" collapse routine.
Moved an object_unlock() so that the backing_object is protected in
the qcollapse routine.
Make sure nobody fools with the backing_object when we're destroying it.
Added some diagnostic code which can be called from the debugger that
looks through all the internal objects and makes certain that they
all belong to someone.
vm_page.c:
Fixed a rather serious logic bug that would result in random system
crashes. Changed pagedaemon wakeup policy (again!).
vm_pageout.c:
Removed unnecessary page rotations on the inactive queue.
Changed the number of pages to explicitly free to just free_reserved
level.
Submitted by: John Dyson
Added hook for pmap_prefault() and use symbolic constant for new third
argument to vm_page_alloc() (vm_fault.c, various)
Changed the way that upages and page tables are held. (vm_glue.c)
Fixed architectural flaw in allocating pages at interrupt time that was
introduced with the merged cache changes. (vm_page.c, various)
Adjusted some algorithms to acheive better paging performance and to
accomodate the fix for the architectural flaw mentioned above. (vm_pageout.c)
Fixed pbuf handling problem, changed policy on handling read-behind page.
(vnode_pager.c)
Submitted by: John Dyson
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
...(this commit): moved initialization of 'start' to make it more clear
that it is initialized properly (also in vm_page_alloc_contig).
Reviewed by:
Submitted by:
Obtained from: