The i386 pmap module uses a special area of kernel virtual memory for mapping
of page tables pages when it needs to modify another process's virtual
address space. It's called the 'alternate page table map'. There is only one
of them and it's expected that only one process will be using it at once and
that the operation is atomic.
When the merged VM/buffer cache was implemented over a year ago, it became
necessary to rundown VM pages at I/O completion. The unfortunate and
unforeseen side effect of this is that pmap functions are now called at bio
interrupt time. If there happend to be a process using the alternate page
table map when this I/O completion occurred, it was possible for a different
process's address space to be switched into the alternate page table map -
leaving the current pmap process with the wrong address space mapped when
the interrupt completed. This resulted in BAD things happening like pages
being mapped or removed from the wrong address space, etc.. Since a very
common case of a process modifying another process's address space is during
fork when the kernel stack is inserted, one of the most common manifestations
of this bug was the kernel stack not being mapped properly, resulting in a
silent hang or reboot. This made it VERY difficult to troubleshoot this bug
(I've been trying to figure out the cause of this for >6 months). Fortunately,
the set of conditions that must be true before this problem occurs is
sufficiently rare enough that most people never saw the bug occur. As I/O
rates increase, however, so does the frequency of the crashes. This problem
used to kill wcarchive about every 10 days, but in more recent times when
the traffic exceeded >100GB/day, the machine could barely manage 6 hours of
uptime.
The fix is to make certain that no process has the pages mapped that are
involved in the I/O, before the I/O is started. The pages are made busy, so
no process will be able to map them, either, until the I/O has finished.
This side-steps the issue by still allowing the pmap functions to be called
at interrupt time, but also assuring that the alternate page table map won't
be switched.
Unfortunately, this appears to not be the only cause of this problem. :-(
Reviewed by: dyson
All new code is "#ifdef PC98"ed so this should make no difference to
PC/AT (and its clones) users.
Ok'd by: core
Submitted by: FreeBSD(98) development team
contributions or ideas from Stephen McKay <syssgm@devetir.qld.gov.au>,
Alan Cox <alc@cs.rice.edu>, David Greenman <davidg@freebsd.org> and me:
More usage of the TAILQ macros. Additional minor fix to queue.h.
Performance enhancements to the pageout daemon.
Addition of a wait in the case that the pageout daemon
has to run immediately.
Slightly modify the pageout algorithm.
Significant revamp of the pmap/fork code:
1) PTE's and UPAGES's are NO LONGER in the process's map.
2) PTE's and UPAGES's reside in their own objects.
3) TOTAL elimination of recursive page table pagefaults.
4) The page directory now resides in the PTE object.
5) Implemented pmap_copy, thereby speeding up fork time.
6) Changed the pv entries so that the head is a pointer
and not an entire entry.
7) Significant cleanup of pmap_protect, and pmap_remove.
8) Removed significant amounts of machine dependent
fork code from vm_glue. Pushed much of that code into
the machine dependent pmap module.
9) Support more completely the reuse of already zeroed
pages (Page table pages and page directories) as being
already zeroed.
Performance and code cleanups in vm_map:
1) Improved and simplified allocation of map entries.
2) Improved vm_map_copy code.
3) Corrected some minor problems in the simplify code.
Implemented splvm (combo of splbio and splimp.) The VM code now
seldom uses splhigh.
Improved the speed of and simplified kmem_malloc.
Minor mod to vm_fault to avoid using pre-zeroed pages in the case
of objects with backing objects along with the already
existant condition of having a vnode. (If there is a backing
object, there will likely be a COW... With a COW, it isn't
necessary to start with a pre-zeroed page.)
Minor reorg of source to perhaps improve locality of ref.
residing in a buffer that had been dirtied by a process was being
handled incorrectly. The pages were mistakenly placed into the
cache queue. This would likely have the effect of mmaped page modifications
being lost when I/O system calls were being used simultaneously to
the same locations in a file.
Submitted by: davidg
queue type is not set to QUEUE_NONE. This appears to have
caused a hang bug that has been lurking.
2) Fix bugs that brelse'ing locked buffers do not "free" them, but the
code assumes so. This can cause hangs when LFS is used.
3) Use malloced memory for directories when applicable. The amount
of malloced memory is seriously limited, but should decrease the
amount of memory used by an average directory to 1/4 - 1/2 previous.
This capability is fully tunable. (Note that there is no config
parameter, and might never be.)
4) Bias slightly the buffer cache usage towards non-VMIO buffers. Since
the data in VMIO buffers is not lost when the buffer is reclaimed, this
will help performance. This is adjustable also.
Speed up for vfs_bio -- addition of a routine bqrelse to greatly diminish
overhead for merged cache.
Efficiency improvement for vfs_cluster. It used to do alot of redundant
calls to cluster_rbuild.
Correct the ordering for vrele of .text and release of credentials.
Use the selective tlb update for 486/586/P6.
Numerous fixes to the size of objects allocated for files. Additionally,
fixes in the various pagers.
Fixes for proper positioning of vnode_pager_setsize in msdosfs and ext2fs.
Fixes in the swap pager for exhausted resources. The pageout code
will not as readily thrash.
Change the page queue flags (PG_ACTIVE, PG_INACTIVE, PG_FREE, PG_CACHE) into
page queue indices (PQ_ACTIVE, PQ_INACTIVE, PQ_FREE, PQ_CACHE),
thereby improving efficiency of several routines.
Eliminate even more unnecessary vm_page_protect operations.
Significantly speed up process forks.
Make vm_object_page_clean more efficient, thereby eliminating the pause
that happens every 30seconds.
Make sequential clustered writes B_ASYNC instead of B_DELWRI even in the
case of filesystems mounted async.
Fix a panic with busy pages when write clustering is done for non-VMIO
buffers.
metadata and VBLK type devices. The code is currently mostly disabled,
and a work-around has been added to disabled attempted clustered writes
for VBLK type device buffers. Clustered write of meta-data is currently
a work in progress.
"getblk" hang. The B_WANTED flag was being cleared gratuitously,
also the optimization of gbincore for ignoring the B_INVAL flag was
incorrect. There is no place in the code where buffers are on the
hash list that are B_INVAL and not B_BUSY.
Move a lot of variables home to their own code (In good time before xmas :-)
Introduce the string descrition of format.
Add a couple more functions to poke into these marvels, while I try to
decide what the correct interface should look like.
Next is adding vars on the fly, and sysctl looking at them too.
Removed a tine bit of defunct and #ifdefed notused code in swapgeneric.
1) Make cluster buffer list be a non-malloced chain. This eliminates
yet another 'evil' M_WAITOK and generally cleans up the code.
2) Fix write clustering for ext2fs. It was just broken. Also, ffs
clustering had an efficiency problem that more bawrites were happening
than should have been.
3) Make changes to buf.h to support the above, plus remove b_pfcent
at the request of David Greenman.
Reviewed by: davidg (partially)
valid bytes, we must also clear the B_DONE flag. Some filesystems
depend on this (incl NFS) and is probably the cause of the biodone
error and subsequent crash. Anyway this change needs to be made.
Prototypes are located in <sys/sysproto.h>.
Add appropriate #include <sys/sysproto.h> to files that needed
protos from systm.h.
Add structure definitions to appropriate files that relied on sys/systm.h,
right before system call definition, as in the rest of the kernel source.
In kern_prot.c, instead of using the dummy structure "args", create
individual dummy structures named <syscall>_args. This makes
life easier for prototype generation.
1) "obj" was't initialized properly, resulting in an important vm_page_lookup
always failing (resulting in a panic).
2) busy pages could be put on the cache queue or freed (resulting in a panic).
support for EXT2FS. Note that the Sig-11 problems appear to be caused by
this, but there is still probably an underlying VM problem that let this
clustering bug cause vnode objects to appear to be corrupted.
The direct manifestation of this bug would have been severely mis-read
files. It is possible that processes would Sig-11 on very damaged
input files and might explain the mysterious differences in system
behaviour when phk's malloc is being used.
Better performance -- more aggressive read-ahead
under certain circumstanses.
Mods to support clustering on small
( < PAGE_SIZE) block size filesystems (e.g. ext2fs,
msdosfs.)
Submitted by: terry (terry lambert)
This is a composite of 3 patch sets submitted by terry.
they are:
New low-level init code that supports loadbal modules better
some cleanups in the namei code to help terry in 16-bit character support
some changes to the mount-root code to make it a little more
modular..
NOTE: mounting root off cdrom or NFS MIGHT be broken as I haven't been able
to test those cases..
certainly mounting root of disk still works just fine..
mfs should work but is untested. (tomorrows task)
The low level init stuff includes a total rewrite of init_main.c
to make it possible for new modules to have an init phase by simply
adding an entry to a TEXT_SET (or is it DATA_SET) list. thus a new module can
be added to the kernel without editing any other files other than the
'files' file.
different types of panics/inconsistencies with NFS clients.
Cleared PG_WANTED where appropriate.
Added checks for buffer busy in allocbuf and biodone.
Reviewed by: John Dyson
VBLK vnodes isn't adequate since all NFS nodes aren't locked, either. The
result is a race condition that would lead to duplicate buffers at the
same block offset.
Submitted by: John Dyson
made a change to NFS that caused buffers at EOF to be variable size. This
had the undesired side-effect of breaking delayed writes on NFS. This
fixes it.
Submitted by: John Dyson
2) Removed unnecessary vm_object_lookup()/pager_cache(object, TRUE) pairs
after vnode_pager_alloc() calls - the object is already guaranteed to be
persistent.
3) Removed some gratuitous casts.
1) Files weren't properly synced on filesystems other than UFS. In some
cases, this lead to lost data. Most likely would be noticed on NFS.
The fix is to make the VM page sync/object_clean general rather than
in each filesystem.
2) Mixing regular and mmaped file I/O on NFS was very broken. It caused
chunks of files to end up as zeroes rather than the intended contents.
The fix was to fix several race conditions and to kludge up the
"b_dirtyoff" and "b_dirtyend" that NFS relies upon - paying attention
to page modifications that occurred via the mmapping.
Reviewed by: David Greenman
Submitted by: John Dyson