2) Removed unnecessary vm_object_lookup()/pager_cache(object, TRUE) pairs
after vnode_pager_alloc() calls - the object is already guaranteed to be
persistent.
3) Removed some gratuitous casts.
1) Files weren't properly synced on filesystems other than UFS. In some
cases, this lead to lost data. Most likely would be noticed on NFS.
The fix is to make the VM page sync/object_clean general rather than
in each filesystem.
2) Mixing regular and mmaped file I/O on NFS was very broken. It caused
chunks of files to end up as zeroes rather than the intended contents.
The fix was to fix several race conditions and to kludge up the
"b_dirtyoff" and "b_dirtyend" that NFS relies upon - paying attention
to page modifications that occurred via the mmapping.
Reviewed by: David Greenman
Submitted by: John Dyson
These changes solve the problem in a general way by moving the
initialization out of the individual fs_mountroot's and into swaponvp().
Submitted by: Poul-Henning Kamp
when the single user shell was terminated. These changes disallow mounting
or R/W upgrading filesystems that are dirty unless "-f" (force) option
is used with mount. /etc/rc has been modified to abort the startup if
one or more non-nfs partitions fail to mount.
Reviewed by: Poul-Henning Kamp, Rod Grimes
I ran into another manifestation of the problem reported in PR 211 and
fixed it. Try this:
as non-root:
cd /tmp; mkdir x y x/z
as root:
chown root /tmp/x/z
as non-root:
cd /tmp/x; mv z ../y # EACCES as expected
as root:
cd /tmp/x; mv z ../y # EINVAL NOT as expected
This is because ufs_rename() sets IN_RENAME and fails to clear it.
Reviewed by: davidg
Submitted by: bde
(2GB). If this limit is not imposed, then filesystem corruption will
ensue when files larger than 2GB are created. This is temporary,
and the underlying limitation will be removed later.
don't lock the vnode - it doesn't appear to ever be necessary for VCHR
vnode/inodes. This fixes a bug introduced in the previous commit that
caused tty timestamps to act strange (causing 'w' and 'finger' to show
the tty wasn't idle when it may have been for hours).
Fixed remaining known bugs in the buffer IO and VM system.
vfs_bio.c:
Fixed some race conditions and locking bugs. Improved performance
by removing some (now) unnecessary code and fixing some broken
logic.
Fixed process accounting of # of FS outputs.
Properly handle NFS interrupts (B_EINTR).
(various)
Replaced calls to clrbuf() with calls to an optimized routine
called vfs_bio_clrbuf().
(various FS sync)
Sync out modified vnode_pager backed pages.
ffs_vnops.c:
Do two passes: Sync out file data first, then indirect blocks.
vm_fault.c:
Fixed deadly embrace caused by acquiring locks in the wrong order.
vnode_pager.c:
Changed to use buffer I/O system for writing out modified pages. This
should fix the problem with the modification date previous not getting
updated. Also dramatically simplifies the code. Note that this is
going to change in the future and be implemented via VOP_PUTPAGES().
vm_object.c:
Fixed a pile of bugs related to cleaning (vnode) objects. The performance
of vm_object_page_clean() is terrible when dealing with huge objects,
but this will change when we implement a binary tree to keep the object
pages sorted.
vm_pageout.c:
Fixed broken clustering of pageouts. Fixed race conditions and other
lockup style bugs in the scanning of pages. Improved performance.
merged cache changes, and figure it out based on the B_VMIO buffer flag.
Fixes a problem where delayed write VMIO buffers would sometimes get
recopied into kernel-alloced memory.
Submitted by: John Dyson
during the FS sync. The system would appear to hang momentarily
if there was a large backlog of I/O. This is because the vnode
remains locked during the output - preventing normal character
I/O. The problem was exacerbated by the FFS contiguous block
allocation fixes and a semi-broken disksort(). The inode/date
will still be synced during a normal FS dismount and whenever
the inode is changed for other reasons.
allow Q_SYNC regardless of "target" uid, we allow it with -1;
fix bug that caused all ops to refer to user quotas, not group.
Submitted by: Mike Karels
if all free blocks are in the same bucket (i.e. NRPOS == 1).
Else a free block is choosen, possibly from a different cylinder,
even if the block succeeding bpref was free ...
Submitted by: se
mapping from numbers to names is messy for backwards compatibility.
E.g., for driver "sd", unit "0":
slice 0: omit the slice number for compatibility; names are sd0[a-h].
slice 1: omit the partition letter 'c' because the whole disk device
shouldn't have anything to do with partitions; sd0 is the
only name.
slices 2-31: subtract 1 from slice number to compensate for the
compatibility slice 0; names are sd0s[1-30][a-h].
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
ordering that can prove fatal during large batches of deletes, but this
is much better than it was. I probably won't be putting much more time
into this until Seltzer releases her new version of LFS which has
fragment support. This should be availible just before USENIX.
timestamps for an atomic operation such as rename() on a local file
system to be identical.
Uniformize yet another idempotency ifdef. The comment nesting was
bogus.
Allow chown() to return success if the gid isn't changed even if
the gid is not the caller's. Such gids are normal for files created
in world-writable directories sucj as /tmp. This "fixes" annoying
error messages for mv'ing files created in /tmp to another file
system. mv still preserves the foreign gid of /tmp, but now does
it silently.
buffering scheme and make it more in tune with FreeBSD's vfs_bio
implementation. The filesystem seems fairly stable, but I wouldn't recommend
it to anyone not willing to experience problems. This is very green code and
has the limitation that YOU CAN ONLY HAVE ONE LFS PARTITION MOUNTED AT A TIME.
What LFS is good for:
Non fsynced writes FASTER THAN FFS
Large deletions Increadibly fast
Reads are a little bit slower than FFS right now, but that is a factor of
how under optimized this code is. LFS should in theory perform at least as
well as FFS under fsync (iozone) type loads, and this is what I'm currently
working on.
Reviewed by: Justin Gibbs
Submitted by: John Dyson
Obtained from: