1. When a directory is renamed to an existing (empty) directory,
it is possible for the target vnode to become the source vnode
underneath you (because another process may complete the same
rename). It was assumed that this can't happen, and the bogus
errno EINVAL was returned. This was fairly harmless.
Fix: return ENOENT instead, as if the source directory was renamed
a little earlier.
2. The same metamorphosis is possible for non-directories. It was
assumed that this can't happen, and the code for handling "just
removing a link name" happened to be used. This would have worked
except for fatal bugs in the link name removal - the link name was
assumed to still be there, and a null pointer was followed.
Fix: check the result of relookup(). This fixes PR 1930.
Notes:
(a) POSIX seems to say that removing link names shall have no effect.
BSD (4.4Lite2 at least) does something reasonable instead.
(b) The relookup() may find a file unrelated to the original.
Removing this isn't correct. Consider 3 existing files A, B and
C, and concurrent renames: AB = rename(A, B), another AB, and
CA = rename("c", "a"). If rename() is atomic, then only the
following results are possible:
AB, AB (fails), CA: A = original C, B = original A, C = gone
AB, CA, AB: A = gone, B = original C, C = gone
CA, AB, AB (fails): A = gone, B = original C, C = gone
but ufs_rename() can give:
A,AB,CA,B (sorta): A = gone, B = original A, C = gone
This usually doesn't matter, since getting into a race is usually
an error.
---
These fixes should be in 2.1.6 and 2.2.
ufs_read() and ufs_write().
Found by: looking at warnings for comparing the result of lblktosize()
(which is usually daddr_t = long) with file sizes (which are u_quad_t
for ufs). File sizes should probably be off_t's to avoid warnings
when the are compared with file offsets, so the fixed lblktosize()
casts to off_t instead of u_quad_t.
Added definition of smalllblksize(). It is the same as the old
lblksize() and is more efficient for small block numbers on 32-bit
machines.
Use smalllblktosize() instead of its expansion in blksize() and
dblksize(). This keeps the line length short and makes it more
obvious that the shift can't overflow.
It is needed for implementation details but very little of it is
needed for the interface. Include it in the few places that didn't
already include it.
Include <sys/ioccom.h> in <sys/disklabel.h> (as already in
<sys/diskslice.h>) so that all the disk-related headers are almost
self-sufficient.
/*
* Structure defined by POSIX.4 to be like a timeval.
*/
struct timespec {
time_t ts_sec; /* seconds */
long ts_nsec; /* and nanoseconds */
};
The correct names of the fields are tv_sec and tv_nsec.
Reminded by: James Drobina <jdrobina@infinet.com>
is that it doesn't say _what_ did it! (the core dumped console message
is very useful for listing the process name and pid). This adds similar
information.
the file access time update on reads and can be useful in reducing
filesystem overhead in cases where the access time is not important (like
Usenet news spools).
The interface into the "VMIO" system has changed to be more consistant
and robust. Essentially, it is now no longer necessary to call vn_open
to get merged VM/Buffer cache operation, and exceptional conditions
such as merged operation of VBLK devices is simpler and more correct.
This code corrects a potentially large set of problems including the
problems with ktrace output and loaded systems, file create/deletes,
etc.
Most of the changes to NFS are cosmetic and name changes, eliminating
a layer of subroutine calls. The direct calls to vput/vrele have
been re-instituted for better cross platform compatibility.
Reviewed by: davidg
process won't possibly block before filling in the fsnode pointer (v_data)
which might be dereferenced during a sync since the vnode is put on the
mnt_vnodelist by getnewvnode.
Pointed out by Matt Day <mday@artisoft.com>
and B_READ before writing. This was was fatal. They also broke the
clearing of B_INVAL before doing i/o. This didn't actually matter.
Submitted by: mostly by joerg
be called with the directory referenced, and this reference will
be dropped iff relookup() fails, so the value returned must not be
ignored.
Reviewed by: davidg
Speed up for vfs_bio -- addition of a routine bqrelse to greatly diminish
overhead for merged cache.
Efficiency improvement for vfs_cluster. It used to do alot of redundant
calls to cluster_rbuild.
Correct the ordering for vrele of .text and release of credentials.
Use the selective tlb update for 486/586/P6.
Numerous fixes to the size of objects allocated for files. Additionally,
fixes in the various pagers.
Fixes for proper positioning of vnode_pager_setsize in msdosfs and ext2fs.
Fixes in the swap pager for exhausted resources. The pageout code
will not as readily thrash.
Change the page queue flags (PG_ACTIVE, PG_INACTIVE, PG_FREE, PG_CACHE) into
page queue indices (PQ_ACTIVE, PQ_INACTIVE, PQ_FREE, PQ_CACHE),
thereby improving efficiency of several routines.
Eliminate even more unnecessary vm_page_protect operations.
Significantly speed up process forks.
Make vm_object_page_clean more efficient, thereby eliminating the pause
that happens every 30seconds.
Make sequential clustered writes B_ASYNC instead of B_DELWRI even in the
case of filesystems mounted async.
Fix a panic with busy pages when write clustering is done for non-VMIO
buffers.
This fixes PR943.
ffs/ffs_vfsops.c:
ffs_statfs() multiplied by (100 - minfree) as part of calculating the
minfree percentage (complemented in 100%), so with the standard minfree
of 8, it was broken for file systems of size >= 1TB/92 = 11GB. Use the
standard freespace() macro instead. This also fixes a rounding bug (the
"Avail" count was sometimes 1 too small).
ffs/* (not fixed):
The freespace() macro multiplies by minfree, so with the standard
minfree of 8, it is broken for file systems of size >= 1TB/8 = 128GB.
This bug is more serious since it affects block allocation.
ffs/ffs_alloc.c (not fixed):
Ordinary users are sometimes allowed to allocate 1 (partial) block
too many so that the "Avail" count goes negative. E.g., if there is
1 fragment available and the file is fairly large, one more full
block is allocated.
df/df.c:
ufs_df() used/uses essentially the same code as ffs_statfs(), so it
had/has the same bugs.
ufs_df() gratuitously replaced "Avail" counts of < 0 by 0, so it
gave different results for non-mounted file systems in this case.
is possible to boot a kernel with an empty in-core MFS image, and have
it load the image from floppy directly. This is admittedly a hack and
would be better replaced by a self-loading ram-disk.
What was happening, was that the main mfs loop was sleeping, and when it was
being awoken by a wakeup when it was supposed to process some IO requests.
The problem was that if it was being woken out of the tsleep() by a signal
at shutdown, it was going straight into dounmount() without servicing any
pending IO requests, causing dounmount() to fail because there were busy
buffers (and they could not be "processed" because the processing loop was
trying to unmount rather than dispatching into mfs_doio()).
This (dare I say it :-) appears to be a layering problem....
1) Make cluster buffer list be a non-malloced chain. This eliminates
yet another 'evil' M_WAITOK and generally cleans up the code.
2) Fix write clustering for ext2fs. It was just broken. Also, ffs
clustering had an efficiency problem that more bawrites were happening
than should have been.
3) Make changes to buf.h to support the above, plus remove b_pfcent
at the request of David Greenman.
Note that the reallocblocks code is disabled pending rewrite for
the cluster buffer list changes.
structs and prototypes for syscalls.
Ifdefed duplicated decentralized declarations of args structs. It's
convenient to have this visible but they are hard to maintain. Some
are already different from the central declarations. 4.4lite2 puts
them in comments in the function headers but I wanted to avoid the
large changes for that.
it 1138 times (:-() in casts and a few more times in declarations.
This change is null for the i386.
The type has to be `typedef int vop_t(void *)' and not `typedef
int vop_t()' because `gcc -Wstrict-prototypes' warns about the
latter. Since vnode op functions are called with args of different
(struct pointer) types, neither of these function types is any use
for type checking of the arg, so it would be preferable not to use
the complete function type, especially since using the complete
type requires adding 1138 casts to avoid compiler warnings and
another 40+ casts to reverse the function pointer conversions before
calling the functions.
earlier discussions with DG, and a recent email exchange with SEF, I
decided to allow UFS to run wide-open on an experimental basis. We
will probably support eventually multiple async modes, and this is
the fastest the we can expect. Just use the -o async flag on the
UFS mount. Good luck...
dangerous than the original MNT_ASYNC. There might be some minor
security considerations due to data writes not being posted as promptly
as before. Meta-data operations are still not quite as fast as Linux,
but streaming I/O is still higher.
filesystem layer, as was done in lite-2. Merged in some other cosmetic
changes while I was at it. Rewrote most of msdosfs_access() to be more
like ufs_access() and to include the FS read-only check.
Obtained from: partially from 4.4BSD-lite2
fix a change where a shortcut resulted in teh wrong answer..
e.g.
touch a
touch b
mv a b
resulted in b being removed and a being moved to b
in the shortcut..
touch a
ln a b
mv a b
the wrong link was removed..
leaving a instead of b, giving a different result to when
both files were separate.
bp->b_flags has been broken for many years:
a) they didn't set B_BUSY for doing i/o. This has been fatal since
1995/07/25 when biodone() started checking that B_BUSY is set.
b) they didn't set B_INVAL for releasing the buffer. This at best
just put a useless buffer in the LRU queue for a little while.
Fix a couple of spelling errors and complete a couple of function
pointer declarations.
Submitted by: terry (terry lambert)
This is a composite of 3 patch sets submitted by terry.
they are:
New low-level init code that supports loadbal modules better
some cleanups in the namei code to help terry in 16-bit character support
some changes to the mount-root code to make it a little more
modular..
NOTE: mounting root off cdrom or NFS MIGHT be broken as I haven't been able
to test those cases..
certainly mounting root of disk still works just fine..
mfs should work but is untested. (tomorrows task)
The low level init stuff includes a total rewrite of init_main.c
to make it possible for new modules to have an init phase by simply
adding an entry to a TEXT_SET (or is it DATA_SET) list. thus a new module can
be added to the kernel without editing any other files other than the
'files' file.
disksort is called at non-interrupt time and can be actively traversing
the list when that happens, there is a very small window of vulnerability.
Close it by protecting disksort with splbio().
wrong vp's ops vector being used by changing the VOP_LINK's argument order.
The special-case hack doesn't go far enough and breaks the generic
bypass routine used in some non-leaf filesystems. Pointed out by Kirk
McKusick.
proc or any VM system structure will have to be rebuilt!!!
Much needed overhaul of the VM system. Included in this first round of
changes:
1) Improved pager interfaces: init, alloc, dealloc, getpages, putpages,
haspage, and sync operations are supported. The haspage interface now
provides information about clusterability. All pager routines now take
struct vm_object's instead of "pagers".
2) Improved data structures. In the previous paradigm, there is constant
confusion caused by pagers being both a data structure ("allocate a
pager") and a collection of routines. The idea of a pager structure has
escentially been eliminated. Objects now have types, and this type is
used to index the appropriate pager. In most cases, items in the pager
structure were duplicated in the object data structure and thus were
unnecessary. In the few cases that remained, a un_pager structure union
was created in the object to contain these items.
3) Because of the cleanup of #1 & #2, a lot of unnecessary layering can now
be removed. For instance, vm_object_enter(), vm_object_lookup(),
vm_object_remove(), and the associated object hash list were some of the
things that were removed.
4) simple_lock's removed. Discussion with several people reveals that the
SMP locking primitives used in the VM system aren't likely the mechanism
that we'll be adopting. Even if it were, the locking that was in the code
was very inadequate and would have to be mostly re-done anyway. The
locking in a uni-processor kernel was a no-op but went a long way toward
making the code difficult to read and debug.
5) Places that attempted to kludge-up the fact that we don't have kernel
thread support have been fixed to reflect the reality that we are really
dealing with processes, not threads. The VM system didn't have complete
thread support, so the comments and mis-named routines were just wrong.
We now use tsleep and wakeup directly in the lock routines, for instance.
6) Where appropriate, the pagers have been improved, especially in the
pager_alloc routines. Most of the pager_allocs have been rewritten and
are now faster and easier to maintain.
7) The pagedaemon pageout clustering algorithm has been rewritten and
now tries harder to output an even number of pages before and after
the requested page. This is sort of the reverse of the ideal pagein
algorithm and should provide better overall performance.
8) Unnecessary (incorrect) casts to caddr_t in calls to tsleep & wakeup
have been removed. Some other unnecessary casts have also been removed.
9) Some almost useless debugging code removed.
10) Terminology of shadow objects vs. backing objects straightened out.
The fact that the vm_object data structure escentially had this
backwards really confused things. The use of "shadow" and "backing
object" throughout the code is now internally consistent and correct
in the Mach terminology.
11) Several minor bug fixes, including one in the vm daemon that caused
0 RSS objects to not get purged as intended.
12) A "default pager" has now been created which cleans up the transition
of objects to the "swap" type. The previous checks throughout the code
for swp->pg_data != NULL were really ugly. This change also provides
the rudiments for future backing of "anonymous" memory by something
other than the swap pager (via the vnode pager, for example), and it
allows the decision about which of these pagers to use to be made
dynamically (although will need some additional decision code to do
this, of course).
13) (dyson) MAP_COPY has been deprecated and the corresponding "copy
object" code has been removed. MAP_COPY was undocumented and non-
standard. It was furthermore broken in several ways which caused its
behavior to degrade to MAP_PRIVATE. Binaries that use MAP_COPY will
continue to work correctly, but via the slightly different semantics
of MAP_PRIVATE.
14) (dyson) Sharing maps have been removed. It's marginal usefulness in a
threads design can be worked around in other ways. Both #12 and #13
were done to simplify the code and improve readability and maintain-
ability. (As were most all of these changes)
TODO:
1) Rewrite most of the vnode pager to use VOP_GETPAGES/PUTPAGES. Doing
this will reduce the vnode pager to a mere fraction of its current size.
2) Rewrite vm_fault and the swap/vnode pagers to use the clustering
information provided by the new haspage pager interface. This will
substantially reduce the overhead by eliminating a large number of
VOP_BMAP() calls. The VOP_BMAP() filesystem interface should be
improved to provide both a "behind" and "ahead" indication of
contiguousness.
3) Implement the extended features of pager_haspage in swap_pager_haspage().
It currently just says 0 pages ahead/behind.
4) Re-implement the swap device (swstrategy) in a more elegant way, perhaps
via a much more general mechanism that could also be used for disk
striping of regular filesystems.
5) Do something to improve the architecture of vm_object_collapse(). The
fact that it makes calls into the swap pager and knows too much about
how the swap pager operates really bothers me. It also doesn't allow
for collapsing of non-swap pager objects ("unnamed" objects backed by
other pagers).
2) Removed unnecessary vm_object_lookup()/pager_cache(object, TRUE) pairs
after vnode_pager_alloc() calls - the object is already guaranteed to be
persistent.
3) Removed some gratuitous casts.
1) Files weren't properly synced on filesystems other than UFS. In some
cases, this lead to lost data. Most likely would be noticed on NFS.
The fix is to make the VM page sync/object_clean general rather than
in each filesystem.
2) Mixing regular and mmaped file I/O on NFS was very broken. It caused
chunks of files to end up as zeroes rather than the intended contents.
The fix was to fix several race conditions and to kludge up the
"b_dirtyoff" and "b_dirtyend" that NFS relies upon - paying attention
to page modifications that occurred via the mmapping.
Reviewed by: David Greenman
Submitted by: John Dyson
These changes solve the problem in a general way by moving the
initialization out of the individual fs_mountroot's and into swaponvp().
Submitted by: Poul-Henning Kamp
when the single user shell was terminated. These changes disallow mounting
or R/W upgrading filesystems that are dirty unless "-f" (force) option
is used with mount. /etc/rc has been modified to abort the startup if
one or more non-nfs partitions fail to mount.
Reviewed by: Poul-Henning Kamp, Rod Grimes
I ran into another manifestation of the problem reported in PR 211 and
fixed it. Try this:
as non-root:
cd /tmp; mkdir x y x/z
as root:
chown root /tmp/x/z
as non-root:
cd /tmp/x; mv z ../y # EACCES as expected
as root:
cd /tmp/x; mv z ../y # EINVAL NOT as expected
This is because ufs_rename() sets IN_RENAME and fails to clear it.
Reviewed by: davidg
Submitted by: bde
(2GB). If this limit is not imposed, then filesystem corruption will
ensue when files larger than 2GB are created. This is temporary,
and the underlying limitation will be removed later.
don't lock the vnode - it doesn't appear to ever be necessary for VCHR
vnode/inodes. This fixes a bug introduced in the previous commit that
caused tty timestamps to act strange (causing 'w' and 'finger' to show
the tty wasn't idle when it may have been for hours).
Fixed remaining known bugs in the buffer IO and VM system.
vfs_bio.c:
Fixed some race conditions and locking bugs. Improved performance
by removing some (now) unnecessary code and fixing some broken
logic.
Fixed process accounting of # of FS outputs.
Properly handle NFS interrupts (B_EINTR).
(various)
Replaced calls to clrbuf() with calls to an optimized routine
called vfs_bio_clrbuf().
(various FS sync)
Sync out modified vnode_pager backed pages.
ffs_vnops.c:
Do two passes: Sync out file data first, then indirect blocks.
vm_fault.c:
Fixed deadly embrace caused by acquiring locks in the wrong order.
vnode_pager.c:
Changed to use buffer I/O system for writing out modified pages. This
should fix the problem with the modification date previous not getting
updated. Also dramatically simplifies the code. Note that this is
going to change in the future and be implemented via VOP_PUTPAGES().
vm_object.c:
Fixed a pile of bugs related to cleaning (vnode) objects. The performance
of vm_object_page_clean() is terrible when dealing with huge objects,
but this will change when we implement a binary tree to keep the object
pages sorted.
vm_pageout.c:
Fixed broken clustering of pageouts. Fixed race conditions and other
lockup style bugs in the scanning of pages. Improved performance.
merged cache changes, and figure it out based on the B_VMIO buffer flag.
Fixes a problem where delayed write VMIO buffers would sometimes get
recopied into kernel-alloced memory.
Submitted by: John Dyson
during the FS sync. The system would appear to hang momentarily
if there was a large backlog of I/O. This is because the vnode
remains locked during the output - preventing normal character
I/O. The problem was exacerbated by the FFS contiguous block
allocation fixes and a semi-broken disksort(). The inode/date
will still be synced during a normal FS dismount and whenever
the inode is changed for other reasons.
allow Q_SYNC regardless of "target" uid, we allow it with -1;
fix bug that caused all ops to refer to user quotas, not group.
Submitted by: Mike Karels
if all free blocks are in the same bucket (i.e. NRPOS == 1).
Else a free block is choosen, possibly from a different cylinder,
even if the block succeeding bpref was free ...
Submitted by: se
mapping from numbers to names is messy for backwards compatibility.
E.g., for driver "sd", unit "0":
slice 0: omit the slice number for compatibility; names are sd0[a-h].
slice 1: omit the partition letter 'c' because the whole disk device
shouldn't have anything to do with partitions; sd0 is the
only name.
slices 2-31: subtract 1 from slice number to compensate for the
compatibility slice 0; names are sd0s[1-30][a-h].
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
ordering that can prove fatal during large batches of deletes, but this
is much better than it was. I probably won't be putting much more time
into this until Seltzer releases her new version of LFS which has
fragment support. This should be availible just before USENIX.
timestamps for an atomic operation such as rename() on a local file
system to be identical.
Uniformize yet another idempotency ifdef. The comment nesting was
bogus.
Allow chown() to return success if the gid isn't changed even if
the gid is not the caller's. Such gids are normal for files created
in world-writable directories sucj as /tmp. This "fixes" annoying
error messages for mv'ing files created in /tmp to another file
system. mv still preserves the foreign gid of /tmp, but now does
it silently.
buffering scheme and make it more in tune with FreeBSD's vfs_bio
implementation. The filesystem seems fairly stable, but I wouldn't recommend
it to anyone not willing to experience problems. This is very green code and
has the limitation that YOU CAN ONLY HAVE ONE LFS PARTITION MOUNTED AT A TIME.
What LFS is good for:
Non fsynced writes FASTER THAN FFS
Large deletions Increadibly fast
Reads are a little bit slower than FFS right now, but that is a factor of
how under optimized this code is. LFS should in theory perform at least as
well as FFS under fsync (iozone) type loads, and this is what I'm currently
working on.
Reviewed by: Justin Gibbs
Submitted by: John Dyson
Obtained from:
This is part of a bug fix from Kirk McKusick to work around problems in FFS
related to the blkno of a 64bit offset not fitting into an int. Note the
proper solution would be to deal with 64bit block numbers, but doing this
would require sweeping changes; some other day perhaps.
Submitted by: Marshall Kirk McKusick
For it to be useful, you must stick your disklabel on the partition which
starts where the MBR says FreeBSD lives. If you don't do that, you might
get a bad day.
Oh, that probably also means that putting swap there is a bad idea...
- Make a number of filesystems work again when they are statically compiled
(blush)
- FIFOs are no longer optional; ``options FIFO'' removed from distributed
config files.
machdep.c:
Changed printf's a little and call vfs_unmountall() if the sync was
successful.
cd9660_vfsops.c, ffs_vfsops.c, nfs_vfsops.c, lfs_vfsops.c:
Allow dismount of root FS. It is now disallowed at a higher level.
vfs_conf.c:
Removed unused rootfs global.
vfs_subr.c:
Added new routines vfs_unmountall and vfs_unmountroot. Filesystems
are now dismounted if the machine is properly rebooted.
ffs_vfsops.c:
Toggle clean bit at the appropriate places. Print warning if an
unclean FS is mounted.
ffs_vfsops.c, lfs_vfsops.c:
Fix bug in selecting proper flags for VOP_CLOSE().
vfs_syscalls.c:
Disallow dismounting root FS via umount syscall.
use of timeout_t -> timeout_func_t in aha1542 and aha1742 drivers.
2) fix a bug in the portalfs that was uncovered by better prototyping -
specifically, the time must be converted from timeval to timespec
before storing in va_atime.
3) fixed/added some miscellaneous prototypes
- Delete redundant declarations.
- Add -Wredundant-declarations to Makefile.i386 so they don't come back.
- Delete sloppy COMMON-style declarations of uninitialized data in
header files.
- Add a few prototypes.
- Clean up warnings resulting from the above.
NB: ioconf.c will still generate a redundant-declaration warning, which
is unavoidable unless somebody volunteers to make `config' smarter.
use it in NFS. This is required both for diskless support and for POSIX
compliance. Note: the support in NFS is only for the local node.
Submitted by: based on work originally done by Yuval Yurom