filesystem softdep_process_worklist() is called in a loop until it indicates
that no dependancies remain, but the determination of that fact depends on
there only being one softdep_process_worklist() instance running. It was
possible for the syncer to also be running softdep_process_worklist()
and the pre-existing checks in the code to prevent this were not sufficient
to prevent the race. This patch solves the problem.
Approved-by: mckusick
in-core pointers to summary information. An array in this region
(fs_csp) could overflow on filesystems with a very large number of
cylinder groups (~16000 on i386 with 8k blocks). When this happens,
other fields in the superblock get corrupted, and fsck refuses to
check the filesystem.
Solve this problem by replacing the fs_csp array in 'struct fs'
with a single pointer, and add padding to keep the length of the
128-byte region fixed. Update the kernel and userland utilities
to use just this single pointer.
With this change, the kernel no longer makes use of the superblock
fields 'fs_csshift' and 'fs_csmask'. Add a comment to newfs/mkfs.c
to indicate that these fields must be calculated for compatibility
with older kernels.
Reviewed by: mckusick
idea either) in ufs_extattr_rm.
o More completely fill out the local_aio structure when writing out the
zero'd extended attribute in ufs_extattr_rm -- previoulsy, this worked
fine, but probably should not have. This corrects extraneous warnings
about inconsistent inodes following file deletion.
Reviewed by: jedgar
ufs_extattr_rm.
o Make both reporting locations report the function name where the
inconsistency is discovered, as well as the inode number in question.
Reviewed by: jedgar
attribute read--the offset is required to be 0 by an earlier check,
meaning that it will always be within the scope of the attribute data.
This change should have no impact on executed code paths other than
removing the unnecessary check: please report if any new failures
start to occur as a result.
Obtained from: TrustedBSD Project
in 4.2-REL which I ripped out in -stable and -current when implementing the
low-memory handling solution. However, maxlaunder turns out to be the saving
grace in certain very heavily loaded systems (e.g. newsreader box). The new
algorithm limits the number of pages laundered in the first pageout daemon
pass. If that is not sufficient then suceessive will be run without any
limit.
Write I/O is now pipelined using two sysctls, vfs.lorunningspace and
vfs.hirunningspace. This prevents excessive buffered writes in the
disk queues which cause long (multi-second) delays for reads. It leads
to more stable (less jerky) and generally faster I/O streaming to disk
by allowing required read ops (e.g. for indirect blocks and such) to occur
without interrupting the write stream, amoung other things.
NOTE: eventually, filesystem write I/O pipelining needs to be done on a
per-device basis. At the moment it is globalized.
1) Be more tolerant of missing snapshot files by only trying to decrement
their reference count if they are registered as active.
2) Fix for snapshots of filesystems with block sizes larger than 8K
(from Ollivier Robert <roberto@eurocontrol.fr>).
3) Fix to avoid losing last block in snapshot file when calculating blocks
that need to be copied (from Don Coleman <coleman@coleman.org>).
which fails to set the modification time on the file. The same
check a few lines later takes the correct action.
Submitted by: Ian Dowse <iedowse@maths.tcd.ie>
by ensuring that newly allocated blocks are zerod. The
race can occur even in the case where the write covers
the entire block.
Reported by: Sven Berkvens <sven@berkvens.net>, Marc Olzheim <zlo@zlo.nu>
Previously, the syncer process was the only process in the
system that could process the soft updates background work
list. If enough other processes were adding requests to that
list, it would eventually grow without bound. Because some of
the work list requests require vnodes to be locked, it was
not generally safe to let random processes process the work
list while they already held vnodes locked. By adding a flag
to the work list queue processing function to indicate whether
the calling process could safely lock vnodes, it becomes possible
to co-opt other processes into helping out with the work list.
Now when the worklist gets too large, other processes can safely
help out by picking off those work requests that can be handled
without locking a vnode, leaving only the small number of
requests requiring a vnode lock for the syncer process. With
this change, it appears possible to keep even the nastiest
workloads under control.
Submitted by: Paul Saab <ps@yahoo-inc.com>
Deal with excessive dirty buffers when msync() syncs non-contiguous
dirty buffers by checking for the case in UFS *before* checking for
clusterability.
in the face of multiple processes doing massive numbers of filesystem
operations. While this patch will work in nearly all situations, there
are still some perverse workloads that can overwhelm the system.
Detecting and handling these perverse workloads will be the subject
of another patch.
Reviewed by: Paul Saab <ps@yahoo-inc.com>
Obtained from: Ethan Solomita <ethan@geocast.com>
Removed most of the hacks that were trying to deal with low-memory
situations prior to now.
The new code is based on the concept that I/O must be able to function in
a low memory situation. All major modules related to I/O (except
networking) have been adjusted to allow allocation out of the system
reserve memory pool. These modules now detect a low memory situation but
rather then block they instead continue to operate, then return resources
to the memory pool instead of cache them or leave them wired.
Code has been added to stall in a low-memory situation prior to a vnode
being locked.
Thus situations where a process blocks in a low-memory condition while
holding a locked vnode have been reduced to near nothing. Not only will
I/O continue to operate, but many prior deadlock conditions simply no
longer exist.
Implement a number of VFS/BIO fixes
(found by Ian): in biodone(), bogus-page replacement code, the loop
was not properly incrementing loop variables prior to a continue
statement. We do not believe this code can be hit anyway but we
aren't taking any chances. We'll turn the whole section into a
panic (as it already is in brelse()) after the release is rolled.
In biodone(), the foff calculation was incorrectly
clamped to the iosize, causing the wrong foff to be calculated
for pages in the case of an I/O error or biodone() called without
initiating I/O. The problem always caused a panic before. Now it
doesn't. The problem is mainly an issue with NFS.
Fixed casts for ~PAGE_MASK. This code worked properly before only
because the calculations use signed arithmatic. Better to properly
extend PAGE_MASK first before inverting it for the 64 bit masking
op.
In brelse(), the bogus_page fixup code was improperly throwing
away the original contents of 'm' when it did the j-loop to
fix the bogus pages. The result was that it would potentially
invalidate parts of the *WRONG* page(!), leading to corruption.
There may still be cases where a background bitmap write is
being duplicated, causing potential corruption. We have identified
a potentially serious bug related to this but the fix is still TBD.
So instead this patch contains a KASSERT to detect the problem
and panic the machine rather then continue to corrupt the filesystem.
The problem does not occur very often.. it is very hard to
reproduce, and it may or may not be the cause of the corruption
people have reported.
Review by: (VFS/BIO: mckusick, Ian Dowse <iedowse@maths.tcd.ie>)
Testing by: (VM/Deadlock) Paul Saab <ps@yahoo-inc.com>
is to first write the deleted directory entry to disk, second write
the zero'ed inode to disk, and finally to release the freed blocks
and the inode back to the cylinder-group map. As this ordering
requires two disk writes to occur which are normally spaced about
30 seconds apart (except when memory is under duress), it takes
about a minute from the time that a file is deleted until its inode
and data blocks show up in the cylinder-group map for reallocation.
If a file has had only a brief lifetime (less than 30 seconds from
creation to deletion), neither its inode nor its directory entry
may have been written to disk. If its directory entry has not been
written to disk, then we need not wait for that directory block to
be written as the on-disk directory block does not reference the
inode. Similarly, if the allocated inode has never been written to
disk, we do not have to wait for it to be written back either as
its on-disk representation is still zero'ed out. Thus, in the case
of a short lived file, we can simply release the blocks and inode
to the cylinder-group map immediately. As the inode and its blocks
are released immediately, they are immediately available for other
uses. If they are not released for a minute, then other inodes and
blocks must be allocated for short lived files, cluttering up the
vnode and buffer caches. The previous code was a bit too aggressive
in trying to release the blocks and inode back to the cylinder-group
map resulting in their being made available when in fact the inode
on disk had not yet been zero'ed. This patch takes a more conservative
approach to doing the release which avoids doing the release prematurely.
ufs_vnops.c:
1) i_ino was confused with i_number, so the inode number passed to
VFS_VGET() was usually wrong (usually 0U).
2) ip was dereferenced after vgone() freed it, so the inode number
passed to VFS_VGET() was sometimes not even wrong.
Bug (1) was usually fatal in ext2_mknod(), since ext2fs doesn't have
space for inode 0 on the disk; ino_to_fsba() subtracts 1 from the
inode number, so inode number 0U gives a way out of bounds array
index. Bug(1) was usually harmless in ufs_mknod(); ino_to_fsba()
doesn't subtract 1, and VFS_VGET() reads suitable garbage (all 0's?)
from the disk for the invalid inode number 0U; ufs_mknod() returns
a wrong vnode, but most callers just vput() it; the correct vnode is
eventually obtained by an implicit VFS_VGET() just like it used to be.
Bug (2) usually doesn't happen.
<sys/proc.h> to <sys/systm.h>.
Correctly document the #includes needed in the manpage.
Add one now needed #include of <sys/systm.h>.
Remove the consequent 48 unused #includes of <sys/proc.h>.
the offending inline function (BUF_KERNPROC) on it being #included
already.
I'm not sure BUF_KERNPROC() is even the right thing to do or in the
right place or implemented the right way (inline vs normal function).
Remove consequently unneeded #includes of <sys/proc.h>
"administrative" authorization checks. In most cases, the VADMIN test
checks to make sure the credential effective uid is the same as the file
owner.
o Modify vaccess() to set VADMIN as an available right if the uid is
appropriate.
o Modify references to uid-based access control operations such that they
now always invoke VOP_ACCESS() instead of using hard-coded policy checks.
o This allows alternative UFS policies to be implemented by replacing only
ufs_access() (such as mandatory system policies).
o VOP_ACCESS() requires the caller to hold an exclusive vnode lock on the
vnode: I believe that new invocations of VOP_ACCESS() are always called
with the lock held.
o Some direct checks of the uid remain, largely associated with the QUOTA
and SUIDDIR code.
Reviewed by: eivind
Obtained from: TrustedBSD Project
description:
How it works:
--
Basically ifs is a copy of ffs, overriding some vfs/vnops. (Yes, hack.)
I didn't see the need in duplicating all of sys/ufs/ffs to get this
off the ground.
File creation is done through a special file - 'newfile' . When newfile
is called, the system allocates and returns an inode. Note that newfile
is done in a cloning fashion:
fd = open("newfile", O_CREAT|O_RDWR, 0644);
fstat(fd, &st);
printf("new file is %d\n", (int)st.st_ino);
Once you have created a file, you can open() and unlink() it by its returned
inode number retrieved from the stat call, ie:
fd = open("5", O_RDWR);
The creation permissions depend entirely if you have write access to the
root directory of the filesystem.
To get the list of currently allocated inodes, VOP_READDIR has been added
which returns a directory listing of those currently allocated.
--
What this entails:
* patching conf/files and conf/options to include IFS as a new compile
option (and since ifs depends upon FFS, include the FFS routines)
* An entry in i386/conf/NOTES indicating IFS exists and where to go for
an explanation
* Unstaticize a couple of routines in src/sys/ufs/ffs/ which the IFS
routines require (ffs_mount() and ffs_reload())
* a new bunch of routines in src/sys/ufs/ifs/ which implement the IFS
routines. IFS replaces some of the vfsops, and a handful of vnops -
most notably are VFS_VGET(), VOP_LOOKUP(), VOP_UNLINK() and VOP_READDIR().
Any other directory operation is marked as invalid.
What this results in:
* an IFS partition's create permissions are controlled by the perm/ownership of
the root mount point, just like a normal directory
* Each inode has perm and ownership too
* IFS does *NOT* mean an FFS partition can be opened per inode. This is a
completely seperate filesystem here
* Softupdates doesn't work with IFS, and really I don't think it needs it.
Besides, fsck's are FAST. (Try it :-)
* Inodes 0 and 1 aren't allocatable because they are special (dump/swap IIRC).
Inode 2 isn't allocatable since UFS/FFS locks all inodes in the system against
this particular inode, and unravelling THAT code isn't trivial. Therefore,
useful inodes start at 3.
Enjoy, and feedback is definitely appreciated!
it is defined whenm used in ufs_extattr_uepm_destroy(), fixing a panic
due to a NULL pointer dereference.
Submitted by: Wesley Morgan <morganw@chemicals.tacorp.com>
up lock on extattrs.
o Get for free a comment indicating where auto-starting of extended
attributes will eventually occur, as it was in my commit tree also.
No implementation change here, only a comment.
call, which should be the last thing down to a per-mount extattr
management structure, after ufs_extattr_stop() on the file system.
This currently has the effect only of destroying the per-mount lock
on extended attributes, and clearing appropriate flags.
o Remove inappropriate invocation in ufs_extattr_vnode_inactive().
Add lockdestroy() and appropriate invocations, which corresponds to
lockinit() and must be called to clean up after a lockmgr lock is no
longer needed.
separately (nfs, cd9660 etc) or keept as a first element of structure
referenced by v_data pointer(ffs). Such organization leads to known problems
with stacked filesystems.
From this point vop_no*lock*() functions maintain only interlock lock.
vop_std*lock*() functions maintain built-in v_lock structure using lockmgr().
vop_sharedlock() is compatible with vop_stdunlock(), but maintains a shared
lock on vnode.
If filesystem wishes to export lockmgr compatible lock, it can put an address
of this lock to v_vnlock field. This indicates that the upper filesystem
can take advantage of it and use single lock structure for entire (or part)
of stack of vnodes. This field shouldn't be examined or modified by VFS code
except for initialization purposes.
Reviewed in general by: mckusick
on directories.
o Allow privileged processes in jail() to create inodes with the
setgid bit set even if they are not a member of the group denoted
by the file creation gid. This occurs due to inherited gid's from
parent directories on file creation, allowing a user to create a
file with a gid that is not in the creating process's credentials.
Obtained from: TrustedBSD Project
and UFS file flags. Here's what the comment says, for reference:
Privileged processes in jail() are permitted to modify
arbitrary user flags on files, but are not permitted
to modify system flags.
In other words, privilege does allow a process in jail to modify user
flags for objects that the process does not own, but privilege will
not permit the setting of system flags on the file.
Obtained from: TrustedBSD Project
remove the setuid/setgid bits by virtue of a change to a file with those
bits set, even if the process doesn't own the file, or isn't a group
member of the file's gid.
Obtained from: TrustedBSD Project
safe as suser() no longer sets ASU.
o Note that in some cases, the PRISON_ROOT flag is used even though no
process structure is passed, to indicate that if a process structure
(and hence jail) was available, it would be ok. In the long run,
the jail identifier should probably be moved to ucred, as the uidinfo
information was.
o Some uid 0 checks remain relating to the quota code, which I'll leave
for another day.
Reviewed by: phk, eivind
Obtained from: TrustedBSD Project
filesystem lookup() routine if it unlocks parent directory. This flag should
be carefully tracked by filesystems if they want to work properly with nullfs
and other stacked filesystems.
VFS takes advantage of this flag to perform symantically correct usage
of vrele() instead of vput() if parent directory already unlocked.
If filesystem fails to track this flag then previous codepath in VFS left
unchanged.
Convert UFS code to set PDIRUNLOCK flag if necessary. Other filesystmes will
be changed after some period of testing.
Reviewed in general by: mckusick, dillon, adrian
Obtained from: NetBSD
- In ufs_extattr_enable(), return EEXIST instead of EOPNOTSUPP
if the caller tries to configure an attribute name that is
already configured
- Throughout, add IO_NODELOCKED to VOP_{READ,WRITE} calls to
indicate lock status of passed vnode. Apparently not a
problem, but worth fixing.
- For all writes, make use of IO_SYNC consistent. Really,
IO_UNIT and combining of VOP_WRITE's should happen, but I
don't have that tested. At least with this, it's
consistent usage. (pointed out by: bde)
- In ufs_extattr_get(), fixed nested locking of backing
vnode (fine due to recursive lock support, but make it
more consistent with other code)
- In ufs_extattr_get(), clean up return code to set uio_resid
more consistently with other pieces of code (worked fine,
this is just a cleanup)
- Fix ufs_extattr_rm(), which was broken--effectively a nop.
- Minor comment and whitespace fixes.
Obtained from: TrustedBSD Project