It can be used to delay mounting root partition to give a chance to GEOM
providers to show up.
Now, when there is no needed provider, vfs_rootmount() function will look
for it every second and if it can't be find in defined time, it'll ask
for root device name (before this change it was done immediately).
This will allow to boot from gmirror device in degraded mode.
too much kernel copying, but it is not the right way to do it, and it is
in the way for straightening out the buffer cache.
The right way is to pass the VM page array down through the struct
bio to the disk device driver and DMA directly in to/out off the
physical memory. Once the VM/buf thing is sorted out it is next on
the list.
Retire most of vnode method. ffs_getpages(). It is not clear if what is
left shouldn't be in the default implementation which we now fall back to.
Retire specfs_getpages() as well, as it has no users now.
and the previously malloc'ed snapshot lock.
Malloc struct snapdata instead of just the lock.
Replace snapshot fields in cdev with pointer to snapdata (saves 16 bytes).
While here, give the private readblock() function a vnode argument
in preparation for moving UFS to access GEOM directly.
preparation for integration of p4::phk_bufwork. In the future,
local filesystems will talk to GEOM directly and they will consequently
be able to issue BIO_DELETE directly. Since the removal of the fla
driver, BIO_DELETE has effectively been a no-op anyway.
field.
Replace three instances of longhaired initialization va_filerev fields.
Added XXX comment wondering why we don't use random bits instead of
uptime of the system for this purpose.
be manipulated by prison root. In 4.x prison root can not manipulate
system flags, regardless of the security level. This behavior
should remain consistent to avoid any surprises which could lead
to security problems for system administrators which give out
privileged access to jails.
This commit changes suser_cred's flag argument from SUSER_ALLOWJAIL
to 0. This will prevent prison root from being able to manipulate
system flags on files.
This may be a MFC candidate for RELENG_5.
Discussed with: cperciva
Reviewed by: rwatson
Approved by: bmilekic (mentor)
PR: kern/70298
has only been partly initialized via newfs(8) so that it applies to both
UFS1 and UFS2.
Submitted by: "Xin LI" delphij at frontfree dot net
MFC: maybe?
address of the dirhash, rather than the first sizeof(struct dirhash
*) bytes of the structure (which, thankfully, seem to be constant).
Submitted by: Ted Unangst <tedu@zeitbombe.org>
MFC after: 2 weeks
a more complete subsystem, and removes the knowlege of how things are
implemented from the drivers. Include locking around filter ops, so a
module like aio will know when not to be unloaded if there are outstanding
knotes using it's filter ops.
Currently, it uses the MTX_DUPOK even though it is not always safe to
aquire duplicate locks. Witness currently doesn't support the ability
to discover if a dup lock is ok (in some cases).
Reviewed by: green, rwatson (both earlier versions)
and refuse initializing filesystems with a wrong version. This will
aid maintenance activites on the 5-stable branch.
s/vfs_mount/vfs_omount/
s/vfs_nmount/vfs_mount/
Name our filesystems mount function consistently.
Eliminate the namiedata argument to both vfs_mount and vfs_omount.
It was originally there to save stack space. A few places abused
it to get hold of some credentials to pass around. Effectively
it is unused.
Reorganize the root filesystem selection code.
Add local rootvp variables as needed.
Remove checks for miniroot's in the swappartition. We never did that
and most of the filesystems could never be used for that, but it had
still been copy&pasted all over the place.
somewhat clearer, but more importantly allows for a consistent naming
scheme for suser_cred flags.
The old name is still defined, but will be removed in a few days (unless I
hear any complaints...)
Discussed with: rwatson, scottl
Requested by: jhb
This is to allow filesystems to decide based on the passed thread
which vnode to return.
Several filesystems used curthread, they now use the passed thread.
our cached 'next vnode' being removed from this mountpoint. If we
find that it was recycled, we restart our traversal from the start
of the list.
Code to do that is in all local disk filesystems (and a few other
places) and looks roughly like this:
MNT_ILOCK(mp);
loop:
for (vp = TAILQ_FIRST(&mp...);
(vp = nvp) != NULL;
nvp = TAILQ_NEXT(vp,...)) {
if (vp->v_mount != mp)
goto loop;
MNT_IUNLOCK(mp);
...
MNT_ILOCK(mp);
}
MNT_IUNLOCK(mp);
The code which takes vnodes off a mountpoint looks like this:
MNT_ILOCK(vp->v_mount);
...
TAILQ_REMOVE(&vp->v_mount->mnt_nvnodelist, vp, v_nmntvnodes);
...
MNT_IUNLOCK(vp->v_mount);
...
vp->v_mount = something;
(Take a moment and try to spot the locking error before you read on.)
On a SMP system, one CPU could have removed nvp from our mountlist
but not yet gotten to assign a new value to vp->v_mount while another
CPU simultaneously get to the top of the traversal loop where it
finds that (vp->v_mount != mp) is not true despite the fact that
the vnode has indeed been removed from our mountpoint.
Fix:
Introduce the macro MNT_VNODE_FOREACH() to traverse the list of
vnodes on a mountpoint while taking into account that vnodes may
be removed from the list as we go. This saves approx 65 lines of
duplicated code.
Split the insmntque() which potentially moves a vnode from one mount
point to another into delmntque() and insmntque() which does just
what the names say.
Fix delmntque() to set vp->v_mount to NULL while holding the
mountpoint lock.
because UFS uses fixed-size directory blocks. When using this code with
other file systems, such as HFS+, the value of auio.uio_resid will need
to be taken into account.
local to a function. Remove a couple of blank lines in variable
declarations.
In one case, explicitly test against NULL rather than using a pointer
as a boolean directly.
lots of errors. Blind substitution of "dev_t foo" by "struct cdev *foo"
in comments usually just created an English syntax error (e.g.,
"struct cdev *changes"), but here it did less than that since the dev_t
is a user dev_t.
fixes was applicable to HEAD, originally it was thought this
should only be done in RELENG_4. Implement IO_INVAL in the vnode
op for writing by marking the buffer as "no cache". This fix
has already been applied to RELENG_4 as Rev. 1.65.2.15 of
ufs/ufs/ufs_readwrite.c.
Reviewed by: alc, tegge
fragment to zero the valid parts of a VM_IO buffer.
RE would like this to be part of 4.10-RC3 so this will be MFC-ed immediately.
Reviewed by: alc, tegge
things which compare /etc/fstab entries to results from
getfsstat(). The real way to fix this is to make 'ufs2'
a recognized filesystem (for real, no beating around the
bush).
This should fix things like 'umount -a -t ufs' now.
Appologies for the previous breakage.
libufs, which only works for Charlie root.
This change reverts the introduction of libufs and moves the
check into the kernel. Since the f_fstypename is the same
for both ufs and ufs2, we check fs_magic for presence of
ufs2 and copy "ufs2" explicitly instead.
Submitted by: Christian S.J. Peron <maneo@bsdpro.com>
license, per letter dated July 22, 1999 and irc message from Robert
Watson saying that clause 3 can be removed from those files with an
NAI copyright that also have only a University of California
copyrights.
Approved by: core, rwatson
WARNS=6. I don't change the WARNS level in the Makefile because I
didn't tested this on other archs.
The fs.h fix was suggested by: marcel
Reviewed by: md5(1)
group block locked. If filesystem has any active snapshots, bawrite
can come back trying to allocate new snapshot data block from the same
cylinder group and cause panic due to recursive lock attempt.
PR: 64206
Reviewed by: mckusick
Tested by: pjd
were a rather overwhelming task. I soon learned that if you don't know
where you're going to store something, at least try to pile it next to
something slightly related in the hope that a pattern emerges.
Apply the same principle to the ffs/snapshot/softupdates code which have
leaked into specfs: Add yet a buf-quasi-method and call it from the
only two places I can see it can make a difference and implement the
magic in ffs_softdep.c where it belongs.
It's not pretty, but at least it's one less layer violated.
AFTER the call to vn_start_write(), not before it. Otherwise, it is
possible to unlock it multiple times if the vn_start_write() fails.
Submitted by: Juergen Hannken-Illjes <hannken@eis.cs.tu-bs.de>
In ufs_lock, check for attempts to acquire shared locks on
snapshot files and change them to be exclusive locks. This
change eliminates deadlocks and machine lockups reported in
-current since most read requests started using shared lock
requests.
Submitted by: Jun Kuriyama <kuriyama@imgsrc.co.jp>
to use the "year1-year3" format, as opposed to "year1, year2, year3".
This seems to make lawyers more happy, but also prevents the
lines from getting excessively long as the years start to add up.
Suggested by: imp
- don't unlock the vnode after vinvalbuf() only to have to relock it
almost immediately.
- don't refer to devices classified by vn_isdisk() as block devices.
operators) in and near revs.1.169-1.170 (open mode bandaid). This
(or better a proper fix) should have been done before cloning the
bandaid to many other file systems.
- rev.1.42 of ffs_readwrite.c added a special case in ffs_read() for reads
that are initially at EOF, and rev.1.62 of ufs_readwrite.c fixed
timestamp bugs in it. Removal of most of vfs_ioopt made it just and
optimization, and removal of the vm object reference calls made it less
than an optimization. It was cloned in rev.1.94 of ufs_readwrite.c as
part of cloning ffs_extwrite() although it was always less than an
optimization in ffs_extwrite().
- some comments, compound statements and vertical whitespace were vestiges
of dead code.
- struct plimit includes a mutex to protect a reference count. The plimit
structure is treated similarly to struct ucred in that is is always copy
on write, so having a reference to a structure is sufficient to read from
it without needing a further lock.
- The proc lock protects the p_limit pointer and must be held while reading
limits from a process to keep the limit structure from changing out from
under you while reading from it.
- Various global limits that are ints are not protected by a lock since
int writes are atomic on all the archs we support and thus a lock
wouldn't buy us anything.
- All accesses to individual resource limits from a process are abstracted
behind a simple lim_rlimit(), lim_max(), and lim_cur() API that return
either an rlimit, or the current or max individual limit of the specified
resource from a process.
- dosetrlimit() was renamed to kern_setrlimit() to match existing style of
other similar syscall helper functions.
- The alpha OSF/1 compat layer no longer calls getrlimit() and setrlimit()
(it didn't used the stackgap when it should have) but uses lim_rlimit()
and kern_setrlimit() instead.
- The svr4 compat no longer uses the stackgap for resource limits calls,
but uses lim_rlimit() and kern_setrlimit() instead.
- The ibcs2 compat no longer uses the stackgap for resource limits. It
also no longer uses the stackgap for accessing sysctl's for the
ibcs2_sysconf() syscall but uses kernel_sysctl() instead. As a result,
ibcs2_sysconf() no longer needs Giant.
- The p_rlimit macro no longer exists.
Submitted by: mtm (mostly, I only did a few cleanups and catchups)
Tested on: i386
Compiled on: alpha, amd64
and ffs_write(). These calls trace their origins to the dead vfs_ioopt
code, first appearing in revision 1.39 of ufs_readwrite.c.
Observed by: bde
Discussed with: tegge
Replace wrong check returned EFBIG with EOVERFLOW handling from POSIX:
36708 [EOVERFLOW] The file is a regular file, nbyte is greater than 0, the
starting position is before the end-of-file, and the starting position is
greater than or equal to the offset maximum established in the open file
description associated with fildes.
ffs_write:
Replace u_int64_t cast with uoff_t cast which is more natural for types
used.
ffs_write & ffs_read:
Remove uio_offset and uio_resid checks for negative values, the caller
supposed to do it already. Add comments about it.
Reviewed by: bde
Move diagnostic printf after vget. This might delay the debug
output some, but at least it keeps kernel from exploding if
DEBUG_VFS_LOCKS is in effect.
system super block after fsck has repaired the file system. The value of
fs_ronly was getting overwritten, which caused ffs_update() to attempt to
update inode timestamps even though the file system was still mounted
read-only.
This fixes the "giving up on N buffers" error that is triggered by running
fsck on the root file system and then rebooting without mounting the file
system read-write.
of newfs, to signify the newfs operation has not yet completed. Re-
write the superblock with the correct magic number once all of the
cylinder groups have been created to show the operation has finished.
Sponsored by: St. Bernard Software
accurate reporting of multi-terabyte filesystem sizes.
You should build and boot a new kernel BEFORE doing a `make world'
as the new kernel will know about binaries using the old statfs
structure, but an old kernel will not know about the new system
calls that support the new statfs structure. Running an old kernel
after a `make world' will cause programs such as `df' that do a
statfs system call to fail with a bad system call.
Reviewed by: Bruce Evans <bde@zeta.org.au>
Reviewed by: Tim Robbins <tjr@freebsd.org>
Reviewed by: Julian Elischer <julian@elischer.org>
Reviewed by: the hoards of <arch@freebsd.org>
Sponsored by: DARPA & NAI Labs.
Introduce two new macros MNT_ILOCK(mp)/MNT_IUNLOCK(mp) to
operate on this mutex transparently.
Eventually new mutex will be protecting more fields in
struct mount, not only vnode list.
Discussed with: jeff
wasn't curthread, i.e. when we receive a thread pointer to use
as a function argument. Use VOP_UNLOCK/vrele in these cases.
The only case there td != curthread known at the moment is
boot() calling sync with thread0 pointer.
This fixes the panic on shutdown people have reported.
those cylinder groups that have at least 75% of the average free
space per cylinder group for that file system are considered as
candidates for the creation of a new directory. The previous formula
for minbfree would set it to zero if the file system was more than
75% full, which allowed cylinder groups with no free space at all
to be chosen as candidates for directory creation, which resulted
in an expensive search for free blocks for each file that was
subsequently created in that directory.
Modify the calculation of minifree in the same way.
Decrease maxcontigdirs as the file system fills to decrease the
likelyhood that a cluster of directories will overflow the available
space in a cylinder group.
Reviewed by: mckusick
Tested by: kmarx@vicor.com
MFC after: 2 weeks
so make the code slightly more uniform. The vnode lock is acquired in
all cases and now the only difference between VCHR and other is we
call UFS_UPDATE instead of VOP_FSYNC().
- Slightly rewrite the fsync loop to be more lock friendly. We must
acquire the vnode interlock before dropping the mnt lock. We must
also check XLOCK to prevent vclean() races.
- Use LK_INTERLOCK in the vget() in ffs_sync to further prevent vclean()
races.
- Use a local variable to store the results of the nvp == TAILQ_NEXT
test so that we do not access the vp after we've vrele()d it.
- Add an XXX comment about UFS_UPDATE() not being protected by any lock
here. I suspect that it should need the VOP lock.
we release the mntvnode_mtx.
- Call vgonel() directly instead of going through vrecycle() since we own
the interlock now.
- Remove a few cases where we locked the interlock just so that we could
call VOP_UNLOCK with interlock held.
mntvnode_mtx.
- Use a local variable to store the results of the test to see if the
next vnode on the mount list has changed. This is so that we no longer
acess the vnode after we vput() it.
in a list head instead of a pointer to the first element at the time of
the first call. These lists are subject to change, and getdirtybuf()
would refetch from the wrong list in some cases.
Spottedy by: tegge
Pointy hat to: me
bail out if the buffer is not already present.
- The buffer returned by incore() is not locked and should not be sent to
brelse(). Use getblk() with the new GB_NOCREAT flag to preserve the
desired semantics.
caller to acquire it. This permits drain_output() to be done atomically
with other operations as well as reducing the number of lock operations.
- Assert that the proper locks are held in drain_output().
- Change getdirtybuf() to accept a mutex as an argument. This mutex is used
to protect the vnode's buf list and the BKGRDWAIT flag. This lock is
dropped when we successfully acquire a buffer and held on return
otherwise. These semantics reduce the number of cumbersome cases in
calling code.
- Pass the mtx from getdirtybuf() into interlocked_sleep() and allow this
mutex to be used as the interlock argument to BUF_LOCK() in the LOCKBUF
case of interlocked_sleep().
- Change the return value of getdirtybuf() to be the resulting locked buffer
or NULL otherwise. This is for callers who pass in a list head that
requires a lock. It is necessary since the lock that protects the list
head must be dropped in getdirtybuf() so that we don't have a lock order
reversal with the buf queues lock in bremfree().
- Adjust all callers of getdirtybuf() to match the new semantics.
- Add a comment in indir_trunc() that points at unlocked access to a buf.
This may also be one of the last instances of incore() in the tree.
- Surround all accesses of the BKGRD{WAIT,INPROG} flags with the vnode
interlock.
- Don't use the B_LOCKED flag and QUEUE_LOCKED for background write
buffers. Check for the BKGRDINPROG flag before recycling or throwing
away a buffer. We do this instead because it is not safe for us to move
the original buffer to a new queue from the callback on the background
write buffer.
- Remove the B_LOCKED flag and the locked buffer queue. They are no longer
used.
- The vnode interlock is used around checks for BKGRDINPROG where it may
not be strictly necessary. If we hold the buf lock the a back-ground
write will not be started without our knowledge, one may only be
completed while we're not looking. Rather than remove the code, Document
two of the places where this extra locking is done. A pass should be
done to verify and minimize the locking later.