very slow process, especially for large file systems that is just
recovered from a crash.
Since the summary is already re-sync'ed every 30 second, we will
not lag behind too much after a crash. With this consideration
in mind, it is more reasonable to transfer the responsibility to
background fsck, to reduce the delay after a crash.
Add a new sysctl variable, vfs.ffs.compute_summary_at_mount, to
control this behavior. When set to nonzero, we will get the
"old" behavior, that the summary is computed immediately at mount
time.
Add five new sysctl variables to adjust ndir, nbfree, nifree,
nffree and numclusters respectively. Teach fsck_ffs about these
API, however, intentionally not to check the existence, since
kernels without these sysctls must have recomputed the summary
and hence no adjustments are necessary.
This change has eliminated the usual tens of minutes of delay of
mounting large dirty volumes.
Reviewed by: mckusick
MFC After: 1 week
it is now quite naturally protected by the ufsmount mutex.
- Use the ufs lock to protect various fields in struct fs, primarily the
cg summary needs protection to avoid allocation races. Several
functions have been slightly re-arranged to reduce the number of
lock operations.
- Adjust several functions (blkfree, freefile, etc.) to accept a
ufsmount as an argument so that we may access the ufs lock.
Sponsored By: Isilon Systems, Inc.
prematurely report that they were full and/or to panic the kernel
with the message ``ffs_clusteralloc: allocated out of group''.
Submitted by: Henry Whincup <henry@jot.to>
MFC after: 1 week
This eliminates a bunch of vnode overhead (approx 1-2 % speed
improvement) and gives us more control over the access to the storage
device.
Access counts on the underlying device are not correctly tracked and
therefore it is possible to read-only mount the same disk device multiple
times:
syv# mount -p
/dev/md0 /var ufs rw 2 2
/dev/ad0 /mnt ufs ro 1 1
/dev/ad0 /mnt2 ufs ro 1 1
/dev/ad0 /mnt3 ufs ro 1 1
Since UFS/FFS is not a synchrousely consistent filesystem (ie: it caches
things in RAM) this is not possible with read-write mounts, and the system
will correctly reject this.
Details:
Add a geom consumer and a bufobj pointer to ufsmount.
Eliminate the vnode argument from softdep_disk_prewrite().
Pick the vnode out of bp->b_vp for now. Eventually we
should find it through bp->b_bufobj->b_private.
In the mountcode, use g_vfs_open() once we have used
VOP_ACCESS() to check permissions.
When upgrading and downgrading between r/o and r/w do the
right thing with GEOM access counts. Remove all the
workarounds for not being able to do this with VOP_OPEN().
If we are the root mount, drop the exclusive access count
until we upgrade to r/w. This allows fsck of the root
filesystem and the MNT_RELOAD to work correctly.
Set bo_private to the GEOM consumer on the device bufobj.
Change the ffs_ops->strategy function to call g_vfs_strategy()
In ufs_strategy() directly call the strategy on the disk
bufobj. Same in rawread.
In ffs_fsync() we will no longer see VCHR device nodes, so
remove code which synced the filesystem mounted on it, in
case we came there. I'm not sure this code made sense in
the first place since we would have taken the specfs route
on such a vnode.
Redo the highly bogus readblock() function in the snapshot
code to something slightly less bogus: Constructing an uio
and using physio was really quite a detour. Instead just
fill in a bio and ship it down.
preparation for integration of p4::phk_bufwork. In the future,
local filesystems will talk to GEOM directly and they will consequently
be able to issue BIO_DELETE directly. Since the removal of the fla
driver, BIO_DELETE has effectively been a no-op anyway.
somewhat clearer, but more importantly allows for a consistent naming
scheme for suser_cred flags.
The old name is still defined, but will be removed in a few days (unless I
hear any complaints...)
Discussed with: rwatson, scottl
Requested by: jhb
fragment to zero the valid parts of a VM_IO buffer.
RE would like this to be part of 4.10-RC3 so this will be MFC-ed immediately.
Reviewed by: alc, tegge
license, per letter dated July 22, 1999 and irc message from Robert
Watson saying that clause 3 can be removed from those files with an
NAI copyright that also have only a University of California
copyrights.
Approved by: core, rwatson
group block locked. If filesystem has any active snapshots, bawrite
can come back trying to allocate new snapshot data block from the same
cylinder group and cause panic due to recursive lock attempt.
PR: 64206
Reviewed by: mckusick
Tested by: pjd
those cylinder groups that have at least 75% of the average free
space per cylinder group for that file system are considered as
candidates for the creation of a new directory. The previous formula
for minbfree would set it to zero if the file system was more than
75% full, which allowed cylinder groups with no free space at all
to be chosen as candidates for directory creation, which resulted
in an expensive search for free blocks for each file that was
subsequently created in that directory.
Modify the calculation of minifree in the same way.
Decrease maxcontigdirs as the file system fills to decrease the
likelyhood that a cluster of directories will overflow the available
space in a cylinder group.
Reviewed by: mckusick
Tested by: kmarx@vicor.com
MFC after: 2 weeks
- Define one flag GB_LOCK_NOWAIT that tells getblk() to pass the LK_NOWAIT
flag to the initial BUF_LOCK(). This will eventually be used in cases
were we want to use a buffer only if it is not currently in use.
- Convert all consumers of the getblk() api to use this extra parameter.
Reviwed by: arch
Not objected to by: mckusick
is being taken from panicing with either "freeing free block" or
"freeing free inode". The problem arises when the snapshot code
is scanning the filesystem looking for inodes with a reference
count of zero (e.g., unlinked but still open) so that it can
expunge them from its view. If it encounters a reclaimed vnode
and has to restart its scan, then it will panic if it encounters
and tries to free an inode that it has already processed. The fix
is to check each candidate inode to see if it has already been
processed before trying to delete it from the snapshot image.
Sponsored by: DARPA & NAI Labs.
values for the initial inode generation numbers in newfs and for
newly allocated inode generation numbers in the kernel.
Submitted by: Theo de Raadt <deraadt@cvs.openbsd.org>
Sponsored by: DARPA & NAI Labs.
pointer types, and remove a huge number of casts from code using it.
Change struct xfile xf_data to xun_data (ABI is still compatible).
If we need to add a #define for f_data and xf_data we can, but I don't
think it will be necessary. There are no operational changes in this
commit.
this was causing filedesc work to be very painful.
In order to make this work split out sigio definitions to thier own header
(sigio.h) which is included from proc.h for the time being.
- v_vflag is protected by the vnode lock and is used when synchronization
with VOP calls is needed.
- v_iflag is protected by interlock and is used for dealing with vnode
management issues. These flags include X/O LOCK, FREE, DOOMED, etc.
- All accesses to v_iflag and v_vflag have either been locked or marked with
mp_fixme's.
- Many ASSERT_VOP_LOCKED calls have been added where the locking was not
clear.
- Many functions in vfs_subr.c were restructured to provide for stronger
locking.
Idea stolen from: BSD/OS
As this code is not actually used by any of the existing
interfaces, it seems unlikely to break anything (famous
last words).
The internal kernel interface to manipulate these attributes
is invoked using two new IO_ flags: IO_NORMAL and IO_EXT.
These flags may be specified in the ioflags word of VOP_READ,
VOP_WRITE, and VOP_TRUNCATE. Specifying IO_NORMAL means that
you want to do I/O to the normal data part of the file and
IO_EXT means that you want to do I/O to the extended attributes
part of the file. IO_NORMAL and IO_EXT are mutually exclusive
for VOP_READ and VOP_WRITE, but may be specified individually
or together in the case of VOP_TRUNCATE. For example, when
removing a file, VOP_TRUNCATE is called with both IO_NORMAL
and IO_EXT set. For backward compatibility, if neither IO_NORMAL
nor IO_EXT is set, then IO_NORMAL is assumed.
Note that the BA_ and IO_ flags have been `merged' so that they
may both be used in the same flags word. This merger is possible
by assigning the IO_ flags to the low sixteen bits and the BA_
flags the high sixteen bits. This works because the high sixteen
bits of the IO_ word is reserved for read-ahead and help with
write clustering so will never be used for flags. This merge
lets us get away from code of the form:
if (ioflags & IO_SYNC)
flags |= BA_SYNC;
For the future, I have considered adding a new field to the
vattr structure, va_extsize. This addition could then be
exported through the stat structure to allow applications to
find out the size of the extended attribute storage and also
would provide a more standard interface for truncating them
(via VOP_SETATTR rather than VOP_TRUNCATE).
I am also contemplating adding a pathconf parameter (for
concreteness, lets call it _PC_MAX_EXTSIZE) which would
let an application determine the maximum size of the extended
atribute storage.
Sponsored by: DARPA & NAI Labs.
out of inodes in a cylinder group would fail to check for
free inodes in other cylinder groups. This bug was introduced
in the UFS2 code merge two days ago.
An inode is allocated by calling ffs_valloc which calls
ffs_hashalloc to do the filesystem scan. Ffs_hashalloc
walks around the cylinder groups calling its passed allocator
(ffs_nodealloccg in this case) until the allocator returns a
non-zero result. The bug is that ffs_hashalloc expects the
passed allocator function to return a 64-bit ufs2_daddr_t.
When allocating inodes, it calls ffs_nodealloccg which was
returning a 32-bit ino_t. The ffs_hashalloc code checked
a 64-bit return value and usually found random non-zero bits in
the high 32-bits so decided that the allocation had succeeded
(in this case in the only cylinder group that it checked).
When the result was passed back to ffs_valloc it looked at
only the bottom 32-bits, saw zero and declared the system
out of inodes. But ffs_hashalloc had really only checked
one cylinder group.
The fix is to change ffs_nodealloccg to return 64-bit results.
Sponsored by: DARPA & NAI Labs.
Submitted by: Poul-Henning Kamp <phk@critter.freebsd.dk>
Reviewed by: Maxime Henrion <mux@freebsd.org>
filesystem expands the inode to 256 bytes to make space for 64-bit
block pointers. It also adds a file-creation time field, an ability
to use jumbo blocks per inode to allow extent like pointer density,
and space for extended attributes (up to twice the filesystem block
size worth of attributes, e.g., on a 16K filesystem, there is space
for 32K of attributes). UFS2 fully supports and runs existing UFS1
filesystems. New filesystems built using newfs can be built in either
UFS1 or UFS2 format using the -O option. In this commit UFS1 is
the default format, so if you want to build UFS2 format filesystems,
you must specify -O 2. This default will be changed to UFS2 when
UFS2 proves itself to be stable. In this commit the boot code for
reading UFS2 filesystems is not compiled (see /sys/boot/common/ufsread.c)
as there is insufficient space in the boot block. Once the size of the
boot block is increased, this code can be defined.
Things to note: the definition of SBSIZE has changed to SBLOCKSIZE.
The header file <ufs/ufs/dinode.h> must be included before
<ufs/ffs/fs.h> so as to get the definitions of ufs2_daddr_t and
ufs_lbn_t.
Still TODO:
Verify that the first level bootstraps work for all the architectures.
Convert the utility ffsinfo to understand UFS2 and test growfs.
Add support for the extended attribute storage. Update soft updates
to ensure integrity of extended attribute storage. Switch the
current extended attribute interfaces to use the extended attribute
storage. Add the extent like functionality (framework is there,
but is currently never used).
Sponsored by: DARPA & NAI Labs.
Reviewed by: Poul-Henning Kamp <phk@freebsd.org>
general cleanup of the API. The entire API now consists of two functions
similar to the pre-KSE API. The suser() function takes a thread pointer
as its only argument. The td_ucred member of this thread must be valid
so the only valid thread pointers are curthread and a few kernel threads
such as thread0. The suser_cred() function takes a pointer to a struct
ucred as its first argument and an integer flag as its second argument.
The flag is currently only used for the PRISON_ROOT flag.
Discussed on: smp@
locking flags when acquiring a vnode. The immediate purpose is
to allow polling lock requests (LK_NOWAIT) needed by soft updates
to avoid deadlock when enlisting other processes to help with
the background cleanup. For the future it will allow the use of
shared locks for read access to vnodes. This change touches a
lot of files as it affects most filesystems within the system.
It has been well tested on FFS, loopback, and CD-ROM filesystems.
only lightly on the others, so if you find a problem there, please
let me (mckusick@mckusick.com) know.
inode'' panic. This change corrects that problem by setting the
fs_active flag when the inode map changes to notify the snapshot
code that the cylinder group must be rescanned.
Submitted by: Robert Watson <rwatson@FreeBSD.org>
been unlinked (e.g., with a zero link count). We have to expunge
all trace of these files from the snapshot so that they are neither
reclaimed prematurely by fsck nor saved unnecessarily by dump.
which small and/or nearly full filesystems would fail with `file
system full' messages when trying to replace a number of existing
files (for example during a system installation). When the allocation
routines are about to fail with a file system full condition, they
make a call to softdep_request_cleanup() which attempts to accelerate
the flushing of pending deletion requests in an effort to free up
space. In the face of filesystem I/O requests that exceed the
available disk transfer capacity, the cleanup request could take
an unbounded amount of time. Thus, the softdep_request_cleanup()
routine will only try for tickdelay seconds (default 2 seconds)
before giving up and returning a filesystem full error. Under typical
conditions, the softdep_request_cleanup() routine is able to free
up space in under fifty milliseconds.
Seigo Tanimura (tanimura) posted the initial delta.
I've polished it quite a bit reducing the need for locking and
adapting it for KSE.
Locks:
1 mutex in each filedesc
protects all the fields.
protects "struct file" initialization, while a struct file
is being changed from &badfileops -> &pipeops or something
the filedesc should be locked.
1 mutex in each struct file
protects the refcount fields.
doesn't protect anything else.
the flags used for garbage collection have been moved to
f_gcflag which was the FILLER short, this doesn't need
locking because the garbage collection is a single threaded
container.
could likely be made to use a pool mutex.
1 sx lock for the global filelist.
struct file * fhold(struct file *fp);
/* increments reference count on a file */
struct file * fhold_locked(struct file *fp);
/* like fhold but expects file to locked */
struct file * ffind_hold(struct thread *, int fd);
/* finds the struct file in thread, adds one reference and
returns it unlocked */
struct file * ffind_lock(struct thread *, int fd);
/* ffind_hold, but returns file locked */
I still have to smp-safe the fget cruft, I'll get to that asap.