This commit was generated by cvs2svn to compensate for changes in r36201,

which included commits to RCS files with non-trunk default branches.
This commit is contained in:
Julian Elischer 1998-05-19 19:47:22 +00:00
commit 8d1c524575
6 changed files with 9244 additions and 0 deletions

View File

@ -0,0 +1,251 @@
Introduction
This package constitutes the alpha distribution of the soft update
code updates for the fast filesystem.
Status
My `filesystem torture tests' (described below) run for days without
a hitch (no panic's, hangs, filesystem corruption, or memory leaks).
However, I have had several panic's reported to me by folks that
are field testing the code which I have not yet been able to
reproduce or fix. Although these panic's are rare and do not cause
filesystem corruption, the code should only be put into production
on systems where the system administrator is aware that it is being
run, and knows how to turn it off if problems arise. Thus, you may
hand out this code to others, but please ensure that this status
message is included with any distributions. Please also include
the file ffs_softdep.stub.c in any distributions so that folks that
cannot abide by the need to redistribute source will not be left
with a kernel that will not link. It will resolve all the calls
into the soft update code and simply ignores the request to enable
them. Thus you will be able to ensure that your other hooks have
not broken anything and that your kernel is softdep-ready for those
that wish to use them. Please report problems back to me with
kernel backtraces of panics if possible. This is massively complex
code, and people only have to have their filesystems hosed once or
twice to avoid future changes like the plague. I want to find and
fix as many bugs as soon as possible so as to get the code rock
solid before it gets widely released. Please report any bugs that
you uncover to mckusick@mckusick.com.
Performance
Running the Andrew Benchmarks yields the following raw data:
Phase Normal Softdep What it does
1 3s <1s Creating directories
2 8s 4s Copying files
3 6s 6s Recursive directory stats
4 8s 9s Scanning each file
5 25s 25s Compilation
Normal: 19.9u 29.2s 0:52.8 135+630io
Softdep: 20.3u 28.5s 0:47.8 103+363io
Another interesting datapoint are my `filesystem torture tests'.
They consist of 1000 runs of the andrew benchmarks, 1000 copy and
removes of /etc with randomly selected pauses of 0-60 seconds
between each copy and remove, and 500 find from / with randomly
selected pauses of 100 seconds between each run). The run of the
torture test compares as follows:
With soft updates: writes: 6 sync, 1,113,686 async; run time 19hr, 50min
Normal filesystem: writes: 1,459,147 sync, 487,031 async; run time 27hr, 15min
The upshot is 42% less I/O and 28% shorter running time.
Another interesting test point is a full MAKEDEV. Because it runs
as a shell script, it becomes mostly limited by the execution speed
of the machine on which it runs. Here are the numbers:
With soft updates:
labrat# time ./MAKEDEV std
2.2u 32.6s 0:34.82 100.0% 0+0k 11+36io 0pf+0w
labrat# ls | wc
522 522 3317
Without soft updates:
labrat# time ./MAKEDEV std
2.0u 40.5s 0:42.53 100.0% 0+0k 11+1221io 0pf+0w
labrat# ls | wc
522 522 3317
Of course, some of the system time is being pushed
to the syncer process, but that is a different story.
To show a benchmark designed to highlight the soft update code
consider a tar of zero-sized files and an rm -rf of a directory tree
that has at least 50 files or so at each level. Running a test with
a directory tree containing 28 directories holding 202 empty files
produces the following numbers:
With soft updates:
tar: 0.0u 0.5s 0:00.65 76.9% 0+0k 0+44io 0pf+0w (0 sync, 33 async writes)
rm: 0.0u 0.2s 0:00.20 100.0% 0+0k 0+37io 0pf+0w (0 sync, 72 async writes)
Normal filesystem:
tar: 0.0u 1.1s 0:07.27 16.5% 0+0k 60+586io 0pf+0w (523 sync, 0 async writes)
rm: 0.0u 0.5s 0:01.84 29.3% 0+0k 0+318io 0pf+0w (258 sync, 65 async writes)
The large reduction in writes is because inodes are clustered, so
most of a block gets allocated, then the whole block is written
out once rather than having the same block written once for each
inode allocated from it. Similarly each directory block is written
once rather than once for each new directory entry. Effectively
what the update code is doing is allocating a bunch of inodes
and directory entries without writing anything, then ensuring that
the block containing the inodes is written first followed by the
directory block that references them. If there were data in the
files it would further ensure that the data blocks were written
before their inodes claimed them.
Copyright Restrictions
Please familiarize yourself with the copyright restrictions
contained at the top of either the sys/ufs/ffs/softdep.h or
sys/ufs/ffs/ffs_softdep.c file. The key provision is similar
to the one used by the DB 2.0 package and goes as follows:
Redistributions in any form must be accompanied by information
on how to obtain complete source code for any accompanying
software that uses the this software. This source code must
either be included in the distribution or be available for
no more than the cost of distribution plus a nominal fee,
and must be freely redistributable under reasonable
conditions. For an executable file, complete source code
means the source code for all modules it contains. It does
not mean source code for modules or files that typically
accompany the operating system on which the executable file
runs, e.g., standard library modules or system header files.
The idea is to allow those of you freely redistributing your source
to use it while retaining for myself the right to peddle it for
money to the commercial UNIX vendors. Note that I have included a
stub file ffs_softdep.c.stub that is freely redistributable so that
you can put in all the necessary hooks to run the full soft updates
code, but still allow vendors that want to maintain proprietary
source to have a working system. I do plan to release the code with
a `Berkeley style' copyright once I have peddled it around to the
commercial vendors. If you have concerns about this copyright,
feel free to contact me with them and we can try to resolve any
difficulties.
Soft Dependency Operation
The soft update implementation does NOT require ANY changes
to the on-disk format of your filesystems. Furthermore it is
not used by default for any filesystems. It must be enabled on
a filesystem by filesystem basis by running tunefs to set a
bit in the superblock indicating that the filesystem should be
managed using soft updates. If you wish to stop using
soft updates due to performance or reliability reasons,
you can simply run tunefs on it again to turn off the bit and
revert to normal operation. The additional dynamic memory load
placed on the kernel malloc arena is approximately equal to
the amount of memory used by vnodes plus inodes (for a system
with 1000 vnodes, the additional peak memory load is about 300K).
Kernel Changes
There are two new changes to the kernel functionality that are not
contained in in the soft update files. The first is a `trickle
sync' facility running in the kernel as process 3. This trickle
sync process replaces the traditional `update' program (which should
be commented out of the /etc/rc startup script). When a vnode is
first written it is placed 30 seconds down on the trickle sync
queue. If it still exists and has dirty data when it reaches the
top of the queue, it is sync'ed. This approach evens out the load
on the underlying I/O system and avoids writing short-lived files.
The papers on trickle-sync tend to favor aging based on buffers
rather than files. However, I sync on file age rather than buffer
age because the data structures are much smaller as there are
typically far fewer files than buffers. Although this can make the
I/O spikey when a big file times out, it is still much better than
the wholesale sync's that were happening before. It also adapts
much better to the soft update code where I want to control
aging to improve performance (inodes age in 10 seconds, directories
in 15 seconds, files in 30 seconds). This ensures that most
dependencies are gone (e.g., inodes are written when directory
entries want to go to disk) reducing the amount of rollback that
is needed.
The other main kernel change is to split the vnode freelist into
two separate lists. One for vnodes that are still being used to
identify buffers and the other for those vnodes no longer identifying
any buffers. The latter list is used by getnewvnode in preference
to the former.
Packaging of Kernel Changes
The sys subdirectory contains the changes and additions to the
kernel. My goal in writing this code was to minimize the changes
that need to be made to the kernel. Thus, most of the new code
is contained in the two new files softdep.h and ffs_softdep.c.
The rest of the kernel changes are simply inserting hooks to
call into these two new files. Although there has been some
structural reorganization of the filesystem code to accommodate
gathering the information required by the soft update code,
the actual ordering of filesystem operations when soft updates
are disabled is unchanged.
The kernel changes are packaged as a set of diffs. As I am
doing my development in BSD/OS, the diffs are relative to the
BSD/OS versions of the files. Because BSD/OS recently had
4.4BSD-Lite2 merged into it, the Lite2 files are a good starting
point for figuring out the changes. There are 40 files that
require change plus the two new files. Most of these files have
only a few lines of changes in them. However, four files have
fairly extensive changes: kern/vfs_subr.c, ufs/ufs/ufs_lookup.c,
ufs/ufs/ufs_vnops.c, and ufs/ffs/ffs_alloc.c. For these four
files, I have provided the original Lite2 version, the Lite2
version with the diffs merged in, and the diffs between the
BSD/OS and merged version. Even so, I expect that there will
be some difficulty in doing the merge; I am certainly willing
to assist in helping get the code merged into your system.
Packaging of Utility Changes
The utilities subdirectory contains the changes and additions
to the utilities. There are diffs to three utilities enclosed:
tunefs - add a flag to enable and disable soft updates
mount - print out whether soft updates are enabled and
also statistics on number of sync and async writes
fsck - tighter checks on acceptable errors and a slightly
different policy for what to put in lost+found on
filesystems using soft updates
In addition you should recompile vmstat so as to get reports
on the 13 new memory types used by the soft update code.
It is not necessary to use the new version of fsck, however it
would aid in my debugging if you do. Also, because of the time
lag between deleting a directory entry and the inode it
references, you will find a lot more files showing up in your
lost+found if you do not use the new version. Note that the
new version checks for the soft update flag in the superblock
and only uses the new algorithms if it is set. So, it will run
unchanged on the filesystems that are not using soft updates.
Operation
Once you have booted a kernel that incorporates the soft update
code and installed the updated utilities, do the following:
1) Comment out the update program in /etc/rc.
2) Run `tunefs -n enable' on one or more test filesystems.
3) Mount these filesystems and then type `mount' to ensure that
they have been enabled for soft updates.
4) Copy the test directory to a softdep filesystem, chdir into
it and run `./doit'. You may want to check out each of the
three subtests individually first: doit1 - andrew benchmarks,
doit2 - copy and removal of /etc, doit3 - find from /.

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,520 @@
/*
* Copyright 1997 Marshall Kirk McKusick. All Rights Reserved.
*
* The soft dependency code is derived from work done by Greg Ganger
* at the University of Michigan.
*
* The following are the copyrights and redistribution conditions that
* apply to this copy of the soft dependency software. For a license
* to use, redistribute or sell the soft dependency software under
* conditions other than those described here, please contact the
* author at one of the following addresses:
*
* Marshall Kirk McKusick mckusick@mckusick.com
* 1614 Oxford Street +1-510-843-9542
* Berkeley, CA 94709-1608
* USA
*
* Redistribution and use in source and binary forms, with or without
* modification, are permitted provided that the following conditions
* are met:
*
* 1. Redistributions of source code must retain the above copyright
* notice, this list of conditions and the following disclaimer.
* 2. Redistributions in binary form must reproduce the above copyright
* notice, this list of conditions and the following disclaimer in the
* documentation and/or other materials provided with the distribution.
* 3. None of the names of McKusick, Ganger, or the University of Michigan
* may be used to endorse or promote products derived from this software
* without specific prior written permission.
* 4. Redistributions in any form must be accompanied by information on
* how to obtain complete source code for any accompanying software
* that uses the this software. This source code must either be included
* in the distribution or be available for no more than the cost of
* distribution plus a nominal fee, and must be freely redistributable
* under reasonable conditions. For an executable file, complete
* source code means the source code for all modules it contains.
* It does not mean source code for modules or files that typically
* accompany the operating system on which the executable file runs,
* e.g., standard library modules or system header files.
*
* THIS SOFTWARE IS PROVIDED BY MARSHALL KIRK MCKUSICK ``AS IS'' AND ANY
* EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
* WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
* DISCLAIMED. IN NO EVENT SHALL MARSHALL KIRK MCKUSICK BE LIABLE FOR
* ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
* DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
* OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
* HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
* LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
* OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
* SUCH DAMAGE.
*
* @(#)softdep.h 9.1 (McKusick) 7/9/97
*/
#include <sys/queue.h>
/*
* Allocation dependencies are handled with undo/redo on the in-memory
* copy of the data. A particular data dependency is eliminated when
* it is ALLCOMPLETE: that is ATTACHED, DEPCOMPLETE, and COMPLETE.
*
* ATTACHED means that the data is not currently being written to
* disk. UNDONE means that the data has been rolled back to a safe
* state for writing to the disk. When the I/O completes, the data is
* restored to its current form and the state reverts to ATTACHED.
* The data must be locked throughout the rollback, I/O, and roll
* forward so that the rolled back information is never visible to
* user processes. The COMPLETE flag indicates that the item has been
* written. For example, a dependency that requires that an inode be
* written will be marked COMPLETE after the inode has been written
* to disk. The DEPCOMPLETE flag indicates the completion of any other
* dependencies such as the writing of a cylinder group map has been
* completed. A dependency structure may be freed only when both it
* and its dependencies have completed and any rollbacks that are in
* progress have finished as indicated by the set of ALLCOMPLETE flags
* all being set. The two MKDIR flags indicate additional dependencies
* that must be done when creating a new directory. MKDIR_BODY is
* cleared when the directory data block containing the "." and ".."
* entries has been written. MKDIR_PARENT is cleared when the parent
* inode with the increased link count for ".." has been written. When
* both MKDIR flags have been cleared, the DEPCOMPLETE flag is set to
* indicate that the directory dependencies have been completed. The
* writing of the directory inode itself sets the COMPLETE flag which
* then allows the directory entry for the new directory to be written
* to disk. The RMDIR flag marks a dirrem structure as representing
* the removal of a directory rather than a file. When the removal
* dependencies are completed, additional work needs to be done
* (truncation of the "." and ".." entries, an additional decrement
* of the associated inode, and a decrement of the parent inode). The
* DIRCHG flag marks a diradd structure as representing the changing
* of an existing entry rather than the addition of a new one. When
* the update is complete the dirrem associated with the inode for
* the old name must be added to the worklist to do the necessary
* reference count decrement. The GOINGAWAY flag indicates that the
* data structure is frozen from further change until its dependencies
* have been completed and its resources freed after which it will be
* discarded. The IOSTARTED flag prevents multiple calls to the I/O
* start routine from doing multiple rollbacks. The ONWORKLIST flag
* shows whether the structure is currently linked onto a worklist.
*/
#define ATTACHED 0x0001
#define UNDONE 0x0002
#define COMPLETE 0x0004
#define DEPCOMPLETE 0x0008
#define MKDIR_PARENT 0x0010
#define MKDIR_BODY 0x0020
#define RMDIR 0x0040
#define DIRCHG 0x0080
#define GOINGAWAY 0x0100
#define IOSTARTED 0x0200
#define ONWORKLIST 0x8000
#define ALLCOMPLETE (ATTACHED | COMPLETE | DEPCOMPLETE)
/*
* The workitem queue.
*
* It is sometimes useful and/or necessary to clean up certain dependencies
* in the background rather than during execution of an application process
* or interrupt service routine. To realize this, we append dependency
* structures corresponding to such tasks to a "workitem" queue. In a soft
* updates implementation, most pending workitems should not wait for more
* than a couple of seconds, so the filesystem syncer process awakens once
* per second to process the items on the queue.
*/
/* LIST_HEAD(workhead, worklist); -- declared in buf.h */
/*
* Each request can be linked onto a work queue through its worklist structure.
* To avoid the need for a pointer to the structure itself, this structure
* MUST be declared FIRST in each type in which it appears! If more than one
* worklist is needed in the structure, then a wk_data field must be added
* and the macros below changed to use it.
*/
struct worklist {
LIST_ENTRY(worklist) wk_list; /* list of work requests */
unsigned short wk_type; /* type of request */
unsigned short wk_state; /* state flags */
};
#define WK_DATA(wk) ((void *)(wk))
#define WK_PAGEDEP(wk) ((struct pagedep *)(wk))
#define WK_INODEDEP(wk) ((struct inodedep *)(wk))
#define WK_NEWBLK(wk) ((struct newblk *)(wk))
#define WK_BMSAFEMAP(wk) ((struct bmsafemap *)(wk))
#define WK_ALLOCDIRECT(wk) ((struct allocdirect *)(wk))
#define WK_INDIRDEP(wk) ((struct indirdep *)(wk))
#define WK_ALLOCINDIR(wk) ((struct allocindir *)(wk))
#define WK_FREEFRAG(wk) ((struct freefrag *)(wk))
#define WK_FREEBLKS(wk) ((struct freeblks *)(wk))
#define WK_FREEFILE(wk) ((struct freefile *)(wk))
#define WK_DIRADD(wk) ((struct diradd *)(wk))
#define WK_MKDIR(wk) ((struct mkdir *)(wk))
#define WK_DIRREM(wk) ((struct dirrem *)(wk))
/*
* Various types of lists
*/
LIST_HEAD(dirremhd, dirrem);
LIST_HEAD(diraddhd, diradd);
LIST_HEAD(newblkhd, newblk);
LIST_HEAD(inodedephd, inodedep);
LIST_HEAD(allocindirhd, allocindir);
LIST_HEAD(allocdirecthd, allocdirect);
TAILQ_HEAD(allocdirectlst, allocdirect);
/*
* The "pagedep" structure tracks the various dependencies related to
* a particular directory page. If a directory page has any dependencies,
* it will have a pagedep linked to its associated buffer. The
* pd_dirremhd list holds the list of dirrem requests which decrement
* inode reference counts. These requests are processed after the
* directory page with the corresponding zero'ed entries has been
* written. The pd_diraddhd list maintains the list of diradd requests
* which cannot be committed until their corresponding inode has been
* written to disk. Because a directory may have many new entries
* being created, several lists are maintained hashed on bits of the
* offset of the entry into the directory page to keep the lists from
* getting too long. Once a new directory entry has been cleared to
* be written, it is moved to the pd_pendinghd list. After the new
* entry has been written to disk it is removed from the pd_pendinghd
* list, any removed operations are done, and the dependency structure
* is freed.
*/
#define DAHASHSZ 6
#define DIRADDHASH(offset) (((offset) >> 2) % DAHASHSZ)
struct pagedep {
struct worklist pd_list; /* page buffer */
# define pd_state pd_list.wk_state /* check for multiple I/O starts */
LIST_ENTRY(pagedep) pd_hash; /* hashed lookup */
struct mount *pd_mnt; /* associated mount point */
ino_t pd_ino; /* associated file */
ufs_lbn_t pd_lbn; /* block within file */
struct dirremhd pd_dirremhd; /* dirrem's waiting for page */
struct diraddhd pd_diraddhd[DAHASHSZ]; /* diradd dir entry updates */
struct diraddhd pd_pendinghd; /* directory entries awaiting write */
};
/*
* The "inodedep" structure tracks the set of dependencies associated
* with an inode. Each block that is allocated is represented by an
* "allocdirect" structure (see below). It is linked onto the id_newinoupdt
* list until both its contents and its allocation in the cylinder
* group map have been written to disk. Once the dependencies have been
* satisfied, it is removed from the id_newinoupdt list and any followup
* actions such as releasing the previous block or fragment are placed
* on the id_inowait list. When an inode is updated (copied from the
* in-core inode structure to a disk buffer containing its on-disk
* copy), the "inodedep" structure is linked onto the buffer through
* its worklist. Thus it will be notified when the buffer is about
* to be written and when it is done. At the update time, all the
* elements on the id_newinoupdt list are moved to the id_inoupdt list
* since those changes are now relevant to the copy of the inode in the
* buffer. When the buffer containing the inode is written to disk, any
* updates listed on the id_inoupdt list are rolled back as they are
* not yet safe. Following the write, the changes are once again rolled
* forward and any actions on the id_inowait list are processed (since
* the previously allocated blocks are no longer claimed on the disk).
* The entries on the id_inoupdt and id_newinoupdt lists must be kept
* sorted by logical block number to speed the calculation of the size
* of the rolled back inode (see explanation in initiate_write_inodeblock).
*/
struct inodedep {
struct worklist id_list; /* buffer holding inode block */
# define id_state id_list.wk_state /* inode dependency state */
LIST_ENTRY(inodedep) id_hash; /* hashed lookup */
struct fs *id_fs; /* associated filesystem */
ino_t id_ino; /* dependent inode */
nlink_t id_nlinkdelta; /* saved effective link count */
struct dinode *id_savedino; /* saved dinode contents */
LIST_ENTRY(inodedep) id_deps; /* bmsafemap's list of inodedep's */
struct buf *id_buf; /* related bmsafemap (if pending) */
off_t id_savedsize; /* file size saved during rollback */
struct workhead id_pendinghd; /* entries awaiting directory write */
struct workhead id_inowait; /* operations after inode written */
struct allocdirectlst id_inoupdt; /* updates before inode written */
struct allocdirectlst id_newinoupdt; /* updates when inode written */
};
/*
* A "newblk" structure is attached to a bmsafemap structure when a block
* or fragment is allocated from a cylinder group. Its state is set to
* DEPCOMPLETE when its cylinder group map is written. It is consumed by
* an associated allocdirect or allocindir allocation which will attach
* themselves to the bmsafemap structure if the newblk's DEPCOMPLETE flag
* is not set (i.e., its cylinder group map has not been written).
*/
struct newblk {
LIST_ENTRY(newblk) nb_hash; /* hashed lookup */
struct fs *nb_fs; /* associated filesystem */
ufs_daddr_t nb_newblkno; /* allocated block number */
int nb_state; /* state of bitmap dependency */
LIST_ENTRY(newblk) nb_deps; /* bmsafemap's list of newblk's */
struct bmsafemap *nb_bmsafemap; /* associated bmsafemap */
};
/*
* A "bmsafemap" structure maintains a list of dependency structures
* that depend on the update of a particular cylinder group map.
* It has lists for newblks, allocdirects, allocindirs, and inodedeps.
* It is attached to the buffer of a cylinder group block when any of
* these things are allocated from the cylinder group. It is freed
* after the cylinder group map is written and the state of its
* dependencies are updated with DEPCOMPLETE to indicate that it has
* been processed.
*/
struct bmsafemap {
struct worklist sm_list; /* cylgrp buffer */
struct buf *sm_buf; /* associated buffer */
struct allocdirecthd sm_allocdirecthd; /* allocdirect deps */
struct allocindirhd sm_allocindirhd; /* allocindir deps */
struct inodedephd sm_inodedephd; /* inodedep deps */
struct newblkhd sm_newblkhd; /* newblk deps */
};
/*
* An "allocdirect" structure is attached to an "inodedep" when a new block
* or fragment is allocated and pointed to by the inode described by
* "inodedep". The worklist is linked to the buffer that holds the block.
* When the block is first allocated, it is linked to the bmsafemap
* structure associated with the buffer holding the cylinder group map
* from which it was allocated. When the cylinder group map is written
* to disk, ad_state has the DEPCOMPLETE flag set. When the block itself
* is written, the COMPLETE flag is set. Once both the cylinder group map
* and the data itself have been written, it is safe to write the inode
* that claims the block. If there was a previous fragment that had been
* allocated before the file was increased in size, the old fragment may
* be freed once the inode claiming the new block is written to disk.
* This ad_fragfree request is attached to the id_inowait list of the
* associated inodedep (pointed to by ad_inodedep) for processing after
* the inode is written.
*/
struct allocdirect {
struct worklist ad_list; /* buffer holding block */
# define ad_state ad_list.wk_state /* block pointer state */
TAILQ_ENTRY(allocdirect) ad_next; /* inodedep's list of allocdirect's */
ufs_lbn_t ad_lbn; /* block within file */
ufs_daddr_t ad_newblkno; /* new value of block pointer */
ufs_daddr_t ad_oldblkno; /* old value of block pointer */
long ad_newsize; /* size of new block */
long ad_oldsize; /* size of old block */
LIST_ENTRY(allocdirect) ad_deps; /* bmsafemap's list of allocdirect's */
struct buf *ad_buf; /* cylgrp buffer (if pending) */
struct inodedep *ad_inodedep; /* associated inodedep */
struct freefrag *ad_freefrag; /* fragment to be freed (if any) */
};
/*
* A single "indirdep" structure manages all allocation dependencies for
* pointers in an indirect block. The up-to-date state of the indirect
* block is stored in ir_savedata. The set of pointers that may be safely
* written to the disk is stored in ir_safecopy. The state field is used
* only to track whether the buffer is currently being written (in which
* case it is not safe to update ir_safecopy). Ir_deplisthd contains the
* list of allocindir structures, one for each block that needs to be
* written to disk. Once the block and its bitmap allocation have been
* written the safecopy can be updated to reflect the allocation and the
* allocindir structure freed. If ir_state indicates that an I/O on the
* indirect block is in progress when ir_safecopy is to be updated, the
* update is deferred by placing the allocindir on the ir_donehd list.
* When the I/O on the indirect block completes, the entries on the
* ir_donehd list are processed by updating their corresponding ir_safecopy
* pointers and then freeing the allocindir structure.
*/
struct indirdep {
struct worklist ir_list; /* buffer holding indirect block */
# define ir_state ir_list.wk_state /* indirect block pointer state */
ufs_daddr_t *ir_saveddata; /* buffer cache contents */
struct buf *ir_savebp; /* buffer holding safe copy */
struct allocindirhd ir_donehd; /* done waiting to update safecopy */
struct allocindirhd ir_deplisthd; /* allocindir deps for this block */
};
/*
* An "allocindir" structure is attached to an "indirdep" when a new block
* is allocated and pointed to by the indirect block described by the
* "indirdep". The worklist is linked to the buffer that holds the new block.
* When the block is first allocated, it is linked to the bmsafemap
* structure associated with the buffer holding the cylinder group map
* from which it was allocated. When the cylinder group map is written
* to disk, ai_state has the DEPCOMPLETE flag set. When the block itself
* is written, the COMPLETE flag is set. Once both the cylinder group map
* and the data itself have been written, it is safe to write the entry in
* the indirect block that claims the block; the "allocindir" dependency
* can then be freed as it is no longer applicable.
*/
struct allocindir {
struct worklist ai_list; /* buffer holding indirect block */
# define ai_state ai_list.wk_state /* indirect block pointer state */
LIST_ENTRY(allocindir) ai_next; /* indirdep's list of allocindir's */
int ai_offset; /* pointer offset in indirect block */
ufs_daddr_t ai_newblkno; /* new block pointer value */
ufs_daddr_t ai_oldblkno; /* old block pointer value */
struct freefrag *ai_freefrag; /* block to be freed when complete */
struct indirdep *ai_indirdep; /* address of associated indirdep */
LIST_ENTRY(allocindir) ai_deps; /* bmsafemap's list of allocindir's */
struct buf *ai_buf; /* cylgrp buffer (if pending) */
};
/*
* A "freefrag" structure is attached to an "inodedep" when a previously
* allocated fragment is replaced with a larger fragment, rather than extended.
* The "freefrag" structure is constructed and attached when the replacement
* block is first allocated. It is processed after the inode claiming the
* bigger block that replaces it has been written to disk. Note that the
* ff_state field is is used to store the uid, so may lose data. However,
* the uid is used only in printing an error message, so is not critical.
* Keeping it in a short keeps the data structure down to 32 bytes.
*/
struct freefrag {
struct worklist ff_list; /* id_inowait or delayed worklist */
# define ff_state ff_list.wk_state /* owning user; should be uid_t */
struct vnode *ff_devvp; /* filesystem device vnode */
struct fs *ff_fs; /* addr of superblock */
ufs_daddr_t ff_blkno; /* fragment physical block number */
long ff_fragsize; /* size of fragment being deleted */
ino_t ff_inum; /* owning inode number */
};
/*
* A "freeblks" structure is attached to an "inodedep" when the
* corresponding file's length is reduced to zero. It records all
* the information needed to free the blocks of a file after its
* zero'ed inode has been written to disk.
*/
struct freeblks {
struct worklist fb_list; /* id_inowait or delayed worklist */
ino_t fb_previousinum; /* inode of previous owner of blocks */
struct vnode *fb_devvp; /* filesystem device vnode */
struct fs *fb_fs; /* addr of superblock */
off_t fb_oldsize; /* previous file size */
off_t fb_newsize; /* new file size */
int fb_chkcnt; /* used to check cnt of blks released */
uid_t fb_uid; /* uid of previous owner of blocks */
ufs_daddr_t fb_dblks[NDADDR]; /* direct blk ptrs to deallocate */
ufs_daddr_t fb_iblks[NIADDR]; /* indirect blk ptrs to deallocate */
};
/*
* A "freefile" structure is attached to an inode when its
* link count is reduced to zero. It marks the inode as free in
* the cylinder group map after the zero'ed inode has been written
* to disk and any associated blocks and fragments have been freed.
*/
struct freefile {
struct worklist fx_list; /* id_inowait or delayed worklist */
mode_t fx_mode; /* mode of inode */
ino_t fx_oldinum; /* inum of the unlinked file */
struct vnode *fx_devvp; /* filesystem device vnode */
struct fs *fx_fs; /* addr of superblock */
};
/*
* A "diradd" structure is linked to an "inodedep" id_inowait list when a
* new directory entry is allocated that references the inode described
* by "inodedep". When the inode itself is written (either the initial
* allocation for new inodes or with the increased link count for
* existing inodes), the COMPLETE flag is set in da_state. If the entry
* is for a newly allocated inode, the "inodedep" structure is associated
* with a bmsafemap which prevents the inode from being written to disk
* until the cylinder group has been updated. Thus the da_state COMPLETE
* flag cannot be set until the inode bitmap dependency has been removed.
* When creating a new file, it is safe to write the directory entry that
* claims the inode once the referenced inode has been written. Since
* writing the inode clears the bitmap dependencies, the DEPCOMPLETE flag
* in the diradd can be set unconditionally when creating a file. When
* creating a directory, there are two additional dependencies described by
* mkdir structures (see their description below). When these dependencies
* are resolved the DEPCOMPLETE flag is set in the diradd structure.
* If there are multiple links created to the same inode, there will be
* a separate diradd structure created for each link. The diradd is
* linked onto the pg_diraddhd list of the pagedep for the directory
* page that contains the entry. When a directory page is written,
* the pg_diraddhd list is traversed to rollback any entries that are
* not yet ready to be written to disk. If a directory entry is being
* changed (by rename) rather than added, the DIRCHG flag is set and
* the da_previous entry points to the entry that will be "removed"
* once the new entry has been committed. During rollback, entries
* with da_previous are replaced with the previous inode number rather
* than zero.
*
* The overlaying of da_pagedep and da_previous is done to keep the
* structure down to 32 bytes in size on a 32-bit machine. If a
* da_previous entry is present, the pointer to its pagedep is available
* in the associated dirrem entry. If the DIRCHG flag is set, the
* da_previous entry is valid; if not set the da_pagedep entry is valid.
* The DIRCHG flag never changes; it is set when the structure is created
* if appropriate and is never cleared.
*/
struct diradd {
struct worklist da_list; /* id_inowait and id_pendinghd list */
# define da_state da_list.wk_state /* state of the new directory entry */
LIST_ENTRY(diradd) da_pdlist; /* pagedep holding directory block */
doff_t da_offset; /* offset of new dir entry in dir blk */
ino_t da_newinum; /* inode number for the new dir entry */
union {
struct dirrem *dau_previous; /* entry being replaced in dir change */
struct pagedep *dau_pagedep; /* pagedep dependency for addition */
} da_un;
};
#define da_previous da_un.dau_previous
#define da_pagedep da_un.dau_pagedep
/*
* Two "mkdir" structures are needed to track the additional dependencies
* associated with creating a new directory entry. Normally a directory
* addition can be committed as soon as the newly referenced inode has been
* written to disk with its increased link count. When a directory is
* created there are two additional dependencies: writing the directory
* data block containing the "." and ".." entries (MKDIR_BODY) and writing
* the parent inode with the increased link count for ".." (MKDIR_PARENT).
* These additional dependencies are tracked by two mkdir structures that
* reference the associated "diradd" structure. When they have completed,
* they set the DEPCOMPLETE flag on the diradd so that it knows that its
* extra dependencies have been completed. The md_state field is used only
* to identify which type of dependency the mkdir structure is tracking.
* It is not used in the mainline code for any purpose other than consistency
* checking. All the mkdir structures in the system are linked together on
* a list. This list is needed so that a diradd can find its associated
* mkdir structures and deallocate them if it is prematurely freed (as for
* example if a mkdir is immediately followed by a rmdir of the same directory).
* Here, the free of the diradd must traverse the list to find the associated
* mkdir structures that reference it. The deletion would be faster if the
* diradd structure were simply augmented to have two pointers that referenced
* the associated mkdir's. However, this would increase the size of the diradd
* structure from 32 to 64-bits to speed a very infrequent operation.
*/
struct mkdir {
struct worklist md_list; /* id_inowait or buffer holding dir */
# define md_state md_list.wk_state /* type: MKDIR_PARENT or MKDIR_BODY */
struct diradd *md_diradd; /* associated diradd */
LIST_ENTRY(mkdir) md_mkdirs; /* list of all mkdirs */
};
LIST_HEAD(mkdirlist, mkdir) mkdirlisthd;
/*
* A "dirrem" structure describes an operation to decrement the link
* count on an inode. The dirrem structure is attached to the pg_dirremhd
* list of the pagedep for the directory page that contains the entry.
* It is processed after the directory page with the deleted entry has
* been written to disk.
*
* The overlaying of dm_pagedep and dm_dirinum is done to keep the
* structure down to 32 bytes in size on a 32-bit machine. It works
* because they are never used concurrently.
*/
struct dirrem {
struct worklist dm_list; /* delayed worklist */
# define dm_state dm_list.wk_state /* state of the old directory entry */
LIST_ENTRY(dirrem) dm_next; /* pagedep's list of dirrem's */
struct mount *dm_mnt; /* associated mount point */
ino_t dm_oldinum; /* inum of the removed dir entry */
union {
struct pagedep *dmu_pagedep; /* pagedep dependency for remove */
ino_t dmu_dirinum; /* parent inode number (for rmdir) */
} dm_un;
};
#define dm_pagedep dm_un.dmu_pagedep
#define dm_dirinum dm_un.dmu_dirinum

251
sys/ufs/ffs/README Normal file
View File

@ -0,0 +1,251 @@
Introduction
This package constitutes the alpha distribution of the soft update
code updates for the fast filesystem.
Status
My `filesystem torture tests' (described below) run for days without
a hitch (no panic's, hangs, filesystem corruption, or memory leaks).
However, I have had several panic's reported to me by folks that
are field testing the code which I have not yet been able to
reproduce or fix. Although these panic's are rare and do not cause
filesystem corruption, the code should only be put into production
on systems where the system administrator is aware that it is being
run, and knows how to turn it off if problems arise. Thus, you may
hand out this code to others, but please ensure that this status
message is included with any distributions. Please also include
the file ffs_softdep.stub.c in any distributions so that folks that
cannot abide by the need to redistribute source will not be left
with a kernel that will not link. It will resolve all the calls
into the soft update code and simply ignores the request to enable
them. Thus you will be able to ensure that your other hooks have
not broken anything and that your kernel is softdep-ready for those
that wish to use them. Please report problems back to me with
kernel backtraces of panics if possible. This is massively complex
code, and people only have to have their filesystems hosed once or
twice to avoid future changes like the plague. I want to find and
fix as many bugs as soon as possible so as to get the code rock
solid before it gets widely released. Please report any bugs that
you uncover to mckusick@mckusick.com.
Performance
Running the Andrew Benchmarks yields the following raw data:
Phase Normal Softdep What it does
1 3s <1s Creating directories
2 8s 4s Copying files
3 6s 6s Recursive directory stats
4 8s 9s Scanning each file
5 25s 25s Compilation
Normal: 19.9u 29.2s 0:52.8 135+630io
Softdep: 20.3u 28.5s 0:47.8 103+363io
Another interesting datapoint are my `filesystem torture tests'.
They consist of 1000 runs of the andrew benchmarks, 1000 copy and
removes of /etc with randomly selected pauses of 0-60 seconds
between each copy and remove, and 500 find from / with randomly
selected pauses of 100 seconds between each run). The run of the
torture test compares as follows:
With soft updates: writes: 6 sync, 1,113,686 async; run time 19hr, 50min
Normal filesystem: writes: 1,459,147 sync, 487,031 async; run time 27hr, 15min
The upshot is 42% less I/O and 28% shorter running time.
Another interesting test point is a full MAKEDEV. Because it runs
as a shell script, it becomes mostly limited by the execution speed
of the machine on which it runs. Here are the numbers:
With soft updates:
labrat# time ./MAKEDEV std
2.2u 32.6s 0:34.82 100.0% 0+0k 11+36io 0pf+0w
labrat# ls | wc
522 522 3317
Without soft updates:
labrat# time ./MAKEDEV std
2.0u 40.5s 0:42.53 100.0% 0+0k 11+1221io 0pf+0w
labrat# ls | wc
522 522 3317
Of course, some of the system time is being pushed
to the syncer process, but that is a different story.
To show a benchmark designed to highlight the soft update code
consider a tar of zero-sized files and an rm -rf of a directory tree
that has at least 50 files or so at each level. Running a test with
a directory tree containing 28 directories holding 202 empty files
produces the following numbers:
With soft updates:
tar: 0.0u 0.5s 0:00.65 76.9% 0+0k 0+44io 0pf+0w (0 sync, 33 async writes)
rm: 0.0u 0.2s 0:00.20 100.0% 0+0k 0+37io 0pf+0w (0 sync, 72 async writes)
Normal filesystem:
tar: 0.0u 1.1s 0:07.27 16.5% 0+0k 60+586io 0pf+0w (523 sync, 0 async writes)
rm: 0.0u 0.5s 0:01.84 29.3% 0+0k 0+318io 0pf+0w (258 sync, 65 async writes)
The large reduction in writes is because inodes are clustered, so
most of a block gets allocated, then the whole block is written
out once rather than having the same block written once for each
inode allocated from it. Similarly each directory block is written
once rather than once for each new directory entry. Effectively
what the update code is doing is allocating a bunch of inodes
and directory entries without writing anything, then ensuring that
the block containing the inodes is written first followed by the
directory block that references them. If there were data in the
files it would further ensure that the data blocks were written
before their inodes claimed them.
Copyright Restrictions
Please familiarize yourself with the copyright restrictions
contained at the top of either the sys/ufs/ffs/softdep.h or
sys/ufs/ffs/ffs_softdep.c file. The key provision is similar
to the one used by the DB 2.0 package and goes as follows:
Redistributions in any form must be accompanied by information
on how to obtain complete source code for any accompanying
software that uses the this software. This source code must
either be included in the distribution or be available for
no more than the cost of distribution plus a nominal fee,
and must be freely redistributable under reasonable
conditions. For an executable file, complete source code
means the source code for all modules it contains. It does
not mean source code for modules or files that typically
accompany the operating system on which the executable file
runs, e.g., standard library modules or system header files.
The idea is to allow those of you freely redistributing your source
to use it while retaining for myself the right to peddle it for
money to the commercial UNIX vendors. Note that I have included a
stub file ffs_softdep.c.stub that is freely redistributable so that
you can put in all the necessary hooks to run the full soft updates
code, but still allow vendors that want to maintain proprietary
source to have a working system. I do plan to release the code with
a `Berkeley style' copyright once I have peddled it around to the
commercial vendors. If you have concerns about this copyright,
feel free to contact me with them and we can try to resolve any
difficulties.
Soft Dependency Operation
The soft update implementation does NOT require ANY changes
to the on-disk format of your filesystems. Furthermore it is
not used by default for any filesystems. It must be enabled on
a filesystem by filesystem basis by running tunefs to set a
bit in the superblock indicating that the filesystem should be
managed using soft updates. If you wish to stop using
soft updates due to performance or reliability reasons,
you can simply run tunefs on it again to turn off the bit and
revert to normal operation. The additional dynamic memory load
placed on the kernel malloc arena is approximately equal to
the amount of memory used by vnodes plus inodes (for a system
with 1000 vnodes, the additional peak memory load is about 300K).
Kernel Changes
There are two new changes to the kernel functionality that are not
contained in in the soft update files. The first is a `trickle
sync' facility running in the kernel as process 3. This trickle
sync process replaces the traditional `update' program (which should
be commented out of the /etc/rc startup script). When a vnode is
first written it is placed 30 seconds down on the trickle sync
queue. If it still exists and has dirty data when it reaches the
top of the queue, it is sync'ed. This approach evens out the load
on the underlying I/O system and avoids writing short-lived files.
The papers on trickle-sync tend to favor aging based on buffers
rather than files. However, I sync on file age rather than buffer
age because the data structures are much smaller as there are
typically far fewer files than buffers. Although this can make the
I/O spikey when a big file times out, it is still much better than
the wholesale sync's that were happening before. It also adapts
much better to the soft update code where I want to control
aging to improve performance (inodes age in 10 seconds, directories
in 15 seconds, files in 30 seconds). This ensures that most
dependencies are gone (e.g., inodes are written when directory
entries want to go to disk) reducing the amount of rollback that
is needed.
The other main kernel change is to split the vnode freelist into
two separate lists. One for vnodes that are still being used to
identify buffers and the other for those vnodes no longer identifying
any buffers. The latter list is used by getnewvnode in preference
to the former.
Packaging of Kernel Changes
The sys subdirectory contains the changes and additions to the
kernel. My goal in writing this code was to minimize the changes
that need to be made to the kernel. Thus, most of the new code
is contained in the two new files softdep.h and ffs_softdep.c.
The rest of the kernel changes are simply inserting hooks to
call into these two new files. Although there has been some
structural reorganization of the filesystem code to accommodate
gathering the information required by the soft update code,
the actual ordering of filesystem operations when soft updates
are disabled is unchanged.
The kernel changes are packaged as a set of diffs. As I am
doing my development in BSD/OS, the diffs are relative to the
BSD/OS versions of the files. Because BSD/OS recently had
4.4BSD-Lite2 merged into it, the Lite2 files are a good starting
point for figuring out the changes. There are 40 files that
require change plus the two new files. Most of these files have
only a few lines of changes in them. However, four files have
fairly extensive changes: kern/vfs_subr.c, ufs/ufs/ufs_lookup.c,
ufs/ufs/ufs_vnops.c, and ufs/ffs/ffs_alloc.c. For these four
files, I have provided the original Lite2 version, the Lite2
version with the diffs merged in, and the diffs between the
BSD/OS and merged version. Even so, I expect that there will
be some difficulty in doing the merge; I am certainly willing
to assist in helping get the code merged into your system.
Packaging of Utility Changes
The utilities subdirectory contains the changes and additions
to the utilities. There are diffs to three utilities enclosed:
tunefs - add a flag to enable and disable soft updates
mount - print out whether soft updates are enabled and
also statistics on number of sync and async writes
fsck - tighter checks on acceptable errors and a slightly
different policy for what to put in lost+found on
filesystems using soft updates
In addition you should recompile vmstat so as to get reports
on the 13 new memory types used by the soft update code.
It is not necessary to use the new version of fsck, however it
would aid in my debugging if you do. Also, because of the time
lag between deleting a directory entry and the inode it
references, you will find a lot more files showing up in your
lost+found if you do not use the new version. Note that the
new version checks for the soft update flag in the superblock
and only uses the new algorithms if it is set. So, it will run
unchanged on the filesystems that are not using soft updates.
Operation
Once you have booted a kernel that incorporates the soft update
code and installed the updated utilities, do the following:
1) Comment out the update program in /etc/rc.
2) Run `tunefs -n enable' on one or more test filesystems.
3) Mount these filesystems and then type `mount' to ensure that
they have been enabled for soft updates.
4) Copy the test directory to a softdep filesystem, chdir into
it and run `./doit'. You may want to check out each of the
three subtests individually first: doit1 - andrew benchmarks,
doit2 - copy and removal of /etc, doit3 - find from /.

3851
sys/ufs/ffs/ffs_softdep.c Normal file

File diff suppressed because it is too large Load Diff

520
sys/ufs/ffs/softdep.h Normal file
View File

@ -0,0 +1,520 @@
/*
* Copyright 1997 Marshall Kirk McKusick. All Rights Reserved.
*
* The soft dependency code is derived from work done by Greg Ganger
* at the University of Michigan.
*
* The following are the copyrights and redistribution conditions that
* apply to this copy of the soft dependency software. For a license
* to use, redistribute or sell the soft dependency software under
* conditions other than those described here, please contact the
* author at one of the following addresses:
*
* Marshall Kirk McKusick mckusick@mckusick.com
* 1614 Oxford Street +1-510-843-9542
* Berkeley, CA 94709-1608
* USA
*
* Redistribution and use in source and binary forms, with or without
* modification, are permitted provided that the following conditions
* are met:
*
* 1. Redistributions of source code must retain the above copyright
* notice, this list of conditions and the following disclaimer.
* 2. Redistributions in binary form must reproduce the above copyright
* notice, this list of conditions and the following disclaimer in the
* documentation and/or other materials provided with the distribution.
* 3. None of the names of McKusick, Ganger, or the University of Michigan
* may be used to endorse or promote products derived from this software
* without specific prior written permission.
* 4. Redistributions in any form must be accompanied by information on
* how to obtain complete source code for any accompanying software
* that uses the this software. This source code must either be included
* in the distribution or be available for no more than the cost of
* distribution plus a nominal fee, and must be freely redistributable
* under reasonable conditions. For an executable file, complete
* source code means the source code for all modules it contains.
* It does not mean source code for modules or files that typically
* accompany the operating system on which the executable file runs,
* e.g., standard library modules or system header files.
*
* THIS SOFTWARE IS PROVIDED BY MARSHALL KIRK MCKUSICK ``AS IS'' AND ANY
* EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
* WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
* DISCLAIMED. IN NO EVENT SHALL MARSHALL KIRK MCKUSICK BE LIABLE FOR
* ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
* DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
* OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
* HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
* LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
* OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
* SUCH DAMAGE.
*
* @(#)softdep.h 9.1 (McKusick) 7/9/97
*/
#include <sys/queue.h>
/*
* Allocation dependencies are handled with undo/redo on the in-memory
* copy of the data. A particular data dependency is eliminated when
* it is ALLCOMPLETE: that is ATTACHED, DEPCOMPLETE, and COMPLETE.
*
* ATTACHED means that the data is not currently being written to
* disk. UNDONE means that the data has been rolled back to a safe
* state for writing to the disk. When the I/O completes, the data is
* restored to its current form and the state reverts to ATTACHED.
* The data must be locked throughout the rollback, I/O, and roll
* forward so that the rolled back information is never visible to
* user processes. The COMPLETE flag indicates that the item has been
* written. For example, a dependency that requires that an inode be
* written will be marked COMPLETE after the inode has been written
* to disk. The DEPCOMPLETE flag indicates the completion of any other
* dependencies such as the writing of a cylinder group map has been
* completed. A dependency structure may be freed only when both it
* and its dependencies have completed and any rollbacks that are in
* progress have finished as indicated by the set of ALLCOMPLETE flags
* all being set. The two MKDIR flags indicate additional dependencies
* that must be done when creating a new directory. MKDIR_BODY is
* cleared when the directory data block containing the "." and ".."
* entries has been written. MKDIR_PARENT is cleared when the parent
* inode with the increased link count for ".." has been written. When
* both MKDIR flags have been cleared, the DEPCOMPLETE flag is set to
* indicate that the directory dependencies have been completed. The
* writing of the directory inode itself sets the COMPLETE flag which
* then allows the directory entry for the new directory to be written
* to disk. The RMDIR flag marks a dirrem structure as representing
* the removal of a directory rather than a file. When the removal
* dependencies are completed, additional work needs to be done
* (truncation of the "." and ".." entries, an additional decrement
* of the associated inode, and a decrement of the parent inode). The
* DIRCHG flag marks a diradd structure as representing the changing
* of an existing entry rather than the addition of a new one. When
* the update is complete the dirrem associated with the inode for
* the old name must be added to the worklist to do the necessary
* reference count decrement. The GOINGAWAY flag indicates that the
* data structure is frozen from further change until its dependencies
* have been completed and its resources freed after which it will be
* discarded. The IOSTARTED flag prevents multiple calls to the I/O
* start routine from doing multiple rollbacks. The ONWORKLIST flag
* shows whether the structure is currently linked onto a worklist.
*/
#define ATTACHED 0x0001
#define UNDONE 0x0002
#define COMPLETE 0x0004
#define DEPCOMPLETE 0x0008
#define MKDIR_PARENT 0x0010
#define MKDIR_BODY 0x0020
#define RMDIR 0x0040
#define DIRCHG 0x0080
#define GOINGAWAY 0x0100
#define IOSTARTED 0x0200
#define ONWORKLIST 0x8000
#define ALLCOMPLETE (ATTACHED | COMPLETE | DEPCOMPLETE)
/*
* The workitem queue.
*
* It is sometimes useful and/or necessary to clean up certain dependencies
* in the background rather than during execution of an application process
* or interrupt service routine. To realize this, we append dependency
* structures corresponding to such tasks to a "workitem" queue. In a soft
* updates implementation, most pending workitems should not wait for more
* than a couple of seconds, so the filesystem syncer process awakens once
* per second to process the items on the queue.
*/
/* LIST_HEAD(workhead, worklist); -- declared in buf.h */
/*
* Each request can be linked onto a work queue through its worklist structure.
* To avoid the need for a pointer to the structure itself, this structure
* MUST be declared FIRST in each type in which it appears! If more than one
* worklist is needed in the structure, then a wk_data field must be added
* and the macros below changed to use it.
*/
struct worklist {
LIST_ENTRY(worklist) wk_list; /* list of work requests */
unsigned short wk_type; /* type of request */
unsigned short wk_state; /* state flags */
};
#define WK_DATA(wk) ((void *)(wk))
#define WK_PAGEDEP(wk) ((struct pagedep *)(wk))
#define WK_INODEDEP(wk) ((struct inodedep *)(wk))
#define WK_NEWBLK(wk) ((struct newblk *)(wk))
#define WK_BMSAFEMAP(wk) ((struct bmsafemap *)(wk))
#define WK_ALLOCDIRECT(wk) ((struct allocdirect *)(wk))
#define WK_INDIRDEP(wk) ((struct indirdep *)(wk))
#define WK_ALLOCINDIR(wk) ((struct allocindir *)(wk))
#define WK_FREEFRAG(wk) ((struct freefrag *)(wk))
#define WK_FREEBLKS(wk) ((struct freeblks *)(wk))
#define WK_FREEFILE(wk) ((struct freefile *)(wk))
#define WK_DIRADD(wk) ((struct diradd *)(wk))
#define WK_MKDIR(wk) ((struct mkdir *)(wk))
#define WK_DIRREM(wk) ((struct dirrem *)(wk))
/*
* Various types of lists
*/
LIST_HEAD(dirremhd, dirrem);
LIST_HEAD(diraddhd, diradd);
LIST_HEAD(newblkhd, newblk);
LIST_HEAD(inodedephd, inodedep);
LIST_HEAD(allocindirhd, allocindir);
LIST_HEAD(allocdirecthd, allocdirect);
TAILQ_HEAD(allocdirectlst, allocdirect);
/*
* The "pagedep" structure tracks the various dependencies related to
* a particular directory page. If a directory page has any dependencies,
* it will have a pagedep linked to its associated buffer. The
* pd_dirremhd list holds the list of dirrem requests which decrement
* inode reference counts. These requests are processed after the
* directory page with the corresponding zero'ed entries has been
* written. The pd_diraddhd list maintains the list of diradd requests
* which cannot be committed until their corresponding inode has been
* written to disk. Because a directory may have many new entries
* being created, several lists are maintained hashed on bits of the
* offset of the entry into the directory page to keep the lists from
* getting too long. Once a new directory entry has been cleared to
* be written, it is moved to the pd_pendinghd list. After the new
* entry has been written to disk it is removed from the pd_pendinghd
* list, any removed operations are done, and the dependency structure
* is freed.
*/
#define DAHASHSZ 6
#define DIRADDHASH(offset) (((offset) >> 2) % DAHASHSZ)
struct pagedep {
struct worklist pd_list; /* page buffer */
# define pd_state pd_list.wk_state /* check for multiple I/O starts */
LIST_ENTRY(pagedep) pd_hash; /* hashed lookup */
struct mount *pd_mnt; /* associated mount point */
ino_t pd_ino; /* associated file */
ufs_lbn_t pd_lbn; /* block within file */
struct dirremhd pd_dirremhd; /* dirrem's waiting for page */
struct diraddhd pd_diraddhd[DAHASHSZ]; /* diradd dir entry updates */
struct diraddhd pd_pendinghd; /* directory entries awaiting write */
};
/*
* The "inodedep" structure tracks the set of dependencies associated
* with an inode. Each block that is allocated is represented by an
* "allocdirect" structure (see below). It is linked onto the id_newinoupdt
* list until both its contents and its allocation in the cylinder
* group map have been written to disk. Once the dependencies have been
* satisfied, it is removed from the id_newinoupdt list and any followup
* actions such as releasing the previous block or fragment are placed
* on the id_inowait list. When an inode is updated (copied from the
* in-core inode structure to a disk buffer containing its on-disk
* copy), the "inodedep" structure is linked onto the buffer through
* its worklist. Thus it will be notified when the buffer is about
* to be written and when it is done. At the update time, all the
* elements on the id_newinoupdt list are moved to the id_inoupdt list
* since those changes are now relevant to the copy of the inode in the
* buffer. When the buffer containing the inode is written to disk, any
* updates listed on the id_inoupdt list are rolled back as they are
* not yet safe. Following the write, the changes are once again rolled
* forward and any actions on the id_inowait list are processed (since
* the previously allocated blocks are no longer claimed on the disk).
* The entries on the id_inoupdt and id_newinoupdt lists must be kept
* sorted by logical block number to speed the calculation of the size
* of the rolled back inode (see explanation in initiate_write_inodeblock).
*/
struct inodedep {
struct worklist id_list; /* buffer holding inode block */
# define id_state id_list.wk_state /* inode dependency state */
LIST_ENTRY(inodedep) id_hash; /* hashed lookup */
struct fs *id_fs; /* associated filesystem */
ino_t id_ino; /* dependent inode */
nlink_t id_nlinkdelta; /* saved effective link count */
struct dinode *id_savedino; /* saved dinode contents */
LIST_ENTRY(inodedep) id_deps; /* bmsafemap's list of inodedep's */
struct buf *id_buf; /* related bmsafemap (if pending) */
off_t id_savedsize; /* file size saved during rollback */
struct workhead id_pendinghd; /* entries awaiting directory write */
struct workhead id_inowait; /* operations after inode written */
struct allocdirectlst id_inoupdt; /* updates before inode written */
struct allocdirectlst id_newinoupdt; /* updates when inode written */
};
/*
* A "newblk" structure is attached to a bmsafemap structure when a block
* or fragment is allocated from a cylinder group. Its state is set to
* DEPCOMPLETE when its cylinder group map is written. It is consumed by
* an associated allocdirect or allocindir allocation which will attach
* themselves to the bmsafemap structure if the newblk's DEPCOMPLETE flag
* is not set (i.e., its cylinder group map has not been written).
*/
struct newblk {
LIST_ENTRY(newblk) nb_hash; /* hashed lookup */
struct fs *nb_fs; /* associated filesystem */
ufs_daddr_t nb_newblkno; /* allocated block number */
int nb_state; /* state of bitmap dependency */
LIST_ENTRY(newblk) nb_deps; /* bmsafemap's list of newblk's */
struct bmsafemap *nb_bmsafemap; /* associated bmsafemap */
};
/*
* A "bmsafemap" structure maintains a list of dependency structures
* that depend on the update of a particular cylinder group map.
* It has lists for newblks, allocdirects, allocindirs, and inodedeps.
* It is attached to the buffer of a cylinder group block when any of
* these things are allocated from the cylinder group. It is freed
* after the cylinder group map is written and the state of its
* dependencies are updated with DEPCOMPLETE to indicate that it has
* been processed.
*/
struct bmsafemap {
struct worklist sm_list; /* cylgrp buffer */
struct buf *sm_buf; /* associated buffer */
struct allocdirecthd sm_allocdirecthd; /* allocdirect deps */
struct allocindirhd sm_allocindirhd; /* allocindir deps */
struct inodedephd sm_inodedephd; /* inodedep deps */
struct newblkhd sm_newblkhd; /* newblk deps */
};
/*
* An "allocdirect" structure is attached to an "inodedep" when a new block
* or fragment is allocated and pointed to by the inode described by
* "inodedep". The worklist is linked to the buffer that holds the block.
* When the block is first allocated, it is linked to the bmsafemap
* structure associated with the buffer holding the cylinder group map
* from which it was allocated. When the cylinder group map is written
* to disk, ad_state has the DEPCOMPLETE flag set. When the block itself
* is written, the COMPLETE flag is set. Once both the cylinder group map
* and the data itself have been written, it is safe to write the inode
* that claims the block. If there was a previous fragment that had been
* allocated before the file was increased in size, the old fragment may
* be freed once the inode claiming the new block is written to disk.
* This ad_fragfree request is attached to the id_inowait list of the
* associated inodedep (pointed to by ad_inodedep) for processing after
* the inode is written.
*/
struct allocdirect {
struct worklist ad_list; /* buffer holding block */
# define ad_state ad_list.wk_state /* block pointer state */
TAILQ_ENTRY(allocdirect) ad_next; /* inodedep's list of allocdirect's */
ufs_lbn_t ad_lbn; /* block within file */
ufs_daddr_t ad_newblkno; /* new value of block pointer */
ufs_daddr_t ad_oldblkno; /* old value of block pointer */
long ad_newsize; /* size of new block */
long ad_oldsize; /* size of old block */
LIST_ENTRY(allocdirect) ad_deps; /* bmsafemap's list of allocdirect's */
struct buf *ad_buf; /* cylgrp buffer (if pending) */
struct inodedep *ad_inodedep; /* associated inodedep */
struct freefrag *ad_freefrag; /* fragment to be freed (if any) */
};
/*
* A single "indirdep" structure manages all allocation dependencies for
* pointers in an indirect block. The up-to-date state of the indirect
* block is stored in ir_savedata. The set of pointers that may be safely
* written to the disk is stored in ir_safecopy. The state field is used
* only to track whether the buffer is currently being written (in which
* case it is not safe to update ir_safecopy). Ir_deplisthd contains the
* list of allocindir structures, one for each block that needs to be
* written to disk. Once the block and its bitmap allocation have been
* written the safecopy can be updated to reflect the allocation and the
* allocindir structure freed. If ir_state indicates that an I/O on the
* indirect block is in progress when ir_safecopy is to be updated, the
* update is deferred by placing the allocindir on the ir_donehd list.
* When the I/O on the indirect block completes, the entries on the
* ir_donehd list are processed by updating their corresponding ir_safecopy
* pointers and then freeing the allocindir structure.
*/
struct indirdep {
struct worklist ir_list; /* buffer holding indirect block */
# define ir_state ir_list.wk_state /* indirect block pointer state */
ufs_daddr_t *ir_saveddata; /* buffer cache contents */
struct buf *ir_savebp; /* buffer holding safe copy */
struct allocindirhd ir_donehd; /* done waiting to update safecopy */
struct allocindirhd ir_deplisthd; /* allocindir deps for this block */
};
/*
* An "allocindir" structure is attached to an "indirdep" when a new block
* is allocated and pointed to by the indirect block described by the
* "indirdep". The worklist is linked to the buffer that holds the new block.
* When the block is first allocated, it is linked to the bmsafemap
* structure associated with the buffer holding the cylinder group map
* from which it was allocated. When the cylinder group map is written
* to disk, ai_state has the DEPCOMPLETE flag set. When the block itself
* is written, the COMPLETE flag is set. Once both the cylinder group map
* and the data itself have been written, it is safe to write the entry in
* the indirect block that claims the block; the "allocindir" dependency
* can then be freed as it is no longer applicable.
*/
struct allocindir {
struct worklist ai_list; /* buffer holding indirect block */
# define ai_state ai_list.wk_state /* indirect block pointer state */
LIST_ENTRY(allocindir) ai_next; /* indirdep's list of allocindir's */
int ai_offset; /* pointer offset in indirect block */
ufs_daddr_t ai_newblkno; /* new block pointer value */
ufs_daddr_t ai_oldblkno; /* old block pointer value */
struct freefrag *ai_freefrag; /* block to be freed when complete */
struct indirdep *ai_indirdep; /* address of associated indirdep */
LIST_ENTRY(allocindir) ai_deps; /* bmsafemap's list of allocindir's */
struct buf *ai_buf; /* cylgrp buffer (if pending) */
};
/*
* A "freefrag" structure is attached to an "inodedep" when a previously
* allocated fragment is replaced with a larger fragment, rather than extended.
* The "freefrag" structure is constructed and attached when the replacement
* block is first allocated. It is processed after the inode claiming the
* bigger block that replaces it has been written to disk. Note that the
* ff_state field is is used to store the uid, so may lose data. However,
* the uid is used only in printing an error message, so is not critical.
* Keeping it in a short keeps the data structure down to 32 bytes.
*/
struct freefrag {
struct worklist ff_list; /* id_inowait or delayed worklist */
# define ff_state ff_list.wk_state /* owning user; should be uid_t */
struct vnode *ff_devvp; /* filesystem device vnode */
struct fs *ff_fs; /* addr of superblock */
ufs_daddr_t ff_blkno; /* fragment physical block number */
long ff_fragsize; /* size of fragment being deleted */
ino_t ff_inum; /* owning inode number */
};
/*
* A "freeblks" structure is attached to an "inodedep" when the
* corresponding file's length is reduced to zero. It records all
* the information needed to free the blocks of a file after its
* zero'ed inode has been written to disk.
*/
struct freeblks {
struct worklist fb_list; /* id_inowait or delayed worklist */
ino_t fb_previousinum; /* inode of previous owner of blocks */
struct vnode *fb_devvp; /* filesystem device vnode */
struct fs *fb_fs; /* addr of superblock */
off_t fb_oldsize; /* previous file size */
off_t fb_newsize; /* new file size */
int fb_chkcnt; /* used to check cnt of blks released */
uid_t fb_uid; /* uid of previous owner of blocks */
ufs_daddr_t fb_dblks[NDADDR]; /* direct blk ptrs to deallocate */
ufs_daddr_t fb_iblks[NIADDR]; /* indirect blk ptrs to deallocate */
};
/*
* A "freefile" structure is attached to an inode when its
* link count is reduced to zero. It marks the inode as free in
* the cylinder group map after the zero'ed inode has been written
* to disk and any associated blocks and fragments have been freed.
*/
struct freefile {
struct worklist fx_list; /* id_inowait or delayed worklist */
mode_t fx_mode; /* mode of inode */
ino_t fx_oldinum; /* inum of the unlinked file */
struct vnode *fx_devvp; /* filesystem device vnode */
struct fs *fx_fs; /* addr of superblock */
};
/*
* A "diradd" structure is linked to an "inodedep" id_inowait list when a
* new directory entry is allocated that references the inode described
* by "inodedep". When the inode itself is written (either the initial
* allocation for new inodes or with the increased link count for
* existing inodes), the COMPLETE flag is set in da_state. If the entry
* is for a newly allocated inode, the "inodedep" structure is associated
* with a bmsafemap which prevents the inode from being written to disk
* until the cylinder group has been updated. Thus the da_state COMPLETE
* flag cannot be set until the inode bitmap dependency has been removed.
* When creating a new file, it is safe to write the directory entry that
* claims the inode once the referenced inode has been written. Since
* writing the inode clears the bitmap dependencies, the DEPCOMPLETE flag
* in the diradd can be set unconditionally when creating a file. When
* creating a directory, there are two additional dependencies described by
* mkdir structures (see their description below). When these dependencies
* are resolved the DEPCOMPLETE flag is set in the diradd structure.
* If there are multiple links created to the same inode, there will be
* a separate diradd structure created for each link. The diradd is
* linked onto the pg_diraddhd list of the pagedep for the directory
* page that contains the entry. When a directory page is written,
* the pg_diraddhd list is traversed to rollback any entries that are
* not yet ready to be written to disk. If a directory entry is being
* changed (by rename) rather than added, the DIRCHG flag is set and
* the da_previous entry points to the entry that will be "removed"
* once the new entry has been committed. During rollback, entries
* with da_previous are replaced with the previous inode number rather
* than zero.
*
* The overlaying of da_pagedep and da_previous is done to keep the
* structure down to 32 bytes in size on a 32-bit machine. If a
* da_previous entry is present, the pointer to its pagedep is available
* in the associated dirrem entry. If the DIRCHG flag is set, the
* da_previous entry is valid; if not set the da_pagedep entry is valid.
* The DIRCHG flag never changes; it is set when the structure is created
* if appropriate and is never cleared.
*/
struct diradd {
struct worklist da_list; /* id_inowait and id_pendinghd list */
# define da_state da_list.wk_state /* state of the new directory entry */
LIST_ENTRY(diradd) da_pdlist; /* pagedep holding directory block */
doff_t da_offset; /* offset of new dir entry in dir blk */
ino_t da_newinum; /* inode number for the new dir entry */
union {
struct dirrem *dau_previous; /* entry being replaced in dir change */
struct pagedep *dau_pagedep; /* pagedep dependency for addition */
} da_un;
};
#define da_previous da_un.dau_previous
#define da_pagedep da_un.dau_pagedep
/*
* Two "mkdir" structures are needed to track the additional dependencies
* associated with creating a new directory entry. Normally a directory
* addition can be committed as soon as the newly referenced inode has been
* written to disk with its increased link count. When a directory is
* created there are two additional dependencies: writing the directory
* data block containing the "." and ".." entries (MKDIR_BODY) and writing
* the parent inode with the increased link count for ".." (MKDIR_PARENT).
* These additional dependencies are tracked by two mkdir structures that
* reference the associated "diradd" structure. When they have completed,
* they set the DEPCOMPLETE flag on the diradd so that it knows that its
* extra dependencies have been completed. The md_state field is used only
* to identify which type of dependency the mkdir structure is tracking.
* It is not used in the mainline code for any purpose other than consistency
* checking. All the mkdir structures in the system are linked together on
* a list. This list is needed so that a diradd can find its associated
* mkdir structures and deallocate them if it is prematurely freed (as for
* example if a mkdir is immediately followed by a rmdir of the same directory).
* Here, the free of the diradd must traverse the list to find the associated
* mkdir structures that reference it. The deletion would be faster if the
* diradd structure were simply augmented to have two pointers that referenced
* the associated mkdir's. However, this would increase the size of the diradd
* structure from 32 to 64-bits to speed a very infrequent operation.
*/
struct mkdir {
struct worklist md_list; /* id_inowait or buffer holding dir */
# define md_state md_list.wk_state /* type: MKDIR_PARENT or MKDIR_BODY */
struct diradd *md_diradd; /* associated diradd */
LIST_ENTRY(mkdir) md_mkdirs; /* list of all mkdirs */
};
LIST_HEAD(mkdirlist, mkdir) mkdirlisthd;
/*
* A "dirrem" structure describes an operation to decrement the link
* count on an inode. The dirrem structure is attached to the pg_dirremhd
* list of the pagedep for the directory page that contains the entry.
* It is processed after the directory page with the deleted entry has
* been written to disk.
*
* The overlaying of dm_pagedep and dm_dirinum is done to keep the
* structure down to 32 bytes in size on a 32-bit machine. It works
* because they are never used concurrently.
*/
struct dirrem {
struct worklist dm_list; /* delayed worklist */
# define dm_state dm_list.wk_state /* state of the old directory entry */
LIST_ENTRY(dirrem) dm_next; /* pagedep's list of dirrem's */
struct mount *dm_mnt; /* associated mount point */
ino_t dm_oldinum; /* inum of the removed dir entry */
union {
struct pagedep *dmu_pagedep; /* pagedep dependency for remove */
ino_t dmu_dirinum; /* parent inode number (for rmdir) */
} dm_un;
};
#define dm_pagedep dm_un.dmu_pagedep
#define dm_dirinum dm_un.dmu_dirinum