75914bbe38
The problem is caused when a directory block is compacted. When this occurs, softdep_change_directoryentry_offset() is called to relocate each directory entry and adjust its matching diradd structure, if any, to match the new location of the entry. The bug is that while softdep_change_directoryentry_offset() correctly adjusts the offsets of the diradd structures on the pd_diraddhd[] lists (which are not yet ready to be committed to disk), it fails to adjust the offsets of the diradd structures on the pd_pendinghd list (which are ready to be committed to disk). This causes the dependency structures to be inconsistent with the buf contents. Now, if the compaction has moved a directory entry to the same offset as one of the diradd structures on the pd_pendinghd list *and* a syscall is done that tries to remove this directory entry before this directory block has been written to disk (which would empty pd_pendinghd), a sanity check in newdirrem() will call panic() when it notices that the inode number in the entry that it is to be removed doesn't match the inode number in the diradd structure with that offset of that entry. Reviewed by: Kirk McKusick <mckusick@McKusick.COM> Submitted by: Don Lewis <Don.Lewis@tsc.tdk.com> |
||
---|---|---|
.. | ||
ffs_softdep.c | ||
README | ||
softdep.h |
Introduction This package constitutes the alpha distribution of the soft update code updates for the fast filesystem. For More information on what Soft Updates is, see: http://www.ece.cmu.edu/~ganger/papers/CSE-TR-254-95/ Status My `filesystem torture tests' (described below) run for days without a hitch (no panic's, hangs, filesystem corruption, or memory leaks). However, I have had several panic's reported to me by folks that are field testing the code which I have not yet been able to reproduce or fix. Although these panic's are rare and do not cause filesystem corruption, the code should only be put into production on systems where the system administrator is aware that it is being run, and knows how to turn it off if problems arise. Thus, you may hand out this code to others, but please ensure that this status message is included with any distributions. Please also include the file ffs_softdep.stub.c in any distributions so that folks that cannot abide by the need to redistribute source will not be left with a kernel that will not link. It will resolve all the calls into the soft update code and simply ignores the request to enable them. Thus you will be able to ensure that your other hooks have not broken anything and that your kernel is softdep-ready for those that wish to use them. Please report problems back to me with kernel backtraces of panics if possible. This is massively complex code, and people only have to have their filesystems hosed once or twice to avoid future changes like the plague. I want to find and fix as many bugs as soon as possible so as to get the code rock solid before it gets widely released. Please report any bugs that you uncover to mckusick@mckusick.com. Performance Running the Andrew Benchmarks yields the following raw data: Phase Normal Softdep What it does 1 3s <1s Creating directories 2 8s 4s Copying files 3 6s 6s Recursive directory stats 4 8s 9s Scanning each file 5 25s 25s Compilation Normal: 19.9u 29.2s 0:52.8 135+630io Softdep: 20.3u 28.5s 0:47.8 103+363io Another interesting datapoint are my `filesystem torture tests'. They consist of 1000 runs of the andrew benchmarks, 1000 copy and removes of /etc with randomly selected pauses of 0-60 seconds between each copy and remove, and 500 find from / with randomly selected pauses of 100 seconds between each run). The run of the torture test compares as follows: With soft updates: writes: 6 sync, 1,113,686 async; run time 19hr, 50min Normal filesystem: writes: 1,459,147 sync, 487,031 async; run time 27hr, 15min The upshot is 42% less I/O and 28% shorter running time. Another interesting test point is a full MAKEDEV. Because it runs as a shell script, it becomes mostly limited by the execution speed of the machine on which it runs. Here are the numbers: With soft updates: labrat# time ./MAKEDEV std 2.2u 32.6s 0:34.82 100.0% 0+0k 11+36io 0pf+0w labrat# ls | wc 522 522 3317 Without soft updates: labrat# time ./MAKEDEV std 2.0u 40.5s 0:42.53 100.0% 0+0k 11+1221io 0pf+0w labrat# ls | wc 522 522 3317 Of course, some of the system time is being pushed to the syncer process, but that is a different story. To show a benchmark designed to highlight the soft update code consider a tar of zero-sized files and an rm -rf of a directory tree that has at least 50 files or so at each level. Running a test with a directory tree containing 28 directories holding 202 empty files produces the following numbers: With soft updates: tar: 0.0u 0.5s 0:00.65 76.9% 0+0k 0+44io 0pf+0w (0 sync, 33 async writes) rm: 0.0u 0.2s 0:00.20 100.0% 0+0k 0+37io 0pf+0w (0 sync, 72 async writes) Normal filesystem: tar: 0.0u 1.1s 0:07.27 16.5% 0+0k 60+586io 0pf+0w (523 sync, 0 async writes) rm: 0.0u 0.5s 0:01.84 29.3% 0+0k 0+318io 0pf+0w (258 sync, 65 async writes) The large reduction in writes is because inodes are clustered, so most of a block gets allocated, then the whole block is written out once rather than having the same block written once for each inode allocated from it. Similarly each directory block is written once rather than once for each new directory entry. Effectively what the update code is doing is allocating a bunch of inodes and directory entries without writing anything, then ensuring that the block containing the inodes is written first followed by the directory block that references them. If there were data in the files it would further ensure that the data blocks were written before their inodes claimed them. Copyright Restrictions Please familiarize yourself with the copyright restrictions contained at the top of either the sys/ufs/ffs/softdep.h or sys/ufs/ffs/ffs_softdep.c file. The key provision is similar to the one used by the DB 2.0 package and goes as follows: Redistributions in any form must be accompanied by information on how to obtain complete source code for any accompanying software that uses the this software. This source code must either be included in the distribution or be available for no more than the cost of distribution plus a nominal fee, and must be freely redistributable under reasonable conditions. For an executable file, complete source code means the source code for all modules it contains. It does not mean source code for modules or files that typically accompany the operating system on which the executable file runs, e.g., standard library modules or system header files. The idea is to allow those of you freely redistributing your source to use it while retaining for myself the right to peddle it for money to the commercial UNIX vendors. Note that I have included a stub file ffs_softdep.c.stub that is freely redistributable so that you can put in all the necessary hooks to run the full soft updates code, but still allow vendors that want to maintain proprietary source to have a working system. I do plan to release the code with a `Berkeley style' copyright once I have peddled it around to the commercial vendors. If you have concerns about this copyright, feel free to contact me with them and we can try to resolve any difficulties. Soft Dependency Operation The soft update implementation does NOT require ANY changes to the on-disk format of your filesystems. Furthermore it is not used by default for any filesystems. It must be enabled on a filesystem by filesystem basis by running tunefs to set a bit in the superblock indicating that the filesystem should be managed using soft updates. If you wish to stop using soft updates due to performance or reliability reasons, you can simply run tunefs on it again to turn off the bit and revert to normal operation. The additional dynamic memory load placed on the kernel malloc arena is approximately equal to the amount of memory used by vnodes plus inodes (for a system with 1000 vnodes, the additional peak memory load is about 300K). Kernel Changes There are two new changes to the kernel functionality that are not contained in in the soft update files. The first is a `trickle sync' facility running in the kernel as process 3. This trickle sync process replaces the traditional `update' program (which should be commented out of the /etc/rc startup script). When a vnode is first written it is placed 30 seconds down on the trickle sync queue. If it still exists and has dirty data when it reaches the top of the queue, it is sync'ed. This approach evens out the load on the underlying I/O system and avoids writing short-lived files. The papers on trickle-sync tend to favor aging based on buffers rather than files. However, I sync on file age rather than buffer age because the data structures are much smaller as there are typically far fewer files than buffers. Although this can make the I/O spikey when a big file times out, it is still much better than the wholesale sync's that were happening before. It also adapts much better to the soft update code where I want to control aging to improve performance (inodes age in 10 seconds, directories in 15 seconds, files in 30 seconds). This ensures that most dependencies are gone (e.g., inodes are written when directory entries want to go to disk) reducing the amount of rollback that is needed. The other main kernel change is to split the vnode freelist into two separate lists. One for vnodes that are still being used to identify buffers and the other for those vnodes no longer identifying any buffers. The latter list is used by getnewvnode in preference to the former. Packaging of Kernel Changes The sys subdirectory contains the changes and additions to the kernel. My goal in writing this code was to minimize the changes that need to be made to the kernel. Thus, most of the new code is contained in the two new files softdep.h and ffs_softdep.c. The rest of the kernel changes are simply inserting hooks to call into these two new files. Although there has been some structural reorganization of the filesystem code to accommodate gathering the information required by the soft update code, the actual ordering of filesystem operations when soft updates are disabled is unchanged. The kernel changes are packaged as a set of diffs. As I am doing my development in BSD/OS, the diffs are relative to the BSD/OS versions of the files. Because BSD/OS recently had 4.4BSD-Lite2 merged into it, the Lite2 files are a good starting point for figuring out the changes. There are 40 files that require change plus the two new files. Most of these files have only a few lines of changes in them. However, four files have fairly extensive changes: kern/vfs_subr.c, ufs/ufs/ufs_lookup.c, ufs/ufs/ufs_vnops.c, and ufs/ffs/ffs_alloc.c. For these four files, I have provided the original Lite2 version, the Lite2 version with the diffs merged in, and the diffs between the BSD/OS and merged version. Even so, I expect that there will be some difficulty in doing the merge; I am certainly willing to assist in helping get the code merged into your system. Packaging of Utility Changes The utilities subdirectory contains the changes and additions to the utilities. There are diffs to three utilities enclosed: tunefs - add a flag to enable and disable soft updates mount - print out whether soft updates are enabled and also statistics on number of sync and async writes fsck - tighter checks on acceptable errors and a slightly different policy for what to put in lost+found on filesystems using soft updates In addition you should recompile vmstat so as to get reports on the 13 new memory types used by the soft update code. It is not necessary to use the new version of fsck, however it would aid in my debugging if you do. Also, because of the time lag between deleting a directory entry and the inode it references, you will find a lot more files showing up in your lost+found if you do not use the new version. Note that the new version checks for the soft update flag in the superblock and only uses the new algorithms if it is set. So, it will run unchanged on the filesystems that are not using soft updates. Operation Once you have booted a kernel that incorporates the soft update code and installed the updated utilities, do the following: 1) Comment out the update program in /etc/rc. 2) Run `tunefs -n enable' on one or more test filesystems. 3) Mount these filesystems and then type `mount' to ensure that they have been enabled for soft updates. 4) Copy the test directory to a softdep filesystem, chdir into it and run `./doit'. You may want to check out each of the three subtests individually first: doit1 - andrew benchmarks, doit2 - copy and removal of /etc, doit3 - find from /. ==== Additional notes from Feb 13 hen removing huge directories of files, it is possible to get the incore state arbitrarily far ahead of the disk. Maintaining all the associated depedency information can exhaust the kernel malloc arena. To avoid this senario, I have put some limits on the soft update code so that it will not be allowed to rampage through all of the kernel memory. I enclose below the relevant patches to vnode.h and vfs_subr.c (which allow the soft update code to speed up the filesystem syncer process). I have also included the diffs for ffs_softdep.c. I hope to make a pass over ffs_softdep.c to isolate the differences with my standard version so that these diffs are less painful to incorporate. Since I know you like to play with tuning, I have put the relevant knobs on sysctl debug variables. The tuning knobs can be viewed with `sysctl debug' and set with `sysctl -w debug.<name>=value'. The knobs are as follows: debug.max_softdeps - limit on any given resource debug.tickdelay - ticks to delay before allocating debug.max_limit_hit - number of times tickdelay imposed debug.rush_requests - number of rush requests to filesystem syncer The max_softdeps limit is derived from vnodesdesired which in turn is sized based on the amount of memory on the machine. When the limit is hit, a process requesting a resource first tries to speed up the filesystem syncer process. Such a request is recorded as a rush_request. After syncdelay / 2 unserviced rush requests (typically 15) are in the filesystem syncers queue (i.e., it is more than 15 seconds behind in its work), the process requesting the memory is put to sleep for tickdelay seconds. Such a delay is recorded in max_limit_hit. Following this delay it is granted its memory without further delay. I have tried the following experiments in which I delete an MH directory containing 16,703 files: Run # 1 2 3 max_softdeps 4496 4496 4496 tickdelay 100 == 1 sec 20 == 0.2 sec 2 == 0.02 sec max_limit_hit 16 == 16 sec 27 == 5.4 sec 203 == 4.1 sec rush_requests 147 102 93 run time 57 sec 46 sec 45 sec I/O's 781 859 936 When run with no limits, it completes in 40 seconds. So, the time spent in delay is directly added to the bottom line. Shortening the tick delay does cut down the total running time, but at the expense of generating more total I/O operations due to the rush orders being sent to the filesystem syncer. Although the number of rush orders decreases with a shorter tick delay, there are more requests in each order, hence the increase in I/O count. Also, although the I/O count does rise with a shorter delay, it is still at least an order of magnitude less than without soft updates. Anyway, you may want to play around with these value to see what works best and to see if you can get an insight into how best to tune them. If you get out of memory panic's, then you have max_softdeps set too high. The max_limit_hit and rush_requests show be reset to zero before each run. The minimum legal value for tickdelay is 2 (if you set it below that, the code will use 2).