Commit Graph

274 Commits

Author SHA1 Message Date
Jeff Roberson
280e091a99 Implement fully asynchronous partial truncation with softupdates journaling
to resolve errors which can cause corruption on recovery with the old
synchronous mechanism.

 - Append partial truncation freework structures to indirdeps while
   truncation is proceeding.  These prevent new block pointers from
   becoming valid until truncation completes and serialize truncations.
 - On completion of a partial truncate journal work waits for zeroed
   pointers to hit indirects.
 - softdep_journal_freeblocks() handles last frag allocation and last
   block zeroing.
 - vtruncbuf/ffs_page_remove moved into softdep_*_freeblocks() so it
   is only implemented in one place.
 - Block allocation failure handling moved up one level so it does not
   proceed with buf locks held.  This permits us to do more extensive
   reclaims when filesystem space is exhausted.
 - softdep_sync_metadata() is broken into two parts, the first executes
   once at the start of ffs_syncvnode() and flushes truncations and
   inode dependencies.  The second is called on each locked buf.  This
   eliminates excessive looping and rollbacks.
 - Improve the mechanism in process_worklist_item() that handles
   acquiring vnode locks for handle_workitem_remove() so that it works
   more generally and does not loop excessively over the same worklist
   items on each call.
 - Don't corrupt directories by zeroing the tail in fsck.  This is only
   done for regular files.
 - Push a fsync complete record for files that need it so the checker
   knows a truncation in the journal is no longer valid.

Discussed with:	mckusick, kib (ffs_pages_remove and ffs_truncate parts)
Tested by:	pho
2011-06-10 22:48:35 +00:00
Matthew D Fleming
3d08a76bbc Use a name instead of a magic number for kern_yield(9) when the priority
should not change.  Fetch the td_user_pri under the thread lock.  This
is probably not necessary but a magic number also seems preferable to
knowing the implementation details here.

Requested by:	Jason Behmer < jason DOT behmer AT isilon DOT com >
2011-05-13 05:27:58 +00:00
Jeff Roberson
273ca85137 - Refactor softdep_setup_freeblocks() into a set of functions to prepare
for a new journal specific partial truncate routine.
 - Use dep_current[] in place of specific dependency counts.  This is
   automatically maintained when workitems are allocated and has
   less risk of becoming incorrect.
2011-04-11 01:43:59 +00:00
Jeff Roberson
4ac80906c3 Fix a long standing SUJ performance problem:
- Keep a hash of indirect blocks that have recently been freed and are
   still referenced in the journal.
 - Lookup blocks in this hash before forcing a new block write to wait on
   the journal entry to hit the disk.  This is only necessary to avoid
   confusion between old identities as indirects and new identities as
   file blocks.
 - Don't free jseg structures until the journal has written a record that
   invalidates it.  This keeps the indirect block information around for
   as long as is required to be safe.
 - Force an empty journal block write when required to flush out stale
   journal data that is simply waiting for the oldest valid sequence
   number to advance beyond it.
2011-04-10 03:49:53 +00:00
Jeff Roberson
59343c7b98 - Don't invalidate jnewblks immediately upon discovering that the block
will be removed.  Permit the journal to proceed so that we don't leave
   a rollback in a cg for a very long time as this can cause terrible perf
   problems in low memory situations.

Tested by:      pho
2011-04-07 03:19:10 +00:00
Kirk McKusick
4c821a3978 Be far more persistent in reclaiming blocks and inodes before giving
up and declaring a filesystem out of space. Especially necessary when
running on a small filesystem. With this improvement, it should be
possible to use soft updates on a small root filesystem.

Kudos to: Peter Holm
Testing by: Peter Holm
MFC: 2 weeks
2011-04-05 21:26:05 +00:00
Jeff Roberson
f79d4144ab Fix problems that manifested from filesystem full conditions:
- In softdep_revert_mkdir() find the dotaddref before we attempt to cancel
   the jaddref so we can make assumptions about where the dotaddref is on
   the list.  cancel_jaddref() does not always remove items from the list
   anymore.
 - Always set GOINGAWAY on an inode in softdep_freefile() if DEPCOMPLETE
   was never set.  This ensures that dependencies will continue to be
   processed on the inowait/bufwait list and is more an artifact of
   the structure of the code than a pure ordering problem.
 - Always set DEPCOMPLETE on canceled jaddrefs so that they can be freed
   appropriately.  This normally occurs when the refs are added to the
   journal but if they are canceled before this point the state would
   never be set and the dependency could never be freed.

Reported by:	pho
Tested by:	pho
2011-04-02 21:52:58 +00:00
Konstantin Belousov
861ed1162b Fix the softdep_request_cleanup() function definition for !SOFTUPDATES case.
Submitted by:	Aleksandr Rybalko <ray dlink ua>
2011-03-28 12:39:48 +00:00
Kirk McKusick
0a809056ce Add retry code analogous to the block allocation retry code
to avoid running out of inodes.

Reported by: Peter Holm
2011-03-23 05:13:54 +00:00
Konstantin Belousov
455a6e0ff3 Use the native sector size of the device backing the UFS volume for SU+J
journal blocks, instead of hard coding 512 byte sector size. Journal need
to atomically write the block, that can only be guaranteed at the device
sector size, not larger. Attempt to write less then sector size results in
driver errors.

Note that this is the first structure in UFS that depends on the
sector size. Other elements are written in the units of fragments.

In collaboration with:	pho
Reviewed by:	jeff
Tested by:	bz, pho
2011-02-12 12:52:12 +00:00
Alexander Leidinger
3eb6e1317c Add some FEATURE macros for some UFS features.
SU+J is not included as a FEATURE macro:
 - it was not in the tree during the GSoC
 - I do not see an option to en-/disable it in NOTES

Two minor changes where made during the review compared to what was developed
during GSoC 2010.

No FreeBSD version bump, the userland application to query the features will
be committed last and can serve as an indication of the availablility if
needed.

Sponsored by:	Google Summer of Code 2010
Submitted by:	kibab
Reviewed by:	kib
X-MFC after:	to be determined in last commit with code from this project
2011-02-09 15:33:13 +00:00
Matthew D Fleming
e7ceb1e99b Based on discussions on the svn-src mailing list, rework r218195:
- entirely eliminate some calls to uio_yeild() as being unnecessary,
   such as in a sysctl handler.

 - move should_yield() and maybe_yield() to kern_synch.c and move the
   prototypes from sys/uio.h to sys/proc.h

 - add a slightly more generic kern_yield() that can replace the
   functionality of uio_yield().

 - replace source uses of uio_yield() with the functional equivalent,
   or in some cases do not change the thread priority when switching.

 - fix a logic inversion bug in vlrureclaim(), pointed out by bde@.

 - instead of using the per-cpu last switched ticks, use a per thread
   variable for should_yield().  With PREEMPTION, the only reasonable
   use of this is to determine if a lock has been held a long time and
   relinquish it.  Without PREEMPTION, this is essentially the same as
   the per-cpu variable.
2011-02-08 00:16:36 +00:00
Matthew D Fleming
08b163fa51 Put the general logic for being a CPU hog into a new function
should_yield().  Use this in various places.  Encapsulate the common
case of check-and-yield into a new function maybe_yield().

Change several checks for a magic number of iterations to use
should_yield() instead.

MFC after:	1 week
2011-02-02 16:35:10 +00:00
Matthew D Fleming
fbbb13f962 sysctl(9) cleanup checkpoint: amd64 GENERIC builds cleanly.
Commit the kernel changes.
2011-01-12 19:54:19 +00:00
Konstantin Belousov
ac32f1176b Instead of incrementing freework reference counter in indir_trunc(), do
it at the allocation time for journaled fs and indirect blocks, when
the allocated object is not accessible outside.

Requested and reviewed by:	jeff
Tested by:	pho
2011-01-04 10:25:55 +00:00
Konstantin Belousov
465e3ccdbb Handle missing jremrefs when a directory is renamed overtop of
another, deleting it.  If the directory is removed, UFS always need to
remove the .. ref, even if the ultimate ref on the parent would not
change. The new directory must have a new journal entry for that ref.
Otherwise journal processing would not properly account for the
parent's reference since it will belong to a removed directory entry.

Change ufs_rename()'s dotdot rename section to always
setup_dotdot_link(). In the tip != NULL case SUJ needs the newref dependency
allocated via setup_dotdot_link().

Stop setting isrmdir to 2 for newdirrem() in softdep_setup_remove().
Remove the isdirrem > 1 checks from newdirrem().

Reported by:	many
Submitted by:	jeff
Tested by:	pho
2010-12-30 10:52:07 +00:00
Konstantin Belousov
42a6fc4385 In indir_trunc(), when processing jnewblk entries that are not written
to the disk, recurse to handle indirect blocks of next level that are
hidden by the corresponding entry.

In collaboration with:	pho
Reviewed by:	jeff, mckusick
Tested by:	mckusick, pho
2010-12-30 10:41:17 +00:00
Konstantin Belousov
d2d6c59245 Move the definition of mkdirlisthd from header to C file.
Reviewed by:	mckusick
Tested by:	pho
2010-12-29 12:16:06 +00:00
Kirk McKusick
84ad0a66d0 This patch fixes a soft update panic while running perl 5.12 tests
which produced:

    panic: indir_trunc: Index out of range -148 parent -2061 lbn -305164

Reported by: Dimitry Andric
Fixed by: Jeff Roberson
2010-12-23 00:38:57 +00:00
Peter Holm
bcc5c95b6b First step in fixing the handle_workitem_freeblocks panic.
In collaboration with:	 kib
2010-11-27 20:27:07 +00:00
Konstantin Belousov
be913821af The softdep_setup_freeblocks() adds worklist items before
deallocate_dependencies() is done. This opens a race between softdep
thread and the thread that does the truncation:
  A write of the indirect block causes the freeblks to become
  ALLCOMPLETE while softdep_setup_freeblocks() dropped softdep lock. And
  then, softdep_disk_write_complete() would reassign the workitem to the
  mount point worklist, causing premature processing of the workitem, or
  journal write exhaust the fb_jfreeblkhd and handle_written_jfreeblk does
  the same reassign.
indir_trunc() then would find the indirect block that is locked (with lock
owned by kernel) but without any dependencies, causing it to hang in
getblk() waiting for buffer lock.

Do not mark freeblks as DEPCOMPLETE until deallocate_dependencies()
finished.

Analyzed, suggested and reviewed by:	jeff
Tested by:	pho
2010-11-11 11:54:01 +00:00
Konstantin Belousov
496fd81362 Change #ifdef INVARIANTS panic into KASSERT, and print some useful
information to diagnose the issue, in handle_complete_freeblocks().

Reviewed by:	jeff
Tested by:	pho
2010-11-11 11:41:52 +00:00
Konstantin Belousov
d23c72cdb5 In journal_mount(), only set MNTK_SUJ flag after the jblocks are mapped.
I believe there is a window otherwise where jblocks can be accessed
without proper initialization.

Reviewed by:	jeff
Tested by:	pho
2010-11-11 11:38:57 +00:00
Konstantin Belousov
fae5c47dd4 Add function lbn_offset to calculate offset of the indirect block of
given level.

Reviewed by:	jeff
Tested by:	pho
2010-11-11 11:35:42 +00:00
Konstantin Belousov
4e4ff01629 Fix typo. Function is called ffs_blkfree. 2010-11-11 11:26:59 +00:00
Alan Cox
a03e344a7f M_USE_RESERVE has been deprecated for a decade. Eliminate any uses that
have no run-time effect.
2010-10-02 17:58:57 +00:00
Kirk McKusick
e69bed360f Since local variable 'i' is used only in a KASSERT, declare and
initialize it only if INVARIANTS is defined to avoid a declared
but unused warning.

Suggested by: Brian Somers <brian@FreeBSD.org>
2010-09-29 14:46:57 +00:00
Konstantin Belousov
063045a555 Fix typo in comment. 2010-09-29 07:40:11 +00:00
Kirk McKusick
c0b2efce9e Update comments in soft updates code to more fully describe
the addition of journalling. Only functional change is to
tighten a KASSERT.

Reviewed by:	jeff Roberson
2010-09-14 18:04:05 +00:00
John Baldwin
3634d5b241 Add dedicated routines to toggle lockmgr flags such as LK_NOSHARE and
LK_CANRECURSE after a lock is created.  Use them to implement macros that
otherwise manipulated the flags directly.  Assert that the associated
lockmgr lock is exclusively locked by the current thread when manipulating
these flags to ensure the flag updates are safe.  This last change required
some minor shuffling in a few filesystems to exclusively lock a brand new
vnode slightly earlier.

Reviewed by:	kib
MFC after:	3 days
2010-08-20 19:46:50 +00:00
Konstantin Belousov
691401eef8 Softdep_process_worklist() should unsuspend not only before processing
the worklist (in softdep_process_journal), but also after flushing the
workitems. Might be, we should even do this before bwillwrite() too, but
this seems to be not needed for now.

Fs might be suspended during processing the queue, and then there is
nobody around to unsuspend.

In collaboration with:	pho
Tested by:	bz
Reviewed by:	jeff
2010-08-12 08:35:24 +00:00
Jeff Roberson
9f9c8c59ae - Handle the truncation of an inode with an effective link count of 0 in
the context of the process that reduced the effective count.  Previously
   all truncation as a result of unlink happened in the softdep flush
   thread.  This had the effect of being impossible to rate limit properly
   with the journal code.  Now the process issuing unlinks is suspended
   when the journal files.  This has a side-effect of improving rm
   performance by allowing more concurrent work.
 - Handle two cases in inactive, one for effnlink == 0 and another when
   nlink finally reaches 0.
 - Eliminate the SPACECOUNTED related code since the truncation is no
   longer delayed.

Discussed with:	mckusick
2010-07-06 07:11:04 +00:00
Andriy Gapon
d89c217f30 ffs_softdep: change K&R in function defintions to ANSI prototypes
Apparently it's bad when we first have an ANSI prototype in function
declaration, but then use K&R in its defintion.

Complaint from:	clang
MFC after:	2 weeks
2010-06-11 18:26:53 +00:00
Jeff Roberson
f0268739c7 - Don't immediately re-run softdepflush if we didn't make any progress
on the last iteration.  This can lead to a deadlock when we have
   worklist items that cannot be immediately satisfied.

Reported by:	uqs, Dimitry Andric <dimitry@andric.com>

 - Remove some unnecessary debugging code and place some other under
   SUJ_DEBUG.
 - Examine the journal state in softdep_slowdown().
 - Re-format some comments so I may more easily add flag descriptions.
2010-05-19 06:18:01 +00:00
Jeff Roberson
8ef48de888 - Call softdep_prealloc() before any of the balloc routines in the
snapshot code.
 - Don't fsync() vnodes in prealloc if copy on write is in progress.  It
   is not safe to recurse back into the write path here.

Reported by:	Vladimir Grebenschikov <vova@fbsd.ru>
2010-05-07 08:45:21 +00:00
Jeff Roberson
2c3ae115b6 - Use the correct flag mask when determining whether an inode has
successfully made it to the free list yet or not.  This fixes
   a deadlock that can occur with unlinked but referenced files.
   Journal space and inodedeps were not correctly reclaimed because
   the inode block was not left dirty.

Tested/Reported by:	lwindschuh@googlemail.com
2010-05-07 08:20:56 +00:00
Jeff Roberson
2bd20091e4 - When canceling jaddrefs they may not yet be in the journal if this is via
a revert call.  In this case don't attempt to remove something that
   has not yet been added.  Otherwise this jaddref must hang around
   to prevent the bitmap write as normal.
2010-04-28 07:57:37 +00:00
Jeff Roberson
3b32573a9f - Fix builds without SOFTUPDATES defined in the kernel config. 2010-04-28 07:26:41 +00:00
Pawel Jakub Dawidek
a8750f2dca Fix build for UFS without SOFTUPDATES. 2010-04-24 07:36:33 +00:00
Jeff Roberson
113db2dddb - Merge soft-updates journaling from projects/suj/head into head. This
brings in support for an optional intent log which eliminates the need
   for background fsck on unclean shutdown.

Sponsored by:   iXsystems, Yahoo!, and Juniper.
With help from: McKusick and Peter Holm
2010-04-24 07:05:35 +00:00
Konstantin Belousov
5c61c646a3 The clear_remove() and clear_inodedeps() call vn_start_write(NULL, &mp,
V_NOWAIT) on the non-busied mount point. Unmount might free ufs-specific
mp data, causing ffs_vgetf() to access freed memory.

Busy mountpoint before dropping softdep lk.

Noted and reviewed by:	tegge
Tested by:	pho
MFC after:	1 week
2009-09-06 11:46:51 +00:00
Konstantin Belousov
165a3b418f When a UFS node is truncated to the zero length, e.g. by explicit
truncate(2) call, or by being removed or truncated on open, either
new softupdate freeblks structure is allocated to track the freed
blocks of the node, or truncation is done syncronously when too many SU
dependencies are accumulated. The decision does not take into account
the allocated freeblks dependencies, allowing workloads that do huge
amount of truncations to exhaust the kernel memory.

Take the number of allocated freeblks into consideration for
softdep_slowdown().

Reported by:	pluknet gmail com
Diagnosed and tested by:	pho
Approved by:	re (rwatson)
MFC after:	1 month
2009-08-14 11:00:38 +00:00
Konstantin Belousov
f1eccd05ec In vn_vget_ino() and their inline equivalents, mnt_ref() the mount point
around the sequence that drop vnode lock and then busies the mount point.
Not having vlocked node or direct reference to the mp allows for the
forced unmount to proceed, making mp unmounted or reused.

Tested by:	pho
Reviewed by:	jeff
Approved by:	re (kensmith)
MFC after:	2 weeks
2009-07-02 18:02:55 +00:00
Konstantin Belousov
a50d1b2a66 Softdep_fsync() may need to lock parent directory of the synced vnode.
Use inlined (due to FFSV_FORCEINSMQ) version of vn_vget_ino() to prevent
mountpoint from being unmounted and freed while no vnodes are locked.

Tested by:	pho
Approved by:	re (kensmith)
MFC after:	1 month
2009-06-30 10:07:00 +00:00
Attilio Rao
f083018223 Handle lock recursion differenty by always checking against LO_RECURSABLE
instead the lock own flag itself.

Tested by:	pho
2009-06-02 13:03:35 +00:00
Konstantin Belousov
bc364c4e99 When removing or renaming snaphost, do not delve into request_cleanup().
The later may need blocks from the underlying device that belongs
to normal files, that should not be locked while snap lock is held.

Reported and tested by:	pho
MFC after:	1 month
2009-04-04 12:19:52 +00:00
Attilio Rao
83b3bdbc8a Improve VFS locking:
- Implement real draining for vfs consumers by not relying on the
  mnt_lock and using instead a refcount in order to keep track of lock
  requesters.
- Due to the change above, remove the mnt_lock lockmgr because it is now
  useless.
- Due to the change above, vfs_busy() is no more linked to a lockmgr.
  Change so its KPI by removing the interlock argument and defining 2 new
  flags for it: MBF_NOWAIT which basically replaces the LK_NOWAIT of the
  old version (which was unlinked from the lockmgr alredy) and
  MBF_MNTLSTLOCK which provides the ability to drop the mountlist_mtx
  once the mnt interlock is held (ability still desired by most consumers).
- The stub used into vfs_mount_destroy(), that allows to override the
  mnt_ref if running for more than 3 seconds, make it totally useless.
  Remove it as it was thought to work into older versions.
  If a problem of "refcount held never going away" should appear, we will
  need to fix properly instead than trust on such hackish solution.
- Fix a bug where returning (with an error) from dounmount() was still
  leaving the MNTK_MWAIT flag on even if it the waiters were actually
  woken up. Just a place in vfs_mount_destroy() is left because it is
  going to recycle the structure in any case, so it doesn't matter.
- Remove the markercnt refcount as it is useless.

This patch modifies VFS ABI and breaks KPI for vfs_busy() so manpages and
__FreeBSD_version will be modified accordingly.

Discussed with:	kib
Tested by:	pho
2008-11-02 10:15:42 +00:00
Dag-Erling Smørgrav
e11e3f187d Fix a number of style issues in the MALLOC / FREE commit. I've tried to
be careful not to fix anything that was already broken; the NFSv4 code is
particularly bad in this respect.
2008-10-23 20:26:15 +00:00
Dag-Erling Smørgrav
1ede983cc9 Retire the MALLOC and FREE macros. They are an abomination unto style(9).
MFC after:	3 months
2008-10-23 15:53:51 +00:00
Konstantin Belousov
52dfc8d7da Add the ffs structures introspection functions for ddb.
Show the b_dep value for the buffer in the show buffer command.
Add a comand to dump the dirty/clean buffer list for vnode.

Reviewed by:	tegge
Tested and used by:   pho
MFC after:   1 month
2008-09-16 11:19:38 +00:00