of the D_NEWBLK kinds of dependencies (i.e. D_ALLOCDIRECT and
D_ALLOCINDIR), which can exhaust kmem.
Handle excess of D_NEWBLK in the same way as excess of D_INODEDEP and
D_DIRREM, by scheduling ast to flush dependencies, after the thread,
which created new dep, left the VFS/FFS innards. For D_NEWBLK, the
only way to get rid of them is to do full sync, since items are
attached to data blocks of arbitrary vnodes. The check for D_NEWBLK
excess in softdep_ast_cleanup_proc() is unlocked.
For 32bit arches, reduce the total amount of allowed dependencies by
two. It could be considered increasing the limit for 64 bit platforms
with direct maps.
Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
When deallocate_dependencies() is performed,
softdep_journal_freeblocks() already called cancel_allocdirect() which
should have eliminated direct dependencies for all truncated full
blocks. The indirect dependencies are allowed above, since second-
and third-level dependencies are only dealt with by the code which
frees indirect block, which happens after the inode write.
Discussed with: mckusick, jeff
Reviewed by: jeff
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
buildkernel run.
Some of them were write-only under some kernel options, e.g. variables
keeping values only used by CTR() macros. It costs nothing to the
code readability and correctness to eliminate the warnings in those
cases too by removing the local cached values used only for
single-access.
Review: https://reviews.freebsd.org/D2665
Reviewed by: rodrigc
Looked at by: bjk
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Eliminate it, and simplify code by removing the local dflags variable
always initialized to DEPALLOC.
Noted by: mckusick
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
limits in the code which is deep in the call stack, and owns several
critical system resources, like vnode locks. Attempt to wait while
the per-mount softupdate thread cleans up the backlog may deadlock,
because the thread might need to lock the same vnode which is owned by
the waiting thread.
Instead of synchronously waiting for the worker, perform the worker'
tickle and pause until the backlog is cleaned, at the safe point
during return from kernel to usermode. A new ast request to call
softdep_ast_cleanup() is created, the SU code now only checks the size
of queue and schedules ast.
There is no ast delivery for the kernel threads, so they are exempted
from the mechanism, except NFS daemon threads. NFS server loop
explicitely checks for the request, and informs the schedule_cleanup()
that it is capable of handling the requests by the process P2_AST_SU
flag. This is needed because nfsd may be the sole cause of the SU
workqueue overflow. But, to not cause nsfd to spawn additional
threads just because we slow down existing workers, only tickle su
threads, without waiting for the backlog cleanup.
Reviewed by: jhb, mckusick
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
sequence is performed on UFS SU+J rootfs:
cp -Rp /sbin/init /sbin/init.old
mv -f /sbin/init.old /sbin/init
Hang occurs on the rootfs unmount. There are two issues:
1. Removed init binary, which is still mapped, creates a reference to
the removed vnode. The inodeblock for such vnode must have active
inodedep, which is (eventually) linked through the unlinked list. This
means that ffs_sync(MNT_SUSPEND) cannot succeed, because number of
softdep workitems for the mp is always > 0. FFS is suspended during
unmount, so unmount just hangs.
2. As noted above, the inodedep is linked eventually. It is not
linked until the superblock is written. But at the vfs_unmountall()
time, when the rootfs is unmounted, the call is made to
ffs_unmount()->ffs_sync() before vflush(), and ffs_sync() only calls
ffs_sbupdate() after all workitems are flushed. It is masked for
normal system operations, because syncer works in parallel and
eventually flushes superblock. Syncer is stopped when rootfs
unmounted, so ffs_sync() must do sb update on its own.
Correct the issues listed above. For MNT_SUSPEND, count the number of
linked unlinked inodedeps (this is not a typo) and substract the count
of such workitems from the total. For the second issue, the
ffs_sbupdate() is called right after device sync in ffs_sync() loop.
There is third problem, occuring with both SU and SU+J. The
softdep_waitidle() loop, which waits for softdep_flush() thread to
clear the worklist, only waits 20ms max. It seems that the 1 tick,
specified for msleep(9), was a typo.
Add fsync(devvp, MNT_WAIT) call to softdep_waitidle(), which seems to
significantly help the softdep thread, and change the MNT_LAZY update
at the reboot time to MNT_WAIT for similar reasons. Note that
userspace cannot create more work while devvp is flushed, since the
mount point is always suspended before the call to softdep_waitidle()
in unmount or remount path.
PR: 195458
In collaboration with: gjb, pho
Reviewed by: mckusick
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
instead of waiting for the FLUSH_* flags. Also, when requesting
flush, do the wakeups unconditionally even when FLUSH_CLEANUP flag was
already set.
Reported and tested by: dim,
"Lundberg, Johannes" <johannes@brilliantservice.co.jp>
Bisected by: dim
MFC after: 2 weeks
thread started and incremented the stat_flush_threads [1].
Unconditionally wakeup softdep_flush threads when needed, do not try
to check wchan, which is racy and breaks abstraction.
Reported by and discussed with: glebius, neel
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
are not suspended. In particular, on the SU-enabled vulumes, there is
no reason why, between the call to softdep_flushfiles() and
softdep_waitidle(), SU work items cannot be queued.
Correct the condition to trigger the panic by only checking when
forced operation is done. Convert direct panic() call into KASSERT(),
there is no invalid on-disk data structures directly involved, so
follow the usual debugging vs. non-debugging approach.
Reported and tested by: pho
Reviewed by: mckusick
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
r269533, which was tested before r269457 was committed, implicitely
relied on the Giant to protect the manipulations of the softdepmounts
list. Use softdep global lock consistently to guarantee the list
structure now.
Insert the new struct mount_softdeps into the softdepmounts only after
it is sufficiently initialized, to prevent softdep_speedup() from
accessing bare memory. Similarly, remove struct mount_softdeps for
the unmounted filesystem from the tailq before destroying structure
rwlock.
Reported and tested by: pho
Reviewed by: mckusick
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
have to adjust freeblk records to reflect the change to a full-size block.
For example, suppose we have a block made up of fragments 8-15 and
want to free its last two fragments. We are given a request that says:
FREEBLK ino=5, blkno=14, lbn=0, frags=2, oldfrags=0
where frags are the number of fragments to free and oldfrags are the
number of fragments to keep. To block align it, we have to change it to
have a valid full-size blkno, so it becomes:
FREEBLK ino=5, blkno=8, lbn=0, frags=2, oldfrags=6
Submitted by: Mikihito Takehara
Tested by: Mikihito Takehara
Reviewed by: Jeff Roberson
MFC after: 1 week
Replace a single soft updates thread with a thread per FFS-filesystem
mount point. The threads are associated with the bufdaemon process.
Reviewed by: kib
Tested by: Peter Holm and Scott Long
MFC after: 2 weeks
Sponsored by: Netflix
a journal block even when there are no journal entries to be written.
Until the root cause is found, handle this case by ensuring that a
valid journal segment is always written.
Second, the data buffer used for writing journal entries was never
being scrubbed of old data. Fix this.
Submitted by: Takehara Mikihito
Obtained from: Netflix, Inc.
MFC after: 3 days
We had previously tried to flush all MKDIR_PARENT dependencies (and
all the NEWBLOCK pagedeps) by calling ffs_update(). However this will
only resolve these dependencies in direct blocks. So very large
directories with MKDIR_PARENT dependencies in indirect blocks had
not yet gotten flushed. As the directory is in the midst of doing a
complete sync, we simply defer the checking of the MKDIR_PARENT
dependencies until the indirect blocks have been sync'ed.
Reported by: Shawn Wallbridge of imaginaryforces.com
Tested by: John-Mark Gurney <jmg@funkthat.com>
PR: 183424
MFC after: 2 weeks
single kernel-wide soft update lock can be replaced with a
per-filesystem soft-updates lock. This per-filesystem lock will
allow each filesystem to have its own soft-updates flushing thread
rather than being limited to a single soft-updates flushing thread
for the entire kernel.
Move soft update variables out of the ufsmount structure and into
their own mount_softdeps structure referenced by ufsmount field
um_softdep. Eventually the per-filesystem lock will be in this
structure. For now there is simply a pointer to the kernel-wide
soft updates lock.
Change all instances of ACQUIRE_LOCK and FREE_LOCK to pass the lock
pointer in the mount_softdeps structure instead of a pointer to the
kernel-wide soft-updates lock.
Replace the five hash tables used by soft updates with per-filesystem
copies of these tables allocated in the mount_softdeps structure.
Several functions that flush dependencies when too many are allocated
in the kernel used to operate across all filesystems. They are now
parameterized to flush dependencies from a specified filesystem.
For now, we stick with the round-robin flushing strategy when the
kernel as a whole has too many dependencies allocated.
While there are many lines of changes, there should be no functional
change in the operation of soft updates.
Tested by: Peter Holm and Scott Long
Sponsored by: Netflix
Add KASSERTS that soft dependency functions only get called
for filesystems running with soft dependencies. Calling these
functions when soft updates are not compiled into the system
become panic's.
No functional change.
Tested by: Peter Holm and Scott Long
Sponsored by: Netflix
Ensure that softdep_unmount() and softdep_setup_sbupdate()
only get called for filesystems running with soft dependencies.
No functional change.
Tested by: Peter Holm and Scott Long
Sponsored by: Netflix
Convert three functions exported from ffs_softdep.c to static
functions as they are not used outside of ffs_softdep.c.
No functional change.
Tested by: Peter Holm and Scott Long
Sponsored by: Netflix
persist much longer than previously. Historically we had at most 100
entries; now the count may reach a million. With the increased count
we spent far too much time looking them up in the grossly undersized
newblk hash table. Configure the newblk hash table to accurately reflect
the number of entries that it must index.
Reviewed by: kib
Tested by: Peter Holm
MFC after: 2 weeks
we need to collect the highest level of allocation for each of the
different soft update dependency structures. This change collects these
statistics and makes them available using `sysctl debug.softdep.highuse'.
Reviewed by: kib
Tested by: Peter Holm
MFC after: 2 weeks
- Use a shared bufobj lock in getblk() and inmem().
- Convert softdep's lk to rwlock to match the bufobj lock.
- Move INFREECNT to b_flags and protect it with the buf lock.
- Remove unnecessary locking around bremfree() and BKGRDINPROG.
Sponsored by: EMC / Isilon Storage Division
Discussed with: mckusick, kib, mdf
an error. One could argue that returning a buffer even when it is
not valid is incorrect, but bread has always returned a buffer
valid or not.
Reviewed by: kib
MFC after: 2 weeks
the return value is NULL. Based on the returned flags, the
return value should never be inspected in the case where NULL
is returned, but it is good coding practice not to return a
pointer to freed memory.
Found by: Coverity Scan, CID 1006096
Reviewed by: kib
MFC after: 2 weeks
- Don't insert BKGRDMARKER bufs into the splay or dirty/clean buf lists.
No consumers need to find them there and it complicates the tree.
These flags are all FFS specific and could be moved out of the buf
cache.
- Use pbgetvp() and pbrelvp() to associate the background and journal
bufs with the vp. Not only is this much cheaper it makes more sense
for these transient bufs.
- Fix the assertions in pbget* and pbrel*. It's not safe to check list
pointers which were never initialized. Use the BX flags instead. We
also check B_PAGING in reassignbuf() so this should cover all cases.
Discussed with: kib, mckusick, attilio
Sponsored by: EMC / Isilon Storage Division
in the pagedep and inodedep hash tables. An entry in the table is
skipped because 'pagedep_hash' and 'inodedep_hash' hold the size
of the hash tables - 1.
The chance that this would have any operational failure is extremely
unlikely. These funtions only need to find a single entry and are
only called when there are too many entries. The chance that they
would fail because all the entries are on the single skipped hash
chain are remote.
Submitted by: Pedro Martelletto
Reviewed by: kib
MFC after: 2 weeks
Current dqflush() panics when a dquot with with non-zero refcount is
encountered. The situation is possible, because quotas are turned off
before softdep workitem queue if flushed, due to the quota file writes
might create softdep workitems.
Make the encountering an active dquot in dqflush() not fatal, return
the error from quotaoff() instead. Ignore the quotaoff() failures
when ffs_flushfiles() is called in the course of softdep_flushfiles()
loop, until the last iteration. At the last loop, the quotas must be
closed, and because SU workitems should be already flushed, the
references to dquot are gone.
Sponsored by: The FreeBSD Foundation
Reported and tested by: pho
Reviewed by: mckusick
MFC after: 2 weeks
When kern_yield() was introduced with the possibility to specify
a new priority, the behaviour changed by not lowering priority at all
in the consumers, making the yielding mechanism highly ineffective for
high priority kthreads like bufdaemon, syncer, vlrudaemon, etc.
There are no evidences that consumers could bear with such change in
semantic and this situation could finally lead to bugs similar to the
ones fixed in r244240.
Re-specify userland pri for kthreads involved.
Tested by: pho
Reviewed by: kib, mdf
MFC after: 1 week
crash. When truncating a file that never made it to disk we use the
canceled allocation dependencies to hold the journal records until
the truncation completes. Previously allocdirect dependencies on
the id_bufwait list were not considered and their journal space
could expire before the bitmaps were written. Cancel them and attach
them to the freeblks as we do for other allocdirects.
- Add KTR traces that were used to debug this problem.
- When adding jsegdeps, always use jwork_insert() so we don't have more
than one segdep on a given jwork list.
Sponsored by: EMC / Isilon Storage Division
When a background copy of a cg is written we complete any work associated
with that bmsafemap. If new work has been added to the non-background
copy of the buffer it will be completed before the next write happens.
The solution is to do the rollbacks when we make the copy so only those
dependencies that were present at the time of writing will be completed
when the background write completes. This would've resulted in various
bitmap related corruptions and panics. It also would've expired journal
entries early causing journal replay to miss some records.
MFC after: 2 weeks
solve power loss problems with dishonest write caches. However, it
should improve the situation and force a full fsck when it is unable
to resolve with the journal.
- Resolve a case where the journal could wrap in an unsafe way causing
us to prematurely lose journal entries in very specific scenarios.
Discussed with: mckusick
MFC after: 1 month
the previous diradd had already finished it could have been reclaimed
already. This would only happen under heavy dependency pressure.
Reported by: Andrey Zonov <zont@FreeBSD.org>
Discussed with: mckusick
MFC after: 1 week
with softupdates went away. Note that this does not fix the problem
entirely; I'm committing it now to make it easier for someone to pick
up the work.
Reviewed by: mckusick
In particular, do not lock Giant conditionally when calling into the
filesystem module, remove the VFS_LOCK_GIANT() and related
macros. Stop handling buffers belonging to non-mpsafe filesystems.
The VFS_VERSION is bumped to indicate the interface change which does
not result in the interface signatures changes.
Conducted and reviewed by: attilio
Tested by: pho
and bmsafemap dependency structures in inodedep_lookup() and
bmsafemap_lookup() respectively. The setup of these structures must
be done while holding the soft-dependency mutex. If the inodedep is
allocated first, it may be freed in the I/O completion callback when
the mutex is released to allocate the bmsafemap. If the bmsafemap is
allocated first, it may be freed in the I/O completion callback when
the mutex is released to allocate the inodedep.
To resolve this problem, bmsafemap_lookup has had a parameter added
that allows a pre-malloc'ed bmsafemap to be passed in so that it does
not need to release the mutex to create a new bmsafemap. The
softdep_setup_inomapdep() routine pre-malloc's a bmsafemap dependency
before acquiring the mutex and starting to build the inodedep with a
call to inodedep_lookup(). The subsequent call to bmsafemap_lookup()
is passed this pre-allocated bmsafemap entry so that it need not
release the mutex if it needs to create a new one.
Reported by: Peter Holm
Tested by: Peter Holm
MFC after: 1 week