for a new journal specific partial truncate routine.
- Use dep_current[] in place of specific dependency counts. This is
automatically maintained when workitems are allocated and has
less risk of becoming incorrect.
- Keep a hash of indirect blocks that have recently been freed and are
still referenced in the journal.
- Lookup blocks in this hash before forcing a new block write to wait on
the journal entry to hit the disk. This is only necessary to avoid
confusion between old identities as indirects and new identities as
file blocks.
- Don't free jseg structures until the journal has written a record that
invalidates it. This keeps the indirect block information around for
as long as is required to be safe.
- Force an empty journal block write when required to flush out stale
journal data that is simply waiting for the oldest valid sequence
number to advance beyond it.
will be removed. Permit the journal to proceed so that we don't leave
a rollback in a cg for a very long time as this can cause terrible perf
problems in low memory situations.
Tested by: pho
up and declaring a filesystem out of space. Especially necessary when
running on a small filesystem. With this improvement, it should be
possible to use soft updates on a small root filesystem.
Kudos to: Peter Holm
Testing by: Peter Holm
MFC: 2 weeks
- In softdep_revert_mkdir() find the dotaddref before we attempt to cancel
the jaddref so we can make assumptions about where the dotaddref is on
the list. cancel_jaddref() does not always remove items from the list
anymore.
- Always set GOINGAWAY on an inode in softdep_freefile() if DEPCOMPLETE
was never set. This ensures that dependencies will continue to be
processed on the inowait/bufwait list and is more an artifact of
the structure of the code than a pure ordering problem.
- Always set DEPCOMPLETE on canceled jaddrefs so that they can be freed
appropriately. This normally occurs when the refs are added to the
journal but if they are canceled before this point the state would
never be set and the dependency could never be freed.
Reported by: pho
Tested by: pho
ffs_snapgone(). ufs.ko module is not build with FFS define, causing
snapshot inode number slots in superblock never be freed, as well as a
reference on the snapshot vnode.
IFS was removed several years ago, and UFS/FFS separation was not
maintained for real.
Reported, analyzed and tested by: Yamagi Burmeister <lists yamagi org>
MFC after: 3 days
from multiple threads while holding a shared lock during a lookup operation.
This could result in incorrect ENOENT failures which could then be
permanently stored in the name cache.
Specifically, the dirhash code optimizes the case that a single thread is
walking a directory sequentially opening (or stat'ing) each file. It uses
state in the dirhash structure to determine if a given lookup is using the
optimization. If the optimization fails, it disables it and restarts the
lookup. The problem arises when two threads both attempt the optimization
and fail. The first thread will restart the loop, but the second thread
will incorrectly think that it did not try the optimization and will only
examine a subset of the directory entires in its hash chain. As a result,
it may fail to find its directory entry and incorrectly fail with ENOENT.
To make this safe for use with shared locks, simplify the state stored in
the dirhash and move some of the state (the part that determines if the
current thread is trying the optimization) into a local variable. One
result is that we will now try the optimization more often. We still
update the value under the shared lock, but it is a single atomic store
similar to i_diroff that is stored in UFS directory i-nodes for the
non-dirhash lookup.
Reviewed by: kib
MFC after: 1 week
journal blocks, instead of hard coding 512 byte sector size. Journal need
to atomically write the block, that can only be guaranteed at the device
sector size, not larger. Attempt to write less then sector size results in
driver errors.
Note that this is the first structure in UFS that depends on the
sector size. Other elements are written in the units of fragments.
In collaboration with: pho
Reviewed by: jeff
Tested by: bz, pho
SU+J is not included as a FEATURE macro:
- it was not in the tree during the GSoC
- I do not see an option to en-/disable it in NOTES
Two minor changes where made during the review compared to what was developed
during GSoC 2010.
No FreeBSD version bump, the userland application to query the features will
be committed last and can serve as an indication of the availablility if
needed.
Sponsored by: Google Summer of Code 2010
Submitted by: kibab
Reviewed by: kib
X-MFC after: to be determined in last commit with code from this project
- entirely eliminate some calls to uio_yeild() as being unnecessary,
such as in a sysctl handler.
- move should_yield() and maybe_yield() to kern_synch.c and move the
prototypes from sys/uio.h to sys/proc.h
- add a slightly more generic kern_yield() that can replace the
functionality of uio_yield().
- replace source uses of uio_yield() with the functional equivalent,
or in some cases do not change the thread priority when switching.
- fix a logic inversion bug in vlrureclaim(), pointed out by bde@.
- instead of using the per-cpu last switched ticks, use a per thread
variable for should_yield(). With PREEMPTION, the only reasonable
use of this is to determine if a lock has been held a long time and
relinquish it. Without PREEMPTION, this is essentially the same as
the per-cpu variable.
should_yield(). Use this in various places. Encapsulate the common
case of check-and-yield into a new function maybe_yield().
Change several checks for a magic number of iterations to use
should_yield() instead.
MFC after: 1 week
it at the allocation time for journaled fs and indirect blocks, when
the allocated object is not accessible outside.
Requested and reviewed by: jeff
Tested by: pho
another, deleting it. If the directory is removed, UFS always need to
remove the .. ref, even if the ultimate ref on the parent would not
change. The new directory must have a new journal entry for that ref.
Otherwise journal processing would not properly account for the
parent's reference since it will belong to a removed directory entry.
Change ufs_rename()'s dotdot rename section to always
setup_dotdot_link(). In the tip != NULL case SUJ needs the newref dependency
allocated via setup_dotdot_link().
Stop setting isrmdir to 2 for newdirrem() in softdep_setup_remove().
Remove the isdirrem > 1 checks from newdirrem().
Reported by: many
Submitted by: jeff
Tested by: pho
to the disk, recurse to handle indirect blocks of next level that are
hidden by the corresponding entry.
In collaboration with: pho
Reviewed by: jeff, mckusick
Tested by: mckusick, pho
The FS_TRIM fs flag indicates that administrator requested issuing of
TRIM commands for the volume. UFS will only send the command to disk
if the disk reports GEOM::candelete attribute.
Since disk queue is reordered, data block is marked as free in the bitmap
only after TRIM command completed. Due to need to sleep waiting for
i/o to finish, TRIM bio_done routine schedules taskqueue to set the
bitmap bit.
Based on the patch by: mckusick
Reviewed by: mckusick, pjd
Tested by: pho
MFC after: 1 month
data. Otherwise, on 32bit systems, unlinked inode which size is the
multiple of 4GB was not truncated, causing corruption.
Reported by: brucec
Reviewed by: mckusick
Tested by: pho
As result, failed softdep_mount() might leave up to two vnodes on the
mp mountlist, preventing mnt_ref from going to zero.
Call ffs_flushfiles() after failed softdep_mount() to clean mountlist.
Initial report by: Garrett Cooper
Reproduced and tested by: pho
and vop_reclaim() methods. They seems to be unused, and the reported
situation is normal for the forced unmount.
MFC after: 1 week
X-MFC-note: keep prtactive symbol in vfs_subr.c
deallocate_dependencies() is done. This opens a race between softdep
thread and the thread that does the truncation:
A write of the indirect block causes the freeblks to become
ALLCOMPLETE while softdep_setup_freeblocks() dropped softdep lock. And
then, softdep_disk_write_complete() would reassign the workitem to the
mount point worklist, causing premature processing of the workitem, or
journal write exhaust the fb_jfreeblkhd and handle_written_jfreeblk does
the same reassign.
indir_trunc() then would find the indirect block that is locked (with lock
owned by kernel) but without any dependencies, causing it to hang in
getblk() waiting for buffer lock.
Do not mark freeblks as DEPCOMPLETE until deallocate_dependencies()
finished.
Analyzed, suggested and reviewed by: jeff
Tested by: pho
autotuned. It is only an upper bound (the memory is not always allocated)
and the system contains a vm_lowmem handler so nothing will crash and burn
if it's tuned too high.
Reviewed by: mckusick
breakage for old mount(2) syscall, since most struct <filesystem>_args
embed export_args. The mount(2) is supposed to provide ABI
compatibility for pre-nmount mount(8) binaries, so restore ABI to
pre-r184588.
Requested and reviewed by: bde
MFC after: 2 weeks
LK_CANRECURSE after a lock is created. Use them to implement macros that
otherwise manipulated the flags directly. Assert that the associated
lockmgr lock is exclusively locked by the current thread when manipulating
these flags to ensure the flag updates are safe. This last change required
some minor shuffling in a few filesystems to exclusively lock a brand new
vnode slightly earlier.
Reviewed by: kib
MFC after: 3 days
the worklist (in softdep_process_journal), but also after flushing the
workitems. Might be, we should even do this before bwillwrite() too, but
this seems to be not needed for now.
Fs might be suspended during processing the queue, and then there is
nobody around to unsuspend.
In collaboration with: pho
Tested by: bz
Reviewed by: jeff
implementation in 8.0 and later as its flags field does not hold dynamic
state such as waiters flags, but is only modified in lockinit() aside
from VN_LOCK_*().
Discussed with: attilio
changed to defer the setting of VN_LOCK_ASHARE() (which clears LK_NOSHARE
in the vnode lock's flags) until after they had determined if the vnode was
a FIFO. This occurs after the vnode has been inserted a VFS hash or some
similar table, so it is possible for another thread to find this vnode via
vget() on an i-node number and block on the vnode lock. If the lockmgr
interlock (vnode interlock for vnode locks) is not held when clearing the
LK_NOSHARE flag, then the lk_flags field can be clobbered. As a result
the thread blocked on the vnode lock may never get woken up. Fix this by
holding the vnode interlock while modifying the lock flags in this case.
MFC after: 3 days
the context of the process that reduced the effective count. Previously
all truncation as a result of unlink happened in the softdep flush
thread. This had the effect of being impossible to rate limit properly
with the journal code. Now the process issuing unlinks is suspended
when the journal files. This has a side-effect of improving rm
performance by allowing more concurrent work.
- Handle two cases in inactive, one for effnlink == 0 and another when
nlink finally reaches 0.
- Eliminate the SPACECOUNTED related code since the truncation is no
longer delayed.
Discussed with: mckusick
the kernel compiled with QUOTA option. ufs_accessx() upgrades the vdp
vnode lock from shared to exclusive to assign the dquot structure to
the vnode, and ufs_delete_denied() is called when tvp is locked. Since
upgrade drops shared lock when non-blocked upgrade failed, LOR is there.
Reported and tested by: Dmitry Pryanishnikov <lynx.ripe gmail com>
Tested by: pho
PR: kern/147890
MFC after: 1 week