Other threads observing the non-NULL um_softdep can assume that it is
safe to use it. This is important for ro->rw remounts where change from
read-only to read-write status cannot be made atomic.
Reviewed by: mckusick
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D29178
otherwise we might follow a pointer in the freed memory.
Reviewed by: mckusick
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D29178
Suppose that we remount rw->ro and in parallel some reader tries to
instantiate a vnode, e.g. during lookup. Suppose that softdep_unmount()
already started, but we did not cleared the MNT_SOFTDEP flag yet.
Then ffs_vgetf() calls into softdep_load_inodeblock() which accessed
destroyed hashes and freed memory.
Set/clear fs_ronly simultaneously (WRT to files flush) with MNT_SOFTDEP.
It might be reasonable to move the change of fs_ronly to under MNT_ILOCK,
but no readers take it.
Reported and tested by: pho
Reviewed by: mckusick
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D29178
Now MNT_SOFTDEP indicates that SU are active in any variant +-J, and
SU+J is indicated by MNT_SOFTDEP | MNT_SUJ combination. The reason is
that unmount will be able to easily hide SU from other operations by
clearing MNT_SOFTDEP while keeping the record of the active journal.
Reviewed by: mckusick
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D29178
It will be used to allow SU flush code to sync the volume while external
consumers see that SU is already disabled on the filesystem. Use it where
ffs_vgetf() called by SU code to process dependencies.
Reviewed by: mckusick
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D29178
This removes the need to check for error == 0.
Reviewed by: mckusick
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D29178
There are three issues with change that stopped truncating ea area before
write, and resulted in possible zero tail in the ea area:
- Truncate to zero checked i_ea_len after the reference was dropped,
making the last drop effectively truncate to zero length always.
- Loop to fill uio for zeroing specified too large length, that triggered
assert in normal situation.
- Integrity check could trip over the tail, instead we must allow
partial header or header with zero length, and clamp ea image in
memory at it.
Reported by: arichardson
Tested by: arichardson, pho
Sponsored by: The FreeBSD Foundation
MFC after: 3 days
Fixup: 5e198e7646
Differential Revision: https://reviews.freebsd.org/D28999
softdep_prealloc() must be called to ensure enough journal space is
available, before ffs_extwrite(). Also it must be done before taking
ffs_lock_ea(), because it calls ffs_syncvnode(), potentially dropping
the vnode lock.
Reviewed by: mckusick
Tested by: pho
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
ffs_lock_ea is after the vnode lock, so vnode must not be relocked under
lock_ea. Move ffs_truncate() call in ffs_close_ea() after the lock_ea is
dropped, and only truncate to length zero, since this is the only mode
supported by ffs_truncate() for EAs. Previously code did truncation and
then write.
Zero the part of the ext area that is unused, if truncation is due but not
done because ea area is not zero-length.
Reviewed by: mckusick
Tested by: pho
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Do it in ffs_write(), where we can gracefuly handle relock and its
consequences. In particular, recheck the v_data to see if the vnode
reclamation ended, and return EBADF when we cannot proceed with the
write.
Reviewed by: mckusick
Reported by: pho
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
instead of DOINGSOFTDEP(). The softdep_prealloc() function does nothing
in SU case.
Note that the call should be safe with regard to the vnode relock,
because it is called with MNT_NOWAIT, which does not descend into fsync.
Reviewed by: mckusick
Tested by: pho
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
The data is only needed by filesystems that
1. use buffer cache
2. utilize clustering write support.
Requested by: mjg
Reviewed by: asomers (previous version), fsu (ext2 parts), mckusick
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D28679
Make sys/buf.h, sys/pipe.h, sys/fs/devfs/devfs*.h headers usable in
userspace, assuming that the consumer has an idea what it is for.
Unhide more material from sys/mount.h and sys/ufs/ufs/inode.h,
sys/ufs/ufs/ufsmount.h for consumption of userspace tools, with the
same caveat.
Remove unacceptable hack from usr.sbin/makefs which relied on sys/buf.h
being unusable in userspace, where it override struct buf with its own
definition. Instead, provide struct m_buf and struct m_vnode and adapt
code to use local variants.
Reviewed by: mckusick
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D28679
Citing Kirk:
The previous code [before 8563de2f27 -- kib] did not call
vnode_pager_setsize() but worked because later in ffs_snapshot() it
does a UFS_WRITE() to output the snaplist. Previously the UFS_WRITE()
allocated the extra block at the end of the file which caused it to do
the needed vnode_pager_setsize(). But the new code had already allocated
the extra block, so UFS_WRITE() did not extend the size and thus did not
do the vnode_pager_setsize().
PR: 253158
Reported by: Harald Schmalzbauer <bugzilla.freebsd@omnilan.de>
Reviewed by: mckusick
Tested by: cy
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
The panic reported in 253158 arises because the /mnt/.snap/.factory
snapshot allocated the last block in the filesystem. The snapshot
code allocates the last block in the filesystem as a way of setting
its length to be the size of the filesystem. Part of taking a
snapshot is to remove all the earlier snapshots from the image of
the newest snapshot so that newer snapshots will not claim the blocks
of the earlier snapshots. The panic occurs when the new snapshot
finds that both it and an earlier snapshot claim the same block.
The fix is to set the size of the snapshot to be one block after
the last block in the filesystem. This block can never be allocated
since it is not a valid block in the filesystem. This extra block
is used as a place to store the initial list of blocks that the
snapshot has already copied and is used to avoid a deadlock in and
speed up the ffs_copyonwrite() function.
Reported by: Harald Schmalzbauer
Tested by: Peter Holm
PR: 253158
Sponsored by: Netflix
In particular, replace a note that reload through vget() is obsoleted,
with explanation why this code is required.
Reviewed by: chs, mckusick
Tested by: pho
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
This catches both missed processing of IN_ENDOFF and missed application
of VOP_VPUT_PAIR() after VOP that created an entry in the directory.
Reviewed by: chs, mckusick
Tested by: pho
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Such vnodes prevent inode reuse, and should be force-cleared when ffs_valloc()
is unable to find a free inode.
Reviewed by: chs, mckusick
Tested by: pho
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
if we noted a parallel request is active and declined to overflow the
system with parallel redundant sync of the vnodes. But we need to wait
for the flush to finish to see if there are any freed resources.
Reviewed by: chs, mckusick
Tested by: pho
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
VFS should retry inactivation when possible, then. This should provide
timely removal of unlinked unreferenced inodes.
Reviewed by: chs, mckusick
Tested by: pho
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
When possible, relock the vnode and retry inactivation. Only vunref() is
required not to drop the vnode lock, so handle it specially by not retrying.
This is a part of the efforts to ensure that unlinked not referenced vnode
does not prevent inode from reusing.
Reviewed by: chs, mckusick
Tested by: pho
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
because softdep_prelink() is reverted to NOP for non-J case. There is no
need to do anything before ufs_direnter() in SU/non-J case, everything
required to sync the directory is done in VOP_VPUT_PAIR().
Suggested by: mckusick
Reviewed by: chs, mckusick
Tested by: pho
MFC after: 2 week
Sponsored by: The FreeBSD Foundation
Originally this was done in 8a1509e442 to forcibly cover cases
where a hole in the directory could be created by extending into
indirect block, since dependency of writing out indirect block is not
tracked. This results in excessive amount of fsyncing the directories,
where all creation of new entry forced fsync before it. This is not needed,
it is enough to fsync when IN_NEEDSYNC is set, and VOP_VPUT_PAIR() provides
the required hook to only perform required syncing.
The series of changes culminating in this commit puts the performance of
metadata-intensive loads back to that before 8a1509e442.
Analyzed by: mckusick
Reviewed by: chs, mckusick
Tested by: pho
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
In ufs_rename case, tdvp is locked from the place where ufs_direnter()
is done till VOP_VPUT_PAIR(), which means that we no longer need to specially
handle rename in ufs_direnter(). Truncation, if possible, is done in the
same way in ffs_vput_pair() both for rename and other VOPs calling
ufs_direnter(). Remove isrename argument and set IN_ENDOFF if
ufs_direnter() succeeded and directory needs truncation.
In ffs_vput_pair(), stop verifying the condition that directory needs
truncation when IN_ENDOFF is set, instead assert that the condition is
true.
Suggested by: mckusick
Reviewed by: chs, mckusick
Tested by: pho
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
VOP_VPUT_PAIR() provides the hook to do the truncation right before
unlock, which is required since truncation might need to fsync(), which
itself might unlock the directory vnode.
Set new flag IN_ENDOFF which indicates that i_endoff is valid and should
be checked against inode size. Excessive size is chomped, but this
operation is advisory and failure to truncate should not result in the
failure of the main VOP.
Reviewed by: chs, mckusick
Tested by: pho
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
In particular, if unlock_vp is false, save vp's inode number and
generation. If ffs_inotovp() can re-create the vnode with the same
number and generation after we finished with handling dvp, then we most
likely raced with unmount, and were able to restore atomicity of open.
We use FFSV_REPLACE_DOOMED there, to drop the old vnode.
This additional recovery is not strictly required, but it improves the
quality of the implementation.
Suggested by: mckusick
Reviewed by: chs, mckusick
Tested by: pho
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
It cleans IN_NEEDSYNC flag on dvp before returning, by applying
ffs_syncvnode() until success or an error different from ERELOOKUP.
IN_NEEDSYNC cleanup is required to avoid creating holes in the directories
when extended into indirect block.
Reviewed by: chs, mckusick
Tested by: pho
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
If the snapshot embrio was reclaimed under us, return error outright.
Reviewed by: chs, mckusick
Tested by: pho
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
for all kinds of async/SU mount variants.
Submitted by: mckusick
Reviewed by: chs
Tested by: pho
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
If it is cleaned before the sync, other threads might see the inode without
the flag set, because syncing could unlock it.
Reviewed by: chs, mckusick
Tested by: pho
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
The function alone was not used for anything but ffs_fstovp() for long time.
Suggested by: mckusick
Reviewed by: chs, mckusick
Tested by: pho
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
It generalizes the VFS_FHTOVP() interface, making it possible to fetch
the inode without faking filehandle. Also it adds the ffs flags argument
which allows to control ffs_vgetf() call.
Requested by: mckusick
Reviewed by: chs, mckusick
Tested by: pho
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
It specifies that caller requests a fresh non-doomed vnode. If doomed
vnode is found in the hash, it should behave similarly to FFSV_REPLACE.
Or, to put it differently, the flag is same as FFSV_REPLACE, but only
when the found hashed vnode is doomed.
Reviewed by: chs, mkcusick
Tested by: pho
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Later processing of ffs_truncate() might temporary unlock the directory
vnode, causing unsychronized dirhash and inode sizes if update is
postponed to UFS_TRUNCATE() callers.
Reviewed by: chs, mkcusick
Tested by: pho
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
and only call buf_complete() if previously started. Some error paths,
like CoW failire, might skip buf_start() and do bufdone(), which itself
call buf_complete().
Various SU handle_written_XXX() functions check that io was started
and incomplete parts of the buffer data reverted before restoring them.
This is a useful invariant that B_IO_STARTED on buffer layer allows to
keep instead of changing check and panic into check and return.
Reported by: pho
Reviewed by: chs, mckusick
Tested by: pho
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundations
as it is done in other places. Header files might need options defined
for correct operation.
Reviewed by: chs, mckusick
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
After discussion with Chuck Silvers (chs@) we have decided that
there is a better way to resolve this lock order reversal which
will be committed separately.
Sponsored by: Netflix
disk failure.
Each vnode has an embedded lock that controls access to its contents.
However vnodes describing a UFS snapshot all share a single snapshot
lock to coordinate their access and update. As part of mounting a
UFS filesystem with snapshots, each of the vnodes describing a
snapshot has its individual lock replaced with the snapshot lock.
When the filesystem is unmounted the vnode's original lock is
returned replacing the snapshot lock.
When a disk fails while the UFS filesystem it contains is still
mounted (for example when a thumb drive is removed) UFS forcibly
unmounts the filesystem. The loss of the drive causes the GEOM
subsystem to orphan the provider, but the consumer remains until
the filesystem has finished with the unmount. Information describing
the snapshot locks was being prematurely cleared during the orphaning
causing the return of the snapshot vnode's original locks to fail.
The fix is to not clear the needed information prematurely.
Sponsored by: Netflix
with snapshots.
Each vnode has an embedded lock that controls access to its contents.
However vnodes describing a UFS snapshot all share a single snapshot
lock to coordinate their access and update. As part of mounting a
UFS filesystem with snapshots, each of the vnodes describing a
snapshot has its individual lock replaced with the snapshot lock.
When the filesystem is unmounted the vnode's original lock is
returned replacing the snapshot lock.
The lock order reversal happens because vnode locks must be acquired
before snapshot locks. When unmounting we must lock both the snapshot
lock and the vnode lock before swapping them so that the vnode will
be continuously locked during the swap. For each vnode representing
a snapshot, we must first acquire the snapshot lock to ensure
exclusive access to it and its original lock. We then face a lock
order reversal when we try to acquire the original vnode lock. The
problem is eliminated by doing a non-blocking exclusive lock on the
original lock which will always succeed since there are no users
of that lock.
Sponsored by: Netflix
UFS uses a new "mntfs" pseudo file system which provides private
device vnodes for a file system to safely access its disk device.
The original device vnode is saved in um_odevvp to hold the exclusive
lock on the device so that any attempts to open it for writing will
fail. But it is otherwise unused and has its BO_NOBUFS flag set to
enforce that file systems using mntfs vnodes do not accidentally
use the original devfs vnode. When the file system is unmounted,
um_odevvp is no longer needed and is released.
The lock order reversal happens because device vnodes must be locked
before UFS vnodes. During unmount, the root directory vnode lock
is held. When when calling vrele() on um_odevvp, vrele() attempts to
exclusive lock um_odevvp causing the lock order reversal. The problem
is eliminated by doing a non-blocking exclusive lock on um_odevvp
which will always succeed since there are no users of um_odevvp.
With um_odevvp locked, it can be released using vput which does not
attempt to do a blocking exclusive lock request and thus avoids the
lock order reversal.
Sponsored by: Netflix