volumes. Treat the field as a semaphore protecting availability of
the device for mounting. Do no access devvp->v_rdev without the vnode
lock owned.
Protect change of the devvp->v_bufobj bo_ops vector with the vnode
lock.
Reviewed by: bde
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
remounted to writeable after initial read-only. Assign to
dev->si_mountpt earlier to account the accesses done at the mount
time.
Based on submission by: bde
MFC after: 1 week
into per-mount taskqueue with the private taskqueue processing thread.
This allows to drain the taskqueue on unmount, to ensure that all
TRIMs are finished before mount structures are freed.
But just draining the taskqueue where TRIM biodone geom-up completions
are processed is not enough, since ffs_blkfree(), called by the task,
might result in more writes. Count inflight delayed blkfree's and
pause() unmount until the counter drains as well.
Reported by: Nick Evans <nevans@talkpoint.com>
Tested by: Nick Evans <nevans@talkpoint.com>, pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
crash a server that has exported UFS2 by presenting a filehandle
with an inode number that references an uninitialized inode in a
cylinder group. The problem is that UFS2 only initializes blocks
of inodes as they are first allocated and ffs_fhtovp() does not
validate that the inode is in a range of inodes that have been
initialized. Attempting to read an uninitialized inode gets random
data from the disk. When the kernel tries to interpret it as an
inode, panics often arise.
Reported by: Christos Zoulas (from NetBSD)
Reviewed by: kib
the name of a filesystem when setting it as the first parameter to the
getnewvnode() function. Most filesystems call getnewvnode from just one
place so can use a literal string as the first parameter. However, NFS
calls getnewvnode from two places, so we create a global constant string
that can be used by the two instances. This change also collapses two
instances of getnewvnode() in the UFS filesystem to a single call.
Reviewed by: kib
Tested by: Peter Holm
- Use pointer assignment rather than a combination of pointers and
flags to switch buffers between unmapped and mapped. This eliminates
multiple flags and generally simplifies the logic.
- Eliminate b_saveaddr since it is only used with pager bufs which have
their b_data re-initialized on each allocation.
- Gather up some convenience routines in the buffer cache for
manipulating buf space and buf malloc space.
- Add an inline, buf_mapped(), to standardize checks around unmapped
buffers.
In collaboration with: mlaier
Reviewed by: kib
Tested by: pho (many small revisions ago)
Sponsored by: EMC / Isilon Storage Division
This obviates the need for a MNTK_SUSPENDABLE flag, since passthrough
filesystems like nullfs and unionfs no longer need to inherit this
information from their lower layer(s). This change also restores the
pre-r273336 behaviour of using the presence of a susp_clean VFS method to
request suspension support.
Reviewed by: kib, mjg
Differential Revision: https://reviews.freebsd.org/D2937
First, on the write error, bufdone() call from ffs_backgroundwrite()
panics because pbrelvp() cleared bp->b_bufobj, while brelse() would
try to re-dirty the copy of the cg buffer. Handle this by setting
B_INVAL for the case of BIO_ERROR.
Second, we must re-dirty the real buffer containing the cylinder group
block data when background write failed. Real cg buffer was already
marked clean in ffs_bufwrite(). After the BV_BKGRDINPROG flag is
cleared on the real cg buffer in ffs_backgroundwrite(), buffer scan
may reuse the buffer at any moment. The result is lost write, and if
the write error was only transient, we get corrupted bitmaps.
We cannot re-dirty the original cg buffer in the
ffs_backgroundwritedone(), since the context is not sleepable,
preventing us from sleeping for origbp' lock. Add BV_BKGDERR flag
(protected by the buffer object lock), which is converted into delayed
write by brelse(), bqrelse() and buffer scan.
In collaboration with: Conrad Meyer <cse.cem@gmail.com>
Reviewed by: mckusick
Sponsored by: The FreeBSD Foundation (kib),
EMC/Isilon storage division (Conrad)
MFC after: 2 weeks
the active vnode list for the given mount point, with the assumption
that vnodes with dirty pages are active. This is enforced by
vinactive() doing vm_object_page_clean() pass over the vnode pages.
The issue is, if vinactive() cannot be called during vput() due to the
vnode being only shared-locked, we might end up with the dirty pages
for the vnode on the free list. Such vnode is invisible to syncer,
and pages are only cleaned on the vnode reactivation. In other words,
the race results in the broken guarantee that user data, written
through the mmap(2), is written to the disk not later than in 30
seconds after the write.
Fix this by keeping the vnode which is freed but still owing
inactivation, on the active list. When syncer loops find such vnode,
it is deactivated and cleaned by the final vput() call.
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
buildkernel run.
Some of them were write-only under some kernel options, e.g. variables
keeping values only used by CTR() macros. It costs nothing to the
code readability and correctness to eliminate the warnings in those
cases too by removing the local cached values used only for
single-access.
Review: https://reviews.freebsd.org/D2665
Reviewed by: rodrigc
Looked at by: bjk
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
trying to build a cluster. The limit is tunable using the sysctl
vfs.ffs.maxclustersearch. The current limit is 10 cylinder groups
per block allocation. It was previously limited to the number of
cylinder groups in the filesystem per block allocation. When there
were no clusters of the needed size left, it repeatedly searched
the whole filesystem for a non-existent cluster on every block
allocation. The result was very slow filesystem allocation with
100% CPU utilization. The old behavior can be had by setting
vfs.ffs.maxclustersearch to a huge number (1,000,000).
This change affects only the layout policy routines so is not able
to interfere with the integrity of the filesystem.
Reported by: Dmitry Sivachenko (demon@)
Tested by: Dmitry Sivachenko (demon@)
MFC after: 2 weeks
use VOP_FSYNC() to perform the NFS server's Commit operation.
This patch adds a mnt_kern_flag called MNTK_USES_BCACHE which
is set by file systems that use the buffer cache. If this flag
is not set, the NFS server always does a VOP_FSYNC().
This should be ok for old file system modules that do not set
MNTK_USES_BCACHE, since calling VOP_FSYNC() is correct, although
it might not be optimal for file systems that use the buffer cache.
Reviewed by: kib
MFC after: 2 weeks
sequence is performed on UFS SU+J rootfs:
cp -Rp /sbin/init /sbin/init.old
mv -f /sbin/init.old /sbin/init
Hang occurs on the rootfs unmount. There are two issues:
1. Removed init binary, which is still mapped, creates a reference to
the removed vnode. The inodeblock for such vnode must have active
inodedep, which is (eventually) linked through the unlinked list. This
means that ffs_sync(MNT_SUSPEND) cannot succeed, because number of
softdep workitems for the mp is always > 0. FFS is suspended during
unmount, so unmount just hangs.
2. As noted above, the inodedep is linked eventually. It is not
linked until the superblock is written. But at the vfs_unmountall()
time, when the rootfs is unmounted, the call is made to
ffs_unmount()->ffs_sync() before vflush(), and ffs_sync() only calls
ffs_sbupdate() after all workitems are flushed. It is masked for
normal system operations, because syncer works in parallel and
eventually flushes superblock. Syncer is stopped when rootfs
unmounted, so ffs_sync() must do sb update on its own.
Correct the issues listed above. For MNT_SUSPEND, count the number of
linked unlinked inodedeps (this is not a typo) and substract the count
of such workitems from the total. For the second issue, the
ffs_sbupdate() is called right after device sync in ffs_sync() loop.
There is third problem, occuring with both SU and SU+J. The
softdep_waitidle() loop, which waits for softdep_flush() thread to
clear the worklist, only waits 20ms max. It seems that the 1 tick,
specified for msleep(9), was a typo.
Add fsync(devvp, MNT_WAIT) call to softdep_waitidle(), which seems to
significantly help the softdep thread, and change the MNT_LAZY update
at the reboot time to MNT_WAIT for similar reasons. Note that
userspace cannot create more work while devvp is flushed, since the
mount point is always suspended before the call to softdep_waitidle()
in unmount or remount path.
PR: 195458
In collaboration with: gjb, pho
Reviewed by: mckusick
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
two.
nullfs and unionfs need to request suspension if underlying filesystem(s)
use it. Utilize mnt_kern_flag for this purpose.
This is a fixup for 273271.
No strong objections from: kib
Pointy hat to: mjg
MFC after: 2 weeks
unmount. There is no need to suspend read-only filesystem, while we
need suspension on modificable mount point.
Reported by: rwatson
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
unmount time) in the helper vfs_write_suspend_umnt(). Use it instead
of two inline copies in FFS.
Fix the bug in the FFS unmount, when suspension failed, the ufs
extattrs were not reinitialized.
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
If you had a UFS2 FS that didn't have it's super block at SBLOCK_UFS2,
you'll end up corrupting your FS as the superblock is updated and written
to a different location...
makefs used to put the superblock at SBLOCK_UFS1 for UFS 2 FS's causing
this issue...
Reviewed by: silience from mckusick
MFC after: 1 week
Ensure that softdep_unmount() and softdep_setup_sbupdate()
only get called for filesystems running with soft dependencies.
No functional change.
Tested by: Peter Holm and Scott Long
Sponsored by: Netflix
vfs_busy(mp);
vfs_write_suspend(mp);
which are problematic if other thread starts unmount between two
calls. The unmount starts a write, while vfs_write_suspend() drain
writers. On the other hand, unmount drains busy references, causing
the deadlock.
Add a flag argument to vfs_write_suspend and require the callers of it
to specify VS_SKIP_UNMOUNT flag, when the call is performed not in the
mount path, i.e. the covered vnode is not locked. The suspension is
not attempted if VS_SKIP_UNMOUNT is specified and unmount is in
progress.
Reported and tested by: Andreas Longwitz <longwitz@incore.de>
Sponsored by: The FreeBSD Foundation
MFC after: 3 weeks
Revert the simplification of the i_gen calculation.
It is still a good idea to avoid zero values and for the case
of old filesystems there is probably no advantage in using
the complete 32 bits anyways.
Discussed with: bde
MFC after: 4 weeks
Further simplify the i_gen calculation for older disks.
Having a zero here is not really a problem and this is more
similar to what is done in newfs_random().
Reported by: Xin Li
MFC after: 4 weeks
In UFS, i_gen is a random generated value and there is not way for
it to be negative. Actually, the value of i_gen is just used to
match bit patterns and it is of not consequence if the values are
signed or not.
Following other filesystems, set it to unsigned and use it as such,
Discussed by: mckusick
Reviewed by: mckusick (previous version)
MFC after: 4 weeks
- Use a shared bufobj lock in getblk() and inmem().
- Convert softdep's lk to rwlock to match the bufobj lock.
- Move INFREECNT to b_flags and protect it with the buf lock.
- Remove unnecessary locking around bremfree() and BKGRDINPROG.
Sponsored by: EMC / Isilon Storage Division
Discussed with: mckusick, kib, mdf
- Don't insert BKGRDMARKER bufs into the splay or dirty/clean buf lists.
No consumers need to find them there and it complicates the tree.
These flags are all FFS specific and could be moved out of the buf
cache.
- Use pbgetvp() and pbrelvp() to associate the background and journal
bufs with the vp. Not only is this much cheaper it makes more sense
for these transient bufs.
- Fix the assertions in pbget* and pbrel*. It's not safe to check list
pointers which were never initialized. Use the BX flags instead. We
also check B_PAGING in reassignbuf() so this should cover all cases.
Discussed with: kib, mckusick, attilio
Sponsored by: EMC / Isilon Storage Division
Current dqflush() panics when a dquot with with non-zero refcount is
encountered. The situation is possible, because quotas are turned off
before softdep workitem queue if flushed, due to the quota file writes
might create softdep workitems.
Make the encountering an active dquot in dqflush() not fatal, return
the error from quotaoff() instead. Ignore the quotaoff() failures
when ffs_flushfiles() is called in the course of softdep_flushfiles()
loop, until the last iteration. At the last loop, the quotas must be
closed, and because SU workitems should be already flushed, the
references to dquot are gone.
Sponsored by: The FreeBSD Foundation
Reported and tested by: pho
Reviewed by: mckusick
MFC after: 2 weeks
received granular locking) but the comment present in UFS has been
copied all over other filesystems code incorrectly for several times.
Removes comments that makes no sense now.
Reviewed by: kib
MFC after: 3 days
over just the active vnodes associated with a mount point to replace
MNT_VNODE_FOREACH_ALL in the vfs_msync, ffs_sync_lazy, and qsync
routines.
The vfs_msync routine is run every 30 seconds for every writably
mounted filesystem. It ensures that any files mmap'ed from the
filesystem with modified pages have those pages queued to be
written back to the file from which they are mapped.
The ffs_lazy_sync and qsync routines are run every 30 seconds for
every writably mounted UFS/FFS filesystem. The ffs_lazy_sync routine
ensures that any files that have been accessed in the previous
30 seconds have had their access times queued for updating in the
filesystem. The qsync routine ensures that any files with modified
quotas have those quotas queued to be written back to their
associated quota file.
In a system configured with 250,000 vnodes, less than 1000 are
typically active at any point in time. Prior to this change all
250,000 vnodes would be locked and inspected twice every minute
by the syncer. For UFS/FFS filesystems they would be locked and
inspected six times every minute (twice by each of these three
routines since each of these routines does its own pass over the
vnodes associated with a mount point). With this change the syncer
now locks and inspects only the tiny set of vnodes that are active.
Reviewed by: kib
Tested by: Peter Holm
MFC after: 2 weeks
The primary changes are that the user of the interface no longer
needs to manage the mount-mutex locking and that the vnode that
is returned has its mutex locked (thus avoiding the need to check
to see if its is DOOMED or other possible end of life senarios).
To minimize compatibility issues for third-party developers, the
old MNT_VNODE_FOREACH interface will remain available so that this
change can be MFC'ed to 9. Following the MFC to 9, MNT_VNODE_FOREACH
will be removed in head.
The reason for this update is to prepare for the addition of the
MNT_VNODE_FOREACH_ACTIVE interface that will loop over just the
active vnodes associated with a mount point (typically less than
1% of the vnodes associated with the mount point).
Reviewed by: kib
Tested by: Peter Holm
MFC after: 2 weeks
to enable the collection of counts of synchronous and asynchronous
reads and writes for its associated filesystem. The counts are
displayed using `mount -v'.
Ensure that buffers used for paging indicate the vnode from
which they are operating so that counts of paging I/O operations
from the filesystem are collected.
This checkin only adds the setting of the mount point for the
UFS/FFS filesystem, but it would be trivial to add the setting
and clearing of the mount point at filesystem mount/unmount
time for other filesystems too.
Reviewed by: kib