group. Change all code points that open-coded this functionality
to use the new function. This commit is a refactoring with no
change in functionality.
In the future this change allows more robust checking of cylinder
group reads along the lines discussed in the hardening UFS session
at BSDCan (retry I/O, add checksums, etc). For more detail see the
session notes at https://wiki.freebsd.org/DevSummit/201706/HardeningUFS
Reviewed by: kib
host.
Problems start appearing when there are several threads all doing
operations on a UFS volume and the SU workqueue needs a cleanup. It is
possible that each thread calling softdep_request_cleanup() owns the
lock for some dirty vnode (e.g. all of them are executing mkdir(2),
mknod(2), creat(2) etc) and all vnodes which must be flushed are locked
by corresponding thread. Then, we get all the threads simultaneously
entering softdep_request_cleanup().
There are two problems:
- Several threads execute MNT_VNODE_FOREACH_ALL() loops in parallel. Due
to the locking, they quickly start executing 'in phase' with the speed
of the slowest thread.
- Since each thread already owns the lock for a dirty vnode, other threads
non-blocking attempt to lock the vnode owned by other thread fail,
and loops executing without making the progress.
Retry logic does not allow the situation to recover. The result is
a livelock.
Fix these problems by making the following changes:
- Allow only one thread to enter MNT_VNODE_FOREACH_ALL() loop per mp.
A new flag FLUSH_RC_ACTIVE guards the loop.
- If there were failed locking attempts during the loop, abort retry
even if there are still work items on the mp work list. An
assumption is that the items will be cleaned when other thread
either fsyncs its vnode, or unlock and allow yet another thread to
make the progress.
It is possible now that some calls would get undeserved ENOSPC from
ffs_alloc(), because the cleanup is not aggressive enough. But I do
not see how can we reliably clean up workitems if calling
softdep_request_cleanup() while still owning the vnode lock. I thought
about scheme where ffs_alloc() returns ERESTART and saves the retry
counter somewhere in struct thread, to return to the top level, unlock
the vnode and retry. But IMO the very rare (and unproven) spurious
ENOSPC is not worth the complications.
Reported and tested by: pho
Style and comments by: mckusick
Reviewed by: mckusick
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
the last step of ffs_unmount().
It is possible that the mount point is recorded for cleanup in AST
context while softdep flush is executed during unmount. The workitems
are flushed by other means for the unmount, but the stray reference to
struct mount blocks destruction of mount. Check for the situation and
manually call vfs_rel() before returning from ffs_unmount().
Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
makefs(1) has a number of signedness warnings (when built with higher
WARNS), most of which can be addressed by careful application of casts
in makefs itself.
There is one case where a signedness warning arises from the blksize
macro, so must be addressed in the macro itself.
Reviewed by: kib, mckusick
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D10589
Rather than the global NAME_MAX constant. This change is required to
support systems with a NAME_MAX/MAXNAMLEN that differs from UFS_MAXNAMLEN.
This was missed in r313475 due to the alternative spelling ("NAME_MAX") of
MAXNAMLEN. This change is also similar in spirit to r313780.
Reported by: ngie@
Sponsored by: Dell EMC Isilon
Renumber cluase 4 to 3, per what everybody else did when BSD granted
them permission to remove clause 3. My insistance on keeping the same
numbering for legal reasons is too pedantic, so give up on that point.
Submitted by: Jan Schaumann <jschauma@stevens.edu>
Pull Request: https://github.com/freebsd/freebsd/pull/96
Thread might create a condition for delayed SU cleanup, which creates
a reference to the mount point in td_su, but exit without returning
through userret(), e.g. when terminating due to single-threading or
process exit. In this case, td_su reference is not dropped and mount
point cannot be freed.
Handle the situation by clearing td_su also in the thread destructor
and in exit1(). softdep_ast_cleanup() has to receive the thread as
argument, since e.g. thread destructor is executed in different
context.
Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
As suggested in r167010, use the structure type and macros to access and
modify UFS2 extended attributes. Add assertions that pointers are
aligned in places where we now access the data through a structure
pointer, instead of character-by-character.
PR: 216127
Reported by: dewayne at heuristicsystems.com.au
Reviewed by: kib@
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D9225
The ea_name string is not nul-terminated. Correct the documentation.
Because the subsequent field is padded to 8 bytes, and the padding is
zeroed, the ea_name string will appear to be nul-terminated whenever the
length isn't exactly one (mod eight).
This was introduced in r167010 (2007).
Additionally, mark the length fields as unsigned. This particularly
matters for the single byte ea_namelength field, which can represent
extended attribute names up to 255 bytes long.
No functional change.
PR: 216127
Reported by: dewayne at heuristicsystems.com.au
Reviewed by: kib@
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D9206
The swap pager enqueues laundered pages near the head of the inactive queue
to avoid another trip through LRU before reclamation. This change adds
support for this behaviour to the vnode pager and makes use of it in UFS and
ext2fs. Some ioflag handling is consolidated into a common subroutine so
that this support can be easily extended to other filesystems which make use
of the buffer cache. No changes are needed for ZFS since its putpages
routine always undirties the pages before returning, and the laundry
thread requeues the pages appropriately in this case.
Reviewed by: alc, kib
Differential Revision: https://reviews.freebsd.org/D8589
Currently mount update keeps vfs_busy(9) reference on the mount point
during MNT_UPDATE VFS_MOUNT() vfsops call. This already provides the
exclusion, but is problematic for filesystems which need to perform
namei(9) during VFS_MOUNT(MNT_UPDATE) operations, e.g. to refresh
mnt_from path, because namei(9) must not be called while the
vfs_busy(9) reference is owned.
Check for MNT_UPDATE flag before setting MNTK_UNMOUNT, and for
MNTK_UNMOUNT before entering innards of vfs_domount_update(), failing
syscalls with EBUSY if conflict is detected. Keep vfs_busy(9)
reference around VFS_MOUNT(MNT_UPDATE) calls still to not change VFS
KPI.
In the update path in ffs_mount(), drop vfs_busy() reference around
namei(), which is now safe due to unmount never executing in parallel
with VFS_MOUNT(MNT_UPDATE), and which avoids the deadlock.
Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
data structure sizes when mounting and reloading UFS/FFS
filesystems by using a u_long rather than an int for the size.
Reported by: Mariusz Zaborski <oshogbo@>
MFC after: 1 week
which also use buffer cache.
Most important addition to the code is the handling of filesystems
where the block size is less than the machine page size, which might
require reading several buffers to validate single page.
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
directory create and delete operations. If it ever finds a directory
with a link count less than 2, it panics. Thus, an rm -rf that
encounters a directory with a link count below 2 causes a kernel
panic. The proposed fix is to return the error EINVAL rather than
panicing. The effect is that the requested operation is not done,
but the system continues to run. At a more convenient later time,
the filesystem can be unmounted and cleaned (with fsck or journal
run). Once cleaned, the operation can be rerun to successful
completion.
This fix takes that approach. The panic message has been converted
into a uprintf(9) to provide the user with the inode number and
filesystem mount point of the offending directory and EINVAL is
returned for the operation.
The long (three year) delay in fixing this problem occurred because
the bug was misclassified when originally assigned and only this week
was found during a sweep of old unresolved bug reports.
PR: 180894
Reviewed by: kib
MFC after: 2 weeks
See the comments for more detailed description of the algorithm.
The pager is used unconditionally when the block size of the
underlying device is larger than the machine page size, since local
vnode pager cannot handle the configuration [1]. Otherwise, the
vfs.ffs.use_buf_pager sysctl allows to switch to the local pager.
Measurements demonstrated no regression in the ever-important
buildworld benchmark, and small (~5%) throughput improvements in the
special microbenchmark configuration for dbench over swap-backed
md(4).
Code can be generalized and reused for other filesystems which use
buffer cache.
Reported by: Anton Yuzhaninov <citrin@citrin.ru> [1]
Tested by: pho
Benchmarked by: mjg, pho
Reviewed by: alc, markj, mckusick (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D8198
Reclaimed vnode type is VBAD, so succesful comparision like
devvp->v_type != VREG does not imply that the devvp references
snapshot, it might be due to a reclaimed vnode. Explicitely check the
vnode type.
In the the most important case of ffs_blkfree(), the devfs vnode is
locked and its type is stable. In other cases, if the vnode is
reclaimed right after the check, hopefully the buffer methods return
right error codes.
Reviewed by: mckusick
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
- Use _Bool to not require userspace to include stdbool.h.
- Make extattr.h usable without vnode_if.h.
- Follow i_ump to get cdev pointer.
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Remove redunand i_dev and i_fs pointers, which are available as
ip->i_ump->um_dev and ip->i_ump->um_fs, and reorder members by size to
reduce padding. To compensate added derefences, the most often i_ump
access to differentiate between UFS1 and UFS2 dinode layout is
removed, by addition of the new i_flag IN_UFS2. Overall, this
actually reduces the amount of memory dereferences.
On 64bit machine, original struct inode size is 176, reduced to 152
bytes with the change.
Tested by: pho (previous version)
Reviewed by: mckusick
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
mounts in almost all cases instead of in most cases. Don't override
DOINGASYNC() by any condition except IO_SYNC.
Fix previous sprinking of DOINGASYNC() checks. Don't override IO_SYNC
by DOINGASYNC(). In ffs_write() and ffs_extwrite(), there were
intentional overrides that just broke O_SYNC of data. In
ffs_truncate(), there are 5 calls to ffs_update(), 4 with
apparently-unintentional overrides and 1 without; this had no effect
due to the main async mount hack descibed below.
Fix 1 place in ffs_truncate() where the caller's IO_ASYNC was overridden
for the soft updates case too (to do a delayed write instead of a sync
write). This is supposed to be the only change that affects anything
except async mounts.
In ffs_update(), remove the 19 year old efficiency hack of ignoring
the waitfor flag for async mounts, so that fsync() almost works for
async mounts. All callers are supposed to be fixed to not ask for a
sync update unless they are for fsync() or [I]O_SYNC operations.
fsync() now almost works for async mounts. It used to sync the data
but not the most important metdata (the inode). It still doesn't sync
associated directories.
This gave 10-20% fewer writes for my makeworld benchmark with async
mounted tmp and obj directories from an already small number.
Style fixes:
- in ffs_balloc.c, remove rotted quadruplicated comments about the
simplest part of the DOING*() decisions and rearrange the nearly-
quadruplicated code to be more nearly so.
- in ufs_vnops.c, use a consistent style with less negative logic and
no manual "optimization" of || to | in DOING*() expressions.
Reviewed by: kib (previous version)
truncation failed.
Doing so resulted in inconsistent state of the ufs dirhash with regard
to the actual directory inode state, and could lead to spurious ENOENT
errors for lookups of existing files in production kernels, or
assertion failures in the debugging kernels.
Change the logic of calling ufsdirhash_dirtrunc() to be same as in
ufs_direnter(). Execute UFS_TRUNCATE() first, log error, and only do
dirtrunc() if UFS_TRUNCATE() succeeded.
Note that the problem was exacerbated by the bug in the
flush_newblk_dep() function (see r305599), which caused in the spurios
errors from ffs_sync() and then ffs_truncate().
In collaboration with: pho
Reviewed by: mckusick
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
The buffer lock is retried on failed LK_SLEEPFAIL attempt, and error
from the failed attempt is irrelevant. But since there is path after
retry which does not clear error, it is possible to return spurious
error from the function.
The issue resulted in a spurious failure of softdep_sync_buf(),
causing further spurious failure of ffs_sync().
In collaboration with: pho
Reviewed by: mckusick
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
This change is formally not needed, since i_endoff not used in all
code paths after the call to ufs_direnter(), and i_endoff is
recalculated by the next lookup. But having the value correct makes
the reasoning about code simpler.
Reported and tested by: pho
Reviewed by: mckusick
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
cleared since nothing prevents completion of the parallel quotaoff.
There is nothing to sync in this case, and no reason to panic.
Reported and tested by: pho
Reviewed by: mckusick
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
operations. Instead of upgrading, assert that the lock is exclusive.
Explain the cause in comments.
This effectively reverts r209367.
Tested by: pho
Reviewed by: mckusick
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
going to re-read inodes.
Secondary write initiators, e.g. ufs_inactive(), might need to start a
write while owning the vnode lock. Since the suspended state
established by /dev/ufssuspend prevents them from entering
vn_start_secondary_write(), we get deadlock otherwise.
Note that it is arguably not very useful to re-read inodes after
/dev/ufssuspend suspension, because the suspension does not block
readers, and other threads might read existing files in parallel with
suspension owner (for now, only growfs(8)) operations. This
effectively means that suspension owner cannot safely modify existing
inodes, and then there is no sense in re-reading. But keep the code
enabled for now.
Reported and tested by: pho
Reviewed by: mckusick
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
with softupdates panics the kernel. The problem that has been pointed
out is that when there is a transient write error on certain metadata
blocks, specifically directory blocks (PAGEDEP), inode blocks
(INODEDEP), indirect pointer blocks (INDIRDEPS), and cylinder group
(BMSAFEMAP, but only when journaling is enabled), we get a panic
in one of the routines called by softdep_disk_io_initiation that
the I/O is "already started" when we retry the write.
These dependency types potentially need to do roll-backs when called
by softdep_disk_io_initiation before doing a write and then a
roll-forward when called by softdep_disk_write_complete after the
I/O completes. The panic happens when there is a transient error.
At the top of softdep_disk_write_complete we check to see if the
write had an error and if an error occurred we just return. This
return is correct most of the time because the main role of the routines
called by softdep_disk_write_complete is to process the now-completed
dependencies so that the next I/O steps can happen.
But for the four types listed above, they do not get to do their
rollback operations. This causes the panic when softdep_disk_io_initiation
gets called on the second attempt to do the write and the roll-back
routines find that the roll-backs have already been done. As an
aside I note that there is also the problem that the buffer will
have been unlocked and thus made visible to the filesystem and to
user applications with the roll-backs in place.
The way to resolve the problem is to add a flag to the routines called
by softdep_disk_write_complete for the four dependency types noted
that indicates whether the write was successful (WRITESUCCEEDED).
If the write does not succeed, they do just the roll-backs and then
return. If the write was successful they also do their usual
processing of the now-completed dependencies.
The fix was tested by selectively injecting write errors for buffers
holding dependencies of each of the four types noted above and then
verifying that the kernel no longer paniced and that following the
successful retry of the write that the filesystem could be unmounted
and successfully checked cleanly.
PR: 211013
Reviewed by: kib
allocation unwinding.
Dandling buffers are released on UFS_BALLOC() failure to ensure that
later attempt to allocate blocks in close range do not find the blocks
with invalid content, since possible partial block allocations are
unwound. As such, it is not enough to just release the buffers, the
pages must also invalidated and removed from the vnode vm_object
queue. Otherwise the pages might be found later and used to
reconstruct indirect buffers when doing allocations at offset close to
the failure point, and their stale content compromise the filesystem
integrity.
Note that just marking the buffer as B_INVAL is not enough, B_NOCACHE
is required. To be sure, clear the B_CACHE flag as well. This
complements the r174973, which started releasing buffers.
Reported and tested by: pho
Reviewed by: mckusick
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
that recorded allocated blocks numbers match the physical block
numbers of dandling buffers which are released.
When finally freeing the blocks during unwind, assert that dandling
buffers where not re-allocated. They shouldn't, because the vnode lock
is owned exclusive.
Reviewed by: mckusick
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
have SU enabled, there is no point in calling softdep_request_cleanup().
The call cannot produce free blocks, but we unecessarily lock ufsmount
and do inode block write. Usual point of not doing optimizations for
the corner case of the full volume is not applicable there, the work
is easily avoidable, and the addition CPU and disk io load do not lead
to succeeding retry.
Reviewed by: mckusick
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
overflow local arrays. This is not immediately obvious from the
static code inspection, due to retry logic.
Reviewed by: mckusick
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
before UFS_BALLOC() is called. I do not believe that this caused any
real issue on FreeBSD because the exclusive vnode lock is held over
the balloc/resize, the change is to make formally correct KPI use.
Based on: the Matthew Dillon' patch from DragonFly BSD
PR: 93942
Reviewed by: mckusick
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
volumes. Treat the field as a semaphore protecting availability of
the device for mounting. Do no access devvp->v_rdev without the vnode
lock owned.
Protect change of the devvp->v_bufobj bo_ops vector with the vnode
lock.
Reviewed by: bde
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
opened in O_SYNC mode, at least for UFS. This also handles
truncation, done due to the O_SYNC | O_TRUNC flags combination to
open(2), in synchronous way.
Noted by: bde
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks