The softdep lock names were unusually long and tended to stick out in
lock profiling reports. Abbreviate them and make them consistent with
our conventional style for lock names.
Reviewed by: mckusick
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D22042
There are 3 counters modified all the time in this structure - one for
keeping the structure alive, one for preventing unmount and one for
tracking active writers. Exact values of these counters are very rarely
needed, which makes them a prime candidate for conversion to a per-cpu
scheme, resulting in much better performance.
Sample benchmark performing fstatfs (modifying 2 out of 3 counters) on
a 104-way 2 socket Skylake system:
before: 852393 ops/s
after: 76682077 ops/s
Reviewed by: kib, jeff
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D21637
When softdep_fsync() is running, a caller must already started write
for the mount point. Since unmount or remount to ro suspends mount
point, it cannot run in parallel with softdep_fsync(), which makes
vfs_busy() call there not needed.
Doing blocking vfs_busy() there effectively causes lock order reversal
between vn_start_write() and setting MNTK_UNMOUNT, because
vfs_busy(mp, 0) sleeps waiting for MNTK_UNMOUNT becoming clear, while
unmount sets the flag and starts the suspension.
Note that all other uses of vfs_busy() in SU code are non-blocking.
Reported by: chs by mckusick
Reviewed by: mckusick
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
After all the changes, its dynamic scope is same as for MNTK_UNMOUNT,
but to allow the syncer vnode to be re-installed on unmount failure.
But the case of syncer was already handled by using the VV_FORCEINSMQ
flag for quite some time.
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
disk blocks, set the FORCE flag in the call to chkiq() or chkdq() since
the user is always allowed to return resources and hence there is no need
to check the user's credential .
Reported by: Christopher Krah, Thomas Barabosch, and Jan-Niclas Hilgert of Fraunhofer FKIE
Reported as: FS-1-UFS-1: Denial Of Service in mount (prison_priv_check)
Discussed with: kib
MFC: 1 week
Sponsored by: Netflix
filesystems that have block pointers that are out-of-range for their
filesystem. These out-of-range block pointers are corrected by
fsck(8) so are only encountered when an unchecked filesystem is
mounted.
A new "untrusted" flag has been added to the generic mount interface
that can be set when mounting media of unknown provenance or integrity.
For example, a daemon that automounts a filesystem on a flash drive
when it is plugged into a system.
This commit adds a test to UFS/FFS that validates all block numbers
before using them. Because checking for out-of-range blocks adds
unnecessary overhead to normal operation, the tests are only done
when the filesystem is mounted as an "untrusted" filesystem.
Reported by: Christopher Krah, Thomas Barabosch, and Jan-Niclas Hilgert of Fraunhofer FKIE
Reported as: FS-14-UFS-3: Out of bounds read in write-2 (ffs_alloccg)
Reviewed by: kib
Sponsored by: Netflix
Assert that the per-mountpoint softdep mutex is held in modified
functions that do not already have this assertion. No functional
change intended.
Reviewed by: kib, mckusick (previous version)
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D20741
This bug was introduced with the change to use softdep_bp_to_mp()
in January 2018 changes -r327723 and -r327821. The softdep_bp_to_mp()
function failed to include VSOCK as one of the valid cases.
Although local-domain sockets do not allocate blocks in the filesystem,
they will allocate blocks if they use extended attributes (such as
ACLs). Thus, softdep_bp_to_mp() needs to return a non-NULL mount
pointer when presented with a socket vnode so that the soft updates
write complete will properly process the soft updates structures
associated with the extended attribute blocks. It was the failure
to process these soft updates structures, thus leaving them hanging
off the buffer, which lead to the "panic: softdep_deallocate_dependencies:
dangling deps" when trying to clean up the buffer after it was written.
PR: 230962
Reported by: 2t8mr7kx9f@protonmail.com
Reviewed by: kib
Tested by: Peter Holm
MFC after: 1 week
Sponsored by: Netflix
January 2018 changes -r327723 and -r327821. The softdep_bp_to_mp()
function failed to include VFIFO as one of the valid cases.
Although fifo's do not allocate blocks in the filesystem, they will
allocate blocks if they use extended attributes (such as ACLs). Thus,
softdep_bp_to_mp() needs to return a non-NULL mount pointer when
presented with a fifo vnode so that the soft updates write complete
will properly process the soft updates structures associated with the
extended attribute blocks. It was the failure to process these soft
updates structures, thus leaving them hanging off the buffer, which
lead to the "panic: softdep_deallocate_dependencies: dangling deps"
when trying to clean up the buffer after it was written.
PR: 230962
Reported by: 2t8mr7kx9f@protonmail.com
Reviewed by: kib
Tested by: Peter Holm
MFC after: 1 week
Sponsored by: Netflix
set of known soft dependency data structures now includes: sd_worklist,
sd_inodedep, sd_allocdirect, sd_allocindir, and sd_mkdir. DDB can
also print lists of sd_allinodedeps, sd_mkdir_list, and sd_workhead.
The sd_workhead script is useful for listing all the dependencies
associated with a buffer, e.g. bp->b_dep.
Prefix the soft dependency show names with sd_ so that they sort
together when listed by DDB's "show help" and to distinguish them
from other data structures printable by DDB.
Sponsored by: Netflix
check hash to the filesystem inodes. Access attempts to files
associated with an inode with an invalid check hash will fail with
EINVAL (Invalid argument). Access is reestablished after an fsck
is run to find and validate the inodes with invalid check-hashes.
This check avoids a class of filesystem panics related to corrupted
inodes. The hash is done using crc32c.
Note this check-hash is for the inode itself and not any of its
indirect blocks. Check-hash validation may be extended to also
cover indirect block pointers, but that will be a separate (and
more costly) feature.
Check hashes are added only to UFS2 and not to UFS1 as UFS1 is
primarily used in embedded systems with small memories and low-powered
processors which need as light-weight a filesystem as possible.
Reviewed by: kib
Tested by: Peter Holm
Sponsored by: Netflix
document the libufs interface for fetching and storing inodes.
The undocumented getino / putino interface has been replaced
with a new getinode / putinode interface.
Convert the utilities that had been using the undocumented
interface to use the new documented interface.
No functional change (as for now the libufs library does not
do inode check-hashes).
Reviewed by: kib
Tested by: Peter Holm
Sponsored by: Netflix
driven by problems found with the algorithms being tested for TRIM
consolodation.
Reported by: Peter Holm
Suggested by: kib
Reviewed by: kib
Sponsored by: Netflix
a smaller number of larger TRIM requests. The hope had been to have
the full TRIM consolodation in place for 12.0, but the algorithms
are still under development and need further testing. With this
framework in place it will be possible to easily add TRIM consolodation
once the optimal strategy has been found.
The only functional change with this patch is the elimination of TRIM
requests for blocks that are freed before they have been likely to
have been written.
Reviewed by: kib
Discussed with: Warner Losh and Chuck Silvers
Sponsored by: Netflix
Followup to r313780. Also prefix ext2's and nandfs's versions with
EXT2_ and NANDFS_.
Reported by: kib
Reviewed by: kib, mckusick
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D9623
Coverity cannot determine that handle_written_indirdep() does not access
uninitialized 'sbp' when flags argument is zero.
So, simply move the initialization slightly sooner to silence the warning.
No functional change.
Reported by: Coverity
Sponsored by: Dell EMC Isilon
When allocating memory through malloc(9), we always expect the amount of
memory requested to be unsigned as a negative value would either stand for
an error or an overflow.
Unsign some values, found when considering the use of mallocarray(9), to
avoid unnecessary casting. Also consider that indexes should be of
at least the same size/type as the upper limit they pretend to index.
MFC after: 2 weeks
Basic use of mallocarray to prevent overflows: static analyzers are also
likely to perform additional checks.
Since mallocarray expects unsigned parameters, unsign some
related variables to minimize sign conversions.
Reviewed by: mckusick
The code accesses bp->b_dep without owning the ufs mount softdep lock,
which makes it possible for the derefenced workitem to be freed in
parallel. In particular, the deallocate_dependencies(),
softdep_disk_io_initiation() and softdep_disk_write_complete() are
affected.
Move the code to safely calculate ump from the buffer with
dependencies into the helper softdep_bp_to_mp() and use it for all
found cases.
Tested by: pho (as part of the bigger patch)
Reviewed by: mckusick (as part of the bigger patch)
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
handle_written_XXX() in case of processing the buffer with an error.
Tested by: pho (as part of the bigger patch)
Reviewed by: mckusick (as part of the bigger patch)
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Mainly focus on files that use BSD 2-Clause license, however the tool I
was using misidentified many licenses so this was mostly a manual - error
prone - task.
The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.
No functional change intended.
When QUEUE_MACRO_DEBUG_TRASH is configured, the queue linkage fields
are trashed upon removal of the item, so be sure to only read them before
removing the item.
No functional change intended.
MFC after: 1 week
Sponsored by: Dell EMC Isilon
Normally wakeups() are performed for completed softupdates work items
in workitem_free() before the underlying memory is free()'d.
complete_jseg() was clearing the "wakeup needed" flag in work items to
defer the wakeup until the end of each loop iteration. However, this
resulted in the item being free'd before it's address was used with
wakeup(). As a result, another part of the kernel could allocate this
memory from malloc() and use it as a wait channel for a different
"event" with a different lock. This triggered an assertion failure
when the lock passed to sleepq_add() did not match the existing lock
associated with the sleep queue. Fix this by removing the code to
defer the wakeup in complete_jseg() allowing the wakeup to occur
slightly earlier in workitem_free() before free() is called.
The main reason I can think of for deferring a wakeup() would be to
avoid waking up a waiter while holding a lock that the waiter would
need. However, no locks are dropped in between the wakeup() in
workitem_free() and the end of the loop in complete_jseg() as far as I
can tell.
In general I think it is not safe to do a wakeup() after free() as one
cannot control how other parts of the kernel that might reuse the
address for a different wait channel will handle spurious wakeups.
Reported by: pho
Reviewed by: kib
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D12494
ino64 expanded nlink_t to 64 bits, but the on-disk format for UFS is still
limited to 16 bits. This is a nop currently but will matter if LINK_MAX is
increased in the future.
Reviewed by: kib
Sponsored by: Chelsio Communications
vnode lock.
Caller of softdep_count_dependencies() may own a buffer lock, which
might conflict with the lock order.
Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 10 days
softdep_count_dependencies().
Buffer's b_dep list is protected by the SU mount lock. Owning the
buffer lock is not enough to guarantee the stability of the list.
Calculation of the UFS mount owning the workitems from the buffer must
be much more careful to not dereference the work item which might be
freed meantime. To get to ump, use the pointers chain which does not
involve workitems at all.
Reported and tested by: pho
Reviewed by: mckusick
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
host.
Problems start appearing when there are several threads all doing
operations on a UFS volume and the SU workqueue needs a cleanup. It is
possible that each thread calling softdep_request_cleanup() owns the
lock for some dirty vnode (e.g. all of them are executing mkdir(2),
mknod(2), creat(2) etc) and all vnodes which must be flushed are locked
by corresponding thread. Then, we get all the threads simultaneously
entering softdep_request_cleanup().
There are two problems:
- Several threads execute MNT_VNODE_FOREACH_ALL() loops in parallel. Due
to the locking, they quickly start executing 'in phase' with the speed
of the slowest thread.
- Since each thread already owns the lock for a dirty vnode, other threads
non-blocking attempt to lock the vnode owned by other thread fail,
and loops executing without making the progress.
Retry logic does not allow the situation to recover. The result is
a livelock.
Fix these problems by making the following changes:
- Allow only one thread to enter MNT_VNODE_FOREACH_ALL() loop per mp.
A new flag FLUSH_RC_ACTIVE guards the loop.
- If there were failed locking attempts during the loop, abort retry
even if there are still work items on the mp work list. An
assumption is that the items will be cleaned when other thread
either fsyncs its vnode, or unlock and allow yet another thread to
make the progress.
It is possible now that some calls would get undeserved ENOSPC from
ffs_alloc(), because the cleanup is not aggressive enough. But I do
not see how can we reliably clean up workitems if calling
softdep_request_cleanup() while still owning the vnode lock. I thought
about scheme where ffs_alloc() returns ERESTART and saves the retry
counter somewhere in struct thread, to return to the top level, unlock
the vnode and retry. But IMO the very rare (and unproven) spurious
ENOSPC is not worth the complications.
Reported and tested by: pho
Style and comments by: mckusick
Reviewed by: mckusick
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Thread might create a condition for delayed SU cleanup, which creates
a reference to the mount point in td_su, but exit without returning
through userret(), e.g. when terminating due to single-threading or
process exit. In this case, td_su reference is not dropped and mount
point cannot be freed.
Handle the situation by clearing td_su also in the thread destructor
and in exit1(). softdep_ast_cleanup() has to receive the thread as
argument, since e.g. thread destructor is executed in different
context.
Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Remove redunand i_dev and i_fs pointers, which are available as
ip->i_ump->um_dev and ip->i_ump->um_fs, and reorder members by size to
reduce padding. To compensate added derefences, the most often i_ump
access to differentiate between UFS1 and UFS2 dinode layout is
removed, by addition of the new i_flag IN_UFS2. Overall, this
actually reduces the amount of memory dereferences.
On 64bit machine, original struct inode size is 176, reduced to 152
bytes with the change.
Tested by: pho (previous version)
Reviewed by: mckusick
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks