408 Commits

Author SHA1 Message Date
markj
6304690b84 Abbreviate softdep lock names.
The softdep lock names were unusually long and tended to stick out in
lock profiling reports.  Abbreviate them and make them consistent with
our conventional style for lock names.

Reviewed by:	mckusick
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D22042
2019-10-18 17:01:27 +00:00
vangyzen
4956f72e24 Add CTLFLAG_STATS to several debug.softdep sysctl OIDs
Refer to r353111.

MFC after:	2 weeks
Sponsored by:	Dell EMC Isilon
2019-10-04 21:44:52 +00:00
mjg
6090f91124 vfs: convert struct mount counters to per-cpu
There are 3 counters modified all the time in this structure - one for
keeping the structure alive, one for preventing unmount and one for
tracking active writers. Exact values of these counters are very rarely
needed, which makes them a prime candidate for conversion to a per-cpu
scheme, resulting in much better performance.

Sample benchmark performing fstatfs (modifying 2 out of 3 counters) on
a 104-way 2 socket Skylake system:
before:   852393 ops/s
after:  76682077 ops/s

Reviewed by:	kib, jeff
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D21637
2019-09-16 21:37:47 +00:00
mjg
099eed319c vfs: manage mnt_writeopcount with atomics
See r352424.

Reviewed by:	kib, jeff
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D21575
2019-09-16 21:33:16 +00:00
kib
479eaba412 Remove some unneeded vfs_busy() calls in SU code.
When softdep_fsync() is running, a caller must already started write
for the mount point.  Since unmount or remount to ro suspends mount
point, it cannot run in parallel with softdep_fsync(), which makes
vfs_busy() call there not needed.

Doing blocking vfs_busy() there effectively causes lock order reversal
between vn_start_write() and setting MNTK_UNMOUNT, because
vfs_busy(mp, 0) sleeps waiting for MNTK_UNMOUNT becoming clear, while
unmount sets the flag and starts the suspension.

Note that all other uses of vfs_busy() in SU code are non-blocking.

Reported by:	chs by mckusick
Reviewed by:	mckusick
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2019-09-09 11:22:38 +00:00
cem
3a2c285092 ufs: Remove redundant brelse() after r294954
Same automation.

No functional change.
2019-09-06 08:08:33 +00:00
kib
8b8c10ee6c De-commision the MNTK_NOINSMNTQ kernel mount flag.
After all the changes, its dynamic scope is same as for MNTK_UNMOUNT,
but to allow the syncer vnode to be re-installed on unmount failure.
But the case of syncer was already handled by using the VV_FORCEINSMQ
flag for quite some time.

Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2019-08-23 19:40:10 +00:00
mckusick
7f0988ede6 When updating the user or group disk quotas for the return of inodes or
disk blocks, set the FORCE flag in the call to chkiq() or chkdq() since
the user is always allowed to return resources and hence there is no need
to check the user's credential .

Reported by:    Christopher Krah, Thomas Barabosch, and Jan-Niclas Hilgert of Fraunhofer FKIE
Reported as:    FS-1-UFS-1: Denial Of Service in mount (prison_priv_check)
Discussed with: kib
MFC:            1 week
Sponsored by:   Netflix
2019-07-31 22:44:58 +00:00
mckusick
eb5503c168 The error reported in FS-14-UFS-3 can only happen on UFS/FFS
filesystems that have block pointers that are out-of-range for their
filesystem. These out-of-range block pointers are corrected by
fsck(8) so are only encountered when an unchecked filesystem is
mounted.

A new "untrusted" flag has been added to the generic mount interface
that can be set when mounting media of unknown provenance or integrity.
For example, a daemon that automounts a filesystem on a flash drive
when it is plugged into a system.

This commit adds a test to UFS/FFS that validates all block numbers
before using them. Because checking for out-of-range blocks adds
unnecessary overhead to normal operation, the tests are only done
when the filesystem is mounted as an "untrusted" filesystem.

Reported by:  Christopher Krah, Thomas Barabosch, and Jan-Niclas Hilgert of Fraunhofer FKIE
Reported as:  FS-14-UFS-3: Out of bounds read in write-2 (ffs_alloccg)
Reviewed by:  kib
Sponsored by: Netflix
2019-07-17 22:07:43 +00:00
markj
944c67f6c1 Remove references to splbio in ffs_softdep.c.
Assert that the per-mountpoint softdep mutex is held in modified
functions that do not already have this assertion.  No functional
change intended.

Reviewed by:	kib, mckusick (previous version)
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D20741
2019-06-26 16:28:42 +00:00
mckusick
0d342021e5 Add a missing bresle() in seldom-used error return. 2019-05-28 17:31:35 +00:00
mckusick
447eec25c7 Convert use of UFS-specific #ifdef DEBUG to DIAGNOSTIC or INVARIANTS
as appropriate. No functional change intended.

Suggested-by: markj
2019-05-28 16:32:04 +00:00
mckusick
0776148e01 Add function name and line number debugging information to softupdates
worklist structures to help track their movement between work lists.
No functional change to the operation of soft updates intended.
2019-05-27 06:22:43 +00:00
mckusick
1e2cc9b200 This is an additional and hopefully final fix for bug report 230962.
This bug was introduced with the change to use softdep_bp_to_mp()
in January 2018 changes -r327723 and -r327821. The softdep_bp_to_mp()
function failed to include VSOCK as one of the valid cases.

Although local-domain sockets do not allocate blocks in the filesystem,
they will allocate blocks if they use extended attributes (such as
ACLs). Thus, softdep_bp_to_mp() needs to return a non-NULL mount
pointer when presented with a socket vnode so that the soft updates
write complete will properly process the soft updates structures
associated with the extended attribute blocks. It was the failure
to process these soft updates structures, thus leaving them hanging
off the buffer, which lead to the "panic: softdep_deallocate_dependencies:
dangling deps" when trying to clean up the buffer after it was written.

PR:           230962
Reported by:  2t8mr7kx9f@protonmail.com
Reviewed by:  kib
Tested by:    Peter Holm
MFC after:    1 week
Sponsored by: Netflix
2019-03-20 23:11:05 +00:00
mckusick
12a82b6654 Add KASSERT to the softdep_disk_write_complete() function in the
soft dependency code to ensure that it will be able to avoid a
dangling dependency.

Sponsored by: Netflix
2019-03-12 00:10:31 +00:00
mckusick
0c506150b7 This bug was introduced with the change to use softdep_bp_to_mp() in
January 2018 changes -r327723 and -r327821. The softdep_bp_to_mp()
function failed to include VFIFO as one of the valid cases.

Although fifo's do not allocate blocks in the filesystem, they will
allocate blocks if they use extended attributes (such as ACLs). Thus,
softdep_bp_to_mp() needs to return a non-NULL mount pointer when
presented with a fifo vnode so that the soft updates write complete
will properly process the soft updates structures associated with the
extended attribute blocks. It was the failure to process these soft
updates structures, thus leaving them hanging off the buffer, which
lead to the "panic: softdep_deallocate_dependencies: dangling deps"
when trying to clean up the buffer after it was written.

PR:           230962
Reported by:  2t8mr7kx9f@protonmail.com
Reviewed by:  kib
Tested by:    Peter Holm
MFC after:    1 week
Sponsored by: Netflix
2019-01-28 21:36:45 +00:00
mckusick
23046c2dc3 Expand DDB's set of printable soft dependency data structures. The
set of known soft dependency data structures now includes: sd_worklist,
sd_inodedep, sd_allocdirect, sd_allocindir, and sd_mkdir. DDB can
also print lists of sd_allinodedeps, sd_mkdir_list, and sd_workhead.
The sd_workhead script is useful for listing all the dependencies
associated with a buffer, e.g. bp->b_dep.

Prefix the soft dependency show names with sd_ so that they sort
together when listed by DDB's "show help" and to distinguish them
from other data structures printable by DDB.

Sponsored by: Netflix
2019-01-26 05:35:24 +00:00
mckusick
830a63af76 Continuing efforts to provide hardening of FFS. This change adds a
check hash to the filesystem inodes. Access attempts to files
associated with an inode with an invalid check hash will fail with
EINVAL (Invalid argument). Access is reestablished after an fsck
is run to find and validate the inodes with invalid check-hashes.
This check avoids a class of filesystem panics related to corrupted
inodes. The hash is done using crc32c.

Note this check-hash is for the inode itself and not any of its
indirect blocks. Check-hash validation may be extended to also
cover indirect block pointers, but that will be a separate (and
more costly) feature.

Check hashes are added only to UFS2 and not to UFS1 as UFS1 is
primarily used in embedded systems with small memories and low-powered
processors which need as light-weight a filesystem as possible.

Reviewed by:  kib
Tested by:    Peter Holm
Sponsored by: Netflix
2018-12-11 22:14:37 +00:00
mjg
7e31d1de7e Remove unused argument to priv_check_cred.
Patch mostly generated with cocinnelle:

@@
expression E1,E2;
@@

- priv_check_cred(E1,E2,0)
+ priv_check_cred(E1,E2)

Sponsored by:	The FreeBSD Foundation
2018-12-11 19:32:16 +00:00
mckusick
80199cdcb4 In preparation for adding inode check-hashes, clean up and
document the libufs interface for fetching and storing inodes.
The undocumented getino / putino interface has been replaced
with a new getinode / putinode interface.

Convert the utilities that had been using the undocumented
interface to use the new documented interface.

No functional change (as for now the libufs library does not
do inode check-hashes).

Reviewed by:  kib
Tested by:    Peter Holm
Sponsored by: Netflix
2018-11-13 21:40:56 +00:00
kib
1d18e5f10a Correct panic messages.
Reviewed by:	mckusick
Sponsored by:	The FreeBSD Foundation
Approved by:	re (rgrimes)
MFC after:	1 week
2018-09-22 17:05:49 +00:00
mckusick
3b4b479fff Replace the TRIM consolodation framework originally added in -r337396
driven by problems found with the algorithms being tested for TRIM
consolodation.

Reported by:  Peter Holm
Suggested by: kib
Reviewed by:  kib
Sponsored by: Netflix
2018-08-18 22:21:59 +00:00
mckusick
8a71ad7c3d Revert -r337396. It is being replaced with a revised interface that
resulted from testing and further reviews.
2018-08-18 21:21:06 +00:00
mckusick
e7f491b70c Put in place the framework for consolodating contiguous blocks into
a smaller number of larger TRIM requests. The hope had been to have
the full TRIM consolodation in place for 12.0, but the algorithms
are still under development and need further testing. With this
framework in place it will be possible to easily add TRIM consolodation
once the optimal strategy has been found.

The only functional change with this patch is the elimination of TRIM
requests for blocks that are freed before they have been likely to
have been written.

Reviewed by: kib
Discussed with: Warner Losh and Chuck Silvers
Sponsored by: Netflix
2018-08-06 21:09:11 +00:00
mckusick
4b68405396 Fix warning found by Coverity.
CID 1009353:  Error handling issues  (CHECKED_RETURN)
2018-05-16 23:42:02 +00:00
mckusick
50d7ca2454 Renumber soft-update types starting at 1 instead of 0 to avoid confusion
of zero'ed memory appearing to have a valid soft-update type.

Also correct some comments.

Reviewed by: kib
2018-04-05 00:32:01 +00:00
emaste
61bad5ab72 Revert r313780 (UFS_ prefix) 2018-03-17 12:59:55 +00:00
emaste
e23f2eb452 Prefix UFS symbols with UFS_ to reduce namespace pollution
Followup to r313780.  Also prefix ext2's and nandfs's versions with
EXT2_ and NANDFS_.

Reported by:	kib
Reviewed by:	kib, mckusick
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D9623
2018-03-17 01:48:27 +00:00
cem
d0850a5df2 ffs: softdep_disk_write_complete: Quiesce spurious Coverity warning
Coverity cannot determine that handle_written_indirdep() does not access
uninitialized 'sbp' when flags argument is zero.

So, simply move the initialization slightly sooner to silence the warning.

No functional change.

Reported by:	Coverity
Sponsored by:	Dell EMC Isilon
2018-03-01 00:29:52 +00:00
pfg
944d693f04 ext2fs|ufs:Unsign some values related to allocation.
When allocating memory through malloc(9), we always expect the amount of
memory requested to be unsigned as a negative value would either stand for
an error or an overflow.
Unsign some values, found when considering the use of mallocarray(9), to
avoid unnecessary casting. Also consider that indexes should be of
at least the same size/type as the upper limit they pretend to index.

MFC after:	2 weeks
2018-01-24 17:58:48 +00:00
pfg
ca690ecdf9 Revert r327781, r328093, r328056:
ufs|ext2fs: Revert uses of mallocarray(9).

These aren't really useful: drop them.
Variable unsigning will be brought again later.
2018-01-24 16:44:57 +00:00
pfg
036ebddf97 ufs: use mallocarray(9).
Basic use of mallocarray to prevent overflows: static analyzers are also
likely to perform additional checks.

Since mallocarray expects unsigned parameters, unsign some
related variables to minimize sign conversions.

Reviewed by:	mckusick
2018-01-17 18:18:33 +00:00
kib
75be4853cd Softlink inodes can own buffers with dependencies.
At least, softlinks longer than 120 bytes have data fragments.

Submitted by:	mckusick
MFC after:	5 days
2018-01-11 13:37:45 +00:00
kib
681df22eda Generalize the fix from r322757 and apply it to several more places.
The code accesses bp->b_dep without owning the ufs mount softdep lock,
which makes it possible for the derefenced workitem to be freed in
parallel.  In particular, the deallocate_dependencies(),
softdep_disk_io_initiation() and softdep_disk_write_complete() are
affected.

Move the code to safely calculate ump from the buffer with
dependencies into the helper softdep_bp_to_mp() and use it for all
found cases.

Tested by:	pho (as part of the bigger patch)
Reviewed by:	mckusick (as part of the bigger patch)
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2018-01-09 10:51:44 +00:00
kib
3d0a89393a When handling write completion, take SU lock around calls to
handle_written_XXX() in case of processing the buffer with an error.

Tested by:	pho (as part of the bigger patch)
Reviewed by:	mckusick (as part of the bigger patch)
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2018-01-09 10:44:17 +00:00
eadler
421a929b1e kernel: Fix several typos and minor errors
- duplicate words
- typos
- references to old versions of FreeBSD

Reviewed by:	imp, benno
2017-12-27 03:23:21 +00:00
pfg
78a6b08618 sys: general adoption of SPDX licensing ID tags.
Mainly focus on files that use BSD 2-Clause license, however the tool I
was using misidentified many licenses so this was mostly a manual - error
prone - task.

The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.

No functional change intended.
2017-11-27 15:23:17 +00:00
markj
6e5d1bfd1d Remove a stale and incorrect comment.
MFC after:	1 week
Sponsored by:	Dell EMC Isilon
2017-10-28 02:51:27 +00:00
markj
6795d68368 Remove workqueue items after updating the workqueue tail pointer.
When QUEUE_MACRO_DEBUG_TRASH is configured, the queue linkage fields
are trashed upon removal of the item, so be sure to only read them before
removing the item.

No functional change intended.

MFC after:	1 week
Sponsored by:	Dell EMC Isilon
2017-10-28 02:48:37 +00:00
markj
668ea833d1 Make drain_output() use bufobj_wwait().
No functional change intended.

Reviewed by:	kib
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D12790
2017-10-25 17:20:18 +00:00
jhb
c991591342 Don't defer wakeup()s for completed journal workitems.
Normally wakeups() are performed for completed softupdates work items
in workitem_free() before the underlying memory is free()'d.
complete_jseg() was clearing the "wakeup needed" flag in work items to
defer the wakeup until the end of each loop iteration.  However, this
resulted in the item being free'd before it's address was used with
wakeup().  As a result, another part of the kernel could allocate this
memory from malloc() and use it as a wait channel for a different
"event" with a different lock.  This triggered an assertion failure
when the lock passed to sleepq_add() did not match the existing lock
associated with the sleep queue.  Fix this by removing the code to
defer the wakeup in complete_jseg() allowing the wakeup to occur
slightly earlier in workitem_free() before free() is called.

The main reason I can think of for deferring a wakeup() would be to
avoid waking up a waiter while holding a lock that the waiter would
need.  However, no locks are dropped in between the wakeup() in
workitem_free() and the end of the loop in complete_jseg() as far as I
can tell.

In general I think it is not safe to do a wakeup() after free() as one
cannot control how other parts of the kernel that might reuse the
address for a different wait channel will handle spurious wakeups.

Reported by:	pho
Reviewed by:	kib
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D12494
2017-09-26 23:24:15 +00:00
jhb
6ae33bb87c Add UFS_LINK_MAX for the UFS-specific limit on link counts.
ino64 expanded nlink_t to 64 bits, but the on-disk format for UFS is still
limited to 16 bits.  This is a nop currently but will matter if LINK_MAX is
increased in the future.

Reviewed by:	kib
Sponsored by:	Chelsio Communications
2017-09-18 23:30:39 +00:00
kib
0cda26ac73 Protect v_rdev dereference with the vnode interlock instead of the
vnode lock.

Caller of softdep_count_dependencies() may own a buffer lock, which
might conflict with the lock order.

Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	10 days
2017-08-25 09:51:22 +00:00
kib
3149ed68c4 Avoid dereferencing potentially freed workitem in
softdep_count_dependencies().

Buffer's b_dep list is protected by the SU mount lock.  Owning the
buffer lock is not enough to guarantee the stability of the list.

Calculation of the UFS mount owning the workitems from the buffer must
be much more careful to not dereference the work item which might be
freed meantime.  To get to ump, use the pointers chain which does not
involve workitems at all.

Reported and tested by:	pho
Reviewed by:	mckusick
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2017-08-21 16:23:44 +00:00
kib
77de7ac78a Style.
Reviewed by:	mckusick
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2017-08-21 16:16:02 +00:00
kib
d8402b7b53 Mitigate several problems with the softdep_request_cleanup() on busy
host.

Problems start appearing when there are several threads all doing
operations on a UFS volume and the SU workqueue needs a cleanup.  It is
possible that each thread calling softdep_request_cleanup() owns the
lock for some dirty vnode (e.g. all of them are executing mkdir(2),
mknod(2), creat(2) etc) and all vnodes which must be flushed are locked
by corresponding thread. Then, we get all the threads simultaneously
entering softdep_request_cleanup().

There are two problems:
- Several threads execute MNT_VNODE_FOREACH_ALL() loops in parallel.  Due
  to the locking, they quickly start executing 'in phase' with the speed
  of the slowest thread.
- Since each thread already owns the lock for a dirty vnode, other threads
  non-blocking attempt to lock the vnode owned by other thread fail,
  and loops executing without making the progress.
Retry logic does not allow the situation to recover.  The result is
a livelock.

Fix these problems by making the following changes:
- Allow only one thread to enter MNT_VNODE_FOREACH_ALL() loop per mp.
  A new flag FLUSH_RC_ACTIVE guards the loop.
- If there were failed locking attempts during the loop, abort retry
  even if there are still work items on the mp work list.  An
  assumption is that the items will be cleaned when other thread
  either fsyncs its vnode, or unlock and allow yet another thread to
  make the progress.

It is possible now that some calls would get undeserved ENOSPC from
ffs_alloc(), because the cleanup is not aggressive enough. But I do
not see how can we reliably clean up workitems if calling
softdep_request_cleanup() while still owning the vnode lock. I thought
about scheme where ffs_alloc() returns ERESTART and saves the retry
counter somewhere in struct thread, to return to the top level, unlock
the vnode and retry.  But IMO the very rare (and unproven) spurious
ENOSPC is not worth the complications.

Reported and tested by:	pho
Style and comments by:	mckusick
Reviewed by:	mckusick
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2017-06-03 16:18:50 +00:00
kib
9973ae075e Do not leak mount references for dying threads.
Thread might create a condition for delayed SU cleanup, which creates
a reference to the mount point in td_su, but exit without returning
through userret(), e.g. when terminating due to single-threading or
process exit.  In this case, td_su reference is not dropped and mount
point cannot be freed.

Handle the situation by clearing td_su also in the thread destructor
and in exit1().  softdep_ast_cleanup() has to receive the thread as
argument, since e.g. thread destructor is executed in different
context.

Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2017-02-25 10:38:18 +00:00
emaste
8e79b56e85 prefix UFS symbols with UFS_ to reduce namespace pollution
Specifically:
  ROOTINO -> UFS_ROOTINO
  WINO -> UFS_WINO
  NXADDR -> UFS_NXADDR
  NDADDR -> UFS_NDADDR
  NIADDR -> UFS_NIADDR
  MAXSYMLINKLEN_UFS[12] -> UFS[12]_MAXSYMLINKLEN (for consistency)

Also prefix ext2's and nandfs's NDADDR and NIADDR with EXT2_ and NANDFS_

Reviewed by:	kib, mckusick
Obtained from:	NetBSD
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D9536
2017-02-15 19:50:26 +00:00
kib
7c67dd5f60 Use type-independent formats for printing nlink_t and ino_t.
Extracted from:	ino64 work by gleb, mckusick
Discussed with:	mckusick
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-01-06 16:59:33 +00:00
kib
20f1e8ac63 Reduce size of ufs inode.
Remove redunand i_dev and i_fs pointers, which are available as
ip->i_ump->um_dev and ip->i_ump->um_fs, and reorder members by size to
reduce padding.  To compensate added derefences, the most often i_ump
access to differentiate between UFS1 and UFS2 dinode layout is
removed, by addition of the new i_flag IN_UFS2.  Overall, this
actually reduces the amount of memory dereferences.

On 64bit machine, original struct inode size is 176, reduced to 152
bytes with the change.

Tested by:	pho (previous version)
Reviewed by:	mckusick
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2016-09-17 16:47:34 +00:00