Commit Graph

2290 Commits

Author SHA1 Message Date
Stefan Eßer
23e84cf153 Fix obvious typo: IN_BLKDATA should be IN_IBLKDATA 2020-06-04 19:54:25 +00:00
Kirk McKusick
30296c428a Two additional places that need to identify IN_IBLKDATA.
Reviewed by: kib
MFC with: -r361785
Differential Revision:  https://reviews.freebsd.org/D25072
2020-06-04 18:35:21 +00:00
Konstantin Belousov
7428630b75 UFS: write inode block for fdatasync(2) if pointers in inode where allocated
The fdatasync() description in POSIX specifies that
    all I/O operations shall be completed as defined for synchronized I/O
    data integrity completion.
and then the explanation of Synchronized I/O Data Integrity Completion says
    The write is complete only when the data specified in the write
    request is successfully transferred and all file system
    information required to retrieve the data is successfully
    transferred.

For UFS this means that all pointers must be on disk. Indirect
pointers already contribute to the list of dirty data blocks, so only
direct blocks and root pointers to indirect blocks, both of which
reside in the inode block, should be taken care of. In ffs_balloc(),
mark the inode with the new flag IN_IBLKDATA that specifies that
ffs_syncvnode(DATA_ONLY) needs a call to ffs_update() to flush the
inode block.

Reviewed by:	mckusick
Discussed with:	tmunro
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D25072
2020-06-04 12:23:15 +00:00
Chuck Silvers
d79ff54b5c This commit enables a UFS filesystem to do a forcible unmount when
the underlying media fails or becomes inaccessible. For example
when a USB flash memory card hosting a UFS filesystem is unplugged.

The strategy for handling disk I/O errors when soft updates are
enabled is to stop writing to the disk of the affected file system
but continue to accept I/O requests and report that all future
writes by the file system to that disk actually succeed. Then
initiate an asynchronous forced unmount of the affected file system.

There are two cases for disk I/O errors:

   - ENXIO, which means that this disk is gone and the lower layers
     of the storage stack already guarantee that no future I/O to
     this disk will succeed.

   - EIO (or most other errors), which means that this particular
     I/O request has failed but subsequent I/O requests to this
     disk might still succeed.

For ENXIO, we can just clear the error and continue, because we
know that the file system cannot affect the on-disk state after we
see this error. For EIO or other errors, we arrange for the geom_vfs
layer to reject all future I/O requests with ENXIO just like is
done when the geom_vfs is orphaned. In both cases, the file system
code can just clear the error and proceed with the forcible unmount.

This new treatment of I/O errors is needed for writes of any buffer
that is involved in a dependency. Most dependencies are described
by a structure attached to the buffer's b_dep field. But some are
created and processed as a result of the completion of the dependencies
attached to the buffer.

Clearing of some dependencies require a read. For example if there
is a dependency that requires an inode to be written, the disk block
containing that inode must be read, the updated inode copied into
place in that buffer, and the buffer then written back to disk.

Often the needed buffer is already in memory and can be used. But
if it needs to be read from the disk, the read will fail, so we
fabricate a buffer full of zeroes and pretend that the read succeeded.
This zero'ed buffer can be updated and written back to disk.

The only case where a buffer full of zeros causes the code to do
the wrong thing is when reading an inode buffer containing an inode
that still has an inode dependency in memory that will reinitialize
the effective link count (i_effnlink) based on the actual link count
(i_nlink) that we read. To handle this case we now store the i_nlink
value that we wrote in the inode dependency so that it can be
restored into the zero'ed buffer thus keeping the tracking of the
inode link count consistent.

Because applications depend on knowing when an attempt to write
their data to stable storage has failed, the fsync(2) and msync(2)
system calls need to return errors if data fails to be written to
stable storage. So these operations return ENXIO for every call
made on files in a file system where we have otherwise been ignoring
I/O errors.

Coauthered by: mckusick
Reviewed by:   kib
Tested by:     Peter Holm
Approved by:   mckusick (mentor)
Sponsored by:  Netflix
Differential Revision:  https://reviews.freebsd.org/D24088
2020-05-25 23:47:31 +00:00
John Baldwin
71d11ee322 Update name of description of vfs.ffs.setsize in comment.
Previously it used the name 'adjsize' instead of 'setsize'.
2020-05-22 17:23:43 +00:00
John Baldwin
f2620e9ceb Retire two unused background fsck sysctls.
These two sysctls were added to support UFS softupdates journalling
with snapshots.  However, the changes to fsck to use them were never
committed and there have never been any in-tree uses of these sysctls.

More details from Kirk:

When journalling got added to soft updates, its journal rollback freed
blocks that it thought were no longer in use. But it does not take
snapshots into account (i.e., if a snapshot is still using it, then it
cannot be freed). So I added the needed logic to fsck by having the
free go through the kernel's blkfree code so it could grab blocks that
were still needed by snapshots. That is done using the setbufoutput
hack. I never got that code working reliably, so it is still sitting
in my work directory. Which also explains why you still cannot take
snapshots on filesystems running with journalling...

In looking over my use of this feature, and in particular the troubles
I was having with it, I conclude that it may be better to extract the
code from the kernel that handles freeing blocks claimed by snapshots
and putting it into fsck directly. My original intent was that it is
complex and at the time changing, so only having to maintain it in one
place was appealing. But at this point it has not changed in years and
the hacks like setinode and setbufoutput to be able to use the kernel
code is sufficiently ugly, that I am leaning towards just extracting
it.

Reviewed by:	mckusick
MFC after:	1 week
Sponsored by:	DARPA
Differential Revision:	https://reviews.freebsd.org/D24484
2020-04-21 17:42:32 +00:00
Konstantin Belousov
71f2642988 ufs: apply suspension for non-forced rw unmounts.
Forced rw unmounts and remounts from rw to ro already suspend
filesystem, which closes races with writers instantiating new vnodes
while unmount flushes the queue.  Original intent of not including
non-forced unmounts into this regime was to allow such unmounts to
fail if writer was active, but this did not worked well.

Similar change, but causing all unmount, even involving only ro
filesystem, were proposed in D24088, but I believe that suspending ro
is undesirable, and definitely spends CPU time.

Reported by:	markj
Discussed with:	chs, mckusick
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2020-04-10 01:24:16 +00:00
Kirk McKusick
621a274820 Fixing the soft update macros in -r359612 triggered a previously
hidden bug in the file truncation code. Until that bug is tracked
down and fixed, revert to the old behavior.

Reported by: Peter Holm
Reviewed by: kib, Chuck Silvers
2020-04-09 23:51:18 +00:00
Kirk McKusick
c79f5a4328 Revert -r359612 as it can cause other panics.
An updated version will be made when the issue has been resolved.

Reported by: Peter Holm
2020-04-06 20:23:47 +00:00
Kirk McKusick
2baca88584 When shrinking the size of a directory it is sometimes necessary to
sync it to disk before shrinking it. Complete the sync before getting
the buffer for the block to be updated to do the shrink to avoid
panicing with a recursive lock on one of the directory's buffers.

Reviewed by:  Chuck Silvers (chs)
MFC after:    3 days
Sponsored by: Netflix
2020-04-03 20:43:25 +00:00
Kirk McKusick
aedb9cc662 Convert DOINGSOFTDEP, MOUNTEDSOFTDEP, DOINGSUJ, and MOUNTEDSUJ to being
boolean expressions so that their values are not lost when assigned to
`bool' or `int' variables.

Reviewed by:  Chuck Silvers (chs)
MFC after:    3 days
Sponsored by: Netflix
2020-04-03 20:30:45 +00:00
Konstantin Belousov
abfdf76791 VOP_GETPAGES_ASYNC(): consistently call iodone() callback in case of error.
Reviewed by:	glebius, markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D24038
2020-03-30 21:44:30 +00:00
Kirk McKusick
95ca762da8 When mounting a UFS filesystem, return EINTEGRITY rather than EIO
when a superblock check-hash error is detected. This change clarifies
a mount that failed due to media hardware failures (EIO) from a mount
that failed due to media errors (EINTEGRITY) that can be corrected by
running fsck(8).

Sponsored by: Netflix
2020-03-11 21:00:40 +00:00
Chuck Silvers
69b3fdfa0b Use the devfs vnode rather than the mntfs vnode for permissions checks.
I missed this one in r358714.

Reported by:	pho
Reviewed by:	mckusick
Approved by:	imp (mentor)
Sponsored by:	Netflix
2020-03-09 15:55:13 +00:00
Mateusz Guzik
d2222aa0e9 fd: use smr for managing struct pwd
This has a side effect of eliminating filedesc slock/sunlock during path
lookup, which in turn removes contention vs concurrent modifications to the fd
table.

Reviewed by:	markj, kib
Differential Revision:	https://reviews.freebsd.org/D23889
2020-03-08 00:23:36 +00:00
Chuck Silvers
f15ccf8836 Add a new "mntfs" pseudo file system which provides private device vnodes for
file systems to safely access their disk devices, and adapt FFS to use it.
Also add a new BO_NOBUFS flag to allow enforcing that file systems using
mntfs vnodes do not accidentally use the original devfs vnode to create buffers.

Reviewed by:	kib, mckusick
Approved by:	imp (mentor)
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D23787
2020-03-06 18:41:37 +00:00
Mateusz Guzik
8d03b99b9d fd: move vnodes out of filedesc into a dedicated structure
The new structure is copy-on-write. With the assumption that path lookups are
significantly more frequent than chdirs and chrooting this is a win.

This provides stable root and jail root vnodes without the need to reference
them on lookup, which in turn means less work on globally shared structures.
Note this also happens to fix a bug where jail vnode was never referenced,
meaning subsequent access on lookup could run into use-after-free.

Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D23884
2020-03-01 21:53:46 +00:00
Pawel Biernacki
7029da5c36 Mark more nodes as CTLFLAG_MPSAFE or CTLFLAG_NEEDGIANT (17 of many)
r357614 added CTLFLAG_NEEDGIANT to make it easier to find nodes that are
still not MPSAFE (or already are but aren’t properly marked).
Use it in preparation for a general review of all nodes.

This is non-functional change that adds annotations to SYSCTL_NODE and
SYSCTL_PROC nodes using one of the soon-to-be-required flags.

Mark all obvious cases as MPSAFE.  All entries that haven't been marked
as MPSAFE before are by default marked as NEEDGIANT

Approved by:	kib (mentor, blanket)
Commented by:	kib, gallatin, melifaro
Differential Revision:	https://reviews.freebsd.org/D23718
2020-02-26 14:26:36 +00:00
Kirk McKusick
98b6844690 Additional KASSERTs to ensure the consistency of the soft updates
indirdep structure. No functional change.

Tested by:    Peter Holm (as part of a larger patch)
Sponsored by: Netflix
2020-02-18 23:56:23 +00:00
Scott Long
1353215314 Add rudamentary support for UFS to probe whether a block device supports the
BIO_SPEEDUP command.  Add complimentary support to the CAM periphs that
support it.  This is a redo of r357710.
2020-02-16 23:10:59 +00:00
Mateusz Guzik
4d51e175f9 ufs: use faster lockgmr entry points in ffs_lock 2020-02-15 21:48:48 +00:00
Scott Long
85eb41f751 Revert r357710 and 357711 until they can be debugged 2020-02-10 14:27:28 +00:00
Scott Long
9ce150463c Missed a file in r357710, add it here. 2020-02-10 00:26:41 +00:00
Scott Long
7d99bda79e Add rudamentary support for UFS to probe whether a block device supports the
BIO_SPEEDUP command.  Add complimentary support to the CAM periphs that
support it.
2020-02-10 00:23:20 +00:00
Chuck Silvers
62612737d6 With INVARIANTS, track all softdep dependency structures centrally
so that we can find them in dumps.

Approved by:	mckusick (mentor)
Sponsored by:	Netflix
2020-02-03 17:47:14 +00:00
Mateusz Guzik
f1fa1ba3d0 Fix up various vnode-related asserts which did not dump the used vnode 2020-02-03 14:25:32 +00:00
Mateusz Guzik
643656cfaf vfs: replace VOP_MARKATIME with VOP_MMAPPED
The routine is only provided by ufs and is only used on mmap and exec.

Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D23422
2020-02-01 06:46:55 +00:00
Mateusz Guzik
0a09292188 ufs: drop ufs_markatime from ufs_fifoops
The routine is only called on mmap and exec, both of which are invalid for
this type.

Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D23421
2020-02-01 06:41:44 +00:00
Mateusz Guzik
901b05fbd2 ufs: add the missing vn_need_pageq_flush call to ufs_need_inactive 2020-01-30 05:37:35 +00:00
Mateusz Guzik
6c44a3e019 ufs: add vgone calls for unconstructed vnodes in the error path
This mostly eliminates the requirement that vput never unlocks the vnode
before calling VOP_INACTIVE. Note it may still be present for other
filesystems.

See r356126 for an example bug.

Note vput stopped doing early unlock in r357070 thus this change does
not affect correctness as it is.

Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D23215
2020-01-26 00:38:06 +00:00
Mateusz Guzik
d93762b94d vfs: stop handling VI_OWEINACT in vget
vget is almost always called with LK_SHARED, meaning the flag (if present) is
almost guaranteed to get cleared. Stop handling it in the first place and
instead let the thread which wanted to do inactive handle the bumepd usecount.

Reviewed by:	jeff
Tested by:	pho
Differential Revision:	https://reviews.freebsd.org/D23184
2020-01-24 07:45:59 +00:00
Warner Losh
38b37b93d4 We only want to send the speedup to the lower layers when there's a shortage.
Only send a speedup when there's a shortage. While this is a little racy, lost
races aren't a big deal for this function. If there's a shorage just popping up
after we check these values, then we'll catch it next time. If there's a
shortage that's just clearing up, we may do some work at the lower layers a
little sooner than we otherwise would have. Sicne shortages are relatively rare
events, both races are acceptable.

Reviewed by: chs
Differential Revision: https://reviews.freebsd.org/D23182
2020-01-17 01:16:23 +00:00
Warner Losh
3cf5dd8401 Use buf to send speedup
It turns out there's a problem with using g_io to send the speedup. It leads to
a race when there's a resource shortage when a disk fails.

Instead, send BIO_SPEEDUP via struct buf. This is pretty straight forward,
except we need to transfer the bio_flags from b_ioflags for BIO_SPEEDUP commands
in g_vfs_strategy.

Reviewed by: kirk, chs
Differential Revision: https://reviews.freebsd.org/D23117
2020-01-17 01:16:19 +00:00
Kirk McKusick
0297c1384a When sync'ing a mount point, the mount point's vnodes were scanned
twice. Once to update the changed inodes, and a second time to update
changed quota information. This change merges these two scans into a
single scan which does both inode and quota updates.

MFC after: 7 days
2020-01-14 22:27:46 +00:00
Jeff Roberson
815b74866a Fix a long standing bug in journaled soft-updates. The dirrem structure
needs to handle file removal, directory removal, file move, directory move,
etc.  The code in handle_workitem_remove() needs to propagate any completed
journal entries to the write that will render the change stable.  In the
case of a moved directory this means the new parent.  However, for an
overwrite that frees a directory (DIRCHG) we must move the jsegdep to the
removed inode to be released when it is stable in the cg bitmap or the
unlinked inode list.  This case was previously unhandled and caused a
panic.

Reported by:	mckusick, pho
Reviewed by:	mckusick
Tested by:	pho
2020-01-14 02:00:24 +00:00
Mateusz Guzik
f1fcaffd8e ufs: relax an overzealous assert added in r356671
Part of i_flag can persist across a drop to hold count of 0, at which
point the vnode is taken off the lazy list. Then whoever locks and unlocks
the vnode can trip on the assert.

This trips over kyua running a test untarring character devices to ufs.

Reported by:	lwhsu
2020-01-13 14:33:51 +00:00
Mateusz Guzik
cc3593fbd9 vfs: rework vnode list management
The current notion of an active vnode is eliminated.

Vnodes transition between 0<->1 hold counts all the time and the
associated traversal between different lists induces significant
scalability problems in certain workloads.

Introduce a global list containing all allocated vnodes. They get
unlinked only when UMA reclaims memory and are only requeued when
hold count reaches 0.

Sample result from an incremental make -s -j 104 bzImage on tmpfs:
stock:   118.55s user 3649.73s system 7479% cpu 50.382 total
patched: 122.38s user 1780.45s system 6242% cpu 30.480 total

Reviewed by:	jeff
Tested by:	pho (in a larger patch, previous version)
Differential Revision:	https://reviews.freebsd.org/D22997
2020-01-13 02:37:25 +00:00
Mateusz Guzik
80663cadb8 ufs: use lazy list instead of active list for syncer
Quota code is temporarily regressed to do a full vnode scan.

Reviewed by:	jeff
Tested by:	pho (in a larger patch, previous version)
Differential Revision:	https://reviews.freebsd.org/D22996
2020-01-13 02:35:15 +00:00
Mateusz Guzik
ac4ec14188 ufs: add a setter for inode i_flag field
This will be used later to add vnodes to the lazy list.

Reviewed by:	kib (previous version), jeff
Tested by:	pho (in a larger patch)
Differential Revision:	https://reviews.freebsd.org/D22994
2020-01-13 02:31:51 +00:00
Kirk McKusick
27a6257130 When a read error occurs while fetching a directory block to delete
or rename an entry in it, properly reset the link count of the inode
associated with the entry that was to have been changed.

Tested by: Peter Holm
MFC after: 7 days
2020-01-11 03:18:47 +00:00
Mateusz Guzik
8dbc63520c vfs: drop thread argument from vinactive 2020-01-05 00:59:47 +00:00
Mateusz Guzik
b249ce48ea vfs: drop the mostly unused flags argument from VOP_UNLOCK
Filesystems which want to use it in limited capacity can employ the
VOP_UNLOCK_FLAGS macro.

Reviewed by:	kib (previous version)
Differential Revision:	https://reviews.freebsd.org/D21427
2020-01-03 22:29:58 +00:00
Konstantin Belousov
4085590d39 ufs: do not leave non-reclaimed vnodes with zero i_mode around.
After a recent change, vput() relocks even the exclusively locked
vnode before inactivating it.  Before that, UFS could safely
instantiate a vnode for cleared inode, then the last vput() after
ffs_vgetf() noted that ip->i_mode == 0 and recycled.  Now, it is
possible for other threads to note the half-constructed vnode, e.g. to
insert it into hash, which makes other threads to use it despite mode
is zero, before inactivation and reclaim.

Handle the found cases in SU code, by explicitly doing reclaim.
Assert that other places get fully constructed inode from ffs_vgetf(),
which cannot be cleared before dependencies are resolved.

Reported and tested by:	pho
Reviewed by:	mckusick
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2019-12-27 16:43:34 +00:00
Warner Losh
56e4d45895 Drop a sleepable lock when we plan on sleeping
g_io_speedup waits for the completion of the speedup request before proceeding
using biowait(), but check_clear_deps is called with the softdeps lock held
(which is non-sleepable). It's safe to drop this lock around the call to
speedup, so do that.

Submitted by: Peter Holm
Reviewed by: kib@
2019-12-18 16:01:15 +00:00
Warner Losh
22dd705f1f Add BIO_SPEEDUP signalling to UFS
When we have a resource shortage in UFS, send down a BIO_SPEEDUP to
give the CAM I/O scheduler a heads up that we have a resource shortage
and that it should bias its decisions knowing that.

Reviewed by: kirk, kib
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D18351
2019-12-17 00:13:40 +00:00
Mateusz Guzik
6fa079fc3f vfs: flatten vop vectors
This eliminates the following loop from all VOP calls:

while(vop != NULL && \
    vop->vop_spare2 == NULL && vop->vop_bypass == NULL)
        vop = vop->vop_default;

Reviewed by:	jeff
Tesetd by:	pho
Differential Revision:	https://reviews.freebsd.org/D22738
2019-12-16 00:06:22 +00:00
Konstantin Belousov
332f313c45 UFS: implement VOP_INACTIVE()
The checks literally repeat conditions that make ufs_inactive() to
take some actions.

Reviewed by:	jeff
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D22616
2019-12-10 14:07:05 +00:00
Mateusz Guzik
abd80ddb94 vfs: introduce v_irflag and make v_type smaller
The current vnode layout is not smp-friendly by having frequently read data
avoidably sharing cachelines with very frequently modified fields. In
particular v_iflag inspected for VI_DOOMED can be found in the same line with
v_usecount. Instead make it available in the same cacheline as the v_op, v_data
and v_type which all get read all the time.

v_type is avoidably 4 bytes while the necessary data will easily fit in 1.
Shrinking it frees up 3 bytes, 2 of which get used here to introduce a new
flag field with a new value: VIRF_DOOMED.

Reviewed by:	kib, jeff
Differential Revision:	https://reviews.freebsd.org/D22715
2019-12-08 21:30:04 +00:00
Kirk McKusick
d00066a5f9 Currently the breadn_flags() and getblkx() interfaces are passed
the vnode, logical block number, and size of data block that is
being requested. They then use the VOP_BMAP function to calculate
the mapping from logical block number to physical block number from
which to access the data. This change expands the interface to also
pass the physical block number in cases where the VOP_MAP function
may no longer work, for example when a file is being truncated.

No functional change.

Reviewed by:  kib
Tested by:    Peter Holm
Sponsored by: Netflix
2019-12-03 23:07:09 +00:00
Chuck Silvers
2ac044e6bc As part of creating a snapshot, set fs->fs_fmod to 0 in the snapshot image
because nothing ever changes this field for read-only mounts and we want
to verify that it is still 0 when we unmount.

Reviewed by:	mckusick
Approved by:	mckusick (mentor)
Sponsored by:	Netflix
2019-11-28 00:37:43 +00:00
Chuck Silvers
2bcfb938f4 In ffs_freefile(), use a separate variable to hold the inode number within
the cg rather than reusuing "ino" for this purpose.  This reduces the diff
for an upcoming change that improves handling of I/O errors.

No functional change.

Reviewed by:	mckusick
Approved by:	mckusick (mentor)
Sponsored by:	Netflix
2019-11-25 19:31:38 +00:00
Kirk McKusick
486b9a61f7 Add some KASSERTs. Reacquire a mutex after a kernel printf rather
than holding it during the printf. White space cleanup.

Sponsored by: Netflix
2019-11-20 01:10:01 +00:00
Chuck Silvers
b0cf923749 In ufs_dir_dd_ino(), always initialize *dd_vp since the caller expects it.
Reviewed by:	kib, mckusick
Approved by:	imp (mentor)
Sponsored by:	Netflix
2019-11-12 00:32:33 +00:00
Jeff Roberson
67d0e29304 Replace OBJ_MIGHTBEDIRTY with a system using atomics. Remove the TMPFS_DIRTY
flag and use the same system.

This enables further fault locking improvements by allowing more faults to
proceed with a shared lock.

Reviewed by:	kib
Tested by:	pho
Differential Revision:	https://reviews.freebsd.org/D22116
2019-10-29 21:06:34 +00:00
Kirk McKusick
1a75045196 After the unlink() of one name of a file with multiple links, a
stat() of one of the remaining names of the file does not show an
updated ctime (inode modification time) until several seconds after
the unlink() completes. The problem only occurs when the filesystem
is running with soft updates enabled. When running with soft updates,
the ctime is not updated until the soft updates background process
has settled all the needed I/O operations.

This commit causes the ctime to be updated immediately during the
unlink(). A side effect of this change is that the ctime is updated
again when soft updates has finished its processing because that
is the time that is correct from the perspective of programs that
look at the disk (like dump). This change does not cause any extra
I/O to be done, it just ensures that stat() updates the ctime before
handing it back.

PR:           241373
Reported by:  Alan Somers
Tested by:    Alan Somers
MFC after:    3 days
Sponsored by: Netflix
2019-10-24 21:28:37 +00:00
Kirk McKusick
7792f70137 Soft updates needs to keep an on-disk linked list of inodes that
have been unlinked, but are still referenced by open file descriptors.
These inodes cannot be freed until the final file descriptor reference
has been closed. If the system crashes while they are still being
referenced, these inodes and their referenced blocks need to be
freed by fsck. By having them on a linked list with the head pointer
in the superblock, fsck can quickly find and process them rather
than having to check every inode in the filesystem to see if it is
unreferenced.

When updating the head pointer of this list of unlinked inodes in
the superblock, the superblock check-hash was not getting updated.
If the system crashed with the incorrect superblock check-hash, the
superblock would appear to be corrupted. This patch ensures that
the superblock check-hash is updated when updating the head pointer
of the unlinked inodes list.

There is no need to MFC as superblock check hashes first appeared in
13.0.

Tested by:    Peter Holm
Sponsored by: Netflix
2019-10-24 19:47:18 +00:00
Mark Johnston
c456a0a1a6 Abbreviate softdep lock names.
The softdep lock names were unusually long and tended to stick out in
lock profiling reports.  Abbreviate them and make them consistent with
our conventional style for lock names.

Reviewed by:	mckusick
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D22042
2019-10-18 17:01:27 +00:00
Mateusz Guzik
e35cd9e38f ufs: add root vnode caching
See r353150.

Sponsored by:   The FreeBSD Foundation
Differential Revision:  https://reviews.freebsd.org/D21646
2019-10-06 22:18:03 +00:00
Eric van Gyzen
fdd888dee3 Add CTLFLAG_STATS to several debug.softdep sysctl OIDs
Refer to r353111.

MFC after:	2 weeks
Sponsored by:	Dell EMC Isilon
2019-10-04 21:44:52 +00:00
Kirk McKusick
44d37182ce Update ffs_getcg() function to accept a flags parameter to be passed
to breadn_flags() in preparation for later need when doing forcible
unmount when disk dies or is removed.

No functional change.

Sponsored by: Netflix
2019-10-04 05:28:36 +00:00
Mateusz Guzik
4cace859c2 vfs: convert struct mount counters to per-cpu
There are 3 counters modified all the time in this structure - one for
keeping the structure alive, one for preventing unmount and one for
tracking active writers. Exact values of these counters are very rarely
needed, which makes them a prime candidate for conversion to a per-cpu
scheme, resulting in much better performance.

Sample benchmark performing fstatfs (modifying 2 out of 3 counters) on
a 104-way 2 socket Skylake system:
before:   852393 ops/s
after:  76682077 ops/s

Reviewed by:	kib, jeff
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D21637
2019-09-16 21:37:47 +00:00
Mateusz Guzik
e87f3f72f1 vfs: manage mnt_writeopcount with atomics
See r352424.

Reviewed by:	kib, jeff
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D21575
2019-09-16 21:33:16 +00:00
Konstantin Belousov
d89ac450a7 Remove some unneeded vfs_busy() calls in SU code.
When softdep_fsync() is running, a caller must already started write
for the mount point.  Since unmount or remount to ro suspends mount
point, it cannot run in parallel with softdep_fsync(), which makes
vfs_busy() call there not needed.

Doing blocking vfs_busy() there effectively causes lock order reversal
between vn_start_write() and setting MNTK_UNMOUNT, because
vfs_busy(mp, 0) sleeps waiting for MNTK_UNMOUNT becoming clear, while
unmount sets the flag and starts the suspension.

Note that all other uses of vfs_busy() in SU code are non-blocking.

Reported by:	chs by mckusick
Reviewed by:	mckusick
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2019-09-09 11:22:38 +00:00
Konstantin Belousov
f923be6b9a Properly check for writers when fetching quotas for writeable vnodes
in UFS quotaon().

Reviewed by:	markj
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D21560
2019-09-07 15:57:23 +00:00
Conrad Meyer
f3cf622523 ufs: Remove redundant brelse() after r294954
Same automation.

No functional change.
2019-09-06 08:08:33 +00:00
Konstantin Belousov
6470c8d3db Rework v_object lifecycle for vnodes.
Current implementation of vnode_create_vobject() and
vnode_destroy_vobject() is written so that it prepared to handle the
vm object destruction for live vnode.  Practically, no filesystems use
this, except for some remnants that were present in UFS till today.
One of the consequences of that model is that each filesystem must
call vnode_destroy_vobject() in VOP_RECLAIM() or earlier, as result
all of them get rid of the v_object in reclaim.

Move the call to vnode_destroy_vobject() to vgonel() before
VOP_RECLAIM().  This makes v_object stable: either the object is NULL,
or it is valid vm object till the vnode reclamation.  Remove code from
vnode_create_vobject() to handle races with the parallel destruction.

Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D21412
2019-08-29 07:50:25 +00:00
Konstantin Belousov
1604022248 UFS: stop reusing the vnode for reallocated inode.
In ffs_valloc(), force reclaim existing vnode on inode reuse, instead
of trying to re-initialize the same vnode for new purposes.  This is
done in preparation of changes to the vp->v_object lifecycle handling.

A new FFSV_REPLACE flag to ffs_vgetf() directs the function to
vgone(9) the vnode if found in vfs hash, instead of returning it.

Reviewed by:	markj, mckusick
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D21412
2019-08-29 07:45:23 +00:00
Konstantin Belousov
e671edac06 De-commision the MNTK_NOINSMNTQ kernel mount flag.
After all the changes, its dynamic scope is same as for MNTK_UNMOUNT,
but to allow the syncer vnode to be re-installed on unmount failure.
But the case of syncer was already handled by using the VV_FORCEINSMQ
flag for quite some time.

Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2019-08-23 19:40:10 +00:00
Kirk McKusick
5a0d467f5f Clarify comment that describes how the FS_METACKHASH is managed.
MFC after: 3 days
2019-08-13 20:56:44 +00:00
Kirk McKusick
9454b4fd78 A race condition existed between the time a UFS/FFS superblock check
hash was computed and the time that the superblock was copied to a
buffer to be written to disk. The result was a failed superblock
check hash the next time that the superblock was read.

The fix is to compute the check hash after the superblock has been
copied to a buffer to be written.

PR:           236504
Reported by:  Peter Holm
Tested by:    Peter Holm
Sponsored by: Netflix
2019-08-06 18:10:34 +00:00
Kirk McKusick
90381b1ca9 When updating the user or group disk quotas for the return of inodes or
disk blocks, set the FORCE flag in the call to chkiq() or chkdq() since
the user is always allowed to return resources and hence there is no need
to check the user's credential .

Reported by:    Christopher Krah, Thomas Barabosch, and Jan-Niclas Hilgert of Fraunhofer FKIE
Reported as:    FS-1-UFS-1: Denial Of Service in mount (prison_priv_check)
Discussed with: kib
MFC:            1 week
Sponsored by:   Netflix
2019-07-31 22:44:58 +00:00
Rick Macklem
b4c9955e41 Lock the vnode before calling ufs_bmap_seekdata().
r346932 replaced a call to vn_bmap_seekhole() with a call to
ufs_bmap_seekdata(). Although vn_bmap_seekhole() locks the vnode,
ufs_bmap_seekdata() assumes it is already locked.
This patch adds locking of the vnode before the ufs_bmap_seekdata() call.
If the vn_lock() call fails, it returns EBADF since that is the normal
error returned when a file system is forced dismounted and is already
listed as an error return in the lseek(2) man page.

Discussed with:	markj
Reviewed by:	kib
2019-07-27 01:52:34 +00:00
Kirk McKusick
fdf34aa3a5 The error reported in FS-14-UFS-3 can only happen on UFS/FFS
filesystems that have block pointers that are out-of-range for their
filesystem. These out-of-range block pointers are corrected by
fsck(8) so are only encountered when an unchecked filesystem is
mounted.

A new "untrusted" flag has been added to the generic mount interface
that can be set when mounting media of unknown provenance or integrity.
For example, a daemon that automounts a filesystem on a flash drive
when it is plugged into a system.

This commit adds a test to UFS/FFS that validates all block numbers
before using them. Because checking for out-of-range blocks adds
unnecessary overhead to normal operation, the tests are only done
when the filesystem is mounted as an "untrusted" filesystem.

Reported by:  Christopher Krah, Thomas Barabosch, and Jan-Niclas Hilgert of Fraunhofer FKIE
Reported as:  FS-14-UFS-3: Out of bounds read in write-2 (ffs_alloccg)
Reviewed by:  kib
Sponsored by: Netflix
2019-07-17 22:07:43 +00:00
Kirk McKusick
ba554157a3 Style.
No change intended.
2019-07-16 23:39:39 +00:00
Kirk McKusick
1fd136ec5e When a process attempts to allocate space on a full filesystem, a
filesystem full message is sent to the offending process or the
kernel log if the offending process cannot be identified.

To prevent an explotion of messages, the kernel ppsratecheck()
function is used to limit the messages to one per second. This
revision changes the variable that tracks the rate of these messages
from a systemwide limit to a per-filesystem limit by moving it from
a global variable to a variable in the ufsmount structure.

Suggested by: kib
Reviewed by:  kib
Sponsored by: Netflix
2019-07-16 23:12:27 +00:00
Kirk McKusick
daba4da81d Add a new "untrusted" option to the mount command. Its purpose
is to notify the kernel that the file system is untrusted and it
should use more extensive checks on the file-system's metadata
before using it. This option is intended to be used when mounting
file systems from untrusted media such as USB memory sticks or other
externally-provided media.

It will initially be used by the UFS/FFS file system, but should
likely be expanded to be used by other file systems that may appear
on external media like msdosfs, exfat, and ext2fs.

Reviewed by:  kib
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20786
2019-07-01 23:22:26 +00:00
Mark Johnston
6137883ff3 Remove references to splbio in ffs_softdep.c.
Assert that the per-mountpoint softdep mutex is held in modified
functions that do not already have this assertion.  No functional
change intended.

Reviewed by:	kib, mckusick (previous version)
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D20741
2019-06-26 16:28:42 +00:00
Alan Somers
d49b446bfb Add FIOBMAP2 ioctl
This ioctl exposes VOP_BMAP information to userland. It can be used by
programs like fragmentation analyzers and optimized cp implementations. But
I'm using it to test fusefs's VOP_BMAP implementation. The "2" in the name
distinguishes it from the similar but incompatible FIBMAP ioctls in NetBSD
and Linux.  FIOBMAP2 differs from FIBMAP in that it uses a 64-bit block
number instead of 32-bit, and it also returns runp and runb.

Reviewed by:	mckusick
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D20705
2019-06-20 14:13:10 +00:00
Xin LI
f89d207279 Separate kernel crc32() implementation to its own header (gsb_crc32.h) and
rename the source to gsb_crc32.c.

This is a prerequisite of unifying kernel zlib instances.

PR:		229763
Submitted by:	Yoshihiro Ota <ota at j.email.ne.jp>
Differential Revision:	https://reviews.freebsd.org/D20193
2019-06-17 19:49:08 +00:00
Kirk McKusick
e94828443c Add a missing bresle() in seldom-used error return. 2019-05-28 17:31:35 +00:00
Kirk McKusick
af6aeacb3e Convert use of UFS-specific #ifdef DEBUG to DIAGNOSTIC or INVARIANTS
as appropriate. No functional change intended.

Suggested-by: markj
2019-05-28 16:32:04 +00:00
Kirk McKusick
298184acb8 Add function name and line number debugging information to softupdates
worklist structures to help track their movement between work lists.
No functional change to the operation of soft updates intended.
2019-05-27 06:22:43 +00:00
Alan Somers
65417f5e27 Remove "struct ucred*" argument from vtruncbuf
vtruncbuf takes a "struct ucred*" argument. AFAICT, it's been unused ever
since that function was first added in r34611. Remove it.  Also, remove some
"struct ucred" arguments from fuse and nfs functions that were only used by
vtruncbuf.

Reviewed by:	cem
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D20377
2019-05-24 20:27:50 +00:00
Conrad Meyer
daec92844e Include ktr.h in more compilation units
Similar to r348026, exhaustive search for uses of CTRn() and cross reference
ktr.h includes.  Where it was obvious that an OS compat header of some kind
included ktr.h indirectly, .c files were left alone.  Some of these files
clearly got ktr.h via header pollution in some scenarios, or tinderbox would
not be passing prior to this revision, but go ahead and explicitly include it
in files using it anyway.

Like r348026, these CUs did not show up in tinderbox as missing the include.

Reported by:	peterj (arm64/mp_machdep.c)
X-MFC-With:	r347984
Sponsored by:	Dell EMC Isilon
2019-05-21 20:38:48 +00:00
Mark Johnston
9e56947ffc Ensure that error is initialized in ufs_bmap_seekdata().
Reported and tested by:	jhibbits
MFC with:	r346932
Sponsored by:	The FreeBSD Foundation
2019-05-05 16:57:03 +00:00
Konstantin Belousov
78022527bb Switch to use shared vnode locks for text files during image activation.
kern_execve() locks text vnode exclusive to be able to set and clear
VV_TEXT flag. VV_TEXT is mutually exclusive with the v_writecount > 0
condition.

The change removes VV_TEXT, replacing it with the condition
v_writecount <= -1, and puts v_writecount under the vnode interlock.
Each text reference decrements v_writecount.  To clear the text
reference when the segment is unmapped, it is recorded in the
vm_map_entry backed by the text file as MAP_ENTRY_VN_TEXT flag, and
v_writecount is incremented on the map entry removal

The operations like VOP_ADD_WRITECOUNT() and VOP_SET_TEXT() check that
v_writecount does not contradict the desired change.  vn_writecheck()
is now racy and its use was eliminated everywhere except access.
Atomic check for writeability and increment of v_writecount is
performed by the VOP.  vn_truncate() now increments v_writecount
around VOP_SETATTR() call, lack of which is arguably a bug on its own.

nullfs bypasses v_writecount to the lower vnode always, so nullfs
vnode has its own v_writecount correct, and lower vnode gets all
references, since object->handle is always lower vnode.

On the text vnode' vm object dealloc, the v_writecount value is reset
to zero, and deadfs vop_unset_text short-circuit the operation.
Reclamation of lowervp always reclaims all nullfs vnodes referencing
lowervp first, so no stray references are left.

Reviewed by:	markj, trasz
Tested by:	mjg, pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 month
Differential revision:	https://reviews.freebsd.org/D19923
2019-05-05 11:20:43 +00:00
Kirk McKusick
44b193b09e Zero out the file directory entry metadata to reduce disk
scavenging disclosure.

Submitted by: David G. Lawrence <dg@dglawrence.com>
MFC after:    1 week
2019-05-04 18:00:57 +00:00
Kirk McKusick
0061238fb0 This update eliminates a kernel stack disclosure bug in UFS/FFS
directory entries that is caused by uninitialized directory entry
padding written to the disk. It can be viewed by any user with read
access to that directory. Up to 3 bytes of kernel stack are disclosed
per file entry, depending on the the amount of padding the kernel
needs to pad out the entry to a 32 bit boundry. The offset in the
kernel stack that is disclosed is a function of the filename size.
Furthermore, if the user can create files in a directory, this 3
byte window can be expanded 3 bytes at a time to a 254 byte window
with 75% of the data in that window exposed. The additional exposure
is done by removing the entry, creating a new entry with a 4-byte
longer name, extracting 3 more bytes by reading the directory, and
repeating until a 252 byte name is created.

This exploit works in part because the area of the kernel stack
that is being disclosed is in an area that typically doesn't change
that often (perhaps a few times a second on a lightly loaded system),
and these file creates and unlinks themselves don't overwrite the
area of kernel stack being disclosed.

It appears that this bug originated with the creation of the Fast
File System in 4.1b-BSD (Circa 1982, more than 36 years ago!), and
is likely present in every Unix or Unix-like system that uses
UFS/FFS. Amazingly, nobody noticed until now.

This update also adds the -z flag to fsck_ffs to have it scrub
the leaked information in the name padding of existing directories.
It only needs to be run once on each UFS/FFS filesystem after a
patched kernel is installed and running.

Submitted by: David G. Lawrence <dg@dglawrence.com>
Reviewed by:  kib
MFC after:    1 week
2019-05-03 21:54:14 +00:00
Kirk McKusick
ab2214d400 Simplify calculation of DIRECTSIZ. No functional change intended.
Suggested by: kib
MFC after:    1 week
2019-05-03 21:46:25 +00:00
Mark Johnston
cc2c33dfb1 Optimize lseek(SEEK_DATA) on UFS.
This version fixes the problems identified in r345244.

Reviewed by:	kib
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D19598
2019-04-29 22:05:26 +00:00
Konstantin Belousov
5ffc99e2e4 Handle races when remounting UFS volume from ro to rw.
In particular, ensure that writers are not unleashed before SU
structures are initialized.  Also, correctly handle MNT_ASYNC before
this.

Reported and tested by:	pho
Reviewed by:	mckusick
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2019-04-08 15:20:05 +00:00
Mariusz Zaborski
a1304030b8 Introduce funlinkat syscall that always us to check if we are removing
the file associated with the given file descriptor.

Reviewed by:	kib, asomers
Reviewed by:	cem, jilles, brooks (they reviewed previous version)
Discussed with:	pjd, and many others
Differential Revision:	https://reviews.freebsd.org/D14567
2019-04-06 09:34:26 +00:00
Kirk McKusick
69166928c7 This is an additional and hopefully final fix for bug report 230962.
This bug was introduced with the change to use softdep_bp_to_mp()
in January 2018 changes -r327723 and -r327821. The softdep_bp_to_mp()
function failed to include VSOCK as one of the valid cases.

Although local-domain sockets do not allocate blocks in the filesystem,
they will allocate blocks if they use extended attributes (such as
ACLs). Thus, softdep_bp_to_mp() needs to return a non-NULL mount
pointer when presented with a socket vnode so that the soft updates
write complete will properly process the soft updates structures
associated with the extended attribute blocks. It was the failure
to process these soft updates structures, thus leaving them hanging
off the buffer, which lead to the "panic: softdep_deallocate_dependencies:
dangling deps" when trying to clean up the buffer after it was written.

PR:           230962
Reported by:  2t8mr7kx9f@protonmail.com
Reviewed by:  kib
Tested by:    Peter Holm
MFC after:    1 week
Sponsored by: Netflix
2019-03-20 23:11:05 +00:00
Mark Johnston
783efeb544 Revert r345244 for now.
The code which advances the block number is simplistic and is not
correct when the starting offset is non-zero.  Revert the change until
this is fixed.
2019-03-18 05:03:55 +00:00
Mark Johnston
1a7f456a4b Fix the gcc build (-Wstrict-prototypes) after r345244.
Reported by:	jenkins
MFC with:	r345244
2019-03-17 18:06:13 +00:00
Mark Johnston
c676692c61 Optimize lseek(SEEK_DATA) on UFS.
The old implementation, at the VFS layer, would map the entire range of
logical blocks between the starting offset and the first data block
following that offset.  With large sparse files this is very
inefficient.  The VFS currently doesn't provide an interface to improve
upon the current implementation in a generic way.

Add ufs_bmap_seekdata(), which uses the obvious algorithm of scanning
indirect blocks to look for data blocks.  Use it instead of
vn_bmap_seekhole() to implement SEEK_DATA.

Reviewed by:	kib, mckusick
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D19598
2019-03-17 17:34:06 +00:00
Kirk McKusick
42a5a356a8 Add KASSERT to the softdep_disk_write_complete() function in the
soft dependency code to ensure that it will be able to avoid a
dangling dependency.

Sponsored by: Netflix
2019-03-12 00:10:31 +00:00
Kirk McKusick
3532718257 Give more complete information in INVARIANTS panic messages at end of
the ffs_truncate() function.

Sponsored by: Netflix
2019-03-11 23:53:56 +00:00
Kirk McKusick
a9f59cc029 Augment the UFS filesystem specific print function (called by the
kernel vn_printf() routine when printing out vnodes associated with
a UFS filesystem) to also include the inode's link count, effective
link count, generation number, owner, group, flags, size, and for
UFS2 filesystems, the extent size.

Sponsored by: Netflix
2019-03-11 22:05:34 +00:00
Simon J. Gerraty
f5fdf82d82 Add _PC_ACL_* to vop_stdpathconf
This avoid EINVAL from tmpfs etc.

Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D19512
2019-03-11 20:40:56 +00:00