Commit Graph

2286 Commits

Author SHA1 Message Date
Thomas Munro
e7347be9e3 ffs: Support O_DSYNC.
Respect the new IO_DATASYNC flag when performing synchronous writes.
Compared to O_SYNC, O_DSYNC lets us skip updating the inode in some
cases, matching the behaviour of fdatasync(2).

Reviewed by: kib
Differential Review: https://reviews.freebsd.org/D25160
2021-01-08 13:15:56 +13:00
Mateusz Guzik
3e506a67bb vfs: add v_irflag accessors
Reviewed by:	kib (previous version)
Differential Revision:	https://reviews.freebsd.org/D27793
2021-01-03 06:50:06 +00:00
Mateusz Guzik
9997aedb8f ufs: use VNPASS when asserting on a vnode in ufs_read_pgcache 2021-01-01 03:14:11 +00:00
Mark Johnston
ace3d9475c ffs: Avoid out-of-bounds accesses in the fs_active bitmap
We use a bitmap to track which cylinder groups have changed between
snapshot creation and filesystem suspension.  The "legs" of the bitmap
are four bytes wide (see ACTIVESET()) so we must round up the allocation
size to a multiple of four bytes.

I believe this bug is harmless since UMA/kmem_* will both pad the
allocation and zero the full allocation.  Note that malloc() does inline
zeroing when the allocation size is known at compile-time.

Reported by:	pho (using KASAN)
Reviewed by:	kib, mckusick
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D27731
2020-12-23 11:16:40 -05:00
Ryan Libby
93dba42c0e ffs: quiet -Wstrict-prototypes
Reviewed by:	kib, markj, mckusick
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D27558
2020-12-11 22:51:57 +00:00
Kirk McKusick
bb3c01ec79 Document the BA_CLRBUF flag used in ufs and ext2fs filesystems.
Suggested by: kib
MFC after:    3 days
Sponsored by: Netflix
2020-12-06 20:50:21 +00:00
Konstantin Belousov
2c7ada9917 ufs: handle two more cases of possible VNON vnode returned from VFS_VGET().
Reported by:	kevans
Reviewed by:	mckusick, mjg
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D27457
2020-12-06 18:09:14 +00:00
Konstantin Belousov
21a45add50 ffs: do not read full direct blocks if they are going to be overwritten.
BA_CLRBUF specifies that existing context of the block will be
completely overwritten by caller, so there is no reason to spend io
fetching existing data.  We do the same for indirect blocks.

Reported by:	tmunro
Reviewed by:	mckusick, tmunro
Tested by:	pho, tmunro
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D27353
2020-11-30 17:03:26 +00:00
Konstantin Belousov
cd85379104 Make MAXPHYS tunable. Bump MAXPHYS to 1M.
Replace MAXPHYS by runtime variable maxphys. It is initialized from
MAXPHYS by default, but can be also adjusted with the tunable kern.maxphys.

Make b_pages[] array in struct buf flexible.  Size b_pages[] for buffer
cache buffers exactly to atop(maxbcachebuf) (currently it is sized to
atop(MAXPHYS)), and b_pages[] for pbufs is sized to atop(maxphys) + 1.
The +1 for pbufs allow several pbuf consumers, among them vmapbuf(),
to use unaligned buffers still sized to maxphys, esp. when such
buffers come from userspace (*).  Overall, we save significant amount
of otherwise wasted memory in b_pages[] for buffer cache buffers,
while bumping MAXPHYS to desired high value.

Eliminate all direct uses of the MAXPHYS constant in kernel and driver
sources, except a place which initialize maxphys.  Some random (and
arguably weird) uses of MAXPHYS, e.g. in linuxolator, are converted
straight.  Some drivers, which use MAXPHYS to size embeded structures,
get private MAXPHYS-like constant; their convertion is out of scope
for this work.

Changes to cam/, dev/ahci, dev/ata, dev/mpr, dev/mpt, dev/mvs,
dev/siis, where either submitted by, or based on changes by mav.

Suggested by: mav (*)
Reviewed by:	imp, mav, imp, mckusick, scottl (intermediate versions)
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D27225
2020-11-28 12:12:51 +00:00
Konstantin Belousov
92bcefd1d2 clear_inodedeps: handle ERELOOKUP from ffs_syncvnode().
Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
2020-11-26 18:03:24 +00:00
Konstantin Belousov
07ef907f6e ffs_softdep.c: get_parent_vp(): Fix bp lock leak when inum inode was already freed.
Reported by:	markj, pho
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2020-11-25 17:12:21 +00:00
Konstantin Belousov
8a1509e442 Handle LoR in flush_pagedep_deps().
When operating in SU or SU+J mode, ffs_syncvnode() might need to
instantiate other vnode by inode number while owning syncing vnode
lock.  Typically this other vnode is the parent of our vnode, but due
to renames occuring right before fsync (or during fsync when we drop
the syncing vnode lock, see below) it might be no longer parent.

More, the called function flush_pagedep_deps() needs to lock other
vnode while owning the lock for vnode which owns the buffer, for which
the dependencies are flushed.  This creates another instance of the
same LoR as was fixed in softdep_sync().

Put the generic code for safe relocking into new SU helper
get_parent_vp() and use it in flush_pagedep_deps().  The case for safe
relocking of two vnodes with undefined lock order was extracted into
vn helper vn_lock_pair().

Due to call sequence
     ffs_syncvnode()->softdep_sync_buf()->flush_pagedep_deps(),
ffs_syncvnode() indicates with ERELOOKUP that passed vnode was
unlocked in process, and can return ENOENT if the passed vnode
reclaimed.  All callers of the function were inspected.

Because UFS namei lookups store auxiliary information about directory
entry in in-memory directory inode, and this information is then used
by UFS code that creates/removed directory entry in the actual
mutating VOPs, it is critical that directory vnode lock is not dropped
between lookup and VOP.  For softdep_prelink(), which ensures that
later link/unlink operation can proceed without overflowing the
journal, calls were moved to the place where it is safe to drop
processing VOP because mutations are not yet applied.  Then, ERELOOKUP
causes restart of the whole VFS operation (typically VFS syscall) at
top level, including the re-lookup of the involved pathes.  [Note that
we already do the same restart for failing calls to vn_start_write(),
so formally this patch does not introduce new behavior.]

Similarly, unsafe calls to fsync in snapshot creation code were
plugged.  A possible view on these failures is that it does not make
sense to continue creating snapshot if the snapshot vnode was
reclaimed due to forced unmount.

It is possible that relock/ERELOOKUP situation occurs in
ffs_truncate() called from ufs_inactive().  In this case, dropping the
vnode lock is not safe.  Detect the situation with VI_DOINGINACT and
reschedule inactivation by setting VI_OWEINACT.  ufs_inactive()
rechecks VI_OWEINACT and avoids reclaiming vnode is truncation failed
this way.

In ffs_truncate(), allocation of the EOF block for partial truncation
is re-done after vnode is synced, since we cannot leave the buffer
locked through ffs_syncvnode().

In collaboration with:	pho
Reviewed by:	mckusick (previous version), markj
Tested by:	markj (syzkaller), pho
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D26136
2020-11-14 05:30:10 +00:00
Konstantin Belousov
738ea0010b Add ffs_inode_bwrite() helper.
In collaboration with:	pho
Reviewed by:	mckusick (previous version), markj
Tested by:	markj (syzkaller), pho
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D26136
2020-11-14 05:19:59 +00:00
Konstantin Belousov
7b795aa3c0 Revert r367669 to re-commit with proper message 2020-11-14 05:19:44 +00:00
Konstantin Belousov
c0d2077f41 Add a framework that tracks exclusive vnode lock generation count for UFS.
This count is memoized together with the lookup metadata in directory
inode, and we assert that accesses to lookup metadata are done under
the same lock generation as they were stored.  Enabled under DIAGNOSTICS.

UFS saves additional data for parent dirent when doing lookup
(i_offset, i_count, i_endoff), and this data is used later by VOPs
operating on dirents.  If parent vnode exclusive lock is dropped and
re-acquired between lookup and the VOP call, we corrupt directories.

Framework asserts that corruption cannot occur that way, by tracking
vnode lock generation counter.  Updates to inode dirent members also
save the counter, while users compare current and saved counters
values.

Also, fix a case in ufs_lookup_ino() where i_offset and i_count could
be updated under shared lock.  It is not a bug on its own since dvp
i_offset results from such lookup cannot be used, but it causes false
positive in the checker.

In collaboration with:	pho
Reviewed by:	mckusick (previous version), markj
Tested by:	markj (syzkaller), pho
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D26136
2020-11-14 05:17:04 +00:00
Konstantin Belousov
61846fc4dc Add a framework that tracks exclusive vnode lock generation count for UFS.
This count is memoized together with the lookup metadata in directory
inode, and we assert that accesses to lookup metadata are done under
the same lock generation as they were stored.  Enabled under DIAGNOSTICS.

UFS saves additional data for parent dirent when doing lookup
(i_offset, i_count, i_endoff), and this data is used later by VOPs
operating on dirents.  If parent vnode exclusive lock is dropped and
re-acquired between lookup and the VOP call, we corrupt directories.

Framework asserts that corruption cannot occur that way, by tracking
vnode lock generation counter.  Updates to inode dirent members also
save the counter, while users compare current and saved counters
values.

Also, fix a case in ufs_lookup_ino() where i_offset and i_count could
be updated under shared lock.  It is not a bug on its own since dvp
i_offset results from such lookup cannot be used, but it causes false
positive in the checker.

In collaboration with:	pho
Reviewed by:	mckusick (previous version), markj
Tested by:	markj (syzkaller), pho
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D26136
2020-11-14 05:10:39 +00:00
Mark Johnston
f44994874b ffs: Clamp BIO_SPEEDUP length
On 32-bit platforms, the computed size of the BIO_SPEEDUP requested by
softdep_request_cleanup() may be negative when assigned to bp->b_bcount,
which has type "long".

Clamp the size to LONG_MAX.  Also convert the unused g_io_speedup() to
use an off_t for the magnitude of the shortage for consistency with
softdep_send_speedup().

Reviewed by:	chs, kib
Reported by:	pho
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D27081
2020-11-11 13:48:07 +00:00
Conrad Meyer
e6790841f7 UFS2: Fix DoS due to corrupted extattrfile
Prior versions of FreeBSD (11.x) may have produced a corrupt extattr file.
(Specifically, r312416 accidentally fixed this defect by removing a strcpy.)
CURRENT FreeBSD supports disk images from those prior versions of FreeBSD.
Validate the internal structure as soon as we read it in from disk, to
prevent these extattr files from causing invariants violations and DoS.

Attempting to access the extattr portion of these files results in
EINTEGRITY.  At this time, the only way to repair files damaged in this way
is to copy the contents to another file and move it over the original.

PR:		244089
Reported by:	Andrea Venturoli <ml AT netfence.it>
Reviewed by:	kib
Discussed with:	mckusick (earlier draft)
Security:	no
Differential Revision:	https://reviews.freebsd.org/D27010
2020-10-30 19:00:42 +00:00
Mateusz Guzik
4bfebc8d2c cache: add cache_vop_mkdir and rename cache_rename to cache_vop_rename 2020-10-30 10:46:35 +00:00
Edward Tomasz Napierala
bce7ee9d41 Drop "All rights reserved" from all my stuff. This includes
Foundation copyrights, approved by emaste@.  It does not include
files which carry other people's copyrights; if you're one
of those people, feel free to make similar change.

Reviewed by:	emaste, imp, gbe (manpages)
Differential Revision:	https://reviews.freebsd.org/D26980
2020-10-28 13:46:11 +00:00
Kirk McKusick
996d40f91d Various new check-hash checks have been added to the UFS filesystem
over various major releases. Superblock check hashes were added for
the 12 release and cylinder-group and inode check hashes will appear
in the 13 release.

When a disk with a UFS filesystem is writably mounted, the kernel
clears the feature flags for anything that it does not support. For
example, if a UFS disk from a 12-stable kernel is mounted on an
11-stable system, the 11-stable kernel will clear the flag in the
filesystem superblock that indicates that superblock check-hashs
are being maintained. Thus if the disk is later moved back to a
12-stable system, the 12-stable system will know to ignore its
incorrect check-hash.

If the only filesystem modification done on the earlier kernel is
to run a utility such as growfs(8) that modifies the superblock but
neither updates the check-hash nor clears the feature flag indicating
that it does not support the check-hash, the disk will fail to mount
if it is moved back to its original newer kernel.

This patch moves the code that clears the filesystem feature flags
from the mount code (ffs_mountfs()) to the code that reads the
superblock (ffs_sbget()). As ffs_sbget() is used by the kernel mount
code and is imported into libufs(3), all the filesystem utilities
will now also clear these flags when they make modifications to the
filesystem.

As suggested by John Baldwin, fsck_ffs(8) has been changed to accept
and repair bad superblock check-hashes rather than refusing to run.
This change allows fsck to recover filesystems that have been impacted
by utilities older than those created after this change and is a
sensible thing to do in any event.

Reported by:  John Baldwin (jhb@)
MFC after:    2 weeks
Sponsored by: Netflix
2020-10-25 00:43:48 +00:00
Mateusz Guzik
25fb30bd9a vfs: drop spurious cache_purge on rmdir
The removed directory gets cache_purged which is sufficient to remove any entries
related to the parent.

Note only tmpfs, ufs and zfs are patched.
2020-10-23 15:50:49 +00:00
Brooks Davis
44ca4575ea vmapbuf: don't smuggle address or length in buf
Instead, add arguments to vmapbuf.  Since this argument is
always a pointer use a type of void * and cast to vm_offset_t in
vmapbuf.  (In CheriBSD we've altered vm_fault_quick_hold_pages to
take a pointer and check its bounds.)

In no other situtation does b_data contain a user pointer and vmapbuf
replaces b_data with the actual mapping.

Suggested by:	jhb
Reviewed by:	imp, jhb
Obtained from:	CheriBSD
MFC after:	1 week
Sponsored by:	DARPA
Differential Revision:	https://reviews.freebsd.org/D26784
2020-10-21 16:00:15 +00:00
Mateusz Guzik
e9fb2bd9b8 ufs: catch up with removal of thread argument from VOP_INACTIVE 2020-10-20 09:46:20 +00:00
Konstantin Belousov
e1ef4c29a3 Do not leak B_BARRIER.
Normally when a buffer with B_BARRIER is written, the flag is cleared
by g_vfs_strategy() when creating bio.  But in some cases FFS buffer
might not reach g_vfs_strategy(), for instance when copy-on-write
reports an error like ENOSPC.  In this case buffer is returned to
dirty queue and might be written later by other means.  Among then
bdwrite() reasonably asserts that B_BARRIER is not set.

In fact, the only current use of B_BARRIER is for lazy inode block
initialization, where write of the new inode block is fenced against
cylinder group write to mark inode as used.  The situation could be
seen that we break dependency by updating cg without written out
inode.  Practically since CoW was not able to find space for a copy of
inode block, for the same reason cg group block write should fail.

Reported by:	pho
Discussed with:	chs, imp, mckusick
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D26511
2020-10-08 22:41:02 +00:00
Chuck Silvers
8b88330ed6 ufs: restore uniqueness of st_dev as returned by ufs_stat()
switch ufs_stat() to use the same value for st_dev as was used by
the previous ufs_getattr() stat path.

Submitted by:	gallatin
Reviewed by:	mjg, imp, kib, mckusick
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D26596
2020-10-05 18:17:50 +00:00
Konstantin Belousov
3c484f325e Convert page cache read to VOP.
There are several negative side-effects of not calling into VOP layer
at all for page cache reads.  The biggest is the missed activation of
EVFILT_READ knotes.

Also, it allows filesystem to make more fine grained decision to
refuse read from page cache.

Keep VIRF_PGREAD flag around, it is still useful for nullfs, and for
asserts.

Reviewed by:	markj
Tested by:	pho
Discussed with:	mjg
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D26346
2020-09-15 22:06:36 +00:00
Konstantin Belousov
96474d2a3f Do not copy vp into f_data for DTYPE_VNODE files.
The pointer to vnode is already stored into f_vnode, so f_data can be
reused.  Fix all found users of f_data for DTYPE_VNODE.

Provide finit_vnode() helper to initialize file of DTYPE_VNODE type.

Reviewed by:	markj (previous version)
Discussed with:	freqlabs (openzfs chunk)
Tested by:	pho (previous version)
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D26346
2020-09-15 21:55:21 +00:00
Mateusz Guzik
d90f2c3617 ufs: clean up empty lines in .c and .h files 2020-09-01 21:23:00 +00:00
Mateusz Guzik
39f8815070 cache: add cache_rename, a dedicated helper to use for renames
While here make both tmpfs and ufs use it.

No fuctional changes.
2020-08-20 10:05:46 +00:00
Mateusz Guzik
8f226f4c23 vfs: remove the always-curthread td argument from VOP_RECLAIM 2020-08-19 07:28:01 +00:00
Mateusz Guzik
7ad2a82da2 vfs: drop the error parameter from vn_isdisk, introduce vn_isdisk_error
Most consumers pass NULL.
2020-08-19 02:51:17 +00:00
Konstantin Belousov
779ad2acf1 VMIO reads: enable for UFS
Move v_object creation earlier, so that VIRF_PGREAD is never set if
v_object is NULL.  There is no much harm from instantiating v_object
when later check for append-only flags disallows open.

Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D25968
2020-08-16 21:07:19 +00:00
Mateusz Guzik
a92a971bbb vfs: remove the thread argument from vget
It was already asserted to be curthread.

Semantic patch:

@@

expression arg1, arg2, arg3;

@@

- vget(arg1, arg2, arg3)
+ vget(arg1, arg2)
2020-08-16 17:18:54 +00:00
Mateusz Guzik
03337743db vfs: clean MNTK_FPLOOKUP if MNT_UNION is set
Elides checking it during lookup.
2020-08-10 11:51:21 +00:00
Mateusz Guzik
76dc5d3224 ufs: add VOP_STAT handler 2020-08-07 23:08:17 +00:00
Mateusz Guzik
d292b1940c vfs: remove the obsolete privused argument from vaccess
This brings argument count down to 6, which is passable without the
stack on amd64.
2020-08-05 09:27:03 +00:00
Mateusz Guzik
e5e10c82ec ufs: only pass LK_ADAPTIVE if LK_NODDLKTREAT is set
This restores the pre-adaptive spinning state for SU which livelocks
otherwise. Note this is a bug in SU.

Reported by:	pho
2020-08-04 23:09:15 +00:00
Mateusz Guzik
9d5a594f0b ufs: add support for lockless lookup
ACLs are not supported, meaning their presence will force the use of the old lookup.

Reviewed by:    kib
Tested by:      pho (in a patchset)
Differential Revision:	https://reviews.freebsd.org/D25579
2020-07-25 10:38:05 +00:00
Mateusz Guzik
31ad4050fe lockmgr: add adaptive spinning
It is very conservative. Only spinning when LK_ADAPTIVE is passed, only on
exclusive lock and never when any waiters are present. buffer cache is remains
not spinning.

This reduces total sleep times during buildworld etc., but it does not shorten
total real time (culprits are contention in the vm subsystem along with slock +
upgrade which is not covered).

For microbenchmarks: open3_processes -t 52 (open/close of the same file for
writing) ops/s:
before: 258845
after: 801638

Reviewed by:	kib
Tested by:	pho
Differential Revision:	https://reviews.freebsd.org/D25753
2020-07-22 12:30:31 +00:00
Kirk McKusick
93440bbefd The binary representation of the superblock (the fs structure) is written
out verbatim to the disk: see ffs_sbput() in sys/ufs/ffs/ffs_subr.c.
It contains a pointer to the fs_summary_info structure. This pointer
value inadvertently causes garbage to be stored. It is garbage because
the pointer to the fs_summary_info structure is the address the then
current stack or heap. Although a mere pointer does not reveal anything
useful (like a part of a private key) to an attacker, garbage output
deteriorates reproducibility.

This commit zeros out the pointer to the fs_summary_info structure
before writing the out the superblock.

Reviewed by:  kib
Tested by:    Peter Holm
PR:           246983
Sponsored by: Netflix
2020-06-19 01:04:25 +00:00
Kirk McKusick
34816cb9ae Move the pointers stored in the superblock into a separate
fs_summary_info structure. This change was originally done
by the CheriBSD project as they need larger pointers that
do not fit in the existing superblock.

This cleanup of the superblock eases the task of the commit
that immediately follows this one.

Suggested by: brooks
Reviewed by:  kib
PR:           246983
Sponsored by: Netflix
2020-06-19 01:02:53 +00:00
Chuck Silvers
d9a8abf6c2 Move all of the functions in ffs_subr.c that are only used by the ufs kernel
module from that file into ffs_vfsops.c.  This fixes the build for kernel
configs that don't include FFS.

PR:		247256
Submitted by:	glebius
Reviewed by:	mckusick (earlier version)
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D25285
2020-06-17 23:39:52 +00:00
Rick Macklem
1f7104d720 Fix export_args ex_flags field so that is 64bits, the same as mnt_flags.
Since mnt_flags was upgraded to 64bits there has been a quirk in
"struct export_args", since it hold a copy of mnt_flags
in ex_flags, which is an "int" (32bits).
This happens to currently work, since all the flag bits used in ex_flags are
defined in the low order 32bits. However, new export flags cannot be defined.
Also, ex_anon is a "struct xucred", which limits it to 16 additional groups.
This patch revises "struct export_args" to make ex_flags 64bits and replaces
ex_anon with ex_uid, ex_ngroups and ex_groups (which points to a
groups list, so it can be malloc'd up to NGROUPS in size.
This requires that the VFS_CHECKEXP() arguments change, so I also modified the
last "secflavors" argument to be an array pointer, so that the
secflavors could be copied in VFS_CHECKEXP() while the export entry is locked.
(Without this patch VFS_CHECKEXP() returns a pointer to the secflavors
array and then it is used after being unlocked, which is potentially
a problem if the exports entry is changed.
In practice this does not occur when mountd is run with "-S",
but I think it is worth fixing.)

This patch also deleted the vfs_oexport_conv() function, since
do_mount_update() does the conversion, as required by the old vfs_cmount()
calls.

Reviewed by:	kib, freqlabs
Relnotes:	yes
Differential Revision:	https://reviews.freebsd.org/D25088
2020-06-14 00:10:18 +00:00
Kirk McKusick
513274c79c Clear the IN_SIZEMOD and IN_IBLKDATA flags only when doing a
synchronous inode update.

The IN_SIZEMOD and IN_IBLKDATA flags indicate changes to the
file size and block pointer fields in the inode. When these
fields have been changed, the fsync() and fsyncdata() system
calls must write the inode to ensure their semantics that the
file is on stable store.

The IN_SIZEMOD and IN_IBLKDATA flags cannot be cleared until
a synchronous write of the inode is done. If they are cleared
on an asynchronous write, then the inode may not yet have been
written to the disk when an fsync() or fsyncdata() call is done.
Absent these flags, these calls would not know that they needed
to write the inode. Thus, these flags only can be cleared on
synchronous writes of the inode. Since the inode will be locked
for the duration of the I/O that writes it to disk, no fsync()
or fsyncdata() will be able to run before the on-disk inode
is complete.

Reviewed by: kib
MFC with: -r361785
Differential revision:  https://reviews.freebsd.org/D25072
2020-06-06 20:17:56 +00:00
Kirk McKusick
52488b5148 Further evaluation of the POSIX spec for fdatasync() shows that it
requires that new data on growing files be accessible. Thus, the
the fsyncdata() system call must update the on-disk inode when the
size of the file has changed.

This commit adds another inode update flag, IN_SIZEMOD, that gets
set any time that the file size changes. If either the IN_IBLKDATA
or the IN_SIZEMOD flag is set when fdatasync() is called, the
associated inode is synchronously written to disk. We could have
overloaded the IN_IBLKDATA flag to also track size changes since
the only (current) use case for these flags are for fsyncdata(),
but it does seem useful for possible future uses to separately
track the file size changes and the inode block pointer changes.

Reviewed by: kib
MFC with: -r361785
Differential revision:  https://reviews.freebsd.org/D25072
2020-06-05 01:00:55 +00:00
Stefan Eßer
23e84cf153 Fix obvious typo: IN_BLKDATA should be IN_IBLKDATA 2020-06-04 19:54:25 +00:00
Kirk McKusick
30296c428a Two additional places that need to identify IN_IBLKDATA.
Reviewed by: kib
MFC with: -r361785
Differential Revision:  https://reviews.freebsd.org/D25072
2020-06-04 18:35:21 +00:00
Konstantin Belousov
7428630b75 UFS: write inode block for fdatasync(2) if pointers in inode where allocated
The fdatasync() description in POSIX specifies that
    all I/O operations shall be completed as defined for synchronized I/O
    data integrity completion.
and then the explanation of Synchronized I/O Data Integrity Completion says
    The write is complete only when the data specified in the write
    request is successfully transferred and all file system
    information required to retrieve the data is successfully
    transferred.

For UFS this means that all pointers must be on disk. Indirect
pointers already contribute to the list of dirty data blocks, so only
direct blocks and root pointers to indirect blocks, both of which
reside in the inode block, should be taken care of. In ffs_balloc(),
mark the inode with the new flag IN_IBLKDATA that specifies that
ffs_syncvnode(DATA_ONLY) needs a call to ffs_update() to flush the
inode block.

Reviewed by:	mckusick
Discussed with:	tmunro
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D25072
2020-06-04 12:23:15 +00:00
Chuck Silvers
d79ff54b5c This commit enables a UFS filesystem to do a forcible unmount when
the underlying media fails or becomes inaccessible. For example
when a USB flash memory card hosting a UFS filesystem is unplugged.

The strategy for handling disk I/O errors when soft updates are
enabled is to stop writing to the disk of the affected file system
but continue to accept I/O requests and report that all future
writes by the file system to that disk actually succeed. Then
initiate an asynchronous forced unmount of the affected file system.

There are two cases for disk I/O errors:

   - ENXIO, which means that this disk is gone and the lower layers
     of the storage stack already guarantee that no future I/O to
     this disk will succeed.

   - EIO (or most other errors), which means that this particular
     I/O request has failed but subsequent I/O requests to this
     disk might still succeed.

For ENXIO, we can just clear the error and continue, because we
know that the file system cannot affect the on-disk state after we
see this error. For EIO or other errors, we arrange for the geom_vfs
layer to reject all future I/O requests with ENXIO just like is
done when the geom_vfs is orphaned. In both cases, the file system
code can just clear the error and proceed with the forcible unmount.

This new treatment of I/O errors is needed for writes of any buffer
that is involved in a dependency. Most dependencies are described
by a structure attached to the buffer's b_dep field. But some are
created and processed as a result of the completion of the dependencies
attached to the buffer.

Clearing of some dependencies require a read. For example if there
is a dependency that requires an inode to be written, the disk block
containing that inode must be read, the updated inode copied into
place in that buffer, and the buffer then written back to disk.

Often the needed buffer is already in memory and can be used. But
if it needs to be read from the disk, the read will fail, so we
fabricate a buffer full of zeroes and pretend that the read succeeded.
This zero'ed buffer can be updated and written back to disk.

The only case where a buffer full of zeros causes the code to do
the wrong thing is when reading an inode buffer containing an inode
that still has an inode dependency in memory that will reinitialize
the effective link count (i_effnlink) based on the actual link count
(i_nlink) that we read. To handle this case we now store the i_nlink
value that we wrote in the inode dependency so that it can be
restored into the zero'ed buffer thus keeping the tracking of the
inode link count consistent.

Because applications depend on knowing when an attempt to write
their data to stable storage has failed, the fsync(2) and msync(2)
system calls need to return errors if data fails to be written to
stable storage. So these operations return ENXIO for every call
made on files in a file system where we have otherwise been ignoring
I/O errors.

Coauthered by: mckusick
Reviewed by:   kib
Tested by:     Peter Holm
Approved by:   mckusick (mentor)
Sponsored by:  Netflix
Differential Revision:  https://reviews.freebsd.org/D24088
2020-05-25 23:47:31 +00:00