The K&R style in UFS and other places in the tree's days are numbered
as this syntax is removed in C2x proposal N2432:
https://www.open-std.org/jtc1/sc22/wg14/www/docs/n2432.pdf
Though running to nearly 6000 lines of diffs this update should cause
no functional change to the code.
Requested by: Warner Losh
MFC after: 2 weeks
The "offset" argument to vn_io_fault_pgmove() is supposed to be
the offset within the page, but for ffs we currently use the offset
within the block. When the block size is at least as large as the
page size then these values are the same, but when the page size is
larger than the block size then we need to add the offset of
the block within the page as well.
Sponsored by: Netflix
Reviewed by: mckusick, kib, markj
Differential Revision: https://reviews.freebsd.org/D34835
when the vnode is doomed after relock. The mere fact that the vnode is
doomed does not prevent us from doing UFS operations on it while it is
still belongs to UFS, which is determined by non-NULL v_data. Not
finishing some operations, e.g. not syncing the inode block only because
the vnode started reclamation, is not correct.
Add macro IS_UFS() which incapsulates the v_data != NULL, and use it
instead of VN_IS_DOOMED() for places where the operation completion is
important.
Reviewed by: markj, mckusick
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D34072
Mark variables as __diagused for invariant-only vars
Reviewed by: imp, mjg
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D32577
The trunc_dependencies() issue was reported by Alexander Lochmann
<alexander.lochmann@tu-dortmund.de>, who found the problem by performing
lock analysis using LockDoc, see https://doi.org/10.1145/3302424.3303948.
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
There are three issues with change that stopped truncating ea area before
write, and resulted in possible zero tail in the ea area:
- Truncate to zero checked i_ea_len after the reference was dropped,
making the last drop effectively truncate to zero length always.
- Loop to fill uio for zeroing specified too large length, that triggered
assert in normal situation.
- Integrity check could trip over the tail, instead we must allow
partial header or header with zero length, and clamp ea image in
memory at it.
Reported by: arichardson
Tested by: arichardson, pho
Sponsored by: The FreeBSD Foundation
MFC after: 3 days
Fixup: 5e198e7646
Differential Revision: https://reviews.freebsd.org/D28999
softdep_prealloc() must be called to ensure enough journal space is
available, before ffs_extwrite(). Also it must be done before taking
ffs_lock_ea(), because it calls ffs_syncvnode(), potentially dropping
the vnode lock.
Reviewed by: mckusick
Tested by: pho
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
ffs_lock_ea is after the vnode lock, so vnode must not be relocked under
lock_ea. Move ffs_truncate() call in ffs_close_ea() after the lock_ea is
dropped, and only truncate to length zero, since this is the only mode
supported by ffs_truncate() for EAs. Previously code did truncation and
then write.
Zero the part of the ext area that is unused, if truncation is due but not
done because ea area is not zero-length.
Reviewed by: mckusick
Tested by: pho
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Do it in ffs_write(), where we can gracefuly handle relock and its
consequences. In particular, recheck the v_data to see if the vnode
reclamation ended, and return EBADF when we cannot proceed with the
write.
Reviewed by: mckusick
Reported by: pho
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
The data is only needed by filesystems that
1. use buffer cache
2. utilize clustering write support.
Requested by: mjg
Reviewed by: asomers (previous version), fsu (ext2 parts), mckusick
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D28679
This catches both missed processing of IN_ENDOFF and missed application
of VOP_VPUT_PAIR() after VOP that created an entry in the directory.
Reviewed by: chs, mckusick
Tested by: pho
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
In ufs_rename case, tdvp is locked from the place where ufs_direnter()
is done till VOP_VPUT_PAIR(), which means that we no longer need to specially
handle rename in ufs_direnter(). Truncation, if possible, is done in the
same way in ffs_vput_pair() both for rename and other VOPs calling
ufs_direnter(). Remove isrename argument and set IN_ENDOFF if
ufs_direnter() succeeded and directory needs truncation.
In ffs_vput_pair(), stop verifying the condition that directory needs
truncation when IN_ENDOFF is set, instead assert that the condition is
true.
Suggested by: mckusick
Reviewed by: chs, mckusick
Tested by: pho
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
VOP_VPUT_PAIR() provides the hook to do the truncation right before
unlock, which is required since truncation might need to fsync(), which
itself might unlock the directory vnode.
Set new flag IN_ENDOFF which indicates that i_endoff is valid and should
be checked against inode size. Excessive size is chomped, but this
operation is advisory and failure to truncate should not result in the
failure of the main VOP.
Reviewed by: chs, mckusick
Tested by: pho
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
In particular, if unlock_vp is false, save vp's inode number and
generation. If ffs_inotovp() can re-create the vnode with the same
number and generation after we finished with handling dvp, then we most
likely raced with unmount, and were able to restore atomicity of open.
We use FFSV_REPLACE_DOOMED there, to drop the old vnode.
This additional recovery is not strictly required, but it improves the
quality of the implementation.
Suggested by: mckusick
Reviewed by: chs, mckusick
Tested by: pho
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
It cleans IN_NEEDSYNC flag on dvp before returning, by applying
ffs_syncvnode() until success or an error different from ERELOOKUP.
IN_NEEDSYNC cleanup is required to avoid creating holes in the directories
when extended into indirect block.
Reviewed by: chs, mckusick
Tested by: pho
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
If it is cleaned before the sync, other threads might see the inode without
the flag set, because syncing could unlock it.
Reviewed by: chs, mckusick
Tested by: pho
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
as it is done in other places. Header files might need options defined
for correct operation.
Reviewed by: chs, mckusick
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Respect the new IO_DATASYNC flag when performing synchronous writes.
Compared to O_SYNC, O_DSYNC lets us skip updating the inode in some
cases, matching the behaviour of fdatasync(2).
Reviewed by: kib
Differential Review: https://reviews.freebsd.org/D25160
When operating in SU or SU+J mode, ffs_syncvnode() might need to
instantiate other vnode by inode number while owning syncing vnode
lock. Typically this other vnode is the parent of our vnode, but due
to renames occuring right before fsync (or during fsync when we drop
the syncing vnode lock, see below) it might be no longer parent.
More, the called function flush_pagedep_deps() needs to lock other
vnode while owning the lock for vnode which owns the buffer, for which
the dependencies are flushed. This creates another instance of the
same LoR as was fixed in softdep_sync().
Put the generic code for safe relocking into new SU helper
get_parent_vp() and use it in flush_pagedep_deps(). The case for safe
relocking of two vnodes with undefined lock order was extracted into
vn helper vn_lock_pair().
Due to call sequence
ffs_syncvnode()->softdep_sync_buf()->flush_pagedep_deps(),
ffs_syncvnode() indicates with ERELOOKUP that passed vnode was
unlocked in process, and can return ENOENT if the passed vnode
reclaimed. All callers of the function were inspected.
Because UFS namei lookups store auxiliary information about directory
entry in in-memory directory inode, and this information is then used
by UFS code that creates/removed directory entry in the actual
mutating VOPs, it is critical that directory vnode lock is not dropped
between lookup and VOP. For softdep_prelink(), which ensures that
later link/unlink operation can proceed without overflowing the
journal, calls were moved to the place where it is safe to drop
processing VOP because mutations are not yet applied. Then, ERELOOKUP
causes restart of the whole VFS operation (typically VFS syscall) at
top level, including the re-lookup of the involved pathes. [Note that
we already do the same restart for failing calls to vn_start_write(),
so formally this patch does not introduce new behavior.]
Similarly, unsafe calls to fsync in snapshot creation code were
plugged. A possible view on these failures is that it does not make
sense to continue creating snapshot if the snapshot vnode was
reclaimed due to forced unmount.
It is possible that relock/ERELOOKUP situation occurs in
ffs_truncate() called from ufs_inactive(). In this case, dropping the
vnode lock is not safe. Detect the situation with VI_DOINGINACT and
reschedule inactivation by setting VI_OWEINACT. ufs_inactive()
rechecks VI_OWEINACT and avoids reclaiming vnode is truncation failed
this way.
In ffs_truncate(), allocation of the EOF block for partial truncation
is re-done after vnode is synced, since we cannot leave the buffer
locked through ffs_syncvnode().
In collaboration with: pho
Reviewed by: mckusick (previous version), markj
Tested by: markj (syzkaller), pho
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D26136
This count is memoized together with the lookup metadata in directory
inode, and we assert that accesses to lookup metadata are done under
the same lock generation as they were stored. Enabled under DIAGNOSTICS.
UFS saves additional data for parent dirent when doing lookup
(i_offset, i_count, i_endoff), and this data is used later by VOPs
operating on dirents. If parent vnode exclusive lock is dropped and
re-acquired between lookup and the VOP call, we corrupt directories.
Framework asserts that corruption cannot occur that way, by tracking
vnode lock generation counter. Updates to inode dirent members also
save the counter, while users compare current and saved counters
values.
Also, fix a case in ufs_lookup_ino() where i_offset and i_count could
be updated under shared lock. It is not a bug on its own since dvp
i_offset results from such lookup cannot be used, but it causes false
positive in the checker.
In collaboration with: pho
Reviewed by: mckusick (previous version), markj
Tested by: markj (syzkaller), pho
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D26136
Prior versions of FreeBSD (11.x) may have produced a corrupt extattr file.
(Specifically, r312416 accidentally fixed this defect by removing a strcpy.)
CURRENT FreeBSD supports disk images from those prior versions of FreeBSD.
Validate the internal structure as soon as we read it in from disk, to
prevent these extattr files from causing invariants violations and DoS.
Attempting to access the extattr portion of these files results in
EINTEGRITY. At this time, the only way to repair files damaged in this way
is to copy the contents to another file and move it over the original.
PR: 244089
Reported by: Andrea Venturoli <ml AT netfence.it>
Reviewed by: kib
Discussed with: mckusick (earlier draft)
Security: no
Differential Revision: https://reviews.freebsd.org/D27010
ACLs are not supported, meaning their presence will force the use of the old lookup.
Reviewed by: kib
Tested by: pho (in a patchset)
Differential Revision: https://reviews.freebsd.org/D25579
It is very conservative. Only spinning when LK_ADAPTIVE is passed, only on
exclusive lock and never when any waiters are present. buffer cache is remains
not spinning.
This reduces total sleep times during buildworld etc., but it does not shorten
total real time (culprits are contention in the vm subsystem along with slock +
upgrade which is not covered).
For microbenchmarks: open3_processes -t 52 (open/close of the same file for
writing) ops/s:
before: 258845
after: 801638
Reviewed by: kib
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D25753
requires that new data on growing files be accessible. Thus, the
the fsyncdata() system call must update the on-disk inode when the
size of the file has changed.
This commit adds another inode update flag, IN_SIZEMOD, that gets
set any time that the file size changes. If either the IN_IBLKDATA
or the IN_SIZEMOD flag is set when fdatasync() is called, the
associated inode is synchronously written to disk. We could have
overloaded the IN_IBLKDATA flag to also track size changes since
the only (current) use case for these flags are for fsyncdata(),
but it does seem useful for possible future uses to separately
track the file size changes and the inode block pointer changes.
Reviewed by: kib
MFC with: -r361785
Differential revision: https://reviews.freebsd.org/D25072
The fdatasync() description in POSIX specifies that
all I/O operations shall be completed as defined for synchronized I/O
data integrity completion.
and then the explanation of Synchronized I/O Data Integrity Completion says
The write is complete only when the data specified in the write
request is successfully transferred and all file system
information required to retrieve the data is successfully
transferred.
For UFS this means that all pointers must be on disk. Indirect
pointers already contribute to the list of dirty data blocks, so only
direct blocks and root pointers to indirect blocks, both of which
reside in the inode block, should be taken care of. In ffs_balloc(),
mark the inode with the new flag IN_IBLKDATA that specifies that
ffs_syncvnode(DATA_ONLY) needs a call to ffs_update() to flush the
inode block.
Reviewed by: mckusick
Discussed with: tmunro
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D25072
the underlying media fails or becomes inaccessible. For example
when a USB flash memory card hosting a UFS filesystem is unplugged.
The strategy for handling disk I/O errors when soft updates are
enabled is to stop writing to the disk of the affected file system
but continue to accept I/O requests and report that all future
writes by the file system to that disk actually succeed. Then
initiate an asynchronous forced unmount of the affected file system.
There are two cases for disk I/O errors:
- ENXIO, which means that this disk is gone and the lower layers
of the storage stack already guarantee that no future I/O to
this disk will succeed.
- EIO (or most other errors), which means that this particular
I/O request has failed but subsequent I/O requests to this
disk might still succeed.
For ENXIO, we can just clear the error and continue, because we
know that the file system cannot affect the on-disk state after we
see this error. For EIO or other errors, we arrange for the geom_vfs
layer to reject all future I/O requests with ENXIO just like is
done when the geom_vfs is orphaned. In both cases, the file system
code can just clear the error and proceed with the forcible unmount.
This new treatment of I/O errors is needed for writes of any buffer
that is involved in a dependency. Most dependencies are described
by a structure attached to the buffer's b_dep field. But some are
created and processed as a result of the completion of the dependencies
attached to the buffer.
Clearing of some dependencies require a read. For example if there
is a dependency that requires an inode to be written, the disk block
containing that inode must be read, the updated inode copied into
place in that buffer, and the buffer then written back to disk.
Often the needed buffer is already in memory and can be used. But
if it needs to be read from the disk, the read will fail, so we
fabricate a buffer full of zeroes and pretend that the read succeeded.
This zero'ed buffer can be updated and written back to disk.
The only case where a buffer full of zeros causes the code to do
the wrong thing is when reading an inode buffer containing an inode
that still has an inode dependency in memory that will reinitialize
the effective link count (i_effnlink) based on the actual link count
(i_nlink) that we read. To handle this case we now store the i_nlink
value that we wrote in the inode dependency so that it can be
restored into the zero'ed buffer thus keeping the tracking of the
inode link count consistent.
Because applications depend on knowing when an attempt to write
their data to stable storage has failed, the fsync(2) and msync(2)
system calls need to return errors if data fails to be written to
stable storage. So these operations return ENXIO for every call
made on files in a file system where we have otherwise been ignoring
I/O errors.
Coauthered by: mckusick
Reviewed by: kib
Tested by: Peter Holm
Approved by: mckusick (mentor)
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D24088
Part of i_flag can persist across a drop to hold count of 0, at which
point the vnode is taken off the lazy list. Then whoever locks and unlocks
the vnode can trip on the assert.
This trips over kyua running a test untarring character devices to ufs.
Reported by: lwhsu
Quota code is temporarily regressed to do a full vnode scan.
Reviewed by: jeff
Tested by: pho (in a larger patch, previous version)
Differential Revision: https://reviews.freebsd.org/D22996
This will be used later to add vnodes to the lazy list.
Reviewed by: kib (previous version), jeff
Tested by: pho (in a larger patch)
Differential Revision: https://reviews.freebsd.org/D22994
the vnode, logical block number, and size of data block that is
being requested. They then use the VOP_BMAP function to calculate
the mapping from logical block number to physical block number from
which to access the data. This change expands the interface to also
pass the physical block number in cases where the VOP_MAP function
may no longer work, for example when a file is being truncated.
No functional change.
Reviewed by: kib
Tested by: Peter Holm
Sponsored by: Netflix
Implement ffs_getpages_async(), which when possible calls the asynchronous
flavor of the generic pager's getpages function. When the underlying
block size is larger than the system page size, however, it will invoke
the (synchronous) buffer cache pager, followed by a call to the client
completion routine. This retains true asynchronous completion in the most
common (block size <= page size) case, which is important for the performance
of the new sendfile(2). The behavior in the larger block size case mirrors
the default implementation of VOP_GETPAGES_ASYNC, which most other
filesystems use anyway as they do not override the getpages_async method.
PR: 235708
Reported by: pho
Reviewed by: kib, glebius
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D19340
vnodeops make FFS1's fifoops1 use ffs_lock. Also delete ffs_reallocblks
from fifoops1 which is needed only for fifoops2 because of its
support for extended attributes that need to allocate blocks.
Suggested by: kib
It can occur during ordinary use of softupdates, or perhaps if writes to the
underlying media fail (causing bufs to be redirtied). Either way, it is not
particularly actionable.
Reviewed by: imp, kib
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D16258
- Create getblkx(9) variant of getblk(9) which can return error.
- Add GB_NOSPARSE flag for getblk()/getblkx() which requests that BMAP
was performed before the buffer is created, and EJUSTRETURN returned
in case the requested block does not exist.
- Make ffs_read() use GB_NOSPARSE to avoid instantiating buffer (and
allocating the pages for it), copying from zero_region instead.
The end result is less page allocations and buffer recycling when a
hole is read, which is important for some benchmarks.
Requested and reviewed by: jeff
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D14917
Followup to r313780. Also prefix ext2's and nandfs's versions with
EXT2_ and NANDFS_.
Reported by: kib
Reviewed by: kib, mckusick
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D9623