folks running filesystems created on check-hash enabled kernels
(which I will call "new") on a non-check-hash enabled kernels (which
I will call "old). The idea here is to detect when a filesystem is
run on an old kernel and flag the filesystem so that when it gets
moved back to a new kernel, it will not start getting a slew of
check-hash errors.
Back when the UFS version 2 filesystem was created, it added a file
flag FS_INDEXDIRS that was to be set on any filesystem that kept
some sort of on-disk indexing for directories. The idea was precisely
to solve the issue we have today. Specifically that a newer kernel
that supported indexing would be able to tell that the filesystem
had been run on an older non-indexing kernel and that the indexes
should not be used until they had been rebuilt. Since we have never
implemented on-disk directory indicies, the FS_INDEXDIRS flag is
cleared every time any UFS version 2 filesystem ever created is
mounted for writing.
This commit repurposes the FS_INDEXDIRS flag as the FS_METACKHASH
flag. Thus, the FS_METACKHASH is definitively known to have always
been cleared. The FS_INDEXDIRS flag has been moved to a new block
of flags that will always be cleared starting with this commit
(until they get used to implement some future feature which needs
to detect that the filesystem was mounted on a kernel that predates
the new feature).
If a filesystem with check-hashes enabled is mounted on an old
kernel the FS_METACKHASH flag is cleared. When that filesystem is
mounted on a new kernel it will see that the FS_METACKHASH has been
cleared and clears all of the fs_metackhash flags. To get them
re-enabled the user must run fsck (in interactive mode without the
-y flag) which will ask for each supported check hash whether it
should be rebuilt and enabled. When fsck is run in its default preen
mode, it will just ignore the check hashes so they will remain
disabled.
The kernel has always disabled any check hash functions that it
does not support, so as more types of check hashes are added, we
will get a non-surprising result. Specifically if filesystems get
moved to kernels supporting fewer of the check hashes, those that
are not supported will be disabled. If the filesystem is moved back
to a kernel with more of the check-hashes available and fsck is run
interactively to rebuild them, then their checking will resume.
Otherwise just the smaller subset will be checked.
A side effect of this commit is that filesystems running with
cylinder-group check hashes will stop having them checked until
fsck is run to re-enable them (since none of them currently have
the FS_METACKHASH flag set). So, if you want check hashes enabled
on your filesystems after booting a kernel with these changes, you
need to run fsck to enable them. Any newly created filesystems will
have check hashes enabled. If in doubt as to whether you have check
hashes emabled, run dumpfs and look at the list of enabled flags
at the end of the superblock details.
ncookies cannot be negative or the allocator will fail. This should only
happen if a caller is very broken but we can still try to survive the
event.
We should probably also verify for uio_resid > MAXPHYS but in that case
it is not clear that just clipping the ncookies value is an adequate
response.
MFC after: 2 weeks
systems running with a heavy filesystem load. Tracking down this
bug was elusive because there were actually two problems. Sometimes
the in-memory check hash was wrong and sometimes the check hash
computed when doing the read was wrong. The occurrence of either
error caused a check-hash mismatch to be reported.
The first error was that the check hash in the in-memory cylinder
group was incorrect. This error was caused by the following
sequence of events:
- We read a cylinder-group buffer and the check hash is valid.
- We update its cg_time and cg_old_time which makes the in-memory
check-hash value invalid but we do not mark the cylinder group dirty.
- We do not make any other changes to the cylinder group, so we
never mark it dirty, thus do not write it out, and hence never
update the incorrect check hash for the in-memory buffer.
- Later, the buffer gets freed, but the page with the old incorrect
check hash is still in the VM cache.
- Later, we read the cylinder group again, and the first page with
the old check hash is still in the VM cache, but some other pages
are not, so we have to do a read.
- The read does not actually get the first page from disk, but rather
from the VM cache, resulting in the old check hash in the buffer.
- The value computed after doing the read does not match causing the
error to be printed.
The fix for this problem is to only set cg_time and cg_old_time as
the cylinder group is being written to disk. This keeps the in-memory
check-hash valid unless the cylinder group has had other modifications
which will require it to be written with a new check hash calculated.
It also requires that the check hash be recalculated in the in-memory
cylinder group when it is marked clean after doing a background write.
The second problem was that the check hash computed at the end of the
read was incorrect because the calculation of the check hash on
completion of the read was being done too soon.
- When a read completes we had the following sequence:
- bufdone()
-- b_ckhashcalc (calculates check hash)
-- bufdone_finish()
--- vfs_vmio_iodone() (replaces bogus pages with the cached ones)
- When we are reading a buffer where one or more pages are already
in memory (but not all pages, or we wouldn't be doing the read),
the I/O is done with bogus_page mapped in for the pages that exist
in the VM cache. This mapping is done to avoid corrupting the
cached pages if there is any I/O overrun. The vfs_vmio_iodone()
function is responsible for replacing the bogus_page(s) with the
cached ones. But we were calculating the check hash before the
bogus_page(s) were replaced. Hence, when we were calculating the
check hash, we were partly reading from bogus_page, which means
we calculated a bad check hash (e.g., because multiple pages have
been mapped to bogus_page, so its contents are indeterminate).
The second fix is to move the check-hash calculation from bufdone()
to bufdone_finish() after the call to vfs_vmio_iodone() so that it
computes the check hash over the correct set of pages.
With these two changes, the occasional cylinder-group check-hash
errors are gone.
Submitted by: David Pfitzner <dpfitzner@netflix.com>
Reviewed by: kib
Tested by: David Pfitzner
Sanitize the values that will be assigned to ncookies so that we ensure
they are sane and we can handle them.
Let ncookies signed as it was before r328346. The valid range is such
that unsigned values are not required and we are not able to avoid at
least one cast anyways.
Hinted by: bde
Specifically reading is done if ffs_sbget() and writing is done
in ffs_sbput(). These functions are exported to libufs via the
sbget() and sbput() functions which then used in the various
filesystem utilities. This work is in preparation for adding
subperblock check hashes.
No functional change intended.
Reviewed by: kib
When allocating memory through malloc(9), we always expect the amount of
memory requested to be unsigned as a negative value would either stand for
an error or an overflow.
Unsign some values, found when considering the use of mallocarray(9), to
avoid unnecessary casting. Also consider that indexes should be of
at least the same size/type as the upper limit they pretend to index.
MFC after: 2 weeks
Basic use of mallocarray to prevent overflows: static analyzers are also
likely to perform additional checks.
Since mallocarray expects unsigned parameters, unsign some
related variables to minimize sign conversions.
Reviewed by: mckusick
Basic use of mallocarray to prevent overflows. Here allocation is done
with M_NOWAIT so the code is prepared for the possibility of returning
NULL values. Since mallocarray expects unsigned parameters, unsign some
related variables to minimize sign conversions.
Reviewed by: mckusick
The code accesses bp->b_dep without owning the ufs mount softdep lock,
which makes it possible for the derefenced workitem to be freed in
parallel. In particular, the deallocate_dependencies(),
softdep_disk_io_initiation() and softdep_disk_write_complete() are
affected.
Move the code to safely calculate ump from the buffer with
dependencies into the helper softdep_bp_to_mp() and use it for all
found cases.
Tested by: pho (as part of the bigger patch)
Reviewed by: mckusick (as part of the bigger patch)
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
handle_written_XXX() in case of processing the buffer with an error.
Tested by: pho (as part of the bigger patch)
Reviewed by: mckusick (as part of the bigger patch)
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
so that buf_complete() sees fully constructed buffer.
This is a NOP right now, but will be needed by the forthcoming SU change.
Reported and tested by: pho
Reviewed by: mckusick
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
This reduces noise when kernel is compiled by newer GCC versions,
such as one used by external toolchain ports.
Reviewed by: kib, andrew(sys/arm and sys/arm64), emaste(partial), erj(partial)
Reviewed by: jhb (sys/dev/pci/* sys/kern/vfs_aio.c and sys/kern/kern_synch.c)
Differential Revision: https://reviews.freebsd.org/D10385
On the one hand, FIFOs should respect other variables not supported by
the fifofs vnode operation (such as _PC_NAME_MAX, _PC_LINK_MAX, etc.).
These values are fs-specific and must come from a fs-specific method.
On the other hand, filesystems that support FIFOs are required to
support _PC_PIPE_BUF on directory vnodes that can contain FIFOs.
Given this latter requirement, once the fs-specific VOP_PATHCONF
method supports _PC_PIPE_BUF for directories, it is also suitable for
FIFOs permitting a single VOP_PATHCONF method to be used for both
FIFOs and non-FIFOs.
To that end, retire all of the FIFO-specific pathconf methods from
filesystems and change FIFO-specific vnode operation switches to use
the existing fs-specific VOP_PATHCONF method. For fifofs, set it's
VOP_PATHCONF to VOP_PANIC since it should no longer be used.
While here, move _PC_PIPE_BUF handling out of vop_stdpathconf() so that
only filesystems supporting FIFOs will report a value. In addition,
only report a valid _PC_PIPE_BUF for directories and FIFOs.
Discussed with: bde
Reviewed by: kib (part of a larger patch)
MFC after: 1 month
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D12572
Having all filesystems fall through to default values isn't always correct
and these values can vary for different filesystem implementations. Most
of these changes just use the existing default values with a few exceptions:
- Don't report CHOWN_RESTRICTED for ZFS since it doesn't do the exact
permissions check this claims for chown().
- Use NANDFS_NAME_LEN for NAME_MAX for nandfs.
- Don't report a LINK_MAX of 0 on smbfs. Now fail with EINVAL to
indicate hard links aren't supported.
Requested by: bde (though perhaps not this exact implementation)
Reviewed by: kib (earlier version)
MFC after: 1 month
Sponsored by: Chelsio Communications
FFS performs asynchronous inode initialization, using a barrier write
to ensure that the inode block is written before the corresponding
cylinder group header update. Some GEOMs do not appear to handle
BIO_ORDERED correctly, meaning that the barrier write may not work as
intended. The sysctl allows one to work around this problem at the
cost of expensive file creation on new filesystems. The default
behaviour is unchanged.
Reviewed by: kib, mckusick
MFC after: 1 weeks
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D13428
When more than one thread enters ufsdirhash_create() for the same
directory and the inode dirhash is instantiated, but the dirhash' hash
is not, all of them lock the dirhash shared and then try to upgrade.
Since there are several threads owning the lock shared, upgrade fails
and the same attempt is repeated, ad infinitum.
To break the lockstep, lock the dirhash in exclusive mode after the
failed try-upgrade.
Reported and tested by: pho
Sponsored by: Mellanox Technologies
MFC after: 1 week
Mainly focus on files that use BSD 2-Clause license, however the tool I
was using misidentified many licenses so this was mostly a manual - error
prone - task.
The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.
No functional change intended.
Mainly focus on files that use BSD 3-Clause license.
The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.
Special thanks to Wind River for providing access to "The Duke of
Highlander" tool: an older (2014) run over FreeBSD tree was useful as a
starting point.
When QUEUE_MACRO_DEBUG_TRASH is configured, the queue linkage fields
are trashed upon removal of the item, so be sure to only read them before
removing the item.
No functional change intended.
MFC after: 1 week
Sponsored by: Dell EMC Isilon
Normally wakeups() are performed for completed softupdates work items
in workitem_free() before the underlying memory is free()'d.
complete_jseg() was clearing the "wakeup needed" flag in work items to
defer the wakeup until the end of each loop iteration. However, this
resulted in the item being free'd before it's address was used with
wakeup(). As a result, another part of the kernel could allocate this
memory from malloc() and use it as a wait channel for a different
"event" with a different lock. This triggered an assertion failure
when the lock passed to sleepq_add() did not match the existing lock
associated with the sleep queue. Fix this by removing the code to
defer the wakeup in complete_jseg() allowing the wakeup to occur
slightly earlier in workitem_free() before free() is called.
The main reason I can think of for deferring a wakeup() would be to
avoid waking up a waiter while holding a lock that the waiter would
need. However, no locks are dropped in between the wakeup() in
workitem_free() and the end of the loop in complete_jseg() as far as I
can tell.
In general I think it is not safe to do a wakeup() after free() as one
cannot control how other parts of the kernel that might reuse the
address for a different wait channel will handle spurious wakeups.
Reported by: pho
Reviewed by: kib
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D12494
check hash to cylinder groups. If a check hash fails when a cylinder
group is read, no further allocations are attempted in that cylinder
group until it has been fixed by fsck. This avoids a class of
filesystem panics related to corrupted cylinder group maps. The
hash is done using crc32c.
Check hases are added only to UFS2 and not to UFS1 as UFS1 is primarily
used in embedded systems with small memories and low-powered processors
which need as light-weight a filesystem as possible.
Specifics of the changes:
sys/sys/buf.h:
Add BX_FSPRIV to reserve a set of eight b_xflags that may be used
by individual filesystems for their own purpose. Their specific
definitions are found in the header files for each filesystem
that uses them. Also add fields to struct buf as noted below.
sys/kern/vfs_bio.c:
It is only necessary to compute a check hash for a cylinder
group when it is actually read from disk. When calling bread,
you do not know whether the buffer was found in the cache or
read. So a new flag (GB_CKHASH) and a pointer to a function to
perform the hash has been added to breadn_flags to say that the
function should be called to calculate a hash if the data has
been read. The check hash is placed in b_ckhash and the B_CKHASH
flag is set to indicate that a read was done and a check hash
calculated. Though a rather elaborate mechanism, it should
also work for check hashing other metadata in the future. A
kernel internal API change was to change breada into a static
fucntion and add flags and a function pointer to a check-hash
function.
sys/ufs/ffs/fs.h:
Add flags for types of check hashes; stored in a new word in the
superblock. Define corresponding BX_ flags for the different types
of check hashes. Add a check hash word in the cylinder group.
sys/ufs/ffs/ffs_alloc.c:
In ffs_getcg do the dance with breadn_flags to get a check hash and
if one is provided, check it.
sys/ufs/ffs/ffs_vfsops.c:
Copy across the BX_FFSTYPES flags in background writes.
Update the check hash when writing out buffers that need them.
sys/ufs/ffs/ffs_snapshot.c:
Recompute check hash when updating snapshot cylinder groups.
sys/libkern/crc32.c:
lib/libufs/Makefile:
lib/libufs/libufs.h:
lib/libufs/cgroup.c:
Include libkern/crc32.c in libufs and use it to compute check
hashes when updating cylinder groups.
Four utilities are affected:
sbin/newfs/mkfs.c:
Add the check hashes when building the cylinder groups.
sbin/fsck_ffs/fsck.h:
sbin/fsck_ffs/fsutil.c:
Verify and update check hashes when checking and writing cylinder groups.
sbin/fsck_ffs/pass5.c:
Offer to add check hashes to existing filesystems.
Precompute check hashes when rebuilding cylinder group
(although this will be done when it is written in fsutil.c
it is necessary to do it early before comparing with the old
cylinder group)
sbin/dumpfs/dumpfs.c
Print out the new check hash flag(s)
sbin/fsdb/Makefile:
Needs to add libufs now used by pass5.c imported from fsck_ffs.
Reviewed by: kib
Tested by: Peter Holm (pho)
ino64 expanded nlink_t to 64 bits, but the on-disk format for UFS is still
limited to 16 bits. This is a nop currently but will matter if LINK_MAX is
increased in the future.
Reviewed by: kib
Sponsored by: Chelsio Communications
superblocks created in revision 322297 only works on disks
with sector sizes up to 4K. This update allows the recovery
information to be created by newfs and used by fsck on disks
with sector sizes up to 64K. Note that FFS currently limits
filesystem to be mounted from disks with up to 8K sectors.
Expanding this limitation will be the subject of another
commit.
Reported by: Peter Holm
Reviewed with: kib
vnode lock.
Caller of softdep_count_dependencies() may own a buffer lock, which
might conflict with the lock order.
Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 10 days
softdep_count_dependencies().
Buffer's b_dep list is protected by the SU mount lock. Owning the
buffer lock is not enough to guarantee the stability of the list.
Calculation of the UFS mount owning the workitems from the buffer must
be much more careful to not dereference the work item which might be
freed meantime. To get to ump, use the pointers chain which does not
involve workitems at all.
Reported and tested by: pho
Reviewed by: mckusick
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
unable to automatically find alternate superblocks. This checkin
places the information needed to find alternate superblocks to the
end of the area reserved for the boot block.
Filesystems created with a newfs of this vintage or later will
create the recovery information. If you have a filesystem created
prior to this change and wish to have a recovery block created for
your filesystem, you can do so by running fsck in forground mode
(i.e., do not use the -p or -y options). As it starts, fsck will
ask ``SAVE DATA TO FIND ALTERNATE SUPERBLOCKS'' to which you should
answer yes.
Discussed with: kib, imp
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D11589
Update the use of the B_CACHE flag (since the May 1999 commit
that made it the correct test here).
Reported by: Andreas Longwitz <longwitz@incore.de>
Reviewed by: kib
Tested by: Peter Holm
MFC after: 1 week
For freshly allocated snapdata, Lock sn_lock in advance, so
si_snapdata readers see the locked snapdata and not race.
For existing snapdata, if the thread was put to sleep waiting for
sn_lock, re-read si_snapdata. This either closes the race or makes
the reliance on LK_DRAIN less important.
Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
It is possible for ffs_snapblkfree() to race and lock snaplock while
the devvp snapdata is instantiated, but no snapshots exist. In this
case the loop over snapshots in ffs_snapblkfree() is not executed, and
the local variable vp is left initialized to NULL.
Unlock using &sn->sn_lock and not vp->v_vnlock. For the inodes on the
snapshot list, the locks are same.
Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
in ffs_snapremove().
Apparently ffs_snapremove() may be called with the snap lock recursed,
at least one trace demonstrated this when snapshot vnode was unlinked
while synced. It was inactivated from the syncer thread.
Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Update filesystems not currently using vop_stdpathconf() in pathconf
VOPs to use vop_stdpathconf() for any configuration variables that do
not have filesystem-specific values. vop_stdpathconf() is used for
variables that have system-wide settings as well as providing default
values for some values based on system limits. Filesystems can still
explicitly override individual settings.
PR: 219851
Reported by: cem
Reviewed by: cem, kib, ngie
MFC after: 1 month
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D11541
group. Change all code points that open-coded this functionality
to use the new function. This commit is a refactoring with no
change in functionality.
In the future this change allows more robust checking of cylinder
group reads along the lines discussed in the hardening UFS session
at BSDCan (retry I/O, add checksums, etc). For more detail see the
session notes at https://wiki.freebsd.org/DevSummit/201706/HardeningUFS
Reviewed by: kib
host.
Problems start appearing when there are several threads all doing
operations on a UFS volume and the SU workqueue needs a cleanup. It is
possible that each thread calling softdep_request_cleanup() owns the
lock for some dirty vnode (e.g. all of them are executing mkdir(2),
mknod(2), creat(2) etc) and all vnodes which must be flushed are locked
by corresponding thread. Then, we get all the threads simultaneously
entering softdep_request_cleanup().
There are two problems:
- Several threads execute MNT_VNODE_FOREACH_ALL() loops in parallel. Due
to the locking, they quickly start executing 'in phase' with the speed
of the slowest thread.
- Since each thread already owns the lock for a dirty vnode, other threads
non-blocking attempt to lock the vnode owned by other thread fail,
and loops executing without making the progress.
Retry logic does not allow the situation to recover. The result is
a livelock.
Fix these problems by making the following changes:
- Allow only one thread to enter MNT_VNODE_FOREACH_ALL() loop per mp.
A new flag FLUSH_RC_ACTIVE guards the loop.
- If there were failed locking attempts during the loop, abort retry
even if there are still work items on the mp work list. An
assumption is that the items will be cleaned when other thread
either fsyncs its vnode, or unlock and allow yet another thread to
make the progress.
It is possible now that some calls would get undeserved ENOSPC from
ffs_alloc(), because the cleanup is not aggressive enough. But I do
not see how can we reliably clean up workitems if calling
softdep_request_cleanup() while still owning the vnode lock. I thought
about scheme where ffs_alloc() returns ERESTART and saves the retry
counter somewhere in struct thread, to return to the top level, unlock
the vnode and retry. But IMO the very rare (and unproven) spurious
ENOSPC is not worth the complications.
Reported and tested by: pho
Style and comments by: mckusick
Reviewed by: mckusick
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
the last step of ffs_unmount().
It is possible that the mount point is recorded for cleanup in AST
context while softdep flush is executed during unmount. The workitems
are flushed by other means for the unmount, but the stray reference to
struct mount blocks destruction of mount. Check for the situation and
manually call vfs_rel() before returning from ffs_unmount().
Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week