The cookies argument is only used by the NFS server. NFSv2 defines the
cookie as 32 bits on the wire, but NFSv3 increased it to 64 bits. Our
VOP_READDIR, however, has always defined it as u_long, which is 32 bits
on some architectures. Change it to 64 bits on all architectures. This
doesn't matter for any in-tree file systems, but it matters for some
FUSE file systems that use 64-bit directory cookies.
PR: 260375
Reviewed by: rmacklem
Differential Revision: https://reviews.freebsd.org/D33404
The mntfs vnode lock should be before topology, as established in
ffs_mountfs(). Extend the locked region in ffs_unmount().
Reported and reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D33013
Normally a UFS/FFS filesystem with a bad check hash can only be
mounted read only. With this commit the mount(8) -f (force) option
can be used to force a read-write mount of a UFS/FFS filesystem with
a bad check hash. Conveniently the filesystem will proceed to
update its on-disk superblock with a corrected check hash.
Sponsored by: Netflix
When reading UFS/FFS superblocks that have check hashes, both the kernel
and libufs print an error message if the check hash is incorrect. This
commit adds the ability to request that the error message not be made.
It is intended for use by programs like fsck that wants to print its
own error message and by kernel subsystems like glabel that just wants
to check for possible filesystem types.
This capability will be used in followup commits.
Sponsored by: Netflix
The STDSB macro is passed to the ffs_sbget() routine to fetch a
UFS/FFS superblock "from the stadard place". It was identically defined
in lib/libufs/libufs.h, stand/libsa/ufs.c, sys/ufs/ffs/ffs_extern.h,
and sys/ufs/ffs/ffs_subr.c. Delete it from these four files and
define it instead in sys/ufs/ffs/fs.h. All existing uses of this macro
already include sys/ufs/ffs/fs.h so no include changes need to be made.
No functional change intended.
Sponsored by: Netflix
It is not, and the lock is not needed there
Reported and tested by: pho
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D32761
Require devvp locked for mntfs_freevp(), to have it locked around
vgone(). Make that true for ffs, which is the only consumer of
the interface.
Reported and tested by: pho
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D32761
Mark variables as __diagused for invariant-only vars
Reviewed by: imp, mjg
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D32577
These began to become obsolete in d6d64f0f2c (r137739) and the deal
was later sealed in 003e18aef4 (r137801) when vfs.fifofs.fops was
dropped and vop-bypass for pipes became mandatory.
PR: 225934
Suggested by: markj
Reviewe by: kib, markj
Differential Revision: https://reviews.freebsd.org/D32270
The fsckpid mount option was introduced in 927a12ae16 along with a
couple sysctl's to support SU+J with snapshots. However, those sysctl's
were never used and eventually removed in f2620e9ceb.
There are no in-tree consumers of this mount option.
Reviewed by: mckusick, kib
Differential Revision: https://reviews.freebsd.org/D32015
while using a UFS snapshot.
The UFS filesystem supports snapshots. Each snapshot is a file whose
contents are a frozen image of the disk partition on which the filesystem
resides. Each time an existing block in the filesystem is modified,
the filesystem checks whether that block was in use at the time that
the snapshot was taken. If so, and if it has not already been copied,
a new block is allocated from among the blocks that were not in use
at the time that the snapshot was taken and placed in the snapshot file
to replace the entry that has not yet been copied. The previous contents
of the block are copied to the newly allocated snapshot file block,
and the write to the original is then allowed to proceed.
The block allocation is done using the usual UFS_BALLOC() routine
which allocates the needed block in the snapshot and returns a
buffer that is set up to write data into the newly allocated block.
In usual filesystem operation, the contents for the new block is
copied from user space into the buffer and the buffer is then written
to the file using bwrite(), bawrite(), or bdwrite(). In the case of a
snapshot the new block must be filled from the disk block that is about
to be rewritten. The snapshot routine has a function readblock() that
it uses to read the `about to be rewritten' disk block.
/*
* Read the specified block into the given buffer.
*/
static int
readblock(snapvp, bp, lbn)
struct vnode *snapvp;
struct buf *bp;
ufs2_daddr_t lbn;
{
struct inode *ip;
struct bio *bip;
struct fs *fs;
ip = VTOI(snapvp);
fs = ITOFS(ip);
bip = g_alloc_bio();
bip->bio_cmd = BIO_READ;
bip->bio_offset = dbtob(fsbtodb(fs, blkstofrags(fs, lbn)));
bip->bio_data = bp->b_data;
bip->bio_length = bp->b_bcount;
bip->bio_done = NULL;
g_io_request(bip, ITODEVVP(ip)->v_bufobj.bo_private);
bp->b_error = biowait(bip, "snaprdb");
g_destroy_bio(bip);
return (bp->b_error);
}
When the underlying disk fails, its GEOM module is removed.
Subsequent attempts to access it should return the ENXIO error.
The functionality of checking for the lost disk and returning
ENXIO is handled by the g_vfs_strategy() routine:
void
g_vfs_strategy(struct bufobj *bo, struct buf *bp)
{
struct g_vfs_softc *sc;
struct g_consumer *cp;
struct bio *bip;
cp = bo->bo_private;
sc = cp->geom->softc;
/*
* If the provider has orphaned us, just return ENXIO.
*/
mtx_lock(&sc->sc_mtx);
if (sc->sc_orphaned || sc->sc_enxio_active) {
mtx_unlock(&sc->sc_mtx);
bp->b_error = ENXIO;
bp->b_ioflags |= BIO_ERROR;
bufdone(bp);
return;
}
sc->sc_active++;
mtx_unlock(&sc->sc_mtx);
bip = g_alloc_bio();
bip->bio_cmd = bp->b_iocmd;
bip->bio_offset = bp->b_iooffset;
bip->bio_length = bp->b_bcount;
bdata2bio(bp, bip);
if ((bp->b_flags & B_BARRIER) != 0) {
bip->bio_flags |= BIO_ORDERED;
bp->b_flags &= ~B_BARRIER;
}
if (bp->b_iocmd == BIO_SPEEDUP)
bip->bio_flags |= bp->b_ioflags;
bip->bio_done = g_vfs_done;
bip->bio_caller2 = bp;
g_io_request(bip, cp);
}
Only after checking that the device is present does it construct
the "bio" request and call g_io_request(). When readblock()
constructs its own "bio" request and calls g_io_request() directly
it panics with "consumer not attached in g_io_request" when the
underlying device no longer exists.
The fix is to have readblock() call g_vfs_strategy() rather than
constructing its own "bio" request:
/*
* Read the specified block into the given buffer.
*/
static int
readblock(snapvp, bp, lbn)
struct vnode *snapvp;
struct buf *bp;
ufs2_daddr_t lbn;
{
struct inode *ip;
struct fs *fs;
ip = VTOI(snapvp);
fs = ITOFS(ip);
bp->b_iocmd = BIO_READ;
bp->b_iooffset = dbtob(fsbtodb(fs, blkstofrags(fs, lbn)));
bp->b_iodone = bdone;
g_vfs_strategy(&ITODEVVP(ip)->v_bufobj, bp);
bufwait(bp);
return (bp->b_error);
}
Here it uses the buffer that will eventually be written to the disk.
The g_vfs_strategy() routine uses four parts of the buffer: b_bcount,
b_iocmd, b_iooffset, and b_data.
The b_bcount field is already correctly set for the buffer. It is
safe to set the b_iocmd and b_iooffset fields as they are set
correctly when the later write is done. The write path will also
clear the B_DONE flag that our use of the buffer will set. The
b_iodone callback has to be set to bdone() which will do just
notification that the I/O is done in bufdone(). The rest of
bufdone() includes things like processing the softdeps associated
with the buffer should not be done until the buffer has been
written. Bufdone() will set b_iodone back to NULL after using it,
so the full bufdone() processing will be done when the buffer is
written. The final change from the previous version of readblock()
is that it used the b_data for the destination of the read while
g_vfs_strategy() uses the bdata2bio() function to take advantage
of VMIO when it is available.
Differential revision: https://reviews.freebsd.org/D32150
Reviewed by: kib, chs
MFC after: 1 week
Sponsored by: Netflix
Each vnode has an embedded lock that controls access to its contents.
However vnodes describing a UFS snapshot all share a single snapshot
lock to coordinate their access and update. As part of creating a
new UFS snapshot, it has to have its individual vnode lock replaced
with the filesystem's snapshot lock.
The lock order for regular vnodes with respect to buffer locks is that
they must first acquire the vnode lock, then a buffer lock. The order
for the snapshot lock is reversed: a buffer lock must be acquired before
the snapshot lock.
When creating a new snapshot, the snapshot file must retain its vnode
lock until it has allocated all the blocks that it needs before
switching to the snapshot lock. This update moves one final piece of
the initial snapshot block allocation so that it is done before the
newly created snapshot is switched to use the snapshot lock.
Reported by: Witness code
MFC after: 1 week
Sponsored by: Netflix
Instead do protective check for the local flags and do not interpret
EBUSY specially if we did not request trylock mode for bread().
Reviewed by: mckusick
Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Inode type could migrate between snapshot and regular types while the
vnode is unlocked. Recalculate flags specific for snapshot after relock.
Reviewed by: mckusick
Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
ufs_rename() calls ufs_checkpath() to ensure that the target directory
is not a child of the source. If not, rename would create a loop.
For instance:
source->X1->X2->target
and if source moved under target, we get corrupted filesystem.
Suppose that we initially have
source->X1 .... and X2->target
where X1 is not on path from root to X2. Then ufs_checkpath() accepts
the inodes, but there is nothing preventing parallel rename of X2 to become
under X1, after checkpath finished.
Ensure stability of ufs_checkpath() result by taking a per-mount sx in
ufs_rename right before ufs_checkpath() and till the end.
Reviewed by: chs, mckusick
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
During forcible unmount after a disk failure there is a bug that
causes one or more indirdep dependency structures to fail to be
deallocated. Until we manage to track down why they fail to get
cleaned up, this code tracks them down and eliminates them so that
the unmount can succeed.
Reported by: Peter Holm
Help from: kib
Reviewed by: Chuck Silvers
Tested by: Peter Holm
MFC after: 7 days
Sponsored by: Netflix
The soft updates diagnotic code keeps a list for each type of soft
update dependency. When a new block is allocated for a file it is
initially tracked by a "newblk" dependency. The "newblk" dependency
eventually becomes either an "allocdirect" dependency or an "indiralloc"
dependency. The diagnotic code failed to move the "newblk" from the list
of "newblk"s to its new type list.
No functional change intended.
Reviewed by: Chuck Silvers (as part of a larger change)
Tested by: Peter Holm (as part of a larger change)
Sponsored by: Netflix
Now that dounmount() supports a dedicated taskqueue, we can simply call
it with MNT_DEFERRED directly from the failing context. This also
avoids blocking taskqueue_thread with a potentially-expensive unmount
operation.
Reviewed by: kib, mckusick
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D31016
In certain emergency cases such as media failure or removal, UFS will
initiate a forced unmount in order to prevent dirty buffers from
accumulating against the no-longer-usable filesystem. The presence
of a stacked filesystem such as nullfs or unionfs above the UFS mount
will prevent this forced unmount from succeeding.
This change addreses the situation by allowing stacked filesystems to
be recursively unmounted on a taskqueue thread when the MNT_RECURSE
flag is specified to dounmount(). This call will block until all upper
mounts have been removed unless the caller specifies the MNT_DEFERRED
flag to indicate the base filesystem should also be unmounted from the
taskqueue.
To achieve this, the recently-added vfs_pin_from_vp()/vfs_unpin() KPIs
have been combined with the existing 'mnt_uppers' list used by nullfs
and renamed to vfs_register_upper_from_vp()/vfs_unregister_upper().
The format of the mnt_uppers list has also been changed to accommodate
filesystems such as unionfs in which a given mount may be stacked atop
more than one lower mount. Additionally, management of lower FS
reclaim/unlink notifications has been split into a separate list
managed by a separate set of KPIs, as registration of an upper FS no
longer implies interest in these notifications.
Reviewed by: kib, mckusick
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D31016
This effectively causes syncing of the mount point from softdep_prealloc(),
softdep_prerename(), and softdep_prelink(). Typically it avoids the need
for journal suspension at this point, at all.
Suggested and reviewed by: mckusick
Discussed with: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D30041
We call into softdep_prerename() and softdep_prelink() when there is
low free space in the journal. Functions sync all vnodes participating
in the VOP, in the hope that this would reduce journal utilization. But
if the vnodes are already synced, doing sync would only spend writes,
journal is filled not due to the records from modifications of our
vnodes.
Remember original seqc numbers for vnodes, and only initiate syncs when
seqc changed.
Reviewed by: mckusick
Discussed with: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D30041
so call it only in SU+J case
Reviewed by: mckusick
Discussed with: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D30041
Make sure we return an error if no input was specified, since
SYSCTL_IN() will report success in that case.
Reported by: KMSAN
Reviewed by: mckusick
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D30586
Instead of requiring all implementations of vfs_quotactl to unbusy
the mount for Q_QUOTAON and Q_QUOTAOFF, add an "mp_busy" in/out param
to VFS_QUOTACTL(9). The implementation may then indicate to the caller
whether it needed to unbusy the mount.
Also, add stbool.h to libprocstat modules which #define _KERNEL
before including sys/mount.h. Otherwise they'll pull in sys/types.h
before defining _KERNEL and therefore won't have the bool definition
they need for mp_busy.
Reviewed By: kib, markj
Differential Revision: https://reviews.freebsd.org/D30556
Parts of libprocstat like to pretend they're kernel components for the
sake of including mount.h, and including sys/types.h in the _KERNEL
case doesn't fix the build for some reason. Revert both the
VFS_QUOTACTL() change and the follow-up "fix" for now.
Instead of requiring all implementations of vfs_quotactl to unbusy
the mount for Q_QUOTAON and Q_QUOTAOFF, add an "mp_busy" in/out param
to VFS_QUOTACTL(9). The implementation may then indicate to the caller
whether it needed to unbusy the mount.
Reviewed By: kib, markj
Differential Revision: https://reviews.freebsd.org/D30218
At this point the directory's vnode lock is held, so blocking while
waiting for free pages makes the system more susceptible to deadlock in
low memory conditions. This is particularly problematic on NUMA systems
as UMA currently implements a strict first-touch policy.
ufsdirhash_build() already uses M_NOWAIT for other allocations and
already handled failures for the block array allocation, so just convert
to M_NOWAIT.
PR: 253992
Reviewed by: markj, mckusick, vangyzen
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D29045
The original filesystem release (4.2BSD) had no embedded sysmlinks.
Historically symbolic links were just a different type of file, so
the content of the symbolic link was contained in a single disk block
fragment. We observed that most symbolic links were short enough that
they could fit in the area of the inode that normally holds the block
pointers. So we created embedded symlinks where the content of the
link was held in the inode's pointer area thus avoiding the need to
seek and read a data fragment and reducing the pressure on the block
cache. At the time we had only UFS1 with 32-bit block pointers,
so the test for a fastlink was:
di_size < (NDADDR + NIADDR) * sizeof(daddr_t)
(where daddr_t would be ufs1_daddr_t today).
When embedded symlinks were added, a spare field in the superblock
with a known zero value became fs_maxsymlinklen. New filesystems
set this field to (NDADDR + NIADDR) * sizeof(daddr_t). Embedded
symlinks were assumed when di_size < fs->fs_maxsymlinklen. Thus
filesystems that preceeded this change always read from blocks
(since fs->fs_maxsymlinklen == 0) and newer ones used embedded
symlinks if they fit. Similarly symlinks created on pre-embedded
symlink filesystems always spill into blocks while newer ones will
embed if they fit.
At the same time that the embedded symbolic links were added, the
on-disk directory structure was changed splitting the former
u_int16_t d_namlen into u_int8_t d_type and u_int8_t d_namlen.
Thus fs_maxsymlinklen <= 0 (as used by the OFSFMT() macro) can
be used to distinguish old directory formats. In retrospect that
should have just been an added flag, but we did not realize we
needed to know about that change until it was already in production.
Code was split into ufs/ffs so that the log structured filesystem could
use ufs functionality while doing its own disk layout. This meant
that no ffs superblock fields could be used in the ufs code. Thus
ffs superblock fields that were needed in ufs code had to be copied
to fields in the mount structure. Since ufs_readlink needed to know
if a link was embedded, fs_maxlinklen gets copied to mnt_maxsymlinklen.
The kernel panic that arose to making this fix was triggered when a
disk error created an inode of type symlink with no allocated data
blocks but a large size. When readlink was called the uiomove was
attempted which segment faulted.
static int
ufs_readlink(ap)
struct vop_readlink_args /* {
struct vnode *a_vp;
struct uio *a_uio;
struct ucred *a_cred;
} */ *ap;
{
struct vnode *vp = ap->a_vp;
struct inode *ip = VTOI(vp);
doff_t isize;
isize = ip->i_size;
if ((isize < vp->v_mount->mnt_maxsymlinklen) ||
DIP(ip, i_blocks) == 0) { /* XXX - for old fastlink support */
return (uiomove(SHORTLINK(ip), isize, ap->a_uio));
}
return (VOP_READ(vp, ap->a_uio, 0, ap->a_cred));
}
The second part of the "if" statement that adds
DIP(ip, i_blocks) == 0) { /* XXX - for old fastlink support */
is problematic. It never appeared in BSD released by Berkeley because
as noted above mnt_maxsymlinklen is 0 for old format filesystems, so
will always fall through to the VOP_READ as it should. I had to dig
back through `git blame' to find that Rodney Grimes added it as
part of ``The big 4.4BSD Lite to FreeBSD 2.0.0 (Development) patch.''
He must have brought it across from an earlier FreeBSD. Unfortunately
the source-control logs for FreeBSD up to the merger with the
AT&T-blessed 4.4BSD-Lite conversion were destroyed as part of the
agreement to let FreeBSD remain unencumbered, so I cannot pin-point
where that line got added on the FreeBSD side.
The one change needed here is that mnt_maxsymlinklen is declared as
an `int' and should be changed to be `u_int64_t'.
This discovery led us to check out the code that deletes symbolic
links. Specifically
if (vp->v_type == VLNK &&
(ip->i_size < vp->v_mount->mnt_maxsymlinklen ||
datablocks == 0)) {
if (length != 0)
panic("ffs_truncate: partial truncate of symlink");
bzero(SHORTLINK(ip), (u_int)ip->i_size);
ip->i_size = 0;
DIP_SET(ip, i_size, 0);
UFS_INODE_SET_FLAG(ip, IN_SIZEMOD | IN_CHANGE | IN_UPDATE);
if (needextclean)
goto extclean;
return (ffs_update(vp, waitforupdate));
}
Here too our broken symlink inode with no data blocks allocated
and a large size will segment fault as we are incorrectly using the
test that we have no data blocks to decide that it is an embdedded
symbolic link and attempting to bzero past the end of the inode.
The test for datablocks == 0 is unnecessary as the test for
ip->i_size < vp->v_mount->mnt_maxsymlinklen will do the right
thing in all cases.
The test for datablocks == 0 was added by David Greenman in this commit:
Author: David Greenman <dg@FreeBSD.org>
Date: Tue Aug 2 13:51:05 1994 +0000
Completed (hopefully) the kernel support for old style "fastlinks".
Notes:
svn path=/head/; revision=1821
I am guessing that he likely earlier added the incorrect test in the
ufs_readlink code.
I asked David if he had any recollection of why he made this change.
Amazingly, he still had a recollection of why he had made a one-line
change more than twenty years ago. And unsurpisingly it was because
he had been stuck between a rock and a hard place.
FreeBSD was up to 1.1.5 before the switch to the 4.4BSD-Lite code
base. Prior to that, there were three years of development in all
areas of the kernel, including the filesystem code, from the combined
set of people including Bill Jolitz, Patchkit contributors, and
FreeBSD Project members. The compatibility issue at hand was caused
by the FASTLINKS patches from Curt Mayer. In merging in the 4.4BSD-Lite
changes David had to find a way to provide compatibility with both
the changes that had been made in FreeBSD 1.1.5 and with 4.4BSD-Lite.
He felt that these changes would provide compatibility with both systems.
In his words:
``My recollection is that the 'FASTLINKS' symlinks support in
FreeBSD-1.x, as implemented by Curt Mayer, worked differently than
4.4BSD. He used a spare field in the inode to duplicately store the
length. When the 4.4BSD-Lite merge was done, the optimized symlinks
support for existing filesystems (those that were initialized in
FreeBSD-1.x) were broken due to the FFS on-disk structure of
4.4BSD-Lite differing from FreeBSD-1.x. My commit was needed to
restore the backward compatibility with FreeBSD-1.x filesystems.
I think it was the best that could be done in the somewhat urgent
circumstances of the post Berkeley-USL settlement. Also, regarding
Rod's massive commit with little explanation, some context: John
Dyson and I did the initial re-port of the 4.4BSD-Lite kernel to
the 386 platform in just 10 days. It was by far the most intense
hacking effort of my life. In addition to the porting of tons of
FreeBSD-1 code, I think we wrote more than 30,000 lines of new code
in that time to deal with the missing pieces and architectural
changes of 4.4BSD-Lite. We didn't make many notes along the way.
There was a lot of pressure to get something out to the rest of the
developer community as fast as possible, so detailed discrete commits
didn't happen - it all came as a giant wad, which is why Rod's
commit message was worded the way it was.''
Reported by: Chuck Silvers
Tested by: Chuck Silvers
History by: David Greenman Lawrence
MFC after: 1 week
Sponsored by: Netflix
The trunc_dependencies() issue was reported by Alexander Lochmann
<alexander.lochmann@tu-dortmund.de>, who found the problem by performing
lock analysis using LockDoc, see https://doi.org/10.1145/3302424.3303948.
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
When quotas are enabled with the quotaon(8) command, it sets the
MNT_QUOTA flag in the mount structure mnt_flag field. The mount
structure holds a cached copy of the filesystem statfs structure
in mnt_stat that includes a copy of the mnt_flag field in
mnt_stat.f_flags. The mnt_stat structure may not be updated for
hours. Since the mount command requests mount details using the
MNT_NOWAIT option, it gets the mount's mnt_stat statfs structure
whose f_flags field does not yet show the MNT_QUOTA flag being set
in mnt_flag.
The fix is to have quotaon(8) set the MNT_QUOTA flag in both mnt_flag
and in mnt_stat.f_flags so that it will be immediately visible to
callers of statfs(2).
Reported by: Christos Chatzaras
Tested by: Christos Chatzaras
PR: 254682
MFC after: 3 days
Sponsored by: Netflix