freebsd-dev

Author	SHA1	Message	Date
Alan Somers	b214fcceac	Change VOP_READDIR's cookies argument to a **uint64_t The cookies argument is only used by the NFS server. NFSv2 defines the cookie as 32 bits on the wire, but NFSv3 increased it to 64 bits. Our VOP_READDIR, however, has always defined it as u_long, which is 32 bits on some architectures. Change it to 64 bits on all architectures. This doesn't matter for any in-tree file systems, but it matters for some FUSE file systems that use 64-bit directory cookies. PR: 260375 Reviewed by: rmacklem Differential Revision: https://reviews.freebsd.org/D33404	2021-12-15 20:54:57 -07:00
Gordon Bergling	f9af3151fa	Revert "ffs(3): Fix a typo in a sysctl description" It should be - s/contigous/contiguous/ not continuous Reported by: tuexen@ This reverts commit `42efe994ec`.	2021-12-05 13:45:47 +01:00
Gordon Bergling	42efe994ec	ffs(3): Fix a typo in a sysctl description - s/contigous/continuous/ MFC after: 3 days	2021-12-04 12:15:34 +01:00
Mateusz Guzik	7e1d3eefd4	vfs: remove the unused thread argument from NDINIT* See `b4a58fbf64` ("vfs: remove cn_thread") Bump __FreeBSD_version to 1400043.	2021-11-25 22:50:42 +00:00
Gordon Bergling	bebff61587	ffs_softdep: Fix a typo in a source code comment - s/conditonally/conditionally/ MFC after: 3 days	2021-11-19 19:17:41 +01:00
Konstantin Belousov	c34a5148e8	ffs: fix newly introduced LOR between mntfs vnode lock and topology lock The mntfs vnode lock should be before topology, as established in ffs_mountfs(). Extend the locked region in ffs_unmount(). Reported and reviewed by: markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D33013	2021-11-16 20:01:31 +02:00
Kirk McKusick	9b8eb1c5b6	Followup to `f2b391528a` to improve printed message. Sponsored by: Netflix	2021-11-15 16:10:02 -08:00
Kirk McKusick	9e9dcac95a	Allow forced r/w mount of UFS/FFS filesystem with a bad check hash. Normally a UFS/FFS filesystem with a bad check hash can only be mounted read only. With this commit the mount(8) -f (force) option can be used to force a read-write mount of a UFS/FFS filesystem with a bad check hash. Conveniently the filesystem will proceed to update its on-disk superblock with a corrected check hash. Sponsored by: Netflix	2021-11-15 16:03:47 -08:00
Kirk McKusick	f2b391528a	Add ability to suppress UFS/FFS superblock check-hash failure messages. When reading UFS/FFS superblocks that have check hashes, both the kernel and libufs print an error message if the check hash is incorrect. This commit adds the ability to request that the error message not be made. It is intended for use by programs like fsck that wants to print its own error message and by kernel subsystems like glabel that just wants to check for possible filesystem types. This capability will be used in followup commits. Sponsored by: Netflix	2021-11-15 09:11:54 -08:00
Kirk McKusick	b366ee4868	Consolodate four copies of the STDSB define into a single place. The STDSB macro is passed to the ffs_sbget() routine to fetch a UFS/FFS superblock "from the stadard place". It was identically defined in lib/libufs/libufs.h, stand/libsa/ufs.c, sys/ufs/ffs/ffs_extern.h, and sys/ufs/ffs/ffs_subr.c. Delete it from these four files and define it instead in sys/ufs/ffs/fs.h. All existing uses of this macro already include sys/ufs/ffs/fs.h so no include changes need to be made. No functional change intended. Sponsored by: Netflix	2021-11-14 22:10:16 -08:00
Konstantin Belousov	eede22d66d	ffs_snapshot: do not assert that um_devvp is locked It is not, and the lock is not needed there Reported and tested by: pho Reviewed by: markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D32761	2021-11-13 01:00:54 +02:00
Konstantin Belousov	25809a018d	mntfs: lock mntfs pseudo devfs vnode properly Require devvp locked for mntfs_freevp(), to have it locked around vgone(). Make that true for ffs, which is the only consumer of the interface. Reported and tested by: pho Reviewed by: markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D32761	2021-11-13 01:00:41 +02:00
Konstantin Belousov	76b05e3e39	ffs: Remove assertions about locked um_devvp in several places Namely, ffs_blkfree_cg(), and ffs_flushfiles(). Reported and tested by: pho Reviewed by: markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D32761	2021-11-13 01:00:33 +02:00
Konstantin Belousov	2030ee0e1b	ufs: remove write-only variables Mark variables as __diagused for invariant-only vars Reviewed by: imp, mjg Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D32577	2021-10-21 21:40:46 +03:00
Mateusz Guzik	b4a58fbf64	vfs: remove cn_thread It is always curthread. Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D32453	2021-10-11 13:21:47 +00:00
Kyle Evans	6b88668f0b	vfs: remove dead fifoop VOP_KQFILTER implementations These began to become obsolete in `d6d64f0f2c` (r137739) and the deal was later sealed in `003e18aef4` (r137801) when vfs.fifofs.fops was dropped and vop-bypass for pipes became mandatory. PR: 225934 Suggested by: markj Reviewe by: kib, markj Differential Revision: https://reviews.freebsd.org/D32270	2021-10-03 01:02:51 -05:00
Robert Wing	9acea16404	ffs: retire unused fsckpid mount option The fsckpid mount option was introduced in `927a12ae16` along with a couple sysctl's to support SU+J with snapshots. However, those sysctl's were never used and eventually removed in `f2620e9ceb`. There are no in-tree consumers of this mount option. Reviewed by: mckusick, kib Differential Revision: https://reviews.freebsd.org/D32015	2021-10-02 15:11:40 -08:00
Kirk McKusick	4a365e863f	Avoid "consumer not attached in g_io_request" panic when disk lost while using a UFS snapshot. The UFS filesystem supports snapshots. Each snapshot is a file whose contents are a frozen image of the disk partition on which the filesystem resides. Each time an existing block in the filesystem is modified, the filesystem checks whether that block was in use at the time that the snapshot was taken. If so, and if it has not already been copied, a new block is allocated from among the blocks that were not in use at the time that the snapshot was taken and placed in the snapshot file to replace the entry that has not yet been copied. The previous contents of the block are copied to the newly allocated snapshot file block, and the write to the original is then allowed to proceed. The block allocation is done using the usual UFS_BALLOC() routine which allocates the needed block in the snapshot and returns a buffer that is set up to write data into the newly allocated block. In usual filesystem operation, the contents for the new block is copied from user space into the buffer and the buffer is then written to the file using bwrite(), bawrite(), or bdwrite(). In the case of a snapshot the new block must be filled from the disk block that is about to be rewritten. The snapshot routine has a function readblock() that it uses to read the `about to be rewritten' disk block. /* * Read the specified block into the given buffer. / static int readblock(snapvp, bp, lbn) struct vnode snapvp; struct buf bp; ufs2_daddr_t lbn; { struct inode ip; struct bio bip; struct fs fs; ip = VTOI(snapvp); fs = ITOFS(ip); bip = g_alloc_bio(); bip->bio_cmd = BIO_READ; bip->bio_offset = dbtob(fsbtodb(fs, blkstofrags(fs, lbn))); bip->bio_data = bp->b_data; bip->bio_length = bp->b_bcount; bip->bio_done = NULL; g_io_request(bip, ITODEVVP(ip)->v_bufobj.bo_private); bp->b_error = biowait(bip, "snaprdb"); g_destroy_bio(bip); return (bp->b_error); } When the underlying disk fails, its GEOM module is removed. Subsequent attempts to access it should return the ENXIO error. The functionality of checking for the lost disk and returning ENXIO is handled by the g_vfs_strategy() routine: void g_vfs_strategy(struct bufobj bo, struct buf bp) { struct g_vfs_softc sc; struct g_consumer cp; struct bio bip; cp = bo->bo_private; sc = cp->geom->softc; / * If the provider has orphaned us, just return ENXIO. / mtx_lock(&sc->sc_mtx); if (sc->sc_orphaned \|\| sc->sc_enxio_active) { mtx_unlock(&sc->sc_mtx); bp->b_error = ENXIO; bp->b_ioflags \|= BIO_ERROR; bufdone(bp); return; } sc->sc_active++; mtx_unlock(&sc->sc_mtx); bip = g_alloc_bio(); bip->bio_cmd = bp->b_iocmd; bip->bio_offset = bp->b_iooffset; bip->bio_length = bp->b_bcount; bdata2bio(bp, bip); if ((bp->b_flags & B_BARRIER) != 0) { bip->bio_flags \|= BIO_ORDERED; bp->b_flags &= ~B_BARRIER; } if (bp->b_iocmd == BIO_SPEEDUP) bip->bio_flags \|= bp->b_ioflags; bip->bio_done = g_vfs_done; bip->bio_caller2 = bp; g_io_request(bip, cp); } Only after checking that the device is present does it construct the "bio" request and call g_io_request(). When readblock() constructs its own "bio" request and calls g_io_request() directly it panics with "consumer not attached in g_io_request" when the underlying device no longer exists. The fix is to have readblock() call g_vfs_strategy() rather than constructing its own "bio" request: / * Read the specified block into the given buffer. / static int readblock(snapvp, bp, lbn) struct vnode snapvp; struct buf bp; ufs2_daddr_t lbn; { struct inode ip; struct fs *fs; ip = VTOI(snapvp); fs = ITOFS(ip); bp->b_iocmd = BIO_READ; bp->b_iooffset = dbtob(fsbtodb(fs, blkstofrags(fs, lbn))); bp->b_iodone = bdone; g_vfs_strategy(&ITODEVVP(ip)->v_bufobj, bp); bufwait(bp); return (bp->b_error); } Here it uses the buffer that will eventually be written to the disk. The g_vfs_strategy() routine uses four parts of the buffer: b_bcount, b_iocmd, b_iooffset, and b_data. The b_bcount field is already correctly set for the buffer. It is safe to set the b_iocmd and b_iooffset fields as they are set correctly when the later write is done. The write path will also clear the B_DONE flag that our use of the buffer will set. The b_iodone callback has to be set to bdone() which will do just notification that the I/O is done in bufdone(). The rest of bufdone() includes things like processing the softdeps associated with the buffer should not be done until the buffer has been written. Bufdone() will set b_iodone back to NULL after using it, so the full bufdone() processing will be done when the buffer is written. The final change from the previous version of readblock() is that it used the b_data for the destination of the read while g_vfs_strategy() uses the bdata2bio() function to take advantage of VMIO when it is available. Differential revision: https://reviews.freebsd.org/D32150 Reviewed by: kib, chs MFC after: 1 week Sponsored by: Netflix	2021-09-27 20:04:51 -07:00
Kirk McKusick	d7770a5495	Eliminate snaplk / bufwait LOR when creating UFS snapshots Each vnode has an embedded lock that controls access to its contents. However vnodes describing a UFS snapshot all share a single snapshot lock to coordinate their access and update. As part of creating a new UFS snapshot, it has to have its individual vnode lock replaced with the filesystem's snapshot lock. The lock order for regular vnodes with respect to buffer locks is that they must first acquire the vnode lock, then a buffer lock. The order for the snapshot lock is reversed: a buffer lock must be acquired before the snapshot lock. When creating a new snapshot, the snapshot file must retain its vnode lock until it has allocated all the blocks that it needs before switching to the snapshot lock. This update moves one final piece of the initial snapshot block allocation so that it is done before the newly created snapshot is switched to use the snapshot lock. Reported by: Witness code MFC after: 1 week Sponsored by: Netflix	2021-09-18 17:02:30 -07:00
Konstantin Belousov	197a4f29f3	buffer pager: allow get_blksize method to return error Reported and reviewed by: asomers Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D31998	2021-09-17 20:29:55 +03:00
Robert Wing	440320b620	ffs: remove unused thread argument from ffs_reload() MFC After: 1 week Reviewed by: imp, kib Differential Revision: https://reviews.freebsd.org/D31127	2021-09-04 12:25:10 -08:00
Konstantin Belousov	bb536de6c0	ffs_update(): Do not assume that EBUSY can only come LK_NOWAIT trylock Instead do protective check for the local flags and do not interpret EBUSY specially if we did not request trylock mode for bread(). Reviewed by: mckusick Reported and tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week	2021-08-31 07:38:35 +03:00
Konstantin Belousov	f822d4feb8	ffs_update(): recalculate flags after relocking the vnode Inode type could migrate between snapshot and regular types while the vnode is unlocked. Recalculate flags specific for snapshot after relock. Reviewed by: mckusick Reported and tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week	2021-08-31 07:38:35 +03:00
Keith Owens	3b29c8b4bd	ddb: do not assume that ffs is mounted with softdep Avoid a panic when debugging with "show ffs" in ddb. Reviewed By: kib, markj, mckusick MFC after: 1 week Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D31622	2021-08-24 21:00:19 -05:00
Gordon Bergling	464a166c27	ufs_dirhash: Correct a typo in a comment - s/memry/memory/ MFC after: 3 days	2021-08-20 09:59:18 +02:00
Konstantin Belousov	8df4bc48c8	ufs rename: ensure that the result of ufs_checkpath() is stable ufs_rename() calls ufs_checkpath() to ensure that the target directory is not a child of the source. If not, rename would create a loop. For instance: source->X1->X2->target and if source moved under target, we get corrupted filesystem. Suppose that we initially have source->X1 .... and X2->target where X1 is not on path from root to X2. Then ufs_checkpath() accepts the inodes, but there is nothing preventing parallel rename of X2 to become under X1, after checkpath finished. Ensure stability of ufs_checkpath() result by taking a per-mount sx in ufs_rename right before ufs_checkpath() and till the end. Reviewed by: chs, mckusick Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2021-08-13 17:52:26 +03:00
Konstantin Belousov	2e2212b4f5	Style: wrap the long line, definition of ufs_checkpath() Sponsored by: The FreeBSD Foundation MFC after: 3 days	2021-08-13 17:52:20 +03:00
Kirk McKusick	a91716efeb	Clean up orphaned indirdep dependency structures after disk failure. During forcible unmount after a disk failure there is a bug that causes one or more indirdep dependency structures to fail to be deallocated. Until we manage to track down why they fail to get cleaned up, this code tracks them down and eliminates them so that the unmount can succeed. Reported by: Peter Holm Help from: kib Reviewed by: Chuck Silvers Tested by: Peter Holm MFC after: 7 days Sponsored by: Netflix	2021-07-29 16:31:16 -07:00
Kirk McKusick	412b5e40a7	Diagnotic improvement to soft dependency structure management. The soft updates diagnotic code keeps a list for each type of soft update dependency. When a new block is allocated for a file it is initially tracked by a "newblk" dependency. The "newblk" dependency eventually becomes either an "allocdirect" dependency or an "indiralloc" dependency. The diagnotic code failed to move the "newblk" from the list of "newblk"s to its new type list. No functional change intended. Reviewed by: Chuck Silvers (as part of a larger change) Tested by: Peter Holm (as part of a larger change) Sponsored by: Netflix	2021-07-29 16:13:54 -07:00
Jason A. Harmening	211ec9b7d6	FFS: remove ffs_fsfail_task Now that dounmount() supports a dedicated taskqueue, we can simply call it with MNT_DEFERRED directly from the failing context. This also avoids blocking taskqueue_thread with a potentially-expensive unmount operation. Reviewed by: kib, mckusick Tested by: pho Differential Revision: https://reviews.freebsd.org/D31016	2021-07-24 12:52:41 -07:00
Jason A. Harmening	c746ed724d	Allow stacked filesystems to be recursively unmounted In certain emergency cases such as media failure or removal, UFS will initiate a forced unmount in order to prevent dirty buffers from accumulating against the no-longer-usable filesystem. The presence of a stacked filesystem such as nullfs or unionfs above the UFS mount will prevent this forced unmount from succeeding. This change addreses the situation by allowing stacked filesystems to be recursively unmounted on a taskqueue thread when the MNT_RECURSE flag is specified to dounmount(). This call will block until all upper mounts have been removed unless the caller specifies the MNT_DEFERRED flag to indicate the base filesystem should also be unmounted from the taskqueue. To achieve this, the recently-added vfs_pin_from_vp()/vfs_unpin() KPIs have been combined with the existing 'mnt_uppers' list used by nullfs and renamed to vfs_register_upper_from_vp()/vfs_unregister_upper(). The format of the mnt_uppers list has also been changed to accommodate filesystems such as unionfs in which a given mount may be stacked atop more than one lower mount. Additionally, management of lower FS reclaim/unlink notifications has been split into a separate list managed by a separate set of KPIs, as registration of an upper FS no longer implies interest in these notifications. Reviewed by: kib, mckusick Tested by: pho Differential Revision: https://reviews.freebsd.org/D31016	2021-07-24 12:52:00 -07:00
John Baldwin	58109a87d4	Use an ANSI C function declaration for journal_check_space. GCC6 fails to compile this due to a -Wstrict-prototypes error. Sponsored by: Chelsio Communications	2021-07-23 15:59:11 -07:00
Konstantin Belousov	50acaaef54	ffs_softdep: force sync if journal is low in journal_check_space This effectively causes syncing of the mount point from softdep_prealloc(), softdep_prerename(), and softdep_prelink(). Typically it avoids the need for journal suspension at this point, at all. Suggested and reviewed by: mckusick Discussed with: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D30041	2021-06-23 23:47:05 +03:00
Konstantin Belousov	2126f103e0	ffs_softdep.c: add journal_check_space() helper Reviewed by: mckusick Discussed with: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D30041	2021-06-23 23:47:05 +03:00
Konstantin Belousov	64b494a105	softdep_prelink(): only do sync if other thread changed the vnode metadata since previous prelink We call into softdep_prerename() and softdep_prelink() when there is low free space in the journal. Functions sync all vnodes participating in the VOP, in the hope that this would reduce journal utilization. But if the vnodes are already synced, doing sync would only spend writes, journal is filled not due to the records from modifications of our vnodes. Remember original seqc numbers for vnodes, and only initiate syncs when seqc changed. Reviewed by: mckusick Discussed with: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D30041	2021-06-23 23:46:54 +03:00
Konstantin Belousov	f756546662	ufs_rename(): only do softdep_prerename() when other thread changed a vnode Reviewed by: mckusick Discussed with: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D30041	2021-06-23 23:46:15 +03:00
Konstantin Belousov	d4d289cd51	ffs: mark block (re-)allocations as seqc writes Reviewed by: mckusick Discussed with: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D30041	2021-06-23 23:46:15 +03:00
Konstantin Belousov	5eacde3eb8	ufs_rename(): softdep_prerename() does something only for SU+J so call it only in SU+J case Reviewed by: mckusick Discussed with: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D30041	2021-06-23 23:46:15 +03:00
Konstantin Belousov	d0929a990c	ffs: reduce number of dvp relocks in softdep_prelink() If vp == NULL, we unlocked and then immediately relocked dvp there. Reviewed by: mckusick Discussed with: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D30041	2021-06-23 23:46:15 +03:00
Konstantin Belousov	b2b40b28b1	ufs_vnops.c: style Wrap too long functions declarations. Reviewed by: mckusick Discussed with: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D30041	2021-06-23 23:46:15 +03:00
Mark Johnston	b2f9575646	ffs: Correct the input size check in sysctl_ffs_fsck() Make sure we return an error if no input was specified, since SYSCTL_IN() will report success in that case. Reported by: KMSAN Reviewed by: mckusick MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D30586	2021-05-31 18:59:18 -04:00
Jason A. Harmening	a4b07a2701	VFS_QUOTACTL(9): allow implementation to indicate busy state changes Instead of requiring all implementations of vfs_quotactl to unbusy the mount for Q_QUOTAON and Q_QUOTAOFF, add an "mp_busy" in/out param to VFS_QUOTACTL(9). The implementation may then indicate to the caller whether it needed to unbusy the mount. Also, add stbool.h to libprocstat modules which #define _KERNEL before including sys/mount.h. Otherwise they'll pull in sys/types.h before defining _KERNEL and therefore won't have the bool definition they need for mp_busy. Reviewed By: kib, markj Differential Revision: https://reviews.freebsd.org/D30556	2021-05-30 14:53:47 -07:00
Jason A. Harmening	271fcf1c28	Revert commits `6d3e78ad6c` and `54256e7954` Parts of libprocstat like to pretend they're kernel components for the sake of including mount.h, and including sys/types.h in the _KERNEL case doesn't fix the build for some reason. Revert both the VFS_QUOTACTL() change and the follow-up "fix" for now.	2021-05-29 17:48:02 -07:00
Jason A. Harmening	6d3e78ad6c	VFS_QUOTACTL(9): allow implementation to indicate busy state changes Instead of requiring all implementations of vfs_quotactl to unbusy the mount for Q_QUOTAON and Q_QUOTAOFF, add an "mp_busy" in/out param to VFS_QUOTACTL(9). The implementation may then indicate to the caller whether it needed to unbusy the mount. Reviewed By: kib, markj Differential Revision: https://reviews.freebsd.org/D30218	2021-05-29 14:05:39 -07:00
Konstantin Belousov	f784da883f	Move mnt_maxsymlinklen into appropriate fs mount data structures Reviewed by: mckusick Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week X-MFC-Note: struct mount layout Differential revision: https://reviews.freebsd.org/D30325	2021-05-22 15:16:09 +03:00
Don Morris	f17a590085	ufs: Avoid M_WAITOK allocations when building a dirhash At this point the directory's vnode lock is held, so blocking while waiting for free pages makes the system more susceptible to deadlock in low memory conditions. This is particularly problematic on NUMA systems as UMA currently implements a strict first-touch policy. ufsdirhash_build() already uses M_NOWAIT for other allocations and already handled failures for the block array allocation, so just convert to M_NOWAIT. PR: 253992 Reviewed by: markj, mckusick, vangyzen MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D29045	2021-05-20 11:25:45 -04:00
Kirk McKusick	9a2fac6ba6	Fix handling of embedded symbolic links (and history lesson). The original filesystem release (4.2BSD) had no embedded sysmlinks. Historically symbolic links were just a different type of file, so the content of the symbolic link was contained in a single disk block fragment. We observed that most symbolic links were short enough that they could fit in the area of the inode that normally holds the block pointers. So we created embedded symlinks where the content of the link was held in the inode's pointer area thus avoiding the need to seek and read a data fragment and reducing the pressure on the block cache. At the time we had only UFS1 with 32-bit block pointers, so the test for a fastlink was: di_size < (NDADDR + NIADDR) * sizeof(daddr_t) (where daddr_t would be ufs1_daddr_t today). When embedded symlinks were added, a spare field in the superblock with a known zero value became fs_maxsymlinklen. New filesystems set this field to (NDADDR + NIADDR) * sizeof(daddr_t). Embedded symlinks were assumed when di_size < fs->fs_maxsymlinklen. Thus filesystems that preceeded this change always read from blocks (since fs->fs_maxsymlinklen == 0) and newer ones used embedded symlinks if they fit. Similarly symlinks created on pre-embedded symlink filesystems always spill into blocks while newer ones will embed if they fit. At the same time that the embedded symbolic links were added, the on-disk directory structure was changed splitting the former u_int16_t d_namlen into u_int8_t d_type and u_int8_t d_namlen. Thus fs_maxsymlinklen <= 0 (as used by the OFSFMT() macro) can be used to distinguish old directory formats. In retrospect that should have just been an added flag, but we did not realize we needed to know about that change until it was already in production. Code was split into ufs/ffs so that the log structured filesystem could use ufs functionality while doing its own disk layout. This meant that no ffs superblock fields could be used in the ufs code. Thus ffs superblock fields that were needed in ufs code had to be copied to fields in the mount structure. Since ufs_readlink needed to know if a link was embedded, fs_maxlinklen gets copied to mnt_maxsymlinklen. The kernel panic that arose to making this fix was triggered when a disk error created an inode of type symlink with no allocated data blocks but a large size. When readlink was called the uiomove was attempted which segment faulted. static int ufs_readlink(ap) struct vop_readlink_args /* { struct vnode a_vp; struct uio a_uio; struct ucred a_cred; } / ap; { struct vnode vp = ap->a_vp; struct inode ip = VTOI(vp); doff_t isize; isize = ip->i_size; if ((isize < vp->v_mount->mnt_maxsymlinklen) \|\| DIP(ip, i_blocks) == 0) { / XXX - for old fastlink support / return (uiomove(SHORTLINK(ip), isize, ap->a_uio)); } return (VOP_READ(vp, ap->a_uio, 0, ap->a_cred)); } The second part of the "if" statement that adds DIP(ip, i_blocks) == 0) { / XXX - for old fastlink support */ is problematic. It never appeared in BSD released by Berkeley because as noted above mnt_maxsymlinklen is 0 for old format filesystems, so will always fall through to the VOP_READ as it should. I had to dig back through `git blame' to find that Rodney Grimes added it as part of ``The big 4.4BSD Lite to FreeBSD 2.0.0 (Development) patch.'' He must have brought it across from an earlier FreeBSD. Unfortunately the source-control logs for FreeBSD up to the merger with the AT&T-blessed 4.4BSD-Lite conversion were destroyed as part of the agreement to let FreeBSD remain unencumbered, so I cannot pin-point where that line got added on the FreeBSD side. The one change needed here is that mnt_maxsymlinklen is declared as an `int' and should be changed to be `u_int64_t'. This discovery led us to check out the code that deletes symbolic links. Specifically if (vp->v_type == VLNK && (ip->i_size < vp->v_mount->mnt_maxsymlinklen \|\| datablocks == 0)) { if (length != 0) panic("ffs_truncate: partial truncate of symlink"); bzero(SHORTLINK(ip), (u_int)ip->i_size); ip->i_size = 0; DIP_SET(ip, i_size, 0); UFS_INODE_SET_FLAG(ip, IN_SIZEMOD \| IN_CHANGE \| IN_UPDATE); if (needextclean) goto extclean; return (ffs_update(vp, waitforupdate)); } Here too our broken symlink inode with no data blocks allocated and a large size will segment fault as we are incorrectly using the test that we have no data blocks to decide that it is an embdedded symbolic link and attempting to bzero past the end of the inode. The test for datablocks == 0 is unnecessary as the test for ip->i_size < vp->v_mount->mnt_maxsymlinklen will do the right thing in all cases. The test for datablocks == 0 was added by David Greenman in this commit: Author: David Greenman <dg@FreeBSD.org> Date: Tue Aug 2 13:51:05 1994 +0000 Completed (hopefully) the kernel support for old style "fastlinks". Notes: svn path=/head/; revision=1821 I am guessing that he likely earlier added the incorrect test in the ufs_readlink code. I asked David if he had any recollection of why he made this change. Amazingly, he still had a recollection of why he had made a one-line change more than twenty years ago. And unsurpisingly it was because he had been stuck between a rock and a hard place. FreeBSD was up to 1.1.5 before the switch to the 4.4BSD-Lite code base. Prior to that, there were three years of development in all areas of the kernel, including the filesystem code, from the combined set of people including Bill Jolitz, Patchkit contributors, and FreeBSD Project members. The compatibility issue at hand was caused by the FASTLINKS patches from Curt Mayer. In merging in the 4.4BSD-Lite changes David had to find a way to provide compatibility with both the changes that had been made in FreeBSD 1.1.5 and with 4.4BSD-Lite. He felt that these changes would provide compatibility with both systems. In his words: ``My recollection is that the 'FASTLINKS' symlinks support in FreeBSD-1.x, as implemented by Curt Mayer, worked differently than 4.4BSD. He used a spare field in the inode to duplicately store the length. When the 4.4BSD-Lite merge was done, the optimized symlinks support for existing filesystems (those that were initialized in FreeBSD-1.x) were broken due to the FFS on-disk structure of 4.4BSD-Lite differing from FreeBSD-1.x. My commit was needed to restore the backward compatibility with FreeBSD-1.x filesystems. I think it was the best that could be done in the somewhat urgent circumstances of the post Berkeley-USL settlement. Also, regarding Rod's massive commit with little explanation, some context: John Dyson and I did the initial re-port of the 4.4BSD-Lite kernel to the 386 platform in just 10 days. It was by far the most intense hacking effort of my life. In addition to the porting of tons of FreeBSD-1 code, I think we wrote more than 30,000 lines of new code in that time to deal with the missing pieces and architectural changes of 4.4BSD-Lite. We didn't make many notes along the way. There was a lot of pressure to get something out to the rest of the developer community as fast as possible, so detailed discrete commits didn't happen - it all came as a giant wad, which is why Rod's commit message was worded the way it was.'' Reported by: Chuck Silvers Tested by: Chuck Silvers History by: David Greenman Lawrence MFC after: 1 week Sponsored by: Netflix	2021-05-16 17:04:11 -07:00
Konstantin Belousov	e3d6759585	b_vflags update requries bufobj lock The trunc_dependencies() issue was reported by Alexander Lochmann <alexander.lochmann@tu-dortmund.de>, who found the problem by performing lock analysis using LockDoc, see https://doi.org/10.1145/3302424.3303948. Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week	2021-04-15 15:47:42 +03:00
Kirk McKusick	14d0cd7225	Ensure that the mount command shows "with quotas" when quotas are enabled. When quotas are enabled with the quotaon(8) command, it sets the MNT_QUOTA flag in the mount structure mnt_flag field. The mount structure holds a cached copy of the filesystem statfs structure in mnt_stat that includes a copy of the mnt_flag field in mnt_stat.f_flags. The mnt_stat structure may not be updated for hours. Since the mount command requests mount details using the MNT_NOWAIT option, it gets the mount's mnt_stat statfs structure whose f_flags field does not yet show the MNT_QUOTA flag being set in mnt_flag. The fix is to have quotaon(8) set the MNT_QUOTA flag in both mnt_flag and in mnt_stat.f_flags so that it will be immediately visible to callers of statfs(2). Reported by: Christos Chatzaras Tested by: Christos Chatzaras PR: 254682 MFC after: 3 days Sponsored by: Netflix	2021-04-14 15:25:08 -07:00
Konstantin Belousov	0b3948e73b	softdep_unmount: assert that no dandling dependencies are left Reviewed by: mckusick Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D29178	2021-03-12 13:31:08 +02:00
Konstantin Belousov	7a8d4b4da6	FFS: assign fully initialized struct mount_softdeps to um_softdep Other threads observing the non-NULL um_softdep can assume that it is safe to use it. This is important for ro->rw remounts where change from read-only to read-write status cannot be made atomic. Reviewed by: mckusick Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D29178	2021-03-12 13:31:08 +02:00
Konstantin Belousov	2af934cc15	Assert that um_softdep is NULL on free(ump), i.e. softdep_unmount() was called Reviewed by: mckusick Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D29178	2021-03-12 13:31:08 +02:00
Konstantin Belousov	f776c54cee	ffs_mount: when remounting ro->rw and sbupdate failed, cleanup softdeps Reviewed by: mckusick Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D29178	2021-03-12 13:31:08 +02:00
Konstantin Belousov	d7e5e37416	softdep_unmount: handle spurious wakeups Reviewed by: mckusick Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D29178	2021-03-12 13:31:08 +02:00
Konstantin Belousov	fabbc3d879	softdep_flush(): do not access ump after we acked FLUSH_EXIT and unlocked SU lock otherwise we might follow a pointer in the freed memory. Reviewed by: mckusick Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D29178	2021-03-12 13:31:08 +02:00
Konstantin Belousov	7c7a6681fa	ffs: clear MNT_SOFTDEP earlier when remounting rw to ro Suppose that we remount rw->ro and in parallel some reader tries to instantiate a vnode, e.g. during lookup. Suppose that softdep_unmount() already started, but we did not cleared the MNT_SOFTDEP flag yet. Then ffs_vgetf() calls into softdep_load_inodeblock() which accessed destroyed hashes and freed memory. Set/clear fs_ronly simultaneously (WRT to files flush) with MNT_SOFTDEP. It might be reasonable to move the change of fs_ronly to under MNT_ILOCK, but no readers take it. Reported and tested by: pho Reviewed by: mckusick Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D29178	2021-03-12 13:31:07 +02:00
Konstantin Belousov	7f682bdcab	Rework MOUNTED/DOING SOFTDEP/SUJ macros Now MNT_SOFTDEP indicates that SU are active in any variant +-J, and SU+J is indicated by MNT_SOFTDEP \| MNT_SUJ combination. The reason is that unmount will be able to easily hide SU from other operations by clearing MNT_SOFTDEP while keeping the record of the active journal. Reviewed by: mckusick Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D29178	2021-03-12 13:31:07 +02:00
Konstantin Belousov	81cdb19e04	ffs softdep: clear ump->um_softdep on softdep_unmount() Reviewed by: mckusick Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D29178	2021-03-12 13:31:07 +02:00
Konstantin Belousov	a285d3edac	ffs_extern.h: Add comments for ffs_vgetf() flags Requested and reviewed by: mckusick Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D29178	2021-03-12 13:30:59 +02:00
Konstantin Belousov	fd97fa6463	Add FFSV_FORCEINODEDEP flag for ffs_vgetf() It will be used to allow SU flush code to sync the volume while external consumers see that SU is already disabled on the filesystem. Use it where ffs_vgetf() called by SU code to process dependencies. Reviewed by: mckusick Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D29178	2021-03-12 13:30:38 +02:00
Konstantin Belousov	25aac48d2c	simplify journal_mount: move the out label after success block This removes the need to check for error == 0. Reviewed by: mckusick Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D29178	2021-03-12 13:30:37 +02:00
Konstantin Belousov	8742817ba6	FFS extattr: fix handling of the tail There are three issues with change that stopped truncating ea area before write, and resulted in possible zero tail in the ea area: - Truncate to zero checked i_ea_len after the reference was dropped, making the last drop effectively truncate to zero length always. - Loop to fill uio for zeroing specified too large length, that triggered assert in normal situation. - Integrity check could trip over the tail, instead we must allow partial header or header with zero length, and clamp ea image in memory at it. Reported by: arichardson Tested by: arichardson, pho Sponsored by: The FreeBSD Foundation MFC after: 3 days Fixup: `5e198e7646` Differential Revision: https://reviews.freebsd.org/D28999	2021-03-02 02:19:34 +02:00
Konstantin Belousov	6f30ac9995	Call softdep_prealloc() before taking ffs_lock_ea(), if unlock is committing softdep_prealloc() must be called to ensure enough journal space is available, before ffs_extwrite(). Also it must be done before taking ffs_lock_ea(), because it calls ffs_syncvnode(), potentially dropping the vnode lock. Reviewed by: mckusick Tested by: pho MFC after: 1 week Sponsored by: The FreeBSD Foundation	2021-02-24 09:55:21 +02:00
Konstantin Belousov	5e198e7646	ffs_close_ea: do not relock vnode under lock_ea ffs_lock_ea is after the vnode lock, so vnode must not be relocked under lock_ea. Move ffs_truncate() call in ffs_close_ea() after the lock_ea is dropped, and only truncate to length zero, since this is the only mode supported by ffs_truncate() for EAs. Previously code did truncation and then write. Zero the part of the ext area that is unused, if truncation is due but not done because ea area is not zero-length. Reviewed by: mckusick Tested by: pho MFC after: 1 week Sponsored by: The FreeBSD Foundation	2021-02-24 09:55:04 +02:00
Konstantin Belousov	c6d68ca842	ffs_vnops.c: style Use local var to shorten ap->a_vp expression. Reviewed by: mckusick Tested by: pho MFC after: 1 week Sponsored by: The FreeBSD Foundation	2021-02-24 09:54:53 +02:00
Konstantin Belousov	4983146279	ffs: do not call softdep_prealloc() from UFS_BALLOC() Do it in ffs_write(), where we can gracefuly handle relock and its consequences. In particular, recheck the v_data to see if the vnode reclamation ended, and return EBADF when we cannot proceed with the write. Reviewed by: mckusick Reported by: pho MFC after: 1 week Sponsored by: The FreeBSD Foundation	2021-02-24 09:54:50 +02:00
Konstantin Belousov	cc9958bf22	ffs_reallocblks: change the guard for softdep_prealloc() call to DOINGSUJ() instead of DOINGSOFTDEP(). The softdep_prealloc() function does nothing in SU case. Note that the call should be safe with regard to the vnode relock, because it is called with MNT_NOWAIT, which does not descend into fsync. Reviewed by: mckusick Tested by: pho MFC after: 1 week Sponsored by: The FreeBSD Foundation	2021-02-24 09:54:30 +02:00
Konstantin Belousov	2bfd8992c7	vnode: move write cluster support data to inodes. The data is only needed by filesystems that 1. use buffer cache 2. utilize clustering write support. Requested by: mjg Reviewed by: asomers (previous version), fsu (ext2 parts), mckusick Tested by: pho Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D28679	2021-02-21 11:38:21 +02:00
Konstantin Belousov	d485c77f20	Remove #define _KERNEL hacks from libprocstat Make sys/buf.h, sys/pipe.h, sys/fs/devfs/devfs*.h headers usable in userspace, assuming that the consumer has an idea what it is for. Unhide more material from sys/mount.h and sys/ufs/ufs/inode.h, sys/ufs/ufs/ufsmount.h for consumption of userspace tools, with the same caveat. Remove unacceptable hack from usr.sbin/makefs which relied on sys/buf.h being unusable in userspace, where it override struct buf with its own definition. Instead, provide struct m_buf and struct m_vnode and adapt code to use local variants. Reviewed by: mckusick Tested by: pho Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D28679	2021-02-21 11:38:21 +02:00
Konstantin Belousov	c31480a1f6	UFS snapshots: properly set the vm object size. Citing Kirk: The previous code [before `8563de2f27` -- kib] did not call vnode_pager_setsize() but worked because later in ffs_snapshot() it does a UFS_WRITE() to output the snaplist. Previously the UFS_WRITE() allocated the extra block at the end of the file which caused it to do the needed vnode_pager_setsize(). But the new code had already allocated the extra block, so UFS_WRITE() did not extend the size and thus did not do the vnode_pager_setsize(). PR: 253158 Reported by: Harald Schmalzbauer <bugzilla.freebsd@omnilan.de> Reviewed by: mckusick Tested by: cy Sponsored by: The FreeBSD Foundation MFC after: 1 week	2021-02-16 07:11:52 +02:00
Kirk McKusick	8563de2f27	Fix bug 253158 - Panic: snapacct_ufs2: bad block - mksnap_ffs(8) crash The panic reported in 253158 arises because the /mnt/.snap/.factory snapshot allocated the last block in the filesystem. The snapshot code allocates the last block in the filesystem as a way of setting its length to be the size of the filesystem. Part of taking a snapshot is to remove all the earlier snapshots from the image of the newest snapshot so that newer snapshots will not claim the blocks of the earlier snapshots. The panic occurs when the new snapshot finds that both it and an earlier snapshot claim the same block. The fix is to set the size of the snapshot to be one block after the last block in the filesystem. This block can never be allocated since it is not a valid block in the filesystem. This extra block is used as a place to store the initial list of blocks that the snapshot has already copied and is used to avoid a deadlock in and speed up the ffs_copyonwrite() function. Reported by: Harald Schmalzbauer Tested by: Peter Holm PR: 253158 Sponsored by: Netflix	2021-02-11 21:31:16 -08:00
Konstantin Belousov	adf28ab456	fifo: minor comment and assert improvements. In particular, replace a note that reload through vget() is obsoleted, with explanation why this code is required. Reviewed by: chs, mckusick Tested by: pho MFC after: 2 weeks Sponsored by: The FreeBSD Foundation	2021-02-12 03:02:22 +02:00
Konstantin Belousov	26af9f72f7	ffs_unlock: assert that IN_ENDOFF is not leaked past locked scope This catches both missed processing of IN_ENDOFF and missed application of VOP_VPUT_PAIR() after VOP that created an entry in the directory. Reviewed by: chs, mckusick Tested by: pho MFC after: 2 weeks Sponsored by: The FreeBSD Foundation	2021-02-12 03:02:22 +02:00
Konstantin Belousov	28703d2713	ffs softdep: Force processing of VI_OWEINACT vnodes when there is inode shortage Such vnodes prevent inode reuse, and should be force-cleared when ffs_valloc() is unable to find a free inode. Reviewed by: chs, mckusick Tested by: pho MFC after: 2 weeks Sponsored by: The FreeBSD Foundation	2021-02-12 03:02:22 +02:00
Konstantin Belousov	2011b44fa3	softdep_request_cleanup: wait for softdep_request_clean_flush() to pass if we noted a parallel request is active and declined to overflow the system with parallel redundant sync of the vnodes. But we need to wait for the flush to finish to see if there are any freed resources. Reviewed by: chs, mckusick Tested by: pho MFC after: 2 weeks Sponsored by: The FreeBSD Foundation	2021-02-12 03:02:22 +02:00
Konstantin Belousov	013168db8c	ufs_inactive(): stop hiding ERELOOKUP from ffs_truncate(), return it. VFS should retry inactivation when possible, then. This should provide timely removal of unlinked unreferenced inodes. Reviewed by: chs, mckusick Tested by: pho MFC after: 2 weeks Sponsored by: The FreeBSD Foundation	2021-02-12 03:02:21 +02:00
Konstantin Belousov	b59a8e63d6	Stop ignoring ERELOOKUP from VOP_INACTIVE() When possible, relock the vnode and retry inactivation. Only vunref() is required not to drop the vnode lock, so handle it specially by not retrying. This is a part of the efforts to ensure that unlinked not referenced vnode does not prevent inode from reusing. Reviewed by: chs, mckusick Tested by: pho MFC after: 2 weeks Sponsored by: The FreeBSD Foundation	2021-02-12 03:02:21 +02:00
Konstantin Belousov	6aed2435c8	ufs vnops: brace softdep_prelink() with DOINGSUJ instead of DOINGSOFTDEP because softdep_prelink() is reverted to NOP for non-J case. There is no need to do anything before ufs_direnter() in SU/non-J case, everything required to sync the directory is done in VOP_VPUT_PAIR(). Suggested by: mckusick Reviewed by: chs, mckusick Tested by: pho MFC after: 2 week Sponsored by: The FreeBSD Foundation	2021-02-12 03:02:21 +02:00
Konstantin Belousov	ede40b0675	ffs softdep: remove will_direnter argument of softdep_prelink() Originally this was done in `8a1509e442` to forcibly cover cases where a hole in the directory could be created by extending into indirect block, since dependency of writing out indirect block is not tracked. This results in excessive amount of fsyncing the directories, where all creation of new entry forced fsync before it. This is not needed, it is enough to fsync when IN_NEEDSYNC is set, and VOP_VPUT_PAIR() provides the required hook to only perform required syncing. The series of changes culminating in this commit puts the performance of metadata-intensive loads back to that before `8a1509e442`. Analyzed by: mckusick Reviewed by: chs, mckusick Tested by: pho MFC after: 2 weeks Sponsored by: The FreeBSD Foundation	2021-02-12 03:02:21 +02:00
Konstantin Belousov	06f2918ab8	ufs_direnter: directory truncation does not need special case for rename In ufs_rename case, tdvp is locked from the place where ufs_direnter() is done till VOP_VPUT_PAIR(), which means that we no longer need to specially handle rename in ufs_direnter(). Truncation, if possible, is done in the same way in ffs_vput_pair() both for rename and other VOPs calling ufs_direnter(). Remove isrename argument and set IN_ENDOFF if ufs_direnter() succeeded and directory needs truncation. In ffs_vput_pair(), stop verifying the condition that directory needs truncation when IN_ENDOFF is set, instead assert that the condition is true. Suggested by: mckusick Reviewed by: chs, mckusick Tested by: pho MFC after: 2 weeks Sponsored by: The FreeBSD Foundation	2021-02-12 03:02:21 +02:00
Konstantin Belousov	038fe6e089	ufs_rename: use VOP_VPUT_PAIR and rely on directory sync/truncation there Suggested by: mckusick Reviewed by: chs, mckusick Tested by: pho MFC after: 2 weeks Sponsored by: The FreeBSD Foundation	2021-02-12 03:02:21 +02:00
Konstantin Belousov	74a3652f83	ufs_direnter: move directory truncation to ffs_vput_pair(). VOP_VPUT_PAIR() provides the hook to do the truncation right before unlock, which is required since truncation might need to fsync(), which itself might unlock the directory vnode. Set new flag IN_ENDOFF which indicates that i_endoff is valid and should be checked against inode size. Excessive size is chomped, but this operation is advisory and failure to truncate should not result in the failure of the main VOP. Reviewed by: chs, mckusick Tested by: pho MFC after: 2 weeks Sponsored by: The FreeBSD Foundation	2021-02-12 03:02:21 +02:00
Konstantin Belousov	30bfb2fa0f	ffs_vput_pair(): try harder to recover from the vnode reclaim In particular, if unlock_vp is false, save vp's inode number and generation. If ffs_inotovp() can re-create the vnode with the same number and generation after we finished with handling dvp, then we most likely raced with unmount, and were able to restore atomicity of open. We use FFSV_REPLACE_DOOMED there, to drop the old vnode. This additional recovery is not strictly required, but it improves the quality of the implementation. Suggested by: mckusick Reviewed by: chs, mckusick Tested by: pho MFC after: 2 weeks Sponsored by: The FreeBSD Foundation	2021-02-12 03:02:21 +02:00
Konstantin Belousov	f2c9d038bd	FFS: implement special VOP_VPUT_PAIR(). It cleans IN_NEEDSYNC flag on dvp before returning, by applying ffs_syncvnode() until success or an error different from ERELOOKUP. IN_NEEDSYNC cleanup is required to avoid creating holes in the directories when extended into indirect block. Reviewed by: chs, mckusick Tested by: pho MFC after: 2 weeks Sponsored by: The FreeBSD Foundation	2021-02-12 03:02:21 +02:00
Konstantin Belousov	be44e98637	ffs_snapshot: use VOP_VPUT_PAIR after VOP_CREATE. If the snapshot embrio was reclaimed under us, return error outright. Reviewed by: chs, mckusick Tested by: pho MFC after: 2 weeks Sponsored by: The FreeBSD Foundation	2021-02-12 03:02:20 +02:00
Konstantin Belousov	08c2dc2841	ufs_direnter/SU: unconditionally UFS_UPDATE inode when extending directory for all kinds of async/SU mount variants. Submitted by: mckusick Reviewed by: chs Tested by: pho MFC after: 2 weeks Sponsored by: The FreeBSD Foundation	2021-02-12 03:02:20 +02:00
Konstantin Belousov	1de1e2bfbf	ffs_syncvnode: only clear IN_NEEDSYNC after successfull sync If it is cleaned before the sync, other threads might see the inode without the flag set, because syncing could unlock it. Reviewed by: chs, mckusick Tested by: pho MFC after: 2 weeks Sponsored by: The FreeBSD Foundation	2021-02-12 03:02:20 +02:00
Konstantin Belousov	89fd61d955	Merge ufs_fhtovp() into ffs_inotovp(). The function alone was not used for anything but ffs_fstovp() for long time. Suggested by: mckusick Reviewed by: chs, mckusick Tested by: pho MFC after: 2 weeks Sponsored by: The FreeBSD Foundation	2021-02-12 03:02:20 +02:00
Konstantin Belousov	5952c86c78	ffs_inotovp(): interface to convert (ino, gen) into alive vnode It generalizes the VFS_FHTOVP() interface, making it possible to fetch the inode without faking filehandle. Also it adds the ffs flags argument which allows to control ffs_vgetf() call. Requested by: mckusick Reviewed by: chs, mckusick Tested by: pho MFC after: 2 weeks Sponsored by: The FreeBSD Foundation	2021-02-12 03:02:20 +02:00
Konstantin Belousov	f16c26b1c0	ffs: Add FFSV_REPLACE_DOOMED flag to ffs_vgetf() It specifies that caller requests a fresh non-doomed vnode. If doomed vnode is found in the hash, it should behave similarly to FFSV_REPLACE. Or, to put it differently, the flag is same as FFSV_REPLACE, but only when the found hashed vnode is doomed. Reviewed by: chs, mkcusick Tested by: pho MFC after: 2 weeks Sponsored by: The FreeBSD Foundation	2021-02-12 03:02:20 +02:00
Konstantin Belousov	e94f2f1be3	ffs: call ufsdirhash_dirtrunc() right after setting directory size Later processing of ffs_truncate() might temporary unlock the directory vnode, causing unsychronized dirhash and inode sizes if update is postponed to UFS_TRUNCATE() callers. Reviewed by: chs, mkcusick Tested by: pho MFC after: 2 weeks Sponsored by: The FreeBSD Foundation	2021-02-12 03:02:19 +02:00
Konstantin Belousov	bf0db19339	buf SU hooks: track buf_start() calls with B_IOSTARTED flag and only call buf_complete() if previously started. Some error paths, like CoW failire, might skip buf_start() and do bufdone(), which itself call buf_complete(). Various SU handle_written_XXX() functions check that io was started and incomplete parts of the buffer data reverted before restoring them. This is a useful invariant that B_IO_STARTED on buffer layer allows to keep instead of changing check and panic into check and return. Reported by: pho Reviewed by: chs, mckusick Tested by: pho MFC after: 2 weeks Sponsored by: The FreeBSD Foundations	2021-02-12 03:02:19 +02:00
Konstantin Belousov	0281f88e5d	ffs_vnops.c: Move opt_*.h includes to the top. as it is done in other places. Header files might need options defined for correct operation. Reviewed by: chs, mckusick Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week	2021-02-12 03:02:19 +02:00
Kirk McKusick	a63eae65ff	Revert `2d4422e799`, Eliminate lock order reversal in UFS ffs_unmount(). After discussion with Chuck Silvers (chs@) we have decided that there is a better way to resolve this lock order reversal which will be committed separately. Sponsored by: Netflix	2021-01-30 00:03:37 -08:00
Mateusz Guzik	c892d60a1d	ufs: denote lack of support for lockless symlink lookup It is unclear without investigating if it can be provided without using extra memory, so for the time being just don't.	2021-01-23 15:04:43 +00:00
Kirk McKusick	79a5c790bd	Eliminate a locking panic when cleaning up UFS snapshots after a disk failure. Each vnode has an embedded lock that controls access to its contents. However vnodes describing a UFS snapshot all share a single snapshot lock to coordinate their access and update. As part of mounting a UFS filesystem with snapshots, each of the vnodes describing a snapshot has its individual lock replaced with the snapshot lock. When the filesystem is unmounted the vnode's original lock is returned replacing the snapshot lock. When a disk fails while the UFS filesystem it contains is still mounted (for example when a thumb drive is removed) UFS forcibly unmounts the filesystem. The loss of the drive causes the GEOM subsystem to orphan the provider, but the consumer remains until the filesystem has finished with the unmount. Information describing the snapshot locks was being prematurely cleared during the orphaning causing the return of the snapshot vnode's original locks to fail. The fix is to not clear the needed information prematurely. Sponsored by: Netflix	2021-01-15 16:36:42 -08:00
Kirk McKusick	173779b98f	Eliminate lock order reversal in UFS when unmounting filesystems with snapshots. Each vnode has an embedded lock that controls access to its contents. However vnodes describing a UFS snapshot all share a single snapshot lock to coordinate their access and update. As part of mounting a UFS filesystem with snapshots, each of the vnodes describing a snapshot has its individual lock replaced with the snapshot lock. When the filesystem is unmounted the vnode's original lock is returned replacing the snapshot lock. The lock order reversal happens because vnode locks must be acquired before snapshot locks. When unmounting we must lock both the snapshot lock and the vnode lock before swapping them so that the vnode will be continuously locked during the swap. For each vnode representing a snapshot, we must first acquire the snapshot lock to ensure exclusive access to it and its original lock. We then face a lock order reversal when we try to acquire the original vnode lock. The problem is eliminated by doing a non-blocking exclusive lock on the original lock which will always succeed since there are no users of that lock. Sponsored by: Netflix	2021-01-15 16:03:01 -08:00
Mateusz Guzik	6b3a9a0f3d	Convert remaining cap_rights_init users to cap_rights_init_one semantic patch: @@ expression rights, r; @@ - cap_rights_init(&rights, r) + cap_rights_init_one(&rights, r)	2021-01-12 13:16:10 +00:00
Kirk McKusick	2d4422e799	Eliminate lock order reversal in UFS ffs_unmount(). UFS uses a new "mntfs" pseudo file system which provides private device vnodes for a file system to safely access its disk device. The original device vnode is saved in um_odevvp to hold the exclusive lock on the device so that any attempts to open it for writing will fail. But it is otherwise unused and has its BO_NOBUFS flag set to enforce that file systems using mntfs vnodes do not accidentally use the original devfs vnode. When the file system is unmounted, um_odevvp is no longer needed and is released. The lock order reversal happens because device vnodes must be locked before UFS vnodes. During unmount, the root directory vnode lock is held. When when calling vrele() on um_odevvp, vrele() attempts to exclusive lock um_odevvp causing the lock order reversal. The problem is eliminated by doing a non-blocking exclusive lock on um_odevvp which will always succeed since there are no users of um_odevvp. With um_odevvp locked, it can be released using vput which does not attempt to do a blocking exclusive lock request and thus avoids the lock order reversal. Sponsored by: Netflix	2021-01-11 16:49:07 -08:00
Thomas Munro	e7347be9e3	ffs: Support O_DSYNC. Respect the new IO_DATASYNC flag when performing synchronous writes. Compared to O_SYNC, O_DSYNC lets us skip updating the inode in some cases, matching the behaviour of fdatasync(2). Reviewed by: kib Differential Review: https://reviews.freebsd.org/D25160	2021-01-08 13:15:56 +13:00
Mateusz Guzik	3e506a67bb	vfs: add v_irflag accessors Reviewed by: kib (previous version) Differential Revision: https://reviews.freebsd.org/D27793	2021-01-03 06:50:06 +00:00
Mateusz Guzik	9997aedb8f	ufs: use VNPASS when asserting on a vnode in ufs_read_pgcache	2021-01-01 03:14:11 +00:00
Mark Johnston	ace3d9475c	ffs: Avoid out-of-bounds accesses in the fs_active bitmap We use a bitmap to track which cylinder groups have changed between snapshot creation and filesystem suspension. The "legs" of the bitmap are four bytes wide (see ACTIVESET()) so we must round up the allocation size to a multiple of four bytes. I believe this bug is harmless since UMA/kmem_* will both pad the allocation and zero the full allocation. Note that malloc() does inline zeroing when the allocation size is known at compile-time. Reported by: pho (using KASAN) Reviewed by: kib, mckusick MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D27731	2020-12-23 11:16:40 -05:00
Ryan Libby	93dba42c0e	ffs: quiet -Wstrict-prototypes Reviewed by: kib, markj, mckusick Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D27558	2020-12-11 22:51:57 +00:00
Kirk McKusick	bb3c01ec79	Document the BA_CLRBUF flag used in ufs and ext2fs filesystems. Suggested by: kib MFC after: 3 days Sponsored by: Netflix	2020-12-06 20:50:21 +00:00
Konstantin Belousov	2c7ada9917	ufs: handle two more cases of possible VNON vnode returned from VFS_VGET(). Reported by: kevans Reviewed by: mckusick, mjg Tested by: pho Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D27457	2020-12-06 18:09:14 +00:00
Konstantin Belousov	21a45add50	ffs: do not read full direct blocks if they are going to be overwritten. BA_CLRBUF specifies that existing context of the block will be completely overwritten by caller, so there is no reason to spend io fetching existing data. We do the same for indirect blocks. Reported by: tmunro Reviewed by: mckusick, tmunro Tested by: pho, tmunro Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D27353	2020-11-30 17:03:26 +00:00
Konstantin Belousov	cd85379104	Make MAXPHYS tunable. Bump MAXPHYS to 1M. Replace MAXPHYS by runtime variable maxphys. It is initialized from MAXPHYS by default, but can be also adjusted with the tunable kern.maxphys. Make b_pages[] array in struct buf flexible. Size b_pages[] for buffer cache buffers exactly to atop(maxbcachebuf) (currently it is sized to atop(MAXPHYS)), and b_pages[] for pbufs is sized to atop(maxphys) + 1. The +1 for pbufs allow several pbuf consumers, among them vmapbuf(), to use unaligned buffers still sized to maxphys, esp. when such buffers come from userspace (). Overall, we save significant amount of otherwise wasted memory in b_pages[] for buffer cache buffers, while bumping MAXPHYS to desired high value. Eliminate all direct uses of the MAXPHYS constant in kernel and driver sources, except a place which initialize maxphys. Some random (and arguably weird) uses of MAXPHYS, e.g. in linuxolator, are converted straight. Some drivers, which use MAXPHYS to size embeded structures, get private MAXPHYS-like constant; their convertion is out of scope for this work. Changes to cam/, dev/ahci, dev/ata, dev/mpr, dev/mpt, dev/mvs, dev/siis, where either submitted by, or based on changes by mav. Suggested by: mav () Reviewed by: imp, mav, imp, mckusick, scottl (intermediate versions) Tested by: pho Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D27225	2020-11-28 12:12:51 +00:00
Konstantin Belousov	92bcefd1d2	clear_inodedeps: handle ERELOOKUP from ffs_syncvnode(). Reported and tested by: pho Sponsored by: The FreeBSD Foundation	2020-11-26 18:03:24 +00:00
Konstantin Belousov	07ef907f6e	ffs_softdep.c: get_parent_vp(): Fix bp lock leak when inum inode was already freed. Reported by: markj, pho Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week	2020-11-25 17:12:21 +00:00
Konstantin Belousov	8a1509e442	Handle LoR in flush_pagedep_deps(). When operating in SU or SU+J mode, ffs_syncvnode() might need to instantiate other vnode by inode number while owning syncing vnode lock. Typically this other vnode is the parent of our vnode, but due to renames occuring right before fsync (or during fsync when we drop the syncing vnode lock, see below) it might be no longer parent. More, the called function flush_pagedep_deps() needs to lock other vnode while owning the lock for vnode which owns the buffer, for which the dependencies are flushed. This creates another instance of the same LoR as was fixed in softdep_sync(). Put the generic code for safe relocking into new SU helper get_parent_vp() and use it in flush_pagedep_deps(). The case for safe relocking of two vnodes with undefined lock order was extracted into vn helper vn_lock_pair(). Due to call sequence ffs_syncvnode()->softdep_sync_buf()->flush_pagedep_deps(), ffs_syncvnode() indicates with ERELOOKUP that passed vnode was unlocked in process, and can return ENOENT if the passed vnode reclaimed. All callers of the function were inspected. Because UFS namei lookups store auxiliary information about directory entry in in-memory directory inode, and this information is then used by UFS code that creates/removed directory entry in the actual mutating VOPs, it is critical that directory vnode lock is not dropped between lookup and VOP. For softdep_prelink(), which ensures that later link/unlink operation can proceed without overflowing the journal, calls were moved to the place where it is safe to drop processing VOP because mutations are not yet applied. Then, ERELOOKUP causes restart of the whole VFS operation (typically VFS syscall) at top level, including the re-lookup of the involved pathes. [Note that we already do the same restart for failing calls to vn_start_write(), so formally this patch does not introduce new behavior.] Similarly, unsafe calls to fsync in snapshot creation code were plugged. A possible view on these failures is that it does not make sense to continue creating snapshot if the snapshot vnode was reclaimed due to forced unmount. It is possible that relock/ERELOOKUP situation occurs in ffs_truncate() called from ufs_inactive(). In this case, dropping the vnode lock is not safe. Detect the situation with VI_DOINGINACT and reschedule inactivation by setting VI_OWEINACT. ufs_inactive() rechecks VI_OWEINACT and avoids reclaiming vnode is truncation failed this way. In ffs_truncate(), allocation of the EOF block for partial truncation is re-done after vnode is synced, since we cannot leave the buffer locked through ffs_syncvnode(). In collaboration with: pho Reviewed by: mckusick (previous version), markj Tested by: markj (syzkaller), pho Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D26136	2020-11-14 05:30:10 +00:00
Konstantin Belousov	738ea0010b	Add ffs_inode_bwrite() helper. In collaboration with: pho Reviewed by: mckusick (previous version), markj Tested by: markj (syzkaller), pho Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D26136	2020-11-14 05:19:59 +00:00
Konstantin Belousov	7b795aa3c0	Revert r367669 to re-commit with proper message	2020-11-14 05:19:44 +00:00
Konstantin Belousov	c0d2077f41	Add a framework that tracks exclusive vnode lock generation count for UFS. This count is memoized together with the lookup metadata in directory inode, and we assert that accesses to lookup metadata are done under the same lock generation as they were stored. Enabled under DIAGNOSTICS. UFS saves additional data for parent dirent when doing lookup (i_offset, i_count, i_endoff), and this data is used later by VOPs operating on dirents. If parent vnode exclusive lock is dropped and re-acquired between lookup and the VOP call, we corrupt directories. Framework asserts that corruption cannot occur that way, by tracking vnode lock generation counter. Updates to inode dirent members also save the counter, while users compare current and saved counters values. Also, fix a case in ufs_lookup_ino() where i_offset and i_count could be updated under shared lock. It is not a bug on its own since dvp i_offset results from such lookup cannot be used, but it causes false positive in the checker. In collaboration with: pho Reviewed by: mckusick (previous version), markj Tested by: markj (syzkaller), pho Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D26136	2020-11-14 05:17:04 +00:00
Konstantin Belousov	61846fc4dc	Add a framework that tracks exclusive vnode lock generation count for UFS. This count is memoized together with the lookup metadata in directory inode, and we assert that accesses to lookup metadata are done under the same lock generation as they were stored. Enabled under DIAGNOSTICS. UFS saves additional data for parent dirent when doing lookup (i_offset, i_count, i_endoff), and this data is used later by VOPs operating on dirents. If parent vnode exclusive lock is dropped and re-acquired between lookup and the VOP call, we corrupt directories. Framework asserts that corruption cannot occur that way, by tracking vnode lock generation counter. Updates to inode dirent members also save the counter, while users compare current and saved counters values. Also, fix a case in ufs_lookup_ino() where i_offset and i_count could be updated under shared lock. It is not a bug on its own since dvp i_offset results from such lookup cannot be used, but it causes false positive in the checker. In collaboration with: pho Reviewed by: mckusick (previous version), markj Tested by: markj (syzkaller), pho Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D26136	2020-11-14 05:10:39 +00:00
Mark Johnston	f44994874b	ffs: Clamp BIO_SPEEDUP length On 32-bit platforms, the computed size of the BIO_SPEEDUP requested by softdep_request_cleanup() may be negative when assigned to bp->b_bcount, which has type "long". Clamp the size to LONG_MAX. Also convert the unused g_io_speedup() to use an off_t for the magnitude of the shortage for consistency with softdep_send_speedup(). Reviewed by: chs, kib Reported by: pho Tested by: pho Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D27081	2020-11-11 13:48:07 +00:00
Conrad Meyer	e6790841f7	UFS2: Fix DoS due to corrupted extattrfile Prior versions of FreeBSD (11.x) may have produced a corrupt extattr file. (Specifically, r312416 accidentally fixed this defect by removing a strcpy.) CURRENT FreeBSD supports disk images from those prior versions of FreeBSD. Validate the internal structure as soon as we read it in from disk, to prevent these extattr files from causing invariants violations and DoS. Attempting to access the extattr portion of these files results in EINTEGRITY. At this time, the only way to repair files damaged in this way is to copy the contents to another file and move it over the original. PR: 244089 Reported by: Andrea Venturoli <ml AT netfence.it> Reviewed by: kib Discussed with: mckusick (earlier draft) Security: no Differential Revision: https://reviews.freebsd.org/D27010	2020-10-30 19:00:42 +00:00
Mateusz Guzik	4bfebc8d2c	cache: add cache_vop_mkdir and rename cache_rename to cache_vop_rename	2020-10-30 10:46:35 +00:00
Edward Tomasz Napierala	bce7ee9d41	Drop "All rights reserved" from all my stuff. This includes Foundation copyrights, approved by emaste@. It does not include files which carry other people's copyrights; if you're one of those people, feel free to make similar change. Reviewed by: emaste, imp, gbe (manpages) Differential Revision: https://reviews.freebsd.org/D26980	2020-10-28 13:46:11 +00:00
Kirk McKusick	996d40f91d	Various new check-hash checks have been added to the UFS filesystem over various major releases. Superblock check hashes were added for the 12 release and cylinder-group and inode check hashes will appear in the 13 release. When a disk with a UFS filesystem is writably mounted, the kernel clears the feature flags for anything that it does not support. For example, if a UFS disk from a 12-stable kernel is mounted on an 11-stable system, the 11-stable kernel will clear the flag in the filesystem superblock that indicates that superblock check-hashs are being maintained. Thus if the disk is later moved back to a 12-stable system, the 12-stable system will know to ignore its incorrect check-hash. If the only filesystem modification done on the earlier kernel is to run a utility such as growfs(8) that modifies the superblock but neither updates the check-hash nor clears the feature flag indicating that it does not support the check-hash, the disk will fail to mount if it is moved back to its original newer kernel. This patch moves the code that clears the filesystem feature flags from the mount code (ffs_mountfs()) to the code that reads the superblock (ffs_sbget()). As ffs_sbget() is used by the kernel mount code and is imported into libufs(3), all the filesystem utilities will now also clear these flags when they make modifications to the filesystem. As suggested by John Baldwin, fsck_ffs(8) has been changed to accept and repair bad superblock check-hashes rather than refusing to run. This change allows fsck to recover filesystems that have been impacted by utilities older than those created after this change and is a sensible thing to do in any event. Reported by: John Baldwin (jhb@) MFC after: 2 weeks Sponsored by: Netflix	2020-10-25 00:43:48 +00:00
Mateusz Guzik	25fb30bd9a	vfs: drop spurious cache_purge on rmdir The removed directory gets cache_purged which is sufficient to remove any entries related to the parent. Note only tmpfs, ufs and zfs are patched.	2020-10-23 15:50:49 +00:00
Brooks Davis	44ca4575ea	vmapbuf: don't smuggle address or length in buf Instead, add arguments to vmapbuf. Since this argument is always a pointer use a type of void * and cast to vm_offset_t in vmapbuf. (In CheriBSD we've altered vm_fault_quick_hold_pages to take a pointer and check its bounds.) In no other situtation does b_data contain a user pointer and vmapbuf replaces b_data with the actual mapping. Suggested by: jhb Reviewed by: imp, jhb Obtained from: CheriBSD MFC after: 1 week Sponsored by: DARPA Differential Revision: https://reviews.freebsd.org/D26784	2020-10-21 16:00:15 +00:00
Mateusz Guzik	e9fb2bd9b8	ufs: catch up with removal of thread argument from VOP_INACTIVE	2020-10-20 09:46:20 +00:00
Konstantin Belousov	e1ef4c29a3	Do not leak B_BARRIER. Normally when a buffer with B_BARRIER is written, the flag is cleared by g_vfs_strategy() when creating bio. But in some cases FFS buffer might not reach g_vfs_strategy(), for instance when copy-on-write reports an error like ENOSPC. In this case buffer is returned to dirty queue and might be written later by other means. Among then bdwrite() reasonably asserts that B_BARRIER is not set. In fact, the only current use of B_BARRIER is for lazy inode block initialization, where write of the new inode block is fenced against cylinder group write to mark inode as used. The situation could be seen that we break dependency by updating cg without written out inode. Practically since CoW was not able to find space for a copy of inode block, for the same reason cg group block write should fail. Reported by: pho Discussed with: chs, imp, mckusick Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D26511	2020-10-08 22:41:02 +00:00
Chuck Silvers	8b88330ed6	ufs: restore uniqueness of st_dev as returned by ufs_stat() switch ufs_stat() to use the same value for st_dev as was used by the previous ufs_getattr() stat path. Submitted by: gallatin Reviewed by: mjg, imp, kib, mckusick Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D26596	2020-10-05 18:17:50 +00:00
Konstantin Belousov	3c484f325e	Convert page cache read to VOP. There are several negative side-effects of not calling into VOP layer at all for page cache reads. The biggest is the missed activation of EVFILT_READ knotes. Also, it allows filesystem to make more fine grained decision to refuse read from page cache. Keep VIRF_PGREAD flag around, it is still useful for nullfs, and for asserts. Reviewed by: markj Tested by: pho Discussed with: mjg Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D26346	2020-09-15 22:06:36 +00:00
Konstantin Belousov	96474d2a3f	Do not copy vp into f_data for DTYPE_VNODE files. The pointer to vnode is already stored into f_vnode, so f_data can be reused. Fix all found users of f_data for DTYPE_VNODE. Provide finit_vnode() helper to initialize file of DTYPE_VNODE type. Reviewed by: markj (previous version) Discussed with: freqlabs (openzfs chunk) Tested by: pho (previous version) Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D26346	2020-09-15 21:55:21 +00:00
Mateusz Guzik	d90f2c3617	ufs: clean up empty lines in .c and .h files	2020-09-01 21:23:00 +00:00
Mateusz Guzik	39f8815070	cache: add cache_rename, a dedicated helper to use for renames While here make both tmpfs and ufs use it. No fuctional changes.	2020-08-20 10:05:46 +00:00
Mateusz Guzik	8f226f4c23	vfs: remove the always-curthread td argument from VOP_RECLAIM	2020-08-19 07:28:01 +00:00
Mateusz Guzik	7ad2a82da2	vfs: drop the error parameter from vn_isdisk, introduce vn_isdisk_error Most consumers pass NULL.	2020-08-19 02:51:17 +00:00
Konstantin Belousov	779ad2acf1	VMIO reads: enable for UFS Move v_object creation earlier, so that VIRF_PGREAD is never set if v_object is NULL. There is no much harm from instantiating v_object when later check for append-only flags disallows open. Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D25968	2020-08-16 21:07:19 +00:00
Mateusz Guzik	a92a971bbb	vfs: remove the thread argument from vget It was already asserted to be curthread. Semantic patch: @@ expression arg1, arg2, arg3; @@ - vget(arg1, arg2, arg3) + vget(arg1, arg2)	2020-08-16 17:18:54 +00:00
Mateusz Guzik	03337743db	vfs: clean MNTK_FPLOOKUP if MNT_UNION is set Elides checking it during lookup.	2020-08-10 11:51:21 +00:00
Mateusz Guzik	76dc5d3224	ufs: add VOP_STAT handler	2020-08-07 23:08:17 +00:00
Mateusz Guzik	d292b1940c	vfs: remove the obsolete privused argument from vaccess This brings argument count down to 6, which is passable without the stack on amd64.	2020-08-05 09:27:03 +00:00
Mateusz Guzik	e5e10c82ec	ufs: only pass LK_ADAPTIVE if LK_NODDLKTREAT is set This restores the pre-adaptive spinning state for SU which livelocks otherwise. Note this is a bug in SU. Reported by: pho	2020-08-04 23:09:15 +00:00
Mateusz Guzik	9d5a594f0b	ufs: add support for lockless lookup ACLs are not supported, meaning their presence will force the use of the old lookup. Reviewed by: kib Tested by: pho (in a patchset) Differential Revision: https://reviews.freebsd.org/D25579	2020-07-25 10:38:05 +00:00
Mateusz Guzik	31ad4050fe	lockmgr: add adaptive spinning It is very conservative. Only spinning when LK_ADAPTIVE is passed, only on exclusive lock and never when any waiters are present. buffer cache is remains not spinning. This reduces total sleep times during buildworld etc., but it does not shorten total real time (culprits are contention in the vm subsystem along with slock + upgrade which is not covered). For microbenchmarks: open3_processes -t 52 (open/close of the same file for writing) ops/s: before: 258845 after: 801638 Reviewed by: kib Tested by: pho Differential Revision: https://reviews.freebsd.org/D25753	2020-07-22 12:30:31 +00:00
Kirk McKusick	93440bbefd	The binary representation of the superblock (the fs structure) is written out verbatim to the disk: see ffs_sbput() in sys/ufs/ffs/ffs_subr.c. It contains a pointer to the fs_summary_info structure. This pointer value inadvertently causes garbage to be stored. It is garbage because the pointer to the fs_summary_info structure is the address the then current stack or heap. Although a mere pointer does not reveal anything useful (like a part of a private key) to an attacker, garbage output deteriorates reproducibility. This commit zeros out the pointer to the fs_summary_info structure before writing the out the superblock. Reviewed by: kib Tested by: Peter Holm PR: 246983 Sponsored by: Netflix	2020-06-19 01:04:25 +00:00
Kirk McKusick	34816cb9ae	Move the pointers stored in the superblock into a separate fs_summary_info structure. This change was originally done by the CheriBSD project as they need larger pointers that do not fit in the existing superblock. This cleanup of the superblock eases the task of the commit that immediately follows this one. Suggested by: brooks Reviewed by: kib PR: 246983 Sponsored by: Netflix	2020-06-19 01:02:53 +00:00
Chuck Silvers	d9a8abf6c2	Move all of the functions in ffs_subr.c that are only used by the ufs kernel module from that file into ffs_vfsops.c. This fixes the build for kernel configs that don't include FFS. PR: 247256 Submitted by: glebius Reviewed by: mckusick (earlier version) Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D25285	2020-06-17 23:39:52 +00:00
Rick Macklem	1f7104d720	Fix export_args ex_flags field so that is 64bits, the same as mnt_flags. Since mnt_flags was upgraded to 64bits there has been a quirk in "struct export_args", since it hold a copy of mnt_flags in ex_flags, which is an "int" (32bits). This happens to currently work, since all the flag bits used in ex_flags are defined in the low order 32bits. However, new export flags cannot be defined. Also, ex_anon is a "struct xucred", which limits it to 16 additional groups. This patch revises "struct export_args" to make ex_flags 64bits and replaces ex_anon with ex_uid, ex_ngroups and ex_groups (which points to a groups list, so it can be malloc'd up to NGROUPS in size. This requires that the VFS_CHECKEXP() arguments change, so I also modified the last "secflavors" argument to be an array pointer, so that the secflavors could be copied in VFS_CHECKEXP() while the export entry is locked. (Without this patch VFS_CHECKEXP() returns a pointer to the secflavors array and then it is used after being unlocked, which is potentially a problem if the exports entry is changed. In practice this does not occur when mountd is run with "-S", but I think it is worth fixing.) This patch also deleted the vfs_oexport_conv() function, since do_mount_update() does the conversion, as required by the old vfs_cmount() calls. Reviewed by: kib, freqlabs Relnotes: yes Differential Revision: https://reviews.freebsd.org/D25088	2020-06-14 00:10:18 +00:00
Kirk McKusick	513274c79c	Clear the IN_SIZEMOD and IN_IBLKDATA flags only when doing a synchronous inode update. The IN_SIZEMOD and IN_IBLKDATA flags indicate changes to the file size and block pointer fields in the inode. When these fields have been changed, the fsync() and fsyncdata() system calls must write the inode to ensure their semantics that the file is on stable store. The IN_SIZEMOD and IN_IBLKDATA flags cannot be cleared until a synchronous write of the inode is done. If they are cleared on an asynchronous write, then the inode may not yet have been written to the disk when an fsync() or fsyncdata() call is done. Absent these flags, these calls would not know that they needed to write the inode. Thus, these flags only can be cleared on synchronous writes of the inode. Since the inode will be locked for the duration of the I/O that writes it to disk, no fsync() or fsyncdata() will be able to run before the on-disk inode is complete. Reviewed by: kib MFC with: -r361785 Differential revision: https://reviews.freebsd.org/D25072	2020-06-06 20:17:56 +00:00
Kirk McKusick	52488b5148	Further evaluation of the POSIX spec for fdatasync() shows that it requires that new data on growing files be accessible. Thus, the the fsyncdata() system call must update the on-disk inode when the size of the file has changed. This commit adds another inode update flag, IN_SIZEMOD, that gets set any time that the file size changes. If either the IN_IBLKDATA or the IN_SIZEMOD flag is set when fdatasync() is called, the associated inode is synchronously written to disk. We could have overloaded the IN_IBLKDATA flag to also track size changes since the only (current) use case for these flags are for fsyncdata(), but it does seem useful for possible future uses to separately track the file size changes and the inode block pointer changes. Reviewed by: kib MFC with: -r361785 Differential revision: https://reviews.freebsd.org/D25072	2020-06-05 01:00:55 +00:00
Stefan Eßer	23e84cf153	Fix obvious typo: IN_BLKDATA should be IN_IBLKDATA	2020-06-04 19:54:25 +00:00
Kirk McKusick	30296c428a	Two additional places that need to identify IN_IBLKDATA. Reviewed by: kib MFC with: -r361785 Differential Revision: https://reviews.freebsd.org/D25072	2020-06-04 18:35:21 +00:00
Konstantin Belousov	7428630b75	UFS: write inode block for fdatasync(2) if pointers in inode where allocated The fdatasync() description in POSIX specifies that all I/O operations shall be completed as defined for synchronized I/O data integrity completion. and then the explanation of Synchronized I/O Data Integrity Completion says The write is complete only when the data specified in the write request is successfully transferred and all file system information required to retrieve the data is successfully transferred. For UFS this means that all pointers must be on disk. Indirect pointers already contribute to the list of dirty data blocks, so only direct blocks and root pointers to indirect blocks, both of which reside in the inode block, should be taken care of. In ffs_balloc(), mark the inode with the new flag IN_IBLKDATA that specifies that ffs_syncvnode(DATA_ONLY) needs a call to ffs_update() to flush the inode block. Reviewed by: mckusick Discussed with: tmunro Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D25072	2020-06-04 12:23:15 +00:00
Chuck Silvers	d79ff54b5c	This commit enables a UFS filesystem to do a forcible unmount when the underlying media fails or becomes inaccessible. For example when a USB flash memory card hosting a UFS filesystem is unplugged. The strategy for handling disk I/O errors when soft updates are enabled is to stop writing to the disk of the affected file system but continue to accept I/O requests and report that all future writes by the file system to that disk actually succeed. Then initiate an asynchronous forced unmount of the affected file system. There are two cases for disk I/O errors: - ENXIO, which means that this disk is gone and the lower layers of the storage stack already guarantee that no future I/O to this disk will succeed. - EIO (or most other errors), which means that this particular I/O request has failed but subsequent I/O requests to this disk might still succeed. For ENXIO, we can just clear the error and continue, because we know that the file system cannot affect the on-disk state after we see this error. For EIO or other errors, we arrange for the geom_vfs layer to reject all future I/O requests with ENXIO just like is done when the geom_vfs is orphaned. In both cases, the file system code can just clear the error and proceed with the forcible unmount. This new treatment of I/O errors is needed for writes of any buffer that is involved in a dependency. Most dependencies are described by a structure attached to the buffer's b_dep field. But some are created and processed as a result of the completion of the dependencies attached to the buffer. Clearing of some dependencies require a read. For example if there is a dependency that requires an inode to be written, the disk block containing that inode must be read, the updated inode copied into place in that buffer, and the buffer then written back to disk. Often the needed buffer is already in memory and can be used. But if it needs to be read from the disk, the read will fail, so we fabricate a buffer full of zeroes and pretend that the read succeeded. This zero'ed buffer can be updated and written back to disk. The only case where a buffer full of zeros causes the code to do the wrong thing is when reading an inode buffer containing an inode that still has an inode dependency in memory that will reinitialize the effective link count (i_effnlink) based on the actual link count (i_nlink) that we read. To handle this case we now store the i_nlink value that we wrote in the inode dependency so that it can be restored into the zero'ed buffer thus keeping the tracking of the inode link count consistent. Because applications depend on knowing when an attempt to write their data to stable storage has failed, the fsync(2) and msync(2) system calls need to return errors if data fails to be written to stable storage. So these operations return ENXIO for every call made on files in a file system where we have otherwise been ignoring I/O errors. Coauthered by: mckusick Reviewed by: kib Tested by: Peter Holm Approved by: mckusick (mentor) Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D24088	2020-05-25 23:47:31 +00:00
John Baldwin	71d11ee322	Update name of description of vfs.ffs.setsize in comment. Previously it used the name 'adjsize' instead of 'setsize'.	2020-05-22 17:23:43 +00:00

1 2 3 4 5 ...

2485 Commits