freebsd-skq

Author	SHA1	Message	Date
kib	0cda26ac73	Protect v_rdev dereference with the vnode interlock instead of the vnode lock. Caller of softdep_count_dependencies() may own a buffer lock, which might conflict with the lock order. Reported and tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 10 days	2017-08-25 09:51:22 +00:00
kib	3149ed68c4	Avoid dereferencing potentially freed workitem in softdep_count_dependencies(). Buffer's b_dep list is protected by the SU mount lock. Owning the buffer lock is not enough to guarantee the stability of the list. Calculation of the UFS mount owning the workitems from the buffer must be much more careful to not dereference the work item which might be freed meantime. To get to ump, use the pointers chain which does not involve workitems at all. Reported and tested by: pho Reviewed by: mckusick Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2017-08-21 16:23:44 +00:00
kib	77de7ac78a	Style. Reviewed by: mckusick Sponsored by: The FreeBSD Foundation MFC after: 3 days	2017-08-21 16:16:02 +00:00
mckusick	7898ca2150	Since the switch to GPT disk labels, fsck for UFS/FFS has been unable to automatically find alternate superblocks. This checkin places the information needed to find alternate superblocks to the end of the area reserved for the boot block. Filesystems created with a newfs of this vintage or later will create the recovery information. If you have a filesystem created prior to this change and wish to have a recovery block created for your filesystem, you can do so by running fsck in forground mode (i.e., do not use the -p or -y options). As it starts, fsck will ask ``SAVE DATA TO FIND ALTERNATE SUPERBLOCKS'' to which you should answer yes. Discussed with: kib, imp MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D11589	2017-08-09 05:17:21 +00:00
mckusick	940856bc57	Avoid reading a snapshot block when it is already in the cache. Update the use of the B_CACHE flag (since the May 1999 commit that made it the correct test here). Reported by: Andreas Longwitz <longwitz@incore.de> Reviewed by: kib Tested by: Peter Holm MFC after: 1 week	2017-07-31 20:41:45 +00:00
kib	24abfe4267	Improve publication of the newly allocated snapdata. For freshly allocated snapdata, Lock sn_lock in advance, so si_snapdata readers see the locked snapdata and not race. For existing snapdata, if the thread was put to sleep waiting for sn_lock, re-read si_snapdata. This either closes the race or makes the reliance on LK_DRAIN less important. Reported and tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2017-07-21 18:42:35 +00:00
kib	4047279466	Unlock correct lock in ffs_snapblkfree(). It is possible for ffs_snapblkfree() to race and lock snaplock while the devvp snapdata is instantiated, but no snapshots exist. In this case the loop over snapshots in ffs_snapblkfree() is not executed, and the local variable vp is left initialized to NULL. Unlock using &sn->sn_lock and not vp->v_vnlock. For the inodes on the snapshot list, the locks are same. Reported and tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2017-07-21 18:36:17 +00:00
kib	9a191031a0	Account for lock recursion when transfering snaplock to the vnode lock in ffs_snapremove(). Apparently ffs_snapremove() may be called with the snap lock recursed, at least one trace demonstrated this when snapshot vnode was unlinked while synced. It was inactivated from the syncer thread. Reported and tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2017-07-21 18:28:27 +00:00
kib	cce23ef417	Remove write-only variable. Tested by: pho Sponsored by: The FreeBSD Foundation	2017-07-16 07:12:04 +00:00
kib	3daaee64c3	A followup to r320453, correct removal of the blocks from UFS snapshots. Tested by: pho PR: 220693 Sponsored by: The FreeBSD Foundation	2017-07-16 07:11:29 +00:00
jhb	b4c54ab330	Consistently use vop_stdpathconf() for default pathconf values. Update filesystems not currently using vop_stdpathconf() in pathconf VOPs to use vop_stdpathconf() for any configuration variables that do not have filesystem-specific values. vop_stdpathconf() is used for variables that have system-wide settings as well as providing default values for some values based on system limits. Filesystems can still explicitly override individual settings. PR: 219851 Reported by: cem Reviewed by: cem, kib, ngie MFC after: 1 month Sponsored by: Chelsio Communications Differential Revision: https://reviews.freebsd.org/D11541	2017-07-11 21:55:20 +00:00
mckusick	6d2611016c	Create a new function ffs_getcg() to read in and verify a cylinder group. Change all code points that open-coded this functionality to use the new function. This commit is a refactoring with no change in functionality. In the future this change allows more robust checking of cylinder group reads along the lines discussed in the hardening UFS session at BSDCan (retry I/O, add checksums, etc). For more detail see the session notes at https://wiki.freebsd.org/DevSummit/201706/HardeningUFS Reviewed by: kib	2017-06-28 17:32:09 +00:00
kib	d8402b7b53	Mitigate several problems with the softdep_request_cleanup() on busy host. Problems start appearing when there are several threads all doing operations on a UFS volume and the SU workqueue needs a cleanup. It is possible that each thread calling softdep_request_cleanup() owns the lock for some dirty vnode (e.g. all of them are executing mkdir(2), mknod(2), creat(2) etc) and all vnodes which must be flushed are locked by corresponding thread. Then, we get all the threads simultaneously entering softdep_request_cleanup(). There are two problems: - Several threads execute MNT_VNODE_FOREACH_ALL() loops in parallel. Due to the locking, they quickly start executing 'in phase' with the speed of the slowest thread. - Since each thread already owns the lock for a dirty vnode, other threads non-blocking attempt to lock the vnode owned by other thread fail, and loops executing without making the progress. Retry logic does not allow the situation to recover. The result is a livelock. Fix these problems by making the following changes: - Allow only one thread to enter MNT_VNODE_FOREACH_ALL() loop per mp. A new flag FLUSH_RC_ACTIVE guards the loop. - If there were failed locking attempts during the loop, abort retry even if there are still work items on the mp work list. An assumption is that the items will be cleaned when other thread either fsyncs its vnode, or unlock and allow yet another thread to make the progress. It is possible now that some calls would get undeserved ENOSPC from ffs_alloc(), because the cleanup is not aggressive enough. But I do not see how can we reliably clean up workitems if calling softdep_request_cleanup() while still owning the vnode lock. I thought about scheme where ffs_alloc() returns ERESTART and saves the retry counter somewhere in struct thread, to return to the top level, unlock the vnode and retry. But IMO the very rare (and unproven) spurious ENOSPC is not worth the complications. Reported and tested by: pho Style and comments by: mckusick Reviewed by: mckusick Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2017-06-03 16:18:50 +00:00
kib	98a2e55649	Clean possible td_su reference on the struct mount being unmounted as the last step of ffs_unmount(). It is possible that the mount point is recorded for cleanup in AST context while softdep flush is executed during unmount. The workitems are flushed by other means for the unmount, but the stray reference to struct mount blocks destruction of mount. Check for the situation and manually call vfs_rel() before returning from ffs_unmount(). Reported and tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week	2017-06-03 14:15:14 +00:00
kib	a0ed3ee5d2	Remove spl() calls from UFS code. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2017-05-07 14:59:45 +00:00
emaste	1d5c0f311c	UFS fs.h: clear warning from use in makefs(1) makefs(1) has a number of signedness warnings (when built with higher WARNS), most of which can be addressed by careful application of casts in makefs itself. There is one case where a signedness warning arises from the blksize macro, so must be addressed in the macro itself. Reviewed by: kib, mckusick MFC after: 1 month Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D10589	2017-05-05 15:26:55 +00:00
glebius	5763443023	All these files need sys/vmmeter.h, but now they got it implicitly included via sys/pcpu.h.	2017-04-17 17:07:00 +00:00
cem	1860c3b5c0	ufs: Export UFS_MAXNAMLEN to pathconf, statfs Rather than the global NAME_MAX constant. This change is required to support systems with a NAME_MAX/MAXNAMLEN that differs from UFS_MAXNAMLEN. This was missed in r313475 due to the alternative spelling ("NAME_MAX") of MAXNAMLEN. This change is also similar in spirit to r313780. Reported by: ngie@ Sponsored by: Dell EMC Isilon	2017-04-05 01:44:03 +00:00
imp	7e6cabd06e	Renumber copyright clause 4 Renumber cluase 4 to 3, per what everybody else did when BSD granted them permission to remove clause 3. My insistance on keeping the same numbering for legal reasons is too pedantic, so give up on that point. Submitted by: Jan Schaumann <jschauma@stevens.edu> Pull Request: https://github.com/freebsd/freebsd/pull/96	2017-02-28 23:42:47 +00:00
kib	9973ae075e	Do not leak mount references for dying threads. Thread might create a condition for delayed SU cleanup, which creates a reference to the mount point in td_su, but exit without returning through userret(), e.g. when terminating due to single-threading or process exit. In this case, td_su reference is not dropped and mount point cannot be freed. Handle the situation by clearing td_su also in the thread destructor and in exit1(). softdep_ast_cleanup() has to receive the thread as argument, since e.g. thread destructor is executed in different context. Reported and tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2017-02-25 10:38:18 +00:00
emaste	8e79b56e85	prefix UFS symbols with UFS_ to reduce namespace pollution Specifically: ROOTINO -> UFS_ROOTINO WINO -> UFS_WINO NXADDR -> UFS_NXADDR NDADDR -> UFS_NDADDR NIADDR -> UFS_NIADDR MAXSYMLINKLEN_UFS[12] -> UFS[12]_MAXSYMLINKLEN (for consistency) Also prefix ext2's and nandfs's NDADDR and NIADDR with EXT2_ and NANDFS_ Reviewed by: kib, mckusick Obtained from: NetBSD MFC after: 1 month Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D9536	2017-02-15 19:50:26 +00:00
cem	d21c7f090e	ufs: Use UFS_MAXNAMLEN constant (like NFS, EXT2FS, SVR4, IBCS2) instead of redefining the MAXNAMLEN constant. No functional change. Reviewed by: kib@, markj@ Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D9500	2017-02-09 17:47:01 +00:00
cem	c3bd748d53	ffs_vnops: Simplify extattr access As suggested in r167010, use the structure type and macros to access and modify UFS2 extended attributes. Add assertions that pointers are aligned in places where we now access the data through a structure pointer, instead of character-by-character. PR: 216127 Reported by: dewayne at heuristicsystems.com.au Reviewed by: kib@ Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D9225	2017-01-19 16:46:05 +00:00
cem	63708c4468	restore(8): Handle extended attribute names correctly UFS2 extended attribute names are not NUL-terminated. Handle appropriately. Correct the EXTATTR_BASE_LENGTH() macro, which handled ea_namelength == one (mod eight) extended attributes incorrectly. PR: 216127 Reported by: dewayne at heuristicsystems.com.au Reviewed by: kib@ Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D9208	2017-01-18 18:16:57 +00:00
cem	4d542f9fe8	ufs/extattr.h: Fix documentation of ea_name termination The ea_name string is not nul-terminated. Correct the documentation. Because the subsequent field is padded to 8 bytes, and the padding is zeroed, the ea_name string will appear to be nul-terminated whenever the length isn't exactly one (mod eight). This was introduced in r167010 (2007). Additionally, mark the length fields as unsigned. This particularly matters for the single byte ea_namelength field, which can represent extended attribute names up to 255 bytes long. No functional change. PR: 216127 Reported by: dewayne at heuristicsystems.com.au Reviewed by: kib@ Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D9206	2017-01-18 17:55:49 +00:00
kib	7c67dd5f60	Use type-independent formats for printing nlink_t and ino_t. Extracted from: ino64 work by gleb, mckusick Discussed with: mckusick Sponsored by: The FreeBSD Foundation MFC after: 1 week	2017-01-06 16:59:33 +00:00
markj	4159d33f6b	Release laundered vnode pages to the head of the inactive queue. The swap pager enqueues laundered pages near the head of the inactive queue to avoid another trip through LRU before reclamation. This change adds support for this behaviour to the vnode pager and makes use of it in UFS and ext2fs. Some ioflag handling is consolidated into a common subroutine so that this support can be easily extended to other filesystems which make use of the buffer cache. No changes are needed for ZFS since its putpages routine always undirties the pages before returning, and the laundry thread requeues the pages appropriately in this case. Reviewed by: alc, kib Differential Revision: https://reviews.freebsd.org/D8589	2016-11-23 17:53:07 +00:00
kib	aef0061353	Provide simple mutual exclusion between mount point update and unmount. Currently mount update keeps vfs_busy(9) reference on the mount point during MNT_UPDATE VFS_MOUNT() vfsops call. This already provides the exclusion, but is problematic for filesystems which need to perform namei(9) during VFS_MOUNT(MNT_UPDATE) operations, e.g. to refresh mnt_from path, because namei(9) must not be called while the vfs_busy(9) reference is owned. Check for MNT_UPDATE flag before setting MNTK_UNMOUNT, and for MNTK_UNMOUNT before entering innards of vfs_domount_update(), failing syscalls with EBUSY if conflict is detected. Keep vfs_busy(9) reference around VFS_MOUNT(MNT_UPDATE) calls still to not change VFS KPI. In the update path in ffs_mount(), drop vfs_busy() reference around namei(), which is now safe due to unmount never executing in parallel with VFS_MOUNT(MNT_UPDATE), and which avoids the deadlock. Reported and tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2016-11-13 21:49:51 +00:00
emaste	242aa658d8	ANSIfy ffs_subr.c Also renumber license clause to avoid skipping #3	2016-10-31 20:43:43 +00:00
mckusick	7e51aa668a	Avoid possible overflow when calclating malloc size for auxillary data structure sizes when mounting and reloading UFS/FFS filesystems by using a u_long rather than an int for the size. Reported by: Mariusz Zaborski <oshogbo@> MFC after: 1 week	2016-10-28 20:15:19 +00:00
kib	1005ab8477	Generalize UFS buffer pager to allow it serving other filesystems which also use buffer cache. Most important addition to the code is the handling of filesystems where the block size is less than the machine page size, which might require reading several buffers to validate single page. Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2016-10-28 11:43:59 +00:00
mckusick	ae1163bd55	The UFS/FFS filesystem checks directory link counts when doing directory create and delete operations. If it ever finds a directory with a link count less than 2, it panics. Thus, an rm -rf that encounters a directory with a link count below 2 causes a kernel panic. The proposed fix is to return the error EINVAL rather than panicing. The effect is that the requested operation is not done, but the system continues to run. At a more convenient later time, the filesystem can be unmounted and cleaned (with fsck or journal run). Once cleaned, the operation can be rerun to successful completion. This fix takes that approach. The panic message has been converted into a uprintf(9) to provide the user with the inode number and filesystem mount point of the offending directory and EINVAL is returned for the operation. The long (three year) delay in fixing this problem occurred because the bug was misclassified when originally assigned and only this week was found during a sweep of old unresolved bug reports. PR: 180894 Reviewed by: kib MFC after: 2 weeks	2016-10-26 20:28:23 +00:00
marcel	03c2e26593	Include <sys/types.h> explicitly instead of depending on that header being included by <sys/param.h>. When compiled as part of makefs(8) and on macOS or Linux, <sys/param.h> is not our own.	2016-10-24 18:12:57 +00:00
kib	23dcf48f60	Add FFS pager, which uses buffer cache read operation to validate pages. See the comments for more detailed description of the algorithm. The pager is used unconditionally when the block size of the underlying device is larger than the machine page size, since local vnode pager cannot handle the configuration [1]. Otherwise, the vfs.ffs.use_buf_pager sysctl allows to switch to the local pager. Measurements demonstrated no regression in the ever-important buildworld benchmark, and small (~5%) throughput improvements in the special microbenchmark configuration for dbench over swap-backed md(4). Code can be generalized and reused for other filesystems which use buffer cache. Reported by: Anton Yuzhaninov <citrin@citrin.ru> [1] Tested by: pho Benchmarked by: mjg, pho Reviewed by: alc, markj, mckusick (previous version) Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D8198	2016-10-19 11:09:29 +00:00
mjg	6a50fe29a5	vfs: remove the __bo_vnode field from struct vnode The pointer can be obtained using __containerof instead. Reviewed by: kib	2016-09-30 17:11:03 +00:00
kib	800b88629e	Be more strict when selecting between snapshot/regular mount. Reclaimed vnode type is VBAD, so succesful comparision like devvp->v_type != VREG does not imply that the devvp references snapshot, it might be due to a reclaimed vnode. Explicitely check the vnode type. In the the most important case of ffs_blkfree(), the devfs vnode is locked and its type is stable. In other cases, if the vnode is reclaimed right after the check, hopefully the buffer methods return right error codes. Reviewed by: mckusick Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2016-09-19 15:58:33 +00:00
kib	ce2b0686e5	Fix libprocstat build after r305902. - Use _Bool to not require userspace to include stdbool.h. - Make extattr.h usable without vnode_if.h. - Follow i_ump to get cdev pointer. Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2016-09-17 18:14:31 +00:00
kib	20f1e8ac63	Reduce size of ufs inode. Remove redunand i_dev and i_fs pointers, which are available as ip->i_ump->um_dev and ip->i_ump->um_fs, and reorder members by size to reduce padding. To compensate added derefences, the most often i_ump access to differentiate between UFS1 and UFS2 dinode layout is removed, by addition of the new i_flag IN_UFS2. Overall, this actually reduces the amount of memory dereferences. On 64bit machine, original struct inode size is 176, reduced to 152 bytes with the change. Tested by: pho (previous version) Reviewed by: mckusick Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2016-09-17 16:47:34 +00:00
bde	4000d6ac7e	Sprinkle DOINGASYNC() checks so as to do delayed writes for async mounts in almost all cases instead of in most cases. Don't override DOINGASYNC() by any condition except IO_SYNC. Fix previous sprinking of DOINGASYNC() checks. Don't override IO_SYNC by DOINGASYNC(). In ffs_write() and ffs_extwrite(), there were intentional overrides that just broke O_SYNC of data. In ffs_truncate(), there are 5 calls to ffs_update(), 4 with apparently-unintentional overrides and 1 without; this had no effect due to the main async mount hack descibed below. Fix 1 place in ffs_truncate() where the caller's IO_ASYNC was overridden for the soft updates case too (to do a delayed write instead of a sync write). This is supposed to be the only change that affects anything except async mounts. In ffs_update(), remove the 19 year old efficiency hack of ignoring the waitfor flag for async mounts, so that fsync() almost works for async mounts. All callers are supposed to be fixed to not ask for a sync update unless they are for fsync() or [I]O_SYNC operations. fsync() now almost works for async mounts. It used to sync the data but not the most important metdata (the inode). It still doesn't sync associated directories. This gave 10-20% fewer writes for my makeworld benchmark with async mounted tmp and obj directories from an already small number. Style fixes: - in ffs_balloc.c, remove rotted quadruplicated comments about the simplest part of the DOING() decisions and rearrange the nearly- quadruplicated code to be more nearly so. - in ufs_vnops.c, use a consistent style with less negative logic and no manual "optimization" of \|\| to \| in DOING() expressions. Reviewed by: kib (previous version)	2016-09-08 17:40:40 +00:00
kib	8d2c7fdc8d	On rename, do not perform truncation of dirhash if the vnode truncation failed. Doing so resulted in inconsistent state of the ufs dirhash with regard to the actual directory inode state, and could lead to spurious ENOENT errors for lookups of existing files in production kernels, or assertion failures in the debugging kernels. Change the logic of calling ufsdirhash_dirtrunc() to be same as in ufs_direnter(). Execute UFS_TRUNCATE() first, log error, and only do dirtrunc() if UFS_TRUNCATE() succeeded. Note that the problem was exacerbated by the bug in the flush_newblk_dep() function (see r305599), which caused in the spurios errors from ffs_sync() and then ffs_truncate(). In collaboration with: pho Reviewed by: mckusick Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2016-09-08 12:09:34 +00:00
kib	03f5910da5	Do not leak transient ENOLCK error from flush_newblk_dep() loop. The buffer lock is retried on failed LK_SLEEPFAIL attempt, and error from the failed attempt is irrelevant. But since there is path after retry which does not clear error, it is possible to return spurious error from the function. The issue resulted in a spurious failure of softdep_sync_buf(), causing further spurious failure of ffs_sync(). In collaboration with: pho Reviewed by: mckusick Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2016-09-08 12:08:54 +00:00
kib	835bb1130c	When logging unlikely UFS_TRUNCATE() failure in ufs_direnter(), include error code. Reported and tested by: pho Reviewed by: mckusick Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2016-09-08 12:08:08 +00:00
kib	666d76bef3	When externding directory inode in ufs_direnter(), adjust i_endoff. This change is formally not needed, since i_endoff not used in all code paths after the call to ufs_direnter(), and i_endoff is recalculated by the next lookup. But having the value correct makes the reasoning about code simpler. Reported and tested by: pho Reviewed by: mckusick Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2016-09-08 12:07:25 +00:00
kib	dc190244af	In dqsync(), when called from quotactl(), um_quotas entry might appear cleared since nothing prevents completion of the parallel quotaoff. There is nothing to sync in this case, and no reason to panic. Reported and tested by: pho Reviewed by: mckusick Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2016-09-08 12:06:43 +00:00
kib	c8c9eee1f7	In softdep_prealloc(), return early not only for snapshots, but for the quota files as well. Reported and tested by: pho Reviewed by: mckusick Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2016-09-08 12:05:13 +00:00
kib	a483a2c0bd	There is no need to upgrade the last dvp lock on lookups for modifying operations. Instead of upgrading, assert that the lock is exclusive. Explain the cause in comments. This effectively reverts r209367. Tested by: pho Reviewed by: mckusick Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2016-09-08 12:04:45 +00:00
kib	ded52104dc	Partially lift suspension when ffs_reload() finished with cgs and going to re-read inodes. Secondary write initiators, e.g. ufs_inactive(), might need to start a write while owning the vnode lock. Since the suspended state established by /dev/ufssuspend prevents them from entering vn_start_secondary_write(), we get deadlock otherwise. Note that it is arguably not very useful to re-read inodes after /dev/ufssuspend suspension, because the suspension does not block readers, and other threads might read existing files in parallel with suspension owner (for now, only growfs(8)) operations. This effectively means that suspension owner cannot safely modify existing inodes, and then there is no sense in re-reading. But keep the code enabled for now. Reported and tested by: pho Reviewed by: mckusick Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2016-09-08 12:01:28 +00:00
imp	a88b45a2a8	Renumber the advertising clause.	2016-09-06 15:17:35 +00:00
mckusick	d1132818f9	Bug 211013 reports that a write error to a UFS filesystem running with softupdates panics the kernel. The problem that has been pointed out is that when there is a transient write error on certain metadata blocks, specifically directory blocks (PAGEDEP), inode blocks (INODEDEP), indirect pointer blocks (INDIRDEPS), and cylinder group (BMSAFEMAP, but only when journaling is enabled), we get a panic in one of the routines called by softdep_disk_io_initiation that the I/O is "already started" when we retry the write. These dependency types potentially need to do roll-backs when called by softdep_disk_io_initiation before doing a write and then a roll-forward when called by softdep_disk_write_complete after the I/O completes. The panic happens when there is a transient error. At the top of softdep_disk_write_complete we check to see if the write had an error and if an error occurred we just return. This return is correct most of the time because the main role of the routines called by softdep_disk_write_complete is to process the now-completed dependencies so that the next I/O steps can happen. But for the four types listed above, they do not get to do their rollback operations. This causes the panic when softdep_disk_io_initiation gets called on the second attempt to do the write and the roll-back routines find that the roll-backs have already been done. As an aside I note that there is also the problem that the buffer will have been unlocked and thus made visible to the filesystem and to user applications with the roll-backs in place. The way to resolve the problem is to add a flag to the routines called by softdep_disk_write_complete for the four dependency types noted that indicates whether the write was successful (WRITESUCCEEDED). If the write does not succeed, they do just the roll-backs and then return. If the write was successful they also do their usual processing of the now-completed dependencies. The fix was tested by selectively injecting write errors for buffers holding dependencies of each of the four types noted above and then verifying that the kernel no longer paniced and that following the successful retry of the write that the filesystem could be unmounted and successfully checked cleanly. PR: 211013 Reviewed by: kib	2016-08-16 21:02:30 +00:00
kib	73de606a86	In UFS_BALLOC(), invalidate pages of indirect buffers on failed block allocation unwinding. Dandling buffers are released on UFS_BALLOC() failure to ensure that later attempt to allocate blocks in close range do not find the blocks with invalid content, since possible partial block allocations are unwound. As such, it is not enough to just release the buffers, the pages must also invalidated and removed from the vnode vm_object queue. Otherwise the pages might be found later and used to reconstruct indirect buffers when doing allocations at offset close to the failure point, and their stale content compromise the filesystem integrity. Note that just marking the buffer as B_INVAL is not enough, B_NOCACHE is required. To be sure, clear the B_CACHE flag as well. This complements the r174973, which started releasing buffers. Reported and tested by: pho Reviewed by: mckusick Sponsored by: The FreeBSD Foundation MFC after: 1 week	2016-08-16 17:30:58 +00:00
kib	fd0eaea7aa	On unwind after failed block allocation in ffs_balloc_ufs{1,2}, assert that recorded allocated blocks numbers match the physical block numbers of dandling buffers which are released. When finally freeing the blocks during unwind, assert that dandling buffers where not re-allocated. They shouldn't, because the vnode lock is owned exclusive. Reviewed by: mckusick Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week	2016-08-16 17:18:38 +00:00
kib	e037bf91c2	When looking up dandling buffers for unwing after failing block allocation in UFS_BALLOC(), there is no need to map them. Reviewed by: mckusick Sponsored by: The FreeBSD Foundation MFC after: 1 week	2016-08-16 17:05:15 +00:00
kib	4e01efcc0c	When block allocation fails in UFS_BALLOC(), and the volume does not have SU enabled, there is no point in calling softdep_request_cleanup(). The call cannot produce free blocks, but we unecessarily lock ufsmount and do inode block write. Usual point of not doing optimizations for the corner case of the full volume is not applicable there, the work is easily avoidable, and the addition CPU and disk io load do not lead to succeeding retry. Reviewed by: mckusick Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week	2016-08-16 16:50:48 +00:00
kib	03c6968a37	In ffs_balloc_ufs{1,2} routines, assert that unwind records do not overflow local arrays. This is not immediately obvious from the static code inspection, due to retry logic. Reviewed by: mckusick Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week	2016-08-16 16:49:56 +00:00
kib	20b89ec291	Implement VOP_FDATASYNC() for UFS. Reviewed by: mckusick Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D7471	2016-08-15 19:22:23 +00:00
trasz	255ed885fa	Replace all remaining calls to vprint(9) with vn_printf(9), and remove the old macro. MFC after: 1 month	2016-08-10 16:12:31 +00:00
kib	9009fc003e	Ensure that the UFS directory vnode' vm_object is properly sized before UFS_BALLOC() is called. I do not believe that this caused any real issue on FreeBSD because the exclusive vnode lock is held over the balloc/resize, the change is to make formally correct KPI use. Based on: the Matthew Dillon' patch from DragonFly BSD PR: 93942 Reviewed by: mckusick Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week	2016-07-20 14:40:56 +00:00
kevlo	1006f009c6	arc4random() returns 0 to (2**32)−1, use an alternative to initialize i_gen if it's zero rather than a divide by 2. With inputs from delphij, mckusick, rmacklem Reviewed by: mckusick	2016-05-22 14:31:20 +00:00
kib	e006a8a1fd	Stop dropping and reacquiring Giant around geom calls in UFS. Sponsored by: The FreeBSD Foundation	2016-05-21 10:13:25 +00:00
kib	93315c1cff	Improve handling of rdev->si_mountpt on mount and unmount of FFS volumes. Treat the field as a semaphore protecting availability of the device for mounting. Do no access devvp->v_rdev without the vnode lock owned. Protect change of the devvp->v_bufobj bo_ops vector with the vnode lock. Reviewed by: bde Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2016-05-21 09:49:35 +00:00
kib	54336eef75	Ensure that ftruncate(2) is performed synchronously when file is opened in O_SYNC mode, at least for UFS. This also handles truncation, done due to the O_SYNC \| O_TRUNC flags combination to open(2), in synchronous way. Noted by: bde Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2016-05-18 12:03:57 +00:00
kib	393eee329b	Do enable io accounting for read-only mounts and mounts which are remounted to writeable after initial read-only. Assign to dev->si_mountpt earlier to account the accesses done at the mount time. Based on submission by: bde MFC after: 1 week	2016-05-17 21:35:35 +00:00
kib	6c68797acf	If IO_SYNC was passed to ffs_truncate(), request synchronous inode update from the final ffs_update(). Noted by: bde MFC after: 1 week	2016-05-17 21:30:58 +00:00
kib	b27a7ca7b1	For async UFS mounts, shrink the directory asynchronously, at least do not pass IO_SYNC to ffs_truncate() unneccessary. Submitted by: bde MFC after: 1 week	2016-05-17 21:28:28 +00:00
kib	e1e51d237c	Fix comments. Submitted by: bde MFC after: 1 week	2016-05-17 08:24:27 +00:00
kib	67809d96fd	Fix typo in the message. Submitted by: bde MFC after: 1 week	2016-05-17 08:19:20 +00:00
pfg	6ac8bdb320	UFS: spelling fixes on comments. No functional change.	2016-04-29 20:43:51 +00:00
ngie	35bd8367fe	Add FEATURE knob for testing for UFS extended attribute kernel support Support can be verified via `feature_present("ufs_extattr")`, etc. Differential Revision: https://reviews.freebsd.org/D6053 MFC after: 2 weeks Relnotes: yes Reviewed by: asomers, kib Sponsored by: EMC / Isilon Storage Division	2016-04-22 08:09:27 +00:00
pfg	729533413f	sys: use our roundup2/rounddown2() macros when param.h is available. rounddown2 tends to produce longer lines than the original code and when the code has a high indentation level it was not really advantageous to do the replacement. This tries to strike a balance between readability using the macros and flexibility of having the expressions, so not everything is converted.	2016-04-21 19:57:40 +00:00
pfg	858f336801	ufs: replace 0 with NULL for pointers. While here also do late initialization of the variables we are changing. Found with devel/coccinelle. Reviewed by: mckusick MFC after: 2 weeks	2016-04-10 21:48:11 +00:00
trasz	825d80e01c	Add four new RCTL resources - readbps, readiops, writebps and writeiops, for limiting disk (actually filesystem) IO. Note that in some cases these limits are not quite precise. It's ok, as long as it's within some reasonable bounds. Testing - and review of the code, in particular the VFS and VM parts - is very welcome. MFC after: 1 month Relnotes: yes Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D5080	2016-04-07 04:23:25 +00:00
trasz	ca92bb3067	Remove some NULL checks for M_WAITOK allocations. MFC after: 1 month Sponsored by: The FreeBSD Foundation	2016-03-29 13:56:59 +00:00
kib	dae532b324	Split the global taskqueue used to process all UFS trim completions, into per-mount taskqueue with the private taskqueue processing thread. This allows to drain the taskqueue on unmount, to ensure that all TRIMs are finished before mount structures are freed. But just draining the taskqueue where TRIM biodone geom-up completions are processed is not enough, since ffs_blkfree(), called by the task, might result in more writes. Count inflight delayed blkfree's and pause() unmount until the counter drains as well. Reported by: Nick Evans <nevans@talkpoint.com> Tested by: Nick Evans <nevans@talkpoint.com>, pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2016-03-27 08:21:17 +00:00
kib	6fb3ce9534	Style: wrap long lines. Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2016-03-27 08:07:12 +00:00
kib	e8766307c3	Fix locking mistake in softdep_waitidle(). The surrounding code expects that the loop is always exited with the SU lock owned, even on error. Reported and tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 3 days	2016-03-23 09:58:51 +00:00
mckusick	6a020d6c89	The UFS filesystem requires that the last block of a file always be allocated. When shortening the length of a file in which the new end of the file contains a hole, the hole must have a block allocated. Reported by: Maxim Sobolev Reviewed by: kib Tested by: Peter Holm	2016-02-24 01:58:40 +00:00
trasz	0fe7e27aea	Remove ffs_mountroot() prototype; seems to be long gone. MFC after: 1 month Sponsored by: The FreeBSD Foundation	2016-01-28 12:21:23 +00:00
mckusick	32cb8b928a	This fixes a bug in UFS2 exported NFS volumes. An NFS client can crash a server that has exported UFS2 by presenting a filehandle with an inode number that references an uninitialized inode in a cylinder group. The problem is that UFS2 only initializes blocks of inodes as they are first allocated and ffs_fhtovp() does not validate that the inode is in a range of inodes that have been initialized. Attempting to read an uninitialized inode gets random data from the disk. When the kernel tries to interpret it as an inode, panics often arise. Reported by: Christos Zoulas (from NetBSD) Reviewed by: kib	2016-01-27 21:27:05 +00:00
mckusick	0b10a802f8	The bread() function was inconsistent about whether it would return a buffer pointer in the event of an error (for some errors it would return a buffer pointer and for other errors it would not return a buffer pointer). The cluster_read() function was similarly inconsistent. Clients of these functions were inconsistent in handling errors. Some would assume that no buffer was returned after an error and would thus lose buffers under certain error conditions. Others would assume that brelse() should always be called after an error and would thus panic the system under certain error conditions. To correct both of these problems with minimal code churn, bread() and cluster_write() now always free the buffer when returning an error thus ensuring that buffers will never be lost. The brelse() routine checks for being passed a NULL buffer pointer and silently returns to avoid panics. Thus both approaches to handling error returns from bread() and cluster_read() will work correctly. Future code should be written assuming that bread() and cluster_read() will never return a buffer with an error, so should not attempt to brelse() the buffer when an error is returned. Reviewed by: kib	2016-01-27 21:23:01 +00:00
kib	70adc1e216	Recheck curthread->td_su after the VFS_SYNC() call, and re-sync if the ast was rescheduled during VFS_SYNC(). It is possible that enough parallel writes or slow/hung volume result in VFS_SYNC() deferring to the ast flushing of workqueue. Reported and tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week	2015-12-21 11:50:32 +00:00
kib	29c1a1655d	Update ctime when atime or birthtime are updated. Cleanup setting of ctime/mtime/birthtime: do not set IN_ACCESS or IN_UPDATE, then clear them with ufs_itimes(), making transient (possibly inconsistent) change to the times, and then copy user-supplied times into the inode. Instead, directly clear IN_ACCESS or IN_UPDATE when user supplied the time, and copy the value into the inode. Minor inconsistency left is that the inode ctime is updated even when birthtime update attempt is performed on a UFS1 volume. Submitted by: bde MFC after: 2 weeks	2015-12-07 12:09:04 +00:00
mckusick	cb4ab786a1	For performance reasons, it is useful to have a single string used as the name of a filesystem when setting it as the first parameter to the getnewvnode() function. Most filesystems call getnewvnode from just one place so can use a literal string as the first parameter. However, NFS calls getnewvnode from two places, so we create a global constant string that can be used by the two instances. This change also collapses two instances of getnewvnode() in the UFS filesystem to a single call. Reviewed by: kib Tested by: Peter Holm	2015-11-29 21:01:02 +00:00
kib	e768957c56	Do not perform read-ahead for BA_CLRBUF request when we are low on memory or when dirty buffer queue is too large. Reported and tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2015-10-27 13:44:13 +00:00
imp	fb9846240a	Do not relocate extents to make them contiguous if the underlying drive can do deletions. Ability to do deletions is a strong indication that this optimization will not help performance. It will only generate extra write traffic. These devices are typically flash based and have a limited number of write cycles. In addition, making the file contiguous in LBA space doesn't improve the access times from flash devices because they have no seek time. Reviewed by: mckusick@	2015-10-16 03:06:02 +00:00
glebius	a2200eb70b	In softdep_setup_freeblocks(): - Move the bread() to the beginning of function. - Return if it fails, otherwise we will panic. Submitted by: mckusick Sponsored by: Netflix	2015-10-07 12:36:28 +00:00
kib	9a0916a6b3	Do not consume extra reference. This is a bug in r287479. Reported and tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week	2015-09-05 12:28:18 +00:00
kib	4bfbbf8647	Declare the writes around the call to VFS_SYNC() in softdep_ast_cleanup_proc(). Tested by: pho (previous version) Sponsored by: The FreeBSD Foundation MFC after: 1 week	2015-09-05 08:48:24 +00:00
kib	d285ab0209	By doing file extension fast, it is possible to create excess supply of the D_NEWBLK kinds of dependencies (i.e. D_ALLOCDIRECT and D_ALLOCINDIR), which can exhaust kmem. Handle excess of D_NEWBLK in the same way as excess of D_INODEDEP and D_DIRREM, by scheduling ast to flush dependencies, after the thread, which created new dep, left the VFS/FFS innards. For D_NEWBLK, the only way to get rid of them is to do full sync, since items are attached to data blocks of arbitrary vnodes. The check for D_NEWBLK excess in softdep_ast_cleanup_proc() is unlocked. For 32bit arches, reduce the total amount of allowed dependencies by two. It could be considered increasing the limit for 64 bit platforms with direct maps. Reported and tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2015-09-01 13:07:27 +00:00
jeff	44267026a0	- Make 'struct buf *buf' private to vfs_bio.c. Having a global variable 'buf' is inconvenient and has lead me to some irritating to discover bugs over the years. It also makes it more challenging to refactor the buf allocation system. - Move swbuf and declare it as an extern in vfs_bio.c. This is still not perfect but better than it was before. - Eliminate the unused ffs function that relied on knowledge of the buf array. - Move the shutdown code that iterates over the buf array into vfs_bio.c. Reviewed by: kib Sponsored by: EMC / Isilon Storage Division	2015-07-29 02:26:57 +00:00
jeff	3fb666cfae	Refactor unmapped buffer address handling. - Use pointer assignment rather than a combination of pointers and flags to switch buffers between unmapped and mapped. This eliminates multiple flags and generally simplifies the logic. - Eliminate b_saveaddr since it is only used with pager bufs which have their b_data re-initialized on each allocation. - Gather up some convenience routines in the buffer cache for manipulating buf space and buf malloc space. - Add an inline, buf_mapped(), to standardize checks around unmapped buffers. In collaboration with: mlaier Reviewed by: kib Tested by: pho (many small revisions ago) Sponsored by: EMC / Isilon Storage Division	2015-07-23 19:13:41 +00:00
mjg	c71e9ab863	Move chdir/chroot-related fdp manipulation to kern_descrip.c Prefix exported functions with pwd_. Deduplicate some code by adding a helper for setting fd_cdir. Reviewed by: kib	2015-07-11 16:19:11 +00:00
markj	d19ba3f89d	Check suspendability on the mountpoint returned by VOP_GETWRITEMOUNT. This obviates the need for a MNTK_SUSPENDABLE flag, since passthrough filesystems like nullfs and unionfs no longer need to inherit this information from their lower layer(s). This change also restores the pre-r273336 behaviour of using the presence of a susp_clean VFS method to request suspension support. Reviewed by: kib, mjg Differential Revision: https://reviews.freebsd.org/D2937	2015-07-05 22:37:33 +00:00
markm	d586165577	Huge cleanup of random(4) code. * GENERAL - Update copyright. - Make kernel options for RANDOM_YARROW and RANDOM_DUMMY. Set neither to ON, which means we want Fortuna - If there is no 'device random' in the kernel, there will be NO random(4) device in the kernel, and the KERN_ARND sysctl will return nothing. With RANDOM_DUMMY there will be a random(4) that always blocks. - Repair kern.arandom (KERN_ARND sysctl). The old version went through arc4random(9) and was a bit weird. - Adjust arc4random stirring a bit - the existing code looks a little suspect. - Fix the nasty pre- and post-read overloading by providing explictit functions to do these tasks. - Redo read_random(9) so as to duplicate random(4)'s read internals. This makes it a first-class citizen rather than a hack. - Move stuff out of locked regions when it does not need to be there. - Trim RANDOM_DEBUG printfs. Some are excess to requirement, some behind boot verbose. - Use SYSINIT to sequence the startup. - Fix init/deinit sysctl stuff. - Make relevant sysctls also tunables. - Add different harvesting "styles" to allow for different requirements (direct, queue, fast). - Add harvesting of FFS atime events. This needs to be checked for weighing down the FS code. - Add harvesting of slab allocator events. This needs to be checked for weighing down the allocator code. - Fix the random(9) manpage. - Loadable modules are not present for now. These will be re-engineered when the dust settles. - Use macros for locks. - Fix comments. * src/share/man/... - Update the man pages. * src/etc/... - The startup/shutdown work is done in D2924. * src/UPDATING - Add UPDATING announcement. * src/sys/dev/random/build.sh - Add copyright. - Add libz for unit tests. * src/sys/dev/random/dummy.c - Remove; no longer needed. Functionality incorporated into randomdev.. live_entropy_sources.c live_entropy_sources.h - Remove; content moved. - move content to randomdev.[ch] and optimise. * src/sys/dev/random/random_adaptors.c src/sys/dev/random/random_adaptors.h - Remove; plugability is no longer used. Compile-time algorithm selection is the way to go. * src/sys/dev/random/random_harvestq.c src/sys/dev/random/random_harvestq.h - Add early (re)boot-time randomness caching. * src/sys/dev/random/randomdev_soft.c src/sys/dev/random/randomdev_soft.h - Remove; no longer needed. * src/sys/dev/random/uint128.h - Provide a fake uint128_t; if a real one ever arrived, we can use that instead. All that is needed here is N=0, N++, N==0, and some localised trickery is used to manufacture a 128-bit 0ULLL. * src/sys/dev/random/unit_test.c src/sys/dev/random/unit_test.h - Improve unit tests; previously the testing human needed clairvoyance; now the test will do a basic check of compressibility. Clairvoyant talent is still a good idea. - This is still a long way off a proper unit test. * src/sys/dev/random/fortuna.c src/sys/dev/random/fortuna.h - Improve messy union to just uint128_t. - Remove unneeded 'static struct fortuna_start_cache'. - Tighten up up arithmetic. - Provide a method to allow eternal junk to be introduced; harden it against blatant by compress/hashing. - Assert that locks are held correctly. - Fix the nasty pre- and post-read overloading by providing explictit functions to do these tasks. - Turn into self-sufficient module (no longer requires randomdev_soft.[ch]) * src/sys/dev/random/yarrow.c src/sys/dev/random/yarrow.h - Improve messy union to just uint128_t. - Remove unneeded 'staic struct start_cache'. - Tighten up up arithmetic. - Provide a method to allow eternal junk to be introduced; harden it against blatant by compress/hashing. - Assert that locks are held correctly. - Fix the nasty pre- and post-read overloading by providing explictit functions to do these tasks. - Turn into self-sufficient module (no longer requires randomdev_soft.[ch]) - Fix some magic numbers elsewhere used as FAST and SLOW. Differential Revision: https://reviews.freebsd.org/D2025 Reviewed by: vsevolod,delphij,rwatson,trasz,jmg Approved by: so (delphij)	2015-06-30 17:00:45 +00:00
kib	f07d3d4559	Simplify code, no need to test the flag before clearing it. Submitted by: ed MFC after: 12 days	2015-06-29 13:06:24 +00:00
kib	9f65a2d8d9	Handle errors from background write of the cylinder group blocks. First, on the write error, bufdone() call from ffs_backgroundwrite() panics because pbrelvp() cleared bp->b_bufobj, while brelse() would try to re-dirty the copy of the cg buffer. Handle this by setting B_INVAL for the case of BIO_ERROR. Second, we must re-dirty the real buffer containing the cylinder group block data when background write failed. Real cg buffer was already marked clean in ffs_bufwrite(). After the BV_BKGRDINPROG flag is cleared on the real cg buffer in ffs_backgroundwrite(), buffer scan may reuse the buffer at any moment. The result is lost write, and if the write error was only transient, we get corrupted bitmaps. We cannot re-dirty the original cg buffer in the ffs_backgroundwritedone(), since the context is not sleepable, preventing us from sleeping for origbp' lock. Add BV_BKGDERR flag (protected by the buffer object lock), which is converted into delayed write by brelse(), bqrelse() and buffer scan. In collaboration with: Conrad Meyer <cse.cem@gmail.com> Reviewed by: mckusick Sponsored by: The FreeBSD Foundation (kib), EMC/Isilon storage division (Conrad) MFC after: 2 weeks	2015-06-27 09:44:14 +00:00
kib	becc575eec	vfs_msync(), called from syncer vnode fsync VOP, only iterates over the active vnode list for the given mount point, with the assumption that vnodes with dirty pages are active. This is enforced by vinactive() doing vm_object_page_clean() pass over the vnode pages. The issue is, if vinactive() cannot be called during vput() due to the vnode being only shared-locked, we might end up with the dirty pages for the vnode on the free list. Such vnode is invisible to syncer, and pages are only cleaned on the vnode reactivation. In other words, the race results in the broken guarantee that user data, written through the mmap(2), is written to the disk not later than in 30 seconds after the write. Fix this by keeping the vnode which is freed but still owing inactivation, on the active list. When syncer loops find such vnode, it is deactivated and cleaned by the final vput() call. Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2015-06-17 04:46:58 +00:00
mjg	1a3e7a935e	Replace struct filedesc argument in getvnode with struct thread This is is a step towards removal of spurious arguments.	2015-06-16 13:09:18 +00:00
kib	5ad0e0e2f3	Syncing a directory vnode might drop the vnode lock in the softdep_sync() similarly to the regular vnode sync. Allow retry for both vnode types. Reported and tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week	2015-06-03 20:48:00 +00:00
kib	63f2992877	Remove unused variable. When deallocate_dependencies() is performed, softdep_journal_freeblocks() already called cancel_allocdirect() which should have eliminated direct dependencies for all truncated full blocks. The indirect dependencies are allowed above, since second- and third-level dependencies are only dealt with by the code which frees indirect block, which happens after the inode write. Discussed with: mckusick, jeff Reviewed by: jeff Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2015-05-31 15:50:54 +00:00
kib	e2f56205b5	Remove several write-only variables, all reported by the gcc 4.9 buildkernel run. Some of them were write-only under some kernel options, e.g. variables keeping values only used by CTR() macros. It costs nothing to the code readability and correctness to eliminate the warnings in those cases too by removing the local cached values used only for single-access. Review: https://reviews.freebsd.org/D2665 Reviewed by: rodrigc Looked at by: bjk Sponsored by: The FreeBSD Foundation MFC after: 1 week	2015-05-29 13:24:17 +00:00
kib	b5e9f683a3	After r283600, NODELAY flag to inodedep_lookup() function is unused. Eliminate it, and simplify code by removing the local dflags variable always initialized to DEPALLOC. Noted by: mckusick Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2015-05-27 09:49:04 +00:00
kib	ff588ae9b0	Currently, softupdate code detects overstepping on the workitems limits in the code which is deep in the call stack, and owns several critical system resources, like vnode locks. Attempt to wait while the per-mount softupdate thread cleans up the backlog may deadlock, because the thread might need to lock the same vnode which is owned by the waiting thread. Instead of synchronously waiting for the worker, perform the worker' tickle and pause until the backlog is cleaned, at the safe point during return from kernel to usermode. A new ast request to call softdep_ast_cleanup() is created, the SU code now only checks the size of queue and schedules ast. There is no ast delivery for the kernel threads, so they are exempted from the mechanism, except NFS daemon threads. NFS server loop explicitely checks for the request, and informs the schedule_cleanup() that it is capable of handling the requests by the process P2_AST_SU flag. This is needed because nfsd may be the sole cause of the SU workqueue overflow. But, to not cause nsfd to spawn additional threads just because we slow down existing workers, only tickle su threads, without waiting for the backlog cleanup. Reviewed by: jhb, mckusick Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2015-05-27 09:20:42 +00:00
mckusick	d1dcdd191e	Limit the number of cylinder groups that will be searched when trying to build a cluster. The limit is tunable using the sysctl vfs.ffs.maxclustersearch. The current limit is 10 cylinder groups per block allocation. It was previously limited to the number of cylinder groups in the filesystem per block allocation. When there were no clusters of the needed size left, it repeatedly searched the whole filesystem for a non-existent cluster on every block allocation. The result was very slow filesystem allocation with 100% CPU utilization. The old behavior can be had by setting vfs.ffs.maxclustersearch to a huge number (1,000,000). This change affects only the layout policy routines so is not able to interfere with the integrity of the filesystem. Reported by: Dmitry Sivachenko (demon@) Tested by: Dmitry Sivachenko (demon@) MFC after: 2 weeks	2015-04-24 23:27:50 +00:00
rmacklem	ad77d0b1c1	File systems that do not use the buffer cache (such as ZFS) must use VOP_FSYNC() to perform the NFS server's Commit operation. This patch adds a mnt_kern_flag called MNTK_USES_BCACHE which is set by file systems that use the buffer cache. If this flag is not set, the NFS server always does a VOP_FSYNC(). This should be ok for old file system modules that do not set MNTK_USES_BCACHE, since calling VOP_FSYNC() is correct, although it might not be optimal for file systems that use the buffer cache. Reviewed by: kib MFC after: 2 weeks	2015-04-15 20:16:31 +00:00
kib	92897c6ace	Fix build (with gcc). Reported by: bz, ian Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2015-03-27 15:49:21 +00:00
kib	871e29c970	Fix the hand after the immediate reboot when the following command sequence is performed on UFS SU+J rootfs: cp -Rp /sbin/init /sbin/init.old mv -f /sbin/init.old /sbin/init Hang occurs on the rootfs unmount. There are two issues: 1. Removed init binary, which is still mapped, creates a reference to the removed vnode. The inodeblock for such vnode must have active inodedep, which is (eventually) linked through the unlinked list. This means that ffs_sync(MNT_SUSPEND) cannot succeed, because number of softdep workitems for the mp is always > 0. FFS is suspended during unmount, so unmount just hangs. 2. As noted above, the inodedep is linked eventually. It is not linked until the superblock is written. But at the vfs_unmountall() time, when the rootfs is unmounted, the call is made to ffs_unmount()->ffs_sync() before vflush(), and ffs_sync() only calls ffs_sbupdate() after all workitems are flushed. It is masked for normal system operations, because syncer works in parallel and eventually flushes superblock. Syncer is stopped when rootfs unmounted, so ffs_sync() must do sb update on its own. Correct the issues listed above. For MNT_SUSPEND, count the number of linked unlinked inodedeps (this is not a typo) and substract the count of such workitems from the total. For the second issue, the ffs_sbupdate() is called right after device sync in ffs_sync() loop. There is third problem, occuring with both SU and SU+J. The softdep_waitidle() loop, which waits for softdep_flush() thread to clear the worklist, only waits 20ms max. It seems that the 1 tick, specified for msleep(9), was a typo. Add fsync(devvp, MNT_WAIT) call to softdep_waitidle(), which seems to significantly help the softdep thread, and change the MNT_LAZY update at the reboot time to MNT_WAIT for similar reasons. Note that userspace cannot create more work while devvp is flushed, since the mount point is always suspended before the call to softdep_waitidle() in unmount or remount path. PR: 195458 In collaboration with: gjb, pho Reviewed by: mckusick Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2015-03-27 13:55:56 +00:00
kib	37c9c38900	Partially revert r277922, avoid sleeping and do flush if we a awaken, instead of waiting for the FLUSH_* flags. Also, when requesting flush, do the wakeups unconditionally even when FLUSH_CLEANUP flag was already set. Reported and tested by: dim, "Lundberg, Johannes" <johannes@brilliantservice.co.jp> Bisected by: dim MFC after: 2 weeks	2015-02-05 13:00:27 +00:00
kib	83723416a6	When mounting SU-enabled mount point, wait until the softdep_flush() thread started and incremented the stat_flush_threads [1]. Unconditionally wakeup softdep_flush threads when needed, do not try to check wchan, which is racy and breaks abstraction. Reported by and discussed with: glebius, neel Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2015-01-30 11:41:46 +00:00
kib	da0490b2e8	The sys_quotactl() contract demands that the mount point is vfs_unbusy()ed when the cmd is Q_QUOTAON, regardless of other input parameters or error return. Submitted by: Conrad Meyer Sponsored by: EMC / Isilon Storage Division Differential Revision: https://reviews.freebsd.org/D1684 Tested by: pho MFC after: 1 week	2015-01-27 10:32:49 +00:00
kib	4e541c8756	Handle MAKEENTRY cnp flag in the VOP_CREATE(). Curiously, some fs, e.g. smbfs, already did it. Tested by: pho (previous version) Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2014-12-21 13:29:33 +00:00
kib	77c9d3f4e8	The VOP_LOOKUP() implementations for CREATE op do not put the name into namecache, to avoid cache trashing when doing large operations. E.g., tar archive extraction is not usually followed by access to many of the files created. Right now, each VOP_LOOKUP() implementation explicitely knowns about this quirk and tests for both MAKEENTRY flag presence and op != CREATE to make the call to cache_enter(). Centralize the handling of the quirk into VFS, by deciding to cache only by MAKEENTRY flag in VOP. VFS now sets NOCACHE flag for CREATE namei() calls. Note that the change in semantic is backward-compatible and could be merged to the stable branch, and is compatible with non-changed third-party filesystems which correctly handle MAKEENTRY. Suggested by: Chris Torek <torek@pi-coral.com> Reviewed by: mckusick Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2014-12-18 10:01:12 +00:00
gleb	5c99f46b3b	Adjust printf format specifiers for dev_t and ino_t in kernel. ino_t and dev_t are about to become uint64_t. Reviewed by: kib, mckusick	2014-12-17 07:27:19 +00:00
glebius	b4ef8e602d	Merge from projects/sendfile: o Provide a new VOP_GETPAGES_ASYNC(), which works like VOP_GETPAGES(), but doesn't sleep. It returns immediately, and will execute the I/O done handler function that must be supplied as argument. o Provide VOP_GETPAGES_ASYNC() for the FFS, which uses vnode_pager. o Extend pagertab to support pgo_getpages_async method, and implement this method for vnode_pager. Reviewed by: kib Tested by: pho Sponsored by: Netflix Sponsored by: Nginx, Inc.	2014-11-23 12:01:52 +00:00
glebius	fb402568c4	buf.h is not needed here, and pollutes when ufsmount.h is included from userland code. Sponsored by: Nginx, Inc.	2014-11-23 01:02:19 +00:00
glebius	69a603c777	Include required files directly instead of pollution via ufs/ufsmount.h. Sponsored by: Nginx, Inc.	2014-11-23 01:01:14 +00:00
davide	64ef011694	Use the correct variable name.	2014-11-22 00:42:30 +00:00
davide	12cc45e8da	Make ufs_dirhashreclaimperc a percentage for real and rename it to ufs_dirhashreclaimpercent, as suggested by jhb@. As an added bonus this avoids divide-by-zero errors. Requested by: jhb, markj Reviewied by: jhb, markj	2014-11-22 00:37:37 +00:00
kib	a6f1fd882b	When non-forced unmount or remount rw->ro is performed, writes on UFS are not suspended. In particular, on the SU-enabled vulumes, there is no reason why, between the call to softdep_flushfiles() and softdep_waitidle(), SU work items cannot be queued. Correct the condition to trigger the panic by only checking when forced operation is done. Convert direct panic() call into KASSERT(), there is no invalid on-disk data structures directly involved, so follow the usual debugging vs. non-debugging approach. Reported and tested by: pho Reviewed by: mckusick Sponsored by: The FreeBSD Foundation MFC after: 1 week	2014-11-02 13:14:55 +00:00
mjg	12e0034dd0	Provide vfs suspension support only for filesystems which need it, take two. nullfs and unionfs need to request suspension if underlying filesystem(s) use it. Utilize mnt_kern_flag for this purpose. This is a fixup for 273271. No strong objections from: kib Pointy hat to: mjg MFC after: 2 weeks	2014-10-20 18:00:50 +00:00
mjg	cdeccc2a52	Use lockless quota checks in qsync and qsyncvp. No strong objections from: kib, mckusick MFC after: 1 week	2014-10-16 12:41:14 +00:00
kib	1b90da9ab8	Do not set IN_ACCESS flag for read-only mounts. The IN_ACCESS survives remount in rw, also it is set for vnodes on rootfs before noatime can be set or clock is adjusted. All conditions result in wrong atime for accessed vnodes. Submitted by: bde MFC after: 1 week	2014-10-11 19:09:56 +00:00
imp	a2b4dd0675	Restore the backed-out change, using __offsetof instead.	2014-10-10 00:35:08 +00:00
bapt	05bd7a92d7	Backout r272825 every useland usage of ufs/ufs/dir.h are now broken with that change	2014-10-09 17:26:29 +00:00
bapt	c47600c4ec	Use offsetof() from sys/types.h instead of a custom one This fixes build with recent gcc versions	2014-10-09 15:26:22 +00:00
kib	ebd8a253bb	Provide the unique implementation for the VOP_GETPAGES() method used by ffs and ext2fs. Remove duplicated call to vm_page_zero_invalid(), done by VOP and by vm_pager_getpages(). Use vm_pager_free_nonreq(). Reviewed by: alc (previous version) Sponsored by: The FreeBSD Foundation MFC after: 6 weeks (after r271596)	2014-09-15 12:28:29 +00:00
alc	e169525bb8	We don't need an exclusive object lock on the expected execution path through {ext2,ffs}_getpages(). Reviewed by: kib, pfg MFC after: 6 weeks Sponsored by: EMC / Isilon Storage Division	2014-09-13 18:26:13 +00:00
kib	ece271f9c9	Direct access to the quota files, in particular, lookup, causes lock conflict with the quota metadata access. Mark quota vnode lock as recursive and always exclusive to avoid the problem. Reported by: hrs Tested by: hrs, pho Sponsored by: The FreeBSD Foundation MFC after: 1 week	2014-08-29 09:04:24 +00:00
davide	a8c92b02a8	Rather than using an hardcoded reclaim age, rely on an LRU-like approach for dirhash cache, setting a target percent to reclaim (exposed via SYSCTL). This allows to always make some amount of progress keeping the maximum reclaim age dynamic. Tested by: pho Reviewed by: jhb	2014-08-25 17:06:18 +00:00
kib	c3baa63a48	Do not busy the UFS mount point inside VOP_RENAME(). The kern_renameat() already starts write on the mp, which prevents parallel unmount from proceed. Busying mp after vn_start_write() deadlocks the unmount. Reported and tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week	2014-08-20 08:15:23 +00:00
kib	37cbc4993d	Correct the test for condition to suspend UFS filesystem during unmount. There is no need to suspend read-only filesystem, while we need suspension on modificable mount point. Reported by: rwatson Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week	2014-08-20 08:13:03 +00:00
kib	c721b1c6ff	Revision r269457 removed the Giant around mount and unmount code, but r269533, which was tested before r269457 was committed, implicitely relied on the Giant to protect the manipulations of the softdepmounts list. Use softdep global lock consistently to guarantee the list structure now. Insert the new struct mount_softdeps into the softdepmounts only after it is sufficiently initialized, to prevent softdep_speedup() from accessing bare memory. Similarly, remove struct mount_softdeps for the unmounted filesystem from the tailq before destroying structure rwlock. Reported and tested by: pho Reviewed by: mckusick Sponsored by: The FreeBSD Foundation MFC after: 1 week	2014-08-12 09:33:00 +00:00
mckusick	8565976fe4	The SUJ journal is only prepared to handle full-size block numbers, so we have to adjust freeblk records to reflect the change to a full-size block. For example, suppose we have a block made up of fragments 8-15 and want to free its last two fragments. We are given a request that says: FREEBLK ino=5, blkno=14, lbn=0, frags=2, oldfrags=0 where frags are the number of fragments to free and oldfrags are the number of fragments to keep. To block align it, we have to change it to have a valid full-size blkno, so it becomes: FREEBLK ino=5, blkno=8, lbn=0, frags=2, oldfrags=6 Submitted by: Mikihito Takehara Tested by: Mikihito Takehara Reviewed by: Jeff Roberson MFC after: 1 week	2014-08-07 16:53:07 +00:00
mckusick	52fea1f7b2	Add support for multi-threading of soft updates. Replace a single soft updates thread with a thread per FFS-filesystem mount point. The threads are associated with the bufdaemon process. Reviewed by: kib Tested by: Peter Holm and Scott Long MFC after: 2 weeks Sponsored by: Netflix	2014-08-04 22:03:58 +00:00
imp	8570c379e9	Simplify comment to remove multiple negative and passive voice.	2014-07-23 16:18:54 +00:00
kib	32a7383c85	Check for the cross-device cross-link attempt in the VFS, instead of forcing filesystem VOP_LINK() methods to repeat the code. In tmpfs_link(), remove redundand check for the type of the source, already done by VFS. Note that NFS server already performs this check before calling VOP_LINK(). Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2014-07-16 14:04:46 +00:00
kib	a698488f0f	Extract the code to put a filesystem into the suspended state (at the unmount time) in the helper vfs_write_suspend_umnt(). Use it instead of two inline copies in FFS. Fix the bug in the FFS unmount, when suspension failed, the ufs extattrs were not reinitialized. Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2014-07-14 09:10:00 +00:00
kib	f5cc3af6c8	In msdosfs_setattr(), add a check for result of the utimes(2) permissions test, forgotten in r164033. Refactor the permission checks for utimes(2) into vnode helper function vn_utimes_perm(9), and simplify its code comparing with the UFS origin, by writing the call to VOP_ACCESSX only once. Use the helper for UFS(5), tmpfs(5), devfs(5) and msdosfs(5). Reported by: bde Reviewed by: bde, trasz Sponsored by: The FreeBSD Foundation MFC after: 1 week	2014-06-17 07:11:00 +00:00
kib	321a65896d	Initialize the pbuf counter for directio using SYSINIT, instead of using a direct hook called from kern_vfs_bio_buffer_alloc(). Mark ffs_rawread.c as requiring both ffs and directio options to be compiled into the kernel. Add ffs_rawread.c to the list of ufs.ko module' sources. In addition to stopping breaking the layering violation, it also allows to link kernel when FFS is configured as module and DIRECTIO is enabled. One consequence of the change is that ffs_rawread.o is always linked into the module regardless of the DIRECTIO option. This is similar to the option QUOTA and ufs_quota.c. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2014-06-08 10:55:06 +00:00
jmg	43af3dde11	don't check fs_flags for _FLAGS_UPDATED as it is stored in fs_old_flags.. If you had a UFS2 FS that didn't have it's super block at SBLOCK_UFS2, you'll end up corrupting your FS as the superblock is updated and written to a different location... makefs used to put the superblock at SBLOCK_UFS1 for UFS 2 FS's causing this issue... Reviewed by: silience from mckusick MFC after: 1 week	2014-06-03 21:46:13 +00:00
scottl	66aaa46593	Due to reasons unknown at this time, the system can be forced to write a journal block even when there are no journal entries to be written. Until the root cause is found, handle this case by ensuring that a valid journal segment is always written. Second, the data buffer used for writing journal entries was never being scrubbed of old data. Fix this. Submitted by: Takehara Mikihito Obtained from: Netflix, Inc. MFC after: 3 days	2014-05-06 20:40:16 +00:00
mckusick	de9a58cc91	Update comment to explain search order reverted to historical order in -r254996. Suggested by: Pedro Giffuni <pfg@FreeBSD.org> MFC: 3 days	2014-03-22 11:26:39 +00:00
rwatson	33fdc14c0c	Update kernel inclusions of capability.h to use capsicum.h instead; some further refinement is required as some device drivers intended to be portable over FreeBSD versions rely on __FreeBSD_version to decide whether to include capability.h. MFC after: 3 weeks	2014-03-16 10:55:57 +00:00
jeff	9ee698e16a	- If we fail to do a non-blocking acquire of a buf lock while doing a waiting sync pass we need to do a blocking acquire and restart. Another thread, typically the buf daemon, may have this buf locked and if we don't wait we can fail to sync the file. This lead to a great variety of softdep panics because we rely on all dependencies being flushed before proceeding in several cases. Reported by: pho Discussed with: mckusick Sponsored by: EMC / Isilon Storage Division MFC after: 2 weeks	2014-03-06 00:13:21 +00:00
jeff	f1c05bd42b	- Gracefully handle truncation failures when trying to shrink directories. This could cause dirhash panics since the dirhash state would be successfully truncated while the directory was not. Reported by: pho Discussed with: mckusick Sponsored by: EMC / Isilon Storage Division MFC after: 2 weeks	2014-03-06 00:10:07 +00:00
pfg	dc2809ceff	ufs: small formatting fixes. Cleanup some extra space. Use of tabs vs. spaces. No functional change. MFC after: 3 days Reviewed by: mckusick	2014-03-02 02:52:34 +00:00
mckusick	88a7200f04	Fine tune filesystem block allocations under low free-space conditions (-r254995) based on further operational experience. Submitted by: Dmitry Sivachenko Fix Tested by: Dmitry Sivachenko MFC after: 2 weeks	2013-12-30 17:04:24 +00:00
mckusick	34936e35bb	Properly handle unsigned comparison. MFC after: 2 weeks	2013-12-30 06:19:42 +00:00
mckusick	ec81cfc872	We needlessly panic when trying to flush MKDIR_PARENT dependencies. We had previously tried to flush all MKDIR_PARENT dependencies (and all the NEWBLOCK pagedeps) by calling ffs_update(). However this will only resolve these dependencies in direct blocks. So very large directories with MKDIR_PARENT dependencies in indirect blocks had not yet gotten flushed. As the directory is in the midst of doing a complete sync, we simply defer the checking of the MKDIR_PARENT dependencies until the indirect blocks have been sync'ed. Reported by: Shawn Wallbridge of imaginaryforces.com Tested by: John-Mark Gurney <jmg@funkthat.com> PR: 183424 MFC after: 2 weeks	2013-12-01 07:34:21 +00:00
jmg	7383d16d60	fix white space... MFC after: 1 week	2013-11-20 21:21:29 +00:00
jmg	231f0a43ec	fix a use after free, jsegdep_merge will free wk, avoid the next check... CID: 1006098 Sponsored by: Imaginary Forces Reviewed by: mckusick MFC after: 1 week	2013-11-20 21:16:53 +00:00

1 2 3 4 5 ...

2148 Commits