freebsd-dev

Author	SHA1	Message	Date
Kirk McKusick	9c4f551e98	Create a new function ffs_getcg() to read in and verify a cylinder group. Change all code points that open-coded this functionality to use the new function. This commit is a refactoring with no change in functionality. In the future this change allows more robust checking of cylinder group reads along the lines discussed in the hardening UFS session at BSDCan (retry I/O, add checksums, etc). For more detail see the session notes at https://wiki.freebsd.org/DevSummit/201706/HardeningUFS Reviewed by: kib	2017-06-28 17:32:09 +00:00
Konstantin Belousov	698f05ab95	Mitigate several problems with the softdep_request_cleanup() on busy host. Problems start appearing when there are several threads all doing operations on a UFS volume and the SU workqueue needs a cleanup. It is possible that each thread calling softdep_request_cleanup() owns the lock for some dirty vnode (e.g. all of them are executing mkdir(2), mknod(2), creat(2) etc) and all vnodes which must be flushed are locked by corresponding thread. Then, we get all the threads simultaneously entering softdep_request_cleanup(). There are two problems: - Several threads execute MNT_VNODE_FOREACH_ALL() loops in parallel. Due to the locking, they quickly start executing 'in phase' with the speed of the slowest thread. - Since each thread already owns the lock for a dirty vnode, other threads non-blocking attempt to lock the vnode owned by other thread fail, and loops executing without making the progress. Retry logic does not allow the situation to recover. The result is a livelock. Fix these problems by making the following changes: - Allow only one thread to enter MNT_VNODE_FOREACH_ALL() loop per mp. A new flag FLUSH_RC_ACTIVE guards the loop. - If there were failed locking attempts during the loop, abort retry even if there are still work items on the mp work list. An assumption is that the items will be cleaned when other thread either fsyncs its vnode, or unlock and allow yet another thread to make the progress. It is possible now that some calls would get undeserved ENOSPC from ffs_alloc(), because the cleanup is not aggressive enough. But I do not see how can we reliably clean up workitems if calling softdep_request_cleanup() while still owning the vnode lock. I thought about scheme where ffs_alloc() returns ERESTART and saves the retry counter somewhere in struct thread, to return to the top level, unlock the vnode and retry. But IMO the very rare (and unproven) spurious ENOSPC is not worth the complications. Reported and tested by: pho Style and comments by: mckusick Reviewed by: mckusick Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2017-06-03 16:18:50 +00:00
Konstantin Belousov	4cbc378c61	Clean possible td_su reference on the struct mount being unmounted as the last step of ffs_unmount(). It is possible that the mount point is recorded for cleanup in AST context while softdep flush is executed during unmount. The workitems are flushed by other means for the unmount, but the stray reference to struct mount blocks destruction of mount. Check for the situation and manually call vfs_rel() before returning from ffs_unmount(). Reported and tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week	2017-06-03 14:15:14 +00:00
Konstantin Belousov	215b29f62c	Remove spl() calls from UFS code. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2017-05-07 14:59:45 +00:00
Ed Maste	3cf259c390	UFS fs.h: clear warning from use in makefs(1) makefs(1) has a number of signedness warnings (when built with higher WARNS), most of which can be addressed by careful application of casts in makefs itself. There is one case where a signedness warning arises from the blksize macro, so must be addressed in the macro itself. Reviewed by: kib, mckusick MFC after: 1 month Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D10589	2017-05-05 15:26:55 +00:00
Gleb Smirnoff	9ed01c32e0	All these files need sys/vmmeter.h, but now they got it implicitly included via sys/pcpu.h.	2017-04-17 17:07:00 +00:00
Conrad Meyer	a96da1c3fb	ufs: Export UFS_MAXNAMLEN to pathconf, statfs Rather than the global NAME_MAX constant. This change is required to support systems with a NAME_MAX/MAXNAMLEN that differs from UFS_MAXNAMLEN. This was missed in r313475 due to the alternative spelling ("NAME_MAX") of MAXNAMLEN. This change is also similar in spirit to r313780. Reported by: ngie@ Sponsored by: Dell EMC Isilon	2017-04-05 01:44:03 +00:00
Warner Losh	fbbd9655e5	Renumber copyright clause 4 Renumber cluase 4 to 3, per what everybody else did when BSD granted them permission to remove clause 3. My insistance on keeping the same numbering for legal reasons is too pedantic, so give up on that point. Submitted by: Jan Schaumann <jschauma@stevens.edu> Pull Request: https://github.com/freebsd/freebsd/pull/96	2017-02-28 23:42:47 +00:00
Konstantin Belousov	aca4bb9112	Do not leak mount references for dying threads. Thread might create a condition for delayed SU cleanup, which creates a reference to the mount point in td_su, but exit without returning through userret(), e.g. when terminating due to single-threading or process exit. In this case, td_su reference is not dropped and mount point cannot be freed. Handle the situation by clearing td_su also in the thread destructor and in exit1(). softdep_ast_cleanup() has to receive the thread as argument, since e.g. thread destructor is executed in different context. Reported and tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2017-02-25 10:38:18 +00:00
Ed Maste	1dc349ab95	prefix UFS symbols with UFS_ to reduce namespace pollution Specifically: ROOTINO -> UFS_ROOTINO WINO -> UFS_WINO NXADDR -> UFS_NXADDR NDADDR -> UFS_NDADDR NIADDR -> UFS_NIADDR MAXSYMLINKLEN_UFS[12] -> UFS[12]_MAXSYMLINKLEN (for consistency) Also prefix ext2's and nandfs's NDADDR and NIADDR with EXT2_ and NANDFS_ Reviewed by: kib, mckusick Obtained from: NetBSD MFC after: 1 month Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D9536	2017-02-15 19:50:26 +00:00
Conrad Meyer	675c187cc4	ffs_vnops: Simplify extattr access As suggested in r167010, use the structure type and macros to access and modify UFS2 extended attributes. Add assertions that pointers are aligned in places where we now access the data through a structure pointer, instead of character-by-character. PR: 216127 Reported by: dewayne at heuristicsystems.com.au Reviewed by: kib@ Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D9225	2017-01-19 16:46:05 +00:00
Konstantin Belousov	1c32456953	Use type-independent formats for printing nlink_t and ino_t. Extracted from: ino64 work by gleb, mckusick Discussed with: mckusick Sponsored by: The FreeBSD Foundation MFC after: 1 week	2017-01-06 16:59:33 +00:00
Mark Johnston	99e6e1930c	Release laundered vnode pages to the head of the inactive queue. The swap pager enqueues laundered pages near the head of the inactive queue to avoid another trip through LRU before reclamation. This change adds support for this behaviour to the vnode pager and makes use of it in UFS and ext2fs. Some ioflag handling is consolidated into a common subroutine so that this support can be easily extended to other filesystems which make use of the buffer cache. No changes are needed for ZFS since its putpages routine always undirties the pages before returning, and the laundry thread requeues the pages appropriately in this case. Reviewed by: alc, kib Differential Revision: https://reviews.freebsd.org/D8589	2016-11-23 17:53:07 +00:00
Konstantin Belousov	714b7df502	Provide simple mutual exclusion between mount point update and unmount. Currently mount update keeps vfs_busy(9) reference on the mount point during MNT_UPDATE VFS_MOUNT() vfsops call. This already provides the exclusion, but is problematic for filesystems which need to perform namei(9) during VFS_MOUNT(MNT_UPDATE) operations, e.g. to refresh mnt_from path, because namei(9) must not be called while the vfs_busy(9) reference is owned. Check for MNT_UPDATE flag before setting MNTK_UNMOUNT, and for MNTK_UNMOUNT before entering innards of vfs_domount_update(), failing syscalls with EBUSY if conflict is detected. Keep vfs_busy(9) reference around VFS_MOUNT(MNT_UPDATE) calls still to not change VFS KPI. In the update path in ffs_mount(), drop vfs_busy() reference around namei(), which is now safe due to unmount never executing in parallel with VFS_MOUNT(MNT_UPDATE), and which avoids the deadlock. Reported and tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2016-11-13 21:49:51 +00:00
Ed Maste	15c377c3cc	ANSIfy ffs_subr.c Also renumber license clause to avoid skipping #3	2016-10-31 20:43:43 +00:00
Kirk McKusick	ad544726aa	Avoid possible overflow when calclating malloc size for auxillary data structure sizes when mounting and reloading UFS/FFS filesystems by using a u_long rather than an int for the size. Reported by: Mariusz Zaborski <oshogbo@> MFC after: 1 week	2016-10-28 20:15:19 +00:00
Konstantin Belousov	c39baa7480	Generalize UFS buffer pager to allow it serving other filesystems which also use buffer cache. Most important addition to the code is the handling of filesystems where the block size is less than the machine page size, which might require reading several buffers to validate single page. Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2016-10-28 11:43:59 +00:00
Marcel Moolenaar	d67ec19c3e	Include <sys/types.h> explicitly instead of depending on that header being included by <sys/param.h>. When compiled as part of makefs(8) and on macOS or Linux, <sys/param.h> is not our own.	2016-10-24 18:12:57 +00:00
Konstantin Belousov	895219834e	Add FFS pager, which uses buffer cache read operation to validate pages. See the comments for more detailed description of the algorithm. The pager is used unconditionally when the block size of the underlying device is larger than the machine page size, since local vnode pager cannot handle the configuration [1]. Otherwise, the vfs.ffs.use_buf_pager sysctl allows to switch to the local pager. Measurements demonstrated no regression in the ever-important buildworld benchmark, and small (~5%) throughput improvements in the special microbenchmark configuration for dbench over swap-backed md(4). Code can be generalized and reused for other filesystems which use buffer cache. Reported by: Anton Yuzhaninov <citrin@citrin.ru> [1] Tested by: pho Benchmarked by: mjg, pho Reviewed by: alc, markj, mckusick (previous version) Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D8198	2016-10-19 11:09:29 +00:00
Mateusz Guzik	8660b707ff	vfs: remove the __bo_vnode field from struct vnode The pointer can be obtained using __containerof instead. Reviewed by: kib	2016-09-30 17:11:03 +00:00
Konstantin Belousov	bf9c87c813	Be more strict when selecting between snapshot/regular mount. Reclaimed vnode type is VBAD, so succesful comparision like devvp->v_type != VREG does not imply that the devvp references snapshot, it might be due to a reclaimed vnode. Explicitely check the vnode type. In the the most important case of ffs_blkfree(), the devfs vnode is locked and its type is stable. In other cases, if the vnode is reclaimed right after the check, hopefully the buffer methods return right error codes. Reviewed by: mckusick Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2016-09-19 15:58:33 +00:00
Konstantin Belousov	e1db68971e	Reduce size of ufs inode. Remove redunand i_dev and i_fs pointers, which are available as ip->i_ump->um_dev and ip->i_ump->um_fs, and reorder members by size to reduce padding. To compensate added derefences, the most often i_ump access to differentiate between UFS1 and UFS2 dinode layout is removed, by addition of the new i_flag IN_UFS2. Overall, this actually reduces the amount of memory dereferences. On 64bit machine, original struct inode size is 176, reduced to 152 bytes with the change. Tested by: pho (previous version) Reviewed by: mckusick Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2016-09-17 16:47:34 +00:00
Bruce Evans	0c01bcb9ff	Sprinkle DOINGASYNC() checks so as to do delayed writes for async mounts in almost all cases instead of in most cases. Don't override DOINGASYNC() by any condition except IO_SYNC. Fix previous sprinking of DOINGASYNC() checks. Don't override IO_SYNC by DOINGASYNC(). In ffs_write() and ffs_extwrite(), there were intentional overrides that just broke O_SYNC of data. In ffs_truncate(), there are 5 calls to ffs_update(), 4 with apparently-unintentional overrides and 1 without; this had no effect due to the main async mount hack descibed below. Fix 1 place in ffs_truncate() where the caller's IO_ASYNC was overridden for the soft updates case too (to do a delayed write instead of a sync write). This is supposed to be the only change that affects anything except async mounts. In ffs_update(), remove the 19 year old efficiency hack of ignoring the waitfor flag for async mounts, so that fsync() almost works for async mounts. All callers are supposed to be fixed to not ask for a sync update unless they are for fsync() or [I]O_SYNC operations. fsync() now almost works for async mounts. It used to sync the data but not the most important metdata (the inode). It still doesn't sync associated directories. This gave 10-20% fewer writes for my makeworld benchmark with async mounted tmp and obj directories from an already small number. Style fixes: - in ffs_balloc.c, remove rotted quadruplicated comments about the simplest part of the DOING() decisions and rearrange the nearly- quadruplicated code to be more nearly so. - in ufs_vnops.c, use a consistent style with less negative logic and no manual "optimization" of \|\| to \| in DOING() expressions. Reviewed by: kib (previous version)	2016-09-08 17:40:40 +00:00
Konstantin Belousov	7b05b8a29c	Do not leak transient ENOLCK error from flush_newblk_dep() loop. The buffer lock is retried on failed LK_SLEEPFAIL attempt, and error from the failed attempt is irrelevant. But since there is path after retry which does not clear error, it is possible to return spurious error from the function. The issue resulted in a spurious failure of softdep_sync_buf(), causing further spurious failure of ffs_sync(). In collaboration with: pho Reviewed by: mckusick Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2016-09-08 12:08:54 +00:00
Konstantin Belousov	60f1c000f3	In softdep_prealloc(), return early not only for snapshots, but for the quota files as well. Reported and tested by: pho Reviewed by: mckusick Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2016-09-08 12:05:13 +00:00
Konstantin Belousov	df4265774e	Partially lift suspension when ffs_reload() finished with cgs and going to re-read inodes. Secondary write initiators, e.g. ufs_inactive(), might need to start a write while owning the vnode lock. Since the suspended state established by /dev/ufssuspend prevents them from entering vn_start_secondary_write(), we get deadlock otherwise. Note that it is arguably not very useful to re-read inodes after /dev/ufssuspend suspension, because the suspension does not block readers, and other threads might read existing files in parallel with suspension owner (for now, only growfs(8)) operations. This effectively means that suspension owner cannot safely modify existing inodes, and then there is no sense in re-reading. But keep the code enabled for now. Reported and tested by: pho Reviewed by: mckusick Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2016-09-08 12:01:28 +00:00
Warner Losh	949b56ba66	Renumber the advertising clause.	2016-09-06 15:17:35 +00:00
Kirk McKusick	988fd417a0	Bug 211013 reports that a write error to a UFS filesystem running with softupdates panics the kernel. The problem that has been pointed out is that when there is a transient write error on certain metadata blocks, specifically directory blocks (PAGEDEP), inode blocks (INODEDEP), indirect pointer blocks (INDIRDEPS), and cylinder group (BMSAFEMAP, but only when journaling is enabled), we get a panic in one of the routines called by softdep_disk_io_initiation that the I/O is "already started" when we retry the write. These dependency types potentially need to do roll-backs when called by softdep_disk_io_initiation before doing a write and then a roll-forward when called by softdep_disk_write_complete after the I/O completes. The panic happens when there is a transient error. At the top of softdep_disk_write_complete we check to see if the write had an error and if an error occurred we just return. This return is correct most of the time because the main role of the routines called by softdep_disk_write_complete is to process the now-completed dependencies so that the next I/O steps can happen. But for the four types listed above, they do not get to do their rollback operations. This causes the panic when softdep_disk_io_initiation gets called on the second attempt to do the write and the roll-back routines find that the roll-backs have already been done. As an aside I note that there is also the problem that the buffer will have been unlocked and thus made visible to the filesystem and to user applications with the roll-backs in place. The way to resolve the problem is to add a flag to the routines called by softdep_disk_write_complete for the four dependency types noted that indicates whether the write was successful (WRITESUCCEEDED). If the write does not succeed, they do just the roll-backs and then return. If the write was successful they also do their usual processing of the now-completed dependencies. The fix was tested by selectively injecting write errors for buffers holding dependencies of each of the four types noted above and then verifying that the kernel no longer paniced and that following the successful retry of the write that the filesystem could be unmounted and successfully checked cleanly. PR: 211013 Reviewed by: kib	2016-08-16 21:02:30 +00:00
Konstantin Belousov	948137db10	In UFS_BALLOC(), invalidate pages of indirect buffers on failed block allocation unwinding. Dandling buffers are released on UFS_BALLOC() failure to ensure that later attempt to allocate blocks in close range do not find the blocks with invalid content, since possible partial block allocations are unwound. As such, it is not enough to just release the buffers, the pages must also invalidated and removed from the vnode vm_object queue. Otherwise the pages might be found later and used to reconstruct indirect buffers when doing allocations at offset close to the failure point, and their stale content compromise the filesystem integrity. Note that just marking the buffer as B_INVAL is not enough, B_NOCACHE is required. To be sure, clear the B_CACHE flag as well. This complements the r174973, which started releasing buffers. Reported and tested by: pho Reviewed by: mckusick Sponsored by: The FreeBSD Foundation MFC after: 1 week	2016-08-16 17:30:58 +00:00
Konstantin Belousov	2da4bcb82b	On unwind after failed block allocation in ffs_balloc_ufs{1,2}, assert that recorded allocated blocks numbers match the physical block numbers of dandling buffers which are released. When finally freeing the blocks during unwind, assert that dandling buffers where not re-allocated. They shouldn't, because the vnode lock is owned exclusive. Reviewed by: mckusick Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week	2016-08-16 17:18:38 +00:00
Konstantin Belousov	39f7cbe9ab	When looking up dandling buffers for unwing after failing block allocation in UFS_BALLOC(), there is no need to map them. Reviewed by: mckusick Sponsored by: The FreeBSD Foundation MFC after: 1 week	2016-08-16 17:05:15 +00:00
Konstantin Belousov	554248587e	When block allocation fails in UFS_BALLOC(), and the volume does not have SU enabled, there is no point in calling softdep_request_cleanup(). The call cannot produce free blocks, but we unecessarily lock ufsmount and do inode block write. Usual point of not doing optimizations for the corner case of the full volume is not applicable there, the work is easily avoidable, and the addition CPU and disk io load do not lead to succeeding retry. Reviewed by: mckusick Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week	2016-08-16 16:50:48 +00:00
Konstantin Belousov	f1503c5a49	In ffs_balloc_ufs{1,2} routines, assert that unwind records do not overflow local arrays. This is not immediately obvious from the static code inspection, due to retry logic. Reviewed by: mckusick Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week	2016-08-16 16:49:56 +00:00
Konstantin Belousov	2f514f92cf	Implement VOP_FDATASYNC() for UFS. Reviewed by: mckusick Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D7471	2016-08-15 19:22:23 +00:00
Edward Tomasz Napierala	411455a8fb	Replace all remaining calls to vprint(9) with vn_printf(9), and remove the old macro. MFC after: 1 month	2016-08-10 16:12:31 +00:00
Kevin Lo	57d2ac2f90	arc4random() returns 0 to (2**32)−1, use an alternative to initialize i_gen if it's zero rather than a divide by 2. With inputs from delphij, mckusick, rmacklem Reviewed by: mckusick	2016-05-22 14:31:20 +00:00
Konstantin Belousov	5f36cc5bfa	Stop dropping and reacquiring Giant around geom calls in UFS. Sponsored by: The FreeBSD Foundation	2016-05-21 10:13:25 +00:00
Konstantin Belousov	c70b3cd28d	Improve handling of rdev->si_mountpt on mount and unmount of FFS volumes. Treat the field as a semaphore protecting availability of the device for mounting. Do no access devvp->v_rdev without the vnode lock owned. Protect change of the devvp->v_bufobj bo_ops vector with the vnode lock. Reviewed by: bde Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2016-05-21 09:49:35 +00:00
Konstantin Belousov	bfe89a8ee1	Do enable io accounting for read-only mounts and mounts which are remounted to writeable after initial read-only. Assign to dev->si_mountpt earlier to account the accesses done at the mount time. Based on submission by: bde MFC after: 1 week	2016-05-17 21:35:35 +00:00
Konstantin Belousov	f30ddc49e6	If IO_SYNC was passed to ffs_truncate(), request synchronous inode update from the final ffs_update(). Noted by: bde MFC after: 1 week	2016-05-17 21:30:58 +00:00
Konstantin Belousov	98cbffd7bd	Fix comments. Submitted by: bde MFC after: 1 week	2016-05-17 08:24:27 +00:00
Pedro F. Giffuni	08e941833d	UFS: spelling fixes on comments. No functional change.	2016-04-29 20:43:51 +00:00
Pedro F. Giffuni	d9c9c81c08	sys: use our roundup2/rounddown2() macros when param.h is available. rounddown2 tends to produce longer lines than the original code and when the code has a high indentation level it was not really advantageous to do the replacement. This tries to strike a balance between readability using the macros and flexibility of having the expressions, so not everything is converted.	2016-04-21 19:57:40 +00:00
Pedro F. Giffuni	abafa4db03	ufs: replace 0 with NULL for pointers. While here also do late initialization of the variables we are changing. Found with devel/coccinelle. Reviewed by: mckusick MFC after: 2 weeks	2016-04-10 21:48:11 +00:00
Edward Tomasz Napierala	ae34b6ff96	Add four new RCTL resources - readbps, readiops, writebps and writeiops, for limiting disk (actually filesystem) IO. Note that in some cases these limits are not quite precise. It's ok, as long as it's within some reasonable bounds. Testing - and review of the code, in particular the VFS and VM parts - is very welcome. MFC after: 1 month Relnotes: yes Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D5080	2016-04-07 04:23:25 +00:00
Konstantin Belousov	c79dff0fbe	Split the global taskqueue used to process all UFS trim completions, into per-mount taskqueue with the private taskqueue processing thread. This allows to drain the taskqueue on unmount, to ensure that all TRIMs are finished before mount structures are freed. But just draining the taskqueue where TRIM biodone geom-up completions are processed is not enough, since ffs_blkfree(), called by the task, might result in more writes. Count inflight delayed blkfree's and pause() unmount until the counter drains as well. Reported by: Nick Evans <nevans@talkpoint.com> Tested by: Nick Evans <nevans@talkpoint.com>, pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2016-03-27 08:21:17 +00:00
Konstantin Belousov	6336aefc1d	Fix locking mistake in softdep_waitidle(). The surrounding code expects that the loop is always exited with the SU lock owned, even on error. Reported and tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 3 days	2016-03-23 09:58:51 +00:00
Kirk McKusick	24b6748c42	The UFS filesystem requires that the last block of a file always be allocated. When shortening the length of a file in which the new end of the file contains a hole, the hole must have a block allocated. Reported by: Maxim Sobolev Reviewed by: kib Tested by: Peter Holm	2016-02-24 01:58:40 +00:00
Edward Tomasz Napierala	50102a6342	Remove ffs_mountroot() prototype; seems to be long gone. MFC after: 1 month Sponsored by: The FreeBSD Foundation	2016-01-28 12:21:23 +00:00
Kirk McKusick	abe53f7e91	This fixes a bug in UFS2 exported NFS volumes. An NFS client can crash a server that has exported UFS2 by presenting a filehandle with an inode number that references an uninitialized inode in a cylinder group. The problem is that UFS2 only initializes blocks of inodes as they are first allocated and ffs_fhtovp() does not validate that the inode is in a range of inodes that have been initialized. Attempting to read an uninitialized inode gets random data from the disk. When the kernel tries to interpret it as an inode, panics often arise. Reported by: Christos Zoulas (from NetBSD) Reviewed by: kib	2016-01-27 21:27:05 +00:00
Kirk McKusick	d2a28cb080	The bread() function was inconsistent about whether it would return a buffer pointer in the event of an error (for some errors it would return a buffer pointer and for other errors it would not return a buffer pointer). The cluster_read() function was similarly inconsistent. Clients of these functions were inconsistent in handling errors. Some would assume that no buffer was returned after an error and would thus lose buffers under certain error conditions. Others would assume that brelse() should always be called after an error and would thus panic the system under certain error conditions. To correct both of these problems with minimal code churn, bread() and cluster_write() now always free the buffer when returning an error thus ensuring that buffers will never be lost. The brelse() routine checks for being passed a NULL buffer pointer and silently returns to avoid panics. Thus both approaches to handling error returns from bread() and cluster_read() will work correctly. Future code should be written assuming that bread() and cluster_read() will never return a buffer with an error, so should not attempt to brelse() the buffer when an error is returned. Reviewed by: kib	2016-01-27 21:23:01 +00:00
Konstantin Belousov	44de1ba141	Recheck curthread->td_su after the VFS_SYNC() call, and re-sync if the ast was rescheduled during VFS_SYNC(). It is possible that enough parallel writes or slow/hung volume result in VFS_SYNC() deferring to the ast flushing of workqueue. Reported and tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week	2015-12-21 11:50:32 +00:00
Kirk McKusick	43a993bb7d	For performance reasons, it is useful to have a single string used as the name of a filesystem when setting it as the first parameter to the getnewvnode() function. Most filesystems call getnewvnode from just one place so can use a literal string as the first parameter. However, NFS calls getnewvnode from two places, so we create a global constant string that can be used by the two instances. This change also collapses two instances of getnewvnode() in the UFS filesystem to a single call. Reviewed by: kib Tested by: Peter Holm	2015-11-29 21:01:02 +00:00
Konstantin Belousov	fe920528c1	Do not perform read-ahead for BA_CLRBUF request when we are low on memory or when dirty buffer queue is too large. Reported and tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2015-10-27 13:44:13 +00:00
Warner Losh	476dfdc3bf	Do not relocate extents to make them contiguous if the underlying drive can do deletions. Ability to do deletions is a strong indication that this optimization will not help performance. It will only generate extra write traffic. These devices are typically flash based and have a limited number of write cycles. In addition, making the file contiguous in LBA space doesn't improve the access times from flash devices because they have no seek time. Reviewed by: mckusick@	2015-10-16 03:06:02 +00:00
Gleb Smirnoff	20e8b32db4	In softdep_setup_freeblocks(): - Move the bread() to the beginning of function. - Return if it fails, otherwise we will panic. Submitted by: mckusick Sponsored by: Netflix	2015-10-07 12:36:28 +00:00
Konstantin Belousov	7a82f35c9d	Do not consume extra reference. This is a bug in r287479. Reported and tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week	2015-09-05 12:28:18 +00:00
Konstantin Belousov	c65f759837	Declare the writes around the call to VFS_SYNC() in softdep_ast_cleanup_proc(). Tested by: pho (previous version) Sponsored by: The FreeBSD Foundation MFC after: 1 week	2015-09-05 08:48:24 +00:00
Konstantin Belousov	76cae067b8	By doing file extension fast, it is possible to create excess supply of the D_NEWBLK kinds of dependencies (i.e. D_ALLOCDIRECT and D_ALLOCINDIR), which can exhaust kmem. Handle excess of D_NEWBLK in the same way as excess of D_INODEDEP and D_DIRREM, by scheduling ast to flush dependencies, after the thread, which created new dep, left the VFS/FFS innards. For D_NEWBLK, the only way to get rid of them is to do full sync, since items are attached to data blocks of arbitrary vnodes. The check for D_NEWBLK excess in softdep_ast_cleanup_proc() is unlocked. For 32bit arches, reduce the total amount of allowed dependencies by two. It could be considered increasing the limit for 64 bit platforms with direct maps. Reported and tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2015-09-01 13:07:27 +00:00
Jeff Roberson	98082691bb	- Make 'struct buf *buf' private to vfs_bio.c. Having a global variable 'buf' is inconvenient and has lead me to some irritating to discover bugs over the years. It also makes it more challenging to refactor the buf allocation system. - Move swbuf and declare it as an extern in vfs_bio.c. This is still not perfect but better than it was before. - Eliminate the unused ffs function that relied on knowledge of the buf array. - Move the shutdown code that iterates over the buf array into vfs_bio.c. Reviewed by: kib Sponsored by: EMC / Isilon Storage Division	2015-07-29 02:26:57 +00:00
Jeff Roberson	fade8dd714	Refactor unmapped buffer address handling. - Use pointer assignment rather than a combination of pointers and flags to switch buffers between unmapped and mapped. This eliminates multiple flags and generally simplifies the logic. - Eliminate b_saveaddr since it is only used with pager bufs which have their b_data re-initialized on each allocation. - Gather up some convenience routines in the buffer cache for manipulating buf space and buf malloc space. - Add an inline, buf_mapped(), to standardize checks around unmapped buffers. In collaboration with: mlaier Reviewed by: kib Tested by: pho (many small revisions ago) Sponsored by: EMC / Isilon Storage Division	2015-07-23 19:13:41 +00:00
Mateusz Guzik	f0725a8e1e	Move chdir/chroot-related fdp manipulation to kern_descrip.c Prefix exported functions with pwd_. Deduplicate some code by adding a helper for setting fd_cdir. Reviewed by: kib	2015-07-11 16:19:11 +00:00
Mark Johnston	5f34e93c58	Check suspendability on the mountpoint returned by VOP_GETWRITEMOUNT. This obviates the need for a MNTK_SUSPENDABLE flag, since passthrough filesystems like nullfs and unionfs no longer need to inherit this information from their lower layer(s). This change also restores the pre-r273336 behaviour of using the presence of a susp_clean VFS method to request suspension support. Reviewed by: kib, mjg Differential Revision: https://reviews.freebsd.org/D2937	2015-07-05 22:37:33 +00:00
Mark Murray	d1b06863fb	Huge cleanup of random(4) code. * GENERAL - Update copyright. - Make kernel options for RANDOM_YARROW and RANDOM_DUMMY. Set neither to ON, which means we want Fortuna - If there is no 'device random' in the kernel, there will be NO random(4) device in the kernel, and the KERN_ARND sysctl will return nothing. With RANDOM_DUMMY there will be a random(4) that always blocks. - Repair kern.arandom (KERN_ARND sysctl). The old version went through arc4random(9) and was a bit weird. - Adjust arc4random stirring a bit - the existing code looks a little suspect. - Fix the nasty pre- and post-read overloading by providing explictit functions to do these tasks. - Redo read_random(9) so as to duplicate random(4)'s read internals. This makes it a first-class citizen rather than a hack. - Move stuff out of locked regions when it does not need to be there. - Trim RANDOM_DEBUG printfs. Some are excess to requirement, some behind boot verbose. - Use SYSINIT to sequence the startup. - Fix init/deinit sysctl stuff. - Make relevant sysctls also tunables. - Add different harvesting "styles" to allow for different requirements (direct, queue, fast). - Add harvesting of FFS atime events. This needs to be checked for weighing down the FS code. - Add harvesting of slab allocator events. This needs to be checked for weighing down the allocator code. - Fix the random(9) manpage. - Loadable modules are not present for now. These will be re-engineered when the dust settles. - Use macros for locks. - Fix comments. * src/share/man/... - Update the man pages. * src/etc/... - The startup/shutdown work is done in D2924. * src/UPDATING - Add UPDATING announcement. * src/sys/dev/random/build.sh - Add copyright. - Add libz for unit tests. * src/sys/dev/random/dummy.c - Remove; no longer needed. Functionality incorporated into randomdev.. live_entropy_sources.c live_entropy_sources.h - Remove; content moved. - move content to randomdev.[ch] and optimise. * src/sys/dev/random/random_adaptors.c src/sys/dev/random/random_adaptors.h - Remove; plugability is no longer used. Compile-time algorithm selection is the way to go. * src/sys/dev/random/random_harvestq.c src/sys/dev/random/random_harvestq.h - Add early (re)boot-time randomness caching. * src/sys/dev/random/randomdev_soft.c src/sys/dev/random/randomdev_soft.h - Remove; no longer needed. * src/sys/dev/random/uint128.h - Provide a fake uint128_t; if a real one ever arrived, we can use that instead. All that is needed here is N=0, N++, N==0, and some localised trickery is used to manufacture a 128-bit 0ULLL. * src/sys/dev/random/unit_test.c src/sys/dev/random/unit_test.h - Improve unit tests; previously the testing human needed clairvoyance; now the test will do a basic check of compressibility. Clairvoyant talent is still a good idea. - This is still a long way off a proper unit test. * src/sys/dev/random/fortuna.c src/sys/dev/random/fortuna.h - Improve messy union to just uint128_t. - Remove unneeded 'static struct fortuna_start_cache'. - Tighten up up arithmetic. - Provide a method to allow eternal junk to be introduced; harden it against blatant by compress/hashing. - Assert that locks are held correctly. - Fix the nasty pre- and post-read overloading by providing explictit functions to do these tasks. - Turn into self-sufficient module (no longer requires randomdev_soft.[ch]) * src/sys/dev/random/yarrow.c src/sys/dev/random/yarrow.h - Improve messy union to just uint128_t. - Remove unneeded 'staic struct start_cache'. - Tighten up up arithmetic. - Provide a method to allow eternal junk to be introduced; harden it against blatant by compress/hashing. - Assert that locks are held correctly. - Fix the nasty pre- and post-read overloading by providing explictit functions to do these tasks. - Turn into self-sufficient module (no longer requires randomdev_soft.[ch]) - Fix some magic numbers elsewhere used as FAST and SLOW. Differential Revision: https://reviews.freebsd.org/D2025 Reviewed by: vsevolod,delphij,rwatson,trasz,jmg Approved by: so (delphij)	2015-06-30 17:00:45 +00:00
Konstantin Belousov	521987c3e5	Simplify code, no need to test the flag before clearing it. Submitted by: ed MFC after: 12 days	2015-06-29 13:06:24 +00:00
Konstantin Belousov	b2c3df842b	Handle errors from background write of the cylinder group blocks. First, on the write error, bufdone() call from ffs_backgroundwrite() panics because pbrelvp() cleared bp->b_bufobj, while brelse() would try to re-dirty the copy of the cg buffer. Handle this by setting B_INVAL for the case of BIO_ERROR. Second, we must re-dirty the real buffer containing the cylinder group block data when background write failed. Real cg buffer was already marked clean in ffs_bufwrite(). After the BV_BKGRDINPROG flag is cleared on the real cg buffer in ffs_backgroundwrite(), buffer scan may reuse the buffer at any moment. The result is lost write, and if the write error was only transient, we get corrupted bitmaps. We cannot re-dirty the original cg buffer in the ffs_backgroundwritedone(), since the context is not sleepable, preventing us from sleeping for origbp' lock. Add BV_BKGDERR flag (protected by the buffer object lock), which is converted into delayed write by brelse(), bqrelse() and buffer scan. In collaboration with: Conrad Meyer <cse.cem@gmail.com> Reviewed by: mckusick Sponsored by: The FreeBSD Foundation (kib), EMC/Isilon storage division (Conrad) MFC after: 2 weeks	2015-06-27 09:44:14 +00:00
Konstantin Belousov	1eabd96728	vfs_msync(), called from syncer vnode fsync VOP, only iterates over the active vnode list for the given mount point, with the assumption that vnodes with dirty pages are active. This is enforced by vinactive() doing vm_object_page_clean() pass over the vnode pages. The issue is, if vinactive() cannot be called during vput() due to the vnode being only shared-locked, we might end up with the dirty pages for the vnode on the free list. Such vnode is invisible to syncer, and pages are only cleaned on the vnode reactivation. In other words, the race results in the broken guarantee that user data, written through the mmap(2), is written to the disk not later than in 30 seconds after the write. Fix this by keeping the vnode which is freed but still owing inactivation, on the active list. When syncer loops find such vnode, it is deactivated and cleaned by the final vput() call. Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2015-06-17 04:46:58 +00:00
Mateusz Guzik	4da8456f0a	Replace struct filedesc argument in getvnode with struct thread This is is a step towards removal of spurious arguments.	2015-06-16 13:09:18 +00:00
Konstantin Belousov	46c3d3ac99	Syncing a directory vnode might drop the vnode lock in the softdep_sync() similarly to the regular vnode sync. Allow retry for both vnode types. Reported and tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week	2015-06-03 20:48:00 +00:00
Konstantin Belousov	aef373ce38	Remove unused variable. When deallocate_dependencies() is performed, softdep_journal_freeblocks() already called cancel_allocdirect() which should have eliminated direct dependencies for all truncated full blocks. The indirect dependencies are allowed above, since second- and third-level dependencies are only dealt with by the code which frees indirect block, which happens after the inode write. Discussed with: mckusick, jeff Reviewed by: jeff Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2015-05-31 15:50:54 +00:00
Konstantin Belousov	69baeadc31	Remove several write-only variables, all reported by the gcc 4.9 buildkernel run. Some of them were write-only under some kernel options, e.g. variables keeping values only used by CTR() macros. It costs nothing to the code readability and correctness to eliminate the warnings in those cases too by removing the local cached values used only for single-access. Review: https://reviews.freebsd.org/D2665 Reviewed by: rodrigc Looked at by: bjk Sponsored by: The FreeBSD Foundation MFC after: 1 week	2015-05-29 13:24:17 +00:00
Konstantin Belousov	0bc4fe10ad	After r283600, NODELAY flag to inodedep_lookup() function is unused. Eliminate it, and simplify code by removing the local dflags variable always initialized to DEPALLOC. Noted by: mckusick Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2015-05-27 09:49:04 +00:00
Konstantin Belousov	1bc93bb7b9	Currently, softupdate code detects overstepping on the workitems limits in the code which is deep in the call stack, and owns several critical system resources, like vnode locks. Attempt to wait while the per-mount softupdate thread cleans up the backlog may deadlock, because the thread might need to lock the same vnode which is owned by the waiting thread. Instead of synchronously waiting for the worker, perform the worker' tickle and pause until the backlog is cleaned, at the safe point during return from kernel to usermode. A new ast request to call softdep_ast_cleanup() is created, the SU code now only checks the size of queue and schedules ast. There is no ast delivery for the kernel threads, so they are exempted from the mechanism, except NFS daemon threads. NFS server loop explicitely checks for the request, and informs the schedule_cleanup() that it is capable of handling the requests by the process P2_AST_SU flag. This is needed because nfsd may be the sole cause of the SU workqueue overflow. But, to not cause nsfd to spawn additional threads just because we slow down existing workers, only tickle su threads, without waiting for the backlog cleanup. Reviewed by: jhb, mckusick Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2015-05-27 09:20:42 +00:00
Kirk McKusick	74a87c3804	Limit the number of cylinder groups that will be searched when trying to build a cluster. The limit is tunable using the sysctl vfs.ffs.maxclustersearch. The current limit is 10 cylinder groups per block allocation. It was previously limited to the number of cylinder groups in the filesystem per block allocation. When there were no clusters of the needed size left, it repeatedly searched the whole filesystem for a non-existent cluster on every block allocation. The result was very slow filesystem allocation with 100% CPU utilization. The old behavior can be had by setting vfs.ffs.maxclustersearch to a huge number (1,000,000). This change affects only the layout policy routines so is not able to interfere with the integrity of the filesystem. Reported by: Dmitry Sivachenko (demon@) Tested by: Dmitry Sivachenko (demon@) MFC after: 2 weeks	2015-04-24 23:27:50 +00:00
Rick Macklem	dda11d4ab9	File systems that do not use the buffer cache (such as ZFS) must use VOP_FSYNC() to perform the NFS server's Commit operation. This patch adds a mnt_kern_flag called MNTK_USES_BCACHE which is set by file systems that use the buffer cache. If this flag is not set, the NFS server always does a VOP_FSYNC(). This should be ok for old file system modules that do not set MNTK_USES_BCACHE, since calling VOP_FSYNC() is correct, although it might not be optimal for file systems that use the buffer cache. Reviewed by: kib MFC after: 2 weeks	2015-04-15 20:16:31 +00:00
Konstantin Belousov	9928931186	Fix build (with gcc). Reported by: bz, ian Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2015-03-27 15:49:21 +00:00
Konstantin Belousov	4af9f77ecb	Fix the hand after the immediate reboot when the following command sequence is performed on UFS SU+J rootfs: cp -Rp /sbin/init /sbin/init.old mv -f /sbin/init.old /sbin/init Hang occurs on the rootfs unmount. There are two issues: 1. Removed init binary, which is still mapped, creates a reference to the removed vnode. The inodeblock for such vnode must have active inodedep, which is (eventually) linked through the unlinked list. This means that ffs_sync(MNT_SUSPEND) cannot succeed, because number of softdep workitems for the mp is always > 0. FFS is suspended during unmount, so unmount just hangs. 2. As noted above, the inodedep is linked eventually. It is not linked until the superblock is written. But at the vfs_unmountall() time, when the rootfs is unmounted, the call is made to ffs_unmount()->ffs_sync() before vflush(), and ffs_sync() only calls ffs_sbupdate() after all workitems are flushed. It is masked for normal system operations, because syncer works in parallel and eventually flushes superblock. Syncer is stopped when rootfs unmounted, so ffs_sync() must do sb update on its own. Correct the issues listed above. For MNT_SUSPEND, count the number of linked unlinked inodedeps (this is not a typo) and substract the count of such workitems from the total. For the second issue, the ffs_sbupdate() is called right after device sync in ffs_sync() loop. There is third problem, occuring with both SU and SU+J. The softdep_waitidle() loop, which waits for softdep_flush() thread to clear the worklist, only waits 20ms max. It seems that the 1 tick, specified for msleep(9), was a typo. Add fsync(devvp, MNT_WAIT) call to softdep_waitidle(), which seems to significantly help the softdep thread, and change the MNT_LAZY update at the reboot time to MNT_WAIT for similar reasons. Note that userspace cannot create more work while devvp is flushed, since the mount point is always suspended before the call to softdep_waitidle() in unmount or remount path. PR: 195458 In collaboration with: gjb, pho Reviewed by: mckusick Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2015-03-27 13:55:56 +00:00
Konstantin Belousov	f6f76e1747	Partially revert r277922, avoid sleeping and do flush if we a awaken, instead of waiting for the FLUSH_* flags. Also, when requesting flush, do the wakeups unconditionally even when FLUSH_CLEANUP flag was already set. Reported and tested by: dim, "Lundberg, Johannes" <johannes@brilliantservice.co.jp> Bisected by: dim MFC after: 2 weeks	2015-02-05 13:00:27 +00:00
Konstantin Belousov	1c9b585695	When mounting SU-enabled mount point, wait until the softdep_flush() thread started and incremented the stat_flush_threads [1]. Unconditionally wakeup softdep_flush threads when needed, do not try to check wchan, which is racy and breaks abstraction. Reported by and discussed with: glebius, neel Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2015-01-30 11:41:46 +00:00
Konstantin Belousov	6c21f6edb8	The VOP_LOOKUP() implementations for CREATE op do not put the name into namecache, to avoid cache trashing when doing large operations. E.g., tar archive extraction is not usually followed by access to many of the files created. Right now, each VOP_LOOKUP() implementation explicitely knowns about this quirk and tests for both MAKEENTRY flag presence and op != CREATE to make the call to cache_enter(). Centralize the handling of the quirk into VFS, by deciding to cache only by MAKEENTRY flag in VOP. VFS now sets NOCACHE flag for CREATE namei() calls. Note that the change in semantic is backward-compatible and could be merged to the stable branch, and is compatible with non-changed third-party filesystems which correctly handle MAKEENTRY. Suggested by: Chris Torek <torek@pi-coral.com> Reviewed by: mckusick Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2014-12-18 10:01:12 +00:00
Gleb Kurtsou	dde58752db	Adjust printf format specifiers for dev_t and ino_t in kernel. ino_t and dev_t are about to become uint64_t. Reviewed by: kib, mckusick	2014-12-17 07:27:19 +00:00
Gleb Smirnoff	90effb2341	Merge from projects/sendfile: o Provide a new VOP_GETPAGES_ASYNC(), which works like VOP_GETPAGES(), but doesn't sleep. It returns immediately, and will execute the I/O done handler function that must be supplied as argument. o Provide VOP_GETPAGES_ASYNC() for the FFS, which uses vnode_pager. o Extend pagertab to support pgo_getpages_async method, and implement this method for vnode_pager. Reviewed by: kib Tested by: pho Sponsored by: Netflix Sponsored by: Nginx, Inc.	2014-11-23 12:01:52 +00:00
Gleb Smirnoff	8c7f0b92b4	Include required files directly instead of pollution via ufs/ufsmount.h. Sponsored by: Nginx, Inc.	2014-11-23 01:01:14 +00:00
Konstantin Belousov	ca109b01cf	When non-forced unmount or remount rw->ro is performed, writes on UFS are not suspended. In particular, on the SU-enabled vulumes, there is no reason why, between the call to softdep_flushfiles() and softdep_waitidle(), SU work items cannot be queued. Correct the condition to trigger the panic by only checking when forced operation is done. Convert direct panic() call into KASSERT(), there is no invalid on-disk data structures directly involved, so follow the usual debugging vs. non-debugging approach. Reported and tested by: pho Reviewed by: mckusick Sponsored by: The FreeBSD Foundation MFC after: 1 week	2014-11-02 13:14:55 +00:00
Mateusz Guzik	4fce16e4c9	Provide vfs suspension support only for filesystems which need it, take two. nullfs and unionfs need to request suspension if underlying filesystem(s) use it. Utilize mnt_kern_flag for this purpose. This is a fixup for 273271. No strong objections from: kib Pointy hat to: mjg MFC after: 2 weeks	2014-10-20 18:00:50 +00:00
Konstantin Belousov	df5c9c0411	Do not set IN_ACCESS flag for read-only mounts. The IN_ACCESS survives remount in rw, also it is set for vnodes on rootfs before noatime can be set or clock is adjusted. All conditions result in wrong atime for accessed vnodes. Submitted by: bde MFC after: 1 week	2014-10-11 19:09:56 +00:00
Konstantin Belousov	d15b55c554	Provide the unique implementation for the VOP_GETPAGES() method used by ffs and ext2fs. Remove duplicated call to vm_page_zero_invalid(), done by VOP and by vm_pager_getpages(). Use vm_pager_free_nonreq(). Reviewed by: alc (previous version) Sponsored by: The FreeBSD Foundation MFC after: 6 weeks (after r271596)	2014-09-15 12:28:29 +00:00
Alan Cox	3e5c84e292	We don't need an exclusive object lock on the expected execution path through {ext2,ffs}_getpages(). Reviewed by: kib, pfg MFC after: 6 weeks Sponsored by: EMC / Isilon Storage Division	2014-09-13 18:26:13 +00:00
Konstantin Belousov	4ce90426e0	Correct the test for condition to suspend UFS filesystem during unmount. There is no need to suspend read-only filesystem, while we need suspension on modificable mount point. Reported by: rwatson Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week	2014-08-20 08:13:03 +00:00
Konstantin Belousov	a6b5e6e32f	Revision r269457 removed the Giant around mount and unmount code, but r269533, which was tested before r269457 was committed, implicitely relied on the Giant to protect the manipulations of the softdepmounts list. Use softdep global lock consistently to guarantee the list structure now. Insert the new struct mount_softdeps into the softdepmounts only after it is sufficiently initialized, to prevent softdep_speedup() from accessing bare memory. Similarly, remove struct mount_softdeps for the unmounted filesystem from the tailq before destroying structure rwlock. Reported and tested by: pho Reviewed by: mckusick Sponsored by: The FreeBSD Foundation MFC after: 1 week	2014-08-12 09:33:00 +00:00
Kirk McKusick	3da0b29d99	The SUJ journal is only prepared to handle full-size block numbers, so we have to adjust freeblk records to reflect the change to a full-size block. For example, suppose we have a block made up of fragments 8-15 and want to free its last two fragments. We are given a request that says: FREEBLK ino=5, blkno=14, lbn=0, frags=2, oldfrags=0 where frags are the number of fragments to free and oldfrags are the number of fragments to keep. To block align it, we have to change it to have a valid full-size blkno, so it becomes: FREEBLK ino=5, blkno=8, lbn=0, frags=2, oldfrags=6 Submitted by: Mikihito Takehara Tested by: Mikihito Takehara Reviewed by: Jeff Roberson MFC after: 1 week	2014-08-07 16:53:07 +00:00
Kirk McKusick	5f9500c358	Add support for multi-threading of soft updates. Replace a single soft updates thread with a thread per FFS-filesystem mount point. The threads are associated with the bufdaemon process. Reviewed by: kib Tested by: Peter Holm and Scott Long MFC after: 2 weeks Sponsored by: Netflix	2014-08-04 22:03:58 +00:00
Konstantin Belousov	895b3782c6	Extract the code to put a filesystem into the suspended state (at the unmount time) in the helper vfs_write_suspend_umnt(). Use it instead of two inline copies in FFS. Fix the bug in the FFS unmount, when suspension failed, the ufs extattrs were not reinitialized. Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2014-07-14 09:10:00 +00:00
Konstantin Belousov	23f6698fbd	Initialize the pbuf counter for directio using SYSINIT, instead of using a direct hook called from kern_vfs_bio_buffer_alloc(). Mark ffs_rawread.c as requiring both ffs and directio options to be compiled into the kernel. Add ffs_rawread.c to the list of ufs.ko module' sources. In addition to stopping breaking the layering violation, it also allows to link kernel when FFS is configured as module and DIRECTIO is enabled. One consequence of the change is that ffs_rawread.o is always linked into the module regardless of the DIRECTIO option. This is similar to the option QUOTA and ufs_quota.c. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2014-06-08 10:55:06 +00:00
John-Mark Gurney	e9838c1140	don't check fs_flags for _FLAGS_UPDATED as it is stored in fs_old_flags.. If you had a UFS2 FS that didn't have it's super block at SBLOCK_UFS2, you'll end up corrupting your FS as the superblock is updated and written to a different location... makefs used to put the superblock at SBLOCK_UFS1 for UFS 2 FS's causing this issue... Reviewed by: silience from mckusick MFC after: 1 week	2014-06-03 21:46:13 +00:00
Scott Long	7d155880ee	Due to reasons unknown at this time, the system can be forced to write a journal block even when there are no journal entries to be written. Until the root cause is found, handle this case by ensuring that a valid journal segment is always written. Second, the data buffer used for writing journal entries was never being scrubbed of old data. Fix this. Submitted by: Takehara Mikihito Obtained from: Netflix, Inc. MFC after: 3 days	2014-05-06 20:40:16 +00:00
Kirk McKusick	e24784d32a	Update comment to explain search order reverted to historical order in -r254996. Suggested by: Pedro Giffuni <pfg@FreeBSD.org> MFC: 3 days	2014-03-22 11:26:39 +00:00
Robert Watson	4a14441044	Update kernel inclusions of capability.h to use capsicum.h instead; some further refinement is required as some device drivers intended to be portable over FreeBSD versions rely on __FreeBSD_version to decide whether to include capability.h. MFC after: 3 weeks	2014-03-16 10:55:57 +00:00
Jeff Roberson	4803948fe2	- If we fail to do a non-blocking acquire of a buf lock while doing a waiting sync pass we need to do a blocking acquire and restart. Another thread, typically the buf daemon, may have this buf locked and if we don't wait we can fail to sync the file. This lead to a great variety of softdep panics because we rely on all dependencies being flushed before proceeding in several cases. Reported by: pho Discussed with: mckusick Sponsored by: EMC / Isilon Storage Division MFC after: 2 weeks	2014-03-06 00:13:21 +00:00
Pedro F. Giffuni	4896af9f55	ufs: small formatting fixes. Cleanup some extra space. Use of tabs vs. spaces. No functional change. MFC after: 3 days Reviewed by: mckusick	2014-03-02 02:52:34 +00:00
Kirk McKusick	be1fd82346	Fine tune filesystem block allocations under low free-space conditions (-r254995) based on further operational experience. Submitted by: Dmitry Sivachenko Fix Tested by: Dmitry Sivachenko MFC after: 2 weeks	2013-12-30 17:04:24 +00:00
Kirk McKusick	2e436d88a3	We needlessly panic when trying to flush MKDIR_PARENT dependencies. We had previously tried to flush all MKDIR_PARENT dependencies (and all the NEWBLOCK pagedeps) by calling ffs_update(). However this will only resolve these dependencies in direct blocks. So very large directories with MKDIR_PARENT dependencies in indirect blocks had not yet gotten flushed. As the directory is in the midst of doing a complete sync, we simply defer the checking of the MKDIR_PARENT dependencies until the indirect blocks have been sync'ed. Reported by: Shawn Wallbridge of imaginaryforces.com Tested by: John-Mark Gurney <jmg@funkthat.com> PR: 183424 MFC after: 2 weeks	2013-12-01 07:34:21 +00:00
John-Mark Gurney	e0ce310797	fix white space... MFC after: 1 week	2013-11-20 21:21:29 +00:00
John-Mark Gurney	b6ffc3b567	fix a use after free, jsegdep_merge will free wk, avoid the next check... CID: 1006098 Sponsored by: Imaginary Forces Reviewed by: mckusick MFC after: 1 week	2013-11-20 21:16:53 +00:00
Pedro F. Giffuni	4b367145f7	UFS2: make di_extsize unsigned. di_extsize is the EA size and as such it should be unsigned. Adjust related types for consistency. Reviewed by: mckusick (previous version) MFC after: 3 weeks	2013-10-24 00:33:29 +00:00
Brooks Davis	cf058082cd	Allow kernels without options SOFTUPDATES to build. This should fix the embedded tinderboxes. Reviewed by: emaste	2013-10-21 20:51:08 +00:00
Kirk McKusick	07599ccb28	Fix build problem on ARM (which defaults to building without soft updates). Reported by: Tinderbox Sponsored by: Netflix	2013-10-21 13:09:09 +00:00
Kirk McKusick	58941b9f15	Restructuring of the soft updates code to set it up so that the single kernel-wide soft update lock can be replaced with a per-filesystem soft-updates lock. This per-filesystem lock will allow each filesystem to have its own soft-updates flushing thread rather than being limited to a single soft-updates flushing thread for the entire kernel. Move soft update variables out of the ufsmount structure and into their own mount_softdeps structure referenced by ufsmount field um_softdep. Eventually the per-filesystem lock will be in this structure. For now there is simply a pointer to the kernel-wide soft updates lock. Change all instances of ACQUIRE_LOCK and FREE_LOCK to pass the lock pointer in the mount_softdeps structure instead of a pointer to the kernel-wide soft-updates lock. Replace the five hash tables used by soft updates with per-filesystem copies of these tables allocated in the mount_softdeps structure. Several functions that flush dependencies when too many are allocated in the kernel used to operate across all filesystems. They are now parameterized to flush dependencies from a specified filesystem. For now, we stick with the round-robin flushing strategy when the kernel as a whole has too many dependencies allocated. While there are many lines of changes, there should be no functional change in the operation of soft updates. Tested by: Peter Holm and Scott Long Sponsored by: Netflix	2013-10-21 00:28:02 +00:00
Kirk McKusick	cc76ac5a6c	Fourth of several cleanups to soft dependency implementation. Add KASSERTS that soft dependency functions only get called for filesystems running with soft dependencies. Calling these functions when soft updates are not compiled into the system become panic's. No functional change. Tested by: Peter Holm and Scott Long Sponsored by: Netflix	2013-10-20 22:21:01 +00:00
Kirk McKusick	519e3c3b9f	Third of several cleanups to soft dependency implementation. Ensure that softdep_unmount() and softdep_setup_sbupdate() only get called for filesystems running with soft dependencies. No functional change. Tested by: Peter Holm and Scott Long Sponsored by: Netflix	2013-10-20 21:11:40 +00:00
Kirk McKusick	90a306d8af	Second of several cleanups to soft dependency implementation. Delete two unused functions in ffs_sofdep.c. No functional change. Tested by: Peter Holm and Scott Long Sponsored by: Netflix	2013-10-20 20:52:07 +00:00
Kirk McKusick	8850120f48	First of several cleanups to soft dependency implementation. Convert three functions exported from ffs_softdep.c to static functions as they are not used outside of ffs_softdep.c. No functional change. Tested by: Peter Holm and Scott Long Sponsored by: Netflix	2013-10-20 20:41:38 +00:00
Pawel Jakub Dawidek	7008be5bd7	Change the cap_rights_t type from uint64_t to a structure that we can extend in the future in a backward compatible (API and ABI) way. The cap_rights_t represents capability rights. We used to use one bit to represent one right, but we are running out of spare bits. Currently the new structure provides place for 114 rights (so 50 more than the previous cap_rights_t), but it is possible to grow the structure to hold at least 285 rights, although we can make it even larger if 285 rights won't be enough. The structure definition looks like this: struct cap_rights { uint64_t cr_rights[CAP_RIGHTS_VERSION + 2]; }; The initial CAP_RIGHTS_VERSION is 0. The top two bits in the first element of the cr_rights[] array contain total number of elements in the array - 2. This means if those two bits are equal to 0, we have 2 array elements. The top two bits in all remaining array elements should be 0. The next five bits in all array elements contain array index. Only one bit is used and bit position in this five-bits range defines array index. This means there can be at most five array elements in the future. To define new right the CAPRIGHT() macro must be used. The macro takes two arguments - an array index and a bit to set, eg. #define CAP_PDKILL CAPRIGHT(1, 0x0000000000000800ULL) We still support aliases that combine few rights, but the rights have to belong to the same array element, eg: #define CAP_LOOKUP CAPRIGHT(0, 0x0000000000000400ULL) #define CAP_FCHMOD CAPRIGHT(0, 0x0000000000002000ULL) #define CAP_FCHMODAT (CAP_FCHMOD \| CAP_LOOKUP) There is new API to manage the new cap_rights_t structure: cap_rights_t cap_rights_init(cap_rights_t rights, ...); void cap_rights_set(cap_rights_t rights, ...); void cap_rights_clear(cap_rights_t rights, ...); bool cap_rights_is_set(const cap_rights_t rights, ...); bool cap_rights_is_valid(const cap_rights_t rights); void cap_rights_merge(cap_rights_t dst, const cap_rights_t src); void cap_rights_remove(cap_rights_t dst, const cap_rights_t src); bool cap_rights_contains(const cap_rights_t big, const cap_rights_t little); Capability rights to the cap_rights_init(), cap_rights_set(), cap_rights_clear() and cap_rights_is_set() functions are provided by separating them with commas, eg: cap_rights_t rights; cap_rights_init(&rights, CAP_READ, CAP_WRITE, CAP_FSTAT); There is no need to terminate the list of rights, as those functions are actually macros that take care of the termination, eg: #define cap_rights_set(rights, ...) \ __cap_rights_set((rights), __VA_ARGS__, 0ULL) void __cap_rights_set(cap_rights_t *rights, ...); Thanks to using one bit as an array index we can assert in those functions that there are no two rights belonging to different array elements provided together. For example this is illegal and will be detected, because CAP_LOOKUP belongs to element 0 and CAP_PDKILL to element 1: cap_rights_init(&rights, CAP_LOOKUP \| CAP_PDKILL); Providing several rights that belongs to the same array's element this way is correct, but is not advised. It should only be used for aliases definition. This commit also breaks compatibility with some existing Capsicum system calls, but I see no other way to do that. This should be fine as Capsicum is still experimental and this change is not going to 9.x. Sponsored by: The FreeBSD Foundation	2013-09-05 00:09:56 +00:00
Kirk McKusick	2ce451f089	In looking at block layouts as part of fixing filesystem block allocations under low free-space conditions (-r254995), determine that old block-preference search order used before -r249782 worked a bit better. This change reverts to that block-preference search order. MFC after: 2 weeks	2013-08-28 17:46:32 +00:00
Kirk McKusick	28702816d8	A performance problem was reported in PR kern/181226: I have 25TB Dell PERC 6 RAID5 array. When it becomes almost full (10-20GB free), processes which write data to it start eating 100% CPU and write speed drops below 1MB/sec (normally to gives 400MB/sec). The revision at which it first became apparent was http://svnweb.freebsd.org/changeset/base/249782. The offending change reserved an area in each cylinder group to store metadata. The new algorithm attempts to save this area for metadata and allows its use for non-metadata only after all the data areas have been exhausted. The size of the reserved area defaults to half of minfree, so the filesystem reports full before the data area can completely fill. However, in this report, the filesystem has had minfree reduced to 1% thus forcing the metadata area to be used for data. As the filesystem approached full, it had only metadata areas left to allocate. The result was that every block allocation had to scan summary data for 30,000 cylinder groups before falling back to searching up to 30,000 metadata areas. The fix is to give up on saving the metadata areas once the free space reserve drops below 2%. The effect of this change is to use the old algorithm of just accepting the first available block that we find. Since most filesystems use the default 5% minfree, this will have no effect on their operation. For those that want to push to the limit, they will get their crappy block placements quickly. Submitted by: Dmitry Sivachenko Fix Tested by: Dmitry Sivachenko PR: kern/181226 MFC after: 2 weeks	2013-08-28 17:38:05 +00:00
Kirk McKusick	8cf85cf292	With the addition of journalled soft updates, the "newblk" structures persist much longer than previously. Historically we had at most 100 entries; now the count may reach a million. With the increased count we spent far too much time looking them up in the grossly undersized newblk hash table. Configure the newblk hash table to accurately reflect the number of entries that it must index. Reviewed by: kib Tested by: Peter Holm MFC after: 2 weeks	2013-08-05 22:02:45 +00:00
Kirk McKusick	57591d8e78	To better understand performance problems with journalled soft updates, we need to collect the highest level of allocation for each of the different soft update dependency structures. This change collects these statistics and makes them available using `sysctl debug.softdep.highuse'. Reviewed by: kib Tested by: Peter Holm MFC after: 2 weeks	2013-08-05 22:01:16 +00:00
Kirk McKusick	fdd9a6f478	Update to comments describing block allocation policy. Submitted by: Bruce Evans	2013-07-14 18:44:33 +00:00
Konstantin Belousov	2aea094f65	Only copy as much bytes as there in superblock, instead of the full block copy, when copying the superblock into the snapshot. UFS1 does not align superblock on the block boundary, and bcopy runs off the end of the buffer. Reported by: Andre Albsmeier <Andre.Albsmeier@siemens.com> Reviewed by: mckusick Sponsored by: The FreeBSD Foundation MFC after: 1 week	2013-07-12 18:52:33 +00:00
Konstantin Belousov	cc3d8c35f5	There are several code sequences like vfs_busy(mp); vfs_write_suspend(mp); which are problematic if other thread starts unmount between two calls. The unmount starts a write, while vfs_write_suspend() drain writers. On the other hand, unmount drains busy references, causing the deadlock. Add a flag argument to vfs_write_suspend and require the callers of it to specify VS_SKIP_UNMOUNT flag, when the call is performed not in the mount path, i.e. the covered vnode is not locked. The suspension is not attempted if VS_SKIP_UNMOUNT is specified and unmount is in progress. Reported and tested by: Andreas Longwitz <longwitz@incore.de> Sponsored by: The FreeBSD Foundation MFC after: 3 weeks	2013-07-09 20:49:32 +00:00
Kirk McKusick	3f8db5b147	Make better use of metadata area by avoiding using it for data blocks that no should no longer immediately follow their indirect blocks. MFC after: 2 weeks	2013-07-02 21:07:08 +00:00
Pedro F. Giffuni	5f3a9c4000	Style fix: spaces. Cleanup the incomplete revert. Reported by: bde MFC after: 4 weeks	2013-07-02 18:45:37 +00:00
Pedro F. Giffuni	9a0aea4625	Change i_gen in UFS to an unsigned type. Revert the simplification of the i_gen calculation. It is still a good idea to avoid zero values and for the case of old filesystems there is probably no advantage in using the complete 32 bits anyways. Discussed with: bde MFC after: 4 weeks	2013-07-01 21:43:40 +00:00
Pedro F. Giffuni	bcb2f550be	Change i_gen in UFS to an unsigned type. Further simplify the i_gen calculation for older disks. Having a zero here is not really a problem and this is more similar to what is done in newfs_random(). Reported by: Xin Li MFC after: 4 weeks	2013-07-01 14:49:23 +00:00
Pedro F. Giffuni	eee4072f13	Change i_gen in UFS to an unsigned type. In UFS, i_gen is a random generated value and there is not way for it to be negative. Actually, the value of i_gen is just used to match bit patterns and it is of not consequence if the values are signed or not. Following other filesystems, set it to unsigned and use it as such, Discussed by: mckusick Reviewed by: mckusick (previous version) MFC after: 4 weeks	2013-07-01 03:00:15 +00:00
Jeff Roberson	22a722605d	- Convert the bufobj lock to rwlock. - Use a shared bufobj lock in getblk() and inmem(). - Convert softdep's lk to rwlock to match the bufobj lock. - Move INFREECNT to b_flags and protect it with the buf lock. - Remove unnecessary locking around bremfree() and BKGRDINPROG. Sponsored by: EMC / Isilon Storage Division Discussed with: mckusick, kib, mdf	2013-05-31 00:43:41 +00:00
Kirk McKusick	97371fa56d	Properly spell sentinel (missed in 250891) No functional changes. Spotted by: Navdeep Parhar and Alexey Dokuchaev MFC after: 2 weeks	2013-05-22 05:07:55 +00:00
Kirk McKusick	b1bd9340fa	Add missing buffer releases (brelse) after bread calls that return an error. One could argue that returning a buffer even when it is not valid is incorrect, but bread has always returned a buffer valid or not. Reviewed by: kib MFC after: 2 weeks	2013-05-22 00:57:22 +00:00
Kirk McKusick	21844a3d5d	Add missing 28th element to softdep types name array. Found by: Coverity Scan, CID 1007621 Reviewed by: kib MFC after: 2 weeks	2013-05-22 00:48:24 +00:00
Kirk McKusick	d80dbbdb4a	Null a pointer after it is freed so that when it is returned the return value is NULL. Based on the returned flags, the return value should never be inspected in the case where NULL is returned, but it is good coding practice not to return a pointer to freed memory. Found by: Coverity Scan, CID 1006096 Reviewed by: kib MFC after: 2 weeks	2013-05-22 00:40:26 +00:00
Kirk McKusick	64e2b0887c	Remove a bogus check for a NULL buffer pointer. Add a KASSERT that it is not NULL. Found by: Coverity Scan, CID 1009114 Reviewed by: kib MFC after: 2 weeks	2013-05-22 00:30:34 +00:00
Kirk McKusick	13e369a747	Properly spell sentinel (not sintenel or sentinal). No functional changes. Spotted by: kib MFC after: 2 weeks	2013-05-22 00:17:50 +00:00
Eitan Adler	a164074fc4	Fix several typos PR: kern/176054 Submitted by: Christoph Mallon <christoph.mallon@gmx.de> MFC after: 3 days	2013-05-12 16:43:26 +00:00
Gabor Kovesdan	ab3f6b347e	- Correct mispellings of the word occurrence Submitted by: Christoph Mallon <christoph.mallon@gmx.de> (via private mail)	2013-04-17 11:40:10 +00:00
Jeff Roberson	26089666b6	Prepare to replace the buf splay with a trie: - Don't insert BKGRDMARKER bufs into the splay or dirty/clean buf lists. No consumers need to find them there and it complicates the tree. These flags are all FFS specific and could be moved out of the buf cache. - Use pbgetvp() and pbrelvp() to associate the background and journal bufs with the vp. Not only is this much cheaper it makes more sense for these transient bufs. - Fix the assertions in pbget* and pbrel*. It's not safe to check list pointers which were never initialized. Use the BX flags instead. We also check B_PAGING in reassignbuf() so this should cover all cases. Discussed with: kib, mckusick, attilio Sponsored by: EMC / Isilon Storage Division	2013-04-06 22:21:23 +00:00
Kirk McKusick	cd861931e6	The code in clear_remove() and clear_inodedeps() skips one entry in the pagedep and inodedep hash tables. An entry in the table is skipped because 'pagedep_hash' and 'inodedep_hash' hold the size of the hash tables - 1. The chance that this would have any operational failure is extremely unlikely. These funtions only need to find a single entry and are only called when there are too many entries. The chance that they would fail because all the entries are on the single skipped hash chain are remote. Submitted by: Pedro Martelletto Reviewed by: kib MFC after: 2 weeks	2013-04-03 19:26:32 +00:00
Kirk McKusick	baa12a84a7	The purpose of this change to the FFS layout policy is to reduce the running time for a full fsck. It also reduces the random access time for large files and speeds the traversal time for directory tree walks. The key idea is to reserve a small area in each cylinder group immediately following the inode blocks for the use of metadata, specifically indirect blocks and directory contents. The new policy is to preferentially place metadata in the metadata area and everything else in the blocks that follow the metadata area. The size of this area can be set when creating a filesystem using newfs(8) or changed in an existing filesystem using tunefs(8). Both utilities use the `-k held-for-metadata-blocks' option to specify the amount of space to be held for metadata blocks in each cylinder group. By default, newfs(8) sets this area to half of minfree (typically 4% of the data area). This work was inspired by a paper presented at Usenix's FAST '13: www.usenix.org/conference/fast13/ffsck-fast-file-system-checker Details of this implementation appears in the April 2013 of ;login: www.usenix.org/publications/login/april-2013-volume-38-number-2. A copy of the April 2013 ;login: paper can also be downloaded from: www.mckusick.com/publications/faster_fsck.pdf. Reviewed by: kib Tested by: Peter Holm MFC after: 4 weeks	2013-03-22 21:45:28 +00:00
Konstantin Belousov	59a01b70af	UFS support of the unmapped i/o for the user data buffers. Sponsored by: The FreeBSD Foundation Tested by: pho, scottl, jhb, bf	2013-03-19 15:08:15 +00:00
Konstantin Belousov	e81ff91e62	Do not remap usermode pages into KVA for physio. Sponsored by: The FreeBSD Foundation Tested by: pho	2013-03-19 14:43:57 +00:00
Konstantin Belousov	70e198dd07	Some style fixes. Sponsored by: The FreeBSD Foundation	2013-03-14 20:31:39 +00:00
Konstantin Belousov	c535690b33	Add currently unused flag argument to the cluster_read(), cluster_write() and cluster_wbuild() functions. The flags to be allowed are a subset of the GB_* flags for getblk(). Sponsored by: The FreeBSD Foundation Tested by: pho	2013-03-14 20:28:26 +00:00
Attilio Rao	89f6b8632c	Switch the vm_object mutex to be a rwlock. This will enable in the future further optimizations where the vm_object lock will be held in read mode most of the time the page cache resident pool of pages are accessed for reading purposes. The change is mostly mechanical but few notes are reported: * The KPI changes as follow: - VM_OBJECT_LOCK() -> VM_OBJECT_WLOCK() - VM_OBJECT_TRYLOCK() -> VM_OBJECT_TRYWLOCK() - VM_OBJECT_UNLOCK() -> VM_OBJECT_WUNLOCK() - VM_OBJECT_LOCK_ASSERT(MA_OWNED) -> VM_OBJECT_ASSERT_WLOCKED() (in order to avoid visibility of implementation details) - The read-mode operations are added: VM_OBJECT_RLOCK(), VM_OBJECT_TRYRLOCK(), VM_OBJECT_RUNLOCK(), VM_OBJECT_ASSERT_RLOCKED(), VM_OBJECT_ASSERT_LOCKED() * The vm/vm_pager.h namespace pollution avoidance (forcing requiring sys/mutex.h in consumers directly to cater its inlining functions using VM_OBJECT_LOCK()) imposes that all the vm/vm_pager.h consumers now must include also sys/rwlock.h. * zfs requires a quite convoluted fix to include FreeBSD rwlocks into the compat layer because the name clash between FreeBSD and solaris versions must be avoided. At this purpose zfs redefines the vm_object locking functions directly, isolating the FreeBSD components in specific compat stubs. The KPI results heavilly broken by this commit. Thirdy part ports must be updated accordingly (I can think off-hand of VirtualBox, for example). Sponsored by: EMC / Isilon storage division Reviewed by: jeff Reviewed by: pjd (ZFS specific review) Discussed with: alc Tested by: pho	2013-03-09 02:32:23 +00:00
Konstantin Belousov	ba05dec5a4	The softdep freeblks workitem might hold a reference on the dquot. Current dqflush() panics when a dquot with with non-zero refcount is encountered. The situation is possible, because quotas are turned off before softdep workitem queue if flushed, due to the quota file writes might create softdep workitems. Make the encountering an active dquot in dqflush() not fatal, return the error from quotaoff() instead. Ignore the quotaoff() failures when ffs_flushfiles() is called in the course of softdep_flushfiles() loop, until the last iteration. At the last loop, the quotas must be closed, and because SU workitems should be already flushed, the references to dquot are gone. Sponsored by: The FreeBSD Foundation Reported and tested by: pho Reviewed by: mckusick MFC after: 2 weeks	2013-02-27 07:32:39 +00:00
Konstantin Belousov	94f4ac214c	An inode block must not be blockingly read while cg block is owned. The order is inode buffer lock -> snaplk -> cg buffer lock, reversing the order causes deadlocks. Inode block must not be written while cg block buffer is owned. The FFS copy on write needs to allocate a block to copy the content of the inode block, and the cylinder group selected for the allocation might be the same as the owned cg block. The reserved block detection code in the ffs_copyonwrite() and ffs_bp_snapblk() is unable to detect the situation, because the locked cg buffer is not exposed to it. In order to maintain the dependency between initialized inode block and the cg_initediblk pointer, look up the inode buffer in non-blocking mode. If succeeded, brelse cg block, initialize the inode block and write it. After the write is finished, reread cg block and update the cg_initediblk. If inode block is already locked by another thread, let the another thread initialize it. If another thread raced with us after we started writing inode block, the situation is detected by an update of cg_initediblk. Note that double-initialization of the inode block is harmless, the block cannot be used until cg_initediblk is incremented. Sponsored by: The FreeBSD Foundation In collaboration with: pho Reviewed by: mckusick MFC after: 1 month X-MFC-note: after r246877	2013-02-27 07:31:23 +00:00
Kirk McKusick	7839b23f00	The UFS2 filesystem allocates new blocks of inodes as they are needed. When a cylinder group runs short of inodes, a new block for inodes is allocated, zero'ed, and written to the disk. The zero'ed inodes must be on the disk before the cylinder group can be updated to claim them. If the cylinder group claiming the new inodes were written before the zero'ed block of inodes, the system could crash with the filesystem in an unrecoverable state. Rather than adding a soft updates dependency to ensure that the new inode block is written before it is claimed by the cylinder group map, we just do a barrier write of the zero'ed inode block to ensure that it will get written before the updated cylinder group map can be written. This change should only slow down bulk loading of newly created filesystems since that is the primary time that new inode blocks need to be created. Reported by: Robert Watson Reviewed by: kib Tested by: Peter Holm	2013-02-16 15:11:40 +00:00
Konstantin Belousov	9604a7f1b8	Fix several unsafe pointer dereferences in the buffered_write() function, implementing the sysctl vfs.ffs.set_bufoutput (not used in the tree yet). - The current directory vnode dereference is unsafe since fd_cdir could be changed and unreferenced, lock the filedesc around and vref the fd_cdir. - The VTOI() conversion of the fd_cdir is unsafe without first checking that the vnode is indeed from an FFS mount, otherwise the code dereferences a random memory. - The cdir could be reclaimed from under us, lock it around the checks. - The type of the fp vnode might be not a disk, or it might have changed while the thread was in flight, check the type. Reviewed and tested by: mckusick MFC after: 2 weeks	2013-02-10 10:17:33 +00:00
Kirk McKusick	fe85d98a5b	For UFS2 i_blocks is unsigned. The current "sanity" check that it has gone below zero after the blocks in its inode are freed is a no-op which the compiler fails to warn about because of the use of the DIP macro. Change the sanity check to compare the number of blocks being freed against the value i_blocks. If the number of blocks being freed exceeds i_blocks, just set i_blocks to zero. Reported by: Pedro Giffuni (pfg@) MFC after: 2 weeks	2013-02-03 17:16:32 +00:00
Konstantin Belousov	ddd6b3fc33	Add flags argument to vfs_write_resume() and remove vfs_write_resume_flags(). Sponsored by: The FreeBSD Foundation	2013-01-11 06:08:32 +00:00
Konstantin Belousov	f99cb34c4f	The process_deferred_inactive() function locks the vnodes of the ufs mount, which means that is must not be called while the snaplock is owned. The vfs_write_resume(9) does call the function as the VFS_SUSP_CLEAN() method, which is too early and falls into the region still protected by snaplock. Add yet another flag for the vfs_write_resume_flags() to avoid calling suspension cleanup handler after the suspend is lifted, and use it in the ffs_snapshot() call to vfs_write_resume. Reported and tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2013-01-01 16:14:48 +00:00
Konstantin Belousov	91e9474552	Make it possible to atomically resume writes on the mount and account the write start, by adding a variation of the vfs_write_resume(9) which accepts flags. Use the new function to prevent a deadlock between parallel suspension and snapshotting a UFS mount. The ffs_snapshot() code performed vfs_write_resume() followed by vn_start_write() while owning the snaplock. If the suspension intervene between resume and vn_start_write(), the deadlock occured after the suspending thread tried to lock the snaplock, most typically during the write in the ffs_copyonwrite(). Reported and tested by: Andreas Longwitz <longwitz@incore.de> Reviewed by: mckusick MFC after: 2 weeks X-MFC-note: make the vfs_write_resume(9) function a macro after the MFC, in HEAD	2012-12-28 23:08:30 +00:00

1 2 3 4 5 ...

1389 Commits