freebsd-dev

Author	SHA1	Message	Date
Konstantin Belousov	f30ddc49e6	If IO_SYNC was passed to ffs_truncate(), request synchronous inode update from the final ffs_update(). Noted by: bde MFC after: 1 week	2016-05-17 21:30:58 +00:00
Konstantin Belousov	3d0691acdc	For async UFS mounts, shrink the directory asynchronously, at least do not pass IO_SYNC to ffs_truncate() unneccessary. Submitted by: bde MFC after: 1 week	2016-05-17 21:28:28 +00:00
Konstantin Belousov	98cbffd7bd	Fix comments. Submitted by: bde MFC after: 1 week	2016-05-17 08:24:27 +00:00
Konstantin Belousov	5b6526413f	Fix typo in the message. Submitted by: bde MFC after: 1 week	2016-05-17 08:19:20 +00:00
Pedro F. Giffuni	08e941833d	UFS: spelling fixes on comments. No functional change.	2016-04-29 20:43:51 +00:00
Enji Cooper	028d33c96a	Add FEATURE knob for testing for UFS extended attribute kernel support Support can be verified via `feature_present("ufs_extattr")`, etc. Differential Revision: https://reviews.freebsd.org/D6053 MFC after: 2 weeks Relnotes: yes Reviewed by: asomers, kib Sponsored by: EMC / Isilon Storage Division	2016-04-22 08:09:27 +00:00
Pedro F. Giffuni	d9c9c81c08	sys: use our roundup2/rounddown2() macros when param.h is available. rounddown2 tends to produce longer lines than the original code and when the code has a high indentation level it was not really advantageous to do the replacement. This tries to strike a balance between readability using the macros and flexibility of having the expressions, so not everything is converted.	2016-04-21 19:57:40 +00:00
Pedro F. Giffuni	abafa4db03	ufs: replace 0 with NULL for pointers. While here also do late initialization of the variables we are changing. Found with devel/coccinelle. Reviewed by: mckusick MFC after: 2 weeks	2016-04-10 21:48:11 +00:00
Edward Tomasz Napierala	ae34b6ff96	Add four new RCTL resources - readbps, readiops, writebps and writeiops, for limiting disk (actually filesystem) IO. Note that in some cases these limits are not quite precise. It's ok, as long as it's within some reasonable bounds. Testing - and review of the code, in particular the VFS and VM parts - is very welcome. MFC after: 1 month Relnotes: yes Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D5080	2016-04-07 04:23:25 +00:00
Edward Tomasz Napierala	35030a5dd4	Remove some NULL checks for M_WAITOK allocations. MFC after: 1 month Sponsored by: The FreeBSD Foundation	2016-03-29 13:56:59 +00:00
Konstantin Belousov	c79dff0fbe	Split the global taskqueue used to process all UFS trim completions, into per-mount taskqueue with the private taskqueue processing thread. This allows to drain the taskqueue on unmount, to ensure that all TRIMs are finished before mount structures are freed. But just draining the taskqueue where TRIM biodone geom-up completions are processed is not enough, since ffs_blkfree(), called by the task, might result in more writes. Count inflight delayed blkfree's and pause() unmount until the counter drains as well. Reported by: Nick Evans <nevans@talkpoint.com> Tested by: Nick Evans <nevans@talkpoint.com>, pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2016-03-27 08:21:17 +00:00
Konstantin Belousov	d016708959	Style: wrap long lines. Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2016-03-27 08:07:12 +00:00
Konstantin Belousov	6336aefc1d	Fix locking mistake in softdep_waitidle(). The surrounding code expects that the loop is always exited with the SU lock owned, even on error. Reported and tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 3 days	2016-03-23 09:58:51 +00:00
Kirk McKusick	24b6748c42	The UFS filesystem requires that the last block of a file always be allocated. When shortening the length of a file in which the new end of the file contains a hole, the hole must have a block allocated. Reported by: Maxim Sobolev Reviewed by: kib Tested by: Peter Holm	2016-02-24 01:58:40 +00:00
Edward Tomasz Napierala	50102a6342	Remove ffs_mountroot() prototype; seems to be long gone. MFC after: 1 month Sponsored by: The FreeBSD Foundation	2016-01-28 12:21:23 +00:00
Kirk McKusick	abe53f7e91	This fixes a bug in UFS2 exported NFS volumes. An NFS client can crash a server that has exported UFS2 by presenting a filehandle with an inode number that references an uninitialized inode in a cylinder group. The problem is that UFS2 only initializes blocks of inodes as they are first allocated and ffs_fhtovp() does not validate that the inode is in a range of inodes that have been initialized. Attempting to read an uninitialized inode gets random data from the disk. When the kernel tries to interpret it as an inode, panics often arise. Reported by: Christos Zoulas (from NetBSD) Reviewed by: kib	2016-01-27 21:27:05 +00:00
Kirk McKusick	d2a28cb080	The bread() function was inconsistent about whether it would return a buffer pointer in the event of an error (for some errors it would return a buffer pointer and for other errors it would not return a buffer pointer). The cluster_read() function was similarly inconsistent. Clients of these functions were inconsistent in handling errors. Some would assume that no buffer was returned after an error and would thus lose buffers under certain error conditions. Others would assume that brelse() should always be called after an error and would thus panic the system under certain error conditions. To correct both of these problems with minimal code churn, bread() and cluster_write() now always free the buffer when returning an error thus ensuring that buffers will never be lost. The brelse() routine checks for being passed a NULL buffer pointer and silently returns to avoid panics. Thus both approaches to handling error returns from bread() and cluster_read() will work correctly. Future code should be written assuming that bread() and cluster_read() will never return a buffer with an error, so should not attempt to brelse() the buffer when an error is returned. Reviewed by: kib	2016-01-27 21:23:01 +00:00
Konstantin Belousov	44de1ba141	Recheck curthread->td_su after the VFS_SYNC() call, and re-sync if the ast was rescheduled during VFS_SYNC(). It is possible that enough parallel writes or slow/hung volume result in VFS_SYNC() deferring to the ast flushing of workqueue. Reported and tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week	2015-12-21 11:50:32 +00:00
Konstantin Belousov	dc15d94fd9	Update ctime when atime or birthtime are updated. Cleanup setting of ctime/mtime/birthtime: do not set IN_ACCESS or IN_UPDATE, then clear them with ufs_itimes(), making transient (possibly inconsistent) change to the times, and then copy user-supplied times into the inode. Instead, directly clear IN_ACCESS or IN_UPDATE when user supplied the time, and copy the value into the inode. Minor inconsistency left is that the inode ctime is updated even when birthtime update attempt is performed on a UFS1 volume. Submitted by: bde MFC after: 2 weeks	2015-12-07 12:09:04 +00:00
Kirk McKusick	43a993bb7d	For performance reasons, it is useful to have a single string used as the name of a filesystem when setting it as the first parameter to the getnewvnode() function. Most filesystems call getnewvnode from just one place so can use a literal string as the first parameter. However, NFS calls getnewvnode from two places, so we create a global constant string that can be used by the two instances. This change also collapses two instances of getnewvnode() in the UFS filesystem to a single call. Reviewed by: kib Tested by: Peter Holm	2015-11-29 21:01:02 +00:00
Konstantin Belousov	fe920528c1	Do not perform read-ahead for BA_CLRBUF request when we are low on memory or when dirty buffer queue is too large. Reported and tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2015-10-27 13:44:13 +00:00
Warner Losh	476dfdc3bf	Do not relocate extents to make them contiguous if the underlying drive can do deletions. Ability to do deletions is a strong indication that this optimization will not help performance. It will only generate extra write traffic. These devices are typically flash based and have a limited number of write cycles. In addition, making the file contiguous in LBA space doesn't improve the access times from flash devices because they have no seek time. Reviewed by: mckusick@	2015-10-16 03:06:02 +00:00
Gleb Smirnoff	20e8b32db4	In softdep_setup_freeblocks(): - Move the bread() to the beginning of function. - Return if it fails, otherwise we will panic. Submitted by: mckusick Sponsored by: Netflix	2015-10-07 12:36:28 +00:00
Konstantin Belousov	7a82f35c9d	Do not consume extra reference. This is a bug in r287479. Reported and tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week	2015-09-05 12:28:18 +00:00
Konstantin Belousov	c65f759837	Declare the writes around the call to VFS_SYNC() in softdep_ast_cleanup_proc(). Tested by: pho (previous version) Sponsored by: The FreeBSD Foundation MFC after: 1 week	2015-09-05 08:48:24 +00:00
Konstantin Belousov	76cae067b8	By doing file extension fast, it is possible to create excess supply of the D_NEWBLK kinds of dependencies (i.e. D_ALLOCDIRECT and D_ALLOCINDIR), which can exhaust kmem. Handle excess of D_NEWBLK in the same way as excess of D_INODEDEP and D_DIRREM, by scheduling ast to flush dependencies, after the thread, which created new dep, left the VFS/FFS innards. For D_NEWBLK, the only way to get rid of them is to do full sync, since items are attached to data blocks of arbitrary vnodes. The check for D_NEWBLK excess in softdep_ast_cleanup_proc() is unlocked. For 32bit arches, reduce the total amount of allowed dependencies by two. It could be considered increasing the limit for 64 bit platforms with direct maps. Reported and tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2015-09-01 13:07:27 +00:00
Jeff Roberson	98082691bb	- Make 'struct buf *buf' private to vfs_bio.c. Having a global variable 'buf' is inconvenient and has lead me to some irritating to discover bugs over the years. It also makes it more challenging to refactor the buf allocation system. - Move swbuf and declare it as an extern in vfs_bio.c. This is still not perfect but better than it was before. - Eliminate the unused ffs function that relied on knowledge of the buf array. - Move the shutdown code that iterates over the buf array into vfs_bio.c. Reviewed by: kib Sponsored by: EMC / Isilon Storage Division	2015-07-29 02:26:57 +00:00
Jeff Roberson	fade8dd714	Refactor unmapped buffer address handling. - Use pointer assignment rather than a combination of pointers and flags to switch buffers between unmapped and mapped. This eliminates multiple flags and generally simplifies the logic. - Eliminate b_saveaddr since it is only used with pager bufs which have their b_data re-initialized on each allocation. - Gather up some convenience routines in the buffer cache for manipulating buf space and buf malloc space. - Add an inline, buf_mapped(), to standardize checks around unmapped buffers. In collaboration with: mlaier Reviewed by: kib Tested by: pho (many small revisions ago) Sponsored by: EMC / Isilon Storage Division	2015-07-23 19:13:41 +00:00
Mateusz Guzik	f0725a8e1e	Move chdir/chroot-related fdp manipulation to kern_descrip.c Prefix exported functions with pwd_. Deduplicate some code by adding a helper for setting fd_cdir. Reviewed by: kib	2015-07-11 16:19:11 +00:00
Mark Johnston	5f34e93c58	Check suspendability on the mountpoint returned by VOP_GETWRITEMOUNT. This obviates the need for a MNTK_SUSPENDABLE flag, since passthrough filesystems like nullfs and unionfs no longer need to inherit this information from their lower layer(s). This change also restores the pre-r273336 behaviour of using the presence of a susp_clean VFS method to request suspension support. Reviewed by: kib, mjg Differential Revision: https://reviews.freebsd.org/D2937	2015-07-05 22:37:33 +00:00
Mark Murray	d1b06863fb	Huge cleanup of random(4) code. * GENERAL - Update copyright. - Make kernel options for RANDOM_YARROW and RANDOM_DUMMY. Set neither to ON, which means we want Fortuna - If there is no 'device random' in the kernel, there will be NO random(4) device in the kernel, and the KERN_ARND sysctl will return nothing. With RANDOM_DUMMY there will be a random(4) that always blocks. - Repair kern.arandom (KERN_ARND sysctl). The old version went through arc4random(9) and was a bit weird. - Adjust arc4random stirring a bit - the existing code looks a little suspect. - Fix the nasty pre- and post-read overloading by providing explictit functions to do these tasks. - Redo read_random(9) so as to duplicate random(4)'s read internals. This makes it a first-class citizen rather than a hack. - Move stuff out of locked regions when it does not need to be there. - Trim RANDOM_DEBUG printfs. Some are excess to requirement, some behind boot verbose. - Use SYSINIT to sequence the startup. - Fix init/deinit sysctl stuff. - Make relevant sysctls also tunables. - Add different harvesting "styles" to allow for different requirements (direct, queue, fast). - Add harvesting of FFS atime events. This needs to be checked for weighing down the FS code. - Add harvesting of slab allocator events. This needs to be checked for weighing down the allocator code. - Fix the random(9) manpage. - Loadable modules are not present for now. These will be re-engineered when the dust settles. - Use macros for locks. - Fix comments. * src/share/man/... - Update the man pages. * src/etc/... - The startup/shutdown work is done in D2924. * src/UPDATING - Add UPDATING announcement. * src/sys/dev/random/build.sh - Add copyright. - Add libz for unit tests. * src/sys/dev/random/dummy.c - Remove; no longer needed. Functionality incorporated into randomdev.. live_entropy_sources.c live_entropy_sources.h - Remove; content moved. - move content to randomdev.[ch] and optimise. * src/sys/dev/random/random_adaptors.c src/sys/dev/random/random_adaptors.h - Remove; plugability is no longer used. Compile-time algorithm selection is the way to go. * src/sys/dev/random/random_harvestq.c src/sys/dev/random/random_harvestq.h - Add early (re)boot-time randomness caching. * src/sys/dev/random/randomdev_soft.c src/sys/dev/random/randomdev_soft.h - Remove; no longer needed. * src/sys/dev/random/uint128.h - Provide a fake uint128_t; if a real one ever arrived, we can use that instead. All that is needed here is N=0, N++, N==0, and some localised trickery is used to manufacture a 128-bit 0ULLL. * src/sys/dev/random/unit_test.c src/sys/dev/random/unit_test.h - Improve unit tests; previously the testing human needed clairvoyance; now the test will do a basic check of compressibility. Clairvoyant talent is still a good idea. - This is still a long way off a proper unit test. * src/sys/dev/random/fortuna.c src/sys/dev/random/fortuna.h - Improve messy union to just uint128_t. - Remove unneeded 'static struct fortuna_start_cache'. - Tighten up up arithmetic. - Provide a method to allow eternal junk to be introduced; harden it against blatant by compress/hashing. - Assert that locks are held correctly. - Fix the nasty pre- and post-read overloading by providing explictit functions to do these tasks. - Turn into self-sufficient module (no longer requires randomdev_soft.[ch]) * src/sys/dev/random/yarrow.c src/sys/dev/random/yarrow.h - Improve messy union to just uint128_t. - Remove unneeded 'staic struct start_cache'. - Tighten up up arithmetic. - Provide a method to allow eternal junk to be introduced; harden it against blatant by compress/hashing. - Assert that locks are held correctly. - Fix the nasty pre- and post-read overloading by providing explictit functions to do these tasks. - Turn into self-sufficient module (no longer requires randomdev_soft.[ch]) - Fix some magic numbers elsewhere used as FAST and SLOW. Differential Revision: https://reviews.freebsd.org/D2025 Reviewed by: vsevolod,delphij,rwatson,trasz,jmg Approved by: so (delphij)	2015-06-30 17:00:45 +00:00
Konstantin Belousov	521987c3e5	Simplify code, no need to test the flag before clearing it. Submitted by: ed MFC after: 12 days	2015-06-29 13:06:24 +00:00
Konstantin Belousov	b2c3df842b	Handle errors from background write of the cylinder group blocks. First, on the write error, bufdone() call from ffs_backgroundwrite() panics because pbrelvp() cleared bp->b_bufobj, while brelse() would try to re-dirty the copy of the cg buffer. Handle this by setting B_INVAL for the case of BIO_ERROR. Second, we must re-dirty the real buffer containing the cylinder group block data when background write failed. Real cg buffer was already marked clean in ffs_bufwrite(). After the BV_BKGRDINPROG flag is cleared on the real cg buffer in ffs_backgroundwrite(), buffer scan may reuse the buffer at any moment. The result is lost write, and if the write error was only transient, we get corrupted bitmaps. We cannot re-dirty the original cg buffer in the ffs_backgroundwritedone(), since the context is not sleepable, preventing us from sleeping for origbp' lock. Add BV_BKGDERR flag (protected by the buffer object lock), which is converted into delayed write by brelse(), bqrelse() and buffer scan. In collaboration with: Conrad Meyer <cse.cem@gmail.com> Reviewed by: mckusick Sponsored by: The FreeBSD Foundation (kib), EMC/Isilon storage division (Conrad) MFC after: 2 weeks	2015-06-27 09:44:14 +00:00
Konstantin Belousov	1eabd96728	vfs_msync(), called from syncer vnode fsync VOP, only iterates over the active vnode list for the given mount point, with the assumption that vnodes with dirty pages are active. This is enforced by vinactive() doing vm_object_page_clean() pass over the vnode pages. The issue is, if vinactive() cannot be called during vput() due to the vnode being only shared-locked, we might end up with the dirty pages for the vnode on the free list. Such vnode is invisible to syncer, and pages are only cleaned on the vnode reactivation. In other words, the race results in the broken guarantee that user data, written through the mmap(2), is written to the disk not later than in 30 seconds after the write. Fix this by keeping the vnode which is freed but still owing inactivation, on the active list. When syncer loops find such vnode, it is deactivated and cleaned by the final vput() call. Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2015-06-17 04:46:58 +00:00
Mateusz Guzik	4da8456f0a	Replace struct filedesc argument in getvnode with struct thread This is is a step towards removal of spurious arguments.	2015-06-16 13:09:18 +00:00
Konstantin Belousov	46c3d3ac99	Syncing a directory vnode might drop the vnode lock in the softdep_sync() similarly to the regular vnode sync. Allow retry for both vnode types. Reported and tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week	2015-06-03 20:48:00 +00:00
Konstantin Belousov	aef373ce38	Remove unused variable. When deallocate_dependencies() is performed, softdep_journal_freeblocks() already called cancel_allocdirect() which should have eliminated direct dependencies for all truncated full blocks. The indirect dependencies are allowed above, since second- and third-level dependencies are only dealt with by the code which frees indirect block, which happens after the inode write. Discussed with: mckusick, jeff Reviewed by: jeff Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2015-05-31 15:50:54 +00:00
Konstantin Belousov	69baeadc31	Remove several write-only variables, all reported by the gcc 4.9 buildkernel run. Some of them were write-only under some kernel options, e.g. variables keeping values only used by CTR() macros. It costs nothing to the code readability and correctness to eliminate the warnings in those cases too by removing the local cached values used only for single-access. Review: https://reviews.freebsd.org/D2665 Reviewed by: rodrigc Looked at by: bjk Sponsored by: The FreeBSD Foundation MFC after: 1 week	2015-05-29 13:24:17 +00:00
Konstantin Belousov	0bc4fe10ad	After r283600, NODELAY flag to inodedep_lookup() function is unused. Eliminate it, and simplify code by removing the local dflags variable always initialized to DEPALLOC. Noted by: mckusick Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2015-05-27 09:49:04 +00:00
Konstantin Belousov	1bc93bb7b9	Currently, softupdate code detects overstepping on the workitems limits in the code which is deep in the call stack, and owns several critical system resources, like vnode locks. Attempt to wait while the per-mount softupdate thread cleans up the backlog may deadlock, because the thread might need to lock the same vnode which is owned by the waiting thread. Instead of synchronously waiting for the worker, perform the worker' tickle and pause until the backlog is cleaned, at the safe point during return from kernel to usermode. A new ast request to call softdep_ast_cleanup() is created, the SU code now only checks the size of queue and schedules ast. There is no ast delivery for the kernel threads, so they are exempted from the mechanism, except NFS daemon threads. NFS server loop explicitely checks for the request, and informs the schedule_cleanup() that it is capable of handling the requests by the process P2_AST_SU flag. This is needed because nfsd may be the sole cause of the SU workqueue overflow. But, to not cause nsfd to spawn additional threads just because we slow down existing workers, only tickle su threads, without waiting for the backlog cleanup. Reviewed by: jhb, mckusick Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2015-05-27 09:20:42 +00:00
Kirk McKusick	74a87c3804	Limit the number of cylinder groups that will be searched when trying to build a cluster. The limit is tunable using the sysctl vfs.ffs.maxclustersearch. The current limit is 10 cylinder groups per block allocation. It was previously limited to the number of cylinder groups in the filesystem per block allocation. When there were no clusters of the needed size left, it repeatedly searched the whole filesystem for a non-existent cluster on every block allocation. The result was very slow filesystem allocation with 100% CPU utilization. The old behavior can be had by setting vfs.ffs.maxclustersearch to a huge number (1,000,000). This change affects only the layout policy routines so is not able to interfere with the integrity of the filesystem. Reported by: Dmitry Sivachenko (demon@) Tested by: Dmitry Sivachenko (demon@) MFC after: 2 weeks	2015-04-24 23:27:50 +00:00
Rick Macklem	dda11d4ab9	File systems that do not use the buffer cache (such as ZFS) must use VOP_FSYNC() to perform the NFS server's Commit operation. This patch adds a mnt_kern_flag called MNTK_USES_BCACHE which is set by file systems that use the buffer cache. If this flag is not set, the NFS server always does a VOP_FSYNC(). This should be ok for old file system modules that do not set MNTK_USES_BCACHE, since calling VOP_FSYNC() is correct, although it might not be optimal for file systems that use the buffer cache. Reviewed by: kib MFC after: 2 weeks	2015-04-15 20:16:31 +00:00
Konstantin Belousov	9928931186	Fix build (with gcc). Reported by: bz, ian Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2015-03-27 15:49:21 +00:00
Konstantin Belousov	4af9f77ecb	Fix the hand after the immediate reboot when the following command sequence is performed on UFS SU+J rootfs: cp -Rp /sbin/init /sbin/init.old mv -f /sbin/init.old /sbin/init Hang occurs on the rootfs unmount. There are two issues: 1. Removed init binary, which is still mapped, creates a reference to the removed vnode. The inodeblock for such vnode must have active inodedep, which is (eventually) linked through the unlinked list. This means that ffs_sync(MNT_SUSPEND) cannot succeed, because number of softdep workitems for the mp is always > 0. FFS is suspended during unmount, so unmount just hangs. 2. As noted above, the inodedep is linked eventually. It is not linked until the superblock is written. But at the vfs_unmountall() time, when the rootfs is unmounted, the call is made to ffs_unmount()->ffs_sync() before vflush(), and ffs_sync() only calls ffs_sbupdate() after all workitems are flushed. It is masked for normal system operations, because syncer works in parallel and eventually flushes superblock. Syncer is stopped when rootfs unmounted, so ffs_sync() must do sb update on its own. Correct the issues listed above. For MNT_SUSPEND, count the number of linked unlinked inodedeps (this is not a typo) and substract the count of such workitems from the total. For the second issue, the ffs_sbupdate() is called right after device sync in ffs_sync() loop. There is third problem, occuring with both SU and SU+J. The softdep_waitidle() loop, which waits for softdep_flush() thread to clear the worklist, only waits 20ms max. It seems that the 1 tick, specified for msleep(9), was a typo. Add fsync(devvp, MNT_WAIT) call to softdep_waitidle(), which seems to significantly help the softdep thread, and change the MNT_LAZY update at the reboot time to MNT_WAIT for similar reasons. Note that userspace cannot create more work while devvp is flushed, since the mount point is always suspended before the call to softdep_waitidle() in unmount or remount path. PR: 195458 In collaboration with: gjb, pho Reviewed by: mckusick Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2015-03-27 13:55:56 +00:00
Konstantin Belousov	f6f76e1747	Partially revert r277922, avoid sleeping and do flush if we a awaken, instead of waiting for the FLUSH_* flags. Also, when requesting flush, do the wakeups unconditionally even when FLUSH_CLEANUP flag was already set. Reported and tested by: dim, "Lundberg, Johannes" <johannes@brilliantservice.co.jp> Bisected by: dim MFC after: 2 weeks	2015-02-05 13:00:27 +00:00
Konstantin Belousov	1c9b585695	When mounting SU-enabled mount point, wait until the softdep_flush() thread started and incremented the stat_flush_threads [1]. Unconditionally wakeup softdep_flush threads when needed, do not try to check wchan, which is racy and breaks abstraction. Reported by and discussed with: glebius, neel Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2015-01-30 11:41:46 +00:00
Konstantin Belousov	cda374fe00	The sys_quotactl() contract demands that the mount point is vfs_unbusy()ed when the cmd is Q_QUOTAON, regardless of other input parameters or error return. Submitted by: Conrad Meyer Sponsored by: EMC / Isilon Storage Division Differential Revision: https://reviews.freebsd.org/D1684 Tested by: pho MFC after: 1 week	2015-01-27 10:32:49 +00:00
Konstantin Belousov	789bdfdbc6	Handle MAKEENTRY cnp flag in the VOP_CREATE(). Curiously, some fs, e.g. smbfs, already did it. Tested by: pho (previous version) Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2014-12-21 13:29:33 +00:00
Konstantin Belousov	6c21f6edb8	The VOP_LOOKUP() implementations for CREATE op do not put the name into namecache, to avoid cache trashing when doing large operations. E.g., tar archive extraction is not usually followed by access to many of the files created. Right now, each VOP_LOOKUP() implementation explicitely knowns about this quirk and tests for both MAKEENTRY flag presence and op != CREATE to make the call to cache_enter(). Centralize the handling of the quirk into VFS, by deciding to cache only by MAKEENTRY flag in VOP. VFS now sets NOCACHE flag for CREATE namei() calls. Note that the change in semantic is backward-compatible and could be merged to the stable branch, and is compatible with non-changed third-party filesystems which correctly handle MAKEENTRY. Suggested by: Chris Torek <torek@pi-coral.com> Reviewed by: mckusick Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2014-12-18 10:01:12 +00:00
Gleb Kurtsou	dde58752db	Adjust printf format specifiers for dev_t and ino_t in kernel. ino_t and dev_t are about to become uint64_t. Reviewed by: kib, mckusick	2014-12-17 07:27:19 +00:00
Gleb Smirnoff	90effb2341	Merge from projects/sendfile: o Provide a new VOP_GETPAGES_ASYNC(), which works like VOP_GETPAGES(), but doesn't sleep. It returns immediately, and will execute the I/O done handler function that must be supplied as argument. o Provide VOP_GETPAGES_ASYNC() for the FFS, which uses vnode_pager. o Extend pagertab to support pgo_getpages_async method, and implement this method for vnode_pager. Reviewed by: kib Tested by: pho Sponsored by: Netflix Sponsored by: Nginx, Inc.	2014-11-23 12:01:52 +00:00
Gleb Smirnoff	dbfd8ef2e4	buf.h is not needed here, and pollutes when ufsmount.h is included from userland code. Sponsored by: Nginx, Inc.	2014-11-23 01:02:19 +00:00
Gleb Smirnoff	8c7f0b92b4	Include required files directly instead of pollution via ufs/ufsmount.h. Sponsored by: Nginx, Inc.	2014-11-23 01:01:14 +00:00
Davide Italiano	0a4102656e	Use the correct variable name.	2014-11-22 00:42:30 +00:00
Davide Italiano	c38b12220f	Make ufs_dirhashreclaimperc a percentage for real and rename it to ufs_dirhashreclaimpercent, as suggested by jhb@. As an added bonus this avoids divide-by-zero errors. Requested by: jhb, markj Reviewied by: jhb, markj	2014-11-22 00:37:37 +00:00
Konstantin Belousov	ca109b01cf	When non-forced unmount or remount rw->ro is performed, writes on UFS are not suspended. In particular, on the SU-enabled vulumes, there is no reason why, between the call to softdep_flushfiles() and softdep_waitidle(), SU work items cannot be queued. Correct the condition to trigger the panic by only checking when forced operation is done. Convert direct panic() call into KASSERT(), there is no invalid on-disk data structures directly involved, so follow the usual debugging vs. non-debugging approach. Reported and tested by: pho Reviewed by: mckusick Sponsored by: The FreeBSD Foundation MFC after: 1 week	2014-11-02 13:14:55 +00:00
Mateusz Guzik	4fce16e4c9	Provide vfs suspension support only for filesystems which need it, take two. nullfs and unionfs need to request suspension if underlying filesystem(s) use it. Utilize mnt_kern_flag for this purpose. This is a fixup for 273271. No strong objections from: kib Pointy hat to: mjg MFC after: 2 weeks	2014-10-20 18:00:50 +00:00
Mateusz Guzik	8f889ce7c8	Use lockless quota checks in qsync and qsyncvp. No strong objections from: kib, mckusick MFC after: 1 week	2014-10-16 12:41:14 +00:00
Konstantin Belousov	df5c9c0411	Do not set IN_ACCESS flag for read-only mounts. The IN_ACCESS survives remount in rw, also it is set for vnodes on rootfs before noatime can be set or clock is adjusted. All conditions result in wrong atime for accessed vnodes. Submitted by: bde MFC after: 1 week	2014-10-11 19:09:56 +00:00
Warner Losh	34eed6d3de	Restore the backed-out change, using __offsetof instead.	2014-10-10 00:35:08 +00:00
Baptiste Daroussin	16b2833f90	Backout r272825 every useland usage of ufs/ufs/dir.h are now broken with that change	2014-10-09 17:26:29 +00:00
Baptiste Daroussin	92a1d420ba	Use offsetof() from sys/types.h instead of a custom one This fixes build with recent gcc versions	2014-10-09 15:26:22 +00:00
Konstantin Belousov	d15b55c554	Provide the unique implementation for the VOP_GETPAGES() method used by ffs and ext2fs. Remove duplicated call to vm_page_zero_invalid(), done by VOP and by vm_pager_getpages(). Use vm_pager_free_nonreq(). Reviewed by: alc (previous version) Sponsored by: The FreeBSD Foundation MFC after: 6 weeks (after r271596)	2014-09-15 12:28:29 +00:00
Alan Cox	3e5c84e292	We don't need an exclusive object lock on the expected execution path through {ext2,ffs}_getpages(). Reviewed by: kib, pfg MFC after: 6 weeks Sponsored by: EMC / Isilon Storage Division	2014-09-13 18:26:13 +00:00
Konstantin Belousov	c6fef2d49a	Direct access to the quota files, in particular, lookup, causes lock conflict with the quota metadata access. Mark quota vnode lock as recursive and always exclusive to avoid the problem. Reported by: hrs Tested by: hrs, pho Sponsored by: The FreeBSD Foundation MFC after: 1 week	2014-08-29 09:04:24 +00:00
Davide Italiano	266d427bc0	Rather than using an hardcoded reclaim age, rely on an LRU-like approach for dirhash cache, setting a target percent to reclaim (exposed via SYSCTL). This allows to always make some amount of progress keeping the maximum reclaim age dynamic. Tested by: pho Reviewed by: jhb	2014-08-25 17:06:18 +00:00
Konstantin Belousov	29d03af1e4	Do not busy the UFS mount point inside VOP_RENAME(). The kern_renameat() already starts write on the mp, which prevents parallel unmount from proceed. Busying mp after vn_start_write() deadlocks the unmount. Reported and tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week	2014-08-20 08:15:23 +00:00
Konstantin Belousov	4ce90426e0	Correct the test for condition to suspend UFS filesystem during unmount. There is no need to suspend read-only filesystem, while we need suspension on modificable mount point. Reported by: rwatson Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week	2014-08-20 08:13:03 +00:00
Konstantin Belousov	a6b5e6e32f	Revision r269457 removed the Giant around mount and unmount code, but r269533, which was tested before r269457 was committed, implicitely relied on the Giant to protect the manipulations of the softdepmounts list. Use softdep global lock consistently to guarantee the list structure now. Insert the new struct mount_softdeps into the softdepmounts only after it is sufficiently initialized, to prevent softdep_speedup() from accessing bare memory. Similarly, remove struct mount_softdeps for the unmounted filesystem from the tailq before destroying structure rwlock. Reported and tested by: pho Reviewed by: mckusick Sponsored by: The FreeBSD Foundation MFC after: 1 week	2014-08-12 09:33:00 +00:00
Kirk McKusick	3da0b29d99	The SUJ journal is only prepared to handle full-size block numbers, so we have to adjust freeblk records to reflect the change to a full-size block. For example, suppose we have a block made up of fragments 8-15 and want to free its last two fragments. We are given a request that says: FREEBLK ino=5, blkno=14, lbn=0, frags=2, oldfrags=0 where frags are the number of fragments to free and oldfrags are the number of fragments to keep. To block align it, we have to change it to have a valid full-size blkno, so it becomes: FREEBLK ino=5, blkno=8, lbn=0, frags=2, oldfrags=6 Submitted by: Mikihito Takehara Tested by: Mikihito Takehara Reviewed by: Jeff Roberson MFC after: 1 week	2014-08-07 16:53:07 +00:00
Kirk McKusick	5f9500c358	Add support for multi-threading of soft updates. Replace a single soft updates thread with a thread per FFS-filesystem mount point. The threads are associated with the bufdaemon process. Reviewed by: kib Tested by: Peter Holm and Scott Long MFC after: 2 weeks Sponsored by: Netflix	2014-08-04 22:03:58 +00:00
Warner Losh	ddd812b850	Simplify comment to remove multiple negative and passive voice.	2014-07-23 16:18:54 +00:00
Konstantin Belousov	65589a29f4	Check for the cross-device cross-link attempt in the VFS, instead of forcing filesystem VOP_LINK() methods to repeat the code. In tmpfs_link(), remove redundand check for the type of the source, already done by VFS. Note that NFS server already performs this check before calling VOP_LINK(). Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2014-07-16 14:04:46 +00:00
Konstantin Belousov	895b3782c6	Extract the code to put a filesystem into the suspended state (at the unmount time) in the helper vfs_write_suspend_umnt(). Use it instead of two inline copies in FFS. Fix the bug in the FFS unmount, when suspension failed, the ufs extattrs were not reinitialized. Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2014-07-14 09:10:00 +00:00
Konstantin Belousov	7b81a399a4	In msdosfs_setattr(), add a check for result of the utimes(2) permissions test, forgotten in r164033. Refactor the permission checks for utimes(2) into vnode helper function vn_utimes_perm(9), and simplify its code comparing with the UFS origin, by writing the call to VOP_ACCESSX only once. Use the helper for UFS(5), tmpfs(5), devfs(5) and msdosfs(5). Reported by: bde Reviewed by: bde, trasz Sponsored by: The FreeBSD Foundation MFC after: 1 week	2014-06-17 07:11:00 +00:00
Konstantin Belousov	23f6698fbd	Initialize the pbuf counter for directio using SYSINIT, instead of using a direct hook called from kern_vfs_bio_buffer_alloc(). Mark ffs_rawread.c as requiring both ffs and directio options to be compiled into the kernel. Add ffs_rawread.c to the list of ufs.ko module' sources. In addition to stopping breaking the layering violation, it also allows to link kernel when FFS is configured as module and DIRECTIO is enabled. One consequence of the change is that ffs_rawread.o is always linked into the module regardless of the DIRECTIO option. This is similar to the option QUOTA and ufs_quota.c. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2014-06-08 10:55:06 +00:00
John-Mark Gurney	e9838c1140	don't check fs_flags for _FLAGS_UPDATED as it is stored in fs_old_flags.. If you had a UFS2 FS that didn't have it's super block at SBLOCK_UFS2, you'll end up corrupting your FS as the superblock is updated and written to a different location... makefs used to put the superblock at SBLOCK_UFS1 for UFS 2 FS's causing this issue... Reviewed by: silience from mckusick MFC after: 1 week	2014-06-03 21:46:13 +00:00
Scott Long	7d155880ee	Due to reasons unknown at this time, the system can be forced to write a journal block even when there are no journal entries to be written. Until the root cause is found, handle this case by ensuring that a valid journal segment is always written. Second, the data buffer used for writing journal entries was never being scrubbed of old data. Fix this. Submitted by: Takehara Mikihito Obtained from: Netflix, Inc. MFC after: 3 days	2014-05-06 20:40:16 +00:00
Kirk McKusick	e24784d32a	Update comment to explain search order reverted to historical order in -r254996. Suggested by: Pedro Giffuni <pfg@FreeBSD.org> MFC: 3 days	2014-03-22 11:26:39 +00:00
Robert Watson	4a14441044	Update kernel inclusions of capability.h to use capsicum.h instead; some further refinement is required as some device drivers intended to be portable over FreeBSD versions rely on __FreeBSD_version to decide whether to include capability.h. MFC after: 3 weeks	2014-03-16 10:55:57 +00:00
Jeff Roberson	4803948fe2	- If we fail to do a non-blocking acquire of a buf lock while doing a waiting sync pass we need to do a blocking acquire and restart. Another thread, typically the buf daemon, may have this buf locked and if we don't wait we can fail to sync the file. This lead to a great variety of softdep panics because we rely on all dependencies being flushed before proceeding in several cases. Reported by: pho Discussed with: mckusick Sponsored by: EMC / Isilon Storage Division MFC after: 2 weeks	2014-03-06 00:13:21 +00:00
Jeff Roberson	c13a58b022	- Gracefully handle truncation failures when trying to shrink directories. This could cause dirhash panics since the dirhash state would be successfully truncated while the directory was not. Reported by: pho Discussed with: mckusick Sponsored by: EMC / Isilon Storage Division MFC after: 2 weeks	2014-03-06 00:10:07 +00:00
Pedro F. Giffuni	4896af9f55	ufs: small formatting fixes. Cleanup some extra space. Use of tabs vs. spaces. No functional change. MFC after: 3 days Reviewed by: mckusick	2014-03-02 02:52:34 +00:00
Kirk McKusick	be1fd82346	Fine tune filesystem block allocations under low free-space conditions (-r254995) based on further operational experience. Submitted by: Dmitry Sivachenko Fix Tested by: Dmitry Sivachenko MFC after: 2 weeks	2013-12-30 17:04:24 +00:00
Kirk McKusick	9cd7cb1bf1	Properly handle unsigned comparison. MFC after: 2 weeks	2013-12-30 06:19:42 +00:00
Kirk McKusick	2e436d88a3	We needlessly panic when trying to flush MKDIR_PARENT dependencies. We had previously tried to flush all MKDIR_PARENT dependencies (and all the NEWBLOCK pagedeps) by calling ffs_update(). However this will only resolve these dependencies in direct blocks. So very large directories with MKDIR_PARENT dependencies in indirect blocks had not yet gotten flushed. As the directory is in the midst of doing a complete sync, we simply defer the checking of the MKDIR_PARENT dependencies until the indirect blocks have been sync'ed. Reported by: Shawn Wallbridge of imaginaryforces.com Tested by: John-Mark Gurney <jmg@funkthat.com> PR: 183424 MFC after: 2 weeks	2013-12-01 07:34:21 +00:00
John-Mark Gurney	e0ce310797	fix white space... MFC after: 1 week	2013-11-20 21:21:29 +00:00
John-Mark Gurney	b6ffc3b567	fix a use after free, jsegdep_merge will free wk, avoid the next check... CID: 1006098 Sponsored by: Imaginary Forces Reviewed by: mckusick MFC after: 1 week	2013-11-20 21:16:53 +00:00
Pedro F. Giffuni	4b367145f7	UFS2: make di_extsize unsigned. di_extsize is the EA size and as such it should be unsigned. Adjust related types for consistency. Reviewed by: mckusick (previous version) MFC after: 3 weeks	2013-10-24 00:33:29 +00:00
Brooks Davis	cf058082cd	Allow kernels without options SOFTUPDATES to build. This should fix the embedded tinderboxes. Reviewed by: emaste	2013-10-21 20:51:08 +00:00
Kirk McKusick	07599ccb28	Fix build problem on ARM (which defaults to building without soft updates). Reported by: Tinderbox Sponsored by: Netflix	2013-10-21 13:09:09 +00:00
Kirk McKusick	58941b9f15	Restructuring of the soft updates code to set it up so that the single kernel-wide soft update lock can be replaced with a per-filesystem soft-updates lock. This per-filesystem lock will allow each filesystem to have its own soft-updates flushing thread rather than being limited to a single soft-updates flushing thread for the entire kernel. Move soft update variables out of the ufsmount structure and into their own mount_softdeps structure referenced by ufsmount field um_softdep. Eventually the per-filesystem lock will be in this structure. For now there is simply a pointer to the kernel-wide soft updates lock. Change all instances of ACQUIRE_LOCK and FREE_LOCK to pass the lock pointer in the mount_softdeps structure instead of a pointer to the kernel-wide soft-updates lock. Replace the five hash tables used by soft updates with per-filesystem copies of these tables allocated in the mount_softdeps structure. Several functions that flush dependencies when too many are allocated in the kernel used to operate across all filesystems. They are now parameterized to flush dependencies from a specified filesystem. For now, we stick with the round-robin flushing strategy when the kernel as a whole has too many dependencies allocated. While there are many lines of changes, there should be no functional change in the operation of soft updates. Tested by: Peter Holm and Scott Long Sponsored by: Netflix	2013-10-21 00:28:02 +00:00
Kirk McKusick	cc76ac5a6c	Fourth of several cleanups to soft dependency implementation. Add KASSERTS that soft dependency functions only get called for filesystems running with soft dependencies. Calling these functions when soft updates are not compiled into the system become panic's. No functional change. Tested by: Peter Holm and Scott Long Sponsored by: Netflix	2013-10-20 22:21:01 +00:00
Kirk McKusick	519e3c3b9f	Third of several cleanups to soft dependency implementation. Ensure that softdep_unmount() and softdep_setup_sbupdate() only get called for filesystems running with soft dependencies. No functional change. Tested by: Peter Holm and Scott Long Sponsored by: Netflix	2013-10-20 21:11:40 +00:00
Kirk McKusick	90a306d8af	Second of several cleanups to soft dependency implementation. Delete two unused functions in ffs_sofdep.c. No functional change. Tested by: Peter Holm and Scott Long Sponsored by: Netflix	2013-10-20 20:52:07 +00:00
Kirk McKusick	8850120f48	First of several cleanups to soft dependency implementation. Convert three functions exported from ffs_softdep.c to static functions as they are not used outside of ffs_softdep.c. No functional change. Tested by: Peter Holm and Scott Long Sponsored by: Netflix	2013-10-20 20:41:38 +00:00
Pedro F. Giffuni	b876c780d0	Make di_blocks unsigned in UFS1 as is the case already for UFS2. Most of the code between UFS1 and UFS2 is shared so this change is pretty safe. Not only this makes UFS1 and 2 consistent but it also matches what NetBSD and MacOS X have for some years now. Reviewed by: mckusick MFC after: 1 month	2013-10-14 18:17:09 +00:00
Pawel Jakub Dawidek	7008be5bd7	Change the cap_rights_t type from uint64_t to a structure that we can extend in the future in a backward compatible (API and ABI) way. The cap_rights_t represents capability rights. We used to use one bit to represent one right, but we are running out of spare bits. Currently the new structure provides place for 114 rights (so 50 more than the previous cap_rights_t), but it is possible to grow the structure to hold at least 285 rights, although we can make it even larger if 285 rights won't be enough. The structure definition looks like this: struct cap_rights { uint64_t cr_rights[CAP_RIGHTS_VERSION + 2]; }; The initial CAP_RIGHTS_VERSION is 0. The top two bits in the first element of the cr_rights[] array contain total number of elements in the array - 2. This means if those two bits are equal to 0, we have 2 array elements. The top two bits in all remaining array elements should be 0. The next five bits in all array elements contain array index. Only one bit is used and bit position in this five-bits range defines array index. This means there can be at most five array elements in the future. To define new right the CAPRIGHT() macro must be used. The macro takes two arguments - an array index and a bit to set, eg. #define CAP_PDKILL CAPRIGHT(1, 0x0000000000000800ULL) We still support aliases that combine few rights, but the rights have to belong to the same array element, eg: #define CAP_LOOKUP CAPRIGHT(0, 0x0000000000000400ULL) #define CAP_FCHMOD CAPRIGHT(0, 0x0000000000002000ULL) #define CAP_FCHMODAT (CAP_FCHMOD \| CAP_LOOKUP) There is new API to manage the new cap_rights_t structure: cap_rights_t cap_rights_init(cap_rights_t rights, ...); void cap_rights_set(cap_rights_t rights, ...); void cap_rights_clear(cap_rights_t rights, ...); bool cap_rights_is_set(const cap_rights_t rights, ...); bool cap_rights_is_valid(const cap_rights_t rights); void cap_rights_merge(cap_rights_t dst, const cap_rights_t src); void cap_rights_remove(cap_rights_t dst, const cap_rights_t src); bool cap_rights_contains(const cap_rights_t big, const cap_rights_t little); Capability rights to the cap_rights_init(), cap_rights_set(), cap_rights_clear() and cap_rights_is_set() functions are provided by separating them with commas, eg: cap_rights_t rights; cap_rights_init(&rights, CAP_READ, CAP_WRITE, CAP_FSTAT); There is no need to terminate the list of rights, as those functions are actually macros that take care of the termination, eg: #define cap_rights_set(rights, ...) \ __cap_rights_set((rights), __VA_ARGS__, 0ULL) void __cap_rights_set(cap_rights_t *rights, ...); Thanks to using one bit as an array index we can assert in those functions that there are no two rights belonging to different array elements provided together. For example this is illegal and will be detected, because CAP_LOOKUP belongs to element 0 and CAP_PDKILL to element 1: cap_rights_init(&rights, CAP_LOOKUP \| CAP_PDKILL); Providing several rights that belongs to the same array's element this way is correct, but is not advised. It should only be used for aliases definition. This commit also breaks compatibility with some existing Capsicum system calls, but I see no other way to do that. This should be fine as Capsicum is still experimental and this change is not going to 9.x. Sponsored by: The FreeBSD Foundation	2013-09-05 00:09:56 +00:00
Kirk McKusick	2ce451f089	In looking at block layouts as part of fixing filesystem block allocations under low free-space conditions (-r254995), determine that old block-preference search order used before -r249782 worked a bit better. This change reverts to that block-preference search order. MFC after: 2 weeks	2013-08-28 17:46:32 +00:00
Kirk McKusick	28702816d8	A performance problem was reported in PR kern/181226: I have 25TB Dell PERC 6 RAID5 array. When it becomes almost full (10-20GB free), processes which write data to it start eating 100% CPU and write speed drops below 1MB/sec (normally to gives 400MB/sec). The revision at which it first became apparent was http://svnweb.freebsd.org/changeset/base/249782. The offending change reserved an area in each cylinder group to store metadata. The new algorithm attempts to save this area for metadata and allows its use for non-metadata only after all the data areas have been exhausted. The size of the reserved area defaults to half of minfree, so the filesystem reports full before the data area can completely fill. However, in this report, the filesystem has had minfree reduced to 1% thus forcing the metadata area to be used for data. As the filesystem approached full, it had only metadata areas left to allocate. The result was that every block allocation had to scan summary data for 30,000 cylinder groups before falling back to searching up to 30,000 metadata areas. The fix is to give up on saving the metadata areas once the free space reserve drops below 2%. The effect of this change is to use the old algorithm of just accepting the first available block that we find. Since most filesystems use the default 5% minfree, this will have no effect on their operation. For those that want to push to the limit, they will get their crappy block placements quickly. Submitted by: Dmitry Sivachenko Fix Tested by: Dmitry Sivachenko PR: kern/181226 MFC after: 2 weeks	2013-08-28 17:38:05 +00:00

1 2 3 4 5 ...

2036 Commits