freebsd-nq

Author	SHA1	Message	Date
Konstantin Belousov	1eabd96728	vfs_msync(), called from syncer vnode fsync VOP, only iterates over the active vnode list for the given mount point, with the assumption that vnodes with dirty pages are active. This is enforced by vinactive() doing vm_object_page_clean() pass over the vnode pages. The issue is, if vinactive() cannot be called during vput() due to the vnode being only shared-locked, we might end up with the dirty pages for the vnode on the free list. Such vnode is invisible to syncer, and pages are only cleaned on the vnode reactivation. In other words, the race results in the broken guarantee that user data, written through the mmap(2), is written to the disk not later than in 30 seconds after the write. Fix this by keeping the vnode which is freed but still owing inactivation, on the active list. When syncer loops find such vnode, it is deactivated and cleaned by the final vput() call. Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2015-06-17 04:46:58 +00:00
Konstantin Belousov	780dca1b1e	Right now, dounmount() is called with unreferenced mount point. Nothing stops a parallel unmount to suceed before the given call to dounmount() checks and locks the covered vnode. Prevent dounmount() from acting on the freed (although type-stable) memory by changing the interface to require the mount point to be referenced. dounmount() consumes the reference on return, regardless of the sucessfull or erronous result. Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2015-05-27 09:22:50 +00:00
Rick Macklem	dda11d4ab9	File systems that do not use the buffer cache (such as ZFS) must use VOP_FSYNC() to perform the NFS server's Commit operation. This patch adds a mnt_kern_flag called MNTK_USES_BCACHE which is set by file systems that use the buffer cache. If this flag is not set, the NFS server always does a VOP_FSYNC(). This should be ok for old file system modules that do not set MNTK_USES_BCACHE, since calling VOP_FSYNC() is correct, although it might not be optimal for file systems that use the buffer cache. Reviewed by: kib MFC after: 2 weeks	2015-04-15 20:16:31 +00:00
Konstantin Belousov	08189ed667	The VNASSERT in vflush() FORCECLOSE case is trying to panic early to prevent errors from yanking devices out from under filesystems. Only care about special vnodes on devfs, special nodes on other kinds of filesystems do not have special properties. Sponsored by: EMC / Isilon Storage Division Submitted by: Conrad Meyer MFC after: 1 week	2015-02-27 16:43:50 +00:00
Enji Cooper	c514f051b7	Add the mnt_lockref field to the ddb(4) 'show mount' command MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D1688 Submitted by: Conrad Meyer <conrad.meyer@isilon.com> Sponsored by: EMC / Isilon Storage Division	2015-02-17 09:31:58 +00:00
John Baldwin	1b76e0b732	Add two new counters for vnode life cycle events: - vfs.recycles counts the number of vnodes forcefully recycled to avoid exceeding kern.maxvnodes. - vfs.vnodes_created counts the number of vnodes created by successful calls to getnewvnode(). Differential Revision: https://reviews.freebsd.org/D1671 Reviewed by: kib MFC after: 1 week	2015-02-14 17:02:51 +00:00
John Baldwin	a77a12340a	Change the default VFS timestamp precision from seconds to microseconds. Discussed on: arch@ MFC after: 2 weeks	2015-01-25 19:56:45 +00:00
Konstantin Belousov	ea117d1735	The vinactive() call in vgonel() may start writes for the dirty pages, creating delayed write buffers belonging to the reclaimed vnode. Put the buffer cleanup code after inactivation. Add asserts that ensure that buffer queues are empty and add BO_DEAD flag for bufobj to check that no buffers are added after the cleanup. BO_DEAD is only used by INVARIANTS-enabled kernels. Reported and tested by: pho (previous version) Sponsored by: The FreeBSD Foundation MFC after: 1 week	2014-12-13 16:02:37 +00:00
Konstantin Belousov	a77c72f5ae	Apply chunk forgotten in r275620. Remove local variable for real. CID: 1257462 Sponsored by: The FreeBSD Foundation	2014-12-09 09:36:28 +00:00
Konstantin Belousov	a25100c539	Add functions syncer_suspend() and syncer_resume(), which are supposed to be called before suspension and after resume, correspondingly. The syncer_suspend() ensures that all filesystems dirty data and metadata are saved to the permanent storage, and stops kernel threads which might modify filesystems. The syncer_resume() restores stopped threads. For now, only syncer is stopped. This is needed, because each sync loop causes superblock updates for UFS. Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week	2014-12-08 16:48:57 +00:00
Mateusz Guzik	32e7f8e4d5	Don't take devmtx unnecessarily in vn_isdisk. MFC after: 1 week	2014-10-15 05:17:36 +00:00
Will Andrews	9832a24d27	In the syncer, drop the sync mutex while patting the watchdog. Some watchdog drivers (like ipmi) need to sleep while patting the watchdog. See sys/dev/ipmi/ipmi.c:ipmi_wd_event(), which calls malloc(M_WAITOK). Submitted by: asomers MFC after: 1 month Sponsored by: Spectra Logic MFSpectraBSD: 637548 on 2012/10/04	2014-10-01 15:32:28 +00:00
Konstantin Belousov	168f4ee0a8	Remove Giant acquisition from the mount and unmount pathes. It could be claimed that two things were reasonable protected by Giant. One is vfsconf list links, which is converted to the new dedicated sx vfsconf_sx. Another is vfsconf.vfc_refcount, which is now updated with atomics. Note that vfc_refcount still has the same races now as it has under the Giant, the unload of filesystem modules can happen while the module is still in use. Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2014-08-03 03:27:54 +00:00
Konstantin Belousov	634012b917	Remove one-time use macros which check for the vnode lifecycle. More, some parts of the checks are in fact redundand in the surrounding code, and it is more clear what the conditions are by direct testing of the flags. Two of the three macros were only used in assertions. In vnlru_free(), all relevant parts of vholdl() were already inlined, except the increment of v_holdcnt itself. Do not call vholdl() to do the increment as well, this allows to make assertions in vholdl()/vhold() more strict. In v_incr_usecount(), call vholdl() before incrementing other ref counters. The change is no-op, but it makes less surprising to see the vnode state in debugger if interrupted inside v_incr_usecount(). Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week	2014-07-29 16:42:34 +00:00
Alexander Motin	781c93d405	Implement simple direct-mapped cache for popular filesystem identifiers to avoid congestion on global mountlist_mtx mutex in vfs_busyfs(), while traversing through the list of mount points. This change significantly improves NFS server scalability, since it had to do this translation for every request, and the global lock becomes quite congested. This code is more optimized for relatively small number of mount points. On systems with hundreds of active mount points this simple cache may have many collisions. But the original traversal code in that case should also behave much worse, so we are not loosing much. Reviewed by: attilio MFC after: 2 weeks Sponsored by: iXsystems, Inc.	2014-06-12 12:43:48 +00:00
Alexander Motin	4f655310bf	Remove unneeded mountlist_mtx acquisition from sync_fsync(). All struct mount fields accessed by sync_fsync() are protected by MNT_MTX.	2014-06-11 12:56:49 +00:00
Alexander Motin	3345d73ca8	Remove extra branching from r267232. MFC after: 2 weeks	2014-06-08 19:01:37 +00:00
Alexander Motin	590d636321	Use atomics to modify numvnodes variable. This allows to mostly avoid lock usage in getnewvnode_[drop_]reserve(), that reduces number of global vnode_free_list_mtx mutex acquisitions from 4 to 2 per NFS request on ZFS, improving SMP scalability. Reviewed by: kib MFC after: 2 weeks Sponsored by: iXsystems, Inc.	2014-06-08 15:38:40 +00:00
Benjamin Kaduk	bf09eca2cb	Check for mismatched vref()/vdrop() Assert that the hold count has not fallen below the use count, a situation that would only happen when a vref() (or similar) is erroneously paired with a vdrop(). This situation has not been observed in the wild, but could be helpful for someone implementing a new filesystem. Reviewed by: kib Approved by: hrs (mentor)	2014-05-21 03:11:27 +00:00
Bryan Drewery	44f1c91610	Rename global cnt to vm_cnt to avoid shadowing. To reduce the diff struct pcu.cnt field was not renamed, so PCPU_OP(cnt.field) is still used. pc_cnt and pcpu are also used in kvm(3) and vmstat(8). The goal was to not affect externally used KPI. Bump __FreeBSD_version_ in case some out-of-tree module/code relies on the the global cnt variable. Exp-run revealed no ports using it directly. No objection from: arch@ Sponsored by: EMC / Isilon Storage Division	2014-03-22 10:26:09 +00:00
Konstantin Belousov	2c1531e746	Do not flush buffers when the v_object of the passed vnode does not really belong to it. Such vnodes, with the pointers to other vnodes v_objects, are typically instantiated by the bypass filesystems. Invalidating mappings of other vnode pages and the pages is wrong, since reclamation of the upper vnode does not imply that lower vnode is reclaimed too. One of the consequences of the improper reclamation was destruction of the wired mappings of the lower vnode pages, triggering miscellaneous assertions in the VM system. Reported by: John Marshall <john.marshall@riverwillow.com.au> Tested by: John Marshall <john.marshall@riverwillow.com.au>, pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Approved by: re (gjb)	2013-10-09 18:43:29 +00:00
Konstantin Belousov	d6498b153e	When printing the vnode information from ddb, print the lengths of the dirty and clean buffer queues. Sponsored by: The FreeBSD Foundation MFC after: 1 week Approved by: re (gjb)	2013-10-01 20:18:33 +00:00
Konstantin Belousov	fe39412e99	For vunref(), try to upgrade the vnode lock if the function was called with the vnode shared-locked. If upgrade succeeded, the inactivation can be done immediately, instead of being postponed. Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Approved by: re (glebius)	2013-09-29 18:07:14 +00:00
Konstantin Belousov	27884e3bd1	Acquire a hold reference on the vnode when a knote is instantiated. Otherwise, knote keeps a pointer to a vnode which could become invalid any time. Reported by: many Tested by: Patrick Lamaiziere <patfbsd@davenulle.org> Discussed with: jmg Sponsored by: The FreeBSD Foundation MFC after: 1 week Approved by: re (marius)	2013-09-26 13:14:51 +00:00
Pawel Jakub Dawidek	4593c0ad6b	In r114945 the line 'nmp = TAILQ_NEXT(mp, mnt_list);' was duplicated. Instead of just removing the duplicate, convert the loop to TAILQ_FOREACH().	2013-08-17 14:13:45 +00:00
Konstantin Belousov	2c38cc792e	When creation of the v_pollinfo raced and our instance of vpollinfo must be destroyed, knlist_clear() and seldrain() calls could be avoided, since vpollinfo was not used. More, the knlist_clear() calling protocol requires the knlist locked, which is not true at the call site. Split the destruction into the helper destroy_vpollinfo_free(), and call it when raced, instead of destroy_vpollinfo(). Reported and tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 3 days	2013-07-28 06:59:29 +00:00
Konstantin Belousov	a8b0523ae7	Clear the vnode knotes before destroying vpollinfo. Reported and tested by: Patrick Lamaiziere <patfbsd@davenulle.org> Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2013-07-17 10:56:21 +00:00
Konstantin Belousov	d39116f5d5	Be more generous when donating the current thread time to the owner of the vnode lock while iterating over the free vnode list. Instead of yielding, pause for 1 tick. The change is reported to help in some virtualized environments. Submitted by: Roger Pau Monn? <roger.pau@citrix.com> Discussed with: jilles Tested by: pho MFC after: 2 weeks	2013-06-03 17:36:43 +00:00
Jeff Roberson	22a722605d	- Convert the bufobj lock to rwlock. - Use a shared bufobj lock in getblk() and inmem(). - Convert softdep's lk to rwlock to match the bufobj lock. - Move INFREECNT to b_flags and protect it with the buf lock. - Remove unnecessary locking around bremfree() and BKGRDINPROG. Sponsored by: EMC / Isilon Storage Division Discussed with: mckusick, kib, mdf	2013-05-31 00:43:41 +00:00
Jeff Roberson	f2cc1285c2	- Add a new general purpose path-compressed radix trie which can be used with any structure containing a uint64_t index. The tree code auto-generates type safe wrappers. - Eliminate the buf splay and replace it with pctrie. This is not only significantly faster with large files but also allows for the possibility of shared locking. Reviewed by: alc, attilio Sponsored by: EMC / Isilon Storage Division	2013-05-12 04:05:01 +00:00
Konstantin Belousov	0fc6daa72d	- Fix nullfs vnode reference leak in nullfs_reclaim_lowervp(). The null_hashget() obtains the reference on the nullfs vnode, which must be dropped. - Fix a wart which existed from the introduction of the nullfs caching, do not unlock lower vnode in the nullfs_reclaim_lowervp(). It should be innocent, but now it is also formally safe. Inform the nullfs_reclaim() about this using the NULLV_NOUNLOCK flag set on nullfs inode. - Add a callback to the upper filesystems for the lower vnode unlinking. When inactivating a nullfs vnode, check if the lower vnode was unlinked, indicated by nullfs flag NULLV_DROP or VV_NOSYNC on the lower vnode, and reclaim upper vnode if so. This allows nullfs to purge cached vnodes for the unlinked lower vnode, avoiding excessive caching. Reported by: G??ran L??wkrantz <goran.lowkrantz@ismobile.com> Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2013-05-11 11:17:44 +00:00
Marcel Moolenaar	e63091ea6c	Add option WITNESS_NO_VNODE to suppress printing LORs between VNODE locks. To support this, VNODE locks are created with the LK_IS_VNODE flag. This flag is propagated down using the LO_IS_VNODE flag. Note that WITNESS still records the LOR. Only the printing and the optional entering into the kernel debugger is bypassed with the WITNESS_NO_VNODE option.	2013-05-09 16:28:18 +00:00
Matthew D Fleming	770b41b349	Add missing vdrop() in error case. Submitted by: Fahad (mohd.fahadullah@isilon.com) MFC after: 1 week	2013-05-04 18:38:16 +00:00
Rick Macklem	64fa8df6e0	Allow the vnode to be unlocked for the weird case of LK_EXCLOTHER. LK_EXCLOTHER is only used to acquire a usecount on a vnode during NFSv4 recovery from an expired lease. Reported and tested by: pho MFC after: 2 weeks	2013-04-16 14:22:16 +00:00
Jeff Roberson	26089666b6	Prepare to replace the buf splay with a trie: - Don't insert BKGRDMARKER bufs into the splay or dirty/clean buf lists. No consumers need to find them there and it complicates the tree. These flags are all FFS specific and could be moved out of the buf cache. - Use pbgetvp() and pbrelvp() to associate the background and journal bufs with the vp. Not only is this much cheaper it makes more sense for these transient bufs. - Fix the assertions in pbget* and pbrel*. It's not safe to check list pointers which were never initialized. Use the BX flags instead. We also check B_PAGING in reassignbuf() so this should cover all cases. Discussed with: kib, mckusick, attilio Sponsored by: EMC / Isilon Storage Division	2013-04-06 22:21:23 +00:00
Attilio Rao	89f6b8632c	Switch the vm_object mutex to be a rwlock. This will enable in the future further optimizations where the vm_object lock will be held in read mode most of the time the page cache resident pool of pages are accessed for reading purposes. The change is mostly mechanical but few notes are reported: * The KPI changes as follow: - VM_OBJECT_LOCK() -> VM_OBJECT_WLOCK() - VM_OBJECT_TRYLOCK() -> VM_OBJECT_TRYWLOCK() - VM_OBJECT_UNLOCK() -> VM_OBJECT_WUNLOCK() - VM_OBJECT_LOCK_ASSERT(MA_OWNED) -> VM_OBJECT_ASSERT_WLOCKED() (in order to avoid visibility of implementation details) - The read-mode operations are added: VM_OBJECT_RLOCK(), VM_OBJECT_TRYRLOCK(), VM_OBJECT_RUNLOCK(), VM_OBJECT_ASSERT_RLOCKED(), VM_OBJECT_ASSERT_LOCKED() * The vm/vm_pager.h namespace pollution avoidance (forcing requiring sys/mutex.h in consumers directly to cater its inlining functions using VM_OBJECT_LOCK()) imposes that all the vm/vm_pager.h consumers now must include also sys/rwlock.h. * zfs requires a quite convoluted fix to include FreeBSD rwlocks into the compat layer because the name clash between FreeBSD and solaris versions must be avoided. At this purpose zfs redefines the vm_object locking functions directly, isolating the FreeBSD components in specific compat stubs. The KPI results heavilly broken by this commit. Thirdy part ports must be updated accordingly (I can think off-hand of VirtualBox, for example). Sponsored by: EMC / Isilon storage division Reviewed by: jeff Reviewed by: pjd (ZFS specific review) Discussed with: alc Tested by: pho	2013-03-09 02:32:23 +00:00
Konstantin Belousov	10b4bb0b33	Add a trivial comment to record the proper commit log for r245407: Set the v_hash for a new vnode in the getnewvnode() to the value calculated based on the vnode structure address. Filesystems using vfs_hash_insert() override the v_hash using the standard formula of (inode_number + mnt_hashseed). For other filesystems, the initialization allows the vfs_hash_index() to provide useful hash too. Suggested, reviewed and tested by: peter Sponsored by: The FreeBSD Foundation MFC after: 5 days	2013-01-14 05:52:23 +00:00
Konstantin Belousov	a41df84820	diff --git a/sys/kern/vfs_subr.c b/sys/kern/vfs_subr.c index 7c243b6..0bdaf36 100644 --- a/sys/kern/vfs_subr.c +++ b/sys/kern/vfs_subr.c @@ -279,6 +279,7 @@ SYSCTL_INT(_debug, OID_AUTO, vnlru_nowhere, CTLFLAG_RW, #define VSHOULDFREE(vp) (!((vp)->v_iflag & VI_FREE) && !(vp)->v_holdcnt) #define VSHOULDBUSY(vp) (((vp)->v_iflag & VI_FREE) && (vp)->v_holdcnt) +static int vnsz2log; /* * Initialize the vnode management data structures. @@ -293,6 +294,7 @@ SYSCTL_INT(_debug, OID_AUTO, vnlru_nowhere, CTLFLAG_RW, static void vntblinit(void dummy __unused) { + u_int i; int physvnodes, virtvnodes; / @@ -332,6 +334,9 @@ vntblinit(void dummy __unused) syncer_maxdelay = syncer_mask + 1; mtx_init(&sync_mtx, "Syncer mtx", NULL, MTX_DEF); cv_init(&sync_wakeup, "syncer"); + for (i = 1; i <= sizeof(struct vnode); i <<= 1) + vnsz2log++; + vnsz2log--; } SYSINIT(vfs, SI_SUB_VFS, SI_ORDER_FIRST, vntblinit, NULL); @@ -1067,6 +1072,14 @@ alloc: } rangelock_init(&vp->v_rl); + / + * For the filesystems which do not use vfs_hash_insert(), + * still initialize v_hash to have vfs_hash_index() useful. + * E.g., nullfs uses vfs_hash_index() on the lower vnode for + * its own hashing. + / + vp->v_hash = (uintptr_t)vp >> vnsz2log; + vpp = vp; return (0); }	2013-01-14 05:42:54 +00:00
Attilio Rao	c92c859b7b	Fixup r244240: mp_ncpus will be 1 also in the !SMP and smp_disabled=1 case. There is no point in optimizing further the code and use a TRUE litteral for a path that does heavyweight stuff anyway (like lock acq), at the price of obfuscated code. Use the appropriate check where necessary and remove a macro. Sponsored by: EMC / Isilon storage division MFC after: 3 days	2012-12-26 15:20:32 +00:00
Attilio Rao	b1308d72c2	Fixup r218424: uio_yield() was scaling directly to userland priority. When kern_yield() was introduced with the possibility to specify a new priority, the behaviour changed by not lowering priority at all in the consumers, making the yielding mechanism highly ineffective for high priority kthreads like bufdaemon, syncer, vlrudaemon, etc. There are no evidences that consumers could bear with such change in semantic and this situation could finally lead to bugs similar to the ones fixed in r244240. Re-specify userland pri for kthreads involved. Tested by: pho Reviewed by: kib, mdf MFC after: 1 week	2012-12-21 13:14:12 +00:00
Konstantin Belousov	14df601e47	When mnt_vnode_next_active iterator cannot lock the next vnode and yields, specify the user priority for the yield. Otherwise, a higher-priority (kernel) thread could fall into the priority-inversion with the thread owning the mutex lock. On single-processor machines or UP kernels, do not loop adaptively when the next vnode cannot be locked, instead yield unconditionally. Restructure the iteration initializer and the iterator to remove code duplication. Put the code to fetch and lock a vnode next to the current marker, into the mnt_vnode_next_active() function, and use it instead of repeating the loop. Reported by: hrs, rmacklem Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 3 days	2012-12-15 02:04:46 +00:00
Konstantin Belousov	686ffcaceb	Do not yield while owning a mutex. The Giant reacquire in the kern_yield() is problematic than. The owned mutex is the mount interlock, and it is in fact not needed to guarantee the stability of the mount list of active vnodes, so fix the the issue by only taking the mount interlock for MNT_REF and MNT_REL operations. While there, augment the unconditional yield by some amount of spinning [1]. Reported and tested by: pho Reviewed by: attilio Submitted by: attilio [1] MFC after: 3 days	2012-12-10 20:44:09 +00:00
Konstantin Belousov	07840861b1	The vnode_free_list_mtx is required unconditionally when iterating over the active list. The mount interlock is not enough to guarantee the validity of the tailq link pointers. The __mnt_vnode_next_active() and __mnt_vnode_first_active() active lists iterators helper functions did not provided the neccessary stability for the list, allowing the iterators to pick garbage. This was uncovered after the r243599 made the active list iterators non-nop. Since a vnode interlock is before the vnode_free_list_mtx, obtain the vnode ilock in the non-blocking manner when under vnode_free_list_mtx, and restart iteration after the yield if the lock attempt failed. Assert that a vnode found on the list is active, and assert that the helpers return the vnode with interlock owned. Reported and tested by: pho MFC after: 1 week	2012-12-03 22:15:16 +00:00
David Xu	3da9ab75f4	Take first active vnode correctly. Reviewed by: kib MFC after: 3 days	2012-11-27 06:07:58 +00:00
Andriy Gapon	6b991098a7	assert_vop_locked: make the assertion race-free and more efficient this is really a minor improvement for the sake of correctness MFC after: 6 days	2012-11-24 13:11:47 +00:00
Andriy Gapon	4f15bb6730	remove vop_lookup_pre and vop_lookup_post Suggested by: kib MFC after: 5 days	2012-11-22 10:36:10 +00:00
Attilio Rao	973b795b64	insmntque() is always called with the lock held in exclusive mode, then: - assume the lock is held in exclusive mode and remove a moot check about the lock acquisition. - in the destructor remove !MPSAFE specific chunk. Reviewed by: kib MFC after: 2 weeks	2012-11-19 20:43:19 +00:00
Andriy Gapon	ab49c952d9	assert_vop_locked should treat LK_EXCLOTHER as the not locked case ... from a perspective of the current thread. Spotted by: mjg Discussed with: kib MFC after: 18 days	2012-11-19 11:35:56 +00:00
Andriy Gapon	c496727c54	vnode_if: fix locking protocol description for lookup and cachedlookup Also remove the checks from vop_lookup_pre and vop_lookup_post, which are now completely redundant (before this change they were partially redundant). Discussed with: kib MFC after: 10 days	2012-11-19 11:32:56 +00:00
Attilio Rao	bc2258da88	Complete MPSAFE VFS interface and remove MNTK_MPSAFE flag. Porters should refer to __FreeBSD_version 1000021 for this change as it may have happened at the same timeframe.	2012-11-09 18:02:25 +00:00

1 2 3 4 5 ...

895 Commits