Commit Graph

903 Commits

Author SHA1 Message Date
Ed Schouten
2433a4eb04 Make it possible to implement poll(2) on top of kqueue(2).
It looks like EVFILT_READ and EVFILT_WRITE trigger under the same
conditions as poll()'s POLLRDNORM and POLLWRNORM as described by POSIX.
The only difference is that POLLRDNORM has to be triggered on regular
files unconditionally, whereas EVFILT_READ only triggers when not EOF.

Introduce a new flag, NOTE_FILE_POLL, that can be used to make
EVFILT_READ and EVFILT_WRITE behave identically to poll(). This flag
will be used by cloudlibc's poll() function.

Reviewed by:	jmg
Differential Revision:	https://reviews.freebsd.org/D3303
2015-08-05 07:34:29 +00:00
Edward Tomasz Napierala
57a73b26e0 Mark vgonel() as static. It was already declared static earlier;
no idea why compilers don't warn about this.

MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
2015-08-04 08:51:56 +00:00
Mateusz Guzik
752fc07d33 vfs: implement v_holdcnt/v_usecount manipulation using atomic ops
Transitions 0->1 and 1->0 (which decide e.g. on putting the vnode on the free
list) of either counter are still guarded with vnode interlock.

Reviewed by:	kib (earlier version)
Tested by:	pho
2015-07-16 13:57:05 +00:00
Mateusz Guzik
c634b75204 vfs: always clear VI_OWEINACT in consumers bumping v_usecount
Previously vputx would detect the condition and clear the flag.

With this change it is invalid to have both v_usecount > 0 and the flag
set. Assert the condition is met in all revlevant places.

Reviewed by:	kib
2015-07-11 16:28:55 +00:00
Mateusz Guzik
2d1ca3cdff vfs: move si_usecount manipulation to dedicated functions
Reviewed by:	kib
2015-07-11 16:28:12 +00:00
Konstantin Belousov
cf88021ab1 Do not allow creation of the dirty buffers for the dead buffer
objects, i.e. for buffer objects which vnode was reclaimed.  Buffer
cache cannot write such buffers.  Return the error and discard the
buffer immediately on write attempt.

BO_DIRTY now always set during vnode reclamation, since it is used not
only for the INVARIANTS checks.  Do allow placement of the clean
buffers on dead bufobj list, otherwise filesystems cannot use bufcache
at all after the devvp reclaim.

Reported and tested by:	trasz
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2015-07-11 11:21:56 +00:00
Mark Johnston
8bbd1f25b1 Remove a stale descriptive comment for gbincore().
The splay trees referenced in the comment were converted to
path-compressed tries in r250551.

MFC after:	3 days
2015-07-05 22:44:41 +00:00
John-Mark Gurney
1977bd233a zero this struct as it depends upon it...
Reviewed by:	mjg
Differential Revision:	https://reviews.freebsd.org/D2890
2015-06-23 18:40:20 +00:00
Konstantin Belousov
1eabd96728 vfs_msync(), called from syncer vnode fsync VOP, only iterates over
the active vnode list for the given mount point, with the assumption
that vnodes with dirty pages are active.  This is enforced by
vinactive() doing vm_object_page_clean() pass over the vnode pages.

The issue is, if vinactive() cannot be called during vput() due to the
vnode being only shared-locked, we might end up with the dirty pages
for the vnode on the free list.  Such vnode is invisible to syncer,
and pages are only cleaned on the vnode reactivation.  In other words,
the race results in the broken guarantee that user data, written
through the mmap(2), is written to the disk not later than in 30
seconds after the write.

Fix this by keeping the vnode which is freed but still owing
inactivation, on the active list.  When syncer loops find such vnode,
it is deactivated and cleaned by the final vput() call.

Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2015-06-17 04:46:58 +00:00
Konstantin Belousov
780dca1b1e Right now, dounmount() is called with unreferenced mount point.
Nothing stops a parallel unmount to suceed before the given call to
dounmount() checks and locks the covered vnode.  Prevent dounmount()
from acting on the freed (although type-stable) memory by changing the
interface to require the mount point to be referenced.  dounmount()
consumes the reference on return, regardless of the sucessfull or
erronous result.

Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2015-05-27 09:22:50 +00:00
Rick Macklem
dda11d4ab9 File systems that do not use the buffer cache (such as ZFS) must
use VOP_FSYNC() to perform the NFS server's Commit operation.
This patch adds a mnt_kern_flag called MNTK_USES_BCACHE which
is set by file systems that use the buffer cache. If this flag
is not set, the NFS server always does a VOP_FSYNC().
This should be ok for old file system modules that do not set
MNTK_USES_BCACHE, since calling VOP_FSYNC() is correct, although
it might not be optimal for file systems that use the buffer cache.

Reviewed by:	kib
MFC after:	2 weeks
2015-04-15 20:16:31 +00:00
Konstantin Belousov
08189ed667 The VNASSERT in vflush() FORCECLOSE case is trying to panic early to
prevent errors from yanking devices out from under filesystems.  Only
care about special vnodes on devfs, special nodes on other kinds of
filesystems do not have special properties.

Sponsored by:  EMC / Isilon Storage Division
Submitted by:   Conrad Meyer
MFC after:	1 week
2015-02-27 16:43:50 +00:00
Enji Cooper
c514f051b7 Add the mnt_lockref field to the ddb(4) 'show mount' command
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D1688
Submitted by: Conrad Meyer <conrad.meyer@isilon.com>
Sponsored by: EMC / Isilon Storage Division
2015-02-17 09:31:58 +00:00
John Baldwin
1b76e0b732 Add two new counters for vnode life cycle events:
- vfs.recycles counts the number of vnodes forcefully recycled to avoid
  exceeding kern.maxvnodes.
- vfs.vnodes_created counts the number of vnodes created by successful
  calls to getnewvnode().

Differential Revision:	https://reviews.freebsd.org/D1671
Reviewed by:	kib
MFC after:	1 week
2015-02-14 17:02:51 +00:00
John Baldwin
a77a12340a Change the default VFS timestamp precision from seconds to microseconds.
Discussed on:	arch@
MFC after:	2 weeks
2015-01-25 19:56:45 +00:00
Konstantin Belousov
ea117d1735 The vinactive() call in vgonel() may start writes for the dirty pages,
creating delayed write buffers belonging to the reclaimed vnode.  Put
the buffer cleanup code after inactivation.

Add asserts that ensure that buffer queues are empty and add BO_DEAD
flag for bufobj to check that no buffers are added after the cleanup.
BO_DEAD is only used by INVARIANTS-enabled kernels.

Reported and tested by:	pho (previous version)
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-12-13 16:02:37 +00:00
Konstantin Belousov
a77c72f5ae Apply chunk forgotten in r275620. Remove local variable for real.
CID:	1257462
Sponsored by:	The FreeBSD Foundation
2014-12-09 09:36:28 +00:00
Konstantin Belousov
a25100c539 Add functions syncer_suspend() and syncer_resume(), which are supposed
to be called before suspension and after resume, correspondingly.  The
syncer_suspend() ensures that all filesystems dirty data and metadata
are saved to the permanent storage, and stops kernel threads which
might modify filesystems.  The syncer_resume() restores stopped
threads.

For now, only syncer is stopped.  This is needed, because each sync
loop causes superblock updates for UFS.

Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-12-08 16:48:57 +00:00
Mateusz Guzik
32e7f8e4d5 Don't take devmtx unnecessarily in vn_isdisk.
MFC after:	1 week
2014-10-15 05:17:36 +00:00
Will Andrews
9832a24d27 In the syncer, drop the sync mutex while patting the watchdog.
Some watchdog drivers (like ipmi) need to sleep while patting the watchdog.
See sys/dev/ipmi/ipmi.c:ipmi_wd_event(), which calls malloc(M_WAITOK).

Submitted by:	asomers
MFC after:	1 month
Sponsored by:	Spectra Logic
MFSpectraBSD:	637548 on 2012/10/04
2014-10-01 15:32:28 +00:00
Konstantin Belousov
168f4ee0a8 Remove Giant acquisition from the mount and unmount pathes.
It could be claimed that two things were reasonable protected by
Giant.  One is vfsconf list links, which is converted to the new
dedicated sx vfsconf_sx.  Another is vfsconf.vfc_refcount, which is
now updated with atomics.

Note that vfc_refcount still has the same races now as it has under
the Giant, the unload of filesystem modules can happen while the
module is still in use.

Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2014-08-03 03:27:54 +00:00
Konstantin Belousov
634012b917 Remove one-time use macros which check for the vnode lifecycle. More,
some parts of the checks are in fact redundand in the surrounding
code, and it is more clear what the conditions are by direct testing
of the flags.  Two of the three macros were only used in assertions.

In vnlru_free(), all relevant parts of vholdl() were already inlined,
except the increment of v_holdcnt itself.  Do not call vholdl() to do
the increment as well, this allows to make assertions in
vholdl()/vhold() more strict.

In v_incr_usecount(), call vholdl() before incrementing other ref
counters.  The change is no-op, but it makes less surprising to see
the vnode state in debugger if interrupted inside v_incr_usecount().

Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-07-29 16:42:34 +00:00
Alexander Motin
781c93d405 Implement simple direct-mapped cache for popular filesystem identifiers to
avoid congestion on global mountlist_mtx mutex in vfs_busyfs(), while
traversing through the list of mount points.

This change significantly improves NFS server scalability, since it had
to do this translation for every request, and the global lock becomes quite
congested.

This code is more optimized for relatively small number of mount points.
On systems with hundreds of active mount points this simple cache may have
many collisions.  But the original traversal code in that case should also
behave much worse, so we are not loosing much.

Reviewed by:	attilio
MFC after:	2 weeks
Sponsored by:	iXsystems, Inc.
2014-06-12 12:43:48 +00:00
Alexander Motin
4f655310bf Remove unneeded mountlist_mtx acquisition from sync_fsync().
All struct mount fields accessed by sync_fsync() are protected by MNT_MTX.
2014-06-11 12:56:49 +00:00
Alexander Motin
3345d73ca8 Remove extra branching from r267232.
MFC after:	2 weeks
2014-06-08 19:01:37 +00:00
Alexander Motin
590d636321 Use atomics to modify numvnodes variable.
This allows to mostly avoid lock usage in getnewvnode_[drop_]reserve(),
that reduces number of global vnode_free_list_mtx mutex acquisitions
from 4 to 2 per NFS request on ZFS, improving SMP scalability.

Reviewed by:	kib
MFC after:	2 weeks
Sponsored by:	iXsystems, Inc.
2014-06-08 15:38:40 +00:00
Benjamin Kaduk
bf09eca2cb Check for mismatched vref()/vdrop()
Assert that the hold count has not fallen below the use count, a situation
that would only happen when a vref() (or similar) is erroneously paired
with a vdrop().  This situation has not been observed in the wild, but
could be helpful for someone implementing a new filesystem.

Reviewed by:	kib
Approved by:	hrs (mentor)
2014-05-21 03:11:27 +00:00
Bryan Drewery
44f1c91610 Rename global cnt to vm_cnt to avoid shadowing.
To reduce the diff struct pcu.cnt field was not renamed, so
PCPU_OP(cnt.field) is still used. pc_cnt and pcpu are also used in
kvm(3) and vmstat(8). The goal was to not affect externally used KPI.

Bump __FreeBSD_version_ in case some out-of-tree module/code relies on the
the global cnt variable.

Exp-run revealed no ports using it directly.

No objection from:	arch@
Sponsored by:	EMC / Isilon Storage Division
2014-03-22 10:26:09 +00:00
Konstantin Belousov
2c1531e746 Do not flush buffers when the v_object of the passed vnode does not
really belong to it. Such vnodes, with the pointers to other vnodes
v_objects, are typically instantiated by the bypass filesystems.
Invalidating mappings of other vnode pages and the pages is wrong,
since reclamation of the upper vnode does not imply that lower vnode
is reclaimed too.

One of the consequences of the improper reclamation was destruction of
the wired mappings of the lower vnode pages, triggering miscellaneous
assertions in the VM system.

Reported by:    John Marshall <john.marshall@riverwillow.com.au>
Tested by:      John Marshall <john.marshall@riverwillow.com.au>, pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Approved by:	re (gjb)
2013-10-09 18:43:29 +00:00
Konstantin Belousov
d6498b153e When printing the vnode information from ddb, print the lengths of the
dirty and clean buffer queues.

Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Approved by:	re (gjb)
2013-10-01 20:18:33 +00:00
Konstantin Belousov
fe39412e99 For vunref(), try to upgrade the vnode lock if the function was called
with the vnode shared-locked.  If upgrade succeeded, the inactivation
can be done immediately, instead of being postponed.

Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Approved by:	re (glebius)
2013-09-29 18:07:14 +00:00
Konstantin Belousov
27884e3bd1 Acquire a hold reference on the vnode when a knote is instantiated.
Otherwise, knote keeps a pointer to a vnode which could become invalid
any time.

Reported by:	many
Tested by:	Patrick Lamaiziere <patfbsd@davenulle.org>
Discussed with:	jmg
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Approved by:	re (marius)
2013-09-26 13:14:51 +00:00
Pawel Jakub Dawidek
4593c0ad6b In r114945 the line 'nmp = TAILQ_NEXT(mp, mnt_list);' was duplicated.
Instead of just removing the duplicate, convert the loop to TAILQ_FOREACH().
2013-08-17 14:13:45 +00:00
Konstantin Belousov
2c38cc792e When creation of the v_pollinfo raced and our instance of vpollinfo
must be destroyed, knlist_clear() and seldrain() calls could be
avoided, since vpollinfo was not used.  More, the knlist_clear()
calling protocol requires the knlist locked, which is not true at the
call site.

Split the destruction into the helper destroy_vpollinfo_free(), and
call it when raced, instead of destroy_vpollinfo().

Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:   3 days
2013-07-28 06:59:29 +00:00
Konstantin Belousov
a8b0523ae7 Clear the vnode knotes before destroying vpollinfo.
Reported and tested by:	Patrick Lamaiziere <patfbsd@davenulle.org>
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2013-07-17 10:56:21 +00:00
Konstantin Belousov
d39116f5d5 Be more generous when donating the current thread time to the owner of
the vnode lock while iterating over the free vnode list.  Instead of
yielding, pause for 1 tick.  The change is reported to help in some
virtualized environments.

Submitted by:	Roger Pau Monn? <roger.pau@citrix.com>
Discussed with:	jilles
Tested by:	pho
MFC after:	2 weeks
2013-06-03 17:36:43 +00:00
Jeff Roberson
22a722605d - Convert the bufobj lock to rwlock.
- Use a shared bufobj lock in getblk() and inmem().
 - Convert softdep's lk to rwlock to match the bufobj lock.
 - Move INFREECNT to b_flags and protect it with the buf lock.
 - Remove unnecessary locking around bremfree() and BKGRDINPROG.

Sponsored by:	EMC / Isilon Storage Division
Discussed with:	mckusick, kib, mdf
2013-05-31 00:43:41 +00:00
Jeff Roberson
f2cc1285c2 - Add a new general purpose path-compressed radix trie which can be used
with any structure containing a uint64_t index.  The tree code
   auto-generates type safe wrappers.
 - Eliminate the buf splay and replace it with pctrie.  This is not only
   significantly faster with large files but also allows for the possibility
   of shared locking.

Reviewed by:    alc, attilio
Sponsored by:   EMC / Isilon Storage Division
2013-05-12 04:05:01 +00:00
Konstantin Belousov
0fc6daa72d - Fix nullfs vnode reference leak in nullfs_reclaim_lowervp(). The
null_hashget() obtains the reference on the nullfs vnode, which must
  be dropped.

- Fix a wart which existed from the introduction of the nullfs
  caching, do not unlock lower vnode in the nullfs_reclaim_lowervp().
  It should be innocent, but now it is also formally safe.  Inform the
  nullfs_reclaim() about this using the NULLV_NOUNLOCK flag set on
  nullfs inode.

- Add a callback to the upper filesystems for the lower vnode
  unlinking. When inactivating a nullfs vnode, check if the lower
  vnode was unlinked, indicated by nullfs flag NULLV_DROP or VV_NOSYNC
  on the lower vnode, and reclaim upper vnode if so.  This allows
  nullfs to purge cached vnodes for the unlinked lower vnode, avoiding
  excessive caching.

Reported by:	G??ran L??wkrantz <goran.lowkrantz@ismobile.com>
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2013-05-11 11:17:44 +00:00
Marcel Moolenaar
e63091ea6c Add option WITNESS_NO_VNODE to suppress printing LORs between VNODE
locks. To support this, VNODE locks are created with the LK_IS_VNODE
flag. This flag is propagated down using the LO_IS_VNODE flag.

Note that WITNESS still records the LOR. Only the printing and the
optional entering into the kernel debugger is bypassed with the
WITNESS_NO_VNODE option.
2013-05-09 16:28:18 +00:00
Matthew D Fleming
770b41b349 Add missing vdrop() in error case.
Submitted by:	Fahad (mohd.fahadullah@isilon.com)
MFC after:	1 week
2013-05-04 18:38:16 +00:00
Rick Macklem
64fa8df6e0 Allow the vnode to be unlocked for the weird case of
LK_EXCLOTHER. LK_EXCLOTHER is only used to acquire a
usecount on a vnode during NFSv4 recovery from an
expired lease.

Reported and tested by:	pho
MFC after:	2 weeks
2013-04-16 14:22:16 +00:00
Jeff Roberson
26089666b6 Prepare to replace the buf splay with a trie:
- Don't insert BKGRDMARKER bufs into the splay or dirty/clean buf lists.
   No consumers need to find them there and it complicates the tree.
   These flags are all FFS specific and could be moved out of the buf
   cache.
 - Use pbgetvp() and pbrelvp() to associate the background and journal
   bufs with the vp.  Not only is this much cheaper it makes more sense
   for these transient bufs.
 - Fix the assertions in pbget* and pbrel*.  It's not safe to check list
   pointers which were never initialized.  Use the BX flags instead.  We
   also check B_PAGING in reassignbuf() so this should cover all cases.

Discussed with:	kib, mckusick, attilio
Sponsored by:	EMC / Isilon Storage Division
2013-04-06 22:21:23 +00:00
Attilio Rao
89f6b8632c Switch the vm_object mutex to be a rwlock. This will enable in the
future further optimizations where the vm_object lock will be held
in read mode most of the time the page cache resident pool of pages
are accessed for reading purposes.

The change is mostly mechanical but few notes are reported:
* The KPI changes as follow:
  - VM_OBJECT_LOCK() -> VM_OBJECT_WLOCK()
  - VM_OBJECT_TRYLOCK() -> VM_OBJECT_TRYWLOCK()
  - VM_OBJECT_UNLOCK() -> VM_OBJECT_WUNLOCK()
  - VM_OBJECT_LOCK_ASSERT(MA_OWNED) -> VM_OBJECT_ASSERT_WLOCKED()
    (in order to avoid visibility of implementation details)
  - The read-mode operations are added:
    VM_OBJECT_RLOCK(), VM_OBJECT_TRYRLOCK(), VM_OBJECT_RUNLOCK(),
    VM_OBJECT_ASSERT_RLOCKED(), VM_OBJECT_ASSERT_LOCKED()
* The vm/vm_pager.h namespace pollution avoidance (forcing requiring
  sys/mutex.h in consumers directly to cater its inlining functions
  using VM_OBJECT_LOCK()) imposes that all the vm/vm_pager.h
  consumers now must include also sys/rwlock.h.
* zfs requires a quite convoluted fix to include FreeBSD rwlocks into
  the compat layer because the name clash between FreeBSD and solaris
  versions must be avoided.
  At this purpose zfs redefines the vm_object locking functions
  directly, isolating the FreeBSD components in specific compat stubs.

The KPI results heavilly broken by this commit.  Thirdy part ports must
be updated accordingly (I can think off-hand of VirtualBox, for example).

Sponsored by:	EMC / Isilon storage division
Reviewed by:	jeff
Reviewed by:	pjd (ZFS specific review)
Discussed with:	alc
Tested by:	pho
2013-03-09 02:32:23 +00:00
Konstantin Belousov
10b4bb0b33 Add a trivial comment to record the proper commit log for r245407:
Set the v_hash for a new vnode in the getnewvnode() to the value
calculated based on the vnode structure address.  Filesystems using
vfs_hash_insert() override the v_hash using the standard formula of
(inode_number + mnt_hashseed).  For other filesystems, the
initialization allows the vfs_hash_index() to provide useful hash too.

Suggested, reviewed and tested by:	peter
Sponsored by:	The FreeBSD Foundation
MFC after:	5 days
2013-01-14 05:52:23 +00:00
Konstantin Belousov
a41df84820 diff --git a/sys/kern/vfs_subr.c b/sys/kern/vfs_subr.c
index 7c243b6..0bdaf36 100644
--- a/sys/kern/vfs_subr.c
+++ b/sys/kern/vfs_subr.c
@@ -279,6 +279,7 @@ SYSCTL_INT(_debug, OID_AUTO, vnlru_nowhere, CTLFLAG_RW,
 #define VSHOULDFREE(vp) (!((vp)->v_iflag & VI_FREE) && !(vp)->v_holdcnt)
 #define VSHOULDBUSY(vp) (((vp)->v_iflag & VI_FREE) && (vp)->v_holdcnt)

+static int vnsz2log;

 /*
  * Initialize the vnode management data structures.
@@ -293,6 +294,7 @@ SYSCTL_INT(_debug, OID_AUTO, vnlru_nowhere, CTLFLAG_RW,
 static void
 vntblinit(void *dummy __unused)
 {
+	u_int i;
 	int physvnodes, virtvnodes;

 	/*
@@ -332,6 +334,9 @@ vntblinit(void *dummy __unused)
 	syncer_maxdelay = syncer_mask + 1;
 	mtx_init(&sync_mtx, "Syncer mtx", NULL, MTX_DEF);
 	cv_init(&sync_wakeup, "syncer");
+	for (i = 1; i <= sizeof(struct vnode); i <<= 1)
+		vnsz2log++;
+	vnsz2log--;
 }
 SYSINIT(vfs, SI_SUB_VFS, SI_ORDER_FIRST, vntblinit, NULL);

@@ -1067,6 +1072,14 @@ alloc:
 	}
 	rangelock_init(&vp->v_rl);

+	/*
+	 * For the filesystems which do not use vfs_hash_insert(),
+	 * still initialize v_hash to have vfs_hash_index() useful.
+	 * E.g., nullfs uses vfs_hash_index() on the lower vnode for
+	 * its own hashing.
+	 */
+	vp->v_hash = (uintptr_t)vp >> vnsz2log;
+
 	*vpp = vp;
 	return (0);
 }
2013-01-14 05:42:54 +00:00
Attilio Rao
c92c859b7b Fixup r244240: mp_ncpus will be 1 also in the !SMP and smp_disabled=1
case. There is no point in optimizing further the code and use a TRUE
litteral for a path that does heavyweight stuff anyway (like lock acq),
at the price of obfuscated code.

Use the appropriate check where necessary and remove a macro.

Sponsored by:	EMC / Isilon storage division
MFC after:	3 days
2012-12-26 15:20:32 +00:00
Attilio Rao
b1308d72c2 Fixup r218424: uio_yield() was scaling directly to userland priority.
When kern_yield() was introduced with the possibility to specify
a new priority, the behaviour changed by not lowering priority at all
in the consumers, making the yielding mechanism highly ineffective for
high priority kthreads like bufdaemon, syncer, vlrudaemon, etc.
There are no evidences that consumers could bear with such change in
semantic and this situation could finally lead to bugs similar to the
ones fixed in r244240.
Re-specify userland pri for kthreads involved.

Tested by:	pho
Reviewed by:	kib, mdf
MFC after:	1 week
2012-12-21 13:14:12 +00:00
Konstantin Belousov
14df601e47 When mnt_vnode_next_active iterator cannot lock the next vnode and
yields, specify the user priority for the yield.  Otherwise, a
higher-priority (kernel) thread could fall into the priority-inversion
with the thread owning the mutex lock.

On single-processor machines or UP kernels, do not loop adaptively
when the next vnode cannot be locked, instead yield unconditionally.

Restructure the iteration initializer and the iterator to remove code
duplication.  Put the code to fetch and lock a vnode next to the
current marker, into the mnt_vnode_next_active() function, and use it
instead of repeating the loop.

Reported by:	hrs, rmacklem
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2012-12-15 02:04:46 +00:00
Konstantin Belousov
686ffcaceb Do not yield while owning a mutex. The Giant reacquire in the
kern_yield() is problematic than.

The owned mutex is the mount interlock, and it is in fact not needed
to guarantee the stability of the mount list of active vnodes, so fix
the the issue by only taking the mount interlock for MNT_REF and
MNT_REL operations.

While there, augment the unconditional yield by some amount of
spinning [1].

Reported and tested by:	pho
Reviewed by:	attilio
Submitted by:	attilio [1]
MFC after:	3 days
2012-12-10 20:44:09 +00:00