freebsd-dev

Author	SHA1	Message	Date
Konstantin Belousov	634012b917	Remove one-time use macros which check for the vnode lifecycle. More, some parts of the checks are in fact redundand in the surrounding code, and it is more clear what the conditions are by direct testing of the flags. Two of the three macros were only used in assertions. In vnlru_free(), all relevant parts of vholdl() were already inlined, except the increment of v_holdcnt itself. Do not call vholdl() to do the increment as well, this allows to make assertions in vholdl()/vhold() more strict. In v_incr_usecount(), call vholdl() before incrementing other ref counters. The change is no-op, but it makes less surprising to see the vnode state in debugger if interrupted inside v_incr_usecount(). Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week	2014-07-29 16:42:34 +00:00
Alexander Motin	781c93d405	Implement simple direct-mapped cache for popular filesystem identifiers to avoid congestion on global mountlist_mtx mutex in vfs_busyfs(), while traversing through the list of mount points. This change significantly improves NFS server scalability, since it had to do this translation for every request, and the global lock becomes quite congested. This code is more optimized for relatively small number of mount points. On systems with hundreds of active mount points this simple cache may have many collisions. But the original traversal code in that case should also behave much worse, so we are not loosing much. Reviewed by: attilio MFC after: 2 weeks Sponsored by: iXsystems, Inc.	2014-06-12 12:43:48 +00:00
Alexander Motin	4f655310bf	Remove unneeded mountlist_mtx acquisition from sync_fsync(). All struct mount fields accessed by sync_fsync() are protected by MNT_MTX.	2014-06-11 12:56:49 +00:00
Alexander Motin	3345d73ca8	Remove extra branching from r267232. MFC after: 2 weeks	2014-06-08 19:01:37 +00:00
Alexander Motin	590d636321	Use atomics to modify numvnodes variable. This allows to mostly avoid lock usage in getnewvnode_[drop_]reserve(), that reduces number of global vnode_free_list_mtx mutex acquisitions from 4 to 2 per NFS request on ZFS, improving SMP scalability. Reviewed by: kib MFC after: 2 weeks Sponsored by: iXsystems, Inc.	2014-06-08 15:38:40 +00:00
Benjamin Kaduk	bf09eca2cb	Check for mismatched vref()/vdrop() Assert that the hold count has not fallen below the use count, a situation that would only happen when a vref() (or similar) is erroneously paired with a vdrop(). This situation has not been observed in the wild, but could be helpful for someone implementing a new filesystem. Reviewed by: kib Approved by: hrs (mentor)	2014-05-21 03:11:27 +00:00
Bryan Drewery	44f1c91610	Rename global cnt to vm_cnt to avoid shadowing. To reduce the diff struct pcu.cnt field was not renamed, so PCPU_OP(cnt.field) is still used. pc_cnt and pcpu are also used in kvm(3) and vmstat(8). The goal was to not affect externally used KPI. Bump __FreeBSD_version_ in case some out-of-tree module/code relies on the the global cnt variable. Exp-run revealed no ports using it directly. No objection from: arch@ Sponsored by: EMC / Isilon Storage Division	2014-03-22 10:26:09 +00:00
Konstantin Belousov	2c1531e746	Do not flush buffers when the v_object of the passed vnode does not really belong to it. Such vnodes, with the pointers to other vnodes v_objects, are typically instantiated by the bypass filesystems. Invalidating mappings of other vnode pages and the pages is wrong, since reclamation of the upper vnode does not imply that lower vnode is reclaimed too. One of the consequences of the improper reclamation was destruction of the wired mappings of the lower vnode pages, triggering miscellaneous assertions in the VM system. Reported by: John Marshall <john.marshall@riverwillow.com.au> Tested by: John Marshall <john.marshall@riverwillow.com.au>, pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Approved by: re (gjb)	2013-10-09 18:43:29 +00:00
Konstantin Belousov	d6498b153e	When printing the vnode information from ddb, print the lengths of the dirty and clean buffer queues. Sponsored by: The FreeBSD Foundation MFC after: 1 week Approved by: re (gjb)	2013-10-01 20:18:33 +00:00
Konstantin Belousov	fe39412e99	For vunref(), try to upgrade the vnode lock if the function was called with the vnode shared-locked. If upgrade succeeded, the inactivation can be done immediately, instead of being postponed. Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Approved by: re (glebius)	2013-09-29 18:07:14 +00:00
Konstantin Belousov	27884e3bd1	Acquire a hold reference on the vnode when a knote is instantiated. Otherwise, knote keeps a pointer to a vnode which could become invalid any time. Reported by: many Tested by: Patrick Lamaiziere <patfbsd@davenulle.org> Discussed with: jmg Sponsored by: The FreeBSD Foundation MFC after: 1 week Approved by: re (marius)	2013-09-26 13:14:51 +00:00
Pawel Jakub Dawidek	4593c0ad6b	In r114945 the line 'nmp = TAILQ_NEXT(mp, mnt_list);' was duplicated. Instead of just removing the duplicate, convert the loop to TAILQ_FOREACH().	2013-08-17 14:13:45 +00:00
Konstantin Belousov	2c38cc792e	When creation of the v_pollinfo raced and our instance of vpollinfo must be destroyed, knlist_clear() and seldrain() calls could be avoided, since vpollinfo was not used. More, the knlist_clear() calling protocol requires the knlist locked, which is not true at the call site. Split the destruction into the helper destroy_vpollinfo_free(), and call it when raced, instead of destroy_vpollinfo(). Reported and tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 3 days	2013-07-28 06:59:29 +00:00
Konstantin Belousov	a8b0523ae7	Clear the vnode knotes before destroying vpollinfo. Reported and tested by: Patrick Lamaiziere <patfbsd@davenulle.org> Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2013-07-17 10:56:21 +00:00
Konstantin Belousov	d39116f5d5	Be more generous when donating the current thread time to the owner of the vnode lock while iterating over the free vnode list. Instead of yielding, pause for 1 tick. The change is reported to help in some virtualized environments. Submitted by: Roger Pau Monn? <roger.pau@citrix.com> Discussed with: jilles Tested by: pho MFC after: 2 weeks	2013-06-03 17:36:43 +00:00
Jeff Roberson	22a722605d	- Convert the bufobj lock to rwlock. - Use a shared bufobj lock in getblk() and inmem(). - Convert softdep's lk to rwlock to match the bufobj lock. - Move INFREECNT to b_flags and protect it with the buf lock. - Remove unnecessary locking around bremfree() and BKGRDINPROG. Sponsored by: EMC / Isilon Storage Division Discussed with: mckusick, kib, mdf	2013-05-31 00:43:41 +00:00
Jeff Roberson	f2cc1285c2	- Add a new general purpose path-compressed radix trie which can be used with any structure containing a uint64_t index. The tree code auto-generates type safe wrappers. - Eliminate the buf splay and replace it with pctrie. This is not only significantly faster with large files but also allows for the possibility of shared locking. Reviewed by: alc, attilio Sponsored by: EMC / Isilon Storage Division	2013-05-12 04:05:01 +00:00
Konstantin Belousov	0fc6daa72d	- Fix nullfs vnode reference leak in nullfs_reclaim_lowervp(). The null_hashget() obtains the reference on the nullfs vnode, which must be dropped. - Fix a wart which existed from the introduction of the nullfs caching, do not unlock lower vnode in the nullfs_reclaim_lowervp(). It should be innocent, but now it is also formally safe. Inform the nullfs_reclaim() about this using the NULLV_NOUNLOCK flag set on nullfs inode. - Add a callback to the upper filesystems for the lower vnode unlinking. When inactivating a nullfs vnode, check if the lower vnode was unlinked, indicated by nullfs flag NULLV_DROP or VV_NOSYNC on the lower vnode, and reclaim upper vnode if so. This allows nullfs to purge cached vnodes for the unlinked lower vnode, avoiding excessive caching. Reported by: G??ran L??wkrantz <goran.lowkrantz@ismobile.com> Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2013-05-11 11:17:44 +00:00
Marcel Moolenaar	e63091ea6c	Add option WITNESS_NO_VNODE to suppress printing LORs between VNODE locks. To support this, VNODE locks are created with the LK_IS_VNODE flag. This flag is propagated down using the LO_IS_VNODE flag. Note that WITNESS still records the LOR. Only the printing and the optional entering into the kernel debugger is bypassed with the WITNESS_NO_VNODE option.	2013-05-09 16:28:18 +00:00
Matthew D Fleming	770b41b349	Add missing vdrop() in error case. Submitted by: Fahad (mohd.fahadullah@isilon.com) MFC after: 1 week	2013-05-04 18:38:16 +00:00
Rick Macklem	64fa8df6e0	Allow the vnode to be unlocked for the weird case of LK_EXCLOTHER. LK_EXCLOTHER is only used to acquire a usecount on a vnode during NFSv4 recovery from an expired lease. Reported and tested by: pho MFC after: 2 weeks	2013-04-16 14:22:16 +00:00
Jeff Roberson	26089666b6	Prepare to replace the buf splay with a trie: - Don't insert BKGRDMARKER bufs into the splay or dirty/clean buf lists. No consumers need to find them there and it complicates the tree. These flags are all FFS specific and could be moved out of the buf cache. - Use pbgetvp() and pbrelvp() to associate the background and journal bufs with the vp. Not only is this much cheaper it makes more sense for these transient bufs. - Fix the assertions in pbget* and pbrel*. It's not safe to check list pointers which were never initialized. Use the BX flags instead. We also check B_PAGING in reassignbuf() so this should cover all cases. Discussed with: kib, mckusick, attilio Sponsored by: EMC / Isilon Storage Division	2013-04-06 22:21:23 +00:00
Attilio Rao	89f6b8632c	Switch the vm_object mutex to be a rwlock. This will enable in the future further optimizations where the vm_object lock will be held in read mode most of the time the page cache resident pool of pages are accessed for reading purposes. The change is mostly mechanical but few notes are reported: * The KPI changes as follow: - VM_OBJECT_LOCK() -> VM_OBJECT_WLOCK() - VM_OBJECT_TRYLOCK() -> VM_OBJECT_TRYWLOCK() - VM_OBJECT_UNLOCK() -> VM_OBJECT_WUNLOCK() - VM_OBJECT_LOCK_ASSERT(MA_OWNED) -> VM_OBJECT_ASSERT_WLOCKED() (in order to avoid visibility of implementation details) - The read-mode operations are added: VM_OBJECT_RLOCK(), VM_OBJECT_TRYRLOCK(), VM_OBJECT_RUNLOCK(), VM_OBJECT_ASSERT_RLOCKED(), VM_OBJECT_ASSERT_LOCKED() * The vm/vm_pager.h namespace pollution avoidance (forcing requiring sys/mutex.h in consumers directly to cater its inlining functions using VM_OBJECT_LOCK()) imposes that all the vm/vm_pager.h consumers now must include also sys/rwlock.h. * zfs requires a quite convoluted fix to include FreeBSD rwlocks into the compat layer because the name clash between FreeBSD and solaris versions must be avoided. At this purpose zfs redefines the vm_object locking functions directly, isolating the FreeBSD components in specific compat stubs. The KPI results heavilly broken by this commit. Thirdy part ports must be updated accordingly (I can think off-hand of VirtualBox, for example). Sponsored by: EMC / Isilon storage division Reviewed by: jeff Reviewed by: pjd (ZFS specific review) Discussed with: alc Tested by: pho	2013-03-09 02:32:23 +00:00
Konstantin Belousov	10b4bb0b33	Add a trivial comment to record the proper commit log for r245407: Set the v_hash for a new vnode in the getnewvnode() to the value calculated based on the vnode structure address. Filesystems using vfs_hash_insert() override the v_hash using the standard formula of (inode_number + mnt_hashseed). For other filesystems, the initialization allows the vfs_hash_index() to provide useful hash too. Suggested, reviewed and tested by: peter Sponsored by: The FreeBSD Foundation MFC after: 5 days	2013-01-14 05:52:23 +00:00
Konstantin Belousov	a41df84820	diff --git a/sys/kern/vfs_subr.c b/sys/kern/vfs_subr.c index 7c243b6..0bdaf36 100644 --- a/sys/kern/vfs_subr.c +++ b/sys/kern/vfs_subr.c @@ -279,6 +279,7 @@ SYSCTL_INT(_debug, OID_AUTO, vnlru_nowhere, CTLFLAG_RW, #define VSHOULDFREE(vp) (!((vp)->v_iflag & VI_FREE) && !(vp)->v_holdcnt) #define VSHOULDBUSY(vp) (((vp)->v_iflag & VI_FREE) && (vp)->v_holdcnt) +static int vnsz2log; /* * Initialize the vnode management data structures. @@ -293,6 +294,7 @@ SYSCTL_INT(_debug, OID_AUTO, vnlru_nowhere, CTLFLAG_RW, static void vntblinit(void dummy __unused) { + u_int i; int physvnodes, virtvnodes; / @@ -332,6 +334,9 @@ vntblinit(void dummy __unused) syncer_maxdelay = syncer_mask + 1; mtx_init(&sync_mtx, "Syncer mtx", NULL, MTX_DEF); cv_init(&sync_wakeup, "syncer"); + for (i = 1; i <= sizeof(struct vnode); i <<= 1) + vnsz2log++; + vnsz2log--; } SYSINIT(vfs, SI_SUB_VFS, SI_ORDER_FIRST, vntblinit, NULL); @@ -1067,6 +1072,14 @@ alloc: } rangelock_init(&vp->v_rl); + / + * For the filesystems which do not use vfs_hash_insert(), + * still initialize v_hash to have vfs_hash_index() useful. + * E.g., nullfs uses vfs_hash_index() on the lower vnode for + * its own hashing. + / + vp->v_hash = (uintptr_t)vp >> vnsz2log; + vpp = vp; return (0); }	2013-01-14 05:42:54 +00:00
Attilio Rao	c92c859b7b	Fixup r244240: mp_ncpus will be 1 also in the !SMP and smp_disabled=1 case. There is no point in optimizing further the code and use a TRUE litteral for a path that does heavyweight stuff anyway (like lock acq), at the price of obfuscated code. Use the appropriate check where necessary and remove a macro. Sponsored by: EMC / Isilon storage division MFC after: 3 days	2012-12-26 15:20:32 +00:00
Attilio Rao	b1308d72c2	Fixup r218424: uio_yield() was scaling directly to userland priority. When kern_yield() was introduced with the possibility to specify a new priority, the behaviour changed by not lowering priority at all in the consumers, making the yielding mechanism highly ineffective for high priority kthreads like bufdaemon, syncer, vlrudaemon, etc. There are no evidences that consumers could bear with such change in semantic and this situation could finally lead to bugs similar to the ones fixed in r244240. Re-specify userland pri for kthreads involved. Tested by: pho Reviewed by: kib, mdf MFC after: 1 week	2012-12-21 13:14:12 +00:00
Konstantin Belousov	14df601e47	When mnt_vnode_next_active iterator cannot lock the next vnode and yields, specify the user priority for the yield. Otherwise, a higher-priority (kernel) thread could fall into the priority-inversion with the thread owning the mutex lock. On single-processor machines or UP kernels, do not loop adaptively when the next vnode cannot be locked, instead yield unconditionally. Restructure the iteration initializer and the iterator to remove code duplication. Put the code to fetch and lock a vnode next to the current marker, into the mnt_vnode_next_active() function, and use it instead of repeating the loop. Reported by: hrs, rmacklem Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 3 days	2012-12-15 02:04:46 +00:00
Konstantin Belousov	686ffcaceb	Do not yield while owning a mutex. The Giant reacquire in the kern_yield() is problematic than. The owned mutex is the mount interlock, and it is in fact not needed to guarantee the stability of the mount list of active vnodes, so fix the the issue by only taking the mount interlock for MNT_REF and MNT_REL operations. While there, augment the unconditional yield by some amount of spinning [1]. Reported and tested by: pho Reviewed by: attilio Submitted by: attilio [1] MFC after: 3 days	2012-12-10 20:44:09 +00:00
Konstantin Belousov	07840861b1	The vnode_free_list_mtx is required unconditionally when iterating over the active list. The mount interlock is not enough to guarantee the validity of the tailq link pointers. The __mnt_vnode_next_active() and __mnt_vnode_first_active() active lists iterators helper functions did not provided the neccessary stability for the list, allowing the iterators to pick garbage. This was uncovered after the r243599 made the active list iterators non-nop. Since a vnode interlock is before the vnode_free_list_mtx, obtain the vnode ilock in the non-blocking manner when under vnode_free_list_mtx, and restart iteration after the yield if the lock attempt failed. Assert that a vnode found on the list is active, and assert that the helpers return the vnode with interlock owned. Reported and tested by: pho MFC after: 1 week	2012-12-03 22:15:16 +00:00
David Xu	3da9ab75f4	Take first active vnode correctly. Reviewed by: kib MFC after: 3 days	2012-11-27 06:07:58 +00:00
Andriy Gapon	6b991098a7	assert_vop_locked: make the assertion race-free and more efficient this is really a minor improvement for the sake of correctness MFC after: 6 days	2012-11-24 13:11:47 +00:00
Andriy Gapon	4f15bb6730	remove vop_lookup_pre and vop_lookup_post Suggested by: kib MFC after: 5 days	2012-11-22 10:36:10 +00:00
Attilio Rao	973b795b64	insmntque() is always called with the lock held in exclusive mode, then: - assume the lock is held in exclusive mode and remove a moot check about the lock acquisition. - in the destructor remove !MPSAFE specific chunk. Reviewed by: kib MFC after: 2 weeks	2012-11-19 20:43:19 +00:00
Andriy Gapon	ab49c952d9	assert_vop_locked should treat LK_EXCLOTHER as the not locked case ... from a perspective of the current thread. Spotted by: mjg Discussed with: kib MFC after: 18 days	2012-11-19 11:35:56 +00:00
Andriy Gapon	c496727c54	vnode_if: fix locking protocol description for lookup and cachedlookup Also remove the checks from vop_lookup_pre and vop_lookup_post, which are now completely redundant (before this change they were partially redundant). Discussed with: kib MFC after: 10 days	2012-11-19 11:32:56 +00:00
Attilio Rao	bc2258da88	Complete MPSAFE VFS interface and remove MNTK_MPSAFE flag. Porters should refer to __FreeBSD_version 1000021 for this change as it may have happened at the same timeframe.	2012-11-09 18:02:25 +00:00
Konstantin Belousov	76fd782cd9	A clarification to the behaviour of the active vnode list management regarding the vnode page cleaning. In collaboration with: pho MFC after: 1 week	2012-11-05 16:40:42 +00:00
Konstantin Belousov	90af57930c	Add decoding of the missed MNT_KERN_ flags to ddb "show mount" command. MFC after: 3 weeks	2012-11-04 13:33:13 +00:00
Konstantin Belousov	fb81941575	Add decoding of the missed VI_ and VV_ flags to ddb "show vnode" command. MFC after: 3 days	2012-11-04 13:32:45 +00:00
Konstantin Belousov	df3161c7df	Order the enumeration of the MNT_ flags to be the same as the order of their definitions. MFC after: 3 days	2012-11-04 13:31:41 +00:00
Konstantin Belousov	5050aa86cf	Remove the support for using non-mpsafe filesystem modules. In particular, do not lock Giant conditionally when calling into the filesystem module, remove the VFS_LOCK_GIANT() and related macros. Stop handling buffers belonging to non-mpsafe filesystems. The VFS_VERSION is bumped to indicate the interface change which does not result in the interface signatures changes. Conducted and reviewed by: attilio Tested by: pho	2012-10-22 17:50:54 +00:00
Konstantin Belousov	9b233e2307	Add a KPI to allow to reserve some amount of space in the numvnodes counter, without actually allocating the vnodes. The supposed use of the getnewvnode_reserve(9) is to reclaim enough free vnodes while the code still does not hold any resources that might be needed during the reclamation, and to consume the slack later for getnewvnode() calls made from the innards. After the critical block is finished, the caller shall free any reserve left, by getnewvnode_drop_reserve(9). Reviewed by: avg Tested by: pho MFC after: 1 week	2012-10-14 19:43:37 +00:00
Attilio Rao	0a15e5d30d	Remove all the checks on curthread != NULL with the exception of some MD trap checks (eg. printtrap()). Generally this check is not needed anymore, as there is not a legitimate case where curthread != NULL, after pcpu 0 area has been properly initialized. Reviewed by: bde, jhb MFC after: 1 week	2012-09-13 22:26:22 +00:00
Konstantin Belousov	bcd5bb8e57	Add a facility for vgone() to inform the set of subscribed mounts about vnode reclamation. Typical use is for the bypass mounts like nullfs to get a notification about lower vnode going away. Now, vgone() calls new VFS op vfs_reclaim_lowervp() with an argument lowervp which is reclaimed. It is possible to register several reclamation event listeners, to correctly handle the case of several nullfs mounts over the same directory. For the filesystem not having nullfs mounts over it, the overhead added is a single mount interlock lock/unlock in the vnode reclamation path. In collaboration with: pho MFC after: 3 weeks	2012-09-09 19:17:15 +00:00
Konstantin Belousov	258f94423b	Provide some compat32 shims for sysctl vfs.conflist. It is required for getvfsbyname(3) operation when called from 32bit process, and getvfsbyname(3) is used by recent bsdtar import. Reported by: many Tested by: David Naylor <naylor.b.david@gmail.com> MFC after: 5 days	2012-08-22 20:05:34 +00:00
Andriy Gapon	7adc598a15	free wdog_kern_pat calls in post-panic paths from under SW_WATCHDOG Those calls are useful with hardware watchdog drivers too. MFC after: 3 weeks	2012-06-03 08:01:12 +00:00
Konstantin Belousov	8f0e91308a	Add a rangelock implementation, intended to be used to range-locking the i/o regions of the vnode data space. The implementation is quite simple-minded, it uses the list of the lock requests, ordered by arrival time. Each request may be for read or for write. The implementation is fair FIFO. MFC after: 2 month	2012-05-30 16:06:38 +00:00
Edward Tomasz Napierala	af6e6b87ad	Remove unused thread argument to vrecycle(). Reviewed by: kib	2012-04-23 14:10:34 +00:00
Edward Tomasz Napierala	c52fd858ae	Remove unused thread argument from vtruncbuf(). Reviewed by: kib	2012-04-23 13:21:28 +00:00
Kirk McKusick	dca5e0ec50	This update uses the MNT_VNODE_FOREACH_ACTIVE interface that loops over just the active vnodes associated with a mount point to replace MNT_VNODE_FOREACH_ALL in the vfs_msync, ffs_sync_lazy, and qsync routines. The vfs_msync routine is run every 30 seconds for every writably mounted filesystem. It ensures that any files mmap'ed from the filesystem with modified pages have those pages queued to be written back to the file from which they are mapped. The ffs_lazy_sync and qsync routines are run every 30 seconds for every writably mounted UFS/FFS filesystem. The ffs_lazy_sync routine ensures that any files that have been accessed in the previous 30 seconds have had their access times queued for updating in the filesystem. The qsync routine ensures that any files with modified quotas have those quotas queued to be written back to their associated quota file. In a system configured with 250,000 vnodes, less than 1000 are typically active at any point in time. Prior to this change all 250,000 vnodes would be locked and inspected twice every minute by the syncer. For UFS/FFS filesystems they would be locked and inspected six times every minute (twice by each of these three routines since each of these routines does its own pass over the vnodes associated with a mount point). With this change the syncer now locks and inspects only the tiny set of vnodes that are active. Reviewed by: kib Tested by: Peter Holm MFC after: 2 weeks	2012-04-20 07:00:28 +00:00
Kirk McKusick	f257ebbb2e	This change creates a new list of active vnodes associated with a mount point. Active vnodes are those with a non-zero use or hold count, e.g., those vnodes that are not on the free list. Note that this list is in addition to the list of all the vnodes associated with a mount point. To avoid adding another set of linkage pointers to the vnode structure, the active list uses the existing linkage pointers used by the free list (previously named v_freelist, now renamed v_actfreelist). This update adds the MNT_VNODE_FOREACH_ACTIVE interface that loops over just the active vnodes associated with a mount point (typically less than 1% of the vnodes associated with the mount point). Reviewed by: kib Tested by: Peter Holm MFC after: 2 weeks	2012-04-20 06:50:44 +00:00
Kirk McKusick	16165feec4	Delete a no longer useful VNASSERT missed during changes in 234400. Suggested by: kib	2012-04-18 19:34:20 +00:00
Kirk McKusick	60005d66ab	Fix a memory leak of M_VNODE_MARKER introduced in 234386. Found by: Peter Holm	2012-04-18 19:30:22 +00:00
Kirk McKusick	73305eb826	Drop export of vdestroy() function from kern/vfs_subr.c as it is used only as a helper function in that file. Replace sole call to vbusy() with inline code in vholdl(). Replace sole calls to vfree() and vdestroy() with inline code in vdropl(). The Clang compiler already inlines these functions, so they do not show up in a kernel backtrace which is confusing. Also you cannot set their frame in kgdb which means that it is impossible to view their local variables. So, while the produced code is unchanged, the debugging should be easier. Discussed with: kib MFC after: 2 weeks	2012-04-17 21:46:59 +00:00
Kirk McKusick	71469bb38f	Replace the MNT_VNODE_FOREACH interface with MNT_VNODE_FOREACH_ALL. The primary changes are that the user of the interface no longer needs to manage the mount-mutex locking and that the vnode that is returned has its mutex locked (thus avoiding the need to check to see if its is DOOMED or other possible end of life senarios). To minimize compatibility issues for third-party developers, the old MNT_VNODE_FOREACH interface will remain available so that this change can be MFC'ed to 9. Following the MFC to 9, MNT_VNODE_FOREACH will be removed in head. The reason for this update is to prepare for the addition of the MNT_VNODE_FOREACH_ACTIVE interface that will loop over just the active vnodes associated with a mount point (typically less than 1% of the vnodes associated with the mount point). Reviewed by: kib Tested by: Peter Holm MFC after: 2 weeks	2012-04-17 16:28:22 +00:00
Kirk McKusick	ecb6e528c5	Export vinactive() from kern/vfs_subr.c (e.g., make it no longer static and declare its prototype in sys/vnode.h) so that it can be called from process_deferred_inactive() (in ufs/ffs/ffs_snapshot.c) instead of the body of vinactive() being cut and pasted into process_deferred_inactive(). Reviewed by: kib MFC after: 2 weeks	2012-04-11 23:01:11 +00:00
Konstantin Belousov	38ddb5725b	Decomission mnt_noasync. Introduce MNTK_NOASYNC mnt_kern_flag which allows a filesystem to request VFS to not allow MNTK_ASYNC. MFC after: 1 week	2012-03-09 00:12:05 +00:00
Mikolaj Golub	662c901c54	When detaching an unix domain socket, uipc_detach() checks unp->unp_vnode pointer to detect if there is a vnode associated with (binded to) this socket and does necessary cleanup if there is. The issue is that after forced unmount this check may be too late as the unp_vnode is reclaimed and the reference is stale. To fix this provide a helper function that is called on a socket vnode reclamation to do necessary cleanup. Pointed by: kib Reviewed by: kib MFC after: 2 weeks	2012-02-25 10:15:41 +00:00
Konstantin Belousov	c480f781ea	Current implementations of sync(2) and syncer vnode fsync() VOP uses mnt_noasync counter to temporary remove MNTK_ASYNC mount option, which is needed to guarantee a synchronous completion of the initiated i/o before syscall or VOP return. Global removal of MNTK_ASYNC option is harmful because not only i/o started from corresponding thread becomes synchronous, but all i/o is synchronous on the filesystem which is initiated during sync(2) or syncer activity. Instead of removing MNTK_ASYNC from mnt_kern_flag, provide a local thread flag to disable async i/o for current thread only. Use the opportunity to move DOINGASYNC() macro into sys/vnode.h and consistently use it through places which tested for MNTK_ASYNC. Some testing demonstrated 60-70% improvements in run time for the metadata-intensive operations on async-mounted UFS volumes, but still with great deviation due to other reasons. Reviewed by: mckusick Tested by: scottl MFC after: 2 weeks	2012-02-06 11:04:36 +00:00
Konstantin Belousov	abc942b56c	When doing vflush(WRITECLOSE), clean vnode pages. Unmounts do vfs_msync() before calling VFS_UNMOUNT(), but there is still a race allowing a process to dirty pages after msync finished. Remounts rw->ro just left dirty pages in system. Reviewed by: alc, tegge (long time ago) Tested by: pho MFC after: 2 weeks	2012-01-25 20:54:09 +00:00
Kirk McKusick	cc672d3599	Make sure all intermediate variables holding mount flags (mnt_flag) and that all internal kernel calls passing mount flags are declared as uint64_t so that flags in the top 32-bits are not lost. MFC after: 2 weeks	2012-01-17 01:08:01 +00:00
John Baldwin	908cac07ce	Use proper argument structure types for the extattr post-VOP hooks. The wrong structure happened to work since the only argument used was the vnode which is in the same place in both VOP_SETATTR() and the two extattr VOPs. MFC after: 3 days	2012-01-06 20:05:48 +00:00
John Baldwin	f0d6c5caf0	Add post-VOP hooks for VOP_DELETEEXTATTR() and VOP_SETEXTATTR() and use these to trigger a NOTE_ATTRIB EVFILT_VNODE kevent when the extended attributes of a vnode are changed. Note that OS X already implements this behavior. Reviewed by: rwatson MFC after: 2 weeks	2011-12-23 20:11:37 +00:00
John Baldwin	936c09ac0f	Add the posix_fadvise(2) system call. It is somewhat similar to madvise(2) except that it operates on a file descriptor instead of a memory region. It is currently only supported on regular files. Just as with madvise(2), the advice given to posix_fadvise(2) can be divided into two types. The first type provide hints about data access patterns and are used in the file read and write routines to modify the I/O flags passed down to VOP_READ() and VOP_WRITE(). These modes are thus filesystem independent. Note that to ease implementation (and since this API is only advisory anyway), only a single non-normal range is allowed per file descriptor. The second type of hints are used to hint to the OS that data will or will not be used. These hints are implemented via a new VOP_ADVISE(). A default implementation is provided which does nothing for the WILLNEED request and attempts to move any clean pages to the cache page queue for the DONTNEED request. This latter case required two other changes. First, a new V_CLEANONLY flag was added to vinvalbuf(). This requests vinvalbuf() to only flush clean buffers for the vnode from the buffer cache and to not remove any backing pages from the vnode. This is used to ensure clean pages are not wired into the buffer cache before attempting to move them to the cache page queue. The second change adds a new vm_object_page_cache() method. This method is somewhat similar to vm_object_page_remove() except that instead of freeing each page in the specified range, it attempts to move clean pages to the cache queue if possible. To preserve the ABI of struct file, the f_cdevpriv pointer is now reused in a union to point to the currently active advice region if one is present for regular files. Reviewed by: jilles, kib, arch@ Approved by: re (kib) MFC after: 1 month	2011-11-04 04:02:50 +00:00
John Baldwin	62238a6791	Whitespace fix.	2011-10-27 17:43:36 +00:00
Pawel Jakub Dawidek	4c11f091df	The v_data field is a pointer, so set it to NULL, not 0. MFC after: 3 days	2011-10-25 14:01:17 +00:00
Jonathan Anderson	25e33e625f	Change one printf() to log(). As noted in kern/159780, printf() is not very jail-friendly, since it can't be easily monitored by jail management tools. This patch reports an error via log() instead, which, if nobody is watching the log file, still prints to the console. Approved by: mentor (rwatson) Submitted by: Eugene Grosbein <eugen@eg.sd.rdtc.ru> MFC after: 5 days	2011-10-07 09:51:12 +00:00
Konstantin Belousov	837b4d462d	Move parts of the commit log for r166167, where Tor explained the interaction between vnode locks and vfs_busy(), into comment. MFC after: 1 week	2011-10-04 18:45:29 +00:00
Attilio Rao	6aba400a70	Fix a deficiency in the selinfo interface: If a selinfo object is recorded (via selrecord()) and then it is quickly destroyed, with the waiters missing the opportunity to awake, at the next iteration they will find the selinfo object destroyed, causing a PF#. That happens because the selinfo interface has no way to drain the waiters before to destroy the registered selinfo object. Also this race is quite rare to get in practice, because it would require a selrecord(), a poll request by another thread and a quick destruction of the selrecord()'ed selinfo object. Fix this by adding the seldrain() routine which should be called before to destroy the selinfo objects (in order to avoid such case), and fix the present cases where it might have already been called. Sometimes, the context is safe enough to prevent this type of race, like it happens in device drivers which installs selinfo objects on poll callbacks. There, the destruction of the selinfo object happens at driver detach time, when all the filedescriptors should be already closed, thus there cannot be a race. For this case, mfi(4) device driver can be set as an example, as it implements a full correct logic for preventing this from happening. Sponsored by: Sandvine Incorporated Reported by: rstone Tested by: pluknet Reviewed by: jhb, kib Approved by: re (bz) MFC after: 3 weeks	2011-08-25 15:51:54 +00:00
Kirk McKusick	d716efa9f7	Move the MNTK_SUJ flag in mnt_kern_flag to MNT_SUJ in mnt_flag so that it is visible to userland programs. This change enables the `mount' command with no arguments to be able to show if a filesystem is mounted using journaled soft updates as opposed to just normal soft updates. Approved by: re (bz)	2011-07-24 18:27:09 +00:00
Alan Cox	6bbee8e28a	Add a new option, OBJPR_NOTMAPPED, to vm_object_page_remove(). Passing this option to vm_object_page_remove() asserts that the specified range of pages is not mapped, or more precisely that none of these pages have any managed mappings. Thus, vm_object_page_remove() need not call pmap_remove_all() on the pages. This change not only saves time by eliminating pointless calls to pmap_remove_all(), but it also eliminates an inconsistency in the use of pmap_remove_all() versus related functions, like pmap_remove_write(). It eliminates harmless but pointless calls to pmap_remove_all() that were being performed on PG_UNMANAGED pages. Update all of the existing assertions on pmap_remove_all() to reflect this change. Reviewed by: kib	2011-06-29 16:40:41 +00:00
Jonathan Anderson	5e26234ff1	Tidy up a capabilities-related comment. This comment refers to an #ifdef that hasn't been merged [yet?]; remove it. Approved by: rwatson	2011-06-24 14:40:22 +00:00
Matthew D Fleming	3d08a76bbc	Use a name instead of a magic number for kern_yield(9) when the priority should not change. Fetch the td_user_pri under the thread lock. This is probably not necessary but a magic number also seems preferable to knowing the implementation details here. Requested by: Jason Behmer < jason DOT behmer AT isilon DOT com >	2011-05-13 05:27:58 +00:00
Attilio Rao	2be767e069	Add the watchdogs patting during the (shutdown time) disk syncing and disk dumping. With the option SW_WATCHDOG on, these operations are doomed to let watchdog fire, fi they take too long. I implemented the stubs this way because I really want wdog_kern_* KPI to not be dependant by SW_WATCHDOG being on (and really, the option only enables watchdog activation in hardclock) and also avoid to call them when not necessary (avoiding not-volountary watchdog activations). Sponsored by: Sandvine Incorporated Discussed with: emaste, des MFC after: 2 weeks	2011-04-28 16:02:05 +00:00
Rick Macklem	c65c068a5f	Fix a LOR in vfs_busy() where, after msleeping, it would lock the mutexes in the wrong order for the case where the MBF_MNTLSTLOCK is set. I believe this did have the potential for deadlock. For example, if multiple nfsd threads called vfs_busyfs(), which calls vfs_busy() with MBF_MNTLSTLOCK. Thanks go to pho for catching this during his testing. Tested by: pho Submitted by: kib MFC after: 2 weeks	2011-04-23 11:22:48 +00:00
Sergey Kandaurov	443db69597	Remove malloc type M_NETADDR unused since splitting into vfs_subr.c and vfs_export.c. MFC after: 1 week	2011-04-04 16:23:01 +00:00
Konstantin Belousov	fd7032e1b3	Do not assert buffer lock in VFS_STRATEGY() when kernel already paniced. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2011-03-08 11:50:59 +00:00
Matthew D Fleming	e7ceb1e99b	Based on discussions on the svn-src mailing list, rework r218195: - entirely eliminate some calls to uio_yeild() as being unnecessary, such as in a sysctl handler. - move should_yield() and maybe_yield() to kern_synch.c and move the prototypes from sys/uio.h to sys/proc.h - add a slightly more generic kern_yield() that can replace the functionality of uio_yield(). - replace source uses of uio_yield() with the functional equivalent, or in some cases do not change the thread priority when switching. - fix a logic inversion bug in vlrureclaim(), pointed out by bde@. - instead of using the per-cpu last switched ticks, use a per thread variable for should_yield(). With PREEMPTION, the only reasonable use of this is to determine if a lock has been held a long time and relinquish it. Without PREEMPTION, this is essentially the same as the per-cpu variable.	2011-02-08 00:16:36 +00:00
Matthew D Fleming	08b163fa51	Put the general logic for being a CPU hog into a new function should_yield(). Use this in various places. Encapsulate the common case of check-and-yield into a new function maybe_yield(). Change several checks for a magic number of iterations to use should_yield() instead. MFC after: 1 week	2011-02-02 16:35:10 +00:00
Konstantin Belousov	dbccdf7684	When vtruncbuf() iterates over the vnode buffer list, lock buffer object before checking the validity of the next buffer pointer. Otherwise, the buffer might be reclaimed after the check, causing iteration to run into wrong buffer. Reported and tested by: pho MFC after: 1 week	2011-01-25 14:04:02 +00:00
Matthew D Fleming	2fee06f087	Specify a CTLTYPE_FOO so that a future sysctl(8) change does not need to rely on the format string.	2011-01-18 21:14:18 +00:00
Matthew D Fleming	fbbb13f962	sysctl(9) cleanup checkpoint: amd64 GENERIC builds cleanly. Commit the kernel changes.	2011-01-12 19:54:19 +00:00
John Baldwin	a8f4344f08	- Restore dropping the priority of syncer down to PPAUSE when it is idle. This was lost when it was converted to using a condition variable instead of lbolt. - Drop the priority of flowtable down to PPAUSE when it is idle as well since it is a similar background task. MFC after: 2 weeks	2011-01-06 22:17:07 +00:00
Konstantin Belousov	7dbb59c7ce	Teach ddb "show mount" about MNTK_SUJ flag.	2010-12-27 12:06:38 +00:00
Konstantin Belousov	f5eb95b1fc	Allow shared-locked vnode to be passed to vunref(9). When shared-locked vnode is supplied as an argument to vunref(9) and resulting usecount is 0, set VI_OWEINACT and do not try to upgrade vnode lock. The later could cause vnode unlock, allowing the vnode to be reclaimed meantime. Tested by: pho MFC after: 1 week	2010-11-24 12:30:41 +00:00
Konstantin Belousov	730b63b0c2	Remove prtactive variable and related printf()s in the vop_inactive and vop_reclaim() methods. They seems to be unused, and the reported situation is normal for the forced unmount. MFC after: 1 week X-MFC-note: keep prtactive symbol in vfs_subr.c	2010-11-19 21:17:34 +00:00
Rebecca Cran	8d065a3914	Fix some more style(9) issues.	2010-11-14 16:10:15 +00:00
Rebecca Cran	b389be97db	Fix style(9) issues from r215281 and r215282. MFC after: 1 week	2010-11-14 08:06:29 +00:00
Rebecca Cran	5d7abc8777	Add descriptions to some more sysctls. PR: kern/148510 MFC after: 1 week	2010-11-14 07:38:42 +00:00
Konstantin Belousov	9a24dc0760	Protect mnt_syncer with the sync_mtx. This prevents a (rare) vnode leak when mount and update are executed in parallel. Encapsulate syncer vnode deallocation into the helper function vfs_deallocate_syncvnode(), to not externalize sync_mtx from vfs_subr.c. Found and reviewed by: jh (previous version of the patch) Tested by: pho MFC after: 3 weeks	2010-09-11 13:06:06 +00:00
Ed Maste	e5ddf11581	As long as we are going to panic anyway, there's no need to hide additional information behind DIAGNOSTIC.	2010-09-01 13:47:11 +00:00
Jaakko Heinonen	de478dd4b4	execve(2) has a special check for file permissions: a file must have at least one execute bit set, otherwise execve(2) will return EACCES even for an user with PRIV_VFS_EXEC privilege. Add the check also to vaccess(9), vaccess_acl_nfs4(9) and vaccess_acl_posix1e(9). This makes access(2) to better agree with execve(2). Because ZFS doesn't use vaccess(9) for VEXEC, add the check to zfs_freebsd_access() too. There may be other file systems which are not using vaccess*() functions and need to be handled separately. PR: kern/125009 Reviewed by: bde, trasz Approved by: pjd (ZFS part)	2010-08-30 16:30:18 +00:00
Pawel Jakub Dawidek	c87f1ad43c	There is a bug in vfs_allocate_syncvnode() failure handling in mount code. Actually it is hard to properly handle such a failure, especially in MNT_UPDATE case. The only reason for the vfs_allocate_syncvnode() function to fail is getnewvnode() failure. Fortunately it is impossible for current implementation of getnewvnode() to fail, so we can assert this and make vfs_allocate_syncvnode() void. This in turn free us from handling its failures in the mount code. Reviewed by: kib MFC after: 1 month	2010-08-28 08:57:15 +00:00
Konstantin Belousov	3beb1b723f	The buffers b_vflags field is not always properly protected by bufobj lock. If b_bufobj is not NULL, then bufobj lock should be held when manipulating the flags. Not doing this sometimes leaves BV_BKGRDINPROG to be erronously set, causing softdep' getdirtybuf() to stuck indefinitely in "getbuf" sleep, waiting for background write to finish which is not actually performed. Add BO_LOCK() in the cases where it was missed. In collaboration with: pho Tested by: bz Reviewed by: jeff MFC after: 1 month	2010-08-12 08:36:23 +00:00
Alan Cox	3b156706c4	In order for MAXVNODES_MAX to be an "int" on powerpc and sparc, we must cast PAGE_SIZE to an "int". (Powerpc and sparc, unlike the other architectures, define PAGE_SIZE as a "long".) Submitted by: Andreas Tobler	2010-08-04 05:09:02 +00:00
Alan Cox	1d7fe4b515	Update the "desiredvnodes" calculation. In particular, make the part of the calculation that is based on the kernel's heap size more conservative. Hopefully, this will eliminate the need for MAXVNODES_MAX, but for the time being set MAXVNODES_MAX to a large value. Reviewed by: jhb@ MFC after: 6 weeks	2010-08-02 21:33:36 +00:00
Ed Schouten	60ae52f785	Use ISO C99 integer types in sys/kern where possible. There are only about 100 occurences of the BSD-specific u_int*_t datatypes in sys/kern. The ISO C99 integer types are used here more often.	2010-06-21 09:55:56 +00:00
Pawel Jakub Dawidek	d32ef791eb	Backout r207970 for now, it can lead to deadlocks. Reported by: kan MFC after: 3 days	2010-06-17 17:39:51 +00:00
Konstantin Belousov	882da14c3d	Sometimes vnodes share the lock despite being different vnodes on different mount points, e.g. the nullfs vnode and the covered vnode from the lower filesystem. In this case, existing assertion in vop_rename_pre() may be triggered. Check for vnode locks equiality instead of the vnodes itself to not trip over the situation. Submitted by: Mikolaj Golub <to.my.trociny@gmail.com> Tested by: pho MFC after: 2 weeks	2010-06-03 10:20:08 +00:00
Zachary Loafman	7fd32ea923	Add VOP_ADVLOCKPURGE so that the file system is called when purging locks (in the case where the VFS impl isn't using lf_*) Submitted by: Matthew Fleming <matthew.fleming@isilon.com> Reviewed by: zml, dfr	2010-05-12 21:24:46 +00:00
Pawel Jakub Dawidek	408a7c5093	When there is no memory or KVA, try to help by reclaiming some vnodes. This helps with 'kmem_map too small' panics. No objections from: kib Tested by: Alexander V. Ribchansky <shurik@zk.informjust.ua> MFC after: 1 week	2010-05-12 16:42:28 +00:00
Pawel Jakub Dawidek	c60c36a745	I added vfs_lowvnodes event, but it was only used for a short while and now it is totally unused. Remove it. MFC after: 3 days	2010-05-11 22:46:36 +00:00
Jeff Roberson	113db2dddb	- Merge soft-updates journaling from projects/suj/head into head. This brings in support for an optional intent log which eliminates the need for background fsck on unclean shutdown. Sponsored by: iXsystems, Yahoo!, and Juniper. With help from: McKusick and Peter Holm	2010-04-24 07:05:35 +00:00
Jaakko Heinonen	0e9bd4171f	Add missing MNT_NFS4ACLS.	2010-04-04 14:48:43 +00:00
Pawel Jakub Dawidek	b9d8d69108	Fix some whitespace nits.	2010-04-03 11:19:20 +00:00
Pawel Jakub Dawidek	000026c809	Add missing mnt_kern_flag flags in 'show mount' output.	2010-04-03 11:15:55 +00:00
Konstantin Belousov	ea01588095	Add function vop_rename_fail(9) that performs needed cleanup for locks and references of the VOP_RENAME(9) arguments. Use vop_rename_fail() in deadfs_rename(). Tested by: Mikolaj Golub MFC after: 1 week	2010-04-02 14:03:01 +00:00
Konstantin Belousov	d2f334bfc9	Add new function vunref(9) that decrements vnode use count (and hold count) while vnode is exclusively locked. The code for vput(9), vrele(9) and vunref(9) is merged. In collaboration with: pho Reviewed by: alc MFC after: 3 weeks	2010-01-17 21:24:27 +00:00
Konstantin Belousov	a411786576	Add a knob to allow reclaim of the directory vnodes that are source of the namecache records. The reclamation is not enabled by default because for typical workload it would make namecache unusable, but large nested directory tree easily puts any process that accesses filesystem into 1 second wait for vlru. Reported by: yar (long time ago) MFC after: 3 days	2009-12-28 15:35:39 +00:00
Edward Tomasz Napierala	558e9b5c95	Now that all the callers seem to be fixed, add KASSERTs to make sure VAPPEND is not being used improperly.	2009-12-26 11:36:10 +00:00
Konstantin Belousov	49e3050e6c	VI_OBJDIRTY vnode flag mirrors the state of OBJ_MIGHTBEDIRTY vm object flag. Besides providing the redundand information, need to update both vnode and object flags causes more acquisition of vnode interlock. OBJ_MIGHTBEDIRTY is only checked for vnode-backed vm objects. Remove VI_OBJDIRTY and make sure that OBJ_MIGHTBEDIRTY is set only for vnode-backed vm objects. Suggested and reviewed by: alc Tested by: pho MFC after: 3 weeks	2009-12-21 12:29:38 +00:00
Jaakko Heinonen	10d843a446	Extend ddb(4) "show mount" command to print active string mount options. Note that only option names are printed, not values. Reviewed by: pjd Approved by: trasz (mentor) MFC after: 2 weeks	2009-11-19 14:33:03 +00:00
Edward Tomasz Napierala	2c29cfa083	Provide default implementation for VOP_ACCESS(9), so that filesystems which want to provide VOP_ACCESSX(9) don't have to implement both. Note that this commit makes implementation of either of these two mandatory. Reviewed by: kib	2009-10-01 17:22:03 +00:00
Robert Watson	e76d823b81	Use C99 initialization for struct filterops. Obtained from: Mac OS X Sponsored by: Apple Inc. MFC after: 3 weeks	2009-09-12 20:03:45 +00:00
Konstantin Belousov	427992ecdb	In vfs_mark_atime(9), be resistent against reclaimed vnodes. Assert that neccessary locks are taken, since vop might not be called. Tested by: pho MFC after: 3 days	2009-09-09 10:51:50 +00:00
Jamie Gritton	f0899a3460	Call prison_check from vfs_suser rather than re-implementing it. Approved by: re (kib), bz (mentor)	2009-07-02 14:19:33 +00:00
Konstantin Belousov	d8b0556c6d	Adapt vfs kqfilter to the shared vnode lock used by zfs write vop. Use vnode interlock to protect the knote fields [1]. The locking assumes that shared vnode lock is held, thus we get exclusive access to knote either by exclusive vnode lock protection, or by shared vnode lock + vnode interlock. Do not use kl_locked() method to assert either lock ownership or the fact that curthread does not own the lock. For shared locks, ownership is not recorded, e.g. VOP_ISLOCKED can return LK_SHARED for the shared lock not owned by curthread, causing false positives in kqueue subsystem assertions about knlist lock. Remove kl_locked method from knlist lock vector, and add two separate assertion methods kl_assert_locked and kl_assert_unlocked, that are supposed to use proper asserts. Change knlist_init accordingly. Add convenience function knlist_init_mtx to reduce number of arguments for typical knlist initialization. Submitted by: jhb [1] Noted by: jhb [2] Reviewed by: jhb Tested by: rnoland	2009-06-10 20:59:32 +00:00
Robert Watson	bcf11e8d00	Move "options MAC" from opt_mac.h to opt_global.h, as it's now in GENERIC and used in a large number of files, but also because an increasing number of incorrect uses of MAC calls were sneaking in due to copy-and-paste of MAC-aware code without the associated opt_mac.h include. Discussed with: pjd	2009-06-05 14:55:22 +00:00
Attilio Rao	faef64cc39	Remove the now invalid (and possibly unused) debug.mpsafevfs sysctl/tunable. Reviewed by: emaste Sponsored by: Sandvine Incorporated	2009-05-30 23:52:23 +00:00
Edward Tomasz Napierala	c97fcdba57	Add VOP_ACCESSX, which can be used to query for newly added V* permissions, such as VWRITE_ACL. For a filsystems that don't implement it, there is a default implementation, which works as a wrapper around VOP_ACCESS. Reviewed by: rwatson@	2009-05-30 13:59:05 +00:00
Jamie Gritton	0304c73163	Add hierarchical jails. A jail may further virtualize its environment by creating a child jail, which is visible to that jail and to any parent jails. Child jails may be restricted more than their parents, but never less. Jail names reflect this hierarchy, being MIB-style dot-separated strings. Every thread now points to a jail, the default being prison0, which contains information about the physical system. Prison0's root directory is the same as rootvnode; its hostname is the same as the global hostname, and its securelevel replaces the global securelevel. Note that the variable "securelevel" has actually gone away, which should not cause any problems for code that properly uses securelevel_gt() and securelevel_ge(). Some jail-related permissions that were kept in global variables and set via sysctls are now per-jail settings. The sysctls still exist for backward compatibility, used only by the now-deprecated jail(2) system call. Approved by: bz (mentor)	2009-05-27 14:11:23 +00:00
Attilio Rao	dfd233edd5	Remove the thread argument from the FSD (File-System Dependent) parts of the VFS. Now all the VFS_* functions and relating parts don't want the context as long as it always refers to curthread. In some points, in particular when dealing with VOPs and functions living in the same namespace (eg. vflush) which still need to be converted, pass curthread explicitly in order to retain the old behaviour. Such loose ends will be fixed ASAP. While here fix a bug: now, UFS_EXTATTR can be compiled alone without the UFS_EXTATTR_AUTOSTART option. VFS KPI is heavilly changed by this commit so thirdy parts modules needs to be recompiled. Bump __FreeBSD_version in order to signal such situation.	2009-05-11 15:33:26 +00:00
Alexander Kabaev	607fc40b04	Replace v_dd vnode pointer with v_cache_dd pointer to struct namecache in directory vnodes. Allow namecache dotdot entry to be created pointing from child vnode to parent vnode if no existing links in opposite direction exist. Use direct link from parent to child for dotdot lookups otherwise. This restores more efficient dotdot caching in NFS filesystems which was lost when vnodes stoppped being type stable. Reviewed by: kib	2009-03-29 21:25:40 +00:00
Alexander Kabaev	5ab4bb35fb	Change vfs_busy to wait until an outcome of pending unmount operation is known and to retry or fail accordingly to that outcome. This fixes the problem with namespace traversing programs failing with random ENOENT errors if someone just happened to try to unmount that same filesystem at the same time. Reported by: dhw Reviewed by: kib, attilio Sponsored by: Juniper Networks, Inc.	2009-03-02 20:51:39 +00:00
John Baldwin	8941aad19b	Tweak the output of VOP_PRINT/vn_printf() some. - Align the fifo output in fifo_print() with other vn_printf() output. - Remove the leading space from lockmgr_printinfo() so its output lines up in vn_printf(). - lockmgr_printinfo() now ends with a newline, so remove an extra newline from vn_printf().	2009-02-06 20:06:48 +00:00
Edward Tomasz Napierala	ec48c16f14	Add KASSERTs to make it easier to debug problems like the one fixed in r188141. Reviewed by: kib,attilio Approved by: rwatson (mentor) Tested by: pho Sponsored by: FreeBSD Foundation	2009-02-06 18:16:01 +00:00
Attilio Rao	feabc903d9	Add more KTR_VFS logging point in order to have a more effective tracing. Reviewed by: brueffer, kib Tested by: Gianni Trematerra <giovanni D trematerra A gmail D com>	2009-02-05 15:03:35 +00:00
John Baldwin	910826246c	Tweak the wording for vfs_mark_atime() since the I/O it is avoiding by not updating va_atime via VOP_SETATTR() isn't always synchronous. For some filesystems it is asynchronous. Suggested by: bde	2009-01-23 22:13:00 +00:00
John Baldwin	645f1f4ea3	Push down Giant in the vlnru kproc main loop so that it is only acquired around calls to vlrureclaim() on non-MPSAFE filesystems. Specifically, vnlru no longer needs Giant for the common case of waking up and deciding there is nothing for it to do. MFC after: 2 weeks	2009-01-23 22:08:54 +00:00
John Baldwin	1c570a0c09	Fix a few style bogons. Submitted by: bde	2009-01-21 20:08:17 +00:00
John Baldwin	beace17649	Move the VA_MARKATIME flag for VOP_SETATTR() out into its own VOP: VOP_MARKATIME() since unlike the rest of VOP_SETATTR(), VA_MARKATIME can be performed while holding a shared vnode lock (the same functionality is done internally by VOP_READ which can run with a shared vnode lock). Add missing locking of the vnode interlock to the ufs implementation and remove a special note and test from the NFS client about not supporting the feature. Inspired by: ups Tested by: pho	2009-01-21 14:42:00 +00:00
Konstantin Belousov	9316467d05	FFS puts the extended attributes blocks at the negative blocks for the vnode, from -1 down. When vinvalbuf(vp, V_ALT) is done for the vnode, it incorrectly does vm_object_page_remove(0, 0), removing all pages from the underlying vm object, not only the pages that back the extended attributes data. Change vinvalbuf() to not remove any pages from the object when V_NORMAL or V_ALT are specified. Instead, the only in-tree caller in ffs_inode.c:ffs_truncate() that specifies V_ALT explicitely removes the corresponding page range. The V_NORMAL caller does vnode_pager_setsize(vp, 0) immediately after the call to vinvalbuf(V_NORMAL) already. Reported by: csjp Reviewed by: ups MFC after: 3 weeks	2009-01-20 11:27:45 +00:00
Attilio Rao	4a0f807602	1) Fix a deadlock in the VFS: - threadA runs vfs_rel(mp1) - threadB does unmount the mp1 fs, sets MNTK_UNMOUNT and drop MNT_ILOCK() - threadA runs vfs_busy(mp1) and, as long as, MNTK_UNMOUNT is set, sleeps waiting for threadB to complete the unmount - threadB, in vfs_mount_destroy(), finds mnt_lock > 0 and sleeps waiting for the refcount to expire. Fix the deadlock by adding a flag called MNTK_REFEXPIRE which signals the unmounter is waiting for mnt_ref to expire. The vfs_busy contenders got awake, fails, and if they retry the MNTK_REFEXPIRE won't allow them to sleep again. 2) Simplify significantly the code of vfs_mount_destroy() trimming unnecessary codes: - as long as any reference exited, it is no-more possible to have write-op (primarty and secondary) in progress. - it is no needed to drop and reacquire the mount lock. - filling the structures with dummy values is unuseful as long as it is going to be freed. Tested by: pho, Andrea Barberio <insomniac at slackware dot it> Discussed with: kib	2008-12-16 23:16:10 +00:00
Konstantin Belousov	6179164448	In the nfsrv_fhtovp(), after the vfs_getvfs() function found the pointer to the fs, but before a vnode on the fs is locked, unmount may free fs structures, causing access to destroyed data and freed memory. Introduce a vfs_busymp() function that looks up and busies found fs while mountlist_mtx is held. Use it in nfsrv_fhtovp() and in the implementation of the handle syscalls. Two other uses of the vfs_getvfs() in the vfs_subr.c, namely in sysctl_vfs_ctl and vfs_getnewfsid seems to be ok. In particular, sysctl_vfs_ctl is protected by Giant by being a non-sleeping sysctl handler, that prevents Giant-locked unmount code to interfere with it. Noted by: tegge Reviewed by: dfr Tested by: pho MFC after: 1 month	2008-11-29 13:34:59 +00:00
Pawel Jakub Dawidek	1ba4a712dd	Update ZFS from version 6 to 13 and bring some FreeBSD-specific changes. This bring huge amount of changes, I'll enumerate only user-visible changes: - Delegated Administration Allows regular users to perform ZFS operations, like file system creation, snapshot creation, etc. - L2ARC Level 2 cache for ZFS - allows to use additional disks for cache. Huge performance improvements mostly for random read of mostly static content. - slog Allow to use additional disks for ZFS Intent Log to speed up operations like fsync(2). - vfs.zfs.super_owner Allows regular users to perform privileged operations on files stored on ZFS file systems owned by him. Very careful with this one. - chflags(2) Not all the flags are supported. This still needs work. - ZFSBoot Support to boot off of ZFS pool. Not finished, AFAIK. Submitted by: dfr - Snapshot properties - New failure modes Before if write requested failed, system paniced. Now one can select from one of three failure modes: - panic - panic on write error - wait - wait for disk to reappear - continue - serve read requests if possible, block write requests - Refquota, refreservation properties Just quota and reservation properties, but don't count space consumed by children file systems, clones and snapshots. - Sparse volumes ZVOLs that don't reserve space in the pool. - External attributes Compatible with extattr(2). - NFSv4-ACLs Not sure about the status, might not be complete yet. Submitted by: trasz - Creation-time properties - Regression tests for zpool(8) command. Obtained from: OpenSolaris	2008-11-17 20:49:29 +00:00
Attilio Rao	30f60d8c31	Remove the mnt_holdcnt and mnt_holdcntwaiters because they are useless. Really, the concept of holdcnt in the struct mount is rappresented by the mnt_ref (which prevents the type-stable structure from being "recycled) handled through vfs_ref() and vfs_rel(). On this optic, switch the holdcnt acquisition into an emulated vfs_ref() (and subsequent release into vfs_rel()). Discussed with: kib Tested by: pho	2008-11-03 20:00:35 +00:00
Attilio Rao	83b3bdbc8a	Improve VFS locking: - Implement real draining for vfs consumers by not relying on the mnt_lock and using instead a refcount in order to keep track of lock requesters. - Due to the change above, remove the mnt_lock lockmgr because it is now useless. - Due to the change above, vfs_busy() is no more linked to a lockmgr. Change so its KPI by removing the interlock argument and defining 2 new flags for it: MBF_NOWAIT which basically replaces the LK_NOWAIT of the old version (which was unlinked from the lockmgr alredy) and MBF_MNTLSTLOCK which provides the ability to drop the mountlist_mtx once the mnt interlock is held (ability still desired by most consumers). - The stub used into vfs_mount_destroy(), that allows to override the mnt_ref if running for more than 3 seconds, make it totally useless. Remove it as it was thought to work into older versions. If a problem of "refcount held never going away" should appear, we will need to fix properly instead than trust on such hackish solution. - Fix a bug where returning (with an error) from dounmount() was still leaving the MNTK_MWAIT flag on even if it the waiters were actually woken up. Just a place in vfs_mount_destroy() is left because it is going to recycle the structure in any case, so it doesn't matter. - Remove the markercnt refcount as it is useless. This patch modifies VFS ABI and breaks KPI for vfs_busy() so manpages and __FreeBSD_version will be modified accordingly. Discussed with: kib Tested by: pho	2008-11-02 10:15:42 +00:00
Edward Tomasz Napierala	15bc6b2bd8	Introduce accmode_t. This is required for NFSv4 ACLs - it will be neccessary to add more V* constants, and the variables changed by this patch were often being assigned to mode_t variables, which is 16 bit. Approved by: rwatson (mentor)	2008-10-28 13:44:11 +00:00
Konstantin Belousov	7cd5a03a8e	Style return statements in vn_pollrecord().	2008-10-28 12:22:33 +00:00
Konstantin Belousov	ae53539e21	Protect check for v_pollinfo == NULL and assignment of the newly allocated vpollinfo with vnode interlock. Fully initialize vpollinfo before putting pointer to it into vp->v_pollinfo. Discussed with: dwhite Tested by: pho MFC after: 1 week	2008-10-28 12:08:36 +00:00
Konstantin Belousov	3cfc308922	In vfs_busy(), lockmgr() cannot legitimately sleep, because code checked MNTK_UNMOUNT before, and mnt_mtx is used as interlock. vfs_busy() always tries to obtain a shared lock on mnt_lock, the other user is unmount who tries to drain it, setting MNTK_UNMOUNT before. Reviewed by: tegge, attilio Tested by: pho MFC after: 2 weeks	2008-10-20 10:07:28 +00:00
Attilio Rao	0d7935fd01	Remove the struct thread unuseful argument from bufobj interface. In particular following functions KPI results modified: - bufobj_invalbuf() - bufsync() and BO_SYNC() "virtual method" of the buffer objects set. Main consumers of bufobj functions are affected by this change too and, in particular, functions which changed their KPI are: - vinvalbuf() - g_vfs_close() Due to the KPI breakage, __FreeBSD_version will be bumped in a later commit. As a side note, please consider just temporary the 'curthread' argument passing to VOP_SYNC() (in bufsync()) as it will be axed out ASAP Reviewed by: kib Tested by: Giovanni Trematerra <giovanni dot trematerra at gmail dot com>	2008-10-10 21:23:50 +00:00
Attilio Rao	59d4932531	Decontextualize vfs_busy(), vfs_unbusy() and vfs_mount_alloc() functions. Manpages are updated accordingly. Tested by: Diego Sardina <siarodx at gmail dot com>	2008-08-31 14:26:08 +00:00
Attilio Rao	0359a12ead	Decontextualize the couplet VOP_GETATTR / VOP_SETATTR as the passed thread was always curthread and totally unuseful. Tested by: Giovanni Trematerra <giovanni dot trematerra at gmail dot com>	2008-08-28 15:23:18 +00:00
Konstantin Belousov	a888d54d39	Introduce the VV_FORCEINSMQ vnode flag. It instructs the insmnque() function to ignore the unmounting and forces insertion of the vnode into the mount vnode list. Change insmntque() to fail when forced unmount is in progress and VV_FORCEINSMQ is not specified. Add an assertion to the insmntque(), requiring the vnode to be exclusively locked for mp-safe filesystems. Use the VV_FORCEINSMQ for the creation of the syncvnode. Tested by: pho Reviewed by: tegge MFC after: 1 month	2008-08-28 09:08:15 +00:00
Christian S.J. Peron	e451733718	Remove worrying printf warning on bootup when processing vnodes which have NULL mount-points. This is the case for special vnodes, such as the one used in nameiinit() which is used for crossing mount points in lookup() to avoid lock ordering issues. MFC after: 2 weeks Discussed with: rwatson, kib	2008-08-24 20:16:44 +00:00
Ed Schouten	e7ea30e404	Remove the use of lbolt from the VFS syncer. It seems we only use `lbolt' inside the VFS syncer and the TTY layer now. Because I'm planning to replace the TTY layer next month, there's no reason to keep `lbolt' if it's only used in a single thread inside the kernel. Because the syncer code wanted to wake up the syncer thread before the timeout, it called sleepq_remove(). Because we now just use a condvar(9) with a timeout value of `hz', we can wake it up using cv_broadcast() without waking up any unrelated threads. Reviewed by: phk	2008-07-30 12:39:18 +00:00
Pawel Jakub Dawidek	5573021d78	Assert for exclusive vnode lock in vinactive(), vrecycle() and vgonel() functions. Reviewed by: kib	2008-07-27 11:48:15 +00:00
Pawel Jakub Dawidek	610507ae00	- Move vp test for beeing NULL under IGNORE_LOCK(). - Check if panicstr isn't set, if it is ignore the lock. This helps to avoid confusion, because lockmgr is a no-op when panicstr isn't NULL, so asserting anything at this point doesn't make sense and can just race with other panic. Discussed with: kib	2008-07-27 11:46:42 +00:00

1 2 3 4 5 ...

982 Commits