Commit Graph

1551 Commits

Author SHA1 Message Date
John Baldwin
1b7cf11b00 vdropl() drops the vnode interlock. Thus, the code in the QUOTA case that
upgrades the vnode lock if it is share locked was dropping the interlock
before actually checking VI_DOOMED.  Fix this by do the vdropl() after the
check and relying on it to drop the vnode interlock.

Reported by:	pho
Reviewed by:	kib
MFC after:	1 week
2008-09-16 16:15:38 +00:00
Konstantin Belousov
6fecb4e41e Suspend the write operations on the UFS filesystem being unmounted or
remounted from rw to ro.

Proposed and reviewed by:  tegge
In collaboration with:	pho
MFC after:	 1 month
2008-09-16 11:55:53 +00:00
Konstantin Belousov
2814d5ba5f When attempt is made to suspend a filesystem that is already syspended,
wait until the current suspension is lifted instead of silently returning
success immediately. The consequences of calling vfs_write() resume when
not owning the suspension are not well-defined at best.

Add the vfs_susp_clean() mount method to be called from
vfs_write_resume(). Set it to process_deferred_inactive() for ffs, and
stop calling it manually.

Add the thread flag TDP_IGNSUSP that allows to bypass the suspension
point in the vn_start_write. It is intended for use by VFS in the
situations where the suspender want to do some i/o requiring calls to
vn_start_write(), and this i/o cannot be done later.

Reviewed by:	tegge
In collaboration with:	pho
MFC after:	 1 month
2008-09-16 11:51:06 +00:00
Konstantin Belousov
52dfc8d7da Add the ffs structures introspection functions for ddb.
Show the b_dep value for the buffer in the show buffer command.
Add a comand to dump the dirty/clean buffer list for vnode.

Reviewed by:	tegge
Tested and used by:   pho
MFC after:   1 month
2008-09-16 11:19:38 +00:00
Konstantin Belousov
90446e360c When downgrading the read-write mount to read-only, do_unmount() sets
MNT_RDONLY flag before the VFS_MOUNT() is called. In ufs_inactive()
and ufs_itimes_locked(), UFS verifies whether the fs is read-only by
checking MNT_RDONLY, but this may cause loss of the IN_MODIFIED flag
for inode on the fs being remounted rw->ro.

Introduce UFS_RDONLY() struct ufsmount' method that reports the value
of the fs_ronly. The later is set to 1 only after the remount is
finished.

Reviewed by:	tegge
In collaboration with:	pho
MFC after:	 1 month
2008-09-16 10:59:35 +00:00
Konstantin Belousov
0411d79138 The struct inode *ip supplied to softdep_freefile is not neccessary the
inode having number ino. In r170991, the ip was marked IN_MODIFIED, that
is not quite correct.

Mark only the right inode modified by checking inode number.

Reviewed by:	tegge
In collaboration with:	pho
MFC after:	 1 month
2008-09-16 10:52:25 +00:00
Edward Tomasz Napierala
86a0c0aa7b When calling extattr_check_cred, use V{READ,WRITE}, not I{READ,WRITE}.
Approved by:	rwatson (mentor)
2008-09-03 12:46:09 +00:00
Attilio Rao
59d4932531 Decontextualize vfs_busy(), vfs_unbusy() and vfs_mount_alloc() functions.
Manpages are updated accordingly.

Tested by:	Diego Sardina <siarodx at gmail dot com>
2008-08-31 14:26:08 +00:00
Attilio Rao
0359a12ead Decontextualize the couplet VOP_GETATTR / VOP_SETATTR as the passed thread
was always curthread and totally unuseful.

Tested by: Giovanni Trematerra <giovanni dot trematerra at gmail dot com>
2008-08-28 15:23:18 +00:00
Konstantin Belousov
acd05e468a In ffs_valloc(), ffs_vget() may fail because insmntque() refused to
insert new vnode into the mount vnode list. Then, for the SU-enabled
mount, ffs_vfree could create freefile dependency. This dependency can
hang around forever since inode is not marked as IN_MODIFIED and
correspondingly inodeblock may be not marked as dirty.

After ffs_vget() fails, retry with FFSV_FORCEINSMQ, mark the inode as
modified, and vput() it immediately. Take care of the dup alloc.

Tested by:	pho
Reviewed by:	tegge
MFC after:	1 month
2008-08-28 09:19:50 +00:00
Konstantin Belousov
7b7ed832e4 Softdep code may need to instantiate vnode when processing
dependencies. In particular, it may need this while syncing filesystem
being unmounted. Since during unmount MNTK_NOINSMNTQUE flag is set,
that could sometimes disallow insertion of the vnode into the vnode
mount list, softdep code needs to overwrite the MNTK_NOINSMNTQUE flag.

Create the ffs_vgetf() function that sets the VV_FORCEINSMQ flag for
new vnode and use it consistently from the softdep code instead of
ffs_vget().

Add the retry logic to the softdep_flushfiles() to flush the vnodes
that could be instantiated while flushing softdep dependencies.

Tested by:	pho, kris
Reviewed by:	tegge
MFC after:	1 month
2008-08-28 09:18:20 +00:00
Konstantin Belousov
f2228325de Put the relocked variable from the r182111 into the #ifdef QUOTA braces
to prevent warning about unused var on the !QUOTA kernels.

Reported by:	ed
MFC after:	1 week
2008-08-24 19:06:19 +00:00
Konstantin Belousov
689eae1d90 Revert the r167541: "Remove unneeded getinoquota() call in the
ufs_access()." The call to getinoquota in ufs_access() serves the
purpose of instantiating inode dquot from the vn_open(). Since quotas
are accounted only for the inodes with already attached dquot, removal
of the call prevented opened inodes from participation in the quota
calculations.

Since ufs_access() may be called with the vnode being only shared
locked, upgrade (and then downgrade) vnode lock if calling
getinoquota().

Reported by:	simon at optinet com
In collaboration with:	pho
MFC after:	1 week
2008-08-24 17:24:22 +00:00
Konstantin Belousov
e792b09be2 Revert r181345.
Move the NULL pointer check to the vfs_deleteopt() function.

Discussed with:	rodrigc
MFC after:	3 days
2008-08-10 12:15:36 +00:00
Konstantin Belousov
a1a917e029 User may do "mount -o snapshot ...", that causes new FFS mount to be
performed with snapshot option, while the mp->mnt_opt is NULL.
Protect against NULL pointer dereference.

Noted by:	Mateusz Guzik <mjguzik gmail com>
MFC after:	3 days
2008-08-06 14:47:19 +00:00
Dag-Erling Smørgrav
20ed1beeb5 ufsmount.h uses "struct\tfoo *bar;", except where it doesn't.
quota.h uses "struct foo\t*bar;", except where it doesn't.
Try to make them both agree with themselves (though not with eachother)
2008-08-05 15:24:07 +00:00
Dag-Erling Smørgrav
1ac541a69a Whitespace, prototypes 2008-08-05 10:25:55 +00:00
John Baldwin
4a67a0d994 Whitespace tweak. 2008-07-30 21:07:56 +00:00
Konstantin Belousov
89672c6337 The ffs_balloc_ufs{1,2} functions call bdwrite() while having several
vnode buffers locked at once. In particular, there are indirect buffers
among locked ones. The bdwrite() may start the flushing to keep dirty
buffer list at the bounds. If any buffer on the dirty list requires
translation from logical to physical block number, code may ends up
trying to lock an indirect buffer already locked in ffs_balloc_ufsX.

Prevent the bdflush() activity when several buffers are locked at once
by setting the TDP_INBDFUSH for the problematic code blocks.

Reported and tested by:	pho, Josef Buchsteiner at Juniper
In collaboration with:	kan
MFC after:	1 month
2008-07-23 14:32:44 +00:00
Pawel Jakub Dawidek
a80d8caa74 Say hi to svn, by simplifing ffs_vget() function a bit - there is no need for
a variable that is used only once.
2008-07-19 22:29:44 +00:00
Craig Rodrigues
8e7a2353ec Fix comments to replace SBSIZE with SBLOCKSIZE, since SBSIZE
was renamed to SBLOCKSIZE in version 1.33

Reviewed by:	mckusick
2008-05-24 20:44:14 +00:00
Craig Rodrigues
fb77e0af12 After converting the "snapshot" mount option to the MNT_SNAPSHOT flag,
delete "snapshot" from the persistent mount options list.
This should fix problems with doing a mount -o snapshot of a file system, followed by
an NFS export of the same file system.

PR:		122833
Reported by:	Leon Kos <leon.kos lecad fs uni-lj si>,
		Jaakko Heinonen <jh saunalahti fi>
MFC after:	1 month
2008-05-24 00:41:32 +00:00
Craig Rodrigues
02a871f1ea For the following mount options, do not perform the string to flag conversions
here, because we already do them further up in vfs_donmount() in vfs_mount.c

async -> MNT_ASYNC
force -> MNT_FORCE
multilabel -> MNT_MULTILABEL
noatime -> MNT_NOATIME
noclusterr -> MNT_NOCLUSTERR
noclusterw -> MNT_NOCLUSTERW

MFC after:  1 month
2008-05-24 00:02:12 +00:00
Stephan Uphoff
2ac78f0e1a Allow VM object creation in ufs_lookup. (If vfs.vmiodirenable is set)
Directory IO without a VM object will store data in 'malloced' buffers
severely limiting caching of the data. Without this  change VM objects for
directories are only created on an open() of the directory.
TODO: Inline test if VM object already exists to avoid locking/function call
overhead.

Tested by: kris@
Reviewed by: jeff@
Reported by: David Filo
2008-05-20 19:05:43 +00:00
Jeff Roberson
721cc5664f - Use a local variable for i_ino in ufs_lookup. It is only used to
communicate between two parts of this one function.  This was causing
   problems with shared lookups as each would trash the ino value in the
   inode.
 - Remove the unused i_ino field from the inode structure.
2008-04-22 12:34:16 +00:00
Konstantin Belousov
eab626f110 Move the head of byte-level advisory lock list from the
filesystem-specific vnode data to the struct vnode. Provide the
default implementation for the vop_advlock and vop_advlockasync.
Purge the locks on the vnode reclaim by using the lf_purgelocks().
The default implementation is augmented for the nfs and smbfs.
In the nfs_advlock, push the Giant inside the nfs_dolock.

Before the change, the vop_advlock and vop_advlockasync have taken the
unlocked vnode and dereferenced the fs-private inode data, racing with
with the vnode reclamation due to forced unmount. Now, the vop_getattr
under the shared vnode lock is used to obtain the inode size, and
later, in the lf_advlockasync, after locking the vnode interlock, the
VI_DOOMED flag is checked to prevent an operation on the doomed vnode.

The implementation of the lf_purgelocks() is submitted by dfr.

Reported by:	kris
Tested by:	kris, pho
Discussed with:	jeff, dfr
MFC after:	2 weeks
2008-04-16 11:33:32 +00:00
Jeff Roberson
b300d706ea - Use a lockmgr lock rather than a mtx to protect dirhash. This lock
may be held for the duration of the various dirhash operations which
   avoids many complex unlock/lock/revalidate sequences.
 - Permit shared locks on lookup.  To protect the ip->i_dirhash pointer we
   use the vnode interlock in the shared case.  Callers holding the
   exclusive vnode lock can run without fear of concurrent modification to
   i_dirhash.
 - Hold an exclusive dirhash lock when creating the dirhash structure for
   the first time or when re-creating a dirhash structure which has been
   recycled.

Tested by:	kris, pho
2008-04-11 09:48:12 +00:00
Jeff Roberson
eb1314a249 - cache dp->i_offset in the local 'i_offset' variable for use in loop
indexes so directory lookup becomes shared lock safe.  In the modifying
   cases an exclusive lock is held here so the commit routine may
   rely on the state of i_offset.
 - Similarly handle i_diroff by fetching at the start and setting only once
   the operation is complete.  Without the exclusive lock these are only
   considered hints.
 - Assert that an exclusive lock is held when we're preparing for a commit
   routine.
 - Honor the lock type request from lookup instead of always using exclusive
   locking.

Tested by:	pho, kris
2008-04-11 09:44:25 +00:00
Pawel Jakub Dawidek
58e9afacb4 Correct function name in panic().
Reported by:	kensmith
2008-04-07 18:12:37 +00:00
Attilio Rao
047dd67e96 Optimize lockmgr in order to get rid of the pool mutex interlock, of the
state transitioning flags and of msleep(9) callings.
Use, instead, an algorithm very similar to what sx(9) and rwlock(9)
alredy do and direct accesses to the sleepqueue(9) primitive.

In order to avoid writer starvation a mechanism very similar to what
rwlock(9) uses now is implemented, with the correspective per-thread
shared lockmgrs counter.

This patch also adds 2 new functions to lockmgr KPI: lockmgr_rw() and
lockmgr_args_rw().  These two are like the 2 "normal" versions, but they
both accept a rwlock as interlock.  In order to realize this, the general
lockmgr manager function "__lockmgr_args()" has been implemented through
the generic lock layer. It supports all the blocking primitives, but
currently only these 2 mappers live.

The patch drops the support for WITNESS atm, but it will be probabilly
added soon. Also, there is a little race in the draining code which is
also present in the current CVS stock implementation: if some sharers,
once they wakeup, are in the runqueue they can contend the lock with
the exclusive drainer.  This is hard to be fixed but the now committed
code mitigate this issue a lot better than the (past) CVS version.
In addition assertive KA_HELD and KA_UNHELD have been made mute
assertions because they are dangerous and they will be nomore supported
soon.

In order to avoid namespace pollution, stack.h is splitted into two
parts: one which includes only the "struct stack" definition (_stack.h)
and one defining the KPI.  In this way, newly added _lockmgr.h can
just include _stack.h.

Kernel ABI results heavilly changed by this commit (the now committed
version of "struct lock" is a lot smaller than the previous one) and
KPI results broken by lockmgr_rw() / lockmgr_args_rw() introduction,
so manpages and __FreeBSD_version will be updated accordingly.

Tested by:      kris, pho, jeff, danger
Reviewed by:    jeff
Sponsored by:   Google, Summer of Code program 2007
2008-04-06 20:08:51 +00:00
Konstantin Belousov
57b4252e45 Add the support for the AT_FDCWD and fd-relative name lookups to the
namei(9).

Based on the submission by rdivacky,
	sponsored by Google Summer of Code 2007
Reviewed by:	rwatson, rdivacky
Tested by:	pho
2008-03-31 12:01:21 +00:00
Jeff Roberson
d04963d0f4 - Since rev 1.142 of ffs_snapshot.c the interlock has not been required
to protect the v_lock pointer.  Removing the interlock acquisition
   here allows vn_lock() to proceed without requiring the interlock
   at all.
 - If the lock mutated while we were sleeping on it the interlock has
   been dropped.  It is conceivable that the upper layer code was
   relying on the interlock and LK_NOWAIT to protect the identity or
   state of the vnode while acquiring the lock.  In this case return
   EBUSY rather than trying the new lock to prevent potential races.

Reviewed by:	tegge
2008-03-31 07:55:45 +00:00
Jeff Roberson
9c0cdb8253 - Don't free snapdata structures when they are no longer in use.
Keeping the lockmgr lock valid allows us to switch the v_lock pointer
   in snapshot vnodes between the embedded lockmgr lock and snapdata
   lock without needing the vnode interlock to protect against races
 - Keep unused snapdata structures in a list.
 - Add a function to lock the devvp and allocate a snapdata to it or
   acquire a new one without races.  The old function was safe from
   creation races because we set the mount flag when creating snapshots
   and thus serializing them.  However, it might have been subject to
   destroying races.

Reviewed by:	tegge
2008-03-31 07:47:08 +00:00
John Baldwin
d952ba1bd5 Fix a nit with the 'nofoo' options where 'foo' is mapped to 'nonofoo'
(such as 'atime' vs 'noatime').  The filesystems will always see either
'nofoo' or 'nonofoo', never plain 'foo'.  As such, their list of valid
mount options should include 'nofoo' instead of 'foo'.  With this fix,
you can do 'mount -u -o atime' on a FFS filesystem that isn't marked as
noatime without getting an error.  You can also update a noatime FFS
filesystem mounted via mount(2) (e.g. 6.x /sbin/mount binary) to 'atime'
using nmount(2) (e.g. 7.x /sbin/mount binary).

MFC after:	1 week
Reviewed by:	crodig
2008-03-26 20:48:07 +00:00
Doug Rabson
dfdcada31e Add the new kernel-mode NFS Lock Manager. To use it instead of the
user-mode lock manager, build a kernel with the NFSLOCKD option and
add '-k' to 'rpc_lockd_flags' in rc.conf.

Highlights include:

* Thread-safe kernel RPC client - many threads can use the same RPC
  client handle safely with replies being de-multiplexed at the socket
  upcall (typically driven directly by the NIC interrupt) and handed
  off to whichever thread matches the reply. For UDP sockets, many RPC
  clients can share the same socket. This allows the use of a single
  privileged UDP port number to talk to an arbitrary number of remote
  hosts.

* Single-threaded kernel RPC server. Adding support for multi-threaded
  server would be relatively straightforward and would follow
  approximately the Solaris KPI. A single thread should be sufficient
  for the NLM since it should rarely block in normal operation.

* Kernel mode NLM server supporting cancel requests and granted
  callbacks. I've tested the NLM server reasonably extensively - it
  passes both my own tests and the NFS Connectathon locking tests
  running on Solaris, Mac OS X and Ubuntu Linux.

* Userland NLM client supported. While the NLM server doesn't have
  support for the local NFS client's locking needs, it does have to
  field async replies and granted callbacks from remote NLMs that the
  local client has contacted. We relay these replies to the userland
  rpc.lockd over a local domain RPC socket.

* Robust deadlock detection for the local lock manager. In particular
  it will detect deadlocks caused by a lock request that covers more
  than one blocking request. As required by the NLM protocol, all
  deadlock detection happens synchronously - a user is guaranteed that
  if a lock request isn't rejected immediately, the lock will
  eventually be granted. The old system allowed for a 'deferred
  deadlock' condition where a blocked lock request could wake up and
  find that some other deadlock-causing lock owner had beaten them to
  the lock.

* Since both local and remote locks are managed by the same kernel
  locking code, local and remote processes can safely use file locks
  for mutual exclusion. Local processes have no fairness advantage
  compared to remote processes when contending to lock a region that
  has just been unlocked - the local lock manager enforces a strict
  first-come first-served model for both local and remote lockers.

Sponsored by:	Isilon Systems
PR:		95247 107555 115524 116679
MFC after:	2 weeks
2008-03-26 15:23:12 +00:00
Konstantin Belousov
1be222e9df Yield the cpu in the kernel while iterating the list of the
vnodes belonging to the mountpoint. Also, yield when in the
softdep_process_worklist() even when we are not going to sleep due to
buffer drain.

It is believed that the ULE fixed the problem [1], but the yielding
seems to be needed at least for the 4BSD case.

Discussed:	on stable@, with bde
Reviewed by:	tegge, jeff [1]
MFC after:	2 weeks
2008-03-23 13:45:24 +00:00
Jeff Roberson
698b1a6643 - Complete part of the unfinished bufobj work by consistently using
BO_LOCK/UNLOCK/MTX when manipulating the bufobj.
 - Create a new lock in the bufobj to lock bufobj fields independently.
   This leaves the vnode interlock as an 'identity' lock while the bufobj
   is an io lock.  The bufobj lock is ordered before the vnode interlock
   and also before the mnt ilock.
 - Exploit this new lock order to simplify softdep_check_suspend().
 - A few sync related functions are marked with a new XXX to note that
   we may not properly interlock against a non-zero bv_cnt when
   attempting to sync all vnodes on a mountlist.  I do not believe this
   race is important.  If I'm wrong this will make these locations easier
   to find.

Reviewed by:	kib (earlier diff)
Tested by:	kris, pho (earlier diff)
2008-03-22 09:15:16 +00:00
Konstantin Belousov
0e2c6b177f Reduce the acquisition of the vnode interlock in the ffs_read() and
ffs_extread() when setting the IN_ACCESS flag by checking whether the
IN_ACCESS is already set. The possible race there is admissible.

Tested by:	pho
Submitted by:	jeff
2008-03-21 12:33:00 +00:00
Jeff Roberson
374ae2a393 - Relax requirements for p_numthreads, p_threads, p_swtick, and p_nice from
requiring the per-process spinlock to only requiring the process lock.
 - Reflect these changes in the proc.h documentation and consumers throughout
   the kernel.  This is a substantial reduction in locking cost for these
   fields and was made possible by recent changes to threading support.
2008-03-19 06:19:01 +00:00
Robert Watson
237fdd787b In keeping with style(9)'s recommendations on macros, use a ';'
after each SYSINIT() macro invocation.  This makes a number of
lightweight C parsers much happier with the FreeBSD kernel
source, including cflow's prcc and lxr.

MFC after:	1 month
Discussed with:	imp, rink
2008-03-16 10:58:09 +00:00
Coleman Kane
6c62df7e49 Replace the non-MPSAFE timeout(9) API in ffs_softdep.c with the MPSAFE
callout_* API (e.g. callout_init_mtx(9)). This was one of the numerous
items on the http://wiki.freebsd.org/SMPTODO list.

Reviewed by:	imp, obrien, jhb
MFC after:	1 week
2008-03-13 20:15:48 +00:00
Ed Maste
3eb8098d2b Remove include of opt_quota.h; as of revision 1.205 there is no longer
any #ifdef QUOTA conditional code.
2008-03-10 18:44:07 +00:00
Konstantin Belousov
e7fd887711 Initialize mnt_stat.f_iosize before autostarting UFS1 extattrs.
It is normally initialized by ffs_statfs() after ffs_mount finished.

The extattr autostart code calls the ufs_lookup(), that uses value above
to iterate over the directory blocks, see bmask initialization in the
ufs_lookup() and ufsdirhash. Having the filesystem with root directory
spanning more then one block would result in reading a random kernel
memory.

PR:	kern/120781
Test case provided by:	rwatson
MFC after:	1 week
2008-03-05 16:34:03 +00:00
Robert Watson
631ea79e3f Continue on-going campaign to replace lockmgr locks with sx locks where
the specific semantics of ockmgr aren't required: update UFS1 extended
attributes to protect its data structures using an sx lock.

While here, update comments on lock granularity.

MFC after:	2 weeks
2008-03-04 12:50:11 +00:00
Robert Watson
6cf7bc60ec Move setting of MNTK_MPSAFE flag before UFS1 extended attribute
auto-start so that the flag is set before we start performing I/O
in the auto-start routine.

MFC after:	2 weeks
Suggested by:	kib
2008-03-04 12:10:03 +00:00
Robert Watson
874f7ae331 Don't auto-start or allow extattrctl for UFS2 file systems, as UFS2 has
native extended attributes.  This didn't interfere with the operation of
UFS2 extended attributes, but the code shouldn't be running for UFS2.

MFC after:	2 weeks
2008-03-02 22:52:14 +00:00
Giorgos Keramidas
53a5cd3485 Minor typo nit. 2008-02-25 19:31:44 +00:00
Attilio Rao
81c794f998 Axe the 'thread' argument from VOP_ISLOCKED() and lockstatus() as it is
always curthread.

As KPI gets broken by this patch, manpages and __FreeBSD_version will be
updated by further commits.

Tested by:	Andrea Barberio <insomniac at slackware dot it>
2008-02-25 18:45:57 +00:00
Attilio Rao
628f51d275 Introduce some functions in the vnode locks namespace and in the ffs
namespace in order to handle lockmgr fields in a controlled way instead
than spreading all around bogus stubs:
- VN_LOCK_AREC() allows lock recursion for a specified vnode
- VN_LOCK_ASHARE() allows lock sharing for a specified vnode

In FFS land:
- BUF_AREC() allows lock recursion for a specified buffer lock
- BUF_NOREC() disallows recursion for a specified buffer lock

Side note: union_subr.c::unionfs_node_update() is the only other function
directly handling lockmgr fields. As this is not simple to fix, it has
been left behind as "sole" exception.
2008-02-24 16:38:58 +00:00
Attilio Rao
24463dbbee - Introduce lockmgr_args() in the lockmgr space. This function performs
the same operation of lockmgr() but accepting a custom wmesg, prio and
  timo for the particular lock instance, overriding default values
  lkp->lk_wmesg, lkp->lk_prio and lkp->lk_timo.
- Use lockmgr_args() in order to implement BUF_TIMELOCK()
- Cleanup BUF_LOCK()
- Remove LK_INTERNAL as it is nomore used in the lockmgr namespace

Tested by:	Andrea Barberio <insomniac at slackware dot it>
2008-02-15 21:04:36 +00:00