Commit Graph

469 Commits

Author SHA1 Message Date
John-Mark Gurney
e9838c1140 don't check fs_flags for _FLAGS_UPDATED as it is stored in fs_old_flags..
If you had a UFS2 FS that didn't have it's super block at SBLOCK_UFS2,
you'll end up corrupting your FS as the superblock is updated and written
to a different location...

makefs used to put the superblock at SBLOCK_UFS1 for UFS 2 FS's causing
this issue...

Reviewed by:	silience from mckusick
MFC after:	1 week
2014-06-03 21:46:13 +00:00
Pedro F. Giffuni
4896af9f55 ufs: small formatting fixes.
Cleanup some extra space.
Use of tabs vs. spaces.
No functional change.

MFC after:	3 days
Reviewed by:	mckusick
2014-03-02 02:52:34 +00:00
Brooks Davis
cf058082cd Allow kernels without options SOFTUPDATES to build. This should fix the
embedded tinderboxes.

Reviewed by:	emaste
2013-10-21 20:51:08 +00:00
Kirk McKusick
519e3c3b9f Third of several cleanups to soft dependency implementation.
Ensure that softdep_unmount() and softdep_setup_sbupdate()
only get called for filesystems running with soft dependencies.

No functional change.

Tested by:    Peter Holm and Scott Long
Sponsored by: Netflix
2013-10-20 21:11:40 +00:00
Konstantin Belousov
cc3d8c35f5 There are several code sequences like
vfs_busy(mp);
      vfs_write_suspend(mp);
which are problematic if other thread starts unmount between two
calls.  The unmount starts a write, while vfs_write_suspend() drain
writers.  On the other hand, unmount drains busy references, causing
the deadlock.

Add a flag argument to vfs_write_suspend and require the callers of it
to specify VS_SKIP_UNMOUNT flag, when the call is performed not in the
mount path, i.e. the covered vnode is not locked.  The suspension is
not attempted if VS_SKIP_UNMOUNT is specified and unmount is in
progress.

Reported and tested by:	Andreas Longwitz <longwitz@incore.de>
Sponsored by:	The FreeBSD Foundation
MFC after:	3 weeks
2013-07-09 20:49:32 +00:00
Pedro F. Giffuni
5f3a9c4000 Style fix: spaces.
Cleanup the incomplete revert.

Reported by:	bde
MFC after:	4 weeks
2013-07-02 18:45:37 +00:00
Pedro F. Giffuni
9a0aea4625 Change i_gen in UFS to an unsigned type.
Revert the simplification of the i_gen calculation.
It is still a good idea to avoid zero values and for the case
of old filesystems there is probably no advantage in using
the complete 32 bits anyways.

Discussed with:	bde
MFC after:	4 weeks
2013-07-01 21:43:40 +00:00
Pedro F. Giffuni
bcb2f550be Change i_gen in UFS to an unsigned type.
Further simplify the i_gen calculation for older disks.
Having a zero here is not really a problem and this is more
similar to what is done in newfs_random().

Reported by:	Xin Li
MFC after:	4 weeks
2013-07-01 14:49:23 +00:00
Pedro F. Giffuni
eee4072f13 Change i_gen in UFS to an unsigned type.
In UFS, i_gen is a random generated value and there is not way for
it to be negative. Actually, the value of i_gen is just used to
match bit patterns and it is of not consequence if the values are
signed or not.

Following other filesystems, set it to unsigned and use it as such,

Discussed by:	mckusick
Reviewed by:	mckusick (previous version)
MFC after:	4 weeks
2013-07-01 03:00:15 +00:00
Jeff Roberson
22a722605d - Convert the bufobj lock to rwlock.
- Use a shared bufobj lock in getblk() and inmem().
 - Convert softdep's lk to rwlock to match the bufobj lock.
 - Move INFREECNT to b_flags and protect it with the buf lock.
 - Remove unnecessary locking around bremfree() and BKGRDINPROG.

Sponsored by:	EMC / Isilon Storage Division
Discussed with:	mckusick, kib, mdf
2013-05-31 00:43:41 +00:00
Jeff Roberson
26089666b6 Prepare to replace the buf splay with a trie:
- Don't insert BKGRDMARKER bufs into the splay or dirty/clean buf lists.
   No consumers need to find them there and it complicates the tree.
   These flags are all FFS specific and could be moved out of the buf
   cache.
 - Use pbgetvp() and pbrelvp() to associate the background and journal
   bufs with the vp.  Not only is this much cheaper it makes more sense
   for these transient bufs.
 - Fix the assertions in pbget* and pbrel*.  It's not safe to check list
   pointers which were never initialized.  Use the BX flags instead.  We
   also check B_PAGING in reassignbuf() so this should cover all cases.

Discussed with:	kib, mckusick, attilio
Sponsored by:	EMC / Isilon Storage Division
2013-04-06 22:21:23 +00:00
Konstantin Belousov
59a01b70af UFS support of the unmapped i/o for the user data buffers.
Sponsored by:	The FreeBSD Foundation
Tested by:	pho, scottl, jhb, bf
2013-03-19 15:08:15 +00:00
Konstantin Belousov
ba05dec5a4 The softdep freeblks workitem might hold a reference on the dquot.
Current dqflush() panics when a dquot with with non-zero refcount is
encountered.  The situation is possible, because quotas are turned off
before softdep workitem queue if flushed, due to the quota file writes
might create softdep workitems.

Make the encountering an active dquot in dqflush() not fatal, return
the error from quotaoff() instead.  Ignore the quotaoff() failures
when ffs_flushfiles() is called in the course of softdep_flushfiles()
loop, until the last iteration.  At the last loop, the quotas must be
closed, and because SU workitems should be already flushed, the
references to dquot are gone.

Sponsored by:	The FreeBSD Foundation
Reported and tested by:	pho
Reviewed by:	mckusick
MFC after:	2 weeks
2013-02-27 07:32:39 +00:00
Konstantin Belousov
ddd6b3fc33 Add flags argument to vfs_write_resume() and remove
vfs_write_resume_flags().

Sponsored by:	The FreeBSD Foundation
2013-01-11 06:08:32 +00:00
Attilio Rao
c6e0355cee r16312 is not any longer real since many years (likely since when VFS
received granular locking) but the comment present in UFS has been
copied all over other filesystems code incorrectly for several times.

Removes comments that makes no sense now.

Reviewed by:	kib
MFC after:	3 days
2012-11-19 22:43:45 +00:00
Edward Tomasz Napierala
1848286ada Add UFS writesuspension mechanism, designed to allow userland processes
to modify on-disk metadata for filesystems mounted for write.

Reviewed by:	kib, mckusick
Sponsored by:	FreeBSD Foundation
2012-11-18 18:57:19 +00:00
Attilio Rao
bc2258da88 Complete MPSAFE VFS interface and remove MNTK_MPSAFE flag.
Porters should refer to __FreeBSD_version 1000021 for this change as
it may have happened at the same timeframe.
2012-11-09 18:02:25 +00:00
Kevin Lo
f7a3729c91 Use NULL instead of 0 for pointers 2012-07-22 15:40:31 +00:00
Konstantin Belousov
b569050a78 Enable vn_io_fault() lock avoidance for UFS.
Tested by:	pho
MFC after:	2 months
2012-05-30 16:45:41 +00:00
Edward Tomasz Napierala
72b8ff1c74 Fix use-after-free introduced in r234036.
Reviewed by:	mckusick
Tested by:	pho
2012-04-21 10:45:46 +00:00
Kirk McKusick
dca5e0ec50 This update uses the MNT_VNODE_FOREACH_ACTIVE interface that loops
over just the active vnodes associated with a mount point to replace
MNT_VNODE_FOREACH_ALL in the vfs_msync, ffs_sync_lazy, and qsync
routines.

The vfs_msync routine is run every 30 seconds for every writably
mounted filesystem. It ensures that any files mmap'ed from the
filesystem with modified pages have those pages queued to be
written back to the file from which they are mapped.

The ffs_lazy_sync and qsync routines are run every 30 seconds for
every writably mounted UFS/FFS filesystem. The ffs_lazy_sync routine
ensures that any files that have been accessed in the previous
30 seconds have had their access times queued for updating in the
filesystem. The qsync routine ensures that any files with modified
quotas have those quotas queued to be written back to their
associated quota file.

In a system configured with 250,000 vnodes, less than 1000 are
typically active at any point in time. Prior to this change all
250,000 vnodes would be locked and inspected twice every minute
by the syncer. For UFS/FFS filesystems they would be locked and
inspected six times every minute (twice by each of these three
routines since each of these routines does its own pass over the
vnodes associated with a mount point). With this change the syncer
now locks and inspects only the tiny set of vnodes that are active.

Reviewed by: kib
Tested by:   Peter Holm
MFC after:   2 weeks
2012-04-20 07:00:28 +00:00
Kirk McKusick
71469bb38f Replace the MNT_VNODE_FOREACH interface with MNT_VNODE_FOREACH_ALL.
The primary changes are that the user of the interface no longer
needs to manage the mount-mutex locking and that the vnode that
is returned has its mutex locked (thus avoiding the need to check
to see if its is DOOMED or other possible end of life senarios).

To minimize compatibility issues for third-party developers, the
old MNT_VNODE_FOREACH interface will remain available so that this
change can be MFC'ed to 9. Following the MFC to 9, MNT_VNODE_FOREACH
will be removed in head.

The reason for this update is to prepare for the addition of the
MNT_VNODE_FOREACH_ACTIVE interface that will loop over just the
active vnodes associated with a mount point (typically less than
1% of the vnodes associated with the mount point).

Reviewed by: kib
Tested by:   Peter Holm
MFC after:   2 weeks
2012-04-17 16:28:22 +00:00
Edward Tomasz Napierala
2b028c25d3 Fix panic in ffs_reload(), which may happen when read-only filesystem
gets resized and then reloaded.

Reviewed by:	kib, mckusick (earlier version)
Sponsored by:	The FreeBSD Foundation
2012-04-08 13:44:55 +00:00
Kirk McKusick
b73ffa31d4 Drop an unnecessary setting of si_mountpt when updating a UFS mount point.
Clearly it must have been set when the mount was done.

Reviewed by: kib
2012-04-08 06:14:49 +00:00
Kirk McKusick
1faacf5d09 Keep track of the mount point associated with a special device
to enable the collection of counts of synchronous and asynchronous
reads and writes for its associated filesystem. The counts are
displayed using `mount -v'.

Ensure that buffers used for paging indicate the vnode from
which they are operating so that counts of paging I/O operations
from the filesystem are collected.

This checkin only adds the setting of the mount point for the
UFS/FFS filesystem, but it would be trivial to add the setting
and clearing of the mount point at filesystem mount/unmount
time for other filesystems too.

Reviewed by: kib
2012-03-28 20:49:11 +00:00
Konstantin Belousov
ea573a50b3 Do trivial reformatting of the comment to record the missed commit
message for r233609:
Restore the writes of atimes, quotas and superblock from syncer vnode.

Noted by:   rdivacky
2012-03-28 14:16:15 +00:00
Konstantin Belousov
a988a5c609 Reviewed by: bde, mckusick
Tested by:	pho
MFC after:	2 weeks
2012-03-28 14:06:47 +00:00
Konstantin Belousov
e0c1740853 Update comment.
MFC after:	3 days
2012-03-28 13:47:07 +00:00
Kirk McKusick
75a5838904 Add a third flags argument to ffs_syncvnode to avoid a possible conflict
with MNT_WAIT flags that passed in its second argument. This will be
MFC'ed together with r232351.

Discussed with: kib
2012-03-25 00:02:37 +00:00
Kirk McKusick
19c87af0fd In the original days of BSD, a sync was issued on every filesystem
every 30 seconds. This spike in I/O caused the system to pause every
30 seconds which was quite annoying. So, the way that sync worked
was changed so that when a vnode was first dirtied, it was put on
a 30-second cleaning queue (see the syncer_workitem_pending queues
in kern/vfs_subr.c). If the file has not been written or deleted
after 30 seconds, the syncer pushes it out. As the syncer runs once
per second, dirty files are trickled out slowly over the 30-second
period instead of all at once by a call to sync(2).

The one drawback to this is that it does not cover the filesystem
metadata. To handle the metadata, vfs_allocate_syncvnode() is called
to create a "filesystem syncer vnode" at mount time which cycles
around the cleaning queue being sync'ed every 30 seconds. In the
original design, the only things it would sync for UFS were the
filesystem metadata: inode blocks, cylinder group bitmaps, and the
superblock (e.g., by VOP_FSYNC'ing devvp, the device vnode from
which the filesystem is mounted).

Somewhere in its path to integration with FreeBSD the flushing of
the filesystem syncer vnode got changed to sync every vnode associated
with the filesystem. The result of this change is to return to the
old filesystem-wide flush every 30-seconds behavior and makes the
whole 30-second delay per vnode useless.

This change goes back to the originally intended trickle out sync
behavior. Key to ensuring that all the intended semantics are
preserved (e.g., that all inode updates get flushed within a bounded
period of time) is that all inode modifications get pushed to their
corresponding inode blocks so that the metadata flush by the
filesystem syncer vnode gets them to the disk in a timely way.
Thanks to Konstantin Belousov (kib@) for doing the audit and commit
-r231122 which ensures that all of these updates are being made.

Reviewed by:    kib
Tested by:      scottl
MFC after:      2 weeks
2012-02-07 20:43:28 +00:00
Kirk McKusick
cc672d3599 Make sure all intermediate variables holding mount flags (mnt_flag)
and that all internal kernel calls passing mount flags are declared
as uint64_t so that flags in the top 32-bits are not lost.

MFC after: 2 weeks
2012-01-17 01:08:01 +00:00
Kirk McKusick
b60ee81e3d Convert FFS mount error messages from kernel printf's to using the
vfs_mount_error error message facility provided by the nmount
interface.

Clean up formatting of mount warnings which still need to use
kernel printf's since they do not return errors.

Requested by: Craig Rodrigues <rodrigc@crodrigues.org>
MFC after: 2 weeks
2012-01-14 07:26:16 +00:00
Kirk McKusick
fddf7baebe Update to -r224294 to ensure that only one of MNT_SUJ or MNT_SOFTDEP
is set so that mount can revert back to using MNT_NOWAIT when doing
getmntinfo.

Approved by: re (kib)
2011-07-30 00:43:18 +00:00
Kirk McKusick
d716efa9f7 Move the MNTK_SUJ flag in mnt_kern_flag to MNT_SUJ in mnt_flag
so that it is visible to userland programs. This change enables
the `mount' command with no arguments to be able to show if a
filesystem is mounted using journaled soft updates as opposed
to just normal soft updates.

Approved by: re (bz)
2011-07-24 18:27:09 +00:00
Kirk McKusick
927a12ae16 Add an FFS specific mount option to allow a filesystem checker
(typically fsck_ffs) to register that it wishes to use FFS specific
sysctl's to update the filesystem. This ensures that two checkers
cannot run on a given filesystem at the same time and that no other
process accidentally or maliciously uses the filesystem updating
sysctls inappropriately. This functionality is needed by the
journaling soft-updates recovery code.
2011-07-15 16:20:33 +00:00
Kirk McKusick
8795189c98 Allow disk partitions associated with UFS read-only mounted
filesystems to be opened for writing. This functionality used to
be special-cased for just the root filesystem, but with this change
is now available for all UFS filesystems. This change is needed for
journaled soft updates recovery.

Discussed with: Jeff Roberson
2011-07-10 00:41:31 +00:00
Kirk McKusick
9420dc62cd Disable the soft updates journaling after a filesystem is successfully
downgraded to read-only. It will be restarted if the filesystem is
upgraded back to read-write.
2011-06-12 18:46:48 +00:00
Jeff Roberson
280e091a99 Implement fully asynchronous partial truncation with softupdates journaling
to resolve errors which can cause corruption on recovery with the old
synchronous mechanism.

 - Append partial truncation freework structures to indirdeps while
   truncation is proceeding.  These prevent new block pointers from
   becoming valid until truncation completes and serialize truncations.
 - On completion of a partial truncate journal work waits for zeroed
   pointers to hit indirects.
 - softdep_journal_freeblocks() handles last frag allocation and last
   block zeroing.
 - vtruncbuf/ffs_page_remove moved into softdep_*_freeblocks() so it
   is only implemented in one place.
 - Block allocation failure handling moved up one level so it does not
   proceed with buf locks held.  This permits us to do more extensive
   reclaims when filesystem space is exhausted.
 - softdep_sync_metadata() is broken into two parts, the first executes
   once at the start of ffs_syncvnode() and flushes truncations and
   inode dependencies.  The second is called on each locked buf.  This
   eliminates excessive looping and rollbacks.
 - Improve the mechanism in process_worklist_item() that handles
   acquiring vnode locks for handle_workitem_remove() so that it works
   more generally and does not loop excessively over the same worklist
   items on each call.
 - Don't corrupt directories by zeroing the tail in fsck.  This is only
   done for regular files.
 - Push a fsync complete record for files that need it so the checker
   knows a truncation in the journal is no longer valid.

Discussed with:	mckusick, kib (ffs_pages_remove and ffs_truncate parts)
Tested by:	pho
2011-06-10 22:48:35 +00:00
Rick Macklem
694a586a43 Add a lock flags argument to the VFS_FHTOVP() file system
method, so that callers can indicate the minimum vnode
locking requirement. This will allow some file systems to choose
to return a LK_SHARED locked vnode when LK_SHARED is specified
for the flags argument. This patch only adds the flag. It
does not change any file system to use it and all callers
specify LK_EXCLUSIVE, so file system semantics are not changed.

Reviewed by:	kib
2011-05-22 01:07:54 +00:00
Konstantin Belousov
16b1f68d8c Retire opt_ffs_broken_fixme.h.
Instead of directly calling ffs_snapgone(), use UFS_SNAPGONE() with
usual layering.

Requested by:	bde
MFC after:	1 week
2011-03-20 21:05:09 +00:00
Konstantin Belousov
8c2a54de80 Add kernel side support for BIO_DELETE/TRIM on UFS.
The FS_TRIM fs flag indicates that administrator requested issuing of
TRIM commands for the volume. UFS will only send the command to disk
if the disk reports GEOM::candelete attribute.

Since disk queue is reordered, data block is marked as free in the bitmap
only after TRIM command completed. Due to need to sleep waiting for
i/o to finish, TRIM bio_done routine schedules taskqueue to set the
bitmap bit.

Based on the patch by:	mckusick
Reviewed by:	mckusick, pjd
Tested by:	pho
MFC after:	1 month
2010-12-29 12:25:28 +00:00
Konstantin Belousov
fddd463dc2 Journal start looks up .sujournal file by doing lookup on the root dvp.
As result, failed softdep_mount() might leave up to two vnodes on the
mp mountlist, preventing mnt_ref from going to zero.

Call ffs_flushfiles() after failed softdep_mount() to clean mountlist.

Initial report by:	Garrett Cooper
Reproduced and tested by:	pho
2010-12-01 21:19:11 +00:00
Konstantin Belousov
d0cc54f3b4 The r184588 changed the layout of struct export_args, causing an ABI
breakage for old mount(2) syscall, since most struct <filesystem>_args
embed export_args. The mount(2) is supposed to provide ABI
compatibility for pre-nmount mount(8) binaries, so restore ABI to
pre-r184588.

Requested and reviewed by:	bde
MFC after:    2 weeks
2010-10-10 07:05:47 +00:00
David E. O'Brien
59b3a4ebb5 Correct some non-code typos. 2010-09-17 09:14:40 +00:00
John Baldwin
3634d5b241 Add dedicated routines to toggle lockmgr flags such as LK_NOSHARE and
LK_CANRECURSE after a lock is created.  Use them to implement macros that
otherwise manipulated the flags directly.  Assert that the associated
lockmgr lock is exclusively locked by the current thread when manipulating
these flags to ensure the flag updates are safe.  This last change required
some minor shuffling in a few filesystems to exclusively lock a brand new
vnode slightly earlier.

Reviewed by:	kib
MFC after:	3 days
2010-08-20 19:46:50 +00:00
John Baldwin
61e1c19319 Revert the previous commit. The race is not applicable to the lockmgr
implementation in 8.0 and later as its flags field does not hold dynamic
state such as waiters flags, but is only modified in lockinit() aside
from VN_LOCK_*().

Discussed with:	attilio
2010-07-16 19:52:03 +00:00
John Baldwin
dbfcf8cfea When the MNTK_EXTENDED_SHARED mount option was added, some filesystems were
changed to defer the setting of VN_LOCK_ASHARE() (which clears LK_NOSHARE
in the vnode lock's flags) until after they had determined if the vnode was
a FIFO.  This occurs after the vnode has been inserted a VFS hash or some
similar table, so it is possible for another thread to find this vnode via
vget() on an i-node number and block on the vnode lock.  If the lockmgr
interlock (vnode interlock for vnode locks) is not held when clearing the
LK_NOSHARE flag, then the lk_flags field can be clobbered.  As a result
the thread blocked on the vnode lock may never get woken up.  Fix this by
holding the vnode interlock while modifying the lock flags in this case.

MFC after:	3 days
2010-07-16 19:20:20 +00:00
Andriy Gapon
0b9626482b ffs_mount: accept and drop userland-only options that can be passed from
loader(8)

In r193192 loader(8) has grown an ability to pass root mount options
from fstab via vfs.root.mountfrom.options.  Unfortunately, some options
that can be present in fstab are for userland only and lead to root
mounting failure when seen by kernel.
Rather than teaching loader about FFS-specific options that should be
filtered out, ffs_mount recognizes those options as valid, but ignores
and deletes[1] them.

[1] is suggested by jh.

PR:		kern/141050
Reported by:	many
Reviewed by:	jh, bde
MFC after:	4 days
2010-05-19 09:32:11 +00:00
Andriy Gapon
deb3b115e2 ffs_vfsops: restore alphabetic order of options in ffs_opts
The order was not correct only for nfsv4acls.
("no" prefix is ignored)

MFC after:	1 week
2010-04-29 10:04:00 +00:00
Jeff Roberson
113db2dddb - Merge soft-updates journaling from projects/suj/head into head. This
brings in support for an optional intent log which eliminates the need
   for background fsck on unclean shutdown.

Sponsored by:   iXsystems, Yahoo!, and Juniper.
With help from: McKusick and Peter Holm
2010-04-24 07:05:35 +00:00
Andriy Gapon
ecaf3257be ffs_mount: remove redundant assignment of geom consumer to devvp.v_bufobj
The assignment is already done in g_vfs_open.
Redundant assignment is harmless, but can become a problem if g_vfs_open
logic is changed.

MFC after:	1 week
2010-04-03 08:25:04 +00:00
Edward Tomasz Napierala
619961810c Remove unused variable. 2010-02-10 18:56:49 +00:00
Edward Tomasz Napierala
9340fc72e6 Implement NFSv4 ACL support for UFS.
Reviewed by:	rwatson
2009-12-21 19:39:10 +00:00
Konstantin Belousov
6cc745d2d7 insmntque_stddtr() clears vp->v_data and resets vp->v_op to
dead_vnodeops before calling vgone(). Revert r189706 and corresponding
part of the r186560.

Noted and reviewed by:	tegge
Approved by:	des (pseudofs part)
MFC after:	3 days
2009-09-07 11:55:34 +00:00
Robert Watson
bcf11e8d00 Move "options MAC" from opt_mac.h to opt_global.h, as it's now in GENERIC
and used in a large number of files, but also because an increasing number
of incorrect uses of MAC calls were sneaking in due to copy-and-paste of
MAC-aware code without the associated opt_mac.h include.

Discussed with:	pjd
2009-06-05 14:55:22 +00:00
Attilio Rao
dfd233edd5 Remove the thread argument from the FSD (File-System Dependent) parts of
the VFS.  Now all the VFS_* functions and relating parts don't want the
context as long as it always refers to curthread.

In some points, in particular when dealing with VOPs and functions living
in the same namespace (eg. vflush) which still need to be converted,
pass curthread explicitly in order to retain the old behaviour.
Such loose ends will be fixed ASAP.

While here fix a bug: now, UFS_EXTATTR can be compiled alone without the
UFS_EXTATTR_AUTOSTART option.

VFS KPI is heavilly changed by this commit so thirdy parts modules needs
to be recompiled.  Bump __FreeBSD_version in order to signal such
situation.
2009-05-11 15:33:26 +00:00
Konstantin Belousov
c1d8b5e82c Fix two issues with bufdaemon, often causing the processes to hang in
the "nbufkv" sleep.

First, ffs background cg group block write requests a new buffer for
the shadow copy. When ffs_bufwrite() is called from the bufdaemon due
to buffers shortage, requesting the buffer deadlock bufdaemon.
Introduce a new flag for getnewbuf(), GB_NOWAIT_BD, to request getblk
to not block while allocating the buffer, and return failure
instead. Add a flag argument to the geteblk to allow to pass the flags
to getblk(). Do not repeat the getnewbuf() call from geteblk if buffer
allocation failed and either GB_NOWAIT_BD is specified, or geteblk()
is called from bufdaemon (or its helper, see below). In
ffs_bufwrite(), fall back to synchronous cg block write if shadow
block allocation failed.

Since r107847, buffer write assumes that vnode owning the buffer is
locked. The second problem is that buffer cache may accumulate many
buffers belonging to limited number of vnodes. With such workload,
quite often threads that own the mentioned vnodes locks are trying to
read another block from the vnodes, and, due to buffer cache
exhaustion, are asking bufdaemon for help. Bufdaemon is unable to make
any substantial progress because the vnodes are locked.

Allow the threads owning vnode locks to help the bufdaemon by doing
the flush pass over the buffer cache before getnewbuf() is going to
uninterruptible sleep. Move the flushing code from buf_daemon() to new
helper function buf_do_flush(), that is called from getnewbuf().  The
number of buffers flushed by single call to buf_do_flush() from
getnewbuf() is limited by new sysctl vfs.flushbufqtarget.  Prevent
recursive calls to buf_do_flush() by marking the bufdaemon and threads
that temporarily help bufdaemon by TDP_BUFNEED flag.

In collaboration with:	pho
Reviewed by:	 tegge (previous version)
Tested by:	 glebius, yandex ...
MFC after:	 3 weeks
2009-03-16 15:39:46 +00:00
Konstantin Belousov
e65f5a4ead The non-modifying EA VOPs are executed with only shared vnode lock taken.
Provide a custom lock around initializing and tearing down EA area,
to prevent both memory leaks and double-free of it. Count the number
of EA area accessors.

Lock protocol requires either holding exclusive vnode lock to modify
i_ea_area, or shared vnode lock and owning IN_EA_LOCKED flag in i_flag.

Noted by:	YAMAMOTO, Taku <taku tackymt homeip net>
Tested by:	pho (previous version)
MFC after:	2 weeks
2009-03-12 12:43:56 +00:00
Konstantin Belousov
a9d9537110 Do not double-free the struct inode when insmntque failed. Default
insmntque destructor reclaims the vnode, and ufs_reclaim frees the memory.

Reviewed by:	tegge
MFC after:	3 days
2009-03-11 19:45:52 +00:00
John Baldwin
33fc362512 Add a new internal mount flag (MNTK_EXTENDED_SHARED) to indicate that a
filesystem supports additional operations using shared vnode locks.
Currently this is used to enable shared locks for open() and close() of
read-only file descriptors.
- When an ISOPEN namei() request is performed with LOCKSHARED, use a
  shared vnode lock for the leaf vnode only if the mount point has the
  extended shared flag set.
- Set LOCKSHARED in vn_open_cred() for requests that specify O_RDONLY but
  not O_CREAT.
- Use a shared vnode lock around VOP_CLOSE() if the file was opened with
  O_RDONLY and the mountpoint has the extended shared flag set.
- Adjust md(4) to upgrade the vnode lock on the vnode it gets back from
  vn_open() since it now may only have a shared vnode lock.
- Don't enable shared vnode locks on FIFO vnodes in ZFS and UFS since
  FIFO's require exclusive vnode locks for their open() and close()
  routines.  (My recent MPSAFE patches for UDF and cd9660 already included
  this change.)
- Enable extended shared operations on UFS, cd9660, and UDF.

Submitted by:	ups
Reviewed by:	pjd (ZFS bits)
MFC after:	1 month
2009-03-11 14:13:47 +00:00
John Baldwin
5bd65606f4 Adjust some variables (mostly related to the buffer cache) that hold
address space sizes to be longs instead of ints.  Specifically, the follow
values are now longs: runningbufspace, bufspace, maxbufspace,
bufmallocspace, maxbufmallocspace, lobufspace, hibufspace, lorunningspace,
hirunningspace, maxswzone, maxbcache, and maxpipekva.  Previously, a
relatively small number (~ 44000) of buffers set in kern.nbuf would result
in integer overflows resulting either in hangs or bogus values of
hidirtybuffers and lodirtybuffers.  Now one has to overflow a long to see
such problems.  There was a check for a nbuf setting that would cause
overflows in the auto-tuning of nbuf.  I've changed it to always check and
cap nbuf but warn if a user-supplied tunable would cause overflow.

Note that this changes the ABI of several sysctls that are used by things
like top(1), etc., so any MFC would probably require a some gross shims
to allow for that.

MFC after:	1 month
2009-03-09 19:35:20 +00:00
Edward Tomasz Napierala
4f560d7595 Right now, when trying to unmount a device that's already gone,
msdosfs_unmount() and ffs_unmount() exit early after getting ENXIO.
However, dounmount() treats ENXIO as a success and proceeds with
unmounting.  In effect, the filesystem gets unmounted without closing
GEOM provider etc.

Reviewed by:	kib
Approved by:	rwatson (mentor)
Tested by:	dho
Sponsored by:	FreeBSD Foundation
2009-02-23 21:09:28 +00:00
Edward Tomasz Napierala
3c140b2df4 Refactor, moving error checking outside of the
'if (mp->mnt_flag & MNT_SOFTDEP)' conditional.  No functional
changes.

Reviewed by:	kib
Approved by:	rwatson (mentor)
Tested by:	pho
Sponsored by:	FreeBSD Foundation
2009-02-23 20:56:27 +00:00
John Baldwin
ee445a69c5 - If the g_access() call for the initial root mount fails, then fully
cleanup.  Before the GEOM consumer would not have been closed.
- Bump the reference on the character device being mounted while the
  associated devfs vnode is locked.

Reviewed by:	kib
2009-02-11 22:19:54 +00:00
Edward Tomasz Napierala
49c4791ccc Make sure the cdev doesn't go away while the filesystem is still mounted.
Otherwise dev2udev() could return garbage.

Reviewed by:	kib
Approved by:	rwatson (mentor)
Sponsored by:	FreeBSD Foundation
2009-01-29 16:47:15 +00:00
Konstantin Belousov
df86ccf642 If unmount of the ffs mp failed, reinitialize the extended attributes
for the mp, and restart them if autostart is enabled.

Reported and tested by:	pho
Reviewed by:	rwatson
MFC after:	3 weeks
2009-01-08 12:48:27 +00:00
Edward Tomasz Napierala
15bc6b2bd8 Introduce accmode_t. This is required for NFSv4 ACLs - it will be neccessary
to add more V* constants, and the variables changed by this patch were often
being assigned to mode_t variables, which is 16 bit.

Approved by:	rwatson (mentor)
2008-10-28 13:44:11 +00:00
Dag-Erling Smørgrav
1ede983cc9 Retire the MALLOC and FREE macros. They are an abomination unto style(9).
MFC after:	3 months
2008-10-23 15:53:51 +00:00
Attilio Rao
0d7935fd01 Remove the struct thread unuseful argument from bufobj interface.
In particular following functions KPI results modified:
- bufobj_invalbuf()
- bufsync()

and BO_SYNC() "virtual method" of the buffer objects set.
Main consumers of bufobj functions are affected by this change too and,
in particular, functions which changed their KPI are:
- vinvalbuf()
- g_vfs_close()

Due to the KPI breakage, __FreeBSD_version will be bumped in a later
commit.

As a side note, please consider just temporary the 'curthread' argument
passing to VOP_SYNC() (in bufsync()) as it will be axed out ASAP

Reviewed by:	kib
Tested by:	Giovanni Trematerra <giovanni dot trematerra at gmail dot com>
2008-10-10 21:23:50 +00:00
John Baldwin
f888634792 Enable shared lookups on UFS. There are some remaining issues with forced
unmounts, but those are in the VFS lookup code are not UFS specific.

Tested by:	pho, kris
2008-09-24 18:53:04 +00:00
Konstantin Belousov
6fecb4e41e Suspend the write operations on the UFS filesystem being unmounted or
remounted from rw to ro.

Proposed and reviewed by:  tegge
In collaboration with:	pho
MFC after:	 1 month
2008-09-16 11:55:53 +00:00
Konstantin Belousov
2814d5ba5f When attempt is made to suspend a filesystem that is already syspended,
wait until the current suspension is lifted instead of silently returning
success immediately. The consequences of calling vfs_write() resume when
not owning the suspension are not well-defined at best.

Add the vfs_susp_clean() mount method to be called from
vfs_write_resume(). Set it to process_deferred_inactive() for ffs, and
stop calling it manually.

Add the thread flag TDP_IGNSUSP that allows to bypass the suspension
point in the vn_start_write. It is intended for use by VFS in the
situations where the suspender want to do some i/o requiring calls to
vn_start_write(), and this i/o cannot be done later.

Reviewed by:	tegge
In collaboration with:	pho
MFC after:	 1 month
2008-09-16 11:51:06 +00:00
Konstantin Belousov
52dfc8d7da Add the ffs structures introspection functions for ddb.
Show the b_dep value for the buffer in the show buffer command.
Add a comand to dump the dirty/clean buffer list for vnode.

Reviewed by:	tegge
Tested and used by:   pho
MFC after:   1 month
2008-09-16 11:19:38 +00:00
Konstantin Belousov
90446e360c When downgrading the read-write mount to read-only, do_unmount() sets
MNT_RDONLY flag before the VFS_MOUNT() is called. In ufs_inactive()
and ufs_itimes_locked(), UFS verifies whether the fs is read-only by
checking MNT_RDONLY, but this may cause loss of the IN_MODIFIED flag
for inode on the fs being remounted rw->ro.

Introduce UFS_RDONLY() struct ufsmount' method that reports the value
of the fs_ronly. The later is set to 1 only after the remount is
finished.

Reviewed by:	tegge
In collaboration with:	pho
MFC after:	 1 month
2008-09-16 10:59:35 +00:00
Konstantin Belousov
7b7ed832e4 Softdep code may need to instantiate vnode when processing
dependencies. In particular, it may need this while syncing filesystem
being unmounted. Since during unmount MNTK_NOINSMNTQUE flag is set,
that could sometimes disallow insertion of the vnode into the vnode
mount list, softdep code needs to overwrite the MNTK_NOINSMNTQUE flag.

Create the ffs_vgetf() function that sets the VV_FORCEINSMQ flag for
new vnode and use it consistently from the softdep code instead of
ffs_vget().

Add the retry logic to the softdep_flushfiles() to flush the vnodes
that could be instantiated while flushing softdep dependencies.

Tested by:	pho, kris
Reviewed by:	tegge
MFC after:	1 month
2008-08-28 09:18:20 +00:00
Konstantin Belousov
e792b09be2 Revert r181345.
Move the NULL pointer check to the vfs_deleteopt() function.

Discussed with:	rodrigc
MFC after:	3 days
2008-08-10 12:15:36 +00:00
Konstantin Belousov
a1a917e029 User may do "mount -o snapshot ...", that causes new FFS mount to be
performed with snapshot option, while the mp->mnt_opt is NULL.
Protect against NULL pointer dereference.

Noted by:	Mateusz Guzik <mjguzik gmail com>
MFC after:	3 days
2008-08-06 14:47:19 +00:00
Pawel Jakub Dawidek
a80d8caa74 Say hi to svn, by simplifing ffs_vget() function a bit - there is no need for
a variable that is used only once.
2008-07-19 22:29:44 +00:00
Craig Rodrigues
fb77e0af12 After converting the "snapshot" mount option to the MNT_SNAPSHOT flag,
delete "snapshot" from the persistent mount options list.
This should fix problems with doing a mount -o snapshot of a file system, followed by
an NFS export of the same file system.

PR:		122833
Reported by:	Leon Kos <leon.kos lecad fs uni-lj si>,
		Jaakko Heinonen <jh saunalahti fi>
MFC after:	1 month
2008-05-24 00:41:32 +00:00
Craig Rodrigues
02a871f1ea For the following mount options, do not perform the string to flag conversions
here, because we already do them further up in vfs_donmount() in vfs_mount.c

async -> MNT_ASYNC
force -> MNT_FORCE
multilabel -> MNT_MULTILABEL
noatime -> MNT_NOATIME
noclusterr -> MNT_NOCLUSTERR
noclusterw -> MNT_NOCLUSTERW

MFC after:  1 month
2008-05-24 00:02:12 +00:00
John Baldwin
d952ba1bd5 Fix a nit with the 'nofoo' options where 'foo' is mapped to 'nonofoo'
(such as 'atime' vs 'noatime').  The filesystems will always see either
'nofoo' or 'nonofoo', never plain 'foo'.  As such, their list of valid
mount options should include 'nofoo' instead of 'foo'.  With this fix,
you can do 'mount -u -o atime' on a FFS filesystem that isn't marked as
noatime without getting an error.  You can also update a noatime FFS
filesystem mounted via mount(2) (e.g. 6.x /sbin/mount binary) to 'atime'
using nmount(2) (e.g. 7.x /sbin/mount binary).

MFC after:	1 week
Reviewed by:	crodig
2008-03-26 20:48:07 +00:00
Jeff Roberson
698b1a6643 - Complete part of the unfinished bufobj work by consistently using
BO_LOCK/UNLOCK/MTX when manipulating the bufobj.
 - Create a new lock in the bufobj to lock bufobj fields independently.
   This leaves the vnode interlock as an 'identity' lock while the bufobj
   is an io lock.  The bufobj lock is ordered before the vnode interlock
   and also before the mnt ilock.
 - Exploit this new lock order to simplify softdep_check_suspend().
 - A few sync related functions are marked with a new XXX to note that
   we may not properly interlock against a non-zero bv_cnt when
   attempting to sync all vnodes on a mountlist.  I do not believe this
   race is important.  If I'm wrong this will make these locations easier
   to find.

Reviewed by:	kib (earlier diff)
Tested by:	kris, pho (earlier diff)
2008-03-22 09:15:16 +00:00
Konstantin Belousov
e7fd887711 Initialize mnt_stat.f_iosize before autostarting UFS1 extattrs.
It is normally initialized by ffs_statfs() after ffs_mount finished.

The extattr autostart code calls the ufs_lookup(), that uses value above
to iterate over the directory blocks, see bmask initialization in the
ufs_lookup() and ufsdirhash. Having the filesystem with root directory
spanning more then one block would result in reading a random kernel
memory.

PR:	kern/120781
Test case provided by:	rwatson
MFC after:	1 week
2008-03-05 16:34:03 +00:00
Robert Watson
6cf7bc60ec Move setting of MNTK_MPSAFE flag before UFS1 extended attribute
auto-start so that the flag is set before we start performing I/O
in the auto-start routine.

MFC after:	2 weeks
Suggested by:	kib
2008-03-04 12:10:03 +00:00
Attilio Rao
628f51d275 Introduce some functions in the vnode locks namespace and in the ffs
namespace in order to handle lockmgr fields in a controlled way instead
than spreading all around bogus stubs:
- VN_LOCK_AREC() allows lock recursion for a specified vnode
- VN_LOCK_ASHARE() allows lock sharing for a specified vnode

In FFS land:
- BUF_AREC() allows lock recursion for a specified buffer lock
- BUF_NOREC() disallows recursion for a specified buffer lock

Side note: union_subr.c::unionfs_node_update() is the only other function
directly handling lockmgr fields. As this is not simple to fix, it has
been left behind as "sole" exception.
2008-02-24 16:38:58 +00:00
Attilio Rao
0e9eb108f0 Cleanup lockmgr interface and exported KPI:
- Remove the "thread" argument from the lockmgr() function as it is
  always curthread now
- Axe lockcount() function as it is no longer used
- Axe LOCKMGR_ASSERT() as it is bogus really and no currently used.
  Hopefully this will be soonly replaced by something suitable for it.
- Remove the prototype for dumplockinfo() as the function is no longer
  present

Addictionally:
- Introduce a KASSERT() in lockstatus() in order to let it accept only
  curthread or NULL as they should only be passed
- Do a little bit of style(9) cleanup on lockmgr.h

KPI results heavilly broken by this change, so manpages and
FreeBSD_version will be modified accordingly by further commits.

Tested by: matteo
2008-01-24 12:34:30 +00:00
Attilio Rao
d638e093d6 - Introduce the function lockmgr_recursed() which returns true if the
lockmgr lkp, when held in exclusive mode, is recursed
- Introduce the function BUF_RECURSED() which does the same for bufobj
  locks based on the top of lockmgr_recursed()
- Introduce the function BUF_ISLOCKED() which works like the counterpart
  VOP_ISLOCKED(9), showing the state of lockmgr linked with the bufobj

BUF_RECURSED() and BUF_ISLOCKED() entirely replace the usage of bogus
BUF_REFCNT() in a more explicative and SMP-compliant way.
This allows us to axe out BUF_REFCNT() and leaving the function
lockcount() totally unused in our stock kernel. Further commits will
axe lockcount() as well as part of lockmgr() cleanup.

KPI results, obviously, broken so further commits will update manpages
and freebsd version.

Tested by: kris (on UFS and NFS)
2008-01-19 17:36:23 +00:00
Attilio Rao
22db15c06f VOP_LOCK1() (and so VOP_LOCK()) and VOP_UNLOCK() are only used in
conjuction with 'thread' argument passing which is always curthread.
Remove the unuseful extra-argument and pass explicitly curthread to lower
layer functions, when necessary.

KPI results broken by this change, which should affect several ports, so
version bumping and manpage update will be further committed.

Tested by: kris, pho, Diego Sardina <siarodx at gmail dot com>
2008-01-13 14:44:15 +00:00
Attilio Rao
cb05b60a89 vn_lock() is currently only used with the 'curthread' passed as argument.
Remove this argument and pass curthread directly to underlying
VOP_LOCK1() VFS method. This modify makes the code cleaner and in
particular remove an annoying dependence helping next lockmgr() cleanup.
KPI results, obviously, changed.

Manpage and FreeBSD_version will be updated through further commits.

As a side note, would be valuable to say that next commits will address
a similar cleanup about VFS methods, in particular vop_lock1 and
vop_unlock.

Tested by:	Diego Sardina <siarodx at gmail dot com>,
		Andrea Di Pasquale <whyx dot it at gmail dot com>
2008-01-10 01:10:58 +00:00
Robert Watson
30d239bc4c Merge first in a series of TrustedBSD MAC Framework KPI changes
from Mac OS X Leopard--rationalize naming for entry points to
the following general forms:

  mac_<object>_<method/action>
  mac_<object>_check_<method/action>

The previous naming scheme was inconsistent and mostly
reversed from the new scheme.  Also, make object types more
consistent and remove spaces from object types that contain
multiple parts ("posix_sem" -> "posixsem") to make mechanical
parsing easier.  Introduce a new "netinet" object type for
certain IPv4/IPv6-related methods.  Also simplify, slightly,
some entry point names.

All MAC policy modules will need to be recompiled, and modules
not updates as part of this commit will need to be modified to
conform to the new KPI.

Sponsored by:	SPARTA (original patches against Mac OS X)
Obtained from:	TrustedBSD Project, Apple Computer
2007-10-24 19:04:04 +00:00
Alfred Perlstein
77465d9390 Get rid of qaddr_t.
Requested by: bde
2007-10-16 10:54:55 +00:00
Xin LI
04533fc68e Use *_EMPTY macros when appropriate. 2007-04-04 07:29:53 +00:00
Konstantin Belousov
36d4667907 Mark UFS as being MP-Safe in "options QUOTA" case too. Remove no more
neccessary Giant acquisions in softdepend processing code.

Tested by:	Peter Holm
Reviewed by:	tegge
Approved by:	re (kensmith)
2007-03-20 10:51:45 +00:00
Konstantin Belousov
088ffd2086 Implement fine-grained locking for UFS quotas.
Each struct dquot gets dq_lock mutex to protect dq_flags and to interlock
with DQ_LOCK. qhash, dqfreelist and dq.dq_cnt are protected by global
dqhlock mutex.

i_dquot array for inode is protected by lockmgr' vnode lock, corresponding
assert added to the dqget(). Access to struct ufsmount quota-related fields
(um_quotas and um_qflags) is protected by um_lock.

Tested by:	Peter Holm
Reviewed by:	tegge
Approved by:	re (kensmith)

This work were not possible without enormous amount of help given by
Tor Egge and Peter Holm. Tor reviewed each version of patch, pointed out
numerous errors and provided invaluable suggestions. Peter did tireless
testing of the patch as it was developed.
2007-03-14 08:54:08 +00:00
Tor Egge
61b9d89ff0 Make insmntque() externally visibile and allow it to fail (e.g. during
late stages of unmount).  On failure, the vnode is recycled.

Add insmntque1(), to allow for file system specific cleanup when
recycling vnode on failure.

Change getnewvnode() to no longer call insmntque().  Previously,
embryonic vnodes were put onto the list of vnode belonging to a file
system, which is unsafe for a file system marked MPSAFE.

Change vfs_hash_insert() to no longer lock the vnode.  The caller now
has that responsibility.

Change most file systems to lock the vnode and call insmntque() or
insmntque1() after a new vnode has been sufficiently setup.  Handle
failed insmntque*() calls by propagating errors to callers, possibly
after some file system specific cleanup.

Approved by:	re (kensmith)
Reviewed by:	kib
In collaboration with:	kib
2007-03-13 01:50:27 +00:00
Pawel Jakub Dawidek
10bcafe9ab Move vnode-to-file-handle translation from vfs_vptofh to vop_vptofh method.
This way we may support multiple structures in v_data vnode field within
one file system without using black magic.

Vnode-to-file-handle should be VOP in the first place, but was made VFS
operation to keep interface as compatible as possible with SUN's VFS.
BTW. Now Solaris also implements vnode-to-file-handle as VOP operation.

VFS_VPTOFH() was left for API backward compatibility, but is marked for
removal before 8.0-RELEASE.

Approved by:	mckusick
Discussed with:	many (on IRC)
Tested with:	ufs, msdosfs, cd9660, nullfs and zfs
2007-02-15 22:08:35 +00:00
Konstantin Belousov
2cc7d26f7f Cylinder group bitmaps and blocks containing inode for a snapshot
file are after snaplock, while other ffs device buffers are before
snaplock in global lock order. By itself, this could cause deadlock
when bdwrite() tries to flush dirty buffers on snapshotted ffs. If,
during the flush, COW activity for snapshot needs to allocate block
and ffs_alloccg() selects the cylinder group that is being written
by bdwrite(), then kernel would panic due to recursive buffer lock
acquision.

Avoid dealing with buffers in bdwrite() that are from other side of
snaplock divisor in the lock order then the buffer being written. Add
new BOP, bop_bdwrite(), to do dirty buffer flushing for same vnode in
the bdwrite(). Default implementation, bufbdflush(), refactors the code
from bdwrite(). For ffs device buffers, specialized implementation is
used.

Reviewed by:	tegge, jeff, Russell Cattelan (cattelan xfs org, xfs changes)
Tested by:	Peter Holm
X-MFC after:	3 weeks (if ever: it changes ABI)
2007-01-23 10:01:19 +00:00
Robert Watson
acd3428b7d Sweep kernel replacing suser(9) calls with priv(9) calls, assigning
specific privilege names to a broad range of privileges.  These may
require some future tweaking.

Sponsored by:           nCircle Network Security, Inc.
Obtained from:          TrustedBSD Project
Discussed on:           arch@
Reviewed (at least in part) by: mlaier, jmg, pjd, bde, ceri,
                        Alex Lyashkov <umka at sevcity dot net>,
                        Skip Ford <skip dot ford at verizon dot net>,
                        Antoine Brodin <antoine dot brodin at laposte dot net>
2006-11-06 13:42:10 +00:00
Pawel Jakub Dawidek
1a60c7fc8e Add gjournal specific code to the UFS file system:
- Add FS_GJOURNAL flag which enables gjournal support on a file system.
- Add cg_unrefs field to the cylinder group structure which holds
  number of unreferenced (orphaned) inodes in the given cylinder group.
- Add fs_unrefs field to the super block structure which holds
  total number of unreferenced (orphaned) inodes.
- When file or a directory is orphaned (last reference is removed, but
  object is still open), increase fs_unrefs and cg_unrefs fields,
  which is a hint for fsck in which cylinder groups looks for such
  (orphaned) objects.
- When file is last closed, decrease {fs,cg}_unrefs fields.
- Add VV_DELETED vnode flag which points at orphaned objects.

Sponsored by:	home.pl
2006-10-31 21:48:54 +00:00
Robert Watson
aed5570872 Complete break-out of sys/sys/mac.h into sys/security/mac/mac_framework.h
begun with a repo-copy of mac.h to mac_framework.h.  sys/mac.h now
contains the userspace and user<->kernel API and definitions, with all
in-kernel interfaces moved to mac_framework.h, which is now included
across most of the kernel instead.

This change is the first step in a larger cleanup and sweep of MAC
Framework interfaces in the kernel, and will not be MFC'd.

Obtained from:	TrustedBSD Project
Sponsored by:	SPARTA
2006-10-22 11:52:19 +00:00