Commit Graph

1903 Commits

Author SHA1 Message Date
Konstantin Belousov
c535690b33 Add currently unused flag argument to the cluster_read(),
cluster_write() and cluster_wbuild() functions.  The flags to be
allowed are a subset of the GB_* flags for getblk().

Sponsored by:	The FreeBSD Foundation
Tested by:	pho
2013-03-14 20:28:26 +00:00
Attilio Rao
89f6b8632c Switch the vm_object mutex to be a rwlock. This will enable in the
future further optimizations where the vm_object lock will be held
in read mode most of the time the page cache resident pool of pages
are accessed for reading purposes.

The change is mostly mechanical but few notes are reported:
* The KPI changes as follow:
  - VM_OBJECT_LOCK() -> VM_OBJECT_WLOCK()
  - VM_OBJECT_TRYLOCK() -> VM_OBJECT_TRYWLOCK()
  - VM_OBJECT_UNLOCK() -> VM_OBJECT_WUNLOCK()
  - VM_OBJECT_LOCK_ASSERT(MA_OWNED) -> VM_OBJECT_ASSERT_WLOCKED()
    (in order to avoid visibility of implementation details)
  - The read-mode operations are added:
    VM_OBJECT_RLOCK(), VM_OBJECT_TRYRLOCK(), VM_OBJECT_RUNLOCK(),
    VM_OBJECT_ASSERT_RLOCKED(), VM_OBJECT_ASSERT_LOCKED()
* The vm/vm_pager.h namespace pollution avoidance (forcing requiring
  sys/mutex.h in consumers directly to cater its inlining functions
  using VM_OBJECT_LOCK()) imposes that all the vm/vm_pager.h
  consumers now must include also sys/rwlock.h.
* zfs requires a quite convoluted fix to include FreeBSD rwlocks into
  the compat layer because the name clash between FreeBSD and solaris
  versions must be avoided.
  At this purpose zfs redefines the vm_object locking functions
  directly, isolating the FreeBSD components in specific compat stubs.

The KPI results heavilly broken by this commit.  Thirdy part ports must
be updated accordingly (I can think off-hand of VirtualBox, for example).

Sponsored by:	EMC / Isilon storage division
Reviewed by:	jeff
Reviewed by:	pjd (ZFS specific review)
Discussed with:	alc
Tested by:	pho
2013-03-09 02:32:23 +00:00
Konstantin Belousov
ba05dec5a4 The softdep freeblks workitem might hold a reference on the dquot.
Current dqflush() panics when a dquot with with non-zero refcount is
encountered.  The situation is possible, because quotas are turned off
before softdep workitem queue if flushed, due to the quota file writes
might create softdep workitems.

Make the encountering an active dquot in dqflush() not fatal, return
the error from quotaoff() instead.  Ignore the quotaoff() failures
when ffs_flushfiles() is called in the course of softdep_flushfiles()
loop, until the last iteration.  At the last loop, the quotas must be
closed, and because SU workitems should be already flushed, the
references to dquot are gone.

Sponsored by:	The FreeBSD Foundation
Reported and tested by:	pho
Reviewed by:	mckusick
MFC after:	2 weeks
2013-02-27 07:32:39 +00:00
Konstantin Belousov
94f4ac214c An inode block must not be blockingly read while cg block is owned.
The order is inode buffer lock -> snaplk -> cg buffer lock, reversing
the order causes deadlocks.

Inode block must not be written while cg block buffer is owned. The
FFS copy on write needs to allocate a block to copy the content of the
inode block, and the cylinder group selected for the allocation might
be the same as the owned cg block.  The reserved block detection code
in the ffs_copyonwrite() and ffs_bp_snapblk() is unable to detect the
situation, because the locked cg buffer is not exposed to it.

In order to maintain the dependency between initialized inode block
and the cg_initediblk pointer, look up the inode buffer in
non-blocking mode. If succeeded, brelse cg block, initialize the inode
block and write it.  After the write is finished, reread cg block and
update the cg_initediblk.

If inode block is already locked by another thread, let the another
thread initialize it.  If another thread raced with us after we
started writing inode block, the situation is detected by an update of
cg_initediblk.  Note that double-initialization of the inode block is
harmless, the block cannot be used until cg_initediblk is incremented.

Sponsored by:	The FreeBSD Foundation
In collaboration with:	pho
Reviewed by:	mckusick
MFC after:	1 month
X-MFC-note:	after r246877
2013-02-27 07:31:23 +00:00
Kirk McKusick
7839b23f00 The UFS2 filesystem allocates new blocks of inodes as they are needed.
When a cylinder group runs short of inodes, a new block for inodes is
allocated, zero'ed, and written to the disk. The zero'ed inodes must
be on the disk before the cylinder group can be updated to claim them.
If the cylinder group claiming the new inodes were written before the
zero'ed block of inodes, the system could crash with the filesystem in
an unrecoverable state.

Rather than adding a soft updates dependency to ensure that the new
inode block is written before it is claimed by the cylinder group
map, we just do a barrier write of the zero'ed inode block to ensure
that it will get written before the updated cylinder group map can
be written. This change should only slow down bulk loading of newly
created filesystems since that is the primary time that new inode
blocks need to be created.

Reported by: Robert Watson
Reviewed by: kib
Tested by:   Peter Holm
2013-02-16 15:11:40 +00:00
Konstantin Belousov
9604a7f1b8 Fix several unsafe pointer dereferences in the buffered_write()
function, implementing the sysctl vfs.ffs.set_bufoutput (not used in
the tree yet).

- The current directory vnode dereference is unsafe since fd_cdir
  could be changed and unreferenced, lock the filedesc around and vref
  the fd_cdir.
- The VTOI() conversion of the fd_cdir is unsafe without first
  checking that the vnode is indeed from an FFS mount, otherwise
  the code dereferences a random memory.
- The cdir could be reclaimed from under us, lock it around the
  checks.
- The type of the fp vnode might be not a disk, or it might have
  changed while the thread was in flight, check the type.

Reviewed and tested by:	mckusick
MFC after:	2 weeks
2013-02-10 10:17:33 +00:00
Pedro F. Giffuni
a940ce65cd Remove unused MAXSYMLINKLEN macro.
Reviewed by:	mckusick
PR:		kern/175794
MFC after:	1 week
2013-02-08 20:30:19 +00:00
Pedro F. Giffuni
be0c475e56 UFS: Remove dead assignment.
Submitted by:	Christoph Mallon
MFC after:	3 days
2013-02-03 21:30:02 +00:00
Kirk McKusick
fe85d98a5b For UFS2 i_blocks is unsigned. The current "sanity" check that it
has gone below zero after the blocks in its inode are freed is a
no-op which the compiler fails to warn about because of the use of
the DIP macro. Change the sanity check to compare the number of
blocks being freed against the value i_blocks. If the number of
blocks being freed exceeds i_blocks, just set i_blocks to zero.

Reported by: Pedro Giffuni (pfg@)
MFC after:   2 weeks
2013-02-03 17:16:32 +00:00
Konstantin Belousov
ddd6b3fc33 Add flags argument to vfs_write_resume() and remove
vfs_write_resume_flags().

Sponsored by:	The FreeBSD Foundation
2013-01-11 06:08:32 +00:00
Konstantin Belousov
f99cb34c4f The process_deferred_inactive() function locks the vnodes of the ufs
mount, which means that is must not be called while the snaplock is
owned.  The vfs_write_resume(9) does call the function as the
VFS_SUSP_CLEAN() method, which is too early and falls into the region
still protected by snaplock.

Add yet another flag for the vfs_write_resume_flags() to avoid calling
suspension cleanup handler after the suspend is lifted, and use it in
the ffs_snapshot() call to vfs_write_resume.

Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2013-01-01 16:14:48 +00:00
Konstantin Belousov
91e9474552 Make it possible to atomically resume writes on the mount and account
the write start, by adding a variation of the vfs_write_resume(9)
which accepts flags.

Use the new function to prevent a deadlock between parallel suspension
and snapshotting a UFS mount.  The ffs_snapshot() code performed
vfs_write_resume() followed by vn_start_write() while owning the
snaplock.  If the suspension intervene between resume and
vn_start_write(), the deadlock occured after the suspending thread
tried to lock the snaplock, most typically during the write in the
ffs_copyonwrite().

Reported and tested by:	Andreas Longwitz <longwitz@incore.de>
Reviewed by:	mckusick
MFC after:	2 weeks
X-MFC-note:	make the vfs_write_resume(9) function a macro after the MFC,
	in HEAD
2012-12-28 23:08:30 +00:00
Attilio Rao
b1308d72c2 Fixup r218424: uio_yield() was scaling directly to userland priority.
When kern_yield() was introduced with the possibility to specify
a new priority, the behaviour changed by not lowering priority at all
in the consumers, making the yielding mechanism highly ineffective for
high priority kthreads like bufdaemon, syncer, vlrudaemon, etc.
There are no evidences that consumers could bear with such change in
semantic and this situation could finally lead to bugs similar to the
ones fixed in r244240.
Re-specify userland pri for kthreads involved.

Tested by:	pho
Reviewed by:	kib, mdf
MFC after:	1 week
2012-12-21 13:14:12 +00:00
Konstantin Belousov
9f37ee804a Fix a typo, resulting in the NULL pointer dereference.
Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2012-12-15 02:03:59 +00:00
Attilio Rao
c6e0355cee r16312 is not any longer real since many years (likely since when VFS
received granular locking) but the comment present in UFS has been
copied all over other filesystems code incorrectly for several times.

Removes comments that makes no sense now.

Reviewed by:	kib
MFC after:	3 days
2012-11-19 22:43:45 +00:00
Edward Tomasz Napierala
213a71c68b Fix build of kdump(1). 2012-11-18 22:03:31 +00:00
Edward Tomasz Napierala
1848286ada Add UFS writesuspension mechanism, designed to allow userland processes
to modify on-disk metadata for filesystems mounted for write.

Reviewed by:	kib, mckusick
Sponsored by:	FreeBSD Foundation
2012-11-18 18:57:19 +00:00
Jeff Roberson
ad9cdc05ba - Fix a truncation bug with softdep journaling that could leak blocks on
crash.  When truncating a file that never made it to disk we use the
   canceled allocation dependencies to hold the journal records until
   the truncation completes.  Previously allocdirect dependencies on
   the id_bufwait list were not considered and their journal space
   could expire before the bitmaps were written.  Cancel them and attach
   them to the freeblks as we do for other allocdirects.
 - Add KTR traces that were used to debug this problem.
 - When adding jsegdeps, always use jwork_insert() so we don't have more
   than one segdep on a given jwork list.

Sponsored by:	EMC / Isilon Storage Division
2012-11-14 06:37:43 +00:00
Jeff Roberson
b2c29d39cd - Fix a bug that has existed since the original softdep implementation.
When a background copy of a cg is written we complete any work associated
   with that bmsafemap.  If new work has been added to the non-background
   copy of the buffer it will be completed before the next write happens.
   The solution is to do the rollbacks when we make the copy so only those
   dependencies that were present at the time of writing will be completed
   when the background write completes.  This would've resulted in various
   bitmap related corruptions and panics.  It also would've expired journal
   entries early causing journal replay to miss some records.

MFC after:	2 weeks
2012-11-12 19:53:55 +00:00
Attilio Rao
bc2258da88 Complete MPSAFE VFS interface and remove MNTK_MPSAFE flag.
Porters should refer to __FreeBSD_version 1000021 for this change as
it may have happened at the same timeframe.
2012-11-09 18:02:25 +00:00
Jeff Roberson
53cc0bebb9 - Correct rev 242734, segments can sometimes get stuck. Be a bit more
defensive with segment state.

Reported by:	 b. f. <bf1783@googlemail.com>
2012-11-09 04:04:25 +00:00
Jeff Roberson
40b43503c0 - Implement BIO_FLUSH support around journal entries. This will not 100%
solve power loss problems with dishonest write caches.  However, it
   should improve the situation and force a full fsck when it is unable
   to resolve with the journal.
 - Resolve a case where the journal could wrap in an unsafe way causing
   us to prematurely lose journal entries in very specific scenarios.

Discussed with:	mckusick
MFC after:	1 month
2012-11-08 01:41:04 +00:00
Kirk McKusick
aa7ddc85c7 When a file is first being written, the dynamic block reallocation
(implemented by ffs_reallocblks_ufs[12]) relocates the file's blocks
so as to cluster them together into a contiguous set of blocks on
the disk.

When the cluster crosses the boundary into the first indirect block,
the first indirect block is initially allocated in a position
immediately following the last direct block.  Block reallocation
would usually destroy locality by moving the indirect block out of
the way to keep the data blocks contiguous.  This change compensates
for this problem by noting that the first indirect block should be
left immediately following the last direct block.  It then tries
to start a new cluster of contiguous blocks (referenced by the
indirect block) immediately following the indirect block.

We should also do this for other indirect block boundaries, but it
is only important for the first one.

Suggested by: Bruce Evans
MFC:          2 weeks
2012-11-03 18:55:55 +00:00
Jeff Roberson
6d95eb4c5f - In cancel_mkdir_dotdot don't panic if the inodedep is not available. If
the previous diradd had already finished it could have been reclaimed
   already.  This would only happen under heavy dependency pressure.

Reported by:	Andrey Zonov <zont@FreeBSD.org>
Discussed with:	mckusick
MFC after:	1 week
2012-11-02 21:04:06 +00:00
Konstantin Belousov
140dedb81c The r241025 fixed the case when a binary, executed from nullfs mount,
was still possible to open for write from the lower filesystem.  There
is a symmetric situation where the binary could already has file
descriptors opened for write, but it can be executed from the nullfs
overlay.

Handle the issue by passing one v_writecount reference to the lower
vnode if nullfs vnode has non-zero v_writecount.  Note that only one
write reference can be donated, since nullfs only keeps one use
reference on the lower vnode.  Always use the lower vnode v_writecount
for the checks.

Introduce the VOP_GET_WRITECOUNT to read v_writecount, which is
currently always bypassed to the lower vnode, and VOP_ADD_WRITECOUNT
to manipulate the v_writecount value, which manages a single bypass
reference to the lower vnode.  Caling the VOPs instead of directly
accessing v_writecount provide the fix described in the previous
paragraph.

Tested by:	pho
MFC after:	3 weeks
2012-11-02 13:56:36 +00:00
Edward Tomasz Napierala
549f62fa42 Fix problem with geom_label(4) not recognizing UFS labels on filesystems
extended using growfs(8).  The problem here is that geom_label checks if
the filesystem size recorded in UFS superblock is equal to the provider
(i.e. device) size.  This check cannot be removed due to backward
compatibility.  On the other hand, in most cases growfs(8) cannot set
fs_size in the superblock to match the provider size, because, differently
from newfs(8), it cannot recompute cylinder group sizes.

To fix this problem, add another superblock field, fs_providersize, used
only for this purpose.  The geom_label(4) will attach if either fs_size
(filesystem created with newfs(8)) or fs_providersize (filesystem expanded
using growfs(8)) matches the device size.

PR:		kern/165962
Reviewed by:	mckusick
Sponsored by:	FreeBSD Foundation
2012-10-30 21:32:10 +00:00
Edward Tomasz Napierala
f1988d463c Fix two problems that caused instant panic when the device mounted
with softupdates went away.  Note that this does not fix the problem
entirely; I'm committing it now to make it easier for someone to pick
up the work.

Reviewed by:	mckusick
2012-10-28 18:53:28 +00:00
Konstantin Belousov
5050aa86cf Remove the support for using non-mpsafe filesystem modules.
In particular, do not lock Giant conditionally when calling into the
filesystem module, remove the VFS_LOCK_GIANT() and related
macros. Stop handling buffers belonging to non-mpsafe filesystems.

The VFS_VERSION is bumped to indicate the interface change which does
not result in the interface signatures changes.

Conducted and reviewed by:	attilio
Tested by:	pho
2012-10-22 17:50:54 +00:00
Matthew D Fleming
fc8fdae0df Fix up kernel sources to be ready for a 64-bit ino_t.
Original code by:	Gleb Kurtsou
2012-09-27 23:30:49 +00:00
Mateusz Guzik
1ec9bedabe Remove unused member of struct indir (in_exists) from UFS and EXT2 code.
Reviewed by:	mckusick
Approved by:	trasz (mentor)
MFC after:	1 week
2012-08-17 17:45:27 +00:00
Konstantin Belousov
1c771f9222 After the PHYS_TO_VM_PAGE() function was de-inlined, the main reason
to pull vm_param.h was removed.  Other big dependency of vm_page.h on
vm_param.h are PA_LOCK* definitions, which are only needed for
in-kernel code, because modules use KBI-safe functions to lock the
pages.

Stop including vm_param.h into vm_page.h. Include vm_param.h
explicitely for the kernel code which needs it.

Suggested and reviewed by:	alc
MFC after:    2 weeks
2012-08-05 14:11:42 +00:00
Kevin Lo
f7a3729c91 Use NULL instead of 0 for pointers 2012-07-22 15:40:31 +00:00
Konstantin Belousov
c5c1199c83 Extend the KPI to lock and unlock f_offset member of struct file. It
now fully encapsulates all accesses to f_offset, and extends f_offset
locking to other consumers that need it, in particular, to lseek() and
variants of getdirentries().

Ensure that on 32bit architectures f_offset, which is 64bit quantity,
always read and written under the mtxpool protection. This fixes
apparently easy to trigger race when parallel lseek()s or lseek() and
read/write could destroy file offset.

The already broken ABI emulations, including iBCS and SysV, are not
converted (yet).

Tested by:	pho
No objections from:	jhb
MFC after:    3 weeks
2012-07-02 21:01:03 +00:00
Konstantin Belousov
7aac7bc18a Fix unbounded-length malloc, controlled from usermode. The added check
is performed before exact size of the buffer is calculated, but the
buffer cannot have size greater then the total space allocated for
extended attributes. The existing check is executing with precise
size, but it is too late, since buffer needs to be allocated in
advance.

Also, adapt to uio_resid being of ssize_t type.  Use lblktosize instead of
multiplying by fs block size by hand as well.

Reported and tested by:	  pho
MFC after:   1 week
2012-06-21 09:20:07 +00:00
Kirk McKusick
aa445c9d7c In softdep_setup_inomapdep() we may have to allocate both inodedep
and bmsafemap dependency structures in inodedep_lookup() and
bmsafemap_lookup() respectively. The setup of these structures must
be done while holding the soft-dependency mutex. If the inodedep is
allocated first, it may be freed in the I/O completion callback when
the mutex is released to allocate the bmsafemap. If the bmsafemap is
allocated first, it may be freed in the I/O completion callback when
the mutex is released to allocate the inodedep.

To resolve this problem, bmsafemap_lookup has had a parameter added
that allows a pre-malloc'ed bmsafemap to be passed in so that it does
not need to release the mutex to create a new bmsafemap. The
softdep_setup_inomapdep() routine pre-malloc's a bmsafemap dependency
before acquiring the mutex and starting to build the inodedep with a
call to inodedep_lookup(). The subsequent call to bmsafemap_lookup()
is passed this pre-allocated bmsafemap entry so that it need not
release the mutex if it needs to create a new one.

Reported by: Peter Holm
Tested by:   Peter Holm
MFC after:   1 week
2012-06-11 23:07:21 +00:00
Konstantin Belousov
b569050a78 Enable vn_io_fault() lock avoidance for UFS.
Tested by:	pho
MFC after:	2 months
2012-05-30 16:45:41 +00:00
Konstantin Belousov
6ee10a96c0 Implement SEEK_HOLE/SEEK_DATA for UFS.
MFC after:	2 weeks
2012-05-26 05:29:53 +00:00
Kirk McKusick
8b6207110d Add missing `continue' statement at end of case.
Found by:  Kevin Lo (kevlo@)
MFC after: 1 week
2012-05-18 15:20:21 +00:00
Edward Tomasz Napierala
06c65b6b41 Remove unused thread argument from ufs_extattr_uepm_lock()/ufs_extattr_uepm_unlock(). 2012-04-23 17:56:35 +00:00
Edward Tomasz Napierala
05cc75de83 Fix build. 2012-04-23 17:54:49 +00:00
Edward Tomasz Napierala
26621e1f06 Remove unused thread argument from clear_inodeps() and clear_remove(). 2012-04-23 14:44:18 +00:00
Edward Tomasz Napierala
af6e6b87ad Remove unused thread argument to vrecycle().
Reviewed by:	kib
2012-04-23 14:10:34 +00:00
Edward Tomasz Napierala
c52fd858ae Remove unused thread argument from vtruncbuf().
Reviewed by:	kib
2012-04-23 13:21:28 +00:00
Edward Tomasz Napierala
72b8ff1c74 Fix use-after-free introduced in r234036.
Reviewed by:	mckusick
Tested by:	pho
2012-04-21 10:45:46 +00:00
Kirk McKusick
dca5e0ec50 This update uses the MNT_VNODE_FOREACH_ACTIVE interface that loops
over just the active vnodes associated with a mount point to replace
MNT_VNODE_FOREACH_ALL in the vfs_msync, ffs_sync_lazy, and qsync
routines.

The vfs_msync routine is run every 30 seconds for every writably
mounted filesystem. It ensures that any files mmap'ed from the
filesystem with modified pages have those pages queued to be
written back to the file from which they are mapped.

The ffs_lazy_sync and qsync routines are run every 30 seconds for
every writably mounted UFS/FFS filesystem. The ffs_lazy_sync routine
ensures that any files that have been accessed in the previous
30 seconds have had their access times queued for updating in the
filesystem. The qsync routine ensures that any files with modified
quotas have those quotas queued to be written back to their
associated quota file.

In a system configured with 250,000 vnodes, less than 1000 are
typically active at any point in time. Prior to this change all
250,000 vnodes would be locked and inspected twice every minute
by the syncer. For UFS/FFS filesystems they would be locked and
inspected six times every minute (twice by each of these three
routines since each of these routines does its own pass over the
vnodes associated with a mount point). With this change the syncer
now locks and inspects only the tiny set of vnodes that are active.

Reviewed by: kib
Tested by:   Peter Holm
MFC after:   2 weeks
2012-04-20 07:00:28 +00:00
Jaakko Heinonen
808dd116cb The part about exec atime no longer applies in the comment.
Pointed out by:	bde
2012-04-18 15:19:00 +00:00
Kirk McKusick
71469bb38f Replace the MNT_VNODE_FOREACH interface with MNT_VNODE_FOREACH_ALL.
The primary changes are that the user of the interface no longer
needs to manage the mount-mutex locking and that the vnode that
is returned has its mutex locked (thus avoiding the need to check
to see if its is DOOMED or other possible end of life senarios).

To minimize compatibility issues for third-party developers, the
old MNT_VNODE_FOREACH interface will remain available so that this
change can be MFC'ed to 9. Following the MFC to 9, MNT_VNODE_FOREACH
will be removed in head.

The reason for this update is to prepare for the addition of the
MNT_VNODE_FOREACH_ACTIVE interface that will loop over just the
active vnodes associated with a mount point (typically less than
1% of the vnodes associated with the mount point).

Reviewed by: kib
Tested by:   Peter Holm
MFC after:   2 weeks
2012-04-17 16:28:22 +00:00
Kirk McKusick
ecb6e528c5 Export vinactive() from kern/vfs_subr.c (e.g., make it no longer
static and declare its prototype in sys/vnode.h) so that it can be
called from process_deferred_inactive() (in ufs/ffs/ffs_snapshot.c)
instead of the body of vinactive() being cut and pasted into
process_deferred_inactive().

Reviewed by: kib
MFC after:   2 weeks
2012-04-11 23:01:11 +00:00
Jaakko Heinonen
fce74feae1 - Return EPERM from ufs_setattr() when an user without PRIV_VFS_SYSFLAGS
privilege attempts to toggle SF_SETTABLE flags.
- Use the '^' operator in the SF_SNAPSHOT anti-toggling check.

Flags are now stored to ip->i_flags in one place after all checks.

Submitted by:	bde
2012-04-10 15:59:37 +00:00
Edward Tomasz Napierala
2b028c25d3 Fix panic in ffs_reload(), which may happen when read-only filesystem
gets resized and then reloaded.

Reviewed by:	kib, mckusick (earlier version)
Sponsored by:	The FreeBSD Foundation
2012-04-08 13:44:55 +00:00
Kirk McKusick
b73ffa31d4 Drop an unnecessary setting of si_mountpt when updating a UFS mount point.
Clearly it must have been set when the mount was done.

Reviewed by: kib
2012-04-08 06:14:49 +00:00
Jaakko Heinonen
63dfd849a1 Add a check for unsupported file flags to ufs_setattr().
Discussed with:	bde
MFC after:	2 weeks
2012-04-04 14:50:21 +00:00
Kirk McKusick
23d6e518da A file cannot be deallocated until its last name has been removed
and it is no longer referenced by a user process. The inode for a
file whose name has been removed, but is still referenced at the
time of a crash will still be allocated in the filesystem, but will
have no references (e.g., they will have no names referencing them
from any directory).

With traditional soft updates these unreferenced inodes will be
found and reclaimed when the background fsck is run. When using
journaled soft updates, the kernel must keep track of these inodes
so that it can find and reclaim them during the cleanup process.
Their existence cannot be stored in the journal as the journal only
handles short-term events, and they may persist for days. So, they
are tracked by keeping them in a linked list whose head pointer is
stored in the superblock. The journal tracks them only until their
linked list pointers have been commited to disk. Part of the cleanup
process involves traversing the list of unreferenced inodes and
reclaiming them.

This bug was triggered when confusion arose in the commit steps
of keeping the unreferenced-inode linked list coherent on disk.
Notably, a race between the link() system call adding a link-count
to a file and the unlink() system call removing a link-count to
the file. Here if the unlink() ran after link() had looked up
the file but before link() had incremented the link-count of the
file, the file's link-count would drop to zero before the link()
incremented it back up to one. If the file was referenced by a
user process, the first transition through zero made it appear
that it should be added to the unreferenced-inode list when in
fact it should not have been added. If the new name created by
link() was deleted within a few seconds (with the file still
referenced by a user process) it would legitimately be a candidate
for addition to the unreferenced-inode list. The result was that
there were two attempts to add the same inode to the unreferenced-inode
list which scrambled the unreferenced-inode list's pointers leading
to a panic. The fix is to detect and avoid the false attempt at
adding it to the unreferenced-inode list by having the link()
system call check to see if the link count is zero before it
increments it. If it is, the link() fails with ENOENT (showing that
it has failed the link()/unlink() race).

While tracking down this bug, we have added additional assertions
to detect the problem sooner and also simplified some of the code.

Reported by:      Kirk Russell
Fix submitted by: Jeff Roberson
Tested by:        Peter Holm
PR:               kern/159971
MFC (to 9 only):  2 weeks
2012-04-02 21:58:37 +00:00
Jaakko Heinonen
fccf286d24 - Use more natural ip->i_flags instead of vap->va_flags in the final
flags check.
- Add a comment for the immutable/append check done after handling of
  the flags.
- Style improvements.

No functional change intended.

Submitted by:	bde
MFC after:	2 weeks
2012-04-02 16:33:21 +00:00
Kirk McKusick
6c09f4a27c A refinement of change 232351 to avoid a race with a forcible unmount.
While we have a snapshot vnode unlocked to avoid a deadlock with another
inode in the same inode block being updated, the filesystem containing
it may be forcibly unmounted. When that happens the snapshot vnode is
revoked. We need to check for that condition and fail appropriately.

This change will be included along with 232351 when it is MFC'ed to 9.

Spotted by:  kib
Reviewed by: kib
2012-03-28 21:21:19 +00:00
Kirk McKusick
1faacf5d09 Keep track of the mount point associated with a special device
to enable the collection of counts of synchronous and asynchronous
reads and writes for its associated filesystem. The counts are
displayed using `mount -v'.

Ensure that buffers used for paging indicate the vnode from
which they are operating so that counts of paging I/O operations
from the filesystem are collected.

This checkin only adds the setting of the mount point for the
UFS/FFS filesystem, but it would be trivial to add the setting
and clearing of the mount point at filesystem mount/unmount
time for other filesystems too.

Reviewed by: kib
2012-03-28 20:49:11 +00:00
Konstantin Belousov
ea573a50b3 Do trivial reformatting of the comment to record the missed commit
message for r233609:
Restore the writes of atimes, quotas and superblock from syncer vnode.

Noted by:   rdivacky
2012-03-28 14:16:15 +00:00
Konstantin Belousov
a988a5c609 Reviewed by: bde, mckusick
Tested by:	pho
MFC after:	2 weeks
2012-03-28 14:06:47 +00:00
Konstantin Belousov
64c8ead942 Microoptimize: in qsync loop over mount vnodes, only unlock mount
interlock after we committed to try to vget() the vnode.

Submitted by:	bde
Reviewed by:	mckusick
Tested by:	pho
MFC after:	1 week
2012-03-28 13:56:18 +00:00
Konstantin Belousov
e0c1740853 Update comment.
MFC after:	3 days
2012-03-28 13:47:07 +00:00
Kirk McKusick
75a5838904 Add a third flags argument to ffs_syncvnode to avoid a possible conflict
with MNT_WAIT flags that passed in its second argument. This will be
MFC'ed together with r232351.

Discussed with: kib
2012-03-25 00:02:37 +00:00
Konstantin Belousov
064f517d2b Supply boolean as the second argument to ffs_update(), and not a
MNT_[NO]WAIT constants, which in fact always caused sync operation.

Based on the submission by:	bde
Reviewed by:	mckusick
MFC after:	2 weeks
2012-03-13 22:04:27 +00:00
Konstantin Belousov
92ccae0399 Remove superfluous brackets.
Submitted by:	alc
MFC after:	2 weeks
2012-03-11 21:25:42 +00:00
Konstantin Belousov
dd522d76dc Do schedule delayed writes for async mounts.
While there, make some style adjustments, like missed () around
return values.

Submitted by:	bde
Reviewed by:	mckusick
Tested by:	pho
MFC after:	2 weeks
2012-03-11 20:26:19 +00:00
Konstantin Belousov
2fd2c0b1e3 Do not fall back to slow synchronous i/o when low on memory or buffers.
The bawrite() schedules the write to happen immediately, and its use
frees the current thread to do more cleanups.

Submitted by:	bde
Reviewed by:	mckusick
Tested by:	pho
MFC after:	2 weeks
2012-03-11 20:23:46 +00:00
Konstantin Belousov
4cd74eecda In ffs_syncvnode(), pass boolean false as second argument of ffs_update().
Synchronous inode block update is not needed for MNT_LAZY callers (syncer),
and since waitfor values are not zero, code did unneccessary synchronous
update.

Submitted by:	bde
Reviewed by:	mckusick
Tested by:	pho
MFC after:	2 weeks
2012-03-11 20:18:14 +00:00
Konstantin Belousov
18ef3670e5 Remove not needed ARGSUSED lint command.
Submitted by:	bde
MFC after:	3 days
2012-03-11 20:15:12 +00:00
Konstantin Belousov
b80dcb55aa Remove fifo.h. The only used function declaration from the header is
migrated to sys/vnode.h.

Submitted by:	gianni
2012-03-11 12:19:58 +00:00
Peter Holm
e521b5288a Revert r232692 as the correct place to fix this is at the syscall level. 2012-03-09 17:19:50 +00:00
Konstantin Belousov
38ddb5725b Decomission mnt_noasync. Introduce MNTK_NOASYNC mnt_kern_flag which
allows a filesystem to request VFS to not allow MNTK_ASYNC.

MFC after:	1 week
2012-03-09 00:12:05 +00:00
John Baldwin
b47f624183 Add KTR_VFS traces to track modifications to a vnode's writecount. 2012-03-08 20:27:20 +00:00
Peter Holm
80042581a5 syscall() fuzzing can trigger this panic. Return EINVAL instead.
MFC after:	1 week
2012-03-08 12:49:08 +00:00
John Baldwin
58d65e8031 Similar to the fixes in 226967 and 226987, purge any name cache entries
associated with the previous vnode (if any) associated with the target of
a rename().  Otherwise, a lookup of the target pathname concurrent with a
rename() could re-add a name cache entry after the namei(RENAME) lookup
in kern_renameat() had purged the target pathname.

MFC after:	2 weeks
2012-03-02 18:55:19 +00:00
Kirk McKusick
35338e6091 This change avoids a kernel deadlock on "snaplk" when using
snapshots on UFS filesystems running with journaled soft updates.
This is the first of several bugs that need to be fixed before
removing the restriction added in -r230250 to prevent the use
of snapshots on filesystems running with journaled soft updates.

The deadlock occurs when holding the snapshot lock (snaplk)
and then trying to flush an inode via ffs_update(). We become
blocked by another process trying to flush a different inode
contained in the same inode block that we need. It holds the
inode block for which we are waiting locked. When it tries to
write the inode block, it gets blocked waiting for the our
snaplk when it calls ffs_copyonwrite() to see if the inode
block needs to be copied in our snapshot.

The most obvious place that this deadlock arises is in the
ffs_copyonwrite() routine when it updates critical metadata
in a snapshot and tries to write it out before proceeding.
The fix here is to write the data and indirect block pointer
for the snapshot, but to skip the call to ffs_update() to
write the snapshot inode. To ensure that we will never have
to update a pointer in the inode itself, the ffs_snapshot()
routine that creates the snapshot has to ensure that all the
direct blocks are allocated as part of the creation of the
snapshot.

A less obvious place that this deadlock occurs is when we hold
the snaplk because we are deleting a snapshot. In the course of
doing the deletion, we need to allocate various soft update
dependency structures and allocate some journal space. If we
hit a resource limit while doing this we decrease the resources
in use by flushing out an existing dirty file to get it to give
up the soft dependency resources that it holds. The flush can
cause an ffs_update() to be done on the inode for the file that
we have selected to flush resulting in the same deadlock as
described above when the inode that we have chosen to flush
resides in the same inode block as the snapshot inode that we hold.
The fix is to defer cleaning up any time that the inode on which
we are operating is a snapshot.

Help and review by:    Jeff Roberson
Tested by:             Peter Holm
MFC (to 9 only) after: 2 weeks
2012-03-01 18:45:25 +00:00
Konstantin Belousov
2b19466836 Properly lock DQREF() with dqhlock. Missed locking caused counter
corruption.

Assert that the dq reference value is sane before decrementing it.

Reported and tested by:	pho
MFC after:	1 week
2012-02-22 20:03:51 +00:00
Konstantin Belousov
526d0bd547 Fix found places where uio_resid is truncated to int.
Add the sysctl debug.iosize_max_clamp, enabled by default. Setting the
sysctl to zero allows to perform the SSIZE_MAX-sized i/o requests from
the usermode.

Discussed with:	bde, das (previous versions)
MFC after:	1 month
2012-02-21 01:05:12 +00:00
Kirk McKusick
e8e848ef8e Missing conditions in checking whether an inode has been written.
Found and tested by: Peter Holm
MFC after:           2 weeks (to 9 only)
2012-02-13 01:33:39 +00:00
Kirk McKusick
5ecba76999 Historically when an application wrote an entire block of a file,
the kernel allocated a buffer but did not zero it as it was about
to be completely filled by a uiomove() from the user's buffer.
However, if the uiomove() failed, the old contents of the buffer
could be exposed especially if the file was being mmap'ed. The
fix was to always zero the buffer when it was allocated.

This change first attempts the uiomove() to the newly allocated
(and dirty) buffer and only zeros it if the uiomove() fails. The
effect is to eliminate the gratuitous zeroing of the buffer in
the usual case where the uiomove() successfully fills it.

Reviewed by:    kib
Tested by:      scottl
MFC after:      2 weeks (to 9 only)
2012-02-09 22:34:16 +00:00
Kirk McKusick
19c87af0fd In the original days of BSD, a sync was issued on every filesystem
every 30 seconds. This spike in I/O caused the system to pause every
30 seconds which was quite annoying. So, the way that sync worked
was changed so that when a vnode was first dirtied, it was put on
a 30-second cleaning queue (see the syncer_workitem_pending queues
in kern/vfs_subr.c). If the file has not been written or deleted
after 30 seconds, the syncer pushes it out. As the syncer runs once
per second, dirty files are trickled out slowly over the 30-second
period instead of all at once by a call to sync(2).

The one drawback to this is that it does not cover the filesystem
metadata. To handle the metadata, vfs_allocate_syncvnode() is called
to create a "filesystem syncer vnode" at mount time which cycles
around the cleaning queue being sync'ed every 30 seconds. In the
original design, the only things it would sync for UFS were the
filesystem metadata: inode blocks, cylinder group bitmaps, and the
superblock (e.g., by VOP_FSYNC'ing devvp, the device vnode from
which the filesystem is mounted).

Somewhere in its path to integration with FreeBSD the flushing of
the filesystem syncer vnode got changed to sync every vnode associated
with the filesystem. The result of this change is to return to the
old filesystem-wide flush every 30-seconds behavior and makes the
whole 30-second delay per vnode useless.

This change goes back to the originally intended trickle out sync
behavior. Key to ensuring that all the intended semantics are
preserved (e.g., that all inode updates get flushed within a bounded
period of time) is that all inode modifications get pushed to their
corresponding inode blocks so that the metadata flush by the
filesystem syncer vnode gets them to the disk in a timely way.
Thanks to Konstantin Belousov (kib@) for doing the audit and commit
-r231122 which ensures that all of these updates are being made.

Reviewed by:    kib
Tested by:      scottl
MFC after:      2 weeks
2012-02-07 20:43:28 +00:00
Konstantin Belousov
36fc415d5d Sprinkle missed calls to asynchronous UFS_UPDATE() in attempt to
guarantee that all UFS inode metadata changes results in the dirtiness
of the inodeblock.  Due to missed inodeblock updates, syncer was
required to fsync each mount point' vnode to guarantee periodic
metadata flush.

Reviewed by:	mckusick
Tested by:	scottl
MFC after:	2 weeks
2012-02-07 09:51:41 +00:00
Konstantin Belousov
752a98b13e Add missing opt_quota.h include to activate #ifdef QUOTA blocks,
apparently a step in unbreaking QUOTA support.

Reported and tested by:	Adam Strohl <adams-freebsd ateamsystems com>
MFC after:	1 week
2012-02-06 17:59:14 +00:00
Konstantin Belousov
b313a71044 JNEWBLK dependency may legitimately appear on the buf dependency
list. If softdep_sync_buf() discovers such dependency, it should do
nothing, which is safe as it is only waiting on the parent buffer to
be written, so it can be removed.

Committed on behalf of:	 jeff
MFC after:   1 week
2012-02-06 11:47:24 +00:00
Konstantin Belousov
c480f781ea Current implementations of sync(2) and syncer vnode fsync() VOP uses
mnt_noasync counter to temporary remove MNTK_ASYNC mount option, which
is needed to guarantee a synchronous completion of the initiated i/o
before syscall or VOP return.  Global removal of MNTK_ASYNC option is
harmful because not only i/o started from corresponding thread becomes
synchronous, but all i/o is synchronous on the filesystem which is
initiated during sync(2) or syncer activity.

Instead of removing MNTK_ASYNC from mnt_kern_flag, provide a local
thread flag to disable async i/o for current thread only. Use the
opportunity to move DOINGASYNC() macro into sys/vnode.h and
consistently use it through places which tested for MNTK_ASYNC.

Some testing demonstrated 60-70% improvements in run time for the
metadata-intensive operations on async-mounted UFS volumes, but still
with great deviation due to other reasons.

Reviewed by:	mckusick
Tested by:	scottl
MFC after:	2 weeks
2012-02-06 11:04:36 +00:00
Kirk McKusick
86b571509a There are several bugs/hangs when trying to take a snapshot on a UFS/FFS
filesystem running with journaled soft updates. Until these problems
have been tracked down, return ENOTSUPP when an attempt is made to
take a snapshot on a filesystem running with journaled soft updates.

MFC after: 2 weeks
2012-01-17 01:14:56 +00:00
Kirk McKusick
cc672d3599 Make sure all intermediate variables holding mount flags (mnt_flag)
and that all internal kernel calls passing mount flags are declared
as uint64_t so that flags in the top 32-bits are not lost.

MFC after: 2 weeks
2012-01-17 01:08:01 +00:00
Ivan Voras
d83064751a Add a bit of verbosity to the comment. 2012-01-16 15:47:42 +00:00
Kirk McKusick
b60ee81e3d Convert FFS mount error messages from kernel printf's to using the
vfs_mount_error error message facility provided by the nmount
interface.

Clean up formatting of mount warnings which still need to use
kernel printf's since they do not return errors.

Requested by: Craig Rodrigues <rodrigc@crodrigues.org>
MFC after: 2 weeks
2012-01-14 07:26:16 +00:00
Konstantin Belousov
3ab0160340 Avoid LOR between vfs_busy() lock and covered vnode lock on quotaon().
The vfs_busy() is after covered vnode lock in the global lock order, but
since quotaon() does recursive VFS call to open quota file, we usually
end up locking covered vnode after mp is busied in sys_quotactl().

Change the interface of VFS_QUOTACTL(), requiring that mp was unbusied
by fs code, and do not try to pick up vfs_busy() reference in ufs quotaon,
esp. if vfs_busy cannot succeed due to unmount being performed.

Reported and tested by:	pho
MFC after:	1 week
2012-01-08 23:06:53 +00:00
Ed Schouten
8f8d30274a Migrate ufs and ext2fs from skpc() to memcchr().
While there, remove a useless check from the code. memcchr() always
returns characters unequal to 0xff in this case, so inosused[i] ^ 0xff
can never be equal to zero. Also, the fact that memcchr() returns a
pointer instead of the number of bytes until the end, makes conversion
to an offset far more easy.
2012-01-01 20:47:33 +00:00
Gleb Kurtsou
58b1333ae5 Use implementation independent inoNN_t scalars for on-disk UFS structures
Approved by:	mdf (mentor)
2011-11-09 07:48:48 +00:00
Ed Schouten
6472ac3d8a Mark all SYSCTL_NODEs static that have no corresponding SYSCTL_DECLs.
The SYSCTL_NODE macro defines a list that stores all child-elements of
that node. If there's no SYSCTL_DECL macro anywhere else, there's no
reason why it shouldn't be static.
2011-11-07 15:43:11 +00:00
Ed Schouten
8f80f103b4 Remove MALLOC_DECLAREs of nonexisting malloc-pools.
After careful grepping, it seems none of these pools can be found in our
source tree. They are not in use, nor are they defined.
2011-11-06 20:16:50 +00:00
Peter Holm
b76b6150d5 Fix the wrong commit log message for r226967: "Added missing cache purge
of from argument" and fix the comment.
2011-10-31 20:24:33 +00:00
Peter Holm
6890c3a990 The kern_renameat() looks up the fvp using the DELETE flag, which causes
the removal of the name cache entry for fvp.

Reported by:	Anton Yuzhaninov <citrin citrin ru>
In collaboration with:	kib
MFC after:	1 week
2011-10-31 15:01:47 +00:00
Kirk McKusick
8cd680f8c3 This update eliminates a lock-order reversal warning discovered
whle tracking down the system hang reported in kern/160662 and
corrected in revision 225806. The LOR is not the cause of the system
hang and indeed cannot cause an actual deadlock. However, it can
be easily eliminated by defering the acquisition of a buflock until
after all the vnode locks have been acquired.

Reported by:     Hans Ottevanger
PR:              kern/160662
2011-09-27 17:41:48 +00:00
Kirk McKusick
6b3b8a2109 This update eliminates the system hang reported in kern/160662 when
taking a snapshot on a filesystem running with journaled soft updates.

Reported by:     Hans Ottevanger
Fix verified by: Hans Ottevanger
PR:              kern/160662
2011-09-27 17:34:02 +00:00
Konstantin Belousov
b296414c62 Use nowait sync request for a vnode when doing softdep cleanup. We possibly
own the unrelated vnode lock, doing waiting sync causes deadlocks.

Reported and tested by:	pho
Approved by:	re (bz)
2011-09-20 21:53:26 +00:00
Martin Matuska
82378711f9 Generalize ffs_pages_remove() into vn_pages_remove().
Remove mapped pages for all dataset vnodes in zfs_rezget() using
new vn_pages_remove() to fix mmapped files changed by
zfs rollback or zfs receive -F.

PR:		kern/160035, kern/156933
Reviewed by:	kib, pjd
Approved by:	re (kib)
MFC after:	1 week
2011-08-25 08:17:39 +00:00
Andrey V. Elsukov
5373cf4e34 Fix lock leak.
Reported by:	Alex Lyashkov
Approved by:	re (kib)
MFC after:	1 week
2011-08-23 08:47:27 +00:00
Robert Watson
4b3a6fb933 Fix two cases involving opt_capsicum.h and module builds:
(1) opt_capsicum.h is no longer required in ffs_alloc.c, so remove the
   #include.

(2) portalfs depends on opt_capsicum.h, so have the Makefile generate one
   if required.

These affect only modules built without a kernel (i.e, not buildkernel,
but yes buildworld if the dubious MODULES_WITH_WORLD is used).

Approved by:	re (bz)
Sponsored by:	Google Inc
2011-08-15 07:32:44 +00:00