Commit Graph

1967 Commits

Author SHA1 Message Date
Konstantin Belousov
6ee10a96c0 Implement SEEK_HOLE/SEEK_DATA for UFS.
MFC after:	2 weeks
2012-05-26 05:29:53 +00:00
Kirk McKusick
8b6207110d Add missing `continue' statement at end of case.
Found by:  Kevin Lo (kevlo@)
MFC after: 1 week
2012-05-18 15:20:21 +00:00
Edward Tomasz Napierala
06c65b6b41 Remove unused thread argument from ufs_extattr_uepm_lock()/ufs_extattr_uepm_unlock(). 2012-04-23 17:56:35 +00:00
Edward Tomasz Napierala
05cc75de83 Fix build. 2012-04-23 17:54:49 +00:00
Edward Tomasz Napierala
26621e1f06 Remove unused thread argument from clear_inodeps() and clear_remove(). 2012-04-23 14:44:18 +00:00
Edward Tomasz Napierala
af6e6b87ad Remove unused thread argument to vrecycle().
Reviewed by:	kib
2012-04-23 14:10:34 +00:00
Edward Tomasz Napierala
c52fd858ae Remove unused thread argument from vtruncbuf().
Reviewed by:	kib
2012-04-23 13:21:28 +00:00
Edward Tomasz Napierala
72b8ff1c74 Fix use-after-free introduced in r234036.
Reviewed by:	mckusick
Tested by:	pho
2012-04-21 10:45:46 +00:00
Kirk McKusick
dca5e0ec50 This update uses the MNT_VNODE_FOREACH_ACTIVE interface that loops
over just the active vnodes associated with a mount point to replace
MNT_VNODE_FOREACH_ALL in the vfs_msync, ffs_sync_lazy, and qsync
routines.

The vfs_msync routine is run every 30 seconds for every writably
mounted filesystem. It ensures that any files mmap'ed from the
filesystem with modified pages have those pages queued to be
written back to the file from which they are mapped.

The ffs_lazy_sync and qsync routines are run every 30 seconds for
every writably mounted UFS/FFS filesystem. The ffs_lazy_sync routine
ensures that any files that have been accessed in the previous
30 seconds have had their access times queued for updating in the
filesystem. The qsync routine ensures that any files with modified
quotas have those quotas queued to be written back to their
associated quota file.

In a system configured with 250,000 vnodes, less than 1000 are
typically active at any point in time. Prior to this change all
250,000 vnodes would be locked and inspected twice every minute
by the syncer. For UFS/FFS filesystems they would be locked and
inspected six times every minute (twice by each of these three
routines since each of these routines does its own pass over the
vnodes associated with a mount point). With this change the syncer
now locks and inspects only the tiny set of vnodes that are active.

Reviewed by: kib
Tested by:   Peter Holm
MFC after:   2 weeks
2012-04-20 07:00:28 +00:00
Jaakko Heinonen
808dd116cb The part about exec atime no longer applies in the comment.
Pointed out by:	bde
2012-04-18 15:19:00 +00:00
Kirk McKusick
71469bb38f Replace the MNT_VNODE_FOREACH interface with MNT_VNODE_FOREACH_ALL.
The primary changes are that the user of the interface no longer
needs to manage the mount-mutex locking and that the vnode that
is returned has its mutex locked (thus avoiding the need to check
to see if its is DOOMED or other possible end of life senarios).

To minimize compatibility issues for third-party developers, the
old MNT_VNODE_FOREACH interface will remain available so that this
change can be MFC'ed to 9. Following the MFC to 9, MNT_VNODE_FOREACH
will be removed in head.

The reason for this update is to prepare for the addition of the
MNT_VNODE_FOREACH_ACTIVE interface that will loop over just the
active vnodes associated with a mount point (typically less than
1% of the vnodes associated with the mount point).

Reviewed by: kib
Tested by:   Peter Holm
MFC after:   2 weeks
2012-04-17 16:28:22 +00:00
Kirk McKusick
ecb6e528c5 Export vinactive() from kern/vfs_subr.c (e.g., make it no longer
static and declare its prototype in sys/vnode.h) so that it can be
called from process_deferred_inactive() (in ufs/ffs/ffs_snapshot.c)
instead of the body of vinactive() being cut and pasted into
process_deferred_inactive().

Reviewed by: kib
MFC after:   2 weeks
2012-04-11 23:01:11 +00:00
Jaakko Heinonen
fce74feae1 - Return EPERM from ufs_setattr() when an user without PRIV_VFS_SYSFLAGS
privilege attempts to toggle SF_SETTABLE flags.
- Use the '^' operator in the SF_SNAPSHOT anti-toggling check.

Flags are now stored to ip->i_flags in one place after all checks.

Submitted by:	bde
2012-04-10 15:59:37 +00:00
Edward Tomasz Napierala
2b028c25d3 Fix panic in ffs_reload(), which may happen when read-only filesystem
gets resized and then reloaded.

Reviewed by:	kib, mckusick (earlier version)
Sponsored by:	The FreeBSD Foundation
2012-04-08 13:44:55 +00:00
Kirk McKusick
b73ffa31d4 Drop an unnecessary setting of si_mountpt when updating a UFS mount point.
Clearly it must have been set when the mount was done.

Reviewed by: kib
2012-04-08 06:14:49 +00:00
Jaakko Heinonen
63dfd849a1 Add a check for unsupported file flags to ufs_setattr().
Discussed with:	bde
MFC after:	2 weeks
2012-04-04 14:50:21 +00:00
Kirk McKusick
23d6e518da A file cannot be deallocated until its last name has been removed
and it is no longer referenced by a user process. The inode for a
file whose name has been removed, but is still referenced at the
time of a crash will still be allocated in the filesystem, but will
have no references (e.g., they will have no names referencing them
from any directory).

With traditional soft updates these unreferenced inodes will be
found and reclaimed when the background fsck is run. When using
journaled soft updates, the kernel must keep track of these inodes
so that it can find and reclaim them during the cleanup process.
Their existence cannot be stored in the journal as the journal only
handles short-term events, and they may persist for days. So, they
are tracked by keeping them in a linked list whose head pointer is
stored in the superblock. The journal tracks them only until their
linked list pointers have been commited to disk. Part of the cleanup
process involves traversing the list of unreferenced inodes and
reclaiming them.

This bug was triggered when confusion arose in the commit steps
of keeping the unreferenced-inode linked list coherent on disk.
Notably, a race between the link() system call adding a link-count
to a file and the unlink() system call removing a link-count to
the file. Here if the unlink() ran after link() had looked up
the file but before link() had incremented the link-count of the
file, the file's link-count would drop to zero before the link()
incremented it back up to one. If the file was referenced by a
user process, the first transition through zero made it appear
that it should be added to the unreferenced-inode list when in
fact it should not have been added. If the new name created by
link() was deleted within a few seconds (with the file still
referenced by a user process) it would legitimately be a candidate
for addition to the unreferenced-inode list. The result was that
there were two attempts to add the same inode to the unreferenced-inode
list which scrambled the unreferenced-inode list's pointers leading
to a panic. The fix is to detect and avoid the false attempt at
adding it to the unreferenced-inode list by having the link()
system call check to see if the link count is zero before it
increments it. If it is, the link() fails with ENOENT (showing that
it has failed the link()/unlink() race).

While tracking down this bug, we have added additional assertions
to detect the problem sooner and also simplified some of the code.

Reported by:      Kirk Russell
Fix submitted by: Jeff Roberson
Tested by:        Peter Holm
PR:               kern/159971
MFC (to 9 only):  2 weeks
2012-04-02 21:58:37 +00:00
Jaakko Heinonen
fccf286d24 - Use more natural ip->i_flags instead of vap->va_flags in the final
flags check.
- Add a comment for the immutable/append check done after handling of
  the flags.
- Style improvements.

No functional change intended.

Submitted by:	bde
MFC after:	2 weeks
2012-04-02 16:33:21 +00:00
Kirk McKusick
6c09f4a27c A refinement of change 232351 to avoid a race with a forcible unmount.
While we have a snapshot vnode unlocked to avoid a deadlock with another
inode in the same inode block being updated, the filesystem containing
it may be forcibly unmounted. When that happens the snapshot vnode is
revoked. We need to check for that condition and fail appropriately.

This change will be included along with 232351 when it is MFC'ed to 9.

Spotted by:  kib
Reviewed by: kib
2012-03-28 21:21:19 +00:00
Kirk McKusick
1faacf5d09 Keep track of the mount point associated with a special device
to enable the collection of counts of synchronous and asynchronous
reads and writes for its associated filesystem. The counts are
displayed using `mount -v'.

Ensure that buffers used for paging indicate the vnode from
which they are operating so that counts of paging I/O operations
from the filesystem are collected.

This checkin only adds the setting of the mount point for the
UFS/FFS filesystem, but it would be trivial to add the setting
and clearing of the mount point at filesystem mount/unmount
time for other filesystems too.

Reviewed by: kib
2012-03-28 20:49:11 +00:00
Konstantin Belousov
ea573a50b3 Do trivial reformatting of the comment to record the missed commit
message for r233609:
Restore the writes of atimes, quotas and superblock from syncer vnode.

Noted by:   rdivacky
2012-03-28 14:16:15 +00:00
Konstantin Belousov
a988a5c609 Reviewed by: bde, mckusick
Tested by:	pho
MFC after:	2 weeks
2012-03-28 14:06:47 +00:00
Konstantin Belousov
64c8ead942 Microoptimize: in qsync loop over mount vnodes, only unlock mount
interlock after we committed to try to vget() the vnode.

Submitted by:	bde
Reviewed by:	mckusick
Tested by:	pho
MFC after:	1 week
2012-03-28 13:56:18 +00:00
Konstantin Belousov
e0c1740853 Update comment.
MFC after:	3 days
2012-03-28 13:47:07 +00:00
Kirk McKusick
75a5838904 Add a third flags argument to ffs_syncvnode to avoid a possible conflict
with MNT_WAIT flags that passed in its second argument. This will be
MFC'ed together with r232351.

Discussed with: kib
2012-03-25 00:02:37 +00:00
Konstantin Belousov
064f517d2b Supply boolean as the second argument to ffs_update(), and not a
MNT_[NO]WAIT constants, which in fact always caused sync operation.

Based on the submission by:	bde
Reviewed by:	mckusick
MFC after:	2 weeks
2012-03-13 22:04:27 +00:00
Konstantin Belousov
92ccae0399 Remove superfluous brackets.
Submitted by:	alc
MFC after:	2 weeks
2012-03-11 21:25:42 +00:00
Konstantin Belousov
dd522d76dc Do schedule delayed writes for async mounts.
While there, make some style adjustments, like missed () around
return values.

Submitted by:	bde
Reviewed by:	mckusick
Tested by:	pho
MFC after:	2 weeks
2012-03-11 20:26:19 +00:00
Konstantin Belousov
2fd2c0b1e3 Do not fall back to slow synchronous i/o when low on memory or buffers.
The bawrite() schedules the write to happen immediately, and its use
frees the current thread to do more cleanups.

Submitted by:	bde
Reviewed by:	mckusick
Tested by:	pho
MFC after:	2 weeks
2012-03-11 20:23:46 +00:00
Konstantin Belousov
4cd74eecda In ffs_syncvnode(), pass boolean false as second argument of ffs_update().
Synchronous inode block update is not needed for MNT_LAZY callers (syncer),
and since waitfor values are not zero, code did unneccessary synchronous
update.

Submitted by:	bde
Reviewed by:	mckusick
Tested by:	pho
MFC after:	2 weeks
2012-03-11 20:18:14 +00:00
Konstantin Belousov
18ef3670e5 Remove not needed ARGSUSED lint command.
Submitted by:	bde
MFC after:	3 days
2012-03-11 20:15:12 +00:00
Konstantin Belousov
b80dcb55aa Remove fifo.h. The only used function declaration from the header is
migrated to sys/vnode.h.

Submitted by:	gianni
2012-03-11 12:19:58 +00:00
Peter Holm
e521b5288a Revert r232692 as the correct place to fix this is at the syscall level. 2012-03-09 17:19:50 +00:00
Konstantin Belousov
38ddb5725b Decomission mnt_noasync. Introduce MNTK_NOASYNC mnt_kern_flag which
allows a filesystem to request VFS to not allow MNTK_ASYNC.

MFC after:	1 week
2012-03-09 00:12:05 +00:00
John Baldwin
b47f624183 Add KTR_VFS traces to track modifications to a vnode's writecount. 2012-03-08 20:27:20 +00:00
Peter Holm
80042581a5 syscall() fuzzing can trigger this panic. Return EINVAL instead.
MFC after:	1 week
2012-03-08 12:49:08 +00:00
John Baldwin
58d65e8031 Similar to the fixes in 226967 and 226987, purge any name cache entries
associated with the previous vnode (if any) associated with the target of
a rename().  Otherwise, a lookup of the target pathname concurrent with a
rename() could re-add a name cache entry after the namei(RENAME) lookup
in kern_renameat() had purged the target pathname.

MFC after:	2 weeks
2012-03-02 18:55:19 +00:00
Kirk McKusick
35338e6091 This change avoids a kernel deadlock on "snaplk" when using
snapshots on UFS filesystems running with journaled soft updates.
This is the first of several bugs that need to be fixed before
removing the restriction added in -r230250 to prevent the use
of snapshots on filesystems running with journaled soft updates.

The deadlock occurs when holding the snapshot lock (snaplk)
and then trying to flush an inode via ffs_update(). We become
blocked by another process trying to flush a different inode
contained in the same inode block that we need. It holds the
inode block for which we are waiting locked. When it tries to
write the inode block, it gets blocked waiting for the our
snaplk when it calls ffs_copyonwrite() to see if the inode
block needs to be copied in our snapshot.

The most obvious place that this deadlock arises is in the
ffs_copyonwrite() routine when it updates critical metadata
in a snapshot and tries to write it out before proceeding.
The fix here is to write the data and indirect block pointer
for the snapshot, but to skip the call to ffs_update() to
write the snapshot inode. To ensure that we will never have
to update a pointer in the inode itself, the ffs_snapshot()
routine that creates the snapshot has to ensure that all the
direct blocks are allocated as part of the creation of the
snapshot.

A less obvious place that this deadlock occurs is when we hold
the snaplk because we are deleting a snapshot. In the course of
doing the deletion, we need to allocate various soft update
dependency structures and allocate some journal space. If we
hit a resource limit while doing this we decrease the resources
in use by flushing out an existing dirty file to get it to give
up the soft dependency resources that it holds. The flush can
cause an ffs_update() to be done on the inode for the file that
we have selected to flush resulting in the same deadlock as
described above when the inode that we have chosen to flush
resides in the same inode block as the snapshot inode that we hold.
The fix is to defer cleaning up any time that the inode on which
we are operating is a snapshot.

Help and review by:    Jeff Roberson
Tested by:             Peter Holm
MFC (to 9 only) after: 2 weeks
2012-03-01 18:45:25 +00:00
Konstantin Belousov
2b19466836 Properly lock DQREF() with dqhlock. Missed locking caused counter
corruption.

Assert that the dq reference value is sane before decrementing it.

Reported and tested by:	pho
MFC after:	1 week
2012-02-22 20:03:51 +00:00
Konstantin Belousov
526d0bd547 Fix found places where uio_resid is truncated to int.
Add the sysctl debug.iosize_max_clamp, enabled by default. Setting the
sysctl to zero allows to perform the SSIZE_MAX-sized i/o requests from
the usermode.

Discussed with:	bde, das (previous versions)
MFC after:	1 month
2012-02-21 01:05:12 +00:00
Kirk McKusick
e8e848ef8e Missing conditions in checking whether an inode has been written.
Found and tested by: Peter Holm
MFC after:           2 weeks (to 9 only)
2012-02-13 01:33:39 +00:00
Kirk McKusick
5ecba76999 Historically when an application wrote an entire block of a file,
the kernel allocated a buffer but did not zero it as it was about
to be completely filled by a uiomove() from the user's buffer.
However, if the uiomove() failed, the old contents of the buffer
could be exposed especially if the file was being mmap'ed. The
fix was to always zero the buffer when it was allocated.

This change first attempts the uiomove() to the newly allocated
(and dirty) buffer and only zeros it if the uiomove() fails. The
effect is to eliminate the gratuitous zeroing of the buffer in
the usual case where the uiomove() successfully fills it.

Reviewed by:    kib
Tested by:      scottl
MFC after:      2 weeks (to 9 only)
2012-02-09 22:34:16 +00:00
Kirk McKusick
19c87af0fd In the original days of BSD, a sync was issued on every filesystem
every 30 seconds. This spike in I/O caused the system to pause every
30 seconds which was quite annoying. So, the way that sync worked
was changed so that when a vnode was first dirtied, it was put on
a 30-second cleaning queue (see the syncer_workitem_pending queues
in kern/vfs_subr.c). If the file has not been written or deleted
after 30 seconds, the syncer pushes it out. As the syncer runs once
per second, dirty files are trickled out slowly over the 30-second
period instead of all at once by a call to sync(2).

The one drawback to this is that it does not cover the filesystem
metadata. To handle the metadata, vfs_allocate_syncvnode() is called
to create a "filesystem syncer vnode" at mount time which cycles
around the cleaning queue being sync'ed every 30 seconds. In the
original design, the only things it would sync for UFS were the
filesystem metadata: inode blocks, cylinder group bitmaps, and the
superblock (e.g., by VOP_FSYNC'ing devvp, the device vnode from
which the filesystem is mounted).

Somewhere in its path to integration with FreeBSD the flushing of
the filesystem syncer vnode got changed to sync every vnode associated
with the filesystem. The result of this change is to return to the
old filesystem-wide flush every 30-seconds behavior and makes the
whole 30-second delay per vnode useless.

This change goes back to the originally intended trickle out sync
behavior. Key to ensuring that all the intended semantics are
preserved (e.g., that all inode updates get flushed within a bounded
period of time) is that all inode modifications get pushed to their
corresponding inode blocks so that the metadata flush by the
filesystem syncer vnode gets them to the disk in a timely way.
Thanks to Konstantin Belousov (kib@) for doing the audit and commit
-r231122 which ensures that all of these updates are being made.

Reviewed by:    kib
Tested by:      scottl
MFC after:      2 weeks
2012-02-07 20:43:28 +00:00
Konstantin Belousov
36fc415d5d Sprinkle missed calls to asynchronous UFS_UPDATE() in attempt to
guarantee that all UFS inode metadata changes results in the dirtiness
of the inodeblock.  Due to missed inodeblock updates, syncer was
required to fsync each mount point' vnode to guarantee periodic
metadata flush.

Reviewed by:	mckusick
Tested by:	scottl
MFC after:	2 weeks
2012-02-07 09:51:41 +00:00
Konstantin Belousov
752a98b13e Add missing opt_quota.h include to activate #ifdef QUOTA blocks,
apparently a step in unbreaking QUOTA support.

Reported and tested by:	Adam Strohl <adams-freebsd ateamsystems com>
MFC after:	1 week
2012-02-06 17:59:14 +00:00
Konstantin Belousov
b313a71044 JNEWBLK dependency may legitimately appear on the buf dependency
list. If softdep_sync_buf() discovers such dependency, it should do
nothing, which is safe as it is only waiting on the parent buffer to
be written, so it can be removed.

Committed on behalf of:	 jeff
MFC after:   1 week
2012-02-06 11:47:24 +00:00
Konstantin Belousov
c480f781ea Current implementations of sync(2) and syncer vnode fsync() VOP uses
mnt_noasync counter to temporary remove MNTK_ASYNC mount option, which
is needed to guarantee a synchronous completion of the initiated i/o
before syscall or VOP return.  Global removal of MNTK_ASYNC option is
harmful because not only i/o started from corresponding thread becomes
synchronous, but all i/o is synchronous on the filesystem which is
initiated during sync(2) or syncer activity.

Instead of removing MNTK_ASYNC from mnt_kern_flag, provide a local
thread flag to disable async i/o for current thread only. Use the
opportunity to move DOINGASYNC() macro into sys/vnode.h and
consistently use it through places which tested for MNTK_ASYNC.

Some testing demonstrated 60-70% improvements in run time for the
metadata-intensive operations on async-mounted UFS volumes, but still
with great deviation due to other reasons.

Reviewed by:	mckusick
Tested by:	scottl
MFC after:	2 weeks
2012-02-06 11:04:36 +00:00
Kirk McKusick
86b571509a There are several bugs/hangs when trying to take a snapshot on a UFS/FFS
filesystem running with journaled soft updates. Until these problems
have been tracked down, return ENOTSUPP when an attempt is made to
take a snapshot on a filesystem running with journaled soft updates.

MFC after: 2 weeks
2012-01-17 01:14:56 +00:00
Kirk McKusick
cc672d3599 Make sure all intermediate variables holding mount flags (mnt_flag)
and that all internal kernel calls passing mount flags are declared
as uint64_t so that flags in the top 32-bits are not lost.

MFC after: 2 weeks
2012-01-17 01:08:01 +00:00
Ivan Voras
d83064751a Add a bit of verbosity to the comment. 2012-01-16 15:47:42 +00:00
Kirk McKusick
b60ee81e3d Convert FFS mount error messages from kernel printf's to using the
vfs_mount_error error message facility provided by the nmount
interface.

Clean up formatting of mount warnings which still need to use
kernel printf's since they do not return errors.

Requested by: Craig Rodrigues <rodrigc@crodrigues.org>
MFC after: 2 weeks
2012-01-14 07:26:16 +00:00
Konstantin Belousov
3ab0160340 Avoid LOR between vfs_busy() lock and covered vnode lock on quotaon().
The vfs_busy() is after covered vnode lock in the global lock order, but
since quotaon() does recursive VFS call to open quota file, we usually
end up locking covered vnode after mp is busied in sys_quotactl().

Change the interface of VFS_QUOTACTL(), requiring that mp was unbusied
by fs code, and do not try to pick up vfs_busy() reference in ufs quotaon,
esp. if vfs_busy cannot succeed due to unmount being performed.

Reported and tested by:	pho
MFC after:	1 week
2012-01-08 23:06:53 +00:00
Ed Schouten
8f8d30274a Migrate ufs and ext2fs from skpc() to memcchr().
While there, remove a useless check from the code. memcchr() always
returns characters unequal to 0xff in this case, so inosused[i] ^ 0xff
can never be equal to zero. Also, the fact that memcchr() returns a
pointer instead of the number of bytes until the end, makes conversion
to an offset far more easy.
2012-01-01 20:47:33 +00:00
Gleb Kurtsou
58b1333ae5 Use implementation independent inoNN_t scalars for on-disk UFS structures
Approved by:	mdf (mentor)
2011-11-09 07:48:48 +00:00
Ed Schouten
6472ac3d8a Mark all SYSCTL_NODEs static that have no corresponding SYSCTL_DECLs.
The SYSCTL_NODE macro defines a list that stores all child-elements of
that node. If there's no SYSCTL_DECL macro anywhere else, there's no
reason why it shouldn't be static.
2011-11-07 15:43:11 +00:00
Ed Schouten
8f80f103b4 Remove MALLOC_DECLAREs of nonexisting malloc-pools.
After careful grepping, it seems none of these pools can be found in our
source tree. They are not in use, nor are they defined.
2011-11-06 20:16:50 +00:00
Peter Holm
b76b6150d5 Fix the wrong commit log message for r226967: "Added missing cache purge
of from argument" and fix the comment.
2011-10-31 20:24:33 +00:00
Peter Holm
6890c3a990 The kern_renameat() looks up the fvp using the DELETE flag, which causes
the removal of the name cache entry for fvp.

Reported by:	Anton Yuzhaninov <citrin citrin ru>
In collaboration with:	kib
MFC after:	1 week
2011-10-31 15:01:47 +00:00
Kirk McKusick
8cd680f8c3 This update eliminates a lock-order reversal warning discovered
whle tracking down the system hang reported in kern/160662 and
corrected in revision 225806. The LOR is not the cause of the system
hang and indeed cannot cause an actual deadlock. However, it can
be easily eliminated by defering the acquisition of a buflock until
after all the vnode locks have been acquired.

Reported by:     Hans Ottevanger
PR:              kern/160662
2011-09-27 17:41:48 +00:00
Kirk McKusick
6b3b8a2109 This update eliminates the system hang reported in kern/160662 when
taking a snapshot on a filesystem running with journaled soft updates.

Reported by:     Hans Ottevanger
Fix verified by: Hans Ottevanger
PR:              kern/160662
2011-09-27 17:34:02 +00:00
Konstantin Belousov
b296414c62 Use nowait sync request for a vnode when doing softdep cleanup. We possibly
own the unrelated vnode lock, doing waiting sync causes deadlocks.

Reported and tested by:	pho
Approved by:	re (bz)
2011-09-20 21:53:26 +00:00
Martin Matuska
82378711f9 Generalize ffs_pages_remove() into vn_pages_remove().
Remove mapped pages for all dataset vnodes in zfs_rezget() using
new vn_pages_remove() to fix mmapped files changed by
zfs rollback or zfs receive -F.

PR:		kern/160035, kern/156933
Reviewed by:	kib, pjd
Approved by:	re (kib)
MFC after:	1 week
2011-08-25 08:17:39 +00:00
Andrey V. Elsukov
5373cf4e34 Fix lock leak.
Reported by:	Alex Lyashkov
Approved by:	re (kib)
MFC after:	1 week
2011-08-23 08:47:27 +00:00
Robert Watson
4b3a6fb933 Fix two cases involving opt_capsicum.h and module builds:
(1) opt_capsicum.h is no longer required in ffs_alloc.c, so remove the
   #include.

(2) portalfs depends on opt_capsicum.h, so have the Makefile generate one
   if required.

These affect only modules built without a kernel (i.e, not buildkernel,
but yes buildworld if the dubious MODULES_WITH_WORLD is used).

Approved by:	re (bz)
Sponsored by:	Google Inc
2011-08-15 07:32:44 +00:00
Robert Watson
a9d2f8d84f Second-to-last commit implementing Capsicum capabilities in the FreeBSD
kernel for FreeBSD 9.0:

Add a new capability mask argument to fget(9) and friends, allowing system
call code to declare what capabilities are required when an integer file
descriptor is converted into an in-kernel struct file *.  With options
CAPABILITIES compiled into the kernel, this enforces capability
protection; without, this change is effectively a no-op.

Some cases require special handling, such as mmap(2), which must preserve
information about the maximum rights at the time of mapping in the memory
map so that they can later be enforced in mprotect(2) -- this is done by
narrowing the rights in the existing max_protection field used for similar
purposes with file permissions.

In namei(9), we assert that the code is not reached from within capability
mode, as we're not yet ready to enforce namespace capabilities there.
This will follow in a later commit.

Update two capability names: CAP_EVENT and CAP_KEVENT become
CAP_POST_KEVENT and CAP_POLL_KEVENT to more accurately indicate what they
represent.

Approved by:	re (bz)
Submitted by:	jonathan
Sponsored by:	Google Inc
2011-08-11 12:30:23 +00:00
Kirk McKusick
fddf7baebe Update to -r224294 to ensure that only one of MNT_SUJ or MNT_SOFTDEP
is set so that mount can revert back to using MNT_NOWAIT when doing
getmntinfo.

Approved by: re (kib)
2011-07-30 00:43:18 +00:00
Kirk McKusick
d716efa9f7 Move the MNTK_SUJ flag in mnt_kern_flag to MNT_SUJ in mnt_flag
so that it is visible to userland programs. This change enables
the `mount' command with no arguments to be able to show if a
filesystem is mounted using journaled soft updates as opposed
to just normal soft updates.

Approved by: re (bz)
2011-07-24 18:27:09 +00:00
Kirk McKusick
2621f2c43f Default debugging error messages to off for journaled soft updates sysctls.
Delete limiting on output of these sysctls.

Approved by: re (kib)
2011-07-22 18:03:33 +00:00
Kirk McKusick
927a12ae16 Add an FFS specific mount option to allow a filesystem checker
(typically fsck_ffs) to register that it wishes to use FFS specific
sysctl's to update the filesystem. This ensures that two checkers
cannot run on a given filesystem at the same time and that no other
process accidentally or maliciously uses the filesystem updating
sysctls inappropriately. This functionality is needed by the
journaling soft-updates recovery code.
2011-07-15 16:20:33 +00:00
Kirk McKusick
b8ea56d7e4 Consistently check mount flag (MNTK_SUJ) rather than superblock
flag (FS_SUJ) when determining whether to do journaling-based
operations. The mount flag is set only when journaling is active
while the superblock flag is set to indicate that journaling is to
be used. For example, when the filesystem is mounted read-only, the
journaling may be present (FS_SUJ) but not active (MNTK_SUJ).
Inappropriate checking of the FS_SUJ flag was causing some
journaling actions to be attempted at inappropriate times.
2011-07-14 18:06:13 +00:00
Kirk McKusick
17ff0cf70f When first creating snapshots, we may free some blocks within it.
These blocks should not have TRIM applied to them.

Submitted by: Kostik Belousov
2011-07-10 05:34:49 +00:00
Kirk McKusick
8795189c98 Allow disk partitions associated with UFS read-only mounted
filesystems to be opened for writing. This functionality used to
be special-cased for just the root filesystem, but with this change
is now available for all UFS filesystems. This change is needed for
journaled soft updates recovery.

Discussed with: Jeff Roberson
2011-07-10 00:41:31 +00:00
Konstantin Belousov
58f9394c50 Use 'curthread_pflags' instead of 'thread_pflags' to signify that only
curthread can be operated upon.

Requested by:	attilio
MFC after:	1 week
2011-07-09 15:16:07 +00:00
Konstantin Belousov
acf5d7101c Use helper functions instead of manually managing TDP_INBDFLUSH.
Sponsored by:	The FreeBSD Foundation
Reviewed by:	alc (previous version)
MFC after:	1 week
2011-07-09 14:42:45 +00:00
Jeff Roberson
e9b4d8327f - Speed up pendingblock processing again. Having too much delay between
ffs_blkfree() and the pending adjustment causes all kinds of
   space related problems.
2011-07-04 22:08:04 +00:00
Jeff Roberson
f2803e61fa - Handle D_JSEGDEP in the softdep_sync_buf() switch. These can now
find themselves on snapshot vnodes.

Reported by:	pho
2011-07-04 21:04:25 +00:00
Jeff Roberson
8e4f5b70b0 - It is impossible to run request_cleanup() while doing a copyonwrite.
This will most likely cause new block allocations which can recurse
   into request cleanup.
 - While here optimize the ufs locking slightly.  We need only acquire and
   drop once.
 - process_removes() and process_truncates() also is only needed once.
 - Attempt to flush each item on the worklist once but do not loop forever
   if some can not be completed.

Discussed with:	mckusick
2011-07-04 20:53:55 +00:00
Jeff Roberson
c0f4e7afa4 - Fix an inode quota leak. We need to decrement the quota once and only
once.

Tested by:	pho
Reviewed by:	mckusick
2011-07-04 20:52:23 +00:00
Kirk McKusick
08af0c8b8d Handle the FREEDEP case in softdep_sync_buf().
This fix failed to get added in -r223325.

Submitted by:	Peter Holm
2011-06-29 22:12:43 +00:00
Alan Cox
6bbee8e28a Add a new option, OBJPR_NOTMAPPED, to vm_object_page_remove(). Passing this
option to vm_object_page_remove() asserts that the specified range of pages
is not mapped, or more precisely that none of these pages have any managed
mappings.  Thus, vm_object_page_remove() need not call pmap_remove_all() on
the pages.

This change not only saves time by eliminating pointless calls to
pmap_remove_all(), but it also eliminates an inconsistency in the use of
pmap_remove_all() versus related functions, like pmap_remove_write().  It
eliminates harmless but pointless calls to pmap_remove_all() that were being
performed on PG_UNMANAGED pages.

Update all of the existing assertions on pmap_remove_all() to reflect this
change.

Reviewed by:	kib
2011-06-29 16:40:41 +00:00
Jeff Roberson
16f7d82285 - Fix directory count rollbacks by passing the mode to the journal dep
earlier.
 - Add rollback/forward code for frag and cluster accounting.
 - Handle the FREEDEP case in softdep_sync_buf().  (submitted by pho)
2011-06-20 03:25:09 +00:00
Kirk McKusick
9957ac07b2 Fixed dereference of a NULL pointer.
Reported by:	Peter Holm
2011-06-18 21:10:03 +00:00
Kirk McKusick
ff13f23f84 Drop the include of <ufs/ffs/ffs_extern.h> from usr.sbin/makefs/ffs/ffs_bswap.c
and usr.sbin/makefs/ffs/ffs_subr.c as they have no need of anything in that
file.  No other programs or libraries include <ufs/ffs/ffs_extern.h> (nor
should they as it is totally in-kernel interfaces). For added protection
I enclosed the entire contents of <ufs/ffs/ffs_extern.h> in ifdef _KERNEL.

Feedback from:	Bruce Evans and Tai-hwa Liang
2011-06-16 23:40:10 +00:00
Tai-hwa Liang
09108f76fa Fixing compilation bustage by introducing another forward declaration. 2011-06-16 05:26:03 +00:00
Kirk McKusick
43a3cc7796 Ensure that filesystem metadata contained within persistent snapshots
is always kept consistent.

Suggested by:	Jeff Roberson
2011-06-15 23:19:09 +00:00
Kirk McKusick
2191e465cc With the restructuring of the block reclaimation code, the notification
messages for a filesystem being out of space need to be moved so that
they do not print out until after a failed cleanup attempt.

Suggested by:	Jeff Roberson
2011-06-15 18:05:08 +00:00
Kirk McKusick
e34a713594 Missing cleanup case after completion of a snapshot vnode write
claiming a released block.

Submitted by:	Jeff Roberson
Tested by:	Peter Holm
2011-06-15 06:13:08 +00:00
Dimitry Andric
222ef43340 Use alternative, less messy solution to avoid breakage after r223020:
put the snapdata structure between #ifdef _KERNEL guards.

Suggested by:	kib
2011-06-13 16:05:41 +00:00
Kirk McKusick
9eb8728aa5 Update to soft updates journaling to properly track freed blocks
that get claimed by snapshots.

Submitted by:	Jeff Roberson
Tested by:	Peter Holm
2011-06-12 19:27:05 +00:00
Kirk McKusick
9420dc62cd Disable the soft updates journaling after a filesystem is successfully
downgraded to read-only. It will be restarted if the filesystem is
upgraded back to read-write.
2011-06-12 18:46:48 +00:00
Jeff Roberson
280e091a99 Implement fully asynchronous partial truncation with softupdates journaling
to resolve errors which can cause corruption on recovery with the old
synchronous mechanism.

 - Append partial truncation freework structures to indirdeps while
   truncation is proceeding.  These prevent new block pointers from
   becoming valid until truncation completes and serialize truncations.
 - On completion of a partial truncate journal work waits for zeroed
   pointers to hit indirects.
 - softdep_journal_freeblocks() handles last frag allocation and last
   block zeroing.
 - vtruncbuf/ffs_page_remove moved into softdep_*_freeblocks() so it
   is only implemented in one place.
 - Block allocation failure handling moved up one level so it does not
   proceed with buf locks held.  This permits us to do more extensive
   reclaims when filesystem space is exhausted.
 - softdep_sync_metadata() is broken into two parts, the first executes
   once at the start of ffs_syncvnode() and flushes truncations and
   inode dependencies.  The second is called on each locked buf.  This
   eliminates excessive looping and rollbacks.
 - Improve the mechanism in process_worklist_item() that handles
   acquiring vnode locks for handle_workitem_remove() so that it works
   more generally and does not loop excessively over the same worklist
   items on each call.
 - Don't corrupt directories by zeroing the tail in fsck.  This is only
   done for regular files.
 - Push a fsync complete record for files that need it so the checker
   knows a truncation in the journal is no longer valid.

Discussed with:	mckusick, kib (ffs_pages_remove and ffs_truncate parts)
Tested by:	pho
2011-06-10 22:48:35 +00:00
Jeff Roberson
e84fa3ba71 - Add support for referencing quota structures without needing the inode
pointer for softupdates.

Submitted by:	mckusick
2011-06-10 22:19:44 +00:00
Jeff Roberson
5aa336ed20 - If the fsync in ufs_direnter fails SUJ can later panic because we have
partially added a name.  Allow ufs_direnter() to continue in the
   hopes that it is a transient error.  If it is not, the directory
   is corrupted already from IO errors and writing this new block
   is not likely to make things worse.
2011-06-10 22:18:25 +00:00
Kirk McKusick
9f62b10cb3 Grammer fix in comment.
Eliminate one (of several) possible conflicting buffer locks when
trying to reclaim blocks. Rest of fix to be incorporated as part
of SUJ update by jeff.

Pointed out by: Kostik Belousov
2011-06-05 22:36:30 +00:00
Kirk McKusick
1508294bb6 Due to a lag in updating the fs_pendinginodes count, we cannot depend
on it to decide whether we should try to reclaim inodes when we run
short.

Discovered by: Peter Holm
2011-05-28 15:07:29 +00:00
Kirk McKusick
99f6ac66ad The check for whether a block is going to be claimed by a snapshot
needs to happen before we notify the underlying layer that it is
being freed.
2011-05-26 23:56:58 +00:00
Rick Macklem
dbed8d1fc8 Fix the ufs/ffs file system so that it uses the lock
flags argument added to VFS_FHTOVP() by r222167.

Reviewed by:	mckusick
2011-05-22 20:39:07 +00:00
Rick Macklem
694a586a43 Add a lock flags argument to the VFS_FHTOVP() file system
method, so that callers can indicate the minimum vnode
locking requirement. This will allow some file systems to choose
to return a LK_SHARED locked vnode when LK_SHARED is specified
for the flags argument. This patch only adds the flag. It
does not change any file system to use it and all callers
specify LK_EXCLUSIVE, so file system semantics are not changed.

Reviewed by:	kib
2011-05-22 01:07:54 +00:00
Matthew D Fleming
3d08a76bbc Use a name instead of a magic number for kern_yield(9) when the priority
should not change.  Fetch the td_user_pri under the thread lock.  This
is probably not necessary but a magic number also seems preferable to
knowing the implementation details here.

Requested by:	Jason Behmer < jason DOT behmer AT isilon DOT com >
2011-05-13 05:27:58 +00:00
Konstantin Belousov
d3e4b05d20 Fix typos.
Noted by:	Fabian Keil <freebsd-listen fabiankeil de>
Pointy hat to:	kib
MFC after:	1 week
2011-04-30 22:46:02 +00:00
Konstantin Belousov
4417ac326a Clarify the comment.
MFC after:	1 week
2011-04-30 13:49:03 +00:00
Konstantin Belousov
d9ca1af7ed VFS sometimes is unable to inactivate a vnode when vnode use count
goes to zero. E.g., the vnode might be only shared-locked at the time of
vput() call. Such vnodes are kept in the hash, so they can be found later.

If ffs_valloc() allocated an inode that has its vnode cached in hash, and
still owing the inactivation, then vget() call from ffs_valloc() clears
VI_OWEINACT, and then the vnode is reused for the newly allocated inode.

The problem is, the vnode is not reclaimed before it is put to the new
use. ffs_valloc() recycles vnode vm object, but this is not enough.
In particular, at least v_vflag should be cleared, and several bits of
UFS state need to be removed.

It is very inconvenient to call vgone() at this point. Instead, move
some parts of ufs_reclaim() into helper function ufs_prepare_reclaim(),
and call the helper from VOP_RECLAIM and ffs_valloc().

Reviewed by:	mckusick
Tested by:	pho
MFC after:	3 weeks
2011-04-24 10:47:56 +00:00
Jeff Roberson
273ca85137 - Refactor softdep_setup_freeblocks() into a set of functions to prepare
for a new journal specific partial truncate routine.
 - Use dep_current[] in place of specific dependency counts.  This is
   automatically maintained when workitems are allocated and has
   less risk of becoming incorrect.
2011-04-11 01:43:59 +00:00
Jeff Roberson
4ac80906c3 Fix a long standing SUJ performance problem:
- Keep a hash of indirect blocks that have recently been freed and are
   still referenced in the journal.
 - Lookup blocks in this hash before forcing a new block write to wait on
   the journal entry to hit the disk.  This is only necessary to avoid
   confusion between old identities as indirects and new identities as
   file blocks.
 - Don't free jseg structures until the journal has written a record that
   invalidates it.  This keeps the indirect block information around for
   as long as is required to be safe.
 - Force an empty journal block write when required to flush out stale
   journal data that is simply waiting for the oldest valid sequence
   number to advance beyond it.
2011-04-10 03:49:53 +00:00
Jeff Roberson
59343c7b98 - Don't invalidate jnewblks immediately upon discovering that the block
will be removed.  Permit the journal to proceed so that we don't leave
   a rollback in a cg for a very long time as this can cause terrible perf
   problems in low memory situations.

Tested by:      pho
2011-04-07 03:19:10 +00:00
Kirk McKusick
4c821a3978 Be far more persistent in reclaiming blocks and inodes before giving
up and declaring a filesystem out of space. Especially necessary when
running on a small filesystem. With this improvement, it should be
possible to use soft updates on a small root filesystem.

Kudos to: Peter Holm
Testing by: Peter Holm
MFC: 2 weeks
2011-04-05 21:26:05 +00:00
Jeff Roberson
f79d4144ab Fix problems that manifested from filesystem full conditions:
- In softdep_revert_mkdir() find the dotaddref before we attempt to cancel
   the jaddref so we can make assumptions about where the dotaddref is on
   the list.  cancel_jaddref() does not always remove items from the list
   anymore.
 - Always set GOINGAWAY on an inode in softdep_freefile() if DEPCOMPLETE
   was never set.  This ensures that dependencies will continue to be
   processed on the inowait/bufwait list and is more an artifact of
   the structure of the code than a pure ordering problem.
 - Always set DEPCOMPLETE on canceled jaddrefs so that they can be freed
   appropriately.  This normally occurs when the refs are added to the
   journal but if they are canceled before this point the state would
   never be set and the dependency could never be freed.

Reported by:	pho
Tested by:	pho
2011-04-02 21:52:58 +00:00
Konstantin Belousov
861ed1162b Fix the softdep_request_cleanup() function definition for !SOFTUPDATES case.
Submitted by:	Aleksandr Rybalko <ray dlink ua>
2011-03-28 12:39:48 +00:00
Kirk McKusick
0a809056ce Add retry code analogous to the block allocation retry code
to avoid running out of inodes.

Reported by: Peter Holm
2011-03-23 05:13:54 +00:00
Konstantin Belousov
16b1f68d8c Retire opt_ffs_broken_fixme.h.
Instead of directly calling ffs_snapgone(), use UFS_SNAPGONE() with
usual layering.

Requested by:	bde
MFC after:	1 week
2011-03-20 21:05:09 +00:00
Konstantin Belousov
ffda66c299 Remove the #if defined(FFS) || defined(IFS) braces around the calls to
ffs_snapgone(). ufs.ko module is not build with FFS define, causing
snapshot inode number slots in superblock never be freed, as well as a
reference on the snapshot vnode.

IFS was removed several years ago, and UFS/FFS separation was not
maintained for real.

Reported, analyzed and tested by:	Yamagi Burmeister <lists yamagi org>
MFC after:	3 days
2011-03-17 11:23:12 +00:00
Konstantin Belousov
0714775845 Simplify uses of the web of pointers.
Reviewed by:	mckusick
MFC after:	1 week
2011-03-07 22:36:11 +00:00
John Baldwin
8587289fb8 The UFS dirhash code was attempting to update shared state in the dirhash
from multiple threads while holding a shared lock during a lookup operation.
This could result in incorrect ENOENT failures which could then be
permanently stored in the name cache.

Specifically, the dirhash code optimizes the case that a single thread is
walking a directory sequentially opening (or stat'ing) each file.  It uses
state in the dirhash structure to determine if a given lookup is using the
optimization.  If the optimization fails, it disables it and restarts the
lookup.  The problem arises when two threads both attempt the optimization
and fail.  The first thread will restart the loop, but the second thread
will incorrectly think that it did not try the optimization and will only
examine a subset of the directory entires in its hash chain.  As a result,
it may fail to find its directory entry and incorrectly fail with ENOENT.

To make this safe for use with shared locks, simplify the state stored in
the dirhash and move some of the state (the part that determines if the
current thread is trying the optimization) into a local variable.  One
result is that we will now try the optimization more often.  We still
update the value under the shared lock, but it is a single atomic store
similar to i_diroff that is stored in UFS directory i-nodes for the
non-dirhash lookup.

Reviewed by:	kib
MFC after:	1 week
2011-03-07 18:33:29 +00:00
John Baldwin
96e1934a43 Use ffs() to locate free bits in the inode bitmap rather than a loop with
bit shifts.

Reviewed by:	mckusick
MFC after:	1 month
2011-03-04 22:26:41 +00:00
Konstantin Belousov
c30c6a2311 v_mountedhere is a member of the union. Check that the vnodes have
proper type before using the member.

Reported and tested by:	Michael Butler <imb protected-networks net>
2011-02-19 07:47:25 +00:00
Konstantin Belousov
455a6e0ff3 Use the native sector size of the device backing the UFS volume for SU+J
journal blocks, instead of hard coding 512 byte sector size. Journal need
to atomically write the block, that can only be guaranteed at the device
sector size, not larger. Attempt to write less then sector size results in
driver errors.

Note that this is the first structure in UFS that depends on the
sector size. Other elements are written in the units of fragments.

In collaboration with:	pho
Reviewed by:	jeff
Tested by:	bz, pho
2011-02-12 12:52:12 +00:00
Alexander Leidinger
3502f01f20 Wrap long line.
Noticed by:	bz
2011-02-10 08:06:56 +00:00
Alexander Leidinger
3eb6e1317c Add some FEATURE macros for some UFS features.
SU+J is not included as a FEATURE macro:
 - it was not in the tree during the GSoC
 - I do not see an option to en-/disable it in NOTES

Two minor changes where made during the review compared to what was developed
during GSoC 2010.

No FreeBSD version bump, the userland application to query the features will
be committed last and can serve as an indication of the availablility if
needed.

Sponsored by:	Google Summer of Code 2010
Submitted by:	kibab
Reviewed by:	kib
X-MFC after:	to be determined in last commit with code from this project
2011-02-09 15:33:13 +00:00
Matthew D Fleming
e7ceb1e99b Based on discussions on the svn-src mailing list, rework r218195:
- entirely eliminate some calls to uio_yeild() as being unnecessary,
   such as in a sysctl handler.

 - move should_yield() and maybe_yield() to kern_synch.c and move the
   prototypes from sys/uio.h to sys/proc.h

 - add a slightly more generic kern_yield() that can replace the
   functionality of uio_yield().

 - replace source uses of uio_yield() with the functional equivalent,
   or in some cases do not change the thread priority when switching.

 - fix a logic inversion bug in vlrureclaim(), pointed out by bde@.

 - instead of using the per-cpu last switched ticks, use a per thread
   variable for should_yield().  With PREEMPTION, the only reasonable
   use of this is to determine if a lock has been held a long time and
   relinquish it.  Without PREEMPTION, this is essentially the same as
   the per-cpu variable.
2011-02-08 00:16:36 +00:00
Matthew D Fleming
08b163fa51 Put the general logic for being a CPU hog into a new function
should_yield().  Use this in various places.  Encapsulate the common
case of check-and-yield into a new function maybe_yield().

Change several checks for a magic number of iterations to use
should_yield() instead.

MFC after:	1 week
2011-02-02 16:35:10 +00:00
Sergey Kandaurov
b04409f5fc Embed a quota error message (C string) into uprintf() fmt.
While here, fix whitespaces.

Approved by:	kib (mentor)
2011-01-13 16:29:27 +00:00
Matthew D Fleming
fbbb13f962 sysctl(9) cleanup checkpoint: amd64 GENERIC builds cleanly.
Commit the kernel changes.
2011-01-12 19:54:19 +00:00
Konstantin Belousov
ac32f1176b Instead of incrementing freework reference counter in indir_trunc(), do
it at the allocation time for journaled fs and indirect blocks, when
the allocated object is not accessible outside.

Requested and reviewed by:	jeff
Tested by:	pho
2011-01-04 10:25:55 +00:00
Konstantin Belousov
465e3ccdbb Handle missing jremrefs when a directory is renamed overtop of
another, deleting it.  If the directory is removed, UFS always need to
remove the .. ref, even if the ultimate ref on the parent would not
change. The new directory must have a new journal entry for that ref.
Otherwise journal processing would not properly account for the
parent's reference since it will belong to a removed directory entry.

Change ufs_rename()'s dotdot rename section to always
setup_dotdot_link(). In the tip != NULL case SUJ needs the newref dependency
allocated via setup_dotdot_link().

Stop setting isrmdir to 2 for newdirrem() in softdep_setup_remove().
Remove the isdirrem > 1 checks from newdirrem().

Reported by:	many
Submitted by:	jeff
Tested by:	pho
2010-12-30 10:52:07 +00:00
Konstantin Belousov
42a6fc4385 In indir_trunc(), when processing jnewblk entries that are not written
to the disk, recurse to handle indirect blocks of next level that are
hidden by the corresponding entry.

In collaboration with:	pho
Reviewed by:	jeff, mckusick
Tested by:	mckusick, pho
2010-12-30 10:41:17 +00:00
Konstantin Belousov
8c2a54de80 Add kernel side support for BIO_DELETE/TRIM on UFS.
The FS_TRIM fs flag indicates that administrator requested issuing of
TRIM commands for the volume. UFS will only send the command to disk
if the disk reports GEOM::candelete attribute.

Since disk queue is reordered, data block is marked as free in the bitmap
only after TRIM command completed. Due to need to sleep waiting for
i/o to finish, TRIM bio_done routine schedules taskqueue to set the
bitmap bit.

Based on the patch by:	mckusick
Reviewed by:	mckusick, pjd
Tested by:	pho
MFC after:	1 month
2010-12-29 12:25:28 +00:00
Konstantin Belousov
d2d6c59245 Move the definition of mkdirlisthd from header to C file.
Reviewed by:	mckusick
Tested by:	pho
2010-12-29 12:16:06 +00:00
Konstantin Belousov
abf6c181e4 Use a proper type for the variable holding the summary size of the inode
data. Otherwise, on 32bit systems, unlinked inode which size is the
multiple of 4GB was not truncated, causing corruption.

Reported by:	brucec
Reviewed by:	mckusick
Tested by:	pho
2010-12-29 11:19:39 +00:00
Kirk McKusick
84ad0a66d0 This patch fixes a soft update panic while running perl 5.12 tests
which produced:

    panic: indir_trunc: Index out of range -148 parent -2061 lbn -305164

Reported by: Dimitry Andric
Fixed by: Jeff Roberson
2010-12-23 00:38:57 +00:00
Konstantin Belousov
fddd463dc2 Journal start looks up .sujournal file by doing lookup on the root dvp.
As result, failed softdep_mount() might leave up to two vnodes on the
mp mountlist, preventing mnt_ref from going to zero.

Call ffs_flushfiles() after failed softdep_mount() to clean mountlist.

Initial report by:	Garrett Cooper
Reproduced and tested by:	pho
2010-12-01 21:19:11 +00:00
Peter Holm
bcc5c95b6b First step in fixing the handle_workitem_freeblocks panic.
In collaboration with:	 kib
2010-11-27 20:27:07 +00:00
Kirk McKusick
18709a09ed Delete /sys/ufs/ffs/README.snapshot as it is no longer relevant.
Drop reference to it in mount(8).

MFC:	3 days
2010-11-20 18:40:50 +00:00
Konstantin Belousov
730b63b0c2 Remove prtactive variable and related printf()s in the vop_inactive
and vop_reclaim() methods. They seems to be unused, and the reported
situation is normal for the forced unmount.

MFC after:   1 week
X-MFC-note:  keep prtactive symbol in vfs_subr.c
2010-11-19 21:17:34 +00:00
Konstantin Belousov
be913821af The softdep_setup_freeblocks() adds worklist items before
deallocate_dependencies() is done. This opens a race between softdep
thread and the thread that does the truncation:
  A write of the indirect block causes the freeblks to become
  ALLCOMPLETE while softdep_setup_freeblocks() dropped softdep lock. And
  then, softdep_disk_write_complete() would reassign the workitem to the
  mount point worklist, causing premature processing of the workitem, or
  journal write exhaust the fb_jfreeblkhd and handle_written_jfreeblk does
  the same reassign.
indir_trunc() then would find the indirect block that is locked (with lock
owned by kernel) but without any dependencies, causing it to hang in
getblk() waiting for buffer lock.

Do not mark freeblks as DEPCOMPLETE until deallocate_dependencies()
finished.

Analyzed, suggested and reviewed by:	jeff
Tested by:	pho
2010-11-11 11:54:01 +00:00
Konstantin Belousov
496fd81362 Change #ifdef INVARIANTS panic into KASSERT, and print some useful
information to diagnose the issue, in handle_complete_freeblocks().

Reviewed by:	jeff
Tested by:	pho
2010-11-11 11:41:52 +00:00
Konstantin Belousov
d23c72cdb5 In journal_mount(), only set MNTK_SUJ flag after the jblocks are mapped.
I believe there is a window otherwise where jblocks can be accessed
without proper initialization.

Reviewed by:	jeff
Tested by:	pho
2010-11-11 11:38:57 +00:00
Konstantin Belousov
fae5c47dd4 Add function lbn_offset to calculate offset of the indirect block of
given level.

Reviewed by:	jeff
Tested by:	pho
2010-11-11 11:35:42 +00:00
Konstantin Belousov
4e4ff01629 Fix typo. Function is called ffs_blkfree. 2010-11-11 11:26:59 +00:00
John Baldwin
b3e3402d3a Remove unused includes of <sys/mutex.h> and <machine/mutex.h>. 2010-11-09 20:41:10 +00:00
Ivan Voras
8e431dd6f1 Bring vfs.ufs.dirhash_maxmem into the age of the fruitbat and make it
autotuned. It is only an upper bound (the memory is not always allocated)
and the system contains a vm_lowmem handler so nothing will crash and burn
if it's tuned too high.

Reviewed by:	mckusick
2010-10-25 21:46:23 +00:00
Konstantin Belousov
d0cc54f3b4 The r184588 changed the layout of struct export_args, causing an ABI
breakage for old mount(2) syscall, since most struct <filesystem>_args
embed export_args. The mount(2) is supposed to provide ABI
compatibility for pre-nmount mount(8) binaries, so restore ABI to
pre-r184588.

Requested and reviewed by:	bde
MFC after:    2 weeks
2010-10-10 07:05:47 +00:00
Alan Cox
a03e344a7f M_USE_RESERVE has been deprecated for a decade. Eliminate any uses that
have no run-time effect.
2010-10-02 17:58:57 +00:00
Kirk McKusick
e69bed360f Since local variable 'i' is used only in a KASSERT, declare and
initialize it only if INVARIANTS is defined to avoid a declared
but unused warning.

Suggested by: Brian Somers <brian@FreeBSD.org>
2010-09-29 14:46:57 +00:00
Konstantin Belousov
063045a555 Fix typo in comment. 2010-09-29 07:40:11 +00:00
David E. O'Brien
59b3a4ebb5 Correct some non-code typos. 2010-09-17 09:14:40 +00:00
Kirk McKusick
c0b2efce9e Update comments in soft updates code to more fully describe
the addition of journalling. Only functional change is to
tighten a KASSERT.

Reviewed by:	jeff Roberson
2010-09-14 18:04:05 +00:00
John Baldwin
3634d5b241 Add dedicated routines to toggle lockmgr flags such as LK_NOSHARE and
LK_CANRECURSE after a lock is created.  Use them to implement macros that
otherwise manipulated the flags directly.  Assert that the associated
lockmgr lock is exclusively locked by the current thread when manipulating
these flags to ensure the flag updates are safe.  This last change required
some minor shuffling in a few filesystems to exclusively lock a brand new
vnode slightly earlier.

Reviewed by:	kib
MFC after:	3 days
2010-08-20 19:46:50 +00:00
Konstantin Belousov
691401eef8 Softdep_process_worklist() should unsuspend not only before processing
the worklist (in softdep_process_journal), but also after flushing the
workitems. Might be, we should even do this before bwillwrite() too, but
this seems to be not needed for now.

Fs might be suspended during processing the queue, and then there is
nobody around to unsuspend.

In collaboration with:	pho
Tested by:	bz
Reviewed by:	jeff
2010-08-12 08:35:24 +00:00
John Baldwin
61e1c19319 Revert the previous commit. The race is not applicable to the lockmgr
implementation in 8.0 and later as its flags field does not hold dynamic
state such as waiters flags, but is only modified in lockinit() aside
from VN_LOCK_*().

Discussed with:	attilio
2010-07-16 19:52:03 +00:00
John Baldwin
dbfcf8cfea When the MNTK_EXTENDED_SHARED mount option was added, some filesystems were
changed to defer the setting of VN_LOCK_ASHARE() (which clears LK_NOSHARE
in the vnode lock's flags) until after they had determined if the vnode was
a FIFO.  This occurs after the vnode has been inserted a VFS hash or some
similar table, so it is possible for another thread to find this vnode via
vget() on an i-node number and block on the vnode lock.  If the lockmgr
interlock (vnode interlock for vnode locks) is not held when clearing the
LK_NOSHARE flag, then the lk_flags field can be clobbered.  As a result
the thread blocked on the vnode lock may never get woken up.  Fix this by
holding the vnode interlock while modifying the lock flags in this case.

MFC after:	3 days
2010-07-16 19:20:20 +00:00
Jeff Roberson
9f9c8c59ae - Handle the truncation of an inode with an effective link count of 0 in
the context of the process that reduced the effective count.  Previously
   all truncation as a result of unlink happened in the softdep flush
   thread.  This had the effect of being impossible to rate limit properly
   with the journal code.  Now the process issuing unlinks is suspended
   when the journal files.  This has a side-effect of improving rm
   performance by allowing more concurrent work.
 - Handle two cases in inactive, one for effnlink == 0 and another when
   nlink finally reaches 0.
 - Eliminate the SPACECOUNTED related code since the truncation is no
   longer delayed.

Discussed with:	mckusick
2010-07-06 07:11:04 +00:00
Konstantin Belousov
427ef27ec7 Ensure that VOP_ACCESSX is called with exclusively locked vnode for
the kernel compiled with QUOTA option. ufs_accessx() upgrades the vdp
vnode lock from shared to exclusive to assign the dquot structure to
the vnode, and ufs_delete_denied() is called when tvp is locked. Since
upgrade drops shared lock when non-blocked upgrade failed, LOR is there.

Reported and tested by:	Dmitry Pryanishnikov <lynx.ripe gmail com>
Tested by:	pho
PR:	kern/147890
MFC after:	1 week
2010-06-20 13:35:16 +00:00
Andriy Gapon
d89c217f30 ffs_softdep: change K&R in function defintions to ANSI prototypes
Apparently it's bad when we first have an ANSI prototype in function
declaration, but then use K&R in its defintion.

Complaint from:	clang
MFC after:	2 weeks
2010-06-11 18:26:53 +00:00
Konstantin Belousov
db875cbd74 Extend the scope of the lock on the quota file vnode in quotaon() to
cover the initial read by dqopen(). Assert that vnode is locked in
dqopen(). Remove VFS_LOCK_GIANT() from dqopen(), since quotaon() keeps
Giant locked if needed around the call.
2010-06-03 10:24:53 +00:00
Andriy Gapon
0b9626482b ffs_mount: accept and drop userland-only options that can be passed from
loader(8)

In r193192 loader(8) has grown an ability to pass root mount options
from fstab via vfs.root.mountfrom.options.  Unfortunately, some options
that can be present in fstab are for userland only and lead to root
mounting failure when seen by kernel.
Rather than teaching loader about FFS-specific options that should be
filtered out, ffs_mount recognizes those options as valid, but ignores
and deletes[1] them.

[1] is suggested by jh.

PR:		kern/141050
Reported by:	many
Reviewed by:	jh, bde
MFC after:	4 days
2010-05-19 09:32:11 +00:00
Jeff Roberson
f0268739c7 - Don't immediately re-run softdepflush if we didn't make any progress
on the last iteration.  This can lead to a deadlock when we have
   worklist items that cannot be immediately satisfied.

Reported by:	uqs, Dimitry Andric <dimitry@andric.com>

 - Remove some unnecessary debugging code and place some other under
   SUJ_DEBUG.
 - Examine the journal state in softdep_slowdown().
 - Re-format some comments so I may more easily add flag descriptions.
2010-05-19 06:18:01 +00:00
Jeff Roberson
8ef48de888 - Call softdep_prealloc() before any of the balloc routines in the
snapshot code.
 - Don't fsync() vnodes in prealloc if copy on write is in progress.  It
   is not safe to recurse back into the write path here.

Reported by:	Vladimir Grebenschikov <vova@fbsd.ru>
2010-05-07 08:45:21 +00:00
Jeff Roberson
2c3ae115b6 - Use the correct flag mask when determining whether an inode has
successfully made it to the free list yet or not.  This fixes
   a deadlock that can occur with unlinked but referenced files.
   Journal space and inodedeps were not correctly reclaimed because
   the inode block was not left dirty.

Tested/Reported by:	lwindschuh@googlemail.com
2010-05-07 08:20:56 +00:00
Kirk McKusick
e27ed89aef Merger of the quota64 project into head.
This joint work of Dag-Erling Smørgrav and myself updates the
FFS quota system to support both traditional 32-bit and new 64-bit
quotas (for those of you who want to put 2+Tb quotas on your users).

By default quotas are not compiled into the kernel. To include them
in your kernel configuration you need to specify:

options         QUOTA                   # Enable FFS quotas

If you are already running with the current 32-bit quotas, they
should continue to work just as they have in the past. If you
wish to convert to using 64-bit quotas, use `quotacheck -c 64';
if you wish to revert from 64-bit quotas back to 32-bit quotas,
use `quotacheck -c 32'.

There is a new library of functions to simplify the use of the
quota system, do `man quotafile' for details. If your application
is currently using the quotactl(2), it is highly recommended that
you convert your application to use the quotafile interface.
Note that existing binaries will continue to work.

Special thanks to John Kozubik of rsync.net for getting me
interested in pursuing 64-bit quota support and for funding
part of my development time on this project.
2010-05-07 00:41:12 +00:00
Alan Cox
eb00b276ab Eliminate page queues locking around most calls to vm_page_free(). 2010-05-06 18:58:32 +00:00
Kirk McKusick
945f418ab8 Final update to current version of head in preparation for reintegration. 2010-05-06 17:37:23 +00:00
Alan Cox
5ac59343be Acquire the page lock around all remaining calls to vm_page_free() on
managed pages that didn't already have that lock held.  (Freeing an
unmanaged page, such as the various pmaps use, doesn't require the page
lock.)

This allows a change in vm_page_remove()'s locking requirements.  It now
expects the page lock to be held instead of the page queues lock.
Consequently, the page queues lock is no longer required at all by callers
to vm_page_rename().

Discussed with: kib
2010-05-05 18:16:06 +00:00
Edward Tomasz Napierala
b5f770bd86 Move checking against RLIMIT_FSIZE into one place, vn_rlimit_fsize().
Reviewed by:	kib
2010-05-05 16:44:25 +00:00
Andriy Gapon
deb3b115e2 ffs_vfsops: restore alphabetic order of options in ffs_opts
The order was not correct only for nfsv4acls.
("no" prefix is ignored)

MFC after:	1 week
2010-04-29 10:04:00 +00:00
Jeff Roberson
2bd20091e4 - When canceling jaddrefs they may not yet be in the journal if this is via
a revert call.  In this case don't attempt to remove something that
   has not yet been added.  Otherwise this jaddref must hang around
   to prevent the bitmap write as normal.
2010-04-28 07:57:37 +00:00
Jeff Roberson
3b32573a9f - Fix builds without SOFTUPDATES defined in the kernel config. 2010-04-28 07:26:41 +00:00
Kirk McKusick
a4bf5fb987 Update to current version of head. 2010-04-28 05:33:59 +00:00
Pawel Jakub Dawidek
a8750f2dca Fix build for UFS without SOFTUPDATES. 2010-04-24 07:36:33 +00:00
Jeff Roberson
113db2dddb - Merge soft-updates journaling from projects/suj/head into head. This
brings in support for an optional intent log which eliminates the need
   for background fsck on unclean shutdown.

Sponsored by:   iXsystems, Yahoo!, and Juniper.
With help from: McKusick and Peter Holm
2010-04-24 07:05:35 +00:00
Konstantin Belousov
5673e3cb08 The cache_enter(9) function shall not be called for doomed dvp.
Assert this.

In the reported panic, vdestroy() fired the assertion "vp has namecache
for ..", because pseudofs may end up doing cache_enter() with reclaimed
dvp, after dotdot lookup temporary unlocked dvp.
Similar problem exists in ufs_lookup() for "." lookup, when vnode
lock needs to be upgraded.

Verify that dvp is not reclaimed before calling cache_enter().

Reported and tested by:	pho
Reviewed by:	kan
MFC after:	2 weeks
2010-04-20 10:19:27 +00:00
Andriy Gapon
ecaf3257be ffs_mount: remove redundant assignment of geom consumer to devvp.v_bufobj
The assignment is already done in g_vfs_open.
Redundant assignment is harmless, but can become a problem if g_vfs_open
logic is changed.

MFC after:	1 week
2010-04-03 08:25:04 +00:00
Kirk McKusick
516ad57b74 Debugging nits found while testing the new 64-bit quota code. 2010-03-16 06:12:30 +00:00
Dag-Erling Smørgrav
1a0fda2b54 IFH@204581 2010-03-04 13:35:57 +00:00
Konstantin Belousov
2950ff259c When ffs_realloccg() failed to allocate bigger fragment and, because
pending blocks are scheduled for removal, goes to retry the (re)allocation,
clear the bp pointer. It might happen that meantime free space is really
exhausted and we are entering nospace: label without bread()ing buffer,
causing stale bp value to be brelse()d again.

Tested by:	pho
    (Producing a scenario to reliably reproduce the
     race appeared to be much harder then fixing the bug)
MFC after:	1 week
2010-02-13 10:34:50 +00:00
Kirk McKusick
81479e688b One last pass to get all the unsigned comparisons correct. 2010-02-11 18:14:53 +00:00
Kirk McKusick
e870d1e6f9 This fix corrects a problem in the file system that treats large
inode numbers as negative rather than unsigned. For a default
(16K block) file system, this bug began to show up at a file system
size above about 16Tb.

To fully handle this problem, newfs must be updated to ensure that
it will never create a filesystem with more than 2^32 inodes. That
patch will be forthcoming soon.

Reported by: Scott Burns, John Kilburg, Bruce Evans
Followup by: Jeff Roberson
PR:          133980
MFC after:   2 weeks
2010-02-10 20:10:35 +00:00
Edward Tomasz Napierala
619961810c Remove unused variable. 2010-02-10 18:56:49 +00:00
Edward Tomasz Napierala
fa90da2be6 Return proper error code.
Found with:	clang
2010-01-25 16:09:50 +00:00
Edward Tomasz Napierala
780b61c45b Move out code that does POSIX.1e ACL inheritance into separate routines.
Reviewed by:	rwatson
2010-01-24 15:12:27 +00:00
Kirk McKusick
53298164b8 Cast 64-bit quantity to intptr_t rather than int so as to work properly
with 64-bit architectures (such as amd64).

Reported by:	bz
2010-01-11 22:42:06 +00:00
Kirk McKusick
e268f54cb4 Background:
When renaming a directory it passes through several intermediate
states. First its new name will be created causing it to have two
names (from possibly different parents). Next, if it has different
parents, its value of ".." will be changed from pointing to the old
parent to pointing to the new parent. Concurrently, its old name
will be removed bringing it back into a consistent state. When fsck
encounters an extra name for a directory, it offers to remove the
"extraneous hard link"; when it finds that the names have been
changed but the update to ".." has not happened, it offers to rewrite
".." to point at the correct parent. Both of these changes were
considered unexpected so would cause fsck in preen mode or fsck in
background mode to fail with the need to run fsck manually to fix
these problems. Fsck running in preen mode or background mode now
corrects these expected inconsistencies that arise during directory
rename. The functionality added with this update is used by fsck
running in background mode to make these fixes.

Solution:

This update adds three new fsck sysctl commands to support background
fsck in correcting expected inconsistencies that arise from incomplete
directory rename operations. They are:

setcwd(dirinode) - set the current directory to dirinode in the
    filesystem associated with the snapshot.
setdotdot(oldvalue, newvalue) - Verify that the inode number for ".."
    in the current directory is oldvalue then change it to newvalue.
unlink(nameptr, oldvalue) - Verify that the inode number associated
    with nameptr in the current directory is oldvalue then unlink it.

As with all other fsck sysctls, these new ones may only be used by
processes with appropriate priviledge.

Reported by:    	jeff
Security issues:	rwatson
2010-01-11 20:44:05 +00:00
Martin Blapp
c2ede4b379 Remove extraneous semicolons, no functional changes.
Submitted by:	Marc Balmer <marc@msys.ch>
MFC after:	1 week
2010-01-07 21:01:37 +00:00
Kirk McKusick
445e33e4ac KASSERT that condition raised by Coverity cannot happen.
Found by:	Coverity Prevent (tm)
KASSERT by:	sam
2010-01-07 06:20:07 +00:00
Edward Tomasz Napierala
9340fc72e6 Implement NFSv4 ACL support for UFS.
Reviewed by:	rwatson
2009-12-21 19:39:10 +00:00
Konstantin Belousov
49e3050e6c VI_OBJDIRTY vnode flag mirrors the state of OBJ_MIGHTBEDIRTY vm object
flag. Besides providing the redundand information, need to update both
vnode and object flags causes more acquisition of vnode interlock.
OBJ_MIGHTBEDIRTY is only checked for vnode-backed vm objects.

Remove VI_OBJDIRTY and make sure that OBJ_MIGHTBEDIRTY is set only for
vnode-backed vm objects.

Suggested and reviewed by:	alc
Tested by:	pho
MFC after:	3 weeks
2009-12-21 12:29:38 +00:00
Dag-Erling Smørgrav
0fbc5fbedf Sync with head 2009-09-25 22:45:59 +00:00
Dag-Erling Smørgrav
0bb2e5de60 Further improve comments. 2009-09-25 18:50:33 +00:00
Dag-Erling Smørgrav
91b2a3e204 Improve comments, and remove a bogus 0 id check. 2009-09-25 18:44:34 +00:00
Roman Divacky
e0a770a01d Don't build ufs_gjournal.c at all if UFS_GJOURNAL option is not given
instead of building an almost empty C file.

Approved by:	pjd
Approved by:	ed (mentor, implicit)
2009-09-22 16:22:05 +00:00
Dag-Erling Smørgrav
10b3b54548 Merge from head 2009-09-17 16:16:44 +00:00
Dag-Erling Smørgrav
7d4b968b0f Merge from head up to r188941 (last revision before the USB stack switch) 2009-09-17 13:31:39 +00:00
Brooks Davis
5343b3a28c Allocate space for the group array in a static credential used in
the quota code.  One case was correctly handled in r194498, but
this one was missed.

PR:		kern/138657
Tested by:	PR submitter
MFC after:	3 days
2009-09-17 12:35:13 +00:00
Edward Tomasz Napierala
2ff8f63aa6 Remove useless variable assignment. 2009-09-08 17:23:32 +00:00
Konstantin Belousov
6cc745d2d7 insmntque_stddtr() clears vp->v_data and resets vp->v_op to
dead_vnodeops before calling vgone(). Revert r189706 and corresponding
part of the r186560.

Noted and reviewed by:	tegge
Approved by:	des (pseudofs part)
MFC after:	3 days
2009-09-07 11:55:34 +00:00
Konstantin Belousov
5c61c646a3 The clear_remove() and clear_inodedeps() call vn_start_write(NULL, &mp,
V_NOWAIT) on the non-busied mount point. Unmount might free ufs-specific
mp data, causing ffs_vgetf() to access freed memory.

Busy mountpoint before dropping softdep lk.

Noted and reviewed by:	tegge
Tested by:	pho
MFC after:	1 week
2009-09-06 11:46:51 +00:00
Konstantin Belousov
165a3b418f When a UFS node is truncated to the zero length, e.g. by explicit
truncate(2) call, or by being removed or truncated on open, either
new softupdate freeblks structure is allocated to track the freed
blocks of the node, or truncation is done syncronously when too many SU
dependencies are accumulated. The decision does not take into account
the allocated freeblks dependencies, allowing workloads that do huge
amount of truncations to exhaust the kernel memory.

Take the number of allocated freeblks into consideration for
softdep_slowdown().

Reported by:	pluknet gmail com
Diagnosed and tested by:	pho
Approved by:	re (rwatson)
MFC after:	1 month
2009-08-14 11:00:38 +00:00
Edward Tomasz Napierala
340263c992 Fix fpathconf(3) on fifos, in effect making ls(1) properly
display '+' on them.  Taken from kern/125613, with cosmetic
changes.

PR:		kern/125613
Submitted by:	Jaakko Heinonen <jh at saunalahti dot fi>
Approved by:	re (kib)
2009-07-02 20:05:21 +00:00
Konstantin Belousov
f1eccd05ec In vn_vget_ino() and their inline equivalents, mnt_ref() the mount point
around the sequence that drop vnode lock and then busies the mount point.
Not having vlocked node or direct reference to the mp allows for the
forced unmount to proceed, making mp unmounted or reused.

Tested by:	pho
Reviewed by:	jeff
Approved by:	re (kensmith)
MFC after:	2 weeks
2009-07-02 18:02:55 +00:00
Edward Tomasz Napierala
4bc61fd4ec Don't panic on attempt to set ACL on a block device file.
This is just a part of kern/125613.

PR:		kern/125613
Submitted by:	Jaakko Heinonen <jh at saunalahti dot fi>
Reviewed by:	rwatson
Approved by:	re (kib)
2009-07-01 22:30:36 +00:00
Konstantin Belousov
cfba50c070 For SU mounts, softdep_fsync() might drop vnode lock, allowing other
threads to put dirty buffers on the vnode bufobj list. For regular files
and synchronous fsync requests, check for the condition and restart the
fsync vop if a new dirty buffer arrived.

Tested by:	pho
Approved by:	re (kensmith)
MFC after:	1 month
2009-06-30 10:07:33 +00:00