Commit Graph

1850 Commits

Author SHA1 Message Date
mckusick
498accbc2c Default debugging error messages to off for journaled soft updates sysctls.
Delete limiting on output of these sysctls.

Approved by: re (kib)
2011-07-22 18:03:33 +00:00
mckusick
40b131bdaf Add an FFS specific mount option to allow a filesystem checker
(typically fsck_ffs) to register that it wishes to use FFS specific
sysctl's to update the filesystem. This ensures that two checkers
cannot run on a given filesystem at the same time and that no other
process accidentally or maliciously uses the filesystem updating
sysctls inappropriately. This functionality is needed by the
journaling soft-updates recovery code.
2011-07-15 16:20:33 +00:00
mckusick
ea2a83a8da Consistently check mount flag (MNTK_SUJ) rather than superblock
flag (FS_SUJ) when determining whether to do journaling-based
operations. The mount flag is set only when journaling is active
while the superblock flag is set to indicate that journaling is to
be used. For example, when the filesystem is mounted read-only, the
journaling may be present (FS_SUJ) but not active (MNTK_SUJ).
Inappropriate checking of the FS_SUJ flag was causing some
journaling actions to be attempted at inappropriate times.
2011-07-14 18:06:13 +00:00
mckusick
0196d96292 When first creating snapshots, we may free some blocks within it.
These blocks should not have TRIM applied to them.

Submitted by: Kostik Belousov
2011-07-10 05:34:49 +00:00
mckusick
c6e1a97eed Allow disk partitions associated with UFS read-only mounted
filesystems to be opened for writing. This functionality used to
be special-cased for just the root filesystem, but with this change
is now available for all UFS filesystems. This change is needed for
journaled soft updates recovery.

Discussed with: Jeff Roberson
2011-07-10 00:41:31 +00:00
kib
08e1095b2d Use 'curthread_pflags' instead of 'thread_pflags' to signify that only
curthread can be operated upon.

Requested by:	attilio
MFC after:	1 week
2011-07-09 15:16:07 +00:00
kib
977361a38b Use helper functions instead of manually managing TDP_INBDFLUSH.
Sponsored by:	The FreeBSD Foundation
Reviewed by:	alc (previous version)
MFC after:	1 week
2011-07-09 14:42:45 +00:00
jeff
74e3f8c8c5 - Speed up pendingblock processing again. Having too much delay between
ffs_blkfree() and the pending adjustment causes all kinds of
   space related problems.
2011-07-04 22:08:04 +00:00
jeff
0a80dd60a6 - Handle D_JSEGDEP in the softdep_sync_buf() switch. These can now
find themselves on snapshot vnodes.

Reported by:	pho
2011-07-04 21:04:25 +00:00
jeff
07731ef1bc - It is impossible to run request_cleanup() while doing a copyonwrite.
This will most likely cause new block allocations which can recurse
   into request cleanup.
 - While here optimize the ufs locking slightly.  We need only acquire and
   drop once.
 - process_removes() and process_truncates() also is only needed once.
 - Attempt to flush each item on the worklist once but do not loop forever
   if some can not be completed.

Discussed with:	mckusick
2011-07-04 20:53:55 +00:00
jeff
4fa1a63e5a - Fix an inode quota leak. We need to decrement the quota once and only
once.

Tested by:	pho
Reviewed by:	mckusick
2011-07-04 20:52:23 +00:00
mckusick
323608505b Handle the FREEDEP case in softdep_sync_buf().
This fix failed to get added in -r223325.

Submitted by:	Peter Holm
2011-06-29 22:12:43 +00:00
alc
21902be08c Add a new option, OBJPR_NOTMAPPED, to vm_object_page_remove(). Passing this
option to vm_object_page_remove() asserts that the specified range of pages
is not mapped, or more precisely that none of these pages have any managed
mappings.  Thus, vm_object_page_remove() need not call pmap_remove_all() on
the pages.

This change not only saves time by eliminating pointless calls to
pmap_remove_all(), but it also eliminates an inconsistency in the use of
pmap_remove_all() versus related functions, like pmap_remove_write().  It
eliminates harmless but pointless calls to pmap_remove_all() that were being
performed on PG_UNMANAGED pages.

Update all of the existing assertions on pmap_remove_all() to reflect this
change.

Reviewed by:	kib
2011-06-29 16:40:41 +00:00
jeff
045528c230 - Fix directory count rollbacks by passing the mode to the journal dep
earlier.
 - Add rollback/forward code for frag and cluster accounting.
 - Handle the FREEDEP case in softdep_sync_buf().  (submitted by pho)
2011-06-20 03:25:09 +00:00
mckusick
1bcbc6326a Fixed dereference of a NULL pointer.
Reported by:	Peter Holm
2011-06-18 21:10:03 +00:00
mckusick
d97a856c1b Drop the include of <ufs/ffs/ffs_extern.h> from usr.sbin/makefs/ffs/ffs_bswap.c
and usr.sbin/makefs/ffs/ffs_subr.c as they have no need of anything in that
file.  No other programs or libraries include <ufs/ffs/ffs_extern.h> (nor
should they as it is totally in-kernel interfaces). For added protection
I enclosed the entire contents of <ufs/ffs/ffs_extern.h> in ifdef _KERNEL.

Feedback from:	Bruce Evans and Tai-hwa Liang
2011-06-16 23:40:10 +00:00
avatar
f865e36039 Fixing compilation bustage by introducing another forward declaration. 2011-06-16 05:26:03 +00:00
mckusick
ef6ee3faed Ensure that filesystem metadata contained within persistent snapshots
is always kept consistent.

Suggested by:	Jeff Roberson
2011-06-15 23:19:09 +00:00
mckusick
f433c75fa7 With the restructuring of the block reclaimation code, the notification
messages for a filesystem being out of space need to be moved so that
they do not print out until after a failed cleanup attempt.

Suggested by:	Jeff Roberson
2011-06-15 18:05:08 +00:00
mckusick
be13f04a4a Missing cleanup case after completion of a snapshot vnode write
claiming a released block.

Submitted by:	Jeff Roberson
Tested by:	Peter Holm
2011-06-15 06:13:08 +00:00
dim
accb907d3b Use alternative, less messy solution to avoid breakage after r223020:
put the snapdata structure between #ifdef _KERNEL guards.

Suggested by:	kib
2011-06-13 16:05:41 +00:00
mckusick
5f2600c6a9 Update to soft updates journaling to properly track freed blocks
that get claimed by snapshots.

Submitted by:	Jeff Roberson
Tested by:	Peter Holm
2011-06-12 19:27:05 +00:00
mckusick
abfa1c12cc Disable the soft updates journaling after a filesystem is successfully
downgraded to read-only. It will be restarted if the filesystem is
upgraded back to read-write.
2011-06-12 18:46:48 +00:00
jeff
6ba8b7f04c Implement fully asynchronous partial truncation with softupdates journaling
to resolve errors which can cause corruption on recovery with the old
synchronous mechanism.

 - Append partial truncation freework structures to indirdeps while
   truncation is proceeding.  These prevent new block pointers from
   becoming valid until truncation completes and serialize truncations.
 - On completion of a partial truncate journal work waits for zeroed
   pointers to hit indirects.
 - softdep_journal_freeblocks() handles last frag allocation and last
   block zeroing.
 - vtruncbuf/ffs_page_remove moved into softdep_*_freeblocks() so it
   is only implemented in one place.
 - Block allocation failure handling moved up one level so it does not
   proceed with buf locks held.  This permits us to do more extensive
   reclaims when filesystem space is exhausted.
 - softdep_sync_metadata() is broken into two parts, the first executes
   once at the start of ffs_syncvnode() and flushes truncations and
   inode dependencies.  The second is called on each locked buf.  This
   eliminates excessive looping and rollbacks.
 - Improve the mechanism in process_worklist_item() that handles
   acquiring vnode locks for handle_workitem_remove() so that it works
   more generally and does not loop excessively over the same worklist
   items on each call.
 - Don't corrupt directories by zeroing the tail in fsck.  This is only
   done for regular files.
 - Push a fsync complete record for files that need it so the checker
   knows a truncation in the journal is no longer valid.

Discussed with:	mckusick, kib (ffs_pages_remove and ffs_truncate parts)
Tested by:	pho
2011-06-10 22:48:35 +00:00
jeff
aee45716ce - Add support for referencing quota structures without needing the inode
pointer for softupdates.

Submitted by:	mckusick
2011-06-10 22:19:44 +00:00
jeff
2cdd0660a5 - If the fsync in ufs_direnter fails SUJ can later panic because we have
partially added a name.  Allow ufs_direnter() to continue in the
   hopes that it is a transient error.  If it is not, the directory
   is corrupted already from IO errors and writing this new block
   is not likely to make things worse.
2011-06-10 22:18:25 +00:00
mckusick
6e1743e29a Grammer fix in comment.
Eliminate one (of several) possible conflicting buffer locks when
trying to reclaim blocks. Rest of fix to be incorporated as part
of SUJ update by jeff.

Pointed out by: Kostik Belousov
2011-06-05 22:36:30 +00:00
mckusick
75b483b1f1 Due to a lag in updating the fs_pendinginodes count, we cannot depend
on it to decide whether we should try to reclaim inodes when we run
short.

Discovered by: Peter Holm
2011-05-28 15:07:29 +00:00
mckusick
6d52186e53 The check for whether a block is going to be claimed by a snapshot
needs to happen before we notify the underlying layer that it is
being freed.
2011-05-26 23:56:58 +00:00
rmacklem
71a9ed7d15 Fix the ufs/ffs file system so that it uses the lock
flags argument added to VFS_FHTOVP() by r222167.

Reviewed by:	mckusick
2011-05-22 20:39:07 +00:00
rmacklem
fbb8a5e8ec Add a lock flags argument to the VFS_FHTOVP() file system
method, so that callers can indicate the minimum vnode
locking requirement. This will allow some file systems to choose
to return a LK_SHARED locked vnode when LK_SHARED is specified
for the flags argument. This patch only adds the flag. It
does not change any file system to use it and all callers
specify LK_EXCLUSIVE, so file system semantics are not changed.

Reviewed by:	kib
2011-05-22 01:07:54 +00:00
mdf
bbbc4c5455 Use a name instead of a magic number for kern_yield(9) when the priority
should not change.  Fetch the td_user_pri under the thread lock.  This
is probably not necessary but a magic number also seems preferable to
knowing the implementation details here.

Requested by:	Jason Behmer < jason DOT behmer AT isilon DOT com >
2011-05-13 05:27:58 +00:00
kib
b58a6812d7 Fix typos.
Noted by:	Fabian Keil <freebsd-listen fabiankeil de>
Pointy hat to:	kib
MFC after:	1 week
2011-04-30 22:46:02 +00:00
kib
4cc947d01c Clarify the comment.
MFC after:	1 week
2011-04-30 13:49:03 +00:00
kib
38508e2c91 VFS sometimes is unable to inactivate a vnode when vnode use count
goes to zero. E.g., the vnode might be only shared-locked at the time of
vput() call. Such vnodes are kept in the hash, so they can be found later.

If ffs_valloc() allocated an inode that has its vnode cached in hash, and
still owing the inactivation, then vget() call from ffs_valloc() clears
VI_OWEINACT, and then the vnode is reused for the newly allocated inode.

The problem is, the vnode is not reclaimed before it is put to the new
use. ffs_valloc() recycles vnode vm object, but this is not enough.
In particular, at least v_vflag should be cleared, and several bits of
UFS state need to be removed.

It is very inconvenient to call vgone() at this point. Instead, move
some parts of ufs_reclaim() into helper function ufs_prepare_reclaim(),
and call the helper from VOP_RECLAIM and ffs_valloc().

Reviewed by:	mckusick
Tested by:	pho
MFC after:	3 weeks
2011-04-24 10:47:56 +00:00
jeff
e74a74d593 - Refactor softdep_setup_freeblocks() into a set of functions to prepare
for a new journal specific partial truncate routine.
 - Use dep_current[] in place of specific dependency counts.  This is
   automatically maintained when workitems are allocated and has
   less risk of becoming incorrect.
2011-04-11 01:43:59 +00:00
jeff
2f8e8fa587 Fix a long standing SUJ performance problem:
- Keep a hash of indirect blocks that have recently been freed and are
   still referenced in the journal.
 - Lookup blocks in this hash before forcing a new block write to wait on
   the journal entry to hit the disk.  This is only necessary to avoid
   confusion between old identities as indirects and new identities as
   file blocks.
 - Don't free jseg structures until the journal has written a record that
   invalidates it.  This keeps the indirect block information around for
   as long as is required to be safe.
 - Force an empty journal block write when required to flush out stale
   journal data that is simply waiting for the oldest valid sequence
   number to advance beyond it.
2011-04-10 03:49:53 +00:00
jeff
86570b7a27 - Don't invalidate jnewblks immediately upon discovering that the block
will be removed.  Permit the journal to proceed so that we don't leave
   a rollback in a cg for a very long time as this can cause terrible perf
   problems in low memory situations.

Tested by:      pho
2011-04-07 03:19:10 +00:00
mckusick
4d2789f22f Be far more persistent in reclaiming blocks and inodes before giving
up and declaring a filesystem out of space. Especially necessary when
running on a small filesystem. With this improvement, it should be
possible to use soft updates on a small root filesystem.

Kudos to: Peter Holm
Testing by: Peter Holm
MFC: 2 weeks
2011-04-05 21:26:05 +00:00
jeff
f31f8adf99 Fix problems that manifested from filesystem full conditions:
- In softdep_revert_mkdir() find the dotaddref before we attempt to cancel
   the jaddref so we can make assumptions about where the dotaddref is on
   the list.  cancel_jaddref() does not always remove items from the list
   anymore.
 - Always set GOINGAWAY on an inode in softdep_freefile() if DEPCOMPLETE
   was never set.  This ensures that dependencies will continue to be
   processed on the inowait/bufwait list and is more an artifact of
   the structure of the code than a pure ordering problem.
 - Always set DEPCOMPLETE on canceled jaddrefs so that they can be freed
   appropriately.  This normally occurs when the refs are added to the
   journal but if they are canceled before this point the state would
   never be set and the dependency could never be freed.

Reported by:	pho
Tested by:	pho
2011-04-02 21:52:58 +00:00
kib
13251e8d3f Fix the softdep_request_cleanup() function definition for !SOFTUPDATES case.
Submitted by:	Aleksandr Rybalko <ray dlink ua>
2011-03-28 12:39:48 +00:00
mckusick
becd8ff499 Add retry code analogous to the block allocation retry code
to avoid running out of inodes.

Reported by: Peter Holm
2011-03-23 05:13:54 +00:00
kib
a174b18d0c Retire opt_ffs_broken_fixme.h.
Instead of directly calling ffs_snapgone(), use UFS_SNAPGONE() with
usual layering.

Requested by:	bde
MFC after:	1 week
2011-03-20 21:05:09 +00:00
kib
329130799d Remove the #if defined(FFS) || defined(IFS) braces around the calls to
ffs_snapgone(). ufs.ko module is not build with FFS define, causing
snapshot inode number slots in superblock never be freed, as well as a
reference on the snapshot vnode.

IFS was removed several years ago, and UFS/FFS separation was not
maintained for real.

Reported, analyzed and tested by:	Yamagi Burmeister <lists yamagi org>
MFC after:	3 days
2011-03-17 11:23:12 +00:00
kib
c0ee0d94dd Simplify uses of the web of pointers.
Reviewed by:	mckusick
MFC after:	1 week
2011-03-07 22:36:11 +00:00
jhb
d219d97f45 The UFS dirhash code was attempting to update shared state in the dirhash
from multiple threads while holding a shared lock during a lookup operation.
This could result in incorrect ENOENT failures which could then be
permanently stored in the name cache.

Specifically, the dirhash code optimizes the case that a single thread is
walking a directory sequentially opening (or stat'ing) each file.  It uses
state in the dirhash structure to determine if a given lookup is using the
optimization.  If the optimization fails, it disables it and restarts the
lookup.  The problem arises when two threads both attempt the optimization
and fail.  The first thread will restart the loop, but the second thread
will incorrectly think that it did not try the optimization and will only
examine a subset of the directory entires in its hash chain.  As a result,
it may fail to find its directory entry and incorrectly fail with ENOENT.

To make this safe for use with shared locks, simplify the state stored in
the dirhash and move some of the state (the part that determines if the
current thread is trying the optimization) into a local variable.  One
result is that we will now try the optimization more often.  We still
update the value under the shared lock, but it is a single atomic store
similar to i_diroff that is stored in UFS directory i-nodes for the
non-dirhash lookup.

Reviewed by:	kib
MFC after:	1 week
2011-03-07 18:33:29 +00:00
jhb
d9f2a53257 Use ffs() to locate free bits in the inode bitmap rather than a loop with
bit shifts.

Reviewed by:	mckusick
MFC after:	1 month
2011-03-04 22:26:41 +00:00
kib
1d3d8b3035 v_mountedhere is a member of the union. Check that the vnodes have
proper type before using the member.

Reported and tested by:	Michael Butler <imb protected-networks net>
2011-02-19 07:47:25 +00:00
kib
c95c81bfad Use the native sector size of the device backing the UFS volume for SU+J
journal blocks, instead of hard coding 512 byte sector size. Journal need
to atomically write the block, that can only be guaranteed at the device
sector size, not larger. Attempt to write less then sector size results in
driver errors.

Note that this is the first structure in UFS that depends on the
sector size. Other elements are written in the units of fragments.

In collaboration with:	pho
Reviewed by:	jeff
Tested by:	bz, pho
2011-02-12 12:52:12 +00:00
netchild
2fc9b4286f Wrap long line.
Noticed by:	bz
2011-02-10 08:06:56 +00:00
netchild
2e128b8ced Add some FEATURE macros for some UFS features.
SU+J is not included as a FEATURE macro:
 - it was not in the tree during the GSoC
 - I do not see an option to en-/disable it in NOTES

Two minor changes where made during the review compared to what was developed
during GSoC 2010.

No FreeBSD version bump, the userland application to query the features will
be committed last and can serve as an indication of the availablility if
needed.

Sponsored by:	Google Summer of Code 2010
Submitted by:	kibab
Reviewed by:	kib
X-MFC after:	to be determined in last commit with code from this project
2011-02-09 15:33:13 +00:00
mdf
33ee365b55 Based on discussions on the svn-src mailing list, rework r218195:
- entirely eliminate some calls to uio_yeild() as being unnecessary,
   such as in a sysctl handler.

 - move should_yield() and maybe_yield() to kern_synch.c and move the
   prototypes from sys/uio.h to sys/proc.h

 - add a slightly more generic kern_yield() that can replace the
   functionality of uio_yield().

 - replace source uses of uio_yield() with the functional equivalent,
   or in some cases do not change the thread priority when switching.

 - fix a logic inversion bug in vlrureclaim(), pointed out by bde@.

 - instead of using the per-cpu last switched ticks, use a per thread
   variable for should_yield().  With PREEMPTION, the only reasonable
   use of this is to determine if a lock has been held a long time and
   relinquish it.  Without PREEMPTION, this is essentially the same as
   the per-cpu variable.
2011-02-08 00:16:36 +00:00
mdf
b291e9a365 Put the general logic for being a CPU hog into a new function
should_yield().  Use this in various places.  Encapsulate the common
case of check-and-yield into a new function maybe_yield().

Change several checks for a magic number of iterations to use
should_yield() instead.

MFC after:	1 week
2011-02-02 16:35:10 +00:00
pluknet
393b39d5b3 Embed a quota error message (C string) into uprintf() fmt.
While here, fix whitespaces.

Approved by:	kib (mentor)
2011-01-13 16:29:27 +00:00
mdf
f6a71a40b2 sysctl(9) cleanup checkpoint: amd64 GENERIC builds cleanly.
Commit the kernel changes.
2011-01-12 19:54:19 +00:00
kib
86eafd5cfb Instead of incrementing freework reference counter in indir_trunc(), do
it at the allocation time for journaled fs and indirect blocks, when
the allocated object is not accessible outside.

Requested and reviewed by:	jeff
Tested by:	pho
2011-01-04 10:25:55 +00:00
kib
64276929de Handle missing jremrefs when a directory is renamed overtop of
another, deleting it.  If the directory is removed, UFS always need to
remove the .. ref, even if the ultimate ref on the parent would not
change. The new directory must have a new journal entry for that ref.
Otherwise journal processing would not properly account for the
parent's reference since it will belong to a removed directory entry.

Change ufs_rename()'s dotdot rename section to always
setup_dotdot_link(). In the tip != NULL case SUJ needs the newref dependency
allocated via setup_dotdot_link().

Stop setting isrmdir to 2 for newdirrem() in softdep_setup_remove().
Remove the isdirrem > 1 checks from newdirrem().

Reported by:	many
Submitted by:	jeff
Tested by:	pho
2010-12-30 10:52:07 +00:00
kib
44467ec2f8 In indir_trunc(), when processing jnewblk entries that are not written
to the disk, recurse to handle indirect blocks of next level that are
hidden by the corresponding entry.

In collaboration with:	pho
Reviewed by:	jeff, mckusick
Tested by:	mckusick, pho
2010-12-30 10:41:17 +00:00
kib
17dccd1898 Add kernel side support for BIO_DELETE/TRIM on UFS.
The FS_TRIM fs flag indicates that administrator requested issuing of
TRIM commands for the volume. UFS will only send the command to disk
if the disk reports GEOM::candelete attribute.

Since disk queue is reordered, data block is marked as free in the bitmap
only after TRIM command completed. Due to need to sleep waiting for
i/o to finish, TRIM bio_done routine schedules taskqueue to set the
bitmap bit.

Based on the patch by:	mckusick
Reviewed by:	mckusick, pjd
Tested by:	pho
MFC after:	1 month
2010-12-29 12:25:28 +00:00
kib
e8862739b9 Move the definition of mkdirlisthd from header to C file.
Reviewed by:	mckusick
Tested by:	pho
2010-12-29 12:16:06 +00:00
kib
305388e45b Use a proper type for the variable holding the summary size of the inode
data. Otherwise, on 32bit systems, unlinked inode which size is the
multiple of 4GB was not truncated, causing corruption.

Reported by:	brucec
Reviewed by:	mckusick
Tested by:	pho
2010-12-29 11:19:39 +00:00
mckusick
e95566a6fe This patch fixes a soft update panic while running perl 5.12 tests
which produced:

    panic: indir_trunc: Index out of range -148 parent -2061 lbn -305164

Reported by: Dimitry Andric
Fixed by: Jeff Roberson
2010-12-23 00:38:57 +00:00
kib
e3e1ae2c87 Journal start looks up .sujournal file by doing lookup on the root dvp.
As result, failed softdep_mount() might leave up to two vnodes on the
mp mountlist, preventing mnt_ref from going to zero.

Call ffs_flushfiles() after failed softdep_mount() to clean mountlist.

Initial report by:	Garrett Cooper
Reproduced and tested by:	pho
2010-12-01 21:19:11 +00:00
pho
287a38757f First step in fixing the handle_workitem_freeblocks panic.
In collaboration with:	 kib
2010-11-27 20:27:07 +00:00
mckusick
f26669fe8c Delete /sys/ufs/ffs/README.snapshot as it is no longer relevant.
Drop reference to it in mount(8).

MFC:	3 days
2010-11-20 18:40:50 +00:00
kib
7980fb6d3a Remove prtactive variable and related printf()s in the vop_inactive
and vop_reclaim() methods. They seems to be unused, and the reported
situation is normal for the forced unmount.

MFC after:   1 week
X-MFC-note:  keep prtactive symbol in vfs_subr.c
2010-11-19 21:17:34 +00:00
kib
dadf5cd065 The softdep_setup_freeblocks() adds worklist items before
deallocate_dependencies() is done. This opens a race between softdep
thread and the thread that does the truncation:
  A write of the indirect block causes the freeblks to become
  ALLCOMPLETE while softdep_setup_freeblocks() dropped softdep lock. And
  then, softdep_disk_write_complete() would reassign the workitem to the
  mount point worklist, causing premature processing of the workitem, or
  journal write exhaust the fb_jfreeblkhd and handle_written_jfreeblk does
  the same reassign.
indir_trunc() then would find the indirect block that is locked (with lock
owned by kernel) but without any dependencies, causing it to hang in
getblk() waiting for buffer lock.

Do not mark freeblks as DEPCOMPLETE until deallocate_dependencies()
finished.

Analyzed, suggested and reviewed by:	jeff
Tested by:	pho
2010-11-11 11:54:01 +00:00
kib
ac0984a653 Change #ifdef INVARIANTS panic into KASSERT, and print some useful
information to diagnose the issue, in handle_complete_freeblocks().

Reviewed by:	jeff
Tested by:	pho
2010-11-11 11:41:52 +00:00
kib
90d9ba4d7c In journal_mount(), only set MNTK_SUJ flag after the jblocks are mapped.
I believe there is a window otherwise where jblocks can be accessed
without proper initialization.

Reviewed by:	jeff
Tested by:	pho
2010-11-11 11:38:57 +00:00
kib
a16cecf311 Add function lbn_offset to calculate offset of the indirect block of
given level.

Reviewed by:	jeff
Tested by:	pho
2010-11-11 11:35:42 +00:00
kib
57488f7bec Fix typo. Function is called ffs_blkfree. 2010-11-11 11:26:59 +00:00
jhb
c016e5df49 Remove unused includes of <sys/mutex.h> and <machine/mutex.h>. 2010-11-09 20:41:10 +00:00
ivoras
4dfcef74e1 Bring vfs.ufs.dirhash_maxmem into the age of the fruitbat and make it
autotuned. It is only an upper bound (the memory is not always allocated)
and the system contains a vm_lowmem handler so nothing will crash and burn
if it's tuned too high.

Reviewed by:	mckusick
2010-10-25 21:46:23 +00:00
kib
4036cd070d The r184588 changed the layout of struct export_args, causing an ABI
breakage for old mount(2) syscall, since most struct <filesystem>_args
embed export_args. The mount(2) is supposed to provide ABI
compatibility for pre-nmount mount(8) binaries, so restore ABI to
pre-r184588.

Requested and reviewed by:	bde
MFC after:    2 weeks
2010-10-10 07:05:47 +00:00
alc
e9dc33bfce M_USE_RESERVE has been deprecated for a decade. Eliminate any uses that
have no run-time effect.
2010-10-02 17:58:57 +00:00
mckusick
bca797c285 Since local variable 'i' is used only in a KASSERT, declare and
initialize it only if INVARIANTS is defined to avoid a declared
but unused warning.

Suggested by: Brian Somers <brian@FreeBSD.org>
2010-09-29 14:46:57 +00:00
kib
bcef52d3bf Fix typo in comment. 2010-09-29 07:40:11 +00:00
obrien
55122bfc2d Correct some non-code typos. 2010-09-17 09:14:40 +00:00
mckusick
dd70ac636a Update comments in soft updates code to more fully describe
the addition of journalling. Only functional change is to
tighten a KASSERT.

Reviewed by:	jeff Roberson
2010-09-14 18:04:05 +00:00
jhb
d4890c88b0 Add dedicated routines to toggle lockmgr flags such as LK_NOSHARE and
LK_CANRECURSE after a lock is created.  Use them to implement macros that
otherwise manipulated the flags directly.  Assert that the associated
lockmgr lock is exclusively locked by the current thread when manipulating
these flags to ensure the flag updates are safe.  This last change required
some minor shuffling in a few filesystems to exclusively lock a brand new
vnode slightly earlier.

Reviewed by:	kib
MFC after:	3 days
2010-08-20 19:46:50 +00:00
kib
60a46e5ff9 Softdep_process_worklist() should unsuspend not only before processing
the worklist (in softdep_process_journal), but also after flushing the
workitems. Might be, we should even do this before bwillwrite() too, but
this seems to be not needed for now.

Fs might be suspended during processing the queue, and then there is
nobody around to unsuspend.

In collaboration with:	pho
Tested by:	bz
Reviewed by:	jeff
2010-08-12 08:35:24 +00:00
jhb
7f218ea7f4 Revert the previous commit. The race is not applicable to the lockmgr
implementation in 8.0 and later as its flags field does not hold dynamic
state such as waiters flags, but is only modified in lockinit() aside
from VN_LOCK_*().

Discussed with:	attilio
2010-07-16 19:52:03 +00:00
jhb
ea417bf09a When the MNTK_EXTENDED_SHARED mount option was added, some filesystems were
changed to defer the setting of VN_LOCK_ASHARE() (which clears LK_NOSHARE
in the vnode lock's flags) until after they had determined if the vnode was
a FIFO.  This occurs after the vnode has been inserted a VFS hash or some
similar table, so it is possible for another thread to find this vnode via
vget() on an i-node number and block on the vnode lock.  If the lockmgr
interlock (vnode interlock for vnode locks) is not held when clearing the
LK_NOSHARE flag, then the lk_flags field can be clobbered.  As a result
the thread blocked on the vnode lock may never get woken up.  Fix this by
holding the vnode interlock while modifying the lock flags in this case.

MFC after:	3 days
2010-07-16 19:20:20 +00:00
jeff
285c3f355c - Handle the truncation of an inode with an effective link count of 0 in
the context of the process that reduced the effective count.  Previously
   all truncation as a result of unlink happened in the softdep flush
   thread.  This had the effect of being impossible to rate limit properly
   with the journal code.  Now the process issuing unlinks is suspended
   when the journal files.  This has a side-effect of improving rm
   performance by allowing more concurrent work.
 - Handle two cases in inactive, one for effnlink == 0 and another when
   nlink finally reaches 0.
 - Eliminate the SPACECOUNTED related code since the truncation is no
   longer delayed.

Discussed with:	mckusick
2010-07-06 07:11:04 +00:00
kib
e735ee5c8d Ensure that VOP_ACCESSX is called with exclusively locked vnode for
the kernel compiled with QUOTA option. ufs_accessx() upgrades the vdp
vnode lock from shared to exclusive to assign the dquot structure to
the vnode, and ufs_delete_denied() is called when tvp is locked. Since
upgrade drops shared lock when non-blocked upgrade failed, LOR is there.

Reported and tested by:	Dmitry Pryanishnikov <lynx.ripe gmail com>
Tested by:	pho
PR:	kern/147890
MFC after:	1 week
2010-06-20 13:35:16 +00:00
avg
13985611dd ffs_softdep: change K&R in function defintions to ANSI prototypes
Apparently it's bad when we first have an ANSI prototype in function
declaration, but then use K&R in its defintion.

Complaint from:	clang
MFC after:	2 weeks
2010-06-11 18:26:53 +00:00
kib
881c9b1a5c Extend the scope of the lock on the quota file vnode in quotaon() to
cover the initial read by dqopen(). Assert that vnode is locked in
dqopen(). Remove VFS_LOCK_GIANT() from dqopen(), since quotaon() keeps
Giant locked if needed around the call.
2010-06-03 10:24:53 +00:00
avg
4e8fc6f387 ffs_mount: accept and drop userland-only options that can be passed from
loader(8)

In r193192 loader(8) has grown an ability to pass root mount options
from fstab via vfs.root.mountfrom.options.  Unfortunately, some options
that can be present in fstab are for userland only and lead to root
mounting failure when seen by kernel.
Rather than teaching loader about FFS-specific options that should be
filtered out, ffs_mount recognizes those options as valid, but ignores
and deletes[1] them.

[1] is suggested by jh.

PR:		kern/141050
Reported by:	many
Reviewed by:	jh, bde
MFC after:	4 days
2010-05-19 09:32:11 +00:00
jeff
ebb7d74dae - Don't immediately re-run softdepflush if we didn't make any progress
on the last iteration.  This can lead to a deadlock when we have
   worklist items that cannot be immediately satisfied.

Reported by:	uqs, Dimitry Andric <dimitry@andric.com>

 - Remove some unnecessary debugging code and place some other under
   SUJ_DEBUG.
 - Examine the journal state in softdep_slowdown().
 - Re-format some comments so I may more easily add flag descriptions.
2010-05-19 06:18:01 +00:00
jeff
8fb90eedbc - Call softdep_prealloc() before any of the balloc routines in the
snapshot code.
 - Don't fsync() vnodes in prealloc if copy on write is in progress.  It
   is not safe to recurse back into the write path here.

Reported by:	Vladimir Grebenschikov <vova@fbsd.ru>
2010-05-07 08:45:21 +00:00
jeff
0b3e023908 - Use the correct flag mask when determining whether an inode has
successfully made it to the free list yet or not.  This fixes
   a deadlock that can occur with unlinked but referenced files.
   Journal space and inodedeps were not correctly reclaimed because
   the inode block was not left dirty.

Tested/Reported by:	lwindschuh@googlemail.com
2010-05-07 08:20:56 +00:00
mckusick
e95ff34dac Merger of the quota64 project into head.
This joint work of Dag-Erling Smørgrav and myself updates the
FFS quota system to support both traditional 32-bit and new 64-bit
quotas (for those of you who want to put 2+Tb quotas on your users).

By default quotas are not compiled into the kernel. To include them
in your kernel configuration you need to specify:

options         QUOTA                   # Enable FFS quotas

If you are already running with the current 32-bit quotas, they
should continue to work just as they have in the past. If you
wish to convert to using 64-bit quotas, use `quotacheck -c 64';
if you wish to revert from 64-bit quotas back to 32-bit quotas,
use `quotacheck -c 32'.

There is a new library of functions to simplify the use of the
quota system, do `man quotafile' for details. If your application
is currently using the quotactl(2), it is highly recommended that
you convert your application to use the quotafile interface.
Note that existing binaries will continue to work.

Special thanks to John Kozubik of rsync.net for getting me
interested in pursuing 64-bit quota support and for funding
part of my development time on this project.
2010-05-07 00:41:12 +00:00
alc
fecc56fac1 Eliminate page queues locking around most calls to vm_page_free(). 2010-05-06 18:58:32 +00:00
mckusick
b25e55dcc5 Final update to current version of head in preparation for reintegration. 2010-05-06 17:37:23 +00:00
alc
5c7ca3ee73 Acquire the page lock around all remaining calls to vm_page_free() on
managed pages that didn't already have that lock held.  (Freeing an
unmanaged page, such as the various pmaps use, doesn't require the page
lock.)

This allows a change in vm_page_remove()'s locking requirements.  It now
expects the page lock to be held instead of the page queues lock.
Consequently, the page queues lock is no longer required at all by callers
to vm_page_rename().

Discussed with: kib
2010-05-05 18:16:06 +00:00
trasz
402e3baade Move checking against RLIMIT_FSIZE into one place, vn_rlimit_fsize().
Reviewed by:	kib
2010-05-05 16:44:25 +00:00
avg
043deeb564 ffs_vfsops: restore alphabetic order of options in ffs_opts
The order was not correct only for nfsv4acls.
("no" prefix is ignored)

MFC after:	1 week
2010-04-29 10:04:00 +00:00
jeff
47b1b89a95 - When canceling jaddrefs they may not yet be in the journal if this is via
a revert call.  In this case don't attempt to remove something that
   has not yet been added.  Otherwise this jaddref must hang around
   to prevent the bitmap write as normal.
2010-04-28 07:57:37 +00:00
jeff
564f436237 - Fix builds without SOFTUPDATES defined in the kernel config. 2010-04-28 07:26:41 +00:00
mckusick
3a0f5972a0 Update to current version of head. 2010-04-28 05:33:59 +00:00
pjd
57ec1f2624 Fix build for UFS without SOFTUPDATES. 2010-04-24 07:36:33 +00:00
jeff
a574495410 - Merge soft-updates journaling from projects/suj/head into head. This
brings in support for an optional intent log which eliminates the need
   for background fsck on unclean shutdown.

Sponsored by:   iXsystems, Yahoo!, and Juniper.
With help from: McKusick and Peter Holm
2010-04-24 07:05:35 +00:00
kib
d5b92466a9 The cache_enter(9) function shall not be called for doomed dvp.
Assert this.

In the reported panic, vdestroy() fired the assertion "vp has namecache
for ..", because pseudofs may end up doing cache_enter() with reclaimed
dvp, after dotdot lookup temporary unlocked dvp.
Similar problem exists in ufs_lookup() for "." lookup, when vnode
lock needs to be upgraded.

Verify that dvp is not reclaimed before calling cache_enter().

Reported and tested by:	pho
Reviewed by:	kan
MFC after:	2 weeks
2010-04-20 10:19:27 +00:00
avg
d488f0b549 ffs_mount: remove redundant assignment of geom consumer to devvp.v_bufobj
The assignment is already done in g_vfs_open.
Redundant assignment is harmless, but can become a problem if g_vfs_open
logic is changed.

MFC after:	1 week
2010-04-03 08:25:04 +00:00
mckusick
f63b97928b Debugging nits found while testing the new 64-bit quota code. 2010-03-16 06:12:30 +00:00
des
834fb25a9e IFH@204581 2010-03-04 13:35:57 +00:00
kib
d068222571 When ffs_realloccg() failed to allocate bigger fragment and, because
pending blocks are scheduled for removal, goes to retry the (re)allocation,
clear the bp pointer. It might happen that meantime free space is really
exhausted and we are entering nospace: label without bread()ing buffer,
causing stale bp value to be brelse()d again.

Tested by:	pho
    (Producing a scenario to reliably reproduce the
     race appeared to be much harder then fixing the bug)
MFC after:	1 week
2010-02-13 10:34:50 +00:00
mckusick
e7471d443b One last pass to get all the unsigned comparisons correct. 2010-02-11 18:14:53 +00:00
mckusick
d533f2ac8c This fix corrects a problem in the file system that treats large
inode numbers as negative rather than unsigned. For a default
(16K block) file system, this bug began to show up at a file system
size above about 16Tb.

To fully handle this problem, newfs must be updated to ensure that
it will never create a filesystem with more than 2^32 inodes. That
patch will be forthcoming soon.

Reported by: Scott Burns, John Kilburg, Bruce Evans
Followup by: Jeff Roberson
PR:          133980
MFC after:   2 weeks
2010-02-10 20:10:35 +00:00
trasz
bf6995c4bb Remove unused variable. 2010-02-10 18:56:49 +00:00
trasz
0383f8d8bb Return proper error code.
Found with:	clang
2010-01-25 16:09:50 +00:00
trasz
f346ef85e4 Move out code that does POSIX.1e ACL inheritance into separate routines.
Reviewed by:	rwatson
2010-01-24 15:12:27 +00:00
mckusick
94b44c0969 Cast 64-bit quantity to intptr_t rather than int so as to work properly
with 64-bit architectures (such as amd64).

Reported by:	bz
2010-01-11 22:42:06 +00:00
mckusick
0cddeb2cb4 Background:
When renaming a directory it passes through several intermediate
states. First its new name will be created causing it to have two
names (from possibly different parents). Next, if it has different
parents, its value of ".." will be changed from pointing to the old
parent to pointing to the new parent. Concurrently, its old name
will be removed bringing it back into a consistent state. When fsck
encounters an extra name for a directory, it offers to remove the
"extraneous hard link"; when it finds that the names have been
changed but the update to ".." has not happened, it offers to rewrite
".." to point at the correct parent. Both of these changes were
considered unexpected so would cause fsck in preen mode or fsck in
background mode to fail with the need to run fsck manually to fix
these problems. Fsck running in preen mode or background mode now
corrects these expected inconsistencies that arise during directory
rename. The functionality added with this update is used by fsck
running in background mode to make these fixes.

Solution:

This update adds three new fsck sysctl commands to support background
fsck in correcting expected inconsistencies that arise from incomplete
directory rename operations. They are:

setcwd(dirinode) - set the current directory to dirinode in the
    filesystem associated with the snapshot.
setdotdot(oldvalue, newvalue) - Verify that the inode number for ".."
    in the current directory is oldvalue then change it to newvalue.
unlink(nameptr, oldvalue) - Verify that the inode number associated
    with nameptr in the current directory is oldvalue then unlink it.

As with all other fsck sysctls, these new ones may only be used by
processes with appropriate priviledge.

Reported by:    	jeff
Security issues:	rwatson
2010-01-11 20:44:05 +00:00
mbr
7450f52a57 Remove extraneous semicolons, no functional changes.
Submitted by:	Marc Balmer <marc@msys.ch>
MFC after:	1 week
2010-01-07 21:01:37 +00:00
mckusick
3d4c810fbe KASSERT that condition raised by Coverity cannot happen.
Found by:	Coverity Prevent (tm)
KASSERT by:	sam
2010-01-07 06:20:07 +00:00
trasz
f04a989f2d Implement NFSv4 ACL support for UFS.
Reviewed by:	rwatson
2009-12-21 19:39:10 +00:00
kib
b79e14054c VI_OBJDIRTY vnode flag mirrors the state of OBJ_MIGHTBEDIRTY vm object
flag. Besides providing the redundand information, need to update both
vnode and object flags causes more acquisition of vnode interlock.
OBJ_MIGHTBEDIRTY is only checked for vnode-backed vm objects.

Remove VI_OBJDIRTY and make sure that OBJ_MIGHTBEDIRTY is set only for
vnode-backed vm objects.

Suggested and reviewed by:	alc
Tested by:	pho
MFC after:	3 weeks
2009-12-21 12:29:38 +00:00
des
bf5117185e Sync with head 2009-09-25 22:45:59 +00:00
des
b79ff8160a Further improve comments. 2009-09-25 18:50:33 +00:00
des
2c6fa42d07 Improve comments, and remove a bogus 0 id check. 2009-09-25 18:44:34 +00:00
rdivacky
f3b70d313a Don't build ufs_gjournal.c at all if UFS_GJOURNAL option is not given
instead of building an almost empty C file.

Approved by:	pjd
Approved by:	ed (mentor, implicit)
2009-09-22 16:22:05 +00:00
des
9ed1a4b5eb Merge from head 2009-09-17 16:16:44 +00:00
des
7ee29ca499 Merge from head up to r188941 (last revision before the USB stack switch) 2009-09-17 13:31:39 +00:00
brooks
e19a3fa312 Allocate space for the group array in a static credential used in
the quota code.  One case was correctly handled in r194498, but
this one was missed.

PR:		kern/138657
Tested by:	PR submitter
MFC after:	3 days
2009-09-17 12:35:13 +00:00
trasz
a7104567d1 Remove useless variable assignment. 2009-09-08 17:23:32 +00:00
kib
30f476628e insmntque_stddtr() clears vp->v_data and resets vp->v_op to
dead_vnodeops before calling vgone(). Revert r189706 and corresponding
part of the r186560.

Noted and reviewed by:	tegge
Approved by:	des (pseudofs part)
MFC after:	3 days
2009-09-07 11:55:34 +00:00
kib
2e1ddcb566 The clear_remove() and clear_inodedeps() call vn_start_write(NULL, &mp,
V_NOWAIT) on the non-busied mount point. Unmount might free ufs-specific
mp data, causing ffs_vgetf() to access freed memory.

Busy mountpoint before dropping softdep lk.

Noted and reviewed by:	tegge
Tested by:	pho
MFC after:	1 week
2009-09-06 11:46:51 +00:00
kib
3658df033e When a UFS node is truncated to the zero length, e.g. by explicit
truncate(2) call, or by being removed or truncated on open, either
new softupdate freeblks structure is allocated to track the freed
blocks of the node, or truncation is done syncronously when too many SU
dependencies are accumulated. The decision does not take into account
the allocated freeblks dependencies, allowing workloads that do huge
amount of truncations to exhaust the kernel memory.

Take the number of allocated freeblks into consideration for
softdep_slowdown().

Reported by:	pluknet gmail com
Diagnosed and tested by:	pho
Approved by:	re (rwatson)
MFC after:	1 month
2009-08-14 11:00:38 +00:00
trasz
7ce4ab7ff8 Fix fpathconf(3) on fifos, in effect making ls(1) properly
display '+' on them.  Taken from kern/125613, with cosmetic
changes.

PR:		kern/125613
Submitted by:	Jaakko Heinonen <jh at saunalahti dot fi>
Approved by:	re (kib)
2009-07-02 20:05:21 +00:00
kib
350f96b4bf In vn_vget_ino() and their inline equivalents, mnt_ref() the mount point
around the sequence that drop vnode lock and then busies the mount point.
Not having vlocked node or direct reference to the mp allows for the
forced unmount to proceed, making mp unmounted or reused.

Tested by:	pho
Reviewed by:	jeff
Approved by:	re (kensmith)
MFC after:	2 weeks
2009-07-02 18:02:55 +00:00
trasz
dcdba7b2e3 Don't panic on attempt to set ACL on a block device file.
This is just a part of kern/125613.

PR:		kern/125613
Submitted by:	Jaakko Heinonen <jh at saunalahti dot fi>
Reviewed by:	rwatson
Approved by:	re (kib)
2009-07-01 22:30:36 +00:00
kib
4cf230ed17 For SU mounts, softdep_fsync() might drop vnode lock, allowing other
threads to put dirty buffers on the vnode bufobj list. For regular files
and synchronous fsync requests, check for the condition and restart the
fsync vop if a new dirty buffer arrived.

Tested by:	pho
Approved by:	re (kensmith)
MFC after:	1 month
2009-06-30 10:07:33 +00:00
kib
c424611d79 Softdep_fsync() may need to lock parent directory of the synced vnode.
Use inlined (due to FFSV_FORCEINSMQ) version of vn_vget_ino() to prevent
mountpoint from being unmounted and freed while no vnodes are locked.

Tested by:	pho
Approved by:	re (kensmith)
MFC after:	1 month
2009-06-30 10:07:00 +00:00
snb
5d2850ae03 Fix a bug reported by pho@ where one can induce a panic by decreasing
vfs.ufs.dirhash_maxmem below the current amount of memory used by dirhash. When
ufsdirhash_build() is called with the memory in use greater than dirhash_maxmem,
it attempts to free up memory by calling ufsdirhash_recycle(). If successful in
freeing enough memory, ufsdirhash_recycle() leaves the dirhash list locked. But
at this point in ufsdirhash_build(), the list is not explicitly unlocked after
the call(s) to ufsdirhash_recycle(). When we next attempt to lock the dirhash
list, we will get a "panic: _mtx_lock_sleep: recursed on non-recursive mutex
dirhash list".

Tested by:	pho
Approved by:	dwmalone (mentor)
MFC after:	3 weeks
2009-06-25 20:40:13 +00:00
brooks
f53c1c309d Rework the credential code to support larger values of NGROUPS and
NGROUPS_MAX, eliminate ABI dependencies on them, and raise the to 1024
and 1023 respectively.  (Previously they were equal, but under a close
reading of POSIX, NGROUPS_MAX was defined to be too large by 1 since it
is the number of supplemental groups, not total number of groups.)

The bulk of the change consists of converting the struct ucred member
cr_groups from a static array to a pointer.  Do the equivalent in
kinfo_proc.

Introduce new interfaces crcopysafe() and crsetgroups() for duplicating
a process credential before modifying it and for setting group lists
respectively.  Both interfaces take care for the details of allocating
groups array. crsetgroups() takes care of truncating the group list
to the current maximum (NGROUPS) if necessary.  In the future,
crsetgroups() may be responsible for insuring invariants such as sorting
the supplemental groups to allow groupmember() to be implemented as a
binary search.

Because we can not change struct xucred without breaking application
ABIs, we leave it alone and introduce a new XU_NGROUPS value which is
always 16 and is to be used or NGRPS as appropriate for things such as
NFS which need to use no more than 16 groups.  When feasible, truncate
the group list rather than generating an error.

Minor changes:
  - Reduce the number of hand rolled versions of groupmember().
  - Do not assign to both cr_gid and cr_groups[0].
  - Modify ipfw to cache ucreds instead of part of their contents since
    they are immutable once referenced by more than one entity.

Submitted by:	Isilon Systems (initial implementation)
X-MFC after:	never
PR:		bin/113398 kern/133867
2009-06-19 17:10:35 +00:00
snb
af1efe0490 Keep dirhash tailq locked throughout the entirety of ufsdirhash_destroy() to fix
a potential race pointed out by pjd. Also use TAILQ_FOREACH_SAFE to iterate over
dirhashes in ufsdirhash_lowmem(), so that we can continue iterating even after a
dirhash is destroyed.

Suggested by:	pjd
Tested by:      pho
Approved by:	dwmalone (mentor)
2009-06-17 18:55:29 +00:00
kib
b8351fcda2 Do not use casts (int *)0 and (struct thread *)0 for the arguments of
vn_rdwr, use NULL.

Reviewed by:	jhb
MFC after:	1 week
2009-06-16 15:13:45 +00:00
rwatson
f4934662e5 Move "options MAC" from opt_mac.h to opt_global.h, as it's now in GENERIC
and used in a large number of files, but also because an increasing number
of incorrect uses of MAC calls were sneaking in due to copy-and-paste of
MAC-aware code without the associated opt_mac.h include.

Discussed with:	pjd
2009-06-05 14:55:22 +00:00
snb
7f32f3f2a1 Add vm_lowmem event handler for dirhash. This will cause dirhashes to be
deleted when the system is low on memory. This ought to allow an increase to
vfs.ufs.dirhash_maxmem on machines that have lots of memory, without
degrading performance by having too much memory reserved for dirhash when
other things need it. The default value for dirhash_maxmem is being kept at
2MB for now, though.

This work was mostly done during the 2008 Google Summer of Code.

Approved by:	dwmalone (mentor), re
MFC after:	3 months
2009-06-03 09:44:22 +00:00
attilio
44c490ae17 Handle lock recursion differenty by always checking against LO_RECURSABLE
instead the lock own flag itself.

Tested by:	pho
2009-06-02 13:03:35 +00:00
jamie
a013e0afcb Add hierarchical jails. A jail may further virtualize its environment
by creating a child jail, which is visible to that jail and to any
parent jails.  Child jails may be restricted more than their parents,
but never less.  Jail names reflect this hierarchy, being MIB-style
dot-separated strings.

Every thread now points to a jail, the default being prison0, which
contains information about the physical system.  Prison0's root
directory is the same as rootvnode; its hostname is the same as the
global hostname, and its securelevel replaces the global securelevel.
Note that the variable "securelevel" has actually gone away, which
should not cause any problems for code that properly uses
securelevel_gt() and securelevel_ge().

Some jail-related permissions that were kept in global variables and
set via sysctls are now per-jail settings.  The sysctls still exist for
backward compatibility, used only by the now-deprecated jail(2) system
call.

Approved by:	bz (mentor)
2009-05-27 14:11:23 +00:00
trasz
fb57d2691e Make 'struct acl' larger, as required to support NFSv4 ACLs. Provide
compatibility interfaces in both kernel and libc.

Reviewed by:	rwatson
2009-05-22 15:56:43 +00:00
alc
dc942dabcf Introduce vfs_bio_set_valid() and use it from ffs_realloccg(). This
eliminates the misuse of vfs_bio_clrbuf() by ffs_realloccg().

In collaboration with:	tegge
2009-05-17 20:26:00 +00:00
attilio
1dcb84131b Remove the thread argument from the FSD (File-System Dependent) parts of
the VFS.  Now all the VFS_* functions and relating parts don't want the
context as long as it always refers to curthread.

In some points, in particular when dealing with VOPs and functions living
in the same namespace (eg. vflush) which still need to be converted,
pass curthread explicitly in order to retain the old behaviour.
Such loose ends will be fixed ASAP.

While here fix a bug: now, UFS_EXTATTR can be compiled alone without the
UFS_EXTATTR_AUTOSTART option.

VFS KPI is heavilly changed by this commit so thirdy parts modules needs
to be recompiled.  Bump __FreeBSD_version in order to signal such
situation.
2009-05-11 15:33:26 +00:00
kan
7b57a857b7 Do not embed struct ucred into larger netcred parent structures.
Credential might need to hang around longer than its parent and be used
outside of mnt_explock scope controlling netcred lifetime. Use separate
reference-counted ucred allocated separately instead.

While there, extend mnt_explock coverage in vfs_stdexpcheck and clean-up
some unused declarations in new NFS code.

Reported by:	John Hickey
PR:		kern/133439
Reviewed by:	dfr, kib
2009-05-09 18:09:17 +00:00
rmacklem
84d9dc09c0 Change the semantics of i_modrev/va_filerev to what is required for
the nfsv4 Change attribute. There are 2 changes:
 	1 - The value now changes on metadata changes as well as data
 	    modifications (incremented for IN_CHANGE instead of IN_UPDATE).
 	2 - It is now saved in spare space in the on-disk i-node so that it
 	    survives a crash.
 	Since va_filerev is not passed out into user space, the only current
 	use of va_filerev is in the nfs server, which uses it as the directory
 	cookie verifier. Since this verifier is only passed back to the server
 	by a client verbatim and then the server doesn't check it, changing the
 	semantics should not break anything currently in FreeBSD.

Reviewed by:	bde
Approved by:	kib (mentor)
2009-04-27 16:46:16 +00:00
kib
24114749aa In ufs_checkpath(), recheck that '..' still points to the inode with
the same inode number after VFS_VGET() and relock of the vp. If '..'
changed, redo the lookup. To reduce code duplication, move the code to
read '..' dirent into the static helper function ufs_dir_dd_ino().

Supply the source inode number as an argument to ufs_checkpath() instead
of the source inode itself. The inode is unlocked, thus it might be
reclaimed, causing accesses to the freed memory.

Use vn_vget_ino() to get the '..' vnode by its inode number, instead of
directly code VFS_VGET() and relock, to properly busy the mount point
while vp lock is dropped.

Noted and reviewed by:	tegge
Tested by:	pho
MFC after:	1 month
2009-04-20 14:36:01 +00:00
kib
7ee6b427ae When verifying '..' after VFS_VGET() in ufs_lookup(), do not return
error if '..' is still there but changed between lookup and check.
Start relookup instead. Rename is supposed to change '..' reference
atomically, so transient failures introduced by r191137 are wrong.

While rearranging the code to allow lookup restart in ufs_lookup(),
remove the comment that only distracts the reader.

Noted and reviewed by:	tegge
Also reported by:	pho
MFC after:	1 month
2009-04-19 05:34:07 +00:00
trasz
858b10f6e2 Use acl_alloc() and acl_free() instead of using uma(9) directly.
This will make switching to malloc(9) easier; also, it would be
neccessary to add these routines if/when we implement variable-size
ACLs.
2009-04-18 16:47:33 +00:00