Commit Graph

1876 Commits

Author SHA1 Message Date
alc
e9dc33bfce M_USE_RESERVE has been deprecated for a decade. Eliminate any uses that
have no run-time effect.
2010-10-02 17:58:57 +00:00
mckusick
bca797c285 Since local variable 'i' is used only in a KASSERT, declare and
initialize it only if INVARIANTS is defined to avoid a declared
but unused warning.

Suggested by: Brian Somers <brian@FreeBSD.org>
2010-09-29 14:46:57 +00:00
kib
bcef52d3bf Fix typo in comment. 2010-09-29 07:40:11 +00:00
obrien
55122bfc2d Correct some non-code typos. 2010-09-17 09:14:40 +00:00
mckusick
dd70ac636a Update comments in soft updates code to more fully describe
the addition of journalling. Only functional change is to
tighten a KASSERT.

Reviewed by:	jeff Roberson
2010-09-14 18:04:05 +00:00
jhb
d4890c88b0 Add dedicated routines to toggle lockmgr flags such as LK_NOSHARE and
LK_CANRECURSE after a lock is created.  Use them to implement macros that
otherwise manipulated the flags directly.  Assert that the associated
lockmgr lock is exclusively locked by the current thread when manipulating
these flags to ensure the flag updates are safe.  This last change required
some minor shuffling in a few filesystems to exclusively lock a brand new
vnode slightly earlier.

Reviewed by:	kib
MFC after:	3 days
2010-08-20 19:46:50 +00:00
kib
60a46e5ff9 Softdep_process_worklist() should unsuspend not only before processing
the worklist (in softdep_process_journal), but also after flushing the
workitems. Might be, we should even do this before bwillwrite() too, but
this seems to be not needed for now.

Fs might be suspended during processing the queue, and then there is
nobody around to unsuspend.

In collaboration with:	pho
Tested by:	bz
Reviewed by:	jeff
2010-08-12 08:35:24 +00:00
jhb
7f218ea7f4 Revert the previous commit. The race is not applicable to the lockmgr
implementation in 8.0 and later as its flags field does not hold dynamic
state such as waiters flags, but is only modified in lockinit() aside
from VN_LOCK_*().

Discussed with:	attilio
2010-07-16 19:52:03 +00:00
jhb
ea417bf09a When the MNTK_EXTENDED_SHARED mount option was added, some filesystems were
changed to defer the setting of VN_LOCK_ASHARE() (which clears LK_NOSHARE
in the vnode lock's flags) until after they had determined if the vnode was
a FIFO.  This occurs after the vnode has been inserted a VFS hash or some
similar table, so it is possible for another thread to find this vnode via
vget() on an i-node number and block on the vnode lock.  If the lockmgr
interlock (vnode interlock for vnode locks) is not held when clearing the
LK_NOSHARE flag, then the lk_flags field can be clobbered.  As a result
the thread blocked on the vnode lock may never get woken up.  Fix this by
holding the vnode interlock while modifying the lock flags in this case.

MFC after:	3 days
2010-07-16 19:20:20 +00:00
jeff
285c3f355c - Handle the truncation of an inode with an effective link count of 0 in
the context of the process that reduced the effective count.  Previously
   all truncation as a result of unlink happened in the softdep flush
   thread.  This had the effect of being impossible to rate limit properly
   with the journal code.  Now the process issuing unlinks is suspended
   when the journal files.  This has a side-effect of improving rm
   performance by allowing more concurrent work.
 - Handle two cases in inactive, one for effnlink == 0 and another when
   nlink finally reaches 0.
 - Eliminate the SPACECOUNTED related code since the truncation is no
   longer delayed.

Discussed with:	mckusick
2010-07-06 07:11:04 +00:00
kib
e735ee5c8d Ensure that VOP_ACCESSX is called with exclusively locked vnode for
the kernel compiled with QUOTA option. ufs_accessx() upgrades the vdp
vnode lock from shared to exclusive to assign the dquot structure to
the vnode, and ufs_delete_denied() is called when tvp is locked. Since
upgrade drops shared lock when non-blocked upgrade failed, LOR is there.

Reported and tested by:	Dmitry Pryanishnikov <lynx.ripe gmail com>
Tested by:	pho
PR:	kern/147890
MFC after:	1 week
2010-06-20 13:35:16 +00:00
avg
13985611dd ffs_softdep: change K&R in function defintions to ANSI prototypes
Apparently it's bad when we first have an ANSI prototype in function
declaration, but then use K&R in its defintion.

Complaint from:	clang
MFC after:	2 weeks
2010-06-11 18:26:53 +00:00
kib
881c9b1a5c Extend the scope of the lock on the quota file vnode in quotaon() to
cover the initial read by dqopen(). Assert that vnode is locked in
dqopen(). Remove VFS_LOCK_GIANT() from dqopen(), since quotaon() keeps
Giant locked if needed around the call.
2010-06-03 10:24:53 +00:00
avg
4e8fc6f387 ffs_mount: accept and drop userland-only options that can be passed from
loader(8)

In r193192 loader(8) has grown an ability to pass root mount options
from fstab via vfs.root.mountfrom.options.  Unfortunately, some options
that can be present in fstab are for userland only and lead to root
mounting failure when seen by kernel.
Rather than teaching loader about FFS-specific options that should be
filtered out, ffs_mount recognizes those options as valid, but ignores
and deletes[1] them.

[1] is suggested by jh.

PR:		kern/141050
Reported by:	many
Reviewed by:	jh, bde
MFC after:	4 days
2010-05-19 09:32:11 +00:00
jeff
ebb7d74dae - Don't immediately re-run softdepflush if we didn't make any progress
on the last iteration.  This can lead to a deadlock when we have
   worklist items that cannot be immediately satisfied.

Reported by:	uqs, Dimitry Andric <dimitry@andric.com>

 - Remove some unnecessary debugging code and place some other under
   SUJ_DEBUG.
 - Examine the journal state in softdep_slowdown().
 - Re-format some comments so I may more easily add flag descriptions.
2010-05-19 06:18:01 +00:00
jeff
8fb90eedbc - Call softdep_prealloc() before any of the balloc routines in the
snapshot code.
 - Don't fsync() vnodes in prealloc if copy on write is in progress.  It
   is not safe to recurse back into the write path here.

Reported by:	Vladimir Grebenschikov <vova@fbsd.ru>
2010-05-07 08:45:21 +00:00
jeff
0b3e023908 - Use the correct flag mask when determining whether an inode has
successfully made it to the free list yet or not.  This fixes
   a deadlock that can occur with unlinked but referenced files.
   Journal space and inodedeps were not correctly reclaimed because
   the inode block was not left dirty.

Tested/Reported by:	lwindschuh@googlemail.com
2010-05-07 08:20:56 +00:00
mckusick
e95ff34dac Merger of the quota64 project into head.
This joint work of Dag-Erling Smørgrav and myself updates the
FFS quota system to support both traditional 32-bit and new 64-bit
quotas (for those of you who want to put 2+Tb quotas on your users).

By default quotas are not compiled into the kernel. To include them
in your kernel configuration you need to specify:

options         QUOTA                   # Enable FFS quotas

If you are already running with the current 32-bit quotas, they
should continue to work just as they have in the past. If you
wish to convert to using 64-bit quotas, use `quotacheck -c 64';
if you wish to revert from 64-bit quotas back to 32-bit quotas,
use `quotacheck -c 32'.

There is a new library of functions to simplify the use of the
quota system, do `man quotafile' for details. If your application
is currently using the quotactl(2), it is highly recommended that
you convert your application to use the quotafile interface.
Note that existing binaries will continue to work.

Special thanks to John Kozubik of rsync.net for getting me
interested in pursuing 64-bit quota support and for funding
part of my development time on this project.
2010-05-07 00:41:12 +00:00
alc
fecc56fac1 Eliminate page queues locking around most calls to vm_page_free(). 2010-05-06 18:58:32 +00:00
mckusick
b25e55dcc5 Final update to current version of head in preparation for reintegration. 2010-05-06 17:37:23 +00:00
alc
5c7ca3ee73 Acquire the page lock around all remaining calls to vm_page_free() on
managed pages that didn't already have that lock held.  (Freeing an
unmanaged page, such as the various pmaps use, doesn't require the page
lock.)

This allows a change in vm_page_remove()'s locking requirements.  It now
expects the page lock to be held instead of the page queues lock.
Consequently, the page queues lock is no longer required at all by callers
to vm_page_rename().

Discussed with: kib
2010-05-05 18:16:06 +00:00
trasz
402e3baade Move checking against RLIMIT_FSIZE into one place, vn_rlimit_fsize().
Reviewed by:	kib
2010-05-05 16:44:25 +00:00
avg
043deeb564 ffs_vfsops: restore alphabetic order of options in ffs_opts
The order was not correct only for nfsv4acls.
("no" prefix is ignored)

MFC after:	1 week
2010-04-29 10:04:00 +00:00
jeff
47b1b89a95 - When canceling jaddrefs they may not yet be in the journal if this is via
a revert call.  In this case don't attempt to remove something that
   has not yet been added.  Otherwise this jaddref must hang around
   to prevent the bitmap write as normal.
2010-04-28 07:57:37 +00:00
jeff
564f436237 - Fix builds without SOFTUPDATES defined in the kernel config. 2010-04-28 07:26:41 +00:00
mckusick
3a0f5972a0 Update to current version of head. 2010-04-28 05:33:59 +00:00
pjd
57ec1f2624 Fix build for UFS without SOFTUPDATES. 2010-04-24 07:36:33 +00:00
jeff
a574495410 - Merge soft-updates journaling from projects/suj/head into head. This
brings in support for an optional intent log which eliminates the need
   for background fsck on unclean shutdown.

Sponsored by:   iXsystems, Yahoo!, and Juniper.
With help from: McKusick and Peter Holm
2010-04-24 07:05:35 +00:00
kib
d5b92466a9 The cache_enter(9) function shall not be called for doomed dvp.
Assert this.

In the reported panic, vdestroy() fired the assertion "vp has namecache
for ..", because pseudofs may end up doing cache_enter() with reclaimed
dvp, after dotdot lookup temporary unlocked dvp.
Similar problem exists in ufs_lookup() for "." lookup, when vnode
lock needs to be upgraded.

Verify that dvp is not reclaimed before calling cache_enter().

Reported and tested by:	pho
Reviewed by:	kan
MFC after:	2 weeks
2010-04-20 10:19:27 +00:00
avg
d488f0b549 ffs_mount: remove redundant assignment of geom consumer to devvp.v_bufobj
The assignment is already done in g_vfs_open.
Redundant assignment is harmless, but can become a problem if g_vfs_open
logic is changed.

MFC after:	1 week
2010-04-03 08:25:04 +00:00
mckusick
f63b97928b Debugging nits found while testing the new 64-bit quota code. 2010-03-16 06:12:30 +00:00
des
834fb25a9e IFH@204581 2010-03-04 13:35:57 +00:00
kib
d068222571 When ffs_realloccg() failed to allocate bigger fragment and, because
pending blocks are scheduled for removal, goes to retry the (re)allocation,
clear the bp pointer. It might happen that meantime free space is really
exhausted and we are entering nospace: label without bread()ing buffer,
causing stale bp value to be brelse()d again.

Tested by:	pho
    (Producing a scenario to reliably reproduce the
     race appeared to be much harder then fixing the bug)
MFC after:	1 week
2010-02-13 10:34:50 +00:00
mckusick
e7471d443b One last pass to get all the unsigned comparisons correct. 2010-02-11 18:14:53 +00:00
mckusick
d533f2ac8c This fix corrects a problem in the file system that treats large
inode numbers as negative rather than unsigned. For a default
(16K block) file system, this bug began to show up at a file system
size above about 16Tb.

To fully handle this problem, newfs must be updated to ensure that
it will never create a filesystem with more than 2^32 inodes. That
patch will be forthcoming soon.

Reported by: Scott Burns, John Kilburg, Bruce Evans
Followup by: Jeff Roberson
PR:          133980
MFC after:   2 weeks
2010-02-10 20:10:35 +00:00
trasz
bf6995c4bb Remove unused variable. 2010-02-10 18:56:49 +00:00
trasz
0383f8d8bb Return proper error code.
Found with:	clang
2010-01-25 16:09:50 +00:00
trasz
f346ef85e4 Move out code that does POSIX.1e ACL inheritance into separate routines.
Reviewed by:	rwatson
2010-01-24 15:12:27 +00:00
mckusick
94b44c0969 Cast 64-bit quantity to intptr_t rather than int so as to work properly
with 64-bit architectures (such as amd64).

Reported by:	bz
2010-01-11 22:42:06 +00:00
mckusick
0cddeb2cb4 Background:
When renaming a directory it passes through several intermediate
states. First its new name will be created causing it to have two
names (from possibly different parents). Next, if it has different
parents, its value of ".." will be changed from pointing to the old
parent to pointing to the new parent. Concurrently, its old name
will be removed bringing it back into a consistent state. When fsck
encounters an extra name for a directory, it offers to remove the
"extraneous hard link"; when it finds that the names have been
changed but the update to ".." has not happened, it offers to rewrite
".." to point at the correct parent. Both of these changes were
considered unexpected so would cause fsck in preen mode or fsck in
background mode to fail with the need to run fsck manually to fix
these problems. Fsck running in preen mode or background mode now
corrects these expected inconsistencies that arise during directory
rename. The functionality added with this update is used by fsck
running in background mode to make these fixes.

Solution:

This update adds three new fsck sysctl commands to support background
fsck in correcting expected inconsistencies that arise from incomplete
directory rename operations. They are:

setcwd(dirinode) - set the current directory to dirinode in the
    filesystem associated with the snapshot.
setdotdot(oldvalue, newvalue) - Verify that the inode number for ".."
    in the current directory is oldvalue then change it to newvalue.
unlink(nameptr, oldvalue) - Verify that the inode number associated
    with nameptr in the current directory is oldvalue then unlink it.

As with all other fsck sysctls, these new ones may only be used by
processes with appropriate priviledge.

Reported by:    	jeff
Security issues:	rwatson
2010-01-11 20:44:05 +00:00
mbr
7450f52a57 Remove extraneous semicolons, no functional changes.
Submitted by:	Marc Balmer <marc@msys.ch>
MFC after:	1 week
2010-01-07 21:01:37 +00:00
mckusick
3d4c810fbe KASSERT that condition raised by Coverity cannot happen.
Found by:	Coverity Prevent (tm)
KASSERT by:	sam
2010-01-07 06:20:07 +00:00
trasz
f04a989f2d Implement NFSv4 ACL support for UFS.
Reviewed by:	rwatson
2009-12-21 19:39:10 +00:00
kib
b79e14054c VI_OBJDIRTY vnode flag mirrors the state of OBJ_MIGHTBEDIRTY vm object
flag. Besides providing the redundand information, need to update both
vnode and object flags causes more acquisition of vnode interlock.
OBJ_MIGHTBEDIRTY is only checked for vnode-backed vm objects.

Remove VI_OBJDIRTY and make sure that OBJ_MIGHTBEDIRTY is set only for
vnode-backed vm objects.

Suggested and reviewed by:	alc
Tested by:	pho
MFC after:	3 weeks
2009-12-21 12:29:38 +00:00
des
bf5117185e Sync with head 2009-09-25 22:45:59 +00:00
des
b79ff8160a Further improve comments. 2009-09-25 18:50:33 +00:00
des
2c6fa42d07 Improve comments, and remove a bogus 0 id check. 2009-09-25 18:44:34 +00:00
rdivacky
f3b70d313a Don't build ufs_gjournal.c at all if UFS_GJOURNAL option is not given
instead of building an almost empty C file.

Approved by:	pjd
Approved by:	ed (mentor, implicit)
2009-09-22 16:22:05 +00:00
des
9ed1a4b5eb Merge from head 2009-09-17 16:16:44 +00:00
des
7ee29ca499 Merge from head up to r188941 (last revision before the USB stack switch) 2009-09-17 13:31:39 +00:00
brooks
e19a3fa312 Allocate space for the group array in a static credential used in
the quota code.  One case was correctly handled in r194498, but
this one was missed.

PR:		kern/138657
Tested by:	PR submitter
MFC after:	3 days
2009-09-17 12:35:13 +00:00
trasz
a7104567d1 Remove useless variable assignment. 2009-09-08 17:23:32 +00:00
kib
30f476628e insmntque_stddtr() clears vp->v_data and resets vp->v_op to
dead_vnodeops before calling vgone(). Revert r189706 and corresponding
part of the r186560.

Noted and reviewed by:	tegge
Approved by:	des (pseudofs part)
MFC after:	3 days
2009-09-07 11:55:34 +00:00
kib
2e1ddcb566 The clear_remove() and clear_inodedeps() call vn_start_write(NULL, &mp,
V_NOWAIT) on the non-busied mount point. Unmount might free ufs-specific
mp data, causing ffs_vgetf() to access freed memory.

Busy mountpoint before dropping softdep lk.

Noted and reviewed by:	tegge
Tested by:	pho
MFC after:	1 week
2009-09-06 11:46:51 +00:00
kib
3658df033e When a UFS node is truncated to the zero length, e.g. by explicit
truncate(2) call, or by being removed or truncated on open, either
new softupdate freeblks structure is allocated to track the freed
blocks of the node, or truncation is done syncronously when too many SU
dependencies are accumulated. The decision does not take into account
the allocated freeblks dependencies, allowing workloads that do huge
amount of truncations to exhaust the kernel memory.

Take the number of allocated freeblks into consideration for
softdep_slowdown().

Reported by:	pluknet gmail com
Diagnosed and tested by:	pho
Approved by:	re (rwatson)
MFC after:	1 month
2009-08-14 11:00:38 +00:00
trasz
7ce4ab7ff8 Fix fpathconf(3) on fifos, in effect making ls(1) properly
display '+' on them.  Taken from kern/125613, with cosmetic
changes.

PR:		kern/125613
Submitted by:	Jaakko Heinonen <jh at saunalahti dot fi>
Approved by:	re (kib)
2009-07-02 20:05:21 +00:00
kib
350f96b4bf In vn_vget_ino() and their inline equivalents, mnt_ref() the mount point
around the sequence that drop vnode lock and then busies the mount point.
Not having vlocked node or direct reference to the mp allows for the
forced unmount to proceed, making mp unmounted or reused.

Tested by:	pho
Reviewed by:	jeff
Approved by:	re (kensmith)
MFC after:	2 weeks
2009-07-02 18:02:55 +00:00
trasz
dcdba7b2e3 Don't panic on attempt to set ACL on a block device file.
This is just a part of kern/125613.

PR:		kern/125613
Submitted by:	Jaakko Heinonen <jh at saunalahti dot fi>
Reviewed by:	rwatson
Approved by:	re (kib)
2009-07-01 22:30:36 +00:00
kib
4cf230ed17 For SU mounts, softdep_fsync() might drop vnode lock, allowing other
threads to put dirty buffers on the vnode bufobj list. For regular files
and synchronous fsync requests, check for the condition and restart the
fsync vop if a new dirty buffer arrived.

Tested by:	pho
Approved by:	re (kensmith)
MFC after:	1 month
2009-06-30 10:07:33 +00:00
kib
c424611d79 Softdep_fsync() may need to lock parent directory of the synced vnode.
Use inlined (due to FFSV_FORCEINSMQ) version of vn_vget_ino() to prevent
mountpoint from being unmounted and freed while no vnodes are locked.

Tested by:	pho
Approved by:	re (kensmith)
MFC after:	1 month
2009-06-30 10:07:00 +00:00
snb
5d2850ae03 Fix a bug reported by pho@ where one can induce a panic by decreasing
vfs.ufs.dirhash_maxmem below the current amount of memory used by dirhash. When
ufsdirhash_build() is called with the memory in use greater than dirhash_maxmem,
it attempts to free up memory by calling ufsdirhash_recycle(). If successful in
freeing enough memory, ufsdirhash_recycle() leaves the dirhash list locked. But
at this point in ufsdirhash_build(), the list is not explicitly unlocked after
the call(s) to ufsdirhash_recycle(). When we next attempt to lock the dirhash
list, we will get a "panic: _mtx_lock_sleep: recursed on non-recursive mutex
dirhash list".

Tested by:	pho
Approved by:	dwmalone (mentor)
MFC after:	3 weeks
2009-06-25 20:40:13 +00:00
brooks
f53c1c309d Rework the credential code to support larger values of NGROUPS and
NGROUPS_MAX, eliminate ABI dependencies on them, and raise the to 1024
and 1023 respectively.  (Previously they were equal, but under a close
reading of POSIX, NGROUPS_MAX was defined to be too large by 1 since it
is the number of supplemental groups, not total number of groups.)

The bulk of the change consists of converting the struct ucred member
cr_groups from a static array to a pointer.  Do the equivalent in
kinfo_proc.

Introduce new interfaces crcopysafe() and crsetgroups() for duplicating
a process credential before modifying it and for setting group lists
respectively.  Both interfaces take care for the details of allocating
groups array. crsetgroups() takes care of truncating the group list
to the current maximum (NGROUPS) if necessary.  In the future,
crsetgroups() may be responsible for insuring invariants such as sorting
the supplemental groups to allow groupmember() to be implemented as a
binary search.

Because we can not change struct xucred without breaking application
ABIs, we leave it alone and introduce a new XU_NGROUPS value which is
always 16 and is to be used or NGRPS as appropriate for things such as
NFS which need to use no more than 16 groups.  When feasible, truncate
the group list rather than generating an error.

Minor changes:
  - Reduce the number of hand rolled versions of groupmember().
  - Do not assign to both cr_gid and cr_groups[0].
  - Modify ipfw to cache ucreds instead of part of their contents since
    they are immutable once referenced by more than one entity.

Submitted by:	Isilon Systems (initial implementation)
X-MFC after:	never
PR:		bin/113398 kern/133867
2009-06-19 17:10:35 +00:00
snb
af1efe0490 Keep dirhash tailq locked throughout the entirety of ufsdirhash_destroy() to fix
a potential race pointed out by pjd. Also use TAILQ_FOREACH_SAFE to iterate over
dirhashes in ufsdirhash_lowmem(), so that we can continue iterating even after a
dirhash is destroyed.

Suggested by:	pjd
Tested by:      pho
Approved by:	dwmalone (mentor)
2009-06-17 18:55:29 +00:00
kib
b8351fcda2 Do not use casts (int *)0 and (struct thread *)0 for the arguments of
vn_rdwr, use NULL.

Reviewed by:	jhb
MFC after:	1 week
2009-06-16 15:13:45 +00:00
rwatson
f4934662e5 Move "options MAC" from opt_mac.h to opt_global.h, as it's now in GENERIC
and used in a large number of files, but also because an increasing number
of incorrect uses of MAC calls were sneaking in due to copy-and-paste of
MAC-aware code without the associated opt_mac.h include.

Discussed with:	pjd
2009-06-05 14:55:22 +00:00
snb
7f32f3f2a1 Add vm_lowmem event handler for dirhash. This will cause dirhashes to be
deleted when the system is low on memory. This ought to allow an increase to
vfs.ufs.dirhash_maxmem on machines that have lots of memory, without
degrading performance by having too much memory reserved for dirhash when
other things need it. The default value for dirhash_maxmem is being kept at
2MB for now, though.

This work was mostly done during the 2008 Google Summer of Code.

Approved by:	dwmalone (mentor), re
MFC after:	3 months
2009-06-03 09:44:22 +00:00
attilio
44c490ae17 Handle lock recursion differenty by always checking against LO_RECURSABLE
instead the lock own flag itself.

Tested by:	pho
2009-06-02 13:03:35 +00:00
jamie
a013e0afcb Add hierarchical jails. A jail may further virtualize its environment
by creating a child jail, which is visible to that jail and to any
parent jails.  Child jails may be restricted more than their parents,
but never less.  Jail names reflect this hierarchy, being MIB-style
dot-separated strings.

Every thread now points to a jail, the default being prison0, which
contains information about the physical system.  Prison0's root
directory is the same as rootvnode; its hostname is the same as the
global hostname, and its securelevel replaces the global securelevel.
Note that the variable "securelevel" has actually gone away, which
should not cause any problems for code that properly uses
securelevel_gt() and securelevel_ge().

Some jail-related permissions that were kept in global variables and
set via sysctls are now per-jail settings.  The sysctls still exist for
backward compatibility, used only by the now-deprecated jail(2) system
call.

Approved by:	bz (mentor)
2009-05-27 14:11:23 +00:00
trasz
fb57d2691e Make 'struct acl' larger, as required to support NFSv4 ACLs. Provide
compatibility interfaces in both kernel and libc.

Reviewed by:	rwatson
2009-05-22 15:56:43 +00:00
alc
dc942dabcf Introduce vfs_bio_set_valid() and use it from ffs_realloccg(). This
eliminates the misuse of vfs_bio_clrbuf() by ffs_realloccg().

In collaboration with:	tegge
2009-05-17 20:26:00 +00:00
attilio
1dcb84131b Remove the thread argument from the FSD (File-System Dependent) parts of
the VFS.  Now all the VFS_* functions and relating parts don't want the
context as long as it always refers to curthread.

In some points, in particular when dealing with VOPs and functions living
in the same namespace (eg. vflush) which still need to be converted,
pass curthread explicitly in order to retain the old behaviour.
Such loose ends will be fixed ASAP.

While here fix a bug: now, UFS_EXTATTR can be compiled alone without the
UFS_EXTATTR_AUTOSTART option.

VFS KPI is heavilly changed by this commit so thirdy parts modules needs
to be recompiled.  Bump __FreeBSD_version in order to signal such
situation.
2009-05-11 15:33:26 +00:00
kan
7b57a857b7 Do not embed struct ucred into larger netcred parent structures.
Credential might need to hang around longer than its parent and be used
outside of mnt_explock scope controlling netcred lifetime. Use separate
reference-counted ucred allocated separately instead.

While there, extend mnt_explock coverage in vfs_stdexpcheck and clean-up
some unused declarations in new NFS code.

Reported by:	John Hickey
PR:		kern/133439
Reviewed by:	dfr, kib
2009-05-09 18:09:17 +00:00
rmacklem
84d9dc09c0 Change the semantics of i_modrev/va_filerev to what is required for
the nfsv4 Change attribute. There are 2 changes:
 	1 - The value now changes on metadata changes as well as data
 	    modifications (incremented for IN_CHANGE instead of IN_UPDATE).
 	2 - It is now saved in spare space in the on-disk i-node so that it
 	    survives a crash.
 	Since va_filerev is not passed out into user space, the only current
 	use of va_filerev is in the nfs server, which uses it as the directory
 	cookie verifier. Since this verifier is only passed back to the server
 	by a client verbatim and then the server doesn't check it, changing the
 	semantics should not break anything currently in FreeBSD.

Reviewed by:	bde
Approved by:	kib (mentor)
2009-04-27 16:46:16 +00:00
kib
24114749aa In ufs_checkpath(), recheck that '..' still points to the inode with
the same inode number after VFS_VGET() and relock of the vp. If '..'
changed, redo the lookup. To reduce code duplication, move the code to
read '..' dirent into the static helper function ufs_dir_dd_ino().

Supply the source inode number as an argument to ufs_checkpath() instead
of the source inode itself. The inode is unlocked, thus it might be
reclaimed, causing accesses to the freed memory.

Use vn_vget_ino() to get the '..' vnode by its inode number, instead of
directly code VFS_VGET() and relock, to properly busy the mount point
while vp lock is dropped.

Noted and reviewed by:	tegge
Tested by:	pho
MFC after:	1 month
2009-04-20 14:36:01 +00:00
kib
7ee6b427ae When verifying '..' after VFS_VGET() in ufs_lookup(), do not return
error if '..' is still there but changed between lookup and check.
Start relookup instead. Rename is supposed to change '..' reference
atomically, so transient failures introduced by r191137 are wrong.

While rearranging the code to allow lookup restart in ufs_lookup(),
remove the comment that only distracts the reader.

Noted and reviewed by:	tegge
Also reported by:	pho
MFC after:	1 month
2009-04-19 05:34:07 +00:00
trasz
858b10f6e2 Use acl_alloc() and acl_free() instead of using uma(9) directly.
This will make switching to malloc(9) easier; also, it would be
neccessary to add these routines if/when we implement variable-size
ACLs.
2009-04-18 16:47:33 +00:00
kib
915dee68f2 Verify that '..' still exists with the same inode number after
VFS_VGET() has returned in ufs_lookup(). If the '..' lookup started
immediately before the parent directory was removed, we might return
either cleared or unrelated inode otherwise.

Ufs_lookup() is split into new function ufs_lookup_() that either does
lookup, or verifies that directory entry exists and references supplied
inode number.

Reviewed by:	tegge
Tested by:	pho,
	Andreas Tobler <andreast-list fgznet ch> (previous version)
MFC after:	1 month
2009-04-16 09:57:08 +00:00
rwatson
fba90f2e03 Remove VOP_LEASE and supporting functions. This hasn't been used since
the removal of NQNFS, but was left in in case it was required for NFSv4.
Since our new NFSv4 client and server can't use it for their
requirements, GC the old mechanism, as well as other unused lease-
related code and interfaces.

Due to its impact on kernel programming and binary interfaces, this
change should not be MFC'd.

Proposed by:    jeff
Reviewed by:    jeff
Discussed with: rmacklem, zach loafman @ isilon
2009-04-10 10:52:19 +00:00
kib
ec888841ec When removing or renaming snaphost, do not delve into request_cleanup().
The later may need blocks from the underlying device that belongs
to normal files, that should not be locked while snap lock is held.

Reported and tested by:	pho
MFC after:	1 month
2009-04-04 12:19:52 +00:00
kib
642970135f Correct typo.
Noted by:	kensmith
2009-03-27 15:46:02 +00:00
kib
7ad4235008 Fix two issues with bufdaemon, often causing the processes to hang in
the "nbufkv" sleep.

First, ffs background cg group block write requests a new buffer for
the shadow copy. When ffs_bufwrite() is called from the bufdaemon due
to buffers shortage, requesting the buffer deadlock bufdaemon.
Introduce a new flag for getnewbuf(), GB_NOWAIT_BD, to request getblk
to not block while allocating the buffer, and return failure
instead. Add a flag argument to the geteblk to allow to pass the flags
to getblk(). Do not repeat the getnewbuf() call from geteblk if buffer
allocation failed and either GB_NOWAIT_BD is specified, or geteblk()
is called from bufdaemon (or its helper, see below). In
ffs_bufwrite(), fall back to synchronous cg block write if shadow
block allocation failed.

Since r107847, buffer write assumes that vnode owning the buffer is
locked. The second problem is that buffer cache may accumulate many
buffers belonging to limited number of vnodes. With such workload,
quite often threads that own the mentioned vnodes locks are trying to
read another block from the vnodes, and, due to buffer cache
exhaustion, are asking bufdaemon for help. Bufdaemon is unable to make
any substantial progress because the vnodes are locked.

Allow the threads owning vnode locks to help the bufdaemon by doing
the flush pass over the buffer cache before getnewbuf() is going to
uninterruptible sleep. Move the flushing code from buf_daemon() to new
helper function buf_do_flush(), that is called from getnewbuf().  The
number of buffers flushed by single call to buf_do_flush() from
getnewbuf() is limited by new sysctl vfs.flushbufqtarget.  Prevent
recursive calls to buf_do_flush() by marking the bufdaemon and threads
that temporarily help bufdaemon by TDP_BUFNEED flag.

In collaboration with:	pho
Reviewed by:	 tegge (previous version)
Tested by:	 glebius, yandex ...
MFC after:	 3 weeks
2009-03-16 15:39:46 +00:00
kib
9e27cf574d The non-modifying EA VOPs are executed with only shared vnode lock taken.
Provide a custom lock around initializing and tearing down EA area,
to prevent both memory leaks and double-free of it. Count the number
of EA area accessors.

Lock protocol requires either holding exclusive vnode lock to modify
i_ea_area, or shared vnode lock and owning IN_EA_LOCKED flag in i_flag.

Noted by:	YAMAMOTO, Taku <taku tackymt homeip net>
Tested by:	pho (previous version)
MFC after:	2 weeks
2009-03-12 12:43:56 +00:00
kib
de74558fd8 Do not double-free the struct inode when insmntque failed. Default
insmntque destructor reclaims the vnode, and ufs_reclaim frees the memory.

Reviewed by:	tegge
MFC after:	3 days
2009-03-11 19:45:52 +00:00
jhb
520acdaf69 Add a new internal mount flag (MNTK_EXTENDED_SHARED) to indicate that a
filesystem supports additional operations using shared vnode locks.
Currently this is used to enable shared locks for open() and close() of
read-only file descriptors.
- When an ISOPEN namei() request is performed with LOCKSHARED, use a
  shared vnode lock for the leaf vnode only if the mount point has the
  extended shared flag set.
- Set LOCKSHARED in vn_open_cred() for requests that specify O_RDONLY but
  not O_CREAT.
- Use a shared vnode lock around VOP_CLOSE() if the file was opened with
  O_RDONLY and the mountpoint has the extended shared flag set.
- Adjust md(4) to upgrade the vnode lock on the vnode it gets back from
  vn_open() since it now may only have a shared vnode lock.
- Don't enable shared vnode locks on FIFO vnodes in ZFS and UFS since
  FIFO's require exclusive vnode locks for their open() and close()
  routines.  (My recent MPSAFE patches for UDF and cd9660 already included
  this change.)
- Enable extended shared operations on UFS, cd9660, and UDF.

Submitted by:	ups
Reviewed by:	pjd (ZFS bits)
MFC after:	1 month
2009-03-11 14:13:47 +00:00
jhb
80d9458a56 Adjust some variables (mostly related to the buffer cache) that hold
address space sizes to be longs instead of ints.  Specifically, the follow
values are now longs: runningbufspace, bufspace, maxbufspace,
bufmallocspace, maxbufmallocspace, lobufspace, hibufspace, lorunningspace,
hirunningspace, maxswzone, maxbcache, and maxpipekva.  Previously, a
relatively small number (~ 44000) of buffers set in kern.nbuf would result
in integer overflows resulting either in hangs or bogus values of
hidirtybuffers and lodirtybuffers.  Now one has to overflow a long to see
such problems.  There was a check for a nbuf setting that would cause
overflows in the auto-tuning of nbuf.  I've changed it to always check and
cap nbuf but warn if a user-supplied tunable would cause overflow.

Note that this changes the ABI of several sysctls that are used by things
like top(1), etc., so any MFC would probably require a some gross shims
to allow for that.

MFC after:	1 month
2009-03-09 19:35:20 +00:00
trasz
644f63d68b Right now, when trying to unmount a device that's already gone,
msdosfs_unmount() and ffs_unmount() exit early after getting ENXIO.
However, dounmount() treats ENXIO as a success and proceeds with
unmounting.  In effect, the filesystem gets unmounted without closing
GEOM provider etc.

Reviewed by:	kib
Approved by:	rwatson (mentor)
Tested by:	dho
Sponsored by:	FreeBSD Foundation
2009-02-23 21:09:28 +00:00
trasz
40c2eac6aa Refactor, moving error checking outside of the
'if (mp->mnt_flag & MNT_SOFTDEP)' conditional.  No functional
changes.

Reviewed by:	kib
Approved by:	rwatson (mentor)
Tested by:	pho
Sponsored by:	FreeBSD Foundation
2009-02-23 20:56:27 +00:00
jhb
6f82ac1ceb - If the g_access() call for the initial root mount fails, then fully
cleanup.  Before the GEOM consumer would not have been closed.
- Bump the reference on the character device being mounted while the
  associated devfs vnode is locked.

Reviewed by:	kib
2009-02-11 22:19:54 +00:00
trasz
e0aa6f81f9 When a device containing mounted UFS filesystem disappears, the type
of devvp becomes VBAD, which UFS incorrectly interprets as snapshot
vnode, which in turns causes panic.  Fix it by replacing '!= VCHR'
with '== VREG'.

With this fix in place, you should no longer be able to panic the system
by removing a device with an UFS filesystem mounted from it - assuming
you don't use softupdates.

Reviewed by:	kib
Tested by:	pho
Approved by:	rwatson (mentor)
Sponsored by:	FreeBSD Foundation
2009-02-06 17:14:07 +00:00
des
604db9f61a WIP 2009-01-30 13:54:03 +00:00
trasz
d27cdd2fdd Make sure the cdev doesn't go away while the filesystem is still mounted.
Otherwise dev2udev() could return garbage.

Reviewed by:	kib
Approved by:	rwatson (mentor)
Sponsored by:	FreeBSD Foundation
2009-01-29 16:47:15 +00:00
rwatson
c99f201c9e Following a fair amount of real world experience with ACLs and
extended attributes since FreeBSD 5, make the following semantic
changes:

- Don't update the inode modification time (mtime) when extended
  attributes (and hence also ACLs) are added, modified, or removed.
- Don't update the inode access tie (atime) when extended attributes
  (and hence also ACLs) are queried.

This means that rsync (and related tools) won't improperly think
that the data in the file has changed when only the ACL has changed.

Note that ffs_reallocblks() has not been changed to not update on an
IO_EXT transaction, but currently EAs don't use the cluster write
routines so this shouldn't be a problem.  If EAs grow support for
clustering, then VOP_REALLOCBLKS() will need to grow a flag argument
to carry down IO_EXT to UFS.

MFC after:	1 week
PR:             ports/125739
Reported by:    Alexander Zagrebin <alexz@visp.ru>
Tested by:      pluknet <pluknet@gmail.com>,
                Greg Byshenk <freebsd@byshenk.net>
Discussed with: kib, kientzle, timur, Alexander Bokovoy <ab@samba.org>
2009-01-27 21:48:47 +00:00
jhb
2939c2f76f Fix a few style bogons.
Submitted by:	bde
2009-01-21 20:08:17 +00:00
kib
44eef9d9bb Move the code from ufs_lookup.c used to do dotdot lookup, into
the helper function. It is supposed to be useful for any filesystem
that has to unlock dvp to walk to the ".." entry in lookup routine.

Requested by:	jhb
Tested by:	pho
MFC after:	1 month
2009-01-21 14:51:38 +00:00
jhb
47455a7b41 Move the VA_MARKATIME flag for VOP_SETATTR() out into its own VOP:
VOP_MARKATIME() since unlike the rest of VOP_SETATTR(), VA_MARKATIME
can be performed while holding a shared vnode lock (the same functionality
is done internally by VOP_READ which can run with a shared vnode lock).
Add missing locking of the vnode interlock to the ufs implementation and
remove a special note and test from the NFS client about not supporting the
feature.

Inspired by:	ups
Tested by:	pho
2009-01-21 14:42:00 +00:00
kib
1a9032d339 The r187467 should remove all pages for V_NORMAL case too, because
indirect block pages are not removed by the mentioned invocation of
the vnode_pager_setsize().

Put a common code into the helper function ffs_pages_remove().

Reported and tested by:	dchagin
Reviewed by:	ups
MFC after:	3 weeks
2009-01-20 22:00:19 +00:00
jhb
19686e364d Add a comment explaining why the "bufwait" / "dirhash" LOR reported by
WITNESS will not actually result in a deadlock.

Discussed with:	kib
MFC after:	1 week
2009-01-20 16:35:34 +00:00
kib
ac57d6e717 When extending inode size, we call vnode_pager_setsize(), to have a
address space where to put vnode pages, and then call UFS_BALLOC(),
to actually allocate new block and map it. When UFS_BALLOC() returns
error, sometimes we forget to revert the vm object size increase,
allowing for the pages that are not backed by the logical disk blocks.

Revert vnode_pager_setsize() back when UFS_BALLOC() failed, for
ffs_truncate() and ffs_write().

PR:	129956
Reviewed by:	ups
MFC after:	3 weeks
2009-01-20 11:30:22 +00:00
kib
cbb8defa10 FFS puts the extended attributes blocks at the negative blocks for the
vnode, from -1 down. When vinvalbuf(vp, V_ALT) is done for the vnode, it
incorrectly does vm_object_page_remove(0, 0), removing all pages from
the underlying vm object, not only the pages that back the extended
attributes data.

Change vinvalbuf() to not remove any pages from the object when
V_NORMAL or V_ALT are specified. Instead, the only in-tree caller
in ffs_inode.c:ffs_truncate() that specifies V_ALT explicitely
removes the corresponding page range. The V_NORMAL caller
does vnode_pager_setsize(vp, 0) immediately after the call to
vinvalbuf(V_NORMAL) already.

Reported by:	csjp
Reviewed by:	ups
MFC after:	3 weeks
2009-01-20 11:27:45 +00:00
kib
ca2e9cb3a9 Lock the uepm_lock around the autostart of extattrs.
Reported and tested by:	pho
Reviewed by:	rwatson
MFC after:	3 weeks
2009-01-08 12:49:55 +00:00
kib
1f52da301f If unmount of the ffs mp failed, reinitialize the extended attributes
for the mp, and restart them if autostart is enabled.

Reported and tested by:	pho
Reviewed by:	rwatson
MFC after:	3 weeks
2009-01-08 12:48:27 +00:00
kib
068931a989 Do not busy twice the mount point where a quota operation is performed.
Tested by:	pho
MFC after:	1 month
2008-12-18 12:01:53 +00:00
trasz
b2515a861b According to phk@, VOP_STRATEGY should never, _ever_, return
anything other than 0.  Make it so.  This fixes
"panic: VOP_STRATEGY failed bp=0xc320dd90 vp=0xc3b9f648",
encountered when writing to an orphaned filesystem.  Reason
for the panic was the following assert:
KASSERT(i == 0, ("VOP_STRATEGY failed bp=%p vp=%p", bp, bp->b_vp));
at vfs_bio:bufstrategy().

Reviewed by:	scottl, phk
Approved by:	rwatson (mentor)
Sponsored by:	FreeBSD Foundation
2008-12-16 21:13:11 +00:00
kib
01085e085d The dqrele() function syncs the dq, then acquires the dqh lock, and then
does final drop of the the dq reference to put it onto the free list.
There is a possibility that the dq would be found by another thread
after sync and before the dqh lock is acquired. If that other thread
drops the dq before we have taken the dqh lock, the dirty dq is put on
the free list.

Recheck the DQ_MOD after the dqh lock is relocked. Repeat dqsync() if
the dq is dirty. This ensures that up to date dq is written in the quota
file and fixes assertion in dqget().

Reported and tested by:	Frode Nordahl <frode nordahl net>
MFC after:	3 days
2008-12-08 11:04:17 +00:00
kib
868630039f Improve usefulness of the panic by printing the pointer to the problematic
dquot. In-tree gdb is often unable to get the dq value, so supply it in
panic message.

MFC after:	3 days
2008-12-07 13:25:06 +00:00
kib
294fd952b0 Do not lock vnode interlock around reading of v_iflag to check VI_DOOMED.
Read of the pointer is atomic, and flag cannot be set while vnode lock
is held.

Requested by:	jhb
MFC after:	1 month
2008-12-02 11:12:50 +00:00
kib
371d0216fb Busy ufs filesystem around block of code that does ".." lookup. Since
mnt_lock is before lock of any vnode on the mp, it uses LK_NOWAIT. Since
MNTK_UNMOUNT may be transient, pdp lock is dropped when vfs_busy()
failed, and operation is retried after some time. This way, ffs_vget()
is not called on the mp that may be in the process of being destroyed by
unmount.

Check for the VI_DOOMED flag on pdp after its lock is reacquired, to
better detect some situations where directory containing ".."
entry is removed during the lookup.

Reviewed by:	tegge, attilio (previous version)
Tested by:	pho
MFC after:	1 month
2008-11-22 13:11:11 +00:00
jhb
f8948bf5ef Fix typo. 2008-11-19 20:06:59 +00:00
ambrisko
4ee5cbf821 For now on every 10 cyclinder groups flush the buffer cache to free
up space.  If the buffer cache fills up then the disk systems can
grind to a halt.  Better tuning can be figured out later.

Tested by:	Tim, others and work
Reviewed by:	Kostik Belousov
PR:		128832
2008-11-13 17:40:21 +00:00
jhb
9f264a6a75 Quiet a WITNESS warning with the dirhash sx locks by setting the DUPOK
flag.  Specifically, if two threads race to create a dirhash for a
directory, then one might already have created a private dirhash
structure (and locked it) when it realizes the directory now has a
structure and tries to lock that one.
2008-11-04 18:56:12 +00:00
trasz
5ff0dc16bf In UFS, when reading EA that contains ACL fails for some reason, include
inode number and filesystem name, so the administrator can fix the problem.

Approved by:	rwatson (mentor)
2008-11-04 12:30:31 +00:00
attilio
e1f493235e Improve VFS locking:
- Implement real draining for vfs consumers by not relying on the
  mnt_lock and using instead a refcount in order to keep track of lock
  requesters.
- Due to the change above, remove the mnt_lock lockmgr because it is now
  useless.
- Due to the change above, vfs_busy() is no more linked to a lockmgr.
  Change so its KPI by removing the interlock argument and defining 2 new
  flags for it: MBF_NOWAIT which basically replaces the LK_NOWAIT of the
  old version (which was unlinked from the lockmgr alredy) and
  MBF_MNTLSTLOCK which provides the ability to drop the mountlist_mtx
  once the mnt interlock is held (ability still desired by most consumers).
- The stub used into vfs_mount_destroy(), that allows to override the
  mnt_ref if running for more than 3 seconds, make it totally useless.
  Remove it as it was thought to work into older versions.
  If a problem of "refcount held never going away" should appear, we will
  need to fix properly instead than trust on such hackish solution.
- Fix a bug where returning (with an error) from dounmount() was still
  leaving the MNTK_MWAIT flag on even if it the waiters were actually
  woken up. Just a place in vfs_mount_destroy() is left because it is
  going to recycle the structure in any case, so it doesn't matter.
- Remove the markercnt refcount as it is useless.

This patch modifies VFS ABI and breaks KPI for vfs_busy() so manpages and
__FreeBSD_version will be modified accordingly.

Discussed with:	kib
Tested by:	pho
2008-11-02 10:15:42 +00:00
trasz
0ad8692247 Introduce accmode_t. This is required for NFSv4 ACLs - it will be neccessary
to add more V* constants, and the variables changed by this patch were often
being assigned to mode_t variables, which is 16 bit.

Approved by:	rwatson (mentor)
2008-10-28 13:44:11 +00:00
kib
19fb7eaf9f Provide an explanation for getinoquota() call in the ufs_access vop.
MFC after:	3 days
2008-10-28 12:00:28 +00:00
des
a1e1ad22e0 Fix a number of style issues in the MALLOC / FREE commit. I've tried to
be careful not to fix anything that was already broken; the NFSv4 code is
particularly bad in this respect.
2008-10-23 20:26:15 +00:00
des
66f807ed8b Retire the MALLOC and FREE macros. They are an abomination unto style(9).
MFC after:	3 months
2008-10-23 15:53:51 +00:00
kib
e4785f6af4 Assert that v_holdcnt is non-zero before entering lockmgr in vn_lock
and ffs_lock. This cannot catch situations where holdcnt is incremented
not by curthread, but I think it is useful.

Reviewed by:	tegge, attilio
Tested by:	pho
MFC after:	2 weeks
2008-10-20 10:11:33 +00:00
kib
f2993add72 Sync up summary information for cylinder groups while data is already
in memory during snapshot creation. This improves the results of the
background fsck.

Submitted by: tegge
MFC after: 1 week
2008-10-13 14:05:01 +00:00
attilio
b8bf37e585 Remove the struct thread unuseful argument from bufobj interface.
In particular following functions KPI results modified:
- bufobj_invalbuf()
- bufsync()

and BO_SYNC() "virtual method" of the buffer objects set.
Main consumers of bufobj functions are affected by this change too and,
in particular, functions which changed their KPI are:
- vinvalbuf()
- g_vfs_close()

Due to the KPI breakage, __FreeBSD_version will be bumped in a later
commit.

As a side note, please consider just temporary the 'curthread' argument
passing to VOP_SYNC() (in bufsync()) as it will be axed out ASAP

Reviewed by:	kib
Tested by:	Giovanni Trematerra <giovanni dot trematerra at gmail dot com>
2008-10-10 21:23:50 +00:00
jhb
536c1c47bb Enable shared lookups on UFS. There are some remaining issues with forced
unmounts, but those are in the VFS lookup code are not UFS specific.

Tested by:	pho, kris
2008-09-24 18:53:04 +00:00
jhb
82d0acb613 Close a race between concurrent calls to ufsdirhash_recycle() and
ufsdirhash_free() introduced in my last commit by removing the dirhash
about to be free'd in ufsdirhash_free() from the global dirhash list
before dropping the sx lock.

Tested by:	kris
2008-09-22 20:53:22 +00:00
kib
81d455e702 Initialize va_flags and va_filerev properly in VOP_GETATTR(). Don't
initialize va_vaflags and va_spare because they are not part of the
VOP_GETATTR() API. Also don't initialize birthtime to ctime or zero.

Submitted by:   Jaakko Heinonen <jh saunalahti fi>
Reviewed by:	bde
Discussed on:   freebsd-fs
MFC after:	1 month
2008-09-20 19:46:45 +00:00
jhb
281938f04d Retire the 'i_reclen' field from the in-memory i-node. Previously,
during a DELETE lookup operation, lookup would cache the length of the
directory entry to be deleted in 'i_reclen'.  Later, the actual VOP to
remove the directory entry (ufs_remove, ufs_rename, etc.) would call
ufs_dirremove() which extended the length of the previous directory
entry to "remove" the deleted entry.

However, we always read the entire block containing the directory
entry when doing the removal, so we always have the directory entry to
be deleted in-memory when doing the update to the directory block.
Also, we already have to figure out where the directory entry that is
being removed is in the block so that we can pass the component name
to the dirhash code to update the dirhash.  So, instead of passing
'i_reclen' from ufs_lookup() to the ufs_dirremove() routine, just read
the 'd_reclen' field directly out of the entry being removed when
updating the length of the previous entry in the block.

This avoids a cosmetic issue of writing to 'i_reclen' while holding a
shared vnode lock.  It also slightly reduces the amount of side-band
data passed from ufs_lookup() to operations updating a directory via
the directory's i-node.

Reviewed by:	jeff
2008-09-16 19:06:44 +00:00
jhb
020ede1151 Fix a race with shared lookups on UFS. If the the dirhash code reached the
cap on memory usage, then shared LOOKUP operations could start free'ing
dirhash structures.  Without these fixes, concurrent free's on the same
directory could result in one of the threads blocked on a lock in a dirhash
structure free'd by the other thread.
- Replace the lockmgr lock in the dirhash structure with an sx lock.
- Use a reference count managed with ufsdirhash_hold()/drop() to determine
  when to free the dirhash structures.  The directory i-node holds a
  reference while the dirhash is attached to an i-node.  Code that wishes
  to lock the dirhash while holding a shared vnode lock must first
  acquire a private reference to the dirhash while holding the vnode
  interlock before acquiring the dirhash sx lock.  After acquiring the sx
  lock, it drops the private reference after checking to see if the
  dirhash is still used by the directory i-node.
2008-09-16 16:23:56 +00:00
jhb
3a8bef861f - Only set i_offset in the parent directory's i-node during a lookup for
non-LOOKUP operations.
- Relax a VOP assertion for a DELETE lookup.  rename() uses WANTPARENT
  instead of LOCKPARENT when looking up the source pathname.  ufs_rename()
  uses a relookup() to lock the parent directory when it decides to finally
  remove the source path.  Thus, it is ok for a DELETE with WANTPARENT set
  instead of LOCKPARENT to use a shared vnode lock rather than an exclusive
  vnode lock.

Reported by:	kris (2)
Reviewed by:	jeff
2008-09-16 16:18:36 +00:00
jhb
37890de0aa vdropl() drops the vnode interlock. Thus, the code in the QUOTA case that
upgrades the vnode lock if it is share locked was dropping the interlock
before actually checking VI_DOOMED.  Fix this by do the vdropl() after the
check and relying on it to drop the vnode interlock.

Reported by:	pho
Reviewed by:	kib
MFC after:	1 week
2008-09-16 16:15:38 +00:00
kib
18a3116541 Suspend the write operations on the UFS filesystem being unmounted or
remounted from rw to ro.

Proposed and reviewed by:  tegge
In collaboration with:	pho
MFC after:	 1 month
2008-09-16 11:55:53 +00:00
kib
f6863a9ef7 When attempt is made to suspend a filesystem that is already syspended,
wait until the current suspension is lifted instead of silently returning
success immediately. The consequences of calling vfs_write() resume when
not owning the suspension are not well-defined at best.

Add the vfs_susp_clean() mount method to be called from
vfs_write_resume(). Set it to process_deferred_inactive() for ffs, and
stop calling it manually.

Add the thread flag TDP_IGNSUSP that allows to bypass the suspension
point in the vn_start_write. It is intended for use by VFS in the
situations where the suspender want to do some i/o requiring calls to
vn_start_write(), and this i/o cannot be done later.

Reviewed by:	tegge
In collaboration with:	pho
MFC after:	 1 month
2008-09-16 11:51:06 +00:00
kib
0488506405 Add the ffs structures introspection functions for ddb.
Show the b_dep value for the buffer in the show buffer command.
Add a comand to dump the dirty/clean buffer list for vnode.

Reviewed by:	tegge
Tested and used by:   pho
MFC after:   1 month
2008-09-16 11:19:38 +00:00
kib
ad287e3d61 When downgrading the read-write mount to read-only, do_unmount() sets
MNT_RDONLY flag before the VFS_MOUNT() is called. In ufs_inactive()
and ufs_itimes_locked(), UFS verifies whether the fs is read-only by
checking MNT_RDONLY, but this may cause loss of the IN_MODIFIED flag
for inode on the fs being remounted rw->ro.

Introduce UFS_RDONLY() struct ufsmount' method that reports the value
of the fs_ronly. The later is set to 1 only after the remount is
finished.

Reviewed by:	tegge
In collaboration with:	pho
MFC after:	 1 month
2008-09-16 10:59:35 +00:00
kib
f67f57a431 The struct inode *ip supplied to softdep_freefile is not neccessary the
inode having number ino. In r170991, the ip was marked IN_MODIFIED, that
is not quite correct.

Mark only the right inode modified by checking inode number.

Reviewed by:	tegge
In collaboration with:	pho
MFC after:	 1 month
2008-09-16 10:52:25 +00:00
trasz
88f6f49133 When calling extattr_check_cred, use V{READ,WRITE}, not I{READ,WRITE}.
Approved by:	rwatson (mentor)
2008-09-03 12:46:09 +00:00
attilio
e2ca413d09 Decontextualize vfs_busy(), vfs_unbusy() and vfs_mount_alloc() functions.
Manpages are updated accordingly.

Tested by:	Diego Sardina <siarodx at gmail dot com>
2008-08-31 14:26:08 +00:00
attilio
dbf35e279f Decontextualize the couplet VOP_GETATTR / VOP_SETATTR as the passed thread
was always curthread and totally unuseful.

Tested by: Giovanni Trematerra <giovanni dot trematerra at gmail dot com>
2008-08-28 15:23:18 +00:00
kib
8ccd4bd3a1 In ffs_valloc(), ffs_vget() may fail because insmntque() refused to
insert new vnode into the mount vnode list. Then, for the SU-enabled
mount, ffs_vfree could create freefile dependency. This dependency can
hang around forever since inode is not marked as IN_MODIFIED and
correspondingly inodeblock may be not marked as dirty.

After ffs_vget() fails, retry with FFSV_FORCEINSMQ, mark the inode as
modified, and vput() it immediately. Take care of the dup alloc.

Tested by:	pho
Reviewed by:	tegge
MFC after:	1 month
2008-08-28 09:19:50 +00:00
kib
2a5baa49d2 Softdep code may need to instantiate vnode when processing
dependencies. In particular, it may need this while syncing filesystem
being unmounted. Since during unmount MNTK_NOINSMNTQUE flag is set,
that could sometimes disallow insertion of the vnode into the vnode
mount list, softdep code needs to overwrite the MNTK_NOINSMNTQUE flag.

Create the ffs_vgetf() function that sets the VV_FORCEINSMQ flag for
new vnode and use it consistently from the softdep code instead of
ffs_vget().

Add the retry logic to the softdep_flushfiles() to flush the vnodes
that could be instantiated while flushing softdep dependencies.

Tested by:	pho, kris
Reviewed by:	tegge
MFC after:	1 month
2008-08-28 09:18:20 +00:00
kib
5890c8d4b7 Put the relocked variable from the r182111 into the #ifdef QUOTA braces
to prevent warning about unused var on the !QUOTA kernels.

Reported by:	ed
MFC after:	1 week
2008-08-24 19:06:19 +00:00
kib
1eb6d5f22f Revert the r167541: "Remove unneeded getinoquota() call in the
ufs_access()." The call to getinoquota in ufs_access() serves the
purpose of instantiating inode dquot from the vn_open(). Since quotas
are accounted only for the inodes with already attached dquot, removal
of the call prevented opened inodes from participation in the quota
calculations.

Since ufs_access() may be called with the vnode being only shared
locked, upgrade (and then downgrade) vnode lock if calling
getinoquota().

Reported by:	simon at optinet com
In collaboration with:	pho
MFC after:	1 week
2008-08-24 17:24:22 +00:00
kib
ca3c43733a Revert r181345.
Move the NULL pointer check to the vfs_deleteopt() function.

Discussed with:	rodrigc
MFC after:	3 days
2008-08-10 12:15:36 +00:00
kib
28272d34d6 User may do "mount -o snapshot ...", that causes new FFS mount to be
performed with snapshot option, while the mp->mnt_opt is NULL.
Protect against NULL pointer dereference.

Noted by:	Mateusz Guzik <mjguzik gmail com>
MFC after:	3 days
2008-08-06 14:47:19 +00:00
des
f42aa30836 ufsmount.h uses "struct\tfoo *bar;", except where it doesn't.
quota.h uses "struct foo\t*bar;", except where it doesn't.
Try to make them both agree with themselves (though not with eachother)
2008-08-05 15:24:07 +00:00
des
c4f7fdb253 Whitespace, prototypes 2008-08-05 10:25:55 +00:00
jhb
69486dc077 Whitespace tweak. 2008-07-30 21:07:56 +00:00
kib
167593c766 The ffs_balloc_ufs{1,2} functions call bdwrite() while having several
vnode buffers locked at once. In particular, there are indirect buffers
among locked ones. The bdwrite() may start the flushing to keep dirty
buffer list at the bounds. If any buffer on the dirty list requires
translation from logical to physical block number, code may ends up
trying to lock an indirect buffer already locked in ffs_balloc_ufsX.

Prevent the bdflush() activity when several buffers are locked at once
by setting the TDP_INBDFUSH for the problematic code blocks.

Reported and tested by:	pho, Josef Buchsteiner at Juniper
In collaboration with:	kan
MFC after:	1 month
2008-07-23 14:32:44 +00:00
pjd
5373c91a91 Say hi to svn, by simplifing ffs_vget() function a bit - there is no need for
a variable that is used only once.
2008-07-19 22:29:44 +00:00
rodrigc
4e4b0a667c Fix comments to replace SBSIZE with SBLOCKSIZE, since SBSIZE
was renamed to SBLOCKSIZE in version 1.33

Reviewed by:	mckusick
2008-05-24 20:44:14 +00:00
rodrigc
c1896c3fec After converting the "snapshot" mount option to the MNT_SNAPSHOT flag,
delete "snapshot" from the persistent mount options list.
This should fix problems with doing a mount -o snapshot of a file system, followed by
an NFS export of the same file system.

PR:		122833
Reported by:	Leon Kos <leon.kos lecad fs uni-lj si>,
		Jaakko Heinonen <jh saunalahti fi>
MFC after:	1 month
2008-05-24 00:41:32 +00:00
rodrigc
44e2059da2 For the following mount options, do not perform the string to flag conversions
here, because we already do them further up in vfs_donmount() in vfs_mount.c

async -> MNT_ASYNC
force -> MNT_FORCE
multilabel -> MNT_MULTILABEL
noatime -> MNT_NOATIME
noclusterr -> MNT_NOCLUSTERR
noclusterw -> MNT_NOCLUSTERW

MFC after:  1 month
2008-05-24 00:02:12 +00:00
ups
fbd329664f Allow VM object creation in ufs_lookup. (If vfs.vmiodirenable is set)
Directory IO without a VM object will store data in 'malloced' buffers
severely limiting caching of the data. Without this  change VM objects for
directories are only created on an open() of the directory.
TODO: Inline test if VM object already exists to avoid locking/function call
overhead.

Tested by: kris@
Reviewed by: jeff@
Reported by: David Filo
2008-05-20 19:05:43 +00:00
jeff
057cb45df3 - Use a local variable for i_ino in ufs_lookup. It is only used to
communicate between two parts of this one function.  This was causing
   problems with shared lookups as each would trash the ino value in the
   inode.
 - Remove the unused i_ino field from the inode structure.
2008-04-22 12:34:16 +00:00
kib
52243403eb Move the head of byte-level advisory lock list from the
filesystem-specific vnode data to the struct vnode. Provide the
default implementation for the vop_advlock and vop_advlockasync.
Purge the locks on the vnode reclaim by using the lf_purgelocks().
The default implementation is augmented for the nfs and smbfs.
In the nfs_advlock, push the Giant inside the nfs_dolock.

Before the change, the vop_advlock and vop_advlockasync have taken the
unlocked vnode and dereferenced the fs-private inode data, racing with
with the vnode reclamation due to forced unmount. Now, the vop_getattr
under the shared vnode lock is used to obtain the inode size, and
later, in the lf_advlockasync, after locking the vnode interlock, the
VI_DOOMED flag is checked to prevent an operation on the doomed vnode.

The implementation of the lf_purgelocks() is submitted by dfr.

Reported by:	kris
Tested by:	kris, pho
Discussed with:	jeff, dfr
MFC after:	2 weeks
2008-04-16 11:33:32 +00:00
jeff
8facfb0569 - Use a lockmgr lock rather than a mtx to protect dirhash. This lock
may be held for the duration of the various dirhash operations which
   avoids many complex unlock/lock/revalidate sequences.
 - Permit shared locks on lookup.  To protect the ip->i_dirhash pointer we
   use the vnode interlock in the shared case.  Callers holding the
   exclusive vnode lock can run without fear of concurrent modification to
   i_dirhash.
 - Hold an exclusive dirhash lock when creating the dirhash structure for
   the first time or when re-creating a dirhash structure which has been
   recycled.

Tested by:	kris, pho
2008-04-11 09:48:12 +00:00
jeff
fef79bb2db - cache dp->i_offset in the local 'i_offset' variable for use in loop
indexes so directory lookup becomes shared lock safe.  In the modifying
   cases an exclusive lock is held here so the commit routine may
   rely on the state of i_offset.
 - Similarly handle i_diroff by fetching at the start and setting only once
   the operation is complete.  Without the exclusive lock these are only
   considered hints.
 - Assert that an exclusive lock is held when we're preparing for a commit
   routine.
 - Honor the lock type request from lookup instead of always using exclusive
   locking.

Tested by:	pho, kris
2008-04-11 09:44:25 +00:00
pjd
e83dd39a00 Correct function name in panic().
Reported by:	kensmith
2008-04-07 18:12:37 +00:00
attilio
07441f19e1 Optimize lockmgr in order to get rid of the pool mutex interlock, of the
state transitioning flags and of msleep(9) callings.
Use, instead, an algorithm very similar to what sx(9) and rwlock(9)
alredy do and direct accesses to the sleepqueue(9) primitive.

In order to avoid writer starvation a mechanism very similar to what
rwlock(9) uses now is implemented, with the correspective per-thread
shared lockmgrs counter.

This patch also adds 2 new functions to lockmgr KPI: lockmgr_rw() and
lockmgr_args_rw().  These two are like the 2 "normal" versions, but they
both accept a rwlock as interlock.  In order to realize this, the general
lockmgr manager function "__lockmgr_args()" has been implemented through
the generic lock layer. It supports all the blocking primitives, but
currently only these 2 mappers live.

The patch drops the support for WITNESS atm, but it will be probabilly
added soon. Also, there is a little race in the draining code which is
also present in the current CVS stock implementation: if some sharers,
once they wakeup, are in the runqueue they can contend the lock with
the exclusive drainer.  This is hard to be fixed but the now committed
code mitigate this issue a lot better than the (past) CVS version.
In addition assertive KA_HELD and KA_UNHELD have been made mute
assertions because they are dangerous and they will be nomore supported
soon.

In order to avoid namespace pollution, stack.h is splitted into two
parts: one which includes only the "struct stack" definition (_stack.h)
and one defining the KPI.  In this way, newly added _lockmgr.h can
just include _stack.h.

Kernel ABI results heavilly changed by this commit (the now committed
version of "struct lock" is a lot smaller than the previous one) and
KPI results broken by lockmgr_rw() / lockmgr_args_rw() introduction,
so manpages and __FreeBSD_version will be updated accordingly.

Tested by:      kris, pho, jeff, danger
Reviewed by:    jeff
Sponsored by:   Google, Summer of Code program 2007
2008-04-06 20:08:51 +00:00
kib
eff8c6d35e Add the support for the AT_FDCWD and fd-relative name lookups to the
namei(9).

Based on the submission by rdivacky,
	sponsored by Google Summer of Code 2007
Reviewed by:	rwatson, rdivacky
Tested by:	pho
2008-03-31 12:01:21 +00:00
jeff
1968343329 - Since rev 1.142 of ffs_snapshot.c the interlock has not been required
to protect the v_lock pointer.  Removing the interlock acquisition
   here allows vn_lock() to proceed without requiring the interlock
   at all.
 - If the lock mutated while we were sleeping on it the interlock has
   been dropped.  It is conceivable that the upper layer code was
   relying on the interlock and LK_NOWAIT to protect the identity or
   state of the vnode while acquiring the lock.  In this case return
   EBUSY rather than trying the new lock to prevent potential races.

Reviewed by:	tegge
2008-03-31 07:55:45 +00:00
jeff
5e4f326d87 - Don't free snapdata structures when they are no longer in use.
Keeping the lockmgr lock valid allows us to switch the v_lock pointer
   in snapshot vnodes between the embedded lockmgr lock and snapdata
   lock without needing the vnode interlock to protect against races
 - Keep unused snapdata structures in a list.
 - Add a function to lock the devvp and allocate a snapdata to it or
   acquire a new one without races.  The old function was safe from
   creation races because we set the mount flag when creating snapshots
   and thus serializing them.  However, it might have been subject to
   destroying races.

Reviewed by:	tegge
2008-03-31 07:47:08 +00:00
jhb
20cadd93f0 Fix a nit with the 'nofoo' options where 'foo' is mapped to 'nonofoo'
(such as 'atime' vs 'noatime').  The filesystems will always see either
'nofoo' or 'nonofoo', never plain 'foo'.  As such, their list of valid
mount options should include 'nofoo' instead of 'foo'.  With this fix,
you can do 'mount -u -o atime' on a FFS filesystem that isn't marked as
noatime without getting an error.  You can also update a noatime FFS
filesystem mounted via mount(2) (e.g. 6.x /sbin/mount binary) to 'atime'
using nmount(2) (e.g. 7.x /sbin/mount binary).

MFC after:	1 week
Reviewed by:	crodig
2008-03-26 20:48:07 +00:00
dfr
79d2dfdaa6 Add the new kernel-mode NFS Lock Manager. To use it instead of the
user-mode lock manager, build a kernel with the NFSLOCKD option and
add '-k' to 'rpc_lockd_flags' in rc.conf.

Highlights include:

* Thread-safe kernel RPC client - many threads can use the same RPC
  client handle safely with replies being de-multiplexed at the socket
  upcall (typically driven directly by the NIC interrupt) and handed
  off to whichever thread matches the reply. For UDP sockets, many RPC
  clients can share the same socket. This allows the use of a single
  privileged UDP port number to talk to an arbitrary number of remote
  hosts.

* Single-threaded kernel RPC server. Adding support for multi-threaded
  server would be relatively straightforward and would follow
  approximately the Solaris KPI. A single thread should be sufficient
  for the NLM since it should rarely block in normal operation.

* Kernel mode NLM server supporting cancel requests and granted
  callbacks. I've tested the NLM server reasonably extensively - it
  passes both my own tests and the NFS Connectathon locking tests
  running on Solaris, Mac OS X and Ubuntu Linux.

* Userland NLM client supported. While the NLM server doesn't have
  support for the local NFS client's locking needs, it does have to
  field async replies and granted callbacks from remote NLMs that the
  local client has contacted. We relay these replies to the userland
  rpc.lockd over a local domain RPC socket.

* Robust deadlock detection for the local lock manager. In particular
  it will detect deadlocks caused by a lock request that covers more
  than one blocking request. As required by the NLM protocol, all
  deadlock detection happens synchronously - a user is guaranteed that
  if a lock request isn't rejected immediately, the lock will
  eventually be granted. The old system allowed for a 'deferred
  deadlock' condition where a blocked lock request could wake up and
  find that some other deadlock-causing lock owner had beaten them to
  the lock.

* Since both local and remote locks are managed by the same kernel
  locking code, local and remote processes can safely use file locks
  for mutual exclusion. Local processes have no fairness advantage
  compared to remote processes when contending to lock a region that
  has just been unlocked - the local lock manager enforces a strict
  first-come first-served model for both local and remote lockers.

Sponsored by:	Isilon Systems
PR:		95247 107555 115524 116679
MFC after:	2 weeks
2008-03-26 15:23:12 +00:00
kib
5ddf5664cc Yield the cpu in the kernel while iterating the list of the
vnodes belonging to the mountpoint. Also, yield when in the
softdep_process_worklist() even when we are not going to sleep due to
buffer drain.

It is believed that the ULE fixed the problem [1], but the yielding
seems to be needed at least for the 4BSD case.

Discussed:	on stable@, with bde
Reviewed by:	tegge, jeff [1]
MFC after:	2 weeks
2008-03-23 13:45:24 +00:00
jeff
a9d123c3ab - Complete part of the unfinished bufobj work by consistently using
BO_LOCK/UNLOCK/MTX when manipulating the bufobj.
 - Create a new lock in the bufobj to lock bufobj fields independently.
   This leaves the vnode interlock as an 'identity' lock while the bufobj
   is an io lock.  The bufobj lock is ordered before the vnode interlock
   and also before the mnt ilock.
 - Exploit this new lock order to simplify softdep_check_suspend().
 - A few sync related functions are marked with a new XXX to note that
   we may not properly interlock against a non-zero bv_cnt when
   attempting to sync all vnodes on a mountlist.  I do not believe this
   race is important.  If I'm wrong this will make these locations easier
   to find.

Reviewed by:	kib (earlier diff)
Tested by:	kris, pho (earlier diff)
2008-03-22 09:15:16 +00:00
kib
04661caa35 Reduce the acquisition of the vnode interlock in the ffs_read() and
ffs_extread() when setting the IN_ACCESS flag by checking whether the
IN_ACCESS is already set. The possible race there is admissible.

Tested by:	pho
Submitted by:	jeff
2008-03-21 12:33:00 +00:00
jeff
46f09d5bc3 - Relax requirements for p_numthreads, p_threads, p_swtick, and p_nice from
requiring the per-process spinlock to only requiring the process lock.
 - Reflect these changes in the proc.h documentation and consumers throughout
   the kernel.  This is a substantial reduction in locking cost for these
   fields and was made possible by recent changes to threading support.
2008-03-19 06:19:01 +00:00
rwatson
877d7c65ba In keeping with style(9)'s recommendations on macros, use a ';'
after each SYSINIT() macro invocation.  This makes a number of
lightweight C parsers much happier with the FreeBSD kernel
source, including cflow's prcc and lxr.

MFC after:	1 month
Discussed with:	imp, rink
2008-03-16 10:58:09 +00:00
cokane
1dbd762cae Replace the non-MPSAFE timeout(9) API in ffs_softdep.c with the MPSAFE
callout_* API (e.g. callout_init_mtx(9)). This was one of the numerous
items on the http://wiki.freebsd.org/SMPTODO list.

Reviewed by:	imp, obrien, jhb
MFC after:	1 week
2008-03-13 20:15:48 +00:00
emaste
747c71e58b Remove include of opt_quota.h; as of revision 1.205 there is no longer
any #ifdef QUOTA conditional code.
2008-03-10 18:44:07 +00:00
kib
e378ea9934 Initialize mnt_stat.f_iosize before autostarting UFS1 extattrs.
It is normally initialized by ffs_statfs() after ffs_mount finished.

The extattr autostart code calls the ufs_lookup(), that uses value above
to iterate over the directory blocks, see bmask initialization in the
ufs_lookup() and ufsdirhash. Having the filesystem with root directory
spanning more then one block would result in reading a random kernel
memory.

PR:	kern/120781
Test case provided by:	rwatson
MFC after:	1 week
2008-03-05 16:34:03 +00:00
rwatson
48aef5a3f0 Continue on-going campaign to replace lockmgr locks with sx locks where
the specific semantics of ockmgr aren't required: update UFS1 extended
attributes to protect its data structures using an sx lock.

While here, update comments on lock granularity.

MFC after:	2 weeks
2008-03-04 12:50:11 +00:00
rwatson
96767d9090 Move setting of MNTK_MPSAFE flag before UFS1 extended attribute
auto-start so that the flag is set before we start performing I/O
in the auto-start routine.

MFC after:	2 weeks
Suggested by:	kib
2008-03-04 12:10:03 +00:00
rwatson
ba77105862 Don't auto-start or allow extattrctl for UFS2 file systems, as UFS2 has
native extended attributes.  This didn't interfere with the operation of
UFS2 extended attributes, but the code shouldn't be running for UFS2.

MFC after:	2 weeks
2008-03-02 22:52:14 +00:00
keramida
41eb7744b1 Minor typo nit. 2008-02-25 19:31:44 +00:00
attilio
4014b55830 Axe the 'thread' argument from VOP_ISLOCKED() and lockstatus() as it is
always curthread.

As KPI gets broken by this patch, manpages and __FreeBSD_version will be
updated by further commits.

Tested by:	Andrea Barberio <insomniac at slackware dot it>
2008-02-25 18:45:57 +00:00
attilio
0d54671a48 Introduce some functions in the vnode locks namespace and in the ffs
namespace in order to handle lockmgr fields in a controlled way instead
than spreading all around bogus stubs:
- VN_LOCK_AREC() allows lock recursion for a specified vnode
- VN_LOCK_ASHARE() allows lock sharing for a specified vnode

In FFS land:
- BUF_AREC() allows lock recursion for a specified buffer lock
- BUF_NOREC() disallows recursion for a specified buffer lock

Side note: union_subr.c::unionfs_node_update() is the only other function
directly handling lockmgr fields. As this is not simple to fix, it has
been left behind as "sole" exception.
2008-02-24 16:38:58 +00:00
attilio
265cb5fb91 - Introduce lockmgr_args() in the lockmgr space. This function performs
the same operation of lockmgr() but accepting a custom wmesg, prio and
  timo for the particular lock instance, overriding default values
  lkp->lk_wmesg, lkp->lk_prio and lkp->lk_timo.
- Use lockmgr_args() in order to implement BUF_TIMELOCK()
- Cleanup BUF_LOCK()
- Remove LK_INTERNAL as it is nomore used in the lockmgr namespace

Tested by:	Andrea Barberio <insomniac at slackware dot it>
2008-02-15 21:04:36 +00:00
attilio
7213f4c32b Cleanup lockmgr interface and exported KPI:
- Remove the "thread" argument from the lockmgr() function as it is
  always curthread now
- Axe lockcount() function as it is no longer used
- Axe LOCKMGR_ASSERT() as it is bogus really and no currently used.
  Hopefully this will be soonly replaced by something suitable for it.
- Remove the prototype for dumplockinfo() as the function is no longer
  present

Addictionally:
- Introduce a KASSERT() in lockstatus() in order to let it accept only
  curthread or NULL as they should only be passed
- Do a little bit of style(9) cleanup on lockmgr.h

KPI results heavilly broken by this change, so manpages and
FreeBSD_version will be modified accordingly by further commits.

Tested by: matteo
2008-01-24 12:34:30 +00:00
attilio
caa2ca048b - Introduce the function lockmgr_recursed() which returns true if the
lockmgr lkp, when held in exclusive mode, is recursed
- Introduce the function BUF_RECURSED() which does the same for bufobj
  locks based on the top of lockmgr_recursed()
- Introduce the function BUF_ISLOCKED() which works like the counterpart
  VOP_ISLOCKED(9), showing the state of lockmgr linked with the bufobj

BUF_RECURSED() and BUF_ISLOCKED() entirely replace the usage of bogus
BUF_REFCNT() in a more explicative and SMP-compliant way.
This allows us to axe out BUF_REFCNT() and leaving the function
lockcount() totally unused in our stock kernel. Further commits will
axe lockcount() as well as part of lockmgr() cleanup.

KPI results, obviously, broken so further commits will update manpages
and freebsd version.

Tested by: kris (on UFS and NFS)
2008-01-19 17:36:23 +00:00
attilio
71b7824213 VOP_LOCK1() (and so VOP_LOCK()) and VOP_UNLOCK() are only used in
conjuction with 'thread' argument passing which is always curthread.
Remove the unuseful extra-argument and pass explicitly curthread to lower
layer functions, when necessary.

KPI results broken by this change, which should affect several ports, so
version bumping and manpage update will be further committed.

Tested by: kris, pho, Diego Sardina <siarodx at gmail dot com>
2008-01-13 14:44:15 +00:00
attilio
18d0a0dd51 vn_lock() is currently only used with the 'curthread' passed as argument.
Remove this argument and pass curthread directly to underlying
VOP_LOCK1() VFS method. This modify makes the code cleaner and in
particular remove an annoying dependence helping next lockmgr() cleanup.
KPI results, obviously, changed.

Manpage and FreeBSD_version will be updated through further commits.

As a side note, would be valuable to say that next commits will address
a similar cleanup about VFS methods, in particular vop_lock1 and
vop_unlock.

Tested by:	Diego Sardina <siarodx at gmail dot com>,
		Andrea Di Pasquale <whyx dot it at gmail dot com>
2008-01-10 01:10:58 +00:00
kib
149cd5b092 ffs_balloc_ufsX() routines, in the case of recovering from the failed
allocation, free the indirect blocks before clearing the disk pointers,
that could lead to the softupdate inconsistencies in the case of the
machine or disk crash at the wrong time.

Rearrange the recover code to do the ffs_blkfree() after the second
ffs_syncvnode(), that clears the pointers chain.

Proposed and reviewed by:	tegge
Tested by:	Peter Holm
MFC after:	3 weeks
2008-01-03 12:28:57 +00:00
obrien
51625bf288 style(9) 2008-01-02 01:19:17 +00:00
kib
c9fb56ffc8 The ffs_balloc() routines, whan allocating the indirect blocks for
the inode, do the rollback in case the allocation failed (due to
insufficient free space or quota limits). But, the code does leaves the
buffers corresponding to the inoirect blocks on the vnode bufobj list.
This causes several assertion failures (for instance, "ffs_truncate3"
in ffs_truncate()) to fail, and could result in the indirect block
aliasing problem, like writing the context of such blocks to random
disk location.

Remove the buffers from the bufobj properly.

Reported and tested by:	Peter Holm
Reviewed by:	tegge
MFC after:	3 weeks
2007-12-29 13:31:27 +00:00
kensmith
66cb6fd44f Fix a broken check that recently became more annoying because it now
gets enabled when INVARIANTS is on instead of DIAGNOSTIC (which apparently
nobody uses).  From Tor's description:

  This happens when the block range spans two block maps, the first in the
  inode (mapping up to NDADDR direct blocks) and the second being the first
  indirect block.  The current check assumes that both block maps are
  indirect blocks.

Work done by:	tegge
Tested by:	kris, kensmith
2007-12-01 13:12:43 +00:00
ru
1f9b2f54c8 Fix build without INVARIANTS and update a comment to match
a change made in previous revision.
2007-11-09 11:04:36 +00:00
obrien
1ae16b4e64 Turn most ffs 'DIAGNOSTIC's into INVARIANTS. 2007-11-08 17:21:51 +00:00
rwatson
60570a92bf Merge first in a series of TrustedBSD MAC Framework KPI changes
from Mac OS X Leopard--rationalize naming for entry points to
the following general forms:

  mac_<object>_<method/action>
  mac_<object>_check_<method/action>

The previous naming scheme was inconsistent and mostly
reversed from the new scheme.  Also, make object types more
consistent and remove spaces from object types that contain
multiple parts ("posix_sem" -> "posixsem") to make mechanical
parsing easier.  Introduce a new "netinet" object type for
certain IPv4/IPv6-related methods.  Also simplify, slightly,
some entry point names.

All MAC policy modules will need to be recompiled, and modules
not updates as part of this commit will need to be modified to
conform to the new KPI.

Sponsored by:	SPARTA (original patches against Mac OS X)
Obtained from:	TrustedBSD Project, Apple Computer
2007-10-24 19:04:04 +00:00
julian
51d643caa6 Rename the kthread_xxx (e.g. kthread_create()) calls
to kproc_xxx as they actually make whole processes.
Thos makes way for us to add REAL kthread_create() and friends
that actually make theads. it turns out that most of these
calls actually end up being moved back to the thread version
when it's added. but we need to make this cosmetic change first.

I'd LOVE to do this rename in 7.0  so that we can eventually MFC the
new kthread_xxx() calls.
2007-10-20 23:23:23 +00:00
alfred
3a60df401c Get rid of qaddr_t.
Requested by: bde
2007-10-16 10:54:55 +00:00
bz
621c3a5b99 Fix a DIV0 in case a large value for fs_avgfilesize or fs_avgfpdir
is given (with newfs or tunefs) and dirsize overflows.

In case dirsize is <= 0 because of an overflow set maxcontigdirs
to 0 so it will be 1 later. This is what would happen for large
fs_avgfilesize. [1]

Identified with help from:	roberto, pjd
Submitted by:			pjd [1]
Approved by:			re (rwatson)
MFC after:			8 days
2007-09-10 14:12:29 +00:00
rodrigc
85690ea861 Perform range check before allocating memory when reading
extended attributes.

Reviewed by:	kib
Approved by:	re (hrs)
PR:		114389
2007-07-13 18:51:08 +00:00
peter
5c6fefddf1 Fix an annoying pointer/int cast warning that shows up on 64 bit systems.
Approved by:  re
2007-07-02 01:31:43 +00:00
kib
2f486f25b6 Fix livelock that could occur when snapshoting UFS with quotas, where
some quota limit was exceeded. Sequence of UFS_VALLOC()/UFS_VFREE()
call there could cause inodeblock to have both freefile and inodedep
dependencies without any inode in the block being marked for write.
Then, softdep_check_suspend() would return EAGAIN forewer.

Force write of inodeblock with allocated freefile softdependency by
setting IN_MODIFIED flag in softdep_freefile and unconditionally calling
UFS_UPDATE() in ufs_reclaim.

Reported by:	kris
Debug help and tested by: 	Peter Holm
Approved by:	re (kensmith)
MFC after:	3 weeks
2007-06-22 13:22:37 +00:00
rwatson
00b02345d4 Eliminate now-unused SUSER_ALLOWJAIL arguments to priv_check_cred(); in
some cases, move to priv_check() if it was an operation on a thread and
no other flags were present.

Eliminate caller-side jail exception checking (also now-unused); jail
privilege exception code now goes solely in kern_jail.c.

We can't yet eliminate suser() due to some cases in the KAME code where
a privilege check is performed and then used in many different deferred
paths.  Do, however, move those prototypes to priv.h.

Reviewed by:	csjp
Obtained from:	TrustedBSD Project
2007-06-12 00:12:01 +00:00
jeff
91d1501790 Commit 14/14 of sched_lock decomposition.
- Use thread_lock() rather than sched_lock for per-thread scheduling
   sychronization.
 - Use the per-process spinlock rather than the sched_lock for per-process
   scheduling synchronization.

Tested by:      kris, current@
Tested on:      i386, amd64, ULE, 4BSD, libthr, libkse, PREEMPTION, etc.
Discussed with: kris, attilio, kmacy, jhb, julian, bde (small parts each)
2007-06-05 00:00:57 +00:00
kib
17260ba6f1 Change the VOP_OPEN(), vn_open() vnode operation and d_fdopen() cdev operation
argument from being file descriptor index into the pointer to struct file:
part 2. Convert calls missed in the first big commit.

Noted by:	rwatson
Pointy hat to:	kib
2007-06-01 14:33:11 +00:00
jeff
a7a8bac81f - Move rusage from being per-process in struct pstats to per-thread in
td_ru.  This removes the requirement for per-process synchronization in
   statclock() and mi_switch().  This was previously supported by
   sched_lock which is going away.  All modifications to rusage are now
   done in the context of the owning thread.  reads proceed without locks.
 - Aggregate exiting threads rusage in thread_exit() such that the exiting
   thread's rusage is not lost.
 - Provide a new routine, rufetch() to fetch an aggregate of all rusage
   structures from all threads in a process.  This routine must be used
   in any place requiring a rusage from a process prior to it's exit.  The
   exited process's rusage is still available via p_ru.
 - Aggregate tick statistics only on demand via rufetch() or when a thread
   exits.  Tick statistics are kept in the thread and protected by sched_lock
   until it exits.

Initial patch by:	attilio
Reviewed by:		attilio, bde (some objections), arch (mostly silent)
2007-06-01 01:12:45 +00:00
kib
f13486a222 Revert UF_OPENING workaround for CURRENT.
Change the VOP_OPEN(), vn_open() vnode operation and d_fdopen() cdev operation
argument from being file descriptor index into the pointer to struct file.

Proposed and reviewed by:	jhb
Reviewed by:	daichi (unionfs)
Approved by:	re (kensmith)
2007-05-31 11:51:53 +00:00
pjd
ee9e45ce7d - Remove unnecessary vnode internal locking - v_vflag is protect by vnode's
lock (not vnode's interlock).
- Simplify code a bit.
2007-05-28 00:28:15 +00:00
pjd
1e693d6e52 Eliminate VI_LOCK()/VI_UNLOCK() pair from getattr and close code paths.
It's hard to measure performance improvement on my test machine, but the
change won't degrade performance for sure. I can measure slight improvement
for debugging kernel and it can also be a win for machines where atomic
operation is more expensive.

Reviewed by:	kib
2007-05-23 11:06:09 +00:00
kib
162fa8dc6d Since renaming of vop_lock to _vop_lock, pre- and post-condition
function calls are no more generated for vop_lock.
Rename _vop_lock to vop_lock1 to satisfy tools/vnode_if.awk assumption
about vop naming conventions. This restores pre/post-condition calls.
2007-05-18 13:02:13 +00:00
thompsa
25f570bc48 Add a newline to the printf message. 2007-05-03 22:39:52 +00:00
kib
0f0ec54312 Fix the NAMEI zone leak when snapshot was successfully created.
Reported and tested by:	Peter Holm
MFC after:		2 weeks
2007-04-10 09:31:42 +00:00
kib
859db1e740 Recalculate the NEWBLOCK flag for pagedep structure after the softdep
lock is dropped, since pagedep may be already processed and deallocated.

Found and tested by:	kris
MFC after:		2 weeks
2007-04-10 09:30:41 +00:00
kib
61f6e27c62 When LK_NOWAIT is passed as argument to process_worklist_item(), this
does not prevent handle_workitem_remove() from recursing into a blocking
version. Add the dirrem to worklist instead of processing it now if this
is the case.

Reported and tested by:	kris
Submitted by:		tegge
MFC after:		2 weeks
2007-04-10 09:28:17 +00:00
delphij
5a4b35079b Use *_EMPTY macros when appropriate. 2007-04-04 07:29:53 +00:00
kib
951894c128 Revert rev. 1.205. Replace unconditional acquision of Giant when QUOTAS are
defined with VFS_LOCK_GIANT(NULL) call.
This shall fix softdep operation when mpsafe_vfs = 0.

Reported and tested by:	kris
Submitted by:	tegge
MFC after:	1 week
2007-03-29 08:26:04 +00:00
kib
7f02b9589e Mark UFS as being MP-Safe in "options QUOTA" case too. Remove no more
neccessary Giant acquisions in softdepend processing code.

Tested by:	Peter Holm
Reviewed by:	tegge
Approved by:	re (kensmith)
2007-03-20 10:51:45 +00:00
brian
c235d1165c When we write extended attributes, assert that the inode hasn't
already been deleted.  The assertion is important to show that
we won't end up accounting for extended attribute blocks (using
fs_pendingblocks) in our subsequent call to fs_alloc().

Agreed verbally by: mckusick

MFC after:	3 weeks
2007-03-19 18:51:02 +00:00
kib
3f7d43feef Implement fine-grained locking for UFS quotas.
Each struct dquot gets dq_lock mutex to protect dq_flags and to interlock
with DQ_LOCK. qhash, dqfreelist and dq.dq_cnt are protected by global
dqhlock mutex.

i_dquot array for inode is protected by lockmgr' vnode lock, corresponding
assert added to the dqget(). Access to struct ufsmount quota-related fields
(um_quotas and um_qflags) is protected by um_lock.

Tested by:	Peter Holm
Reviewed by:	tegge
Approved by:	re (kensmith)

This work were not possible without enormous amount of help given by
Tor Egge and Peter Holm. Tor reviewed each version of patch, pointed out
numerous errors and provided invaluable suggestions. Peter did tireless
testing of the patch as it was developed.
2007-03-14 08:54:08 +00:00
kib
104c10948a Call getinoquota() before allocating new block for the directory to properly
account for block allocation.

Tested by:	Peter Holm
Reviewed by:	tegge
Approved by:	re (kensmith)
2007-03-14 08:50:27 +00:00
kib
5db0b75a18 Remove unneeded getinoquota() call in the ufs_access().
Tested by:	Peter Holm
Reviewed by:	tegge
Approved by:	re (kensmith)
2007-03-14 08:48:57 +00:00
tegge
214bc5723c Make insmntque() externally visibile and allow it to fail (e.g. during
late stages of unmount).  On failure, the vnode is recycled.

Add insmntque1(), to allow for file system specific cleanup when
recycling vnode on failure.

Change getnewvnode() to no longer call insmntque().  Previously,
embryonic vnodes were put onto the list of vnode belonging to a file
system, which is unsafe for a file system marked MPSAFE.

Change vfs_hash_insert() to no longer lock the vnode.  The caller now
has that responsibility.

Change most file systems to lock the vnode and call insmntque() or
insmntque1() after a new vnode has been sufficiently setup.  Handle
failed insmntque*() calls by propagating errors to callers, possibly
after some file system specific cleanup.

Approved by:	re (kensmith)
Reviewed by:	kib
In collaboration with:	kib
2007-03-13 01:50:27 +00:00
mckusick
e5953785d0 Move macros describing extended attributes in UFS from
<sys/extattr.h> to <ufs/ufs/extattr.h>. Move description
of extended attributes in UFS from man9/extattr.9 to
man5/fs.5.

Note that restore will not compile until <sys/extattr.h>
and <ufs/ufs/extattr.h> have been updated.

Suggested by:	Robert Watson
2007-03-06 08:13:21 +00:00
pjd
d38fdb51a0 Fix build breakage. 2007-03-01 23:14:46 +00:00
pjd
23ac3fc28a Change:
"... try to use VADMIN in preference to VADMIN ..."
To:
"... try to use VADMIN in preference to VWRITE ..."
2007-03-01 21:44:08 +00:00
pjd
e544923453 Rename PRIV_VFS_CLEARSUGID to PRIV_VFS_RETAINSUGID, which seems to better
describe the privilege.

OK'ed by:	rwatson
2007-03-01 20:47:42 +00:00
pjd
9558665f1e Avoid checking for privileges if there is no need to.
Discussed with:	rwatson
2007-03-01 20:38:24 +00:00
brian
c3843b2cca Account for di_blocks allocations when IN_SPACECOUNTED is set in an
inode's i_flag.

It's possible that after ufs_infactive() calls softdep_releasefile(),
i_nlink stays >0 for a considerable amount of time (> 60 seconds here).
During this period, any ffs allocation routines that alter di_blocks
must also account for the blocks in the filesystem's fs_pendingblocks
value.

This change fixes an eventual df/du discrepency that will happen as
the result of fs_pendingblocks being reduced to <0.

The only manifestation of this that people may recognise is the
following message on boot:

    /somefs: update error: blocks -N files M

at which point the negative pending block count is adjusted to zero.

Reviewed by:	tegge
MFC after:	3 weeks
2007-02-23 20:23:35 +00:00
mckusick
96503737f7 The functions that set and delete external attributes must check
that the filesystem is not mounted read-only before proceeding.

Reported by: Ryan Beasley <ryanb@FreeBSD.org>
MFC after: 1 week
2007-02-21 08:50:06 +00:00
rwatson
d298e8c0c2 Rename three quota privileges from the UFS privilege namespace to the
VFS privilege namespace: exceedquota, getquota, and setquota.  Leave
UFS-specific quota configuration privileges in the UFS name space.

This renumbers VFS and UFS privileges, so requires rebuilding modules
if you are using security policies aware of privilege identifiers.
This is likely no one at this point since none of the committed MAC
policies use the privilege checks.
2007-02-19 13:33:10 +00:00
rwatson
58e926bc94 Limit quota privileges in jail to PRIV_UFS_GETQUOTA and
PRIV_UFS_SETQUOTA.
2007-02-19 13:26:39 +00:00
mckusick
368de56c4b This README file is obsolete. The cited problems were fixed long ago
and the code is installed by default so no longer requires action by
the administrator to be included.
2007-02-17 08:25:43 +00:00
pjd
cb2d7c85a8 Move vnode-to-file-handle translation from vfs_vptofh to vop_vptofh method.
This way we may support multiple structures in v_data vnode field within
one file system without using black magic.

Vnode-to-file-handle should be VOP in the first place, but was made VFS
operation to keep interface as compatible as possible with SUN's VFS.
BTW. Now Solaris also implements vnode-to-file-handle as VOP operation.

VFS_VPTOFH() was left for API backward compatibility, but is marked for
removal before 8.0-RELEASE.

Approved by:	mckusick
Discussed with:	many (on IRC)
Tested with:	ufs, msdosfs, cd9660, nullfs and zfs
2007-02-15 22:08:35 +00:00
kib
80c0e5fb98 Style(9). 2007-02-15 09:24:58 +00:00
kib
08a6b49351 Remove not needed acquision of the mount interlock aroung reading of
mnt_kern_flags in ufs_itimes().

Suggested by:	ssouhlal
Confirmed by:	tegge
MFC after:	2 weeks
2007-02-08 09:47:19 +00:00
tegge
06da132002 Call pbgetvp() and pbrelvp() instead of setting b_vp directly.
PR:		kern/108151
2007-02-04 23:42:02 +00:00
mpp
2ba513e47f If quotacheck or edquota reset the block or inode grace time for
a user or group, when the kernel first sees this, it will update
the grace time value.  However, it never flags the quota as modified
and the updated value never makes it to the quota data file unless
the user actually makes some other change that would write the
data out.

Fixed to flag the quota as modified if the soft limit has actually
been reached and should be now enforced.
2007-02-04 06:46:57 +00:00
mpp
261dbc8078 Prevent quotactl calls that pass in an id of -1 from incorrectly
using the callers UID instead of the GID when performing group
operations.  This could allow users to determine group quota
information for groups they are not a member of in some cases.

Rename the "uid" parameter in ufs_quotactl to "id" to better show
that it is used for more than just the uid, and to be more in line
with the naming conventions in the other quota routines.

PR:	kern/33940
2007-02-01 02:13:53 +00:00
mpp
3cdb06d461 Disallow negative UIDs when processing quotactl options. 2007-02-01 01:01:56 +00:00
kib
fdd50404d1 Cylinder group bitmaps and blocks containing inode for a snapshot
file are after snaplock, while other ffs device buffers are before
snaplock in global lock order. By itself, this could cause deadlock
when bdwrite() tries to flush dirty buffers on snapshotted ffs. If,
during the flush, COW activity for snapshot needs to allocate block
and ffs_alloccg() selects the cylinder group that is being written
by bdwrite(), then kernel would panic due to recursive buffer lock
acquision.

Avoid dealing with buffers in bdwrite() that are from other side of
snaplock divisor in the lock order then the buffer being written. Add
new BOP, bop_bdwrite(), to do dirty buffer flushing for same vnode in
the bdwrite(). Default implementation, bufbdflush(), refactors the code
from bdwrite(). For ffs device buffers, specialized implementation is
used.

Reviewed by:	tegge, jeff, Russell Cattelan (cattelan xfs org, xfs changes)
Tested by:	Peter Holm
X-MFC after:	3 weeks (if ever: it changes ABI)
2007-01-23 10:01:19 +00:00
delphij
49f7e5db02 Fix build. chkdquot() should not return anything. 2007-01-20 13:54:28 +00:00
mpp
0f6ed07b89 Quota system cleanup.
1) Do not do quota accounting for the actual quota data files
   or for file system snapshot files ("system" files).  This
   prevents a deadlock descibed in PR kern/30958 if the kernel
   ever has to grow the quota file.  Snapshot files were already
   exempt from the quota checks, but this change generalized the check.
2) Fix a cast that caused extremely large uids/gids to incorrectly
   write the quota information to the data file at a truncated
   value for a uint_t32 id value.  The incorrect cast caused quota
   files in this case to be around 4GB in size, with the correct cast
   they can now be 131GB in size.  Also related to PR kern/30958.
3) Check for what appear to be negative UIDs/GIDs and not account
   for them.  This prevents the quota files from becoming 131GB in
   size and causing quotacheck to run forever at bootup.  This could
   also cause the kernel to try and expand the quota file, which might
   deadlock due to the issue in #1.  kern/30958 and kern/38156
   (and some much older closed PR's).
4) With the deadlock problems gone, the kernel can now expand the
   size of the quota database files if it needs to.
5) Pass in the i-node count change value to chkiq and chkiqchg as an
   int, like it used to be before the common routine was split up
   into 2 different routines to increase / decrease the i-node in-use
   count.  Prevents an underflow on the i-node count.  Related
   to PR kern/89247.
6) Prevent the block usage from growing slowly if a file system is
   full and the write was denied due to that fact.  PR kern/89247.

Some of these changes require an updated quotacheck to prevent
the creation of huge (131GB) quota data files (item #3).

#1/#4 probably fixes a lot of the random hangs when quotas are enabled,
possibly some of the jail hangs.
2007-01-20 11:58:32 +00:00
mpp
be61542c54 Fix a spelling error. heirarchy -> hierarchy.
Obtained from:	OpenBSD
2007-01-16 19:40:25 +00:00
mpp
5c20abfae9 Fix a spelling error in some comments. heirarchy -> hierarchy.
Obtained from: OpenBSD
2007-01-16 19:35:43 +00:00
rwatson
7a85fd85df Canonicalize copyright: use a date range rather than comma-delimited
list.

MFC after:	3 days
2007-01-08 17:55:32 +00:00
kmacy
0c00ea16db change vop_lock handling to allowing tracking of callers' file and line for
acquisition of lockmgr locks

Approved by: scottl (standing in for mentor rwatson)
2006-11-13 05:51:22 +00:00
rwatson
10d0d9cf47 Sweep kernel replacing suser(9) calls with priv(9) calls, assigning
specific privilege names to a broad range of privileges.  These may
require some future tweaking.

Sponsored by:           nCircle Network Security, Inc.
Obtained from:          TrustedBSD Project
Discussed on:           arch@
Reviewed (at least in part) by: mlaier, jmg, pjd, bde, ceri,
                        Alex Lyashkov <umka at sevcity dot net>,
                        Skip Ford <skip dot ford at verizon dot net>,
                        Antoine Brodin <antoine dot brodin at laposte dot net>
2006-11-06 13:42:10 +00:00
kib
1776bc8845 Aquire Giant in the softdep_flush for clear_remove() and clear_inodedeps()
processing when QUOTA is set.

Reported and tested by:	Peter Holm
Reviewed by:	tegge
MFC after:	3 days
2006-11-01 13:48:44 +00:00
pjd
036e929548 Add gjournal specific code to the UFS file system:
- Add FS_GJOURNAL flag which enables gjournal support on a file system.
- Add cg_unrefs field to the cylinder group structure which holds
  number of unreferenced (orphaned) inodes in the given cylinder group.
- Add fs_unrefs field to the super block structure which holds
  total number of unreferenced (orphaned) inodes.
- When file or a directory is orphaned (last reference is removed, but
  object is still open), increase fs_unrefs and cg_unrefs fields,
  which is a hint for fsck in which cylinder groups looks for such
  (orphaned) objects.
- When file is last closed, decrease {fs,cg}_unrefs fields.
- Add VV_DELETED vnode flag which points at orphaned objects.

Sponsored by:	home.pl
2006-10-31 21:48:54 +00:00
rwatson
7beaaf5cd2 Complete break-out of sys/sys/mac.h into sys/security/mac/mac_framework.h
begun with a repo-copy of mac.h to mac_framework.h.  sys/mac.h now
contains the userspace and user<->kernel API and definitions, with all
in-kernel interfaces moved to mac_framework.h, which is now included
across most of the kernel instead.

This change is the first step in a larger cleanup and sweep of MAC
Framework interfaces in the kernel, and will not be MFC'd.

Obtained from:	TrustedBSD Project
Sponsored by:	SPARTA
2006-10-22 11:52:19 +00:00
kib
dc1147adb0 Do not translate the IN_ACCESS inode flag into the IN_MODIFIED while filesystem
is suspending/suspended. Doing so may result in deadlock. Instead, set the
(new) IN_LAZYACCESS flag, that becomes IN_MODIFIED when suspend is lifted.

Change the locking protocol in order to set the IN_ACCESS and timestamps
without upgrading shared vnode lock to exclusive (see comments in the
inode.h). Before that, inode was modified while holding only shared
lock.

Tested by:	Peter Holm
Reviewed by:	tegge, bde
Approved by:	pjd (mentor)
MFC after:	3 weeks
2006-10-10 09:20:54 +00:00
tegge
4877fa61ee Correct check for when IO_SYNC should be set for filesystem
not using softupdates when truncating a directory to zero length.

Discussed with:	bde
2006-10-02 02:08:31 +00:00
tegge
688c3982c5 Protect change to bo_flag by holding the bufobj mutex. 2006-09-26 04:21:20 +00:00
tegge
89ea8a9b1b Reduce fluctuations of mnt_flag to allow unlocked readers to get a
slightly more consistent view.
2006-09-26 04:20:09 +00:00
tegge
34ea634be7 Don't restore MNT_QUOTA bit in mnt_flag after snapshot creation,
closing a race between nmount() and quotactl().
2006-09-26 04:19:11 +00:00
tegge
431fd40aef Increase mnt_noasync once in softdep_mount() to disallow async io,
closing a window where a file system using softupdates could be async
for a short while if both MNT_UPDATE and MNT_ASYNC were passed as flags
to nmount().  Add MNTK_SOFTDEP flag to ensure that softdep_mount()
doesn't increase mnt_noasync multiple times.
2006-09-26 04:17:17 +00:00
tegge
f42473d76b Add mnt_noasync counter to better handle interleaved calls to nmount(),
sync() and sync_fsync() without losing MNT_ASYNC.  Add MNTK_ASYNC flag
which is set only when MNT_ASYNC is set and mnt_noasync is zero, and
check that flag instead of MNT_ASYNC before initiating async io.
2006-09-26 04:15:59 +00:00
tegge
83154f853d Use mount interlock to protect all changes to mnt_flag and mnt_kern_flag.
This eliminates a race where MNT_UPDATE flag could be lost when nmount()
raced against sync(), sync_fsync() or quotactl().
2006-09-26 04:12:49 +00:00
kib
edd4f4618e Fix the glitch introduced in rev. 1.93. In softdep_sync_metadata(),
switch by worklist type contains two for() loops, for D_INDIRDEP and
D_PAGEDEP. On error, these loops are exited by break, where the switch
actually shall be leaved. Use goto instead of break to reach the error
handling code.

Reported by:	Peter Holm
Reviewed by:	tegge
Approved by:	pjd (mentor)
MFC after:	2 weeks
2006-09-20 07:49:28 +00:00
rwatson
8b3f7ca1ce Declare security and security.bsd sysctl hierarchies in sysctl.h along
with other commonly used sysctl name spaces, rather than declaring them
all over the place.

MFC after:	1 month
Sponsored by:	nCircle Network Security, Inc.
2006-09-17 20:00:36 +00:00