Commit Graph

691 Commits

Author SHA1 Message Date
Poul-Henning Kamp
5d9d81e7ea Put the I/O block size in bufobj->bo_bsize.
We keep si_bsize_phys around for now as that is the simplest way to pull
the number out of disk device drivers in devfs_open().  The correct solution
would be to do an ioctl(DIOCGSECTORSIZE), but the point is probably mooth
when filesystems sit on GEOM, so don't bother for now.
2004-10-26 07:39:12 +00:00
Poul-Henning Kamp
156cb26583 Loose the v_dirty* and v_clean* alias macros.
Check the count field where we just want to know the full/empty state,
rather than using TAILQ_EMPTY() or TAILQ_FIRST().
2004-10-25 09:14:03 +00:00
Poul-Henning Kamp
ee1d0eb330 Remove vnode->v_bsize. This was a dead-end. 2004-10-25 07:50:59 +00:00
Poul-Henning Kamp
4dcd0ac4cf Collapse vnode->v_object and buf->b_object into bufobj->bo_object. 2004-10-25 06:02:57 +00:00
Poul-Henning Kamp
b792bebeea Move the buffer method vector (buf->b_op) to the bufobj.
Extend it with a strategy method.

Add bufstrategy() which do the usual VOP_SPECSTRATEGY/VOP_STRATEGY
song and dance.

Rename ibwrite to bufwrite().

Move the two NFS buf_ops to more sensible places, add bufstrategy
to them.

Add inlines for bwrite() and bstrategy() which calls through
buf->b_bufobj->b_ops->b_{write,strategy}().

Replace almost all VOP_STRATEGY()/VOP_SPECSTRATEGY() calls with bstrategy().
2004-10-24 20:03:41 +00:00
Robert Watson
b0e86f6ac2 When MAC is enabled, warn if getnewvnode() is asked to produce a vnode
without a mountpoint.  In this scenario, there's no useful source for
a label on the vnode, since we can't query the mountpoint for the
labeling strategy or default label.
2004-10-22 11:04:58 +00:00
Poul-Henning Kamp
ff7c5a4880 Alas, poor SPECFS! -- I knew him, Horatio; A filesystem of infinite
jest, of most excellent fancy: he hath taught me lessons a thousand
times; and now, how abhorred in my imagination it is! my gorge rises
at it.  Here were those hacks that I have curs'd I know not how
oft.  Where be your kludges now? your workarounds? your layering
violations, that were wont to set the table on a roar?

Move the skeleton of specfs into devfs where it now belongs and
bury the rest.
2004-10-22 09:59:37 +00:00
Poul-Henning Kamp
494eb176e7 Add b_bufobj to struct buf which eventually will eliminate the need for b_vp.
Initialize b_bufobj for all buffers.

Make incore() and gbincore() take a bufobj instead of a vnode.

Make inmem() local to vfs_bio.c

Change a lot of VI_[UN]LOCK(bp->b_vp) to BO_[UN]LOCK(bp->b_bufobj)
also VI_MTX() to BO_MTX(),

Make buf_vlist_add() take a bufobj instead of a vnode.

Eliminate other uses of bp->b_vp where bp->b_bufobj will do.

Various minor polishing: remove "register", turn panic into KASSERT,
use new function declarations, TAILQ_FOREACH_SAFE() etc.
2004-10-22 08:47:20 +00:00
Poul-Henning Kamp
a76d8f4ec9 Move the VI_BWAIT flag into no bo_flag element of bufobj and call it BO_WWAIT
Add bufobj_wref(), bufobj_wdrop() and bufobj_wwait() to handle the write
count on a bufobj.  Bufobj_wdrop() replaces vwakeup().

Use these functions all relevant places except in ffs_softdep.c where
the use if interlocked_sleep() makes this impossible.

Rename b_vnbufs to b_bobufs now that we touch all the relevant files anyway.
2004-10-21 15:53:54 +00:00
Poul-Henning Kamp
1bca607b9f Add BO_* macros parallel to VI_* macros for manipulating the bo_mtx.
Initialize the bo_mtx when we allocate a vnode i getnewvnode() For
now we point to the vnodes interlock mutex, that retains the exact
same locking sematics.

Move v_numoutput from vnode to bufobj.  Add renaming macro to
postpone code sweep.
2004-10-21 14:42:31 +00:00
Poul-Henning Kamp
67647b2312 Polish vtruncbuf() to improve readability and style a bit. 2004-10-21 14:13:54 +00:00
Poul-Henning Kamp
e163395619 Simplify buf_vlist_remove().
Now that we have encapsulated the splaytree related information
into a structure we can eliminate the half of this function.
2004-10-21 13:48:50 +00:00
Greg Lehey
57259f2864 vtryrecycle: Don't rely on type VBAD alone to mean that we don't need
to clean the vnode.  If v_data is set, we still need to
	     clean it.  This code change should catch all incidents of
	     the previous commit (INVARIANTS only).
2004-10-06 02:09:59 +00:00
Greg Lehey
f2154b33d2 getnewvnode: Weaken the panic "cleaned vnode isn't" to a warning.
Discussion: this panic (or waning) only occurs when the kernel is
  compiled with INVARIANTS.  Otherwise the problem (which means that
  the vp->v_data field isn't NULL, and represents a coding error and
  possibly a memory leak) is silently ignored by setting it to NULL
  later on.

  Panicking here isn't very helpful: by this time, we can only find
  the symptoms.  The panic occurs long after the reason for "not
  cleaning" has been forgotten; in the case in point, it was the
  result of severe file system corruption which left the v_type field
  set to VBAD.  That issue will be addressed by a separate commit.
2004-10-06 02:06:11 +00:00
Poul-Henning Kamp
ba2851254f Fix a LOR relating to freeing cdevs. 2004-10-01 06:33:39 +00:00
Poul-Henning Kamp
70526ca6a5 Hold dev_lock and check for NULL devsw pointer when we determine
if a vnode is a disk.
2004-09-24 06:16:08 +00:00
Poul-Henning Kamp
a0e78d2eb0 Do not refcount the cdevsw, but rather maintain a cdev->si_threadcount
of the number of threads which are inside whatever is behind the
cdevsw for this particular cdev.

Make the device mutex visible through dev_lock() and dev_unlock().
We may want finer granularity later.

Replace spechash_mtx use with dev_lock()/dev_unlock().
2004-09-23 07:17:41 +00:00
Poul-Henning Kamp
08dbd671ff Remove unused B_WRITEINPROG flag 2004-09-15 21:49:22 +00:00
Poul-Henning Kamp
1affa3adc8 Create simple function init_va_filerev() for initializing a va_filerev
field.

Replace three instances of longhaired initialization va_filerev fields.

Added XXX comment wondering why we don't use random bits instead of
uptime of the system for this purpose.
2004-09-07 09:17:05 +00:00
Don Lewis
8ded654028 Don't attempt to trigger the syncer thread final sync code in the
shutdown_pre_sync state if the RB_NOSYNC flag is set.  This is the
likely cause of hangs after a system panic that are keeping crash
dumps from being done.

This is a MFC candidate for RELENG_5.

MFC after:	3 days
2004-08-20 19:21:47 +00:00
David E. O'Brien
78c37b0de8 s/MAX_SAFE_MAXVNODES/MAXVNODES_MAX/g 2004-08-16 08:33:37 +00:00
John-Mark Gurney
ad3b9257c2 Add locking to the kqueue subsystem. This also makes the kqueue subsystem
a more complete subsystem, and removes the knowlege of how things are
implemented from the drivers.  Include locking around filter ops, so a
module like aio will know when not to be unloaded if there are outstanding
knotes using it's filter ops.

Currently, it uses the MTX_DUPOK even though it is not always safe to
aquire duplicate locks.  Witness currently doesn't support the ability
to discover if a dup lock is ok (in some cases).

Reviewed by:	green, rwatson (both earlier versions)
2004-08-15 06:24:42 +00:00
Robert Watson
87e83e7d4c In v_addpollinfo(), we allocate storage to back vp->v_pollinfo. However,
we may sleep when doing so; check that we didn't race with another thread
allocating storage for the vnode after allocation is made to a local
pointer, and only update the vnode pointer if it's still NULL.  Otherwise,
accept that another thread got there first, and release the local storage.

Discussed with:	jmg
2004-08-11 01:27:53 +00:00
Nate Lawson
c8c216d558 Skip the syncing disks loop if there are no dirty buffers. Remove a
variable used to flag the initial printf.

Submitted by:	truckman (earlier version)
2004-08-10 01:32:05 +00:00
David E. O'Brien
64298d52cc Put a cap on the auto-tuning of kern.maxvnodes.
Cap value chosen by:	scottl
2004-08-02 21:52:43 +00:00
Nate Lawson
b1c8139147 Minor message cleanup. 2004-07-30 01:30:05 +00:00
Poul-Henning Kamp
3dfe213e61 Convert the vfsconf list to a TAILQ.
Introduce vfs_byname() function to find things on it.

Staticize vfs_nmount() function under the name vfs_donmount().

Various cleanups.
2004-07-27 22:32:01 +00:00
Colin Percival
56f21b9d74 Rename suser_cred()'s PRISON_ROOT flag to SUSER_ALLOWJAIL. This is
somewhat clearer, but more importantly allows for a consistent naming
scheme for suser_cred flags.

The old name is still defined, but will be removed in a few days (unless I
hear any complaints...)

Discussed with:	rwatson, scottl
Requested by:	jhb
2004-07-26 07:24:04 +00:00
Poul-Henning Kamp
cf95b5c381 Eliminate unused second argument to reassignbuf() and simplify it
accordingly.
2004-07-25 21:24:23 +00:00
Alfred Perlstein
05656b6e2b put several of the options for DEBUG_VFS_LOCKS under control of sysctls. 2004-07-21 07:13:14 +00:00
Alfred Perlstein
bb5faea34f Cleanup shutdown output. 2004-07-15 08:01:00 +00:00
Alfred Perlstein
da6303bacc Tidy up system shutdown. 2004-07-15 04:29:48 +00:00
Alfred Perlstein
f257b7a54b Make VFS_ROOT() and vflush() take a thread argument.
This is to allow filesystems to decide based on the passed thread
which vnode to return.
Several filesystems used curthread, they now use the passed thread.
2004-07-12 08:14:09 +00:00
Alfred Perlstein
7ae8ce5df1 Dump the actual bad values when this assertion is tripped. 2004-07-12 04:13:38 +00:00
Marcel Moolenaar
32240d082c Update for the KDB framework:
o  Call kdb_enter() instead of Debugger().
2004-07-10 21:47:53 +00:00
Alfred Perlstein
057589c485 fixup sysctl by fsid node 2004-07-08 06:11:36 +00:00
Alfred Perlstein
ea0104b032 Introduce vfs_suser(), used to test if a user should have special privs
for a mount.
2004-07-06 09:37:43 +00:00
Alfred Perlstein
c713aaaeca NFS mobility PHASE I, II & III (phase VI, and V pending):
Rebind the client socket when we experience a timeout.  This fixes
the case where our IP changes for some reason.

Signal a VFS event when NFS transitions from up to down and vice
versa.

Add a placeholder vfs_sysctl where we will put status reporting
shortly.

Also:
Make down NFS mounts return EIO instead of EINTR when there is a
soft timeout or force unmount in progress.
2004-07-06 09:12:03 +00:00
Don Lewis
27875d9c88 Unconditionally set last_work_seen while in the SYNCER_RUNNING state
so that last_work_seen has a reasonable value at the transition
to the SYNCER_SHUTTING_DOWN state, even if net_worklist_len happened
to be zero at the time.

Initialize last_work_seen to zero as a safety measure in case the
syncer never ran in the SYNCER_RUNNING state.

Tested by:	phk
2004-07-05 21:32:01 +00:00
Don Lewis
faf1b66d1d Rework syncer termination code:
Speed up the syncer when shutting down by sleeping for a shorter
    period of time instead of cranking up rushjob and using the
    normal one second sleep.

    Skip empty worklist slots when shutting down to avoid lengthy
    intervals of inactivity.

    Give I/O more time to complete between steps by not speeding the
    syncer quite as much.

    Terminate the syncer after one full pass through the worklist
    plus one second with the worklist containing nothing but syncer
    vnodes.

    Print an indication of shutdown progress to the console.

Add a sysctl, vfs.worklist_len, to allow the size of the syncer worklist
to be monitored.
2004-07-05 01:07:33 +00:00
Poul-Henning Kamp
c555963fd1 Give synthetic root filesystem device vnodes a v_bsize of DEV_BSIZE. 2004-07-04 22:33:22 +00:00
Alfred Perlstein
2d1dca73ee Pass the operation in with the fsidctl.
Remove some fsidctls that we will not be using.
Correct prototypes for fs sysctls.
2004-07-04 20:21:58 +00:00
Poul-Henning Kamp
7f6599fec6 Make the last commit handle non-phk root devices better. 2004-07-04 19:42:25 +00:00
Poul-Henning Kamp
1cbb1e02c4 Blocksize for I/O should be a property of the vnode and not found by groping
around in the vnodes surroundings when we allocate a block.

Assign a blocksize when we create a vnode, and yell a warning (and ignore it)
if we got the wrong size.

Please email all such warnings to me.
2004-07-04 12:49:04 +00:00
Alfred Perlstein
94ed9c8af5 Introduce a new kevent filter. EVFILT_FS that will be used to signal
generic filesystem events to userspace.  Currently only mount and unmount
of filesystems are signalled.  Soon to be added, up/down status of NFS.

Introduce a sysctl node used to route requests to/from filesystems
based on filesystem ids.

Introduce a new vfsop, vfs_sysctl(mp, req) that is used as the callback/
entrypoint by the sysctl code to change individual filesystems.
2004-07-04 10:52:54 +00:00
Alfred Perlstein
903ac7c219 Revision 1.496 would not boot on my system due to
ffs_mount -> bdevvp -> getnewvnode(..., mp = NULL, ...) ->
 insmntqueue(vp, mp = NULL) -> KASSERT -> panic

Make getnewvnode() only call insmntqueue() if the mountpoint parameter
is not NULL.
2004-07-04 10:19:15 +00:00
Poul-Henning Kamp
e3c5a7a4dd When we traverse the vnodes on a mountpoint we need to look out for
our cached 'next vnode' being removed from this mountpoint.  If we
find that it was recycled, we restart our traversal from the start
of the list.

Code to do that is in all local disk filesystems (and a few other
places) and looks roughly like this:

		MNT_ILOCK(mp);
	loop:
		for (vp = TAILQ_FIRST(&mp...);
		    (vp = nvp) != NULL;
		    nvp = TAILQ_NEXT(vp,...)) {
			if (vp->v_mount != mp)
				goto loop;
			MNT_IUNLOCK(mp);
			...
			MNT_ILOCK(mp);
		}
		MNT_IUNLOCK(mp);

The code which takes vnodes off a mountpoint looks like this:

	MNT_ILOCK(vp->v_mount);
	...
	TAILQ_REMOVE(&vp->v_mount->mnt_nvnodelist, vp, v_nmntvnodes);
	...
	MNT_IUNLOCK(vp->v_mount);
	...
	vp->v_mount = something;

(Take a moment and try to spot the locking error before you read on.)

On a SMP system, one CPU could have removed nvp from our mountlist
but not yet gotten to assign a new value to vp->v_mount while another
CPU simultaneously get to the top of the traversal loop where it
finds that (vp->v_mount != mp) is not true despite the fact that
the vnode has indeed been removed from our mountpoint.

Fix:

Introduce the macro MNT_VNODE_FOREACH() to traverse the list of
vnodes on a mountpoint while taking into account that vnodes may
be removed from the list as we go.  This saves approx 65 lines of
duplicated code.

Split the insmntque() which potentially moves a vnode from one mount
point to another into delmntque() and insmntque() which does just
what the names say.

Fix delmntque() to set vp->v_mount to NULL while holding the
mountpoint lock.
2004-07-04 08:52:35 +00:00
Don Lewis
e06500dde5 When shutting down the syncer kernel thread, first tell it to run
faster and iterate to over its work list a few times in an attempt
to empty the work list before the syncer terminates.  This leaves
fewer dirty blocks to be written at the "syncing disks" stage and
keeps the the "giving up on N buffers" problem from being triggered
by the presence of a large soft updates work list at system shutdown
time.  The downside is that the syncer takes noticeably longer to
terminate.

Tested by:	"Arjan van Leeuwen" <avleeuwen AT piwebs DOT com>
Approved by:	mckusick
2004-07-01 23:59:19 +00:00
Poul-Henning Kamp
f3732fd15b Second half of the dev_t cleanup.
The big lines are:
	NODEV -> NULL
	NOUDEV -> NODEV
	udev_t -> dev_t
	udev2dev() -> findcdev()

Various minor adjustments including handling of userland access to kernel
space struct cdev etc.
2004-06-17 17:16:53 +00:00
Poul-Henning Kamp
89c9c53da0 Do the dreaded s/dev_t/struct cdev */
Bump __FreeBSD_version accordingly.
2004-06-16 09:47:26 +00:00
Poul-Henning Kamp
170593a9b5 Remove a left over from userland buffer-cache access to disks. 2004-06-14 14:25:03 +00:00
Robert Watson
9e6127fe3b Assert Giant in vrele(). 2004-05-31 19:06:01 +00:00
Maxime Henrion
a0b5a67929 Put deprecated sysctl code inside BURN_BRIDGES. 2004-04-11 21:09:22 +00:00
Warner Losh
7f8a436ff2 Remove advertising clause from University of California Regent's license,
per letter dated July 22, 1999.

Approved by: core
2004-04-05 21:03:37 +00:00
Peter Wemm
39d3505a30 Kill some XXXKSE's. vnlru/syncer are single threaded. 2004-03-29 22:45:33 +00:00
Poul-Henning Kamp
4d453ef101 Properly vector all bwrite() and BUF_WRITE() calls through the same path
and s/BUF_WRITE()/bwrite()/ since it now does the same as bwrite().
2004-03-11 18:02:36 +00:00
Poul-Henning Kamp
651b11eaf2 Remove unused second arg to vfinddev().
Don't call addaliasu() on VBLK nodes.
2004-03-11 16:33:11 +00:00
Alexander Kabaev
ff85a3f0e1 Always call vn_finished_write after vn_start_write was called. All
occurences of 'goto done' after vn_start_write invocation were cleaning
up incompletely.
2004-03-06 04:09:54 +00:00
John Baldwin
44f3b09204 Switch the sleep/wakeup and condition variable implementations to use the
sleep queue interface:
- Sleep queues attempt to merge some of the benefits of both sleep queues
  and condition variables.  Having sleep qeueus in a hash table avoids
  having to allocate a queue head for each wait channel.  Thus, struct cv
  has shrunk down to just a single char * pointer now.  However, the
  hash table does not hold threads directly, but queue heads.  This means
  that once you have located a queue in the hash bucket, you no longer have
  to walk the rest of the hash chain looking for threads.  Instead, you have
  a list of all the threads sleeping on that wait channel.
- Outside of the sleepq code and the sleep/cv code the kernel no longer
  differentiates between cv's and sleep/wakeup.  For example, calls to
  abortsleep() and cv_abort() are replaced with a call to sleepq_abort().
  Thus, the TDF_CVWAITQ flag is removed.  Also, calls to unsleep() and
  cv_waitq_remove() have been replaced with calls to sleepq_remove().
- The sched_sleep() function no longer accepts a priority argument as
  sleep's no longer inherently bump the priority.  Instead, this is soley
  a propery of msleep() which explicitly calls sched_prio() before
  blocking.
- The TDF_ONSLEEPQ flag has been dropped as it was never used.  The
  associated TDF_SET_ONSLEEPQ and TDF_CLR_ON_SLEEPQ macros have also been
  dropped and replaced with a single explicit clearing of td_wchan.
  TD_SET_ONSLEEPQ() would really have only made sense if it had taken
  the wait channel and message as arguments anyway.  Now that that only
  happens in one place, a macro would be overkill.
2004-02-27 18:52:44 +00:00
Don Lewis
47934cef8f Split the mlock() kernel code into two parts, mlock(), which unpacks
the syscall arguments and does the suser() permission check, and
kern_mlock(), which does the resource limit checking and calls
vm_map_wire().  Split munlock() in a similar way.

Enable the RLIMIT_MEMLOCK checking code in kern_mlock().

Replace calls to vslock() and vsunlock() in the sysctl code with
calls to kern_mlock() and kern_munlock() so that the sysctl code
will obey the wired memory limits.

Nuke the vslock() and vsunlock() implementations, which are no
longer used.

Add a member to struct sysctl_req to track the amount of memory
that is wired to handle the request.

Modify sysctl_wire_old_buffer() to return an error if its call to
kern_mlock() fails.  Only wire the minimum of the length specified
in the sysctl request and the length specified in its argument list.
It is recommended that sysctl handlers that use sysctl_wire_old_buffer()
should specify reasonable estimates for the amount of data they
want to return so that only the minimum amount of memory is wired
no matter what length has been specified by the request.

Modify the callers of sysctl_wire_old_buffer() to look for the
error return.

Modify sysctl_old_user to obey the wired buffer length and clean up
its implementation.

Reviewed by:	bms
2004-02-26 00:27:04 +00:00
Poul-Henning Kamp
ded67d0f77 Check for NODEV return from udev2dev() 2004-02-21 23:52:03 +00:00
Poul-Henning Kamp
cd690b60de Device megapatch 6/6:
This is what we came here for:  Hang dev_t's from their cdevsw,
refcount cdevsw and dev_t and generally keep track of things a lot
better than we used to:

Hold a cdevsw reference around all entrances into the device driver,
this will be necessary to safely determine when we can unload driver
code.

Hold a dev_t reference while the device is open.

KASSERT that we do not enter the driver on a non-referenced dev_t.

Remove old D_NAG code, anonymous dev_t's are not a problem now.

When destroy_dev() is called on a referenced dev_t, move it to
dead_cdevsw's list.  When the refcount drops, free it.

Check that cdevsw->d_version is correct.  If not, set all methods
to the dead_*() methods to prevent entrance into driver.  Print
warning on console to this effect.  The device driver may still
explode if it is also incompatible with newbus, but in that case
we probably didn't get this far in the first place.
2004-02-21 21:57:26 +00:00
Poul-Henning Kamp
816d62bbb9 Device megapatch 5/6:
Remove the unused second argument from udev2dev().

Convert all remaining users of makedev() to use udev2dev().  The
semantic difference is that udev2dev() will only locate a pre-existing
dev_t, it will not line makedev() create a new one.

Apart from the tiny well controlled windown in D_PSEUDO drivers,
there should no longer be any "anonymous" dev_t's in the system
now, only dev_t's created with make_dev() and make_dev_alias()
2004-02-21 21:32:15 +00:00
Alexander Kabaev
580ddfa64b More style fixes.
Obtained from:	bde
2004-01-05 23:40:46 +00:00
Alexander Kabaev
b0fdf71656 style(9):
Add empty line before first code line in functions with no local
variables.
Properly terminate comment sentences.
Indent lines which are longer that 80 characters.
Move v_addpollinfo closer to the rest of poll-related functions.
Move DEBUG_VFS_LOCKS ifdefed block to the end of file.

Obtained from:	bde (partly)
2004-01-05 19:04:29 +00:00
Alexander Kabaev
3ff1b7c23f Cosmetics: strip '\n' from a string passed to Debugger(). 2004-01-04 03:42:20 +00:00
Bruce Evans
9efe7d9d83 v_vxproc was a bogus name for a thread (pointer). 2003-12-28 09:12:56 +00:00
Jeff Roberson
958557e9c7 - In vget() if LK_NOWAIT is specified we should return EBUSY and not ENOENT.
Submitted by:	Stephan Uphoff <ups@stups.com>
2003-12-16 17:08:27 +00:00
Jeff Roberson
d85213669b - When doing a forced unmount, VFS attempts to keep VCHR vnodes valid by
reassigning their v_ops field to specfs, detaching from the mountpoint, etc.
   However, this is not sufficient.  If we vclean() the vnode the pages owned
   by the vnode are lost, potentially while buffers reference them.  Implement
   parts of vclean() seperately in vgonechrl() so that the pages and bufs
   associated with a device vnode are not destroyed while in use.
2003-12-16 17:05:05 +00:00
Jeff Roberson
a6c6a93c89 - Don't forget to unlock the vnode interlock in the LK_NOWAIT case.
Submitted by:	Stephan Uphoff <ups@stups.com>
Approved by:	re (rwatson)
2003-11-30 22:09:58 +00:00
Seigo Tanimura
512824f8f7 - Implement selwakeuppri() which allows raising the priority of a
thread being waken up.  The thread waken up can run at a priority as
  high as after tsleep().

- Replace selwakeup()s with selwakeuppri()s and pass appropriate
  priorities.

- Add cv_broadcastpri() which raises the priority of the broadcast
  threads.  Used by selwakeuppri() if collision occurs.

Not objected in:	-arch, -current
2003-11-09 09:17:26 +00:00
Alexander Kabaev
ca430f2e92 Remove mntvnode_mtx and replace it with per-mountpoint mutex.
Introduce two new macros MNT_ILOCK(mp)/MNT_IUNLOCK(mp) to
operate on this mutex transparently.

Eventually new mutex will be protecting more fields in
struct mount, not only vnode list.

Discussed with: jeff
2003-11-05 04:30:08 +00:00
Garrett Wollman
06cb76bde3 Add appropriate const poisoning to the assert_*locked() family so that I can
call ASSERT_VOP_LOCKED(vp, __func__) without a diagnostic.

Inspired by:	the evil and rude OpenAFS cache manager code
2003-10-23 18:17:36 +00:00
Alan Cox
f2b1200d08 Initialize the buf's b_object in pbgetvp(). Clear it in pbrelvp(). (This
facilitates synchronization of the vm page's valid field using the
vm object's lock.)

Suggested by:	tegge
2003-10-20 18:24:38 +00:00
Poul-Henning Kamp
3da2d6a453 Simplify count_dev() 2003-10-17 11:56:48 +00:00
Poul-Henning Kamp
5108cd3652 Simplify vn_isdisk() a bit. 2003-10-12 14:04:39 +00:00
Jeff Roberson
7dd1328c13 - Fix a typo, I meant & and not |. This was causing lockups from the syncer
looping forever due to list corruption.

Solved by:	tegge
2003-10-11 21:50:45 +00:00
Jeff Roberson
bdcfcdecea - Fix an XXX. Check the error of vn_lock() in vflush(). Don't specify
LK_RETRY either, we don't want this vnode if it turns into another.
 - Remove the code that checks the mount point after acquiring the lock
   we are guaranteed to either fail or get the vnode that we wanted.
2003-10-05 07:12:38 +00:00
Jeff Roberson
45503a37dd - Rename vcanrecycle() to vtryrecycle() to reflect its new role.
- In vtryrecycle() try to vgonel the vnode if all of the previous checks
   passed.  We won't vgonel if someone has either acquired a hold or usecount
   or started the vgone process elsewhere.  This is because we may have been
   removed from the free list while we were inspecting the vnode for
   recycling.
 - The VI_TRYLOCK stops two threads from entering getnewvnode() and recycling
   the same vnode.  To further reduce the likelyhood of this event, requeue
   the vnode on the tail of the list prior to calling vtryrecycle().  We can
   not actually remove the vnode from the list until we know that it's
   going to be recycled because other interlock holders may see the VI_FREE
   flag and try to remove it from the free list.
 - Kill a bogus XXX comment.  If XLOCK is set we shouldn't wait for it
   regardless of MNT_WAIT because the vnode does not actually belong to
   this filesystem.
2003-10-05 05:35:41 +00:00
Jeff Roberson
85311d4b59 - Don't cache_purge() in getnewvnode. It's done in vclean(). With this
purge, the purge in vclean, and the filesystems purge, we had 3 purges
   per vnode.
 - Move the insmntque(vp, 0) to vclean() so that we may remove it from the
   two vgone() functions and reduce the number of lock operations required.
2003-10-05 02:48:04 +00:00
Jeff Roberson
ce13b187e7 - Solve a LOR with the sync_mtx by using the VI_ONWORKLST flag to determine
whether or not the sync failed.  This could potentially get set between
   the time that we VOP_UNLOCK and VI_LOCK() but the race would harmelssly
   lead to the sync being delayed by an extra 30 seconds.  If we do not move
   the vnode it could cause an endless loop if it continues to fail to sync.
 - Use vhold and vdrop to stop the vnode from changing identities while we
   have it unlocked.  Other internal vfs lists are likely to follow this
   scheme.
2003-10-05 00:35:41 +00:00
Jeff Roberson
894fbf9769 - Move the xlock 'locking' code into vx_lock() and vx_unlock().
- Create a new function, vgonechrl(), which performs vgone for an in-use
   character device.  Move the code from vflush() that did this into
   vgonechrl().
 - Hold the xlock across the entirety of vgonel() and vgonechrl() so that
   at no point will an invalid vnode exist on any list without XLOCK set.
 - Move the xlock code out of vclean() now that it is in the vgone*()
   functions.
2003-10-05 00:02:41 +00:00
Jeff Roberson
6f4b0863e0 - In sched_sync() test our preconditions prior to dropping the sync_mtx.
This is so that we may grab the interlock while still holding the
   sync_mtx.  We have to VI_TRYLOCK() because in all other cases the lock
   order runs the other way.
 - If we don't meet any of the preconditions, reinsert the vp into the
   list for the next second.
 - We don't need to panic if we fail to sync here because each FSYNC
   function handles this case.  Removing this redundant code also
   simplifies locking.
2003-10-04 18:03:53 +00:00
Jeff Roberson
e4c49d2b50 - In a Giantless world, the vn_lock() in vcanrecycle() could legitimately
fail.  Remove the panic from that case and document why it might fail.
 - Document the reason for calling cache_purge() on a newly created vnode.
 - In insmntque() order the operations so that we can call mtx_unlock()
   one fewer times.  This makes the code somewhat clearer as well.
 - Add XXX comments in sched_sync() and vflush().
 - In vget(), do not sleep while waiting for XLOCK to clear if LK_NOWAIT is
   set.
 - In vclean() we don't need to acquire a lock around a single TAILQ_FIRST
   call.  It's ok if we race here, the vinvalbuf will just do nothing.
 - Increase the scope of the lock in vgonel() to reduce the number of lock
   operations that are performed.
2003-10-04 15:10:40 +00:00
Jeff Roberson
51b575490c - In reassignbuf() don't unlock vp and lock newvp if they are the same.
Doing so creates a race where the buf is on neither list.
 - Only vfree() in an error case in vclean() if VSHOULDFREE() thinks we
   should.
 - Convert the error case in vclean() to INVARIANTS from DIAGNOSTIC as this
   really should not happen and is fast to check.
2003-09-20 00:21:48 +00:00
Jeff Roberson
6b6c163a37 - Remove spls(). The locking that has replaced them is in place and they
no longer serve as guidelines for future work.
2003-09-19 23:52:06 +00:00
Alexander Kabaev
aebbeee812 Eliminate one case of VI_UNLOCK followed by an immediate
VI_LOCK.
2003-09-19 19:13:54 +00:00
John Baldwin
8b149b5131 Consistently use the BSD u_int and u_short instead of the SYSV uint and
ushort.  In most of these files, there was a mixture of both styles and
this change just makes them self-consistent.

Requested by:	bde (kern_ktrace.c)
2003-08-07 15:04:27 +00:00
Poul-Henning Kamp
68f2d20b70 Revert stuff which accidentally ended up in the previous commit. 2003-07-22 10:36:36 +00:00
Poul-Henning Kamp
55d1d7034f Don't attempt to inline large functions mb_alloc() and mb_free(),
it more than doubles the text size of this file.

GCC has wisely ignored us on this previously
2003-07-22 10:24:41 +00:00
David E. O'Brien
677b542ea2 Use __FBSDID(). 2003-06-11 00:56:59 +00:00
Poul-Henning Kamp
a62f80f8e0 Remove unused variable and now unbalanced call to splbio();
Found by:       FlexeLint
2003-05-31 20:09:01 +00:00
Alan Cox
2e05d89828 Make the maximum number of vnodes a function of both the physical memory
size and the kernel's heap size, specifically, vm_kmem_size.  This
function allows a maximum of 40% of the vm_kmem_size to be used for
vnodes and vm objects.  This is a conservative bound based upon recent
problem reports.  (In other words, a slight increase in this percentage
may be safe.)

Finally, machines with less than ~3GB of RAM should be unaffected
by this change, i.e., the maximum number of vnodes should remain
the same.  If necessary, machines with 3GB or more of RAM can increase
the maximum number of vnodes by increasing vm_kmem_size.

Desired by:	scottl
Tested by:	jake
Approved by:	re (rwatson,scottl)
2003-05-23 19:54:02 +00:00
Don Lewis
1e9bc9f889 Detect that a vnode has been reclaimed while vflush() was waiting to lock
the vnode and restart the loop.  Vflush() is vulnerable since it does not
hold a reference to the vnode and it holds no other locks while waiting
for the vnode lock.  The vnode will no longer be on the list when the
loop is restarted.

Approved by:	re (rwatson)
2003-05-16 19:46:51 +00:00
Alan Cox
099e981aa1 Optimize the use of splay in gbincore(). During a "make buildworld" the
desired buffer is found at one of the roots more than 60% of the time.
Thus, checking both roots before performing either splay eliminates
unnecessary splays on the first tree splayed.

Approved by:	re (jhb)
2003-05-13 04:36:02 +00:00
Robert Watson
1964fb9ba2 Remove bogus locking from DDB's "show lockedvnods" command: using
synchronization primitives from inside DDB is generally a bad idea,
and in this case it frequently results in panics due to DDB commands
being executed from the sio fast interrupt context on a serial
console.  Replace the locking with a note that a lack of locking
means that DDB may get see inconsistent views of the mount and vnode
lists, which could also result in a panic.  More frequently,
though, this avoids a panic than causes it.

Discussed with ages ago:	bde
Approved by:			re (scottl)
2003-05-12 14:37:47 +00:00
Alan Cox
bff99f0d12 - Revert kern/vfs_subr.c revision 1.444. The vm_object's size isn't
trustworthy for vnode-backed objects.
 - Restore the old behavior of vm_object_page_remove() when the end
   of the given range is zero.  Add a comment to vm_object_page_remove()
   regarding this behavior.

Reported by:	iedowse
2003-05-03 08:09:24 +00:00
Alan Cox
ebba1b25f9 Lock accesses to the vm_object's ref_count and resident_page_count. 2003-05-01 03:10:38 +00:00
Alan Cox
ecde4b3218 Various changes to vm_object_page_remove():
- Eliminate an odd, special-case feature:
   if start == end == 0 then all pages are removed.  Only one caller
   used this feature and that caller can trivially pass the object's
   size.
 - Assert that the vm_object is locked on entry; don't bother testing
   for a NULL vm_object.
 - Style: Fix lines that are longer than 80 characters.
2003-04-26 23:41:30 +00:00
Alan Cox
1ca5895341 - Convert vm_object_pip_wait() from using tsleep() to msleep().
- Make vm_object_pip_sleep() static.
 - Lock the vm_object when performing vm_object_pip_wait().
2003-04-26 18:33:18 +00:00
Alan Cox
b6e48e0372 - Acquire the vm_object's lock when performing vm_object_page_clean().
- Add a parameter to vm_pageout_flush() that tells vm_pageout_flush()
   whether its caller has locked the vm_object.  (This is a temporary
   measure to bootstrap vm_object locking.)
2003-04-24 04:31:25 +00:00
Alan Cox
49281fbf68 Update locking around vm_object_page_remove() to use the new macros. 2003-04-18 16:39:03 +00:00
Alan Cox
e96c181d16 Use vm_object_pip_wait() rather than reimplementing it. 2003-04-13 05:10:44 +00:00
Tor Egge
6b08046175 Adjust the number of vnodes scanned by vlrureclaim() according to the
size of the vnode list.
2003-03-26 22:15:58 +00:00
Yaroslav Tykhiy
17ce5b94d6 We shouldn't assert that a vode is locked in vop_lock_post()
if VOP_LOCK() has failed.

Reviewed by:	jeff
2003-03-22 13:21:54 +00:00
Jeff Roberson
e99215a614 - Remove a dead check for bp->b_vp == vp in vtruncbuf(). This has not been
possible for some time.
 - Lock the buf before accessing fields.  This should very rarely be locked.
 - Assert that B_DELWRI is set after we acquire the buf.  This should always
   be the case now.
2003-03-13 07:22:53 +00:00
Jeff Roberson
09f11da5a3 - Remove a race between fsync like functions and flushbufqueues() by
requiring locked bufs in vfs_bio_awrite().  Previously the buf could
   have been written out by fsync before we acquired the buf lock if it
   weren't for giant.  The cluster_wbuild() handles this race properly but
   the single write at the end of vfs_bio_awrite() would not.
 - Modify flushbufqueues() so there is only one copy of the loop.  Pass a
   parameter in that says whether or not we should sync bufs with deps.
 - Call flushbufqueues() a second time and then break if we couldn't find
   any bufs without deps.
2003-03-13 07:19:23 +00:00
Alan Cox
09c80124a3 Remove ENABLE_VFS_IOOPT. It is a long unfinished work-in-progress.
Discussed on:	arch@
2003-03-06 03:41:02 +00:00
Nate Lawson
99648386d3 Finish cleanup of vprint() which was begun with changing v_tag to a string.
Remove extraneous uses of vop_null, instead defering to the default op.
Rename vnode type "vfs" to the more descriptive "syncer".
Fix formatting for various filesystems that use vop_print.
2003-03-03 19:15:40 +00:00
Jeff Roberson
491081fabf - Hold the vnode interlock across calls to bgetvp instead of acquiring it
internally.  This is required to stop multiple bufs from being associated
   with a single lblkno.
2003-03-02 06:05:23 +00:00
Jeff Roberson
bff5362bf2 - gc USE_BUFHASH. The smp locking of the buf cache renders this useless. 2003-03-01 05:55:03 +00:00
Kirk McKusick
3a7053cb60 Prevent large files from monopolizing the system buffers. Keep
track of the number of dirty buffers held by a vnode. When a
bdwrite is done on a buffer, check the existing number of dirty
buffers associated with its vnode. If the number rises above
vfs.dirtybufthresh (currently 90% of vfs.hidirtybuffers), one
of the other (hopefully older) dirty buffers associated with
the vnode is written (using bawrite). In the event that this
approach fails to curb the growth in it the vnode's number of
dirty buffers (due to soft updates rollback dependencies),
the more drastic approach of doing a VOP_FSYNC on the vnode
is used. This code primarily affects very large and actively
written files such as snapshots. This change should eliminate
hanging when taking snapshots or doing background fsck on
very large filesystems.

Hopefully, one day it will be possible to cache filesystem
metadata in the VM cache as is done with file data. As it
stands, only the buffer cache can be used which limits total
metadata storage to about 20Mb no matter how much memory is
available on the system. This rather small memory gets badly
thrashed causing a lot of extra I/O. For example, taking a
snapshot of a 1Tb filesystem minimally requires about 35,000
write operations, but because of the cache thrashing (we only
have about 350 buffers at our disposal) ends up doing about
237,540 I/O's thus taking twenty-five minutes instead of four
if it could run entirely in the cache.

Reported by:	Attila Nagy <bra@fsn.hu>
Sponsored by:   DARPA & NAI Labs.
2003-02-25 06:44:42 +00:00
Jeff Roberson
17661e5ac4 - Add an interlock argument to BUF_LOCK and BUF_TIMELOCK.
- Remove the buftimelock mutex and acquire the buf's interlock to protect
   these fields instead.
 - Hold the vnode interlock while locking bufs on the clean/dirty queues.
   This reduces some cases from one BUF_LOCK with a LK_NOWAIT and another
   BUF_LOCK with a LK_TIMEFAIL to a single lock.

Reviewed by:	arch, mckusick
2003-02-25 03:37:48 +00:00
Poul-Henning Kamp
acb18acfec Bracket the kern.vnode sysctl in #ifdef notyet because it results
in massive locking issues on diskless systems.

It is also not clear that this sysctl is non-dangerous in its
requirements for locked down memory on large RAM systems.
2003-02-23 18:09:05 +00:00
Warner Losh
a163d034fa Back out M_* changes, per decision of the TRB.
Approved by: trb
2003-02-19 05:47:46 +00:00
Alfred Perlstein
44956c9863 Remove M_TRYWAIT/M_WAITOK/M_WAIT. Callers should use 0.
Merge M_NOWAIT/M_DONTWAIT into a single flag M_NOWAIT.
2003-01-21 08:56:16 +00:00
Ian Dowse
6a1b2a22ef Add a new vnode flag VI_DOINGINACT to indicate that a VOP_INACTIVE
call is in progress on the vnode. When vput() or vrele() sees a
1->0 reference count transition, it now return without any further
action if this flag is set. This flag is necessary to avoid recursion
into VOP_INACTIVE if the filesystem inactive routine causes the
reference count to increase and then drop back to zero. It is also
used to guarantee that an unlocked vnode will not be recycled while
blocked in VOP_INACTIVE().

There are at least two cases where the recursion can occur: one is
that the softupdates code called by ufs_inactive() via ffs_truncate()
can call vput() on the vnode. This has been reported by many people
as "lockmgr: draining against myself" panics. The other case is
that nfs_inactive() can call vget() and then vrele() on the vnode
to clean up a sillyrename file.

Reviewed by:	mckusick (an older version of the patch)
2002-12-29 18:30:49 +00:00
Poul-Henning Kamp
371400cf2e Use a timeout of one second while we wait for the vnode washer,
this prevents a potential race and makes the system a little bit
less jerky under extreme loads.
2002-12-29 11:18:25 +00:00
Poul-Henning Kamp
851a87ea1a Vnodes pull in 800-900 bytes these days, all things counted, so we need
to treat desiredvnodes much more like a limit than as a vague concept.

On a 2GB RAM machine where desired vnodes is 130k, we run out of
kmem_map space when we hit about 190k vnodes.

If we wake up the vnode washer in getnewvnode(), sleep until it is done,
so that it has a chance to offer us a washed vnode.  If we don't sleep
here we'll just race ahead and allocate yet a vnode which will never
get freed.

In the vnodewasher, instead of doing 10 vnodes per mountpoint per
rotation, do 10% of the vnodes distributed evenly across the
mountpoints.
2002-12-29 10:39:05 +00:00
Poul-Henning Kamp
9f16282798 KASSERT that vop_revoke() gets a VCHR. 2002-12-28 22:27:14 +00:00
Alan Cox
475e8011ab Perform vm_object_lock() and vm_object_unlock() around
vm_object_page_remove().
2002-12-15 05:41:56 +00:00
Alan Cox
2e29a1f21f To avoid lock order reversals in getnewvnode(), the call to uma_zfree()
must be delayed until the vnode interlock is released.

Reported by:	kris@
Approved by:	re (jhb)
2002-12-08 05:06:50 +00:00
Robert Drehmel
f85a961930 Do not set a variable (vp->p_pollinfo) to NULL if we know
it already has that value.

Approved by:	re
2002-11-27 16:45:54 +00:00
Robert Watson
763bbd2f4f Slightly change the semantics of vnode labels for MAC: rather than
"refreshing" the label on the vnode before use, just get the label
right from inception.  For single-label file systems, set the label
in the generic VFS getnewvnode() code; for multi-label file systems,
leave the labeling up to the file system.  With UFS1/2, this means
reading the extended attribute during vfs_vget() as the inode is
pulled off disk, rather than hitting the extended attributes
frequently during operations later, improving performance.  This
also corrects sematics for shared vnode locks, which were not
previously present in the system.  This chances the cache
coherrency properties WRT out-of-band access to label data, but in
an acceptable form.  With UFS1, there is a small race condition
during automatic extended attribute start -- this is not present
with UFS2, and occurs because EAs aren't available at vnode
inception.  We'll introduce a work around for this shortly.

Approved by:	re
Obtained from:	TrustedBSD Project
Sponsored by:	DARPA, Network Associates Laboratories
2002-10-26 14:38:24 +00:00
Poul-Henning Kamp
0d6dc414b4 In vrele() we can actually have a VCHR with v_rdev == NULL if we
came from the bottom of addaliasu().  Don't panic.
2002-10-25 07:58:25 +00:00
Kirk McKusick
9ab73fd11a Within ufs, the ffs_sync and ffs_fsync functions did not always
check for and/or report I/O errors. The result is that a VFS_SYNC
or VOP_FSYNC called with MNT_WAIT could loop infinitely on ufs in
the presence of a hard error writing a disk sector or in a filesystem
full condition. This patch ensures that I/O errors will always be
checked and returned.  This patch also ensures that every call to
VFS_SYNC or VOP_FSYNC with MNT_WAIT set checks for and takes
appropriate action when an error is returned.

Sponsored by:   DARPA & NAI Labs.
2002-10-25 00:20:37 +00:00
Poul-Henning Kamp
a2fb4feded Fix the spechash lock order reversal by keeping an updated sum
of v_usecount in the dev_t which vcount() can return without
locking any vnodes.

Seen by:	jhb
2002-10-24 19:38:56 +00:00
Kirk McKusick
a6b9f47b31 When scanning the freelist looking for candidate vnodes to recycle,
be sure to exit the loop with vp == NULL if no candidates are found.
Formerly, this bug would cause the last vnode inspected to be used,
even if it was not available. The result was a panic "vn_finished_write:
neg cnt".

Sponsored by:	DARPA & NAI Labs.
2002-10-14 19:54:39 +00:00
Kirk McKusick
e04a020067 Unconditionally reset vp->v_vnlock back to the default in the
vclean() function (e.g., vp->v_vnlock = &vp->v_lock) rather
than requiring filesystems that use alternate locks to do so
in their vop_reclaim functions. This change is a further cleanup
of the vop_stdlock interface.

Submitted by:	Poul-Henning Kamp <phk@critter.freebsd.dk>
Sponsored by:	DARPA & NAI Labs.
2002-10-14 19:44:51 +00:00
Kirk McKusick
a5b65058d5 Regularize the vop_stdlock'ing protocol across all the filesystems
that use it. Specifically, vop_stdlock uses the lock pointed to by
vp->v_vnlock. By default, getnewvnode sets up vp->v_vnlock to
reference vp->v_lock. Filesystems that wish to use the default
do not need to allocate a lock at the front of their node structure
(as some still did) or do a lockinit. They can simply start using
vn_lock/VOP_UNLOCK. Filesystems that wish to manage their own locks,
but still use the vop_stdlock functions (such as nullfs) can simply
replace vp->v_vnlock with a pointer to the lock that they wish to
have used for the vnode. Such filesystems are responsible for
setting the vp->v_vnlock back to the default in their vop_reclaim
routine (e.g., vp->v_vnlock = &vp->v_lock).

In theory, this set of changes cleans up the existing filesystem
lock interface and should have no function change to the existing
locking scheme.

Sponsored by:	DARPA & NAI Labs.
2002-10-14 03:20:36 +00:00
Kirk McKusick
192e439ed4 When considering a vnode for reuse in getnewvnode, we call
vcanrecycle to check a free vnode's availability. If it is
available, vcanrecycle returns an error code of zero and the
vnode in question locked. The getnewvnode routine then used
to call vn_start_write with the V_NOWAIT flag. If the filesystem
was suspended while taking a snapshot, the vn_start_write would
fail but getnewvnode would fail to unlock the vnode, instead
leaving it locked on the freelist. The result would be that the
vnode would be locked forever and would eventually hang the
system with a race to the root when it was attempted to recycle
it. This fix moves the vn_start_write check into vcanrecycle
where it will properly unlock the vnode if it is unavailable
for recycling due to filesystem suspension.

Sponsored by:	DARPA & NAI Labs.
2002-10-11 01:04:14 +00:00
Maxim Sobolev
790a8088d0 Fix problem introduced in rev.1.406, which can cause already unlocked
mutex being unlocked again causing system panic.
2002-10-05 12:56:10 +00:00
Poul-Henning Kamp
8d3574c7a4 Fix some harmless mis-indents.
Spotted by:	FlexeLint
2002-10-01 15:48:31 +00:00
Robert Watson
0626774f08 Move vnode MAC label initialization to after the release of the vnode
interlock in getnewvnode() to avoid possible sleeps while holding
the mutex.  Note that the warning from Witness is a slight false
positive since we know there will be no contention on the interlock
since we haven't made the vnode available for use yet, but the theory
is not a bad one.

Obtained from:	TrustedBSD Project
Sponsored by:	DARPA, Network Associates Laboratories
2002-09-30 20:51:48 +00:00
Poul-Henning Kamp
37c841831f Be consistent about "static" functions: if the function is marked
static in its prototype, mark it static at the definition too.

Inspired by:    FlexeLint warning #512
2002-09-28 17:15:38 +00:00
Jeff Roberson
6423c9433c - Move ASSERT_VOP_*LOCK* functionality into functions in vfs_subr.c
- Make the VI asserts more orthogonal to the rest of the asserts by using a
   new, common vfs_badlock() function and adding a 'str' arg.
 - Adjust generated ASSERTS to match the new prototype.
 - Adjust explicit ASSERTS to match the new prototype.
2002-09-26 04:48:44 +00:00
Jeff Roberson
6cb8bf2027 - Lock down the syncer with sync_mtx.
- Enable vfs_badlock_mutex by default.
 - Assert that the vp is locked in VOP_UNLOCK.
 - Use standard interlock macros in remaining code.
 - Correct a race in getnewvnode().
 - Lock access to v_numoutput with interlock.
 - Lock access to buf lists and splay tree with interlock.
 - Add VOP and VI asserts.
 - Lock b_vnbufs with the vnode interlock.
 - Add vrefcnt() for callers who want to retreive the vnode ref without
   holding a lock.  Add a comment that describes when this is safe.
 - Add vholdl() and vdropl() so that callers who already own the interlock
   can avoid race conditions and unnecessary unlocking.
 - Move the VOP_GETATTR() in vflush() into the WRITECLOSE conditional case.
 - Hold the interlock before droping the mntlist_mtx in vflush() to avoid
   a race.
 - Fix locking in vfs_msync().
2002-09-25 02:22:21 +00:00
Nate Lawson
86ed6d45ac Remove any VOP_PRINT that redundantly prints the tag.
Move lockmgr_printinfo() into vprint() for everyone's benefit.

Suggested by: bde
2002-09-18 20:42:04 +00:00
Nate Lawson
06be2aaa83 Remove all use of vnode->v_tag, replacing with appropriate substitutes.
v_tag is now const char * and should only be used for debugging.

Additionally:
1. All users of VT_NTS now check vfsconf->vf_type VFCF_NETWORK
2. The user of VT_PROCFS now checks for the new flag VV_PROCDEP, which
is propagated by pseudofs to all child vnodes if the fs sets PFS_PROCDEP.

Suggested by:   phk
Reviewed by:    bde, rwatson (earlier version)
2002-09-14 09:02:28 +00:00
Julian Elischer
85e40eaf26 Indentation does not make a block.. need curly braces too.
Submitted by: Eagle-eyes evans <bde@freebsd.org>
2002-09-11 18:15:26 +00:00
Julian Elischer
71fad9fdee Completely redo thread states.
Reviewed by:	davidxu@freebsd.org
2002-09-11 08:13:56 +00:00
Poul-Henning Kamp
f8b663614d Fix an inherited style bug: compare with NOCRED instead of NULL.
Sponsored by:	DARPA & NAI Labs.
2002-09-05 20:46:19 +00:00
Poul-Henning Kamp
c1a925a637 Introduce new extattr_check_cred() function which implements the canonical
crential washing for extended attributes.

Sponsored by:	DARPA & NAI Labs.
2002-09-05 20:38:57 +00:00
Philippe Charnier
93b0017f88 Replace various spelling with FALLTHROUGH which is lint()able 2002-08-25 13:23:09 +00:00
Jeff Roberson
ad32f726db - Fix a mistake in my last few commits. The PDROP flag stops msleep from
re-acquiring the mutex.

Pointy hat to:	me
Noticed by:	tegge
2002-08-23 00:32:03 +00:00
Jeff Roberson
9abf54f032 - Make vn_lock() vget() and VOP_LOCK() all behave the same way WRT
LK_INTERLOCK.  The interlock will never be held on return from these
   functions even when there is an error.  Errors typically only occur when
   the XLOCK is held which means this isn't the vnode we want anyway.  Almost
   all users of these interfaces expected this behavior even though it was
   not provided before.
2002-08-22 07:44:45 +00:00
Jeff Roberson
183158485a - Fix interlock handling in vn_lock(). Previously, vn_lock() could return
with interlock held in error conditions when the caller did not specify
   LK_INTERLOCK.
 - Add several comments to vn_lock() describing the rational behind the code
   flow since it was not immediately obvious.
2002-08-22 06:51:06 +00:00
Jeff Roberson
0b600db425 - Document two cases, one in vget and the other in vn_lock, where the state
of interlock on exit is not consistent.  There are probably several bugs
   relating to this.
2002-08-21 08:34:48 +00:00
Jeff Roberson
88cf6b94bd - If vn_lock fails with the LK_INTERLOCK flag set, interlock will not be
released.  vcanrecycle() failed to unlock interlock under this condition.
 - Remove an extra VOP_UNLOCK from a failure case in vcanrecycle().

Pointed out by:	rwatson
2002-08-21 06:40:34 +00:00
Jeff Roberson
71ea4ba57c - Add two new debugging macros: ASSERT_VI_LOCKED and ASSERT_VI_UNLOCKED
- Use the new VI asserts in place of the old mtx_assert checks.
 - Add the VI asserts to the automated lock checking in the VOP calls.  The
   interlock should not be held across vops with a few exceptions.
 - Add the vop_(un)lock_{pre,post} functions to assert that interlock is held
   when LK_INTERLOCK is set.
2002-08-21 06:19:29 +00:00
Jeff Roberson
055c012332 - Extend the vnode_free_list_mtx to cover numvnodes and freevnodes. This
was done only some of the time before, and now it is uniformly applied.
2002-08-13 05:29:48 +00:00
Maxime Henrion
5965373e69 - Introduce a new struct xvfsconf, the userland version of struct vfsconf.
- Make getvfsbyname() take a struct xvfsconf *.
- Convert several consumers of getvfsbyname() to use struct xvfsconf.
- Correct the getvfsbyname.3 manpage.
- Create a new vfs.conflist sysctl to dump all the struct xvfsconf in the
  kernel, and rewrite getvfsbyname() to use this instead of the weird
  existing API.
- Convert some {set,get,end}vfsent() consumers to use the new vfs.conflist
  sysctl.
- Convert a vfsload() call in nfsiod.c to kldload() and remove the useless
  vfsisloadable() and endvfsent() calls.
- Add a warning printf() in vfs_sysctl() to tell people they are using
  an old userland.

After these changes, it's possible to modify struct vfsconf without
breaking the binary compatibility.  Please note that these changes don't
break this compatibility either.

When bp will have updated mount_smbfs(8) with the patch I sent him, there
will be no more consumers of the {set,get,end}vfsent(), vfsisloadable()
and vfsload() API, and I will promptly delete it.
2002-08-10 20:19:04 +00:00
Jeff Roberson
8947be9ba0 - Move some logic from getnewvnode() to a new function vcanrecycle()
- Unlock the free list mutex around vcanrecycle to prevent a lock order
   reversal.
2002-08-05 10:15:56 +00:00
Jeff Roberson
e6e370a7fe - Replace v_flag with v_iflag and v_vflag
- v_vflag is protected by the vnode lock and is used when synchronization
   with VOP calls is needed.
 - v_iflag is protected by interlock and is used for dealing with vnode
   management issues.  These flags include X/O LOCK, FREE, DOOMED, etc.
 - All accesses to v_iflag and v_vflag have either been locked or marked with
   mp_fixme's.
 - Many ASSERT_VOP_LOCKED calls have been added where the locking was not
   clear.
 - Many functions in vfs_subr.c were restructured to provide for stronger
   locking.

Idea stolen from:	BSD/OS
2002-08-04 10:29:36 +00:00
Robert Watson
f9d0d52459 Include file cleanup; mac.h and malloc.h at one point had ordering
relationship requirements, and no longer do.

Reminded by:	bde
2002-08-01 17:47:56 +00:00
Dag-Erling Smørgrav
3072197229 Nit in previous commit: the correct sysctl type is "S,xvnode" 2002-07-31 12:25:28 +00:00
Dag-Erling Smørgrav
217b2a0b61 Initialize v_cachedid to -1 in getnewvnode().
Reintroduce the kern.vnode sysctl and make it export xvnodes rather than
vnodes.

Sponsored by:	DARPA, NAI Labs
2002-07-31 12:24:35 +00:00
Robert Watson
07bdba7e2d Note that the privilege indicating flag to vaccess() originally used
by the process accounting system is now deprecated.
2002-07-31 02:05:12 +00:00
Robert Watson
a0ee6ed1c0 Introduce support for Mandatory Access Control and extensible
kernel access control.

Invoke the necessary MAC entry points to maintain labels on vnodes.
In particular, initialize the label when the vnode is allocated or
reused, and destroy the label when the vnode is going to be released,
or reused.  Wow, an object where there really is exactly one place
where it's allocated, and one other where it's freed.  Amazing.

Obtained from:	TrustedBSD Project
Sponsored by:	DARPA, NAI Labs
2002-07-31 02:03:46 +00:00
Jeff Roberson
a562685f65 - Backout the patch made in revision 1.75 of vfs_mount.c. The vputs here
were hiding the real problem of the missing unlock in sync_inactive.
 - Add the missing unlock in sync_inactive.

Submitted by:	iedowse
2002-07-29 06:26:55 +00:00
Don Lewis
5c38b6dbce Wire the sysctl output buffer before grabbing any locks to prevent
SYSCTL_OUT() from blocking while locks are held.  This should
only be done when it would be inconvenient to make a temporary copy of
the data and defer calling SYSCTL_OUT() until after the locks are
released.
2002-07-28 19:59:31 +00:00
Robert Watson
b02aac465d Teach discretionary access control methods for files about VAPPEND
and VALLPERM.

Obtained from:	TrustedBSD Project
Sponsored by:	DARPA, NAI Labs
2002-07-22 03:57:07 +00:00
Kirk McKusick
7aca6291e3 Add support to UFS2 to provide storage for extended attributes.
As this code is not actually used by any of the existing
interfaces, it seems unlikely to break anything (famous
last words).

The internal kernel interface to manipulate these attributes
is invoked using two new IO_ flags: IO_NORMAL and IO_EXT.
These flags may be specified in the ioflags word of VOP_READ,
VOP_WRITE, and VOP_TRUNCATE. Specifying IO_NORMAL means that
you want to do I/O to the normal data part of the file and
IO_EXT means that you want to do I/O to the extended attributes
part of the file. IO_NORMAL and IO_EXT are mutually exclusive
for VOP_READ and VOP_WRITE, but may be specified individually
or together in the case of VOP_TRUNCATE. For example, when
removing a file, VOP_TRUNCATE is called with both IO_NORMAL
and IO_EXT set. For backward compatibility, if neither IO_NORMAL
nor IO_EXT is set, then IO_NORMAL is assumed.

Note that the BA_ and IO_ flags have been `merged' so that they
may both be used in the same flags word. This merger is possible
by assigning the IO_ flags to the low sixteen bits and the BA_
flags the high sixteen bits. This works because the high sixteen
bits of the IO_ word is reserved for read-ahead and help with
write clustering so will never be used for flags. This merge
lets us get away from code of the form:

        if (ioflags & IO_SYNC)
                flags |= BA_SYNC;

For the future, I have considered adding a new field to the
vattr structure, va_extsize. This addition could then be
exported through the stat structure to allow applications to
find out the size of the extended attribute storage and also
would provide a more standard interface for truncating them
(via VOP_SETATTR rather than VOP_TRUNCATE).

I am also contemplating adding a pathconf parameter (for
concreteness, lets call it _PC_MAX_EXTSIZE) which would
let an application determine the maximum size of the extended
atribute storage.

Sponsored by:	DARPA & NAI Labs.
2002-07-19 07:29:39 +00:00
Kirk McKusick
fb36a3d847 Change utimes to set the file creation time (for filesystems that
support creation times such as UFS2) to the value of the
modification time if the value of the modification time is older
than the current creation time. See utimes(2) for further details.

Sponsored by:	DARPA & NAI Labs.
2002-07-17 02:03:19 +00:00
Matthew Dillon
d331c5d43f Replace the global buffer hash table with per-vnode splay trees using a
methodology similar to the vm_map_entry splay and the VM splay that Alan
Cox is working on.  Extensive testing has appeared to have shown no
increase in overhead.

Disadvantages
    Dirties more cache lines during lookups.

    Not as fast as a hash table lookup (but still N log N and optimal
    when there is locality of reference).

Advantages
    vnode->v_dirtyblkhd is now perfectly sorted, making fsync/sync/filesystem
    syncer operate more efficiently.

    I get to rip out all the old hacks (some of which were mine) that tried
    to keep the v_dirtyblkhd tailq sorted.

    The per-vnode splay tree should be easier to lock / SMPng pushdown on
    vnodes will be easier.

    This commit along with another that Alan is working on for the VM page
    global hash table will allow me to implement ranged fsync(), optimize
    server-side nfs commit rpcs, and implement partial syncs by the
    filesystem syncer (aka filesystem syncer would detect that someone is
    trying to get the vnode lock, remembers its place, and skip to the
    next vnode).

Note that the buffer cache splay is somewhat more complex then other splays
due to special handling of background bitmap writes (multiple buffers with
the same lblkno in the same vnode), and B_INVAL discontinuities between the
old hash table and the existence of the buffer on the v_cleanblkhd list.

Suggested by: alc
2002-07-10 17:02:32 +00:00
Jeff Roberson
25b286d6db - Use standard locking functions in syncer's opv
- vput instead of vrele syncer vnodes in vfs_mount
 - Add vop_lookup_{pre,post} to verify locking in VOP_LOOKUP
2002-07-09 19:54:20 +00:00
Jeff Roberson
18c48f437f - Don't hold the vn lock while calling VOP_CLOSE in vclean(). 2002-07-07 06:38:22 +00:00
Jeff Roberson
bed75d4627 - BUF_REFCNT() seems to be the preferred method for verifying a locked buf.
Tell vop_strategy_pre() to use this instead.
 - Ignore B_CLUSTER bufs.  Their components are locked but they don't really
   exist so they don't have to be.  This isn't ideal but it is safe.
2002-07-07 05:29:45 +00:00
Jeff Roberson
c031d11bb4 Fix a mistake in my last commit. Don't grab an extra reference to the object
in bp->b_object.
2002-07-06 21:27:20 +00:00
Jeff Roberson
9a236af3ad Fixup uses of GETVOBJECT.
- Cache a pointer to the vnode's object in the buf.
 - Hold a reference to that object in addition to the vnode's reference just
   to be consistent.
 - Cleanup code that got the object indirectly through the vp and VOP calls.

This fixes at least one case where we were calling GETVOBJECT without a lock.
It also avoids an expensive layered call at the cost of another pointer in
struct buf.
2002-07-06 08:59:52 +00:00
Jeff Roberson
302c7aaab9 - Add vop_strategy_pre to validate VOP_STRATEGY locking.
- Disable original vop_strategy lock specification.
 - Switch to the new vop_strategy_pre for lock validation.

VOP_STRATEGY requires only that the buf is locked UNLESS the block numbers need
to be translated.  There may be other reasons, but as long as the underlying
layer uses a VOP to perform the operations they will be caught later.
2002-07-06 05:21:12 +00:00
Jeff Roberson
cc8662b0f9 Add "vop_rename_pre" to do pre rename lock verification. This is enabled only
with DEBUG_VFS_LOCKS.
2002-07-06 04:39:48 +00:00
Maxime Henrion
d7f9ecc86b Move vfs_rootmountalloc() in vfs_mount.c and remove lite2_vfs_mountroot()
which was #if 0'd and is not likely to be used now.
2002-07-03 09:27:24 +00:00
Maxime Henrion
2b4edb69f1 Move every code related to mount(2) in a new file, vfs_mount.c.
The file vfs_conf.c which was dealing with root mounting has
been repo-copied into vfs_mount.c to preserve history.
This makes nmount related development easier, and help reducing
the size of vfs_syscalls.c, which is still an enormous file.

Reviewed by:	rwatson
Repo-copy by:	peter
2002-07-02 17:09:22 +00:00
Ian Dowse
6bd521df93 Use indirect function pointer hooks instead of #ifdef SOFTUPDATES
direct calls for the two places where the kernel calls into soft
updates code. Set up the hooks in softdep_initialize() and NULL
them out in softdep_uninitialize(). This change allows soft updates
to function correctly when ufs is loaded as a module.

Reviewed by:	mckusick
2002-07-01 17:59:40 +00:00
David E. O'Brien
87e1503e2c Rename the db command lockedvnodes to lockedvnods so that it fits on the
help screen and one doens't think we have a lockedvnodesmap command.
2002-06-29 04:45:09 +00:00
Alfred Perlstein
210a5a7169 nuke caddr_t. 2002-06-28 23:17:36 +00:00
Jeff Roberson
90769c9ed0 Improve the VOP locking asserts
- Add vfs_badlock_print to control whether or not we print lock violations
 - Add vfs_badlock_panic to control whether we panic on lock violations

Both default to on to mimic the original behavior if DEBUG_VFS_LOCKS is on.
2002-06-28 20:58:14 +00:00
Brian Feldman
aac12bcfbc Fix a case where a vnode got explicitly unlocked after the pointer to it
got set to NULL.

Revision 1.355: in the box
2002-06-28 16:17:47 +00:00
Maxime Henrion
7d2d440991 Change the way we internally store the mount options to
a linked list.  This is to allow the merging of the mount
options in the MNT_UPDATE case, as the current data structure
is unsuitable for this.

There are no functional differences in this commit.

Reviewed by:	phk
2002-06-20 20:03:42 +00:00
Maxime Henrion
fe93750656 Change vfs_copyopt() so that the length argument passed to it
must be the exact same size as the mount option.  This makes
vfs_copyopt() much more useful.
2002-06-14 20:04:21 +00:00
Dag-Erling Smørgrav
edad3af28d Move some sysctls from the debug tree to the vfs tree. 2002-06-06 15:50:22 +00:00
Dag-Erling Smørgrav
4a357a32e0 Gratuitous whitespace cleanup. 2002-06-06 15:46:38 +00:00
Tom Rhodes
d394511de3 More s/file system/filesystem/g 2002-05-16 21:28:32 +00:00
Maxime Henrion
34e53231d0 o Fix vfs_copyopt(), the first argument to bcopy() is the source,
not the destination.
o Remove some code from vfs_getopt() which was making the interface
  more complicated to use for a very slight gain.
2002-05-16 17:09:41 +00:00
Jeff Roberson
f0d73b3e5f Switch from just holding the interlock to holding the standard lock throughout
getnewvnode().  This is safer.  In the future, we should investigate requiring
only the interlock to get the vnode object.
2002-05-07 02:44:06 +00:00
Jeff Roberson
6953f5da1a Hold the currently selected vnode's lock across the call to VOP_GETVOBJECT.
Don't try to create a vm object before the file system has a chance to finish
initializing it.  This is incorrect for a number of reasons.  Firstly, that
VOP requires a lock which the file system may not have initialized yet. Also,
open and others will create a vm object if it is necessary later.
2002-05-06 04:47:43 +00:00
Poul-Henning Kamp
81e017430a Expand the one-line function pbreassignbuf() the only place it is or could
be used.
2002-05-05 20:37:08 +00:00
Matthew Dillon
9f9435545b Remove obsolete code (that was already #if 0'd out).
Requested by: Hiten Pandya <hitmaster2k@yahoo.com>
2002-05-04 17:10:15 +00:00
John Baldwin
6008862bc2 Change callers of mtx_init() to pass in an appropriate lock type name. In
most cases NULL is passed, but in some cases such as network driver locks
(which use the MTX_NETWORK_LOCK macro) and UMA zone locks, a name is used.

Tested on:	i386, alpha, sparc64
2002-04-04 21:03:38 +00:00
John Baldwin
44731cab3b Change the suser() API to take advantage of td_ucred as well as do a
general cleanup of the API.  The entire API now consists of two functions
similar to the pre-KSE API.  The suser() function takes a thread pointer
as its only argument.  The td_ucred member of this thread must be valid
so the only valid thread pointers are curthread and a few kernel threads
such as thread0.  The suser_cred() function takes a pointer to a struct
ucred as its first argument and an integer flag as its second argument.
The flag is currently only used for the PRISON_ROOT flag.

Discussed on:	smp@
2002-04-01 21:31:13 +00:00
Maxime Henrion
17594b936b As discussed in -arch, add the new nmount(2) system call and the
new vfs_getopt()/vfs_copyopt() API.  This is intended to be used
later, when there will be filesystems implementing the VFS_NMOUNT
operation.  The mount(2) system call will disappear when all
filesystems will be converted to the new API.  Documentation will
be committed in a while.

Reviewed by:	phk
2002-03-26 15:33:44 +00:00
Jeff Roberson
c897b81311 Remove references to vm_zone.h and switch over to the new uma API.
Also, remove maxsockets.  If you look carefully you'll notice that the old
zone allocator never honored this anyway.
2002-03-20 04:09:59 +00:00
Alfred Perlstein
4d77a549fe Remove __P. 2002-03-19 21:25:46 +00:00
Robert Watson
89e1164ee2 Three p_ucred -> td_ucred's missed in jhb's earlier pass; all appear to
be safe.
2002-03-05 19:45:45 +00:00
John Baldwin
a854ed9893 Simple p_ucred -> td_ucred changes to start using the per-thread ucred
reference.
2002-02-27 18:32:23 +00:00
Poul-Henning Kamp
68edc1b939 Make v_addpollinfo() visible and non-inline.
Have callers only call it as needed.
Add necessary call in ufs_kqfilter().

Test-case found by:	Andrew Gallatin <gallatin@cs.duke.edu>
2002-02-18 16:18:02 +00:00
Poul-Henning Kamp
90737495aa Remove yet a redundant VN_KNOTE() macro. 2002-02-18 08:24:48 +00:00
Poul-Henning Kamp
4b55dbe36b Move the stuff related to select and poll out of struct vnode.
The use of the zone allocator may or may not be overkill.
There is an XXX: over in ufs/ufs/ufs_vnops.c that jlemon may need
to revisit.

This shaves about 60 bytes of struct vnode which on my laptop means
600k less RAM used for vnodes.
2002-02-17 21:15:36 +00:00
Peter Wemm
2b8a08af6b Fix a couple of style bugs introduced (or touched by) previous commit. 2002-02-07 23:06:26 +00:00