Commit Graph

478 Commits

Author SHA1 Message Date
rwatson
25f3ce6010 Merge from POSIX.1e Capabilities development tree:
o POSIX.1e capabilities authorize overriding of VEXEC for VDIR based
  on CAP_DAC_READ_SEARCH, but of !VDIR based on CAP_DAC_EXECUTE.  Add
  appropriate conditionals to vaccess() to take that into account.
o Synchronization cap_check_xxx() -> cap_check() change.

Obtained from:	TrustedBSD Project
2001-11-02 15:16:59 +00:00
dillon
b37309f764 syncdelay, filedelay, dirdelay, metadelay are ints, not time_t's,
and can also be made static.
2001-10-27 19:58:56 +00:00
dillon
f883ef447a Implement kern.maxvnodes. adjusting kern.maxvnodes now actually has a
real effect.

Optimize vfs_msync().  Avoid having to continually drop and re-obtain
mutexes when scanning the vnode list.  Improves looping case by 500%.

Optimize ffs_sync().  Avoid having to continually drop and re-obtain
mutexes when scanning the vnode list.  This makes a couple of assumptions,
which I believe are ok, in regards to vnode stability when the mount list
mutex is held.  Improves looping case by 500%.

(more optimization work is needed on top of these fixes)

MFC after:	1 week
2001-10-26 00:08:05 +00:00
dillon
306854a569 Add missing TAILQ_INSERT_TAIL's which somehow didn't get comitted with
the recent vnode cleanup.
2001-10-25 23:13:56 +00:00
dillon
45a6fabe87 Change the vnode list under the mount point from a LIST to a TAILQ
in preparation for an implementation of limiting code for kern.maxvnodes.

MFC after:	3 days
2001-10-23 01:21:29 +00:00
dillon
74303ee776 fix minor bug in kern.minvnodes sysctl. Use OID_AUTO. 2001-10-16 23:08:09 +00:00
dillon
414efe2875 WS Cleanup 2001-10-08 19:51:13 +00:00
dillon
34563ec4a5 vinvalbuf() was only waiting for write-I/O to complete. It really has to
wait for both read AND write I/O to complete.  Only NFS calls vinvalbuf()
on an active vnode (when the server indicates that the file is stale), so
this bug fix only effects NFS clients.

MFC after:	3 days
2001-10-05 20:10:32 +00:00
dillon
5a5b9f79f4 After extensive testing it has been determined that adding complexity
to avoid removing higher level directory vnodes from the namecache has
no perceivable effect and will be removed.  This is especially true
when vmiodirenable is turned on, which it is by default now.  ( vmiodirenable
makes a huge difference in directory caching ).  The vfs.vmiodirenable and
vfs.nameileafonly sysctls have been left in to allow further testing, but
I expect to rip out vfs.nameileafonly soon too.

I have also determined through testing that the real problem with numvnodes
getting too large is due to the VM Page cache preventing the vnode from
being reclaimed.  The directory stuff made only a tiny dent relative
to Poul's original code, enough so that some tests succeeded.  But tests
with several million small files show that the bigger problem is the VM Page
cache.  This will have to be addressed by a future commit.

MFC after:	3 days
2001-10-01 04:33:35 +00:00
julian
5596676e6c KSE Milestone 2
Note ALL MODULES MUST BE RECOMPILED
make the kernel aware that there are smaller units of scheduling than the
process. (but only allow one thread per process at this time).
This is functionally equivalent to teh previousl -current except
that there is a thread associated with each process.

Sorry john! (your next MFC will be a doosie!)

Reviewed by: peter@freebsd.org, dillon@freebsd.org

X-MFC after:    ha ha ha ha
2001-09-12 08:38:13 +00:00
peter
4b437abe78 If a file has been completely unlinked, stop automatically syncing the
file.  ffs will discard any pending dirty pages when it is closed,
so we may as well not waste time trying to clean them.  This doesn't
stop other things from writing it out, eg: pageout, fsync(2) etc.
2001-08-27 06:09:56 +00:00
peter
6ca5d5c5c5 Revert previous accidental commit. FWIW, it was part of enabling
VM caching of disks through mmap() and stopping syncing of open files
that had their last reference in the fs removed (ie: their unsync'ed
pages get discarded on close already, so I made it stop syncing too).
2001-07-27 15:57:17 +00:00
peter
18bc463cb6 Fix cut/paste blunder. Serves me right for doing a last minute tweak
to what I had for some time.

Submitted by:	bde
2001-07-27 15:52:49 +00:00
dillon
e028603b7e With Alfred's permission, remove vm_mtx in favor of a fine-grained approach
(this commit is just the first stage).  Also add various GIANT_ macros to
formalize the removal of Giant, making it easy to test in a more piecemeal
fashion. These macros will allow us to test fine-grained locks to a degree
before removing Giant, and also after, and to remove Giant in a piecemeal
fashion via sysctl's on those subsystems which the authors believe can
operate without Giant.
2001-07-04 16:20:28 +00:00
jhb
34fab2d86c - Fix a mntvnode and vnode interlock reversal.
- Protect the mnt_vnode list with the mntvnode lock.
2001-06-28 04:05:54 +00:00
alfred
a3f0842419 Introduce a global lock for the vm subsystem (vm_mtx).
vm_mtx does not recurse and is required for most low level
vm operations.

faults can not be taken without holding Giant.

Memory subsystems can now call the base page allocators safely.

Almost all atomic ops were removed as they are covered under the
vm mutex.

Alpha and ia64 now need to catch up to i386's trap handlers.

FFS and NFS have been tested, other filesystems will need minor
changes (grabbing the vm lock when twiddling page properties).

Reviewed (partially) by: jake, jhb
2001-05-19 01:28:09 +00:00
iedowse
dafd513732 Change the second argument of vflush() to an integer that specifies
the number of references on the filesystem root vnode to be both
expected and released. Many filesystems hold an extra reference on
the filesystem root vnode, which must be accounted for when
determining if the filesystem is busy and then released if it isn't
busy. The old `skipvp' approach required individual filesystem
xxx_unmount functions to re-implement much of vflush()'s logic to
deal with the root vnode.

All 9 filesystems that hold an extra reference on the root vnode
got the logic wrong in the case of forced unmounts, so `umount -f'
would always fail if there were any extra root vnode references.
Fix this issue centrally in vflush(), now that we can.

This commit also fixes a vnode reference leak in devfs, which could
result in idle devfs filesystems that refuse to unmount.

Reviewed by:	phk, bp
2001-05-16 18:04:37 +00:00
iedowse
33f5635b77 In vrele() and vput(), avoid triggering the confusing "missed vn_close"
KASSERT when vp->v_usecount is zero or negative. In this case, the
"v*: negative ref cnt" panic that follows is much more appropriate.

Reviewed by:	mckusick
2001-05-11 20:42:41 +00:00
phk
161a28e738 vfs_subr.c is getting rather fat. The underlying repocopy and this
commit moves the filesystem export handling code to vfs_export.c
2001-04-26 20:47:14 +00:00
phk
cdc83afc7f Move the netexport structure from the fs-specific mountstructure
to struct mount.

This makes the "struct netexport *" paramter to the vfs_export
and vfs_checkexport interface unneeded.

Consequently that all non-stacking filesystems can use
vfs_stdcheckexp().

At the same time, make it a pointer to a struct netexport
in struct mount, so that we can remove the bogus AF_MAX
and #include <net/radix.h> from <sys/mount.h>
2001-04-25 07:07:52 +00:00
grog
1f5de30718 Correct #includes to work with fixed sys/mount.h. 2001-04-23 09:05:15 +00:00
tanimura
546a3cb874 Reclaim directory vnodes held in namecache if few free vnodes are
available.

Only directory vnodes holding no child directory vnodes held in
v_cache_src are recycled, so that directory vnodes near the root of
the filesystem hierarchy remain in namecache and directory vnodes are
not reclaimed in cascade.

The period of vnode reclaiming attempt and the number of vnodes
attempted to reclaim can be tuned via sysctl(2).

Suggested by:	tegge
Approved by:	phk
2001-04-18 11:19:50 +00:00
phk
378e561228 This patch removes the VOP_BWRITE() vector.
VOP_BWRITE() was a hack which made it possible for NFS client
side to use struct buf with non-bio backing.

This patch takes a more general approach and adds a bp->b_op
vector where more methods can be added.

The success of this patch depends on bp->b_op being initialized
all relevant places for some value of "relevant" which is not
easy to determine.  For now the buffers have grown a b_magic
element which will make such issues a tiny bit easier to debug.
2001-04-17 08:56:39 +00:00
jlemon
58f9dcd6ce Add a NOTE_REVOKE flag for vnodes, which is triggered from within vclean().
Use this to tell a filter attached to a vnode that the underlying vnode is
no longer valid, by returning EV_EOF.

PR: kern/25309, kern/25206
2001-02-23 20:06:01 +00:00
green
18d474781f Switch to using a struct xucred instead of a struct xucred when not
actually in the kernel.  This structure is a different size than
what is currently in -CURRENT, but should hopefully be the last time
any application breakage is caused there.  As soon as any major
inconveniences are removed, the definition of the in-kernel struct
ucred should be conditionalized upon defined(_KERNEL).

This also changes struct export_args to remove dependency on the
constantly-changing struct ucred, as well as limiting the bounds
of the size fields to the correct size.  This means: a) mountd and
friends won't break all the time, b) mountd and friends won't crash
the kernel all the time if they don't know what they're doing wrt
actual struct export_args layout.

Reviewed by:	bde
2001-02-18 13:30:20 +00:00
bmilekic
f364d4ac36 Change and clean the mutex lock interface.
mtx_enter(lock, type) becomes:

mtx_lock(lock) for sleep locks (MTX_DEF-initialized locks)
mtx_lock_spin(lock) for spin locks (MTX_SPIN-initialized)

similarily, for releasing a lock, we now have:

mtx_unlock(lock) for MTX_DEF and mtx_unlock_spin(lock) for MTX_SPIN.
We change the caller interface for the two different types of locks
because the semantics are entirely different for each case, and this
makes it explicitly clear and, at the same time, it rids us of the
extra `type' argument.

The enter->lock and exit->unlock change has been made with the idea
that we're "locking data" and not "entering locked code" in mind.

Further, remove all additional "flags" previously passed to the
lock acquire/release routines with the exception of two:

MTX_QUIET and MTX_NOSWITCH

The functionality of these flags is preserved and they can be passed
to the lock/unlock routines by calling the corresponding wrappers:

mtx_{lock, unlock}_flags(lock, flag(s)) and
mtx_{lock, unlock}_spin_flags(lock, flag(s)) for MTX_DEF and MTX_SPIN
locks, respectively.

Re-inline some lock acq/rel code; in the sleep lock case, we only
inline the _obtain_lock()s in order to ensure that the inlined code
fits into a cache line. In the spin lock case, we inline recursion and
actually only perform a function call if we need to spin. This change
has been made with the idea that we generally tend to avoid spin locks
and that also the spin locks that we do have and are heavily used
(i.e. sched_lock) do recurse, and therefore in an effort to reduce
function call overhead for some architectures (such as alpha), we
inline recursion for this case.

Create a new malloc type for the witness code and retire from using
the M_DEV type. The new type is called M_WITNESS and is only declared
if WITNESS is enabled.

Begin cleaning up some machdep/mutex.h code - specifically updated the
"optimized" inlined code in alpha/mutex.h and wrote MTX_LOCK_SPIN
and MTX_UNLOCK_SPIN asm macros for the i386/mutex.h as we presently
need those.

Finally, caught up to the interface changes in all sys code.

Contributors: jake, jhb, jasone (in no particular order)
2001-02-09 06:11:45 +00:00
phk
e87f7a15ad Mechanical change to use <sys/queue.h> macro API instead of
fondling implementation details.

Created with: sed(1)
Reviewed by: md5(1)
2001-02-04 13:13:25 +00:00
bp
85976e8349 Properly lock new vnode.
Reminded by:	tegge
2001-01-31 04:54:23 +00:00
jasone
8d2ec1ebc4 Convert all simplelocks to mutexes and remove the simplelock implementations. 2001-01-24 12:35:55 +00:00
rwatson
a7fc696a51 o The move to using VADMIN under vaccess() resulted in some system
calls returning EACCES instead of EPERM.  This patch modifies vaccess()
  to return EPERM instead of EACCES if VADMIN is among the requested
  rights.  This affects functions normally limited to the owners of
  a file, such as chmod(), as EPERM is the error indicating that
  privilege would allow the operation, rather than a chance in mandatory
  or discretionary rights.

Reported by:	bde
2001-01-23 04:15:19 +00:00
jhb
cdfe59aac6 Stick the kthread API in a kthread_* namespace, and the specialized kproc
functions in a kproc_* namespace.

Reviewed by:	-arch
2000-12-15 20:08:20 +00:00
mckusick
8fb19aa301 Use proper mutex locking when calling setrunnable from speedup_syncer().
Submitted by:	Tor.Egge@fast.no
2000-12-13 01:06:53 +00:00
dwmalone
dd75d1d73b Convert more malloc+bzero to malloc+M_ZERO.
Submitted by:	josh@zipperup.org
Submitted by:	Robert Drehmel <robd@gmx.net>
2000-12-08 21:51:06 +00:00
peter
eb5dd3d06e Untangle vfsinit() a bit. Use seperate sysinit functions rather than
having a super-function calling bits all over the place.
2000-12-06 07:09:08 +00:00
gallatin
397a29f117 Correct int/long type mismatch in the proper place this time. freevnodes
and numvnodes are longs in the kernel.  They should remain longs in systat,
what really needs to change is that they should be using SYSCTL_LONG rather
than SYSCTL_INT.   I also changed wantfreevnodes to SYSCTL_LONG because I
happened to notice it.

I wish there was a way to find all of these automatically..

Pointed out by: bde
2000-12-02 20:08:33 +00:00
jhb
c91f8bd1fb Use msleep() instead of mtx_exit()/tsleep() so that we release the lock and
go to sleep as an "atomic" operation.
2000-12-01 03:43:33 +00:00
mckusick
4a644ba189 Get rid of a bogus mtx_exit (it was attempting to release an
already released mutex).

Submitted by:	"Chris Knight" <chris@aims.com.au>
2000-11-30 19:09:29 +00:00
dillon
2ace352085 Implement a low-memory deadlock solution.
Removed most of the hacks that were trying to deal with low-memory
    situations prior to now.

    The new code is based on the concept that I/O must be able to function in
    a low memory situation.  All major modules related to I/O (except
    networking) have been adjusted to allow allocation out of the system
    reserve memory pool.  These modules now detect a low memory situation but
    rather then block they instead continue to operate, then return resources
    to the memory pool instead of cache them or leave them wired.

    Code has been added to stall in a low-memory situation prior to a vnode
    being locked.

    Thus situations where a process blocks in a low-memory condition while
    holding a locked vnode have been reduced to near nothing.  Not only will
    I/O continue to operate, but many prior deadlock conditions simply no
    longer exist.

Implement a number of VFS/BIO fixes

	(found by Ian): in biodone(), bogus-page replacement code, the loop
        was not properly incrementing loop variables prior to a continue
        statement.  We do not believe this code can be hit anyway but we
        aren't taking any chances.  We'll turn the whole section into a
        panic (as it already is in brelse()) after the release is rolled.

	In biodone(), the foff calculation was incorrectly
        clamped to the iosize, causing the wrong foff to be calculated
        for pages in the case of an I/O error or biodone() called without
        initiating I/O.  The problem always caused a panic before.  Now it
        doesn't.  The problem is mainly an issue with NFS.

	Fixed casts for ~PAGE_MASK.  This code worked properly before only
        because the calculations use signed arithmatic.  Better to properly
        extend PAGE_MASK first before inverting it for the 64 bit masking
        op.

	In brelse(), the bogus_page fixup code was improperly throwing
        away the original contents of 'm' when it did the j-loop to
        fix the bogus pages.  The result was that it would potentially
        invalidate parts of the *WRONG* page(!), leading to corruption.

	There may still be cases where a background bitmap write is
        being duplicated, causing potential corruption.  We have identified
        a potentially serious bug related to this but the fix is still TBD.
        So instead this patch contains a KASSERT to detect the problem
  	and panic the machine rather then continue to corrupt the filesystem.
	The problem does not occur very often..  it is very hard to
	reproduce, and it may or may not be the cause of the corruption
	people have reported.

Review by: (VFS/BIO: mckusick, Ian Dowse <iedowse@maths.tcd.ie>)
Testing by: (VM/Deadlock) Paul Saab <ps@yahoo-inc.com>
2000-11-18 23:06:26 +00:00
tegge
8e9f33e1ce Clear the VFREE flag when the vnode is removed from the free list in
getnewvnode().  Otherwise routines called from VOP_INACTIVE() might
attempt to remove the vnode from a free list the vnode isn't on,
causing corruption.
PR:		18012
2000-11-02 21:42:54 +00:00
phk
4e063f5534 Take VBLK devices further out of their missery.
This should fix the panic I introduced in my previous commit on this topic.
2000-11-02 21:14:13 +00:00
jhb
d944886e4d Catch up to moving headers:
- machine/ipl.h -> sys/ipl.h
- machine/mutex.h -> sys/mutex.h
2000-10-20 07:58:15 +00:00
rwatson
9c993b44d0 o Introduce new VOP_ACCESS() flag VADMIN, allowing file systems to perform
"administrative" authorization checks.  In most cases, the VADMIN test
  checks to make sure the credential effective uid is the same as the file
  owner.
o Modify vaccess() to set VADMIN as an available right if the uid is
  appropriate.
o Modify references to uid-based access control operations such that they
  now always invoke VOP_ACCESS() instead of using hard-coded policy checks.
o This allows alternative UFS policies to be implemented by replacing only
  ufs_access() (such as mandatory system policies).
o VOP_ACCESS() requires the caller to hold an exclusive vnode lock on the
  vnode: I believe that new invocations of VOP_ACCESS() are always called
  with the lock held.
o Some direct checks of the uid remain, largely associated with the QUOTA
  and SUIDDIR code.

Reviewed by:	eivind
Obtained from:	TrustedBSD Project
2000-10-19 07:53:59 +00:00
eivind
4a39f454a0 Blow away the v_specmountpoint define, replacing it with what it was
defined as (rdev->si_mountpoint)
2000-10-09 17:31:39 +00:00
jasone
b82f0ccba2 Do not call lockdestroy() for v_vnlock, which may point to a lock in a
deeper vfs stacking layer.

Submitted by:	bp
2000-10-06 08:04:48 +00:00
eivind
8dd2b8de06 Style fixes based on comments by bde 2000-10-05 18:22:46 +00:00
jasone
4e290e67b7 Convert lockmgr locks from using simple locks to using mutexes.
Add lockdestroy() and appropriate invocations, which corresponds to
lockinit() and must be called to clean up after a lockmgr lock is no
longer needed.
2000-10-04 01:29:17 +00:00
bp
a9effa9b45 Move KASSERTs which checks value of v_usecount after vnode locking, so
it will not produce wrong alarms.
2000-10-02 09:57:06 +00:00
mckusick
781d4acf4f Do the right thing if bdevvp is called twice for the same device.
Obtained from:	Poul-Henning Kamp <phk@freebsd.org>
2000-09-27 18:03:17 +00:00
bp
6110b03d24 Add a lock structure to vnode structure. Previously it was either allocated
separately (nfs, cd9660 etc) or keept as a first element of structure
referenced by v_data pointer(ffs). Such organization leads to known problems
with stacked filesystems.

From this point vop_no*lock*() functions maintain only interlock lock.
vop_std*lock*() functions maintain built-in v_lock structure using lockmgr().
vop_sharedlock() is compatible with vop_stdunlock(), but maintains a shared
lock on vnode.

If filesystem wishes to export lockmgr compatible lock, it can put an address
of this lock to v_vnlock field. This indicates that the upper filesystem
can take advantage of it and use single lock structure for entire (or part)
of stack of vnodes. This field shouldn't be examined or modified by VFS code
except for initialization purposes.

Reviewed in general by:	mckusick
2000-09-25 15:24:04 +00:00
eivind
8e9b2cd3f4 Style fixes:
* Add lots of comments
* Convert a couple of assertions to KASSERT()
* Minimal whitespace & misapplied {} fixes
* Convert #if 0 to #if COMPILING_LINT for code we presently do not
  support, but want to keep available.

Reviewed by:	adrian, markm
2000-09-22 12:22:36 +00:00
eivind
a4293ea4b0 Staticize addalias() 2000-09-22 11:54:48 +00:00
alfred
3ddda6b562 comment vfs_export functions, requested by: eivind 2000-09-21 15:55:55 +00:00
rwatson
8a56e23d58 o Add additional comment describing vaccess() behavior.
Requested by:	eivind
Reviewed by:	eivind, adrian
2000-09-20 17:18:12 +00:00
phk
6023f97970 Rename lminor() to dev2unit(). This function gives a linear unit number
which hides the 'hole' in the minor bits.

Introduce unit2minor() to do the reverse operation.

Fix some some make_dev() calls which didn't use UID_* or GID_* macros.

Kill the v_hashchain alias macro, it hides the real relationship.

Introduce experimental SI_CHEAPCLONE flag set it on cloned bpfs.
2000-09-19 10:28:44 +00:00
bp
a7bc78c86d Add three new VOPs: VOP_CREATEVOBJECT, VOP_DESTROYVOBJECT and VOP_GETVOBJECT.
They will be used by nullfs and other stacked filesystems to support full
cache coherency.

Reviewed in general by:	mckusick, dillon
2000-09-12 09:49:08 +00:00
jasone
769e0f974d Major update to the way synchronization is done in the kernel. Highlights
include:

* Mutual exclusion is used instead of spl*().  See mutex(9).  (Note: The
  alpha port is still in transition and currently uses both.)

* Per-CPU idle processes.

* Interrupts are run in their own separate kernel threads and can be
  preempted (i386 only).

Partially contributed by:	BSDi (BSD/OS)
Submissions by (at least):	cp, dfr, dillon, grog, jake, jhb, sheldonh
2000-09-07 01:33:02 +00:00
rwatson
6eea0a7165 o Synchronize vaccess() capability access control checks with TrustedBSD
tree.

Obtained from:	TrustedBSD Project
2000-09-06 12:18:24 +00:00
phk
677da8cb2f Move extern declaration of dead_vnodeop_p to a .h file.
Remove race condition in vn_isdisk().
2000-09-05 21:09:56 +00:00
rwatson
e54ea574fa o Restructure vaccess() so as to check for DAC permission to modify the
object before falling back on privilege.  Make vaccess() accept an
  additional optional argument, privused, to determine whether
  privilege was required for vaccess() to return 0.  Add commented
  out capability checks for reference.  Rename some variables to make
  it more clear which modes/uids/etc are associated with the object,
  and which with the access mode.
o Update file system use of vaccess() to pass NULL as the optional
  privused argument.  Once additional patches are applied, suser()
  will no longer set ASU, so privused will permit passing of
  privilege information up the stack to the caller.

Reviewed by:	bde, green, phk, -security, others
Obtained from:	TrustedBSD Project
2000-08-29 14:45:49 +00:00
phk
30c0d10f82 Fix typo in last commit. 2000-08-20 11:46:39 +00:00
phk
3d2aecdc81 Centralize the canonical vop_access user/group/other check in vaccess().
Discussed with: bde
2000-08-20 08:36:26 +00:00
mckusick
acc66855bf This patch corrects the first round of panics and hangs reported
with the new snapshot code.

Update addaliasu to correctly implement the semantics of the old
checkalias function. When a device vnode first comes into existence,
check to see if an anonymous vnode for the same device was created
at boot time by bdevvp(). If so, adopt the bdevvp vnode rather than
creating a new vnode for the device. This corrects a problem which
caused the kernel to panic when taking a snapshot of the root
filesystem.

Change the calling convention of vn_write_suspend_wait() to be the
same as vn_start_write().

Split out softdep_flushworklist() from softdep_flushfiles() so that
it can be used to clear the work queue when suspending filesystem
operations.

Access to buffers becomes recursive so that snapshots can recursively
traverse their indirect blocks using ffs_copyonwrite() when checking
for the need for copy on write when flushing one of their own indirect
blocks. This eliminates a deadlock between the syncer daemon and a
process taking a snapshot.

Ensure that softdep_process_worklist() can never block because of a
snapshot being taken. This eliminates a problem with buffer starvation.

Cleanup change in ffs_sync() which did not synchronously wait when
MNT_WAIT was specified. The result was an unclean filesystem panic
when doing forcible unmount with heavy filesystem I/O in progress.

Return a zero'ed block when reading a block that was not in use at
the time that a snapshot was taken. Normally, these blocks should
never be read. However, the readahead code will occationally read
them which can cause unexpected behavior.

Clean up the debugging code that ensures that no blocks be written
on a filesystem while it is suspended. Snapshots must explicitly
label the blocks that they are writing during the suspension so that
they do not cause a `write on suspended filesystem' panic.

Reorganize ffs_copyonwrite() to eliminate a deadlock and also to
prevent a race condition that would permit the same block to be
copied twice. This change eliminates an unexpected soft updates
inconsistency in fsck caused by the double allocation.

Use bqrelse rather than brelse for buffers that will be needed
soon again by the snapshot code. This improves snapshot performance.
2000-07-24 05:28:33 +00:00
mckusick
a3d0c189ea Add snapshots to the fast filesystem. Most of the changes support
the gating of system calls that cause modifications to the underlying
filesystem. The gating can be enabled by any filesystem that needs
to consistently suspend operations by adding the vop_stdgetwritemount
to their set of vnops. Once gating is enabled, the function
vfs_write_suspend stops all new write operations to a filesystem,
allows any filesystem modifying system calls already in progress
to complete, then sync's the filesystem to disk and returns. The
function vfs_write_resume allows the suspended write operations to
begin again. Gating is not added by default for all filesystems as
for SMP systems it adds two extra locks to such critical kernel
paths as the write system call. Thus, gating should only be added
as needed.

Details on the use and current status of snapshots in FFS can be
found in /sys/ufs/ffs/README.snapshot so for brevity and timelyness
is not included here. Unless and until you create a snapshot file,
these changes should have no effect on your system (famous last words).
2000-07-11 22:07:57 +00:00
bp
8a86977869 Fix support for more than 256 simultaneous mounts. Theoretical limit
is 2^16 mounts per fs type.

Reported by:	Troy Arie Cobb <tcobb@staff.circle.net> via phk
Reviewed by:	bde
2000-07-07 14:01:08 +00:00
phk
e5de271d47 Previous commit changing SYSCTL_HANDLER_ARGS violated KNF.
Pointed out by:	bde
2000-07-04 11:25:35 +00:00
mckusick
2f0e9591fa Simplify and rationalise the management of the vnode free list
(preparing the code to add snapshots).
2000-07-04 04:32:40 +00:00
mckusick
806786489f If a buffer flush fails when trying to reclaim a vnode, it is too
late to save the vnode, so just toss any remaining unwritten buffers
rather than leaving them lying around to make trouble in the future.
2000-07-04 03:23:29 +00:00
phk
2a91a9dd04 Make the two calls from kern/* into softupdates #ifdef SOFTUPDATES,
that is way cleaner than using the softupdates_stub stunt, which
should be killed when convenient.

Discussed with:	mckusick
2000-07-03 13:26:54 +00:00
phk
61ff05be25 Style police catches up with rev 1.26 of src/sys/sys/sysctl.h:
Sanitize SYSCTL_HANDLER_ARGS so that simplistic tools can grog our
sources:

        -sysctl_vm_zone SYSCTL_HANDLER_ARGS
        +sysctl_vm_zone (SYSCTL_HANDLER_ARGS)
2000-07-03 09:35:31 +00:00
phk
0535bee2fb Move prtactive to vfs from ufs. It is used all over the place. 2000-06-27 07:46:22 +00:00
phk
4ec91666fa Virtualizes & untangles the bioops operations vector.
Ref: Message-ID: <18317.961014572@critter.freebsd.dk> To: current@
2000-06-16 08:48:51 +00:00
jake
961b97d434 Back out the previous change to the queue(3) interface.
It was not discussed and should probably not happen.

Requested by:		msmith and others
2000-05-26 02:09:24 +00:00
jake
d93fbc9916 Change the way that the queue(3) structures are declared; don't assume that
the type argument to *_HEAD and *_ENTRY is a struct.

Suggested by:	phk
Reviewed by:	phk
Approved by:	mdodd
2000-05-23 20:41:01 +00:00
asmodai
0d05e48123 Fix the rootmount code for now.
This function will probably rewritten/renamed to devpp.

Submitted by:	Assar Westerlund <assar@sics.se> on -current
Confirmed to work:	Steinar Haug <sthaug@nethelp.no>,
			Manfred Antar <mantar@pacbell.net>
Reviewed by:	phk
2000-05-14 07:43:12 +00:00
phk
36c3965ff9 Separate the struct bio related stuff out of <sys/buf.h> into
<sys/bio.h>.

<sys/bio.h> is now a prerequisite for <sys/buf.h> but it shall
not be made a nested include according to bdes teachings on the
subject of nested includes.

Diskdrivers and similar stuff below specfs::strategy() should no
longer need to include <sys/buf.> unless they need caching of data.

Still a few bogus uses of struct buf to track down.

Repocopy by:    peter
2000-05-05 09:59:14 +00:00
phk
5df766a0f8 Rename the existing BUF_STRATEGY() to DEV_STRATEGY()
substitute BUF_WRITE(foo) for VOP_BWRITE(foo->b_vp, foo)

substitute BUF_STRATEGY(foo) for VOP_STRATEGY(foo->b_vp, foo)

This patch is machine generated except for the ccd.c and buf.h parts.
2000-03-20 11:29:10 +00:00
chris
dd595b28f0 In vn_isdisk(), check whether vp->v_rdev is NULL. If it is, then
return ENXIO (Device not configured).  Without this, vn_isdisk()
could (and did in the case of lstat() under fdesc) pass a NULL pointer
to devsw(), which caused a page fault.

Reviewed by:	alfred
2000-03-18 01:27:44 +00:00
phk
6b3385b773 Eliminate the undocumented, experimental, non-delivering and highly
dangerous MAX_PERF option.
2000-03-16 08:51:55 +00:00
bde
33e9ba0226 Don't try so hard to make the lower 16 bits of fsids unique. It tended
to recycle full fsids after only 16 mount/unmount's.  This is probably
too often for exported fsids.  Now we recycle the full fsids only
after 2^16 mount/ umount's and only ensure uniqueness in the lower 16
bits if there have been <= 256 calls to vfs_getnewfsid() since the
system started.
2000-03-14 14:19:49 +00:00
bde
2af0bce593 Try harder to make the lower 16 bits of fsids unique. The vfs type
number was packed very wastefully, giving perfect non-uniqeness in
the lower 16 bits of fsids for filesystems with the same vfs type.
This made linux_stat() return perfectly non-unique (broken) 16-bit
st_dev's for nfs mount points, and effectively reduced mntid_base to
8 bits so that the vfs_getnewfsid() looped endlessly when there are
already 256 mounted filesystems with the required vfs type.

Approved by:	jkh
2000-03-12 14:23:21 +00:00
sos
dc230127da Do refcounting of open devices (more) correctly.
count_dev funtion by phk.
2000-02-07 23:05:40 +00:00
rwatson
7cc7ecf5e1 Remove static qualifier from vgonel, as it is needed by the Arla folk
outside of vfs_subr.c.

Submitted by:	Assar Westerlund <assar@sics.se>
Reviewed by:	rwatson
Approved by:	jkh
2000-02-02 07:07:17 +00:00
rwatson
2aada2e694 This patch fixes a locking bug that can result in deadlock if
the codepath is followed.

From the PR:

  vclean calls vrele leading to deadlock (if usecount > 0)

  vclean() calls vrele() if v_usecount of the node was higher than one.
  But before calling it, it sets the VXLOCK flag, which will make
  vn_lock called from vrele dead-lock.

PR:		kern/15117
Submitted by:	Assar Westerlund <assar@stacken.kth.se>
Reviewed by:	rwatson
Obtained from:	NetBSD
2000-01-29 15:22:58 +00:00
phk
ae0c1ec8f7 Give vn_isdisk() a second argument where it can return a suitable errno.
Suggested by:	bde
2000-01-10 12:04:27 +00:00
mckusick
a44e140976 Remove the P_BUFEXHAUST flag from the syncer process (leaving
it only on the buf_daemon process). The problem is that when the
syncer process starts running the worklist, it wants to delete
lots of files. It does this by VFS_VGET'ing the vnodes, clearing
the blocks in them and bdwrite'ing the buffer. It can process close
to a thousand files per second which generates a large number of
dirty buffers. So, giving it special priviledge at the buffer trough
leads to trouble as the buf_daemon does occationally need a free
buffer to proceed and if the syncer has used every last one up,
we are toast.
2000-01-10 00:07:24 +00:00
eivind
767bad2cc1 Change NDFREE() from a macro to a function for the time being; the macro
version caused intolerable bloat (30k).  I'm likely to revisit this with an
attempt at a smarter macro.

Bloat noticed by:       bde
2000-01-08 16:20:06 +00:00
luoqi
e100d44d55 Introduce a mechanism to suspend/resume system processes. Suspend syncer
and bufdaemon prior to disk sync during system shutdown.
2000-01-07 08:36:44 +00:00
dillon
c6689c797d Enhance reassignbuf(). When a buffer cannot be time-optimally inserted
into vnode dirtyblkhd we append it to the list instead of prepend it to
    the list in order to maintain a 'forward' locality of reference, which
    is arguably better then 'reverse'.  The original algorithm did things this
    way to but at a huge time cost.

    Enhance the append interlock for NFS writes to handle intr/soft mounts
    better.

    Fix the hysteresis for NFS async daemon I/O requests to reduce the
    number of unnecessary context switches.

    Modify handling of NFS mount options.  Any given user option that is
    too high now defaults to the kernel maximum for that option rather then
    the kernel default for that option.

Reviewed by:	 Alfred Perlstein <bright@wintelcom.net>
2000-01-05 05:11:37 +00:00
mckusick
0529f59c27 Prettyness police: Identify flags in b_xflags with BX_ to distinguish
them from flags in b_flags which are prefixed with B_
1999-12-22 03:11:04 +00:00
dillon
b66fb2c648 Add MAP_NOSYNC feature to mmap(), and MADV_NOSYNC and MADV_AUTOSYNC to
madvise().

    This feature prevents the update daemon from gratuitously flushing
    dirty pages associated with a mapped file-backed region of memory.  The
    system pager will still page the memory as necessary and the VM system
    will still be fully coherent with the filesystem.  Modifications made
    by other means to the same area of memory, for example by write(), are
    unaffected.  The feature works on a page-granularity basis.

    MAP_NOSYNC allows one to use mmap() to share memory between processes
    without incuring any significant filesystem overhead, putting it in
    the same performance category as SysV Shared memory and anonymous memory.

Reviewed by: julian, alc, dg
1999-12-12 03:19:33 +00:00
eivind
287836faea Lock reporting and assertion changes.
* lockstatus() and VOP_ISLOCKED() gets a new process argument and a new
  return value: LK_EXCLOTHER, when the lock is held exclusively by another
  process.
* The ASSERT_VOP_(UN)LOCKED family is extended to use what this gives them
* Extend the vnode_if.src format to allow more exact specification than
  locked/unlocked.

This commit should not do any semantic changes unless you are using
DEBUG_VFS_LOCKS.

Discussed with:	grog, mch, peter, phk
Reviewed by:	peter
1999-12-11 16:13:02 +00:00
dillon
881e17e778 Remove vfs_getrootfsid() function (a temporary hack added a few months
ago to make BOOTP work again).  It is no longer required by BOOTP and
    no longer used.
1999-11-29 22:25:36 +00:00
phk
1848d96439 Convert various pieces of code to use vn_isdisk() rather than checking
for vp->v_type == VBLK.

In ccd: we don't need to call VOP_GETATTR to find the type of a vnode.

Reviewed by:    sos
1999-11-22 10:33:55 +00:00
phk
1adcecffd9 struct mountlist and struct mount.mnt_list have no business being
a CIRCLEQ.  Change them to TAILQ_HEAD and TAILQ_ENTRY respectively.

This removes ugly  mp != (void*)&mountlist  comparisons.

Requested by:   phk
Submitted by:   Jake Burkholder jake@checker.org
PR:             14967
1999-11-20 10:00:46 +00:00
phk
ec4e24bd52 Commit the remaining part of PR14914:
Alot of the code in sys/kern directly accesses the *Q_HEAD and *Q_ENTRY
   structures for list operations.  This patch makes all list operations
   in sys/kern use the queue(3) macros, rather than directly accessing the
   *Q_{HEAD,ENTRY} structures.

Reviewed by:    phk
Submitted by:   Jake Burkholder <jake@checker.org>
PR:     14914
1999-11-16 16:28:58 +00:00
phk
8c9bc6b146 Next step in the device cleanup process.
Correctly lock vnodes when calling VOP_OPEN() from filesystem mount code.

Unify spec_open() for bdev and cdev cases.

Remove the disabled bdev specific read/write code.
1999-11-09 14:15:33 +00:00
phk
8e3c3eafed useracc() the prequel:
Merge the contents (less some trivial bordering the silly comments)
of <vm/vm_prot.h> and <vm/vm_inherit.h> into <vm/vm.h>.  This puts
the #defines for the vm_inherit_t and vm_prot_t types next to their
typedefs.

This paves the road for the commit to follow shortly: change
useracc() to use VM_PROT_{READ|WRITE} rather than B_{READ|WRITE}
as argument.
1999-10-29 18:09:36 +00:00
peter
787140aa42 Trim unused options (or #ifdef for undoc options).
Submitted by:	phk
1999-10-11 15:19:12 +00:00
phk
8b06d6a2fb Move the buffered read/write code out of spec_{read|write} and into
two new functions spec_buf{read|write}.

Add sysctl vfs.bdev_buffered which defaults to 1 == true.  This
sysctl can be used to experimentally turn buffered behaviour for
bdevs off.  I should not be changed while any blockdevices are
open.  Remove the misplaced sysctl vfs.enable_userblk_io.

No other changes in behaviour.
1999-10-04 11:23:10 +00:00
phk
073b941095 Remove v_maxio from struct vnode.
Replace it with mnt_iosize_max in struct mount.

Nits from:	bde
1999-09-29 20:05:33 +00:00
dillon
493548f6b6 Final commit to remove vnode->v_lastr. vm_fault now handles read
clustering issues (replacing code that used to be in
    ufs/ufs/ufs_readwrite.c).  vm_fault also now uses the new VM page counter
    inlines.

    This completes the changeover from vnode->v_lastr to vm_entry_t->v_lastr
    for VM, and fp->f_nextread and fp->f_seqcount (which have been in the
    tree for a while).  Determination of the I/O strategy (sequential, random,
    and so forth) is now handled on a descriptor-by-descriptor basis for
    base I/O calls, and on a memory-region-by-memory-region and
    process-by-process basis for VM faults.

Reviewed by:	David Greenman <dg@root.com>, Alan Cox <alc@cs.rice.edu>
1999-09-21 00:36:16 +00:00
phk
c0818069f7 Initialize vp->v_maxio to its default in getnetvnode() rather than
four different places in vfs_cluster.c
1999-09-20 19:53:23 +00:00
dillon
beba2c930c Fix BOOTP root FS mounts. Also cleanup vfs_getnewfsid() and collapse
addaliasu() into addalias() (no operational change) and clarify comments
    relating to a trick that vclean() uses.

    The fix to BOOTP is yet another hack.  Actually, rootfsid handling
    is already a major hack.  The whole thing needs to be cleaned up.

Reviewed by:	David Greenman <dg@root.com>, Alan Cox <alc@cs.rice.edu>
1999-09-19 06:24:21 +00:00
dillon
2c39c80e00 Add vfs.enable_userblk_io sysctl to control whether user reads and writes
to buffered block devices are allowed.  The default is to be backwards
    compatible, i.e. reads and writes are allowed.

    The idea is for a larger crowd to start running with this disabled and
    see what problems, if any, crop up, and then to change the default to
    off and see if any problems crop up in the next 6 months prior to
    potentially removing support entirely.  There are still a few people,
    Julian and myself included, who believe the buffered block device
    access from usermode to be useful.

    Remove use of vnode->v_lastr from buffered block device I/O in
    preparation for removal of vnode->v_lastr field, replacing it with
    the already existing seqcount metric to detect sequential operation.

Reviewed by:	Alan Cox <alc@cs.rice.edu>, David Greenman <dg@root.com>
1999-09-17 06:10:27 +00:00
phk
244faf2e1b Add dev_t freeing code. Controlled by sysctl debug.free_devt, default
is off.
1999-08-29 09:09:12 +00:00
phk
d311a0563b remove unused variables. 1999-08-28 19:21:03 +00:00
peter
3b842d34e8 $Id$ -> $FreeBSD$ 1999-08-28 01:08:13 +00:00
phk
591c94d4c6 Simplify the handling of VCHR and VBLK vnodes using the new dev_t:
Make the alias list a SLIST.

        Drop the "fast recycling" optimization of vnodes (including
        the returning of a prexisting but stale vnode from checkalias).
        It doesn't buy us anything now that we don't hardlimit
        vnodes anymore.

        Rename checkalias2() and checkalias() to addalias() and
        addaliasu() - which takes dev_t and udev_t arg respectively.

        Make the revoke syscalls use vcount() instead of VALIASED.

        Remove VALIASED flag, we don't need it now and it is faster
        to traverse the much shorter lists than to maintain the
        flag.

        vfs_mountedon() can check the dev_t directly, all the vnodes
        point to the same one.

Print the devicename in specfs/vprint().

Remove a couple of stale LFS vnode flags.

Remove unimplemented/unused LK_DRAINED;
1999-08-26 14:53:31 +00:00
phk
ea55d63475 Introduce vn_isdisk(struct vnode *vp) function, and use it to test for diskness. 1999-08-25 12:24:39 +00:00
julian
7ceff14f18 Make DEVFS use PHK's specinfo struct as the source of dev_t and devsw.
In lookup() however it's the other way around as we need to supply the
dev_t for the vnode, so devfs still has a copy of it stashed away.

Sourcing it from the vnode in the vnops however is useful as it makes
a lot of the code almost the same as that in specfs.
1999-08-25 04:55:20 +00:00
jdp
9f71d680aa Support full-precision file timestamps. Until now, only the seconds
have been maintained, and that is still the default.  A new sysctl
variable "vfs.timestamp_precision" can be used to enable higher
levels of precision:

      0 = seconds only; nanoseconds zeroed (default).
      1 = seconds and nanoseconds, accurate within 1/HZ.
      2 = seconds and nanoseconds, truncated to microseconds.
    >=3 = seconds and nanoseconds, maximum precision.

Level 1 uses getnanotime(), which is fast but can be wrong by up
to 1/HZ.  Level 2 uses microtime().  It might be desirable for
consistency with utimes() and friends, which take timeval structures
rather than timespecs.  Level 3 uses nanotime() for the higest
precision.

I benchmarked levels 0, 1, and 3 by copying a 550 MB tree with
"cpio -pdu".  There was almost negligible difference in the system
times -- much less than 1%, and less than the variation among
multiple runs at the same level.  Bruce Evans dreamed up a torture
test involving 1-byte reads with intervening fstat() calls, but
the cpio test seems more realistic to me.

This feature is currently implemented only for the UFS (FFS and
MFS) filesystems.  But I think it should be easy to support it in
the others as well.

An earlier version of this was reviewed by Bruce.  He's not to
blame for any breakage I've introduced since then.

Reviewed by:	bde (an earlier version of the code)
1999-08-22 00:15:16 +00:00
phk
7b7ae40370 The bdevsw() and cdevsw() are now identical, so kill the former. 1999-08-13 10:29:38 +00:00
phk
683c2698ff s/v_specinfo/v_rdev/ 1999-08-13 10:10:12 +00:00
phk
e938d317d5 Decommision miscfs/specfs/specdev.h. Most of it goes into <sys/conf.h>,
a few lines into <sys/vnode.h>.

Add a few fields to struct specinfo, paving the way for the fun part.
1999-08-08 18:43:05 +00:00
alc
743eeab638 Add sysctl and support code to allow directories to be VMIO'd. The default
setting for the sysctl is OFF, which is the historical operation.

Submitted by:	dillon
1999-07-26 06:25:53 +00:00
phk
cacc73aa18 Now a dev_t is a pointer to struct specinfo which is shared by all specdev
vnodes referencing this device.

Details:
        cdevsw->d_parms has been removed, the specinfo is available
        now (== dev_t) and the driver should modify it directly
        when applicable, and the only driver doing so, does so:
        vn.c.  I am not sure the logic in checking for "<" was right
        before, and it looks even less so now.

        An intial pool of 50 struct specinfo are depleted during
        early boot, after that malloc had better work.  It is
        likely that fewer than 50 would do.

        Hashing is done from udev_t to dev_t with a prime number
        remainder hash, experiments show no better hash available
        for decent cost (MD5 is only marginally better)  The prime
        number used should not be close to a power of two, we use
        83 for now.

        Add new checkalias2() to get around the loss of info from
        dev2udev() in bdevvp();

        The aliased vnodes are hung on a list straight of the dev_t,
        and speclisth[SPECSZ] is unused.  The sharing of struct
        specinfo means that the v_specnext moves into the vnode
        which grows by 4 bytes.

        Don't use a VBLK dev_t which doesn't make sense in MFS, now
        we hang a dummy cdevsw on B/Cmaj 253 so that things look sane.

	Storage overhead from all of this is O(50k).

        Bump __FreeBSD_version to 400009

The next step will add the stuff needed so device-drivers can start to
hang things from struct specinfo
1999-07-20 09:47:55 +00:00
phk
f3c07181e3 [click] Now all dev_t's in the kernel have their char device major.
Only know casualy of this is swapinfo/pstat which should be fixes
the right way:  Store the actual pathname in the kernel like mount
does.  [Volounteers sought for this task]

The road map from here is roughly:  expand struct specinfo into struct
based dev_t.  Add dev_t registration facilities for device drivers and
start to use them.
1999-07-19 09:37:59 +00:00
phk
63c9fe9157 Introduce the vn_todev(struct vnode*) function, which returns the dev_t
corresponding to a VBLK or VCHR node, or NODEV.
1999-07-18 14:30:37 +00:00
phk
fb0144f561 Fix 2nd arg to udev2dev(). 1999-07-17 19:38:00 +00:00
phk
6c373ff516 I have not one single time remembered the name of this function correctly
so obviously I gave it the wrong name.  s/umakedev/makeudev/g
1999-07-17 18:43:50 +00:00
kris
d5babdb93c Correct a couple of spelling errors in comments. 1999-07-12 15:02:51 +00:00
mckusick
52ea4270f3 These changes appear to give us benefits with both small (32MB) and
large (1G) memory machine configurations.  I was able to run 'dbench 32'
on a 32MB system without bring the machine to a grinding halt.

    * buffer cache hash table now dynamically allocated.  This will
      have no effect on memory consumption for smaller systems and
      will help scale the buffer cache for larger systems.

    * minor enhancement to pmap_clearbit().  I noticed that
      all the calls to it used constant arguments.  Making
      it an inline allows the constants to propogate to
      deeper inlines and should produce better code.

    * removal of inherent vfs_ioopt support through the emplacement
      of appropriate #ifdef's, with John's permission.  If we do not
      find a use for it by the end of the year we will remove it entirely.

    * removal of getnewbufloops* counters & sysctl's - no longer
      necessary for debugging, getnewbuf() is now optimal.

    * buffer hash table functions removed from sys/buf.h and localized
      to vfs_bio.c

    * VFS_BIO_NEED_DIRTYFLUSH flag and support code added
      ( bwillwrite() ), allowing processes to block when too many dirty
      buffers are present in the system.

    * removal of a softdep test in bdwrite() that is no longer necessary
      now that bdwrite() no longer attempts to flush dirty buffers.

    * slight optimization added to bqrelse() - there is no reason
      to test for available buffer space on B_DELWRI buffers.

    * addition of reverse-scanning code to vfs_bio_awrite().
      vfs_bio_awrite() will attempt to locate clusterable areas
      in both the forward and reverse direction relative to the
      offset of the buffer passed to it.  This will probably not
      make much of a difference now, but I believe we will start
      to rely on it heavily in the future if we decide to shift
      some of the burden of the clustering closer to the actual
      I/O initiation.

    * Removal of the newbufcnt and lastnewbuf counters that Kirk
      added.  They do not fix any race conditions that haven't already
      been fixed by the gbincore() test done after the only call
      to getnewbuf().  getnewbuf() is a static, so there is no chance
      of it being misused by other modules.  ( Unless Kirk can think
      of a specific thing that this code fixes.  I went through it
      very carefully and didn't see anything ).

    * removal of VOP_ISLOCKED() check in flushbufqueues().  I do not
      think this check is necessary, the buffer should flush properly
      whether the vnode is locked or not. ( yes? ).

    * removal of extra arguments passed to getnewbuf() that are not
      necessary.

    * missed cluster_wbuild() that had to be a cluster_wbuild_wb() in
      vfs_cluster.c

    * vn_write() now calls bwillwrite() *PRIOR* to locking the vnode,
      which should greatly aid flushing operations in heavy load
      situations - both the pageout and update daemons will be able
      to operate more efficiently.

    * removal of b_usecount.  We may add it back in later but for now
      it is useless.  Prior implementations of the buffer cache never
      had enough buffers for it to be useful, and current implementations
      which make more buffers available might not benefit relative to
      the amount of sophistication required to implement a b_usecount.
      Straight LRU should work just as well, especially when most things
      are VMIO backed.  I expect that (even though John will not like
      this assumption) directories will become VMIO backed some point soon.

Submitted by:	Matthew Dillon <dillon@backplane.com>
Reviewed by:	Kirk McKusick <mckusick@mckusick.com>
1999-07-08 06:06:00 +00:00
mckusick
9d4f0d78fa The buffer queue mechanism has been reformulated. Instead of having
QUEUE_AGE, QUEUE_LRU, and QUEUE_EMPTY we instead have QUEUE_CLEAN,
QUEUE_DIRTY, QUEUE_EMPTY, and QUEUE_EMPTYKVA.  With this patch clean
and dirty buffers have been separated.  Empty buffers with KVM
assignments have been separated from truely empty buffers.  getnewbuf()
has been rewritten and now operates in a 100% optimal fashion.  That is,
it is able to find precisely the right kind of buffer it needs to
allocate a new buffer, defragment KVM, or to free-up an existing buffer
when the buffer cache is full (which is a steady-state situation for
the buffer cache).

Buffer flushing has been reorganized.  Previously buffers were flushed
in the context of whatever process hit the conditions forcing buffer
flushing to occur.  This resulted in processes blocking on conditions
unrelated to what they were doing.  This also resulted in inappropriate
VFS stacking chains due to multiple processes getting stuck trying to
flush dirty buffers or due to a single process getting into a situation
where it might attempt to flush buffers recursively - a situation that
was only partially fixed in prior commits.  We have added a new daemon
called the buf_daemon which is responsible for flushing dirty buffers
when the number of dirty buffers exceeds the vfs.hidirtybuffers limit.
This daemon attempts to dynamically adjust the rate at which dirty buffers
are flushed such that getnewbuf() calls (almost) never block.

The number of nbufs and amount of buffer space is now scaled past the
8MB limit that was previously imposed for systems with over 64MB of
memory, and the vfs.{lo,hi}dirtybuffers limits have been relaxed
somewhat.  The number of physical buffers has been increased with the
intention that we will manage physical I/O differently in the future.

reassignbuf previously attempted to keep the dirtyblkhd list sorted which
could result in non-deterministic operation under certain conditions,
such as when a large number of dirty buffers are being managed.  This
algorithm has been changed.  reassignbuf now keeps buffers locally sorted
if it can do so cheaply, and otherwise gives up and adds buffers to
the head of the dirtyblkhd list.  The new algorithm is deterministic but
not perfect.  The new algorithm greatly reduces problems that previously
occured when write_behind was turned off in the system.

The P_FLSINPROG proc->p_flag bit has been replaced by the more descriptive
P_BUFEXHAUST bit.  This bit allows processes working with filesystem
buffers to use available emergency reserves.  Normal processes do not set
this bit and are not allowed to dig into emergency reserves.  The purpose
of this bit is to avoid low-memory deadlocks.

A small race condition was fixed in getpbuf() in vm/vm_pager.c.

Submitted by:	Matthew Dillon <dillon@apollo.backplane.com>
Reviewed by:	Kirk McKusick <mckusick@mckusick.com>
1999-07-04 00:25:38 +00:00
phk
b6d067fe70 Make sure that stat(2) and friends always return a valid st_dev field.
Pseudo-FS need not fill in the va_fsid anymore, the syscall code
will use the first half of the fsid, which now looks like a udev_t
with major 255.
1999-07-02 16:29:47 +00:00
peter
6cb5fe6c6b Slight reorganization of kernel thread/process creation. Instead of using
SYSINIT_KT() etc (which is a static, compile-time procedure), use a
NetBSD-style kthread_create() interface.  kproc_start is still available
as a SYSINIT() hook.  This allowed simplification of chunks of the
sysinit code in the process.  This kthread_create() is our old kproc_start
internals, with the SYSINIT_KT fork hooks grafted in and tweaked to work
the same as the NetBSD one.

One thing I'd like to do shortly is get rid of nfsiod as a user initiated
process.  It makes sense for the nfs client code to create them on the
fly as needed up to a user settable limit.  This means that nfsiod
doesn't need to be in /sbin and is always "available".  This is a fair bit
easier to do outside of the SYSINIT_KT() framework.
1999-07-01 13:21:46 +00:00
mckusick
5b58f2f951 Convert buffer locking from using the B_BUSY and B_WANTED flags to using
lockmgr locks. This commit should be functionally equivalent to the old
semantics. That is, all buffer locking is done with LK_EXCLUSIVE
requests. Changes to take advantage of LK_SHARED and LK_RECURSIVE will
be done in future commits.
1999-06-26 02:47:16 +00:00
mckusick
88e39a63db Add a vnode argument to VOP_BWRITE to get rid of the last vnode
operator special case. Delete special case code from vnode_if.sh,
vnode_if.src, umap_vnops.c, and null_vnops.c.
1999-06-16 23:27:55 +00:00
mckusick
02e5fe8035 Get rid of the global variable rushjob and replace it with a function in
kern/vfs_subr.c named speedup_syncer() which handles the speedup request.
Change the various clients of rushjob to use the new function.
1999-06-15 23:37:29 +00:00
phk
6a5dc97620 Simplify cdevsw registration.
The cdevsw_add() function now finds the major number(s) in the
struct cdevsw passed to it.  cdevsw_add_generic() is no longer
needed, cdevsw_add() does the same thing.

cdevsw_add() will print an message if the d_maj field looks bogus.

Remove nblkdev and nchrdev variables.  Most places they were used
bogusly.  Instead check a dev_t for validity by seeing if devsw()
or bdevsw() returns NULL.

Move bdevsw() and devsw() functions to kern/kern_conf.c

Bump __FreeBSD_version to 400006

This commit removes:
        72 bogus makedev() calls
        26 bogus SYSINIT functions

if_xe.c bogusly accessed cdevsw[], author/maintainer please fix.

I4b and vinum not changed.  Patches emailed to authors.  LINT
probably broken until they catch up.
1999-05-31 11:29:30 +00:00
jb
5dc5dd6c49 Remove the test for bdevsw(dev) == NULL from bdevvp() because it fails
if there is no character device associated with the block device. In this
case that doesn't matter because bdevvp() doesn't use the character
device structure.

I can use the pointy bit of the axe too.
1999-05-24 00:34:10 +00:00
luoqi
6f6fbfa99e Legally acquire a major number for mfs. 1999-05-14 20:40:23 +00:00
mckusick
48fc2a0eac Previously directories were sync'ed every 10 seconds while bitmaps &
inodes were synced every 15 seconds. This is now reversed as during
directory create, we cannot commit the directory entry until its
inode has been written. With this switch, the inodes will be more
likely to be written by the time that the directory is written thus
reducing the number of directory rollbacks that are needed.
1999-05-14 01:29:21 +00:00
peter
64dd9c44eb Fix (?) SPECHASH dev_t/major/minor/etc args 1999-05-12 19:06:40 +00:00
phk
23c70ba4d7 Don't peek into dev_t 1999-05-12 11:06:07 +00:00
phk
7e26ca1d1a Divorce "dev_t" from the "major|minor" bitmap, which is now called
udev_t in the kernel but still called dev_t in userland.

Provide functions to manipulate both types:
        major()         umajor()
        minor()         uminor()
        makedev()       umakedev()
        dev2udev()      udev2dev()

For now they're functions, they will become in-line functions
after one of the next two steps in this process.

Return major/minor/makedev to macro-hood for userland.

Register a name in cdevsw[] for the "filedescriptor" driver.

In the kernel the udev_t appears in places where we have the
major/minor number combination, (ie: a potential device: we
may not have the driver nor the device), like in inodes, vattr,
cdevsw registration and so on, whereas the dev_t appears where
we carry around a reference to a actual device.

In the future the cdevsw and the aliased-from vnode will be hung
directly from the dev_t, along with up to two softc pointers for
the device driver and a few houskeeping bits.  This will essentially
replace the current "alias" check code (same buck, bigger bang).

A little stunt has been provided to try to catch places where the
wrong type is being used (dev_t vs udev_t), if you see something
not working, #undef DEVT_FASCIST in kern/kern_conf.c and see if
it makes a difference.  If it does, please try to track it down
(many hands make light work) or at least try to reproduce it
as simply as possible, and describe how to do that.

Without DEVT_FASCIST I belive this patch is a no-op.

Stylistic/posixoid comments about the userland view of the <sys/*.h>
files welcome now, from userland they now contain the end result.

Next planned step: make all dev_t's refer to the same devsw[] which
means convert BLK's to CHR's at the perimeter of the vnodes and
other places where they enter the game (bootdev, mknod, sysctl).
1999-05-11 19:55:07 +00:00
phk
bac74fbd54 Fix some of the places where too much inside knowledge about major/minor
layout and dev_t structure is being (ab)used.
1999-05-08 07:02:41 +00:00
phk
500e41bd71 I got tired of seeing all the cdevsw[major(foo)] all over the place.
Made a new (inline) function devsw(dev_t dev) and substituted it.

Changed to the BDEV variant to this format as well: bdevsw(dev_t dev)

DEVFS will eventually benefit from this change too.
1999-05-08 06:40:31 +00:00
phk
693dd58bb3 Continue where Julian left off in July 1998:
Virtualize bdevsw[] from cdevsw.  bdevsw() is now an (inline)
        function.

        Join CDEV_MODULE and BDEV_MODULE to DEV_MODULE (please pay attention
        to the order of the cmaj/bmaj arguments!)

        Join CDEV_DRIVER_MODULE and BDEV_DRIVER_MODULE to DEV_DRIVER_MODULE
        (ditto!)

(Next step will be to convert all bdev dev_t's to cdev dev_t's
before they get to do any damage^H^H^H^H^H^Hwork in the kernel.)
1999-05-07 10:11:40 +00:00
billf
dd35516544 Add sysctl descriptions to many SYSCTL_XXXs
PR:		kern/11197
Submitted by:	Adrian Chadd <adrian@FreeBSD.org>
Reviewed by:	billf(spelling/style/minor nits)
Looked at by:	bde(style)
1999-05-03 23:57:32 +00:00
julian
f27c95753f Reviewed by: Many at differnt times in differnt parts,
including alan, john, me, luoqi, and kirk
Submitted by:	Matt Dillon <dillon@frebsd.org>

This change implements a relatively sophisticated fix to getnewbuf().
There were two problems with getnewbuf(). First, the writerecursion
can lead to a system stack overflow when you have NFS and/or VN
devices in the system. Second, the free/dirty buffer accounting was
completely broken. Not only did the nfs routines blow it trying to
manually account for the buffer state, but the accounting that was
done did not work well with the purpose of their existance: figuring
out when getnewbuf() needs to sleep.

The meat of the change is to kern/vfs_bio.c. The remaining diffs are
all minor except for NFS, which includes both the fixes for bp
interaction AND fixes for a 'biodone(): buffer already done' lockup.
Sys/buf.h also contains a chaining structure which is not used by
this patchset but is used by other patches that are coming soon.
This patch deliniated by tags PRE_MAT_GETBUF and POST_MAT_GETBUF.
(sorry for the missing T matt)
1999-03-12 02:24:58 +00:00
dillon
9b8aaaa737 Reviewed by: Julian Elischer <julian@whistle.com>
Add d_parms() to {c,b}devsw[].  If non-NULL this function points to
    a device routine that will properly fill in the specinfo structure.
    vfs_subr.c's checkalias() supplies appropriate defaults.  This change
    should be fully backwards compatible with existing devices.
1999-02-25 05:22:30 +00:00
dillon
17b6679e7e Protect vn worklist and vn->v_{clean,dirty}blkhd at splbio().
Get rid of extra LIST_REMOVE()

Reviewed by:	hsu@FreeBSD.ORG (Jeffrey Hsu), mckusick@McKusick.COM
Submitted by:	hsu@FreeBSD.ORG (Jeffrey Hsu), dillon@backplane.com ( Matthew Dillon )
1999-02-19 17:36:58 +00:00
dillon
ecfbb44ba1 vp->v_object must be valid after normal flow of vfs_object_create()
completes, change if() to KASSERT().  This is not a bug, we are
    simplify clarifying and optimizing the code.

    In if/else in vfs_object_create(), the failure of both conditionals
    will lead to a NULL object.  Exit gracefully if this case occurs.
    ( this case does not normally occur, but needed to be handled ).

Obtained from: Eivind Eklund <eivind@FreeBSD.org>
1999-02-04 18:25:39 +00:00
dillon
a784c6e33d More const fixes for -Wall, -Wcast-qual 1999-01-29 23:18:50 +00:00
dillon
975fba8a24 Fix warnings in preparation for adding -Wall -Wcast-qual to the
kernel compile
1999-01-28 00:57:57 +00:00
dillon
df24433bbe This is a rather large commit that encompasses the new swapper,
changes to the VM system to support the new swapper, VM bug
    fixes, several VM optimizations, and some additional revamping of the
    VM code.  The specific bug fixes will be documented with additional
    forced commits.  This commit is somewhat rough in regards to code
    cleanup issues.

Reviewed by:	"John S. Dyson" <root@dyson.iquest.net>, "David Greenman" <dg@root.com>
1999-01-21 08:29:12 +00:00
eivind
89e1199534 KNFize, by bde. 1999-01-10 01:58:29 +00:00
eivind
a8dc66f457 Split DIAGNOSTIC -> DIAGNOSTIC, INVARIANTS, and INVARIANT_SUPPORT as
discussed on -hackers.

Introduce 'KASSERT(assertion, ("panic message", args))' for simple
check + panic.

Reviewed by:	msmith
1999-01-08 17:31:30 +00:00
eivind
ffaaca5874 Remove the 'waslocked' parameter to vfs_object_create(). 1999-01-05 18:50:03 +00:00
eivind
2b3e3223c1 Finish staticization. 1999-01-05 18:12:29 +00:00
bde
734d13314e Ifdefed conditionally used simplock variables. 1999-01-02 11:34:57 +00:00
bde
85c558753b Restored rev.1.31 which was clobbered by rev.1.69 (the big Lite2
merge).  This fixes at least hanging in revoke(2) when a somewhat
active slave pty is revoked.  The hang made the window for the
null pointer bug in ufsspec_{read,write} much larger.

There are many other bugs in this area (revoke of an active fifo
at best leaks memory...).
1998-12-24 12:07:16 +00:00
eivind
a4213663c9 Check return value of tsleep(). I've checked of all call points -
there does not seem to be a problem with this.

PR:		kern/8732
Analysis by:	David G Andersen <danderse@cs.utah.edu>
Tested by:	Alfred Perlstein <bright@hotjobs.com>
1998-12-22 00:44:11 +00:00
eivind
a0317115f8 Staticize. 1998-12-21 23:38:33 +00:00
archie
982e80577d Examine all occurrences of sprintf(), strcat(), and str[n]cpy()
for possible buffer overflow problems. Replaced most sprintf()'s
with snprintf(); for others cases, added terminating NUL bytes where
appropriate, replaced constants like "16" with sizeof(), etc.

These changes include several bug fixes, but most changes are for
maintainability's sake. Any instance where it wasn't "immediately
obvious" that a buffer overflow could not occur was made safer.

Reviewed by:	Bruce Evans <bde@zeta.org.au>
Reviewed by:	Matthew Dillon <dillon@apollo.backplane.com>
Reviewed by:	Mike Spengler <mks@networkcs.com>
1998-12-04 22:54:57 +00:00
peter
ca67637f92 Convert lists for bufs attached to vnodes from a LIST to a TAILQ.
- Use TAILQ_* macros extensively instead of internal names
- use b_xflags instead of the NOLIST magic number hack in the next pointer
- clean bufs are inserted at the tail rather than the head.
- redo dirty buffer insert so that metadata (negative lbn) goes to the
  tail directly rather than at the HEAD.  This makes a difference when
  inserting dirty data blocks in lbn sorted order since data block
  insertion will not have to bypass all the metadata cruft.  data is
  lbn sorted since it makes sense for clustering and writeback ordering,
  while metadata sorting doesn't help much since the lbn's are
  meaningless when walking the list for writebacks.

Small systems will not notice much (if any) benefit from this, but really
busy systems with large dirty block lists should get a lot more.

I've tested this with softdep, and it doesn't seem to mind the change of
queueing of metadata.

Reviewed (in princible) by: dg
Obtained from: partly from John Dyson's work-in-progress patches in June.
1998-10-31 14:20:39 +00:00
peter
c9ebff45d0 The last argument to vm_object_page_clean() are now bit flags, rather than
the old true/false.

While here, have vfs_msync() only call vm_object_page_clean() with
OBJPC_SYNC if called with MNT_WAIT flags.  vfs_msync() is called at unmount
time (with MNT_WAIT) and from the syncer process (formerly update).
This should make dirty mmap writebacks a little less nasty.

I have tested this a little with SOFTUPDATES enabled, but I don't normally
use it since I've been badly burned too many times.
1998-10-31 07:42:04 +00:00
bde
9a84781068 Oops, rev.1.167 made the device number checking in bdevvp() too strict
for mfs root mounts.  Don't require major 255 to be in bdevsw[].
1998-10-29 11:50:32 +00:00
peter
fe9b0651f9 Remove the V_SAVEMETA flag, nothing uses it any more now that msdosfs and
ext2fs call vtruncbuf() directly.  This simplifies and cleans up
vinvalbuf() a little.
1998-10-29 09:51:28 +00:00
bde
285a46e7e4 Updated the major number check in vfs_object_create(). It's not
clear if the check is necessary, but vfs_object_create() is called
for all vnodes and it was silly to create objects for VBLK vnodes
that don't even have a driver.
1998-10-26 08:07:00 +00:00
phk
13c66194f4 Nitpicking and dusting performed on a train. Removes trivial warnings
about unused variables, labels and other lint.
1998-10-25 17:44:59 +00:00
bde
b17bde009a Fixed device number checking in bdevvp():
- dev != NODEV was checked for, but 0 was returned on failure.  This was
  fixed in Lite2 (except the return code was still slightly wrong (ENODEV
  instead of ENXIO)) but the changes were not merged.  This case probably
  doesn't actually occur under FreeBSD.
- major(dev) was not checked to have a valid non-NULL bdevsw entry.  This
  caused panics when the driver for the root device didn't exist.

Fixed minor misformattings in bdevvp().  Rev.1.14 consisted mainly of
gratuitous reformattings that seem to have caused many Lite2 merge
errors.

PR:			8417
1998-10-25 16:11:49 +00:00
dt
1f69c742ab Backed out rev. 1.164. It caused problems on SMP.
PR:		8309
1998-10-14 15:05:52 +00:00
dg
3defb6d13f Fixed two potentially serious classes of bugs:
1) The vnode pager wasn't properly tracking the file size due to
   "size" being page rounded in some cases and not in others.
   This sometimes resulted in corrupted files. First noticed by
   Terry Lambert.
   Fixed by changing the "size" pager_alloc parameter to be a 64bit
   byte value (as opposed to a 32bit page index) and changing the
   pagers and their callers to deal with this properly.
2) Fixed a bogus type cast in round_page() and trunc_page() that
   caused some 64bit offsets and sizes to be scrambled. Removing
   the cast required adding casts at a few dozen callers.
   There may be problems with other bogus casts in close-by
   macros. A quick check seemed to indicate that those were okay,
   however.
1998-10-13 08:24:45 +00:00
dt
016509e247 UnVMIO vnodes of block devices when they are no longer in use. (Some
things, like msdosfs, do not work (panic) on devices with VMIO enabled.
FFS enable VMIO on mounted devices, and nothing previously disabled it, so,
after you mounted FFS floppy, you could not mount msdosfs floppy anymore...)

This is mostly a quick before-release fix.

Reviewed by:	bde
1998-10-12 20:14:09 +00:00
sos
8397655514 Remove the SLICE code.
This clearly needs alot more thought, and we dont need this to hunt
us down in 3.0-RELEASE.
1998-09-14 19:56:42 +00:00
bde
a84a2dedfc Instantiate `nfs_mount_type' in a standard file so that it is present
when nfs is an LKM.  Declare it in a header file.  Don't forget to use
it in non-Lite2 code.  Initialize it to -1 instead of to 0, since 0
will soon be the mount type number for the first vfs loaded.

NetBSD uses strcmp() to avoid this ugly global.
1998-09-05 15:17:34 +00:00
bde
d3f8ad1875 Oops, the previous revision unconfigured too much pre-Lite2 compatibilty
cruft.  At least lsvfs(1) was broken.
1998-08-29 13:13:10 +00:00
bde
92b68e1a4b Don't configure compatibility code for pre-Lite2 mount() calls by
default.  This code should go away soon.
1998-08-12 20:17:42 +00:00
dfr
bdb4d90548 Initialise all the fields separately in vattr_null since on the alpha
they are not all the same width.
1998-07-12 16:45:39 +00:00
bde
f0b863f4b5 Fixed printf format errors. 1998-07-11 07:46:16 +00:00
bde
9e868cbb1a Removed unused includes. 1998-06-21 18:02:50 +00:00
julian
da325152ee Replace 'sleep()' with 'tsleep()'
Accidentally imported from Kirk's codebase.

Pointed out by: various.
1998-06-10 22:02:14 +00:00
julian
acd840b15c Submitted by: Kirk McKusick <mckusick@McKusick.COM>
Fix for potential hang when trying to reboot the system or
to forcibly unmount a soft update enabled filesystem.
FreeBSD already handled the reboot case differently, this is however a better
fix.
1998-06-10 18:13:19 +00:00
dfr
1d5f38ac22 This commit fixes various 64bit portability problems required for
FreeBSD/alpha.  The most significant item is to change the command
argument to ioctl functions from int to u_long.  This change brings us
inline with various other BSD versions.  Driver writers may like to
use (__FreeBSD_version == 300003) to detect this change.

The prototype FreeBSD/alpha machdep will follow in a couple of days
time.
1998-06-07 17:13:14 +00:00
tegge
93d053207e Supply the correct process argument to dounmount when possible. 1998-05-17 19:38:55 +00:00
julian
0796a5c56e Add changes and code to implement a functional DEVFS.
This code will be turned on with the TWO options
DEVFS and SLICE. (see LINT)
Two labels PRE_DEVFS_SLICE and POST_DEVFS_SLICE will deliniate these changes.

/dev will be automatically mounted by init (thanks phk)
on bootup. See /sys/dev/slice/slice.4 for more info.
All code should act the same without these options enabled.

Mike Smith, Poul Henning Kamp, Soeren, and a few dozen others

This code does not support the following:
bad144 handling.
Persistance. (My head is still hurting from the last time we discussed this)
ATAPI flopies are not handled by the SLICE code yet.

When this code is running, all major numbers are arbitrary and COULD
be dynamically assigned. (this is not done, for POLA only)
Minor numbers for disk slices ARE arbitray and dynamically assigned.
1998-04-19 23:32:49 +00:00
peter
bbf574dac9 In vfs_msync(), test to see if the vnode being examined is "interesting"
(ie: it has a vm_object attached and is marked as OBJ_MIGHTBEDIRTY) before
attempting to lock it.  This should reduce the cpu hit that is incurred
when doing a sync(2) and when the syncer process is doing the 30-second
writeback of dirty mmap() data to disk.  Skip this speedup if we are
doing an unmount() to be sure to get everything - we can afford to
occasionally miss a msync while the system is running, but not at unmount.

I'm not sure about the VXLOCK and MNT_WAIT case, it seems a bit odd to skip
doing a page_clean at unmount time just because a vnode is VXLOCKed, but
that's what was being done before...
1998-04-18 06:26:16 +00:00
peter
a1bb161dc4 When the softdep conversion took place, the periodic vfs_msync() from
update got lost.  This is responsible for ensuring that dirty mmap() pages
get periodically written to disk.  Without it, long time mmap's might not
have their dirty pages written out at all of the system crashes or isn't
cleanly shut down.  This could be nasty if you've got a long-running
writing via mmap(), dirty pages used to get written to disk within 30
seconds or so.
1998-04-16 03:31:26 +00:00
tegge
eb926b31ca Unlock mountlist_slock if the mount point was busy (unmount in progress)
during the attempt at lazy fsync.
1998-04-15 18:37:49 +00:00
phk
9b703b1455 Eradicate the variable "time" from the kernel, using various measures.
"time" wasn't a atomic variable, so splfoo() protection were needed
around any access to it, unless you just wanted the seconds part.

Most uses of time.tv_sec now uses the new variable time_second instead.

gettime() changed to getmicrotime(0.

Remove a couple of unneeded splfoo() protections, the new getmicrotime()
is atomic, (until Bruce sets a breakpoint in it).

A couple of places needed random data, so use read_random() instead
of mucking about with time which isn't random.

Add a new nfs_curusec() function.

Mark a couple of bogosities involving the now disappeard time variable.

Update ffs_update() to avoid the weird "== &time" checks, by fixing the
one remaining call that passwd &time as args.

Change profiling in ncr.c to use ticks instead of time.  Resolution is
the same.

Add new function "tvtohz()" to avoid the bogus "splfoo(), add time, call
hzto() which subtracts time" sequences.

Reviewed by:	bde
1998-03-30 09:56:58 +00:00
bde
b5ae2c779b Removed unused #includes. 1998-03-28 13:25:01 +00:00
bde
a1015f7749 Don't depend on <sys/mount.h> including <sys/socket.h>. 1998-03-28 12:04:40 +00:00
dyson
aad85d5a04 In kern_physio.c fix tsleep priority messup.
In vfs_bio.c, remove b_generation count usage,
	remove redundant reassignbuf,
	remove redundant spl(s),
	manage page PG_ZERO flags more correctly,
	utilize in invalid value for b_offset until it
		is properly initialized.  Add asserts
		for #ifdef DIAGNOSTIC, when b_offset is
		improperly used.
	when a process is not performing I/O, and just waiting
		on a buffer generally, make the sleep priority
		low.
	only check page validity in getblk for B_VMIO buffers.

In vfs_cluster, add b_offset asserts, correct pointer calculation
	for clustered reads.  Improve readability of certain parts of
	the code.  Remove redundant spl(s).

In vfs_subr, correct usage of vfs_bio_awrite (From Andrew Gallatin
	<gallatin@cs.duke.edu>).  More vtruncbuf problems fixed.
1998-03-19 22:48:16 +00:00
dyson
dc6d7f19f8 Fix an embarassing problem in vtruncbuf. 1998-03-19 18:46:58 +00:00
dyson
ca4225334c Correct a severely evil bug in the vtruncbuf code. It didn't cause
me any problems until after the previous commit.  This problem then
caused a severe case of creeping crud on my diskdrive, and hosed
my system so bad, that I needed to do a complete reinstall.  Sorry!!!

I assume that others have manifest this bug.
1998-03-17 06:30:52 +00:00
dyson
9bd499d7fa Allow vfs_ioopt to be enabled with a (temporary) config option. 1998-03-16 02:13:03 +00:00
dyson
6e92f5716b Some VM improvements, including elimination of alot of Sig-11
problems.  Tor Egge and others have helped with various VM bugs
lately, but don't blame him -- blame me!!!

pmap.c:
1)	Create an object for kernel page table allocations.  This
	fixes a bogus allocation method previously used for such, by
	grabbing pages from the kernel object, using bogus pindexes.
	(This was a code cleanup, and perhaps a minor system stability
	 issue.)

pmap.c:
2)	Pre-set the modify and accessed bits when prudent.  This will
	decrease bus traffic under certain circumstances.

vfs_bio.c, vfs_cluster.c:
3)	Rather than calculating the beginning virtual byte offset
	multiple times, stick the offset into the buffer header, so
	that the calculated offset can be reused.  (Long long multiplies
	are often expensive, and this is a probably unmeasurable performance
	improvement, and code cleanup.)

vfs_bio.c:
4)	Handle write recursion more intelligently (but not perfectly) so
	that it is less likely to cause a system panic, and is also
	much more robust.

vfs_bio.c:
5)	getblk incorrectly wrote out blocks that are incorrectly sized.
	The problem is fixed, and writes blocks out ONLY when B_DELWRI
	is true.

vfs_bio.c:
6)	Check that already constituted buffers have fully valid pages.  If
	not, then make sure that the B_CACHE bit is not set. (This was
	a major source of Sig-11 type problems.)

vfs_bio.c:
7)	Fix a potential system deadlock due to an incorrectly specified
	sleep priority while waiting for a buffer write operation.  The
	change that I made opens the system up to serious problems, and
	we need to examine the issue of process sleep priorities.

vfs_cluster.c, vfs_bio.c:
8)	Make clustered reads work more correctly (and more completely)
	when buffers are already constituted, but not fully valid.
	(This was another system reliability issue.)

vfs_subr.c, ffs_inode.c:
9)	Create a vtruncbuf function, which is used by filesystems that
	can truncate files.  The vinvalbuf forced a file sync type operation,
	while vtruncbuf only invalidates the buffers past the new end of file,
	and also invalidates the appropriate pages.  (This was a system reliabiliy
	and performance issue.)

10)	Modify FFS to use vtruncbuf.

vm_object.c:
11)	Make the object rundown mechanism for OBJT_VNODE type objects work
	more correctly.  Included in that fix, create pager entries for
	the OBJT_DEAD pager type, so that paging requests that might slip
	in during race conditions are properly handled.  (This was a system
	reliability issue.)

vm_page.c:
12)	Make some of the page validation routines be a little less picky
	about arguments passed to them.  Also, support page invalidation
	change the object generation count so that we handle generation
	counts a little more robustly.

vm_pageout.c:
13)	Further reduce pageout daemon activity when the system doesn't
	need help from it.  There should be no additional performance
	decrease even when the pageout daemon is running.  (This was
	a significant performance issue.)

vnode_pager.c:
14)	Teach the vnode pager to handle race conditions during vnode
	deallocations.
1998-03-16 01:56:03 +00:00
dyson
7c9ff06841 Disable the vfs.ioopt option for now, so that we don't get gratuitious
bugreports.  I might not be able to fix the problems before 3.0, due
to other, more important things.
1998-03-14 19:50:36 +00:00
tegge
fca0f92630 Don't misuse vnode interlocks in routines that can be called from interrupts.
PR:		5893
1998-03-14 02:55:01 +00:00
julian
10c5ccc30a Reviewed by: dyson@freebsd.org (john Dyson), dg@root.com (david greenman)
Submitted by:	Kirk McKusick (mcKusick@mckusick.com)
Obtained from:  WHistle development tree
1998-03-08 09:59:44 +00:00
dyson
8ceb6160f4 This mega-commit is meant to fix numerous interrelated problems. There
has been some bitrot and incorrect assumptions in the vfs_bio code.  These
problems have manifest themselves worse on NFS type filesystems, but can
still affect local filesystems under certain circumstances.  Most of
the problems have involved mmap consistancy, and as a side-effect broke
the vfs.ioopt code.  This code might have been committed seperately, but
almost everything is interrelated.

1)	Allow (pmap_object_init_pt) prefaulting of buffer-busy pages that
	are fully valid.
2)	Rather than deactivating erroneously read initial (header) pages in
	kern_exec, we now free them.
3)	Fix the rundown of non-VMIO buffers that are in an inconsistent
	(missing vp) state.
4)	Fix the disassociation of pages from buffers in brelse.  The previous
	code had rotted and was faulty in a couple of important circumstances.
5)	Remove a gratuitious buffer wakeup in vfs_vmio_release.
6)	Remove a crufty and currently unused cluster mechanism for VBLK
	files in vfs_bio_awrite.  When the code is functional, I'll add back
	a cleaner version.
7)	The page busy count wakeups assocated with the buffer cache usage were
	incorrectly cleaned up in a previous commit by me.  Revert to the
	original, correct version, but with a cleaner implementation.
8)	The cluster read code now tries to keep data associated with buffers
	more aggressively (without breaking the heuristics) when it is presumed
	that the read data (buffers) will be soon needed.
9)	Change to filesystem lockmgr locks so that they use LK_NOPAUSE.  The
	delay loop waiting is not useful for filesystem locks, due to the
	length of the time intervals.
10)	Correct and clean-up spec_getpages.
11)	Implement a fully functional nfs_getpages, nfs_putpages.
12)	Fix nfs_write so that modifications are coherent with the NFS data on
	the server disk (at least as well as NFS seems to allow.)
13)	Properly support MS_INVALIDATE on NFS.
14)	Properly pass down MS_INVALIDATE to lower levels of the VM code from
	vm_map_clean.
15)	Better support the notion of pages being busy but valid, so that
	fewer in-transit waits occur.  (use p->busy more for pageouts instead
	of PG_BUSY.)  Since the page is fully valid, it is still usable for
	reads.
16)	It is possible (in error) for cached pages to be busy.  Make the
	page allocation code handle that case correctly.  (It should probably
	be a printf or panic, but I want the system to handle coding errors
	robustly.  I'll probably add a printf.)
17)	Correct the design and usage of vm_page_sleep.  It didn't handle
	consistancy problems very well, so make the design a little less
	lofty.  After vm_page_sleep, if it ever blocked, it is still important
	to relookup the page (if the object generation count changed), and
	verify it's status (always.)
18)	In vm_pageout.c, vm_pageout_clean had rotted, so clean that up.
19)	Push the page busy for writes and VM_PROT_READ into vm_pageout_flush.
20)	Fix vm_pager_put_pages and it's descendents to support an int flag
	instead of a boolean, so that we can pass down the invalidate bit.
1998-03-07 21:37:31 +00:00
dyson
1928b68b1a Change vfs.ioopt default back to '0'. 1998-03-01 23:07:45 +00:00
dyson
69e5a1e9f5 1) Use a more consistent page wait methodology.
2)	Do not unnecessarily force page blocking when paging
	pages out.
3)	Further improve swap pager performance and correctness,
	including fixing the paging in progress deadlock (except
	in severe I/O error conditions.)
4)	Enable vfs_ioopt=1 as a default.
5)	Fix and enable the page prezeroing in SMP mode.

All in all, SMP systems especially should show a significant
improvement in "snappyness."
1998-03-01 04:18:54 +00:00
dyson
32e0a3673a Clean-up the vget mechanism by permanently attaching VM objects to
vnodes, therefore vget doesn't need to do so anymore.  Other minor
improvements include the temp free vnode queue obeying the VAGE
flag and a printf that warns of to-be-removed code being executed.
1998-02-23 06:59:52 +00:00
kato
8b0f1ac87b Fixed vnode interlock handling.
Reviewed by:	Bruce Evans <bde@zeta.org.au>
            	Tor Egge <Tor.Egge@idi.ntnu.no>
1998-02-10 02:54:24 +00:00
eivind
d7a6ab2803 Staticize. 1998-02-09 06:11:36 +00:00
kato
95aa7532e2 When the vp is lcoked, vget() calls vfs_object_create() with
waslocked = TRUE.  This change may fix lockmgr panic in umapfs/nullfs.

PR:		5634
Reviewed by:	"John S. Dyson" <toor@dyson.iquest.net>
Suggested by:	Bruce Evans <bde@zeta.org.au>
1998-02-07 08:44:31 +00:00
eivind
4547a09753 Back out DIAGNOSTIC changes. 1998-02-06 12:14:30 +00:00
dyson
ebccbfc1ff 1) Start using a cleaner and more consistant page allocator instead
of the various ad-hoc schemes.
2)	When bringing in UPAGES, the pmap code needs to do another vm_page_lookup.
3)	When appropriate, set the PG_A or PG_M bits a-priori to both avoid some
	processor errata, and to minimize redundant processor updating of page
	tables.
4)	Modify pmap_protect so that it can only remove permissions (as it
	originally supported.)  The additional capability is not needed.
5)	Streamline read-only to read-write page mappings.
6)	For pmap_copy_page, don't enable write mapping for source page.
7)	Correct and clean-up pmap_incore.
8)	Cluster initial kern_exec pagin.
9)	Removal of some minor lint from kern_malloc.
10)	Correct some ioopt code.
11)	Remove some dead code from the MI swapout routine.
12)	Correct vm_object_deallocate (to remove backing_object ref.)
13)	Fix dead object handling, that had problems under heavy memory load.
14)	Add minor vm_page_lookup improvements.
15)	Some pages are not in objects, and make sure that the vm_page.c can
	properly support such pages.
16)	Add some more page deficit handling.
17)	Some minor code readability improvements.
1998-02-05 03:32:49 +00:00