Commit Graph

238 Commits

Author SHA1 Message Date
jeff
e6d26e8880 This patch adds the "LOCKSHARED" option to namei which causes it to only acquire shared locks on leafs.
The stat() and open() calls have been changed to make use of this new functionality.  Using shared locks in
these cases is sufficient and can significantly reduce their latency if IO is pending to these vnodes.  Also,
this reduces the number of exclusive locks that are floating around in the system, which helps reduce the
number of deadlocks that occur.

A new kernel option "LOOKUP_SHARED" has been added.  It defaults to off so this patch can be turned on for
testing, and should eventually go away once it is proven to be stable.  I have personally been running this
patch for over a year now, so it is believed to be fully stable.

Reviewed by:	jake, obrien
Approved by:	jake
2002-03-12 04:00:11 +00:00
tanimura
22c75bf1c9 Stop abusing the pgrpsess_lock. 2002-03-11 07:53:13 +00:00
eivind
05c20bbea2 Document all functions, global and static variables, and sysctls.
Includes some minor whitespace changes, and re-ordering to be able to document
properly (e.g, grouping of variables and the SYSCTL macro calls for them, where
the documentation has been added.)

Reviewed by:	phk (but all errors are mine)
2002-03-05 15:38:49 +00:00
jhb
3706cd3509 Simple p_ucred -> td_ucred changes to start using the per-thread ucred
reference.
2002-02-27 18:32:23 +00:00
tanimura
a09da29859 Lock struct pgrp, session and sigio.
New locks are:

- pgrpsess_lock which locks the whole pgrps and sessions,
- pg_mtx which protects the pgrp members, and
- s_mtx which protects the session members.

Please refer to sys/proc.h for the coverage of these locks.

Changes on the pgrp/session interface:

- pgfind() needs the pgrpsess_lock held.

- The caller of enterpgrp() is responsible to allocate a new pgrp and
  session.

- Call enterthispgrp() in order to enter an existing pgrp.

- pgsignal() requires a pgrp lock held.

Reviewed by:	jhb, alfred
Tested on:	cvsup.jp.FreeBSD.org
		(which is a quad-CPU machine running -current)
2002-02-23 11:12:57 +00:00
rwatson
5587a218c1 More cleanups relating to vm object allocation failure: make sure we
call VOP_CLOSE() with vp unlocked; clean up the return path a little,
in as much as our namei/vnode operation return paths can be cleared
up.  For a return case that was apparently never taken, this sure
is ugly.

Reviewed by:	jeffr
2002-02-20 00:11:57 +00:00
iedowse
269a0d55a1 Add the braces missed by revision 1.131.
Pointy hat to:	rwatson
2002-02-18 12:46:18 +00:00
rwatson
43f97fe4ae When vn_open() is failing because it cannot allocate a vm object, call
VOP_CLOSE() on the vnode, so that VOP_OPEN() and VOP_CLOSE() calls
are symmetric in all failure cases.  This prevents an 'open' reference
from being leaked in that unlikely failure scenario.
2002-02-18 00:26:10 +00:00
rwatson
2bbae54d18 Make sure to hold vnode lock when calling into VOP_GETATTR().
Discussed with:	mckusick, phk
2002-02-10 21:44:30 +00:00
rwatson
6ca91a055c Part I: Update extended attribute API and ABI:
o Modify the system call syntax for extattr_{get,set}_{fd,file}() so
  as not to use the scatter gather API (which appeared not to be used
  by any consumers, and be less portable), rather, accepts 'data'
  and 'nbytes' in the style of other simple read/write interfaces.
  This changes the API and ABI.

o Modify system call semantics so that extattr_get_{fd,file}() return
  a size_t.  When performing a read, the number of bytes read will
  be returned, unless the data pointer is NULL, in which case the
  number of bytes of data are returned.  This changes the API only.

o Modify the VOP_GETEXTATTR() vnode operation to accept a *size_t
  argument so as to return the size, if desirable.  If set to NULL,
  the size will not be returned.

o Update various filesystems (pseodofs, ufs) to DTRT.

These changes should make extended attributes more useful and more
portable.  More commits to rebuild the system call files, as well
as update userland utilities to follow.

Obtained from:	TrustedBSD Project
Sponsored by:	DARPA, NAI Labs
2002-02-10 04:43:22 +00:00
phk
f8c5229a89 Make st_blksize default to PAGE_SIZE instead of zero. 2002-01-25 16:39:57 +00:00
dillon
f51ea914df Remove 'VXLOCK: interlock avoided' warnings. This can now occur in normal
operation.  The vgonel() code has always called vclean() but until we
started proactively freeing vnodes it would never actually be called with
a dirty vnode, so this situation did not occur prior to the vnlru() code.
Now that we proactively free vnodes when kern.maxvnodes is hit, however,
vclean() winds up with work to do and improperly generates the warnings.

Reviewed by:	peter
Approved by:	re (for MFC)
MFC after:	1 day
2002-01-19 02:14:45 +00:00
alfred
844237b396 SMP Lock struct file, filedesc and the global file list.
Seigo Tanimura (tanimura) posted the initial delta.

I've polished it quite a bit reducing the need for locking and
adapting it for KSE.

Locks:

1 mutex in each filedesc
   protects all the fields.
   protects "struct file" initialization, while a struct file
     is being changed from &badfileops -> &pipeops or something
     the filedesc should be locked.

1 mutex in each struct file
   protects the refcount fields.
   doesn't protect anything else.
   the flags used for garbage collection have been moved to
     f_gcflag which was the FILLER short, this doesn't need
     locking because the garbage collection is a single threaded
     container.
  could likely be made to use a pool mutex.

1 sx lock for the global filelist.

struct file *	fhold(struct file *fp);
        /* increments reference count on a file */

struct file *	fhold_locked(struct file *fp);
        /* like fhold but expects file to locked */

struct file *	ffind_hold(struct thread *, int fd);
        /* finds the struct file in thread, adds one reference and
                returns it unlocked */

struct file *	ffind_lock(struct thread *, int fd);
        /* ffind_hold, but returns file locked */

I still have to smp-safe the fget cruft, I'll get to that asap.
2002-01-13 11:58:06 +00:00
dillon
1750942f6f This is a forward port of Peter's vlrureclaim() fix, with some minor mods
by me to make it more efficient.  The original code had serious balancing
problems and could also deadlock easily.  This code relegates the vnode
reclamation to its own kproc and relaxes the vnode reclamation requirements
to better maintain kern.maxvnodes.  This code still doesn't balance as well
as it could, but it does a much better job then the original code.

Approved by:	re@freebsd.org
Obtained from:	ps, peter, dillon
MFS Assuming:	Assuming no problems crop up in Yahoo testing
MFC after:	7 days
2001-12-18 20:48:54 +00:00
alfred
015f13094a turn vn_open() into a wrapper around vn_open_cred() which allows
one to perform a vn_open using temporary/other/fake credentials.

Modify the nfs client side locking code to use vn_open_cred() passing
proc0's ucred instead of the old way which was to temporary raise
privs while running vn_open().  This should close the race hopefully.
2001-11-11 22:39:07 +00:00
rwatson
83a8501a7c o vn_open() fails to call VOP_CLOSE() if vfs_object_create fails. Ideally
all successful calls to VOP_OPEN() might be reflected in a call to
  VOP_CLOSE().  For now, simply add a comment reflecting this problem;
  this should be fixed at some point.
2001-10-23 19:09:01 +00:00
jhb
4c62ba7c58 Add missing includes of sys/lock.h. 2001-10-11 17:52:20 +00:00
dillon
9ab85c5929 Make uio_yield() a global. Call uio_yield() between chunks
in vn_rdwr_inchunks(), allowing other processes to gain an exclusive
lock on the vnode.  Specifically: directory scanning, to avoid a race to the
root directory, and multiple child processes coring simultaniously so they
can figure out that some other core'ing child has an exclusive adv lock and
just exit instead.

This completely fixes performance problems when large programs core.  You
can have hundreds of copies (forked children) of the same binary core all
at once and not notice.

MFC after:	3 days
2001-09-26 06:54:32 +00:00
julian
5596676e6c KSE Milestone 2
Note ALL MODULES MUST BE RECOMPILED
make the kernel aware that there are smaller units of scheduling than the
process. (but only allow one thread per process at this time).
This is functionally equivalent to teh previousl -current except
that there is a thread associated with each process.

Sorry john! (your next MFC will be a doosie!)

Reviewed by: peter@freebsd.org, dillon@freebsd.org

X-MFC after:    ha ha ha ha
2001-09-12 08:38:13 +00:00
dillon
d73b3c59f0 This brings in a Yahoo coredump patch from Paul, with additional mods by
me (addition of vn_rdwr_inchunks).  The problem Yahoo is solving is that
if you have large process images core dumping, or you have a large number of
forked processes all core dumping at the same time, the original coredump code
would leave the vnode locked throughout.  This can cause the directory vnode
to get locked up, which can cause the parent directory vnode to get locked
up, and so on all the way to the root node, locking the entire machine up
for extremely long periods of time.

This patch solves the problem in two ways.  First it uses an advisory
non-blocking lock to abort multiple processes trying to core to the same
file.  Second (my contribution) it chunks up the writes and uses bwillwrite()
to avoid holding the vnode locked while blocking in the buffer cache.

Submitted by:	ps
Reviewed by:	dillon
MFC after:	2 weeks
2001-09-08 20:02:33 +00:00
ache
a480a6737d vn_stat(): if va_size (u_quad_t) > OFF_MAX, return EOVERFLOW, don't copy it
blindly to st_size
2001-08-23 17:56:48 +00:00
dillon
a179ee09ab This patch implements O_DIRECT about 80% of the way. It takes a patchset
Tor created a while ago, removes the raw I/O piece (that has cache coherency
problems), and adds a buffer cache / VM freeing piece.

Essentially this patch causes O_DIRECT I/O to not be left in the cache, but
does not prevent it from going through the cache, hence the 80%.  For
the last 20% we need a method by which the I/O can be issued directly to
buffer supplied by the user process and bypass the buffer cache entirely,
but still maintain cache coherency.

I also have the code working under -stable but the changes made to sys/file.h
may not be MFCable, so an MFC is not on the table yet.

Submitted by:	tegge, dillon
2001-05-24 07:22:27 +00:00
grog
4b9d9cbaac Revert consequences of changes to mount.h, part 2.
Requested by:	bde
2001-04-29 02:45:39 +00:00
mckusick
f863141979 When closing the last reference to an unlinked file, it is freed
by the inactive routine. Because the freeing causes the filesystem
to be modified, the close must be held up during periods when the
filesystem is suspended.

For snapshots to be consistent across crashes, they must write
blocks that they copy and claim those written blocks in their
on-disk block pointers before the old blocks that they referenced
can be allowed to be written.

Close a loophole that allowed unwritten blocks to be skipped when
doing ffs_sync with a request to wait for all I/O activity to be
completed.
2001-04-25 08:11:18 +00:00
grog
1f5de30718 Correct #includes to work with fixed sys/mount.h. 2001-04-23 09:05:15 +00:00
bp
d38c3464d1 Previous commit broke interlock locking for !LK_RETRY case. 2001-03-26 12:45:35 +00:00
bp
9aefd18cc8 Prevent race condition by using msleep() instead of mtx_unlock()/tsleep().
Reviewed by:	alfred
2001-03-26 03:10:07 +00:00
rwatson
0012887962 o Rename "namespace" argument to "attrnamespace" as namespace is a C++
reserved word.

Submitted by:	jkh
Obtained from:	TrustedBSD Project
2001-03-19 05:44:15 +00:00
rwatson
f773ff5a87 o Change the API and ABI of the Extended Attribute kernel interfaces to
introduce a new argument, "namespace", rather than relying on a first-
  character namespace indicator.  This is in line with more recent
  thinking on EA interfaces on various mailing lists, including the
  posix1e, Linux acl-devel, and trustedbsd-discuss forums.  Two namespaces
  are defined by default, EXTATTR_NAMESPACE_SYSTEM and
  EXTATTR_NAMESPACE_USER, where the primary distinction lies in the
  access control model: user EAs are accessible based on the normal
  MAC and DAC file/directory protections, and system attributes are
  limited to kernel-originated or appropriately privileged userland
  requests.

o These API changes occur at several levels: the namespace argument is
  introduced in the extattr_{get,set}_file() system call interfaces,
  at the vnode operation level in the vop_{get,set}extattr() interfaces,
  and in the UFS extended attribute implementation.  Changes are also
  introduced in the VFS extattrctl() interface (system call, VFS,
  and UFS implementation), where the arguments are modified to include
  a namespace field, as well as modified to advoid direct access to
  userspace variables from below the VFS layer (in the style of recent
  changes to mount by adrian@FreeBSD.org).  This required some cleanup
  and bug fixing regarding VFS locks and the VFS interface, as a vnode
  pointer may now be optionally submitted to the VFS_EXTATTRCTL()
  call.  Updated documentation for the VFS interface will be committed
  shortly.

o In the near future, the auto-starting feature will be updated to
  search two sub-directories to the ".attribute" directory in appropriate
  file systems: "user" and "system" to locate attributes intended for
  those namespaces, as the single filename is no longer sufficient
  to indicate what namespace the attribute is intended for.  Until this
  is committed, all attributes auto-started by UFS will be placed in
  the EXTATTR_NAMESPACE_SYSTEM namespace.

o The default POSIX.1e attribute names for ACLs and Capabilities have
  been updated to no longer include the '$' in their filename.  As such,
  if you're using these features, you'll need to rename the attribute
  backing files to the same names without '$' symbols in front.

o Note that these changes will require changes in userland, which will
  be committed shortly.  These include modifications to the extended
  attribute utilities, as well as to libutil for new namespace
  string conversion routines.  Once the matching userland changes are
  committed, a buildworld is recommended to update all the necessary
  include files and verify that the kernel and userland environments
  are in sync.  Note: If you do not use extended attributes (most people
  won't), upgrading is not imperative although since the system call
  API has changed, the new userland extended attribute code will no longer
  compile with old include files.

o Couple of minor cleanups while I'm there: make more code compilation
  conditional on FFS_EXTATTR, which should recover a bit of space on
  kernels running without EA's, as well as update copyright dates.

Obtained from:	TrustedBSD Project
2001-03-15 02:54:29 +00:00
jlemon
11781a7431 Extend kqueue down to the device layer.
Backwards compatible approach suggested by: peter
2001-02-15 16:34:11 +00:00
bmilekic
f364d4ac36 Change and clean the mutex lock interface.
mtx_enter(lock, type) becomes:

mtx_lock(lock) for sleep locks (MTX_DEF-initialized locks)
mtx_lock_spin(lock) for spin locks (MTX_SPIN-initialized)

similarily, for releasing a lock, we now have:

mtx_unlock(lock) for MTX_DEF and mtx_unlock_spin(lock) for MTX_SPIN.
We change the caller interface for the two different types of locks
because the semantics are entirely different for each case, and this
makes it explicitly clear and, at the same time, it rids us of the
extra `type' argument.

The enter->lock and exit->unlock change has been made with the idea
that we're "locking data" and not "entering locked code" in mind.

Further, remove all additional "flags" previously passed to the
lock acquire/release routines with the exception of two:

MTX_QUIET and MTX_NOSWITCH

The functionality of these flags is preserved and they can be passed
to the lock/unlock routines by calling the corresponding wrappers:

mtx_{lock, unlock}_flags(lock, flag(s)) and
mtx_{lock, unlock}_spin_flags(lock, flag(s)) for MTX_DEF and MTX_SPIN
locks, respectively.

Re-inline some lock acq/rel code; in the sleep lock case, we only
inline the _obtain_lock()s in order to ensure that the inlined code
fits into a cache line. In the spin lock case, we inline recursion and
actually only perform a function call if we need to spin. This change
has been made with the idea that we generally tend to avoid spin locks
and that also the spin locks that we do have and are heavily used
(i.e. sched_lock) do recurse, and therefore in an effort to reduce
function call overhead for some architectures (such as alpha), we
inline recursion for this case.

Create a new malloc type for the witness code and retire from using
the M_DEV type. The new type is called M_WITNESS and is only declared
if WITNESS is enabled.

Begin cleaning up some machdep/mutex.h code - specifically updated the
"optimized" inlined code in alpha/mutex.h and wrote MTX_LOCK_SPIN
and MTX_UNLOCK_SPIN asm macros for the i386/mutex.h as we presently
need those.

Finally, caught up to the interface changes in all sys code.

Contributors: jake, jhb, jasone (in no particular order)
2001-02-09 06:11:45 +00:00
jasone
8d2ec1ebc4 Convert all simplelocks to mutexes and remove the simplelock implementations. 2001-01-24 12:35:55 +00:00
dillon
2ace352085 Implement a low-memory deadlock solution.
Removed most of the hacks that were trying to deal with low-memory
    situations prior to now.

    The new code is based on the concept that I/O must be able to function in
    a low memory situation.  All major modules related to I/O (except
    networking) have been adjusted to allow allocation out of the system
    reserve memory pool.  These modules now detect a low memory situation but
    rather then block they instead continue to operate, then return resources
    to the memory pool instead of cache them or leave them wired.

    Code has been added to stall in a low-memory situation prior to a vnode
    being locked.

    Thus situations where a process blocks in a low-memory condition while
    holding a locked vnode have been reduced to near nothing.  Not only will
    I/O continue to operate, but many prior deadlock conditions simply no
    longer exist.

Implement a number of VFS/BIO fixes

	(found by Ian): in biodone(), bogus-page replacement code, the loop
        was not properly incrementing loop variables prior to a continue
        statement.  We do not believe this code can be hit anyway but we
        aren't taking any chances.  We'll turn the whole section into a
        panic (as it already is in brelse()) after the release is rolled.

	In biodone(), the foff calculation was incorrectly
        clamped to the iosize, causing the wrong foff to be calculated
        for pages in the case of an I/O error or biodone() called without
        initiating I/O.  The problem always caused a panic before.  Now it
        doesn't.  The problem is mainly an issue with NFS.

	Fixed casts for ~PAGE_MASK.  This code worked properly before only
        because the calculations use signed arithmatic.  Better to properly
        extend PAGE_MASK first before inverting it for the 64 bit masking
        op.

	In brelse(), the bogus_page fixup code was improperly throwing
        away the original contents of 'm' when it did the j-loop to
        fix the bogus pages.  The result was that it would potentially
        invalidate parts of the *WRONG* page(!), leading to corruption.

	There may still be cases where a background bitmap write is
        being duplicated, causing potential corruption.  We have identified
        a potentially serious bug related to this but the fix is still TBD.
        So instead this patch contains a KASSERT to detect the problem
  	and panic the machine rather then continue to corrupt the filesystem.
	The problem does not occur very often..  it is very hard to
	reproduce, and it may or may not be the cause of the corruption
	people have reported.

Review by: (VFS/BIO: mckusick, Ian Dowse <iedowse@maths.tcd.ie>)
Testing by: (VM/Deadlock) Paul Saab <ps@yahoo-inc.com>
2000-11-18 23:06:26 +00:00
phk
4e063f5534 Take VBLK devices further out of their missery.
This should fix the panic I introduced in my previous commit on this topic.
2000-11-02 21:14:13 +00:00
jhb
d944886e4d Catch up to moving headers:
- machine/ipl.h -> sys/ipl.h
- machine/mutex.h -> sys/mutex.h
2000-10-20 07:58:15 +00:00
jasone
4e290e67b7 Convert lockmgr locks from using simple locks to using mutexes.
Add lockdestroy() and appropriate invocations, which corresponds to
lockinit() and must be called to clean up after a lockmgr lock is no
longer needed.
2000-10-04 01:29:17 +00:00
rwatson
9b419172b9 o Introduce vn_extattr_rm(), a helper function in the style of
vn_extattr_get() and vn_extattr_set().  vn_extattr_rm() removes the
  specified extended attribute from a vnode, authorizing the change as
  the kernel (NULL cred).

Obtained from:	TrustedBSD Project
2000-09-22 22:33:13 +00:00
rwatson
81eb6ee3cf o vn_extattr_set() will now call appropriate vn_start_write() and
vn_finished_write() if IO_NODELOCKED is not set.

Obtained from:	TrustedBSD Project
2000-09-05 03:15:02 +00:00
rwatson
b0f6f79479 o Introduce vn_extattr_{get,set}, wrapper routines for VOP_GETEXTATTR
and VOP_SETEXTATTR to simplify calling from in-kernel consumers,
  such as capability code.  Both accept a vnode (optionally locked,
  with ioflg to indicate that), attribute name, and a buffer + buffer
  length in UIO_SYSSPACE.  Both authorize the call as a kernel request,
  with cred set to NULL for the actual VOP_ calls.

Obtained from:	TrustedBSD Project
2000-08-08 17:15:32 +00:00
mckusick
acc66855bf This patch corrects the first round of panics and hangs reported
with the new snapshot code.

Update addaliasu to correctly implement the semantics of the old
checkalias function. When a device vnode first comes into existence,
check to see if an anonymous vnode for the same device was created
at boot time by bdevvp(). If so, adopt the bdevvp vnode rather than
creating a new vnode for the device. This corrects a problem which
caused the kernel to panic when taking a snapshot of the root
filesystem.

Change the calling convention of vn_write_suspend_wait() to be the
same as vn_start_write().

Split out softdep_flushworklist() from softdep_flushfiles() so that
it can be used to clear the work queue when suspending filesystem
operations.

Access to buffers becomes recursive so that snapshots can recursively
traverse their indirect blocks using ffs_copyonwrite() when checking
for the need for copy on write when flushing one of their own indirect
blocks. This eliminates a deadlock between the syncer daemon and a
process taking a snapshot.

Ensure that softdep_process_worklist() can never block because of a
snapshot being taken. This eliminates a problem with buffer starvation.

Cleanup change in ffs_sync() which did not synchronously wait when
MNT_WAIT was specified. The result was an unclean filesystem panic
when doing forcible unmount with heavy filesystem I/O in progress.

Return a zero'ed block when reading a block that was not in use at
the time that a snapshot was taken. Normally, these blocks should
never be read. However, the readahead code will occationally read
them which can cause unexpected behavior.

Clean up the debugging code that ensures that no blocks be written
on a filesystem while it is suspended. Snapshots must explicitly
label the blocks that they are writing during the suspension so that
they do not cause a `write on suspended filesystem' panic.

Reorganize ffs_copyonwrite() to eliminate a deadlock and also to
prevent a race condition that would permit the same block to be
copied twice. This change eliminates an unexpected soft updates
inconsistency in fsck caused by the double allocation.

Use bqrelse rather than brelse for buffers that will be needed
soon again by the snapshot code. This improves snapshot performance.
2000-07-24 05:28:33 +00:00
mckusick
a3d0c189ea Add snapshots to the fast filesystem. Most of the changes support
the gating of system calls that cause modifications to the underlying
filesystem. The gating can be enabled by any filesystem that needs
to consistently suspend operations by adding the vop_stdgetwritemount
to their set of vnops. Once gating is enabled, the function
vfs_write_suspend stops all new write operations to a filesystem,
allows any filesystem modifying system calls already in progress
to complete, then sync's the filesystem to disk and returns. The
function vfs_write_resume allows the suspended write operations to
begin again. Gating is not added by default for all filesystems as
for SMP systems it adds two extra locks to such critical kernel
paths as the write system call. Thus, gating should only be added
as needed.

Details on the use and current status of snapshots in FFS can be
found in /sys/ufs/ffs/README.snapshot so for brevity and timelyness
is not included here. Unless and until you create a snapshot file,
these changes should have no effect on your system (famous last words).
2000-07-11 22:07:57 +00:00
mckusick
040e64cd97 Move the truncation code out of vn_open and into the open system call
after the acquisition of any advisory locks. This fix corrects a case
in which a process tries to open a file with a non-blocking exclusive
lock. Even if it fails to get the lock it would still truncate the
file even though its open failed. With this change, the truncation
is done only after the lock is successfully acquired.

Obtained from:	 BSD/OS
2000-07-04 03:34:11 +00:00
jlemon
f077cb82c8 Fix stupid braino in last commit, initialize `vp' before we test vp->v_tag.
Spotted by: dillon
2000-06-25 18:10:45 +00:00
jlemon
a99f398c52 Add a hack to fail registration of kq events on a non-ufs filesystem, as
support for those is non-existent at the moment.
2000-06-22 18:41:07 +00:00
jake
961b97d434 Back out the previous change to the queue(3) interface.
It was not discussed and should probably not happen.

Requested by:		msmith and others
2000-05-26 02:09:24 +00:00
jake
d93fbc9916 Change the way that the queue(3) structures are declared; don't assume that
the type argument to *_HEAD and *_ENTRY is a struct.

Suggested by:	phk
Reviewed by:	phk
Approved by:	mdodd
2000-05-23 20:41:01 +00:00
asmodai
904f94ecca Fix comment typo.
Submitted by:	nrahlstr
2000-05-12 16:06:49 +00:00
phk
36c3965ff9 Separate the struct bio related stuff out of <sys/buf.h> into
<sys/bio.h>.

<sys/bio.h> is now a prerequisite for <sys/buf.h> but it shall
not be made a nested include according to bdes teachings on the
subject of nested includes.

Diskdrivers and similar stuff below specfs::strategy() should no
longer need to include <sys/buf.> unless they need caching of data.

Still a few bogus uses of struct buf to track down.

Repocopy by:    peter
2000-05-05 09:59:14 +00:00
phk
10914aa708 Remove unneeded #include <vm/vm_zone.h>
Generated by:	src/tools/tools/kerninclude
2000-04-30 18:52:11 +00:00
jlemon
c41c876463 Introduce kqueue() and kevent(), a kernel event notification facility. 2000-04-16 18:53:38 +00:00
dillon
057e33d02c Change the write-behind code to take more care when starting
async I/O's.  The sequential read heuristic has been extended to
    cover writes as well.  We continue to call cluster_write() normally,
    thus blocks in the file will still be reallocated for large (but still
    random) I/O's, but I/O will only be initiated for truely sequential
    writes.

    This solves a number of annoying situations, especially with DBM (hash
    method) writes, and also has the side effect of fixing a number of
    (stupid) benchmarks.

Reviewed-by: mckusick
2000-04-02 00:55:28 +00:00
phk
ae0c1ec8f7 Give vn_isdisk() a second argument where it can return a suitable errno.
Suggested by:	bde
2000-01-10 12:04:27 +00:00
mckusick
2f9951ffbd Add bwillwrite to all system calls that create things in the filesystem.
Benchmarks that create huge trees of empty files overwhelm the buffer cache.
2000-01-10 00:08:53 +00:00
eivind
87724eb673 Introduce NDFREE (and remove VOP_ABORTOP) 1999-12-15 23:02:35 +00:00
dillon
13bf2dbc72 Ensure that garbage from the kernel stack does not wind up being
returned to user mode in the spare fields of the stat structure.

PR:		kern/14966
Reviewed by:	dillon@freebsd.org
Submitted by:	Kelly Yancey kbyanc@posi.net
1999-11-18 08:14:20 +00:00
peter
2b764c6970 Add a vnode fo_stat() entry point. 1999-11-08 03:32:15 +00:00
green
140cb4ff83 This is what was "fdfix2.patch," a fix for fd sharing. It's pretty
far-reaching in fd-land, so you'll want to consult the code for
changes.  The biggest change is that now, you don't use
	fp->f_ops->fo_foo(fp, bar)
but instead
	fo_foo(fp, bar),
which increments and decrements the fp refcount upon entry and exit.
Two new calls, fhold() and fdrop(), are provided.  Each does what it
seems like it should, and if fdrop() brings the refcount to zero, the
fd is freed as well.

Thanks to peter ("to hell with it, it looks ok to me.") for his review.
Thanks to msmith for keeping me from putting locks everywhere :)

Reviewed by:	peter
1999-09-19 17:00:25 +00:00
julian
5c78e7345a Changes to centralise the default blocksize behaviour.
More likely to follow.

Submitted by: phk@freebsd.org
1999-09-09 19:08:44 +00:00
julian
fd9cb11e53 Revert a bunch of contraversial changes by PHK. After
a quick think and discussion among various people some form of some of
these changes will probably be recommitted.

The reversion requested was requested by dg while discussions proceed.
PHK has indicated that he can live with this, and it has been agreed
that some form of some of these changes may return shortly after further
discussion.
1999-09-03 05:16:59 +00:00
phk
be3ce843d2 Improve the returned values in st_blksize a little bit, avoid
accessing union fields not valid for dev_t type.
1999-09-01 05:36:55 +00:00
phk
216936ca6d Make bdev userland access work like cdev userland access unless
the highly non-recommended option ALLOW_BDEV_ACCESS is used.

(bdev access is evil because you don't get write errors reported.)

Kill si_bsize_best before it kills Matt :-)

Use the specfs routines rather having cloned copies in devfs.
1999-08-30 07:56:23 +00:00
peter
3b842d34e8 $Id$ -> $FreeBSD$ 1999-08-28 01:08:13 +00:00
green
080e369bf2 Add FIODTYPE ioctl for getting d_flags (type) info on a device.
Okayed by:	phk
1999-08-27 16:35:37 +00:00
phk
8bfe025139 Add a couple of missing but unimportant break; statements. 1999-08-25 11:44:11 +00:00
phk
681535f3fe oops: Add missing include. 1999-08-13 11:22:48 +00:00
phk
a45a44a2bd Move the special-casing of stat(2)->st_blksize for device files
from UFS to the generic level.  For chr/blk devices we don't care
about the blocksize of the filesystem, we want what the device
asked for.
1999-08-13 10:56:07 +00:00
green
c03366a55d Fix fd race conditions (during shared fd table usage.) Badfileops is
now used in f_ops in place of NULL, and modifications to the files
are more carefully ordered. f_ops should also be set to &badfileops
upon "close" of a file.

This does not fix other problems mentioned in this PR than the first
one.

PR:		11629
Reviewed by:	peter
1999-08-04 18:53:50 +00:00
alc
743eeab638 Add sysctl and support code to allow directories to be VMIO'd. The default
setting for the sysctl is OFF, which is the historical operation.

Submitted by:	dillon
1999-07-26 06:25:53 +00:00
mckusick
52ea4270f3 These changes appear to give us benefits with both small (32MB) and
large (1G) memory machine configurations.  I was able to run 'dbench 32'
on a 32MB system without bring the machine to a grinding halt.

    * buffer cache hash table now dynamically allocated.  This will
      have no effect on memory consumption for smaller systems and
      will help scale the buffer cache for larger systems.

    * minor enhancement to pmap_clearbit().  I noticed that
      all the calls to it used constant arguments.  Making
      it an inline allows the constants to propogate to
      deeper inlines and should produce better code.

    * removal of inherent vfs_ioopt support through the emplacement
      of appropriate #ifdef's, with John's permission.  If we do not
      find a use for it by the end of the year we will remove it entirely.

    * removal of getnewbufloops* counters & sysctl's - no longer
      necessary for debugging, getnewbuf() is now optimal.

    * buffer hash table functions removed from sys/buf.h and localized
      to vfs_bio.c

    * VFS_BIO_NEED_DIRTYFLUSH flag and support code added
      ( bwillwrite() ), allowing processes to block when too many dirty
      buffers are present in the system.

    * removal of a softdep test in bdwrite() that is no longer necessary
      now that bdwrite() no longer attempts to flush dirty buffers.

    * slight optimization added to bqrelse() - there is no reason
      to test for available buffer space on B_DELWRI buffers.

    * addition of reverse-scanning code to vfs_bio_awrite().
      vfs_bio_awrite() will attempt to locate clusterable areas
      in both the forward and reverse direction relative to the
      offset of the buffer passed to it.  This will probably not
      make much of a difference now, but I believe we will start
      to rely on it heavily in the future if we decide to shift
      some of the burden of the clustering closer to the actual
      I/O initiation.

    * Removal of the newbufcnt and lastnewbuf counters that Kirk
      added.  They do not fix any race conditions that haven't already
      been fixed by the gbincore() test done after the only call
      to getnewbuf().  getnewbuf() is a static, so there is no chance
      of it being misused by other modules.  ( Unless Kirk can think
      of a specific thing that this code fixes.  I went through it
      very carefully and didn't see anything ).

    * removal of VOP_ISLOCKED() check in flushbufqueues().  I do not
      think this check is necessary, the buffer should flush properly
      whether the vnode is locked or not. ( yes? ).

    * removal of extra arguments passed to getnewbuf() that are not
      necessary.

    * missed cluster_wbuild() that had to be a cluster_wbuild_wb() in
      vfs_cluster.c

    * vn_write() now calls bwillwrite() *PRIOR* to locking the vnode,
      which should greatly aid flushing operations in heavy load
      situations - both the pageout and update daemons will be able
      to operate more efficiently.

    * removal of b_usecount.  We may add it back in later but for now
      it is useless.  Prior implementations of the buffer cache never
      had enough buffers for it to be useful, and current implementations
      which make more buffers available might not benefit relative to
      the amount of sophistication required to implement a b_usecount.
      Straight LRU should work just as well, especially when most things
      are VMIO backed.  I expect that (even though John will not like
      this assumption) directories will become VMIO backed some point soon.

Submitted by:	Matthew Dillon <dillon@backplane.com>
Reviewed by:	Kirk McKusick <mckusick@mckusick.com>
1999-07-08 06:06:00 +00:00
phk
b6d067fe70 Make sure that stat(2) and friends always return a valid st_dev field.
Pseudo-FS need not fill in the va_fsid anymore, the syscall code
will use the first half of the fsid, which now looks like a udev_t
with major 255.
1999-07-02 16:29:47 +00:00
phk
ca21a25f17 This Implements the mumbled about "Jail" feature.
This is a seriously beefed up chroot kind of thing.  The process
is jailed along the same lines as a chroot does it, but with
additional tough restrictions imposed on what the superuser can do.

For all I know, it is safe to hand over the root bit inside a
prison to the customer living in that prison, this is what
it was developed for in fact:  "real virtual servers".

Each prison has an ip number associated with it, which all IP
communications will be coerced to use and each prison has its own
hostname.

Needless to say, you need more RAM this way, but the advantage is
that each customer can run their own particular version of apache
and not stomp on the toes of their neighbors.

It generally does what one would expect, but setting up a jail
still takes a little knowledge.

A few notes:

   I have no scripts for setting up a jail, don't ask me for them.

   The IP number should be an alias on one of the interfaces.

   mount a /proc in each jail, it will make ps more useable.

   /proc/<pid>/status tells the hostname of the prison for
   jailed processes.

   Quotas are only sensible if you have a mountpoint per prison.

   There are no privisions for stopping resource-hogging.

   Some "#ifdef INET" and similar may be missing (send patches!)

If somebody wants to take it from here and develop it into
more of a "virtual machine" they should be most welcome!

Tools, comments, patches & documentation most welcome.

Have fun...

Sponsored by:   http://www.rndassociates.com/
Run for almost a year by:       http://www.servetheweb.com/
1999-04-28 11:38:52 +00:00
phk
16e3fbd2c1 Suser() simplification:
1:
  s/suser/suser_xxx/

2:
  Add new function: suser(struct proc *), prototyped in <sys/proc.h>.

3:
  s/suser_xxx(\([a-zA-Z0-9_]*\)->p_ucred, \&\1->p_acflag)/suser(\1)/

The remaining suser_xxx() calls will be scrutinized and dealt with
later.

There may be some unneeded #include <sys/cred.h>, but they are left
as an exercise for Bruce.

More changes to the suser() API will come along with the "jail" code.
1999-04-27 11:18:52 +00:00
alc
a58e3203d4 Address several problems in vn_read and vn_write:
1. Make read-ahead work for pread and aio_read.

2. Fix one place where a comparison of uio_offset with -1
   wasn't updated to use FOF_OFFSET.

3. Honor O_APPEND in the FOF_OFFSET case.

In addition, use the variable name "ioflag" in both vn_read and
vn_write to avoid possible confusion between the variable "flag"
and the parameter "flags".

Submitted by:	Bruce Evans <bde@zeta.org.au> and me
1999-04-21 05:56:45 +00:00
dt
f13dd5fa6d Add standard padding argument to pread and pwrite syscall. That should make them
NetBSD compatible.

Add parameter to fo_read and fo_write. (The only flag FOF_OFFSET mean that
the offset is set in the struct uio).

Factor out some common code from read/pread/write/pwrite syscalls.
1999-04-04 21:41:28 +00:00
alc
1e32f0021e Changed vn_read/write such that fp->f_offset isn't touched
if uio->uio_offset != -1.  This fixes a problem with aio_read/write
and permits a straightforward implementation of pread/pwrite.

PR:		kern/8669
Submitted by:	John Plevyak <jplevyak@inktomi.com>
Reviewed by:	Matthew Dillon <dillon@apollo.backplane.com>
1999-03-26 20:25:21 +00:00
phk
3d7d9296c0 Use suser() to determine super-user-ness, don't examine cr_uid directly. 1999-01-30 12:21:49 +00:00
eivind
836035a4ed Add 'options DEBUG_LOCKS', which stores extra information in struct
lock, and add some macros and function parameters to make sure that
the information get to the point where it can be put in the lock
structure.

While I'm here, add DEBUG_VFS_LOCKS to LINT.
1999-01-20 14:49:12 +00:00
eivind
ffaaca5874 Remove the 'waslocked' parameter to vfs_object_create(). 1999-01-05 18:50:03 +00:00
peter
e268ff150f Only do one VOP_ACCESS() per open() instead of two. This should reduce
the NFSv3 ACCESS RPC problems a little for busy clients that do a lot of
open/close.  The nfs code could probably cache the results, but I'm not
sure whether this would be legal or useful.  The problem is that with
a CPU farm, on each open there would be a lookup, getattr then access RPC
then the read/write RPC activity.  Caching the access results probably
isn't going to help much if the clients access lots of files.  Having the
nfs_access() routine interpret the getattr results is a bit of a hack, but
it's how NFSv2 is done and it might be OK for a mount attribute for v3.
1998-11-02 02:36:16 +00:00
phk
4cb7827a87 Report the mode as the result of the VOP_GETATTR rather than the
vnodes type, they may not correspond.
1998-06-27 06:43:09 +00:00
dfr
1d5f38ac22 This commit fixes various 64bit portability problems required for
FreeBSD/alpha.  The most significant item is to change the command
argument to ioctl functions from int to u_long.  This change brings us
inline with various other BSD versions.  Driver writers may like to
use (__FreeBSD_version == 300003) to detect this change.

The prototype FreeBSD/alpha machdep will follow in a couple of days
time.
1998-06-07 17:13:14 +00:00
msmith
964ce778b1 In the words of the submitter:
---------
Make callers of namei() responsible for releasing references or locks
instead of having the underlying filesystems do it.  This eliminates
redundancy in all terminal filesystems and makes it possible for stacked
transport layers such as umapfs or nullfs to operate correctly.

Quality testing was done with testvn, and lat_fs from the lmbench suite.

Some NFS client testing courtesy of Patrik Kudo.

vop_mknod and vop_symlink still release the returned vpp.  vop_rename
still releases 4 vnode arguments before it returns.  These remaining cases
will be corrected in the next set of patches.
---------

Submitted by:	Michael Hancock <michaelh@cet.co.jp>
1998-05-07 04:58:58 +00:00
alex
dd07a29831 Grammar police. 1998-04-10 00:09:04 +00:00
wosch
e3aed30232 New mount option nosymfollow. If enabled, the kernel lookup()
function will not follow symbolic links on the mounted
file system and return EACCES (Permission denied).
1998-04-08 18:31:59 +00:00
peter
33b5f6b63d Today is not my lucky day. Fix missing brace and I got a request
to use EMLINK instead.
1998-04-06 19:32:37 +00:00
peter
db031c687b Use a different errno (ELOOP (as sef mentioned) since the text that goes
with the error sounds ok for the condition) if O_NOFOLLOW gets a link.
1998-04-06 18:43:28 +00:00
peter
b0a513624f Rather than let users get fd's to symlink files, make O_NOFOLLOW cause
an error if it gets a link (like it does if it gets a socket).  The
implications of letting users try and do file operations on symlinks
themselves were too worrying.
1998-04-06 18:25:21 +00:00
peter
d5ab1c3759 Implement a new open(2) flag: O_NOFOLLOW. This will instruct open
to not follow symlinks, but to open a handle on the link itself(!).
As strange as this might sound, it has several useful applications
safe race-free ways of opening files in hostile areas (eg: /tmp, a mode
1777 /var/mail, etc).  It also would allow things like fchown() to work
on the link rather than having to implement a new syscall specifically for
that task.

Reviewed by: phk
1998-04-06 17:38:43 +00:00
bde
d185d33240 Removed unused #includes. 1998-02-25 05:58:50 +00:00
eivind
4547a09753 Back out DIAGNOSTIC changes. 1998-02-06 12:14:30 +00:00
eivind
c552a9a1c3 Turn DIAGNOSTIC into a new-style option. 1998-02-04 22:34:03 +00:00
dyson
d9d8bf6d30 Fix some vnode management problems, and better mgmt of vnode free list.
Fix the UIO optimization code.
Fix an assumption in vm_map_insert regarding allocation of swap pagers.
Fix an spl problem in the collapse handling in vm_object_deallocate.
When pages are freed from vnode objects, and the criteria for putting
the associated vnode onto the free list is reached, either put the
vnode onto the list, or put it onto an interrupt safe version of the
list, for further transfer onto the actual free list.
Some minor syntax changes changing pre-decs, pre-incs to post versions.
Remove a bogus timeout (that I added for debugging) from vn_lock.

PHK will likely still have problems with the vnode list management, and
so do I, but it is better than it was.
1998-01-12 01:46:33 +00:00
dyson
cb2800cd94 Make our v_usecount vnode reference count work identically to the
original BSD code.  The association between the vnode and the vm_object
no longer includes reference counts.  The major difference is that
vm_object's are no longer freed gratuitiously from the vnode, and so
once an object is created for the vnode, it will last as long as the
vnode does.

When a vnode object reference count is incremented, then the underlying
vnode reference count is incremented also.  The two "objects" are now
more intimately related, and so the interactions are now much less
complex.

When vnodes are now normally placed onto the free queue with an object still
attached.  The rundown of the object happens at vnode rundown time, and
happens with exactly the same filesystem semantics of the original VFS
code.  There is absolutely no need for vnode_pager_uncache and other
travesties like that anymore.

A side-effect of these changes is that SMP locking should be much simpler,
the I/O copyin/copyout optimizations work, NFS should be more ponderable,
and further work on layered filesystems should be less frustrating, because
of the totally coherent management of the vnode objects and vnodes.

Please be careful with your system while running this code, but I would
greatly appreciate feedback as soon a reasonably possible.
1998-01-06 05:26:17 +00:00
dyson
8ab3ac77d2 Fix the decl of vfs_ioopt, allow LFS to compile again, fix a minor problem
with the object cache removal.
1997-12-29 01:03:55 +00:00
dyson
cd67bb82fe Lots of improvements, including restructring the caching and management
of vnodes and objects.  There are some metadata performance improvements
that come along with this.  There are also a few prototypes added when
the need is noticed.  Changes include:

1) Cleaning up vref, vget.
2) Removal of the object cache.
3) Nuke vnode_pager_uncache and friends, because they aren't needed anymore.
4) Correct some missing LK_RETRY's in vn_lock.
5) Correct the page range in the code for msync.

Be gentle, and please give me feedback asap.
1997-12-29 00:25:11 +00:00
sef
c7d273eccb Changes to allow event-based process monitoring and control. 1997-12-06 04:11:14 +00:00
dyson
6781e003d5 Fix and complete the AIO syscalls. There are some performance enhancements
coming up soon, but the code is functional.  Docs will be forthcoming.
1997-11-29 01:33:10 +00:00
phk
4d26888936 Remove a bunch of variables which were unused both in GENERIC and LINT.
Found by:	-Wunused
1997-11-07 08:53:44 +00:00
bde
bd7b557dd6 Use 127 instead of CHAR_MAX for the limit on the sequence count. The
limit doesn't have anything to do with characters.  The count mainly
needs to fit in the VOP_READ() ioflag after being left shifted by 16.

Moved vn_lock() before vn_closefile().  vn_lock() was mismerged from
Lite2.

Removed some gratuitous braces.
1997-10-27 15:26:23 +00:00
dyson
df56983676 Relax the vnode locking for read only operations. 1997-10-06 02:38:30 +00:00
peter
fe8263de9d vn_select -> vn_poll 1997-09-14 02:51:16 +00:00
bde
6ffb8bf9af Removed unused #includes. 1997-09-02 20:06:59 +00:00
dfr
5974d18a75 [Previous comment was incorrect for these files]
Added calls to VFS lock debugging macros to make fixing filesystems' locking
easier.
1997-04-04 17:47:43 +00:00
dfr
60008c7902 Add a function vop_sharedlock which a copy of vop_nolock without the
implementation #ifdef out.  This can be used for now by NFS.  As soon
as all the other filesystems' locking is fixed, this can go away.

Print the vnode address in vprint for easier debugging.
1997-04-04 17:46:21 +00:00
bde
a0d9474a34 Don't include <sys/ioctl.h> in the kernel. Stage 4: include
<sys/ttycom.h> and sometimes <sys/filio.h> instead of <sys/ioctl.h>
in miscellaneous files.  Most of these files have nothing to do
with ttys but need to include <sys/ttycom.h> to get the definitions
of TIOC[SG]PGRP which are (ab)used to convert F[SG]ETOWN fcntls into
ioctls.
1997-03-24 11:52:29 +00:00
bde
0d3591bdbd Don't #include <sys/fcntl.h> in <sys/file.h> if KERNEL is defined.
Fixed everything that depended on getting fcntl.h stuff from the wrong
place.  Most things don't depend on file.h stuff at all.
1997-03-23 03:37:54 +00:00
guido
d3df6f9beb Fix style bugs and other bugs in the NFS fix. 1997-03-08 15:14:30 +00:00
gpalmer
7293d8e7d6 Fix (I hope) the NFS hole. This is only compile tested.
Submitted by:	(partly) davids@SECNET.COM via BUGTRAQ
1997-03-07 07:42:41 +00:00
peter
94b6d72794 Back out part 1 of the MCFH that changed $Id$ to $FreeBSD$. We are not
ready for it yet.
1997-02-22 09:48:43 +00:00
dyson
10f666af84 This is the kernel Lite/2 commit. There are some requisite userland
changes, so don't expect to be able to run the kernel as-is (very well)
without the appropriate Lite/2 userland changes.

The system boots and can mount UFS filesystems.

Untested: ext2fs, msdosfs, NFS
Known problems: Incorrect Berkeley ID strings in some files.
		Mount_std mounts will not work until the getfsent
		library routine is changed.

Reviewed by:	various people
Submitted by:	Jeffery Hsu <hsu@freebsd.org>
1997-02-10 02:22:35 +00:00
jkh
808a36ef65 Make the long-awaited change from $Id$ to $FreeBSD$
This will make a number of things easier in the future, as well as (finally!)
avoiding the Id-smashing problem which has plagued developers for so long.

Boy, I'm glad we're not using sup anymore.  This update would have been
insane otherwise.
1997-01-14 07:20:47 +00:00
dyson
04a0ee9a2a This commit is the embodiment of some VFS read clustering improvements.
Firstly, now our read-ahead clustering is on a file descriptor basis and not
on a per-vnode basis.  This will allow multiple processes reading the
same file to take advantage of read-ahead clustering.  Secondly, there
previously was a problem with large reads still using the ramp-up
algorithm.  Of course, that was bogus, and now we read the entire
"chunk" off of the disk in one operation.   The read-ahead clustering
algorithm should use less CPU than the previous also (I hope :-)).

NOTE:  THAT LKMS MUST BE REBUILT!!!
1996-12-29 02:45:28 +00:00
dyson
966cbc5d29 Even though this looks like it, this is not a complex code change.
The interface into the "VMIO" system has changed to be more consistant
and robust.  Essentially, it is now no longer necessary to call vn_open
to get merged VM/Buffer cache operation, and exceptional conditions
such as merged operation of VBLK devices is simpler and more correct.

This code corrects a potentially large set of problems including the
problems with ktrace output and loaded systems, file create/deletes,
etc.

Most of the changes to NFS are cosmetic and name changes, eliminating
a layer of subroutine calls.  The direct calls to vput/vrele have
been re-instituted for better cross platform compatibility.

Reviewed by: davidg
1996-08-21 21:56:23 +00:00
dyson
77f053059a Remove a now unnecessary function prototype. 1996-03-09 06:42:15 +00:00
dyson
b7119381dd Enable VMIO for non-VDIR metadata and block device. 1996-03-02 03:45:12 +00:00
dyson
8fc8a772af Eliminated many redundant vm_map_lookup operations for vm_mmap.
Speed up for vfs_bio -- addition of a routine bqrelse to greatly diminish
	overhead for merged cache.
Efficiency improvement for vfs_cluster.  It used to do alot of redundant
	calls to cluster_rbuild.
Correct the ordering for vrele of .text and release of credentials.
Use the selective tlb update for 486/586/P6.
Numerous fixes to the size of objects allocated for files.  Additionally,
	fixes in the various pagers.
Fixes for proper positioning of vnode_pager_setsize in msdosfs and ext2fs.
Fixes in the swap pager for exhausted resources.  The pageout code
	will not as readily thrash.
Change the page queue flags (PG_ACTIVE, PG_INACTIVE, PG_FREE, PG_CACHE) into
	page queue indices (PQ_ACTIVE, PQ_INACTIVE, PQ_FREE, PQ_CACHE),
	thereby improving efficiency of several routines.
Eliminate even more unnecessary vm_page_protect operations.
Significantly speed up process forks.
Make vm_object_page_clean more efficient, thereby eliminating the pause
	that happens every 30seconds.
Make sequential clustered writes B_ASYNC instead of B_DELWRI even in the
	case of filesystems mounted async.
Fix a panic with busy pages when write clustering is done for non-VMIO
	buffers.
1996-01-19 04:00:31 +00:00
phk
02cbd66c6d Staticize.
Unstaticize a function in scsi/scsi_base that was used, with an undocumented
option.
My last count on the LINT kernel shows:
Total symbols:  3647
unref symbols:   463
undef symbols:     4
1 ref symbols:  1751
2 ref symbols:   485
Approaching the pain threshold now.
1995-12-17 21:23:44 +00:00
dyson
601ed1a4c0 Changes to support 1Tb filesizes. Pages are now named by an
(object,index) pair instead of (object,offset) pair.
1995-12-11 04:58:34 +00:00
dg
c30f46c534 Untangled the vm.h include file spaghetti. 1995-12-07 12:48:31 +00:00
dg
b5341559e2 Moved the filesystem read-only check out of the syscalls and into the
filesystem layer, as was done in lite-2. Merged in some other cosmetic
changes while I was at it. Rewrote most of msdosfs_access() to be more
like ufs_access() and to include the FS read-only check.

Obtained from: partially from 4.4BSD-lite2
1995-10-22 09:32:48 +00:00
phk
d78eeda10d A little hack to avoid a 64bit divide. Can go away if Gcc ever learns to
optimise 64bit stuff...
1995-10-06 09:43:32 +00:00
dg
b8e783516f vnode_pager_alloc() never returns NULL, so don't check for it. 1995-07-20 09:43:12 +00:00
bde
81e1e32f6c Don't include <sys/tty.h> in drivers that aren't tty drivers or in general
files that don't depend on the internals of <sys/tty.h>
1995-07-16 10:13:08 +00:00
dg
c8b0a7332c NOTE: libkvm, w, ps, 'top', and any other utility which depends on struct
proc or any VM system structure will have to be rebuilt!!!

Much needed overhaul of the VM system. Included in this first round of
changes:

1) Improved pager interfaces: init, alloc, dealloc, getpages, putpages,
   haspage, and sync operations are supported. The haspage interface now
   provides information about clusterability. All pager routines now take
   struct vm_object's instead of "pagers".

2) Improved data structures. In the previous paradigm, there is constant
   confusion caused by pagers being both a data structure ("allocate a
   pager") and a collection of routines. The idea of a pager structure has
   escentially been eliminated. Objects now have types, and this type is
   used to index the appropriate pager. In most cases, items in the pager
   structure were duplicated in the object data structure and thus were
   unnecessary. In the few cases that remained, a un_pager structure union
   was created in the object to contain these items.

3) Because of the cleanup of #1 & #2, a lot of unnecessary layering can now
   be removed. For instance, vm_object_enter(), vm_object_lookup(),
   vm_object_remove(), and the associated object hash list were some of the
   things that were removed.

4) simple_lock's removed. Discussion with several people reveals that the
   SMP locking primitives used in the VM system aren't likely the mechanism
   that we'll be adopting. Even if it were, the locking that was in the code
   was very inadequate and would have to be mostly re-done anyway. The
   locking in a uni-processor kernel was a no-op but went a long way toward
   making the code difficult to read and debug.

5) Places that attempted to kludge-up the fact that we don't have kernel
   thread support have been fixed to reflect the reality that we are really
   dealing with processes, not threads. The VM system didn't have complete
   thread support, so the comments and mis-named routines were just wrong.
   We now use tsleep and wakeup directly in the lock routines, for instance.

6) Where appropriate, the pagers have been improved, especially in the
   pager_alloc routines. Most of the pager_allocs have been rewritten and
   are now faster and easier to maintain.

7) The pagedaemon pageout clustering algorithm has been rewritten and
   now tries harder to output an even number of pages before and after
   the requested page. This is sort of the reverse of the ideal pagein
   algorithm and should provide better overall performance.

8) Unnecessary (incorrect) casts to caddr_t in calls to tsleep & wakeup
   have been removed. Some other unnecessary casts have also been removed.

9) Some almost useless debugging code removed.

10) Terminology of shadow objects vs. backing objects straightened out.
    The fact that the vm_object data structure escentially had this
    backwards really confused things. The use of "shadow" and "backing
    object" throughout the code is now internally consistent and correct
    in the Mach terminology.

11) Several minor bug fixes, including one in the vm daemon that caused
    0 RSS objects to not get purged as intended.

12) A "default pager" has now been created which cleans up the transition
    of objects to the "swap" type. The previous checks throughout the code
    for swp->pg_data != NULL were really ugly. This change also provides
    the rudiments for future backing of "anonymous" memory by something
    other than the swap pager (via the vnode pager, for example), and it
    allows the decision about which of these pagers to use to be made
    dynamically (although will need some additional decision code to do
    this, of course).

13) (dyson) MAP_COPY has been deprecated and the corresponding "copy
    object" code has been removed. MAP_COPY was undocumented and non-
    standard. It was furthermore broken in several ways which caused its
    behavior to degrade to MAP_PRIVATE. Binaries that use MAP_COPY will
    continue to work correctly, but via the slightly different semantics
    of MAP_PRIVATE.

14) (dyson) Sharing maps have been removed. It's marginal usefulness in a
    threads design can be worked around in other ways. Both #12 and #13
    were done to simplify the code and improve readability and maintain-
    ability. (As were most all of these changes)

TODO:

1) Rewrite most of the vnode pager to use VOP_GETPAGES/PUTPAGES. Doing
   this will reduce the vnode pager to a mere fraction of its current size.

2) Rewrite vm_fault and the swap/vnode pagers to use the clustering
   information provided by the new haspage pager interface. This will
   substantially reduce the overhead by eliminating a large number of
   VOP_BMAP() calls. The VOP_BMAP() filesystem interface should be
   improved to provide both a "behind" and "ahead" indication of
   contiguousness.

3) Implement the extended features of pager_haspage in swap_pager_haspage().
   It currently just says 0 pages ahead/behind.

4) Re-implement the swap device (swstrategy) in a more elegant way, perhaps
   via a much more general mechanism that could also be used for disk
   striping of regular filesystems.

5) Do something to improve the architecture of vm_object_collapse(). The
   fact that it makes calls into the swap pager and knows too much about
   how the swap pager operates really bothers me. It also doesn't allow
   for collapsing of non-swap pager objects ("unnamed" objects backed by
   other pagers).
1995-07-13 08:48:48 +00:00
dg
24d64ff84a Moved call to VOP_GETATTR() out of vnode_pager_alloc() and into the places
that call vnode_pager_alloc() so that a failure return can be dealt with.
This fixes a panic seen on NFS clients when a file being opened is deleted
on the server before the open completes.
1995-07-09 06:58:03 +00:00
dg
c6d1e93313 Removed extra semicolon. 1995-06-28 12:32:47 +00:00
dg
3c7c1dd62f 1) Converted v_vmdata to v_object.
2) Removed unnecessary vm_object_lookup()/pager_cache(object, TRUE) pairs
   after vnode_pager_alloc() calls - the object is already guaranteed to be
   persistent.
3) Removed some gratuitous casts.
1995-06-28 12:01:13 +00:00
rgrimes
c86f0c7a71 Remove trailing whitespace. 1995-05-30 08:16:23 +00:00
dg
6d9a56d4ef Unlock the vnode before sleeping on an OBJ_DEAD object. Should fix Bruce's
hang. Fixed some formatting anomolies and removed some unneeded casts.
1995-05-10 18:59:11 +00:00
dg
fa7269fdef Removed unnecessary call to vnode_pager_uncache(). We automatically clear
the VTEXT flag after all mappers have finished with the object.
1995-03-19 12:08:03 +00:00
phk
eb7832a7dc YFfix 1995-02-14 06:31:13 +00:00
dg
1707d41102 These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.

The majority of the merged VM/cache work is by John Dyson.

The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.

vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme.  The scheme is almost fully compatible with the old filesystem
interface.  Significant improvement in the number of opportunities for write
clustering.

vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache.  Fixup of vfs_cluster to eliminate the bogus pagemove stuff.

vm_object.c:
Yet more improvements in the collapse code.  Elimination of some windows that
can cause list corruption.

vm_pageout.c:
Fixed it, it really works better now.  Somehow in 2.0, some "enhancements"
broke the code.  This code has been reworked from the ground-up.

vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.

pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.

vm_glue.c
Much simpler and more effective swapping code.  No more gratuitous swapping.

proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.

swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency.  Now the
code doesn't need it anymore.

machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.

machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.

ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache.  Add "bypass" support for sneaking in on
busy buffers.

Submitted by:	John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
dg
52b4dc9c30 Stuff object into v_vmdata rather than pager. Not important which at
the moment, but will be in the future. Other changes mostly cosmetic,
but are made for future VMIO considerations.

Submitted by:	John Dyson
1994-10-05 09:48:45 +00:00
phk
c3e4945541 All of this is cosmetic. prototypes, #includes, printfs and so on. Makes
GCC a lot more silent.
1994-10-02 17:35:40 +00:00
dg
c87abe4536 Moved over my fix for vnode lossage when multiple TIOCSCTTY ioctls are
done. This patch was extended to also include a suggested change by
Kirk McKusick which allows the control tty to be reasigned to a different
tty without losing a vnode.
1994-08-18 03:53:38 +00:00
dg
8d205697aa Added $Id$ 1994-08-02 07:55:43 +00:00
rgrimes
2469c867a1 The big 4.4BSD Lite to FreeBSD 2.0.0 (Development) patch.
Reviewed by:	Rodney W. Grimes
Submitted by:	John Dyson and David Greenman
1994-05-25 09:21:21 +00:00
rgrimes
8fb65ce818 BSD 4.4 Lite Kernel Sources 1994-05-24 10:09:53 +00:00