Commit Graph

159 Commits

Author SHA1 Message Date
Matthew Dillon
d331c5d43f Replace the global buffer hash table with per-vnode splay trees using a
methodology similar to the vm_map_entry splay and the VM splay that Alan
Cox is working on.  Extensive testing has appeared to have shown no
increase in overhead.

Disadvantages
    Dirties more cache lines during lookups.

    Not as fast as a hash table lookup (but still N log N and optimal
    when there is locality of reference).

Advantages
    vnode->v_dirtyblkhd is now perfectly sorted, making fsync/sync/filesystem
    syncer operate more efficiently.

    I get to rip out all the old hacks (some of which were mine) that tried
    to keep the v_dirtyblkhd tailq sorted.

    The per-vnode splay tree should be easier to lock / SMPng pushdown on
    vnodes will be easier.

    This commit along with another that Alan is working on for the VM page
    global hash table will allow me to implement ranged fsync(), optimize
    server-side nfs commit rpcs, and implement partial syncs by the
    filesystem syncer (aka filesystem syncer would detect that someone is
    trying to get the vnode lock, remembers its place, and skip to the
    next vnode).

Note that the buffer cache splay is somewhat more complex then other splays
due to special handling of background bitmap writes (multiple buffers with
the same lblkno in the same vnode), and B_INVAL discontinuities between the
old hash table and the existence of the buffer on the v_cleanblkhd list.

Suggested by: alc
2002-07-10 17:02:32 +00:00
John Baldwin
56e9ce41a5 In namei(), we use a NULL thread for uio_td when doing a VOP_READLINK().
nfs_readlink() calls nfs_bioread() which passes in uio_td as the thread
argument to nfs_getcacheblk().  In nfs_getcacheblk() we dereference the
thread pointer to get a process pointer to pass to nfs_sigintr().  This
obviously results in a panic. :)

Rather than change nfs_getcacheblk() to check if the thread pointer is
NULL when calling nfs_sigintr() like other callers do, change
nfs_sigintr() to take a thread as the last argument instead of a
process so none of the callers have to care if the thread is NULL or not.
2002-06-28 21:53:08 +00:00
John Baldwin
a854ed9893 Simple p_ucred -> td_ucred changes to start using the per-thread ucred
reference.
2002-02-27 18:32:23 +00:00
Peter Wemm
1bde568682 Revise the nfsiod auto tuning code. Now both the upper and lower limits
are specifyable by sysctl and are respected.

Submitted by:	Maxime Henrion <mux@sneakerz.org>
2002-01-15 20:57:21 +00:00
Peter Wemm
117f61374c Implement vfs.nfs.iodmin (minimum number of nfsiod's) and
vfs.nfs.iodmaxidle (idle time before nfsiod's exit).  Make it adaptive
so that we create nfsiod's on demand and they go away after not being
used for a while.  The upper limit is NFS_MAXASYNCDAEMON (currently 20).
More will be done here, but this is a useful checkpoint.

Submitted by:	Maxime Henrion <mux@qualys.com>
2002-01-14 02:13:46 +00:00
Matthew Dillon
3ebeaf5984 This fixes a large number of bugs in our NFS client side code. A recent
commit by Kirk also fixed a softupdates bug that could easily be triggered
by server side NFS.

	* An edge case with shared R+W mmap()'s and truncate whereby
	  the system would inappropriately clear the dirty bits on
	  still-dirty data.  (applicable to all filesystems)

	  THIS FIX TEMPORARILY DISABLED PENDING FURTHER TESTING.
	  see vm/vm_page.c line 1641

	* The straddle case for VM pages and buffer cache buffers when
	  truncating.  (applicable to NFS client side)

	* Possible SMP database corruption due to vm_pager_unmap_page()
	  not clearing the TLB for the other cpu's.  (applicable to NFS
	  client side but could effect all filesystems).  Note: not
	  considered serious since the corruption occurs beyond the file
	  EOF.

	* When flusing a dirty buffer due to B_CACHE getting cleared,
	  we were accidently setting B_CACHE again (that is, bwrite() sets
	  B_CACHE), when we really want it to stay clear after the write
	  is complete.  This resulted in a corrupt buffer.  (applicable
	  to all filesystems but probably only triggered by NFS)

	* We have to call vtruncbuf() when ftruncate()ing to remove
	  any buffer cache buffers.  This is still tentitive, I may
	  be able to remove it due to the second bug fix.  (applicable
	  to NFS client side)

	* vnode_pager_setsize() race against nfs_vinvalbuf()... we have
	  to set n_size before calling nfs_vinvalbuf or the NFS code
	  may recursively vnode_pager_setsize() to the original value
	  before the truncate.  This is what was causing the user mmap
	  bus faults in the nfs tester program.  (applicable to NFS
	  client side)

	* Fix to softupdates (see ufs/ffs/ffs_inode.c 1.73, commit made
	  by Kirk).

Testing program written by: Avadis Tevanian, Jr.
Testing program supplied by: jkh / Apple (see Dec2001 posting to freebsd-hackers with Subject 'NFS: How to make FreeBS fall on its face in one easy step')
MFC after:	1 week
2001-12-14 01:16:57 +00:00
Matthew Dillon
7e76bb562e Implement IO_NOWDRAIN and B_NOWDRAIN - prevents the buffer cache from blocking
in wdrain during a write.  This flag needs to be used in devices whos
strategy routines turn-around and issue another high level I/O, such as
when MD turns around and issues a VOP_WRITE to vnode backing store, in order
to avoid deadlocking the dirty buffer draining code.

Remove a vprintf() warning from MD when the backing vnode is found to be
in-use.  The syncer of buf_daemon could be flushing the backing vnode at
the time of an MD operation so the warning is not correct.

MFC after:	1 week
2001-11-05 18:48:54 +00:00
John Baldwin
bd78cece5d Change the kernel's ucred API as follows:
- crhold() returns a reference to the ucred whose refcount it bumps.
- crcopy() now simply copies the credentials from one credential to
  another and has no return value.
- a new crshared() primitive is added which returns true if a ucred's
  refcount is > 1 and false (0) otherwise.
2001-10-11 23:38:17 +00:00
Peter Wemm
891a092764 Sigh, Last minute pre-merge typo. (missing quotes) 2001-09-18 23:49:33 +00:00
Peter Wemm
eb25edbda3 Cleanup and split of nfs client and server code.
This builds on the top of several repo-copies.
2001-09-18 23:32:09 +00:00
Warner Losh
976a26437e nfs_strategy calls nfs_asyncio with td as NULL. So add a bandaid that
will pass NULL as the struct proc when td is NULL.  This has stopped
crashing on my machine.

Note: The passing of NULL may be bogus, but I'll let others fix that
problem.

Reviewed by: jhb
2001-09-18 18:37:52 +00:00
Julian Elischer
b40ce4165d KSE Milestone 2
Note ALL MODULES MUST BE RECOMPILED
make the kernel aware that there are smaller units of scheduling than the
process. (but only allow one thread per process at this time).
This is functionally equivalent to teh previousl -current except
that there is a thread associated with each process.

Sorry john! (your next MFC will be a doosie!)

Reviewed by: peter@freebsd.org, dillon@freebsd.org

X-MFC after:    ha ha ha ha
2001-09-12 08:38:13 +00:00
John Baldwin
617e358cdf - Sort includes.
- Update vmmeter statistics for vnode pagein/pageouts in getpages/putpages.
2001-07-04 20:14:59 +00:00
Matthew Dillon
0cddd8f023 With Alfred's permission, remove vm_mtx in favor of a fine-grained approach
(this commit is just the first stage).  Also add various GIANT_ macros to
formalize the removal of Giant, making it easy to test in a more piecemeal
fashion. These macros will allow us to test fine-grained locks to a degree
before removing Giant, and also after, and to remove Giant in a piecemeal
fashion via sysctl's on those subsystems which the authors believe can
operate without Giant.
2001-07-04 16:20:28 +00:00
John Baldwin
ce70e0a964 Assert Giant is held by the caller rather than getting it and releasing
it in getpages/putpages.
2001-05-23 22:26:05 +00:00
Alfred Perlstein
2395531439 Introduce a global lock for the vm subsystem (vm_mtx).
vm_mtx does not recurse and is required for most low level
vm operations.

faults can not be taken without holding Giant.

Memory subsystems can now call the base page allocators safely.

Almost all atomic ops were removed as they are covered under the
vm mutex.

Alpha and ia64 now need to catch up to i386's trap handlers.

FFS and NFS have been tested, other filesystems will need minor
changes (grabbing the vm lock when twiddling page properties).

Reviewed (partially) by: jake, jhb
2001-05-19 01:28:09 +00:00
Greg Lehey
60fb0ce365 Revert consequences of changes to mount.h, part 2.
Requested by:	bde
2001-04-29 02:45:39 +00:00
Greg Lehey
d98dc34f52 Correct #includes to work with fixed sys/mount.h. 2001-04-23 09:05:15 +00:00
Alfred Perlstein
d8d5fa8805 vnode_pager_freepage() is really vm_page_free() in disguise,
nuke vnode_pager_freepage() and replace all calls to it with vm_page_free()
2001-04-19 06:18:23 +00:00
Poul-Henning Kamp
f84e29a06c This patch removes the VOP_BWRITE() vector.
VOP_BWRITE() was a hack which made it possible for NFS client
side to use struct buf with non-bio backing.

This patch takes a more general approach and adds a bp->b_op
vector where more methods can be added.

The success of this patch depends on bp->b_op being initialized
all relevant places for some value of "relevant" which is not
easy to determine.  For now the buffers have grown a b_magic
element which will make such issues a tiny bit easier to debug.
2001-04-17 08:56:39 +00:00
John Baldwin
19eb87d22a Grab the process lock while calling psignal and before calling psignal. 2001-03-07 03:37:06 +00:00
Poul-Henning Kamp
9626b608de Separate the struct bio related stuff out of <sys/buf.h> into
<sys/bio.h>.

<sys/bio.h> is now a prerequisite for <sys/buf.h> but it shall
not be made a nested include according to bdes teachings on the
subject of nested includes.

Diskdrivers and similar stuff below specfs::strategy() should no
longer need to include <sys/buf.> unless they need caching of data.

Still a few bogus uses of struct buf to track down.

Repocopy by:    peter
2000-05-05 09:59:14 +00:00
Poul-Henning Kamp
8177437d85 Complete the bio/buf divorce for all code below devfs::strategy
Exceptions:
        Vinum untouched.  This means that it cannot be compiled.
        Greg Lehey is on the case.

        CCD not converted yet, casts to struct buf (still safe)

        atapi-cd casts to struct buf to examine B_PHYS
2000-04-15 05:54:02 +00:00
Poul-Henning Kamp
c244d2de43 Move B_ERROR flag to b_ioflags and call it BIO_ERROR.
(Much of this done by script)

Move B_ORDERED flag to b_ioflags and call it BIO_ORDERED.

Move b_pblkno and b_iodone_chain to struct bio while we transition, they
will be obsoleted once bio structs chain/stack.

Add bio_queue field for struct bio aware disksort.

Address a lot of stylistic issues brought up by bde.
2000-04-02 15:24:56 +00:00
Poul-Henning Kamp
b99c307a21 Rename the existing BUF_STRATEGY() to DEV_STRATEGY()
substitute BUF_WRITE(foo) for VOP_BWRITE(foo->b_vp, foo)

substitute BUF_STRATEGY(foo) for VOP_STRATEGY(foo->b_vp, foo)

This patch is machine generated except for the ccd.c and buf.h parts.
2000-03-20 11:29:10 +00:00
Poul-Henning Kamp
21144e3bf1 Remove B_READ, B_WRITE and B_FREEBUF and replace them with a new
field in struct buf: b_iocmd.  The b_iocmd is enforced to have
exactly one bit set.

B_WRITE was bogusly defined as zero giving rise to obvious coding
mistakes.

Also eliminate the redundant struct buf flag B_CALL, it can just
as efficiently be done by comparing b_iodone to NULL.

Should you get a panic or drop into the debugger, complaining about
"b_iocmd", don't continue.  It is likely to write on your disk
where it should have been reading.

This change is a step in the direction towards a stackable BIO capability.

A lot of this patch were machine generated (Thanks to style(9) compliance!)

Vinum users:  Greg has not had time to test this yet, be careful.
2000-03-20 10:44:49 +00:00
Matthew Dillon
c37c9620cd Enhance reassignbuf(). When a buffer cannot be time-optimally inserted
into vnode dirtyblkhd we append it to the list instead of prepend it to
    the list in order to maintain a 'forward' locality of reference, which
    is arguably better then 'reverse'.  The original algorithm did things this
    way to but at a huge time cost.

    Enhance the append interlock for NFS writes to handle intr/soft mounts
    better.

    Fix the hysteresis for NFS async daemon I/O requests to reduce the
    number of unnecessary context switches.

    Modify handling of NFS mount options.  Any given user option that is
    too high now defaults to the kernel maximum for that option rather then
    the kernel default for that option.

Reviewed by:	 Alfred Perlstein <bright@wintelcom.net>
2000-01-05 05:11:37 +00:00
Matthew Dillon
b7303db36e Fix two problems: First, fix the append seek position race that can
occur due to np->n_size potentially changing if nfs_getcacheblk()
    blocks in nfs_write().

    Second, under -current we must supply the proper bufsize when obtaining
    buffers that straddle the EOF, but due to the fact that np->n_size can
    change out from under us it is possible that we may specify the wrong
    buffer size and wind up truncating dirty data written by another
    process.

    Both problems are solved by implementing nfs_rslock(), which allows us
    to lock around sensitive buffer cache operations such as those that
    occur when appending to a file.

    It is believed that this race is responsible for causing dirtyoff/dirtyend
    and (in stable) validoff/validend to exceed the buffer size.  Therefore
    we have now added a warning printf for the dirtyoff/end case in current.

    However, we have introduced a new problem which we need to fix at some
    point, and that is that soft or intr NFS mounts may become
    uninterruptable from the point of view of process A which is stuck waiting
    on rslock while process B is stuck doing the rpc.  To unstick process A,
    process B would have to be interrupted first.

Reviewed by:	Alfred Perlstein <bright@wintelcom.net>
1999-12-14 19:07:54 +00:00
Matthew Dillon
ea94c7b968 Synopsis of problem being fixed: Dan Nelson originally reported that
blocks of zeros could wind up in a file written to over NFS by a client.
    The problem only occurs a few times per several gigabytes of data.   This
    problem turned out to be bug #3 below.

    bug #1:

        B_CLUSTEROK must be cleared when an NFS buffer is reverted from
        stage 2 (ready for commit rpc) to stage 1 (ready for write).
        Reversions can occur when a dirty NFS buffer is redirtied with new
        data.

        Otherwise the VFS/BIO system may end up thinking that a stage 1
        NFS buffer is clusterable.  Stage 1 NFS buffers are not clusterable.

    bug #2:

        B_CLUSTEROK was inappropriately set for a 'short' NFS buffer (short
        buffers only occur near the EOF of the file).  Change to only set
        when the buffer is a full biosize (usually 8K).  This bug has no
        effect but should be fixed in -current anyway.  It need not be
        backported.

    bug #3:

        B_NEEDCOMMIT was inappropriately set in nfs_flush() (which is
	typically only called by the update daemon).  nfs_flush()
        does a multi-pass loop but due to the lack of vnode locking it
        is possible for new buffers to be added to the dirtyblkhd list
        while a flush operation is going on.  This may result in nfs_flush()
        setting B_NEEDCOMMIT on a buffer which has *NOT* yet gone through its
        stage 1 write, causing only the commit rpc to be made and thus
        causing the contents of the buffer to be thrown away (never sent to
        the server).

    The patch also contains some cleanup, which only applies to the commit
    into -current.

Reviewed by:	dg, julian
Originally Reported by: Dan Nelson <dnelson@emsphone.com>
1999-12-12 06:09:57 +00:00
Poul-Henning Kamp
923502ff91 useracc() the prequel:
Merge the contents (less some trivial bordering the silly comments)
of <vm/vm_prot.h> and <vm/vm_inherit.h> into <vm/vm.h>.  This puts
the #defines for the vm_inherit_t and vm_prot_t types next to their
typedefs.

This paves the road for the commit to follow shortly: change
useracc() to use VM_PROT_{READ|WRITE} rather than B_{READ|WRITE}
as argument.
1999-10-29 18:09:36 +00:00
Matthew Dillon
8fdd2461b3 Add comment to clarify a commit rpc optimization already being performed. 1999-09-20 19:10:28 +00:00
Matthew Dillon
b5acbc8b9c Asynchronized client-side nfs_commit. NFS commit operations were
previously issued synchronously even if async daemons (nfsiod's) were
    available.  The commit has been moved from the strategy code to the doio
    code in order to asynchronize it.

    Removed use of lastr in preparation for removal of vnode->v_lastr.  It
    has been replaced with seqcount, which is already supported by the system
    and, in fact, gives us a better heuristic for sequential detection then
    lastr ever did.

    Made major performance improvements to the server side commit.  The
    server previously fsync'd the entire file for each commit rpc.  The
    server now bawrite()s only those buffers related to the offset/size
    specified in the commit rpc.

    Note that we do not commit the meta-data yet.  This works still needs
    to be done.

    Note that a further optimization can be done (and has not yet been done)
    on the client: we can merge multiple potential commit rpc's into a
    single rpc with a greater file offset/size range and greatly reduce
    rpc traffic.

Reviewed by:	Alan Cox <alc@cs.rice.edu>, David Greenman <dg@root.com>
1999-09-17 05:57:57 +00:00
Peter Wemm
c3aac50f28 $Id$ -> $FreeBSD$ 1999-08-28 01:08:13 +00:00
Alan Cox
2c28a10540 Add the (inline) function vm_page_undirty for clearing the dirty bitmask
of a vm_page.

Use it.

Submitted by:	dillon
1999-08-17 04:02:34 +00:00
Dmitrij Tejblum
e868365294 nfs_getcacheblk() can return 0 if the mount is interruptible. It need to be
checked by the caller.

Broken in: rev. 1.70 (1999/05/02)
1999-08-12 18:04:39 +00:00
Kirk McKusick
67812eacd7 Convert buffer locking from using the B_BUSY and B_WANTED flags to using
lockmgr locks. This commit should be functionally equivalent to the old
semantics. That is, all buffer locking is done with LK_EXCLUSIVE
requests. Changes to take advantage of LK_SHARED and LK_RECURSIVE will
be done in future commits.
1999-06-26 02:47:16 +00:00
Kirk McKusick
f9c8cab591 Add a vnode argument to VOP_BWRITE to get rid of the last vnode
operator special case. Delete special case code from vnode_if.sh,
vnode_if.src, umap_vnops.c, and null_vnops.c.
1999-06-16 23:27:55 +00:00
Peter Wemm
adbde675ee Don't mistake a non-async block that needs to be committed for an
interrupted write.

Obtained from: fvdl@NetBSD.org via OpenBSD.
1999-06-05 05:25:37 +00:00
Poul-Henning Kamp
b0eeea2042 remove b_proc from struct buf, it's (now) unused.
Reviewed by:	dillon, bde
1999-05-06 20:00:34 +00:00
Alan Cox
4221e284a3 The VFS/BIO subsystem contained a number of hacks in order to optimize
piecemeal, middle-of-file writes for NFS.  These hacks have caused no
end of trouble, especially when combined with mmap().  I've removed
them.  Instead, NFS will issue a read-before-write to fully
instantiate the struct buf containing the write.  NFS does, however,
optimize piecemeal appends to files.  For most common file operations,
you will not notice the difference.  The sole remaining fragment in
the VFS/BIO system is b_dirtyoff/end, which NFS uses to avoid cache
coherency issues with read-merge-write style operations.  NFS also
optimizes the write-covers-entire-buffer case by avoiding the
read-before-write.  There is quite a bit of room for further
optimization in these areas.

The VM system marks pages fully-valid (AKA vm_page_t->valid =
VM_PAGE_BITS_ALL) in several places, most noteably in vm_fault.  This
is not correct operation.  The vm_pager_get_pages() code is now
responsible for marking VM pages all-valid.  A number of VM helper
routines have been added to aid in zeroing-out the invalid portions of
a VM page prior to the page being marked all-valid.  This operation is
necessary to properly support mmap().  The zeroing occurs most often
when dealing with file-EOF situations.  Several bugs have been fixed
in the NFS subsystem, including bits handling file and directory EOF
situations and buf->b_flags consistancy issues relating to clearing
B_ERROR & B_INVAL, and handling B_DONE.

getblk() and allocbuf() have been rewritten.  B_CACHE operation is now
formally defined in comments and more straightforward in
implementation.  B_CACHE for VMIO buffers is based on the validity of
the backing store.  B_CACHE for non-VMIO buffers is based simply on
whether the buffer is B_INVAL or not (B_CACHE set if B_INVAL clear,
and vise-versa).  biodone() is now responsible for setting B_CACHE
when a successful read completes.  B_CACHE is also set when a bdwrite()
is initiated and when a bwrite() is initiated.  VFS VOP_BWRITE
routines (there are only two - nfs_bwrite() and bwrite()) are now
expected to set B_CACHE.  This means that bowrite() and bawrite() also
set B_CACHE indirectly.

There are a number of places in the code which were previously using
buf->b_bufsize (which is DEV_BSIZE aligned) when they should have
been using buf->b_bcount.  These have been fixed.  getblk() now clears
B_DONE on return because the rest of the system is so bad about
dealing with B_DONE.

Major fixes to NFS/TCP have been made.  A server-side bug could cause
requests to be lost by the server due to nfs_realign() overwriting
other rpc's in the same TCP mbuf chain.  The server's kernel must be
recompiled to get the benefit of the fixes.

Submitted by:	Matthew Dillon <dillon@apollo.backplane.com>
1999-05-02 23:57:16 +00:00
Peter Wemm
8a0d8193f2 Hold nfsd's upages in-core with PHOLD rather than P_NOSWAP. 1999-04-06 03:07:54 +00:00
Julian Elischer
8d17e69460 Catch a case spotted by Tor where files mmapped could leave garbage in the
unallocated parts of the last page when the file ended on a frag
but not a page boundary.
Delimitted by tags PRE_MATT_MMAP_EOF and POST_MATT_MMAP_EOF,
in files alpha/alpha/pmap.c i386/i386/pmap.c nfs/nfs_bio.c vm/pmap.h
    vm/vm_page.c vm/vm_page.h vm/vnode_pager.c miscfs/specfs/spec_vnops.c
    ufs/ufs/ufs_readwrite.c kern/vfs_bio.c

Submitted by: Matt Dillon <dillon@freebsd.org>
Reviewed by: Alan Cox <alc@freebsd.org>
1999-04-05 19:38:30 +00:00
Julian Elischer
4ef2094e45 Reviewed by: Many at differnt times in differnt parts,
including alan, john, me, luoqi, and kirk
Submitted by:	Matt Dillon <dillon@frebsd.org>

This change implements a relatively sophisticated fix to getnewbuf().
There were two problems with getnewbuf(). First, the writerecursion
can lead to a system stack overflow when you have NFS and/or VN
devices in the system. Second, the free/dirty buffer accounting was
completely broken. Not only did the nfs routines blow it trying to
manually account for the buffer state, but the accounting that was
done did not work well with the purpose of their existance: figuring
out when getnewbuf() needs to sleep.

The meat of the change is to kern/vfs_bio.c. The remaining diffs are
all minor except for NFS, which includes both the fixes for bp
interaction AND fixes for a 'biodone(): buffer already done' lockup.
Sys/buf.h also contains a chaining structure which is not used by
this patchset but is used by other patches that are coming soon.
This patch deliniated by tags PRE_MAT_GETBUF and POST_MAT_GETBUF.
(sorry for the missing T matt)
1999-03-12 02:24:58 +00:00
Matthew Dillon
1c7c3c6a86 This is a rather large commit that encompasses the new swapper,
changes to the VM system to support the new swapper, VM bug
    fixes, several VM optimizations, and some additional revamping of the
    VM code.  The specific bug fixes will be documented with additional
    forced commits.  This commit is somewhat rough in regards to code
    cleanup issues.

Reviewed by:	"John S. Dyson" <root@dyson.iquest.net>, "David Greenman" <dg@root.com>
1999-01-21 08:29:12 +00:00
Dmitrij Tejblum
10db74e96d (Hopefully) fix support for "large" files. Mostly cast block numbers to off_t
before they multiplied to block sizes.
1998-12-14 17:51:30 +00:00
Archie Cobbs
f1d19042b0 The "easy" fixes for compiling the kernel -Wunused: remove unreferenced static
and local variables, goto labels, and functions declared but not defined.
1998-12-07 21:58:50 +00:00
Peter Wemm
dad00f4e9c Remove [apparently] bogus casts to u_long for the vnode_pager_setsize()
second argument.  np_size is a 64 bit int, so is the second arg.  This
might have caused needless 2G/4G file size problems.

I believe it was Bruce who queried this.
1998-11-09 07:00:14 +00:00
Kirk McKusick
1cda241131 Mark directory buffers that have no valid data with B_INVAL
so that they are not put in the cache.
1998-09-29 22:01:10 +00:00
Kirk McKusick
113b88d241 When adding data to a buffer, we need to clear the B_NEEDCOMMIT flag
which says that the data is on server but not committed.
1998-09-29 21:46:54 +00:00
Doug Rabson
e69763a315 Cosmetic changes to the PAGE_XXX macros to make them consistent with
the other objects in vm.
1998-09-04 08:06:57 +00:00
Bruce Evans
4c4918c9e4 Avoid an egcs pessimization for 64-bit signed division on i386's.
Pre-2.8 versions of gcc generate a call to __divdi3() for all 64-bit
signed divisions, but egcs optimizes them to a shift and fixup when
the divisor is a constant power of 2.  Unfortunately, it generates
a call to __cmpdi2() for the fixup, although all except possibly
ancient versions of gcc and egcs do ordinary 64-bit comparisons
inline.
1998-06-14 15:52:00 +00:00
Peter Wemm
55b41976c1 Make sure we go a nfs_fsinfo() in get/putpages before calling
readrpc/writerpc, since they assume it's already been done.  This could
break if the first read/write access to a nfs filesystem was an exec() or
mmap() instead of a read(), write() syscall.  (or statfs()).
nfs_getpages() could return an errno (EOPNOTSUPP) instead of a VM_PAGER_*
return code.  Some layout tweaks for the get/putpages code.
1998-06-01 11:32:53 +00:00
Peter Wemm
7c1c33a7dd When using NFSv3, use the remote server's idea of the maximum file size
rather than assuming 2^64.  It may not like files that big. :-)
On the nfs server, calculate and report the max file size as the point
that the block numbers in the cache would turn negative.
(ie: 1099511627775 bytes (1TB)).

One of the things I'm worried about however, is that directory offsets
are really cookies on a NFSv3 server and can be rather large, especially
when/if the server generates the opaque directory cookies by using a local
filesystem offset in what comes out as the upper 32 bits of the 64 bit
cookie.  (a server is free to do this, it could save byte swapping
depending on the native 64 bit byte order)

Obtained from:	NetBSD
1998-05-30 16:33:58 +00:00
Peter Wemm
dfae73fd2e A cleaner fix for PR#5102, clear nonsense flags at mount time rather than
in the core of nfs_bio.c at the 11th hour.

PR:		5102
1998-05-20 08:02:24 +00:00
Peter Wemm
fe6c0d4599 Allow control of the attribute cache timeouts at mount time.
We had run out of bits in the nfs mount flags, I have moved the internal
state flags into a seperate variable.  These are no longer visible via
statfs(), but I don't know of anything that looks at them.
1998-05-19 07:11:27 +00:00
Steve Price
b9921401f1 Don't allow the readdirplus routine to be used in NFS V2.
PR:		5102
Reviewed by:	msmith
Submitted by:	Dmitry Kohmanyuk <dk@farm.org>
1998-03-28 16:05:05 +00:00
Julian Elischer
b1897c197c Reviewed by: dyson@freebsd.org (john Dyson), dg@root.com (david greenman)
Submitted by:	Kirk McKusick (mcKusick@mckusick.com)
Obtained from:  WHistle development tree
1998-03-08 09:59:44 +00:00
John Dyson
8f9110f6a1 This mega-commit is meant to fix numerous interrelated problems. There
has been some bitrot and incorrect assumptions in the vfs_bio code.  These
problems have manifest themselves worse on NFS type filesystems, but can
still affect local filesystems under certain circumstances.  Most of
the problems have involved mmap consistancy, and as a side-effect broke
the vfs.ioopt code.  This code might have been committed seperately, but
almost everything is interrelated.

1)	Allow (pmap_object_init_pt) prefaulting of buffer-busy pages that
	are fully valid.
2)	Rather than deactivating erroneously read initial (header) pages in
	kern_exec, we now free them.
3)	Fix the rundown of non-VMIO buffers that are in an inconsistent
	(missing vp) state.
4)	Fix the disassociation of pages from buffers in brelse.  The previous
	code had rotted and was faulty in a couple of important circumstances.
5)	Remove a gratuitious buffer wakeup in vfs_vmio_release.
6)	Remove a crufty and currently unused cluster mechanism for VBLK
	files in vfs_bio_awrite.  When the code is functional, I'll add back
	a cleaner version.
7)	The page busy count wakeups assocated with the buffer cache usage were
	incorrectly cleaned up in a previous commit by me.  Revert to the
	original, correct version, but with a cleaner implementation.
8)	The cluster read code now tries to keep data associated with buffers
	more aggressively (without breaking the heuristics) when it is presumed
	that the read data (buffers) will be soon needed.
9)	Change to filesystem lockmgr locks so that they use LK_NOPAUSE.  The
	delay loop waiting is not useful for filesystem locks, due to the
	length of the time intervals.
10)	Correct and clean-up spec_getpages.
11)	Implement a fully functional nfs_getpages, nfs_putpages.
12)	Fix nfs_write so that modifications are coherent with the NFS data on
	the server disk (at least as well as NFS seems to allow.)
13)	Properly support MS_INVALIDATE on NFS.
14)	Properly pass down MS_INVALIDATE to lower levels of the VM code from
	vm_map_clean.
15)	Better support the notion of pages being busy but valid, so that
	fewer in-transit waits occur.  (use p->busy more for pageouts instead
	of PG_BUSY.)  Since the page is fully valid, it is still usable for
	reads.
16)	It is possible (in error) for cached pages to be busy.  Make the
	page allocation code handle that case correctly.  (It should probably
	be a printf or panic, but I want the system to handle coding errors
	robustly.  I'll probably add a printf.)
17)	Correct the design and usage of vm_page_sleep.  It didn't handle
	consistancy problems very well, so make the design a little less
	lofty.  After vm_page_sleep, if it ever blocked, it is still important
	to relookup the page (if the object generation count changed), and
	verify it's status (always.)
18)	In vm_pageout.c, vm_pageout_clean had rotted, so clean that up.
19)	Push the page busy for writes and VM_PROT_READ into vm_pageout_flush.
20)	Fix vm_pager_put_pages and it's descendents to support an int flag
	instead of a boolean, so that we can pass down the invalidate bit.
1998-03-07 21:37:31 +00:00
Mike Smith
651ae11e2f Trivial filesystem getpages/putpages implementations, set the second.
These should be considered the first steps in a work-in-progress.
Submitted by:	Terry Lambert <terry@freebsd.org>
1998-03-06 09:46:52 +00:00
Eivind Eklund
0b08f5f737 Back out DIAGNOSTIC changes. 1998-02-06 12:14:30 +00:00
Eivind Eklund
47cfdb166d Turn DIAGNOSTIC into a new-style option. 1998-02-04 22:34:03 +00:00
Tor Egge
f5160d1e06 Release the buffer when an error occurs while reading directory entries. 1998-01-31 01:27:18 +00:00
John Dyson
33b90a70cd Various NFS fixes:
Make vfs_bio buffer mgmt work better.
	Buffers were being used after brelse.
	Make nfs_getpages work independently of other NFS
		interfaces.  This eliminates some difficult
		recursion problems and decreases pagefault
		overhead.
	Remove an erroneous vfs_unbusy_pages.
	Fix a reentrancy problem, with nfs_vinvalbuf when
		vnode is already being rundown.
	Reassignbuf wasn't being called when needed under
		certain circumstances.

	(Thanks to Bill Paul for help.)
1998-01-25 06:24:09 +00:00
John Dyson
95e5e988e0 Make our v_usecount vnode reference count work identically to the
original BSD code.  The association between the vnode and the vm_object
no longer includes reference counts.  The major difference is that
vm_object's are no longer freed gratuitiously from the vnode, and so
once an object is created for the vnode, it will last as long as the
vnode does.

When a vnode object reference count is incremented, then the underlying
vnode reference count is incremented also.  The two "objects" are now
more intimately related, and so the interactions are now much less
complex.

When vnodes are now normally placed onto the free queue with an object still
attached.  The rundown of the object happens at vnode rundown time, and
happens with exactly the same filesystem semantics of the original VFS
code.  There is absolutely no need for vnode_pager_uncache and other
travesties like that anymore.

A side-effect of these changes is that SMP locking should be much simpler,
the I/O copyin/copyout optimizations work, NFS should be more ponderable,
and further work on layered filesystems should be less frustrating, because
of the totally coherent management of the vnode objects and vnodes.

Please be careful with your system while running this code, but I would
greatly appreciate feedback as soon a reasonably possible.
1998-01-06 05:26:17 +00:00
John Dyson
2f29e93460 Various of the ISP users have commented that the 1.41 version of the
nfs_bio.c code worked better than the 1.44.  This commit reverts
the important parts of 1.44 to 1.41, and we will fix it when we
can get a handle on the problem.
1997-12-08 00:59:08 +00:00
Poul-Henning Kamp
07b2d0aaa3 unifdef -U__NetBSD__ -D__FreeBSD__ 1997-09-10 19:52:27 +00:00
Bruce Evans
1fd0b0588f Removed unused #includes. 1997-08-02 14:33:27 +00:00
Doug Rabson
3d0f68fc7d Avoid small synchronous writes when an application does lots of random-access
short writes within a block (e.g. ld).
1997-06-25 08:35:41 +00:00
John Dyson
91487cf4bf Upgrade NFS to support the new vfs_bio resource/buffer management. 1997-06-16 00:23:40 +00:00
Doug Rabson
2d1500e4de Fix a problem caused by removing large numbers of files from a directory
which could cause a bad size to be given to uiomove, causing a page fault.
1997-06-06 08:12:17 +00:00
Doug Rabson
501338ca4f Fix some performance problems with the NFS mmap fixes. 1997-06-03 09:42:43 +00:00
Doug Rabson
32ad9cb531 Fix a few bugs with NFS and mmap caused by NFS' use of b_validoff
and b_validend.  The changes to vfs_bio.c are a bit ugly but hopefully
can be tidied up later by a slight redesign.

PR:		kern/2573, kern/2754, kern/3046 (possibly)
Reviewed by:	dyson
1997-05-19 14:36:56 +00:00
Doug Rabson
5c28711af7 Check the B_CLUSTER flag when choosing whether to use unstable or filesync
writes.

PR:		kern/3438
Submitted by:	Tor Egge <Tor.Egge@idi.ntnu.no>
1997-05-13 19:41:32 +00:00
Doug Rabson
baaf1d96f0 Fix a bug where a program which appended many small records to a file could
wind up writing zeros instead of real data when the file is on an NFSv2
mounted directory.

While tracking this bug down, I noticed that nfs_asyncio was waking *all*
the iods when a block was written instead of just one per block.  Fixing this
gives a 25% performance improvment for writes on v2 (less for v3).

Both are 2.2 candidates.

PR:		kern/2774
1997-04-19 14:28:36 +00:00
Doug Rabson
18cab10cb3 Don't allow partial buffers to be cluster-comitted.
Zero the b_dirty{off,end} after cluster-comitting a group of buffers.

With these fixes, I was able to complete a 'make world' with remote src
and obj directories.
1997-04-18 14:12:17 +00:00
Doug Rabson
363880128c The code which recovered from a modified directory situation did not check
for eof when re-caching the directory.  This could cause it to loop forever
if a directory was truncated.
1997-04-03 07:52:00 +00:00
Bruce Evans
7eff94279a YAMInTheWrongDirectionF22 (part of rev.1.28.2.3: set B_CLUSTEROK for
commits).
1997-03-09 10:21:26 +00:00
Peter Wemm
6875d25465 Back out part 1 of the MCFH that changed $Id$ to $FreeBSD$. We are not
ready for it yet.
1997-02-22 09:48:43 +00:00
John Dyson
996c772f58 This is the kernel Lite/2 commit. There are some requisite userland
changes, so don't expect to be able to run the kernel as-is (very well)
without the appropriate Lite/2 userland changes.

The system boots and can mount UFS filesystems.

Untested: ext2fs, msdosfs, NFS
Known problems: Incorrect Berkeley ID strings in some files.
		Mount_std mounts will not work until the getfsent
		library routine is changed.

Reviewed by:	various people
Submitted by:	Jeffery Hsu <hsu@freebsd.org>
1997-02-10 02:22:35 +00:00
Jordan K. Hubbard
1130b656e5 Make the long-awaited change from $Id$ to $FreeBSD$
This will make a number of things easier in the future, as well as (finally!)
avoiding the Id-smashing problem which has plagued developers for so long.

Boy, I'm glad we're not using sup anymore.  This update would have been
insane otherwise.
1997-01-14 07:20:47 +00:00
Doug Rabson
f438ae02f5 Improve the queuing algorithms used by NFS' asynchronous i/o. The
existing mechanism uses a global queue for some buffers and the
vp->b_dirtyblkhd queue for others.  This turns sequential writes into
randomly ordered writes to the server, affecting both read and write
performance.  The existing mechanism also copes badly with hung
servers, tending to block accesses to other servers when all the iods
are waiting for a hung server.

The new mechanism uses a queue for each mount point.  All asynchronous
i/o goes through this queue which preserves the ordering of requests.
A simple mechanism ensures that the iods are shared out fairly between
active mount points.  This removes the sysctl variable vfs.nfs.dwrite
since the new queueing mechanism removes the old delayed write code
completely.

This should go into the 2.2 branch.
1996-11-06 10:53:16 +00:00
Doug Rabson
425b5191a4 If a large (>4096 bytes) directory was modified, the old directory
contents are discarded, including the cached seek cookies.
Unfortunately, if the directory was larger than NFS_DIRBLKSIZ, then
this confused nfs_readdirrpc(), making it appear as if the directory
was truncated.

Reviewed by:	Karl Denninger <karl@Mcs.Net>
1996-10-21 10:07:52 +00:00
Bruce Evans
e07fd62c16 Staticized `nfs_dwrite'. 1996-10-12 17:39:39 +00:00
Doug Rabson
f31dba4c5d This fixes a problem with the nfs socket handling code which happens
if a single process is performing a large number of requests (in this
case writing a large file).  The writing process could monopolise the
recieve lock and prevent any other processes from recieving their
replies.

It also adds a new sysctl variable 'vfs.nfs.dwrite' which controls the
behaviour which originally pointed out the problem.  When a process
writes to a file over NFS, it usually arranges for another process
(the 'iod') to perform the request.  If no iods are available, then it
turns the write into a 'delayed write' which is later picked up by the
next iod to do a write request for that file.  This can cause that
particular iod to do a disproportionate number of requests from a
single process which can harm performance on some NFS servers.  The
alternative is to perform the write synchronously in the context of
the original writing process if no iod is avaiable for asynchronous
writing.

The 'delayed write' behaviour is selected when vfs.nfs.dwrite=1 and
the non-delayed behaviour is selected when vfs.nfs.dwrite=0.  The
default is vfs.nfs.dwrite=1; if many people tell me that performance
is better if vfs.nfs.dwrite=0 then I will change the default.

Submitted by:	Hidetoshi Shimokawa <simokawa@sat.t.u-tokyo.ac.jp>
1996-10-11 10:15:33 +00:00
Nate Williams
030e2e9ebb In sys/time.h, struct timespec is defined as:
/*
         * Structure defined by POSIX.4 to be like a timeval.
         */
        struct timespec {
                time_t  ts_sec;         /* seconds */
                long    ts_nsec;        /* and nanoseconds */
        };

        The correct names of the fields are tv_sec and tv_nsec.

Reminded by:	James Drobina <jdrobina@infinet.com>
1996-09-19 18:21:32 +00:00
Doug Rabson
09c6884729 Various fixes from frank@fwi.uva.nl (Frank van der Linden) via
rick@snowhite.cis.uoguelph.ca:

1. Clear B_NEEDCOMMIT in nfs_write to make sure that dirty data is
correctly send to the server.  If a buffer was dirtied when it was in
the B_DELWRI+B_NEEDCOMMIT state, the state of the buffer was left
unchanged and when the buffer was later cleaned, just a commit rpc was
made to the server to complete the previous write.  Clearing
B_NEEDCOMMIT ensures that another write is made to the server.

2. If a server returned a server (for whatever reason) returned an
answer to a write RPC that implied that fewer bytes than requested
were written, bad things would happen.

3. The setattr operation passed on the atime in stead of the mtime to
the server. The fix is trivial.

4. XIDs always started at 0, but this caused some servers (older DEC
OSF/1 3.0 so I've been told) who had very long-lasting XID caches to
get confused if, after a reboot of a BSD client, RPCs came in with a
XID that had in the past been used before from that client. Patch is
to use the current time in seconds as a starting point for XIDs. The
patch below is not perfect, because it requires the root fs to be
mounted first. This is because of the check BSD systems do, comparing
FS time to system time.

Reviewed by:	Bruce Evans, Terry Lambert.
Obtained from:  frank@fwi.uva.nl (Frank van der Linden) via rick@snowhite.cis.uoguelph.ca
1996-07-16 10:19:45 +00:00
Paul Traina
75985daf3b Clear flags before using an inactive buffer. This is a kludge, but
matches the code in bread().

Reviewed by:	bde
1996-06-08 05:59:04 +00:00
Mike Pritchard
97f1b9871e Add a check to prevent a computation from underflowing and causing
a panic due to an attaempt to allocate a buffer for a terabyte or
so of data when an attempt is made to create sparse data (e.g.
a holey file) more than 1 block past the end of the file.

Note:  some other areas of this code need to be looked at,
since they might cause problems when the file size exceeds 2GB,
due to storing results in ints when the computations are being
done with quad sized variables.

Reviewed by:	bde
1996-01-24 18:52:18 +00:00
Poul-Henning Kamp
b8dce649f1 Staticize. 1995-12-17 21:14:36 +00:00
David Greenman
efeaf95a41 Untangled the vm.h include file spaghetti. 1995-12-07 12:48:31 +00:00
Bruce Evans
dee6b0ab68 Completed function declarations and/or added prototypes and/or moved
prototypes to the right place.
1995-12-03 10:03:12 +00:00
Poul-Henning Kamp
a98ca4699e Second batch of cleanup changes.
This time mostly making a lot of things static and some unused
variables here and there.
1995-10-29 15:33:36 +00:00
Doug Rabson
27df97742b Add support for amd direct maps.
Reviewed by:	Thomas Graichen <graichen@sirius.physik.fu-berlin.de>
1995-08-24 10:17:39 +00:00
Doug Rabson
05085e65f6 Use a consistent blocksize for sizing bufs to avoid panicing the bio system. 1995-07-07 11:01:31 +00:00
Doug Rabson
a62dc40654 Changes to support version 3 of the NFS protocol.
The version 2 support has been tested (client+server) against FreeBSD-2.0,
IRIX 5.3 and FreeBSD-current (using a loopback mount).  The version 2 support
is stable AFAIK.
The version 3 support has been tested with a loopback mount and minimally
against an IRIX 5.3 server.  It needs more testing and may have problems.
I have patched amd to support the new variable length filehandles although
it will still only use version 2 of the protocol.

Before booting a kernel with these changes, nfs clients will need to at least
build and install /usr/sbin/mount_nfs.  Servers will need to build and
install /usr/sbin/mountd.

NFS diskless support is untested.

Obtained from: Rick Macklem <rick@snowhite.cis.uoguelph.ca>
1995-06-27 11:07:30 +00:00
Rodney W. Grimes
9b2e535452 Remove trailing whitespace. 1995-05-30 08:16:23 +00:00
David Greenman
61f5d51062 Changes to fix the following bugs:
1) Files weren't properly synced on filesystems other than UFS. In some
   cases, this lead to lost data. Most likely would be noticed on NFS.
   The fix is to make the VM page sync/object_clean general rather than
   in each filesystem.
2) Mixing regular and mmaped file I/O on NFS was very broken. It caused
   chunks of files to end up as zeroes rather than the intended contents.
   The fix was to fix several race conditions and to kludge up the
   "b_dirtyoff" and "b_dirtyend" that NFS relies upon - paying attention
   to page modifications that occurred via the mmapping.

Reviewed by:	David Greenman
Submitted by:	John Dyson
1995-05-21 21:39:31 +00:00
David Greenman
8cbf8eafa6 Various fixes from John Dyson:
1) Rewrote screwy code that uses an incore buffer without making it busy.
2) Use B_CACHE instead of B_DONE in cases where it is appropriate.
3) Minor code optimization.

This *might* fix kern/345 submitted by Heikki Suonsivu.
1995-04-16 05:05:25 +00:00
David Greenman
403ef252fa Removed obsolete vtrace() remnants. 1995-03-04 03:24:45 +00:00
David Greenman
ef762baa88 Removed a pile of vfs_unbusy_pages()...both unnecessary and wrong - resulted
in serious system instability. Changed a B_INVAL to a B_NOCACHE so that
buffer data is properly disposed of.

Submitted by:	John Dyson, Rick Macklin, and ohki@gssm.otsuka.tsukuba.ac.jp
1995-02-03 03:40:08 +00:00