Commit Graph

608 Commits

Author SHA1 Message Date
Alan Cox
6ec2fca505 Eliminate some unnecessary uses of the vm page queues lock around the
vm page's valid field.  This field is being synchronized using the
containing vm object's lock.
2003-10-04 22:47:20 +00:00
Alan Cox
bf0da100d6 - Extend the scope the vm object lock to cover calls to
vm_page_is_valid().
 - Assert that the lock on the containing vm object is held in
   vm_page_is_valid().
2003-10-04 19:23:29 +00:00
Alan Cox
c76789caa6 - vm_hold_free_pages() should lock the kernel object. (The pages being
freed belong to the kernel object.)
 - Increase the granularity of the vm object locking in vm_hold_load_pages()
   in order to reduce the number of times that we acquire and release the
   same lock.
2003-09-22 04:58:09 +00:00
Alan Cox
35b86dc8de Correct a typo in the previous revision. 2003-09-15 02:56:48 +00:00
Alan Cox
58abfe0051 Convert vmapbuf() from using pmap_extract() to using
pmap_extract_and_hold().  Note, however, that GIANT_REQUIRED should not be
removed until all platforms fully implement the "prot" parameter to
pmap_extract_and_hold().

Reviewed by:	tegge
2003-09-13 04:29:55 +00:00
Jeff Roberson
d919a11d06 - Define a new flag for getblk(): GB_NOCREAT. This flag causes getblk() to
bail out if the buffer is not already present.
 - The buffer returned by incore() is not locked and should not be sent to
   brelse().  Use getblk() with the new GB_NOCREAT flag to preserve the
   desired semantics.
2003-08-31 08:50:11 +00:00
Jeff Roberson
a7db559087 - If there is no vp assume that BKGRDINPROG is not set and set RELPBUF in
brelse().
2003-08-31 01:07:45 +00:00
Jeff Roberson
b5c61abd82 - In some cases bp->b_vp can be NULL in brelse, don't try to lock the
interlock in that case.

Found by:	alc
2003-08-31 00:06:07 +00:00
Marcel Moolenaar
9e8147f3af In bufdone(), change the format specifier for m->valid and m->dirty to
a long type and explicitly cast m->valid and m->dirty to unsigned long.
When PAGE_SIZE is 32K, these fields are in fact unsigned long.
2003-08-28 19:58:11 +00:00
Alexander Kabaev
772a9659d9 Do not return with vnode interlock held.
Reviewed by:	rwatson
2003-08-28 15:48:15 +00:00
Jeff Roberson
9dbfeb0ae6 - Move BX_BKGRDWAIT and BX_BKGRDINPROG to BV_ and the b_vflags field.
- Surround all accesses of the BKGRD{WAIT,INPROG} flags with the vnode
   interlock.
 - Don't use the B_LOCKED flag and QUEUE_LOCKED for background write
   buffers.  Check for the BKGRDINPROG flag before recycling or throwing
   away a buffer.  We do this instead because it is not safe for us to move
   the original buffer to a new queue from the callback on the background
   write buffer.
 - Remove the B_LOCKED flag and the locked buffer queue.  They are no longer
   used.
 - The vnode interlock is used around checks for BKGRDINPROG where it may
   not be strictly necessary.  If we hold the buf lock the a back-ground
   write will not be started without our knowledge, one may only be
   completed while we're not looking.  Rather than remove the code, Document
   two of the places where this extra locking is done.  A pass should be
   done to verify and minimize the locking later.
2003-08-28 06:55:18 +00:00
Alan Cox
b7ad744dc5 Hold the page queues lock when performing vm_page_clear_dirty() and
vm_page_set_invalid().
2003-08-23 18:11:53 +00:00
Poul-Henning Kamp
4bfd22f25e Grab Giant in bufdonebio() since drivers may not hold it.
This only protects the "struct buf" consumers (ie: DEV_STRATEGY()),
but does not protect BIO_STRATEGY() users.
2003-08-02 09:45:10 +00:00
Alan Cox
105660e8ba Eliminate an abuse of kmem_alloc_pageable() in bufinit()
by using VM_ALLOC_NOOBJ to allocate the bogus page.

Reviewed by:	tegge
2003-08-02 05:05:34 +00:00
Poul-Henning Kamp
568733688b Initialize b_saveaddr when we hand out buffers 2003-06-20 08:26:38 +00:00
Alan Cox
f717a9d063 Lock the vm object when removing a page. 2003-06-11 16:37:33 +00:00
David E. O'Brien
677b542ea2 Use __FBSDID(). 2003-06-11 00:56:59 +00:00
Poul-Henning Kamp
17a1391990 The IO_NOWDRAIN and B_NOWDRAIN hacks are no longer needed to prevent
deadlocks with vnode backed md(4) devices because md now uses a
kthread to run the bio requests instead of doing it directly from
the bio down path.
2003-05-31 16:42:45 +00:00
Alan Cox
01dfc1deae Finish the vm_object locking for this file, including holding the vm_object
lock when accessing the vm_object's flags or calling vm_page_lookup().
2003-04-28 05:40:45 +00:00
Alan Cox
af3e0bb202 - Lock the vm_object when performing vm_page_alloc() in allocbuf(). 2003-04-26 07:42:24 +00:00
Alan Cox
097d4338db Lock the vm_object in vfs_busy_pages(). 2003-04-20 00:17:05 +00:00
Alan Cox
0fa05eae77 - Lock the vm_object when performing vm_object_pip_subtract().
- Assert that the vm_object lock is held in vm_object_pip_subtract().
2003-04-19 22:11:41 +00:00
Alan Cox
0d420ad3e6 - Lock the vm_object when performing vm_object_pip_wakeupn().
- Assert that the vm_object lock is held in vm_object_pip_wakeupn().
 - Add a new macro VM_OBJECT_LOCK_ASSERT().
2003-04-19 21:15:44 +00:00
Alan Cox
de5ef10142 Update locking on the kernel_object to use the new macros. 2003-04-14 00:36:53 +00:00
Alan Cox
0b556837a9 Remove an unnecessary trunc_page() from vmapbuf().
Reviewed by:	tegge
2003-04-06 00:40:54 +00:00
Alan Cox
08468b6ad7 o Check the b_bufsize passed to vmapbuf() returning an error
if it is invalid.
 o Remove a debugging printf() from vmapbuf().

Suggested by:   tegge
2003-04-04 06:14:54 +00:00
Poul-Henning Kamp
d086f85ac4 Preparation commit before I start on the bioqueue lockdown:
Collect all the bits of bioqueue handing in subr_disk.c, vfs_bio.c is big
enough as it is and disksort already lives in subr_disk.c.
2003-03-30 08:51:23 +00:00
Tor Egge
5bbb806004 Add support for reading directly from file to userland buffer when the
O_DIRECT descriptor status flag is set and both offset and length is a
multiple of the physical media sector size.
2003-03-26 23:40:42 +00:00
Jake Burkholder
227f9a1c58 - Add vm_paddr_t, a physical address type. This is required for systems
where physical addresses larger than virtual addresses, such as i386s
  with PAE.
- Use this to represent physical addresses in the MI vm system and in the
  i386 pmap code.  This also changes the paddr parameter to d_mmap_t.
- Fix printf formats to handle physical addresses >4G in the i386 memory
  detection code, and due to kvtop returning vm_paddr_t instead of u_long.

Note that this is a name change only; vm_paddr_t is still the same as
vm_offset_t on all currently supported platforms.

Sponsored by:	DARPA, Network Associates Laboratories
Discussed with:	re, phk (cdevsw change)
2003-03-25 00:07:06 +00:00
Poul-Henning Kamp
b4b138c27f Including <sys/stdint.h> is (almost?) universally only to be able to use
%j in printfs, so put a newsted include in <sys/systm.h> where the printf
prototype lives and save everybody else the trouble.
2003-03-18 08:45:25 +00:00
Jeff Roberson
749ffa4ecd - Add a lock for protecting against msleep(bp, ...) wakeup(bp) races.
- Create a new function bdone() which sets B_DONE and calls wakup(bp). This
   is suitable for use as b_iodone for buf consumers who are not going
   through the buf cache.
 - Create a new function bwait() which waits for the buf to be done at a set
   priority and with a specific wmesg.
 - Replace several cases where the above functionality was implemented
   without locking with the new functions.
2003-03-13 07:31:45 +00:00
Jeff Roberson
09f11da5a3 - Remove a race between fsync like functions and flushbufqueues() by
requiring locked bufs in vfs_bio_awrite().  Previously the buf could
   have been written out by fsync before we acquired the buf lock if it
   weren't for giant.  The cluster_wbuild() handles this race properly but
   the single write at the end of vfs_bio_awrite() would not.
 - Modify flushbufqueues() so there is only one copy of the loop.  Pass a
   parameter in that says whether or not we should sync bufs with deps.
 - Call flushbufqueues() a second time and then break if we couldn't find
   any bufs without deps.
2003-03-13 07:19:23 +00:00
Jeff Roberson
7261f5f68e - Add a new 'flags' parameter to getblk().
- Define one flag GB_LOCK_NOWAIT that tells getblk() to pass the LK_NOWAIT
   flag to the initial BUF_LOCK().  This will eventually be used in cases
   were we want to use a buffer only if it is not currently in use.
 - Convert all consumers of the getblk() api to use this extra parameter.

Reviwed by:	arch
Not objected to by:	mckusick
2003-03-04 00:04:44 +00:00
Jeff Roberson
491081fabf - Hold the vnode interlock across calls to bgetvp instead of acquiring it
internally.  This is required to stop multiple bufs from being associated
   with a single lblkno.
2003-03-02 06:05:23 +00:00
Jeff Roberson
bff5362bf2 - gc USE_BUFHASH. The smp locking of the buf cache renders this useless. 2003-03-01 05:55:03 +00:00
Kirk McKusick
7e734c4149 When doing cleanup of excessive buffers in bdwrite (see kern/vfs_bio.c
delta 1.371) we must ensure that we do not get ourselves into a
recursive trap endlessly trying to clean up after ourselves.

Reported by:	Attila Nagy <bra@fsn.hu>
Sponsored by:   DARPA & NAI Labs.
2003-02-25 23:59:09 +00:00
Jeff Roberson
2e3981a70c - Add the missing NULL interlock argument to a recently added BUF_LOCK. 2003-02-25 08:23:11 +00:00
Kirk McKusick
3a7053cb60 Prevent large files from monopolizing the system buffers. Keep
track of the number of dirty buffers held by a vnode. When a
bdwrite is done on a buffer, check the existing number of dirty
buffers associated with its vnode. If the number rises above
vfs.dirtybufthresh (currently 90% of vfs.hidirtybuffers), one
of the other (hopefully older) dirty buffers associated with
the vnode is written (using bawrite). In the event that this
approach fails to curb the growth in it the vnode's number of
dirty buffers (due to soft updates rollback dependencies),
the more drastic approach of doing a VOP_FSYNC on the vnode
is used. This code primarily affects very large and actively
written files such as snapshots. This change should eliminate
hanging when taking snapshots or doing background fsck on
very large filesystems.

Hopefully, one day it will be possible to cache filesystem
metadata in the VM cache as is done with file data. As it
stands, only the buffer cache can be used which limits total
metadata storage to about 20Mb no matter how much memory is
available on the system. This rather small memory gets badly
thrashed causing a lot of extra I/O. For example, taking a
snapshot of a 1Tb filesystem minimally requires about 35,000
write operations, but because of the cache thrashing (we only
have about 350 buffers at our disposal) ends up doing about
237,540 I/O's thus taking twenty-five minutes instead of four
if it could run entirely in the cache.

Reported by:	Attila Nagy <bra@fsn.hu>
Sponsored by:   DARPA & NAI Labs.
2003-02-25 06:44:42 +00:00
Jeff Roberson
17661e5ac4 - Add an interlock argument to BUF_LOCK and BUF_TIMELOCK.
- Remove the buftimelock mutex and acquire the buf's interlock to protect
   these fields instead.
 - Hold the vnode interlock while locking bufs on the clean/dirty queues.
   This reduces some cases from one BUF_LOCK with a LK_NOWAIT and another
   BUF_LOCK with a LK_TIMEFAIL to a single lock.

Reviewed by:	arch, mckusick
2003-02-25 03:37:48 +00:00
Warner Losh
a163d034fa Back out M_* changes, per decision of the TRB.
Approved by: trb
2003-02-19 05:47:46 +00:00
Jeff Roberson
71146186a1 - Introduce a new function bremfreel() that does a bremfree with the buf
queue lock already held.
 - In getblk() and flushbufqueues() use bremfreel() while we still have the
   buf queue lock held to keep the lists consistent.
 - Add LK_NOWAIT to two cases where we're essentially asserting that the bufs
   are not locked while acquiring the locks.  This will make sure that we get
   the appropriate panic() and not another one for sleeping with a lock held.
2003-02-16 10:43:06 +00:00
Jeff Roberson
25c4325446 - Add a comment about a race that will happen without Giant. 2003-02-10 22:47:34 +00:00
Jeff Roberson
c7b716cc2a - Unlock the nblock after the loop in bwillwrite(). 2003-02-10 22:33:59 +00:00
Jeff Roberson
7137d635ac - In getnewbuf() unlock the bq lock prior to sleeping when we're out of
buffers.

Submitted by:	tegge
2003-02-10 06:02:51 +00:00
Jeff Roberson
3306adcfcf - Correct another atomic op.
Spotted by:	alc
2003-02-09 22:39:51 +00:00
Jeff Roberson
69953c8435 - Move some code out from #ifdef INVARIANTS. 2003-02-09 12:11:37 +00:00
Jeff Roberson
767b9a529d - Cleanup unlocked accesses to buf flags by introducing a new b_vflag member
that is protected by the vnode lock.
 - Move B_SCANNED into b_vflags and call it BV_SCANNED.
 - Create a vop_stdfsync() modeled after spec's sync.
 - Replace spec_fsync, msdos_fsync, and hpfs_fsync with the stdfsync and some
   fs specific processing.  This gives all of these filesystems proper
   behavior wrt MNT_WAIT/NOWAIT and the use of the B_SCANNED flag.
 - Annotate the locking in buf.h
2003-02-09 11:28:35 +00:00
Jeff Roberson
15553af710 - spell add 'add' and not 'subtract' in an atomic op.
Spotted by:	alc
Pointy hat to:	jeff
2003-02-09 11:21:40 +00:00
Jeff Roberson
d85be48243 - Lock down the buffer cache's infrastructure code. This includes locks on
buf lists, synchronization variables, and atomic ops for the counters.
   This change does not remove giant from any code although some pushdown
   may be possible.
 - In vfs_bio_awrite() don't access buf fields without the buf lock.
2003-02-09 09:47:31 +00:00
Alfred Perlstein
44956c9863 Remove M_TRYWAIT/M_WAITOK/M_WAIT. Callers should use 0.
Merge M_NOWAIT/M_DONTWAIT into a single flag M_NOWAIT.
2003-01-21 08:56:16 +00:00
Matthew Dillon
2d5c7e4506 Close the remaining user address mapping races for physical
I/O, CAM, and AIO.  Still TODO: streamline useracc() checks.

Reviewed by:	alc, tegge
MFC after:	7 days
2003-01-20 17:46:48 +00:00
Alan Cox
28ec30cd9f - Hold the page queues lock around vm_page_hold().
- Assert that the page queues lock rather than Giant is held in
   vm_page_hold().
2003-01-20 09:24:03 +00:00
Alan Cox
6eb07b4ac2 Fix two long-standing, but likely harmless, errors in the use of
vm_pageout_deficit:
1. Update vm_pageout_deficit before VM_WAIT.  There is no sense in
   delaying the update; the sooner the pageout daemon receives this
   information the better.  Reviewed by: tegge
2. Update vm_pageout_deficit according to the number of pages still
   needed to complete the allocation, not the original size of the
   allocation.  Submitted by: tegge

(These errors have existed since the introduction of vm_pageout_deficit
in revision 1.144.)
2003-01-16 08:14:56 +00:00
Matthew Dillon
f597900329 Merge all the various copies of vmapbuf() and vunmapbuf() into a single
portable copy.  Note that pmap_extract() must be used instead of
pmap_kextract().

This is precursor work to a reorganization of vmapbuf() to close remaining
user/kernel races (which can lead to a panic).
2003-01-15 23:54:35 +00:00
Alan Cox
b0ef8c5fe4 - Update vm_pageout_deficit using atomic operations. It's a simple
counter outside the scope of existing locks.
 - Eliminate a redundant clearing of vm_pageout_deficit.
2003-01-14 06:57:03 +00:00
Alan Cox
8febaa4df0 vm_hold_load_pages() needn't clear PG_ZERO because it didn't pass
VM_ALLOC_ZERO to vm_page_alloc(). (PG_ZERO is clear by default.)
2003-01-12 06:30:15 +00:00
Alan Cox
1f17965656 Make bogus_offset local to bufinit(). 2003-01-07 19:55:08 +00:00
Poul-Henning Kamp
ea4804130a Fix cut&paste bug which would result in a panic because buffer was
being biodone'ed multiple times.
2003-01-05 22:01:08 +00:00
Alan Cox
9ce904432a Allocate bogus_page with VM_ALLOC_WIRED. (Previously, bogus_page's
allocation incremented the global count of wired pages, but not the
page's own wire count.  This inconsistency was introduced in
revision 1.230.)
2003-01-05 18:46:13 +00:00
Poul-Henning Kamp
f5b11b6e2d Temporarily introduce a new VOP_SPECSTRATEGY operation while I try
to sort out disk-io from file-io in the vm/buffer/filesystem space.

The intent is to sort VOP_STRATEGY calls into those which operate
on "real" vnodes and those which operate on VCHR vnodes.  For
the latter kind, the call will be changed to VOP_SPECSTRATEGY,
possibly conditionally for those places where dual-use happens.

Add a default VOP_SPECSTRATEGY method which will call the normal
VOP_STRATEGY.  First time it is called it will print debugging
information.  This will only happen if a normal vnode is passed
to VOP_SPECSTRATEGY by mistake.

Add a real VOP_SPECSTRATEGY in specfs, which does what VOP_STRATEGY
does on a VCHR vnode today.

Add a new VOP_STRATEGY method in specfs to catch instances where
the conversion to VOP_SPECSTRATEGY has not yet happened.  Handle
the request just like we always did, but first time called print
debugging information.

Apart up to two instances of console messages per boot, this amounts
to a glorified no-op commit.

If you get any of the messages on your console I would very much
like a copy of them mailed to phk@freebsd.org
2003-01-04 22:10:36 +00:00
Poul-Henning Kamp
7b330b22b6 Don't call VOP_BMAP on VCHR vnodes when the logical and physical block
numbers are identical: it cannot even hope to accomplish anything.
2003-01-04 09:37:42 +00:00
Poul-Henning Kamp
862702306b Convert calls to BUF_STRATEGY to VOP_STRATEGY calls. This is a no-op since
all BUF_STRATEGY did in the first place was call VOP_STRATEGY.
2003-01-03 06:32:15 +00:00
Jens Schweikhardt
9d5abbddbf Correct typos, mostly s/ a / an / where appropriate. Some whitespace cleanup,
especially in troff files.
2003-01-01 18:49:04 +00:00
Alan Cox
d746789347 Hold the page queues lock when calling vm_page_flag_clear(). 2002-12-27 06:52:32 +00:00
Alan Cox
0cb6c00463 - Hold the kernel_object's lock around vm_page_alloc(kernel_object,...).
- Hold the page queues lock around vm_page_wakeup().
2002-12-23 20:10:47 +00:00
Kirk McKusick
0f5f789c0d The buffer daemon cannot skip over buffers owned by locked inodes as
they may be the only viable ones to flush. Thus it will now wait for
an inode lock if the other alternatives will result in rollbacks (and
immediate redirtying of the buffer). If only buffers with rollbacks
are available, one will be flushed, but then the buffer daemon will
wait briefly before proceeding. Failing to wait briefly effectively
deadlocks a uniprocessor since every other process writing to that
filesystem will wait for the buffer daemon to clean up which takes
close enough to forever to feel like a deadlock.

Reported by:	Archie Cobbs <archie@dellroad.org>
Sponsored by:   DARPA & NAI Labs.
Approved by:	re
2002-12-14 01:35:30 +00:00
Alan Cox
178949e021 Hold the page queues/flags lock when calling vm_page_set_validclean().
Approved by:	re
2002-11-23 19:10:31 +00:00
Alan Cox
4fec79bef8 Now that pmap_remove_all() is exported by our pmap implementations
use it directly.
2002-11-16 07:44:25 +00:00
Alan Cox
d154fb4fe6 When prot is VM_PROT_NONE, call pmap_page_protect() directly rather than
indirectly through vm_page_protect().  The one remaining page flag that
is updated by vm_page_protect() is already being updated by our various
pmap implementations.

Note: A later commit will similarly change the VM_PROT_READ case and
eliminate vm_page_protect().
2002-11-10 07:12:04 +00:00
Kirk McKusick
bc7bdd50c1 When the number of dirty buffers rises too high, the buf_daemon runs
to help clean up. After selecting a potential buffer to write, this
patch has it acquire a lock on the vnode that owns the buffer before
trying to write it. The vnode lock is necessary to avoid a race with
some other process holding the vnode locked and trying to flush its
dirty buffers. In particular, if the vnode in question is a snapshot
file, then the race can lead to a deadlock. To avoid slowing down the
buf_daemon, it does a non-blocking lock request when trying to lock
the vnode. If it fails to get the lock it skips over the buffer and
continues down its queue looking for buffers to flush.

Sponsored by:	DARPA & NAI Labs.
2002-10-18 01:29:59 +00:00
Poul-Henning Kamp
53cc479393 Remove unused includes.
Clarify the intention of a while();
Move a local variable to avoid potential name-confusion.
2002-09-28 17:46:30 +00:00
Poul-Henning Kamp
37c841831f Be consistent about "static" functions: if the function is marked
static in its prototype, mark it static at the definition too.

Inspired by:    FlexeLint warning #512
2002-09-28 17:15:38 +00:00
Poul-Henning Kamp
54286a04c5 Correctly order VI_UNLOCK(), local variables and block comment. 2002-09-28 12:15:44 +00:00
Poul-Henning Kamp
089cf428da Make biowait() check bio_error before the BIO_ERROR flag, to propery
catch internal GEOM use of bio_error.

Sponsored by:	DARPA & NAI Labs.
2002-09-26 16:32:14 +00:00
Jeff Roberson
b7227b7712 - Lock accesses to v_numoutput.
- Lock calls to gbincore.
2002-09-25 02:11:37 +00:00
Poul-Henning Kamp
f986355c0e s/Danglish/English/
Some style issues.
Change the timeout to be hz/10 instead of hz.

Brucification by:	bde.
2002-09-15 17:52:35 +00:00
Poul-Henning Kamp
028e9e5902 Un-inline the non-trivial "trivial" bio* functions.
Untangle devstat_end_transaction_bio()
2002-09-14 19:34:11 +00:00
Nate Lawson
06be2aaa83 Remove all use of vnode->v_tag, replacing with appropriate substitutes.
v_tag is now const char * and should only be used for debugging.

Additionally:
1. All users of VT_NTS now check vfsconf->vf_type VFCF_NETWORK
2. The user of VT_PROCFS now checks for the new flag VV_PROCDEP, which
is propagated by pseudofs to all child vnodes if the fs sets PFS_PROCDEP.

Suggested by:   phk
Reviewed by:    bde, rwatson (earlier version)
2002-09-14 09:02:28 +00:00
Poul-Henning Kamp
c7143e7150 Oops, broke the build there. Uninline biodone() now that it is non-trivial.
Introduce biowait() function.  Currently there is a race condition and the
mitigation is a timeout/retry.  It is not obvious what kind of locking (if any)
is suitable for BIO_DONE, since the majority of users take are of this
themselves, and only a few places actually rely on the wakeup.

Sponsored by: DARPA & NAI Labs.
2002-09-13 11:28:31 +00:00
Peter Wemm
447b3772dc Change hw.physmem and hw.usermem to unsigned long like they used to be
in the original hardwired sysctl implementation.

The buf size calculator still overflows an integer on machines with large
KVA (eg: ia64) where the number of pages does not fit into an int.  Use
'long' there.

Change Maxmem and physmem and related variables to 'long', mostly for
completeness.  Machines are not likely to overflow 'int' pages in the
near term, but then again, 640K ought to be enough for anybody.  This
comes for free on 32 bit machines, so why not?
2002-08-30 04:04:37 +00:00
Philippe Charnier
93b0017f88 Replace various spelling with FALLTHROUGH which is lint()able 2002-08-25 13:23:09 +00:00
Jeff Roberson
e6e370a7fe - Replace v_flag with v_iflag and v_vflag
- v_vflag is protected by the vnode lock and is used when synchronization
   with VOP calls is needed.
 - v_iflag is protected by interlock and is used for dealing with vnode
   management issues.  These flags include X/O LOCK, FREE, DOOMED, etc.
 - All accesses to v_iflag and v_vflag have either been locked or marked with
   mp_fixme's.
 - Many ASSERT_VOP_LOCKED calls have been added where the locking was not
   clear.
 - Many functions in vfs_subr.c were restructured to provide for stronger
   locking.

Idea stolen from:	BSD/OS
2002-08-04 10:29:36 +00:00
Alan Cox
3327872297 o Convert two instances of vm_page_sleep_busy() to vm_page_sleep_if_busy()
with appropriate page queue locking.
2002-08-03 18:59:19 +00:00
Alan Cox
46086ddf91 o Acquire the page queues lock before calling vm_page_io_finish().
o Assert that the page queues lock is held in vm_page_io_finish().
2002-08-01 17:57:42 +00:00
Alan Cox
1812190d09 o Replace vm_page_sleep_busy() with vm_page_sleep_if_busy()
in vfs_busy_pages().
2002-07-30 20:41:10 +00:00
Alan Cox
4aca0b1510 o Use vm_page_alloc(... | VM_ALLOC_WIRED) in place of vm_page_wire(). 2002-07-19 19:35:06 +00:00
Kirk McKusick
7aca6291e3 Add support to UFS2 to provide storage for extended attributes.
As this code is not actually used by any of the existing
interfaces, it seems unlikely to break anything (famous
last words).

The internal kernel interface to manipulate these attributes
is invoked using two new IO_ flags: IO_NORMAL and IO_EXT.
These flags may be specified in the ioflags word of VOP_READ,
VOP_WRITE, and VOP_TRUNCATE. Specifying IO_NORMAL means that
you want to do I/O to the normal data part of the file and
IO_EXT means that you want to do I/O to the extended attributes
part of the file. IO_NORMAL and IO_EXT are mutually exclusive
for VOP_READ and VOP_WRITE, but may be specified individually
or together in the case of VOP_TRUNCATE. For example, when
removing a file, VOP_TRUNCATE is called with both IO_NORMAL
and IO_EXT set. For backward compatibility, if neither IO_NORMAL
nor IO_EXT is set, then IO_NORMAL is assumed.

Note that the BA_ and IO_ flags have been `merged' so that they
may both be used in the same flags word. This merger is possible
by assigning the IO_ flags to the low sixteen bits and the BA_
flags the high sixteen bits. This works because the high sixteen
bits of the IO_ word is reserved for read-ahead and help with
write clustering so will never be used for flags. This merge
lets us get away from code of the form:

        if (ioflags & IO_SYNC)
                flags |= BA_SYNC;

For the future, I have considered adding a new field to the
vattr structure, va_extsize. This addition could then be
exported through the stat structure to allow applications to
find out the size of the extended attribute storage and also
would provide a more standard interface for truncating them
(via VOP_SETATTR rather than VOP_TRUNCATE).

I am also contemplating adding a pathconf parameter (for
concreteness, lets call it _PC_MAX_EXTSIZE) which would
let an application determine the maximum size of the extended
atribute storage.

Sponsored by:	DARPA & NAI Labs.
2002-07-19 07:29:39 +00:00
Alan Cox
9175709532 o Lock page queue accesses by vm_page_wire(). 2002-07-14 19:45:46 +00:00
Alan Cox
5123aaef42 o Lock some page queue accesses, in particular, those by vm_page_unwire(). 2002-07-13 20:13:34 +00:00
Matthew Dillon
d331c5d43f Replace the global buffer hash table with per-vnode splay trees using a
methodology similar to the vm_map_entry splay and the VM splay that Alan
Cox is working on.  Extensive testing has appeared to have shown no
increase in overhead.

Disadvantages
    Dirties more cache lines during lookups.

    Not as fast as a hash table lookup (but still N log N and optimal
    when there is locality of reference).

Advantages
    vnode->v_dirtyblkhd is now perfectly sorted, making fsync/sync/filesystem
    syncer operate more efficiently.

    I get to rip out all the old hacks (some of which were mine) that tried
    to keep the v_dirtyblkhd tailq sorted.

    The per-vnode splay tree should be easier to lock / SMPng pushdown on
    vnodes will be easier.

    This commit along with another that Alan is working on for the VM page
    global hash table will allow me to implement ranged fsync(), optimize
    server-side nfs commit rpcs, and implement partial syncs by the
    filesystem syncer (aka filesystem syncer would detect that someone is
    trying to get the vnode lock, remembers its place, and skip to the
    next vnode).

Note that the buffer cache splay is somewhat more complex then other splays
due to special handling of background bitmap writes (multiple buffers with
the same lblkno in the same vnode), and B_INVAL discontinuities between the
old hash table and the existence of the buffer on the v_cleanblkhd list.

Suggested by: alc
2002-07-10 17:02:32 +00:00
Bruce Evans
f20c8dde15 Fixed some printf format errors (one new one reported by gcc and 3 nearby
old ones not reported by gcc).  This helps unbreak LINT.
2002-07-08 12:21:11 +00:00
Jeff Roberson
49244e35ff Add two asserts that prove & document getblk and geteblk's behavior of
returning locked bufs.
2002-07-07 05:27:08 +00:00
Jeff Roberson
c031d11bb4 Fix a mistake in my last commit. Don't grab an extra reference to the object
in bp->b_object.
2002-07-06 21:27:20 +00:00
Jeff Roberson
9a236af3ad Fixup uses of GETVOBJECT.
- Cache a pointer to the vnode's object in the buf.
 - Hold a reference to that object in addition to the vnode's reference just
   to be consistent.
 - Cleanup code that got the object indirectly through the vp and VOP calls.

This fixes at least one case where we were calling GETVOBJECT without a lock.
It also avoids an expensive layered call at the cost of another pointer in
struct buf.
2002-07-06 08:59:52 +00:00
Maxime Henrion
0a84e7c8f7 More 64 bits platforms warning fixes.
Reviewed by:	rwatson
2002-06-23 18:32:39 +00:00
Matthew Dillon
ed22d6e948 Fix a bug in vfs_bio_clrbuf(). The single-page-clrbuf optimization was
improperly clearing more then just the invalid portions of the page.  (This
bug is not known to have been triggered by anything).

Submitted by:	tegge
MFC after:	7 days
2002-06-22 19:09:35 +00:00
Kirk McKusick
1c85e6a35d This commit adds basic support for the UFS2 filesystem. The UFS2
filesystem expands the inode to 256 bytes to make space for 64-bit
block pointers. It also adds a file-creation time field, an ability
to use jumbo blocks per inode to allow extent like pointer density,
and space for extended attributes (up to twice the filesystem block
size worth of attributes, e.g., on a 16K filesystem, there is space
for 32K of attributes). UFS2 fully supports and runs existing UFS1
filesystems. New filesystems built using newfs can be built in either
UFS1 or UFS2 format using the -O option. In this commit UFS1 is
the default format, so if you want to build UFS2 format filesystems,
you must specify -O 2. This default will be changed to UFS2 when
UFS2 proves itself to be stable. In this commit the boot code for
reading UFS2 filesystems is not compiled (see /sys/boot/common/ufsread.c)
as there is insufficient space in the boot block. Once the size of the
boot block is increased, this code can be defined.

Things to note: the definition of SBSIZE has changed to SBLOCKSIZE.
The header file <ufs/ufs/dinode.h> must be included before
<ufs/ffs/fs.h> so as to get the definitions of ufs2_daddr_t and
ufs_lbn_t.

Still TODO:
Verify that the first level bootstraps work for all the architectures.
Convert the utility ffsinfo to understand UFS2 and test growfs.
Add support for the extended attribute storage. Update soft updates
to ensure integrity of extended attribute storage. Switch the
current extended attribute interfaces to use the extended attribute
storage. Add the extent like functionality (framework is there,
but is currently never used).

Sponsored by: DARPA & NAI Labs.
Reviewed by:	Poul-Henning Kamp <phk@freebsd.org>
2002-06-21 06:18:05 +00:00
Poul-Henning Kamp
53e645d90e Use "bwrbg" as description when we sleep for background writing,
"biord" was misleading in every possible way.
2002-06-06 08:56:10 +00:00
Poul-Henning Kamp
e31c615c60 Remove a six year old undocumented #ifdef : NO_B_MALLOC. 2002-05-04 19:24:55 +00:00
John Baldwin
6008862bc2 Change callers of mtx_init() to pass in an appropriate lock type name. In
most cases NULL is passed, but in some cases such as network driver locks
(which use the MTX_NETWORK_LOCK macro) and UMA zone locks, a name is used.

Tested on:	i386, alpha, sparc64
2002-04-04 21:03:38 +00:00
Matthew Dillon
d1b534dfc6 brelse() was improperly clearing B_DELWRI in the B_DELWRI|B_INVAL case
without removing the buffer from the vnode's dirty buffer list, which
can result in a panic in NFS.  Replaced the code with a call to bundirty()
which deals with it properly.

PR:		kern/36108, kern/36174
Submitted by:	various people
Special mention: to Danny Schales <dan@coes.LaTech.edu> for providing a core dump that helped me track this down.
MFC after:	1 day
2002-04-03 00:17:36 +00:00
Alfred Perlstein
4d77a549fe Remove __P. 2002-03-19 21:25:46 +00:00
Bruce Evans
367b50a28f Fixed some printf format errors (hopefully all of the remaining daddr64_t
ones for GENERIC, and all others on the same line as those).  Reformat
the printfs if necessary to avoid new long lones or old format printf
errors.
2002-03-19 04:09:21 +00:00
Jake Burkholder
ac59490b5e Convert all pmap_kenter/pmap_kremove pairs in MI code to use pmap_qenter/
pmap_qremove.  pmap_kenter is not safe to use in MI code because it is not
guaranteed to flush the mapping from the tlb on all cpus.  If the process
in question is preempted and migrates cpus between the call to pmap_kenter
and pmap_kremove, the original cpu will be left with stale mappings in its
tlb.  This is currently not a problem for i386 because we do not use PG_G on
SMP, and thus all mappings are flushed from the tlb on context switches, not
just user mappings.  This is not the case on all architectures, and if PG_G
is to be used with SMP on i386 it will be a problem.  This was committed by
peter earlier as part of his fine grained tlb shootdown work for i386, which
was backed out for other reasons.

Reviewed by:	peter
2002-03-17 00:56:41 +00:00
Kirk McKusick
0d2af52141 Introduce the new 64-bit size disk block, daddr64_t. Change
the bio and buffer structures to have daddr64_t bio_pblkno,
b_blkno, and b_lblkno fields which allows access to disks
larger than a Terabyte in size. This change also requires
that the VOP_BMAP vnode operation accept and return daddr64_t
blocks. This delta should not affect system operation in
any way. It merely sets up the necessary interfaces to allow
the development of disk drivers that work with these larger
disk block addresses. It also allows for the development of
UFS2 which will use 64-bit block addresses.
2002-03-15 18:49:47 +00:00
Alfred Perlstein
628abf6c69 Giant pushdown for read/write/pread/pwrite syscalls.
kern/kern_descrip.c:
Aquire Giant in fdrop_locked when file refcount hits zero, this removes
the requirement for the caller to own Giant for the most part.

kern/kern_ktrace.c:
Aquire Giant in ktrgenio, simplifies locking in upper read/write syscalls.

kern/vfs_bio.c:
Aquire Giant in bwillwrite if needed.

kern/sys_generic.c
Giant pushdown, remove Giant for:
   read, pread, write and pwrite.
readv and writev aren't done yet because of the possible malloc calls
for iov to uio processing.

kern/sys_socket.c
Grab giant in the socket fo_read/write functions.

kern/vfs_vnops.c
Grab giant in the vnode fo_read/write functions.
2002-03-15 08:03:46 +00:00
Eivind Eklund
f52bd684f3 * Move bswlist declaration and initialization from kern/vfs_bio.c to
vm/vm_pager.c, which is the only place it is used.
* Make the QUEUE_* definitions and bufqueues local to vfs_bio.c.
* constify buf_wmesg.
2002-03-05 18:20:58 +00:00
Eivind Eklund
eb8e6d5276 Document all functions, global and static variables, and sysctls.
Includes some minor whitespace changes, and re-ordering to be able to document
properly (e.g, grouping of variables and the SYSCTL macro calls for them, where
the documentation has been added.)

Reviewed by:	phk (but all errors are mine)
2002-03-05 15:38:49 +00:00
Peter Wemm
d1693e1701 Back out all the pmap related stuff I've touched over the last few days.
There is some unresolved badness that has been eluding me, particularly
affecting uniprocessor kernels.  Turning off PG_G helped (which is a bad
sign) but didn't solve it entirely.  Userland programs still crashed.
2002-02-27 09:51:33 +00:00
Peter Wemm
bd1e3a0f89 Jake further reduced IPI shootdowns on sparc64 in loops by using ranged
shootdowns in a couple of key places.  Do the same for i386.  This also
hides some physical addresses from higher levels and has it use the
generic vm_page_t's instead.  This will help for PAE down the road.

Obtained from:	jake (MI code, suggestions for MD part)
2002-02-27 02:14:58 +00:00
Poul-Henning Kamp
57c10583aa GC: BIO_ORDERED, various infrastructure dealing with BIO_ORDERED. 2002-02-22 09:26:35 +00:00
Poul-Henning Kamp
986066d065 Replace bowrite() with BUF_WRITE in ufs.
Remove bowrite(), it is now unused.

This is the first step in getting entirely rid of BIO_ORDERED which is
a generally accepted evil thing.

Approved by:	mckusick
2002-02-22 09:03:00 +00:00
Matthew Dillon
027df6bdd7 GC P_BUFEXHAUST leftovers, we've had a new mechanism to avoid buffer
cache lockups for over a year now.

MFC after:		0 days
2002-01-31 18:39:44 +00:00
Matthew Dillon
3ebeaf5984 This fixes a large number of bugs in our NFS client side code. A recent
commit by Kirk also fixed a softupdates bug that could easily be triggered
by server side NFS.

	* An edge case with shared R+W mmap()'s and truncate whereby
	  the system would inappropriately clear the dirty bits on
	  still-dirty data.  (applicable to all filesystems)

	  THIS FIX TEMPORARILY DISABLED PENDING FURTHER TESTING.
	  see vm/vm_page.c line 1641

	* The straddle case for VM pages and buffer cache buffers when
	  truncating.  (applicable to NFS client side)

	* Possible SMP database corruption due to vm_pager_unmap_page()
	  not clearing the TLB for the other cpu's.  (applicable to NFS
	  client side but could effect all filesystems).  Note: not
	  considered serious since the corruption occurs beyond the file
	  EOF.

	* When flusing a dirty buffer due to B_CACHE getting cleared,
	  we were accidently setting B_CACHE again (that is, bwrite() sets
	  B_CACHE), when we really want it to stay clear after the write
	  is complete.  This resulted in a corrupt buffer.  (applicable
	  to all filesystems but probably only triggered by NFS)

	* We have to call vtruncbuf() when ftruncate()ing to remove
	  any buffer cache buffers.  This is still tentitive, I may
	  be able to remove it due to the second bug fix.  (applicable
	  to NFS client side)

	* vnode_pager_setsize() race against nfs_vinvalbuf()... we have
	  to set n_size before calling nfs_vinvalbuf or the NFS code
	  may recursively vnode_pager_setsize() to the original value
	  before the truncate.  This is what was causing the user mmap
	  bus faults in the nfs tester program.  (applicable to NFS
	  client side)

	* Fix to softupdates (see ufs/ffs/ffs_inode.c 1.73, commit made
	  by Kirk).

Testing program written by: Avadis Tevanian, Jr.
Testing program supplied by: jkh / Apple (see Dec2001 posting to freebsd-hackers with Subject 'NFS: How to make FreeBS fall on its face in one easy step')
MFC after:	1 week
2001-12-14 01:16:57 +00:00
Matthew Dillon
a4233d5dc3 The nbuf calculation was assuming that PAGE_SIZE = 4096 bytes, which is
bogus.  The calculation has been adjusted to use units of kilobytes.

Noticed by: Chad David <davidc@acns.ab.ca>
MFC after:	1 week
2001-12-08 20:37:08 +00:00
Matthew Dillon
8ba1f55b49 Placemark an interrupt race in -current which is currently protected by
Giant.  -stable will get spl*() fixes for the race.

Reported by: Rob Anderson <rob@isilon.com>
MFC after:	0 days
2001-11-08 18:09:18 +00:00
Matthew Dillon
7e76bb562e Implement IO_NOWDRAIN and B_NOWDRAIN - prevents the buffer cache from blocking
in wdrain during a write.  This flag needs to be used in devices whos
strategy routines turn-around and issue another high level I/O, such as
when MD turns around and issues a VOP_WRITE to vnode backing store, in order
to avoid deadlocking the dirty buffer draining code.

Remove a vprintf() warning from MD when the backing vnode is found to be
in-use.  The syncer of buf_daemon could be flushing the backing vnode at
the time of an MD operation so the warning is not correct.

MFC after:	1 week
2001-11-05 18:48:54 +00:00
Matthew Dillon
5eb13f768c Documentation
MFC after:	1 day
2001-10-21 06:26:55 +00:00
John Baldwin
bd78cece5d Change the kernel's ucred API as follows:
- crhold() returns a reference to the ucred whose refcount it bumps.
- crcopy() now simply copies the credentials from one credential to
  another and has no return value.
- a new crshared() primitive is added which returns true if a ucred's
  refcount is > 1 and false (0) otherwise.
2001-10-11 23:38:17 +00:00
Matthew Dillon
46cad5761c Enable vmiodirenable by default. Remove incorrect comment from sysctl.conf.
MFC after:	1 week
2001-09-26 19:35:04 +00:00
Julian Elischer
b40ce4165d KSE Milestone 2
Note ALL MODULES MUST BE RECOMPILED
make the kernel aware that there are smaller units of scheduling than the
process. (but only allow one thread per process at this time).
This is functionally equivalent to teh previousl -current except
that there is a thread associated with each process.

Sorry john! (your next MFC will be a doosie!)

Reviewed by: peter@freebsd.org, dillon@freebsd.org

X-MFC after:    ha ha ha ha
2001-09-12 08:38:13 +00:00
Matthew Dillon
0cf5e0ebd6 Remove the code that limited the buffer_map to 1/2 the size of the
kernel_map.  maxbcache takes care of this now and the 1/2 limit can
interfere with testing.

Suggested by: bde
2001-08-22 18:10:37 +00:00
Matthew Dillon
219d632c15 Move most of the kernel submap initialization code, including the
timeout callwheel and buffer cache, out of the platform specific areas
and into the machine independant area.  i386 and alpha adjusted here.
Other cpus can be fixed piecemeal.

Reviewed by:    freebsd-smp, jake
2001-08-22 04:07:27 +00:00
Peter Wemm
b219758f94 Revert previous accidental commit. FWIW, it was part of enabling
VM caching of disks through mmap() and stopping syncing of open files
that had their last reference in the fs removed (ie: their unsync'ed
pages get discarded on close already, so I made it stop syncing too).
2001-07-27 15:57:17 +00:00
Peter Wemm
24a590a074 Fix cut/paste blunder. Serves me right for doing a last minute tweak
to what I had for some time.

Submitted by:	bde
2001-07-27 15:52:49 +00:00
Matthew Dillon
0cddd8f023 With Alfred's permission, remove vm_mtx in favor of a fine-grained approach
(this commit is just the first stage).  Also add various GIANT_ macros to
formalize the removal of Giant, making it easy to test in a more piecemeal
fashion. These macros will allow us to test fine-grained locks to a degree
before removing Giant, and also after, and to remove Giant in a piecemeal
fashion via sysctl's on those subsystems which the authors believe can
operate without Giant.
2001-07-04 16:20:28 +00:00
Matthew Dillon
ac8f990bde This patch implements O_DIRECT about 80% of the way. It takes a patchset
Tor created a while ago, removes the raw I/O piece (that has cache coherency
problems), and adds a buffer cache / VM freeing piece.

Essentially this patch causes O_DIRECT I/O to not be left in the cache, but
does not prevent it from going through the cache, hence the 80%.  For
the last 20% we need a method by which the I/O can be issued directly to
buffer supplied by the user process and bypass the buffer cache entirely,
but still maintain cache coherency.

I also have the code working under -stable but the changes made to sys/file.h
may not be MFCable, so an MFC is not on the table yet.

Submitted by:	tegge, dillon
2001-05-24 07:22:27 +00:00
John Baldwin
8aa66068ed - Always call bfreekva() w/o vm_mtx held.
- Always call vfs_setdirty() with vm_mtx held.
- Fix an old comment: vm_hold_unload_pages is called vm_hold_free_pages()
  nowadays.
- Always call vm_hold_free_pages() w/o vm_mtx held.
2001-05-23 22:24:49 +00:00
Alfred Perlstein
2395531439 Introduce a global lock for the vm subsystem (vm_mtx).
vm_mtx does not recurse and is required for most low level
vm operations.

faults can not be taken without holding Giant.

Memory subsystems can now call the base page allocators safely.

Almost all atomic ops were removed as they are covered under the
vm mutex.

Alpha and ia64 now need to catch up to i386's trap handlers.

FFS and NFS have been tested, other filesystems will need minor
changes (grabbing the vm lock when twiddling page properties).

Reviewed (partially) by: jake, jhb
2001-05-19 01:28:09 +00:00
Greg Lehey
60fb0ce365 Revert consequences of changes to mount.h, part 2.
Requested by:	bde
2001-04-29 02:45:39 +00:00
Greg Lehey
d98dc34f52 Correct #includes to work with fixed sys/mount.h. 2001-04-23 09:05:15 +00:00
Poul-Henning Kamp
793d6d5d57 bread() is a special case of breadn(), so don't replicate code. 2001-04-18 07:16:07 +00:00
Poul-Henning Kamp
0dfba3cef1 Write a switch statement as less obscure if statements. 2001-04-17 20:22:07 +00:00
Poul-Henning Kamp
f84e29a06c This patch removes the VOP_BWRITE() vector.
VOP_BWRITE() was a hack which made it possible for NFS client
side to use struct buf with non-bio backing.

This patch takes a more general approach and adds a bp->b_op
vector where more methods can be added.

The success of this patch depends on bp->b_op being initialized
all relevant places for some value of "relevant" which is not
easy to determine.  For now the buffers have grown a b_magic
element which will make such issues a tiny bit easier to debug.
2001-04-17 08:56:39 +00:00
Kirk McKusick
5819ab3f12 Add debugging option to always read/write cylinder groups as full
sized blocks. To enable this option, use: `sysctl -w debug.bigcgs=1'.
Add debugging option to disable background writes of cylinder
groups. To enable this option, use: `sysctl -w debug.dobkgrdwrite=0'.
These debugging options should be tried on systems that are panicing
with corrupted cylinder group maps to see if it makes the problem
go away. The set of panics in question are:

	ffs_clusteralloc: map mismatch
	ffs_nodealloccg: map corrupted
	ffs_nodealloccg: block not in map
	ffs_alloccg: map corrupted
	ffs_alloccg: block not in map
	ffs_alloccgblk: cyl groups corrupted
	ffs_alloccgblk: can't find blk in cyl
	ffs_checkblk: partially free fragment

The following panics are less likely to be related to this problem,
but might be helped by these debugging options:

	ffs_valloc: dup alloc
	ffs_blkfree: freeing free block
	ffs_blkfree: freeing free frag
	ffs_vfree: freeing free inode

If you try these options, please report whether they helped reduce your
bitmap corruption panics to Kirk McKusick at <mckusick@mckusick.com>
and to Matt Dillon <dillon@earth.backplane.com>.
2001-04-17 05:37:51 +00:00
Matthew Dillon
63692125a9 Fix lockup for loopback NFS mounts. The pipelined I/O limitations could be
hit on the client side and prevent the server side from retiring writes.
Pipeline operations turned off for all READs (no big loss since reads are
usually synchronous) and for NFS writes, and left on for the default bwrite().
(MFC expected prior to 4.3 freeze)

Testing by: mjacob, dillon
2001-02-28 04:13:11 +00:00
Bosko Milekic
9ed346bab0 Change and clean the mutex lock interface.
mtx_enter(lock, type) becomes:

mtx_lock(lock) for sleep locks (MTX_DEF-initialized locks)
mtx_lock_spin(lock) for spin locks (MTX_SPIN-initialized)

similarily, for releasing a lock, we now have:

mtx_unlock(lock) for MTX_DEF and mtx_unlock_spin(lock) for MTX_SPIN.
We change the caller interface for the two different types of locks
because the semantics are entirely different for each case, and this
makes it explicitly clear and, at the same time, it rids us of the
extra `type' argument.

The enter->lock and exit->unlock change has been made with the idea
that we're "locking data" and not "entering locked code" in mind.

Further, remove all additional "flags" previously passed to the
lock acquire/release routines with the exception of two:

MTX_QUIET and MTX_NOSWITCH

The functionality of these flags is preserved and they can be passed
to the lock/unlock routines by calling the corresponding wrappers:

mtx_{lock, unlock}_flags(lock, flag(s)) and
mtx_{lock, unlock}_spin_flags(lock, flag(s)) for MTX_DEF and MTX_SPIN
locks, respectively.

Re-inline some lock acq/rel code; in the sleep lock case, we only
inline the _obtain_lock()s in order to ensure that the inlined code
fits into a cache line. In the spin lock case, we inline recursion and
actually only perform a function call if we need to spin. This change
has been made with the idea that we generally tend to avoid spin locks
and that also the spin locks that we do have and are heavily used
(i.e. sched_lock) do recurse, and therefore in an effort to reduce
function call overhead for some architectures (such as alpha), we
inline recursion for this case.

Create a new malloc type for the witness code and retire from using
the M_DEV type. The new type is called M_WITNESS and is only declared
if WITNESS is enabled.

Begin cleaning up some machdep/mutex.h code - specifically updated the
"optimized" inlined code in alpha/mutex.h and wrote MTX_LOCK_SPIN
and MTX_UNLOCK_SPIN asm macros for the i386/mutex.h as we presently
need those.

Finally, caught up to the interface changes in all sys code.

Contributors: jake, jhb, jasone (in no particular order)
2001-02-09 06:11:45 +00:00
Matthew Dillon
4e71e795a1 This commit represents work mainly submitted by Tor and slightly modified
by myself.  It solves a serious vm_map corruption problem that can occur
with the buffer cache when block sizes > 64K are used.  This code has been
heavily tested in -stable but only tested somewhat on -current.  An MFC
will occur in a few days.  My additions include the vm_map_simplify_entry()
and minor buffer cache boundry case fix.

Make the buffer cache use a system map for buffer cache KVM rather then a
normal map.

Ensure that VM objects are not allocated for system maps.  There were cases
where a buffer map could wind up with a backing VM object -- normally
harmless, but this could also result in the buffer cache blocking in places
where it assumes no blocking will occur, possibly resulting in corrupted
maps.

Fix a minor boundry case in the buffer cache size limit is reached that
could result in non-optimal code.

Add vm_map_simplify_entry() calls to prevent 'creeping proliferation'
of vm_map_entry's in the buffer cache's vm_map.  Previously only a simple
linear optimization was made.  (The buffer vm_map typically has only a
handful of vm_map_entry's.  This stabilizes it at that level permanently).

PR: 20609
Submitted by: (Tor Egge) tegge
2001-02-04 06:19:28 +00:00
Jake Burkholder
ef73ae4b0c Use PCPU_GET, PCPU_PTR and PCPU_SET to access all per-cpu variables
other then curproc.
2001-01-10 04:43:51 +00:00
Matthew Dillon
2b6b0df712 This implements a better launder limiting solution. There was a solution
in 4.2-REL which I ripped out in -stable and -current when implementing the
low-memory handling solution.  However, maxlaunder turns out to be the saving
grace in certain very heavily loaded systems (e.g. newsreader box).  The new
algorithm limits the number of pages laundered in the first pageout daemon
pass.  If that is not sufficient then suceessive will be run without any
limit.

Write I/O is now pipelined using two sysctls, vfs.lorunningspace and
vfs.hirunningspace.  This prevents excessive buffered writes in the
disk queues which cause long (multi-second) delays for reads.  It leads
to more stable (less jerky) and generally faster I/O streaming to disk
by allowing required read ops (e.g. for indirect blocks and such) to occur
without interrupting the write stream, amoung other things.

NOTE: eventually, filesystem write I/O pipelining needs to be done on a
per-device basis.  At the moment it is globalized.
2000-12-26 19:41:38 +00:00
John Baldwin
ffc831da27 Stick the kthread API in a kthread_* namespace, and the specialized kproc
functions in a kproc_* namespace.

Reviewed by:	-arch
2000-12-15 20:08:20 +00:00
Matthew Dillon
936524aa02 Implement a low-memory deadlock solution.
Removed most of the hacks that were trying to deal with low-memory
    situations prior to now.

    The new code is based on the concept that I/O must be able to function in
    a low memory situation.  All major modules related to I/O (except
    networking) have been adjusted to allow allocation out of the system
    reserve memory pool.  These modules now detect a low memory situation but
    rather then block they instead continue to operate, then return resources
    to the memory pool instead of cache them or leave them wired.

    Code has been added to stall in a low-memory situation prior to a vnode
    being locked.

    Thus situations where a process blocks in a low-memory condition while
    holding a locked vnode have been reduced to near nothing.  Not only will
    I/O continue to operate, but many prior deadlock conditions simply no
    longer exist.

Implement a number of VFS/BIO fixes

	(found by Ian): in biodone(), bogus-page replacement code, the loop
        was not properly incrementing loop variables prior to a continue
        statement.  We do not believe this code can be hit anyway but we
        aren't taking any chances.  We'll turn the whole section into a
        panic (as it already is in brelse()) after the release is rolled.

	In biodone(), the foff calculation was incorrectly
        clamped to the iosize, causing the wrong foff to be calculated
        for pages in the case of an I/O error or biodone() called without
        initiating I/O.  The problem always caused a panic before.  Now it
        doesn't.  The problem is mainly an issue with NFS.

	Fixed casts for ~PAGE_MASK.  This code worked properly before only
        because the calculations use signed arithmatic.  Better to properly
        extend PAGE_MASK first before inverting it for the 64 bit masking
        op.

	In brelse(), the bogus_page fixup code was improperly throwing
        away the original contents of 'm' when it did the j-loop to
        fix the bogus pages.  The result was that it would potentially
        invalidate parts of the *WRONG* page(!), leading to corruption.

	There may still be cases where a background bitmap write is
        being duplicated, causing potential corruption.  We have identified
        a potentially serious bug related to this but the fix is still TBD.
        So instead this patch contains a KASSERT to detect the problem
  	and panic the machine rather then continue to corrupt the filesystem.
	The problem does not occur very often..  it is very hard to
	reproduce, and it may or may not be the cause of the corruption
	people have reported.

Review by: (VFS/BIO: mckusick, Ian Dowse <iedowse@maths.tcd.ie>)
Testing by: (VM/Deadlock) Paul Saab <ps@yahoo-inc.com>
2000-11-18 23:06:26 +00:00
Poul-Henning Kamp
1d7e3e42e7 Take VBLK devices further out of their missery.
This should fix the panic I introduced in my previous commit on this topic.
2000-11-02 21:14:13 +00:00
John Baldwin
35e0e5b311 Catch up to moving headers:
- machine/ipl.h -> sys/ipl.h
- machine/mutex.h -> sys/mutex.h
2000-10-20 07:58:15 +00:00
Jason Evans
a18b1f1d4d Convert lockmgr locks from using simple locks to using mutexes.
Add lockdestroy() and appropriate invocations, which corresponds to
lockinit() and must be called to clean up after a lockmgr lock is no
longer needed.
2000-10-04 01:29:17 +00:00
Boris Popov
9ff5ce6baf Add three new VOPs: VOP_CREATEVOBJECT, VOP_DESTROYVOBJECT and VOP_GETVOBJECT.
They will be used by nullfs and other stacked filesystems to support full
cache coherency.

Reviewed in general by:	mckusick, dillon
2000-09-12 09:49:08 +00:00
Jason Evans
0384fff8c5 Major update to the way synchronization is done in the kernel. Highlights
include:

* Mutual exclusion is used instead of spl*().  See mutex(9).  (Note: The
  alpha port is still in transition and currently uses both.)

* Per-CPU idle processes.

* Interrupts are run in their own separate kernel threads and can be
  preempted (i386 only).

Partially contributed by:	BSDi (BSD/OS)
Submissions by (at least):	cp, dfr, dillon, grog, jake, jhb, sheldonh
2000-09-07 01:33:02 +00:00
Kirk McKusick
54e53ebda7 Now that buffer locks can be recursive, we need to delete the panics
that complain about them.

Obtained from:	Brian Fundakowski Feldman <green@FreeBSD.org>
2000-07-25 18:28:46 +00:00
Kirk McKusick
f2a2857bb3 Add snapshots to the fast filesystem. Most of the changes support
the gating of system calls that cause modifications to the underlying
filesystem. The gating can be enabled by any filesystem that needs
to consistently suspend operations by adding the vop_stdgetwritemount
to their set of vnops. Once gating is enabled, the function
vfs_write_suspend stops all new write operations to a filesystem,
allows any filesystem modifying system calls already in progress
to complete, then sync's the filesystem to disk and returns. The
function vfs_write_resume allows the suspended write operations to
begin again. Gating is not added by default for all filesystems as
for SMP systems it adds two extra locks to such critical kernel
paths as the write system call. Thus, gating should only be added
as needed.

Details on the use and current status of snapshots in FFS can be
found in /sys/ufs/ffs/README.snapshot so for brevity and timelyness
is not included here. Unless and until you create a snapshot file,
these changes should have no effect on your system (famous last words).
2000-07-11 22:07:57 +00:00
Poul-Henning Kamp
a2e7a027a7 Virtualizes & untangles the bioops operations vector.
Ref: Message-ID: <18317.961014572@critter.freebsd.dk> To: current@
2000-06-16 08:48:51 +00:00
Jake Burkholder
e39756439c Back out the previous change to the queue(3) interface.
It was not discussed and should probably not happen.

Requested by:		msmith and others
2000-05-26 02:09:24 +00:00
Jake Burkholder
740a1973a6 Change the way that the queue(3) structures are declared; don't assume that
the type argument to *_HEAD and *_ENTRY is a struct.

Suggested by:	phk
Reviewed by:	phk
Approved by:	mdodd
2000-05-23 20:41:01 +00:00
Poul-Henning Kamp
9626b608de Separate the struct bio related stuff out of <sys/buf.h> into
<sys/bio.h>.

<sys/bio.h> is now a prerequisite for <sys/buf.h> but it shall
not be made a nested include according to bdes teachings on the
subject of nested includes.

Diskdrivers and similar stuff below specfs::strategy() should no
longer need to include <sys/buf.> unless they need caching of data.

Still a few bogus uses of struct buf to track down.

Repocopy by:    peter
2000-05-05 09:59:14 +00:00
Poul-Henning Kamp
017ef345bc Give struct bio it's own call back mechanism. 2000-05-01 13:36:25 +00:00
Poul-Henning Kamp
95bdaa0ee8 Hmm, diff/patch still doesn't like me.
Missed one s/biowait/bufwait/g
2000-04-30 06:16:03 +00:00
Poul-Henning Kamp
87150cb06d s/biowait/bufwait/g
Prodded by: several.
2000-04-29 16:25:22 +00:00
Poul-Henning Kamp
c1462ad325 Remove a leftover dysonism. 2000-04-29 16:14:10 +00:00
Poul-Henning Kamp
3389ae9350 Remove ~25 unneeded #include <sys/conf.h>
Remove ~60 unneeded #include <sys/malloc.h>
2000-04-19 14:58:28 +00:00
Poul-Henning Kamp
19583a8007 Don't declare common variables in include files:
move buftimelock til vfs_bio.c where it is initialized.
2000-04-18 11:21:28 +00:00
Poul-Henning Kamp
8177437d85 Complete the bio/buf divorce for all code below devfs::strategy
Exceptions:
        Vinum untouched.  This means that it cannot be compiled.
        Greg Lehey is on the case.

        CCD not converted yet, casts to struct buf (still safe)

        atapi-cd casts to struct buf to examine B_PHYS
2000-04-15 05:54:02 +00:00
Poul-Henning Kamp
c244d2de43 Move B_ERROR flag to b_ioflags and call it BIO_ERROR.
(Much of this done by script)

Move B_ORDERED flag to b_ioflags and call it BIO_ORDERED.

Move b_pblkno and b_iodone_chain to struct bio while we transition, they
will be obsoleted once bio structs chain/stack.

Add bio_queue field for struct bio aware disksort.

Address a lot of stylistic issues brought up by bde.
2000-04-02 15:24:56 +00:00
Poul-Henning Kamp
8c125869a9 Draw the outline of "struct bio".
Struct bio is the future carrier of I/O requests for "struct buf".
2000-04-02 09:26:51 +00:00
Matthew Dillon
7c58e473f5 Commit the buffer cache cleanup patch to 4.x and 5.x. This patch fixes a
fragmentation problem due to geteblk() reserving too much space for the
    buffer and imposes a larger granularity (16K) on KVA reservations for
    the buffer cache to avoid fragmentation issues.  The buffer cache size
    calculations have been redone to simplify them (fewer defines, better
    comments, less chance of running out of KVA).

    The geteblk() fix solves a performance problem that DG was able reproduce.

    This patch does not completely fix the KVA fragmentation problems, but
    it goes a long way

Mostly Reviewed by: bde and others
Approved by: jkh
2000-03-27 21:29:33 +00:00
Poul-Henning Kamp
b99c307a21 Rename the existing BUF_STRATEGY() to DEV_STRATEGY()
substitute BUF_WRITE(foo) for VOP_BWRITE(foo->b_vp, foo)

substitute BUF_STRATEGY(foo) for VOP_STRATEGY(foo->b_vp, foo)

This patch is machine generated except for the ccd.c and buf.h parts.
2000-03-20 11:29:10 +00:00
Poul-Henning Kamp
21144e3bf1 Remove B_READ, B_WRITE and B_FREEBUF and replace them with a new
field in struct buf: b_iocmd.  The b_iocmd is enforced to have
exactly one bit set.

B_WRITE was bogusly defined as zero giving rise to obvious coding
mistakes.

Also eliminate the redundant struct buf flag B_CALL, it can just
as efficiently be done by comparing b_iodone to NULL.

Should you get a panic or drop into the debugger, complaining about
"b_iocmd", don't continue.  It is likely to write on your disk
where it should have been reading.

This change is a step in the direction towards a stackable BIO capability.

A lot of this patch were machine generated (Thanks to style(9) compliance!)

Vinum users:  Greg has not had time to test this yet, be careful.
2000-03-20 10:44:49 +00:00
Poul-Henning Kamp
db5f635acc Eliminate the undocumented, experimental, non-delivering and highly
dangerous MAX_PERF option.
2000-03-16 08:51:55 +00:00
Kirk McKusick
71c87cfd7e Need to reset the buffer pointer to avoid reconsidering the same buffer
again (without this the rollback analysis was being lost). Should reduce
the write count for most workloads.

Submitted by:	Craig A Soules <soules+@andrew.cmu.edu>
2000-01-18 02:13:26 +00:00
Poul-Henning Kamp
ba4ad1fcea Give vn_isdisk() a second argument where it can return a suitable errno.
Suggested by:	bde
2000-01-10 12:04:27 +00:00
Kirk McKusick
cf60e8e4bf Several performance improvements for soft updates have been added:
1) Fastpath deletions. When a file is being deleted, check to see if it
   was so recently created that its inode has not yet been written to
   disk. If so, the delete can proceed to immediately free the inode.
2) Background writes: No file or block allocations can be done while the
   bitmap is being written to disk. To avoid these stalls, the bitmap is
   copied to another buffer which is written thus leaving the original
   available for futher allocations.
3) Link count tracking. Constantly track the difference in i_effnlink and
   i_nlink so that inodes that have had no change other than i_effnlink
   need not be written.
4) Identify buffers with rollback dependencies so that the buffer flushing
   daemon can choose to skip over them.
2000-01-10 00:24:24 +00:00
Luoqi Chen
5e95083920 Introduce a mechanism to suspend/resume system processes. Suspend syncer
and bufdaemon prior to disk sync during system shutdown.
2000-01-07 08:36:44 +00:00
Matthew Dillon
8f95b97072 Reimplement buf_daemon / getnewbuf() interaction for dealing with
stressful situations.  buf_daemon now makes a distinction between
    being woken up and its sleep timing out, and as a consequence is now
    much better able to dynamically tune itself to its environment.

Reviewed by:	Alfred Perlstein <bright@wintelcom.net>
1999-12-20 20:28:40 +00:00
Kirk McKusick
e9cc475851 Collect read and write counts for filesystems. This new code
drops the counting in bwrite and puts it all in spec_strategy.
I did some tests and verified that the counts collected for writes
in spec_strategy is identical to the counts that we previously
collected in bwrite. We now also get read counts (async reads
come from requests for read-ahead blocks). Note that you need
to compile a new version of mount to get the read counts printed
out. The old mount binary is completely compatible, the only
reason to install a new mount is to get the read counts printed.

Submitted by:	Craig A Soules <soules+@andrew.cmu.edu>
Reviewed by:	Kirk McKusick <mckusick@mckusick.com>
1999-12-01 02:09:30 +00:00
Poul-Henning Kamp
38224dcd59 Convert various pieces of code to use vn_isdisk() rather than checking
for vp->v_type == VBLK.

In ccd: we don't need to call VOP_GETATTR to find the type of a vnode.

Reviewed by:    sos
1999-11-22 10:33:55 +00:00
Poul-Henning Kamp
2e3c8fcbd0 This is a partial commit of the patch from PR 14914:
Alot of the code in sys/kern directly accesses the *Q_HEAD and *Q_ENTRY
   structures for list operations.  This patch makes all list operations
   in sys/kern use the queue(3) macros, rather than directly accessing the
   *Q_{HEAD,ENTRY} structures.

This batch of changes compile to the same object files.

Reviewed by:    phk
Submitted by:   Jake Burkholder <jake@checker.org>
PR:     14914
1999-11-16 10:56:05 +00:00
Poul-Henning Kamp
3f5bbb089a Remove a #define which doesn't do miracles anymore. 1999-10-30 09:31:52 +00:00
Poul-Henning Kamp
923502ff91 useracc() the prequel:
Merge the contents (less some trivial bordering the silly comments)
of <vm/vm_prot.h> and <vm/vm_inherit.h> into <vm/vm.h>.  This puts
the #defines for the vm_inherit_t and vm_prot_t types next to their
typedefs.

This paves the road for the commit to follow shortly: change
useracc() to use VM_PROT_{READ|WRITE} rather than B_{READ|WRITE}
as argument.
1999-10-29 18:09:36 +00:00
Matthew Dillon
9782fb6209 Adjust the buffer cache to better handle small-memory machines. A
slightly older version of this code was tested by BDE and I.

    Also fixes a lockup situation when kva gets too fragmented.

    Remove the maxvmiobufspace variable and sysctl, they are no longer
    used.  Also cleanup (remove) #if 0 sections from prior commits.

    This code is more of a hack, but presumably the whole buffer cache
    implementation is going to be rewritten in the next year so it's no
    big deal.
1999-10-24 03:27:28 +00:00
Peter Wemm
d1f088dab5 Trim unused options (or #ifdef for undoc options).
Submitted by:	phk
1999-10-11 15:19:12 +00:00
Dmitrij Tejblum
4c6fc7285e Count bogus_page as wired. 1999-09-30 07:39:20 +00:00
Matthew Dillon
d909b5635d Fix bug in brelse() regarding redirtying buffers on B_ERROR. brelse()
improperly ignored the B_INVAL flag when acting on the B_ERROR.
    If both B_INVAL and B_ERROR are set the buffer is typically out of the
    underlying device's block range and must be destroyed.  If only B_ERROR
    is set (for a write), a write error occured and operation remains as it
    was before:  the buffer must be redirtied to avoid corrupting the
    filesystem state.

Reviewed by:	David Greenman <dg@root.com>
Submitted by:	Tor.Egge@fast.no
1999-09-20 16:19:24 +00:00
Luoqi Chen
d8a31f810f Allow getblk() to be called from an idle context (by panic() inside
an interrupt handler).

Reviewed by:	dillon
1999-09-03 17:49:25 +00:00
Peter Wemm
c3aac50f28 $Id$ -> $FreeBSD$ 1999-08-28 01:08:13 +00:00
Bruce Evans
d009ccfaf5 Cast pointers to uintptr_t instead of casting them to u_long, and/or vice
versa.  Cosmetic.
1999-08-24 00:56:50 +00:00
Poul-Henning Kamp
0ef1c82630 Decommision miscfs/specfs/specdev.h. Most of it goes into <sys/conf.h>,
a few lines into <sys/vnode.h>.

Add a few fields to struct specinfo, paving the way for the fun part.
1999-08-08 18:43:05 +00:00
Alan Cox
6745299365 Add sysctl and support code to allow directories to be VMIO'd. The default
setting for the sysctl is OFF, which is the historical operation.

Submitted by:	dillon
1999-07-26 06:25:53 +00:00
Peter Wemm
29a751bf4e bufhashinit() is called with a caddr_t and is expected to return the
same in both the alpha and i386 ports.
1999-07-09 16:41:19 +00:00
Kirk McKusick
025037833c Condition in KASSERT was reversed. 1999-07-08 17:58:55 +00:00
Kirk McKusick
ad8ac923fa These changes appear to give us benefits with both small (32MB) and
large (1G) memory machine configurations.  I was able to run 'dbench 32'
on a 32MB system without bring the machine to a grinding halt.

    * buffer cache hash table now dynamically allocated.  This will
      have no effect on memory consumption for smaller systems and
      will help scale the buffer cache for larger systems.

    * minor enhancement to pmap_clearbit().  I noticed that
      all the calls to it used constant arguments.  Making
      it an inline allows the constants to propogate to
      deeper inlines and should produce better code.

    * removal of inherent vfs_ioopt support through the emplacement
      of appropriate #ifdef's, with John's permission.  If we do not
      find a use for it by the end of the year we will remove it entirely.

    * removal of getnewbufloops* counters & sysctl's - no longer
      necessary for debugging, getnewbuf() is now optimal.

    * buffer hash table functions removed from sys/buf.h and localized
      to vfs_bio.c

    * VFS_BIO_NEED_DIRTYFLUSH flag and support code added
      ( bwillwrite() ), allowing processes to block when too many dirty
      buffers are present in the system.

    * removal of a softdep test in bdwrite() that is no longer necessary
      now that bdwrite() no longer attempts to flush dirty buffers.

    * slight optimization added to bqrelse() - there is no reason
      to test for available buffer space on B_DELWRI buffers.

    * addition of reverse-scanning code to vfs_bio_awrite().
      vfs_bio_awrite() will attempt to locate clusterable areas
      in both the forward and reverse direction relative to the
      offset of the buffer passed to it.  This will probably not
      make much of a difference now, but I believe we will start
      to rely on it heavily in the future if we decide to shift
      some of the burden of the clustering closer to the actual
      I/O initiation.

    * Removal of the newbufcnt and lastnewbuf counters that Kirk
      added.  They do not fix any race conditions that haven't already
      been fixed by the gbincore() test done after the only call
      to getnewbuf().  getnewbuf() is a static, so there is no chance
      of it being misused by other modules.  ( Unless Kirk can think
      of a specific thing that this code fixes.  I went through it
      very carefully and didn't see anything ).

    * removal of VOP_ISLOCKED() check in flushbufqueues().  I do not
      think this check is necessary, the buffer should flush properly
      whether the vnode is locked or not. ( yes? ).

    * removal of extra arguments passed to getnewbuf() that are not
      necessary.

    * missed cluster_wbuild() that had to be a cluster_wbuild_wb() in
      vfs_cluster.c

    * vn_write() now calls bwillwrite() *PRIOR* to locking the vnode,
      which should greatly aid flushing operations in heavy load
      situations - both the pageout and update daemons will be able
      to operate more efficiently.

    * removal of b_usecount.  We may add it back in later but for now
      it is useless.  Prior implementations of the buffer cache never
      had enough buffers for it to be useful, and current implementations
      which make more buffers available might not benefit relative to
      the amount of sophistication required to implement a b_usecount.
      Straight LRU should work just as well, especially when most things
      are VMIO backed.  I expect that (even though John will not like
      this assumption) directories will become VMIO backed some point soon.

Submitted by:	Matthew Dillon <dillon@backplane.com>
Reviewed by:	Kirk McKusick <mckusick@mckusick.com>
1999-07-08 06:06:00 +00:00
Kirk McKusick
e929c00d23 The buffer queue mechanism has been reformulated. Instead of having
QUEUE_AGE, QUEUE_LRU, and QUEUE_EMPTY we instead have QUEUE_CLEAN,
QUEUE_DIRTY, QUEUE_EMPTY, and QUEUE_EMPTYKVA.  With this patch clean
and dirty buffers have been separated.  Empty buffers with KVM
assignments have been separated from truely empty buffers.  getnewbuf()
has been rewritten and now operates in a 100% optimal fashion.  That is,
it is able to find precisely the right kind of buffer it needs to
allocate a new buffer, defragment KVM, or to free-up an existing buffer
when the buffer cache is full (which is a steady-state situation for
the buffer cache).

Buffer flushing has been reorganized.  Previously buffers were flushed
in the context of whatever process hit the conditions forcing buffer
flushing to occur.  This resulted in processes blocking on conditions
unrelated to what they were doing.  This also resulted in inappropriate
VFS stacking chains due to multiple processes getting stuck trying to
flush dirty buffers or due to a single process getting into a situation
where it might attempt to flush buffers recursively - a situation that
was only partially fixed in prior commits.  We have added a new daemon
called the buf_daemon which is responsible for flushing dirty buffers
when the number of dirty buffers exceeds the vfs.hidirtybuffers limit.
This daemon attempts to dynamically adjust the rate at which dirty buffers
are flushed such that getnewbuf() calls (almost) never block.

The number of nbufs and amount of buffer space is now scaled past the
8MB limit that was previously imposed for systems with over 64MB of
memory, and the vfs.{lo,hi}dirtybuffers limits have been relaxed
somewhat.  The number of physical buffers has been increased with the
intention that we will manage physical I/O differently in the future.

reassignbuf previously attempted to keep the dirtyblkhd list sorted which
could result in non-deterministic operation under certain conditions,
such as when a large number of dirty buffers are being managed.  This
algorithm has been changed.  reassignbuf now keeps buffers locally sorted
if it can do so cheaply, and otherwise gives up and adds buffers to
the head of the dirtyblkhd list.  The new algorithm is deterministic but
not perfect.  The new algorithm greatly reduces problems that previously
occured when write_behind was turned off in the system.

The P_FLSINPROG proc->p_flag bit has been replaced by the more descriptive
P_BUFEXHAUST bit.  This bit allows processes working with filesystem
buffers to use available emergency reserves.  Normal processes do not set
this bit and are not allowed to dig into emergency reserves.  The purpose
of this bit is to avoid low-memory deadlocks.

A small race condition was fixed in getpbuf() in vm/vm_pager.c.

Submitted by:	Matthew Dillon <dillon@apollo.backplane.com>
Reviewed by:	Kirk McKusick <mckusick@mckusick.com>
1999-07-04 00:25:38 +00:00
Peter Wemm
ddebd8794d Hopefully fix the remaining glitches with the BUF_*() changes. This should
(really this time) fix pageout to swap and a couple of clustering cases.

This simplifies BUF_KERNPROC() so that it unconditionally reassigns the
lock owner rather than testing B_ASYNC and having the caller decide when
to do the reassign.  At present this is required because some places use
B_CALL/b_iodone to free the buffers without B_ASYNC being set.  Also,
vfs_cluster.c explicitly calls BUF_KERNPROC() when attaching the buffers
rather than the parent walking the cluster_head tailq.

Reviewed by:	Kirk McKusick <mckusick@mckusick.com>
1999-06-29 05:59:47 +00:00
Peter Wemm
72283ee95d Fix a bug that was almost certainly making breadn() fail. BUF_KERNPROC()
was being called on the wrong bp - it should be called on the one that's
just about to be fed to VOP_STRATEGY().
1999-06-28 15:32:10 +00:00
Peter Wemm
e6257a9a09 GC the remnants of the old pre-softupdates update daemon. It's been
#if 0'd for a fair while now.
1999-06-26 14:46:35 +00:00
Kirk McKusick
67812eacd7 Convert buffer locking from using the B_BUSY and B_WANTED flags to using
lockmgr locks. This commit should be functionally equivalent to the old
semantics. That is, all buffer locking is done with LK_EXCLUSIVE
requests. Changes to take advantage of LK_SHARED and LK_RECURSIVE will
be done in future commits.
1999-06-26 02:47:16 +00:00
Kirk McKusick
45623f31bc When allocating new buffers in getnewbuf, there are several points
at which we may sleep. So, after completing our buffer allocation
we must ensure that another process has not come along and allocated
a different buffer with the same identity. We do this by keeping a
global counter of the number of buffers that getnewbuf has allocated.
We save this count when we enter getnewbuf and check it when we are
about to return. If it has changed, then other buffers were allocated
while we were in getnewbuf, so we must return NULL to let our parent
know that it must recheck to see if it still needs the new buffer.
Hopefully this fix will eliminate the creation of duplicate buffers
with the same identity and the obscure corruptions that they cause.
1999-06-22 01:39:53 +00:00
Kirk McKusick
f9c8cab591 Add a vnode argument to VOP_BWRITE to get rid of the last vnode
operator special case. Delete special case code from vnode_if.sh,
vnode_if.src, umap_vnops.c, and null_vnops.c.
1999-06-16 23:27:55 +00:00
Tor Egge
01cf8ad024 If we still haven't got a sufficient number of free buffers after the
call to flushdirtybuffers() then sleep in waitfreebuffers().
PR:		11697
Reviewed by:	David Greenman, Matt Dillon
1999-06-16 03:19:04 +00:00
Kirk McKusick
e4ab40bcb6 Get rid of the global variable rushjob and replace it with a function in
kern/vfs_subr.c named speedup_syncer() which handles the speedup request.
Change the various clients of rushjob to use the new function.
1999-06-15 23:37:29 +00:00
Peter Wemm
ccb84588dd Try an fix a couple of dev_t/major/minor etc nits. 1999-05-12 22:30:50 +00:00
Poul-Henning Kamp
b0eeea2042 remove b_proc from struct buf, it's (now) unused.
Reviewed by:	dillon, bde
1999-05-06 20:00:34 +00:00
Poul-Henning Kamp
84c55b38e4 Remove unused fields from struct buf:
b_savekva
	b_validoff
	b_validend

Reviewed by:	dillon, bde
1999-05-06 17:06:41 +00:00
Alan Cox
4221e284a3 The VFS/BIO subsystem contained a number of hacks in order to optimize
piecemeal, middle-of-file writes for NFS.  These hacks have caused no
end of trouble, especially when combined with mmap().  I've removed
them.  Instead, NFS will issue a read-before-write to fully
instantiate the struct buf containing the write.  NFS does, however,
optimize piecemeal appends to files.  For most common file operations,
you will not notice the difference.  The sole remaining fragment in
the VFS/BIO system is b_dirtyoff/end, which NFS uses to avoid cache
coherency issues with read-merge-write style operations.  NFS also
optimizes the write-covers-entire-buffer case by avoiding the
read-before-write.  There is quite a bit of room for further
optimization in these areas.

The VM system marks pages fully-valid (AKA vm_page_t->valid =
VM_PAGE_BITS_ALL) in several places, most noteably in vm_fault.  This
is not correct operation.  The vm_pager_get_pages() code is now
responsible for marking VM pages all-valid.  A number of VM helper
routines have been added to aid in zeroing-out the invalid portions of
a VM page prior to the page being marked all-valid.  This operation is
necessary to properly support mmap().  The zeroing occurs most often
when dealing with file-EOF situations.  Several bugs have been fixed
in the NFS subsystem, including bits handling file and directory EOF
situations and buf->b_flags consistancy issues relating to clearing
B_ERROR & B_INVAL, and handling B_DONE.

getblk() and allocbuf() have been rewritten.  B_CACHE operation is now
formally defined in comments and more straightforward in
implementation.  B_CACHE for VMIO buffers is based on the validity of
the backing store.  B_CACHE for non-VMIO buffers is based simply on
whether the buffer is B_INVAL or not (B_CACHE set if B_INVAL clear,
and vise-versa).  biodone() is now responsible for setting B_CACHE
when a successful read completes.  B_CACHE is also set when a bdwrite()
is initiated and when a bwrite() is initiated.  VFS VOP_BWRITE
routines (there are only two - nfs_bwrite() and bwrite()) are now
expected to set B_CACHE.  This means that bowrite() and bawrite() also
set B_CACHE indirectly.

There are a number of places in the code which were previously using
buf->b_bufsize (which is DEV_BSIZE aligned) when they should have
been using buf->b_bcount.  These have been fixed.  getblk() now clears
B_DONE on return because the rest of the system is so bad about
dealing with B_DONE.

Major fixes to NFS/TCP have been made.  A server-side bug could cause
requests to be lost by the server due to nfs_realign() overwriting
other rpc's in the same TCP mbuf chain.  The server's kernel must be
recompiled to get the benefit of the fixes.

Submitted by:	Matthew Dillon <dillon@apollo.backplane.com>
1999-05-02 23:57:16 +00:00
Alan Cox
0043b4376a Address a performance problem in getnewbuf:
In heavy-writing situations, QUEUE_LRU can contain a large number
	of DELWRI buffers at its head.  These buffers must be moved
	to the tail if they cannot be written async in order to reduce
	the scanning time required to skip past these buffers in later
	getnewbuf() calls.

Submitted by:	Matthew Dillon <dillon@apollo.backplane.com>
1999-04-29 18:15:25 +00:00
Dmitrij Tejblum
35871a15c5 getnewbuf(): check return value from tsleep(). Interruptible NFS may pass
PCATCH to slpflag.
1999-04-14 18:51:52 +00:00
Alan Cox
b2e2337ba1 Fix a performance problem with the new getnewbuf() code: in an outofspace
condition ( bufspace > hibufspace ), an inappropriate scan of the empty
queue was performed looking for buffer space to free up.

Submitted by:	Matthew Dillon <dillon@apollo.backplane.com>
1999-04-07 02:41:54 +00:00
Julian Elischer
8d17e69460 Catch a case spotted by Tor where files mmapped could leave garbage in the
unallocated parts of the last page when the file ended on a frag
but not a page boundary.
Delimitted by tags PRE_MATT_MMAP_EOF and POST_MATT_MMAP_EOF,
in files alpha/alpha/pmap.c i386/i386/pmap.c nfs/nfs_bio.c vm/pmap.h
    vm/vm_page.c vm/vm_page.h vm/vnode_pager.c miscfs/specfs/spec_vnops.c
    ufs/ufs/ufs_readwrite.c kern/vfs_bio.c

Submitted by: Matt Dillon <dillon@freebsd.org>
Reviewed by: Alan Cox <alc@freebsd.org>
1999-04-05 19:38:30 +00:00
Bruce Evans
96ebc5810b Fixed a serious bug in rev.1.202. getnewbuf() sometimes didn't
initialise bp->b_data.  This tended to cause panics for file
systems whose block size is smaller than one page.
1999-03-19 10:17:44 +00:00
Julian Elischer
4ef2094e45 Reviewed by: Many at differnt times in differnt parts,
including alan, john, me, luoqi, and kirk
Submitted by:	Matt Dillon <dillon@frebsd.org>

This change implements a relatively sophisticated fix to getnewbuf().
There were two problems with getnewbuf(). First, the writerecursion
can lead to a system stack overflow when you have NFS and/or VN
devices in the system. Second, the free/dirty buffer accounting was
completely broken. Not only did the nfs routines blow it trying to
manually account for the buffer state, but the accounting that was
done did not work well with the purpose of their existance: figuring
out when getnewbuf() needs to sleep.

The meat of the change is to kern/vfs_bio.c. The remaining diffs are
all minor except for NFS, which includes both the fixes for bp
interaction AND fixes for a 'biodone(): buffer already done' lockup.
Sys/buf.h also contains a chaining structure which is not used by
this patchset but is used by other patches that are coming soon.
This patch deliniated by tags PRE_MAT_GETBUF and POST_MAT_GETBUF.
(sorry for the missing T matt)
1999-03-12 02:24:58 +00:00
Julian Elischer
850c9afd03 Make comment match code. 1999-03-02 21:23:38 +00:00
Julian Elischer
8b3bd42341 Remove inapropriate use of VOP_ISLOCKED()
This produced races resulting in panics and filesystem corruptions
under some circumstances.

Reviewed by: luoqi chen <luoqi@freebsd.org>
Reviewed by: Kirk McKusick <mckusick@mckusick.com>
Submitted by: Matt Dillon <dillon@freebsd.org>
1999-03-02 20:26:39 +00:00
Matthew Dillon
d254af07a1 Fix warnings in preparation for adding -Wall -Wcast-qual to the
kernel compile
1999-01-27 21:50:00 +00:00
Matthew Dillon
377f9b28a6 Don't try to calculate B_CACHE for an NFS related bp that has a
> 0 b_validend.  This will screw up small-writes, causing lots of
    little writes out the network.

    We will assume that NFS handles B_CACHE properly.
1999-01-24 00:51:11 +00:00
Matthew Dillon
fae1f2e045 Fix an expression parenthesization typo in a conditional. It should not
have any operational effects other then to make the code in question
     a little faster.  Also added a more involved comment.
1999-01-23 06:36:15 +00:00
David Greenman
33ce4218c6 Don't throw away the buffer contents on a fatal write error; just mark
the buffer as still being dirty. This isn't a perfect solution, but
throwing away the buffer contents will often result in filesystem
corruption and this solution will at least correctly deal with transient
errors.
Submitted by:	Kirk McKusick <mckusick@mckusick.com>
1999-01-22 08:59:05 +00:00
Matthew Dillon
8618c644e4 The main operational changes are in getblk()'s handling of the
B_DELWRI and B_CACHE flags, fixing a bug that showed up with NFS.
    Also, a number of cases where manually inserted code has been removed
    and replaced with an inline function call giving us better functional
    isolation in the source.
1999-01-21 09:19:33 +00:00
Matthew Dillon
1c7c3c6a86 This is a rather large commit that encompasses the new swapper,
changes to the VM system to support the new swapper, VM bug
    fixes, several VM optimizations, and some additional revamping of the
    VM code.  The specific bug fixes will be documented with additional
    forced commits.  This commit is somewhat rough in regards to code
    cleanup issues.

Reviewed by:	"John S. Dyson" <root@dyson.iquest.net>, "David Greenman" <dg@root.com>
1999-01-21 08:29:12 +00:00
Matthew Dillon
ba2871b74c Obtained from: Luoqi
Fix NFS file corruption problem introduced in 1.188.  The valid range
    was not being set properly, causing a later reference to the buffer
    to clear the B_CACHE bit.
1999-01-19 08:00:51 +00:00
Eivind Eklund
a32c99f35e Silence warnings. 1999-01-12 11:59:34 +00:00
Eivind Eklund
219cbf59f2 KNFize, by bde. 1999-01-10 01:58:29 +00:00
Eivind Eklund
5526d2d920 Split DIAGNOSTIC -> DIAGNOSTIC, INVARIANTS, and INVARIANT_SUPPORT as
discussed on -hackers.

Introduce 'KASSERT(assertion, ("panic message", args))' for simple
check + panic.

Reviewed by:	msmith
1999-01-08 17:31:30 +00:00
Matthew Dillon
e6ee8e16e0 Adjust some comments to prevent future confusion on the implementation.
Also add a reference to the buf(9) manual page.
1998-12-22 18:57:30 +00:00
Luoqi Chen
1a551e9808 Correctly handle misaligned VMIO buffer (whose start or end offset in the VM
object are not page aligned). This should fix the mount_msdos panic after a
failed attemp to mount as ffs.

Reviewed By:	Matthew Dillon	<dillon@apollo.backplane.com>
		Archie Cobbs	<archie@whistle.com>
		Dmitrij Tejblum <dima@tejblum.dnttm.rssi.ru>
1998-12-22 14:43:58 +00:00
Matthew Dillon
fe523aa107 fix intermediate overflow in 'quad = int * int' situation by casting
the arguments to the multiply to a quad equivalent.  In this case,
    vm_ooffset_t.

Reviewed by:	Archie Cobbs <archie@whistle.com>
1998-12-14 21:17:37 +00:00
Eivind Eklund
4979978b8d Fix grouping of statements. This remove a potential panic in the soft
updates code.  While I'm here, remove an unintended trigraph.

Reviewed by:	Kirk McKusick <kirk@freebsd.org>
1998-12-07 17:23:45 +00:00
David Greenman
4f699173cb Closed a very narrow and rare race condition that involved net interrupts,
bio interrupts, and a truncated file that along with the precise alignment
of the planets could result in a page being freed multiple times or a
just-freed page being put onto the inactive queue.
1998-11-18 09:00:47 +00:00
Peter Wemm
40c8cfe552 Use TAILQ macros for clean/dirty block list processing. Set b_xflags
rather than abusing the list next pointer with a magic number.
1998-10-31 15:31:29 +00:00
David Greenman
2a78b8d1f8 Unwire everything to the inactive queue in order to preserve LRU ordering. 1998-10-30 14:53:54 +00:00
David Greenman
0d5a725446 Fixed editing error. Pointed out by bde. 1998-10-29 11:04:22 +00:00
David Greenman
730075613a Added a second argument, "activate" to the vm_page_unwire() call so that
the caller can select either inactive or active queue to put the page on.
1998-10-28 13:37:02 +00:00
Poul-Henning Kamp
f5ef029e92 Nitpicking and dusting performed on a train. Removes trivial warnings
about unused variables, labels and other lint.
1998-10-25 17:44:59 +00:00
David Greenman
6cde7a165f Fixed two potentially serious classes of bugs:
1) The vnode pager wasn't properly tracking the file size due to
   "size" being page rounded in some cases and not in others.
   This sometimes resulted in corrupted files. First noticed by
   Terry Lambert.
   Fixed by changing the "size" pager_alloc parameter to be a 64bit
   byte value (as opposed to a 32bit page index) and changing the
   pagers and their callers to deal with this properly.
2) Fixed a bogus type cast in round_page() and trunc_page() that
   caused some 64bit offsets and sizes to be scrambled. Removing
   the cast required adding casts at a few dozen callers.
   There may be problems with other bogus casts in close-by
   macros. A quick check seemed to indicate that those were okay,
   however.
1998-10-13 08:24:45 +00:00
Matthew Dillon
ff8fae607b PR: kern/7418
Reviewed by:	Luoqi Chen <luoqi@watermarkgroup.com>

    Fixed problem where write()s can get lost due to buffers flagged B_DELWRI
    being improperly released in brelse().
1998-09-26 00:12:35 +00:00
Peter Wemm
10baba4b95 Goodbye BOUNCE_BUFFERS, for a hack it has served us well.
The last consumer of this code (the old SCSI system) has left us and
the CAM code does it's own bouncing.  The isa dma system has been
doing it's own bouncing for a while too.

Reviewed by:	core
1998-09-25 17:34:49 +00:00
Justin T. Gibbs
7ea97031d1 kern_clock.c:
Remove old disk statistics variables.

vfs_bio.c:
	Enable bowrite now that B_ORDERED works for all buffer devices.
1998-09-15 10:05:18 +00:00
Poul-Henning Kamp
0375c9f2b8 Add a new vnode op, VOP_FREEBLKS(), which filesystems can use to inform
device drivers about sectors no longer in use.

Device-drivers receive the call through d_strategy, if they have
D_CANFREE in d_flags.

This allows flash based devices to erase the sectors and avoid
pointlessly carrying them around in compactions.

Reviewed by:	Kirk Mckusick, bde
Sponsored by:	M-Systems (www.m-sys.com)
1998-09-05 14:13:12 +00:00
Doug Rabson
e69763a315 Cosmetic changes to the PAGE_XXX macros to make them consistent with
the other objects in vm.
1998-09-04 08:06:57 +00:00
Luoqi Chen
ddae3cb9a0 Close a race window for getnewbuf() between shared lock holders of the vnode.
Reviewed by:	Mike Smith
1998-08-28 20:07:13 +00:00
Poul-Henning Kamp
12e14047a4 Fix DDBs printing of buf-flags after I changed them yesterday. 1998-08-25 14:41:42 +00:00
Poul-Henning Kamp
1d9b3ba13d Remove the last remaining evidence of B_TAPE.
Reclaim 3 unused bits in b_flags
1998-08-24 17:47:25 +00:00
Doug Rabson
069e9bc1b4 Change various syscalls to use size_t arguments instead of u_int.
Add some overflow checks to read/write (from bde).

Change all modifications to vm_page::flags, vm_page::busy, vm_object::flags
and vm_object::paging_in_progress to use operations which are not
interruptable.

Reviewed by: Bruce Evans <bde@zeta.org.au>
1998-08-24 08:39:39 +00:00
Doug Rabson
7032ad107e Protect all modifications to v_numoutput with splbio(). 1998-08-13 08:09:08 +00:00
Doug Rabson
d474eaaa5f Protect all modifications to paging_in_progress with splvm(). The i386
managed to avoid corruption of this variable by luck (the compiler used a
memory read-modify-write instruction which wasn't interruptable) but other
architectures cannot.

With this change, I am now able to 'make buildworld' on the alpha (sfx: the
crowd goes wild...)
1998-08-06 08:33:19 +00:00
Bruce Evans
9f14a215f4 Fixed printf format errors. 1998-07-13 07:05:55 +00:00
Julian Elischer
6deaf84b1f Catch a few corner cases where FreeBSD differs enough from BSD 4.4 to
confuse Soft updates..
Should solve several "dangling deps" panics.
1998-07-08 01:04:33 +00:00
Julian Elischer
fd5d1124e2 VOP_STRATEGY grows an (struct vnode *) argument
as the value in b_vp is often not really what you want.
(and needs to be frobbed). more cleanups will follow this.
Reviewed by: Bruce Evans <bde@freebsd.org>
1998-07-04 20:45:42 +00:00
Peter Wemm
b1951f4028 vm_page_is_valid() wasn't expecting a large offset argument, it's
expecting a sub-page offset.  We were passing the file position,
and vm_page_bits() could do some interesting things when base was
larger PAGE_SIZE.
if (size > PAGE_SIZE - base)
	size = PAGE_SIZE - base;
is interesting when (PAGE_SIZE - base) is negative.  I could imagine that
this could have interesting consequences for memory page -> device block
bit validation.
1998-05-01 15:10:59 +00:00
Peter Wemm
f806d5a257 Fix one problem with NFSv3 > 2GB file support.
Submitted by: bde
1998-05-01 15:04:35 +00:00
Dag-Erling Smørgrav
dc73342347 Seventy-odd "its" / "it's" typos in comments fixed as per kern/6108. 1998-04-17 22:37:19 +00:00
Bruce Evans
c1087c1324 Support compiling with `gcc -ansi'. 1998-04-15 17:47:40 +00:00
John Dyson
f9be84912c Correct a problem where buffers might not be zeroed when needed. The
B_MALLOC buffers might not have been properly zeroed.
1998-03-27 06:48:24 +00:00
John Dyson
52c64c95c5 In kern_physio.c fix tsleep priority messup.
In vfs_bio.c, remove b_generation count usage,
	remove redundant reassignbuf,
	remove redundant spl(s),
	manage page PG_ZERO flags more correctly,
	utilize in invalid value for b_offset until it
		is properly initialized.  Add asserts
		for #ifdef DIAGNOSTIC, when b_offset is
		improperly used.
	when a process is not performing I/O, and just waiting
		on a buffer generally, make the sleep priority
		low.
	only check page validity in getblk for B_VMIO buffers.

In vfs_cluster, add b_offset asserts, correct pointer calculation
	for clustered reads.  Improve readability of certain parts of
	the code.  Remove redundant spl(s).

In vfs_subr, correct usage of vfs_bio_awrite (From Andrew Gallatin
	<gallatin@cs.duke.edu>).  More vtruncbuf problems fixed.
1998-03-19 22:48:16 +00:00