Commit Graph

506 Commits

Author SHA1 Message Date
jeff
6862688995 - Properly check against B_DELWRI and B_NEEDSGIANT. This check was
incorrectly written and caused some !NEEDSGIANT buffers to be put in
   the NEEDSGIANT queue.

Sponsored by:	Isilon Systems, Inc.
2006-04-04 06:44:21 +00:00
jeff
2086f279cf - Add the B_NEEDSGIANT flag which is only set if the vnode that owns a buf
requires Giant.  It is set in bgetvp and cleared in brelvp.
 - Create QUEUE_DIRTY_GIANT for dirty buffers that require giant.
 - In the buf daemon, only grab giant when processing QUEUE_DIRTY_GIANT and
   only if we think there are buffers in that queue.

Sponsored by:	Isilon Systems, Inc.
2006-03-31 02:56:30 +00:00
pjd
b746cfb8d4 Destroy "bip" bio in error case.
Found by:	Coverity Prevent analysis tool
Coverity ID:	795
MFC after:	3 days
2006-03-22 00:42:41 +00:00
tegge
78439a3a90 For low memory situations, non-VMIO buffers didnt't release pages back to
the system when brelse() was called with B_RELBUF set on the buffer.  This
could be a problem when the system was low on memory, had many buffers on
QUEUE_EMPTYKVA and started to traverse directories.  For each getnewbuf(),
pages were allocated from the system, driving the free reserve downwards.
For each brelse(), the system put the buffer on QUEUE_CLEAN, with B_INVAL
set.

This commit changes the semantics of B_RELBUF to also free pages from
non-VMIO buffers.

Reviewed by:	alc
2006-02-02 21:37:39 +00:00
alc
bd4e907d2a Remove an unnecessary call to pmap_remove_all(). The given page is not
mapped because its contents are invalid.

Reviewed by: tegge
2006-01-23 00:00:45 +00:00
tegge
98fde94067 Set flag in needsbuffer while still holding bqlock to avoid lost wakeup. 2006-01-16 22:09:47 +00:00
netchild
507a9b3e93 MI changes:
- provide an interface (macros) to the page coloring part of the VM system,
   this allows to try different coloring algorithms without the need to
   touch every file [1]
 - make the page queue tuning values readable: sysctl vm.stats.pagequeue
 - autotuning of the page coloring values based upon the cache size instead
   of options in the kernel config (disabling of the page coloring as a
   kernel option is still possible)

MD changes:
 - detection of the cache size: only IA32 and AMD64 (untested) contains
   cache size detection code, every other arch just comes with a dummy
   function (this results in the use of default values like it was the
   case without the autotuning of the page coloring)
 - print some more info on Intel CPU's (like we do on AMD and Transmeta
   CPU's)

Note to AMD owners (IA32 and AMD64): please run "sysctl vm.stats.pagequeue"
and report if the cache* values are zero (= bug in the cache detection code)
or not.

Based upon work by:	Chad David <davidc@acns.ab.ca> [1]
Reviewed by:		alc, arch (in 2004)
Discussed with:		alc, Chad David, arch (in 2004)
2005-12-31 14:39:20 +00:00
rodrigc
5a03a98174 Changes imported from XFS for FreeBSD project:
- add fields to struct buf (needed by XFS)
    - 3 private fields: b_fsprivate1, b_fsprivate2, b_fsprivate3
    - b_pin_count, count of pinned buffer

- add new B_MANAGED flag
- add breada() function to initiate asynchronous I/O on read-ahead blocks.
- add bufdone_finish(), bpin(), bunpin_wait() functions

Patches provided by:	kan
Reviewed by:		phk
Silence on:		arch@
2005-12-07 03:39:08 +00:00
rwatson
be4f357149 Normalize a significant number of kernel malloc type names:
- Prefer '_' to ' ', as it results in more easily parsed results in
  memory monitoring tools such as vmstat.

- Remove punctuation that is incompatible with using memory type names
  as file names, such as '/' characters.

- Disambiguate some collisions by adding subsystem prefixes to some
  memory types.

- Generally prefer lower case to upper case.

- If the same type is defined in multiple architecture directories,
  attempt to use the same name in additional cases.

Not all instances were caught in this change, so more work is required to
finish this conversion.  Similar changes are required for UMA zone names.
2005-10-31 15:41:29 +00:00
tegge
9797c80fe8 Release clean buffer with wrong size and no dependencies also for non-VMIO
case.
2005-10-09 22:41:25 +00:00
truckman
7dfe92499b Un-staticize waitrunningbufspace() and call it before returning from
ffs_copyonwrite() if any async writes were launched.

Restore the threads previous TDP_NORUNNINGBUF state before returning
from ffs_copyonwrite().
2005-09-30 18:07:41 +00:00
truckman
414043e88d Un-staticize runningbufwakeup() and staticize updateproc.
Add a new private thread flag to indicate that the thread should
not sleep if runningbufspace is too large.

Set this flag on the bufdaemon and syncer threads so that they skip
the waitrunningbufspace() call in bufwrite() rather than than
checking the proc pointer vs. the known proc pointers for these two
threads.  A way of preventing these threads from being starved for
I/O but still placing limits on their outstanding I/O would be
desirable.

Set this flag in ffs_copyonwrite() to prevent bufwrite() calls from
blocking on the runningbufspace check while holding snaplk.  This
prevents snaplk from being held for an arbitrarily long period of
time if runningbufspace is high and greatly reduces the contention
for snaplk.  The disadvantage is that ffs_copyonwrite() can start
a large amount of I/O if there are a large number of snapshots,
which could cause a deadlock in other parts of the code.

Call runningbufwakeup() in ffs_copyonwrite() to decrement runningbufspace
before attempting to grab snaplk so that I/O requests waiting on
snaplk are not counted in runningbufspace as being in-progress.
Increment runningbufspace again before actually launching the
original I/O request.

Prior to the above two changes, the system could deadlock if enough
I/O requests were blocked by snaplk to prevent runningbufspace from
falling below lorunningspace and one of the bawrite() calls in
ffs_copyonwrite() blocked in waitrunningbufspace() while holding
snaplk.

See <http://www.holm.cc/stress/log/cons143.html>
2005-09-30 01:30:01 +00:00
peadar
05494531ef Close a race in biodone(), whereby the bio_done field of the passed
bio may have been freed and reassigned by the wakeup before being
tested after releasing the bdonelock.

There's a non-zero chance this is the cause of a few of the crashes
knocking around with biodone() sitting in the stack backtrace.

Reviewed By: phk@
2005-09-29 10:37:20 +00:00
jeff
8076188a59 - Use lockmgr_printinfo rather than rolling our own. This introduces a
slight problem by using printf instead of db_printf however
   'show lockedvnods' does the same so I believe it is ok for now.
2005-08-03 05:02:08 +00:00
alc
38bf328ab8 Eliminate inconsistency in the setting of the B_DONE flag. Specifically,
make the b_iodone callback responsible for setting it if it is needed.
Previously, it was set unconditionally by bufdone() without holding
whichever lock is shared by the b_iodone callback and the corresponding
top-half function.  Consequently, in a race, the top-half function could
conclude that operation was done before the b_iodone callback finished.
See, for example, aio_physwakeup() and aio_fphysio().

Note: I don't believe that the other, more widely-used b_iodone callbacks
are affected.

Discussed with: jeff
Reviewed by: phk
MFC after: 2 weeks
2005-07-20 19:06:06 +00:00
jeff
78308b0fd3 - Add and enhance asserts related to the wrong bufobj panic.
Sponsored by:	Isilon Systems, Inc.
Approved by:	re (blanket vfs)
2005-06-14 20:32:27 +00:00
jeff
c92b8a6f78 - Split one KASSERT in bremfree() into two to aid in debugging.
Sponsored by:	Isilon Systems, Inc.
2005-06-13 00:45:05 +00:00
green
ff904ffb64 Fix a serious deadlock with the NFS client. Given a large enough
atomic write request, it can fill the buffer cache with the entirety
of that write in order to handle retries.  However, it never drops
the vnode lock, or else it wouldn't be atomic, so it ends up waiting
indefinitely for more buf memory that cannot be gotten as it has it
all, and it waits in an uncancellable state.

To fix this, hibufspace is exported and scaled to a reasonable
fraction.  This is used as the limit of how much of an atomic write
request by the NFS client will be handled asynchronously.  If the
request is larger than this, it will be turned into a synchronous
request which won't deadlock the system.  It's possible this value is
far off from what is required by some, so it shall be tunable as soon
as mount_nfs(8) learns of the new field.

The slowdown between an asynchronous and a synchronous write on NFS
appears to be on the order of 2x-4x.

General nod by:	gad
MFC after:	2 weeks
More testing:	wes
PR:		kern/79208
2005-06-10 23:50:41 +00:00
jeff
f637381b78 - My sub-par public school education has been exposed. s/sentinal/sentinel/
Noticed by:	Emil Mikulic
2005-06-09 04:40:20 +00:00
jeff
b53b83993c - Under heavy IO load the buf daemon can run for many hundereds of
milliseconds due to what is essentially n^2 algorithmic complexity.  This
   change makes the algorithm N*2 instead.  This heavy processing manifested
   itself as skipping in audio and video playback due to the long scheduling
   latencies and contention on giant by pcm.
 - flushbufqueues() is now responsible for flushing multiple buffers
   rather than one at a time.  This allows us to save our progress in the
   list by using a sentinal.  We must do the numdirtywakeup() and
   waitrunningbufspace() here now rather than in buf_daemon().
 - Also add a uio_yield() after we have processed the list once for bufs
   without deps and again for bufs with deps.  This is to release Giant
   and allow any other giant locked code to proceed.

Tested by:	Many users on current@
Revealed by:	schedgraph traces sent by Emil Mikulic & Anthony Ginepro
2005-06-08 20:26:05 +00:00
jeff
33b78c31e9 - Add bufobj_wrefl() to add a write ref to a bufobj that is already locked. 2005-05-30 07:01:18 +00:00
jeff
cb9dfadd87 - Remove long dead splbio() calls and comments relating to the old
synchronization mechanism.
2005-04-30 12:18:50 +00:00
jeff
116d72569a - Don't acquire Giant before calling b_biodone, individual consumers are
now required to do so themselves.

Sponsored by:	Isilon Systems, Inc.
2005-04-30 11:44:22 +00:00
jeff
d8b31a35ea - Add two KASSERTs to prevent us from recycling a buf that is still on a
bufobj list.

Sponsored by:	Isilon Systems, Inc.
2005-04-22 00:53:20 +00:00
jeff
8e533783f3 - Add information about the buf lock to db_show_buffer.
- Add a 'show lockedbufs' command that is similar to show lockedvnods.

Sponsored by:	Isilon Systems, Inc.
2005-03-25 00:20:37 +00:00
jeff
893b010525 - Lock access to the buffer_map with the vm_map lock. In 4.x this was
done with splbio, in 5.x this was done with Giant.

Discussed with:		alc
Reported by:		julian, pho
2005-03-08 09:34:54 +00:00
phk
5dd8d30575 Make various vnode related functions static 2005-02-10 12:28:58 +00:00
jeff
480b60be3c - Add more information to the getnewbuf() recycling KTR.
Sponsored by:	Isilon Systems, Inc.
2005-02-10 02:22:56 +00:00
jeff
ede81ae242 - Remove an invalid KASSERT added in recent background write reshuffling.
Sponsored by:	Isilon Systems, Inc.
2005-02-08 23:25:08 +00:00
phk
af5ef3f262 Background writes are entirely an FFS/Softupdates thing.
Give FFS vnodes a specific bufwrite method which contains all the
background write stuff and then calls into the default bufwrite()
for the rest of the job.

Remove all the background write related stuff from the normal bufwrite.

This drags the softdep_move_dependencies() back into FFS.

Long term, it is worth looking at simply copying the data into
allocated memory and issuing the bio directly and not create the
"shadow buf" in the first place (just like copy-on-write is done
in snapshots for instance).  I don't think we really gain anything
but complexity from doing this with a buf.
2005-02-08 20:29:10 +00:00
jeff
0a084a15e2 - Don't release BKGRDINPROG until after we've bufdone'd the copy.
Sponsored by:	Isilon Systems, Inc.
2005-02-05 01:26:14 +00:00
jeff
da8e6b049d - Don't drop the wref on the bufobj until after bufdone() has completed.
Without this, threads waiting in bufobj_wwait() may wakeup prior to
   bufdone() completing.

Sponsored by:	Isilon Systems, Inc.
2005-01-28 17:48:58 +00:00
phk
796d435574 Don't use VOP_GETVOBJECT, use vp->v_object directly. 2005-01-25 00:40:01 +00:00
phk
d5c135375c Kill the VV_OBJBUF and test the v_object for NULL instead. 2005-01-24 13:13:57 +00:00
jeff
39bf4e6e67 - Add CTR calls to trace the lifecycle of a buffer.
- Remove some KASSERTs which are invalid if the appropriate lock is
   not held.
 - Slightly restructure bremfree() so that it is more sane.
 - Change the flush code in bdwrite() to avoid acquiring a mutex
   whenever possible.
 - Change the flush code in bdwrite() to avoid holding the bufobj mutex
   while calling buf_countdeps().  This introduces a lock-order
   relationship with the softdep lock that can not otherwise be resolved.
 - Don't set B_DONE until bufdone() is complete, otherwise another
   processor may believe the buf is done before it is.
 - Only acquire Giant if the caller has set b_iodone.  Don't grab giant
   around normal bufdone() calls.

Sponsored By:	Isilon Systems, Inc.
2005-01-24 10:47:04 +00:00
phk
5a497775d6 Add BO_SYNC() and add a default which uses the secret vnode pointer
and VOP_FSYNC() for now.
2005-01-11 10:43:08 +00:00
phk
da2718f1af Remove the unused credential argument from VOP_FSYNC() and VFS_SYNC().
I'm not sure why a credential was added to these in the first place, it is
not used anywhere and it doesn't make much sense:

	The credentials for syncing a file (ability to write to the
	file) should be checked at the system call level.

	Credentials for syncing one or more filesystems ("none")
	should be checked at the system call level as well.

	If the filesystem implementation needs a particular credential
	to carry out the syncing it would logically have to the
	cached mount credential, or a credential cached along with
	any delayed write data.

Discussed with:	rwatson
2005-01-11 07:36:22 +00:00
jeff
9caab2e843 - Eliminate the acquisition and release of the bqlock in bremfree() by
setting the B_REMFREE flag in the buf.  This is done to prevent lock order
   reversals with code that must call bremfree() with a local lock held.
   This also reduces overhead by removing two lock operations per buf for
   fsync() and similar.
 - Check for the B_REMFREE flag in brelse() and bqrelse() after the bqlock
   has been acquired so that we may remove ourself from the free-list.
 - Provide a bremfreef() function to immediately remove a buf from a
   free-list for use only by NFS.  This is done because the nfsclient code
   overloads the b_freelist queue for its own async. io queue.
 - Simplify the numfreebuffers accounting by removing a switch statement
   that executed the same code in every possible case.
 - getnewbuf() can encounter locked bufs on free-lists once Giant is removed.
   Remove a panic associated with this condition and delay asserts that
   inspect the buf until after it is locked.

Reviewed by:	phk
Sponsored by:	Isilon Systems, Inc.
2004-11-18 08:44:09 +00:00
phk
e5715b2cc1 Retire b_magic now, we have the bufobj containing the same hint. 2004-11-04 09:48:18 +00:00
phk
e9aa533e84 Change buf->b_object to buf->b_bufobj->bo_object
some whitespace fixes.
2004-11-04 09:06:54 +00:00
phk
bb0cfa35bf whitespace 2004-11-04 08:25:52 +00:00
phk
1e4caea88c Remove buf->b_dev field. 2004-11-04 07:59:57 +00:00
alc
25b80a64b9 The synchronization provided by vm object locking has eliminated the
need for most calls to vm_page_busy().  Specifically, most calls to
vm_page_busy() occur immediately prior to a call to vm_page_remove().
In such cases, the containing vm object is locked across both calls.
Consequently, the setting of the vm page's PG_BUSY flag is not even
visible to other threads that are following the synchronization
protocol.

This change (1) eliminates the calls to vm_page_busy() that
immediately precede a call to vm_page_remove() or functions, such as
vm_page_free() and vm_page_rename(), that call it and (2) relaxes the
requirement in vm_page_remove() that the vm page's PG_BUSY flag is
set.  Now, the vm page's PG_BUSY flag is set only when the vm object
lock is released while the vm page is still in transition.  Typically,
this is when it is undergoing I/O.
2004-11-03 20:17:31 +00:00
phk
546ea57ed3 Remove the last call in the system to VOP_SPECSTRATEGY(): We can no
longer come through the VNODE layer to the disks since all the filesystems
now go via geom_vfs to GEOM.
2004-10-29 10:52:31 +00:00
phk
86cc21c765 Give dev_strategy() an explict cdev argument in preparation for removing
buf->b-dev.

Put a bio between the buf passed to dev_strategy() and the device driver
strategy routine in order to not clobber fields in the buf.

Assert copyright on vfs_bio.c and update copyright message to canonical
text.  There is no legal difference between John Dysons two-clause
abbreviated BSD license and the canonical text.
2004-10-29 07:16:37 +00:00
phk
08ed0626b7 Lock bp->b_bufobj->b_object instead of bp->b_object 2004-10-28 08:38:46 +00:00
phk
fd2239c999 The island council met and voted buf_prewrite() home.
Give ffs it's own bufobj->bo_ops vector and create a private strategy
routine, (currently misnamed for forwards compatibility), which is
just a copy of the generic bufstrategy routine except we call
softdep_disk_prewrite() directly instead of through the buf_prewrite()
indirection.

Teach UFS about the need for softdep_disk_prewrite() and call the
function directly in FFS.

Remove buf_prewrite() from the default bufstrategy() and from the
global bio_ops method vector.
2004-10-26 10:44:10 +00:00
phk
c66aa10c8e Put the I/O block size in bufobj->bo_bsize.
We keep si_bsize_phys around for now as that is the simplest way to pull
the number out of disk device drivers in devfs_open().  The correct solution
would be to do an ioctl(DIOCGSECTORSIZE), but the point is probably mooth
when filesystems sit on GEOM, so don't bother for now.
2004-10-26 07:39:12 +00:00
alc
343104d2b1 Hold the lock on the containing vm object when calling
vm_page_sleep_if_busy().
2004-10-26 06:58:26 +00:00
phk
3a8a530155 Remove vnode->v_bsize. This was a dead-end. 2004-10-25 07:50:59 +00:00