568 Commits

Author SHA1 Message Date
kib
9defbad772 When buffer write is failed, it is wrong for brelse() to invalidate
portion of the page that was written. Among other problems, this
page might be picked up by pagedaemon, with failed assertion in
vm_pageout_flush() about validity of the page.

Reported and tested by:	pho
Approved by:	re (kensmith)
MFC after:	3 weeks
2009-07-19 20:25:59 +00:00
alc
f482e4a9e0 Eliminate an unused variable from allocbuf().
Eliminate the unnecessary setting of page valid bits from a non-VMIO buffer
in vm_hold_load_pages().
2009-06-07 18:19:04 +00:00
alc
41ce4c9579 Eliminate a comment describing code that was deleted over eight years ago.
Move another comment to its proper place.  Fix a typo in a third comment.
2009-06-01 06:12:08 +00:00
alc
5e539a33da nfs_write() can use the recently introduced vfs_bio_set_valid() instead of
vfs_bio_set_validclean(), thereby avoiding the page queues lock.

Garbage collect vfs_bio_set_validclean().  Nothing uses it any longer.
2009-05-31 20:18:02 +00:00
alc
40062c9280 Modify vm_hold_load_pages() to allocate pages using VM_ALLOC_NOOBJ rather
than using the kernel object.  This allows the elimination of page queues
locking from vm_hold_free_pages().
2009-05-29 18:35:51 +00:00
zml
b5e46da5a4 fail(9) support:
Add support for kernel fault injection using KFAIL_POINT_* macros and
fail_point_* infrastructure. Add example fail point in vfs_bio.c to
simulate VM buf pressure.

Approved by:        dfr (mentor)
2009-05-27 16:36:54 +00:00
jhb
35ba8bce41 Only use the ABI compat shim for vfs.bufspace if the old buffer is smaller
than a long.

PR:		amd64/134786
Submitted by:	Emil Mikulic  emikulic| gmail
MFC after:	3 days
2009-05-21 16:18:45 +00:00
alc
c8b00d493e Several changes to vfs_bio_clrbuf():
Provide a more descriptive comment.

Eliminate dead code.  The page cannot possibly have PG_ZERO set.

Eliminate unnecessary blank lines.

Reviewed by:	tegge
2009-05-17 23:25:53 +00:00
alc
dc942dabcf Introduce vfs_bio_set_valid() and use it from ffs_realloccg(). This
eliminates the misuse of vfs_bio_clrbuf() by ffs_realloccg().

In collaboration with:	tegge
2009-05-17 20:26:00 +00:00
alc
82da6bfdea Eliminate page queues locking from bufdone_finish() through the
following changes:

Rename vfs_page_set_valid() to vfs_page_set_validclean() to reflect
what this function actually does.  Suggested by: tegge

Introduce a new version of vfs_page_set_valid() that does no more than
what the function's name implies.  Specifically, it does not update
the page's dirty mask, and thus it does not require the page queues
lock to be held.

Update two of the three callers to the old vfs_page_set_valid() to
call vfs_page_set_validclean() instead because they actually require
the page's dirty mask to be cleared.

Introduce vm_page_set_valid().

Reviewed by:	tegge
2009-05-13 05:39:39 +00:00
alc
123b385c44 Revert CVS revision 1.94 (svn r16840). Current pmap implementations don't
suffer from the race condition that motivated revision 1.94.  Consequently,
the work-around that was implemented by revision 1.94 is no longer needed.
Moreover, reverting this work-around eliminates the need for
vfs_busy_pages() to acquire the page queues lock when preparing a buffer
for read.

Reviewed by:	tegge
2009-05-11 05:16:57 +00:00
kan
06ed41c4fa Undo private changes that should never have been committed. 2009-04-17 18:34:11 +00:00
kan
febc8407e1 More fallout from negative dotdot caching. Negative entries should
be removed from and reinserted to proper ncneg list.

Reported by:  pho
Submitted by: kib
2009-04-17 18:11:11 +00:00
kib
c6f7194967 In flushbufqueues(), do not allocate sentinel buffer on the stack,
struct buf is large. Use sleeping malloc(9) call, and zero the allocated
buf as a debugging feature.
2009-04-16 09:37:48 +00:00
kib
0f76607a29 Export the number of times bufdaemon got help from the normal threads. 2009-04-16 09:33:52 +00:00
jhb
ffcd13a80f Improve the description of a few sysctls.
Submitted by:	bde (partially)
MFC after:	3 days
2009-03-23 20:18:06 +00:00
attilio
9c2a3fc781 Fix an old-standing bug that crept in along the several revisions:
B_DELWRI cleanup and vnode disassociation should happen just before to
assign the buffer to a queue.

Reported by:	miwi, Volker <volker at vwsoft dot com>,
		Ben Kaduk <minimarmot at gmail dot com>,
		Christopher Mallon <christoph dot mallon at gmx dot de>
Tested by:	lulf, miwi
2009-03-17 16:30:49 +00:00
kib
7ad4235008 Fix two issues with bufdaemon, often causing the processes to hang in
the "nbufkv" sleep.

First, ffs background cg group block write requests a new buffer for
the shadow copy. When ffs_bufwrite() is called from the bufdaemon due
to buffers shortage, requesting the buffer deadlock bufdaemon.
Introduce a new flag for getnewbuf(), GB_NOWAIT_BD, to request getblk
to not block while allocating the buffer, and return failure
instead. Add a flag argument to the geteblk to allow to pass the flags
to getblk(). Do not repeat the getnewbuf() call from geteblk if buffer
allocation failed and either GB_NOWAIT_BD is specified, or geteblk()
is called from bufdaemon (or its helper, see below). In
ffs_bufwrite(), fall back to synchronous cg block write if shadow
block allocation failed.

Since r107847, buffer write assumes that vnode owning the buffer is
locked. The second problem is that buffer cache may accumulate many
buffers belonging to limited number of vnodes. With such workload,
quite often threads that own the mentioned vnodes locks are trying to
read another block from the vnodes, and, due to buffer cache
exhaustion, are asking bufdaemon for help. Bufdaemon is unable to make
any substantial progress because the vnodes are locked.

Allow the threads owning vnode locks to help the bufdaemon by doing
the flush pass over the buffer cache before getnewbuf() is going to
uninterruptible sleep. Move the flushing code from buf_daemon() to new
helper function buf_do_flush(), that is called from getnewbuf().  The
number of buffers flushed by single call to buf_do_flush() from
getnewbuf() is limited by new sysctl vfs.flushbufqtarget.  Prevent
recursive calls to buf_do_flush() by marking the bufdaemon and threads
that temporarily help bufdaemon by TDP_BUFNEED flag.

In collaboration with:	pho
Reviewed by:	 tegge (previous version)
Tested by:	 glebius, yandex ...
MFC after:	 3 weeks
2009-03-16 15:39:46 +00:00
jhb
c8dd604fc2 In the ABI shim for vfs.bufspace, rather than truncating values larger than
INT_MAX to INT_MAX, just go ahead and write out the full long to give an
error of ENOMEM to the user process.

Requested by:	bde
2009-03-10 21:27:15 +00:00
jhb
0372a61a28 Add an ABI compat shim for the vfs.bufspace sysctl for sysctl requests that
try to fetch it as an int rather than a long.  If the current value is
greater than INT_MAX it reports a value of INT_MAX.
2009-03-10 15:26:50 +00:00
jhb
80d9458a56 Adjust some variables (mostly related to the buffer cache) that hold
address space sizes to be longs instead of ints.  Specifically, the follow
values are now longs: runningbufspace, bufspace, maxbufspace,
bufmallocspace, maxbufmallocspace, lobufspace, hibufspace, lorunningspace,
hirunningspace, maxswzone, maxbcache, and maxpipekva.  Previously, a
relatively small number (~ 44000) of buffers set in kern.nbuf would result
in integer overflows resulting either in hangs or bogus values of
hidirtybuffers and lodirtybuffers.  Now one has to overflow a long to see
such problems.  There was a check for a nbuf setting that would cause
overflows in the auto-tuning of nbuf.  I've changed it to always check and
cap nbuf but warn if a user-supplied tunable would cause overflow.

Note that this changes the ABI of several sysctls that are used by things
like top(1), etc., so any MFC would probably require a some gross shims
to allow for that.

MFC after:	1 month
2009-03-09 19:35:20 +00:00
jhb
f856c6d618 Tweak the output of VOP_PRINT/vn_printf() some.
- Align the fifo output in fifo_print() with other vn_printf() output.
- Remove the leading space from lockmgr_printinfo() so its output lines up
  in vn_printf().
- lockmgr_printinfo() now ends with a newline, so remove an extra newline
  from vn_printf().
2009-02-06 20:06:48 +00:00
attilio
b8bf37e585 Remove the struct thread unuseful argument from bufobj interface.
In particular following functions KPI results modified:
- bufobj_invalbuf()
- bufsync()

and BO_SYNC() "virtual method" of the buffer objects set.
Main consumers of bufobj functions are affected by this change too and,
in particular, functions which changed their KPI are:
- vinvalbuf()
- g_vfs_close()

Due to the KPI breakage, __FreeBSD_version will be bumped in a later
commit.

As a side note, please consider just temporary the 'curthread' argument
passing to VOP_SYNC() (in bufsync()) as it will be axed out ASAP

Reviewed by:	kib
Tested by:	Giovanni Trematerra <giovanni dot trematerra at gmail dot com>
2008-10-10 21:23:50 +00:00
kib
0488506405 Add the ffs structures introspection functions for ddb.
Show the b_dep value for the buffer in the show buffer command.
Add a comand to dump the dirty/clean buffer list for vnode.

Reviewed by:	tegge
Tested and used by:   pho
MFC after:   1 month
2008-09-16 11:19:38 +00:00
kib
ee27d03e64 In brelse, put the B_NEEDSGIANT buffer on the QUEUE_DIRTY_GIANT queue,
instead of QUEUE_DIRTY.

Tested by:	pho
Reviewed by:	attilio
MFC after:	3 days
2008-08-19 11:31:49 +00:00
alc
08181df483 Eliminate dead code. (The commit message for revision 1.287 explains why
this code is dead.)
2008-07-20 04:13:51 +00:00
attilio
7e107a0c8c b_waiters cannot be adequately protected by the interlock because it is
dropped after the call to lockmgr() so just revert this approach using
something similar to the precedent one:
BUF_LOCKWAITERS() just checks if there are waiters (not the actual number
of them) and it is based on newly introduced lockmgr_waiters() which
returns if the lockmgr has waiters or not. The name has been choosen
differently by old lockwaiters() in order to not confuse them.

KPI results enriched by this commit so __FreeBSD_version bumping and
manpage update will be happening soon.
'struct buf' also changes, so kernel ABI is disturbed.

Bug found by:	jeff
Approved by:	jeff, kib
2008-03-28 12:30:12 +00:00
jeff
a9d123c3ab - Complete part of the unfinished bufobj work by consistently using
BO_LOCK/UNLOCK/MTX when manipulating the bufobj.
 - Create a new lock in the bufobj to lock bufobj fields independently.
   This leaves the vnode interlock as an 'identity' lock while the bufobj
   is an io lock.  The bufobj lock is ordered before the vnode interlock
   and also before the mnt ilock.
 - Exploit this new lock order to simplify softdep_check_suspend().
 - A few sync related functions are marked with a new XXX to note that
   we may not properly interlock against a non-zero bv_cnt when
   attempting to sync all vnodes on a mountlist.  I do not believe this
   race is important.  If I'm wrong this will make these locations easier
   to find.

Reviewed by:	kib (earlier diff)
Tested by:	kris, pho (earlier diff)
2008-03-22 09:15:16 +00:00
kib
bc4bc893dd Reduce contention on the vnode interlock by not acquiring the BO_LOCK
around the check for the BV_BKGRDINPROG in the brelse() and bqrelse().
See the comment for the explanation why it is safe.

Tested by:	pho
Submitted by:	jeff
2008-03-21 12:38:44 +00:00
jeff
72142b2fae - Reduce contention on the global bdonelock and bpinlock by using
a pool mutex to protect these sleep/wakeup/counter races.  This
   still is preferable to bloating each bio with a mtx.
2008-03-21 10:00:05 +00:00
rwatson
877d7c65ba In keeping with style(9)'s recommendations on macros, use a ';'
after each SYSINIT() macro invocation.  This makes a number of
lightweight C parsers much happier with the FreeBSD kernel
source, including cflow's prcc and lxr.

MFC after:	1 month
Discussed with:	imp, rink
2008-03-16 10:58:09 +00:00
attilio
0d87334131 - Handle buffer lock waiters count directly in the buffer cache instead
than rely on the lockmgr support [1]:
  * bump the waiters only if the interlock is held
  * let brelvp() return the waiters count
  * rely on brelvp() instead than BUF_LOCKWAITERS() in order to check
    for the waiters number
- Remove a namespace pollution introduced recently with lockmgr.h
  including lock.h by including lock.h directly in the consumers and
  making it mandatory for using lockmgr.
- Modify flags accepted by lockinit():
  * introduce LK_NOPROFILE which disables lock profiling for the
    specified lockmgr
  * introduce LK_QUIET which disables ktr tracing for the specified
    lockmgr [2]
  * disallow LK_SLEEPFAIL and LK_NOWAIT to be passed there so that it
    can only be used on a per-instance basis
- Remove BUF_LOCKWAITERS() and lockwaiters() as they are no longer
  used

This patch breaks KPI so __FreBSD_version will be bumped and manpages
updated by further commits. Additively, 'struct buf' changes results in
a disturbed ABI also.

[2] Really, currently there is no ktr tracing in the lockmgr, but it
will be added soon.

[1] Submitted by:	kib
Tested by:	pho, Andrea Barberio <insomniac at slackware dot it>
2008-03-01 19:47:50 +00:00
attilio
456bfb1f0f - Add real assertions to lockmgr locking primitives.
A couple of notes for this:
  * WITNESS support, when enabled, is only used for shared locks in order
    to avoid problems with the "disowned" locks
  * KA_HELD and KA_UNHELD only exists in the lockmgr namespace in order
    to assert for a generic thread (not curthread) owning or not the
    lock.  Really, this kind of check is bogus but it seems very
    widespread in the consumers code.  So, for the moment, we cater this
    untrusted behaviour, until the consumers are not fixed and the
    options could be removed (hopefully during 8.0-CURRENT lifecycle)
  * Implementing KA_HELD and KA_UNHELD (not surported natively by
    WITNESS) made necessary the introduction of LA_MASKASSERT which
    specifies the range for default lock assertion flags
  * About other aspects, lockmgr_assert() follows exactly what other
    locking primitives offer about this operation.

- Build real assertions for buffer cache locks on the top of
  lockmgr_assert().  They can be used with the BUF_ASSERT_*(bp)
  paradigm.

- Add checks at lock destruction time and use a cookie for verifying
  lock integrity at any operation.

- Redefine BUF_LOCKFREE() in order to not use a direct assert but
  let it rely on the aforementioned destruction time check.

KPI results evidently broken, so __FreeBSD_version bumping and
manpage update result necessary and will be committed soon.

Side note: lockmgr_assert() will be used soon in order to implement
real assertions in the vnode namespace replacing the legacy and still
bogus "VOP_ISLOCKED()" way.

Tested by:      kris (earlier version)
Reviewed by:    jhb
2008-02-13 20:44:19 +00:00
attilio
caa2ca048b - Introduce the function lockmgr_recursed() which returns true if the
lockmgr lkp, when held in exclusive mode, is recursed
- Introduce the function BUF_RECURSED() which does the same for bufobj
  locks based on the top of lockmgr_recursed()
- Introduce the function BUF_ISLOCKED() which works like the counterpart
  VOP_ISLOCKED(9), showing the state of lockmgr linked with the bufobj

BUF_RECURSED() and BUF_ISLOCKED() entirely replace the usage of bogus
BUF_REFCNT() in a more explicative and SMP-compliant way.
This allows us to axe out BUF_REFCNT() and leaving the function
lockcount() totally unused in our stock kernel. Further commits will
axe lockcount() as well as part of lockmgr() cleanup.

KPI results, obviously, broken so further commits will update manpages
and freebsd version.

Tested by: kris (on UFS and NFS)
2008-01-19 17:36:23 +00:00
attilio
71b7824213 VOP_LOCK1() (and so VOP_LOCK()) and VOP_UNLOCK() are only used in
conjuction with 'thread' argument passing which is always curthread.
Remove the unuseful extra-argument and pass explicitly curthread to lower
layer functions, when necessary.

KPI results broken by this change, which should affect several ports, so
version bumping and manpage update will be further committed.

Tested by: kris, pho, Diego Sardina <siarodx at gmail dot com>
2008-01-13 14:44:15 +00:00
attilio
18d0a0dd51 vn_lock() is currently only used with the 'curthread' passed as argument.
Remove this argument and pass curthread directly to underlying
VOP_LOCK1() VFS method. This modify makes the code cleaner and in
particular remove an annoying dependence helping next lockmgr() cleanup.
KPI results, obviously, changed.

Manpage and FreeBSD_version will be updated through further commits.

As a side note, would be valuable to say that next commits will address
a similar cleanup about VFS methods, in particular vop_lock1 and
vop_unlock.

Tested by:	Diego Sardina <siarodx at gmail dot com>,
		Andrea Di Pasquale <whyx dot it at gmail dot com>
2008-01-10 01:10:58 +00:00
imp
95c2febd5a Rather than not redirting the bp when we get ENXIO, only redirty it
when the error is EIO.  This catches a much larger class of errors
that are unlikely to succeed if retried.

Submitted by: bde
2007-12-30 05:53:45 +00:00
imp
430e46303c A partial solution to some of the 'pull the umass device with a
mounted FS' problems.  These are more along the lines of 'avoiding an
avoidable panic' than a complete solution to removable devices.  We
now close the barn door after the horse has gotten lose and has been
hit by a truck, as it were.  The barn no longer catches fire in this
case, but the horse is still dead :-).

The vfs_bio.c fix causes us not to put a failed write back into the
dirty pool if the error returned was ENXIO.  In that case, the buffer
is treated like any other clean buffer that's being retured.  ENXIO
means the device isn't there anymore and will never be there again in
the future, so retrying is futile.

The vfs_mount.c fix treats 'ENXIO' as success for unmounting a file
system.  If the device is gone, retrying later won't help and we'll
never be able to unmount the device.

These two are part of a larger patch set submitted by the author.  The
other patches will be forth coming.  I added comments to these two
patches.

Submitted by: Henrik Gulbrandsen
Reviewed by: phk@
PR: usb/46176 (partial)
2007-12-27 16:38:28 +00:00
alc
ed1f1ac93c Eliminate vfs_page_set_valid()'s unused argument. 2007-12-02 01:28:35 +00:00
julian
51d643caa6 Rename the kthread_xxx (e.g. kthread_create()) calls
to kproc_xxx as they actually make whole processes.
Thos makes way for us to add REAL kthread_create() and friends
that actually make theads. it turns out that most of these
calls actually end up being moved back to the thread version
when it's added. but we need to make this cosmetic change first.

I'd LOVE to do this rename in 7.0  so that we can eventually MFC the
new kthread_xxx() calls.
2007-10-20 23:23:23 +00:00
ru
78adc698af Fix the description of the formula used to autosize the number of
buffers in the buffer cache.

Approved by:	re (kensmith)
2007-09-26 11:22:23 +00:00
alc
d1bce06c64 Change the management of cached pages (PQ_CACHE) in two fundamental
ways:

(1) Cached pages are no longer kept in the object's resident page
splay tree and memq.  Instead, they are kept in a separate per-object
splay tree of cached pages.  However, access to this new per-object
splay tree is synchronized by the _free_ page queues lock, not to be
confused with the heavily contended page queues lock.  Consequently, a
cached page can be reclaimed by vm_page_alloc(9) without acquiring the
object's lock or the page queues lock.

This solves a problem independently reported by tegge@ and Isilon.
Specifically, they observed the page daemon consuming a great deal of
CPU time because of pages bouncing back and forth between the cache
queue (PQ_CACHE) and the inactive queue (PQ_INACTIVE).  The source of
this problem turned out to be a deadlock avoidance strategy employed
when selecting a cached page to reclaim in vm_page_select_cache().
However, the root cause was really that reclaiming a cached page
required the acquisition of an object lock while the page queues lock
was already held.  Thus, this change addresses the problem at its
root, by eliminating the need to acquire the object's lock.

Moreover, keeping cached pages in the object's primary splay tree and
memq was, in effect, optimizing for the uncommon case.  Cached pages
are reclaimed far, far more often than they are reactivated.  Instead,
this change makes reclamation cheaper, especially in terms of
synchronization overhead, and reactivation more expensive, because
reactivated pages will have to be reentered into the object's primary
splay tree and memq.

(2) Cached pages are now stored alongside free pages in the physical
memory allocator's buddy queues, increasing the likelihood that large
allocations of contiguous physical memory (i.e., superpages) will
succeed.

Finally, as a result of this change long-standing restrictions on when
and where a cached page can be reclaimed and returned by
vm_page_alloc(9) are eliminated.  Specifically, calls to
vm_page_alloc(9) specifying VM_ALLOC_INTERRUPT can now reclaim and
return a formerly cached page.  Consequently, a call to malloc(9)
specifying M_NOWAIT is less likely to fail.

Discussed with: many over the course of the summer, including jeff@,
   Justin Husted @ Isilon, peter@, tegge@
Tested by: an earlier version by kris@
Approved by: re (kensmith)
2007-09-25 06:25:06 +00:00
marcel
faf9b9cc38 Work around an integer overflow in expression `3 * maxbufspace / 4',
when maxbufspace is larger than INT_MAX / 3. The overflow causes a
hard hang on ia64 when physical memory is sufficiently large (8GB).
2007-06-09 23:41:14 +00:00
delphij
0b0bfc4c75 In getblk(), before gbincore(), use BO_LOCK directly when locking
the bufobj, rather than using VI_LOCK, like what was done with
revision 1.453.
2007-06-08 07:05:08 +00:00
jeff
a7a8bac81f - Move rusage from being per-process in struct pstats to per-thread in
td_ru.  This removes the requirement for per-process synchronization in
   statclock() and mi_switch().  This was previously supported by
   sched_lock which is going away.  All modifications to rusage are now
   done in the context of the owning thread.  reads proceed without locks.
 - Aggregate exiting threads rusage in thread_exit() such that the exiting
   thread's rusage is not lost.
 - Provide a new routine, rufetch() to fetch an aggregate of all rusage
   structures from all threads in a process.  This routine must be used
   in any place requiring a rusage from a process prior to it's exit.  The
   exited process's rusage is still available via p_ru.
 - Aggregate tick statistics only on demand via rufetch() or when a thread
   exits.  Tick statistics are kept in the thread and protected by sched_lock
   until it exits.

Initial patch by:	attilio
Reviewed by:		attilio, bde (some objections), arch (mostly silent)
2007-06-01 01:12:45 +00:00
attilio
7dd8ed88a9 Revert VMCNT_* operations introduction.
Probabilly, a general approach is not the better solution here, so we should
solve the sched_lock protection problems separately.

Requested by: alc
Approved by: jeff (mentor)
2007-05-31 22:52:15 +00:00
jeff
e1996cb960 - define and use VMCNT_{GET,SET,ADD,SUB,PTR} macros for manipulating
vmcnts.  This can be used to abstract away pcpu details but also changes
   to use atomics for all counters now.  This means sched lock is no longer
   responsible for protecting counts in the switch routines.

Contributed by:		Attilio Rao <attilio@FreeBSD.org>
2007-05-18 07:10:50 +00:00
kib
543a201e4b Disable nesting of BOP_BDFLUSH(). VOP_FSYNC() call in bdwrite() could
result in bdwrite() being reentered, thus causing infinite recursion.

Reported and tested by:	Peter Holm
Reviewed by:	tegge
MFC after:	2 weeks
2007-04-24 10:59:21 +00:00
wkoszek
78bc9b6aa1 vm_map_delete should be used only internally, by the VM subsystem. Replace
it with vm_map_remove, which not only embeds additional check, but also
takes care of locking.

Reviewed by:	alc
Approved by:	alc, cognet (mentor)
2007-03-29 13:26:13 +00:00
kris
56c63dc234 Correct a comment typo 2007-03-25 10:07:23 +00:00