2004-10-29 07:16:37 +00:00
|
|
|
/*-
|
2017-11-27 15:20:12 +00:00
|
|
|
* SPDX-License-Identifier: BSD-2-Clause-FreeBSD
|
|
|
|
*
|
2004-10-29 07:16:37 +00:00
|
|
|
* Copyright (c) 2004 Poul-Henning Kamp
|
1997-12-22 11:54:00 +00:00
|
|
|
* Copyright (c) 1994,1997 John S. Dyson
|
Implement the concept of the unmapped VMIO buffers, i.e. buffers which
do not map the b_pages pages into buffer_map KVA. The use of the
unmapped buffers eliminate the need to perform TLB shootdown for
mapping on the buffer creation and reuse, greatly reducing the amount
of IPIs for shootdown on big-SMP machines and eliminating up to 25-30%
of the system time on i/o intensive workloads.
The unmapped buffer should be explicitely requested by the GB_UNMAPPED
flag by the consumer. For unmapped buffer, no KVA reservation is
performed at all. The consumer might request unmapped buffer which
does have a KVA reserve, to manually map it without recursing into
buffer cache and blocking, with the GB_KVAALLOC flag.
When the mapped buffer is requested and unmapped buffer already
exists, the cache performs an upgrade, possibly reusing the KVA
reservation.
Unmapped buffer is translated into unmapped bio in g_vfs_strategy().
Unmapped bio carry a pointer to the vm_page_t array, offset and length
instead of the data pointer. The provider which processes the bio
should explicitely specify a readiness to accept unmapped bio,
otherwise g_down geom thread performs the transient upgrade of the bio
request by mapping the pages into the new bio_transient_map KVA
submap.
The bio_transient_map submap claims up to 10% of the buffer map, and
the total buffer_map + bio_transient_map KVA usage stays the
same. Still, it could be manually tuned by kern.bio_transient_maxcnt
tunable, in the units of the transient mappings. Eventually, the
bio_transient_map could be removed after all geom classes and drivers
can accept unmapped i/o requests.
Unmapped support can be turned off by the vfs.unmapped_buf_allowed
tunable, disabling which makes the buffer (or cluster) creation
requests to ignore GB_UNMAPPED and GB_KVAALLOC flags. Unmapped
buffers are only enabled by default on the architectures where
pmap_copy_page() was implemented and tested.
In the rework, filesystem metadata is not the subject to maxbufspace
limit anymore. Since the metadata buffers are always mapped, the
buffers still have to fit into the buffer map, which provides a
reasonable (but practically unreachable) upper bound on it. The
non-metadata buffer allocations, both mapped and unmapped, is
accounted against maxbufspace, as before. Effectively, this means that
the maxbufspace is forced on mapped and unmapped buffers separately.
The pre-patch bufspace limiting code did not worked, because
buffer_map fragmentation does not allow the limit to be reached.
By Jeff Roberson request, the getnewbuf() function was split into
smaller single-purpose functions.
Sponsored by: The FreeBSD Foundation
Discussed with: jeff (previous version)
Tested by: pho, scottl (previous version), jhb, bf
MFC after: 2 weeks
2013-03-19 14:13:12 +00:00
|
|
|
* Copyright (c) 2013 The FreeBSD Foundation
|
1994-05-25 09:21:21 +00:00
|
|
|
* All rights reserved.
|
1994-05-24 10:09:53 +00:00
|
|
|
*
|
Implement the concept of the unmapped VMIO buffers, i.e. buffers which
do not map the b_pages pages into buffer_map KVA. The use of the
unmapped buffers eliminate the need to perform TLB shootdown for
mapping on the buffer creation and reuse, greatly reducing the amount
of IPIs for shootdown on big-SMP machines and eliminating up to 25-30%
of the system time on i/o intensive workloads.
The unmapped buffer should be explicitely requested by the GB_UNMAPPED
flag by the consumer. For unmapped buffer, no KVA reservation is
performed at all. The consumer might request unmapped buffer which
does have a KVA reserve, to manually map it without recursing into
buffer cache and blocking, with the GB_KVAALLOC flag.
When the mapped buffer is requested and unmapped buffer already
exists, the cache performs an upgrade, possibly reusing the KVA
reservation.
Unmapped buffer is translated into unmapped bio in g_vfs_strategy().
Unmapped bio carry a pointer to the vm_page_t array, offset and length
instead of the data pointer. The provider which processes the bio
should explicitely specify a readiness to accept unmapped bio,
otherwise g_down geom thread performs the transient upgrade of the bio
request by mapping the pages into the new bio_transient_map KVA
submap.
The bio_transient_map submap claims up to 10% of the buffer map, and
the total buffer_map + bio_transient_map KVA usage stays the
same. Still, it could be manually tuned by kern.bio_transient_maxcnt
tunable, in the units of the transient mappings. Eventually, the
bio_transient_map could be removed after all geom classes and drivers
can accept unmapped i/o requests.
Unmapped support can be turned off by the vfs.unmapped_buf_allowed
tunable, disabling which makes the buffer (or cluster) creation
requests to ignore GB_UNMAPPED and GB_KVAALLOC flags. Unmapped
buffers are only enabled by default on the architectures where
pmap_copy_page() was implemented and tested.
In the rework, filesystem metadata is not the subject to maxbufspace
limit anymore. Since the metadata buffers are always mapped, the
buffers still have to fit into the buffer map, which provides a
reasonable (but practically unreachable) upper bound on it. The
non-metadata buffer allocations, both mapped and unmapped, is
accounted against maxbufspace, as before. Effectively, this means that
the maxbufspace is forced on mapped and unmapped buffers separately.
The pre-patch bufspace limiting code did not worked, because
buffer_map fragmentation does not allow the limit to be reached.
By Jeff Roberson request, the getnewbuf() function was split into
smaller single-purpose functions.
Sponsored by: The FreeBSD Foundation
Discussed with: jeff (previous version)
Tested by: pho, scottl (previous version), jhb, bf
MFC after: 2 weeks
2013-03-19 14:13:12 +00:00
|
|
|
* Portions of this software were developed by Konstantin Belousov
|
|
|
|
* under sponsorship from the FreeBSD Foundation.
|
|
|
|
*
|
1994-05-24 10:09:53 +00:00
|
|
|
* Redistribution and use in source and binary forms, with or without
|
|
|
|
* modification, are permitted provided that the following conditions
|
|
|
|
* are met:
|
|
|
|
* 1. Redistributions of source code must retain the above copyright
|
2004-10-29 07:16:37 +00:00
|
|
|
* notice, this list of conditions and the following disclaimer.
|
|
|
|
* 2. Redistributions in binary form must reproduce the above copyright
|
|
|
|
* notice, this list of conditions and the following disclaimer in the
|
|
|
|
* documentation and/or other materials provided with the distribution.
|
|
|
|
*
|
|
|
|
* THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
|
|
|
|
* ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
|
|
|
* IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
|
|
|
|
* ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
|
|
|
|
* FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
|
|
|
|
* DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
|
|
|
|
* OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
|
|
|
|
* HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
|
|
|
|
* LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
|
|
|
|
* OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
|
|
|
|
* SUCH DAMAGE.
|
1994-05-24 10:09:53 +00:00
|
|
|
*/
|
|
|
|
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
/*
|
|
|
|
* this file contains a new buffer I/O scheme implementing a coherent
|
|
|
|
* VM object and buffer cache scheme. Pains have been taken to make
|
|
|
|
* sure that the performance degradation associated with schemes such
|
|
|
|
* as this is not realized.
|
|
|
|
*
|
|
|
|
* Author: John S. Dyson
|
|
|
|
* Significant help during the development and debugging phases
|
|
|
|
* had been provided by David Greenman, also of the FreeBSD core team.
|
1998-12-22 18:57:30 +00:00
|
|
|
*
|
|
|
|
* see man buf(9) for more info.
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
*/
|
|
|
|
|
2003-06-11 00:56:59 +00:00
|
|
|
#include <sys/cdefs.h>
|
|
|
|
__FBSDID("$FreeBSD$");
|
|
|
|
|
1994-05-24 10:09:53 +00:00
|
|
|
#include <sys/param.h>
|
|
|
|
#include <sys/systm.h>
|
2021-04-13 20:30:05 +00:00
|
|
|
#include <sys/asan.h>
|
2000-05-05 09:59:14 +00:00
|
|
|
#include <sys/bio.h>
|
2018-03-17 18:14:49 +00:00
|
|
|
#include <sys/bitset.h>
|
2003-10-18 09:03:15 +00:00
|
|
|
#include <sys/conf.h>
|
2018-02-20 00:06:07 +00:00
|
|
|
#include <sys/counter.h>
|
2000-01-07 08:36:44 +00:00
|
|
|
#include <sys/buf.h>
|
2002-09-14 19:34:11 +00:00
|
|
|
#include <sys/devicestat.h>
|
2000-01-07 08:36:44 +00:00
|
|
|
#include <sys/eventhandler.h>
|
2009-05-27 16:36:54 +00:00
|
|
|
#include <sys/fail.h>
|
2019-05-21 20:38:48 +00:00
|
|
|
#include <sys/ktr.h>
|
2007-06-09 23:41:14 +00:00
|
|
|
#include <sys/limits.h>
|
2000-01-07 08:36:44 +00:00
|
|
|
#include <sys/lock.h>
|
|
|
|
#include <sys/malloc.h>
|
|
|
|
#include <sys/mount.h>
|
2000-10-20 07:58:15 +00:00
|
|
|
#include <sys/mutex.h>
|
1994-05-25 09:21:21 +00:00
|
|
|
#include <sys/kernel.h>
|
The buffer queue mechanism has been reformulated. Instead of having
QUEUE_AGE, QUEUE_LRU, and QUEUE_EMPTY we instead have QUEUE_CLEAN,
QUEUE_DIRTY, QUEUE_EMPTY, and QUEUE_EMPTYKVA. With this patch clean
and dirty buffers have been separated. Empty buffers with KVM
assignments have been separated from truely empty buffers. getnewbuf()
has been rewritten and now operates in a 100% optimal fashion. That is,
it is able to find precisely the right kind of buffer it needs to
allocate a new buffer, defragment KVM, or to free-up an existing buffer
when the buffer cache is full (which is a steady-state situation for
the buffer cache).
Buffer flushing has been reorganized. Previously buffers were flushed
in the context of whatever process hit the conditions forcing buffer
flushing to occur. This resulted in processes blocking on conditions
unrelated to what they were doing. This also resulted in inappropriate
VFS stacking chains due to multiple processes getting stuck trying to
flush dirty buffers or due to a single process getting into a situation
where it might attempt to flush buffers recursively - a situation that
was only partially fixed in prior commits. We have added a new daemon
called the buf_daemon which is responsible for flushing dirty buffers
when the number of dirty buffers exceeds the vfs.hidirtybuffers limit.
This daemon attempts to dynamically adjust the rate at which dirty buffers
are flushed such that getnewbuf() calls (almost) never block.
The number of nbufs and amount of buffer space is now scaled past the
8MB limit that was previously imposed for systems with over 64MB of
memory, and the vfs.{lo,hi}dirtybuffers limits have been relaxed
somewhat. The number of physical buffers has been increased with the
intention that we will manage physical I/O differently in the future.
reassignbuf previously attempted to keep the dirtyblkhd list sorted which
could result in non-deterministic operation under certain conditions,
such as when a large number of dirty buffers are being managed. This
algorithm has been changed. reassignbuf now keeps buffers locally sorted
if it can do so cheaply, and otherwise gives up and adds buffers to
the head of the dirtyblkhd list. The new algorithm is deterministic but
not perfect. The new algorithm greatly reduces problems that previously
occured when write_behind was turned off in the system.
The P_FLSINPROG proc->p_flag bit has been replaced by the more descriptive
P_BUFEXHAUST bit. This bit allows processes working with filesystem
buffers to use available emergency reserves. Normal processes do not set
this bit and are not allowed to dig into emergency reserves. The purpose
of this bit is to avoid low-memory deadlocks.
A small race condition was fixed in getpbuf() in vm/vm_pager.c.
Submitted by: Matthew Dillon <dillon@apollo.backplane.com>
Reviewed by: Kirk McKusick <mckusick@mckusick.com>
1999-07-04 00:25:38 +00:00
|
|
|
#include <sys/kthread.h>
|
2000-01-07 08:36:44 +00:00
|
|
|
#include <sys/proc.h>
|
2016-04-07 04:23:25 +00:00
|
|
|
#include <sys/racct.h>
|
2019-09-12 16:26:59 +00:00
|
|
|
#include <sys/refcount.h>
|
2000-01-07 08:36:44 +00:00
|
|
|
#include <sys/resourcevar.h>
|
2013-03-09 02:32:23 +00:00
|
|
|
#include <sys/rwlock.h>
|
2015-10-14 02:10:07 +00:00
|
|
|
#include <sys/smp.h>
|
2000-01-07 08:36:44 +00:00
|
|
|
#include <sys/sysctl.h>
|
2019-12-12 18:45:31 +00:00
|
|
|
#include <sys/syscallsubr.h>
|
2013-06-28 03:51:20 +00:00
|
|
|
#include <sys/vmem.h>
|
1995-12-07 12:48:31 +00:00
|
|
|
#include <sys/vmmeter.h>
|
2000-01-07 08:36:44 +00:00
|
|
|
#include <sys/vnode.h>
|
2015-07-29 02:26:57 +00:00
|
|
|
#include <sys/watchdog.h>
|
2004-10-29 07:16:37 +00:00
|
|
|
#include <geom/geom.h>
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
#include <vm/vm.h>
|
1995-12-07 12:48:31 +00:00
|
|
|
#include <vm/vm_param.h>
|
1995-03-16 18:17:34 +00:00
|
|
|
#include <vm/vm_kern.h>
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
#include <vm/vm_object.h>
|
2016-10-28 11:43:59 +00:00
|
|
|
#include <vm/vm_page.h>
|
|
|
|
#include <vm/vm_pageout.h>
|
|
|
|
#include <vm/vm_pager.h>
|
1995-12-07 12:48:31 +00:00
|
|
|
#include <vm/vm_extern.h>
|
1996-11-30 22:41:49 +00:00
|
|
|
#include <vm/vm_map.h>
|
2015-07-29 02:26:57 +00:00
|
|
|
#include <vm/swap_pager.h>
|
1994-05-25 09:21:21 +00:00
|
|
|
|
2005-10-31 15:41:29 +00:00
|
|
|
static MALLOC_DEFINE(M_BIOBUF, "biobuf", "BIO buffer");
|
1997-10-11 18:31:40 +00:00
|
|
|
|
1998-03-08 09:59:44 +00:00
|
|
|
struct bio_ops bioops; /* I/O operation notification */
|
|
|
|
|
2001-04-17 08:56:39 +00:00
|
|
|
struct buf_ops buf_ops_bio = {
|
2004-10-24 20:03:41 +00:00
|
|
|
.bop_name = "buf_ops_bio",
|
|
|
|
.bop_write = bufwrite,
|
|
|
|
.bop_strategy = bufstrategy,
|
2005-01-11 10:43:08 +00:00
|
|
|
.bop_sync = bufsync,
|
Cylinder group bitmaps and blocks containing inode for a snapshot
file are after snaplock, while other ffs device buffers are before
snaplock in global lock order. By itself, this could cause deadlock
when bdwrite() tries to flush dirty buffers on snapshotted ffs. If,
during the flush, COW activity for snapshot needs to allocate block
and ffs_alloccg() selects the cylinder group that is being written
by bdwrite(), then kernel would panic due to recursive buffer lock
acquision.
Avoid dealing with buffers in bdwrite() that are from other side of
snaplock divisor in the lock order then the buffer being written. Add
new BOP, bop_bdwrite(), to do dirty buffer flushing for same vnode in
the bdwrite(). Default implementation, bufbdflush(), refactors the code
from bdwrite(). For ffs device buffers, specialized implementation is
used.
Reviewed by: tegge, jeff, Russell Cattelan (cattelan xfs org, xfs changes)
Tested by: Peter Holm
X-MFC after: 3 weeks (if ever: it changes ABI)
2007-01-23 10:01:19 +00:00
|
|
|
.bop_bdflush = bufbdflush,
|
2001-04-17 08:56:39 +00:00
|
|
|
};
|
|
|
|
|
2018-03-17 18:14:49 +00:00
|
|
|
struct bufqueue {
|
|
|
|
struct mtx_padalign bq_lock;
|
|
|
|
TAILQ_HEAD(, buf) bq_queue;
|
|
|
|
uint8_t bq_index;
|
|
|
|
uint16_t bq_subqueue;
|
|
|
|
int bq_len;
|
|
|
|
} __aligned(CACHE_LINE_SIZE);
|
|
|
|
|
|
|
|
#define BQ_LOCKPTR(bq) (&(bq)->bq_lock)
|
|
|
|
#define BQ_LOCK(bq) mtx_lock(BQ_LOCKPTR((bq)))
|
|
|
|
#define BQ_UNLOCK(bq) mtx_unlock(BQ_LOCKPTR((bq)))
|
|
|
|
#define BQ_ASSERT_LOCKED(bq) mtx_assert(BQ_LOCKPTR((bq)), MA_OWNED)
|
|
|
|
|
|
|
|
struct bufdomain {
|
|
|
|
struct bufqueue bd_subq[MAXCPU + 1]; /* Per-cpu sub queues + global */
|
|
|
|
struct bufqueue bd_dirtyq;
|
|
|
|
struct bufqueue *bd_cleanq;
|
|
|
|
struct mtx_padalign bd_run_lock;
|
|
|
|
/* Constants */
|
|
|
|
long bd_maxbufspace;
|
|
|
|
long bd_hibufspace;
|
|
|
|
long bd_lobufspace;
|
|
|
|
long bd_bufspacethresh;
|
|
|
|
int bd_hifreebuffers;
|
|
|
|
int bd_lofreebuffers;
|
|
|
|
int bd_hidirtybuffers;
|
|
|
|
int bd_lodirtybuffers;
|
|
|
|
int bd_dirtybufthresh;
|
|
|
|
int bd_lim;
|
|
|
|
/* atomics */
|
|
|
|
int bd_wanted;
|
2022-01-19 00:26:16 +00:00
|
|
|
bool bd_shutdown;
|
2018-03-17 18:14:49 +00:00
|
|
|
int __aligned(CACHE_LINE_SIZE) bd_numdirtybuffers;
|
|
|
|
int __aligned(CACHE_LINE_SIZE) bd_running;
|
|
|
|
long __aligned(CACHE_LINE_SIZE) bd_bufspace;
|
|
|
|
int __aligned(CACHE_LINE_SIZE) bd_freebuffers;
|
|
|
|
} __aligned(CACHE_LINE_SIZE);
|
|
|
|
|
|
|
|
#define BD_LOCKPTR(bd) (&(bd)->bd_cleanq->bq_lock)
|
|
|
|
#define BD_LOCK(bd) mtx_lock(BD_LOCKPTR((bd)))
|
|
|
|
#define BD_UNLOCK(bd) mtx_unlock(BD_LOCKPTR((bd)))
|
|
|
|
#define BD_ASSERT_LOCKED(bd) mtx_assert(BD_LOCKPTR((bd)), MA_OWNED)
|
|
|
|
#define BD_RUN_LOCKPTR(bd) (&(bd)->bd_run_lock)
|
|
|
|
#define BD_RUN_LOCK(bd) mtx_lock(BD_RUN_LOCKPTR((bd)))
|
|
|
|
#define BD_RUN_UNLOCK(bd) mtx_unlock(BD_RUN_LOCKPTR((bd)))
|
|
|
|
#define BD_DOMAIN(bd) (bd - bdomain)
|
|
|
|
|
Make MAXPHYS tunable. Bump MAXPHYS to 1M.
Replace MAXPHYS by runtime variable maxphys. It is initialized from
MAXPHYS by default, but can be also adjusted with the tunable kern.maxphys.
Make b_pages[] array in struct buf flexible. Size b_pages[] for buffer
cache buffers exactly to atop(maxbcachebuf) (currently it is sized to
atop(MAXPHYS)), and b_pages[] for pbufs is sized to atop(maxphys) + 1.
The +1 for pbufs allow several pbuf consumers, among them vmapbuf(),
to use unaligned buffers still sized to maxphys, esp. when such
buffers come from userspace (*). Overall, we save significant amount
of otherwise wasted memory in b_pages[] for buffer cache buffers,
while bumping MAXPHYS to desired high value.
Eliminate all direct uses of the MAXPHYS constant in kernel and driver
sources, except a place which initialize maxphys. Some random (and
arguably weird) uses of MAXPHYS, e.g. in linuxolator, are converted
straight. Some drivers, which use MAXPHYS to size embeded structures,
get private MAXPHYS-like constant; their convertion is out of scope
for this work.
Changes to cam/, dev/ahci, dev/ata, dev/mpr, dev/mpt, dev/mvs,
dev/siis, where either submitted by, or based on changes by mav.
Suggested by: mav (*)
Reviewed by: imp, mav, imp, mckusick, scottl (intermediate versions)
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D27225
2020-11-28 12:12:51 +00:00
|
|
|
static char *buf; /* buffer header pool */
|
|
|
|
static struct buf *
|
|
|
|
nbufp(unsigned i)
|
|
|
|
{
|
|
|
|
return ((struct buf *)(buf + (sizeof(struct buf) +
|
|
|
|
sizeof(vm_page_t) * atop(maxbcachebuf)) * i));
|
|
|
|
}
|
|
|
|
|
2019-12-04 21:26:03 +00:00
|
|
|
caddr_t __read_mostly unmapped_buf;
|
1994-05-26 08:45:29 +00:00
|
|
|
|
2014-08-04 22:03:58 +00:00
|
|
|
/* Used below and for softdep flushing threads in ufs/ffs/ffs_softdep.c */
|
|
|
|
struct proc *bufdaemonproc;
|
2003-11-04 06:30:00 +00:00
|
|
|
|
2010-07-11 20:11:44 +00:00
|
|
|
static void vm_hold_free_pages(struct buf *bp, int newbsize);
|
2004-09-15 20:54:23 +00:00
|
|
|
static void vm_hold_load_pages(struct buf *bp, vm_offset_t from,
|
1995-12-14 08:32:45 +00:00
|
|
|
vm_offset_t to);
|
2009-05-13 05:39:39 +00:00
|
|
|
static void vfs_page_set_valid(struct buf *bp, vm_ooffset_t off, vm_page_t m);
|
|
|
|
static void vfs_page_set_validclean(struct buf *bp, vm_ooffset_t off,
|
2007-12-02 01:28:35 +00:00
|
|
|
vm_page_t m);
|
2010-06-08 17:54:28 +00:00
|
|
|
static void vfs_clean_pages_dirty_buf(struct buf *bp);
|
2019-10-29 20:37:59 +00:00
|
|
|
static void vfs_setdirty_range(struct buf *bp);
|
2015-09-22 23:57:52 +00:00
|
|
|
static void vfs_vmio_invalidate(struct buf *bp);
|
|
|
|
static void vfs_vmio_truncate(struct buf *bp, int npages);
|
|
|
|
static void vfs_vmio_extend(struct buf *bp, int npages, int size);
|
2003-02-09 09:47:31 +00:00
|
|
|
static int vfs_bio_clcheck(struct vnode *vp, int size,
|
|
|
|
daddr_t lblkno, daddr_t blkno);
|
2017-09-22 12:45:15 +00:00
|
|
|
static void breada(struct vnode *, daddr_t *, int *, int, struct ucred *, int,
|
|
|
|
void (*)(struct buf *));
|
2018-03-17 18:14:49 +00:00
|
|
|
static int buf_flush(struct vnode *vp, struct bufdomain *, int);
|
|
|
|
static int flushbufqueues(struct vnode *, struct bufdomain *, int, int);
|
2002-03-19 21:25:46 +00:00
|
|
|
static void buf_daemon(void);
|
2013-06-05 23:53:00 +00:00
|
|
|
static __inline void bd_wakeup(void);
|
2013-11-15 15:29:53 +00:00
|
|
|
static int sysctl_runningspace(SYSCTL_HANDLER_ARGS);
|
2015-10-14 02:10:07 +00:00
|
|
|
static void bufkva_reclaim(vmem_t *, int);
|
|
|
|
static void bufkva_free(struct buf *);
|
2018-01-12 23:25:05 +00:00
|
|
|
static int buf_import(void *, void **, int, int, int);
|
2015-10-14 02:10:07 +00:00
|
|
|
static void buf_release(void *, void **, int);
|
2017-06-17 22:24:19 +00:00
|
|
|
static void maxbcachebuf_adjust(void);
|
2018-03-17 18:14:49 +00:00
|
|
|
static inline struct bufdomain *bufdomain(struct buf *);
|
|
|
|
static void bq_remove(struct bufqueue *bq, struct buf *bp);
|
|
|
|
static void bq_insert(struct bufqueue *bq, struct buf *bp, bool unlock);
|
|
|
|
static int buf_recycle(struct bufdomain *, bool kva);
|
|
|
|
static void bq_init(struct bufqueue *bq, int qindex, int cpu,
|
|
|
|
const char *lockname);
|
|
|
|
static void bd_init(struct bufdomain *bd);
|
|
|
|
static int bd_flushall(struct bufdomain *bd);
|
|
|
|
static int sysctl_bufdomain_long(SYSCTL_HANDLER_ARGS);
|
|
|
|
static int sysctl_bufdomain_int(SYSCTL_HANDLER_ARGS);
|
2015-10-14 02:10:07 +00:00
|
|
|
|
2009-03-10 15:26:50 +00:00
|
|
|
static int sysctl_bufspace(SYSCTL_HANDLER_ARGS);
|
2002-03-05 15:38:49 +00:00
|
|
|
int vmiodirenable = TRUE;
|
|
|
|
SYSCTL_INT(_vfs, OID_AUTO, vmiodirenable, CTLFLAG_RW, &vmiodirenable, 0,
|
|
|
|
"Use the VM system for directory writes");
|
Adjust some variables (mostly related to the buffer cache) that hold
address space sizes to be longs instead of ints. Specifically, the follow
values are now longs: runningbufspace, bufspace, maxbufspace,
bufmallocspace, maxbufmallocspace, lobufspace, hibufspace, lorunningspace,
hirunningspace, maxswzone, maxbcache, and maxpipekva. Previously, a
relatively small number (~ 44000) of buffers set in kern.nbuf would result
in integer overflows resulting either in hangs or bogus values of
hidirtybuffers and lodirtybuffers. Now one has to overflow a long to see
such problems. There was a check for a nbuf setting that would cause
overflows in the auto-tuning of nbuf. I've changed it to always check and
cap nbuf but warn if a user-supplied tunable would cause overflow.
Note that this changes the ABI of several sysctls that are used by things
like top(1), etc., so any MFC would probably require a some gross shims
to allow for that.
MFC after: 1 month
2009-03-09 19:35:20 +00:00
|
|
|
long runningbufspace;
|
|
|
|
SYSCTL_LONG(_vfs, OID_AUTO, runningbufspace, CTLFLAG_RD, &runningbufspace, 0,
|
2002-03-05 15:38:49 +00:00
|
|
|
"Amount of presently outstanding async buffer io");
|
2009-03-10 15:26:50 +00:00
|
|
|
SYSCTL_PROC(_vfs, OID_AUTO, bufspace, CTLTYPE_LONG|CTLFLAG_MPSAFE|CTLFLAG_RD,
|
2018-02-20 00:06:07 +00:00
|
|
|
NULL, 0, sysctl_bufspace, "L", "Physical memory used for buffers");
|
|
|
|
static counter_u64_t bufkvaspace;
|
|
|
|
SYSCTL_COUNTER_U64(_vfs, OID_AUTO, bufkvaspace, CTLFLAG_RD, &bufkvaspace,
|
2015-07-23 19:13:41 +00:00
|
|
|
"Kernel virtual memory used for buffers");
|
Adjust some variables (mostly related to the buffer cache) that hold
address space sizes to be longs instead of ints. Specifically, the follow
values are now longs: runningbufspace, bufspace, maxbufspace,
bufmallocspace, maxbufmallocspace, lobufspace, hibufspace, lorunningspace,
hirunningspace, maxswzone, maxbcache, and maxpipekva. Previously, a
relatively small number (~ 44000) of buffers set in kern.nbuf would result
in integer overflows resulting either in hangs or bogus values of
hidirtybuffers and lodirtybuffers. Now one has to overflow a long to see
such problems. There was a check for a nbuf setting that would cause
overflows in the auto-tuning of nbuf. I've changed it to always check and
cap nbuf but warn if a user-supplied tunable would cause overflow.
Note that this changes the ABI of several sysctls that are used by things
like top(1), etc., so any MFC would probably require a some gross shims
to allow for that.
MFC after: 1 month
2009-03-09 19:35:20 +00:00
|
|
|
static long maxbufspace;
|
2018-03-17 18:14:49 +00:00
|
|
|
SYSCTL_PROC(_vfs, OID_AUTO, maxbufspace,
|
|
|
|
CTLTYPE_LONG|CTLFLAG_MPSAFE|CTLFLAG_RW, &maxbufspace,
|
|
|
|
__offsetof(struct bufdomain, bd_maxbufspace), sysctl_bufdomain_long, "L",
|
2015-10-14 02:10:07 +00:00
|
|
|
"Maximum allowed value of bufspace (including metadata)");
|
Adjust some variables (mostly related to the buffer cache) that hold
address space sizes to be longs instead of ints. Specifically, the follow
values are now longs: runningbufspace, bufspace, maxbufspace,
bufmallocspace, maxbufmallocspace, lobufspace, hibufspace, lorunningspace,
hirunningspace, maxswzone, maxbcache, and maxpipekva. Previously, a
relatively small number (~ 44000) of buffers set in kern.nbuf would result
in integer overflows resulting either in hangs or bogus values of
hidirtybuffers and lodirtybuffers. Now one has to overflow a long to see
such problems. There was a check for a nbuf setting that would cause
overflows in the auto-tuning of nbuf. I've changed it to always check and
cap nbuf but warn if a user-supplied tunable would cause overflow.
Note that this changes the ABI of several sysctls that are used by things
like top(1), etc., so any MFC would probably require a some gross shims
to allow for that.
MFC after: 1 month
2009-03-09 19:35:20 +00:00
|
|
|
static long bufmallocspace;
|
|
|
|
SYSCTL_LONG(_vfs, OID_AUTO, bufmallocspace, CTLFLAG_RD, &bufmallocspace, 0,
|
2002-03-05 15:38:49 +00:00
|
|
|
"Amount of malloced memory for buffers");
|
Adjust some variables (mostly related to the buffer cache) that hold
address space sizes to be longs instead of ints. Specifically, the follow
values are now longs: runningbufspace, bufspace, maxbufspace,
bufmallocspace, maxbufmallocspace, lobufspace, hibufspace, lorunningspace,
hirunningspace, maxswzone, maxbcache, and maxpipekva. Previously, a
relatively small number (~ 44000) of buffers set in kern.nbuf would result
in integer overflows resulting either in hangs or bogus values of
hidirtybuffers and lodirtybuffers. Now one has to overflow a long to see
such problems. There was a check for a nbuf setting that would cause
overflows in the auto-tuning of nbuf. I've changed it to always check and
cap nbuf but warn if a user-supplied tunable would cause overflow.
Note that this changes the ABI of several sysctls that are used by things
like top(1), etc., so any MFC would probably require a some gross shims
to allow for that.
MFC after: 1 month
2009-03-09 19:35:20 +00:00
|
|
|
static long maxbufmallocspace;
|
2015-10-14 02:10:07 +00:00
|
|
|
SYSCTL_LONG(_vfs, OID_AUTO, maxmallocbufspace, CTLFLAG_RW, &maxbufmallocspace,
|
|
|
|
0, "Maximum amount of malloced memory for buffers");
|
Adjust some variables (mostly related to the buffer cache) that hold
address space sizes to be longs instead of ints. Specifically, the follow
values are now longs: runningbufspace, bufspace, maxbufspace,
bufmallocspace, maxbufmallocspace, lobufspace, hibufspace, lorunningspace,
hirunningspace, maxswzone, maxbcache, and maxpipekva. Previously, a
relatively small number (~ 44000) of buffers set in kern.nbuf would result
in integer overflows resulting either in hangs or bogus values of
hidirtybuffers and lodirtybuffers. Now one has to overflow a long to see
such problems. There was a check for a nbuf setting that would cause
overflows in the auto-tuning of nbuf. I've changed it to always check and
cap nbuf but warn if a user-supplied tunable would cause overflow.
Note that this changes the ABI of several sysctls that are used by things
like top(1), etc., so any MFC would probably require a some gross shims
to allow for that.
MFC after: 1 month
2009-03-09 19:35:20 +00:00
|
|
|
static long lobufspace;
|
2018-03-17 18:14:49 +00:00
|
|
|
SYSCTL_PROC(_vfs, OID_AUTO, lobufspace,
|
|
|
|
CTLTYPE_LONG|CTLFLAG_MPSAFE|CTLFLAG_RW, &lobufspace,
|
|
|
|
__offsetof(struct bufdomain, bd_lobufspace), sysctl_bufdomain_long, "L",
|
2002-03-05 15:38:49 +00:00
|
|
|
"Minimum amount of buffers we want to have");
|
Adjust some variables (mostly related to the buffer cache) that hold
address space sizes to be longs instead of ints. Specifically, the follow
values are now longs: runningbufspace, bufspace, maxbufspace,
bufmallocspace, maxbufmallocspace, lobufspace, hibufspace, lorunningspace,
hirunningspace, maxswzone, maxbcache, and maxpipekva. Previously, a
relatively small number (~ 44000) of buffers set in kern.nbuf would result
in integer overflows resulting either in hangs or bogus values of
hidirtybuffers and lodirtybuffers. Now one has to overflow a long to see
such problems. There was a check for a nbuf setting that would cause
overflows in the auto-tuning of nbuf. I've changed it to always check and
cap nbuf but warn if a user-supplied tunable would cause overflow.
Note that this changes the ABI of several sysctls that are used by things
like top(1), etc., so any MFC would probably require a some gross shims
to allow for that.
MFC after: 1 month
2009-03-09 19:35:20 +00:00
|
|
|
long hibufspace;
|
2018-03-17 18:14:49 +00:00
|
|
|
SYSCTL_PROC(_vfs, OID_AUTO, hibufspace,
|
|
|
|
CTLTYPE_LONG|CTLFLAG_MPSAFE|CTLFLAG_RW, &hibufspace,
|
|
|
|
__offsetof(struct bufdomain, bd_hibufspace), sysctl_bufdomain_long, "L",
|
2015-10-14 02:10:07 +00:00
|
|
|
"Maximum allowed value of bufspace (excluding metadata)");
|
|
|
|
long bufspacethresh;
|
2018-03-17 18:14:49 +00:00
|
|
|
SYSCTL_PROC(_vfs, OID_AUTO, bufspacethresh,
|
|
|
|
CTLTYPE_LONG|CTLFLAG_MPSAFE|CTLFLAG_RW, &bufspacethresh,
|
|
|
|
__offsetof(struct bufdomain, bd_bufspacethresh), sysctl_bufdomain_long, "L",
|
|
|
|
"Bufspace consumed before waking the daemon to free some");
|
2018-02-20 00:06:07 +00:00
|
|
|
static counter_u64_t buffreekvacnt;
|
|
|
|
SYSCTL_COUNTER_U64(_vfs, OID_AUTO, buffreekvacnt, CTLFLAG_RW, &buffreekvacnt,
|
2002-03-05 15:38:49 +00:00
|
|
|
"Number of times we have freed the KVA space from some buffer");
|
2018-02-20 00:06:07 +00:00
|
|
|
static counter_u64_t bufdefragcnt;
|
|
|
|
SYSCTL_COUNTER_U64(_vfs, OID_AUTO, bufdefragcnt, CTLFLAG_RW, &bufdefragcnt,
|
2002-03-05 15:38:49 +00:00
|
|
|
"Number of times we have had to repeat buffer allocation to defragment");
|
Adjust some variables (mostly related to the buffer cache) that hold
address space sizes to be longs instead of ints. Specifically, the follow
values are now longs: runningbufspace, bufspace, maxbufspace,
bufmallocspace, maxbufmallocspace, lobufspace, hibufspace, lorunningspace,
hirunningspace, maxswzone, maxbcache, and maxpipekva. Previously, a
relatively small number (~ 44000) of buffers set in kern.nbuf would result
in integer overflows resulting either in hangs or bogus values of
hidirtybuffers and lodirtybuffers. Now one has to overflow a long to see
such problems. There was a check for a nbuf setting that would cause
overflows in the auto-tuning of nbuf. I've changed it to always check and
cap nbuf but warn if a user-supplied tunable would cause overflow.
Note that this changes the ABI of several sysctls that are used by things
like top(1), etc., so any MFC would probably require a some gross shims
to allow for that.
MFC after: 1 month
2009-03-09 19:35:20 +00:00
|
|
|
static long lorunningspace;
|
2013-11-15 15:29:53 +00:00
|
|
|
SYSCTL_PROC(_vfs, OID_AUTO, lorunningspace, CTLTYPE_LONG | CTLFLAG_MPSAFE |
|
|
|
|
CTLFLAG_RW, &lorunningspace, 0, sysctl_runningspace, "L",
|
2002-03-05 15:38:49 +00:00
|
|
|
"Minimum preferred space used for in-progress I/O");
|
Adjust some variables (mostly related to the buffer cache) that hold
address space sizes to be longs instead of ints. Specifically, the follow
values are now longs: runningbufspace, bufspace, maxbufspace,
bufmallocspace, maxbufmallocspace, lobufspace, hibufspace, lorunningspace,
hirunningspace, maxswzone, maxbcache, and maxpipekva. Previously, a
relatively small number (~ 44000) of buffers set in kern.nbuf would result
in integer overflows resulting either in hangs or bogus values of
hidirtybuffers and lodirtybuffers. Now one has to overflow a long to see
such problems. There was a check for a nbuf setting that would cause
overflows in the auto-tuning of nbuf. I've changed it to always check and
cap nbuf but warn if a user-supplied tunable would cause overflow.
Note that this changes the ABI of several sysctls that are used by things
like top(1), etc., so any MFC would probably require a some gross shims
to allow for that.
MFC after: 1 month
2009-03-09 19:35:20 +00:00
|
|
|
static long hirunningspace;
|
2013-11-15 15:29:53 +00:00
|
|
|
SYSCTL_PROC(_vfs, OID_AUTO, hirunningspace, CTLTYPE_LONG | CTLFLAG_MPSAFE |
|
|
|
|
CTLFLAG_RW, &hirunningspace, 0, sysctl_runningspace, "L",
|
2002-03-05 15:38:49 +00:00
|
|
|
"Maximum amount of space to use for in-progress I/O");
|
Cylinder group bitmaps and blocks containing inode for a snapshot
file are after snaplock, while other ffs device buffers are before
snaplock in global lock order. By itself, this could cause deadlock
when bdwrite() tries to flush dirty buffers on snapshotted ffs. If,
during the flush, COW activity for snapshot needs to allocate block
and ffs_alloccg() selects the cylinder group that is being written
by bdwrite(), then kernel would panic due to recursive buffer lock
acquision.
Avoid dealing with buffers in bdwrite() that are from other side of
snaplock divisor in the lock order then the buffer being written. Add
new BOP, bop_bdwrite(), to do dirty buffer flushing for same vnode in
the bdwrite(). Default implementation, bufbdflush(), refactors the code
from bdwrite(). For ffs device buffers, specialized implementation is
used.
Reviewed by: tegge, jeff, Russell Cattelan (cattelan xfs org, xfs changes)
Tested by: Peter Holm
X-MFC after: 3 weeks (if ever: it changes ABI)
2007-01-23 10:01:19 +00:00
|
|
|
int dirtybufferflushes;
|
2003-02-25 06:44:42 +00:00
|
|
|
SYSCTL_INT(_vfs, OID_AUTO, dirtybufferflushes, CTLFLAG_RW, &dirtybufferflushes,
|
|
|
|
0, "Number of bdwrite to bawrite conversions to limit dirty buffers");
|
Cylinder group bitmaps and blocks containing inode for a snapshot
file are after snaplock, while other ffs device buffers are before
snaplock in global lock order. By itself, this could cause deadlock
when bdwrite() tries to flush dirty buffers on snapshotted ffs. If,
during the flush, COW activity for snapshot needs to allocate block
and ffs_alloccg() selects the cylinder group that is being written
by bdwrite(), then kernel would panic due to recursive buffer lock
acquision.
Avoid dealing with buffers in bdwrite() that are from other side of
snaplock divisor in the lock order then the buffer being written. Add
new BOP, bop_bdwrite(), to do dirty buffer flushing for same vnode in
the bdwrite(). Default implementation, bufbdflush(), refactors the code
from bdwrite(). For ffs device buffers, specialized implementation is
used.
Reviewed by: tegge, jeff, Russell Cattelan (cattelan xfs org, xfs changes)
Tested by: Peter Holm
X-MFC after: 3 weeks (if ever: it changes ABI)
2007-01-23 10:01:19 +00:00
|
|
|
int bdwriteskip;
|
|
|
|
SYSCTL_INT(_vfs, OID_AUTO, bdwriteskip, CTLFLAG_RW, &bdwriteskip,
|
|
|
|
0, "Number of buffers supplied to bdwrite with snapshot deadlock risk");
|
|
|
|
int altbufferflushes;
|
2019-10-04 21:43:43 +00:00
|
|
|
SYSCTL_INT(_vfs, OID_AUTO, altbufferflushes, CTLFLAG_RW | CTLFLAG_STATS,
|
|
|
|
&altbufferflushes, 0, "Number of fsync flushes to limit dirty buffers");
|
2003-02-25 23:59:09 +00:00
|
|
|
static int recursiveflushes;
|
2019-10-04 21:43:43 +00:00
|
|
|
SYSCTL_INT(_vfs, OID_AUTO, recursiveflushes, CTLFLAG_RW | CTLFLAG_STATS,
|
|
|
|
&recursiveflushes, 0, "Number of flushes skipped due to being recursive");
|
2018-03-17 18:14:49 +00:00
|
|
|
static int sysctl_numdirtybuffers(SYSCTL_HANDLER_ARGS);
|
|
|
|
SYSCTL_PROC(_vfs, OID_AUTO, numdirtybuffers,
|
|
|
|
CTLTYPE_INT|CTLFLAG_MPSAFE|CTLFLAG_RD, NULL, 0, sysctl_numdirtybuffers, "I",
|
2002-03-05 15:38:49 +00:00
|
|
|
"Number of buffers that are dirty (has unwritten changes) at the moment");
|
|
|
|
static int lodirtybuffers;
|
2018-03-17 18:14:49 +00:00
|
|
|
SYSCTL_PROC(_vfs, OID_AUTO, lodirtybuffers,
|
2018-03-21 23:21:32 +00:00
|
|
|
CTLTYPE_INT|CTLFLAG_MPSAFE|CTLFLAG_RW, &lodirtybuffers,
|
2018-03-22 05:26:27 +00:00
|
|
|
__offsetof(struct bufdomain, bd_lodirtybuffers), sysctl_bufdomain_int, "I",
|
2002-03-05 15:38:49 +00:00
|
|
|
"How many buffers we want to have free before bufdaemon can sleep");
|
|
|
|
static int hidirtybuffers;
|
2018-03-17 18:14:49 +00:00
|
|
|
SYSCTL_PROC(_vfs, OID_AUTO, hidirtybuffers,
|
2018-03-21 23:21:32 +00:00
|
|
|
CTLTYPE_INT|CTLFLAG_MPSAFE|CTLFLAG_RW, &hidirtybuffers,
|
2018-03-22 05:26:27 +00:00
|
|
|
__offsetof(struct bufdomain, bd_hidirtybuffers), sysctl_bufdomain_int, "I",
|
2002-03-05 15:38:49 +00:00
|
|
|
"When the number of dirty buffers is considered severe");
|
Cylinder group bitmaps and blocks containing inode for a snapshot
file are after snaplock, while other ffs device buffers are before
snaplock in global lock order. By itself, this could cause deadlock
when bdwrite() tries to flush dirty buffers on snapshotted ffs. If,
during the flush, COW activity for snapshot needs to allocate block
and ffs_alloccg() selects the cylinder group that is being written
by bdwrite(), then kernel would panic due to recursive buffer lock
acquision.
Avoid dealing with buffers in bdwrite() that are from other side of
snaplock divisor in the lock order then the buffer being written. Add
new BOP, bop_bdwrite(), to do dirty buffer flushing for same vnode in
the bdwrite(). Default implementation, bufbdflush(), refactors the code
from bdwrite(). For ffs device buffers, specialized implementation is
used.
Reviewed by: tegge, jeff, Russell Cattelan (cattelan xfs org, xfs changes)
Tested by: Peter Holm
X-MFC after: 3 weeks (if ever: it changes ABI)
2007-01-23 10:01:19 +00:00
|
|
|
int dirtybufthresh;
|
2018-03-17 18:14:49 +00:00
|
|
|
SYSCTL_PROC(_vfs, OID_AUTO, dirtybufthresh,
|
2018-03-21 23:21:32 +00:00
|
|
|
CTLTYPE_INT|CTLFLAG_MPSAFE|CTLFLAG_RW, &dirtybufthresh,
|
2018-03-22 05:26:27 +00:00
|
|
|
__offsetof(struct bufdomain, bd_dirtybufthresh), sysctl_bufdomain_int, "I",
|
2018-03-17 18:14:49 +00:00
|
|
|
"Number of bdwrite to bawrite conversions to clear dirty buffers");
|
2002-03-05 15:38:49 +00:00
|
|
|
static int numfreebuffers;
|
|
|
|
SYSCTL_INT(_vfs, OID_AUTO, numfreebuffers, CTLFLAG_RD, &numfreebuffers, 0,
|
|
|
|
"Number of free buffers");
|
|
|
|
static int lofreebuffers;
|
2018-03-17 18:14:49 +00:00
|
|
|
SYSCTL_PROC(_vfs, OID_AUTO, lofreebuffers,
|
2018-03-21 23:21:32 +00:00
|
|
|
CTLTYPE_INT|CTLFLAG_MPSAFE|CTLFLAG_RW, &lofreebuffers,
|
2018-03-22 05:26:27 +00:00
|
|
|
__offsetof(struct bufdomain, bd_lofreebuffers), sysctl_bufdomain_int, "I",
|
2015-10-14 02:10:07 +00:00
|
|
|
"Target number of free buffers");
|
2002-03-05 15:38:49 +00:00
|
|
|
static int hifreebuffers;
|
2018-03-17 18:14:49 +00:00
|
|
|
SYSCTL_PROC(_vfs, OID_AUTO, hifreebuffers,
|
2018-03-21 23:21:32 +00:00
|
|
|
CTLTYPE_INT|CTLFLAG_MPSAFE|CTLFLAG_RW, &hifreebuffers,
|
2018-03-22 05:26:27 +00:00
|
|
|
__offsetof(struct bufdomain, bd_hifreebuffers), sysctl_bufdomain_int, "I",
|
2015-10-14 02:10:07 +00:00
|
|
|
"Threshold for clean buffer recycling");
|
2018-02-20 00:06:07 +00:00
|
|
|
static counter_u64_t getnewbufcalls;
|
|
|
|
SYSCTL_COUNTER_U64(_vfs, OID_AUTO, getnewbufcalls, CTLFLAG_RD,
|
|
|
|
&getnewbufcalls, "Number of calls to getnewbuf");
|
|
|
|
static counter_u64_t getnewbufrestarts;
|
|
|
|
SYSCTL_COUNTER_U64(_vfs, OID_AUTO, getnewbufrestarts, CTLFLAG_RD,
|
|
|
|
&getnewbufrestarts,
|
2016-04-29 21:54:28 +00:00
|
|
|
"Number of times getnewbuf has had to restart a buffer acquisition");
|
2018-02-20 00:06:07 +00:00
|
|
|
static counter_u64_t mappingrestarts;
|
|
|
|
SYSCTL_COUNTER_U64(_vfs, OID_AUTO, mappingrestarts, CTLFLAG_RD,
|
|
|
|
&mappingrestarts,
|
Implement the concept of the unmapped VMIO buffers, i.e. buffers which
do not map the b_pages pages into buffer_map KVA. The use of the
unmapped buffers eliminate the need to perform TLB shootdown for
mapping on the buffer creation and reuse, greatly reducing the amount
of IPIs for shootdown on big-SMP machines and eliminating up to 25-30%
of the system time on i/o intensive workloads.
The unmapped buffer should be explicitely requested by the GB_UNMAPPED
flag by the consumer. For unmapped buffer, no KVA reservation is
performed at all. The consumer might request unmapped buffer which
does have a KVA reserve, to manually map it without recursing into
buffer cache and blocking, with the GB_KVAALLOC flag.
When the mapped buffer is requested and unmapped buffer already
exists, the cache performs an upgrade, possibly reusing the KVA
reservation.
Unmapped buffer is translated into unmapped bio in g_vfs_strategy().
Unmapped bio carry a pointer to the vm_page_t array, offset and length
instead of the data pointer. The provider which processes the bio
should explicitely specify a readiness to accept unmapped bio,
otherwise g_down geom thread performs the transient upgrade of the bio
request by mapping the pages into the new bio_transient_map KVA
submap.
The bio_transient_map submap claims up to 10% of the buffer map, and
the total buffer_map + bio_transient_map KVA usage stays the
same. Still, it could be manually tuned by kern.bio_transient_maxcnt
tunable, in the units of the transient mappings. Eventually, the
bio_transient_map could be removed after all geom classes and drivers
can accept unmapped i/o requests.
Unmapped support can be turned off by the vfs.unmapped_buf_allowed
tunable, disabling which makes the buffer (or cluster) creation
requests to ignore GB_UNMAPPED and GB_KVAALLOC flags. Unmapped
buffers are only enabled by default on the architectures where
pmap_copy_page() was implemented and tested.
In the rework, filesystem metadata is not the subject to maxbufspace
limit anymore. Since the metadata buffers are always mapped, the
buffers still have to fit into the buffer map, which provides a
reasonable (but practically unreachable) upper bound on it. The
non-metadata buffer allocations, both mapped and unmapped, is
accounted against maxbufspace, as before. Effectively, this means that
the maxbufspace is forced on mapped and unmapped buffers separately.
The pre-patch bufspace limiting code did not worked, because
buffer_map fragmentation does not allow the limit to be reached.
By Jeff Roberson request, the getnewbuf() function was split into
smaller single-purpose functions.
Sponsored by: The FreeBSD Foundation
Discussed with: jeff (previous version)
Tested by: pho, scottl (previous version), jhb, bf
MFC after: 2 weeks
2013-03-19 14:13:12 +00:00
|
|
|
"Number of times getblk has had to restart a buffer mapping for "
|
|
|
|
"unmapped buffer");
|
2018-02-20 00:06:07 +00:00
|
|
|
static counter_u64_t numbufallocfails;
|
|
|
|
SYSCTL_COUNTER_U64(_vfs, OID_AUTO, numbufallocfails, CTLFLAG_RW,
|
|
|
|
&numbufallocfails, "Number of times buffer allocations failed");
|
Fix two issues with bufdaemon, often causing the processes to hang in
the "nbufkv" sleep.
First, ffs background cg group block write requests a new buffer for
the shadow copy. When ffs_bufwrite() is called from the bufdaemon due
to buffers shortage, requesting the buffer deadlock bufdaemon.
Introduce a new flag for getnewbuf(), GB_NOWAIT_BD, to request getblk
to not block while allocating the buffer, and return failure
instead. Add a flag argument to the geteblk to allow to pass the flags
to getblk(). Do not repeat the getnewbuf() call from geteblk if buffer
allocation failed and either GB_NOWAIT_BD is specified, or geteblk()
is called from bufdaemon (or its helper, see below). In
ffs_bufwrite(), fall back to synchronous cg block write if shadow
block allocation failed.
Since r107847, buffer write assumes that vnode owning the buffer is
locked. The second problem is that buffer cache may accumulate many
buffers belonging to limited number of vnodes. With such workload,
quite often threads that own the mentioned vnodes locks are trying to
read another block from the vnodes, and, due to buffer cache
exhaustion, are asking bufdaemon for help. Bufdaemon is unable to make
any substantial progress because the vnodes are locked.
Allow the threads owning vnode locks to help the bufdaemon by doing
the flush pass over the buffer cache before getnewbuf() is going to
uninterruptible sleep. Move the flushing code from buf_daemon() to new
helper function buf_do_flush(), that is called from getnewbuf(). The
number of buffers flushed by single call to buf_do_flush() from
getnewbuf() is limited by new sysctl vfs.flushbufqtarget. Prevent
recursive calls to buf_do_flush() by marking the bufdaemon and threads
that temporarily help bufdaemon by TDP_BUFNEED flag.
In collaboration with: pho
Reviewed by: tegge (previous version)
Tested by: glebius, yandex ...
MFC after: 3 weeks
2009-03-16 15:39:46 +00:00
|
|
|
static int flushbufqtarget = 100;
|
|
|
|
SYSCTL_INT(_vfs, OID_AUTO, flushbufqtarget, CTLFLAG_RW, &flushbufqtarget, 0,
|
|
|
|
"Amount of work to do in flushbufqueues when helping bufdaemon");
|
2018-02-20 00:06:07 +00:00
|
|
|
static counter_u64_t notbufdflushes;
|
|
|
|
SYSCTL_COUNTER_U64(_vfs, OID_AUTO, notbufdflushes, CTLFLAG_RD, ¬bufdflushes,
|
2009-04-16 09:33:52 +00:00
|
|
|
"Number of dirty buffer flushes done by the bufdaemon helpers");
|
2013-02-16 14:51:30 +00:00
|
|
|
static long barrierwrites;
|
2019-10-04 21:43:43 +00:00
|
|
|
SYSCTL_LONG(_vfs, OID_AUTO, barrierwrites, CTLFLAG_RW | CTLFLAG_STATS,
|
|
|
|
&barrierwrites, 0, "Number of barrier writes");
|
Implement the concept of the unmapped VMIO buffers, i.e. buffers which
do not map the b_pages pages into buffer_map KVA. The use of the
unmapped buffers eliminate the need to perform TLB shootdown for
mapping on the buffer creation and reuse, greatly reducing the amount
of IPIs for shootdown on big-SMP machines and eliminating up to 25-30%
of the system time on i/o intensive workloads.
The unmapped buffer should be explicitely requested by the GB_UNMAPPED
flag by the consumer. For unmapped buffer, no KVA reservation is
performed at all. The consumer might request unmapped buffer which
does have a KVA reserve, to manually map it without recursing into
buffer cache and blocking, with the GB_KVAALLOC flag.
When the mapped buffer is requested and unmapped buffer already
exists, the cache performs an upgrade, possibly reusing the KVA
reservation.
Unmapped buffer is translated into unmapped bio in g_vfs_strategy().
Unmapped bio carry a pointer to the vm_page_t array, offset and length
instead of the data pointer. The provider which processes the bio
should explicitely specify a readiness to accept unmapped bio,
otherwise g_down geom thread performs the transient upgrade of the bio
request by mapping the pages into the new bio_transient_map KVA
submap.
The bio_transient_map submap claims up to 10% of the buffer map, and
the total buffer_map + bio_transient_map KVA usage stays the
same. Still, it could be manually tuned by kern.bio_transient_maxcnt
tunable, in the units of the transient mappings. Eventually, the
bio_transient_map could be removed after all geom classes and drivers
can accept unmapped i/o requests.
Unmapped support can be turned off by the vfs.unmapped_buf_allowed
tunable, disabling which makes the buffer (or cluster) creation
requests to ignore GB_UNMAPPED and GB_KVAALLOC flags. Unmapped
buffers are only enabled by default on the architectures where
pmap_copy_page() was implemented and tested.
In the rework, filesystem metadata is not the subject to maxbufspace
limit anymore. Since the metadata buffers are always mapped, the
buffers still have to fit into the buffer map, which provides a
reasonable (but practically unreachable) upper bound on it. The
non-metadata buffer allocations, both mapped and unmapped, is
accounted against maxbufspace, as before. Effectively, this means that
the maxbufspace is forced on mapped and unmapped buffers separately.
The pre-patch bufspace limiting code did not worked, because
buffer_map fragmentation does not allow the limit to be reached.
By Jeff Roberson request, the getnewbuf() function was split into
smaller single-purpose functions.
Sponsored by: The FreeBSD Foundation
Discussed with: jeff (previous version)
Tested by: pho, scottl (previous version), jhb, bf
MFC after: 2 weeks
2013-03-19 14:13:12 +00:00
|
|
|
SYSCTL_INT(_vfs, OID_AUTO, unmapped_buf_allowed, CTLFLAG_RD,
|
|
|
|
&unmapped_buf_allowed, 0,
|
|
|
|
"Permit the use of the unmapped i/o");
|
2017-06-17 22:24:19 +00:00
|
|
|
int maxbcachebuf = MAXBCACHEBUF;
|
|
|
|
SYSCTL_INT(_vfs, OID_AUTO, maxbcachebuf, CTLFLAG_RDTUN, &maxbcachebuf, 0,
|
|
|
|
"Maximum size of a buffer cache block");
|
2002-03-05 15:38:49 +00:00
|
|
|
|
2013-06-05 23:53:00 +00:00
|
|
|
/*
|
|
|
|
* This lock synchronizes access to bd_request.
|
|
|
|
*/
|
2017-09-06 20:28:18 +00:00
|
|
|
static struct mtx_padalign __exclusive_cache_line bdlock;
|
2013-06-05 23:53:00 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* This lock protects the runningbufreq and synchronizes runningbufwakeup and
|
|
|
|
* waitrunningbufspace().
|
|
|
|
*/
|
2017-09-06 20:28:18 +00:00
|
|
|
static struct mtx_padalign __exclusive_cache_line rbreqlock;
|
2013-06-05 23:53:00 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Lock that protects bdirtywait.
|
|
|
|
*/
|
2017-09-06 20:28:18 +00:00
|
|
|
static struct mtx_padalign __exclusive_cache_line bdirtylock;
|
2013-06-05 23:53:00 +00:00
|
|
|
|
2022-01-19 00:26:16 +00:00
|
|
|
/*
|
|
|
|
* bufdaemon shutdown request and sleep channel.
|
|
|
|
*/
|
|
|
|
static bool bd_shutdown;
|
|
|
|
|
2002-03-05 15:38:49 +00:00
|
|
|
/*
|
|
|
|
* Wakeup point for bufdaemon, as well as indicator of whether it is already
|
|
|
|
* active. Set to 1 when the bufdaemon is already "on" the queue, 0 when it
|
|
|
|
* is idling.
|
|
|
|
*/
|
The buffer queue mechanism has been reformulated. Instead of having
QUEUE_AGE, QUEUE_LRU, and QUEUE_EMPTY we instead have QUEUE_CLEAN,
QUEUE_DIRTY, QUEUE_EMPTY, and QUEUE_EMPTYKVA. With this patch clean
and dirty buffers have been separated. Empty buffers with KVM
assignments have been separated from truely empty buffers. getnewbuf()
has been rewritten and now operates in a 100% optimal fashion. That is,
it is able to find precisely the right kind of buffer it needs to
allocate a new buffer, defragment KVM, or to free-up an existing buffer
when the buffer cache is full (which is a steady-state situation for
the buffer cache).
Buffer flushing has been reorganized. Previously buffers were flushed
in the context of whatever process hit the conditions forcing buffer
flushing to occur. This resulted in processes blocking on conditions
unrelated to what they were doing. This also resulted in inappropriate
VFS stacking chains due to multiple processes getting stuck trying to
flush dirty buffers or due to a single process getting into a situation
where it might attempt to flush buffers recursively - a situation that
was only partially fixed in prior commits. We have added a new daemon
called the buf_daemon which is responsible for flushing dirty buffers
when the number of dirty buffers exceeds the vfs.hidirtybuffers limit.
This daemon attempts to dynamically adjust the rate at which dirty buffers
are flushed such that getnewbuf() calls (almost) never block.
The number of nbufs and amount of buffer space is now scaled past the
8MB limit that was previously imposed for systems with over 64MB of
memory, and the vfs.{lo,hi}dirtybuffers limits have been relaxed
somewhat. The number of physical buffers has been increased with the
intention that we will manage physical I/O differently in the future.
reassignbuf previously attempted to keep the dirtyblkhd list sorted which
could result in non-deterministic operation under certain conditions,
such as when a large number of dirty buffers are being managed. This
algorithm has been changed. reassignbuf now keeps buffers locally sorted
if it can do so cheaply, and otherwise gives up and adds buffers to
the head of the dirtyblkhd list. The new algorithm is deterministic but
not perfect. The new algorithm greatly reduces problems that previously
occured when write_behind was turned off in the system.
The P_FLSINPROG proc->p_flag bit has been replaced by the more descriptive
P_BUFEXHAUST bit. This bit allows processes working with filesystem
buffers to use available emergency reserves. Normal processes do not set
this bit and are not allowed to dig into emergency reserves. The purpose
of this bit is to avoid low-memory deadlocks.
A small race condition was fixed in getpbuf() in vm/vm_pager.c.
Submitted by: Matthew Dillon <dillon@apollo.backplane.com>
Reviewed by: Kirk McKusick <mckusick@mckusick.com>
1999-07-04 00:25:38 +00:00
|
|
|
static int bd_request;
|
|
|
|
|
2010-04-24 07:05:35 +00:00
|
|
|
/*
|
|
|
|
* Request for the buf daemon to write more buffers than is indicated by
|
|
|
|
* lodirtybuf. This may be necessary to push out excess dependencies or
|
|
|
|
* defragment the address space where a simple count of the number of dirty
|
|
|
|
* buffers is insufficient to characterize the demand for flushing them.
|
|
|
|
*/
|
|
|
|
static int bd_speedupreq;
|
|
|
|
|
2002-03-05 15:38:49 +00:00
|
|
|
/*
|
|
|
|
* Synchronization (sleep/wakeup) variable for active buffer space requests.
|
|
|
|
* Set when wait starts, cleared prior to wakeup().
|
|
|
|
* Used in runningbufwakeup() and waitrunningbufspace().
|
|
|
|
*/
|
|
|
|
static int runningbufreq;
|
|
|
|
|
2003-02-09 09:47:31 +00:00
|
|
|
/*
|
2013-06-05 23:53:00 +00:00
|
|
|
* Synchronization for bwillwrite() waiters.
|
2003-02-09 09:47:31 +00:00
|
|
|
*/
|
2013-06-05 23:53:00 +00:00
|
|
|
static int bdirtywait;
|
2003-02-09 09:47:31 +00:00
|
|
|
|
2002-03-05 18:20:58 +00:00
|
|
|
/*
|
|
|
|
* Definitions for the buffer free lists.
|
|
|
|
*/
|
|
|
|
#define QUEUE_NONE 0 /* on no queue */
|
2015-10-14 02:10:07 +00:00
|
|
|
#define QUEUE_EMPTY 1 /* empty buffer headers */
|
2003-08-28 06:55:18 +00:00
|
|
|
#define QUEUE_DIRTY 2 /* B_DELWRI buffers */
|
2015-10-14 02:10:07 +00:00
|
|
|
#define QUEUE_CLEAN 3 /* non-B_DELWRI buffers */
|
2018-02-20 00:06:07 +00:00
|
|
|
#define QUEUE_SENTINEL 4 /* not an queue index, but mark for sentinel */
|
|
|
|
|
2018-03-17 18:14:49 +00:00
|
|
|
/* Maximum number of buffer domains. */
|
|
|
|
#define BUF_DOMAINS 8
|
2018-02-20 00:06:07 +00:00
|
|
|
|
2018-03-17 18:14:49 +00:00
|
|
|
struct bufdomainset bdlodirty; /* Domains > lodirty */
|
|
|
|
struct bufdomainset bdhidirty; /* Domains > hidirty */
|
2015-10-14 02:10:07 +00:00
|
|
|
|
|
|
|
/* Configured number of clean queues. */
|
2018-03-17 18:14:49 +00:00
|
|
|
static int __read_mostly buf_domains;
|
2015-10-14 02:10:07 +00:00
|
|
|
|
2018-03-17 18:14:49 +00:00
|
|
|
BITSET_DEFINE(bufdomainset, BUF_DOMAINS);
|
|
|
|
struct bufdomain __exclusive_cache_line bdomain[BUF_DOMAINS];
|
|
|
|
struct bufqueue __exclusive_cache_line bqempty;
|
2015-10-14 02:10:07 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* per-cpu empty buffer cache.
|
|
|
|
*/
|
|
|
|
uma_zone_t buf_zone;
|
|
|
|
|
2002-03-05 15:38:49 +00:00
|
|
|
/*
|
|
|
|
* Single global constant for BUF_WMESG, to avoid getting multiple references.
|
|
|
|
* buf_wmesg is referred from macros.
|
|
|
|
*/
|
2002-03-05 18:20:58 +00:00
|
|
|
const char *buf_wmesg = BUF_WMESG;
|
1996-01-19 04:00:31 +00:00
|
|
|
|
2013-11-15 15:29:53 +00:00
|
|
|
static int
|
|
|
|
sysctl_runningspace(SYSCTL_HANDLER_ARGS)
|
|
|
|
{
|
|
|
|
long value;
|
|
|
|
int error;
|
|
|
|
|
|
|
|
value = *(long *)arg1;
|
|
|
|
error = sysctl_handle_long(oidp, &value, 0, req);
|
|
|
|
if (error != 0 || req->newptr == NULL)
|
|
|
|
return (error);
|
|
|
|
mtx_lock(&rbreqlock);
|
|
|
|
if (arg1 == &hirunningspace) {
|
|
|
|
if (value < lorunningspace)
|
|
|
|
error = EINVAL;
|
|
|
|
else
|
|
|
|
hirunningspace = value;
|
|
|
|
} else {
|
|
|
|
KASSERT(arg1 == &lorunningspace,
|
|
|
|
("%s: unknown arg1", __func__));
|
|
|
|
if (value > hirunningspace)
|
|
|
|
error = EINVAL;
|
|
|
|
else
|
|
|
|
lorunningspace = value;
|
|
|
|
}
|
|
|
|
mtx_unlock(&rbreqlock);
|
|
|
|
return (error);
|
|
|
|
}
|
|
|
|
|
2018-03-17 18:14:49 +00:00
|
|
|
static int
|
|
|
|
sysctl_bufdomain_int(SYSCTL_HANDLER_ARGS)
|
|
|
|
{
|
|
|
|
int error;
|
|
|
|
int value;
|
|
|
|
int i;
|
|
|
|
|
|
|
|
value = *(int *)arg1;
|
|
|
|
error = sysctl_handle_int(oidp, &value, 0, req);
|
|
|
|
if (error != 0 || req->newptr == NULL)
|
|
|
|
return (error);
|
|
|
|
*(int *)arg1 = value;
|
|
|
|
for (i = 0; i < buf_domains; i++)
|
2018-03-20 02:01:30 +00:00
|
|
|
*(int *)(uintptr_t)(((uintptr_t)&bdomain[i]) + arg2) =
|
2018-03-17 18:14:49 +00:00
|
|
|
value / buf_domains;
|
|
|
|
|
|
|
|
return (error);
|
|
|
|
}
|
|
|
|
|
|
|
|
static int
|
|
|
|
sysctl_bufdomain_long(SYSCTL_HANDLER_ARGS)
|
|
|
|
{
|
|
|
|
long value;
|
|
|
|
int error;
|
|
|
|
int i;
|
|
|
|
|
|
|
|
value = *(long *)arg1;
|
|
|
|
error = sysctl_handle_long(oidp, &value, 0, req);
|
|
|
|
if (error != 0 || req->newptr == NULL)
|
|
|
|
return (error);
|
|
|
|
*(long *)arg1 = value;
|
|
|
|
for (i = 0; i < buf_domains; i++)
|
2018-03-20 02:01:30 +00:00
|
|
|
*(long *)(uintptr_t)(((uintptr_t)&bdomain[i]) + arg2) =
|
2018-03-17 18:14:49 +00:00
|
|
|
value / buf_domains;
|
|
|
|
|
|
|
|
return (error);
|
|
|
|
}
|
|
|
|
|
2009-03-10 15:26:50 +00:00
|
|
|
#if defined(COMPAT_FREEBSD4) || defined(COMPAT_FREEBSD5) || \
|
|
|
|
defined(COMPAT_FREEBSD6) || defined(COMPAT_FREEBSD7)
|
|
|
|
static int
|
|
|
|
sysctl_bufspace(SYSCTL_HANDLER_ARGS)
|
|
|
|
{
|
|
|
|
long lvalue;
|
|
|
|
int ivalue;
|
2018-02-20 00:06:07 +00:00
|
|
|
int i;
|
2009-03-10 15:26:50 +00:00
|
|
|
|
2018-02-20 00:06:07 +00:00
|
|
|
lvalue = 0;
|
2018-03-17 18:14:49 +00:00
|
|
|
for (i = 0; i < buf_domains; i++)
|
|
|
|
lvalue += bdomain[i].bd_bufspace;
|
2009-05-21 16:18:45 +00:00
|
|
|
if (sizeof(int) == sizeof(long) || req->oldlen >= sizeof(long))
|
2018-02-20 00:06:07 +00:00
|
|
|
return (sysctl_handle_long(oidp, &lvalue, 0, req));
|
2009-03-10 21:27:15 +00:00
|
|
|
if (lvalue > INT_MAX)
|
|
|
|
/* On overflow, still write out a long to trigger ENOMEM. */
|
|
|
|
return (sysctl_handle_long(oidp, &lvalue, 0, req));
|
|
|
|
ivalue = lvalue;
|
2009-03-10 15:26:50 +00:00
|
|
|
return (sysctl_handle_int(oidp, &ivalue, 0, req));
|
|
|
|
}
|
2018-02-20 00:06:07 +00:00
|
|
|
#else
|
2015-10-14 02:10:07 +00:00
|
|
|
static int
|
2018-02-20 00:06:07 +00:00
|
|
|
sysctl_bufspace(SYSCTL_HANDLER_ARGS)
|
1999-07-08 06:06:00 +00:00
|
|
|
{
|
2018-02-20 00:06:07 +00:00
|
|
|
long lvalue;
|
|
|
|
int i;
|
2004-09-15 20:54:23 +00:00
|
|
|
|
2018-02-20 00:06:07 +00:00
|
|
|
lvalue = 0;
|
2018-03-17 18:14:49 +00:00
|
|
|
for (i = 0; i < buf_domains; i++)
|
|
|
|
lvalue += bdomain[i].bd_bufspace;
|
2018-02-22 20:39:25 +00:00
|
|
|
return (sysctl_handle_long(oidp, &lvalue, 0, req));
|
2013-06-05 23:53:00 +00:00
|
|
|
}
|
2018-02-20 00:06:07 +00:00
|
|
|
#endif
|
2013-06-05 23:53:00 +00:00
|
|
|
|
2018-03-17 18:14:49 +00:00
|
|
|
static int
|
|
|
|
sysctl_numdirtybuffers(SYSCTL_HANDLER_ARGS)
|
|
|
|
{
|
|
|
|
int value;
|
|
|
|
int i;
|
|
|
|
|
|
|
|
value = 0;
|
|
|
|
for (i = 0; i < buf_domains; i++)
|
|
|
|
value += bdomain[i].bd_numdirtybuffers;
|
|
|
|
return (sysctl_handle_int(oidp, &value, 0, req));
|
|
|
|
}
|
|
|
|
|
2013-06-05 23:53:00 +00:00
|
|
|
/*
|
|
|
|
* bdirtywakeup:
|
|
|
|
*
|
|
|
|
* Wakeup any bwillwrite() waiters.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
bdirtywakeup(void)
|
|
|
|
{
|
|
|
|
mtx_lock(&bdirtylock);
|
|
|
|
if (bdirtywait) {
|
|
|
|
bdirtywait = 0;
|
|
|
|
wakeup(&bdirtywait);
|
1999-07-08 06:06:00 +00:00
|
|
|
}
|
2013-06-05 23:53:00 +00:00
|
|
|
mtx_unlock(&bdirtylock);
|
|
|
|
}
|
|
|
|
|
2018-03-17 18:14:49 +00:00
|
|
|
/*
|
|
|
|
* bd_clear:
|
|
|
|
*
|
|
|
|
* Clear a domain from the appropriate bitsets when dirtybuffers
|
|
|
|
* is decremented.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
bd_clear(struct bufdomain *bd)
|
|
|
|
{
|
|
|
|
|
|
|
|
mtx_lock(&bdirtylock);
|
|
|
|
if (bd->bd_numdirtybuffers <= bd->bd_lodirtybuffers)
|
|
|
|
BIT_CLR(BUF_DOMAINS, BD_DOMAIN(bd), &bdlodirty);
|
|
|
|
if (bd->bd_numdirtybuffers <= bd->bd_hidirtybuffers)
|
|
|
|
BIT_CLR(BUF_DOMAINS, BD_DOMAIN(bd), &bdhidirty);
|
|
|
|
mtx_unlock(&bdirtylock);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* bd_set:
|
|
|
|
*
|
|
|
|
* Set a domain in the appropriate bitsets when dirtybuffers
|
|
|
|
* is incremented.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
bd_set(struct bufdomain *bd)
|
|
|
|
{
|
|
|
|
|
|
|
|
mtx_lock(&bdirtylock);
|
|
|
|
if (bd->bd_numdirtybuffers > bd->bd_lodirtybuffers)
|
|
|
|
BIT_SET(BUF_DOMAINS, BD_DOMAIN(bd), &bdlodirty);
|
|
|
|
if (bd->bd_numdirtybuffers > bd->bd_hidirtybuffers)
|
|
|
|
BIT_SET(BUF_DOMAINS, BD_DOMAIN(bd), &bdhidirty);
|
|
|
|
mtx_unlock(&bdirtylock);
|
|
|
|
}
|
|
|
|
|
2013-06-05 23:53:00 +00:00
|
|
|
/*
|
|
|
|
* bdirtysub:
|
|
|
|
*
|
|
|
|
* Decrement the numdirtybuffers count by one and wakeup any
|
|
|
|
* threads blocked in bwillwrite().
|
|
|
|
*/
|
|
|
|
static void
|
2018-03-17 18:14:49 +00:00
|
|
|
bdirtysub(struct buf *bp)
|
2013-06-05 23:53:00 +00:00
|
|
|
{
|
2018-03-17 18:14:49 +00:00
|
|
|
struct bufdomain *bd;
|
|
|
|
int num;
|
2013-06-05 23:53:00 +00:00
|
|
|
|
2018-03-17 18:14:49 +00:00
|
|
|
bd = bufdomain(bp);
|
|
|
|
num = atomic_fetchadd_int(&bd->bd_numdirtybuffers, -1);
|
|
|
|
if (num == (bd->bd_lodirtybuffers + bd->bd_hidirtybuffers) / 2)
|
2013-06-05 23:53:00 +00:00
|
|
|
bdirtywakeup();
|
2018-03-17 18:14:49 +00:00
|
|
|
if (num == bd->bd_lodirtybuffers || num == bd->bd_hidirtybuffers)
|
|
|
|
bd_clear(bd);
|
2013-06-05 23:53:00 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* bdirtyadd:
|
|
|
|
*
|
|
|
|
* Increment the numdirtybuffers count by one and wakeup the buf
|
|
|
|
* daemon if needed.
|
|
|
|
*/
|
|
|
|
static void
|
2018-03-17 18:14:49 +00:00
|
|
|
bdirtyadd(struct buf *bp)
|
2013-06-05 23:53:00 +00:00
|
|
|
{
|
2018-03-17 18:14:49 +00:00
|
|
|
struct bufdomain *bd;
|
|
|
|
int num;
|
2013-06-05 23:53:00 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Only do the wakeup once as we cross the boundary. The
|
|
|
|
* buf daemon will keep running until the condition clears.
|
|
|
|
*/
|
2018-03-17 18:14:49 +00:00
|
|
|
bd = bufdomain(bp);
|
|
|
|
num = atomic_fetchadd_int(&bd->bd_numdirtybuffers, 1);
|
|
|
|
if (num == (bd->bd_lodirtybuffers + bd->bd_hidirtybuffers) / 2)
|
2013-06-05 23:53:00 +00:00
|
|
|
bd_wakeup();
|
2018-03-17 18:14:49 +00:00
|
|
|
if (num == bd->bd_lodirtybuffers || num == bd->bd_hidirtybuffers)
|
|
|
|
bd_set(bd);
|
1999-07-08 06:06:00 +00:00
|
|
|
}
|
|
|
|
|
1999-03-12 02:24:58 +00:00
|
|
|
/*
|
2018-02-20 00:06:07 +00:00
|
|
|
* bufspace_daemon_wakeup:
|
1999-03-12 02:24:58 +00:00
|
|
|
*
|
2018-02-20 00:06:07 +00:00
|
|
|
* Wakeup the daemons responsible for freeing clean bufs.
|
1999-03-12 02:24:58 +00:00
|
|
|
*/
|
2015-10-14 02:10:07 +00:00
|
|
|
static void
|
2018-02-20 00:06:07 +00:00
|
|
|
bufspace_daemon_wakeup(struct bufdomain *bd)
|
1999-03-12 02:24:58 +00:00
|
|
|
{
|
2004-09-15 20:54:23 +00:00
|
|
|
|
1999-03-12 02:24:58 +00:00
|
|
|
/*
|
2018-02-20 00:06:07 +00:00
|
|
|
* avoid the lock if the daemon is running.
|
1999-03-12 02:24:58 +00:00
|
|
|
*/
|
2018-02-20 00:06:07 +00:00
|
|
|
if (atomic_fetchadd_int(&bd->bd_running, 1) == 0) {
|
|
|
|
BD_RUN_LOCK(bd);
|
|
|
|
atomic_store_int(&bd->bd_running, 1);
|
|
|
|
wakeup(&bd->bd_running);
|
|
|
|
BD_RUN_UNLOCK(bd);
|
2015-10-14 02:10:07 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2015-07-23 19:13:41 +00:00
|
|
|
/*
|
2015-10-14 02:10:07 +00:00
|
|
|
* bufspace_adjust:
|
2015-07-23 19:13:41 +00:00
|
|
|
*
|
|
|
|
* Adjust the reported bufspace for a KVA managed buffer, possibly
|
|
|
|
* waking any waiters.
|
|
|
|
*/
|
|
|
|
static void
|
2015-10-14 02:10:07 +00:00
|
|
|
bufspace_adjust(struct buf *bp, int bufsize)
|
2015-07-23 19:13:41 +00:00
|
|
|
{
|
2018-02-20 00:06:07 +00:00
|
|
|
struct bufdomain *bd;
|
2015-10-14 02:10:07 +00:00
|
|
|
long space;
|
2015-07-23 19:13:41 +00:00
|
|
|
int diff;
|
|
|
|
|
|
|
|
KASSERT((bp->b_flags & B_MALLOC) == 0,
|
2015-10-14 02:10:07 +00:00
|
|
|
("bufspace_adjust: malloc buf %p", bp));
|
2018-03-17 18:14:49 +00:00
|
|
|
bd = bufdomain(bp);
|
2015-07-23 19:13:41 +00:00
|
|
|
diff = bufsize - bp->b_bufsize;
|
|
|
|
if (diff < 0) {
|
2018-02-20 00:06:07 +00:00
|
|
|
atomic_subtract_long(&bd->bd_bufspace, -diff);
|
2018-03-17 18:14:49 +00:00
|
|
|
} else if (diff > 0) {
|
2018-02-20 00:06:07 +00:00
|
|
|
space = atomic_fetchadd_long(&bd->bd_bufspace, diff);
|
2015-10-14 02:10:07 +00:00
|
|
|
/* Wake up the daemon on the transition. */
|
2018-02-20 00:06:07 +00:00
|
|
|
if (space < bd->bd_bufspacethresh &&
|
|
|
|
space + diff >= bd->bd_bufspacethresh)
|
|
|
|
bufspace_daemon_wakeup(bd);
|
2015-10-14 02:10:07 +00:00
|
|
|
}
|
2015-07-23 19:13:41 +00:00
|
|
|
bp->b_bufsize = bufsize;
|
|
|
|
}
|
|
|
|
|
2015-10-14 02:10:07 +00:00
|
|
|
/*
|
|
|
|
* bufspace_reserve:
|
|
|
|
*
|
|
|
|
* Reserve bufspace before calling allocbuf(). metadata has a
|
|
|
|
* different space limit than data.
|
|
|
|
*/
|
|
|
|
static int
|
2018-02-20 00:06:07 +00:00
|
|
|
bufspace_reserve(struct bufdomain *bd, int size, bool metadata)
|
2015-10-14 02:10:07 +00:00
|
|
|
{
|
2018-02-20 00:06:07 +00:00
|
|
|
long limit, new;
|
2015-10-14 02:10:07 +00:00
|
|
|
long space;
|
|
|
|
|
|
|
|
if (metadata)
|
2018-02-20 00:06:07 +00:00
|
|
|
limit = bd->bd_maxbufspace;
|
2015-10-14 02:10:07 +00:00
|
|
|
else
|
2018-02-20 00:06:07 +00:00
|
|
|
limit = bd->bd_hibufspace;
|
|
|
|
space = atomic_fetchadd_long(&bd->bd_bufspace, size);
|
|
|
|
new = space + size;
|
|
|
|
if (new > limit) {
|
|
|
|
atomic_subtract_long(&bd->bd_bufspace, size);
|
|
|
|
return (ENOSPC);
|
|
|
|
}
|
2015-10-14 02:10:07 +00:00
|
|
|
|
|
|
|
/* Wake up the daemon on the transition. */
|
2018-02-20 00:06:07 +00:00
|
|
|
if (space < bd->bd_bufspacethresh && new >= bd->bd_bufspacethresh)
|
|
|
|
bufspace_daemon_wakeup(bd);
|
2015-10-14 02:10:07 +00:00
|
|
|
|
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* bufspace_release:
|
|
|
|
*
|
|
|
|
* Release reserved bufspace after bufspace_adjust() has consumed it.
|
|
|
|
*/
|
|
|
|
static void
|
2018-02-20 00:06:07 +00:00
|
|
|
bufspace_release(struct bufdomain *bd, int size)
|
2015-10-14 02:10:07 +00:00
|
|
|
{
|
2018-02-20 00:06:07 +00:00
|
|
|
|
|
|
|
atomic_subtract_long(&bd->bd_bufspace, size);
|
2015-10-14 02:10:07 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* bufspace_wait:
|
|
|
|
*
|
|
|
|
* Wait for bufspace, acting as the buf daemon if a locked vnode is
|
2018-02-20 00:06:07 +00:00
|
|
|
* supplied. bd_wanted must be set prior to polling for space. The
|
|
|
|
* operation must be re-tried on return.
|
2015-10-14 02:10:07 +00:00
|
|
|
*/
|
|
|
|
static void
|
2018-02-20 00:06:07 +00:00
|
|
|
bufspace_wait(struct bufdomain *bd, struct vnode *vp, int gbflags,
|
|
|
|
int slpflag, int slptimeo)
|
2015-10-14 02:10:07 +00:00
|
|
|
{
|
|
|
|
struct thread *td;
|
|
|
|
int error, fl, norunbuf;
|
|
|
|
|
|
|
|
if ((gbflags & GB_NOWAIT_BD) != 0)
|
|
|
|
return;
|
|
|
|
|
|
|
|
td = curthread;
|
2018-02-20 00:06:07 +00:00
|
|
|
BD_LOCK(bd);
|
|
|
|
while (bd->bd_wanted) {
|
2015-10-14 02:10:07 +00:00
|
|
|
if (vp != NULL && vp->v_type != VCHR &&
|
|
|
|
(td->td_pflags & TDP_BUFNEED) == 0) {
|
2018-02-20 00:06:07 +00:00
|
|
|
BD_UNLOCK(bd);
|
2015-10-14 02:10:07 +00:00
|
|
|
/*
|
|
|
|
* getblk() is called with a vnode locked, and
|
|
|
|
* some majority of the dirty buffers may as
|
|
|
|
* well belong to the vnode. Flushing the
|
|
|
|
* buffers there would make a progress that
|
|
|
|
* cannot be achieved by the buf_daemon, that
|
|
|
|
* cannot lock the vnode.
|
|
|
|
*/
|
|
|
|
norunbuf = ~(TDP_BUFNEED | TDP_NORUNNINGBUF) |
|
|
|
|
(td->td_pflags & TDP_NORUNNINGBUF);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Play bufdaemon. The getnewbuf() function
|
|
|
|
* may be called while the thread owns lock
|
|
|
|
* for another dirty buffer for the same
|
|
|
|
* vnode, which makes it impossible to use
|
|
|
|
* VOP_FSYNC() there, due to the buffer lock
|
|
|
|
* recursion.
|
|
|
|
*/
|
|
|
|
td->td_pflags |= TDP_BUFNEED | TDP_NORUNNINGBUF;
|
2018-03-17 18:14:49 +00:00
|
|
|
fl = buf_flush(vp, bd, flushbufqtarget);
|
2015-10-14 02:10:07 +00:00
|
|
|
td->td_pflags &= norunbuf;
|
2018-02-20 00:06:07 +00:00
|
|
|
BD_LOCK(bd);
|
2015-10-14 02:10:07 +00:00
|
|
|
if (fl != 0)
|
|
|
|
continue;
|
2018-02-20 00:06:07 +00:00
|
|
|
if (bd->bd_wanted == 0)
|
2015-10-14 02:10:07 +00:00
|
|
|
break;
|
|
|
|
}
|
2018-02-20 00:06:07 +00:00
|
|
|
error = msleep(&bd->bd_wanted, BD_LOCKPTR(bd),
|
2015-10-14 02:10:07 +00:00
|
|
|
(PRIBIO + 4) | slpflag, "newbuf", slptimeo);
|
|
|
|
if (error != 0)
|
|
|
|
break;
|
|
|
|
}
|
2018-02-20 00:06:07 +00:00
|
|
|
BD_UNLOCK(bd);
|
2015-10-14 02:10:07 +00:00
|
|
|
}
|
|
|
|
|
2022-01-19 00:26:16 +00:00
|
|
|
static void
|
|
|
|
bufspace_daemon_shutdown(void *arg, int howto __unused)
|
|
|
|
{
|
|
|
|
struct bufdomain *bd = arg;
|
|
|
|
int error;
|
|
|
|
|
|
|
|
BD_RUN_LOCK(bd);
|
|
|
|
bd->bd_shutdown = true;
|
|
|
|
wakeup(&bd->bd_running);
|
|
|
|
error = msleep(&bd->bd_shutdown, BD_RUN_LOCKPTR(bd), 0,
|
|
|
|
"bufspace_shutdown", 60 * hz);
|
|
|
|
BD_RUN_UNLOCK(bd);
|
|
|
|
if (error != 0)
|
|
|
|
printf("bufspacedaemon wait error: %d\n", error);
|
|
|
|
}
|
|
|
|
|
2015-10-14 02:10:07 +00:00
|
|
|
/*
|
|
|
|
* bufspace_daemon:
|
|
|
|
*
|
|
|
|
* buffer space management daemon. Tries to maintain some marginal
|
|
|
|
* amount of free buffer space so that requesting processes neither
|
|
|
|
* block nor work to reclaim buffers.
|
|
|
|
*/
|
|
|
|
static void
|
2018-02-20 00:06:07 +00:00
|
|
|
bufspace_daemon(void *arg)
|
2015-10-14 02:10:07 +00:00
|
|
|
{
|
2022-01-19 00:26:16 +00:00
|
|
|
struct bufdomain *bd = arg;
|
2018-02-20 00:06:07 +00:00
|
|
|
|
2022-01-19 00:26:16 +00:00
|
|
|
EVENTHANDLER_REGISTER(shutdown_pre_sync, bufspace_daemon_shutdown, bd,
|
2018-04-22 16:05:29 +00:00
|
|
|
SHUTDOWN_PRI_LAST + 100);
|
|
|
|
|
2022-01-19 00:26:16 +00:00
|
|
|
BD_RUN_LOCK(bd);
|
|
|
|
while (!bd->bd_shutdown) {
|
|
|
|
BD_RUN_UNLOCK(bd);
|
2015-10-14 02:10:07 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Free buffers from the clean queue until we meet our
|
|
|
|
* targets.
|
|
|
|
*
|
|
|
|
* Theory of operation: The buffer cache is most efficient
|
|
|
|
* when some free buffer headers and space are always
|
|
|
|
* available to getnewbuf(). This daemon attempts to prevent
|
|
|
|
* the excessive blocking and synchronization associated
|
|
|
|
* with shortfall. It goes through three phases according
|
|
|
|
* demand:
|
|
|
|
*
|
|
|
|
* 1) The daemon wakes up voluntarily once per-second
|
|
|
|
* during idle periods when the counters are below
|
|
|
|
* the wakeup thresholds (bufspacethresh, lofreebuffers).
|
|
|
|
*
|
|
|
|
* 2) The daemon wakes up as we cross the thresholds
|
|
|
|
* ahead of any potential blocking. This may bounce
|
|
|
|
* slightly according to the rate of consumption and
|
|
|
|
* release.
|
|
|
|
*
|
|
|
|
* 3) The daemon and consumers are starved for working
|
|
|
|
* clean buffers. This is the 'bufspace' sleep below
|
|
|
|
* which will inefficiently trade bufs with bqrelse
|
|
|
|
* until we return to condition 2.
|
|
|
|
*/
|
2018-03-26 18:36:17 +00:00
|
|
|
while (bd->bd_bufspace > bd->bd_lobufspace ||
|
|
|
|
bd->bd_freebuffers < bd->bd_hifreebuffers) {
|
2018-02-20 00:06:07 +00:00
|
|
|
if (buf_recycle(bd, false) != 0) {
|
|
|
|
if (bd_flushall(bd))
|
|
|
|
continue;
|
2018-03-17 18:14:49 +00:00
|
|
|
/*
|
|
|
|
* Speedup dirty if we've run out of clean
|
|
|
|
* buffers. This is possible in particular
|
|
|
|
* because softdep may held many bufs locked
|
|
|
|
* pending writes to other bufs which are
|
|
|
|
* marked for delayed write, exhausting
|
|
|
|
* clean space until they are written.
|
|
|
|
*/
|
|
|
|
bd_speedup();
|
2018-02-20 00:06:07 +00:00
|
|
|
BD_LOCK(bd);
|
|
|
|
if (bd->bd_wanted) {
|
|
|
|
msleep(&bd->bd_wanted, BD_LOCKPTR(bd),
|
|
|
|
PRIBIO|PDROP, "bufspace", hz/10);
|
|
|
|
} else
|
|
|
|
BD_UNLOCK(bd);
|
2015-10-14 02:10:07 +00:00
|
|
|
}
|
|
|
|
maybe_yield();
|
2018-03-26 18:36:17 +00:00
|
|
|
}
|
2022-01-19 00:26:16 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Re-check our limits and sleep. bd_running must be
|
|
|
|
* cleared prior to checking the limits to avoid missed
|
|
|
|
* wakeups. The waker will adjust one of bufspace or
|
|
|
|
* freebuffers prior to checking bd_running.
|
|
|
|
*/
|
|
|
|
BD_RUN_LOCK(bd);
|
|
|
|
if (bd->bd_shutdown)
|
|
|
|
break;
|
|
|
|
atomic_store_int(&bd->bd_running, 0);
|
|
|
|
if (bd->bd_bufspace < bd->bd_bufspacethresh &&
|
|
|
|
bd->bd_freebuffers > bd->bd_lofreebuffers) {
|
|
|
|
msleep(&bd->bd_running, BD_RUN_LOCKPTR(bd),
|
|
|
|
PRIBIO, "-", hz);
|
|
|
|
} else {
|
|
|
|
/* Avoid spurious wakeups while running. */
|
|
|
|
atomic_store_int(&bd->bd_running, 1);
|
|
|
|
}
|
2015-10-14 02:10:07 +00:00
|
|
|
}
|
2022-01-19 00:26:16 +00:00
|
|
|
wakeup(&bd->bd_shutdown);
|
|
|
|
BD_RUN_UNLOCK(bd);
|
|
|
|
kthread_exit();
|
2015-10-14 02:10:07 +00:00
|
|
|
}
|
|
|
|
|
2015-07-23 19:13:41 +00:00
|
|
|
/*
|
|
|
|
* bufmallocadjust:
|
|
|
|
*
|
|
|
|
* Adjust the reported bufspace for a malloc managed buffer, possibly
|
|
|
|
* waking any waiters.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
bufmallocadjust(struct buf *bp, int bufsize)
|
|
|
|
{
|
|
|
|
int diff;
|
|
|
|
|
|
|
|
KASSERT((bp->b_flags & B_MALLOC) != 0,
|
|
|
|
("bufmallocadjust: non-malloc buf %p", bp));
|
|
|
|
diff = bufsize - bp->b_bufsize;
|
2015-10-14 02:10:07 +00:00
|
|
|
if (diff < 0)
|
2015-07-23 19:13:41 +00:00
|
|
|
atomic_subtract_long(&bufmallocspace, -diff);
|
2015-10-14 02:10:07 +00:00
|
|
|
else
|
2015-07-23 19:13:41 +00:00
|
|
|
atomic_add_long(&bufmallocspace, diff);
|
|
|
|
bp->b_bufsize = bufsize;
|
|
|
|
}
|
|
|
|
|
2000-12-26 19:41:38 +00:00
|
|
|
/*
|
2013-06-05 23:53:00 +00:00
|
|
|
* runningwakeup:
|
2000-12-26 19:41:38 +00:00
|
|
|
*
|
2013-06-05 23:53:00 +00:00
|
|
|
* Wake up processes that are waiting on asynchronous writes to fall
|
|
|
|
* below lorunningspace.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
runningwakeup(void)
|
|
|
|
{
|
|
|
|
|
|
|
|
mtx_lock(&rbreqlock);
|
|
|
|
if (runningbufreq) {
|
|
|
|
runningbufreq = 0;
|
|
|
|
wakeup(&runningbufreq);
|
|
|
|
}
|
|
|
|
mtx_unlock(&rbreqlock);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* runningbufwakeup:
|
|
|
|
*
|
|
|
|
* Decrement the outstanding write count according.
|
2000-12-26 19:41:38 +00:00
|
|
|
*/
|
2005-09-30 01:30:01 +00:00
|
|
|
void
|
2000-12-26 19:41:38 +00:00
|
|
|
runningbufwakeup(struct buf *bp)
|
|
|
|
{
|
2013-06-05 23:53:00 +00:00
|
|
|
long space, bspace;
|
2004-09-15 20:54:23 +00:00
|
|
|
|
2013-06-05 23:53:00 +00:00
|
|
|
bspace = bp->b_runningbufspace;
|
2013-07-13 19:36:18 +00:00
|
|
|
if (bspace == 0)
|
|
|
|
return;
|
|
|
|
space = atomic_fetchadd_long(&runningbufspace, -bspace);
|
|
|
|
KASSERT(space >= bspace, ("runningbufspace underflow %ld %ld",
|
|
|
|
space, bspace));
|
2013-06-05 23:53:00 +00:00
|
|
|
bp->b_runningbufspace = 0;
|
|
|
|
/*
|
|
|
|
* Only acquire the lock and wakeup on the transition from exceeding
|
|
|
|
* the threshold to falling below it.
|
|
|
|
*/
|
|
|
|
if (space < lorunningspace)
|
|
|
|
return;
|
|
|
|
if (space - bspace > lorunningspace)
|
|
|
|
return;
|
|
|
|
runningwakeup();
|
2000-12-26 19:41:38 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* waitrunningbufspace()
|
|
|
|
*
|
|
|
|
* runningbufspace is a measure of the amount of I/O currently
|
|
|
|
* running. This routine is used in async-write situations to
|
|
|
|
* prevent creating huge backups of pending writes to a device.
|
|
|
|
* Only asynchronous writes are governed by this function.
|
|
|
|
*
|
|
|
|
* This does NOT turn an async write into a sync write. It waits
|
|
|
|
* for earlier writes to complete and generally returns before the
|
|
|
|
* caller's write has reached the device.
|
|
|
|
*/
|
2005-09-30 18:07:41 +00:00
|
|
|
void
|
2000-12-26 19:41:38 +00:00
|
|
|
waitrunningbufspace(void)
|
|
|
|
{
|
2004-09-15 20:54:23 +00:00
|
|
|
|
2003-02-09 09:47:31 +00:00
|
|
|
mtx_lock(&rbreqlock);
|
2000-12-26 19:41:38 +00:00
|
|
|
while (runningbufspace > hirunningspace) {
|
2013-07-13 19:34:34 +00:00
|
|
|
runningbufreq = 1;
|
2003-02-09 09:47:31 +00:00
|
|
|
msleep(&runningbufreq, &rbreqlock, PVM, "wdrain", 0);
|
2000-12-26 19:41:38 +00:00
|
|
|
}
|
2003-02-09 09:47:31 +00:00
|
|
|
mtx_unlock(&rbreqlock);
|
2000-12-26 19:41:38 +00:00
|
|
|
}
|
|
|
|
|
The VFS/BIO subsystem contained a number of hacks in order to optimize
piecemeal, middle-of-file writes for NFS. These hacks have caused no
end of trouble, especially when combined with mmap(). I've removed
them. Instead, NFS will issue a read-before-write to fully
instantiate the struct buf containing the write. NFS does, however,
optimize piecemeal appends to files. For most common file operations,
you will not notice the difference. The sole remaining fragment in
the VFS/BIO system is b_dirtyoff/end, which NFS uses to avoid cache
coherency issues with read-merge-write style operations. NFS also
optimizes the write-covers-entire-buffer case by avoiding the
read-before-write. There is quite a bit of room for further
optimization in these areas.
The VM system marks pages fully-valid (AKA vm_page_t->valid =
VM_PAGE_BITS_ALL) in several places, most noteably in vm_fault. This
is not correct operation. The vm_pager_get_pages() code is now
responsible for marking VM pages all-valid. A number of VM helper
routines have been added to aid in zeroing-out the invalid portions of
a VM page prior to the page being marked all-valid. This operation is
necessary to properly support mmap(). The zeroing occurs most often
when dealing with file-EOF situations. Several bugs have been fixed
in the NFS subsystem, including bits handling file and directory EOF
situations and buf->b_flags consistancy issues relating to clearing
B_ERROR & B_INVAL, and handling B_DONE.
getblk() and allocbuf() have been rewritten. B_CACHE operation is now
formally defined in comments and more straightforward in
implementation. B_CACHE for VMIO buffers is based on the validity of
the backing store. B_CACHE for non-VMIO buffers is based simply on
whether the buffer is B_INVAL or not (B_CACHE set if B_INVAL clear,
and vise-versa). biodone() is now responsible for setting B_CACHE
when a successful read completes. B_CACHE is also set when a bdwrite()
is initiated and when a bwrite() is initiated. VFS VOP_BWRITE
routines (there are only two - nfs_bwrite() and bwrite()) are now
expected to set B_CACHE. This means that bowrite() and bawrite() also
set B_CACHE indirectly.
There are a number of places in the code which were previously using
buf->b_bufsize (which is DEV_BSIZE aligned) when they should have
been using buf->b_bcount. These have been fixed. getblk() now clears
B_DONE on return because the rest of the system is so bad about
dealing with B_DONE.
Major fixes to NFS/TCP have been made. A server-side bug could cause
requests to be lost by the server due to nfs_realign() overwriting
other rpc's in the same TCP mbuf chain. The server's kernel must be
recompiled to get the benefit of the fixes.
Submitted by: Matthew Dillon <dillon@apollo.backplane.com>
1999-05-02 23:57:16 +00:00
|
|
|
/*
|
|
|
|
* vfs_buf_test_cache:
|
|
|
|
*
|
|
|
|
* Called when a buffer is extended. This function clears the B_CACHE
|
|
|
|
* bit if the newly extended portion of the buffer does not contain
|
|
|
|
* valid data.
|
|
|
|
*/
|
2015-09-22 23:57:52 +00:00
|
|
|
static __inline void
|
|
|
|
vfs_buf_test_cache(struct buf *bp, vm_ooffset_t foff, vm_offset_t off,
|
|
|
|
vm_offset_t size, vm_page_t m)
|
The VFS/BIO subsystem contained a number of hacks in order to optimize
piecemeal, middle-of-file writes for NFS. These hacks have caused no
end of trouble, especially when combined with mmap(). I've removed
them. Instead, NFS will issue a read-before-write to fully
instantiate the struct buf containing the write. NFS does, however,
optimize piecemeal appends to files. For most common file operations,
you will not notice the difference. The sole remaining fragment in
the VFS/BIO system is b_dirtyoff/end, which NFS uses to avoid cache
coherency issues with read-merge-write style operations. NFS also
optimizes the write-covers-entire-buffer case by avoiding the
read-before-write. There is quite a bit of room for further
optimization in these areas.
The VM system marks pages fully-valid (AKA vm_page_t->valid =
VM_PAGE_BITS_ALL) in several places, most noteably in vm_fault. This
is not correct operation. The vm_pager_get_pages() code is now
responsible for marking VM pages all-valid. A number of VM helper
routines have been added to aid in zeroing-out the invalid portions of
a VM page prior to the page being marked all-valid. This operation is
necessary to properly support mmap(). The zeroing occurs most often
when dealing with file-EOF situations. Several bugs have been fixed
in the NFS subsystem, including bits handling file and directory EOF
situations and buf->b_flags consistancy issues relating to clearing
B_ERROR & B_INVAL, and handling B_DONE.
getblk() and allocbuf() have been rewritten. B_CACHE operation is now
formally defined in comments and more straightforward in
implementation. B_CACHE for VMIO buffers is based on the validity of
the backing store. B_CACHE for non-VMIO buffers is based simply on
whether the buffer is B_INVAL or not (B_CACHE set if B_INVAL clear,
and vise-versa). biodone() is now responsible for setting B_CACHE
when a successful read completes. B_CACHE is also set when a bdwrite()
is initiated and when a bwrite() is initiated. VFS VOP_BWRITE
routines (there are only two - nfs_bwrite() and bwrite()) are now
expected to set B_CACHE. This means that bowrite() and bawrite() also
set B_CACHE indirectly.
There are a number of places in the code which were previously using
buf->b_bufsize (which is DEV_BSIZE aligned) when they should have
been using buf->b_bcount. These have been fixed. getblk() now clears
B_DONE on return because the rest of the system is so bad about
dealing with B_DONE.
Major fixes to NFS/TCP have been made. A server-side bug could cause
requests to be lost by the server due to nfs_realign() overwriting
other rpc's in the same TCP mbuf chain. The server's kernel must be
recompiled to get the benefit of the fixes.
Submitted by: Matthew Dillon <dillon@apollo.backplane.com>
1999-05-02 23:57:16 +00:00
|
|
|
{
|
2004-09-15 20:54:23 +00:00
|
|
|
|
2019-10-15 03:45:41 +00:00
|
|
|
/*
|
|
|
|
* This function and its results are protected by higher level
|
|
|
|
* synchronization requiring vnode and buf locks to page in and
|
|
|
|
* validate pages.
|
|
|
|
*/
|
The VFS/BIO subsystem contained a number of hacks in order to optimize
piecemeal, middle-of-file writes for NFS. These hacks have caused no
end of trouble, especially when combined with mmap(). I've removed
them. Instead, NFS will issue a read-before-write to fully
instantiate the struct buf containing the write. NFS does, however,
optimize piecemeal appends to files. For most common file operations,
you will not notice the difference. The sole remaining fragment in
the VFS/BIO system is b_dirtyoff/end, which NFS uses to avoid cache
coherency issues with read-merge-write style operations. NFS also
optimizes the write-covers-entire-buffer case by avoiding the
read-before-write. There is quite a bit of room for further
optimization in these areas.
The VM system marks pages fully-valid (AKA vm_page_t->valid =
VM_PAGE_BITS_ALL) in several places, most noteably in vm_fault. This
is not correct operation. The vm_pager_get_pages() code is now
responsible for marking VM pages all-valid. A number of VM helper
routines have been added to aid in zeroing-out the invalid portions of
a VM page prior to the page being marked all-valid. This operation is
necessary to properly support mmap(). The zeroing occurs most often
when dealing with file-EOF situations. Several bugs have been fixed
in the NFS subsystem, including bits handling file and directory EOF
situations and buf->b_flags consistancy issues relating to clearing
B_ERROR & B_INVAL, and handling B_DONE.
getblk() and allocbuf() have been rewritten. B_CACHE operation is now
formally defined in comments and more straightforward in
implementation. B_CACHE for VMIO buffers is based on the validity of
the backing store. B_CACHE for non-VMIO buffers is based simply on
whether the buffer is B_INVAL or not (B_CACHE set if B_INVAL clear,
and vise-versa). biodone() is now responsible for setting B_CACHE
when a successful read completes. B_CACHE is also set when a bdwrite()
is initiated and when a bwrite() is initiated. VFS VOP_BWRITE
routines (there are only two - nfs_bwrite() and bwrite()) are now
expected to set B_CACHE. This means that bowrite() and bawrite() also
set B_CACHE indirectly.
There are a number of places in the code which were previously using
buf->b_bufsize (which is DEV_BSIZE aligned) when they should have
been using buf->b_bcount. These have been fixed. getblk() now clears
B_DONE on return because the rest of the system is so bad about
dealing with B_DONE.
Major fixes to NFS/TCP have been made. A server-side bug could cause
requests to be lost by the server due to nfs_realign() overwriting
other rpc's in the same TCP mbuf chain. The server's kernel must be
recompiled to get the benefit of the fixes.
Submitted by: Matthew Dillon <dillon@apollo.backplane.com>
1999-05-02 23:57:16 +00:00
|
|
|
if (bp->b_flags & B_CACHE) {
|
|
|
|
int base = (foff + off) & PAGE_MASK;
|
|
|
|
if (vm_page_is_valid(m, base, size) == 0)
|
|
|
|
bp->b_flags &= ~B_CACHE;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2007-03-25 10:07:23 +00:00
|
|
|
/* Wake up the buffer daemon if necessary */
|
2018-02-20 00:06:07 +00:00
|
|
|
static void
|
2013-06-05 23:53:00 +00:00
|
|
|
bd_wakeup(void)
|
The buffer queue mechanism has been reformulated. Instead of having
QUEUE_AGE, QUEUE_LRU, and QUEUE_EMPTY we instead have QUEUE_CLEAN,
QUEUE_DIRTY, QUEUE_EMPTY, and QUEUE_EMPTYKVA. With this patch clean
and dirty buffers have been separated. Empty buffers with KVM
assignments have been separated from truely empty buffers. getnewbuf()
has been rewritten and now operates in a 100% optimal fashion. That is,
it is able to find precisely the right kind of buffer it needs to
allocate a new buffer, defragment KVM, or to free-up an existing buffer
when the buffer cache is full (which is a steady-state situation for
the buffer cache).
Buffer flushing has been reorganized. Previously buffers were flushed
in the context of whatever process hit the conditions forcing buffer
flushing to occur. This resulted in processes blocking on conditions
unrelated to what they were doing. This also resulted in inappropriate
VFS stacking chains due to multiple processes getting stuck trying to
flush dirty buffers or due to a single process getting into a situation
where it might attempt to flush buffers recursively - a situation that
was only partially fixed in prior commits. We have added a new daemon
called the buf_daemon which is responsible for flushing dirty buffers
when the number of dirty buffers exceeds the vfs.hidirtybuffers limit.
This daemon attempts to dynamically adjust the rate at which dirty buffers
are flushed such that getnewbuf() calls (almost) never block.
The number of nbufs and amount of buffer space is now scaled past the
8MB limit that was previously imposed for systems with over 64MB of
memory, and the vfs.{lo,hi}dirtybuffers limits have been relaxed
somewhat. The number of physical buffers has been increased with the
intention that we will manage physical I/O differently in the future.
reassignbuf previously attempted to keep the dirtyblkhd list sorted which
could result in non-deterministic operation under certain conditions,
such as when a large number of dirty buffers are being managed. This
algorithm has been changed. reassignbuf now keeps buffers locally sorted
if it can do so cheaply, and otherwise gives up and adds buffers to
the head of the dirtyblkhd list. The new algorithm is deterministic but
not perfect. The new algorithm greatly reduces problems that previously
occured when write_behind was turned off in the system.
The P_FLSINPROG proc->p_flag bit has been replaced by the more descriptive
P_BUFEXHAUST bit. This bit allows processes working with filesystem
buffers to use available emergency reserves. Normal processes do not set
this bit and are not allowed to dig into emergency reserves. The purpose
of this bit is to avoid low-memory deadlocks.
A small race condition was fixed in getpbuf() in vm/vm_pager.c.
Submitted by: Matthew Dillon <dillon@apollo.backplane.com>
Reviewed by: Kirk McKusick <mckusick@mckusick.com>
1999-07-04 00:25:38 +00:00
|
|
|
{
|
2004-09-15 20:54:23 +00:00
|
|
|
|
2003-02-09 09:47:31 +00:00
|
|
|
mtx_lock(&bdlock);
|
2013-06-05 23:53:00 +00:00
|
|
|
if (bd_request == 0) {
|
The buffer queue mechanism has been reformulated. Instead of having
QUEUE_AGE, QUEUE_LRU, and QUEUE_EMPTY we instead have QUEUE_CLEAN,
QUEUE_DIRTY, QUEUE_EMPTY, and QUEUE_EMPTYKVA. With this patch clean
and dirty buffers have been separated. Empty buffers with KVM
assignments have been separated from truely empty buffers. getnewbuf()
has been rewritten and now operates in a 100% optimal fashion. That is,
it is able to find precisely the right kind of buffer it needs to
allocate a new buffer, defragment KVM, or to free-up an existing buffer
when the buffer cache is full (which is a steady-state situation for
the buffer cache).
Buffer flushing has been reorganized. Previously buffers were flushed
in the context of whatever process hit the conditions forcing buffer
flushing to occur. This resulted in processes blocking on conditions
unrelated to what they were doing. This also resulted in inappropriate
VFS stacking chains due to multiple processes getting stuck trying to
flush dirty buffers or due to a single process getting into a situation
where it might attempt to flush buffers recursively - a situation that
was only partially fixed in prior commits. We have added a new daemon
called the buf_daemon which is responsible for flushing dirty buffers
when the number of dirty buffers exceeds the vfs.hidirtybuffers limit.
This daemon attempts to dynamically adjust the rate at which dirty buffers
are flushed such that getnewbuf() calls (almost) never block.
The number of nbufs and amount of buffer space is now scaled past the
8MB limit that was previously imposed for systems with over 64MB of
memory, and the vfs.{lo,hi}dirtybuffers limits have been relaxed
somewhat. The number of physical buffers has been increased with the
intention that we will manage physical I/O differently in the future.
reassignbuf previously attempted to keep the dirtyblkhd list sorted which
could result in non-deterministic operation under certain conditions,
such as when a large number of dirty buffers are being managed. This
algorithm has been changed. reassignbuf now keeps buffers locally sorted
if it can do so cheaply, and otherwise gives up and adds buffers to
the head of the dirtyblkhd list. The new algorithm is deterministic but
not perfect. The new algorithm greatly reduces problems that previously
occured when write_behind was turned off in the system.
The P_FLSINPROG proc->p_flag bit has been replaced by the more descriptive
P_BUFEXHAUST bit. This bit allows processes working with filesystem
buffers to use available emergency reserves. Normal processes do not set
this bit and are not allowed to dig into emergency reserves. The purpose
of this bit is to avoid low-memory deadlocks.
A small race condition was fixed in getpbuf() in vm/vm_pager.c.
Submitted by: Matthew Dillon <dillon@apollo.backplane.com>
Reviewed by: Kirk McKusick <mckusick@mckusick.com>
1999-07-04 00:25:38 +00:00
|
|
|
bd_request = 1;
|
|
|
|
wakeup(&bd_request);
|
|
|
|
}
|
2003-02-09 09:47:31 +00:00
|
|
|
mtx_unlock(&bdlock);
|
The buffer queue mechanism has been reformulated. Instead of having
QUEUE_AGE, QUEUE_LRU, and QUEUE_EMPTY we instead have QUEUE_CLEAN,
QUEUE_DIRTY, QUEUE_EMPTY, and QUEUE_EMPTYKVA. With this patch clean
and dirty buffers have been separated. Empty buffers with KVM
assignments have been separated from truely empty buffers. getnewbuf()
has been rewritten and now operates in a 100% optimal fashion. That is,
it is able to find precisely the right kind of buffer it needs to
allocate a new buffer, defragment KVM, or to free-up an existing buffer
when the buffer cache is full (which is a steady-state situation for
the buffer cache).
Buffer flushing has been reorganized. Previously buffers were flushed
in the context of whatever process hit the conditions forcing buffer
flushing to occur. This resulted in processes blocking on conditions
unrelated to what they were doing. This also resulted in inappropriate
VFS stacking chains due to multiple processes getting stuck trying to
flush dirty buffers or due to a single process getting into a situation
where it might attempt to flush buffers recursively - a situation that
was only partially fixed in prior commits. We have added a new daemon
called the buf_daemon which is responsible for flushing dirty buffers
when the number of dirty buffers exceeds the vfs.hidirtybuffers limit.
This daemon attempts to dynamically adjust the rate at which dirty buffers
are flushed such that getnewbuf() calls (almost) never block.
The number of nbufs and amount of buffer space is now scaled past the
8MB limit that was previously imposed for systems with over 64MB of
memory, and the vfs.{lo,hi}dirtybuffers limits have been relaxed
somewhat. The number of physical buffers has been increased with the
intention that we will manage physical I/O differently in the future.
reassignbuf previously attempted to keep the dirtyblkhd list sorted which
could result in non-deterministic operation under certain conditions,
such as when a large number of dirty buffers are being managed. This
algorithm has been changed. reassignbuf now keeps buffers locally sorted
if it can do so cheaply, and otherwise gives up and adds buffers to
the head of the dirtyblkhd list. The new algorithm is deterministic but
not perfect. The new algorithm greatly reduces problems that previously
occured when write_behind was turned off in the system.
The P_FLSINPROG proc->p_flag bit has been replaced by the more descriptive
P_BUFEXHAUST bit. This bit allows processes working with filesystem
buffers to use available emergency reserves. Normal processes do not set
this bit and are not allowed to dig into emergency reserves. The purpose
of this bit is to avoid low-memory deadlocks.
A small race condition was fixed in getpbuf() in vm/vm_pager.c.
Submitted by: Matthew Dillon <dillon@apollo.backplane.com>
Reviewed by: Kirk McKusick <mckusick@mckusick.com>
1999-07-04 00:25:38 +00:00
|
|
|
}
|
|
|
|
|
2017-06-17 22:24:19 +00:00
|
|
|
/*
|
|
|
|
* Adjust the maxbcachbuf tunable.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
maxbcachebuf_adjust(void)
|
|
|
|
{
|
|
|
|
int i;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* maxbcachebuf must be a power of 2 >= MAXBSIZE.
|
|
|
|
*/
|
|
|
|
i = 2;
|
|
|
|
while (i * 2 <= maxbcachebuf)
|
|
|
|
i *= 2;
|
|
|
|
maxbcachebuf = i;
|
|
|
|
if (maxbcachebuf < MAXBSIZE)
|
|
|
|
maxbcachebuf = MAXBSIZE;
|
Make MAXPHYS tunable. Bump MAXPHYS to 1M.
Replace MAXPHYS by runtime variable maxphys. It is initialized from
MAXPHYS by default, but can be also adjusted with the tunable kern.maxphys.
Make b_pages[] array in struct buf flexible. Size b_pages[] for buffer
cache buffers exactly to atop(maxbcachebuf) (currently it is sized to
atop(MAXPHYS)), and b_pages[] for pbufs is sized to atop(maxphys) + 1.
The +1 for pbufs allow several pbuf consumers, among them vmapbuf(),
to use unaligned buffers still sized to maxphys, esp. when such
buffers come from userspace (*). Overall, we save significant amount
of otherwise wasted memory in b_pages[] for buffer cache buffers,
while bumping MAXPHYS to desired high value.
Eliminate all direct uses of the MAXPHYS constant in kernel and driver
sources, except a place which initialize maxphys. Some random (and
arguably weird) uses of MAXPHYS, e.g. in linuxolator, are converted
straight. Some drivers, which use MAXPHYS to size embeded structures,
get private MAXPHYS-like constant; their convertion is out of scope
for this work.
Changes to cam/, dev/ahci, dev/ata, dev/mpr, dev/mpt, dev/mvs,
dev/siis, where either submitted by, or based on changes by mav.
Suggested by: mav (*)
Reviewed by: imp, mav, imp, mckusick, scottl (intermediate versions)
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D27225
2020-11-28 12:12:51 +00:00
|
|
|
if (maxbcachebuf > maxphys)
|
|
|
|
maxbcachebuf = maxphys;
|
2017-06-17 22:24:19 +00:00
|
|
|
if (bootverbose != 0 && maxbcachebuf != MAXBCACHEBUF)
|
|
|
|
printf("maxbcachebuf=%d\n", maxbcachebuf);
|
|
|
|
}
|
|
|
|
|
1999-12-20 20:28:40 +00:00
|
|
|
/*
|
|
|
|
* bd_speedup - speedup the buffer cache flushing code
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
bd_speedup(void)
|
|
|
|
{
|
2010-04-24 07:05:35 +00:00
|
|
|
int needwake;
|
2004-09-15 20:54:23 +00:00
|
|
|
|
2010-04-24 07:05:35 +00:00
|
|
|
mtx_lock(&bdlock);
|
|
|
|
needwake = 0;
|
|
|
|
if (bd_speedupreq == 0 || bd_request == 0)
|
|
|
|
needwake = 1;
|
|
|
|
bd_speedupreq = 1;
|
|
|
|
bd_request = 1;
|
|
|
|
if (needwake)
|
|
|
|
wakeup(&bd_request);
|
|
|
|
mtx_unlock(&bdlock);
|
1999-12-20 20:28:40 +00:00
|
|
|
}
|
The VFS/BIO subsystem contained a number of hacks in order to optimize
piecemeal, middle-of-file writes for NFS. These hacks have caused no
end of trouble, especially when combined with mmap(). I've removed
them. Instead, NFS will issue a read-before-write to fully
instantiate the struct buf containing the write. NFS does, however,
optimize piecemeal appends to files. For most common file operations,
you will not notice the difference. The sole remaining fragment in
the VFS/BIO system is b_dirtyoff/end, which NFS uses to avoid cache
coherency issues with read-merge-write style operations. NFS also
optimizes the write-covers-entire-buffer case by avoiding the
read-before-write. There is quite a bit of room for further
optimization in these areas.
The VM system marks pages fully-valid (AKA vm_page_t->valid =
VM_PAGE_BITS_ALL) in several places, most noteably in vm_fault. This
is not correct operation. The vm_pager_get_pages() code is now
responsible for marking VM pages all-valid. A number of VM helper
routines have been added to aid in zeroing-out the invalid portions of
a VM page prior to the page being marked all-valid. This operation is
necessary to properly support mmap(). The zeroing occurs most often
when dealing with file-EOF situations. Several bugs have been fixed
in the NFS subsystem, including bits handling file and directory EOF
situations and buf->b_flags consistancy issues relating to clearing
B_ERROR & B_INVAL, and handling B_DONE.
getblk() and allocbuf() have been rewritten. B_CACHE operation is now
formally defined in comments and more straightforward in
implementation. B_CACHE for VMIO buffers is based on the validity of
the backing store. B_CACHE for non-VMIO buffers is based simply on
whether the buffer is B_INVAL or not (B_CACHE set if B_INVAL clear,
and vise-versa). biodone() is now responsible for setting B_CACHE
when a successful read completes. B_CACHE is also set when a bdwrite()
is initiated and when a bwrite() is initiated. VFS VOP_BWRITE
routines (there are only two - nfs_bwrite() and bwrite()) are now
expected to set B_CACHE. This means that bowrite() and bawrite() also
set B_CACHE indirectly.
There are a number of places in the code which were previously using
buf->b_bufsize (which is DEV_BSIZE aligned) when they should have
been using buf->b_bcount. These have been fixed. getblk() now clears
B_DONE on return because the rest of the system is so bad about
dealing with B_DONE.
Major fixes to NFS/TCP have been made. A server-side bug could cause
requests to be lost by the server due to nfs_realign() overwriting
other rpc's in the same TCP mbuf chain. The server's kernel must be
recompiled to get the benefit of the fixes.
Submitted by: Matthew Dillon <dillon@apollo.backplane.com>
1999-05-02 23:57:16 +00:00
|
|
|
|
2013-03-27 10:56:15 +00:00
|
|
|
#ifdef __i386__
|
|
|
|
#define TRANSIENT_DENOM 5
|
|
|
|
#else
|
|
|
|
#define TRANSIENT_DENOM 10
|
|
|
|
#endif
|
|
|
|
|
1994-05-24 10:09:53 +00:00
|
|
|
/*
|
2001-08-22 04:07:27 +00:00
|
|
|
* Calculating buffer cache scaling values and reserve space for buffer
|
|
|
|
* headers. This is called during low level kernel initialization and
|
|
|
|
* may be called more then once. We CANNOT write to the memory area
|
|
|
|
* being reserved at this time.
|
1994-05-24 10:09:53 +00:00
|
|
|
*/
|
1999-07-09 16:41:19 +00:00
|
|
|
caddr_t
|
2002-08-30 04:04:37 +00:00
|
|
|
kern_vfs_bio_buffer_alloc(caddr_t v, long physmem_est)
|
1999-07-08 06:06:00 +00:00
|
|
|
{
|
Adjust some variables (mostly related to the buffer cache) that hold
address space sizes to be longs instead of ints. Specifically, the follow
values are now longs: runningbufspace, bufspace, maxbufspace,
bufmallocspace, maxbufmallocspace, lobufspace, hibufspace, lorunningspace,
hirunningspace, maxswzone, maxbcache, and maxpipekva. Previously, a
relatively small number (~ 44000) of buffers set in kern.nbuf would result
in integer overflows resulting either in hangs or bogus values of
hidirtybuffers and lodirtybuffers. Now one has to overflow a long to see
such problems. There was a check for a nbuf setting that would cause
overflows in the auto-tuning of nbuf. I've changed it to always check and
cap nbuf but warn if a user-supplied tunable would cause overflow.
Note that this changes the ABI of several sysctls that are used by things
like top(1), etc., so any MFC would probably require a some gross shims
to allow for that.
MFC after: 1 month
2009-03-09 19:35:20 +00:00
|
|
|
int tuned_nbuf;
|
Implement the concept of the unmapped VMIO buffers, i.e. buffers which
do not map the b_pages pages into buffer_map KVA. The use of the
unmapped buffers eliminate the need to perform TLB shootdown for
mapping on the buffer creation and reuse, greatly reducing the amount
of IPIs for shootdown on big-SMP machines and eliminating up to 25-30%
of the system time on i/o intensive workloads.
The unmapped buffer should be explicitely requested by the GB_UNMAPPED
flag by the consumer. For unmapped buffer, no KVA reservation is
performed at all. The consumer might request unmapped buffer which
does have a KVA reserve, to manually map it without recursing into
buffer cache and blocking, with the GB_KVAALLOC flag.
When the mapped buffer is requested and unmapped buffer already
exists, the cache performs an upgrade, possibly reusing the KVA
reservation.
Unmapped buffer is translated into unmapped bio in g_vfs_strategy().
Unmapped bio carry a pointer to the vm_page_t array, offset and length
instead of the data pointer. The provider which processes the bio
should explicitely specify a readiness to accept unmapped bio,
otherwise g_down geom thread performs the transient upgrade of the bio
request by mapping the pages into the new bio_transient_map KVA
submap.
The bio_transient_map submap claims up to 10% of the buffer map, and
the total buffer_map + bio_transient_map KVA usage stays the
same. Still, it could be manually tuned by kern.bio_transient_maxcnt
tunable, in the units of the transient mappings. Eventually, the
bio_transient_map could be removed after all geom classes and drivers
can accept unmapped i/o requests.
Unmapped support can be turned off by the vfs.unmapped_buf_allowed
tunable, disabling which makes the buffer (or cluster) creation
requests to ignore GB_UNMAPPED and GB_KVAALLOC flags. Unmapped
buffers are only enabled by default on the architectures where
pmap_copy_page() was implemented and tested.
In the rework, filesystem metadata is not the subject to maxbufspace
limit anymore. Since the metadata buffers are always mapped, the
buffers still have to fit into the buffer map, which provides a
reasonable (but practically unreachable) upper bound on it. The
non-metadata buffer allocations, both mapped and unmapped, is
accounted against maxbufspace, as before. Effectively, this means that
the maxbufspace is forced on mapped and unmapped buffers separately.
The pre-patch bufspace limiting code did not worked, because
buffer_map fragmentation does not allow the limit to be reached.
By Jeff Roberson request, the getnewbuf() function was split into
smaller single-purpose functions.
Sponsored by: The FreeBSD Foundation
Discussed with: jeff (previous version)
Tested by: pho, scottl (previous version), jhb, bf
MFC after: 2 weeks
2013-03-19 14:13:12 +00:00
|
|
|
long maxbuf, maxbuf_sz, buf_sz, biotmap_sz;
|
2004-09-15 20:54:23 +00:00
|
|
|
|
2021-04-13 20:30:05 +00:00
|
|
|
/*
|
2021-08-10 20:52:36 +00:00
|
|
|
* With KASAN or KMSAN enabled, the kernel map is shadowed. Account for
|
|
|
|
* this when sizing maps based on the amount of physical memory
|
|
|
|
* available.
|
2021-04-13 20:30:05 +00:00
|
|
|
*/
|
2021-08-10 20:52:36 +00:00
|
|
|
#if defined(KASAN)
|
2021-04-13 20:30:05 +00:00
|
|
|
physmem_est = (physmem_est * KASAN_SHADOW_SCALE) /
|
|
|
|
(KASAN_SHADOW_SCALE + 1);
|
2021-08-10 20:52:36 +00:00
|
|
|
#elif defined(KMSAN)
|
|
|
|
physmem_est /= 3;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* KMSAN cannot reliably determine whether buffer data is initialized
|
|
|
|
* unless it is updated through a KVA mapping.
|
|
|
|
*/
|
|
|
|
unmapped_buf_allowed = 0;
|
2021-04-13 20:30:05 +00:00
|
|
|
#endif
|
|
|
|
|
2001-12-08 20:37:08 +00:00
|
|
|
/*
|
|
|
|
* physmem_est is in pages. Convert it to kilobytes (assumes
|
|
|
|
* PAGE_SIZE is >= 1K)
|
|
|
|
*/
|
|
|
|
physmem_est = physmem_est * (PAGE_SIZE / 1024);
|
|
|
|
|
2017-06-17 22:24:19 +00:00
|
|
|
maxbcachebuf_adjust();
|
2001-08-22 04:07:27 +00:00
|
|
|
/*
|
|
|
|
* The nominal buffer size (and minimum KVA allocation) is BKVASIZE.
|
|
|
|
* For the first 64MB of ram nominally allocate sufficient buffers to
|
|
|
|
* cover 1/4 of our ram. Beyond the first 64MB allocate additional
|
2007-09-26 11:22:23 +00:00
|
|
|
* buffers to cover 1/10 of our ram over 64MB. When auto-sizing
|
2001-08-22 04:07:27 +00:00
|
|
|
* the buffer cache we limit the eventual kva reservation to
|
|
|
|
* maxbcache bytes.
|
|
|
|
*
|
|
|
|
* factor represents the 1/4 x ram conversion.
|
|
|
|
*/
|
|
|
|
if (nbuf == 0) {
|
2001-12-08 20:37:08 +00:00
|
|
|
int factor = 4 * BKVASIZE / 1024;
|
2001-08-22 04:07:27 +00:00
|
|
|
|
|
|
|
nbuf = 50;
|
2001-12-08 20:37:08 +00:00
|
|
|
if (physmem_est > 4096)
|
|
|
|
nbuf += min((physmem_est - 4096) / factor,
|
|
|
|
65536 / factor);
|
|
|
|
if (physmem_est > 65536)
|
2013-06-03 04:16:48 +00:00
|
|
|
nbuf += min((physmem_est - 65536) * 2 / (factor * 5),
|
|
|
|
32 * 1024 * 1024 / (factor * 5));
|
2001-08-22 04:07:27 +00:00
|
|
|
|
|
|
|
if (maxbcache && nbuf > maxbcache / BKVASIZE)
|
|
|
|
nbuf = maxbcache / BKVASIZE;
|
Adjust some variables (mostly related to the buffer cache) that hold
address space sizes to be longs instead of ints. Specifically, the follow
values are now longs: runningbufspace, bufspace, maxbufspace,
bufmallocspace, maxbufmallocspace, lobufspace, hibufspace, lorunningspace,
hirunningspace, maxswzone, maxbcache, and maxpipekva. Previously, a
relatively small number (~ 44000) of buffers set in kern.nbuf would result
in integer overflows resulting either in hangs or bogus values of
hidirtybuffers and lodirtybuffers. Now one has to overflow a long to see
such problems. There was a check for a nbuf setting that would cause
overflows in the auto-tuning of nbuf. I've changed it to always check and
cap nbuf but warn if a user-supplied tunable would cause overflow.
Note that this changes the ABI of several sysctls that are used by things
like top(1), etc., so any MFC would probably require a some gross shims
to allow for that.
MFC after: 1 month
2009-03-09 19:35:20 +00:00
|
|
|
tuned_nbuf = 1;
|
|
|
|
} else
|
|
|
|
tuned_nbuf = 0;
|
|
|
|
|
|
|
|
/* XXX Avoid unsigned long overflows later on with maxbufspace. */
|
|
|
|
maxbuf = (LONG_MAX / 3) / BKVASIZE;
|
|
|
|
if (nbuf > maxbuf) {
|
|
|
|
if (!tuned_nbuf)
|
|
|
|
printf("Warning: nbufs lowered from %d to %ld\n", nbuf,
|
|
|
|
maxbuf);
|
|
|
|
nbuf = maxbuf;
|
2001-08-22 04:07:27 +00:00
|
|
|
}
|
|
|
|
|
Implement the concept of the unmapped VMIO buffers, i.e. buffers which
do not map the b_pages pages into buffer_map KVA. The use of the
unmapped buffers eliminate the need to perform TLB shootdown for
mapping on the buffer creation and reuse, greatly reducing the amount
of IPIs for shootdown on big-SMP machines and eliminating up to 25-30%
of the system time on i/o intensive workloads.
The unmapped buffer should be explicitely requested by the GB_UNMAPPED
flag by the consumer. For unmapped buffer, no KVA reservation is
performed at all. The consumer might request unmapped buffer which
does have a KVA reserve, to manually map it without recursing into
buffer cache and blocking, with the GB_KVAALLOC flag.
When the mapped buffer is requested and unmapped buffer already
exists, the cache performs an upgrade, possibly reusing the KVA
reservation.
Unmapped buffer is translated into unmapped bio in g_vfs_strategy().
Unmapped bio carry a pointer to the vm_page_t array, offset and length
instead of the data pointer. The provider which processes the bio
should explicitely specify a readiness to accept unmapped bio,
otherwise g_down geom thread performs the transient upgrade of the bio
request by mapping the pages into the new bio_transient_map KVA
submap.
The bio_transient_map submap claims up to 10% of the buffer map, and
the total buffer_map + bio_transient_map KVA usage stays the
same. Still, it could be manually tuned by kern.bio_transient_maxcnt
tunable, in the units of the transient mappings. Eventually, the
bio_transient_map could be removed after all geom classes and drivers
can accept unmapped i/o requests.
Unmapped support can be turned off by the vfs.unmapped_buf_allowed
tunable, disabling which makes the buffer (or cluster) creation
requests to ignore GB_UNMAPPED and GB_KVAALLOC flags. Unmapped
buffers are only enabled by default on the architectures where
pmap_copy_page() was implemented and tested.
In the rework, filesystem metadata is not the subject to maxbufspace
limit anymore. Since the metadata buffers are always mapped, the
buffers still have to fit into the buffer map, which provides a
reasonable (but practically unreachable) upper bound on it. The
non-metadata buffer allocations, both mapped and unmapped, is
accounted against maxbufspace, as before. Effectively, this means that
the maxbufspace is forced on mapped and unmapped buffers separately.
The pre-patch bufspace limiting code did not worked, because
buffer_map fragmentation does not allow the limit to be reached.
By Jeff Roberson request, the getnewbuf() function was split into
smaller single-purpose functions.
Sponsored by: The FreeBSD Foundation
Discussed with: jeff (previous version)
Tested by: pho, scottl (previous version), jhb, bf
MFC after: 2 weeks
2013-03-19 14:13:12 +00:00
|
|
|
/*
|
2014-01-21 03:24:52 +00:00
|
|
|
* Ideal allocation size for the transient bio submap is 10%
|
Implement the concept of the unmapped VMIO buffers, i.e. buffers which
do not map the b_pages pages into buffer_map KVA. The use of the
unmapped buffers eliminate the need to perform TLB shootdown for
mapping on the buffer creation and reuse, greatly reducing the amount
of IPIs for shootdown on big-SMP machines and eliminating up to 25-30%
of the system time on i/o intensive workloads.
The unmapped buffer should be explicitely requested by the GB_UNMAPPED
flag by the consumer. For unmapped buffer, no KVA reservation is
performed at all. The consumer might request unmapped buffer which
does have a KVA reserve, to manually map it without recursing into
buffer cache and blocking, with the GB_KVAALLOC flag.
When the mapped buffer is requested and unmapped buffer already
exists, the cache performs an upgrade, possibly reusing the KVA
reservation.
Unmapped buffer is translated into unmapped bio in g_vfs_strategy().
Unmapped bio carry a pointer to the vm_page_t array, offset and length
instead of the data pointer. The provider which processes the bio
should explicitely specify a readiness to accept unmapped bio,
otherwise g_down geom thread performs the transient upgrade of the bio
request by mapping the pages into the new bio_transient_map KVA
submap.
The bio_transient_map submap claims up to 10% of the buffer map, and
the total buffer_map + bio_transient_map KVA usage stays the
same. Still, it could be manually tuned by kern.bio_transient_maxcnt
tunable, in the units of the transient mappings. Eventually, the
bio_transient_map could be removed after all geom classes and drivers
can accept unmapped i/o requests.
Unmapped support can be turned off by the vfs.unmapped_buf_allowed
tunable, disabling which makes the buffer (or cluster) creation
requests to ignore GB_UNMAPPED and GB_KVAALLOC flags. Unmapped
buffers are only enabled by default on the architectures where
pmap_copy_page() was implemented and tested.
In the rework, filesystem metadata is not the subject to maxbufspace
limit anymore. Since the metadata buffers are always mapped, the
buffers still have to fit into the buffer map, which provides a
reasonable (but practically unreachable) upper bound on it. The
non-metadata buffer allocations, both mapped and unmapped, is
accounted against maxbufspace, as before. Effectively, this means that
the maxbufspace is forced on mapped and unmapped buffers separately.
The pre-patch bufspace limiting code did not worked, because
buffer_map fragmentation does not allow the limit to be reached.
By Jeff Roberson request, the getnewbuf() function was split into
smaller single-purpose functions.
Sponsored by: The FreeBSD Foundation
Discussed with: jeff (previous version)
Tested by: pho, scottl (previous version), jhb, bf
MFC after: 2 weeks
2013-03-19 14:13:12 +00:00
|
|
|
* of the maximal space buffer map. This roughly corresponds
|
|
|
|
* to the amount of the buffer mapped for typical UFS load.
|
|
|
|
*
|
|
|
|
* Clip the buffer map to reserve space for the transient
|
2013-03-27 10:56:15 +00:00
|
|
|
* BIOs, if its extent is bigger than 90% (80% on i386) of the
|
|
|
|
* maximum buffer map extent on the platform.
|
Implement the concept of the unmapped VMIO buffers, i.e. buffers which
do not map the b_pages pages into buffer_map KVA. The use of the
unmapped buffers eliminate the need to perform TLB shootdown for
mapping on the buffer creation and reuse, greatly reducing the amount
of IPIs for shootdown on big-SMP machines and eliminating up to 25-30%
of the system time on i/o intensive workloads.
The unmapped buffer should be explicitely requested by the GB_UNMAPPED
flag by the consumer. For unmapped buffer, no KVA reservation is
performed at all. The consumer might request unmapped buffer which
does have a KVA reserve, to manually map it without recursing into
buffer cache and blocking, with the GB_KVAALLOC flag.
When the mapped buffer is requested and unmapped buffer already
exists, the cache performs an upgrade, possibly reusing the KVA
reservation.
Unmapped buffer is translated into unmapped bio in g_vfs_strategy().
Unmapped bio carry a pointer to the vm_page_t array, offset and length
instead of the data pointer. The provider which processes the bio
should explicitely specify a readiness to accept unmapped bio,
otherwise g_down geom thread performs the transient upgrade of the bio
request by mapping the pages into the new bio_transient_map KVA
submap.
The bio_transient_map submap claims up to 10% of the buffer map, and
the total buffer_map + bio_transient_map KVA usage stays the
same. Still, it could be manually tuned by kern.bio_transient_maxcnt
tunable, in the units of the transient mappings. Eventually, the
bio_transient_map could be removed after all geom classes and drivers
can accept unmapped i/o requests.
Unmapped support can be turned off by the vfs.unmapped_buf_allowed
tunable, disabling which makes the buffer (or cluster) creation
requests to ignore GB_UNMAPPED and GB_KVAALLOC flags. Unmapped
buffers are only enabled by default on the architectures where
pmap_copy_page() was implemented and tested.
In the rework, filesystem metadata is not the subject to maxbufspace
limit anymore. Since the metadata buffers are always mapped, the
buffers still have to fit into the buffer map, which provides a
reasonable (but practically unreachable) upper bound on it. The
non-metadata buffer allocations, both mapped and unmapped, is
accounted against maxbufspace, as before. Effectively, this means that
the maxbufspace is forced on mapped and unmapped buffers separately.
The pre-patch bufspace limiting code did not worked, because
buffer_map fragmentation does not allow the limit to be reached.
By Jeff Roberson request, the getnewbuf() function was split into
smaller single-purpose functions.
Sponsored by: The FreeBSD Foundation
Discussed with: jeff (previous version)
Tested by: pho, scottl (previous version), jhb, bf
MFC after: 2 weeks
2013-03-19 14:13:12 +00:00
|
|
|
*
|
|
|
|
* The fall-back to the maxbuf in case of maxbcache unset,
|
|
|
|
* allows to not trim the buffer KVA for the architectures
|
|
|
|
* with ample KVA space.
|
|
|
|
*/
|
2013-03-21 07:28:15 +00:00
|
|
|
if (bio_transient_maxcnt == 0 && unmapped_buf_allowed) {
|
Implement the concept of the unmapped VMIO buffers, i.e. buffers which
do not map the b_pages pages into buffer_map KVA. The use of the
unmapped buffers eliminate the need to perform TLB shootdown for
mapping on the buffer creation and reuse, greatly reducing the amount
of IPIs for shootdown on big-SMP machines and eliminating up to 25-30%
of the system time on i/o intensive workloads.
The unmapped buffer should be explicitely requested by the GB_UNMAPPED
flag by the consumer. For unmapped buffer, no KVA reservation is
performed at all. The consumer might request unmapped buffer which
does have a KVA reserve, to manually map it without recursing into
buffer cache and blocking, with the GB_KVAALLOC flag.
When the mapped buffer is requested and unmapped buffer already
exists, the cache performs an upgrade, possibly reusing the KVA
reservation.
Unmapped buffer is translated into unmapped bio in g_vfs_strategy().
Unmapped bio carry a pointer to the vm_page_t array, offset and length
instead of the data pointer. The provider which processes the bio
should explicitely specify a readiness to accept unmapped bio,
otherwise g_down geom thread performs the transient upgrade of the bio
request by mapping the pages into the new bio_transient_map KVA
submap.
The bio_transient_map submap claims up to 10% of the buffer map, and
the total buffer_map + bio_transient_map KVA usage stays the
same. Still, it could be manually tuned by kern.bio_transient_maxcnt
tunable, in the units of the transient mappings. Eventually, the
bio_transient_map could be removed after all geom classes and drivers
can accept unmapped i/o requests.
Unmapped support can be turned off by the vfs.unmapped_buf_allowed
tunable, disabling which makes the buffer (or cluster) creation
requests to ignore GB_UNMAPPED and GB_KVAALLOC flags. Unmapped
buffers are only enabled by default on the architectures where
pmap_copy_page() was implemented and tested.
In the rework, filesystem metadata is not the subject to maxbufspace
limit anymore. Since the metadata buffers are always mapped, the
buffers still have to fit into the buffer map, which provides a
reasonable (but practically unreachable) upper bound on it. The
non-metadata buffer allocations, both mapped and unmapped, is
accounted against maxbufspace, as before. Effectively, this means that
the maxbufspace is forced on mapped and unmapped buffers separately.
The pre-patch bufspace limiting code did not worked, because
buffer_map fragmentation does not allow the limit to be reached.
By Jeff Roberson request, the getnewbuf() function was split into
smaller single-purpose functions.
Sponsored by: The FreeBSD Foundation
Discussed with: jeff (previous version)
Tested by: pho, scottl (previous version), jhb, bf
MFC after: 2 weeks
2013-03-19 14:13:12 +00:00
|
|
|
maxbuf_sz = maxbcache != 0 ? maxbcache : maxbuf * BKVASIZE;
|
|
|
|
buf_sz = (long)nbuf * BKVASIZE;
|
2013-03-27 10:56:15 +00:00
|
|
|
if (buf_sz < maxbuf_sz / TRANSIENT_DENOM *
|
|
|
|
(TRANSIENT_DENOM - 1)) {
|
Implement the concept of the unmapped VMIO buffers, i.e. buffers which
do not map the b_pages pages into buffer_map KVA. The use of the
unmapped buffers eliminate the need to perform TLB shootdown for
mapping on the buffer creation and reuse, greatly reducing the amount
of IPIs for shootdown on big-SMP machines and eliminating up to 25-30%
of the system time on i/o intensive workloads.
The unmapped buffer should be explicitely requested by the GB_UNMAPPED
flag by the consumer. For unmapped buffer, no KVA reservation is
performed at all. The consumer might request unmapped buffer which
does have a KVA reserve, to manually map it without recursing into
buffer cache and blocking, with the GB_KVAALLOC flag.
When the mapped buffer is requested and unmapped buffer already
exists, the cache performs an upgrade, possibly reusing the KVA
reservation.
Unmapped buffer is translated into unmapped bio in g_vfs_strategy().
Unmapped bio carry a pointer to the vm_page_t array, offset and length
instead of the data pointer. The provider which processes the bio
should explicitely specify a readiness to accept unmapped bio,
otherwise g_down geom thread performs the transient upgrade of the bio
request by mapping the pages into the new bio_transient_map KVA
submap.
The bio_transient_map submap claims up to 10% of the buffer map, and
the total buffer_map + bio_transient_map KVA usage stays the
same. Still, it could be manually tuned by kern.bio_transient_maxcnt
tunable, in the units of the transient mappings. Eventually, the
bio_transient_map could be removed after all geom classes and drivers
can accept unmapped i/o requests.
Unmapped support can be turned off by the vfs.unmapped_buf_allowed
tunable, disabling which makes the buffer (or cluster) creation
requests to ignore GB_UNMAPPED and GB_KVAALLOC flags. Unmapped
buffers are only enabled by default on the architectures where
pmap_copy_page() was implemented and tested.
In the rework, filesystem metadata is not the subject to maxbufspace
limit anymore. Since the metadata buffers are always mapped, the
buffers still have to fit into the buffer map, which provides a
reasonable (but practically unreachable) upper bound on it. The
non-metadata buffer allocations, both mapped and unmapped, is
accounted against maxbufspace, as before. Effectively, this means that
the maxbufspace is forced on mapped and unmapped buffers separately.
The pre-patch bufspace limiting code did not worked, because
buffer_map fragmentation does not allow the limit to be reached.
By Jeff Roberson request, the getnewbuf() function was split into
smaller single-purpose functions.
Sponsored by: The FreeBSD Foundation
Discussed with: jeff (previous version)
Tested by: pho, scottl (previous version), jhb, bf
MFC after: 2 weeks
2013-03-19 14:13:12 +00:00
|
|
|
/*
|
|
|
|
* There is more KVA than memory. Do not
|
|
|
|
* adjust buffer map size, and assign the rest
|
|
|
|
* of maxbuf to transient map.
|
|
|
|
*/
|
|
|
|
biotmap_sz = maxbuf_sz - buf_sz;
|
|
|
|
} else {
|
|
|
|
/*
|
|
|
|
* Buffer map spans all KVA we could afford on
|
2013-03-27 10:56:15 +00:00
|
|
|
* this platform. Give 10% (20% on i386) of
|
|
|
|
* the buffer map to the transient bio map.
|
Implement the concept of the unmapped VMIO buffers, i.e. buffers which
do not map the b_pages pages into buffer_map KVA. The use of the
unmapped buffers eliminate the need to perform TLB shootdown for
mapping on the buffer creation and reuse, greatly reducing the amount
of IPIs for shootdown on big-SMP machines and eliminating up to 25-30%
of the system time on i/o intensive workloads.
The unmapped buffer should be explicitely requested by the GB_UNMAPPED
flag by the consumer. For unmapped buffer, no KVA reservation is
performed at all. The consumer might request unmapped buffer which
does have a KVA reserve, to manually map it without recursing into
buffer cache and blocking, with the GB_KVAALLOC flag.
When the mapped buffer is requested and unmapped buffer already
exists, the cache performs an upgrade, possibly reusing the KVA
reservation.
Unmapped buffer is translated into unmapped bio in g_vfs_strategy().
Unmapped bio carry a pointer to the vm_page_t array, offset and length
instead of the data pointer. The provider which processes the bio
should explicitely specify a readiness to accept unmapped bio,
otherwise g_down geom thread performs the transient upgrade of the bio
request by mapping the pages into the new bio_transient_map KVA
submap.
The bio_transient_map submap claims up to 10% of the buffer map, and
the total buffer_map + bio_transient_map KVA usage stays the
same. Still, it could be manually tuned by kern.bio_transient_maxcnt
tunable, in the units of the transient mappings. Eventually, the
bio_transient_map could be removed after all geom classes and drivers
can accept unmapped i/o requests.
Unmapped support can be turned off by the vfs.unmapped_buf_allowed
tunable, disabling which makes the buffer (or cluster) creation
requests to ignore GB_UNMAPPED and GB_KVAALLOC flags. Unmapped
buffers are only enabled by default on the architectures where
pmap_copy_page() was implemented and tested.
In the rework, filesystem metadata is not the subject to maxbufspace
limit anymore. Since the metadata buffers are always mapped, the
buffers still have to fit into the buffer map, which provides a
reasonable (but practically unreachable) upper bound on it. The
non-metadata buffer allocations, both mapped and unmapped, is
accounted against maxbufspace, as before. Effectively, this means that
the maxbufspace is forced on mapped and unmapped buffers separately.
The pre-patch bufspace limiting code did not worked, because
buffer_map fragmentation does not allow the limit to be reached.
By Jeff Roberson request, the getnewbuf() function was split into
smaller single-purpose functions.
Sponsored by: The FreeBSD Foundation
Discussed with: jeff (previous version)
Tested by: pho, scottl (previous version), jhb, bf
MFC after: 2 weeks
2013-03-19 14:13:12 +00:00
|
|
|
*/
|
2013-03-27 10:56:15 +00:00
|
|
|
biotmap_sz = buf_sz / TRANSIENT_DENOM;
|
Implement the concept of the unmapped VMIO buffers, i.e. buffers which
do not map the b_pages pages into buffer_map KVA. The use of the
unmapped buffers eliminate the need to perform TLB shootdown for
mapping on the buffer creation and reuse, greatly reducing the amount
of IPIs for shootdown on big-SMP machines and eliminating up to 25-30%
of the system time on i/o intensive workloads.
The unmapped buffer should be explicitely requested by the GB_UNMAPPED
flag by the consumer. For unmapped buffer, no KVA reservation is
performed at all. The consumer might request unmapped buffer which
does have a KVA reserve, to manually map it without recursing into
buffer cache and blocking, with the GB_KVAALLOC flag.
When the mapped buffer is requested and unmapped buffer already
exists, the cache performs an upgrade, possibly reusing the KVA
reservation.
Unmapped buffer is translated into unmapped bio in g_vfs_strategy().
Unmapped bio carry a pointer to the vm_page_t array, offset and length
instead of the data pointer. The provider which processes the bio
should explicitely specify a readiness to accept unmapped bio,
otherwise g_down geom thread performs the transient upgrade of the bio
request by mapping the pages into the new bio_transient_map KVA
submap.
The bio_transient_map submap claims up to 10% of the buffer map, and
the total buffer_map + bio_transient_map KVA usage stays the
same. Still, it could be manually tuned by kern.bio_transient_maxcnt
tunable, in the units of the transient mappings. Eventually, the
bio_transient_map could be removed after all geom classes and drivers
can accept unmapped i/o requests.
Unmapped support can be turned off by the vfs.unmapped_buf_allowed
tunable, disabling which makes the buffer (or cluster) creation
requests to ignore GB_UNMAPPED and GB_KVAALLOC flags. Unmapped
buffers are only enabled by default on the architectures where
pmap_copy_page() was implemented and tested.
In the rework, filesystem metadata is not the subject to maxbufspace
limit anymore. Since the metadata buffers are always mapped, the
buffers still have to fit into the buffer map, which provides a
reasonable (but practically unreachable) upper bound on it. The
non-metadata buffer allocations, both mapped and unmapped, is
accounted against maxbufspace, as before. Effectively, this means that
the maxbufspace is forced on mapped and unmapped buffers separately.
The pre-patch bufspace limiting code did not worked, because
buffer_map fragmentation does not allow the limit to be reached.
By Jeff Roberson request, the getnewbuf() function was split into
smaller single-purpose functions.
Sponsored by: The FreeBSD Foundation
Discussed with: jeff (previous version)
Tested by: pho, scottl (previous version), jhb, bf
MFC after: 2 weeks
2013-03-19 14:13:12 +00:00
|
|
|
buf_sz -= biotmap_sz;
|
|
|
|
}
|
Make MAXPHYS tunable. Bump MAXPHYS to 1M.
Replace MAXPHYS by runtime variable maxphys. It is initialized from
MAXPHYS by default, but can be also adjusted with the tunable kern.maxphys.
Make b_pages[] array in struct buf flexible. Size b_pages[] for buffer
cache buffers exactly to atop(maxbcachebuf) (currently it is sized to
atop(MAXPHYS)), and b_pages[] for pbufs is sized to atop(maxphys) + 1.
The +1 for pbufs allow several pbuf consumers, among them vmapbuf(),
to use unaligned buffers still sized to maxphys, esp. when such
buffers come from userspace (*). Overall, we save significant amount
of otherwise wasted memory in b_pages[] for buffer cache buffers,
while bumping MAXPHYS to desired high value.
Eliminate all direct uses of the MAXPHYS constant in kernel and driver
sources, except a place which initialize maxphys. Some random (and
arguably weird) uses of MAXPHYS, e.g. in linuxolator, are converted
straight. Some drivers, which use MAXPHYS to size embeded structures,
get private MAXPHYS-like constant; their convertion is out of scope
for this work.
Changes to cam/, dev/ahci, dev/ata, dev/mpr, dev/mpt, dev/mvs,
dev/siis, where either submitted by, or based on changes by mav.
Suggested by: mav (*)
Reviewed by: imp, mav, imp, mckusick, scottl (intermediate versions)
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D27225
2020-11-28 12:12:51 +00:00
|
|
|
if (biotmap_sz / INT_MAX > maxphys)
|
Implement the concept of the unmapped VMIO buffers, i.e. buffers which
do not map the b_pages pages into buffer_map KVA. The use of the
unmapped buffers eliminate the need to perform TLB shootdown for
mapping on the buffer creation and reuse, greatly reducing the amount
of IPIs for shootdown on big-SMP machines and eliminating up to 25-30%
of the system time on i/o intensive workloads.
The unmapped buffer should be explicitely requested by the GB_UNMAPPED
flag by the consumer. For unmapped buffer, no KVA reservation is
performed at all. The consumer might request unmapped buffer which
does have a KVA reserve, to manually map it without recursing into
buffer cache and blocking, with the GB_KVAALLOC flag.
When the mapped buffer is requested and unmapped buffer already
exists, the cache performs an upgrade, possibly reusing the KVA
reservation.
Unmapped buffer is translated into unmapped bio in g_vfs_strategy().
Unmapped bio carry a pointer to the vm_page_t array, offset and length
instead of the data pointer. The provider which processes the bio
should explicitely specify a readiness to accept unmapped bio,
otherwise g_down geom thread performs the transient upgrade of the bio
request by mapping the pages into the new bio_transient_map KVA
submap.
The bio_transient_map submap claims up to 10% of the buffer map, and
the total buffer_map + bio_transient_map KVA usage stays the
same. Still, it could be manually tuned by kern.bio_transient_maxcnt
tunable, in the units of the transient mappings. Eventually, the
bio_transient_map could be removed after all geom classes and drivers
can accept unmapped i/o requests.
Unmapped support can be turned off by the vfs.unmapped_buf_allowed
tunable, disabling which makes the buffer (or cluster) creation
requests to ignore GB_UNMAPPED and GB_KVAALLOC flags. Unmapped
buffers are only enabled by default on the architectures where
pmap_copy_page() was implemented and tested.
In the rework, filesystem metadata is not the subject to maxbufspace
limit anymore. Since the metadata buffers are always mapped, the
buffers still have to fit into the buffer map, which provides a
reasonable (but practically unreachable) upper bound on it. The
non-metadata buffer allocations, both mapped and unmapped, is
accounted against maxbufspace, as before. Effectively, this means that
the maxbufspace is forced on mapped and unmapped buffers separately.
The pre-patch bufspace limiting code did not worked, because
buffer_map fragmentation does not allow the limit to be reached.
By Jeff Roberson request, the getnewbuf() function was split into
smaller single-purpose functions.
Sponsored by: The FreeBSD Foundation
Discussed with: jeff (previous version)
Tested by: pho, scottl (previous version), jhb, bf
MFC after: 2 weeks
2013-03-19 14:13:12 +00:00
|
|
|
bio_transient_maxcnt = INT_MAX;
|
|
|
|
else
|
Make MAXPHYS tunable. Bump MAXPHYS to 1M.
Replace MAXPHYS by runtime variable maxphys. It is initialized from
MAXPHYS by default, but can be also adjusted with the tunable kern.maxphys.
Make b_pages[] array in struct buf flexible. Size b_pages[] for buffer
cache buffers exactly to atop(maxbcachebuf) (currently it is sized to
atop(MAXPHYS)), and b_pages[] for pbufs is sized to atop(maxphys) + 1.
The +1 for pbufs allow several pbuf consumers, among them vmapbuf(),
to use unaligned buffers still sized to maxphys, esp. when such
buffers come from userspace (*). Overall, we save significant amount
of otherwise wasted memory in b_pages[] for buffer cache buffers,
while bumping MAXPHYS to desired high value.
Eliminate all direct uses of the MAXPHYS constant in kernel and driver
sources, except a place which initialize maxphys. Some random (and
arguably weird) uses of MAXPHYS, e.g. in linuxolator, are converted
straight. Some drivers, which use MAXPHYS to size embeded structures,
get private MAXPHYS-like constant; their convertion is out of scope
for this work.
Changes to cam/, dev/ahci, dev/ata, dev/mpr, dev/mpt, dev/mvs,
dev/siis, where either submitted by, or based on changes by mav.
Suggested by: mav (*)
Reviewed by: imp, mav, imp, mckusick, scottl (intermediate versions)
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D27225
2020-11-28 12:12:51 +00:00
|
|
|
bio_transient_maxcnt = biotmap_sz / maxphys;
|
Implement the concept of the unmapped VMIO buffers, i.e. buffers which
do not map the b_pages pages into buffer_map KVA. The use of the
unmapped buffers eliminate the need to perform TLB shootdown for
mapping on the buffer creation and reuse, greatly reducing the amount
of IPIs for shootdown on big-SMP machines and eliminating up to 25-30%
of the system time on i/o intensive workloads.
The unmapped buffer should be explicitely requested by the GB_UNMAPPED
flag by the consumer. For unmapped buffer, no KVA reservation is
performed at all. The consumer might request unmapped buffer which
does have a KVA reserve, to manually map it without recursing into
buffer cache and blocking, with the GB_KVAALLOC flag.
When the mapped buffer is requested and unmapped buffer already
exists, the cache performs an upgrade, possibly reusing the KVA
reservation.
Unmapped buffer is translated into unmapped bio in g_vfs_strategy().
Unmapped bio carry a pointer to the vm_page_t array, offset and length
instead of the data pointer. The provider which processes the bio
should explicitely specify a readiness to accept unmapped bio,
otherwise g_down geom thread performs the transient upgrade of the bio
request by mapping the pages into the new bio_transient_map KVA
submap.
The bio_transient_map submap claims up to 10% of the buffer map, and
the total buffer_map + bio_transient_map KVA usage stays the
same. Still, it could be manually tuned by kern.bio_transient_maxcnt
tunable, in the units of the transient mappings. Eventually, the
bio_transient_map could be removed after all geom classes and drivers
can accept unmapped i/o requests.
Unmapped support can be turned off by the vfs.unmapped_buf_allowed
tunable, disabling which makes the buffer (or cluster) creation
requests to ignore GB_UNMAPPED and GB_KVAALLOC flags. Unmapped
buffers are only enabled by default on the architectures where
pmap_copy_page() was implemented and tested.
In the rework, filesystem metadata is not the subject to maxbufspace
limit anymore. Since the metadata buffers are always mapped, the
buffers still have to fit into the buffer map, which provides a
reasonable (but practically unreachable) upper bound on it. The
non-metadata buffer allocations, both mapped and unmapped, is
accounted against maxbufspace, as before. Effectively, this means that
the maxbufspace is forced on mapped and unmapped buffers separately.
The pre-patch bufspace limiting code did not worked, because
buffer_map fragmentation does not allow the limit to be reached.
By Jeff Roberson request, the getnewbuf() function was split into
smaller single-purpose functions.
Sponsored by: The FreeBSD Foundation
Discussed with: jeff (previous version)
Tested by: pho, scottl (previous version), jhb, bf
MFC after: 2 weeks
2013-03-19 14:13:12 +00:00
|
|
|
/*
|
2016-04-29 21:54:28 +00:00
|
|
|
* Artificially limit to 1024 simultaneous in-flight I/Os
|
Implement the concept of the unmapped VMIO buffers, i.e. buffers which
do not map the b_pages pages into buffer_map KVA. The use of the
unmapped buffers eliminate the need to perform TLB shootdown for
mapping on the buffer creation and reuse, greatly reducing the amount
of IPIs for shootdown on big-SMP machines and eliminating up to 25-30%
of the system time on i/o intensive workloads.
The unmapped buffer should be explicitely requested by the GB_UNMAPPED
flag by the consumer. For unmapped buffer, no KVA reservation is
performed at all. The consumer might request unmapped buffer which
does have a KVA reserve, to manually map it without recursing into
buffer cache and blocking, with the GB_KVAALLOC flag.
When the mapped buffer is requested and unmapped buffer already
exists, the cache performs an upgrade, possibly reusing the KVA
reservation.
Unmapped buffer is translated into unmapped bio in g_vfs_strategy().
Unmapped bio carry a pointer to the vm_page_t array, offset and length
instead of the data pointer. The provider which processes the bio
should explicitely specify a readiness to accept unmapped bio,
otherwise g_down geom thread performs the transient upgrade of the bio
request by mapping the pages into the new bio_transient_map KVA
submap.
The bio_transient_map submap claims up to 10% of the buffer map, and
the total buffer_map + bio_transient_map KVA usage stays the
same. Still, it could be manually tuned by kern.bio_transient_maxcnt
tunable, in the units of the transient mappings. Eventually, the
bio_transient_map could be removed after all geom classes and drivers
can accept unmapped i/o requests.
Unmapped support can be turned off by the vfs.unmapped_buf_allowed
tunable, disabling which makes the buffer (or cluster) creation
requests to ignore GB_UNMAPPED and GB_KVAALLOC flags. Unmapped
buffers are only enabled by default on the architectures where
pmap_copy_page() was implemented and tested.
In the rework, filesystem metadata is not the subject to maxbufspace
limit anymore. Since the metadata buffers are always mapped, the
buffers still have to fit into the buffer map, which provides a
reasonable (but practically unreachable) upper bound on it. The
non-metadata buffer allocations, both mapped and unmapped, is
accounted against maxbufspace, as before. Effectively, this means that
the maxbufspace is forced on mapped and unmapped buffers separately.
The pre-patch bufspace limiting code did not worked, because
buffer_map fragmentation does not allow the limit to be reached.
By Jeff Roberson request, the getnewbuf() function was split into
smaller single-purpose functions.
Sponsored by: The FreeBSD Foundation
Discussed with: jeff (previous version)
Tested by: pho, scottl (previous version), jhb, bf
MFC after: 2 weeks
2013-03-19 14:13:12 +00:00
|
|
|
* using the transient mapping.
|
|
|
|
*/
|
|
|
|
if (bio_transient_maxcnt > 1024)
|
|
|
|
bio_transient_maxcnt = 1024;
|
|
|
|
if (tuned_nbuf)
|
|
|
|
nbuf = buf_sz / BKVASIZE;
|
|
|
|
}
|
|
|
|
|
2019-01-16 20:20:38 +00:00
|
|
|
if (nswbuf == 0) {
|
|
|
|
nswbuf = min(nbuf / 4, 256);
|
|
|
|
if (nswbuf < NSWBUF_MIN)
|
|
|
|
nswbuf = NSWBUF_MIN;
|
|
|
|
}
|
|
|
|
|
2001-08-22 04:07:27 +00:00
|
|
|
/*
|
|
|
|
* Reserve space for the buffer cache buffers
|
|
|
|
*/
|
Make MAXPHYS tunable. Bump MAXPHYS to 1M.
Replace MAXPHYS by runtime variable maxphys. It is initialized from
MAXPHYS by default, but can be also adjusted with the tunable kern.maxphys.
Make b_pages[] array in struct buf flexible. Size b_pages[] for buffer
cache buffers exactly to atop(maxbcachebuf) (currently it is sized to
atop(MAXPHYS)), and b_pages[] for pbufs is sized to atop(maxphys) + 1.
The +1 for pbufs allow several pbuf consumers, among them vmapbuf(),
to use unaligned buffers still sized to maxphys, esp. when such
buffers come from userspace (*). Overall, we save significant amount
of otherwise wasted memory in b_pages[] for buffer cache buffers,
while bumping MAXPHYS to desired high value.
Eliminate all direct uses of the MAXPHYS constant in kernel and driver
sources, except a place which initialize maxphys. Some random (and
arguably weird) uses of MAXPHYS, e.g. in linuxolator, are converted
straight. Some drivers, which use MAXPHYS to size embeded structures,
get private MAXPHYS-like constant; their convertion is out of scope
for this work.
Changes to cam/, dev/ahci, dev/ata, dev/mpr, dev/mpt, dev/mvs,
dev/siis, where either submitted by, or based on changes by mav.
Suggested by: mav (*)
Reviewed by: imp, mav, imp, mckusick, scottl (intermediate versions)
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D27225
2020-11-28 12:12:51 +00:00
|
|
|
buf = (char *)v;
|
|
|
|
v = (caddr_t)buf + (sizeof(struct buf) + sizeof(vm_page_t) *
|
|
|
|
atop(maxbcachebuf)) * nbuf;
|
2001-08-22 04:07:27 +00:00
|
|
|
|
Make MAXPHYS tunable. Bump MAXPHYS to 1M.
Replace MAXPHYS by runtime variable maxphys. It is initialized from
MAXPHYS by default, but can be also adjusted with the tunable kern.maxphys.
Make b_pages[] array in struct buf flexible. Size b_pages[] for buffer
cache buffers exactly to atop(maxbcachebuf) (currently it is sized to
atop(MAXPHYS)), and b_pages[] for pbufs is sized to atop(maxphys) + 1.
The +1 for pbufs allow several pbuf consumers, among them vmapbuf(),
to use unaligned buffers still sized to maxphys, esp. when such
buffers come from userspace (*). Overall, we save significant amount
of otherwise wasted memory in b_pages[] for buffer cache buffers,
while bumping MAXPHYS to desired high value.
Eliminate all direct uses of the MAXPHYS constant in kernel and driver
sources, except a place which initialize maxphys. Some random (and
arguably weird) uses of MAXPHYS, e.g. in linuxolator, are converted
straight. Some drivers, which use MAXPHYS to size embeded structures,
get private MAXPHYS-like constant; their convertion is out of scope
for this work.
Changes to cam/, dev/ahci, dev/ata, dev/mpr, dev/mpt, dev/mvs,
dev/siis, where either submitted by, or based on changes by mav.
Suggested by: mav (*)
Reviewed by: imp, mav, imp, mckusick, scottl (intermediate versions)
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D27225
2020-11-28 12:12:51 +00:00
|
|
|
return (v);
|
1999-07-08 06:06:00 +00:00
|
|
|
}
|
|
|
|
|
2002-03-05 15:38:49 +00:00
|
|
|
/* Initialize the buffer subsystem. Called before use of any buffers. */
|
1994-09-25 19:34:02 +00:00
|
|
|
void
|
1999-07-08 06:06:00 +00:00
|
|
|
bufinit(void)
|
1994-05-24 10:09:53 +00:00
|
|
|
{
|
1994-05-25 09:21:21 +00:00
|
|
|
struct buf *bp;
|
|
|
|
int i;
|
|
|
|
|
2017-06-17 22:24:19 +00:00
|
|
|
KASSERT(maxbcachebuf >= MAXBSIZE,
|
|
|
|
("maxbcachebuf (%d) must be >= MAXBSIZE (%d)\n", maxbcachebuf,
|
|
|
|
MAXBSIZE));
|
2018-02-20 00:06:07 +00:00
|
|
|
bq_init(&bqempty, QUEUE_EMPTY, -1, "bufq empty lock");
|
2003-02-09 09:47:31 +00:00
|
|
|
mtx_init(&rbreqlock, "runningbufspace lock", NULL, MTX_DEF);
|
|
|
|
mtx_init(&bdlock, "buffer daemon lock", NULL, MTX_DEF);
|
2013-06-05 23:53:00 +00:00
|
|
|
mtx_init(&bdirtylock, "dirty buf lock", NULL, MTX_DEF);
|
1994-05-25 09:21:21 +00:00
|
|
|
|
Make MAXPHYS tunable. Bump MAXPHYS to 1M.
Replace MAXPHYS by runtime variable maxphys. It is initialized from
MAXPHYS by default, but can be also adjusted with the tunable kern.maxphys.
Make b_pages[] array in struct buf flexible. Size b_pages[] for buffer
cache buffers exactly to atop(maxbcachebuf) (currently it is sized to
atop(MAXPHYS)), and b_pages[] for pbufs is sized to atop(maxphys) + 1.
The +1 for pbufs allow several pbuf consumers, among them vmapbuf(),
to use unaligned buffers still sized to maxphys, esp. when such
buffers come from userspace (*). Overall, we save significant amount
of otherwise wasted memory in b_pages[] for buffer cache buffers,
while bumping MAXPHYS to desired high value.
Eliminate all direct uses of the MAXPHYS constant in kernel and driver
sources, except a place which initialize maxphys. Some random (and
arguably weird) uses of MAXPHYS, e.g. in linuxolator, are converted
straight. Some drivers, which use MAXPHYS to size embeded structures,
get private MAXPHYS-like constant; their convertion is out of scope
for this work.
Changes to cam/, dev/ahci, dev/ata, dev/mpr, dev/mpt, dev/mvs,
dev/siis, where either submitted by, or based on changes by mav.
Suggested by: mav (*)
Reviewed by: imp, mav, imp, mckusick, scottl (intermediate versions)
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D27225
2020-11-28 12:12:51 +00:00
|
|
|
unmapped_buf = (caddr_t)kva_alloc(maxphys);
|
2015-07-23 19:13:41 +00:00
|
|
|
|
1994-05-25 09:21:21 +00:00
|
|
|
/* finally, initialize each buffer header and stick on empty q */
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
for (i = 0; i < nbuf; i++) {
|
Make MAXPHYS tunable. Bump MAXPHYS to 1M.
Replace MAXPHYS by runtime variable maxphys. It is initialized from
MAXPHYS by default, but can be also adjusted with the tunable kern.maxphys.
Make b_pages[] array in struct buf flexible. Size b_pages[] for buffer
cache buffers exactly to atop(maxbcachebuf) (currently it is sized to
atop(MAXPHYS)), and b_pages[] for pbufs is sized to atop(maxphys) + 1.
The +1 for pbufs allow several pbuf consumers, among them vmapbuf(),
to use unaligned buffers still sized to maxphys, esp. when such
buffers come from userspace (*). Overall, we save significant amount
of otherwise wasted memory in b_pages[] for buffer cache buffers,
while bumping MAXPHYS to desired high value.
Eliminate all direct uses of the MAXPHYS constant in kernel and driver
sources, except a place which initialize maxphys. Some random (and
arguably weird) uses of MAXPHYS, e.g. in linuxolator, are converted
straight. Some drivers, which use MAXPHYS to size embeded structures,
get private MAXPHYS-like constant; their convertion is out of scope
for this work.
Changes to cam/, dev/ahci, dev/ata, dev/mpr, dev/mpt, dev/mvs,
dev/siis, where either submitted by, or based on changes by mav.
Suggested by: mav (*)
Reviewed by: imp, mav, imp, mckusick, scottl (intermediate versions)
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D27225
2020-11-28 12:12:51 +00:00
|
|
|
bp = nbufp(i);
|
|
|
|
bzero(bp, sizeof(*bp) + sizeof(vm_page_t) * atop(maxbcachebuf));
|
2015-10-14 02:10:07 +00:00
|
|
|
bp->b_flags = B_INVAL;
|
1994-05-24 10:09:53 +00:00
|
|
|
bp->b_rcred = NOCRED;
|
|
|
|
bp->b_wcred = NOCRED;
|
2018-02-20 00:06:07 +00:00
|
|
|
bp->b_qindex = QUEUE_NONE;
|
|
|
|
bp->b_domain = -1;
|
2018-02-25 00:35:21 +00:00
|
|
|
bp->b_subqueue = mp_maxid + 1;
|
1998-10-31 15:31:29 +00:00
|
|
|
bp->b_xflags = 0;
|
2015-07-23 19:13:41 +00:00
|
|
|
bp->b_data = bp->b_kvabase = unmapped_buf;
|
1998-03-08 09:59:44 +00:00
|
|
|
LIST_INIT(&bp->b_dep);
|
1999-06-26 02:47:16 +00:00
|
|
|
BUF_LOCKINIT(bp);
|
2018-02-20 00:06:07 +00:00
|
|
|
bq_insert(&bqempty, bp, false);
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
1999-03-12 02:24:58 +00:00
|
|
|
|
|
|
|
/*
|
2000-03-27 21:29:33 +00:00
|
|
|
* maxbufspace is the absolute maximum amount of buffer space we are
|
|
|
|
* allowed to reserve in KVM and in real terms. The absolute maximum
|
2015-10-14 02:10:07 +00:00
|
|
|
* is nominally used by metadata. hibufspace is the nominal maximum
|
|
|
|
* used by most other requests. The differential is required to
|
|
|
|
* ensure that metadata deadlocks don't occur.
|
2000-03-27 21:29:33 +00:00
|
|
|
*
|
|
|
|
* maxbufspace is based on BKVASIZE. Allocating buffers larger then
|
|
|
|
* this may result in KVM fragmentation which is not handled optimally
|
2015-10-14 02:10:07 +00:00
|
|
|
* by the system. XXX This is less true with vmem. We could use
|
|
|
|
* PAGE_SIZE.
|
1999-03-12 02:24:58 +00:00
|
|
|
*/
|
Adjust some variables (mostly related to the buffer cache) that hold
address space sizes to be longs instead of ints. Specifically, the follow
values are now longs: runningbufspace, bufspace, maxbufspace,
bufmallocspace, maxbufmallocspace, lobufspace, hibufspace, lorunningspace,
hirunningspace, maxswzone, maxbcache, and maxpipekva. Previously, a
relatively small number (~ 44000) of buffers set in kern.nbuf would result
in integer overflows resulting either in hangs or bogus values of
hidirtybuffers and lodirtybuffers. Now one has to overflow a long to see
such problems. There was a check for a nbuf setting that would cause
overflows in the auto-tuning of nbuf. I've changed it to always check and
cap nbuf but warn if a user-supplied tunable would cause overflow.
Note that this changes the ABI of several sysctls that are used by things
like top(1), etc., so any MFC would probably require a some gross shims
to allow for that.
MFC after: 1 month
2009-03-09 19:35:20 +00:00
|
|
|
maxbufspace = (long)nbuf * BKVASIZE;
|
2017-06-17 22:24:19 +00:00
|
|
|
hibufspace = lmax(3 * maxbufspace / 4, maxbufspace - maxbcachebuf * 10);
|
2015-10-14 02:10:07 +00:00
|
|
|
lobufspace = (hibufspace / 20) * 19; /* 95% */
|
|
|
|
bufspacethresh = lobufspace + (hibufspace - lobufspace) / 2;
|
2000-03-27 21:29:33 +00:00
|
|
|
|
2010-07-23 12:30:29 +00:00
|
|
|
/*
|
2010-10-25 14:05:25 +00:00
|
|
|
* Note: The 16 MiB upper limit for hirunningspace was chosen
|
2010-08-09 22:22:46 +00:00
|
|
|
* arbitrarily and may need further tuning. It corresponds to
|
|
|
|
* 128 outstanding write IO requests (if IO size is 128 KiB),
|
2010-08-09 23:32:37 +00:00
|
|
|
* which fits with many RAID controllers' tagged queuing limits.
|
2010-10-25 14:05:25 +00:00
|
|
|
* The lower 1 MiB limit is the historical upper limit for
|
2010-08-09 22:22:46 +00:00
|
|
|
* hirunningspace.
|
2010-07-23 12:30:29 +00:00
|
|
|
*/
|
2017-06-17 22:24:19 +00:00
|
|
|
hirunningspace = lmax(lmin(roundup(hibufspace / 64, maxbcachebuf),
|
2010-07-20 13:59:51 +00:00
|
|
|
16 * 1024 * 1024), 1024 * 1024);
|
2017-06-17 22:24:19 +00:00
|
|
|
lorunningspace = roundup((hirunningspace * 2) / 3, maxbcachebuf);
|
2000-12-26 19:41:38 +00:00
|
|
|
|
2015-10-14 02:10:07 +00:00
|
|
|
/*
|
|
|
|
* Limit the amount of malloc memory since it is wired permanently into
|
|
|
|
* the kernel space. Even though this is accounted for in the buffer
|
|
|
|
* allocation, we don't want the malloced region to grow uncontrolled.
|
|
|
|
* The malloc scheme improves memory utilization significantly on
|
|
|
|
* average (small) directories.
|
|
|
|
*/
|
1999-03-12 02:24:58 +00:00
|
|
|
maxbufmallocspace = hibufspace / 20;
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
|
2015-10-14 02:10:07 +00:00
|
|
|
/*
|
2016-04-29 21:54:28 +00:00
|
|
|
* Reduce the chance of a deadlock occurring by limiting the number
|
2015-10-14 02:10:07 +00:00
|
|
|
* of delayed-write dirty buffers we allow to stack up.
|
|
|
|
*/
|
1999-07-08 06:06:00 +00:00
|
|
|
hidirtybuffers = nbuf / 4 + 20;
|
2003-02-25 06:44:42 +00:00
|
|
|
dirtybufthresh = hidirtybuffers * 9 / 10;
|
2015-10-14 02:10:07 +00:00
|
|
|
/*
|
|
|
|
* To support extreme low-memory systems, make sure hidirtybuffers
|
|
|
|
* cannot eat up all available buffer space. This occurs when our
|
|
|
|
* minimum cannot be met. We try to size hidirtybuffers to 3/4 our
|
|
|
|
* buffer space assuming BKVASIZE'd buffers.
|
|
|
|
*/
|
Adjust some variables (mostly related to the buffer cache) that hold
address space sizes to be longs instead of ints. Specifically, the follow
values are now longs: runningbufspace, bufspace, maxbufspace,
bufmallocspace, maxbufmallocspace, lobufspace, hibufspace, lorunningspace,
hirunningspace, maxswzone, maxbcache, and maxpipekva. Previously, a
relatively small number (~ 44000) of buffers set in kern.nbuf would result
in integer overflows resulting either in hangs or bogus values of
hidirtybuffers and lodirtybuffers. Now one has to overflow a long to see
such problems. There was a check for a nbuf setting that would cause
overflows in the auto-tuning of nbuf. I've changed it to always check and
cap nbuf but warn if a user-supplied tunable would cause overflow.
Note that this changes the ABI of several sysctls that are used by things
like top(1), etc., so any MFC would probably require a some gross shims
to allow for that.
MFC after: 1 month
2009-03-09 19:35:20 +00:00
|
|
|
while ((long)hidirtybuffers * BKVASIZE > 3 * hibufspace / 4) {
|
1999-10-24 03:27:28 +00:00
|
|
|
hidirtybuffers >>= 1;
|
|
|
|
}
|
2000-12-26 19:41:38 +00:00
|
|
|
lodirtybuffers = hidirtybuffers / 2;
|
1999-10-24 03:27:28 +00:00
|
|
|
|
2015-10-14 02:10:07 +00:00
|
|
|
/*
|
|
|
|
* lofreebuffers should be sufficient to avoid stalling waiting on
|
|
|
|
* buf headers under heavy utilization. The bufs in per-cpu caches
|
|
|
|
* are counted as free but will be unavailable to threads executing
|
|
|
|
* on other cpus.
|
|
|
|
*
|
|
|
|
* hifreebuffers is the free target for the bufspace daemon. This
|
|
|
|
* should be set appropriately to limit work per-iteration.
|
|
|
|
*/
|
|
|
|
lofreebuffers = MIN((nbuf / 25) + (20 * mp_ncpus), 128 * mp_ncpus);
|
|
|
|
hifreebuffers = (3 * lofreebuffers) / 2;
|
1997-06-15 17:56:53 +00:00
|
|
|
numfreebuffers = nbuf;
|
1999-03-12 02:24:58 +00:00
|
|
|
|
2015-10-14 02:10:07 +00:00
|
|
|
/* Setup the kva and free list allocators. */
|
|
|
|
vmem_set_reclaim(buffer_arena, bufkva_reclaim);
|
Make MAXPHYS tunable. Bump MAXPHYS to 1M.
Replace MAXPHYS by runtime variable maxphys. It is initialized from
MAXPHYS by default, but can be also adjusted with the tunable kern.maxphys.
Make b_pages[] array in struct buf flexible. Size b_pages[] for buffer
cache buffers exactly to atop(maxbcachebuf) (currently it is sized to
atop(MAXPHYS)), and b_pages[] for pbufs is sized to atop(maxphys) + 1.
The +1 for pbufs allow several pbuf consumers, among them vmapbuf(),
to use unaligned buffers still sized to maxphys, esp. when such
buffers come from userspace (*). Overall, we save significant amount
of otherwise wasted memory in b_pages[] for buffer cache buffers,
while bumping MAXPHYS to desired high value.
Eliminate all direct uses of the MAXPHYS constant in kernel and driver
sources, except a place which initialize maxphys. Some random (and
arguably weird) uses of MAXPHYS, e.g. in linuxolator, are converted
straight. Some drivers, which use MAXPHYS to size embeded structures,
get private MAXPHYS-like constant; their convertion is out of scope
for this work.
Changes to cam/, dev/ahci, dev/ata, dev/mpr, dev/mpt, dev/mvs,
dev/siis, where either submitted by, or based on changes by mav.
Suggested by: mav (*)
Reviewed by: imp, mav, imp, mckusick, scottl (intermediate versions)
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D27225
2020-11-28 12:12:51 +00:00
|
|
|
buf_zone = uma_zcache_create("buf free cache",
|
|
|
|
sizeof(struct buf) + sizeof(vm_page_t) * atop(maxbcachebuf),
|
2015-10-14 02:10:07 +00:00
|
|
|
NULL, NULL, NULL, NULL, buf_import, buf_release, NULL, 0);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Size the clean queue according to the amount of buffer space.
|
|
|
|
* One queue per-256mb up to the max. More queues gives better
|
|
|
|
* concurrency but less accurate LRU.
|
|
|
|
*/
|
2018-03-17 18:14:49 +00:00
|
|
|
buf_domains = MIN(howmany(maxbufspace, 256*1024*1024), BUF_DOMAINS);
|
|
|
|
for (i = 0 ; i < buf_domains; i++) {
|
2018-02-20 00:06:07 +00:00
|
|
|
struct bufdomain *bd;
|
|
|
|
|
2018-03-17 18:14:49 +00:00
|
|
|
bd = &bdomain[i];
|
2018-02-20 00:06:07 +00:00
|
|
|
bd_init(bd);
|
2018-03-17 18:14:49 +00:00
|
|
|
bd->bd_freebuffers = nbuf / buf_domains;
|
|
|
|
bd->bd_hifreebuffers = hifreebuffers / buf_domains;
|
|
|
|
bd->bd_lofreebuffers = lofreebuffers / buf_domains;
|
2018-02-20 00:06:07 +00:00
|
|
|
bd->bd_bufspace = 0;
|
2018-03-17 18:14:49 +00:00
|
|
|
bd->bd_maxbufspace = maxbufspace / buf_domains;
|
|
|
|
bd->bd_hibufspace = hibufspace / buf_domains;
|
|
|
|
bd->bd_lobufspace = lobufspace / buf_domains;
|
|
|
|
bd->bd_bufspacethresh = bufspacethresh / buf_domains;
|
|
|
|
bd->bd_numdirtybuffers = 0;
|
|
|
|
bd->bd_hidirtybuffers = hidirtybuffers / buf_domains;
|
|
|
|
bd->bd_lodirtybuffers = lodirtybuffers / buf_domains;
|
|
|
|
bd->bd_dirtybufthresh = dirtybufthresh / buf_domains;
|
2018-02-20 00:06:07 +00:00
|
|
|
/* Don't allow more than 2% of bufs in the per-cpu caches. */
|
2018-03-17 18:14:49 +00:00
|
|
|
bd->bd_lim = nbuf / buf_domains / 50 / mp_ncpus;
|
2018-02-20 00:06:07 +00:00
|
|
|
}
|
|
|
|
getnewbufcalls = counter_u64_alloc(M_WAITOK);
|
|
|
|
getnewbufrestarts = counter_u64_alloc(M_WAITOK);
|
|
|
|
mappingrestarts = counter_u64_alloc(M_WAITOK);
|
|
|
|
numbufallocfails = counter_u64_alloc(M_WAITOK);
|
|
|
|
notbufdflushes = counter_u64_alloc(M_WAITOK);
|
|
|
|
buffreekvacnt = counter_u64_alloc(M_WAITOK);
|
|
|
|
bufdefragcnt = counter_u64_alloc(M_WAITOK);
|
|
|
|
bufkvaspace = counter_u64_alloc(M_WAITOK);
|
Implement the concept of the unmapped VMIO buffers, i.e. buffers which
do not map the b_pages pages into buffer_map KVA. The use of the
unmapped buffers eliminate the need to perform TLB shootdown for
mapping on the buffer creation and reuse, greatly reducing the amount
of IPIs for shootdown on big-SMP machines and eliminating up to 25-30%
of the system time on i/o intensive workloads.
The unmapped buffer should be explicitely requested by the GB_UNMAPPED
flag by the consumer. For unmapped buffer, no KVA reservation is
performed at all. The consumer might request unmapped buffer which
does have a KVA reserve, to manually map it without recursing into
buffer cache and blocking, with the GB_KVAALLOC flag.
When the mapped buffer is requested and unmapped buffer already
exists, the cache performs an upgrade, possibly reusing the KVA
reservation.
Unmapped buffer is translated into unmapped bio in g_vfs_strategy().
Unmapped bio carry a pointer to the vm_page_t array, offset and length
instead of the data pointer. The provider which processes the bio
should explicitely specify a readiness to accept unmapped bio,
otherwise g_down geom thread performs the transient upgrade of the bio
request by mapping the pages into the new bio_transient_map KVA
submap.
The bio_transient_map submap claims up to 10% of the buffer map, and
the total buffer_map + bio_transient_map KVA usage stays the
same. Still, it could be manually tuned by kern.bio_transient_maxcnt
tunable, in the units of the transient mappings. Eventually, the
bio_transient_map could be removed after all geom classes and drivers
can accept unmapped i/o requests.
Unmapped support can be turned off by the vfs.unmapped_buf_allowed
tunable, disabling which makes the buffer (or cluster) creation
requests to ignore GB_UNMAPPED and GB_KVAALLOC flags. Unmapped
buffers are only enabled by default on the architectures where
pmap_copy_page() was implemented and tested.
In the rework, filesystem metadata is not the subject to maxbufspace
limit anymore. Since the metadata buffers are always mapped, the
buffers still have to fit into the buffer map, which provides a
reasonable (but practically unreachable) upper bound on it. The
non-metadata buffer allocations, both mapped and unmapped, is
accounted against maxbufspace, as before. Effectively, this means that
the maxbufspace is forced on mapped and unmapped buffers separately.
The pre-patch bufspace limiting code did not worked, because
buffer_map fragmentation does not allow the limit to be reached.
By Jeff Roberson request, the getnewbuf() function was split into
smaller single-purpose functions.
Sponsored by: The FreeBSD Foundation
Discussed with: jeff (previous version)
Tested by: pho, scottl (previous version), jhb, bf
MFC after: 2 weeks
2013-03-19 14:13:12 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
#ifdef INVARIANTS
|
|
|
|
static inline void
|
|
|
|
vfs_buf_check_mapped(struct buf *bp)
|
|
|
|
{
|
|
|
|
|
|
|
|
KASSERT(bp->b_kvabase != unmapped_buf,
|
|
|
|
("mapped buf: b_kvabase was not updated %p", bp));
|
|
|
|
KASSERT(bp->b_data != unmapped_buf,
|
|
|
|
("mapped buf: b_data was not updated %p", bp));
|
2015-07-30 15:43:26 +00:00
|
|
|
KASSERT(bp->b_data < unmapped_buf || bp->b_data >= unmapped_buf +
|
Make MAXPHYS tunable. Bump MAXPHYS to 1M.
Replace MAXPHYS by runtime variable maxphys. It is initialized from
MAXPHYS by default, but can be also adjusted with the tunable kern.maxphys.
Make b_pages[] array in struct buf flexible. Size b_pages[] for buffer
cache buffers exactly to atop(maxbcachebuf) (currently it is sized to
atop(MAXPHYS)), and b_pages[] for pbufs is sized to atop(maxphys) + 1.
The +1 for pbufs allow several pbuf consumers, among them vmapbuf(),
to use unaligned buffers still sized to maxphys, esp. when such
buffers come from userspace (*). Overall, we save significant amount
of otherwise wasted memory in b_pages[] for buffer cache buffers,
while bumping MAXPHYS to desired high value.
Eliminate all direct uses of the MAXPHYS constant in kernel and driver
sources, except a place which initialize maxphys. Some random (and
arguably weird) uses of MAXPHYS, e.g. in linuxolator, are converted
straight. Some drivers, which use MAXPHYS to size embeded structures,
get private MAXPHYS-like constant; their convertion is out of scope
for this work.
Changes to cam/, dev/ahci, dev/ata, dev/mpr, dev/mpt, dev/mvs,
dev/siis, where either submitted by, or based on changes by mav.
Suggested by: mav (*)
Reviewed by: imp, mav, imp, mckusick, scottl (intermediate versions)
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D27225
2020-11-28 12:12:51 +00:00
|
|
|
maxphys, ("b_data + b_offset unmapped %p", bp));
|
Implement the concept of the unmapped VMIO buffers, i.e. buffers which
do not map the b_pages pages into buffer_map KVA. The use of the
unmapped buffers eliminate the need to perform TLB shootdown for
mapping on the buffer creation and reuse, greatly reducing the amount
of IPIs for shootdown on big-SMP machines and eliminating up to 25-30%
of the system time on i/o intensive workloads.
The unmapped buffer should be explicitely requested by the GB_UNMAPPED
flag by the consumer. For unmapped buffer, no KVA reservation is
performed at all. The consumer might request unmapped buffer which
does have a KVA reserve, to manually map it without recursing into
buffer cache and blocking, with the GB_KVAALLOC flag.
When the mapped buffer is requested and unmapped buffer already
exists, the cache performs an upgrade, possibly reusing the KVA
reservation.
Unmapped buffer is translated into unmapped bio in g_vfs_strategy().
Unmapped bio carry a pointer to the vm_page_t array, offset and length
instead of the data pointer. The provider which processes the bio
should explicitely specify a readiness to accept unmapped bio,
otherwise g_down geom thread performs the transient upgrade of the bio
request by mapping the pages into the new bio_transient_map KVA
submap.
The bio_transient_map submap claims up to 10% of the buffer map, and
the total buffer_map + bio_transient_map KVA usage stays the
same. Still, it could be manually tuned by kern.bio_transient_maxcnt
tunable, in the units of the transient mappings. Eventually, the
bio_transient_map could be removed after all geom classes and drivers
can accept unmapped i/o requests.
Unmapped support can be turned off by the vfs.unmapped_buf_allowed
tunable, disabling which makes the buffer (or cluster) creation
requests to ignore GB_UNMAPPED and GB_KVAALLOC flags. Unmapped
buffers are only enabled by default on the architectures where
pmap_copy_page() was implemented and tested.
In the rework, filesystem metadata is not the subject to maxbufspace
limit anymore. Since the metadata buffers are always mapped, the
buffers still have to fit into the buffer map, which provides a
reasonable (but practically unreachable) upper bound on it. The
non-metadata buffer allocations, both mapped and unmapped, is
accounted against maxbufspace, as before. Effectively, this means that
the maxbufspace is forced on mapped and unmapped buffers separately.
The pre-patch bufspace limiting code did not worked, because
buffer_map fragmentation does not allow the limit to be reached.
By Jeff Roberson request, the getnewbuf() function was split into
smaller single-purpose functions.
Sponsored by: The FreeBSD Foundation
Discussed with: jeff (previous version)
Tested by: pho, scottl (previous version), jhb, bf
MFC after: 2 weeks
2013-03-19 14:13:12 +00:00
|
|
|
}
|
2015-07-29 09:57:34 +00:00
|
|
|
|
|
|
|
static inline void
|
|
|
|
vfs_buf_check_unmapped(struct buf *bp)
|
|
|
|
{
|
|
|
|
|
|
|
|
KASSERT(bp->b_data == unmapped_buf,
|
|
|
|
("unmapped buf: corrupted b_data %p", bp));
|
|
|
|
}
|
|
|
|
|
|
|
|
#define BUF_CHECK_MAPPED(bp) vfs_buf_check_mapped(bp)
|
|
|
|
#define BUF_CHECK_UNMAPPED(bp) vfs_buf_check_unmapped(bp)
|
|
|
|
#else
|
|
|
|
#define BUF_CHECK_MAPPED(bp) do {} while (0)
|
|
|
|
#define BUF_CHECK_UNMAPPED(bp) do {} while (0)
|
|
|
|
#endif
|
|
|
|
|
2015-07-29 02:26:57 +00:00
|
|
|
static int
|
|
|
|
isbufbusy(struct buf *bp)
|
|
|
|
{
|
2016-04-29 16:32:28 +00:00
|
|
|
if (((bp->b_flags & B_INVAL) == 0 && BUF_ISLOCKED(bp)) ||
|
2015-07-29 02:26:57 +00:00
|
|
|
((bp->b_flags & (B_DELWRI | B_INVAL)) == B_DELWRI))
|
|
|
|
return (1);
|
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Shutdown the system cleanly to prepare for reboot, halt, or power off.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
bufshutdown(int show_busybufs)
|
|
|
|
{
|
|
|
|
static int first_buf_printf = 1;
|
|
|
|
struct buf *bp;
|
Make MAXPHYS tunable. Bump MAXPHYS to 1M.
Replace MAXPHYS by runtime variable maxphys. It is initialized from
MAXPHYS by default, but can be also adjusted with the tunable kern.maxphys.
Make b_pages[] array in struct buf flexible. Size b_pages[] for buffer
cache buffers exactly to atop(maxbcachebuf) (currently it is sized to
atop(MAXPHYS)), and b_pages[] for pbufs is sized to atop(maxphys) + 1.
The +1 for pbufs allow several pbuf consumers, among them vmapbuf(),
to use unaligned buffers still sized to maxphys, esp. when such
buffers come from userspace (*). Overall, we save significant amount
of otherwise wasted memory in b_pages[] for buffer cache buffers,
while bumping MAXPHYS to desired high value.
Eliminate all direct uses of the MAXPHYS constant in kernel and driver
sources, except a place which initialize maxphys. Some random (and
arguably weird) uses of MAXPHYS, e.g. in linuxolator, are converted
straight. Some drivers, which use MAXPHYS to size embeded structures,
get private MAXPHYS-like constant; their convertion is out of scope
for this work.
Changes to cam/, dev/ahci, dev/ata, dev/mpr, dev/mpt, dev/mvs,
dev/siis, where either submitted by, or based on changes by mav.
Suggested by: mav (*)
Reviewed by: imp, mav, imp, mckusick, scottl (intermediate versions)
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D27225
2020-11-28 12:12:51 +00:00
|
|
|
int i, iter, nbusy, pbusy;
|
2015-07-29 02:26:57 +00:00
|
|
|
#ifndef PREEMPTION
|
|
|
|
int subiter;
|
|
|
|
#endif
|
|
|
|
|
2020-07-10 09:01:36 +00:00
|
|
|
/*
|
2015-07-29 02:26:57 +00:00
|
|
|
* Sync filesystems for shutdown
|
|
|
|
*/
|
|
|
|
wdog_kern_pat(WD_LASTVAL);
|
2019-12-12 18:45:31 +00:00
|
|
|
kern_sync(curthread);
|
2015-07-29 02:26:57 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* With soft updates, some buffers that are
|
|
|
|
* written will be remarked as dirty until other
|
|
|
|
* buffers are written.
|
|
|
|
*/
|
|
|
|
for (iter = pbusy = 0; iter < 20; iter++) {
|
|
|
|
nbusy = 0;
|
Make MAXPHYS tunable. Bump MAXPHYS to 1M.
Replace MAXPHYS by runtime variable maxphys. It is initialized from
MAXPHYS by default, but can be also adjusted with the tunable kern.maxphys.
Make b_pages[] array in struct buf flexible. Size b_pages[] for buffer
cache buffers exactly to atop(maxbcachebuf) (currently it is sized to
atop(MAXPHYS)), and b_pages[] for pbufs is sized to atop(maxphys) + 1.
The +1 for pbufs allow several pbuf consumers, among them vmapbuf(),
to use unaligned buffers still sized to maxphys, esp. when such
buffers come from userspace (*). Overall, we save significant amount
of otherwise wasted memory in b_pages[] for buffer cache buffers,
while bumping MAXPHYS to desired high value.
Eliminate all direct uses of the MAXPHYS constant in kernel and driver
sources, except a place which initialize maxphys. Some random (and
arguably weird) uses of MAXPHYS, e.g. in linuxolator, are converted
straight. Some drivers, which use MAXPHYS to size embeded structures,
get private MAXPHYS-like constant; their convertion is out of scope
for this work.
Changes to cam/, dev/ahci, dev/ata, dev/mpr, dev/mpt, dev/mvs,
dev/siis, where either submitted by, or based on changes by mav.
Suggested by: mav (*)
Reviewed by: imp, mav, imp, mckusick, scottl (intermediate versions)
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D27225
2020-11-28 12:12:51 +00:00
|
|
|
for (i = nbuf - 1; i >= 0; i--) {
|
|
|
|
bp = nbufp(i);
|
2015-07-29 02:26:57 +00:00
|
|
|
if (isbufbusy(bp))
|
|
|
|
nbusy++;
|
Make MAXPHYS tunable. Bump MAXPHYS to 1M.
Replace MAXPHYS by runtime variable maxphys. It is initialized from
MAXPHYS by default, but can be also adjusted with the tunable kern.maxphys.
Make b_pages[] array in struct buf flexible. Size b_pages[] for buffer
cache buffers exactly to atop(maxbcachebuf) (currently it is sized to
atop(MAXPHYS)), and b_pages[] for pbufs is sized to atop(maxphys) + 1.
The +1 for pbufs allow several pbuf consumers, among them vmapbuf(),
to use unaligned buffers still sized to maxphys, esp. when such
buffers come from userspace (*). Overall, we save significant amount
of otherwise wasted memory in b_pages[] for buffer cache buffers,
while bumping MAXPHYS to desired high value.
Eliminate all direct uses of the MAXPHYS constant in kernel and driver
sources, except a place which initialize maxphys. Some random (and
arguably weird) uses of MAXPHYS, e.g. in linuxolator, are converted
straight. Some drivers, which use MAXPHYS to size embeded structures,
get private MAXPHYS-like constant; their convertion is out of scope
for this work.
Changes to cam/, dev/ahci, dev/ata, dev/mpr, dev/mpt, dev/mvs,
dev/siis, where either submitted by, or based on changes by mav.
Suggested by: mav (*)
Reviewed by: imp, mav, imp, mckusick, scottl (intermediate versions)
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D27225
2020-11-28 12:12:51 +00:00
|
|
|
}
|
2015-07-29 02:26:57 +00:00
|
|
|
if (nbusy == 0) {
|
|
|
|
if (first_buf_printf)
|
|
|
|
printf("All buffers synced.");
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
if (first_buf_printf) {
|
|
|
|
printf("Syncing disks, buffers remaining... ");
|
|
|
|
first_buf_printf = 0;
|
|
|
|
}
|
|
|
|
printf("%d ", nbusy);
|
|
|
|
if (nbusy < pbusy)
|
|
|
|
iter = 0;
|
|
|
|
pbusy = nbusy;
|
|
|
|
|
|
|
|
wdog_kern_pat(WD_LASTVAL);
|
2019-12-12 18:45:31 +00:00
|
|
|
kern_sync(curthread);
|
2015-07-29 02:26:57 +00:00
|
|
|
|
|
|
|
#ifdef PREEMPTION
|
|
|
|
/*
|
2018-03-21 14:46:59 +00:00
|
|
|
* Spin for a while to allow interrupt threads to run.
|
2015-07-29 02:26:57 +00:00
|
|
|
*/
|
|
|
|
DELAY(50000 * iter);
|
|
|
|
#else
|
|
|
|
/*
|
2018-03-21 14:46:59 +00:00
|
|
|
* Context switch several times to allow interrupt
|
|
|
|
* threads to run.
|
2015-07-29 02:26:57 +00:00
|
|
|
*/
|
|
|
|
for (subiter = 0; subiter < 50 * iter; subiter++) {
|
|
|
|
thread_lock(curthread);
|
2019-12-15 21:26:50 +00:00
|
|
|
mi_switch(SW_VOL);
|
2015-07-29 02:26:57 +00:00
|
|
|
DELAY(1000);
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
}
|
|
|
|
printf("\n");
|
|
|
|
/*
|
|
|
|
* Count only busy local buffers to prevent forcing
|
|
|
|
* a fsck if we're just a client of a wedged NFS server
|
|
|
|
*/
|
|
|
|
nbusy = 0;
|
Make MAXPHYS tunable. Bump MAXPHYS to 1M.
Replace MAXPHYS by runtime variable maxphys. It is initialized from
MAXPHYS by default, but can be also adjusted with the tunable kern.maxphys.
Make b_pages[] array in struct buf flexible. Size b_pages[] for buffer
cache buffers exactly to atop(maxbcachebuf) (currently it is sized to
atop(MAXPHYS)), and b_pages[] for pbufs is sized to atop(maxphys) + 1.
The +1 for pbufs allow several pbuf consumers, among them vmapbuf(),
to use unaligned buffers still sized to maxphys, esp. when such
buffers come from userspace (*). Overall, we save significant amount
of otherwise wasted memory in b_pages[] for buffer cache buffers,
while bumping MAXPHYS to desired high value.
Eliminate all direct uses of the MAXPHYS constant in kernel and driver
sources, except a place which initialize maxphys. Some random (and
arguably weird) uses of MAXPHYS, e.g. in linuxolator, are converted
straight. Some drivers, which use MAXPHYS to size embeded structures,
get private MAXPHYS-like constant; their convertion is out of scope
for this work.
Changes to cam/, dev/ahci, dev/ata, dev/mpr, dev/mpt, dev/mvs,
dev/siis, where either submitted by, or based on changes by mav.
Suggested by: mav (*)
Reviewed by: imp, mav, imp, mckusick, scottl (intermediate versions)
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D27225
2020-11-28 12:12:51 +00:00
|
|
|
for (i = nbuf - 1; i >= 0; i--) {
|
|
|
|
bp = nbufp(i);
|
2015-07-29 02:26:57 +00:00
|
|
|
if (isbufbusy(bp)) {
|
|
|
|
#if 0
|
|
|
|
/* XXX: This is bogus. We should probably have a BO_REMOTE flag instead */
|
|
|
|
if (bp->b_dev == NULL) {
|
|
|
|
TAILQ_REMOVE(&mountlist,
|
|
|
|
bp->b_vp->v_mount, mnt_list);
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
nbusy++;
|
|
|
|
if (show_busybufs > 0) {
|
|
|
|
printf(
|
|
|
|
"%d: buf:%p, vnode:%p, flags:%0x, blkno:%jd, lblkno:%jd, buflock:",
|
|
|
|
nbusy, bp, bp->b_vp, bp->b_flags,
|
|
|
|
(intmax_t)bp->b_blkno,
|
|
|
|
(intmax_t)bp->b_lblkno);
|
|
|
|
BUF_LOCKPRINTINFO(bp);
|
|
|
|
if (show_busybufs > 1)
|
|
|
|
vn_printf(bp->b_vp,
|
|
|
|
"vnode content: ");
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
if (nbusy) {
|
|
|
|
/*
|
|
|
|
* Failed to sync all blocks. Indicate this and don't
|
|
|
|
* unmount filesystems (thus forcing an fsck on reboot).
|
|
|
|
*/
|
|
|
|
printf("Giving up on %d buffers\n", nbusy);
|
|
|
|
DELAY(5000000); /* 5 seconds */
|
2021-11-28 01:52:46 +00:00
|
|
|
swapoff_all();
|
2015-07-29 02:26:57 +00:00
|
|
|
} else {
|
|
|
|
if (!first_buf_printf)
|
|
|
|
printf("Final sync complete\n");
|
2021-11-28 01:52:46 +00:00
|
|
|
|
2015-07-29 02:26:57 +00:00
|
|
|
/*
|
2021-11-29 19:11:33 +00:00
|
|
|
* Unmount filesystems and perform swapoff, to quiesce
|
|
|
|
* the system as much as possible. In particular, no
|
|
|
|
* I/O should be initiated from top levels since it
|
|
|
|
* might be abruptly terminated by reset, or otherwise
|
|
|
|
* erronously handled because other parts of the
|
|
|
|
* system are disabled.
|
|
|
|
*
|
|
|
|
* Swapoff before unmount, because file-backed swap is
|
|
|
|
* non-operational after unmount of the underlying
|
|
|
|
* filesystem.
|
2015-07-29 02:26:57 +00:00
|
|
|
*/
|
2021-11-28 01:52:46 +00:00
|
|
|
if (!KERNEL_PANICKED()) {
|
|
|
|
swapoff_all();
|
2015-07-29 02:26:57 +00:00
|
|
|
vfs_unmountall();
|
2021-11-28 01:52:46 +00:00
|
|
|
}
|
2015-07-29 02:26:57 +00:00
|
|
|
}
|
|
|
|
DELAY(100000); /* wait for console output to finish */
|
|
|
|
}
|
Implement the concept of the unmapped VMIO buffers, i.e. buffers which
do not map the b_pages pages into buffer_map KVA. The use of the
unmapped buffers eliminate the need to perform TLB shootdown for
mapping on the buffer creation and reuse, greatly reducing the amount
of IPIs for shootdown on big-SMP machines and eliminating up to 25-30%
of the system time on i/o intensive workloads.
The unmapped buffer should be explicitely requested by the GB_UNMAPPED
flag by the consumer. For unmapped buffer, no KVA reservation is
performed at all. The consumer might request unmapped buffer which
does have a KVA reserve, to manually map it without recursing into
buffer cache and blocking, with the GB_KVAALLOC flag.
When the mapped buffer is requested and unmapped buffer already
exists, the cache performs an upgrade, possibly reusing the KVA
reservation.
Unmapped buffer is translated into unmapped bio in g_vfs_strategy().
Unmapped bio carry a pointer to the vm_page_t array, offset and length
instead of the data pointer. The provider which processes the bio
should explicitely specify a readiness to accept unmapped bio,
otherwise g_down geom thread performs the transient upgrade of the bio
request by mapping the pages into the new bio_transient_map KVA
submap.
The bio_transient_map submap claims up to 10% of the buffer map, and
the total buffer_map + bio_transient_map KVA usage stays the
same. Still, it could be manually tuned by kern.bio_transient_maxcnt
tunable, in the units of the transient mappings. Eventually, the
bio_transient_map could be removed after all geom classes and drivers
can accept unmapped i/o requests.
Unmapped support can be turned off by the vfs.unmapped_buf_allowed
tunable, disabling which makes the buffer (or cluster) creation
requests to ignore GB_UNMAPPED and GB_KVAALLOC flags. Unmapped
buffers are only enabled by default on the architectures where
pmap_copy_page() was implemented and tested.
In the rework, filesystem metadata is not the subject to maxbufspace
limit anymore. Since the metadata buffers are always mapped, the
buffers still have to fit into the buffer map, which provides a
reasonable (but practically unreachable) upper bound on it. The
non-metadata buffer allocations, both mapped and unmapped, is
accounted against maxbufspace, as before. Effectively, this means that
the maxbufspace is forced on mapped and unmapped buffers separately.
The pre-patch bufspace limiting code did not worked, because
buffer_map fragmentation does not allow the limit to be reached.
By Jeff Roberson request, the getnewbuf() function was split into
smaller single-purpose functions.
Sponsored by: The FreeBSD Foundation
Discussed with: jeff (previous version)
Tested by: pho, scottl (previous version), jhb, bf
MFC after: 2 weeks
2013-03-19 14:13:12 +00:00
|
|
|
|
|
|
|
static void
|
|
|
|
bpmap_qenter(struct buf *bp)
|
|
|
|
{
|
|
|
|
|
|
|
|
BUF_CHECK_MAPPED(bp);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* bp->b_data is relative to bp->b_offset, but
|
|
|
|
* bp->b_offset may be offset into the first page.
|
|
|
|
*/
|
|
|
|
bp->b_data = (caddr_t)trunc_page((vm_offset_t)bp->b_data);
|
|
|
|
pmap_qenter((vm_offset_t)bp->b_data, bp->b_pages, bp->b_npages);
|
|
|
|
bp->b_data = (caddr_t)((vm_offset_t)bp->b_data |
|
|
|
|
(vm_offset_t)(bp->b_offset & PAGE_MASK));
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
|
|
|
|
2018-03-17 18:14:49 +00:00
|
|
|
static inline struct bufdomain *
|
|
|
|
bufdomain(struct buf *bp)
|
|
|
|
{
|
|
|
|
|
|
|
|
return (&bdomain[bp->b_domain]);
|
|
|
|
}
|
|
|
|
|
2018-02-20 00:06:07 +00:00
|
|
|
static struct bufqueue *
|
|
|
|
bufqueue(struct buf *bp)
|
|
|
|
{
|
|
|
|
|
|
|
|
switch (bp->b_qindex) {
|
|
|
|
case QUEUE_NONE:
|
|
|
|
/* FALLTHROUGH */
|
|
|
|
case QUEUE_SENTINEL:
|
|
|
|
return (NULL);
|
|
|
|
case QUEUE_EMPTY:
|
|
|
|
return (&bqempty);
|
|
|
|
case QUEUE_DIRTY:
|
2018-03-17 18:14:49 +00:00
|
|
|
return (&bufdomain(bp)->bd_dirtyq);
|
2018-02-20 00:06:07 +00:00
|
|
|
case QUEUE_CLEAN:
|
2018-03-17 18:14:49 +00:00
|
|
|
return (&bufdomain(bp)->bd_subq[bp->b_subqueue]);
|
2018-02-20 00:06:07 +00:00
|
|
|
default:
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
panic("bufqueue(%p): Unhandled type %d\n", bp, bp->b_qindex);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Return the locked bufqueue that bp is a member of.
|
|
|
|
*/
|
|
|
|
static struct bufqueue *
|
|
|
|
bufqueue_acquire(struct buf *bp)
|
|
|
|
{
|
|
|
|
struct bufqueue *bq, *nbq;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* bp can be pushed from a per-cpu queue to the
|
|
|
|
* cleanq while we're waiting on the lock. Retry
|
|
|
|
* if the queues don't match.
|
|
|
|
*/
|
|
|
|
bq = bufqueue(bp);
|
|
|
|
BQ_LOCK(bq);
|
|
|
|
for (;;) {
|
|
|
|
nbq = bufqueue(bp);
|
|
|
|
if (bq == nbq)
|
|
|
|
break;
|
|
|
|
BQ_UNLOCK(bq);
|
|
|
|
BQ_LOCK(nbq);
|
|
|
|
bq = nbq;
|
|
|
|
}
|
|
|
|
return (bq);
|
|
|
|
}
|
|
|
|
|
2013-06-05 23:53:00 +00:00
|
|
|
/*
|
|
|
|
* binsfree:
|
|
|
|
*
|
2018-02-20 00:06:07 +00:00
|
|
|
* Insert the buffer into the appropriate free list. Requires a
|
|
|
|
* locked buffer on entry and buffer is unlocked before return.
|
2013-06-05 23:53:00 +00:00
|
|
|
*/
|
|
|
|
static void
|
|
|
|
binsfree(struct buf *bp, int qindex)
|
|
|
|
{
|
2018-02-20 00:06:07 +00:00
|
|
|
struct bufdomain *bd;
|
|
|
|
struct bufqueue *bq;
|
2013-06-05 23:53:00 +00:00
|
|
|
|
2018-02-20 00:06:07 +00:00
|
|
|
KASSERT(qindex == QUEUE_CLEAN || qindex == QUEUE_DIRTY,
|
|
|
|
("binsfree: Invalid qindex %d", qindex));
|
|
|
|
BUF_ASSERT_XLOCKED(bp);
|
2015-10-14 02:10:07 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Handle delayed bremfree() processing.
|
|
|
|
*/
|
2015-06-23 06:12:14 +00:00
|
|
|
if (bp->b_flags & B_REMFREE) {
|
2018-02-20 00:06:07 +00:00
|
|
|
if (bp->b_qindex == qindex) {
|
|
|
|
bp->b_flags |= B_REUSE;
|
|
|
|
bp->b_flags &= ~B_REMFREE;
|
|
|
|
BUF_UNLOCK(bp);
|
|
|
|
return;
|
2015-06-23 06:12:14 +00:00
|
|
|
}
|
2018-02-20 00:06:07 +00:00
|
|
|
bq = bufqueue_acquire(bp);
|
|
|
|
bq_remove(bq, bp);
|
|
|
|
BQ_UNLOCK(bq);
|
|
|
|
}
|
2018-03-17 18:14:49 +00:00
|
|
|
bd = bufdomain(bp);
|
2018-02-20 00:06:07 +00:00
|
|
|
if (qindex == QUEUE_CLEAN) {
|
|
|
|
if (bd->bd_lim != 0)
|
|
|
|
bq = &bd->bd_subq[PCPU_GET(cpuid)];
|
|
|
|
else
|
|
|
|
bq = bd->bd_cleanq;
|
2015-06-23 06:12:14 +00:00
|
|
|
} else
|
2018-03-17 18:14:49 +00:00
|
|
|
bq = &bd->bd_dirtyq;
|
2018-02-20 00:06:07 +00:00
|
|
|
bq_insert(bq, bp, true);
|
2015-10-14 02:10:07 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* buf_free:
|
|
|
|
*
|
|
|
|
* Free a buffer to the buf zone once it no longer has valid contents.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
buf_free(struct buf *bp)
|
|
|
|
{
|
|
|
|
|
|
|
|
if (bp->b_flags & B_REMFREE)
|
|
|
|
bremfreef(bp);
|
|
|
|
if (bp->b_vflags & BV_BKGRDINPROG)
|
|
|
|
panic("losing buffer 1");
|
|
|
|
if (bp->b_rcred != NOCRED) {
|
|
|
|
crfree(bp->b_rcred);
|
|
|
|
bp->b_rcred = NOCRED;
|
|
|
|
}
|
|
|
|
if (bp->b_wcred != NOCRED) {
|
|
|
|
crfree(bp->b_wcred);
|
|
|
|
bp->b_wcred = NOCRED;
|
|
|
|
}
|
|
|
|
if (!LIST_EMPTY(&bp->b_dep))
|
|
|
|
buf_deallocate(bp);
|
|
|
|
bufkva_free(bp);
|
2018-03-17 18:14:49 +00:00
|
|
|
atomic_add_int(&bufdomain(bp)->bd_freebuffers, 1);
|
Make MAXPHYS tunable. Bump MAXPHYS to 1M.
Replace MAXPHYS by runtime variable maxphys. It is initialized from
MAXPHYS by default, but can be also adjusted with the tunable kern.maxphys.
Make b_pages[] array in struct buf flexible. Size b_pages[] for buffer
cache buffers exactly to atop(maxbcachebuf) (currently it is sized to
atop(MAXPHYS)), and b_pages[] for pbufs is sized to atop(maxphys) + 1.
The +1 for pbufs allow several pbuf consumers, among them vmapbuf(),
to use unaligned buffers still sized to maxphys, esp. when such
buffers come from userspace (*). Overall, we save significant amount
of otherwise wasted memory in b_pages[] for buffer cache buffers,
while bumping MAXPHYS to desired high value.
Eliminate all direct uses of the MAXPHYS constant in kernel and driver
sources, except a place which initialize maxphys. Some random (and
arguably weird) uses of MAXPHYS, e.g. in linuxolator, are converted
straight. Some drivers, which use MAXPHYS to size embeded structures,
get private MAXPHYS-like constant; their convertion is out of scope
for this work.
Changes to cam/, dev/ahci, dev/ata, dev/mpr, dev/mpt, dev/mvs,
dev/siis, where either submitted by, or based on changes by mav.
Suggested by: mav (*)
Reviewed by: imp, mav, imp, mckusick, scottl (intermediate versions)
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D27225
2020-11-28 12:12:51 +00:00
|
|
|
MPASS((bp->b_flags & B_MAXPHYS) == 0);
|
2015-10-14 02:10:07 +00:00
|
|
|
BUF_UNLOCK(bp);
|
|
|
|
uma_zfree(buf_zone, bp);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* buf_import:
|
|
|
|
*
|
|
|
|
* Import bufs into the uma cache from the buf list. The system still
|
|
|
|
* expects a static array of bufs and much of the synchronization
|
|
|
|
* around bufs assumes type stable storage. As a result, UMA is used
|
|
|
|
* only as a per-cpu cache of bufs still maintained on a global list.
|
|
|
|
*/
|
|
|
|
static int
|
2018-01-12 23:25:05 +00:00
|
|
|
buf_import(void *arg, void **store, int cnt, int domain, int flags)
|
2015-10-14 02:10:07 +00:00
|
|
|
{
|
|
|
|
struct buf *bp;
|
|
|
|
int i;
|
|
|
|
|
2018-02-20 00:06:07 +00:00
|
|
|
BQ_LOCK(&bqempty);
|
2015-10-14 02:10:07 +00:00
|
|
|
for (i = 0; i < cnt; i++) {
|
2018-02-20 00:06:07 +00:00
|
|
|
bp = TAILQ_FIRST(&bqempty.bq_queue);
|
2015-10-14 02:10:07 +00:00
|
|
|
if (bp == NULL)
|
|
|
|
break;
|
2018-02-20 00:06:07 +00:00
|
|
|
bq_remove(&bqempty, bp);
|
2015-10-14 02:10:07 +00:00
|
|
|
store[i] = bp;
|
|
|
|
}
|
2018-02-20 00:06:07 +00:00
|
|
|
BQ_UNLOCK(&bqempty);
|
2015-10-14 02:10:07 +00:00
|
|
|
|
|
|
|
return (i);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* buf_release:
|
|
|
|
*
|
|
|
|
* Release bufs from the uma cache back to the buffer queues.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
buf_release(void *arg, void **store, int cnt)
|
|
|
|
{
|
2018-02-20 00:06:07 +00:00
|
|
|
struct bufqueue *bq;
|
|
|
|
struct buf *bp;
|
2015-10-14 02:10:07 +00:00
|
|
|
int i;
|
|
|
|
|
2018-02-20 00:06:07 +00:00
|
|
|
bq = &bqempty;
|
|
|
|
BQ_LOCK(bq);
|
|
|
|
for (i = 0; i < cnt; i++) {
|
|
|
|
bp = store[i];
|
|
|
|
/* Inline bq_insert() to batch locking. */
|
|
|
|
TAILQ_INSERT_TAIL(&bq->bq_queue, bp, b_freelist);
|
|
|
|
bp->b_flags &= ~(B_AGE | B_REUSE);
|
|
|
|
bq->bq_len++;
|
|
|
|
bp->b_qindex = bq->bq_index;
|
|
|
|
}
|
|
|
|
BQ_UNLOCK(bq);
|
2015-10-14 02:10:07 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* buf_alloc:
|
|
|
|
*
|
|
|
|
* Allocate an empty buffer header.
|
|
|
|
*/
|
|
|
|
static struct buf *
|
2018-02-20 00:06:07 +00:00
|
|
|
buf_alloc(struct bufdomain *bd)
|
2015-10-14 02:10:07 +00:00
|
|
|
{
|
|
|
|
struct buf *bp;
|
2020-08-02 16:34:27 +00:00
|
|
|
int freebufs, error;
|
2015-10-14 02:10:07 +00:00
|
|
|
|
2018-02-20 00:06:07 +00:00
|
|
|
/*
|
|
|
|
* We can only run out of bufs in the buf zone if the average buf
|
|
|
|
* is less than BKVASIZE. In this case the actual wait/block will
|
|
|
|
* come from buf_reycle() failing to flush one of these small bufs.
|
|
|
|
*/
|
|
|
|
bp = NULL;
|
|
|
|
freebufs = atomic_fetchadd_int(&bd->bd_freebuffers, -1);
|
|
|
|
if (freebufs > 0)
|
|
|
|
bp = uma_zalloc(buf_zone, M_NOWAIT);
|
2015-10-14 02:10:07 +00:00
|
|
|
if (bp == NULL) {
|
2018-11-06 17:32:25 +00:00
|
|
|
atomic_add_int(&bd->bd_freebuffers, 1);
|
2018-02-20 00:06:07 +00:00
|
|
|
bufspace_daemon_wakeup(bd);
|
|
|
|
counter_u64_add(numbufallocfails, 1);
|
2015-10-14 02:10:07 +00:00
|
|
|
return (NULL);
|
|
|
|
}
|
|
|
|
/*
|
2018-02-20 00:06:07 +00:00
|
|
|
* Wake-up the bufspace daemon on transition below threshold.
|
2015-10-14 02:10:07 +00:00
|
|
|
*/
|
2018-02-20 00:06:07 +00:00
|
|
|
if (freebufs == bd->bd_lofreebuffers)
|
|
|
|
bufspace_daemon_wakeup(bd);
|
2015-10-14 02:10:07 +00:00
|
|
|
|
2020-08-02 16:34:27 +00:00
|
|
|
error = BUF_LOCK(bp, LK_EXCLUSIVE, NULL);
|
|
|
|
KASSERT(error == 0, ("%s: BUF_LOCK on free buf %p: %d.", __func__, bp,
|
|
|
|
error));
|
|
|
|
(void)error;
|
2020-07-10 09:01:36 +00:00
|
|
|
|
2015-10-14 02:10:07 +00:00
|
|
|
KASSERT(bp->b_vp == NULL,
|
|
|
|
("bp: %p still has vnode %p.", bp, bp->b_vp));
|
|
|
|
KASSERT((bp->b_flags & (B_DELWRI | B_NOREUSE)) == 0,
|
|
|
|
("invalid buffer %p flags %#x", bp, bp->b_flags));
|
|
|
|
KASSERT((bp->b_xflags & (BX_VNCLEAN|BX_VNDIRTY)) == 0,
|
|
|
|
("bp: %p still on a buffer list. xflags %X", bp, bp->b_xflags));
|
|
|
|
KASSERT(bp->b_npages == 0,
|
|
|
|
("bp: %p still has %d vm pages\n", bp, bp->b_npages));
|
|
|
|
KASSERT(bp->b_kvasize == 0, ("bp: %p still has kva\n", bp));
|
|
|
|
KASSERT(bp->b_bufsize == 0, ("bp: %p still has bufspace\n", bp));
|
Make MAXPHYS tunable. Bump MAXPHYS to 1M.
Replace MAXPHYS by runtime variable maxphys. It is initialized from
MAXPHYS by default, but can be also adjusted with the tunable kern.maxphys.
Make b_pages[] array in struct buf flexible. Size b_pages[] for buffer
cache buffers exactly to atop(maxbcachebuf) (currently it is sized to
atop(MAXPHYS)), and b_pages[] for pbufs is sized to atop(maxphys) + 1.
The +1 for pbufs allow several pbuf consumers, among them vmapbuf(),
to use unaligned buffers still sized to maxphys, esp. when such
buffers come from userspace (*). Overall, we save significant amount
of otherwise wasted memory in b_pages[] for buffer cache buffers,
while bumping MAXPHYS to desired high value.
Eliminate all direct uses of the MAXPHYS constant in kernel and driver
sources, except a place which initialize maxphys. Some random (and
arguably weird) uses of MAXPHYS, e.g. in linuxolator, are converted
straight. Some drivers, which use MAXPHYS to size embeded structures,
get private MAXPHYS-like constant; their convertion is out of scope
for this work.
Changes to cam/, dev/ahci, dev/ata, dev/mpr, dev/mpt, dev/mvs,
dev/siis, where either submitted by, or based on changes by mav.
Suggested by: mav (*)
Reviewed by: imp, mav, imp, mckusick, scottl (intermediate versions)
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D27225
2020-11-28 12:12:51 +00:00
|
|
|
MPASS((bp->b_flags & B_MAXPHYS) == 0);
|
2015-10-14 02:10:07 +00:00
|
|
|
|
2018-02-20 00:06:07 +00:00
|
|
|
bp->b_domain = BD_DOMAIN(bd);
|
2015-10-14 02:10:07 +00:00
|
|
|
bp->b_flags = 0;
|
|
|
|
bp->b_ioflags = 0;
|
|
|
|
bp->b_xflags = 0;
|
|
|
|
bp->b_vflags = 0;
|
|
|
|
bp->b_vp = NULL;
|
|
|
|
bp->b_blkno = bp->b_lblkno = 0;
|
|
|
|
bp->b_offset = NOOFFSET;
|
|
|
|
bp->b_iodone = 0;
|
|
|
|
bp->b_error = 0;
|
|
|
|
bp->b_resid = 0;
|
|
|
|
bp->b_bcount = 0;
|
|
|
|
bp->b_npages = 0;
|
|
|
|
bp->b_dirtyoff = bp->b_dirtyend = 0;
|
|
|
|
bp->b_bufobj = NULL;
|
|
|
|
bp->b_data = bp->b_kvabase = unmapped_buf;
|
|
|
|
bp->b_fsprivate1 = NULL;
|
|
|
|
bp->b_fsprivate2 = NULL;
|
|
|
|
bp->b_fsprivate3 = NULL;
|
|
|
|
LIST_INIT(&bp->b_dep);
|
|
|
|
|
|
|
|
return (bp);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2018-02-20 00:06:07 +00:00
|
|
|
* buf_recycle:
|
2015-10-14 02:10:07 +00:00
|
|
|
*
|
|
|
|
* Free a buffer from the given bufqueue. kva controls whether the
|
|
|
|
* freed buf must own some kva resources. This is used for
|
|
|
|
* defragmenting.
|
|
|
|
*/
|
|
|
|
static int
|
2018-02-20 00:06:07 +00:00
|
|
|
buf_recycle(struct bufdomain *bd, bool kva)
|
2015-10-14 02:10:07 +00:00
|
|
|
{
|
2018-02-20 00:06:07 +00:00
|
|
|
struct bufqueue *bq;
|
2015-10-14 02:10:07 +00:00
|
|
|
struct buf *bp, *nbp;
|
|
|
|
|
|
|
|
if (kva)
|
2018-02-20 00:06:07 +00:00
|
|
|
counter_u64_add(bufdefragcnt, 1);
|
2015-10-14 02:10:07 +00:00
|
|
|
nbp = NULL;
|
2018-02-20 00:06:07 +00:00
|
|
|
bq = bd->bd_cleanq;
|
|
|
|
BQ_LOCK(bq);
|
|
|
|
KASSERT(BQ_LOCKPTR(bq) == BD_LOCKPTR(bd),
|
|
|
|
("buf_recycle: Locks don't match"));
|
|
|
|
nbp = TAILQ_FIRST(&bq->bq_queue);
|
2015-10-14 02:10:07 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Run scan, possibly freeing data and/or kva mappings on the fly
|
|
|
|
* depending.
|
|
|
|
*/
|
|
|
|
while ((bp = nbp) != NULL) {
|
|
|
|
/*
|
|
|
|
* Calculate next bp (we can only use it if we do not
|
|
|
|
* release the bqlock).
|
|
|
|
*/
|
|
|
|
nbp = TAILQ_NEXT(bp, b_freelist);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If we are defragging then we need a buffer with
|
|
|
|
* some kva to reclaim.
|
|
|
|
*/
|
|
|
|
if (kva && bp->b_kvasize == 0)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
if (BUF_LOCK(bp, LK_EXCLUSIVE | LK_NOWAIT, NULL) != 0)
|
|
|
|
continue;
|
|
|
|
|
2018-02-20 00:06:07 +00:00
|
|
|
/*
|
|
|
|
* Implement a second chance algorithm for frequently
|
|
|
|
* accessed buffers.
|
|
|
|
*/
|
|
|
|
if ((bp->b_flags & B_REUSE) != 0) {
|
|
|
|
TAILQ_REMOVE(&bq->bq_queue, bp, b_freelist);
|
|
|
|
TAILQ_INSERT_TAIL(&bq->bq_queue, bp, b_freelist);
|
|
|
|
bp->b_flags &= ~B_REUSE;
|
|
|
|
BUF_UNLOCK(bp);
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
2015-10-14 02:10:07 +00:00
|
|
|
/*
|
|
|
|
* Skip buffers with background writes in progress.
|
|
|
|
*/
|
|
|
|
if ((bp->b_vflags & BV_BKGRDINPROG) != 0) {
|
|
|
|
BUF_UNLOCK(bp);
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
2018-02-20 00:06:07 +00:00
|
|
|
KASSERT(bp->b_qindex == QUEUE_CLEAN,
|
|
|
|
("buf_recycle: inconsistent queue %d bp %p",
|
|
|
|
bp->b_qindex, bp));
|
|
|
|
KASSERT(bp->b_domain == BD_DOMAIN(bd),
|
|
|
|
("getnewbuf: queue domain %d doesn't match request %d",
|
|
|
|
bp->b_domain, (int)BD_DOMAIN(bd)));
|
2015-10-14 02:10:07 +00:00
|
|
|
/*
|
|
|
|
* NOTE: nbp is now entirely invalid. We can only restart
|
|
|
|
* the scan from this point on.
|
|
|
|
*/
|
2018-02-20 00:06:07 +00:00
|
|
|
bq_remove(bq, bp);
|
|
|
|
BQ_UNLOCK(bq);
|
2015-10-14 02:10:07 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Requeue the background write buffer with error and
|
|
|
|
* restart the scan.
|
|
|
|
*/
|
|
|
|
if ((bp->b_vflags & BV_BKGRDERR) != 0) {
|
|
|
|
bqrelse(bp);
|
2018-02-20 00:06:07 +00:00
|
|
|
BQ_LOCK(bq);
|
|
|
|
nbp = TAILQ_FIRST(&bq->bq_queue);
|
2015-10-14 02:10:07 +00:00
|
|
|
continue;
|
|
|
|
}
|
|
|
|
bp->b_flags |= B_INVAL;
|
|
|
|
brelse(bp);
|
|
|
|
return (0);
|
|
|
|
}
|
2018-02-20 00:06:07 +00:00
|
|
|
bd->bd_wanted = 1;
|
|
|
|
BQ_UNLOCK(bq);
|
2015-10-14 02:10:07 +00:00
|
|
|
|
|
|
|
return (ENOBUFS);
|
|
|
|
}
|
|
|
|
|
1994-05-25 09:21:21 +00:00
|
|
|
/*
|
1999-03-12 02:24:58 +00:00
|
|
|
* bremfree:
|
|
|
|
*
|
2013-06-05 23:53:00 +00:00
|
|
|
* Mark the buffer for removal from the appropriate free list.
|
2020-07-10 09:01:36 +00:00
|
|
|
*
|
1994-05-25 09:21:21 +00:00
|
|
|
*/
|
|
|
|
void
|
2004-09-15 20:54:23 +00:00
|
|
|
bremfree(struct buf *bp)
|
2003-02-16 10:43:06 +00:00
|
|
|
{
|
2004-09-15 20:54:23 +00:00
|
|
|
|
2005-01-24 10:47:04 +00:00
|
|
|
CTR3(KTR_BUF, "bremfree(%p) vp %p flags %X", bp, bp->b_vp, bp->b_flags);
|
2005-06-13 00:45:05 +00:00
|
|
|
KASSERT((bp->b_flags & B_REMFREE) == 0,
|
|
|
|
("bremfree: buffer %p already marked for delayed removal.", bp));
|
|
|
|
KASSERT(bp->b_qindex != QUEUE_NONE,
|
2005-01-24 10:47:04 +00:00
|
|
|
("bremfree: buffer %p not on a queue.", bp));
|
2013-05-31 00:43:41 +00:00
|
|
|
BUF_ASSERT_XLOCKED(bp);
|
2004-11-18 08:44:09 +00:00
|
|
|
|
|
|
|
bp->b_flags |= B_REMFREE;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* bremfreef:
|
|
|
|
*
|
|
|
|
* Force an immediate removal from a free list. Used only in nfs when
|
|
|
|
* it abuses the b_freelist pointer.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
bremfreef(struct buf *bp)
|
|
|
|
{
|
2018-02-20 00:06:07 +00:00
|
|
|
struct bufqueue *bq;
|
2013-06-05 23:53:00 +00:00
|
|
|
|
2018-02-20 00:06:07 +00:00
|
|
|
bq = bufqueue_acquire(bp);
|
|
|
|
bq_remove(bq, bp);
|
|
|
|
BQ_UNLOCK(bq);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
bq_init(struct bufqueue *bq, int qindex, int subqueue, const char *lockname)
|
|
|
|
{
|
|
|
|
|
|
|
|
mtx_init(&bq->bq_lock, lockname, NULL, MTX_DEF);
|
|
|
|
TAILQ_INIT(&bq->bq_queue);
|
|
|
|
bq->bq_len = 0;
|
|
|
|
bq->bq_index = qindex;
|
|
|
|
bq->bq_subqueue = subqueue;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
bd_init(struct bufdomain *bd)
|
|
|
|
{
|
|
|
|
int i;
|
|
|
|
|
2018-02-25 00:35:21 +00:00
|
|
|
bd->bd_cleanq = &bd->bd_subq[mp_maxid + 1];
|
|
|
|
bq_init(bd->bd_cleanq, QUEUE_CLEAN, mp_maxid + 1, "bufq clean lock");
|
2018-03-17 18:14:49 +00:00
|
|
|
bq_init(&bd->bd_dirtyq, QUEUE_DIRTY, -1, "bufq dirty lock");
|
2018-02-20 00:06:07 +00:00
|
|
|
for (i = 0; i <= mp_maxid; i++)
|
|
|
|
bq_init(&bd->bd_subq[i], QUEUE_CLEAN, i,
|
|
|
|
"bufq clean subqueue lock");
|
|
|
|
mtx_init(&bd->bd_run_lock, "bufspace daemon run lock", NULL, MTX_DEF);
|
2003-02-16 10:43:06 +00:00
|
|
|
}
|
|
|
|
|
2004-11-18 08:44:09 +00:00
|
|
|
/*
|
2018-02-20 00:06:07 +00:00
|
|
|
* bq_remove:
|
2004-11-18 08:44:09 +00:00
|
|
|
*
|
|
|
|
* Removes a buffer from the free list, must be called with the
|
2013-06-05 23:53:00 +00:00
|
|
|
* correct qlock held.
|
2004-11-18 08:44:09 +00:00
|
|
|
*/
|
2005-02-10 12:28:58 +00:00
|
|
|
static void
|
2018-02-20 00:06:07 +00:00
|
|
|
bq_remove(struct bufqueue *bq, struct buf *bp)
|
1994-05-24 10:09:53 +00:00
|
|
|
{
|
2010-06-11 17:03:26 +00:00
|
|
|
|
2018-02-20 00:06:07 +00:00
|
|
|
CTR3(KTR_BUF, "bq_remove(%p) vp %p flags %X",
|
2005-01-24 10:47:04 +00:00
|
|
|
bp, bp->b_vp, bp->b_flags);
|
|
|
|
KASSERT(bp->b_qindex != QUEUE_NONE,
|
2018-02-20 00:06:07 +00:00
|
|
|
("bq_remove: buffer %p not on a queue.", bp));
|
|
|
|
KASSERT(bufqueue(bp) == bq,
|
|
|
|
("bq_remove: Remove buffer %p from wrong queue.", bp));
|
|
|
|
|
|
|
|
BQ_ASSERT_LOCKED(bq);
|
2015-10-14 02:10:07 +00:00
|
|
|
if (bp->b_qindex != QUEUE_EMPTY) {
|
|
|
|
BUF_ASSERT_XLOCKED(bp);
|
|
|
|
}
|
2018-02-20 00:06:07 +00:00
|
|
|
KASSERT(bq->bq_len >= 1,
|
|
|
|
("queue %d underflow", bp->b_qindex));
|
|
|
|
TAILQ_REMOVE(&bq->bq_queue, bp, b_freelist);
|
|
|
|
bq->bq_len--;
|
2005-01-24 10:47:04 +00:00
|
|
|
bp->b_qindex = QUEUE_NONE;
|
2018-02-20 00:06:07 +00:00
|
|
|
bp->b_flags &= ~(B_REMFREE | B_REUSE);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
bd_flush(struct bufdomain *bd, struct bufqueue *bq)
|
|
|
|
{
|
|
|
|
struct buf *bp;
|
|
|
|
|
|
|
|
BQ_ASSERT_LOCKED(bq);
|
|
|
|
if (bq != bd->bd_cleanq) {
|
|
|
|
BD_LOCK(bd);
|
|
|
|
while ((bp = TAILQ_FIRST(&bq->bq_queue)) != NULL) {
|
|
|
|
TAILQ_REMOVE(&bq->bq_queue, bp, b_freelist);
|
|
|
|
TAILQ_INSERT_TAIL(&bd->bd_cleanq->bq_queue, bp,
|
|
|
|
b_freelist);
|
2018-02-25 00:35:21 +00:00
|
|
|
bp->b_subqueue = bd->bd_cleanq->bq_subqueue;
|
2018-02-20 00:06:07 +00:00
|
|
|
}
|
|
|
|
bd->bd_cleanq->bq_len += bq->bq_len;
|
|
|
|
bq->bq_len = 0;
|
|
|
|
}
|
|
|
|
if (bd->bd_wanted) {
|
|
|
|
bd->bd_wanted = 0;
|
|
|
|
wakeup(&bd->bd_wanted);
|
|
|
|
}
|
|
|
|
if (bq != bd->bd_cleanq)
|
|
|
|
BD_UNLOCK(bd);
|
|
|
|
}
|
|
|
|
|
|
|
|
static int
|
|
|
|
bd_flushall(struct bufdomain *bd)
|
|
|
|
{
|
|
|
|
struct bufqueue *bq;
|
|
|
|
int flushed;
|
|
|
|
int i;
|
|
|
|
|
|
|
|
if (bd->bd_lim == 0)
|
|
|
|
return (0);
|
|
|
|
flushed = 0;
|
2018-02-25 00:35:21 +00:00
|
|
|
for (i = 0; i <= mp_maxid; i++) {
|
2018-02-20 00:06:07 +00:00
|
|
|
bq = &bd->bd_subq[i];
|
|
|
|
if (bq->bq_len == 0)
|
|
|
|
continue;
|
|
|
|
BQ_LOCK(bq);
|
|
|
|
bd_flush(bd, bq);
|
|
|
|
BQ_UNLOCK(bq);
|
|
|
|
flushed++;
|
|
|
|
}
|
|
|
|
|
|
|
|
return (flushed);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
bq_insert(struct bufqueue *bq, struct buf *bp, bool unlock)
|
|
|
|
{
|
|
|
|
struct bufdomain *bd;
|
|
|
|
|
|
|
|
if (bp->b_qindex != QUEUE_NONE)
|
|
|
|
panic("bq_insert: free buffer %p onto another queue?", bp);
|
|
|
|
|
2018-03-17 18:14:49 +00:00
|
|
|
bd = bufdomain(bp);
|
2018-02-20 00:06:07 +00:00
|
|
|
if (bp->b_flags & B_AGE) {
|
|
|
|
/* Place this buf directly on the real queue. */
|
|
|
|
if (bq->bq_index == QUEUE_CLEAN)
|
|
|
|
bq = bd->bd_cleanq;
|
|
|
|
BQ_LOCK(bq);
|
|
|
|
TAILQ_INSERT_HEAD(&bq->bq_queue, bp, b_freelist);
|
|
|
|
} else {
|
|
|
|
BQ_LOCK(bq);
|
|
|
|
TAILQ_INSERT_TAIL(&bq->bq_queue, bp, b_freelist);
|
|
|
|
}
|
|
|
|
bp->b_flags &= ~(B_AGE | B_REUSE);
|
|
|
|
bq->bq_len++;
|
|
|
|
bp->b_qindex = bq->bq_index;
|
|
|
|
bp->b_subqueue = bq->bq_subqueue;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Unlock before we notify so that we don't wakeup a waiter that
|
|
|
|
* fails a trylock on the buf and sleeps again.
|
|
|
|
*/
|
|
|
|
if (unlock)
|
|
|
|
BUF_UNLOCK(bp);
|
|
|
|
|
|
|
|
if (bp->b_qindex == QUEUE_CLEAN) {
|
|
|
|
/*
|
|
|
|
* Flush the per-cpu queue and notify any waiters.
|
|
|
|
*/
|
|
|
|
if (bd->bd_wanted || (bq != bd->bd_cleanq &&
|
|
|
|
bq->bq_len >= bd->bd_lim))
|
|
|
|
bd_flush(bd, bq);
|
|
|
|
}
|
|
|
|
BQ_UNLOCK(bq);
|
1994-05-25 09:21:21 +00:00
|
|
|
}
|
1994-05-24 10:09:53 +00:00
|
|
|
|
2015-07-23 19:13:41 +00:00
|
|
|
/*
|
2015-10-14 02:10:07 +00:00
|
|
|
* bufkva_free:
|
2015-07-23 19:13:41 +00:00
|
|
|
*
|
|
|
|
* Free the kva allocation for a buffer.
|
|
|
|
*
|
|
|
|
*/
|
|
|
|
static void
|
2015-10-14 02:10:07 +00:00
|
|
|
bufkva_free(struct buf *bp)
|
2015-07-23 19:13:41 +00:00
|
|
|
{
|
|
|
|
|
|
|
|
#ifdef INVARIANTS
|
|
|
|
if (bp->b_kvasize == 0) {
|
|
|
|
KASSERT(bp->b_kvabase == unmapped_buf &&
|
|
|
|
bp->b_data == unmapped_buf,
|
|
|
|
("Leaked KVA space on %p", bp));
|
|
|
|
} else if (buf_mapped(bp))
|
|
|
|
BUF_CHECK_MAPPED(bp);
|
|
|
|
else
|
|
|
|
BUF_CHECK_UNMAPPED(bp);
|
|
|
|
#endif
|
|
|
|
if (bp->b_kvasize == 0)
|
|
|
|
return;
|
|
|
|
|
|
|
|
vmem_free(buffer_arena, (vm_offset_t)bp->b_kvabase, bp->b_kvasize);
|
2018-02-20 00:06:07 +00:00
|
|
|
counter_u64_add(bufkvaspace, -bp->b_kvasize);
|
|
|
|
counter_u64_add(buffreekvacnt, 1);
|
2015-07-23 19:13:41 +00:00
|
|
|
bp->b_data = bp->b_kvabase = unmapped_buf;
|
|
|
|
bp->b_kvasize = 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2015-10-14 02:10:07 +00:00
|
|
|
* bufkva_alloc:
|
2015-07-23 19:13:41 +00:00
|
|
|
*
|
|
|
|
* Allocate the buffer KVA and set b_kvasize and b_kvabase.
|
|
|
|
*/
|
|
|
|
static int
|
2015-10-14 02:10:07 +00:00
|
|
|
bufkva_alloc(struct buf *bp, int maxsize, int gbflags)
|
2015-07-23 19:13:41 +00:00
|
|
|
{
|
|
|
|
vm_offset_t addr;
|
|
|
|
int error;
|
|
|
|
|
|
|
|
KASSERT((gbflags & GB_UNMAPPED) == 0 || (gbflags & GB_KVAALLOC) != 0,
|
|
|
|
("Invalid gbflags 0x%x in %s", gbflags, __func__));
|
Make MAXPHYS tunable. Bump MAXPHYS to 1M.
Replace MAXPHYS by runtime variable maxphys. It is initialized from
MAXPHYS by default, but can be also adjusted with the tunable kern.maxphys.
Make b_pages[] array in struct buf flexible. Size b_pages[] for buffer
cache buffers exactly to atop(maxbcachebuf) (currently it is sized to
atop(MAXPHYS)), and b_pages[] for pbufs is sized to atop(maxphys) + 1.
The +1 for pbufs allow several pbuf consumers, among them vmapbuf(),
to use unaligned buffers still sized to maxphys, esp. when such
buffers come from userspace (*). Overall, we save significant amount
of otherwise wasted memory in b_pages[] for buffer cache buffers,
while bumping MAXPHYS to desired high value.
Eliminate all direct uses of the MAXPHYS constant in kernel and driver
sources, except a place which initialize maxphys. Some random (and
arguably weird) uses of MAXPHYS, e.g. in linuxolator, are converted
straight. Some drivers, which use MAXPHYS to size embeded structures,
get private MAXPHYS-like constant; their convertion is out of scope
for this work.
Changes to cam/, dev/ahci, dev/ata, dev/mpr, dev/mpt, dev/mvs,
dev/siis, where either submitted by, or based on changes by mav.
Suggested by: mav (*)
Reviewed by: imp, mav, imp, mckusick, scottl (intermediate versions)
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D27225
2020-11-28 12:12:51 +00:00
|
|
|
MPASS((bp->b_flags & B_MAXPHYS) == 0);
|
|
|
|
KASSERT(maxsize <= maxbcachebuf,
|
|
|
|
("bufkva_alloc kva too large %d %u", maxsize, maxbcachebuf));
|
2015-07-23 19:13:41 +00:00
|
|
|
|
2015-10-14 02:10:07 +00:00
|
|
|
bufkva_free(bp);
|
2015-07-23 19:13:41 +00:00
|
|
|
|
|
|
|
addr = 0;
|
|
|
|
error = vmem_alloc(buffer_arena, maxsize, M_BESTFIT | M_NOWAIT, &addr);
|
|
|
|
if (error != 0) {
|
|
|
|
/*
|
|
|
|
* Buffer map is too fragmented. Request the caller
|
|
|
|
* to defragment the map.
|
|
|
|
*/
|
|
|
|
return (error);
|
|
|
|
}
|
|
|
|
bp->b_kvabase = (caddr_t)addr;
|
|
|
|
bp->b_kvasize = maxsize;
|
2018-02-20 00:06:07 +00:00
|
|
|
counter_u64_add(bufkvaspace, bp->b_kvasize);
|
2015-07-23 19:13:41 +00:00
|
|
|
if ((gbflags & GB_UNMAPPED) != 0) {
|
|
|
|
bp->b_data = unmapped_buf;
|
|
|
|
BUF_CHECK_UNMAPPED(bp);
|
|
|
|
} else {
|
|
|
|
bp->b_data = bp->b_kvabase;
|
|
|
|
BUF_CHECK_MAPPED(bp);
|
|
|
|
}
|
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
2015-10-14 02:10:07 +00:00
|
|
|
/*
|
|
|
|
* bufkva_reclaim:
|
|
|
|
*
|
|
|
|
* Reclaim buffer kva by freeing buffers holding kva. This is a vmem
|
|
|
|
* callback that fires to avoid returning failure.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
bufkva_reclaim(vmem_t *vmem, int flags)
|
|
|
|
{
|
2018-02-20 00:06:07 +00:00
|
|
|
bool done;
|
|
|
|
int q;
|
2015-10-14 02:10:07 +00:00
|
|
|
int i;
|
|
|
|
|
2018-02-20 00:06:07 +00:00
|
|
|
done = false;
|
|
|
|
for (i = 0; i < 5; i++) {
|
2018-03-17 18:14:49 +00:00
|
|
|
for (q = 0; q < buf_domains; q++)
|
|
|
|
if (buf_recycle(&bdomain[q], true) != 0)
|
2018-02-20 00:06:07 +00:00
|
|
|
done = true;
|
|
|
|
if (done)
|
2015-10-14 02:10:07 +00:00
|
|
|
break;
|
2018-02-20 00:06:07 +00:00
|
|
|
}
|
2015-10-14 02:10:07 +00:00
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
2005-12-07 03:39:08 +00:00
|
|
|
/*
|
|
|
|
* Attempt to initiate asynchronous I/O on read-ahead blocks. We must
|
|
|
|
* clear BIO_ERROR and B_INVAL prior to initiating I/O . If B_CACHE is set,
|
|
|
|
* the buffer is valid and we do not have to do anything.
|
|
|
|
*/
|
2017-09-22 12:45:15 +00:00
|
|
|
static void
|
|
|
|
breada(struct vnode * vp, daddr_t * rablkno, int * rabsize, int cnt,
|
|
|
|
struct ucred * cred, int flags, void (*ckhashfunc)(struct buf *))
|
2005-12-07 03:39:08 +00:00
|
|
|
{
|
|
|
|
struct buf *rabp;
|
2019-04-29 13:23:32 +00:00
|
|
|
struct thread *td;
|
2005-12-07 03:39:08 +00:00
|
|
|
int i;
|
|
|
|
|
2019-04-29 13:23:32 +00:00
|
|
|
td = curthread;
|
|
|
|
|
2005-12-07 03:39:08 +00:00
|
|
|
for (i = 0; i < cnt; i++, rablkno++, rabsize++) {
|
|
|
|
if (inmem(vp, *rablkno))
|
|
|
|
continue;
|
|
|
|
rabp = getblk(vp, *rablkno, *rabsize, 0, 0, 0);
|
2017-09-22 12:45:15 +00:00
|
|
|
if ((rabp->b_flags & B_CACHE) != 0) {
|
|
|
|
brelse(rabp);
|
|
|
|
continue;
|
|
|
|
}
|
2016-04-07 04:23:25 +00:00
|
|
|
#ifdef RACCT
|
2019-04-29 13:23:32 +00:00
|
|
|
if (racct_enable) {
|
|
|
|
PROC_LOCK(curproc);
|
|
|
|
racct_add_buf(curproc, rabp, 0);
|
|
|
|
PROC_UNLOCK(curproc);
|
2017-09-22 12:45:15 +00:00
|
|
|
}
|
2019-04-29 13:23:32 +00:00
|
|
|
#endif /* RACCT */
|
|
|
|
td->td_ru.ru_inblock++;
|
2017-09-22 12:45:15 +00:00
|
|
|
rabp->b_flags |= B_ASYNC;
|
|
|
|
rabp->b_flags &= ~B_INVAL;
|
|
|
|
if ((flags & GB_CKHASH) != 0) {
|
|
|
|
rabp->b_flags |= B_CKHASH;
|
|
|
|
rabp->b_ckhashcalc = ckhashfunc;
|
2005-12-07 03:39:08 +00:00
|
|
|
}
|
2017-09-22 12:45:15 +00:00
|
|
|
rabp->b_ioflags &= ~BIO_ERROR;
|
|
|
|
rabp->b_iocmd = BIO_READ;
|
|
|
|
if (rabp->b_rcred == NOCRED && cred != NOCRED)
|
|
|
|
rabp->b_rcred = crhold(cred);
|
|
|
|
vfs_busy_pages(rabp, 0);
|
|
|
|
BUF_KERNPROC(rabp);
|
|
|
|
rabp->b_iooffset = dbtob(rabp->b_blkno);
|
|
|
|
bstrategy(rabp);
|
2005-12-07 03:39:08 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
1994-05-25 09:21:21 +00:00
|
|
|
/*
|
2012-03-01 18:45:25 +00:00
|
|
|
* Entry point for bread() and breadn() via #defines in sys/buf.h.
|
|
|
|
*
|
|
|
|
* Get a buffer with the specified data. Look in the cache first. We
|
|
|
|
* must clear BIO_ERROR and B_INVAL prior to initiating I/O. If B_CACHE
|
|
|
|
* is set, the buffer is valid and we do not have to do anything, see
|
|
|
|
* getblk(). Also starts asynchronous I/O on read-ahead blocks.
|
2016-01-27 21:23:01 +00:00
|
|
|
*
|
|
|
|
* Always return a NULL buffer pointer (in bpp) when returning an error.
|
2019-12-03 23:07:09 +00:00
|
|
|
*
|
|
|
|
* The blkno parameter is the logical block being requested. Normally
|
|
|
|
* the mapping of logical block number to disk block address is done
|
|
|
|
* by calling VOP_BMAP(). However, if the mapping is already known, the
|
|
|
|
* disk block address can be passed using the dblkno parameter. If the
|
|
|
|
* disk block address is not known, then the same value should be passed
|
|
|
|
* for blkno and dblkno.
|
1994-05-25 09:21:21 +00:00
|
|
|
*/
|
|
|
|
int
|
2019-12-03 23:07:09 +00:00
|
|
|
breadn_flags(struct vnode *vp, daddr_t blkno, daddr_t dblkno, int size,
|
|
|
|
daddr_t *rablkno, int *rabsize, int cnt, struct ucred *cred, int flags,
|
2017-09-22 12:45:15 +00:00
|
|
|
void (*ckhashfunc)(struct buf *), struct buf **bpp)
|
1994-05-24 10:09:53 +00:00
|
|
|
{
|
2005-12-07 03:39:08 +00:00
|
|
|
struct buf *bp;
|
2018-05-13 09:47:28 +00:00
|
|
|
struct thread *td;
|
|
|
|
int error, readwait, rv;
|
1994-05-25 09:21:21 +00:00
|
|
|
|
2005-01-24 10:47:04 +00:00
|
|
|
CTR3(KTR_BUF, "breadn(%p, %jd, %d)", vp, blkno, size);
|
2018-05-13 09:47:28 +00:00
|
|
|
td = curthread;
|
2012-03-01 18:45:25 +00:00
|
|
|
/*
|
2018-05-13 09:47:28 +00:00
|
|
|
* Can only return NULL if GB_LOCK_NOWAIT or GB_SPARSE flags
|
|
|
|
* are specified.
|
2012-03-01 18:45:25 +00:00
|
|
|
*/
|
2019-12-03 23:07:09 +00:00
|
|
|
error = getblkx(vp, blkno, dblkno, size, 0, 0, flags, &bp);
|
2018-05-13 09:47:28 +00:00
|
|
|
if (error != 0) {
|
|
|
|
*bpp = NULL;
|
|
|
|
return (error);
|
|
|
|
}
|
2019-12-03 23:07:09 +00:00
|
|
|
KASSERT(blkno == bp->b_lblkno,
|
|
|
|
("getblkx returned buffer for blkno %jd instead of blkno %jd",
|
|
|
|
(intmax_t)bp->b_lblkno, (intmax_t)blkno));
|
2018-05-13 09:47:28 +00:00
|
|
|
flags &= ~GB_NOSPARSE;
|
|
|
|
*bpp = bp;
|
1994-05-25 09:21:21 +00:00
|
|
|
|
2017-09-22 12:45:15 +00:00
|
|
|
/*
|
|
|
|
* If not found in cache, do some I/O
|
|
|
|
*/
|
|
|
|
readwait = 0;
|
1994-05-25 09:21:21 +00:00
|
|
|
if ((bp->b_flags & B_CACHE) == 0) {
|
2016-04-07 04:23:25 +00:00
|
|
|
#ifdef RACCT
|
2019-04-29 13:23:32 +00:00
|
|
|
if (racct_enable) {
|
|
|
|
PROC_LOCK(td->td_proc);
|
|
|
|
racct_add_buf(td->td_proc, bp, 0);
|
|
|
|
PROC_UNLOCK(td->td_proc);
|
2016-04-07 04:23:25 +00:00
|
|
|
}
|
2019-04-29 13:23:32 +00:00
|
|
|
#endif /* RACCT */
|
|
|
|
td->td_ru.ru_inblock++;
|
2000-03-20 10:44:49 +00:00
|
|
|
bp->b_iocmd = BIO_READ;
|
2000-04-02 15:24:56 +00:00
|
|
|
bp->b_flags &= ~B_INVAL;
|
2017-09-22 12:45:15 +00:00
|
|
|
if ((flags & GB_CKHASH) != 0) {
|
|
|
|
bp->b_flags |= B_CKHASH;
|
|
|
|
bp->b_ckhashcalc = ckhashfunc;
|
|
|
|
}
|
This commit enables a UFS filesystem to do a forcible unmount when
the underlying media fails or becomes inaccessible. For example
when a USB flash memory card hosting a UFS filesystem is unplugged.
The strategy for handling disk I/O errors when soft updates are
enabled is to stop writing to the disk of the affected file system
but continue to accept I/O requests and report that all future
writes by the file system to that disk actually succeed. Then
initiate an asynchronous forced unmount of the affected file system.
There are two cases for disk I/O errors:
- ENXIO, which means that this disk is gone and the lower layers
of the storage stack already guarantee that no future I/O to
this disk will succeed.
- EIO (or most other errors), which means that this particular
I/O request has failed but subsequent I/O requests to this
disk might still succeed.
For ENXIO, we can just clear the error and continue, because we
know that the file system cannot affect the on-disk state after we
see this error. For EIO or other errors, we arrange for the geom_vfs
layer to reject all future I/O requests with ENXIO just like is
done when the geom_vfs is orphaned. In both cases, the file system
code can just clear the error and proceed with the forcible unmount.
This new treatment of I/O errors is needed for writes of any buffer
that is involved in a dependency. Most dependencies are described
by a structure attached to the buffer's b_dep field. But some are
created and processed as a result of the completion of the dependencies
attached to the buffer.
Clearing of some dependencies require a read. For example if there
is a dependency that requires an inode to be written, the disk block
containing that inode must be read, the updated inode copied into
place in that buffer, and the buffer then written back to disk.
Often the needed buffer is already in memory and can be used. But
if it needs to be read from the disk, the read will fail, so we
fabricate a buffer full of zeroes and pretend that the read succeeded.
This zero'ed buffer can be updated and written back to disk.
The only case where a buffer full of zeros causes the code to do
the wrong thing is when reading an inode buffer containing an inode
that still has an inode dependency in memory that will reinitialize
the effective link count (i_effnlink) based on the actual link count
(i_nlink) that we read. To handle this case we now store the i_nlink
value that we wrote in the inode dependency so that it can be
restored into the zero'ed buffer thus keeping the tracking of the
inode link count consistent.
Because applications depend on knowing when an attempt to write
their data to stable storage has failed, the fsync(2) and msync(2)
system calls need to return errors if data fails to be written to
stable storage. So these operations return ENXIO for every call
made on files in a file system where we have otherwise been ignoring
I/O errors.
Coauthered by: mckusick
Reviewed by: kib
Tested by: Peter Holm
Approved by: mckusick (mentor)
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D24088
2020-05-25 23:47:31 +00:00
|
|
|
if ((flags & GB_CVTENXIO) != 0)
|
|
|
|
bp->b_xflags |= BX_CVTENXIO;
|
2000-04-02 15:24:56 +00:00
|
|
|
bp->b_ioflags &= ~BIO_ERROR;
|
2001-10-11 23:38:17 +00:00
|
|
|
if (bp->b_rcred == NOCRED && cred != NOCRED)
|
|
|
|
bp->b_rcred = crhold(cred);
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
vfs_busy_pages(bp, 0);
|
2003-10-18 19:49:46 +00:00
|
|
|
bp->b_iooffset = dbtob(bp->b_blkno);
|
2004-10-24 20:03:41 +00:00
|
|
|
bstrategy(bp);
|
1994-05-25 09:21:21 +00:00
|
|
|
++readwait;
|
|
|
|
}
|
1999-03-12 02:24:58 +00:00
|
|
|
|
2017-09-22 12:45:15 +00:00
|
|
|
/*
|
|
|
|
* Attempt to initiate asynchronous I/O on read-ahead blocks.
|
|
|
|
*/
|
|
|
|
breada(vp, rablkno, rabsize, cnt, cred, flags, ckhashfunc);
|
1994-05-24 10:09:53 +00:00
|
|
|
|
2017-09-22 12:45:15 +00:00
|
|
|
rv = 0;
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
if (readwait) {
|
2000-04-29 16:25:22 +00:00
|
|
|
rv = bufwait(bp);
|
2016-01-27 21:23:01 +00:00
|
|
|
if (rv != 0) {
|
|
|
|
brelse(bp);
|
|
|
|
*bpp = NULL;
|
|
|
|
}
|
1994-05-25 09:21:21 +00:00
|
|
|
}
|
|
|
|
return (rv);
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
|
|
|
|
1994-05-25 09:21:21 +00:00
|
|
|
/*
|
|
|
|
* Write, release buffer on completion. (Done by iodone
|
The VFS/BIO subsystem contained a number of hacks in order to optimize
piecemeal, middle-of-file writes for NFS. These hacks have caused no
end of trouble, especially when combined with mmap(). I've removed
them. Instead, NFS will issue a read-before-write to fully
instantiate the struct buf containing the write. NFS does, however,
optimize piecemeal appends to files. For most common file operations,
you will not notice the difference. The sole remaining fragment in
the VFS/BIO system is b_dirtyoff/end, which NFS uses to avoid cache
coherency issues with read-merge-write style operations. NFS also
optimizes the write-covers-entire-buffer case by avoiding the
read-before-write. There is quite a bit of room for further
optimization in these areas.
The VM system marks pages fully-valid (AKA vm_page_t->valid =
VM_PAGE_BITS_ALL) in several places, most noteably in vm_fault. This
is not correct operation. The vm_pager_get_pages() code is now
responsible for marking VM pages all-valid. A number of VM helper
routines have been added to aid in zeroing-out the invalid portions of
a VM page prior to the page being marked all-valid. This operation is
necessary to properly support mmap(). The zeroing occurs most often
when dealing with file-EOF situations. Several bugs have been fixed
in the NFS subsystem, including bits handling file and directory EOF
situations and buf->b_flags consistancy issues relating to clearing
B_ERROR & B_INVAL, and handling B_DONE.
getblk() and allocbuf() have been rewritten. B_CACHE operation is now
formally defined in comments and more straightforward in
implementation. B_CACHE for VMIO buffers is based on the validity of
the backing store. B_CACHE for non-VMIO buffers is based simply on
whether the buffer is B_INVAL or not (B_CACHE set if B_INVAL clear,
and vise-versa). biodone() is now responsible for setting B_CACHE
when a successful read completes. B_CACHE is also set when a bdwrite()
is initiated and when a bwrite() is initiated. VFS VOP_BWRITE
routines (there are only two - nfs_bwrite() and bwrite()) are now
expected to set B_CACHE. This means that bowrite() and bawrite() also
set B_CACHE indirectly.
There are a number of places in the code which were previously using
buf->b_bufsize (which is DEV_BSIZE aligned) when they should have
been using buf->b_bcount. These have been fixed. getblk() now clears
B_DONE on return because the rest of the system is so bad about
dealing with B_DONE.
Major fixes to NFS/TCP have been made. A server-side bug could cause
requests to be lost by the server due to nfs_realign() overwriting
other rpc's in the same TCP mbuf chain. The server's kernel must be
recompiled to get the benefit of the fixes.
Submitted by: Matthew Dillon <dillon@apollo.backplane.com>
1999-05-02 23:57:16 +00:00
|
|
|
* if async). Do not bother writing anything if the buffer
|
|
|
|
* is invalid.
|
|
|
|
*
|
|
|
|
* Note that we set B_CACHE here, indicating that buffer is
|
|
|
|
* fully valid and thus cacheable. This is true even of NFS
|
|
|
|
* now so we set it generally. This could be set either here
|
|
|
|
* or in biodone() since the I/O is synchronous. We put it
|
|
|
|
* here.
|
1994-05-25 09:21:21 +00:00
|
|
|
*/
|
|
|
|
int
|
2004-10-24 20:03:41 +00:00
|
|
|
bufwrite(struct buf *bp)
|
1994-05-24 10:09:53 +00:00
|
|
|
{
|
2005-04-30 12:18:50 +00:00
|
|
|
int oldflags;
|
2006-12-20 09:22:31 +00:00
|
|
|
struct vnode *vp;
|
2013-06-05 23:53:00 +00:00
|
|
|
long space;
|
2006-12-20 09:22:31 +00:00
|
|
|
int vp_md;
|
1998-03-08 09:59:44 +00:00
|
|
|
|
2005-01-24 10:47:04 +00:00
|
|
|
CTR3(KTR_BUF, "bufwrite(%p) vp %p flags %X", bp, bp->b_vp, bp->b_flags);
|
2015-07-11 11:21:56 +00:00
|
|
|
if ((bp->b_bufobj->bo_flag & BO_DEAD) != 0) {
|
|
|
|
bp->b_flags |= B_INVAL | B_RELBUF;
|
|
|
|
bp->b_flags &= ~B_CACHE;
|
|
|
|
brelse(bp);
|
|
|
|
return (ENXIO);
|
|
|
|
}
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
if (bp->b_flags & B_INVAL) {
|
1994-05-25 09:21:21 +00:00
|
|
|
brelse(bp);
|
|
|
|
return (0);
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
}
|
In kern_physio.c fix tsleep priority messup.
In vfs_bio.c, remove b_generation count usage,
remove redundant reassignbuf,
remove redundant spl(s),
manage page PG_ZERO flags more correctly,
utilize in invalid value for b_offset until it
is properly initialized. Add asserts
for #ifdef DIAGNOSTIC, when b_offset is
improperly used.
when a process is not performing I/O, and just waiting
on a buffer generally, make the sleep priority
low.
only check page validity in getblk for B_VMIO buffers.
In vfs_cluster, add b_offset asserts, correct pointer calculation
for clustered reads. Improve readability of certain parts of
the code. Remove redundant spl(s).
In vfs_subr, correct usage of vfs_bio_awrite (From Andrew Gallatin
<gallatin@cs.duke.edu>). More vtruncbuf problems fixed.
1998-03-19 22:48:16 +00:00
|
|
|
|
2013-02-16 14:51:30 +00:00
|
|
|
if (bp->b_flags & B_BARRIER)
|
2018-12-02 12:53:39 +00:00
|
|
|
atomic_add_long(&barrierwrites, 1);
|
2013-02-16 14:51:30 +00:00
|
|
|
|
In kern_physio.c fix tsleep priority messup.
In vfs_bio.c, remove b_generation count usage,
remove redundant reassignbuf,
remove redundant spl(s),
manage page PG_ZERO flags more correctly,
utilize in invalid value for b_offset until it
is properly initialized. Add asserts
for #ifdef DIAGNOSTIC, when b_offset is
improperly used.
when a process is not performing I/O, and just waiting
on a buffer generally, make the sleep priority
low.
only check page validity in getblk for B_VMIO buffers.
In vfs_cluster, add b_offset asserts, correct pointer calculation
for clustered reads. Improve readability of certain parts of
the code. Remove redundant spl(s).
In vfs_subr, correct usage of vfs_bio_awrite (From Andrew Gallatin
<gallatin@cs.duke.edu>). More vtruncbuf problems fixed.
1998-03-19 22:48:16 +00:00
|
|
|
oldflags = bp->b_flags;
|
|
|
|
|
2005-02-08 20:29:10 +00:00
|
|
|
KASSERT(!(bp->b_vflags & BV_BKGRDINPROG),
|
|
|
|
("FFS background buffer should not get here %p", bp));
|
2000-01-10 00:24:24 +00:00
|
|
|
|
2006-12-20 09:22:31 +00:00
|
|
|
vp = bp->b_vp;
|
|
|
|
if (vp)
|
|
|
|
vp_md = vp->v_vflag & VV_MD;
|
|
|
|
else
|
|
|
|
vp_md = 0;
|
|
|
|
|
2013-03-20 21:08:00 +00:00
|
|
|
/*
|
|
|
|
* Mark the buffer clean. Increment the bufobj write count
|
|
|
|
* before bundirty() call, to prevent other thread from seeing
|
|
|
|
* empty dirty list and zero counter for writes in progress,
|
|
|
|
* falsely indicating that the bufobj is clean.
|
|
|
|
*/
|
|
|
|
bufobj_wref(bp->b_bufobj);
|
1999-03-12 02:24:58 +00:00
|
|
|
bundirty(bp);
|
1994-05-25 09:21:21 +00:00
|
|
|
|
2000-04-02 15:24:56 +00:00
|
|
|
bp->b_flags &= ~B_DONE;
|
|
|
|
bp->b_ioflags &= ~BIO_ERROR;
|
2004-09-15 21:49:22 +00:00
|
|
|
bp->b_flags |= B_CACHE;
|
2000-03-20 10:44:49 +00:00
|
|
|
bp->b_iocmd = BIO_WRITE;
|
1994-05-25 09:21:21 +00:00
|
|
|
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
vfs_busy_pages(bp, 1);
|
2001-02-28 04:13:11 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Normal bwrites pipeline writes
|
|
|
|
*/
|
|
|
|
bp->b_runningbufspace = bp->b_bufsize;
|
2013-06-05 23:53:00 +00:00
|
|
|
space = atomic_fetchadd_long(&runningbufspace, bp->b_runningbufspace);
|
2001-02-28 04:13:11 +00:00
|
|
|
|
2016-04-07 04:23:25 +00:00
|
|
|
#ifdef RACCT
|
2019-04-29 13:23:32 +00:00
|
|
|
if (racct_enable) {
|
|
|
|
PROC_LOCK(curproc);
|
|
|
|
racct_add_buf(curproc, bp, 1);
|
|
|
|
PROC_UNLOCK(curproc);
|
2016-04-07 04:23:25 +00:00
|
|
|
}
|
2019-04-29 13:23:32 +00:00
|
|
|
#endif /* RACCT */
|
|
|
|
curthread->td_ru.ru_oublock++;
|
1999-06-29 05:59:47 +00:00
|
|
|
if (oldflags & B_ASYNC)
|
|
|
|
BUF_KERNPROC(bp);
|
2003-10-18 19:49:46 +00:00
|
|
|
bp->b_iooffset = dbtob(bp->b_blkno);
|
2016-10-31 23:09:52 +00:00
|
|
|
buf_track(bp, __func__);
|
2004-10-24 20:03:41 +00:00
|
|
|
bstrategy(bp);
|
1994-05-25 09:21:21 +00:00
|
|
|
|
1996-09-13 03:15:45 +00:00
|
|
|
if ((oldflags & B_ASYNC) == 0) {
|
2000-04-29 16:25:22 +00:00
|
|
|
int rtval = bufwait(bp);
|
1994-05-25 09:21:21 +00:00
|
|
|
brelse(bp);
|
|
|
|
return (rtval);
|
2013-06-05 23:53:00 +00:00
|
|
|
} else if (space > hirunningspace) {
|
2000-12-26 19:41:38 +00:00
|
|
|
/*
|
|
|
|
* don't allow the async write to saturate the I/O
|
2003-05-31 16:42:45 +00:00
|
|
|
* system. We will not deadlock here because
|
2001-11-05 18:48:54 +00:00
|
|
|
* we are blocking waiting for I/O that is already in-progress
|
2003-11-04 06:30:00 +00:00
|
|
|
* to complete. We do not block here if it is the update
|
|
|
|
* or syncer daemon trying to clean up as that can lead
|
|
|
|
* to deadlock.
|
2000-12-26 19:41:38 +00:00
|
|
|
*/
|
2006-12-20 09:22:31 +00:00
|
|
|
if ((curthread->td_pflags & TDP_NORUNNINGBUF) == 0 && !vp_md)
|
2003-11-04 06:30:00 +00:00
|
|
|
waitrunningbufspace();
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
}
|
1994-05-24 10:09:53 +00:00
|
|
|
|
1999-03-12 02:24:58 +00:00
|
|
|
return (0);
|
1997-06-15 17:56:53 +00:00
|
|
|
}
|
|
|
|
|
Cylinder group bitmaps and blocks containing inode for a snapshot
file are after snaplock, while other ffs device buffers are before
snaplock in global lock order. By itself, this could cause deadlock
when bdwrite() tries to flush dirty buffers on snapshotted ffs. If,
during the flush, COW activity for snapshot needs to allocate block
and ffs_alloccg() selects the cylinder group that is being written
by bdwrite(), then kernel would panic due to recursive buffer lock
acquision.
Avoid dealing with buffers in bdwrite() that are from other side of
snaplock divisor in the lock order then the buffer being written. Add
new BOP, bop_bdwrite(), to do dirty buffer flushing for same vnode in
the bdwrite(). Default implementation, bufbdflush(), refactors the code
from bdwrite(). For ffs device buffers, specialized implementation is
used.
Reviewed by: tegge, jeff, Russell Cattelan (cattelan xfs org, xfs changes)
Tested by: Peter Holm
X-MFC after: 3 weeks (if ever: it changes ABI)
2007-01-23 10:01:19 +00:00
|
|
|
void
|
|
|
|
bufbdflush(struct bufobj *bo, struct buf *bp)
|
|
|
|
{
|
|
|
|
struct buf *nbp;
|
buf: Fix the dirtybufthresh check
dirtybufthresh is a watermark, slightly below the high watermark for
dirty buffers. When a delayed write is issued, the dirtying thread will
start flushing buffers if the dirtybufthresh watermark is reached. This
helps ensure that the high watermark is not reached, otherwise
performance will degrade as clustering and other optimizations are
disabled (see buf_dirty_count_severe()).
When the buffer cache was partitioned into "domains", the dirtybufthresh
threshold checks were not updated. Fix this.
Reported by: Shrikanth R Kamath <kshrikanth@juniper.net>
Reviewed by: rlibby, mckusick, kib, bdrewery
Sponsored by: Juniper Networks, Inc., Klara, Inc.
Fixes: 3cec5c77d6
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D28901
2021-02-25 15:04:44 +00:00
|
|
|
struct bufdomain *bd;
|
Cylinder group bitmaps and blocks containing inode for a snapshot
file are after snaplock, while other ffs device buffers are before
snaplock in global lock order. By itself, this could cause deadlock
when bdwrite() tries to flush dirty buffers on snapshotted ffs. If,
during the flush, COW activity for snapshot needs to allocate block
and ffs_alloccg() selects the cylinder group that is being written
by bdwrite(), then kernel would panic due to recursive buffer lock
acquision.
Avoid dealing with buffers in bdwrite() that are from other side of
snaplock divisor in the lock order then the buffer being written. Add
new BOP, bop_bdwrite(), to do dirty buffer flushing for same vnode in
the bdwrite(). Default implementation, bufbdflush(), refactors the code
from bdwrite(). For ffs device buffers, specialized implementation is
used.
Reviewed by: tegge, jeff, Russell Cattelan (cattelan xfs org, xfs changes)
Tested by: Peter Holm
X-MFC after: 3 weeks (if ever: it changes ABI)
2007-01-23 10:01:19 +00:00
|
|
|
|
buf: Fix the dirtybufthresh check
dirtybufthresh is a watermark, slightly below the high watermark for
dirty buffers. When a delayed write is issued, the dirtying thread will
start flushing buffers if the dirtybufthresh watermark is reached. This
helps ensure that the high watermark is not reached, otherwise
performance will degrade as clustering and other optimizations are
disabled (see buf_dirty_count_severe()).
When the buffer cache was partitioned into "domains", the dirtybufthresh
threshold checks were not updated. Fix this.
Reported by: Shrikanth R Kamath <kshrikanth@juniper.net>
Reviewed by: rlibby, mckusick, kib, bdrewery
Sponsored by: Juniper Networks, Inc., Klara, Inc.
Fixes: 3cec5c77d6
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D28901
2021-02-25 15:04:44 +00:00
|
|
|
bd = &bdomain[bo->bo_domain];
|
|
|
|
if (bo->bo_dirty.bv_cnt > bd->bd_dirtybufthresh + 10) {
|
Cylinder group bitmaps and blocks containing inode for a snapshot
file are after snaplock, while other ffs device buffers are before
snaplock in global lock order. By itself, this could cause deadlock
when bdwrite() tries to flush dirty buffers on snapshotted ffs. If,
during the flush, COW activity for snapshot needs to allocate block
and ffs_alloccg() selects the cylinder group that is being written
by bdwrite(), then kernel would panic due to recursive buffer lock
acquision.
Avoid dealing with buffers in bdwrite() that are from other side of
snaplock divisor in the lock order then the buffer being written. Add
new BOP, bop_bdwrite(), to do dirty buffer flushing for same vnode in
the bdwrite(). Default implementation, bufbdflush(), refactors the code
from bdwrite(). For ffs device buffers, specialized implementation is
used.
Reviewed by: tegge, jeff, Russell Cattelan (cattelan xfs org, xfs changes)
Tested by: Peter Holm
X-MFC after: 3 weeks (if ever: it changes ABI)
2007-01-23 10:01:19 +00:00
|
|
|
(void) VOP_FSYNC(bp->b_vp, MNT_NOWAIT, curthread);
|
|
|
|
altbufferflushes++;
|
buf: Fix the dirtybufthresh check
dirtybufthresh is a watermark, slightly below the high watermark for
dirty buffers. When a delayed write is issued, the dirtying thread will
start flushing buffers if the dirtybufthresh watermark is reached. This
helps ensure that the high watermark is not reached, otherwise
performance will degrade as clustering and other optimizations are
disabled (see buf_dirty_count_severe()).
When the buffer cache was partitioned into "domains", the dirtybufthresh
threshold checks were not updated. Fix this.
Reported by: Shrikanth R Kamath <kshrikanth@juniper.net>
Reviewed by: rlibby, mckusick, kib, bdrewery
Sponsored by: Juniper Networks, Inc., Klara, Inc.
Fixes: 3cec5c77d6
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D28901
2021-02-25 15:04:44 +00:00
|
|
|
} else if (bo->bo_dirty.bv_cnt > bd->bd_dirtybufthresh) {
|
Cylinder group bitmaps and blocks containing inode for a snapshot
file are after snaplock, while other ffs device buffers are before
snaplock in global lock order. By itself, this could cause deadlock
when bdwrite() tries to flush dirty buffers on snapshotted ffs. If,
during the flush, COW activity for snapshot needs to allocate block
and ffs_alloccg() selects the cylinder group that is being written
by bdwrite(), then kernel would panic due to recursive buffer lock
acquision.
Avoid dealing with buffers in bdwrite() that are from other side of
snaplock divisor in the lock order then the buffer being written. Add
new BOP, bop_bdwrite(), to do dirty buffer flushing for same vnode in
the bdwrite(). Default implementation, bufbdflush(), refactors the code
from bdwrite(). For ffs device buffers, specialized implementation is
used.
Reviewed by: tegge, jeff, Russell Cattelan (cattelan xfs org, xfs changes)
Tested by: Peter Holm
X-MFC after: 3 weeks (if ever: it changes ABI)
2007-01-23 10:01:19 +00:00
|
|
|
BO_LOCK(bo);
|
|
|
|
/*
|
|
|
|
* Try to find a buffer to flush.
|
|
|
|
*/
|
|
|
|
TAILQ_FOREACH(nbp, &bo->bo_dirty.bv_hd, b_bobufs) {
|
|
|
|
if ((nbp->b_vflags & BV_BKGRDINPROG) ||
|
|
|
|
BUF_LOCK(nbp,
|
|
|
|
LK_EXCLUSIVE | LK_NOWAIT, NULL))
|
|
|
|
continue;
|
|
|
|
if (bp == nbp)
|
|
|
|
panic("bdwrite: found ourselves");
|
|
|
|
BO_UNLOCK(bo);
|
|
|
|
/* Don't countdeps with the bo lock held. */
|
|
|
|
if (buf_countdeps(nbp, 0)) {
|
|
|
|
BO_LOCK(bo);
|
|
|
|
BUF_UNLOCK(nbp);
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
if (nbp->b_flags & B_CLUSTEROK) {
|
|
|
|
vfs_bio_awrite(nbp);
|
|
|
|
} else {
|
|
|
|
bremfree(nbp);
|
|
|
|
bawrite(nbp);
|
|
|
|
}
|
|
|
|
dirtybufferflushes++;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
if (nbp == NULL)
|
|
|
|
BO_UNLOCK(bo);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
1994-05-25 09:21:21 +00:00
|
|
|
/*
|
The VFS/BIO subsystem contained a number of hacks in order to optimize
piecemeal, middle-of-file writes for NFS. These hacks have caused no
end of trouble, especially when combined with mmap(). I've removed
them. Instead, NFS will issue a read-before-write to fully
instantiate the struct buf containing the write. NFS does, however,
optimize piecemeal appends to files. For most common file operations,
you will not notice the difference. The sole remaining fragment in
the VFS/BIO system is b_dirtyoff/end, which NFS uses to avoid cache
coherency issues with read-merge-write style operations. NFS also
optimizes the write-covers-entire-buffer case by avoiding the
read-before-write. There is quite a bit of room for further
optimization in these areas.
The VM system marks pages fully-valid (AKA vm_page_t->valid =
VM_PAGE_BITS_ALL) in several places, most noteably in vm_fault. This
is not correct operation. The vm_pager_get_pages() code is now
responsible for marking VM pages all-valid. A number of VM helper
routines have been added to aid in zeroing-out the invalid portions of
a VM page prior to the page being marked all-valid. This operation is
necessary to properly support mmap(). The zeroing occurs most often
when dealing with file-EOF situations. Several bugs have been fixed
in the NFS subsystem, including bits handling file and directory EOF
situations and buf->b_flags consistancy issues relating to clearing
B_ERROR & B_INVAL, and handling B_DONE.
getblk() and allocbuf() have been rewritten. B_CACHE operation is now
formally defined in comments and more straightforward in
implementation. B_CACHE for VMIO buffers is based on the validity of
the backing store. B_CACHE for non-VMIO buffers is based simply on
whether the buffer is B_INVAL or not (B_CACHE set if B_INVAL clear,
and vise-versa). biodone() is now responsible for setting B_CACHE
when a successful read completes. B_CACHE is also set when a bdwrite()
is initiated and when a bwrite() is initiated. VFS VOP_BWRITE
routines (there are only two - nfs_bwrite() and bwrite()) are now
expected to set B_CACHE. This means that bowrite() and bawrite() also
set B_CACHE indirectly.
There are a number of places in the code which were previously using
buf->b_bufsize (which is DEV_BSIZE aligned) when they should have
been using buf->b_bcount. These have been fixed. getblk() now clears
B_DONE on return because the rest of the system is so bad about
dealing with B_DONE.
Major fixes to NFS/TCP have been made. A server-side bug could cause
requests to be lost by the server due to nfs_realign() overwriting
other rpc's in the same TCP mbuf chain. The server's kernel must be
recompiled to get the benefit of the fixes.
Submitted by: Matthew Dillon <dillon@apollo.backplane.com>
1999-05-02 23:57:16 +00:00
|
|
|
* Delayed write. (Buffer is marked dirty). Do not bother writing
|
|
|
|
* anything if the buffer is marked invalid.
|
|
|
|
*
|
|
|
|
* Note that since the buffer must be completely valid, we can safely
|
|
|
|
* set B_CACHE. In fact, we have to set B_CACHE here rather then in
|
|
|
|
* biodone() in order to prevent getblk from writing the buffer
|
|
|
|
* out synchronously.
|
1994-05-25 09:21:21 +00:00
|
|
|
*/
|
|
|
|
void
|
2004-09-15 20:54:23 +00:00
|
|
|
bdwrite(struct buf *bp)
|
1994-05-24 10:09:53 +00:00
|
|
|
{
|
2003-02-25 06:44:42 +00:00
|
|
|
struct thread *td = curthread;
|
|
|
|
struct vnode *vp;
|
2004-10-22 08:47:20 +00:00
|
|
|
struct bufobj *bo;
|
2003-02-25 06:44:42 +00:00
|
|
|
|
2005-01-24 10:47:04 +00:00
|
|
|
CTR3(KTR_BUF, "bdwrite(%p) vp %p flags %X", bp, bp->b_vp, bp->b_flags);
|
2004-10-22 08:47:20 +00:00
|
|
|
KASSERT(bp->b_bufobj != NULL, ("No b_bufobj %p", bp));
|
2013-02-16 14:51:30 +00:00
|
|
|
KASSERT((bp->b_flags & B_BARRIER) == 0,
|
|
|
|
("Barrier request in delayed write %p", bp));
|
1997-06-15 17:56:53 +00:00
|
|
|
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
if (bp->b_flags & B_INVAL) {
|
1994-05-25 09:21:21 +00:00
|
|
|
brelse(bp);
|
|
|
|
return;
|
|
|
|
}
|
1995-05-21 21:39:31 +00:00
|
|
|
|
2003-02-25 06:44:42 +00:00
|
|
|
/*
|
|
|
|
* If we have too many dirty buffers, don't create any more.
|
|
|
|
* If we are wildly over our limit, then force a complete
|
|
|
|
* cleanup. Otherwise, just keep the situation from getting
|
2003-02-25 23:59:09 +00:00
|
|
|
* out of control. Note that we have to avoid a recursive
|
|
|
|
* disaster and not try to clean up after our own cleanup!
|
2003-02-25 06:44:42 +00:00
|
|
|
*/
|
|
|
|
vp = bp->b_vp;
|
2004-10-22 08:47:20 +00:00
|
|
|
bo = bp->b_bufobj;
|
2007-04-24 10:59:21 +00:00
|
|
|
if ((td->td_pflags & (TDP_COWINPROGRESS|TDP_INBDFLUSH)) == 0) {
|
|
|
|
td->td_pflags |= TDP_INBDFLUSH;
|
Cylinder group bitmaps and blocks containing inode for a snapshot
file are after snaplock, while other ffs device buffers are before
snaplock in global lock order. By itself, this could cause deadlock
when bdwrite() tries to flush dirty buffers on snapshotted ffs. If,
during the flush, COW activity for snapshot needs to allocate block
and ffs_alloccg() selects the cylinder group that is being written
by bdwrite(), then kernel would panic due to recursive buffer lock
acquision.
Avoid dealing with buffers in bdwrite() that are from other side of
snaplock divisor in the lock order then the buffer being written. Add
new BOP, bop_bdwrite(), to do dirty buffer flushing for same vnode in
the bdwrite(). Default implementation, bufbdflush(), refactors the code
from bdwrite(). For ffs device buffers, specialized implementation is
used.
Reviewed by: tegge, jeff, Russell Cattelan (cattelan xfs org, xfs changes)
Tested by: Peter Holm
X-MFC after: 3 weeks (if ever: it changes ABI)
2007-01-23 10:01:19 +00:00
|
|
|
BO_BDFLUSH(bo, bp);
|
2007-04-24 10:59:21 +00:00
|
|
|
td->td_pflags &= ~TDP_INBDFLUSH;
|
|
|
|
} else
|
2005-01-24 10:47:04 +00:00
|
|
|
recursiveflushes++;
|
2003-02-25 06:44:42 +00:00
|
|
|
|
|
|
|
bdirty(bp);
|
The VFS/BIO subsystem contained a number of hacks in order to optimize
piecemeal, middle-of-file writes for NFS. These hacks have caused no
end of trouble, especially when combined with mmap(). I've removed
them. Instead, NFS will issue a read-before-write to fully
instantiate the struct buf containing the write. NFS does, however,
optimize piecemeal appends to files. For most common file operations,
you will not notice the difference. The sole remaining fragment in
the VFS/BIO system is b_dirtyoff/end, which NFS uses to avoid cache
coherency issues with read-merge-write style operations. NFS also
optimizes the write-covers-entire-buffer case by avoiding the
read-before-write. There is quite a bit of room for further
optimization in these areas.
The VM system marks pages fully-valid (AKA vm_page_t->valid =
VM_PAGE_BITS_ALL) in several places, most noteably in vm_fault. This
is not correct operation. The vm_pager_get_pages() code is now
responsible for marking VM pages all-valid. A number of VM helper
routines have been added to aid in zeroing-out the invalid portions of
a VM page prior to the page being marked all-valid. This operation is
necessary to properly support mmap(). The zeroing occurs most often
when dealing with file-EOF situations. Several bugs have been fixed
in the NFS subsystem, including bits handling file and directory EOF
situations and buf->b_flags consistancy issues relating to clearing
B_ERROR & B_INVAL, and handling B_DONE.
getblk() and allocbuf() have been rewritten. B_CACHE operation is now
formally defined in comments and more straightforward in
implementation. B_CACHE for VMIO buffers is based on the validity of
the backing store. B_CACHE for non-VMIO buffers is based simply on
whether the buffer is B_INVAL or not (B_CACHE set if B_INVAL clear,
and vise-versa). biodone() is now responsible for setting B_CACHE
when a successful read completes. B_CACHE is also set when a bdwrite()
is initiated and when a bwrite() is initiated. VFS VOP_BWRITE
routines (there are only two - nfs_bwrite() and bwrite()) are now
expected to set B_CACHE. This means that bowrite() and bawrite() also
set B_CACHE indirectly.
There are a number of places in the code which were previously using
buf->b_bufsize (which is DEV_BSIZE aligned) when they should have
been using buf->b_bcount. These have been fixed. getblk() now clears
B_DONE on return because the rest of the system is so bad about
dealing with B_DONE.
Major fixes to NFS/TCP have been made. A server-side bug could cause
requests to be lost by the server due to nfs_realign() overwriting
other rpc's in the same TCP mbuf chain. The server's kernel must be
recompiled to get the benefit of the fixes.
Submitted by: Matthew Dillon <dillon@apollo.backplane.com>
1999-05-02 23:57:16 +00:00
|
|
|
/*
|
|
|
|
* Set B_CACHE, indicating that the buffer is fully valid. This is
|
|
|
|
* true even of NFS now.
|
|
|
|
*/
|
|
|
|
bp->b_flags |= B_CACHE;
|
|
|
|
|
1995-05-21 21:39:31 +00:00
|
|
|
/*
|
|
|
|
* This bmap keeps the system from needing to do the bmap later,
|
|
|
|
* perhaps when the system is attempting to do a sync. Since it
|
|
|
|
* is likely that the indirect block -- or whatever other datastructure
|
|
|
|
* that the filesystem needs is still in memory now, it is a good
|
|
|
|
* thing to do this. Note also, that if the pageout daemon is
|
|
|
|
* requesting a sync -- there might not be enough memory to do
|
|
|
|
* the bmap then... So, this is important to do.
|
|
|
|
*/
|
2003-02-25 06:44:42 +00:00
|
|
|
if (vp->v_type != VCHR && bp->b_lblkno == bp->b_blkno) {
|
|
|
|
VOP_BMAP(vp, bp->b_lblkno, NULL, &bp->b_blkno, NULL, NULL);
|
1995-03-03 22:13:16 +00:00
|
|
|
}
|
1995-05-21 21:39:31 +00:00
|
|
|
|
2016-10-31 23:09:52 +00:00
|
|
|
buf_track(bp, __func__);
|
|
|
|
|
1995-05-21 21:39:31 +00:00
|
|
|
/*
|
2010-06-08 17:54:28 +00:00
|
|
|
* Set the *dirty* buffer range based upon the VM system dirty
|
|
|
|
* pages.
|
|
|
|
*
|
|
|
|
* Mark the buffer pages as clean. We need to do this here to
|
|
|
|
* satisfy the vnode_pager and the pageout daemon, so that it
|
|
|
|
* thinks that the pages have been "cleaned". Note that since
|
|
|
|
* the pages are in a delayed write buffer -- the VFS layer
|
|
|
|
* "will" see that the pages get written out on the next sync,
|
|
|
|
* or perhaps the cluster will be completed.
|
1995-05-21 21:39:31 +00:00
|
|
|
*/
|
2010-06-08 17:54:28 +00:00
|
|
|
vfs_clean_pages_dirty_buf(bp);
|
1996-01-19 04:00:31 +00:00
|
|
|
bqrelse(bp);
|
1997-06-15 17:56:53 +00:00
|
|
|
|
1999-07-08 06:06:00 +00:00
|
|
|
/*
|
|
|
|
* note: we cannot initiate I/O from a bdwrite even if we wanted to,
|
|
|
|
* due to the softdep code.
|
|
|
|
*/
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
|
|
|
|
1998-03-08 09:59:44 +00:00
|
|
|
/*
|
1999-03-12 02:24:58 +00:00
|
|
|
* bdirty:
|
|
|
|
*
|
2000-03-20 10:44:49 +00:00
|
|
|
* Turn buffer into delayed write request. We must clear BIO_READ and
|
1999-03-12 02:24:58 +00:00
|
|
|
* B_RELBUF, and we must set B_DELWRI. We reassign the buffer to
|
|
|
|
* itself to properly update it in the dirty/clean lists. We mark it
|
|
|
|
* B_DONE to ensure that any asynchronization of the buffer properly
|
The VFS/BIO subsystem contained a number of hacks in order to optimize
piecemeal, middle-of-file writes for NFS. These hacks have caused no
end of trouble, especially when combined with mmap(). I've removed
them. Instead, NFS will issue a read-before-write to fully
instantiate the struct buf containing the write. NFS does, however,
optimize piecemeal appends to files. For most common file operations,
you will not notice the difference. The sole remaining fragment in
the VFS/BIO system is b_dirtyoff/end, which NFS uses to avoid cache
coherency issues with read-merge-write style operations. NFS also
optimizes the write-covers-entire-buffer case by avoiding the
read-before-write. There is quite a bit of room for further
optimization in these areas.
The VM system marks pages fully-valid (AKA vm_page_t->valid =
VM_PAGE_BITS_ALL) in several places, most noteably in vm_fault. This
is not correct operation. The vm_pager_get_pages() code is now
responsible for marking VM pages all-valid. A number of VM helper
routines have been added to aid in zeroing-out the invalid portions of
a VM page prior to the page being marked all-valid. This operation is
necessary to properly support mmap(). The zeroing occurs most often
when dealing with file-EOF situations. Several bugs have been fixed
in the NFS subsystem, including bits handling file and directory EOF
situations and buf->b_flags consistancy issues relating to clearing
B_ERROR & B_INVAL, and handling B_DONE.
getblk() and allocbuf() have been rewritten. B_CACHE operation is now
formally defined in comments and more straightforward in
implementation. B_CACHE for VMIO buffers is based on the validity of
the backing store. B_CACHE for non-VMIO buffers is based simply on
whether the buffer is B_INVAL or not (B_CACHE set if B_INVAL clear,
and vise-versa). biodone() is now responsible for setting B_CACHE
when a successful read completes. B_CACHE is also set when a bdwrite()
is initiated and when a bwrite() is initiated. VFS VOP_BWRITE
routines (there are only two - nfs_bwrite() and bwrite()) are now
expected to set B_CACHE. This means that bowrite() and bawrite() also
set B_CACHE indirectly.
There are a number of places in the code which were previously using
buf->b_bufsize (which is DEV_BSIZE aligned) when they should have
been using buf->b_bcount. These have been fixed. getblk() now clears
B_DONE on return because the rest of the system is so bad about
dealing with B_DONE.
Major fixes to NFS/TCP have been made. A server-side bug could cause
requests to be lost by the server due to nfs_realign() overwriting
other rpc's in the same TCP mbuf chain. The server's kernel must be
recompiled to get the benefit of the fixes.
Submitted by: Matthew Dillon <dillon@apollo.backplane.com>
1999-05-02 23:57:16 +00:00
|
|
|
* clears B_DONE ( else a panic will occur later ).
|
|
|
|
*
|
|
|
|
* bdirty() is kinda like bdwrite() - we have to clear B_INVAL which
|
|
|
|
* might have been set pre-getblk(). Unlike bwrite/bdwrite, bdirty()
|
|
|
|
* should only be called if the buffer is known-good.
|
1999-03-12 02:24:58 +00:00
|
|
|
*
|
|
|
|
* Since the buffer is not on a queue, we do not update the numfreebuffers
|
|
|
|
* count.
|
|
|
|
*
|
|
|
|
* The buffer must be on QUEUE_NONE.
|
1998-03-08 09:59:44 +00:00
|
|
|
*/
|
|
|
|
void
|
2004-09-15 20:54:23 +00:00
|
|
|
bdirty(struct buf *bp)
|
1998-03-08 09:59:44 +00:00
|
|
|
{
|
2004-09-15 20:54:23 +00:00
|
|
|
|
2005-01-24 10:47:04 +00:00
|
|
|
CTR3(KTR_BUF, "bdirty(%p) vp %p flags %X",
|
|
|
|
bp, bp->b_vp, bp->b_flags);
|
2004-10-22 08:47:20 +00:00
|
|
|
KASSERT(bp->b_bufobj != NULL, ("No b_bufobj %p", bp));
|
2004-11-18 08:44:09 +00:00
|
|
|
KASSERT(bp->b_flags & B_REMFREE || bp->b_qindex == QUEUE_NONE,
|
2002-03-05 15:38:49 +00:00
|
|
|
("bdirty: buffer %p still on queue %d", bp, bp->b_qindex));
|
2000-03-20 10:44:49 +00:00
|
|
|
bp->b_flags &= ~(B_RELBUF);
|
|
|
|
bp->b_iocmd = BIO_WRITE;
|
1999-03-12 02:24:58 +00:00
|
|
|
|
1998-03-08 09:59:44 +00:00
|
|
|
if ((bp->b_flags & B_DELWRI) == 0) {
|
2005-01-24 10:47:04 +00:00
|
|
|
bp->b_flags |= /* XXX B_DONE | */ B_DELWRI;
|
2004-07-25 21:24:23 +00:00
|
|
|
reassignbuf(bp);
|
2018-03-17 18:14:49 +00:00
|
|
|
bdirtyadd(bp);
|
1998-03-08 09:59:44 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
1994-05-25 09:21:21 +00:00
|
|
|
/*
|
1999-03-12 02:24:58 +00:00
|
|
|
* bundirty:
|
|
|
|
*
|
|
|
|
* Clear B_DELWRI for buffer.
|
|
|
|
*
|
|
|
|
* Since the buffer is not on a queue, we do not update the numfreebuffers
|
|
|
|
* count.
|
2020-07-10 09:01:36 +00:00
|
|
|
*
|
1999-03-12 02:24:58 +00:00
|
|
|
* The buffer must be on QUEUE_NONE.
|
|
|
|
*/
|
|
|
|
|
|
|
|
void
|
2004-09-15 20:54:23 +00:00
|
|
|
bundirty(struct buf *bp)
|
1999-03-12 02:24:58 +00:00
|
|
|
{
|
2004-09-15 20:54:23 +00:00
|
|
|
|
2005-01-24 10:47:04 +00:00
|
|
|
CTR3(KTR_BUF, "bundirty(%p) vp %p flags %X", bp, bp->b_vp, bp->b_flags);
|
2004-10-22 08:47:20 +00:00
|
|
|
KASSERT(bp->b_bufobj != NULL, ("No b_bufobj %p", bp));
|
2004-11-18 08:44:09 +00:00
|
|
|
KASSERT(bp->b_flags & B_REMFREE || bp->b_qindex == QUEUE_NONE,
|
2002-03-05 15:38:49 +00:00
|
|
|
("bundirty: buffer %p still on queue %d", bp, bp->b_qindex));
|
1999-03-12 02:24:58 +00:00
|
|
|
|
|
|
|
if (bp->b_flags & B_DELWRI) {
|
|
|
|
bp->b_flags &= ~B_DELWRI;
|
2004-07-25 21:24:23 +00:00
|
|
|
reassignbuf(bp);
|
2018-03-17 18:14:49 +00:00
|
|
|
bdirtysub(bp);
|
1999-03-12 02:24:58 +00:00
|
|
|
}
|
2000-01-10 00:24:24 +00:00
|
|
|
/*
|
|
|
|
* Since it is now being written, we can clear its deferred write flag.
|
|
|
|
*/
|
|
|
|
bp->b_flags &= ~B_DEFERRED;
|
1999-03-12 02:24:58 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* bawrite:
|
|
|
|
*
|
|
|
|
* Asynchronous write. Start output on a buffer, but do not wait for
|
|
|
|
* it to complete. The buffer is released when the output completes.
|
The VFS/BIO subsystem contained a number of hacks in order to optimize
piecemeal, middle-of-file writes for NFS. These hacks have caused no
end of trouble, especially when combined with mmap(). I've removed
them. Instead, NFS will issue a read-before-write to fully
instantiate the struct buf containing the write. NFS does, however,
optimize piecemeal appends to files. For most common file operations,
you will not notice the difference. The sole remaining fragment in
the VFS/BIO system is b_dirtyoff/end, which NFS uses to avoid cache
coherency issues with read-merge-write style operations. NFS also
optimizes the write-covers-entire-buffer case by avoiding the
read-before-write. There is quite a bit of room for further
optimization in these areas.
The VM system marks pages fully-valid (AKA vm_page_t->valid =
VM_PAGE_BITS_ALL) in several places, most noteably in vm_fault. This
is not correct operation. The vm_pager_get_pages() code is now
responsible for marking VM pages all-valid. A number of VM helper
routines have been added to aid in zeroing-out the invalid portions of
a VM page prior to the page being marked all-valid. This operation is
necessary to properly support mmap(). The zeroing occurs most often
when dealing with file-EOF situations. Several bugs have been fixed
in the NFS subsystem, including bits handling file and directory EOF
situations and buf->b_flags consistancy issues relating to clearing
B_ERROR & B_INVAL, and handling B_DONE.
getblk() and allocbuf() have been rewritten. B_CACHE operation is now
formally defined in comments and more straightforward in
implementation. B_CACHE for VMIO buffers is based on the validity of
the backing store. B_CACHE for non-VMIO buffers is based simply on
whether the buffer is B_INVAL or not (B_CACHE set if B_INVAL clear,
and vise-versa). biodone() is now responsible for setting B_CACHE
when a successful read completes. B_CACHE is also set when a bdwrite()
is initiated and when a bwrite() is initiated. VFS VOP_BWRITE
routines (there are only two - nfs_bwrite() and bwrite()) are now
expected to set B_CACHE. This means that bowrite() and bawrite() also
set B_CACHE indirectly.
There are a number of places in the code which were previously using
buf->b_bufsize (which is DEV_BSIZE aligned) when they should have
been using buf->b_bcount. These have been fixed. getblk() now clears
B_DONE on return because the rest of the system is so bad about
dealing with B_DONE.
Major fixes to NFS/TCP have been made. A server-side bug could cause
requests to be lost by the server due to nfs_realign() overwriting
other rpc's in the same TCP mbuf chain. The server's kernel must be
recompiled to get the benefit of the fixes.
Submitted by: Matthew Dillon <dillon@apollo.backplane.com>
1999-05-02 23:57:16 +00:00
|
|
|
*
|
|
|
|
* bwrite() ( or the VOP routine anyway ) is responsible for handling
|
|
|
|
* B_INVAL buffers. Not us.
|
1994-05-25 09:21:21 +00:00
|
|
|
*/
|
|
|
|
void
|
2004-09-15 20:54:23 +00:00
|
|
|
bawrite(struct buf *bp)
|
1994-05-24 10:09:53 +00:00
|
|
|
{
|
2004-09-15 20:54:23 +00:00
|
|
|
|
1995-02-22 09:30:13 +00:00
|
|
|
bp->b_flags |= B_ASYNC;
|
2004-03-11 18:02:36 +00:00
|
|
|
(void) bwrite(bp);
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
|
|
|
|
2013-02-16 14:51:30 +00:00
|
|
|
/*
|
|
|
|
* babarrierwrite:
|
|
|
|
*
|
|
|
|
* Asynchronous barrier write. Start output on a buffer, but do not
|
|
|
|
* wait for it to complete. Place a write barrier after this write so
|
|
|
|
* that this buffer and all buffers written before it are committed to
|
|
|
|
* the disk before any buffers written after this write are committed
|
|
|
|
* to the disk. The buffer is released when the output completes.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
babarrierwrite(struct buf *bp)
|
|
|
|
{
|
|
|
|
|
|
|
|
bp->b_flags |= B_ASYNC | B_BARRIER;
|
|
|
|
(void) bwrite(bp);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* bbarrierwrite:
|
|
|
|
*
|
|
|
|
* Synchronous barrier write. Start output on a buffer and wait for
|
|
|
|
* it to complete. Place a write barrier after this write so that
|
|
|
|
* this buffer and all buffers written before it are committed to
|
|
|
|
* the disk before any buffers written after this write are committed
|
|
|
|
* to the disk. The buffer is released when the output completes.
|
|
|
|
*/
|
|
|
|
int
|
|
|
|
bbarrierwrite(struct buf *bp)
|
|
|
|
{
|
|
|
|
|
|
|
|
bp->b_flags |= B_BARRIER;
|
|
|
|
return (bwrite(bp));
|
|
|
|
}
|
|
|
|
|
1999-07-08 06:06:00 +00:00
|
|
|
/*
|
|
|
|
* bwillwrite:
|
|
|
|
*
|
|
|
|
* Called prior to the locking of any vnodes when we are expecting to
|
|
|
|
* write. We do not want to starve the buffer cache with too many
|
|
|
|
* dirty buffers so we block here. By blocking prior to the locking
|
|
|
|
* of any vnodes we attempt to avoid the situation where a locked vnode
|
|
|
|
* prevents the various system daemons from flushing related buffers.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
bwillwrite(void)
|
|
|
|
{
|
2004-09-15 20:54:23 +00:00
|
|
|
|
2018-03-17 18:14:49 +00:00
|
|
|
if (buf_dirty_count_severe()) {
|
2013-06-05 23:53:00 +00:00
|
|
|
mtx_lock(&bdirtylock);
|
2018-03-17 18:14:49 +00:00
|
|
|
while (buf_dirty_count_severe()) {
|
2013-06-05 23:53:00 +00:00
|
|
|
bdirtywait = 1;
|
|
|
|
msleep(&bdirtywait, &bdirtylock, (PRIBIO + 4),
|
|
|
|
"flswai", 0);
|
1999-07-08 06:06:00 +00:00
|
|
|
}
|
2013-06-05 23:53:00 +00:00
|
|
|
mtx_unlock(&bdirtylock);
|
1999-07-08 06:06:00 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
Implement a low-memory deadlock solution.
Removed most of the hacks that were trying to deal with low-memory
situations prior to now.
The new code is based on the concept that I/O must be able to function in
a low memory situation. All major modules related to I/O (except
networking) have been adjusted to allow allocation out of the system
reserve memory pool. These modules now detect a low memory situation but
rather then block they instead continue to operate, then return resources
to the memory pool instead of cache them or leave them wired.
Code has been added to stall in a low-memory situation prior to a vnode
being locked.
Thus situations where a process blocks in a low-memory condition while
holding a locked vnode have been reduced to near nothing. Not only will
I/O continue to operate, but many prior deadlock conditions simply no
longer exist.
Implement a number of VFS/BIO fixes
(found by Ian): in biodone(), bogus-page replacement code, the loop
was not properly incrementing loop variables prior to a continue
statement. We do not believe this code can be hit anyway but we
aren't taking any chances. We'll turn the whole section into a
panic (as it already is in brelse()) after the release is rolled.
In biodone(), the foff calculation was incorrectly
clamped to the iosize, causing the wrong foff to be calculated
for pages in the case of an I/O error or biodone() called without
initiating I/O. The problem always caused a panic before. Now it
doesn't. The problem is mainly an issue with NFS.
Fixed casts for ~PAGE_MASK. This code worked properly before only
because the calculations use signed arithmatic. Better to properly
extend PAGE_MASK first before inverting it for the 64 bit masking
op.
In brelse(), the bogus_page fixup code was improperly throwing
away the original contents of 'm' when it did the j-loop to
fix the bogus pages. The result was that it would potentially
invalidate parts of the *WRONG* page(!), leading to corruption.
There may still be cases where a background bitmap write is
being duplicated, causing potential corruption. We have identified
a potentially serious bug related to this but the fix is still TBD.
So instead this patch contains a KASSERT to detect the problem
and panic the machine rather then continue to corrupt the filesystem.
The problem does not occur very often.. it is very hard to
reproduce, and it may or may not be the cause of the corruption
people have reported.
Review by: (VFS/BIO: mckusick, Ian Dowse <iedowse@maths.tcd.ie>)
Testing by: (VM/Deadlock) Paul Saab <ps@yahoo-inc.com>
2000-11-18 23:06:26 +00:00
|
|
|
/*
|
|
|
|
* Return true if we have too many dirty buffers.
|
|
|
|
*/
|
|
|
|
int
|
|
|
|
buf_dirty_count_severe(void)
|
|
|
|
{
|
2004-09-15 20:54:23 +00:00
|
|
|
|
2018-03-17 18:14:49 +00:00
|
|
|
return (!BIT_EMPTY(BUF_DOMAINS, &bdhidirty));
|
Implement a low-memory deadlock solution.
Removed most of the hacks that were trying to deal with low-memory
situations prior to now.
The new code is based on the concept that I/O must be able to function in
a low memory situation. All major modules related to I/O (except
networking) have been adjusted to allow allocation out of the system
reserve memory pool. These modules now detect a low memory situation but
rather then block they instead continue to operate, then return resources
to the memory pool instead of cache them or leave them wired.
Code has been added to stall in a low-memory situation prior to a vnode
being locked.
Thus situations where a process blocks in a low-memory condition while
holding a locked vnode have been reduced to near nothing. Not only will
I/O continue to operate, but many prior deadlock conditions simply no
longer exist.
Implement a number of VFS/BIO fixes
(found by Ian): in biodone(), bogus-page replacement code, the loop
was not properly incrementing loop variables prior to a continue
statement. We do not believe this code can be hit anyway but we
aren't taking any chances. We'll turn the whole section into a
panic (as it already is in brelse()) after the release is rolled.
In biodone(), the foff calculation was incorrectly
clamped to the iosize, causing the wrong foff to be calculated
for pages in the case of an I/O error or biodone() called without
initiating I/O. The problem always caused a panic before. Now it
doesn't. The problem is mainly an issue with NFS.
Fixed casts for ~PAGE_MASK. This code worked properly before only
because the calculations use signed arithmatic. Better to properly
extend PAGE_MASK first before inverting it for the 64 bit masking
op.
In brelse(), the bogus_page fixup code was improperly throwing
away the original contents of 'm' when it did the j-loop to
fix the bogus pages. The result was that it would potentially
invalidate parts of the *WRONG* page(!), leading to corruption.
There may still be cases where a background bitmap write is
being duplicated, causing potential corruption. We have identified
a potentially serious bug related to this but the fix is still TBD.
So instead this patch contains a KASSERT to detect the problem
and panic the machine rather then continue to corrupt the filesystem.
The problem does not occur very often.. it is very hard to
reproduce, and it may or may not be the cause of the corruption
people have reported.
Review by: (VFS/BIO: mckusick, Ian Dowse <iedowse@maths.tcd.ie>)
Testing by: (VM/Deadlock) Paul Saab <ps@yahoo-inc.com>
2000-11-18 23:06:26 +00:00
|
|
|
}
|
|
|
|
|
1994-05-25 09:21:21 +00:00
|
|
|
/*
|
1999-03-12 02:24:58 +00:00
|
|
|
* brelse:
|
|
|
|
*
|
|
|
|
* Release a busy buffer and, if requested, free its resources. The
|
|
|
|
* buffer will be stashed in the appropriate bufqueue[] allowing it
|
|
|
|
* to be accessed later as a cache entity or reused for other purposes.
|
1994-05-25 09:21:21 +00:00
|
|
|
*/
|
|
|
|
void
|
2004-09-15 20:54:23 +00:00
|
|
|
brelse(struct buf *bp)
|
1994-05-24 10:09:53 +00:00
|
|
|
{
|
2018-03-14 22:11:45 +00:00
|
|
|
struct mount *v_mnt;
|
2013-06-05 23:53:00 +00:00
|
|
|
int qindex;
|
|
|
|
|
2016-01-27 21:23:01 +00:00
|
|
|
/*
|
2016-02-07 16:18:12 +00:00
|
|
|
* Many functions erroneously call brelse with a NULL bp under rare
|
2016-01-27 21:23:01 +00:00
|
|
|
* error conditions. Simply return when called with a NULL bp.
|
|
|
|
*/
|
|
|
|
if (bp == NULL)
|
|
|
|
return;
|
2005-01-24 10:47:04 +00:00
|
|
|
CTR3(KTR_BUF, "brelse(%p) vp %p flags %X",
|
|
|
|
bp, bp->b_vp, bp->b_flags);
|
2002-03-05 15:38:49 +00:00
|
|
|
KASSERT(!(bp->b_flags & (B_CLUSTER|B_PAGING)),
|
|
|
|
("brelse: inappropriate B_PAGING or B_CLUSTER bp %p", bp));
|
2015-09-30 23:06:29 +00:00
|
|
|
KASSERT((bp->b_flags & B_VMIO) != 0 || (bp->b_flags & B_NOREUSE) == 0,
|
|
|
|
("brelse: non-VMIO buffer marked NOREUSE"));
|
1999-03-12 02:24:58 +00:00
|
|
|
|
2013-02-27 07:34:09 +00:00
|
|
|
if (BUF_LOCKRECURSED(bp)) {
|
|
|
|
/*
|
|
|
|
* Do not process, in particular, do not handle the
|
|
|
|
* B_INVAL/B_RELBUF and do not release to free list.
|
|
|
|
*/
|
|
|
|
BUF_UNLOCK(bp);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
2005-12-07 03:39:08 +00:00
|
|
|
if (bp->b_flags & B_MANAGED) {
|
|
|
|
bqrelse(bp);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
2021-01-30 02:10:34 +00:00
|
|
|
if (LIST_EMPTY(&bp->b_dep)) {
|
|
|
|
bp->b_flags &= ~B_IOSTARTED;
|
|
|
|
} else {
|
|
|
|
KASSERT((bp->b_flags & B_IOSTARTED) == 0,
|
|
|
|
("brelse: SU io not finished bp %p", bp));
|
|
|
|
}
|
|
|
|
|
Handle errors from background write of the cylinder group blocks.
First, on the write error, bufdone() call from ffs_backgroundwrite()
panics because pbrelvp() cleared bp->b_bufobj, while brelse() would
try to re-dirty the copy of the cg buffer. Handle this by setting
B_INVAL for the case of BIO_ERROR.
Second, we must re-dirty the real buffer containing the cylinder group
block data when background write failed. Real cg buffer was already
marked clean in ffs_bufwrite(). After the BV_BKGRDINPROG flag is
cleared on the real cg buffer in ffs_backgroundwrite(), buffer scan
may reuse the buffer at any moment. The result is lost write, and if
the write error was only transient, we get corrupted bitmaps.
We cannot re-dirty the original cg buffer in the
ffs_backgroundwritedone(), since the context is not sleepable,
preventing us from sleeping for origbp' lock. Add BV_BKGDERR flag
(protected by the buffer object lock), which is converted into delayed
write by brelse(), bqrelse() and buffer scan.
In collaboration with: Conrad Meyer <cse.cem@gmail.com>
Reviewed by: mckusick
Sponsored by: The FreeBSD Foundation (kib),
EMC/Isilon storage division (Conrad)
MFC after: 2 weeks
2015-06-27 09:44:14 +00:00
|
|
|
if ((bp->b_vflags & (BV_BKGRDINPROG | BV_BKGRDERR)) == BV_BKGRDERR) {
|
|
|
|
BO_LOCK(bp->b_bufobj);
|
|
|
|
bp->b_vflags &= ~BV_BKGRDERR;
|
|
|
|
BO_UNLOCK(bp->b_bufobj);
|
|
|
|
bdirty(bp);
|
|
|
|
}
|
2019-09-11 21:24:14 +00:00
|
|
|
|
|
|
|
if (bp->b_iocmd == BIO_WRITE && (bp->b_ioflags & BIO_ERROR) &&
|
|
|
|
(bp->b_flags & B_INVALONERR)) {
|
|
|
|
/*
|
|
|
|
* Forced invalidation of dirty buffer contents, to be used
|
|
|
|
* after a failed write in the rare case that the loss of the
|
|
|
|
* contents is acceptable. The buffer is invalidated and
|
|
|
|
* freed.
|
|
|
|
*/
|
|
|
|
bp->b_flags |= B_INVAL | B_RELBUF | B_NOCACHE;
|
|
|
|
bp->b_flags &= ~(B_ASYNC | B_CACHE);
|
|
|
|
}
|
|
|
|
|
2007-12-30 05:53:45 +00:00
|
|
|
if (bp->b_iocmd == BIO_WRITE && (bp->b_ioflags & BIO_ERROR) &&
|
2017-04-14 20:15:34 +00:00
|
|
|
(bp->b_error != ENXIO || !LIST_EMPTY(&bp->b_dep)) &&
|
2015-10-31 04:53:07 +00:00
|
|
|
!(bp->b_flags & B_INVAL)) {
|
The VFS/BIO subsystem contained a number of hacks in order to optimize
piecemeal, middle-of-file writes for NFS. These hacks have caused no
end of trouble, especially when combined with mmap(). I've removed
them. Instead, NFS will issue a read-before-write to fully
instantiate the struct buf containing the write. NFS does, however,
optimize piecemeal appends to files. For most common file operations,
you will not notice the difference. The sole remaining fragment in
the VFS/BIO system is b_dirtyoff/end, which NFS uses to avoid cache
coherency issues with read-merge-write style operations. NFS also
optimizes the write-covers-entire-buffer case by avoiding the
read-before-write. There is quite a bit of room for further
optimization in these areas.
The VM system marks pages fully-valid (AKA vm_page_t->valid =
VM_PAGE_BITS_ALL) in several places, most noteably in vm_fault. This
is not correct operation. The vm_pager_get_pages() code is now
responsible for marking VM pages all-valid. A number of VM helper
routines have been added to aid in zeroing-out the invalid portions of
a VM page prior to the page being marked all-valid. This operation is
necessary to properly support mmap(). The zeroing occurs most often
when dealing with file-EOF situations. Several bugs have been fixed
in the NFS subsystem, including bits handling file and directory EOF
situations and buf->b_flags consistancy issues relating to clearing
B_ERROR & B_INVAL, and handling B_DONE.
getblk() and allocbuf() have been rewritten. B_CACHE operation is now
formally defined in comments and more straightforward in
implementation. B_CACHE for VMIO buffers is based on the validity of
the backing store. B_CACHE for non-VMIO buffers is based simply on
whether the buffer is B_INVAL or not (B_CACHE set if B_INVAL clear,
and vise-versa). biodone() is now responsible for setting B_CACHE
when a successful read completes. B_CACHE is also set when a bdwrite()
is initiated and when a bwrite() is initiated. VFS VOP_BWRITE
routines (there are only two - nfs_bwrite() and bwrite()) are now
expected to set B_CACHE. This means that bowrite() and bawrite() also
set B_CACHE indirectly.
There are a number of places in the code which were previously using
buf->b_bufsize (which is DEV_BSIZE aligned) when they should have
been using buf->b_bcount. These have been fixed. getblk() now clears
B_DONE on return because the rest of the system is so bad about
dealing with B_DONE.
Major fixes to NFS/TCP have been made. A server-side bug could cause
requests to be lost by the server due to nfs_realign() overwriting
other rpc's in the same TCP mbuf chain. The server's kernel must be
recompiled to get the benefit of the fixes.
Submitted by: Matthew Dillon <dillon@apollo.backplane.com>
1999-05-02 23:57:16 +00:00
|
|
|
/*
|
2017-04-14 20:15:34 +00:00
|
|
|
* Failed write, redirty. All errors except ENXIO (which
|
2017-11-20 20:53:03 +00:00
|
|
|
* means the device is gone) are treated as being
|
|
|
|
* transient.
|
|
|
|
*
|
|
|
|
* XXX Treating EIO as transient is not correct; the
|
|
|
|
* contract with the local storage device drivers is that
|
|
|
|
* they will only return EIO once the I/O is no longer
|
|
|
|
* retriable. Network I/O also respects this through the
|
|
|
|
* guarantees of TCP and/or the internal retries of NFS.
|
|
|
|
* ENOMEM might be transient, but we also have no way of
|
|
|
|
* knowing when its ok to retry/reschedule. In general,
|
|
|
|
* this entire case should be made obsolete through better
|
|
|
|
* error handling/recovery and resource scheduling.
|
2017-04-14 20:15:34 +00:00
|
|
|
*
|
|
|
|
* Do this also for buffers that failed with ENXIO, but have
|
|
|
|
* non-empty dependencies - the soft updates code might need
|
|
|
|
* to access the buffer to untangle them.
|
|
|
|
*
|
|
|
|
* Must clear BIO_ERROR to prevent pages from being scrapped.
|
The VFS/BIO subsystem contained a number of hacks in order to optimize
piecemeal, middle-of-file writes for NFS. These hacks have caused no
end of trouble, especially when combined with mmap(). I've removed
them. Instead, NFS will issue a read-before-write to fully
instantiate the struct buf containing the write. NFS does, however,
optimize piecemeal appends to files. For most common file operations,
you will not notice the difference. The sole remaining fragment in
the VFS/BIO system is b_dirtyoff/end, which NFS uses to avoid cache
coherency issues with read-merge-write style operations. NFS also
optimizes the write-covers-entire-buffer case by avoiding the
read-before-write. There is quite a bit of room for further
optimization in these areas.
The VM system marks pages fully-valid (AKA vm_page_t->valid =
VM_PAGE_BITS_ALL) in several places, most noteably in vm_fault. This
is not correct operation. The vm_pager_get_pages() code is now
responsible for marking VM pages all-valid. A number of VM helper
routines have been added to aid in zeroing-out the invalid portions of
a VM page prior to the page being marked all-valid. This operation is
necessary to properly support mmap(). The zeroing occurs most often
when dealing with file-EOF situations. Several bugs have been fixed
in the NFS subsystem, including bits handling file and directory EOF
situations and buf->b_flags consistancy issues relating to clearing
B_ERROR & B_INVAL, and handling B_DONE.
getblk() and allocbuf() have been rewritten. B_CACHE operation is now
formally defined in comments and more straightforward in
implementation. B_CACHE for VMIO buffers is based on the validity of
the backing store. B_CACHE for non-VMIO buffers is based simply on
whether the buffer is B_INVAL or not (B_CACHE set if B_INVAL clear,
and vise-versa). biodone() is now responsible for setting B_CACHE
when a successful read completes. B_CACHE is also set when a bdwrite()
is initiated and when a bwrite() is initiated. VFS VOP_BWRITE
routines (there are only two - nfs_bwrite() and bwrite()) are now
expected to set B_CACHE. This means that bowrite() and bawrite() also
set B_CACHE indirectly.
There are a number of places in the code which were previously using
buf->b_bufsize (which is DEV_BSIZE aligned) when they should have
been using buf->b_bcount. These have been fixed. getblk() now clears
B_DONE on return because the rest of the system is so bad about
dealing with B_DONE.
Major fixes to NFS/TCP have been made. A server-side bug could cause
requests to be lost by the server due to nfs_realign() overwriting
other rpc's in the same TCP mbuf chain. The server's kernel must be
recompiled to get the benefit of the fixes.
Submitted by: Matthew Dillon <dillon@apollo.backplane.com>
1999-05-02 23:57:16 +00:00
|
|
|
*/
|
2000-04-02 15:24:56 +00:00
|
|
|
bp->b_ioflags &= ~BIO_ERROR;
|
1999-01-22 08:59:05 +00:00
|
|
|
bdirty(bp);
|
2000-04-02 15:24:56 +00:00
|
|
|
} else if ((bp->b_flags & (B_NOCACHE | B_INVAL)) ||
|
2004-09-13 06:50:42 +00:00
|
|
|
(bp->b_ioflags & BIO_ERROR) || (bp->b_bufsize <= 0)) {
|
The VFS/BIO subsystem contained a number of hacks in order to optimize
piecemeal, middle-of-file writes for NFS. These hacks have caused no
end of trouble, especially when combined with mmap(). I've removed
them. Instead, NFS will issue a read-before-write to fully
instantiate the struct buf containing the write. NFS does, however,
optimize piecemeal appends to files. For most common file operations,
you will not notice the difference. The sole remaining fragment in
the VFS/BIO system is b_dirtyoff/end, which NFS uses to avoid cache
coherency issues with read-merge-write style operations. NFS also
optimizes the write-covers-entire-buffer case by avoiding the
read-before-write. There is quite a bit of room for further
optimization in these areas.
The VM system marks pages fully-valid (AKA vm_page_t->valid =
VM_PAGE_BITS_ALL) in several places, most noteably in vm_fault. This
is not correct operation. The vm_pager_get_pages() code is now
responsible for marking VM pages all-valid. A number of VM helper
routines have been added to aid in zeroing-out the invalid portions of
a VM page prior to the page being marked all-valid. This operation is
necessary to properly support mmap(). The zeroing occurs most often
when dealing with file-EOF situations. Several bugs have been fixed
in the NFS subsystem, including bits handling file and directory EOF
situations and buf->b_flags consistancy issues relating to clearing
B_ERROR & B_INVAL, and handling B_DONE.
getblk() and allocbuf() have been rewritten. B_CACHE operation is now
formally defined in comments and more straightforward in
implementation. B_CACHE for VMIO buffers is based on the validity of
the backing store. B_CACHE for non-VMIO buffers is based simply on
whether the buffer is B_INVAL or not (B_CACHE set if B_INVAL clear,
and vise-versa). biodone() is now responsible for setting B_CACHE
when a successful read completes. B_CACHE is also set when a bdwrite()
is initiated and when a bwrite() is initiated. VFS VOP_BWRITE
routines (there are only two - nfs_bwrite() and bwrite()) are now
expected to set B_CACHE. This means that bowrite() and bawrite() also
set B_CACHE indirectly.
There are a number of places in the code which were previously using
buf->b_bufsize (which is DEV_BSIZE aligned) when they should have
been using buf->b_bcount. These have been fixed. getblk() now clears
B_DONE on return because the rest of the system is so bad about
dealing with B_DONE.
Major fixes to NFS/TCP have been made. A server-side bug could cause
requests to be lost by the server due to nfs_realign() overwriting
other rpc's in the same TCP mbuf chain. The server's kernel must be
recompiled to get the benefit of the fixes.
Submitted by: Matthew Dillon <dillon@apollo.backplane.com>
1999-05-02 23:57:16 +00:00
|
|
|
/*
|
2017-04-14 20:15:34 +00:00
|
|
|
* Either a failed read I/O, or we were asked to free or not
|
|
|
|
* cache the buffer, or we failed to write to a device that's
|
|
|
|
* no longer present.
|
The VFS/BIO subsystem contained a number of hacks in order to optimize
piecemeal, middle-of-file writes for NFS. These hacks have caused no
end of trouble, especially when combined with mmap(). I've removed
them. Instead, NFS will issue a read-before-write to fully
instantiate the struct buf containing the write. NFS does, however,
optimize piecemeal appends to files. For most common file operations,
you will not notice the difference. The sole remaining fragment in
the VFS/BIO system is b_dirtyoff/end, which NFS uses to avoid cache
coherency issues with read-merge-write style operations. NFS also
optimizes the write-covers-entire-buffer case by avoiding the
read-before-write. There is quite a bit of room for further
optimization in these areas.
The VM system marks pages fully-valid (AKA vm_page_t->valid =
VM_PAGE_BITS_ALL) in several places, most noteably in vm_fault. This
is not correct operation. The vm_pager_get_pages() code is now
responsible for marking VM pages all-valid. A number of VM helper
routines have been added to aid in zeroing-out the invalid portions of
a VM page prior to the page being marked all-valid. This operation is
necessary to properly support mmap(). The zeroing occurs most often
when dealing with file-EOF situations. Several bugs have been fixed
in the NFS subsystem, including bits handling file and directory EOF
situations and buf->b_flags consistancy issues relating to clearing
B_ERROR & B_INVAL, and handling B_DONE.
getblk() and allocbuf() have been rewritten. B_CACHE operation is now
formally defined in comments and more straightforward in
implementation. B_CACHE for VMIO buffers is based on the validity of
the backing store. B_CACHE for non-VMIO buffers is based simply on
whether the buffer is B_INVAL or not (B_CACHE set if B_INVAL clear,
and vise-versa). biodone() is now responsible for setting B_CACHE
when a successful read completes. B_CACHE is also set when a bdwrite()
is initiated and when a bwrite() is initiated. VFS VOP_BWRITE
routines (there are only two - nfs_bwrite() and bwrite()) are now
expected to set B_CACHE. This means that bowrite() and bawrite() also
set B_CACHE indirectly.
There are a number of places in the code which were previously using
buf->b_bufsize (which is DEV_BSIZE aligned) when they should have
been using buf->b_bcount. These have been fixed. getblk() now clears
B_DONE on return because the rest of the system is so bad about
dealing with B_DONE.
Major fixes to NFS/TCP have been made. A server-side bug could cause
requests to be lost by the server due to nfs_realign() overwriting
other rpc's in the same TCP mbuf chain. The server's kernel must be
recompiled to get the benefit of the fixes.
Submitted by: Matthew Dillon <dillon@apollo.backplane.com>
1999-05-02 23:57:16 +00:00
|
|
|
*/
|
1994-05-25 09:21:21 +00:00
|
|
|
bp->b_flags |= B_INVAL;
|
2007-02-22 14:52:59 +00:00
|
|
|
if (!LIST_EMPTY(&bp->b_dep))
|
2000-06-16 08:48:51 +00:00
|
|
|
buf_deallocate(bp);
|
2013-06-05 23:53:00 +00:00
|
|
|
if (bp->b_flags & B_DELWRI)
|
2018-03-17 18:14:49 +00:00
|
|
|
bdirtysub(bp);
|
2000-03-20 10:44:49 +00:00
|
|
|
bp->b_flags &= ~(B_DELWRI | B_CACHE);
|
This mega-commit is meant to fix numerous interrelated problems. There
has been some bitrot and incorrect assumptions in the vfs_bio code. These
problems have manifest themselves worse on NFS type filesystems, but can
still affect local filesystems under certain circumstances. Most of
the problems have involved mmap consistancy, and as a side-effect broke
the vfs.ioopt code. This code might have been committed seperately, but
almost everything is interrelated.
1) Allow (pmap_object_init_pt) prefaulting of buffer-busy pages that
are fully valid.
2) Rather than deactivating erroneously read initial (header) pages in
kern_exec, we now free them.
3) Fix the rundown of non-VMIO buffers that are in an inconsistent
(missing vp) state.
4) Fix the disassociation of pages from buffers in brelse. The previous
code had rotted and was faulty in a couple of important circumstances.
5) Remove a gratuitious buffer wakeup in vfs_vmio_release.
6) Remove a crufty and currently unused cluster mechanism for VBLK
files in vfs_bio_awrite. When the code is functional, I'll add back
a cleaner version.
7) The page busy count wakeups assocated with the buffer cache usage were
incorrectly cleaned up in a previous commit by me. Revert to the
original, correct version, but with a cleaner implementation.
8) The cluster read code now tries to keep data associated with buffers
more aggressively (without breaking the heuristics) when it is presumed
that the read data (buffers) will be soon needed.
9) Change to filesystem lockmgr locks so that they use LK_NOPAUSE. The
delay loop waiting is not useful for filesystem locks, due to the
length of the time intervals.
10) Correct and clean-up spec_getpages.
11) Implement a fully functional nfs_getpages, nfs_putpages.
12) Fix nfs_write so that modifications are coherent with the NFS data on
the server disk (at least as well as NFS seems to allow.)
13) Properly support MS_INVALIDATE on NFS.
14) Properly pass down MS_INVALIDATE to lower levels of the VM code from
vm_map_clean.
15) Better support the notion of pages being busy but valid, so that
fewer in-transit waits occur. (use p->busy more for pageouts instead
of PG_BUSY.) Since the page is fully valid, it is still usable for
reads.
16) It is possible (in error) for cached pages to be busy. Make the
page allocation code handle that case correctly. (It should probably
be a printf or panic, but I want the system to handle coding errors
robustly. I'll probably add a printf.)
17) Correct the design and usage of vm_page_sleep. It didn't handle
consistancy problems very well, so make the design a little less
lofty. After vm_page_sleep, if it ever blocked, it is still important
to relookup the page (if the object generation count changed), and
verify it's status (always.)
18) In vm_pageout.c, vm_pageout_clean had rotted, so clean that up.
19) Push the page busy for writes and VM_PROT_READ into vm_pageout_flush.
20) Fix vm_pager_put_pages and it's descendents to support an int flag
instead of a boolean, so that we can pass down the invalidate bit.
1998-03-07 21:37:31 +00:00
|
|
|
if ((bp->b_flags & B_VMIO) == 0) {
|
2015-09-27 05:16:06 +00:00
|
|
|
allocbuf(bp, 0);
|
This mega-commit is meant to fix numerous interrelated problems. There
has been some bitrot and incorrect assumptions in the vfs_bio code. These
problems have manifest themselves worse on NFS type filesystems, but can
still affect local filesystems under certain circumstances. Most of
the problems have involved mmap consistancy, and as a side-effect broke
the vfs.ioopt code. This code might have been committed seperately, but
almost everything is interrelated.
1) Allow (pmap_object_init_pt) prefaulting of buffer-busy pages that
are fully valid.
2) Rather than deactivating erroneously read initial (header) pages in
kern_exec, we now free them.
3) Fix the rundown of non-VMIO buffers that are in an inconsistent
(missing vp) state.
4) Fix the disassociation of pages from buffers in brelse. The previous
code had rotted and was faulty in a couple of important circumstances.
5) Remove a gratuitious buffer wakeup in vfs_vmio_release.
6) Remove a crufty and currently unused cluster mechanism for VBLK
files in vfs_bio_awrite. When the code is functional, I'll add back
a cleaner version.
7) The page busy count wakeups assocated with the buffer cache usage were
incorrectly cleaned up in a previous commit by me. Revert to the
original, correct version, but with a cleaner implementation.
8) The cluster read code now tries to keep data associated with buffers
more aggressively (without breaking the heuristics) when it is presumed
that the read data (buffers) will be soon needed.
9) Change to filesystem lockmgr locks so that they use LK_NOPAUSE. The
delay loop waiting is not useful for filesystem locks, due to the
length of the time intervals.
10) Correct and clean-up spec_getpages.
11) Implement a fully functional nfs_getpages, nfs_putpages.
12) Fix nfs_write so that modifications are coherent with the NFS data on
the server disk (at least as well as NFS seems to allow.)
13) Properly support MS_INVALIDATE on NFS.
14) Properly pass down MS_INVALIDATE to lower levels of the VM code from
vm_map_clean.
15) Better support the notion of pages being busy but valid, so that
fewer in-transit waits occur. (use p->busy more for pageouts instead
of PG_BUSY.) Since the page is fully valid, it is still usable for
reads.
16) It is possible (in error) for cached pages to be busy. Make the
page allocation code handle that case correctly. (It should probably
be a printf or panic, but I want the system to handle coding errors
robustly. I'll probably add a printf.)
17) Correct the design and usage of vm_page_sleep. It didn't handle
consistancy problems very well, so make the design a little less
lofty. After vm_page_sleep, if it ever blocked, it is still important
to relookup the page (if the object generation count changed), and
verify it's status (always.)
18) In vm_pageout.c, vm_pageout_clean had rotted, so clean that up.
19) Push the page busy for writes and VM_PROT_READ into vm_pageout_flush.
20) Fix vm_pager_put_pages and it's descendents to support an int flag
instead of a boolean, so that we can pass down the invalidate bit.
1998-03-07 21:37:31 +00:00
|
|
|
if (bp->b_vp)
|
2008-03-28 12:30:12 +00:00
|
|
|
brelvp(bp);
|
1996-01-19 04:00:31 +00:00
|
|
|
}
|
1994-05-25 09:21:21 +00:00
|
|
|
}
|
1995-05-30 08:16:23 +00:00
|
|
|
|
1998-09-26 00:12:35 +00:00
|
|
|
/*
|
2015-09-27 05:16:06 +00:00
|
|
|
* We must clear B_RELBUF if B_DELWRI is set. If vfs_vmio_truncate()
|
1998-09-26 00:12:35 +00:00
|
|
|
* is called with B_DELWRI set, the underlying pages may wind up
|
|
|
|
* getting freed causing a previous write (bdwrite()) to get 'lost'
|
|
|
|
* because pages associated with a B_DELWRI bp are marked clean.
|
|
|
|
*
|
2015-09-27 05:16:06 +00:00
|
|
|
* We still allow the B_INVAL case to call vfs_vmio_truncate(), even
|
1998-09-26 00:12:35 +00:00
|
|
|
* if B_DELWRI is set.
|
|
|
|
*/
|
|
|
|
if (bp->b_flags & B_DELWRI)
|
|
|
|
bp->b_flags &= ~B_RELBUF;
|
|
|
|
|
1995-02-22 09:16:07 +00:00
|
|
|
/*
|
|
|
|
* VMIO buffer rundown. It is not very necessary to keep a VMIO buffer
|
The VFS/BIO subsystem contained a number of hacks in order to optimize
piecemeal, middle-of-file writes for NFS. These hacks have caused no
end of trouble, especially when combined with mmap(). I've removed
them. Instead, NFS will issue a read-before-write to fully
instantiate the struct buf containing the write. NFS does, however,
optimize piecemeal appends to files. For most common file operations,
you will not notice the difference. The sole remaining fragment in
the VFS/BIO system is b_dirtyoff/end, which NFS uses to avoid cache
coherency issues with read-merge-write style operations. NFS also
optimizes the write-covers-entire-buffer case by avoiding the
read-before-write. There is quite a bit of room for further
optimization in these areas.
The VM system marks pages fully-valid (AKA vm_page_t->valid =
VM_PAGE_BITS_ALL) in several places, most noteably in vm_fault. This
is not correct operation. The vm_pager_get_pages() code is now
responsible for marking VM pages all-valid. A number of VM helper
routines have been added to aid in zeroing-out the invalid portions of
a VM page prior to the page being marked all-valid. This operation is
necessary to properly support mmap(). The zeroing occurs most often
when dealing with file-EOF situations. Several bugs have been fixed
in the NFS subsystem, including bits handling file and directory EOF
situations and buf->b_flags consistancy issues relating to clearing
B_ERROR & B_INVAL, and handling B_DONE.
getblk() and allocbuf() have been rewritten. B_CACHE operation is now
formally defined in comments and more straightforward in
implementation. B_CACHE for VMIO buffers is based on the validity of
the backing store. B_CACHE for non-VMIO buffers is based simply on
whether the buffer is B_INVAL or not (B_CACHE set if B_INVAL clear,
and vise-versa). biodone() is now responsible for setting B_CACHE
when a successful read completes. B_CACHE is also set when a bdwrite()
is initiated and when a bwrite() is initiated. VFS VOP_BWRITE
routines (there are only two - nfs_bwrite() and bwrite()) are now
expected to set B_CACHE. This means that bowrite() and bawrite() also
set B_CACHE indirectly.
There are a number of places in the code which were previously using
buf->b_bufsize (which is DEV_BSIZE aligned) when they should have
been using buf->b_bcount. These have been fixed. getblk() now clears
B_DONE on return because the rest of the system is so bad about
dealing with B_DONE.
Major fixes to NFS/TCP have been made. A server-side bug could cause
requests to be lost by the server due to nfs_realign() overwriting
other rpc's in the same TCP mbuf chain. The server's kernel must be
recompiled to get the benefit of the fixes.
Submitted by: Matthew Dillon <dillon@apollo.backplane.com>
1999-05-02 23:57:16 +00:00
|
|
|
* constituted, not even NFS buffers now. Two flags effect this. If
|
|
|
|
* B_INVAL, the struct buf is invalidated but the VM object is kept
|
|
|
|
* around ( i.e. so it is trivial to reconstitute the buffer later ).
|
1997-05-19 14:36:56 +00:00
|
|
|
*
|
2000-04-02 15:24:56 +00:00
|
|
|
* If BIO_ERROR or B_NOCACHE is set, pages in the VM object will be
|
|
|
|
* invalidated. BIO_ERROR cannot be set for a failed write unless the
|
The VFS/BIO subsystem contained a number of hacks in order to optimize
piecemeal, middle-of-file writes for NFS. These hacks have caused no
end of trouble, especially when combined with mmap(). I've removed
them. Instead, NFS will issue a read-before-write to fully
instantiate the struct buf containing the write. NFS does, however,
optimize piecemeal appends to files. For most common file operations,
you will not notice the difference. The sole remaining fragment in
the VFS/BIO system is b_dirtyoff/end, which NFS uses to avoid cache
coherency issues with read-merge-write style operations. NFS also
optimizes the write-covers-entire-buffer case by avoiding the
read-before-write. There is quite a bit of room for further
optimization in these areas.
The VM system marks pages fully-valid (AKA vm_page_t->valid =
VM_PAGE_BITS_ALL) in several places, most noteably in vm_fault. This
is not correct operation. The vm_pager_get_pages() code is now
responsible for marking VM pages all-valid. A number of VM helper
routines have been added to aid in zeroing-out the invalid portions of
a VM page prior to the page being marked all-valid. This operation is
necessary to properly support mmap(). The zeroing occurs most often
when dealing with file-EOF situations. Several bugs have been fixed
in the NFS subsystem, including bits handling file and directory EOF
situations and buf->b_flags consistancy issues relating to clearing
B_ERROR & B_INVAL, and handling B_DONE.
getblk() and allocbuf() have been rewritten. B_CACHE operation is now
formally defined in comments and more straightforward in
implementation. B_CACHE for VMIO buffers is based on the validity of
the backing store. B_CACHE for non-VMIO buffers is based simply on
whether the buffer is B_INVAL or not (B_CACHE set if B_INVAL clear,
and vise-versa). biodone() is now responsible for setting B_CACHE
when a successful read completes. B_CACHE is also set when a bdwrite()
is initiated and when a bwrite() is initiated. VFS VOP_BWRITE
routines (there are only two - nfs_bwrite() and bwrite()) are now
expected to set B_CACHE. This means that bowrite() and bawrite() also
set B_CACHE indirectly.
There are a number of places in the code which were previously using
buf->b_bufsize (which is DEV_BSIZE aligned) when they should have
been using buf->b_bcount. These have been fixed. getblk() now clears
B_DONE on return because the rest of the system is so bad about
dealing with B_DONE.
Major fixes to NFS/TCP have been made. A server-side bug could cause
requests to be lost by the server due to nfs_realign() overwriting
other rpc's in the same TCP mbuf chain. The server's kernel must be
recompiled to get the benefit of the fixes.
Submitted by: Matthew Dillon <dillon@apollo.backplane.com>
1999-05-02 23:57:16 +00:00
|
|
|
* buffer is also B_INVAL because it hits the re-dirtying code above.
|
1998-12-22 18:57:30 +00:00
|
|
|
*
|
The VFS/BIO subsystem contained a number of hacks in order to optimize
piecemeal, middle-of-file writes for NFS. These hacks have caused no
end of trouble, especially when combined with mmap(). I've removed
them. Instead, NFS will issue a read-before-write to fully
instantiate the struct buf containing the write. NFS does, however,
optimize piecemeal appends to files. For most common file operations,
you will not notice the difference. The sole remaining fragment in
the VFS/BIO system is b_dirtyoff/end, which NFS uses to avoid cache
coherency issues with read-merge-write style operations. NFS also
optimizes the write-covers-entire-buffer case by avoiding the
read-before-write. There is quite a bit of room for further
optimization in these areas.
The VM system marks pages fully-valid (AKA vm_page_t->valid =
VM_PAGE_BITS_ALL) in several places, most noteably in vm_fault. This
is not correct operation. The vm_pager_get_pages() code is now
responsible for marking VM pages all-valid. A number of VM helper
routines have been added to aid in zeroing-out the invalid portions of
a VM page prior to the page being marked all-valid. This operation is
necessary to properly support mmap(). The zeroing occurs most often
when dealing with file-EOF situations. Several bugs have been fixed
in the NFS subsystem, including bits handling file and directory EOF
situations and buf->b_flags consistancy issues relating to clearing
B_ERROR & B_INVAL, and handling B_DONE.
getblk() and allocbuf() have been rewritten. B_CACHE operation is now
formally defined in comments and more straightforward in
implementation. B_CACHE for VMIO buffers is based on the validity of
the backing store. B_CACHE for non-VMIO buffers is based simply on
whether the buffer is B_INVAL or not (B_CACHE set if B_INVAL clear,
and vise-versa). biodone() is now responsible for setting B_CACHE
when a successful read completes. B_CACHE is also set when a bdwrite()
is initiated and when a bwrite() is initiated. VFS VOP_BWRITE
routines (there are only two - nfs_bwrite() and bwrite()) are now
expected to set B_CACHE. This means that bowrite() and bawrite() also
set B_CACHE indirectly.
There are a number of places in the code which were previously using
buf->b_bufsize (which is DEV_BSIZE aligned) when they should have
been using buf->b_bcount. These have been fixed. getblk() now clears
B_DONE on return because the rest of the system is so bad about
dealing with B_DONE.
Major fixes to NFS/TCP have been made. A server-side bug could cause
requests to be lost by the server due to nfs_realign() overwriting
other rpc's in the same TCP mbuf chain. The server's kernel must be
recompiled to get the benefit of the fixes.
Submitted by: Matthew Dillon <dillon@apollo.backplane.com>
1999-05-02 23:57:16 +00:00
|
|
|
* Normally we can do this whether a buffer is B_DELWRI or not. If
|
|
|
|
* the buffer is an NFS buffer, it is tracking piecemeal writes or
|
2000-01-10 00:24:24 +00:00
|
|
|
* the commit state and we cannot afford to lose the buffer. If the
|
|
|
|
* buffer has a background write in progress, we need to keep it
|
|
|
|
* around to prevent it from being reconstituted and starting a second
|
|
|
|
* background write.
|
1995-02-22 09:16:07 +00:00
|
|
|
*/
|
2018-03-14 22:11:45 +00:00
|
|
|
|
|
|
|
v_mnt = bp->b_vp != NULL ? bp->b_vp->v_mount : NULL;
|
|
|
|
|
2015-09-22 23:57:52 +00:00
|
|
|
if ((bp->b_flags & B_VMIO) && (bp->b_flags & B_NOCACHE ||
|
|
|
|
(bp->b_ioflags & BIO_ERROR && bp->b_iocmd == BIO_READ)) &&
|
2018-03-14 22:11:45 +00:00
|
|
|
(v_mnt == NULL || (v_mnt->mnt_vfc->vfc_flags & VFCF_NETWORK) == 0 ||
|
2020-08-19 02:51:17 +00:00
|
|
|
vn_isdisk(bp->b_vp) || (bp->b_flags & B_DELWRI) == 0)) {
|
2015-09-22 23:57:52 +00:00
|
|
|
vfs_vmio_invalidate(bp);
|
2015-09-27 05:16:06 +00:00
|
|
|
allocbuf(bp, 0);
|
|
|
|
}
|
2015-09-22 23:57:52 +00:00
|
|
|
|
2015-09-30 23:06:29 +00:00
|
|
|
if ((bp->b_flags & (B_INVAL | B_RELBUF)) != 0 ||
|
|
|
|
(bp->b_flags & (B_DELWRI | B_NOREUSE)) == B_NOREUSE) {
|
2015-09-27 05:16:06 +00:00
|
|
|
allocbuf(bp, 0);
|
2015-09-30 23:06:29 +00:00
|
|
|
bp->b_flags &= ~B_NOREUSE;
|
2006-02-02 21:37:39 +00:00
|
|
|
if (bp->b_vp != NULL)
|
2008-03-28 12:30:12 +00:00
|
|
|
brelvp(bp);
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
}
|
2020-07-10 09:01:36 +00:00
|
|
|
|
2009-03-17 16:30:49 +00:00
|
|
|
/*
|
|
|
|
* If the buffer has junk contents signal it and eventually
|
|
|
|
* clean up B_DELWRI and diassociate the vnode so that gbincore()
|
|
|
|
* doesn't find it.
|
|
|
|
*/
|
|
|
|
if (bp->b_bufsize == 0 || (bp->b_ioflags & BIO_ERROR) != 0 ||
|
|
|
|
(bp->b_flags & (B_INVAL | B_NOCACHE | B_RELBUF)) != 0)
|
|
|
|
bp->b_flags |= B_INVAL;
|
|
|
|
if (bp->b_flags & B_INVAL) {
|
|
|
|
if (bp->b_flags & B_DELWRI)
|
|
|
|
bundirty(bp);
|
|
|
|
if (bp->b_vp)
|
|
|
|
brelvp(bp);
|
|
|
|
}
|
|
|
|
|
2016-10-31 23:09:52 +00:00
|
|
|
buf_track(bp, __func__);
|
|
|
|
|
1994-08-06 09:15:42 +00:00
|
|
|
/* buffers with no memory */
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
if (bp->b_bufsize == 0) {
|
2015-10-14 02:10:07 +00:00
|
|
|
buf_free(bp);
|
|
|
|
return;
|
|
|
|
}
|
1997-06-15 17:56:53 +00:00
|
|
|
/* buffers with junk contents */
|
2015-10-14 02:10:07 +00:00
|
|
|
if (bp->b_flags & (B_INVAL | B_NOCACHE | B_RELBUF) ||
|
2002-03-05 15:38:49 +00:00
|
|
|
(bp->b_ioflags & BIO_ERROR)) {
|
Add support to UFS2 to provide storage for extended attributes.
As this code is not actually used by any of the existing
interfaces, it seems unlikely to break anything (famous
last words).
The internal kernel interface to manipulate these attributes
is invoked using two new IO_ flags: IO_NORMAL and IO_EXT.
These flags may be specified in the ioflags word of VOP_READ,
VOP_WRITE, and VOP_TRUNCATE. Specifying IO_NORMAL means that
you want to do I/O to the normal data part of the file and
IO_EXT means that you want to do I/O to the extended attributes
part of the file. IO_NORMAL and IO_EXT are mutually exclusive
for VOP_READ and VOP_WRITE, but may be specified individually
or together in the case of VOP_TRUNCATE. For example, when
removing a file, VOP_TRUNCATE is called with both IO_NORMAL
and IO_EXT set. For backward compatibility, if neither IO_NORMAL
nor IO_EXT is set, then IO_NORMAL is assumed.
Note that the BA_ and IO_ flags have been `merged' so that they
may both be used in the same flags word. This merger is possible
by assigning the IO_ flags to the low sixteen bits and the BA_
flags the high sixteen bits. This works because the high sixteen
bits of the IO_ word is reserved for read-ahead and help with
write clustering so will never be used for flags. This merge
lets us get away from code of the form:
if (ioflags & IO_SYNC)
flags |= BA_SYNC;
For the future, I have considered adding a new field to the
vattr structure, va_extsize. This addition could then be
exported through the stat structure to allow applications to
find out the size of the extended attribute storage and also
would provide a more standard interface for truncating them
(via VOP_SETATTR rather than VOP_TRUNCATE).
I am also contemplating adding a pathconf parameter (for
concreteness, lets call it _PC_MAX_EXTSIZE) which would
let an application determine the maximum size of the extended
atribute storage.
Sponsored by: DARPA & NAI Labs.
2002-07-19 07:29:39 +00:00
|
|
|
bp->b_xflags &= ~(BX_BKGRDWRITE | BX_ALTDATA);
|
2003-08-28 06:55:18 +00:00
|
|
|
if (bp->b_vflags & BV_BKGRDINPROG)
|
2000-01-10 00:24:24 +00:00
|
|
|
panic("losing buffer 2");
|
2013-06-05 23:53:00 +00:00
|
|
|
qindex = QUEUE_CLEAN;
|
|
|
|
bp->b_flags |= B_AGE;
|
The buffer queue mechanism has been reformulated. Instead of having
QUEUE_AGE, QUEUE_LRU, and QUEUE_EMPTY we instead have QUEUE_CLEAN,
QUEUE_DIRTY, QUEUE_EMPTY, and QUEUE_EMPTYKVA. With this patch clean
and dirty buffers have been separated. Empty buffers with KVM
assignments have been separated from truely empty buffers. getnewbuf()
has been rewritten and now operates in a 100% optimal fashion. That is,
it is able to find precisely the right kind of buffer it needs to
allocate a new buffer, defragment KVM, or to free-up an existing buffer
when the buffer cache is full (which is a steady-state situation for
the buffer cache).
Buffer flushing has been reorganized. Previously buffers were flushed
in the context of whatever process hit the conditions forcing buffer
flushing to occur. This resulted in processes blocking on conditions
unrelated to what they were doing. This also resulted in inappropriate
VFS stacking chains due to multiple processes getting stuck trying to
flush dirty buffers or due to a single process getting into a situation
where it might attempt to flush buffers recursively - a situation that
was only partially fixed in prior commits. We have added a new daemon
called the buf_daemon which is responsible for flushing dirty buffers
when the number of dirty buffers exceeds the vfs.hidirtybuffers limit.
This daemon attempts to dynamically adjust the rate at which dirty buffers
are flushed such that getnewbuf() calls (almost) never block.
The number of nbufs and amount of buffer space is now scaled past the
8MB limit that was previously imposed for systems with over 64MB of
memory, and the vfs.{lo,hi}dirtybuffers limits have been relaxed
somewhat. The number of physical buffers has been increased with the
intention that we will manage physical I/O differently in the future.
reassignbuf previously attempted to keep the dirtyblkhd list sorted which
could result in non-deterministic operation under certain conditions,
such as when a large number of dirty buffers are being managed. This
algorithm has been changed. reassignbuf now keeps buffers locally sorted
if it can do so cheaply, and otherwise gives up and adds buffers to
the head of the dirtyblkhd list. The new algorithm is deterministic but
not perfect. The new algorithm greatly reduces problems that previously
occured when write_behind was turned off in the system.
The P_FLSINPROG proc->p_flag bit has been replaced by the more descriptive
P_BUFEXHAUST bit. This bit allows processes working with filesystem
buffers to use available emergency reserves. Normal processes do not set
this bit and are not allowed to dig into emergency reserves. The purpose
of this bit is to avoid low-memory deadlocks.
A small race condition was fixed in getpbuf() in vm/vm_pager.c.
Submitted by: Matthew Dillon <dillon@apollo.backplane.com>
Reviewed by: Kirk McKusick <mckusick@mckusick.com>
1999-07-04 00:25:38 +00:00
|
|
|
/* remaining buffers */
|
2013-06-05 23:53:00 +00:00
|
|
|
} else if (bp->b_flags & B_DELWRI)
|
|
|
|
qindex = QUEUE_DIRTY;
|
|
|
|
else
|
|
|
|
qindex = QUEUE_CLEAN;
|
1999-03-12 02:24:58 +00:00
|
|
|
|
2000-07-11 22:07:57 +00:00
|
|
|
if ((bp->b_flags & B_DELWRI) == 0 && (bp->b_xflags & BX_VNDIRTY))
|
|
|
|
panic("brelse: not dirty");
|
2018-02-20 00:06:07 +00:00
|
|
|
|
|
|
|
bp->b_flags &= ~(B_ASYNC | B_NOCACHE | B_RELBUF | B_DIRECT);
|
This commit enables a UFS filesystem to do a forcible unmount when
the underlying media fails or becomes inaccessible. For example
when a USB flash memory card hosting a UFS filesystem is unplugged.
The strategy for handling disk I/O errors when soft updates are
enabled is to stop writing to the disk of the affected file system
but continue to accept I/O requests and report that all future
writes by the file system to that disk actually succeed. Then
initiate an asynchronous forced unmount of the affected file system.
There are two cases for disk I/O errors:
- ENXIO, which means that this disk is gone and the lower layers
of the storage stack already guarantee that no future I/O to
this disk will succeed.
- EIO (or most other errors), which means that this particular
I/O request has failed but subsequent I/O requests to this
disk might still succeed.
For ENXIO, we can just clear the error and continue, because we
know that the file system cannot affect the on-disk state after we
see this error. For EIO or other errors, we arrange for the geom_vfs
layer to reject all future I/O requests with ENXIO just like is
done when the geom_vfs is orphaned. In both cases, the file system
code can just clear the error and proceed with the forcible unmount.
This new treatment of I/O errors is needed for writes of any buffer
that is involved in a dependency. Most dependencies are described
by a structure attached to the buffer's b_dep field. But some are
created and processed as a result of the completion of the dependencies
attached to the buffer.
Clearing of some dependencies require a read. For example if there
is a dependency that requires an inode to be written, the disk block
containing that inode must be read, the updated inode copied into
place in that buffer, and the buffer then written back to disk.
Often the needed buffer is already in memory and can be used. But
if it needs to be read from the disk, the read will fail, so we
fabricate a buffer full of zeroes and pretend that the read succeeded.
This zero'ed buffer can be updated and written back to disk.
The only case where a buffer full of zeros causes the code to do
the wrong thing is when reading an inode buffer containing an inode
that still has an inode dependency in memory that will reinitialize
the effective link count (i_effnlink) based on the actual link count
(i_nlink) that we read. To handle this case we now store the i_nlink
value that we wrote in the inode dependency so that it can be
restored into the zero'ed buffer thus keeping the tracking of the
inode link count consistent.
Because applications depend on knowing when an attempt to write
their data to stable storage has failed, the fsync(2) and msync(2)
system calls need to return errors if data fails to be written to
stable storage. So these operations return ENXIO for every call
made on files in a file system where we have otherwise been ignoring
I/O errors.
Coauthered by: mckusick
Reviewed by: kib
Tested by: Peter Holm
Approved by: mckusick (mentor)
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D24088
2020-05-25 23:47:31 +00:00
|
|
|
bp->b_xflags &= ~(BX_CVTENXIO);
|
2018-02-20 00:06:07 +00:00
|
|
|
/* binsfree unlocks bp. */
|
|
|
|
binsfree(bp, qindex);
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
}
|
|
|
|
|
1996-01-19 04:00:31 +00:00
|
|
|
/*
|
1999-03-12 02:24:58 +00:00
|
|
|
* Release a buffer back to the appropriate queue but do not try to free
|
Implement a low-memory deadlock solution.
Removed most of the hacks that were trying to deal with low-memory
situations prior to now.
The new code is based on the concept that I/O must be able to function in
a low memory situation. All major modules related to I/O (except
networking) have been adjusted to allow allocation out of the system
reserve memory pool. These modules now detect a low memory situation but
rather then block they instead continue to operate, then return resources
to the memory pool instead of cache them or leave them wired.
Code has been added to stall in a low-memory situation prior to a vnode
being locked.
Thus situations where a process blocks in a low-memory condition while
holding a locked vnode have been reduced to near nothing. Not only will
I/O continue to operate, but many prior deadlock conditions simply no
longer exist.
Implement a number of VFS/BIO fixes
(found by Ian): in biodone(), bogus-page replacement code, the loop
was not properly incrementing loop variables prior to a continue
statement. We do not believe this code can be hit anyway but we
aren't taking any chances. We'll turn the whole section into a
panic (as it already is in brelse()) after the release is rolled.
In biodone(), the foff calculation was incorrectly
clamped to the iosize, causing the wrong foff to be calculated
for pages in the case of an I/O error or biodone() called without
initiating I/O. The problem always caused a panic before. Now it
doesn't. The problem is mainly an issue with NFS.
Fixed casts for ~PAGE_MASK. This code worked properly before only
because the calculations use signed arithmatic. Better to properly
extend PAGE_MASK first before inverting it for the 64 bit masking
op.
In brelse(), the bogus_page fixup code was improperly throwing
away the original contents of 'm' when it did the j-loop to
fix the bogus pages. The result was that it would potentially
invalidate parts of the *WRONG* page(!), leading to corruption.
There may still be cases where a background bitmap write is
being duplicated, causing potential corruption. We have identified
a potentially serious bug related to this but the fix is still TBD.
So instead this patch contains a KASSERT to detect the problem
and panic the machine rather then continue to corrupt the filesystem.
The problem does not occur very often.. it is very hard to
reproduce, and it may or may not be the cause of the corruption
people have reported.
Review by: (VFS/BIO: mckusick, Ian Dowse <iedowse@maths.tcd.ie>)
Testing by: (VM/Deadlock) Paul Saab <ps@yahoo-inc.com>
2000-11-18 23:06:26 +00:00
|
|
|
* it. The buffer is expected to be used again soon.
|
The VFS/BIO subsystem contained a number of hacks in order to optimize
piecemeal, middle-of-file writes for NFS. These hacks have caused no
end of trouble, especially when combined with mmap(). I've removed
them. Instead, NFS will issue a read-before-write to fully
instantiate the struct buf containing the write. NFS does, however,
optimize piecemeal appends to files. For most common file operations,
you will not notice the difference. The sole remaining fragment in
the VFS/BIO system is b_dirtyoff/end, which NFS uses to avoid cache
coherency issues with read-merge-write style operations. NFS also
optimizes the write-covers-entire-buffer case by avoiding the
read-before-write. There is quite a bit of room for further
optimization in these areas.
The VM system marks pages fully-valid (AKA vm_page_t->valid =
VM_PAGE_BITS_ALL) in several places, most noteably in vm_fault. This
is not correct operation. The vm_pager_get_pages() code is now
responsible for marking VM pages all-valid. A number of VM helper
routines have been added to aid in zeroing-out the invalid portions of
a VM page prior to the page being marked all-valid. This operation is
necessary to properly support mmap(). The zeroing occurs most often
when dealing with file-EOF situations. Several bugs have been fixed
in the NFS subsystem, including bits handling file and directory EOF
situations and buf->b_flags consistancy issues relating to clearing
B_ERROR & B_INVAL, and handling B_DONE.
getblk() and allocbuf() have been rewritten. B_CACHE operation is now
formally defined in comments and more straightforward in
implementation. B_CACHE for VMIO buffers is based on the validity of
the backing store. B_CACHE for non-VMIO buffers is based simply on
whether the buffer is B_INVAL or not (B_CACHE set if B_INVAL clear,
and vise-versa). biodone() is now responsible for setting B_CACHE
when a successful read completes. B_CACHE is also set when a bdwrite()
is initiated and when a bwrite() is initiated. VFS VOP_BWRITE
routines (there are only two - nfs_bwrite() and bwrite()) are now
expected to set B_CACHE. This means that bowrite() and bawrite() also
set B_CACHE indirectly.
There are a number of places in the code which were previously using
buf->b_bufsize (which is DEV_BSIZE aligned) when they should have
been using buf->b_bcount. These have been fixed. getblk() now clears
B_DONE on return because the rest of the system is so bad about
dealing with B_DONE.
Major fixes to NFS/TCP have been made. A server-side bug could cause
requests to be lost by the server due to nfs_realign() overwriting
other rpc's in the same TCP mbuf chain. The server's kernel must be
recompiled to get the benefit of the fixes.
Submitted by: Matthew Dillon <dillon@apollo.backplane.com>
1999-05-02 23:57:16 +00:00
|
|
|
*
|
|
|
|
* bqrelse() is used by bdwrite() to requeue a delayed write, and used by
|
|
|
|
* biodone() to requeue an async I/O on completion. It is also used when
|
|
|
|
* known good buffers need to be requeued but we think we may need the data
|
|
|
|
* again soon.
|
2001-05-24 07:22:27 +00:00
|
|
|
*
|
|
|
|
* XXX we should be able to leave the B_RELBUF hint set on completion.
|
1996-01-19 04:00:31 +00:00
|
|
|
*/
|
|
|
|
void
|
2004-09-15 20:54:23 +00:00
|
|
|
bqrelse(struct buf *bp)
|
1996-01-19 04:00:31 +00:00
|
|
|
{
|
2013-06-05 23:53:00 +00:00
|
|
|
int qindex;
|
2010-08-12 08:36:23 +00:00
|
|
|
|
2005-01-24 10:47:04 +00:00
|
|
|
CTR3(KTR_BUF, "bqrelse(%p) vp %p flags %X", bp, bp->b_vp, bp->b_flags);
|
2004-09-15 20:54:23 +00:00
|
|
|
KASSERT(!(bp->b_flags & (B_CLUSTER|B_PAGING)),
|
|
|
|
("bqrelse: inappropriate B_PAGING or B_CLUSTER bp %p", bp));
|
1996-01-19 04:00:31 +00:00
|
|
|
|
2015-10-14 02:10:07 +00:00
|
|
|
qindex = QUEUE_NONE;
|
2008-01-19 17:36:23 +00:00
|
|
|
if (BUF_LOCKRECURSED(bp)) {
|
1999-06-26 02:47:16 +00:00
|
|
|
/* do not release to free list */
|
|
|
|
BUF_UNLOCK(bp);
|
|
|
|
return;
|
|
|
|
}
|
2013-06-05 23:53:00 +00:00
|
|
|
bp->b_flags &= ~(B_ASYNC | B_NOCACHE | B_AGE | B_RELBUF);
|
This commit enables a UFS filesystem to do a forcible unmount when
the underlying media fails or becomes inaccessible. For example
when a USB flash memory card hosting a UFS filesystem is unplugged.
The strategy for handling disk I/O errors when soft updates are
enabled is to stop writing to the disk of the affected file system
but continue to accept I/O requests and report that all future
writes by the file system to that disk actually succeed. Then
initiate an asynchronous forced unmount of the affected file system.
There are two cases for disk I/O errors:
- ENXIO, which means that this disk is gone and the lower layers
of the storage stack already guarantee that no future I/O to
this disk will succeed.
- EIO (or most other errors), which means that this particular
I/O request has failed but subsequent I/O requests to this
disk might still succeed.
For ENXIO, we can just clear the error and continue, because we
know that the file system cannot affect the on-disk state after we
see this error. For EIO or other errors, we arrange for the geom_vfs
layer to reject all future I/O requests with ENXIO just like is
done when the geom_vfs is orphaned. In both cases, the file system
code can just clear the error and proceed with the forcible unmount.
This new treatment of I/O errors is needed for writes of any buffer
that is involved in a dependency. Most dependencies are described
by a structure attached to the buffer's b_dep field. But some are
created and processed as a result of the completion of the dependencies
attached to the buffer.
Clearing of some dependencies require a read. For example if there
is a dependency that requires an inode to be written, the disk block
containing that inode must be read, the updated inode copied into
place in that buffer, and the buffer then written back to disk.
Often the needed buffer is already in memory and can be used. But
if it needs to be read from the disk, the read will fail, so we
fabricate a buffer full of zeroes and pretend that the read succeeded.
This zero'ed buffer can be updated and written back to disk.
The only case where a buffer full of zeros causes the code to do
the wrong thing is when reading an inode buffer containing an inode
that still has an inode dependency in memory that will reinitialize
the effective link count (i_effnlink) based on the actual link count
(i_nlink) that we read. To handle this case we now store the i_nlink
value that we wrote in the inode dependency so that it can be
restored into the zero'ed buffer thus keeping the tracking of the
inode link count consistent.
Because applications depend on knowing when an attempt to write
their data to stable storage has failed, the fsync(2) and msync(2)
system calls need to return errors if data fails to be written to
stable storage. So these operations return ENXIO for every call
made on files in a file system where we have otherwise been ignoring
I/O errors.
Coauthered by: mckusick
Reviewed by: kib
Tested by: Peter Holm
Approved by: mckusick (mentor)
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D24088
2020-05-25 23:47:31 +00:00
|
|
|
bp->b_xflags &= ~(BX_CVTENXIO);
|
2005-12-07 03:39:08 +00:00
|
|
|
|
2021-01-30 02:10:34 +00:00
|
|
|
if (LIST_EMPTY(&bp->b_dep)) {
|
|
|
|
bp->b_flags &= ~B_IOSTARTED;
|
|
|
|
} else {
|
|
|
|
KASSERT((bp->b_flags & B_IOSTARTED) == 0,
|
|
|
|
("bqrelse: SU io not finished bp %p", bp));
|
|
|
|
}
|
|
|
|
|
2005-12-07 03:39:08 +00:00
|
|
|
if (bp->b_flags & B_MANAGED) {
|
2013-06-05 23:53:00 +00:00
|
|
|
if (bp->b_flags & B_REMFREE)
|
|
|
|
bremfreef(bp);
|
|
|
|
goto out;
|
2005-12-07 03:39:08 +00:00
|
|
|
}
|
|
|
|
|
2003-08-28 06:55:18 +00:00
|
|
|
/* buffers with stale but valid contents */
|
Handle errors from background write of the cylinder group blocks.
First, on the write error, bufdone() call from ffs_backgroundwrite()
panics because pbrelvp() cleared bp->b_bufobj, while brelse() would
try to re-dirty the copy of the cg buffer. Handle this by setting
B_INVAL for the case of BIO_ERROR.
Second, we must re-dirty the real buffer containing the cylinder group
block data when background write failed. Real cg buffer was already
marked clean in ffs_bufwrite(). After the BV_BKGRDINPROG flag is
cleared on the real cg buffer in ffs_backgroundwrite(), buffer scan
may reuse the buffer at any moment. The result is lost write, and if
the write error was only transient, we get corrupted bitmaps.
We cannot re-dirty the original cg buffer in the
ffs_backgroundwritedone(), since the context is not sleepable,
preventing us from sleeping for origbp' lock. Add BV_BKGDERR flag
(protected by the buffer object lock), which is converted into delayed
write by brelse(), bqrelse() and buffer scan.
In collaboration with: Conrad Meyer <cse.cem@gmail.com>
Reviewed by: mckusick
Sponsored by: The FreeBSD Foundation (kib),
EMC/Isilon storage division (Conrad)
MFC after: 2 weeks
2015-06-27 09:44:14 +00:00
|
|
|
if ((bp->b_flags & B_DELWRI) != 0 || (bp->b_vflags & (BV_BKGRDINPROG |
|
|
|
|
BV_BKGRDERR)) == BV_BKGRDERR) {
|
|
|
|
BO_LOCK(bp->b_bufobj);
|
|
|
|
bp->b_vflags &= ~BV_BKGRDERR;
|
|
|
|
BO_UNLOCK(bp->b_bufobj);
|
2013-06-05 23:53:00 +00:00
|
|
|
qindex = QUEUE_DIRTY;
|
2003-08-28 06:55:18 +00:00
|
|
|
} else {
|
2013-06-05 23:53:00 +00:00
|
|
|
if ((bp->b_flags & B_DELWRI) == 0 &&
|
|
|
|
(bp->b_xflags & BX_VNDIRTY))
|
|
|
|
panic("bqrelse: not dirty");
|
2015-09-30 23:06:29 +00:00
|
|
|
if ((bp->b_flags & B_NOREUSE) != 0) {
|
|
|
|
brelse(bp);
|
|
|
|
return;
|
|
|
|
}
|
2013-06-05 23:53:00 +00:00
|
|
|
qindex = QUEUE_CLEAN;
|
1997-06-15 17:56:53 +00:00
|
|
|
}
|
2018-02-20 00:06:07 +00:00
|
|
|
buf_track(bp, __func__);
|
|
|
|
/* binsfree unlocks bp. */
|
2013-06-05 23:53:00 +00:00
|
|
|
binsfree(bp, qindex);
|
2018-02-20 00:06:07 +00:00
|
|
|
return;
|
1996-01-19 04:00:31 +00:00
|
|
|
|
2013-06-05 23:53:00 +00:00
|
|
|
out:
|
2016-10-31 23:09:52 +00:00
|
|
|
buf_track(bp, __func__);
|
2003-02-09 09:47:31 +00:00
|
|
|
/* unlock */
|
|
|
|
BUF_UNLOCK(bp);
|
1996-01-19 04:00:31 +00:00
|
|
|
}
|
|
|
|
|
2015-09-22 23:57:52 +00:00
|
|
|
/*
|
|
|
|
* Complete I/O to a VMIO backed page. Validate the pages as appropriate,
|
|
|
|
* restore bogus pages.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
vfs_vmio_iodone(struct buf *bp)
|
|
|
|
{
|
|
|
|
vm_ooffset_t foff;
|
|
|
|
vm_page_t m;
|
|
|
|
vm_object_t obj;
|
2018-05-19 04:59:39 +00:00
|
|
|
struct vnode *vp __unused;
|
2017-03-19 23:06:11 +00:00
|
|
|
int i, iosize, resid;
|
|
|
|
bool bogus;
|
2015-09-22 23:57:52 +00:00
|
|
|
|
|
|
|
obj = bp->b_bufobj->bo_object;
|
2020-02-28 16:05:18 +00:00
|
|
|
KASSERT(blockcount_read(&obj->paging_in_progress) >= bp->b_npages,
|
2015-09-22 23:57:52 +00:00
|
|
|
("vfs_vmio_iodone: paging in progress(%d) < b_npages(%d)",
|
2020-02-28 16:05:18 +00:00
|
|
|
blockcount_read(&obj->paging_in_progress), bp->b_npages));
|
2015-09-22 23:57:52 +00:00
|
|
|
|
|
|
|
vp = bp->b_vp;
|
2020-02-03 14:25:32 +00:00
|
|
|
VNPASS(vp->v_holdcnt > 0, vp);
|
|
|
|
VNPASS(vp->v_object != NULL, vp);
|
2015-09-22 23:57:52 +00:00
|
|
|
|
|
|
|
foff = bp->b_offset;
|
|
|
|
KASSERT(bp->b_offset != NOOFFSET,
|
|
|
|
("vfs_vmio_iodone: bp %p has no buffer offset", bp));
|
|
|
|
|
2017-03-19 23:06:11 +00:00
|
|
|
bogus = false;
|
2015-09-22 23:57:52 +00:00
|
|
|
iosize = bp->b_bcount - bp->b_resid;
|
|
|
|
for (i = 0; i < bp->b_npages; i++) {
|
|
|
|
resid = ((foff + PAGE_SIZE) & ~(off_t)PAGE_MASK) - foff;
|
|
|
|
if (resid > iosize)
|
|
|
|
resid = iosize;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* cleanup bogus pages, restoring the originals
|
|
|
|
*/
|
|
|
|
m = bp->b_pages[i];
|
|
|
|
if (m == bogus_page) {
|
2020-02-28 21:42:48 +00:00
|
|
|
bogus = true;
|
|
|
|
m = vm_page_relookup(obj, OFF_TO_IDX(foff));
|
2015-09-22 23:57:52 +00:00
|
|
|
if (m == NULL)
|
|
|
|
panic("biodone: page disappeared!");
|
|
|
|
bp->b_pages[i] = m;
|
|
|
|
} else if ((bp->b_iocmd == BIO_READ) && resid > 0) {
|
|
|
|
/*
|
|
|
|
* In the write case, the valid and clean bits are
|
|
|
|
* already changed correctly ( see bdwrite() ), so we
|
|
|
|
* only need to do this here in the read case.
|
|
|
|
*/
|
|
|
|
KASSERT((m->dirty & vm_page_bits(foff & PAGE_MASK,
|
|
|
|
resid)) == 0, ("vfs_vmio_iodone: page %p "
|
|
|
|
"has unexpected dirty bits", m));
|
|
|
|
vfs_page_set_valid(bp, foff, m);
|
|
|
|
}
|
|
|
|
KASSERT(OFF_TO_IDX(foff) == m->pindex,
|
|
|
|
("vfs_vmio_iodone: foff(%jd)/pindex(%ju) mismatch",
|
|
|
|
(intmax_t)foff, (uintmax_t)m->pindex));
|
|
|
|
|
|
|
|
vm_page_sunbusy(m);
|
|
|
|
foff = (foff + PAGE_SIZE) & ~(off_t)PAGE_MASK;
|
|
|
|
iosize -= resid;
|
|
|
|
}
|
2015-10-03 17:04:52 +00:00
|
|
|
vm_object_pip_wakeupn(obj, bp->b_npages);
|
2015-09-22 23:57:52 +00:00
|
|
|
if (bogus && buf_mapped(bp)) {
|
|
|
|
BUF_CHECK_MAPPED(bp);
|
|
|
|
pmap_qenter(trunc_page((vm_offset_t)bp->b_data),
|
|
|
|
bp->b_pages, bp->b_npages);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Perform page invalidation when a buffer is released. The fully invalid
|
2015-09-27 05:16:06 +00:00
|
|
|
* pages will be reclaimed later in vfs_vmio_truncate().
|
2015-09-22 23:57:52 +00:00
|
|
|
*/
|
|
|
|
static void
|
|
|
|
vfs_vmio_invalidate(struct buf *bp)
|
|
|
|
{
|
|
|
|
vm_object_t obj;
|
|
|
|
vm_page_t m;
|
2019-07-29 22:01:28 +00:00
|
|
|
int flags, i, resid, poffset, presid;
|
2015-09-22 23:57:52 +00:00
|
|
|
|
2015-09-27 05:16:06 +00:00
|
|
|
if (buf_mapped(bp)) {
|
|
|
|
BUF_CHECK_MAPPED(bp);
|
|
|
|
pmap_qremove(trunc_page((vm_offset_t)bp->b_data), bp->b_npages);
|
|
|
|
} else
|
|
|
|
BUF_CHECK_UNMAPPED(bp);
|
2015-09-22 23:57:52 +00:00
|
|
|
/*
|
|
|
|
* Get the base offset and length of the buffer. Note that
|
|
|
|
* in the VMIO case if the buffer block size is not
|
|
|
|
* page-aligned then b_data pointer may not be page-aligned.
|
|
|
|
* But our b_pages[] array *IS* page aligned.
|
|
|
|
*
|
|
|
|
* block sizes less then DEV_BSIZE (usually 512) are not
|
|
|
|
* supported due to the page granularity bits (m->valid,
|
|
|
|
* m->dirty, etc...).
|
|
|
|
*
|
|
|
|
* See man buf(9) for more information
|
|
|
|
*/
|
2019-07-29 22:01:28 +00:00
|
|
|
flags = (bp->b_flags & B_NOREUSE) != 0 ? VPR_NOREUSE : 0;
|
2015-09-22 23:57:52 +00:00
|
|
|
obj = bp->b_bufobj->bo_object;
|
|
|
|
resid = bp->b_bufsize;
|
|
|
|
poffset = bp->b_offset & PAGE_MASK;
|
|
|
|
VM_OBJECT_WLOCK(obj);
|
|
|
|
for (i = 0; i < bp->b_npages; i++) {
|
|
|
|
m = bp->b_pages[i];
|
|
|
|
if (m == bogus_page)
|
|
|
|
panic("vfs_vmio_invalidate: Unexpected bogus page.");
|
2015-09-27 05:16:06 +00:00
|
|
|
bp->b_pages[i] = NULL;
|
2015-09-22 23:57:52 +00:00
|
|
|
|
2015-09-23 07:44:07 +00:00
|
|
|
presid = resid > (PAGE_SIZE - poffset) ?
|
|
|
|
(PAGE_SIZE - poffset) : resid;
|
2015-09-22 23:57:52 +00:00
|
|
|
KASSERT(presid >= 0, ("brelse: extra page"));
|
2019-10-15 03:35:11 +00:00
|
|
|
vm_page_busy_acquire(m, VM_ALLOC_SBUSY);
|
2015-09-22 23:57:52 +00:00
|
|
|
if (pmap_page_wired_mappings(m) == 0)
|
|
|
|
vm_page_set_invalid(m, poffset, presid);
|
2019-10-15 03:35:11 +00:00
|
|
|
vm_page_sunbusy(m);
|
2019-07-29 22:01:28 +00:00
|
|
|
vm_page_release_locked(m, flags);
|
2015-09-22 23:57:52 +00:00
|
|
|
resid -= presid;
|
|
|
|
poffset = 0;
|
|
|
|
}
|
|
|
|
VM_OBJECT_WUNLOCK(obj);
|
1996-01-19 04:00:31 +00:00
|
|
|
bp->b_npages = 0;
|
2015-09-22 23:57:52 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Page-granular truncation of an existing VMIO buffer.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
vfs_vmio_truncate(struct buf *bp, int desiredpages)
|
|
|
|
{
|
2015-09-27 05:16:06 +00:00
|
|
|
vm_object_t obj;
|
2015-09-22 23:57:52 +00:00
|
|
|
vm_page_t m;
|
2019-07-29 22:01:28 +00:00
|
|
|
int flags, i;
|
2015-09-22 23:57:52 +00:00
|
|
|
|
|
|
|
if (bp->b_npages == desiredpages)
|
|
|
|
return;
|
|
|
|
|
|
|
|
if (buf_mapped(bp)) {
|
|
|
|
BUF_CHECK_MAPPED(bp);
|
|
|
|
pmap_qremove((vm_offset_t)trunc_page((vm_offset_t)bp->b_data) +
|
|
|
|
(desiredpages << PAGE_SHIFT), bp->b_npages - desiredpages);
|
|
|
|
} else
|
|
|
|
BUF_CHECK_UNMAPPED(bp);
|
2018-03-21 21:15:43 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* The object lock is needed only if we will attempt to free pages.
|
|
|
|
*/
|
2019-07-29 22:01:28 +00:00
|
|
|
flags = (bp->b_flags & B_NOREUSE) != 0 ? VPR_NOREUSE : 0;
|
|
|
|
if ((bp->b_flags & B_DIRECT) != 0) {
|
|
|
|
flags |= VPR_TRYFREE;
|
|
|
|
obj = bp->b_bufobj->bo_object;
|
2015-09-27 05:16:06 +00:00
|
|
|
VM_OBJECT_WLOCK(obj);
|
2019-07-29 22:01:28 +00:00
|
|
|
} else {
|
|
|
|
obj = NULL;
|
|
|
|
}
|
2015-09-22 23:57:52 +00:00
|
|
|
for (i = desiredpages; i < bp->b_npages; i++) {
|
|
|
|
m = bp->b_pages[i];
|
|
|
|
KASSERT(m != bogus_page, ("allocbuf: bogus page found"));
|
|
|
|
bp->b_pages[i] = NULL;
|
2019-07-29 22:01:28 +00:00
|
|
|
if (obj != NULL)
|
|
|
|
vm_page_release_locked(m, flags);
|
|
|
|
else
|
|
|
|
vm_page_release(m, flags);
|
2015-09-22 23:57:52 +00:00
|
|
|
}
|
2015-09-27 05:16:06 +00:00
|
|
|
if (obj != NULL)
|
|
|
|
VM_OBJECT_WUNLOCK(obj);
|
2015-09-22 23:57:52 +00:00
|
|
|
bp->b_npages = desiredpages;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Byte granular extension of VMIO buffers.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
vfs_vmio_extend(struct buf *bp, int desiredpages, int size)
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* We are growing the buffer, possibly in a
|
|
|
|
* byte-granular fashion.
|
|
|
|
*/
|
|
|
|
vm_object_t obj;
|
|
|
|
vm_offset_t toff;
|
|
|
|
vm_offset_t tinc;
|
|
|
|
vm_page_t m;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Step 1, bring in the VM pages from the object, allocating
|
|
|
|
* them if necessary. We must clear B_CACHE if these pages
|
|
|
|
* are not valid for the range covered by the buffer.
|
|
|
|
*/
|
|
|
|
obj = bp->b_bufobj->bo_object;
|
2017-08-09 04:23:04 +00:00
|
|
|
if (bp->b_npages < desiredpages) {
|
Make MAXPHYS tunable. Bump MAXPHYS to 1M.
Replace MAXPHYS by runtime variable maxphys. It is initialized from
MAXPHYS by default, but can be also adjusted with the tunable kern.maxphys.
Make b_pages[] array in struct buf flexible. Size b_pages[] for buffer
cache buffers exactly to atop(maxbcachebuf) (currently it is sized to
atop(MAXPHYS)), and b_pages[] for pbufs is sized to atop(maxphys) + 1.
The +1 for pbufs allow several pbuf consumers, among them vmapbuf(),
to use unaligned buffers still sized to maxphys, esp. when such
buffers come from userspace (*). Overall, we save significant amount
of otherwise wasted memory in b_pages[] for buffer cache buffers,
while bumping MAXPHYS to desired high value.
Eliminate all direct uses of the MAXPHYS constant in kernel and driver
sources, except a place which initialize maxphys. Some random (and
arguably weird) uses of MAXPHYS, e.g. in linuxolator, are converted
straight. Some drivers, which use MAXPHYS to size embeded structures,
get private MAXPHYS-like constant; their convertion is out of scope
for this work.
Changes to cam/, dev/ahci, dev/ata, dev/mpr, dev/mpt, dev/mvs,
dev/siis, where either submitted by, or based on changes by mav.
Suggested by: mav (*)
Reviewed by: imp, mav, imp, mckusick, scottl (intermediate versions)
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D27225
2020-11-28 12:12:51 +00:00
|
|
|
KASSERT(desiredpages <= atop(maxbcachebuf),
|
|
|
|
("vfs_vmio_extend past maxbcachebuf %p %d %u",
|
|
|
|
bp, desiredpages, maxbcachebuf));
|
|
|
|
|
2015-09-22 23:57:52 +00:00
|
|
|
/*
|
|
|
|
* We must allocate system pages since blocking
|
|
|
|
* here could interfere with paging I/O, no
|
|
|
|
* matter which process we are.
|
|
|
|
*
|
|
|
|
* Only exclusive busy can be tested here.
|
|
|
|
* Blocking on shared busy might lead to
|
|
|
|
* deadlocks once allocbuf() is called after
|
|
|
|
* pages are vfs_busy_pages().
|
|
|
|
*/
|
2020-02-28 20:34:30 +00:00
|
|
|
(void)vm_page_grab_pages_unlocked(obj,
|
2017-08-09 04:23:04 +00:00
|
|
|
OFF_TO_IDX(bp->b_offset) + bp->b_npages,
|
|
|
|
VM_ALLOC_SYSTEM | VM_ALLOC_IGN_SBUSY |
|
|
|
|
VM_ALLOC_NOBUSY | VM_ALLOC_WIRED,
|
|
|
|
&bp->b_pages[bp->b_npages], desiredpages - bp->b_npages);
|
|
|
|
bp->b_npages = desiredpages;
|
2015-09-22 23:57:52 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Step 2. We've loaded the pages into the buffer,
|
|
|
|
* we have to figure out if we can still have B_CACHE
|
|
|
|
* set. Note that B_CACHE is set according to the
|
|
|
|
* byte-granular range ( bcount and size ), not the
|
|
|
|
* aligned range ( newbsize ).
|
|
|
|
*
|
|
|
|
* The VM test is against m->valid, which is DEV_BSIZE
|
|
|
|
* aligned. Needless to say, the validity of the data
|
|
|
|
* needs to also be DEV_BSIZE aligned. Note that this
|
|
|
|
* fails with NFS if the server or some other client
|
|
|
|
* extends the file's EOF. If our buffer is resized,
|
|
|
|
* B_CACHE may remain set! XXX
|
|
|
|
*/
|
|
|
|
toff = bp->b_bcount;
|
|
|
|
tinc = PAGE_SIZE - ((bp->b_offset + toff) & PAGE_MASK);
|
|
|
|
while ((bp->b_flags & B_CACHE) && toff < size) {
|
|
|
|
vm_pindex_t pi;
|
|
|
|
|
|
|
|
if (tinc > (size - toff))
|
|
|
|
tinc = size - toff;
|
|
|
|
pi = ((bp->b_offset & PAGE_MASK) + toff) >> PAGE_SHIFT;
|
|
|
|
m = bp->b_pages[pi];
|
|
|
|
vfs_buf_test_cache(bp, bp->b_offset, toff, tinc, m);
|
|
|
|
toff += tinc;
|
|
|
|
tinc = PAGE_SIZE;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Step 3, fixup the KVA pmap.
|
|
|
|
*/
|
|
|
|
if (buf_mapped(bp))
|
|
|
|
bpmap_qenter(bp);
|
|
|
|
else
|
|
|
|
BUF_CHECK_UNMAPPED(bp);
|
1996-01-19 04:00:31 +00:00
|
|
|
}
|
|
|
|
|
2003-02-09 09:47:31 +00:00
|
|
|
/*
|
|
|
|
* Check to see if a block at a particular lbn is available for a clustered
|
|
|
|
* write.
|
|
|
|
*/
|
|
|
|
static int
|
|
|
|
vfs_bio_clcheck(struct vnode *vp, int size, daddr_t lblkno, daddr_t blkno)
|
|
|
|
{
|
|
|
|
struct buf *bpa;
|
|
|
|
int match;
|
|
|
|
|
|
|
|
match = 0;
|
|
|
|
|
|
|
|
/* If the buf isn't in core skip it */
|
2004-10-22 08:47:20 +00:00
|
|
|
if ((bpa = gbincore(&vp->v_bufobj, lblkno)) == NULL)
|
2003-02-09 09:47:31 +00:00
|
|
|
return (0);
|
|
|
|
|
|
|
|
/* If the buf is busy we don't want to wait for it */
|
2003-02-25 03:37:48 +00:00
|
|
|
if (BUF_LOCK(bpa, LK_EXCLUSIVE | LK_NOWAIT, NULL) != 0)
|
2003-02-09 09:47:31 +00:00
|
|
|
return (0);
|
|
|
|
|
|
|
|
/* Only cluster with valid clusterable delayed write buffers */
|
|
|
|
if ((bpa->b_flags & (B_DELWRI | B_CLUSTEROK | B_INVAL)) !=
|
|
|
|
(B_DELWRI | B_CLUSTEROK))
|
|
|
|
goto done;
|
|
|
|
|
|
|
|
if (bpa->b_bufsize != size)
|
|
|
|
goto done;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Check to see if it is in the expected place on disk and that the
|
|
|
|
* block has been mapped.
|
|
|
|
*/
|
|
|
|
if ((bpa->b_blkno != bpa->b_lblkno) && (bpa->b_blkno == blkno))
|
|
|
|
match = 1;
|
|
|
|
done:
|
|
|
|
BUF_UNLOCK(bpa);
|
|
|
|
return (match);
|
|
|
|
}
|
|
|
|
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
/*
|
1999-07-08 06:06:00 +00:00
|
|
|
* vfs_bio_awrite:
|
|
|
|
*
|
|
|
|
* Implement clustered async writes for clearing out B_DELWRI buffers.
|
|
|
|
* This is much better then the old way of writing only one buffer at
|
|
|
|
* a time. Note that we may not be presented with the buffers in the
|
|
|
|
* correct order, so we search for the cluster in both directions.
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
*/
|
1995-12-11 04:58:34 +00:00
|
|
|
int
|
2004-09-15 20:54:23 +00:00
|
|
|
vfs_bio_awrite(struct buf *bp)
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
{
|
2008-03-22 09:15:16 +00:00
|
|
|
struct bufobj *bo;
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
int i;
|
1999-07-08 06:06:00 +00:00
|
|
|
int j;
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
daddr_t lblkno = bp->b_lblkno;
|
|
|
|
struct vnode *vp = bp->b_vp;
|
|
|
|
int ncl;
|
1995-12-11 04:58:34 +00:00
|
|
|
int nwritten;
|
Make our v_usecount vnode reference count work identically to the
original BSD code. The association between the vnode and the vm_object
no longer includes reference counts. The major difference is that
vm_object's are no longer freed gratuitiously from the vnode, and so
once an object is created for the vnode, it will last as long as the
vnode does.
When a vnode object reference count is incremented, then the underlying
vnode reference count is incremented also. The two "objects" are now
more intimately related, and so the interactions are now much less
complex.
When vnodes are now normally placed onto the free queue with an object still
attached. The rundown of the object happens at vnode rundown time, and
happens with exactly the same filesystem semantics of the original VFS
code. There is absolutely no need for vnode_pager_uncache and other
travesties like that anymore.
A side-effect of these changes is that SMP locking should be much simpler,
the I/O copyin/copyout optimizations work, NFS should be more ponderable,
and further work on layered filesystems should be less frustrating, because
of the totally coherent management of the vnode objects and vnodes.
Please be careful with your system while running this code, but I would
greatly appreciate feedback as soon a reasonably possible.
1998-01-06 05:26:17 +00:00
|
|
|
int size;
|
|
|
|
int maxcl;
|
Implement the concept of the unmapped VMIO buffers, i.e. buffers which
do not map the b_pages pages into buffer_map KVA. The use of the
unmapped buffers eliminate the need to perform TLB shootdown for
mapping on the buffer creation and reuse, greatly reducing the amount
of IPIs for shootdown on big-SMP machines and eliminating up to 25-30%
of the system time on i/o intensive workloads.
The unmapped buffer should be explicitely requested by the GB_UNMAPPED
flag by the consumer. For unmapped buffer, no KVA reservation is
performed at all. The consumer might request unmapped buffer which
does have a KVA reserve, to manually map it without recursing into
buffer cache and blocking, with the GB_KVAALLOC flag.
When the mapped buffer is requested and unmapped buffer already
exists, the cache performs an upgrade, possibly reusing the KVA
reservation.
Unmapped buffer is translated into unmapped bio in g_vfs_strategy().
Unmapped bio carry a pointer to the vm_page_t array, offset and length
instead of the data pointer. The provider which processes the bio
should explicitely specify a readiness to accept unmapped bio,
otherwise g_down geom thread performs the transient upgrade of the bio
request by mapping the pages into the new bio_transient_map KVA
submap.
The bio_transient_map submap claims up to 10% of the buffer map, and
the total buffer_map + bio_transient_map KVA usage stays the
same. Still, it could be manually tuned by kern.bio_transient_maxcnt
tunable, in the units of the transient mappings. Eventually, the
bio_transient_map could be removed after all geom classes and drivers
can accept unmapped i/o requests.
Unmapped support can be turned off by the vfs.unmapped_buf_allowed
tunable, disabling which makes the buffer (or cluster) creation
requests to ignore GB_UNMAPPED and GB_KVAALLOC flags. Unmapped
buffers are only enabled by default on the architectures where
pmap_copy_page() was implemented and tested.
In the rework, filesystem metadata is not the subject to maxbufspace
limit anymore. Since the metadata buffers are always mapped, the
buffers still have to fit into the buffer map, which provides a
reasonable (but practically unreachable) upper bound on it. The
non-metadata buffer allocations, both mapped and unmapped, is
accounted against maxbufspace, as before. Effectively, this means that
the maxbufspace is forced on mapped and unmapped buffers separately.
The pre-patch bufspace limiting code did not worked, because
buffer_map fragmentation does not allow the limit to be reached.
By Jeff Roberson request, the getnewbuf() function was split into
smaller single-purpose functions.
Sponsored by: The FreeBSD Foundation
Discussed with: jeff (previous version)
Tested by: pho, scottl (previous version), jhb, bf
MFC after: 2 weeks
2013-03-19 14:13:12 +00:00
|
|
|
int gbflags;
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
|
2008-03-22 09:15:16 +00:00
|
|
|
bo = &vp->v_bufobj;
|
2015-07-23 19:13:41 +00:00
|
|
|
gbflags = (bp->b_data == unmapped_buf) ? GB_UNMAPPED : 0;
|
1995-12-13 03:47:01 +00:00
|
|
|
/*
|
1999-07-08 06:06:00 +00:00
|
|
|
* right now we support clustered writing only to regular files. If
|
|
|
|
* we find a clusterable block we could be in the middle of a cluster
|
|
|
|
* rather then at the beginning.
|
1995-12-13 03:47:01 +00:00
|
|
|
*/
|
|
|
|
if ((vp->v_type == VREG) &&
|
|
|
|
(vp->v_mount != 0) && /* Only on nodes that have the size info */
|
1995-08-24 13:28:16 +00:00
|
|
|
(bp->b_flags & (B_CLUSTEROK | B_INVAL)) == B_CLUSTEROK) {
|
1995-12-11 04:58:34 +00:00
|
|
|
size = vp->v_mount->mnt_stat.f_iosize;
|
Make MAXPHYS tunable. Bump MAXPHYS to 1M.
Replace MAXPHYS by runtime variable maxphys. It is initialized from
MAXPHYS by default, but can be also adjusted with the tunable kern.maxphys.
Make b_pages[] array in struct buf flexible. Size b_pages[] for buffer
cache buffers exactly to atop(maxbcachebuf) (currently it is sized to
atop(MAXPHYS)), and b_pages[] for pbufs is sized to atop(maxphys) + 1.
The +1 for pbufs allow several pbuf consumers, among them vmapbuf(),
to use unaligned buffers still sized to maxphys, esp. when such
buffers come from userspace (*). Overall, we save significant amount
of otherwise wasted memory in b_pages[] for buffer cache buffers,
while bumping MAXPHYS to desired high value.
Eliminate all direct uses of the MAXPHYS constant in kernel and driver
sources, except a place which initialize maxphys. Some random (and
arguably weird) uses of MAXPHYS, e.g. in linuxolator, are converted
straight. Some drivers, which use MAXPHYS to size embeded structures,
get private MAXPHYS-like constant; their convertion is out of scope
for this work.
Changes to cam/, dev/ahci, dev/ata, dev/mpr, dev/mpt, dev/mvs,
dev/siis, where either submitted by, or based on changes by mav.
Suggested by: mav (*)
Reviewed by: imp, mav, imp, mckusick, scottl (intermediate versions)
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D27225
2020-11-28 12:12:51 +00:00
|
|
|
maxcl = maxphys / size;
|
1995-01-24 10:00:46 +00:00
|
|
|
|
2013-05-31 00:43:41 +00:00
|
|
|
BO_RLOCK(bo);
|
2003-02-09 09:47:31 +00:00
|
|
|
for (i = 1; i < maxcl; i++)
|
|
|
|
if (vfs_bio_clcheck(vp, size, lblkno + i,
|
|
|
|
bp->b_blkno + ((i * size) >> DEV_BSHIFT)) == 0)
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
break;
|
2003-02-09 09:47:31 +00:00
|
|
|
|
|
|
|
for (j = 1; i + j <= maxcl && j <= lblkno; j++)
|
|
|
|
if (vfs_bio_clcheck(vp, size, lblkno - j,
|
|
|
|
bp->b_blkno - ((j * size) >> DEV_BSHIFT)) == 0)
|
1999-07-08 06:06:00 +00:00
|
|
|
break;
|
2013-05-31 00:43:41 +00:00
|
|
|
BO_RUNLOCK(bo);
|
1999-07-08 06:06:00 +00:00
|
|
|
--j;
|
|
|
|
ncl = i + j;
|
1995-01-11 01:53:18 +00:00
|
|
|
/*
|
|
|
|
* this is a possible cluster write
|
|
|
|
*/
|
|
|
|
if (ncl != 1) {
|
2003-03-13 07:19:23 +00:00
|
|
|
BUF_UNLOCK(bp);
|
2013-03-14 20:28:26 +00:00
|
|
|
nwritten = cluster_wbuild(vp, size, lblkno - j, ncl,
|
Implement the concept of the unmapped VMIO buffers, i.e. buffers which
do not map the b_pages pages into buffer_map KVA. The use of the
unmapped buffers eliminate the need to perform TLB shootdown for
mapping on the buffer creation and reuse, greatly reducing the amount
of IPIs for shootdown on big-SMP machines and eliminating up to 25-30%
of the system time on i/o intensive workloads.
The unmapped buffer should be explicitely requested by the GB_UNMAPPED
flag by the consumer. For unmapped buffer, no KVA reservation is
performed at all. The consumer might request unmapped buffer which
does have a KVA reserve, to manually map it without recursing into
buffer cache and blocking, with the GB_KVAALLOC flag.
When the mapped buffer is requested and unmapped buffer already
exists, the cache performs an upgrade, possibly reusing the KVA
reservation.
Unmapped buffer is translated into unmapped bio in g_vfs_strategy().
Unmapped bio carry a pointer to the vm_page_t array, offset and length
instead of the data pointer. The provider which processes the bio
should explicitely specify a readiness to accept unmapped bio,
otherwise g_down geom thread performs the transient upgrade of the bio
request by mapping the pages into the new bio_transient_map KVA
submap.
The bio_transient_map submap claims up to 10% of the buffer map, and
the total buffer_map + bio_transient_map KVA usage stays the
same. Still, it could be manually tuned by kern.bio_transient_maxcnt
tunable, in the units of the transient mappings. Eventually, the
bio_transient_map could be removed after all geom classes and drivers
can accept unmapped i/o requests.
Unmapped support can be turned off by the vfs.unmapped_buf_allowed
tunable, disabling which makes the buffer (or cluster) creation
requests to ignore GB_UNMAPPED and GB_KVAALLOC flags. Unmapped
buffers are only enabled by default on the architectures where
pmap_copy_page() was implemented and tested.
In the rework, filesystem metadata is not the subject to maxbufspace
limit anymore. Since the metadata buffers are always mapped, the
buffers still have to fit into the buffer map, which provides a
reasonable (but practically unreachable) upper bound on it. The
non-metadata buffer allocations, both mapped and unmapped, is
accounted against maxbufspace, as before. Effectively, this means that
the maxbufspace is forced on mapped and unmapped buffers separately.
The pre-patch bufspace limiting code did not worked, because
buffer_map fragmentation does not allow the limit to be reached.
By Jeff Roberson request, the getnewbuf() function was split into
smaller single-purpose functions.
Sponsored by: The FreeBSD Foundation
Discussed with: jeff (previous version)
Tested by: pho, scottl (previous version), jhb, bf
MFC after: 2 weeks
2013-03-19 14:13:12 +00:00
|
|
|
gbflags);
|
2013-03-14 20:28:26 +00:00
|
|
|
return (nwritten);
|
1995-01-11 01:53:18 +00:00
|
|
|
}
|
1998-01-17 09:17:02 +00:00
|
|
|
}
|
1995-11-19 19:54:31 +00:00
|
|
|
bremfree(bp);
|
1999-06-26 02:47:16 +00:00
|
|
|
bp->b_flags |= B_ASYNC;
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
/*
|
1995-01-11 01:53:18 +00:00
|
|
|
* default (old) behavior, writing out only one block
|
The VFS/BIO subsystem contained a number of hacks in order to optimize
piecemeal, middle-of-file writes for NFS. These hacks have caused no
end of trouble, especially when combined with mmap(). I've removed
them. Instead, NFS will issue a read-before-write to fully
instantiate the struct buf containing the write. NFS does, however,
optimize piecemeal appends to files. For most common file operations,
you will not notice the difference. The sole remaining fragment in
the VFS/BIO system is b_dirtyoff/end, which NFS uses to avoid cache
coherency issues with read-merge-write style operations. NFS also
optimizes the write-covers-entire-buffer case by avoiding the
read-before-write. There is quite a bit of room for further
optimization in these areas.
The VM system marks pages fully-valid (AKA vm_page_t->valid =
VM_PAGE_BITS_ALL) in several places, most noteably in vm_fault. This
is not correct operation. The vm_pager_get_pages() code is now
responsible for marking VM pages all-valid. A number of VM helper
routines have been added to aid in zeroing-out the invalid portions of
a VM page prior to the page being marked all-valid. This operation is
necessary to properly support mmap(). The zeroing occurs most often
when dealing with file-EOF situations. Several bugs have been fixed
in the NFS subsystem, including bits handling file and directory EOF
situations and buf->b_flags consistancy issues relating to clearing
B_ERROR & B_INVAL, and handling B_DONE.
getblk() and allocbuf() have been rewritten. B_CACHE operation is now
formally defined in comments and more straightforward in
implementation. B_CACHE for VMIO buffers is based on the validity of
the backing store. B_CACHE for non-VMIO buffers is based simply on
whether the buffer is B_INVAL or not (B_CACHE set if B_INVAL clear,
and vise-versa). biodone() is now responsible for setting B_CACHE
when a successful read completes. B_CACHE is also set when a bdwrite()
is initiated and when a bwrite() is initiated. VFS VOP_BWRITE
routines (there are only two - nfs_bwrite() and bwrite()) are now
expected to set B_CACHE. This means that bowrite() and bawrite() also
set B_CACHE indirectly.
There are a number of places in the code which were previously using
buf->b_bufsize (which is DEV_BSIZE aligned) when they should have
been using buf->b_bcount. These have been fixed. getblk() now clears
B_DONE on return because the rest of the system is so bad about
dealing with B_DONE.
Major fixes to NFS/TCP have been made. A server-side bug could cause
requests to be lost by the server due to nfs_realign() overwriting
other rpc's in the same TCP mbuf chain. The server's kernel must be
recompiled to get the benefit of the fixes.
Submitted by: Matthew Dillon <dillon@apollo.backplane.com>
1999-05-02 23:57:16 +00:00
|
|
|
*
|
|
|
|
* XXX returns b_bufsize instead of b_bcount for nwritten?
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
*/
|
1995-12-11 04:58:34 +00:00
|
|
|
nwritten = bp->b_bufsize;
|
2004-03-11 18:02:36 +00:00
|
|
|
(void) bwrite(bp);
|
1999-03-12 02:24:58 +00:00
|
|
|
|
2013-03-14 20:31:39 +00:00
|
|
|
return (nwritten);
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
|
|
|
|
Implement the concept of the unmapped VMIO buffers, i.e. buffers which
do not map the b_pages pages into buffer_map KVA. The use of the
unmapped buffers eliminate the need to perform TLB shootdown for
mapping on the buffer creation and reuse, greatly reducing the amount
of IPIs for shootdown on big-SMP machines and eliminating up to 25-30%
of the system time on i/o intensive workloads.
The unmapped buffer should be explicitely requested by the GB_UNMAPPED
flag by the consumer. For unmapped buffer, no KVA reservation is
performed at all. The consumer might request unmapped buffer which
does have a KVA reserve, to manually map it without recursing into
buffer cache and blocking, with the GB_KVAALLOC flag.
When the mapped buffer is requested and unmapped buffer already
exists, the cache performs an upgrade, possibly reusing the KVA
reservation.
Unmapped buffer is translated into unmapped bio in g_vfs_strategy().
Unmapped bio carry a pointer to the vm_page_t array, offset and length
instead of the data pointer. The provider which processes the bio
should explicitely specify a readiness to accept unmapped bio,
otherwise g_down geom thread performs the transient upgrade of the bio
request by mapping the pages into the new bio_transient_map KVA
submap.
The bio_transient_map submap claims up to 10% of the buffer map, and
the total buffer_map + bio_transient_map KVA usage stays the
same. Still, it could be manually tuned by kern.bio_transient_maxcnt
tunable, in the units of the transient mappings. Eventually, the
bio_transient_map could be removed after all geom classes and drivers
can accept unmapped i/o requests.
Unmapped support can be turned off by the vfs.unmapped_buf_allowed
tunable, disabling which makes the buffer (or cluster) creation
requests to ignore GB_UNMAPPED and GB_KVAALLOC flags. Unmapped
buffers are only enabled by default on the architectures where
pmap_copy_page() was implemented and tested.
In the rework, filesystem metadata is not the subject to maxbufspace
limit anymore. Since the metadata buffers are always mapped, the
buffers still have to fit into the buffer map, which provides a
reasonable (but practically unreachable) upper bound on it. The
non-metadata buffer allocations, both mapped and unmapped, is
accounted against maxbufspace, as before. Effectively, this means that
the maxbufspace is forced on mapped and unmapped buffers separately.
The pre-patch bufspace limiting code did not worked, because
buffer_map fragmentation does not allow the limit to be reached.
By Jeff Roberson request, the getnewbuf() function was split into
smaller single-purpose functions.
Sponsored by: The FreeBSD Foundation
Discussed with: jeff (previous version)
Tested by: pho, scottl (previous version), jhb, bf
MFC after: 2 weeks
2013-03-19 14:13:12 +00:00
|
|
|
/*
|
2015-10-14 02:10:07 +00:00
|
|
|
* getnewbuf_kva:
|
|
|
|
*
|
|
|
|
* Allocate KVA for an empty buf header according to gbflags.
|
Implement the concept of the unmapped VMIO buffers, i.e. buffers which
do not map the b_pages pages into buffer_map KVA. The use of the
unmapped buffers eliminate the need to perform TLB shootdown for
mapping on the buffer creation and reuse, greatly reducing the amount
of IPIs for shootdown on big-SMP machines and eliminating up to 25-30%
of the system time on i/o intensive workloads.
The unmapped buffer should be explicitely requested by the GB_UNMAPPED
flag by the consumer. For unmapped buffer, no KVA reservation is
performed at all. The consumer might request unmapped buffer which
does have a KVA reserve, to manually map it without recursing into
buffer cache and blocking, with the GB_KVAALLOC flag.
When the mapped buffer is requested and unmapped buffer already
exists, the cache performs an upgrade, possibly reusing the KVA
reservation.
Unmapped buffer is translated into unmapped bio in g_vfs_strategy().
Unmapped bio carry a pointer to the vm_page_t array, offset and length
instead of the data pointer. The provider which processes the bio
should explicitely specify a readiness to accept unmapped bio,
otherwise g_down geom thread performs the transient upgrade of the bio
request by mapping the pages into the new bio_transient_map KVA
submap.
The bio_transient_map submap claims up to 10% of the buffer map, and
the total buffer_map + bio_transient_map KVA usage stays the
same. Still, it could be manually tuned by kern.bio_transient_maxcnt
tunable, in the units of the transient mappings. Eventually, the
bio_transient_map could be removed after all geom classes and drivers
can accept unmapped i/o requests.
Unmapped support can be turned off by the vfs.unmapped_buf_allowed
tunable, disabling which makes the buffer (or cluster) creation
requests to ignore GB_UNMAPPED and GB_KVAALLOC flags. Unmapped
buffers are only enabled by default on the architectures where
pmap_copy_page() was implemented and tested.
In the rework, filesystem metadata is not the subject to maxbufspace
limit anymore. Since the metadata buffers are always mapped, the
buffers still have to fit into the buffer map, which provides a
reasonable (but practically unreachable) upper bound on it. The
non-metadata buffer allocations, both mapped and unmapped, is
accounted against maxbufspace, as before. Effectively, this means that
the maxbufspace is forced on mapped and unmapped buffers separately.
The pre-patch bufspace limiting code did not worked, because
buffer_map fragmentation does not allow the limit to be reached.
By Jeff Roberson request, the getnewbuf() function was split into
smaller single-purpose functions.
Sponsored by: The FreeBSD Foundation
Discussed with: jeff (previous version)
Tested by: pho, scottl (previous version), jhb, bf
MFC after: 2 weeks
2013-03-19 14:13:12 +00:00
|
|
|
*/
|
2015-10-14 02:10:07 +00:00
|
|
|
static int
|
|
|
|
getnewbuf_kva(struct buf *bp, int gbflags, int maxsize)
|
Implement the concept of the unmapped VMIO buffers, i.e. buffers which
do not map the b_pages pages into buffer_map KVA. The use of the
unmapped buffers eliminate the need to perform TLB shootdown for
mapping on the buffer creation and reuse, greatly reducing the amount
of IPIs for shootdown on big-SMP machines and eliminating up to 25-30%
of the system time on i/o intensive workloads.
The unmapped buffer should be explicitely requested by the GB_UNMAPPED
flag by the consumer. For unmapped buffer, no KVA reservation is
performed at all. The consumer might request unmapped buffer which
does have a KVA reserve, to manually map it without recursing into
buffer cache and blocking, with the GB_KVAALLOC flag.
When the mapped buffer is requested and unmapped buffer already
exists, the cache performs an upgrade, possibly reusing the KVA
reservation.
Unmapped buffer is translated into unmapped bio in g_vfs_strategy().
Unmapped bio carry a pointer to the vm_page_t array, offset and length
instead of the data pointer. The provider which processes the bio
should explicitely specify a readiness to accept unmapped bio,
otherwise g_down geom thread performs the transient upgrade of the bio
request by mapping the pages into the new bio_transient_map KVA
submap.
The bio_transient_map submap claims up to 10% of the buffer map, and
the total buffer_map + bio_transient_map KVA usage stays the
same. Still, it could be manually tuned by kern.bio_transient_maxcnt
tunable, in the units of the transient mappings. Eventually, the
bio_transient_map could be removed after all geom classes and drivers
can accept unmapped i/o requests.
Unmapped support can be turned off by the vfs.unmapped_buf_allowed
tunable, disabling which makes the buffer (or cluster) creation
requests to ignore GB_UNMAPPED and GB_KVAALLOC flags. Unmapped
buffers are only enabled by default on the architectures where
pmap_copy_page() was implemented and tested.
In the rework, filesystem metadata is not the subject to maxbufspace
limit anymore. Since the metadata buffers are always mapped, the
buffers still have to fit into the buffer map, which provides a
reasonable (but practically unreachable) upper bound on it. The
non-metadata buffer allocations, both mapped and unmapped, is
accounted against maxbufspace, as before. Effectively, this means that
the maxbufspace is forced on mapped and unmapped buffers separately.
The pre-patch bufspace limiting code did not worked, because
buffer_map fragmentation does not allow the limit to be reached.
By Jeff Roberson request, the getnewbuf() function was split into
smaller single-purpose functions.
Sponsored by: The FreeBSD Foundation
Discussed with: jeff (previous version)
Tested by: pho, scottl (previous version), jhb, bf
MFC after: 2 weeks
2013-03-19 14:13:12 +00:00
|
|
|
{
|
1999-03-12 02:24:58 +00:00
|
|
|
|
2015-10-14 02:10:07 +00:00
|
|
|
if ((gbflags & (GB_UNMAPPED | GB_KVAALLOC)) != GB_UNMAPPED) {
|
2006-10-02 02:06:27 +00:00
|
|
|
/*
|
2015-10-14 02:10:07 +00:00
|
|
|
* In order to keep fragmentation sane we only allocate kva
|
|
|
|
* in BKVASIZE chunks. XXX with vmem we can do page size.
|
2006-10-02 02:06:27 +00:00
|
|
|
*/
|
2015-10-14 02:10:07 +00:00
|
|
|
maxsize = (maxsize + BKVAMASK) & ~BKVAMASK;
|
Implement the concept of the unmapped VMIO buffers, i.e. buffers which
do not map the b_pages pages into buffer_map KVA. The use of the
unmapped buffers eliminate the need to perform TLB shootdown for
mapping on the buffer creation and reuse, greatly reducing the amount
of IPIs for shootdown on big-SMP machines and eliminating up to 25-30%
of the system time on i/o intensive workloads.
The unmapped buffer should be explicitely requested by the GB_UNMAPPED
flag by the consumer. For unmapped buffer, no KVA reservation is
performed at all. The consumer might request unmapped buffer which
does have a KVA reserve, to manually map it without recursing into
buffer cache and blocking, with the GB_KVAALLOC flag.
When the mapped buffer is requested and unmapped buffer already
exists, the cache performs an upgrade, possibly reusing the KVA
reservation.
Unmapped buffer is translated into unmapped bio in g_vfs_strategy().
Unmapped bio carry a pointer to the vm_page_t array, offset and length
instead of the data pointer. The provider which processes the bio
should explicitely specify a readiness to accept unmapped bio,
otherwise g_down geom thread performs the transient upgrade of the bio
request by mapping the pages into the new bio_transient_map KVA
submap.
The bio_transient_map submap claims up to 10% of the buffer map, and
the total buffer_map + bio_transient_map KVA usage stays the
same. Still, it could be manually tuned by kern.bio_transient_maxcnt
tunable, in the units of the transient mappings. Eventually, the
bio_transient_map could be removed after all geom classes and drivers
can accept unmapped i/o requests.
Unmapped support can be turned off by the vfs.unmapped_buf_allowed
tunable, disabling which makes the buffer (or cluster) creation
requests to ignore GB_UNMAPPED and GB_KVAALLOC flags. Unmapped
buffers are only enabled by default on the architectures where
pmap_copy_page() was implemented and tested.
In the rework, filesystem metadata is not the subject to maxbufspace
limit anymore. Since the metadata buffers are always mapped, the
buffers still have to fit into the buffer map, which provides a
reasonable (but practically unreachable) upper bound on it. The
non-metadata buffer allocations, both mapped and unmapped, is
accounted against maxbufspace, as before. Effectively, this means that
the maxbufspace is forced on mapped and unmapped buffers separately.
The pre-patch bufspace limiting code did not worked, because
buffer_map fragmentation does not allow the limit to be reached.
By Jeff Roberson request, the getnewbuf() function was split into
smaller single-purpose functions.
Sponsored by: The FreeBSD Foundation
Discussed with: jeff (previous version)
Tested by: pho, scottl (previous version), jhb, bf
MFC after: 2 weeks
2013-03-19 14:13:12 +00:00
|
|
|
|
2015-10-14 02:10:07 +00:00
|
|
|
if (maxsize != bp->b_kvasize &&
|
|
|
|
bufkva_alloc(bp, maxsize, gbflags))
|
|
|
|
return (ENOSPC);
|
1996-11-30 22:41:49 +00:00
|
|
|
}
|
2015-10-14 02:10:07 +00:00
|
|
|
return (0);
|
Implement the concept of the unmapped VMIO buffers, i.e. buffers which
do not map the b_pages pages into buffer_map KVA. The use of the
unmapped buffers eliminate the need to perform TLB shootdown for
mapping on the buffer creation and reuse, greatly reducing the amount
of IPIs for shootdown on big-SMP machines and eliminating up to 25-30%
of the system time on i/o intensive workloads.
The unmapped buffer should be explicitely requested by the GB_UNMAPPED
flag by the consumer. For unmapped buffer, no KVA reservation is
performed at all. The consumer might request unmapped buffer which
does have a KVA reserve, to manually map it without recursing into
buffer cache and blocking, with the GB_KVAALLOC flag.
When the mapped buffer is requested and unmapped buffer already
exists, the cache performs an upgrade, possibly reusing the KVA
reservation.
Unmapped buffer is translated into unmapped bio in g_vfs_strategy().
Unmapped bio carry a pointer to the vm_page_t array, offset and length
instead of the data pointer. The provider which processes the bio
should explicitely specify a readiness to accept unmapped bio,
otherwise g_down geom thread performs the transient upgrade of the bio
request by mapping the pages into the new bio_transient_map KVA
submap.
The bio_transient_map submap claims up to 10% of the buffer map, and
the total buffer_map + bio_transient_map KVA usage stays the
same. Still, it could be manually tuned by kern.bio_transient_maxcnt
tunable, in the units of the transient mappings. Eventually, the
bio_transient_map could be removed after all geom classes and drivers
can accept unmapped i/o requests.
Unmapped support can be turned off by the vfs.unmapped_buf_allowed
tunable, disabling which makes the buffer (or cluster) creation
requests to ignore GB_UNMAPPED and GB_KVAALLOC flags. Unmapped
buffers are only enabled by default on the architectures where
pmap_copy_page() was implemented and tested.
In the rework, filesystem metadata is not the subject to maxbufspace
limit anymore. Since the metadata buffers are always mapped, the
buffers still have to fit into the buffer map, which provides a
reasonable (but practically unreachable) upper bound on it. The
non-metadata buffer allocations, both mapped and unmapped, is
accounted against maxbufspace, as before. Effectively, this means that
the maxbufspace is forced on mapped and unmapped buffers separately.
The pre-patch bufspace limiting code did not worked, because
buffer_map fragmentation does not allow the limit to be reached.
By Jeff Roberson request, the getnewbuf() function was split into
smaller single-purpose functions.
Sponsored by: The FreeBSD Foundation
Discussed with: jeff (previous version)
Tested by: pho, scottl (previous version), jhb, bf
MFC after: 2 weeks
2013-03-19 14:13:12 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* getnewbuf:
|
|
|
|
*
|
|
|
|
* Find and initialize a new buffer header, freeing up existing buffers
|
|
|
|
* in the bufqueues as necessary. The new buffer is returned locked.
|
|
|
|
*
|
|
|
|
* We block if:
|
|
|
|
* We have insufficient buffer headers
|
|
|
|
* We have insufficient buffer space
|
2013-06-28 03:51:20 +00:00
|
|
|
* buffer_arena is too fragmented ( space reservation fails )
|
Implement the concept of the unmapped VMIO buffers, i.e. buffers which
do not map the b_pages pages into buffer_map KVA. The use of the
unmapped buffers eliminate the need to perform TLB shootdown for
mapping on the buffer creation and reuse, greatly reducing the amount
of IPIs for shootdown on big-SMP machines and eliminating up to 25-30%
of the system time on i/o intensive workloads.
The unmapped buffer should be explicitely requested by the GB_UNMAPPED
flag by the consumer. For unmapped buffer, no KVA reservation is
performed at all. The consumer might request unmapped buffer which
does have a KVA reserve, to manually map it without recursing into
buffer cache and blocking, with the GB_KVAALLOC flag.
When the mapped buffer is requested and unmapped buffer already
exists, the cache performs an upgrade, possibly reusing the KVA
reservation.
Unmapped buffer is translated into unmapped bio in g_vfs_strategy().
Unmapped bio carry a pointer to the vm_page_t array, offset and length
instead of the data pointer. The provider which processes the bio
should explicitely specify a readiness to accept unmapped bio,
otherwise g_down geom thread performs the transient upgrade of the bio
request by mapping the pages into the new bio_transient_map KVA
submap.
The bio_transient_map submap claims up to 10% of the buffer map, and
the total buffer_map + bio_transient_map KVA usage stays the
same. Still, it could be manually tuned by kern.bio_transient_maxcnt
tunable, in the units of the transient mappings. Eventually, the
bio_transient_map could be removed after all geom classes and drivers
can accept unmapped i/o requests.
Unmapped support can be turned off by the vfs.unmapped_buf_allowed
tunable, disabling which makes the buffer (or cluster) creation
requests to ignore GB_UNMAPPED and GB_KVAALLOC flags. Unmapped
buffers are only enabled by default on the architectures where
pmap_copy_page() was implemented and tested.
In the rework, filesystem metadata is not the subject to maxbufspace
limit anymore. Since the metadata buffers are always mapped, the
buffers still have to fit into the buffer map, which provides a
reasonable (but practically unreachable) upper bound on it. The
non-metadata buffer allocations, both mapped and unmapped, is
accounted against maxbufspace, as before. Effectively, this means that
the maxbufspace is forced on mapped and unmapped buffers separately.
The pre-patch bufspace limiting code did not worked, because
buffer_map fragmentation does not allow the limit to be reached.
By Jeff Roberson request, the getnewbuf() function was split into
smaller single-purpose functions.
Sponsored by: The FreeBSD Foundation
Discussed with: jeff (previous version)
Tested by: pho, scottl (previous version), jhb, bf
MFC after: 2 weeks
2013-03-19 14:13:12 +00:00
|
|
|
* If we have to flush dirty buffers ( but we try to avoid this )
|
2015-10-14 02:10:07 +00:00
|
|
|
*
|
|
|
|
* The caller is responsible for releasing the reserved bufspace after
|
|
|
|
* allocbuf() is called.
|
Implement the concept of the unmapped VMIO buffers, i.e. buffers which
do not map the b_pages pages into buffer_map KVA. The use of the
unmapped buffers eliminate the need to perform TLB shootdown for
mapping on the buffer creation and reuse, greatly reducing the amount
of IPIs for shootdown on big-SMP machines and eliminating up to 25-30%
of the system time on i/o intensive workloads.
The unmapped buffer should be explicitely requested by the GB_UNMAPPED
flag by the consumer. For unmapped buffer, no KVA reservation is
performed at all. The consumer might request unmapped buffer which
does have a KVA reserve, to manually map it without recursing into
buffer cache and blocking, with the GB_KVAALLOC flag.
When the mapped buffer is requested and unmapped buffer already
exists, the cache performs an upgrade, possibly reusing the KVA
reservation.
Unmapped buffer is translated into unmapped bio in g_vfs_strategy().
Unmapped bio carry a pointer to the vm_page_t array, offset and length
instead of the data pointer. The provider which processes the bio
should explicitely specify a readiness to accept unmapped bio,
otherwise g_down geom thread performs the transient upgrade of the bio
request by mapping the pages into the new bio_transient_map KVA
submap.
The bio_transient_map submap claims up to 10% of the buffer map, and
the total buffer_map + bio_transient_map KVA usage stays the
same. Still, it could be manually tuned by kern.bio_transient_maxcnt
tunable, in the units of the transient mappings. Eventually, the
bio_transient_map could be removed after all geom classes and drivers
can accept unmapped i/o requests.
Unmapped support can be turned off by the vfs.unmapped_buf_allowed
tunable, disabling which makes the buffer (or cluster) creation
requests to ignore GB_UNMAPPED and GB_KVAALLOC flags. Unmapped
buffers are only enabled by default on the architectures where
pmap_copy_page() was implemented and tested.
In the rework, filesystem metadata is not the subject to maxbufspace
limit anymore. Since the metadata buffers are always mapped, the
buffers still have to fit into the buffer map, which provides a
reasonable (but practically unreachable) upper bound on it. The
non-metadata buffer allocations, both mapped and unmapped, is
accounted against maxbufspace, as before. Effectively, this means that
the maxbufspace is forced on mapped and unmapped buffers separately.
The pre-patch bufspace limiting code did not worked, because
buffer_map fragmentation does not allow the limit to be reached.
By Jeff Roberson request, the getnewbuf() function was split into
smaller single-purpose functions.
Sponsored by: The FreeBSD Foundation
Discussed with: jeff (previous version)
Tested by: pho, scottl (previous version), jhb, bf
MFC after: 2 weeks
2013-03-19 14:13:12 +00:00
|
|
|
*/
|
|
|
|
static struct buf *
|
2015-10-14 02:10:07 +00:00
|
|
|
getnewbuf(struct vnode *vp, int slpflag, int slptimeo, int maxsize, int gbflags)
|
Implement the concept of the unmapped VMIO buffers, i.e. buffers which
do not map the b_pages pages into buffer_map KVA. The use of the
unmapped buffers eliminate the need to perform TLB shootdown for
mapping on the buffer creation and reuse, greatly reducing the amount
of IPIs for shootdown on big-SMP machines and eliminating up to 25-30%
of the system time on i/o intensive workloads.
The unmapped buffer should be explicitely requested by the GB_UNMAPPED
flag by the consumer. For unmapped buffer, no KVA reservation is
performed at all. The consumer might request unmapped buffer which
does have a KVA reserve, to manually map it without recursing into
buffer cache and blocking, with the GB_KVAALLOC flag.
When the mapped buffer is requested and unmapped buffer already
exists, the cache performs an upgrade, possibly reusing the KVA
reservation.
Unmapped buffer is translated into unmapped bio in g_vfs_strategy().
Unmapped bio carry a pointer to the vm_page_t array, offset and length
instead of the data pointer. The provider which processes the bio
should explicitely specify a readiness to accept unmapped bio,
otherwise g_down geom thread performs the transient upgrade of the bio
request by mapping the pages into the new bio_transient_map KVA
submap.
The bio_transient_map submap claims up to 10% of the buffer map, and
the total buffer_map + bio_transient_map KVA usage stays the
same. Still, it could be manually tuned by kern.bio_transient_maxcnt
tunable, in the units of the transient mappings. Eventually, the
bio_transient_map could be removed after all geom classes and drivers
can accept unmapped i/o requests.
Unmapped support can be turned off by the vfs.unmapped_buf_allowed
tunable, disabling which makes the buffer (or cluster) creation
requests to ignore GB_UNMAPPED and GB_KVAALLOC flags. Unmapped
buffers are only enabled by default on the architectures where
pmap_copy_page() was implemented and tested.
In the rework, filesystem metadata is not the subject to maxbufspace
limit anymore. Since the metadata buffers are always mapped, the
buffers still have to fit into the buffer map, which provides a
reasonable (but practically unreachable) upper bound on it. The
non-metadata buffer allocations, both mapped and unmapped, is
accounted against maxbufspace, as before. Effectively, this means that
the maxbufspace is forced on mapped and unmapped buffers separately.
The pre-patch bufspace limiting code did not worked, because
buffer_map fragmentation does not allow the limit to be reached.
By Jeff Roberson request, the getnewbuf() function was split into
smaller single-purpose functions.
Sponsored by: The FreeBSD Foundation
Discussed with: jeff (previous version)
Tested by: pho, scottl (previous version), jhb, bf
MFC after: 2 weeks
2013-03-19 14:13:12 +00:00
|
|
|
{
|
2018-02-20 00:06:07 +00:00
|
|
|
struct bufdomain *bd;
|
Implement the concept of the unmapped VMIO buffers, i.e. buffers which
do not map the b_pages pages into buffer_map KVA. The use of the
unmapped buffers eliminate the need to perform TLB shootdown for
mapping on the buffer creation and reuse, greatly reducing the amount
of IPIs for shootdown on big-SMP machines and eliminating up to 25-30%
of the system time on i/o intensive workloads.
The unmapped buffer should be explicitely requested by the GB_UNMAPPED
flag by the consumer. For unmapped buffer, no KVA reservation is
performed at all. The consumer might request unmapped buffer which
does have a KVA reserve, to manually map it without recursing into
buffer cache and blocking, with the GB_KVAALLOC flag.
When the mapped buffer is requested and unmapped buffer already
exists, the cache performs an upgrade, possibly reusing the KVA
reservation.
Unmapped buffer is translated into unmapped bio in g_vfs_strategy().
Unmapped bio carry a pointer to the vm_page_t array, offset and length
instead of the data pointer. The provider which processes the bio
should explicitely specify a readiness to accept unmapped bio,
otherwise g_down geom thread performs the transient upgrade of the bio
request by mapping the pages into the new bio_transient_map KVA
submap.
The bio_transient_map submap claims up to 10% of the buffer map, and
the total buffer_map + bio_transient_map KVA usage stays the
same. Still, it could be manually tuned by kern.bio_transient_maxcnt
tunable, in the units of the transient mappings. Eventually, the
bio_transient_map could be removed after all geom classes and drivers
can accept unmapped i/o requests.
Unmapped support can be turned off by the vfs.unmapped_buf_allowed
tunable, disabling which makes the buffer (or cluster) creation
requests to ignore GB_UNMAPPED and GB_KVAALLOC flags. Unmapped
buffers are only enabled by default on the architectures where
pmap_copy_page() was implemented and tested.
In the rework, filesystem metadata is not the subject to maxbufspace
limit anymore. Since the metadata buffers are always mapped, the
buffers still have to fit into the buffer map, which provides a
reasonable (but practically unreachable) upper bound on it. The
non-metadata buffer allocations, both mapped and unmapped, is
accounted against maxbufspace, as before. Effectively, this means that
the maxbufspace is forced on mapped and unmapped buffers separately.
The pre-patch bufspace limiting code did not worked, because
buffer_map fragmentation does not allow the limit to be reached.
By Jeff Roberson request, the getnewbuf() function was split into
smaller single-purpose functions.
Sponsored by: The FreeBSD Foundation
Discussed with: jeff (previous version)
Tested by: pho, scottl (previous version), jhb, bf
MFC after: 2 weeks
2013-03-19 14:13:12 +00:00
|
|
|
struct buf *bp;
|
2015-10-14 02:10:07 +00:00
|
|
|
bool metadata, reserved;
|
Implement the concept of the unmapped VMIO buffers, i.e. buffers which
do not map the b_pages pages into buffer_map KVA. The use of the
unmapped buffers eliminate the need to perform TLB shootdown for
mapping on the buffer creation and reuse, greatly reducing the amount
of IPIs for shootdown on big-SMP machines and eliminating up to 25-30%
of the system time on i/o intensive workloads.
The unmapped buffer should be explicitely requested by the GB_UNMAPPED
flag by the consumer. For unmapped buffer, no KVA reservation is
performed at all. The consumer might request unmapped buffer which
does have a KVA reserve, to manually map it without recursing into
buffer cache and blocking, with the GB_KVAALLOC flag.
When the mapped buffer is requested and unmapped buffer already
exists, the cache performs an upgrade, possibly reusing the KVA
reservation.
Unmapped buffer is translated into unmapped bio in g_vfs_strategy().
Unmapped bio carry a pointer to the vm_page_t array, offset and length
instead of the data pointer. The provider which processes the bio
should explicitely specify a readiness to accept unmapped bio,
otherwise g_down geom thread performs the transient upgrade of the bio
request by mapping the pages into the new bio_transient_map KVA
submap.
The bio_transient_map submap claims up to 10% of the buffer map, and
the total buffer_map + bio_transient_map KVA usage stays the
same. Still, it could be manually tuned by kern.bio_transient_maxcnt
tunable, in the units of the transient mappings. Eventually, the
bio_transient_map could be removed after all geom classes and drivers
can accept unmapped i/o requests.
Unmapped support can be turned off by the vfs.unmapped_buf_allowed
tunable, disabling which makes the buffer (or cluster) creation
requests to ignore GB_UNMAPPED and GB_KVAALLOC flags. Unmapped
buffers are only enabled by default on the architectures where
pmap_copy_page() was implemented and tested.
In the rework, filesystem metadata is not the subject to maxbufspace
limit anymore. Since the metadata buffers are always mapped, the
buffers still have to fit into the buffer map, which provides a
reasonable (but practically unreachable) upper bound on it. The
non-metadata buffer allocations, both mapped and unmapped, is
accounted against maxbufspace, as before. Effectively, this means that
the maxbufspace is forced on mapped and unmapped buffers separately.
The pre-patch bufspace limiting code did not worked, because
buffer_map fragmentation does not allow the limit to be reached.
By Jeff Roberson request, the getnewbuf() function was split into
smaller single-purpose functions.
Sponsored by: The FreeBSD Foundation
Discussed with: jeff (previous version)
Tested by: pho, scottl (previous version), jhb, bf
MFC after: 2 weeks
2013-03-19 14:13:12 +00:00
|
|
|
|
2015-10-29 19:02:24 +00:00
|
|
|
bp = NULL;
|
Implement the concept of the unmapped VMIO buffers, i.e. buffers which
do not map the b_pages pages into buffer_map KVA. The use of the
unmapped buffers eliminate the need to perform TLB shootdown for
mapping on the buffer creation and reuse, greatly reducing the amount
of IPIs for shootdown on big-SMP machines and eliminating up to 25-30%
of the system time on i/o intensive workloads.
The unmapped buffer should be explicitely requested by the GB_UNMAPPED
flag by the consumer. For unmapped buffer, no KVA reservation is
performed at all. The consumer might request unmapped buffer which
does have a KVA reserve, to manually map it without recursing into
buffer cache and blocking, with the GB_KVAALLOC flag.
When the mapped buffer is requested and unmapped buffer already
exists, the cache performs an upgrade, possibly reusing the KVA
reservation.
Unmapped buffer is translated into unmapped bio in g_vfs_strategy().
Unmapped bio carry a pointer to the vm_page_t array, offset and length
instead of the data pointer. The provider which processes the bio
should explicitely specify a readiness to accept unmapped bio,
otherwise g_down geom thread performs the transient upgrade of the bio
request by mapping the pages into the new bio_transient_map KVA
submap.
The bio_transient_map submap claims up to 10% of the buffer map, and
the total buffer_map + bio_transient_map KVA usage stays the
same. Still, it could be manually tuned by kern.bio_transient_maxcnt
tunable, in the units of the transient mappings. Eventually, the
bio_transient_map could be removed after all geom classes and drivers
can accept unmapped i/o requests.
Unmapped support can be turned off by the vfs.unmapped_buf_allowed
tunable, disabling which makes the buffer (or cluster) creation
requests to ignore GB_UNMAPPED and GB_KVAALLOC flags. Unmapped
buffers are only enabled by default on the architectures where
pmap_copy_page() was implemented and tested.
In the rework, filesystem metadata is not the subject to maxbufspace
limit anymore. Since the metadata buffers are always mapped, the
buffers still have to fit into the buffer map, which provides a
reasonable (but practically unreachable) upper bound on it. The
non-metadata buffer allocations, both mapped and unmapped, is
accounted against maxbufspace, as before. Effectively, this means that
the maxbufspace is forced on mapped and unmapped buffers separately.
The pre-patch bufspace limiting code did not worked, because
buffer_map fragmentation does not allow the limit to be reached.
By Jeff Roberson request, the getnewbuf() function was split into
smaller single-purpose functions.
Sponsored by: The FreeBSD Foundation
Discussed with: jeff (previous version)
Tested by: pho, scottl (previous version), jhb, bf
MFC after: 2 weeks
2013-03-19 14:13:12 +00:00
|
|
|
KASSERT((gbflags & (GB_UNMAPPED | GB_KVAALLOC)) != GB_KVAALLOC,
|
|
|
|
("GB_KVAALLOC only makes sense with GB_UNMAPPED"));
|
|
|
|
if (!unmapped_buf_allowed)
|
|
|
|
gbflags &= ~(GB_UNMAPPED | GB_KVAALLOC);
|
|
|
|
|
|
|
|
if (vp == NULL || (vp->v_vflag & (VV_MD | VV_SYSTEM)) != 0 ||
|
|
|
|
vp->v_type == VCHR)
|
2015-10-14 02:10:07 +00:00
|
|
|
metadata = true;
|
Implement the concept of the unmapped VMIO buffers, i.e. buffers which
do not map the b_pages pages into buffer_map KVA. The use of the
unmapped buffers eliminate the need to perform TLB shootdown for
mapping on the buffer creation and reuse, greatly reducing the amount
of IPIs for shootdown on big-SMP machines and eliminating up to 25-30%
of the system time on i/o intensive workloads.
The unmapped buffer should be explicitely requested by the GB_UNMAPPED
flag by the consumer. For unmapped buffer, no KVA reservation is
performed at all. The consumer might request unmapped buffer which
does have a KVA reserve, to manually map it without recursing into
buffer cache and blocking, with the GB_KVAALLOC flag.
When the mapped buffer is requested and unmapped buffer already
exists, the cache performs an upgrade, possibly reusing the KVA
reservation.
Unmapped buffer is translated into unmapped bio in g_vfs_strategy().
Unmapped bio carry a pointer to the vm_page_t array, offset and length
instead of the data pointer. The provider which processes the bio
should explicitely specify a readiness to accept unmapped bio,
otherwise g_down geom thread performs the transient upgrade of the bio
request by mapping the pages into the new bio_transient_map KVA
submap.
The bio_transient_map submap claims up to 10% of the buffer map, and
the total buffer_map + bio_transient_map KVA usage stays the
same. Still, it could be manually tuned by kern.bio_transient_maxcnt
tunable, in the units of the transient mappings. Eventually, the
bio_transient_map could be removed after all geom classes and drivers
can accept unmapped i/o requests.
Unmapped support can be turned off by the vfs.unmapped_buf_allowed
tunable, disabling which makes the buffer (or cluster) creation
requests to ignore GB_UNMAPPED and GB_KVAALLOC flags. Unmapped
buffers are only enabled by default on the architectures where
pmap_copy_page() was implemented and tested.
In the rework, filesystem metadata is not the subject to maxbufspace
limit anymore. Since the metadata buffers are always mapped, the
buffers still have to fit into the buffer map, which provides a
reasonable (but practically unreachable) upper bound on it. The
non-metadata buffer allocations, both mapped and unmapped, is
accounted against maxbufspace, as before. Effectively, this means that
the maxbufspace is forced on mapped and unmapped buffers separately.
The pre-patch bufspace limiting code did not worked, because
buffer_map fragmentation does not allow the limit to be reached.
By Jeff Roberson request, the getnewbuf() function was split into
smaller single-purpose functions.
Sponsored by: The FreeBSD Foundation
Discussed with: jeff (previous version)
Tested by: pho, scottl (previous version), jhb, bf
MFC after: 2 weeks
2013-03-19 14:13:12 +00:00
|
|
|
else
|
2015-10-14 02:10:07 +00:00
|
|
|
metadata = false;
|
2018-02-20 00:06:07 +00:00
|
|
|
if (vp == NULL)
|
2018-03-17 18:14:49 +00:00
|
|
|
bd = &bdomain[0];
|
2018-02-20 00:06:07 +00:00
|
|
|
else
|
2018-03-17 18:14:49 +00:00
|
|
|
bd = &bdomain[vp->v_bufobj.bo_domain];
|
2018-02-20 00:06:07 +00:00
|
|
|
|
|
|
|
counter_u64_add(getnewbufcalls, 1);
|
2015-10-14 02:10:07 +00:00
|
|
|
reserved = false;
|
|
|
|
do {
|
|
|
|
if (reserved == false &&
|
2018-02-20 00:06:07 +00:00
|
|
|
bufspace_reserve(bd, maxsize, metadata) != 0) {
|
|
|
|
counter_u64_add(getnewbufrestarts, 1);
|
2015-10-14 02:10:07 +00:00
|
|
|
continue;
|
2018-02-20 00:06:07 +00:00
|
|
|
}
|
2015-10-14 02:10:07 +00:00
|
|
|
reserved = true;
|
2018-02-20 00:06:07 +00:00
|
|
|
if ((bp = buf_alloc(bd)) == NULL) {
|
|
|
|
counter_u64_add(getnewbufrestarts, 1);
|
2015-10-14 02:10:07 +00:00
|
|
|
continue;
|
2018-02-20 00:06:07 +00:00
|
|
|
}
|
2015-10-14 02:10:07 +00:00
|
|
|
if (getnewbuf_kva(bp, gbflags, maxsize) == 0)
|
|
|
|
return (bp);
|
|
|
|
break;
|
2018-02-20 00:06:07 +00:00
|
|
|
} while (buf_recycle(bd, false) == 0);
|
1999-03-12 02:24:58 +00:00
|
|
|
|
2015-10-14 02:10:07 +00:00
|
|
|
if (reserved)
|
2018-02-20 00:06:07 +00:00
|
|
|
bufspace_release(bd, maxsize);
|
2015-10-14 02:10:07 +00:00
|
|
|
if (bp != NULL) {
|
|
|
|
bp->b_flags |= B_INVAL;
|
|
|
|
brelse(bp);
|
1996-12-01 15:46:40 +00:00
|
|
|
}
|
2018-02-20 00:06:07 +00:00
|
|
|
bufspace_wait(bd, vp, gbflags, slpflag, slptimeo);
|
2015-10-14 02:10:07 +00:00
|
|
|
|
|
|
|
return (NULL);
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
|
|
|
|
1999-03-12 02:24:58 +00:00
|
|
|
/*
|
The buffer queue mechanism has been reformulated. Instead of having
QUEUE_AGE, QUEUE_LRU, and QUEUE_EMPTY we instead have QUEUE_CLEAN,
QUEUE_DIRTY, QUEUE_EMPTY, and QUEUE_EMPTYKVA. With this patch clean
and dirty buffers have been separated. Empty buffers with KVM
assignments have been separated from truely empty buffers. getnewbuf()
has been rewritten and now operates in a 100% optimal fashion. That is,
it is able to find precisely the right kind of buffer it needs to
allocate a new buffer, defragment KVM, or to free-up an existing buffer
when the buffer cache is full (which is a steady-state situation for
the buffer cache).
Buffer flushing has been reorganized. Previously buffers were flushed
in the context of whatever process hit the conditions forcing buffer
flushing to occur. This resulted in processes blocking on conditions
unrelated to what they were doing. This also resulted in inappropriate
VFS stacking chains due to multiple processes getting stuck trying to
flush dirty buffers or due to a single process getting into a situation
where it might attempt to flush buffers recursively - a situation that
was only partially fixed in prior commits. We have added a new daemon
called the buf_daemon which is responsible for flushing dirty buffers
when the number of dirty buffers exceeds the vfs.hidirtybuffers limit.
This daemon attempts to dynamically adjust the rate at which dirty buffers
are flushed such that getnewbuf() calls (almost) never block.
The number of nbufs and amount of buffer space is now scaled past the
8MB limit that was previously imposed for systems with over 64MB of
memory, and the vfs.{lo,hi}dirtybuffers limits have been relaxed
somewhat. The number of physical buffers has been increased with the
intention that we will manage physical I/O differently in the future.
reassignbuf previously attempted to keep the dirtyblkhd list sorted which
could result in non-deterministic operation under certain conditions,
such as when a large number of dirty buffers are being managed. This
algorithm has been changed. reassignbuf now keeps buffers locally sorted
if it can do so cheaply, and otherwise gives up and adds buffers to
the head of the dirtyblkhd list. The new algorithm is deterministic but
not perfect. The new algorithm greatly reduces problems that previously
occured when write_behind was turned off in the system.
The P_FLSINPROG proc->p_flag bit has been replaced by the more descriptive
P_BUFEXHAUST bit. This bit allows processes working with filesystem
buffers to use available emergency reserves. Normal processes do not set
this bit and are not allowed to dig into emergency reserves. The purpose
of this bit is to avoid low-memory deadlocks.
A small race condition was fixed in getpbuf() in vm/vm_pager.c.
Submitted by: Matthew Dillon <dillon@apollo.backplane.com>
Reviewed by: Kirk McKusick <mckusick@mckusick.com>
1999-07-04 00:25:38 +00:00
|
|
|
* buf_daemon:
|
1999-03-12 02:24:58 +00:00
|
|
|
*
|
The buffer queue mechanism has been reformulated. Instead of having
QUEUE_AGE, QUEUE_LRU, and QUEUE_EMPTY we instead have QUEUE_CLEAN,
QUEUE_DIRTY, QUEUE_EMPTY, and QUEUE_EMPTYKVA. With this patch clean
and dirty buffers have been separated. Empty buffers with KVM
assignments have been separated from truely empty buffers. getnewbuf()
has been rewritten and now operates in a 100% optimal fashion. That is,
it is able to find precisely the right kind of buffer it needs to
allocate a new buffer, defragment KVM, or to free-up an existing buffer
when the buffer cache is full (which is a steady-state situation for
the buffer cache).
Buffer flushing has been reorganized. Previously buffers were flushed
in the context of whatever process hit the conditions forcing buffer
flushing to occur. This resulted in processes blocking on conditions
unrelated to what they were doing. This also resulted in inappropriate
VFS stacking chains due to multiple processes getting stuck trying to
flush dirty buffers or due to a single process getting into a situation
where it might attempt to flush buffers recursively - a situation that
was only partially fixed in prior commits. We have added a new daemon
called the buf_daemon which is responsible for flushing dirty buffers
when the number of dirty buffers exceeds the vfs.hidirtybuffers limit.
This daemon attempts to dynamically adjust the rate at which dirty buffers
are flushed such that getnewbuf() calls (almost) never block.
The number of nbufs and amount of buffer space is now scaled past the
8MB limit that was previously imposed for systems with over 64MB of
memory, and the vfs.{lo,hi}dirtybuffers limits have been relaxed
somewhat. The number of physical buffers has been increased with the
intention that we will manage physical I/O differently in the future.
reassignbuf previously attempted to keep the dirtyblkhd list sorted which
could result in non-deterministic operation under certain conditions,
such as when a large number of dirty buffers are being managed. This
algorithm has been changed. reassignbuf now keeps buffers locally sorted
if it can do so cheaply, and otherwise gives up and adds buffers to
the head of the dirtyblkhd list. The new algorithm is deterministic but
not perfect. The new algorithm greatly reduces problems that previously
occured when write_behind was turned off in the system.
The P_FLSINPROG proc->p_flag bit has been replaced by the more descriptive
P_BUFEXHAUST bit. This bit allows processes working with filesystem
buffers to use available emergency reserves. Normal processes do not set
this bit and are not allowed to dig into emergency reserves. The purpose
of this bit is to avoid low-memory deadlocks.
A small race condition was fixed in getpbuf() in vm/vm_pager.c.
Submitted by: Matthew Dillon <dillon@apollo.backplane.com>
Reviewed by: Kirk McKusick <mckusick@mckusick.com>
1999-07-04 00:25:38 +00:00
|
|
|
* buffer flushing daemon. Buffers are normally flushed by the
|
|
|
|
* update daemon but if it cannot keep up this process starts to
|
|
|
|
* take the load in an attempt to prevent getnewbuf() from blocking.
|
1999-03-12 02:24:58 +00:00
|
|
|
*/
|
The buffer queue mechanism has been reformulated. Instead of having
QUEUE_AGE, QUEUE_LRU, and QUEUE_EMPTY we instead have QUEUE_CLEAN,
QUEUE_DIRTY, QUEUE_EMPTY, and QUEUE_EMPTYKVA. With this patch clean
and dirty buffers have been separated. Empty buffers with KVM
assignments have been separated from truely empty buffers. getnewbuf()
has been rewritten and now operates in a 100% optimal fashion. That is,
it is able to find precisely the right kind of buffer it needs to
allocate a new buffer, defragment KVM, or to free-up an existing buffer
when the buffer cache is full (which is a steady-state situation for
the buffer cache).
Buffer flushing has been reorganized. Previously buffers were flushed
in the context of whatever process hit the conditions forcing buffer
flushing to occur. This resulted in processes blocking on conditions
unrelated to what they were doing. This also resulted in inappropriate
VFS stacking chains due to multiple processes getting stuck trying to
flush dirty buffers or due to a single process getting into a situation
where it might attempt to flush buffers recursively - a situation that
was only partially fixed in prior commits. We have added a new daemon
called the buf_daemon which is responsible for flushing dirty buffers
when the number of dirty buffers exceeds the vfs.hidirtybuffers limit.
This daemon attempts to dynamically adjust the rate at which dirty buffers
are flushed such that getnewbuf() calls (almost) never block.
The number of nbufs and amount of buffer space is now scaled past the
8MB limit that was previously imposed for systems with over 64MB of
memory, and the vfs.{lo,hi}dirtybuffers limits have been relaxed
somewhat. The number of physical buffers has been increased with the
intention that we will manage physical I/O differently in the future.
reassignbuf previously attempted to keep the dirtyblkhd list sorted which
could result in non-deterministic operation under certain conditions,
such as when a large number of dirty buffers are being managed. This
algorithm has been changed. reassignbuf now keeps buffers locally sorted
if it can do so cheaply, and otherwise gives up and adds buffers to
the head of the dirtyblkhd list. The new algorithm is deterministic but
not perfect. The new algorithm greatly reduces problems that previously
occured when write_behind was turned off in the system.
The P_FLSINPROG proc->p_flag bit has been replaced by the more descriptive
P_BUFEXHAUST bit. This bit allows processes working with filesystem
buffers to use available emergency reserves. Normal processes do not set
this bit and are not allowed to dig into emergency reserves. The purpose
of this bit is to avoid low-memory deadlocks.
A small race condition was fixed in getpbuf() in vm/vm_pager.c.
Submitted by: Matthew Dillon <dillon@apollo.backplane.com>
Reviewed by: Kirk McKusick <mckusick@mckusick.com>
1999-07-04 00:25:38 +00:00
|
|
|
static struct kproc_desc buf_kp = {
|
|
|
|
"bufdaemon",
|
|
|
|
buf_daemon,
|
|
|
|
&bufdaemonproc
|
|
|
|
};
|
2008-03-16 10:58:09 +00:00
|
|
|
SYSINIT(bufdaemon, SI_SUB_KTHREAD_BUF, SI_ORDER_FIRST, kproc_start, &buf_kp);
|
The buffer queue mechanism has been reformulated. Instead of having
QUEUE_AGE, QUEUE_LRU, and QUEUE_EMPTY we instead have QUEUE_CLEAN,
QUEUE_DIRTY, QUEUE_EMPTY, and QUEUE_EMPTYKVA. With this patch clean
and dirty buffers have been separated. Empty buffers with KVM
assignments have been separated from truely empty buffers. getnewbuf()
has been rewritten and now operates in a 100% optimal fashion. That is,
it is able to find precisely the right kind of buffer it needs to
allocate a new buffer, defragment KVM, or to free-up an existing buffer
when the buffer cache is full (which is a steady-state situation for
the buffer cache).
Buffer flushing has been reorganized. Previously buffers were flushed
in the context of whatever process hit the conditions forcing buffer
flushing to occur. This resulted in processes blocking on conditions
unrelated to what they were doing. This also resulted in inappropriate
VFS stacking chains due to multiple processes getting stuck trying to
flush dirty buffers or due to a single process getting into a situation
where it might attempt to flush buffers recursively - a situation that
was only partially fixed in prior commits. We have added a new daemon
called the buf_daemon which is responsible for flushing dirty buffers
when the number of dirty buffers exceeds the vfs.hidirtybuffers limit.
This daemon attempts to dynamically adjust the rate at which dirty buffers
are flushed such that getnewbuf() calls (almost) never block.
The number of nbufs and amount of buffer space is now scaled past the
8MB limit that was previously imposed for systems with over 64MB of
memory, and the vfs.{lo,hi}dirtybuffers limits have been relaxed
somewhat. The number of physical buffers has been increased with the
intention that we will manage physical I/O differently in the future.
reassignbuf previously attempted to keep the dirtyblkhd list sorted which
could result in non-deterministic operation under certain conditions,
such as when a large number of dirty buffers are being managed. This
algorithm has been changed. reassignbuf now keeps buffers locally sorted
if it can do so cheaply, and otherwise gives up and adds buffers to
the head of the dirtyblkhd list. The new algorithm is deterministic but
not perfect. The new algorithm greatly reduces problems that previously
occured when write_behind was turned off in the system.
The P_FLSINPROG proc->p_flag bit has been replaced by the more descriptive
P_BUFEXHAUST bit. This bit allows processes working with filesystem
buffers to use available emergency reserves. Normal processes do not set
this bit and are not allowed to dig into emergency reserves. The purpose
of this bit is to avoid low-memory deadlocks.
A small race condition was fixed in getpbuf() in vm/vm_pager.c.
Submitted by: Matthew Dillon <dillon@apollo.backplane.com>
Reviewed by: Kirk McKusick <mckusick@mckusick.com>
1999-07-04 00:25:38 +00:00
|
|
|
|
Fix two issues with bufdaemon, often causing the processes to hang in
the "nbufkv" sleep.
First, ffs background cg group block write requests a new buffer for
the shadow copy. When ffs_bufwrite() is called from the bufdaemon due
to buffers shortage, requesting the buffer deadlock bufdaemon.
Introduce a new flag for getnewbuf(), GB_NOWAIT_BD, to request getblk
to not block while allocating the buffer, and return failure
instead. Add a flag argument to the geteblk to allow to pass the flags
to getblk(). Do not repeat the getnewbuf() call from geteblk if buffer
allocation failed and either GB_NOWAIT_BD is specified, or geteblk()
is called from bufdaemon (or its helper, see below). In
ffs_bufwrite(), fall back to synchronous cg block write if shadow
block allocation failed.
Since r107847, buffer write assumes that vnode owning the buffer is
locked. The second problem is that buffer cache may accumulate many
buffers belonging to limited number of vnodes. With such workload,
quite often threads that own the mentioned vnodes locks are trying to
read another block from the vnodes, and, due to buffer cache
exhaustion, are asking bufdaemon for help. Bufdaemon is unable to make
any substantial progress because the vnodes are locked.
Allow the threads owning vnode locks to help the bufdaemon by doing
the flush pass over the buffer cache before getnewbuf() is going to
uninterruptible sleep. Move the flushing code from buf_daemon() to new
helper function buf_do_flush(), that is called from getnewbuf(). The
number of buffers flushed by single call to buf_do_flush() from
getnewbuf() is limited by new sysctl vfs.flushbufqtarget. Prevent
recursive calls to buf_do_flush() by marking the bufdaemon and threads
that temporarily help bufdaemon by TDP_BUFNEED flag.
In collaboration with: pho
Reviewed by: tegge (previous version)
Tested by: glebius, yandex ...
MFC after: 3 weeks
2009-03-16 15:39:46 +00:00
|
|
|
static int
|
2018-03-17 18:14:49 +00:00
|
|
|
buf_flush(struct vnode *vp, struct bufdomain *bd, int target)
|
Fix two issues with bufdaemon, often causing the processes to hang in
the "nbufkv" sleep.
First, ffs background cg group block write requests a new buffer for
the shadow copy. When ffs_bufwrite() is called from the bufdaemon due
to buffers shortage, requesting the buffer deadlock bufdaemon.
Introduce a new flag for getnewbuf(), GB_NOWAIT_BD, to request getblk
to not block while allocating the buffer, and return failure
instead. Add a flag argument to the geteblk to allow to pass the flags
to getblk(). Do not repeat the getnewbuf() call from geteblk if buffer
allocation failed and either GB_NOWAIT_BD is specified, or geteblk()
is called from bufdaemon (or its helper, see below). In
ffs_bufwrite(), fall back to synchronous cg block write if shadow
block allocation failed.
Since r107847, buffer write assumes that vnode owning the buffer is
locked. The second problem is that buffer cache may accumulate many
buffers belonging to limited number of vnodes. With such workload,
quite often threads that own the mentioned vnodes locks are trying to
read another block from the vnodes, and, due to buffer cache
exhaustion, are asking bufdaemon for help. Bufdaemon is unable to make
any substantial progress because the vnodes are locked.
Allow the threads owning vnode locks to help the bufdaemon by doing
the flush pass over the buffer cache before getnewbuf() is going to
uninterruptible sleep. Move the flushing code from buf_daemon() to new
helper function buf_do_flush(), that is called from getnewbuf(). The
number of buffers flushed by single call to buf_do_flush() from
getnewbuf() is limited by new sysctl vfs.flushbufqtarget. Prevent
recursive calls to buf_do_flush() by marking the bufdaemon and threads
that temporarily help bufdaemon by TDP_BUFNEED flag.
In collaboration with: pho
Reviewed by: tegge (previous version)
Tested by: glebius, yandex ...
MFC after: 3 weeks
2009-03-16 15:39:46 +00:00
|
|
|
{
|
|
|
|
int flushed;
|
|
|
|
|
2018-03-17 18:14:49 +00:00
|
|
|
flushed = flushbufqueues(vp, bd, target, 0);
|
Fix two issues with bufdaemon, often causing the processes to hang in
the "nbufkv" sleep.
First, ffs background cg group block write requests a new buffer for
the shadow copy. When ffs_bufwrite() is called from the bufdaemon due
to buffers shortage, requesting the buffer deadlock bufdaemon.
Introduce a new flag for getnewbuf(), GB_NOWAIT_BD, to request getblk
to not block while allocating the buffer, and return failure
instead. Add a flag argument to the geteblk to allow to pass the flags
to getblk(). Do not repeat the getnewbuf() call from geteblk if buffer
allocation failed and either GB_NOWAIT_BD is specified, or geteblk()
is called from bufdaemon (or its helper, see below). In
ffs_bufwrite(), fall back to synchronous cg block write if shadow
block allocation failed.
Since r107847, buffer write assumes that vnode owning the buffer is
locked. The second problem is that buffer cache may accumulate many
buffers belonging to limited number of vnodes. With such workload,
quite often threads that own the mentioned vnodes locks are trying to
read another block from the vnodes, and, due to buffer cache
exhaustion, are asking bufdaemon for help. Bufdaemon is unable to make
any substantial progress because the vnodes are locked.
Allow the threads owning vnode locks to help the bufdaemon by doing
the flush pass over the buffer cache before getnewbuf() is going to
uninterruptible sleep. Move the flushing code from buf_daemon() to new
helper function buf_do_flush(), that is called from getnewbuf(). The
number of buffers flushed by single call to buf_do_flush() from
getnewbuf() is limited by new sysctl vfs.flushbufqtarget. Prevent
recursive calls to buf_do_flush() by marking the bufdaemon and threads
that temporarily help bufdaemon by TDP_BUFNEED flag.
In collaboration with: pho
Reviewed by: tegge (previous version)
Tested by: glebius, yandex ...
MFC after: 3 weeks
2009-03-16 15:39:46 +00:00
|
|
|
if (flushed == 0) {
|
|
|
|
/*
|
|
|
|
* Could not find any buffers without rollback
|
|
|
|
* dependencies, so just write the first one
|
|
|
|
* in the hopes of eventually making progress.
|
|
|
|
*/
|
2015-04-27 11:13:19 +00:00
|
|
|
if (vp != NULL && target > 2)
|
|
|
|
target /= 2;
|
2018-03-17 18:14:49 +00:00
|
|
|
flushbufqueues(vp, bd, target, 1);
|
Fix two issues with bufdaemon, often causing the processes to hang in
the "nbufkv" sleep.
First, ffs background cg group block write requests a new buffer for
the shadow copy. When ffs_bufwrite() is called from the bufdaemon due
to buffers shortage, requesting the buffer deadlock bufdaemon.
Introduce a new flag for getnewbuf(), GB_NOWAIT_BD, to request getblk
to not block while allocating the buffer, and return failure
instead. Add a flag argument to the geteblk to allow to pass the flags
to getblk(). Do not repeat the getnewbuf() call from geteblk if buffer
allocation failed and either GB_NOWAIT_BD is specified, or geteblk()
is called from bufdaemon (or its helper, see below). In
ffs_bufwrite(), fall back to synchronous cg block write if shadow
block allocation failed.
Since r107847, buffer write assumes that vnode owning the buffer is
locked. The second problem is that buffer cache may accumulate many
buffers belonging to limited number of vnodes. With such workload,
quite often threads that own the mentioned vnodes locks are trying to
read another block from the vnodes, and, due to buffer cache
exhaustion, are asking bufdaemon for help. Bufdaemon is unable to make
any substantial progress because the vnodes are locked.
Allow the threads owning vnode locks to help the bufdaemon by doing
the flush pass over the buffer cache before getnewbuf() is going to
uninterruptible sleep. Move the flushing code from buf_daemon() to new
helper function buf_do_flush(), that is called from getnewbuf(). The
number of buffers flushed by single call to buf_do_flush() from
getnewbuf() is limited by new sysctl vfs.flushbufqtarget. Prevent
recursive calls to buf_do_flush() by marking the bufdaemon and threads
that temporarily help bufdaemon by TDP_BUFNEED flag.
In collaboration with: pho
Reviewed by: tegge (previous version)
Tested by: glebius, yandex ...
MFC after: 3 weeks
2009-03-16 15:39:46 +00:00
|
|
|
}
|
|
|
|
return (flushed);
|
|
|
|
}
|
|
|
|
|
2022-01-19 00:26:16 +00:00
|
|
|
static void
|
|
|
|
buf_daemon_shutdown(void *arg __unused, int howto __unused)
|
|
|
|
{
|
|
|
|
int error;
|
|
|
|
|
|
|
|
mtx_lock(&bdlock);
|
|
|
|
bd_shutdown = true;
|
|
|
|
wakeup(&bd_request);
|
|
|
|
error = msleep(&bd_shutdown, &bdlock, 0, "buf_daemon_shutdown",
|
|
|
|
60 * hz);
|
|
|
|
mtx_unlock(&bdlock);
|
|
|
|
if (error != 0)
|
|
|
|
printf("bufdaemon wait error: %d\n", error);
|
|
|
|
}
|
|
|
|
|
1997-06-15 17:56:53 +00:00
|
|
|
static void
|
The buffer queue mechanism has been reformulated. Instead of having
QUEUE_AGE, QUEUE_LRU, and QUEUE_EMPTY we instead have QUEUE_CLEAN,
QUEUE_DIRTY, QUEUE_EMPTY, and QUEUE_EMPTYKVA. With this patch clean
and dirty buffers have been separated. Empty buffers with KVM
assignments have been separated from truely empty buffers. getnewbuf()
has been rewritten and now operates in a 100% optimal fashion. That is,
it is able to find precisely the right kind of buffer it needs to
allocate a new buffer, defragment KVM, or to free-up an existing buffer
when the buffer cache is full (which is a steady-state situation for
the buffer cache).
Buffer flushing has been reorganized. Previously buffers were flushed
in the context of whatever process hit the conditions forcing buffer
flushing to occur. This resulted in processes blocking on conditions
unrelated to what they were doing. This also resulted in inappropriate
VFS stacking chains due to multiple processes getting stuck trying to
flush dirty buffers or due to a single process getting into a situation
where it might attempt to flush buffers recursively - a situation that
was only partially fixed in prior commits. We have added a new daemon
called the buf_daemon which is responsible for flushing dirty buffers
when the number of dirty buffers exceeds the vfs.hidirtybuffers limit.
This daemon attempts to dynamically adjust the rate at which dirty buffers
are flushed such that getnewbuf() calls (almost) never block.
The number of nbufs and amount of buffer space is now scaled past the
8MB limit that was previously imposed for systems with over 64MB of
memory, and the vfs.{lo,hi}dirtybuffers limits have been relaxed
somewhat. The number of physical buffers has been increased with the
intention that we will manage physical I/O differently in the future.
reassignbuf previously attempted to keep the dirtyblkhd list sorted which
could result in non-deterministic operation under certain conditions,
such as when a large number of dirty buffers are being managed. This
algorithm has been changed. reassignbuf now keeps buffers locally sorted
if it can do so cheaply, and otherwise gives up and adds buffers to
the head of the dirtyblkhd list. The new algorithm is deterministic but
not perfect. The new algorithm greatly reduces problems that previously
occured when write_behind was turned off in the system.
The P_FLSINPROG proc->p_flag bit has been replaced by the more descriptive
P_BUFEXHAUST bit. This bit allows processes working with filesystem
buffers to use available emergency reserves. Normal processes do not set
this bit and are not allowed to dig into emergency reserves. The purpose
of this bit is to avoid low-memory deadlocks.
A small race condition was fixed in getpbuf() in vm/vm_pager.c.
Submitted by: Matthew Dillon <dillon@apollo.backplane.com>
Reviewed by: Kirk McKusick <mckusick@mckusick.com>
1999-07-04 00:25:38 +00:00
|
|
|
buf_daemon()
|
1999-03-12 02:24:58 +00:00
|
|
|
{
|
2018-03-17 18:14:49 +00:00
|
|
|
struct bufdomain *bd;
|
|
|
|
int speedupreq;
|
2013-06-05 23:53:00 +00:00
|
|
|
int lodirty;
|
2018-02-20 00:06:07 +00:00
|
|
|
int i;
|
2000-09-07 01:33:02 +00:00
|
|
|
|
2000-01-07 08:36:44 +00:00
|
|
|
/*
|
|
|
|
* This process needs to be suspended prior to shutdown sync.
|
|
|
|
*/
|
2022-01-19 00:26:16 +00:00
|
|
|
EVENTHANDLER_REGISTER(shutdown_pre_sync, buf_daemon_shutdown, NULL,
|
2018-04-22 16:05:29 +00:00
|
|
|
SHUTDOWN_PRI_LAST + 100);
|
2000-01-07 08:36:44 +00:00
|
|
|
|
2018-02-20 00:06:07 +00:00
|
|
|
/*
|
|
|
|
* Start the buf clean daemons as children threads.
|
|
|
|
*/
|
2018-03-17 18:14:49 +00:00
|
|
|
for (i = 0 ; i < buf_domains; i++) {
|
2018-02-20 00:06:07 +00:00
|
|
|
int error;
|
|
|
|
|
|
|
|
error = kthread_add((void (*)(void *))bufspace_daemon,
|
2018-03-17 18:14:49 +00:00
|
|
|
&bdomain[i], curproc, NULL, 0, 0, "bufspacedaemon-%d", i);
|
2018-02-20 00:06:07 +00:00
|
|
|
if (error)
|
|
|
|
panic("error %d spawning bufspace daemon", error);
|
|
|
|
}
|
|
|
|
|
The buffer queue mechanism has been reformulated. Instead of having
QUEUE_AGE, QUEUE_LRU, and QUEUE_EMPTY we instead have QUEUE_CLEAN,
QUEUE_DIRTY, QUEUE_EMPTY, and QUEUE_EMPTYKVA. With this patch clean
and dirty buffers have been separated. Empty buffers with KVM
assignments have been separated from truely empty buffers. getnewbuf()
has been rewritten and now operates in a 100% optimal fashion. That is,
it is able to find precisely the right kind of buffer it needs to
allocate a new buffer, defragment KVM, or to free-up an existing buffer
when the buffer cache is full (which is a steady-state situation for
the buffer cache).
Buffer flushing has been reorganized. Previously buffers were flushed
in the context of whatever process hit the conditions forcing buffer
flushing to occur. This resulted in processes blocking on conditions
unrelated to what they were doing. This also resulted in inappropriate
VFS stacking chains due to multiple processes getting stuck trying to
flush dirty buffers or due to a single process getting into a situation
where it might attempt to flush buffers recursively - a situation that
was only partially fixed in prior commits. We have added a new daemon
called the buf_daemon which is responsible for flushing dirty buffers
when the number of dirty buffers exceeds the vfs.hidirtybuffers limit.
This daemon attempts to dynamically adjust the rate at which dirty buffers
are flushed such that getnewbuf() calls (almost) never block.
The number of nbufs and amount of buffer space is now scaled past the
8MB limit that was previously imposed for systems with over 64MB of
memory, and the vfs.{lo,hi}dirtybuffers limits have been relaxed
somewhat. The number of physical buffers has been increased with the
intention that we will manage physical I/O differently in the future.
reassignbuf previously attempted to keep the dirtyblkhd list sorted which
could result in non-deterministic operation under certain conditions,
such as when a large number of dirty buffers are being managed. This
algorithm has been changed. reassignbuf now keeps buffers locally sorted
if it can do so cheaply, and otherwise gives up and adds buffers to
the head of the dirtyblkhd list. The new algorithm is deterministic but
not perfect. The new algorithm greatly reduces problems that previously
occured when write_behind was turned off in the system.
The P_FLSINPROG proc->p_flag bit has been replaced by the more descriptive
P_BUFEXHAUST bit. This bit allows processes working with filesystem
buffers to use available emergency reserves. Normal processes do not set
this bit and are not allowed to dig into emergency reserves. The purpose
of this bit is to avoid low-memory deadlocks.
A small race condition was fixed in getpbuf() in vm/vm_pager.c.
Submitted by: Matthew Dillon <dillon@apollo.backplane.com>
Reviewed by: Kirk McKusick <mckusick@mckusick.com>
1999-07-04 00:25:38 +00:00
|
|
|
/*
|
|
|
|
* This process is allowed to take the buffer cache to the limit
|
|
|
|
*/
|
Fix two issues with bufdaemon, often causing the processes to hang in
the "nbufkv" sleep.
First, ffs background cg group block write requests a new buffer for
the shadow copy. When ffs_bufwrite() is called from the bufdaemon due
to buffers shortage, requesting the buffer deadlock bufdaemon.
Introduce a new flag for getnewbuf(), GB_NOWAIT_BD, to request getblk
to not block while allocating the buffer, and return failure
instead. Add a flag argument to the geteblk to allow to pass the flags
to getblk(). Do not repeat the getnewbuf() call from geteblk if buffer
allocation failed and either GB_NOWAIT_BD is specified, or geteblk()
is called from bufdaemon (or its helper, see below). In
ffs_bufwrite(), fall back to synchronous cg block write if shadow
block allocation failed.
Since r107847, buffer write assumes that vnode owning the buffer is
locked. The second problem is that buffer cache may accumulate many
buffers belonging to limited number of vnodes. With such workload,
quite often threads that own the mentioned vnodes locks are trying to
read another block from the vnodes, and, due to buffer cache
exhaustion, are asking bufdaemon for help. Bufdaemon is unable to make
any substantial progress because the vnodes are locked.
Allow the threads owning vnode locks to help the bufdaemon by doing
the flush pass over the buffer cache before getnewbuf() is going to
uninterruptible sleep. Move the flushing code from buf_daemon() to new
helper function buf_do_flush(), that is called from getnewbuf(). The
number of buffers flushed by single call to buf_do_flush() from
getnewbuf() is limited by new sysctl vfs.flushbufqtarget. Prevent
recursive calls to buf_do_flush() by marking the bufdaemon and threads
that temporarily help bufdaemon by TDP_BUFNEED flag.
In collaboration with: pho
Reviewed by: tegge (previous version)
Tested by: glebius, yandex ...
MFC after: 3 weeks
2009-03-16 15:39:46 +00:00
|
|
|
curthread->td_pflags |= TDP_NORUNNINGBUF | TDP_BUFNEED;
|
2003-02-09 09:47:31 +00:00
|
|
|
mtx_lock(&bdlock);
|
2022-01-19 00:26:16 +00:00
|
|
|
while (!bd_shutdown) {
|
The buffer queue mechanism has been reformulated. Instead of having
QUEUE_AGE, QUEUE_LRU, and QUEUE_EMPTY we instead have QUEUE_CLEAN,
QUEUE_DIRTY, QUEUE_EMPTY, and QUEUE_EMPTYKVA. With this patch clean
and dirty buffers have been separated. Empty buffers with KVM
assignments have been separated from truely empty buffers. getnewbuf()
has been rewritten and now operates in a 100% optimal fashion. That is,
it is able to find precisely the right kind of buffer it needs to
allocate a new buffer, defragment KVM, or to free-up an existing buffer
when the buffer cache is full (which is a steady-state situation for
the buffer cache).
Buffer flushing has been reorganized. Previously buffers were flushed
in the context of whatever process hit the conditions forcing buffer
flushing to occur. This resulted in processes blocking on conditions
unrelated to what they were doing. This also resulted in inappropriate
VFS stacking chains due to multiple processes getting stuck trying to
flush dirty buffers or due to a single process getting into a situation
where it might attempt to flush buffers recursively - a situation that
was only partially fixed in prior commits. We have added a new daemon
called the buf_daemon which is responsible for flushing dirty buffers
when the number of dirty buffers exceeds the vfs.hidirtybuffers limit.
This daemon attempts to dynamically adjust the rate at which dirty buffers
are flushed such that getnewbuf() calls (almost) never block.
The number of nbufs and amount of buffer space is now scaled past the
8MB limit that was previously imposed for systems with over 64MB of
memory, and the vfs.{lo,hi}dirtybuffers limits have been relaxed
somewhat. The number of physical buffers has been increased with the
intention that we will manage physical I/O differently in the future.
reassignbuf previously attempted to keep the dirtyblkhd list sorted which
could result in non-deterministic operation under certain conditions,
such as when a large number of dirty buffers are being managed. This
algorithm has been changed. reassignbuf now keeps buffers locally sorted
if it can do so cheaply, and otherwise gives up and adds buffers to
the head of the dirtyblkhd list. The new algorithm is deterministic but
not perfect. The new algorithm greatly reduces problems that previously
occured when write_behind was turned off in the system.
The P_FLSINPROG proc->p_flag bit has been replaced by the more descriptive
P_BUFEXHAUST bit. This bit allows processes working with filesystem
buffers to use available emergency reserves. Normal processes do not set
this bit and are not allowed to dig into emergency reserves. The purpose
of this bit is to avoid low-memory deadlocks.
A small race condition was fixed in getpbuf() in vm/vm_pager.c.
Submitted by: Matthew Dillon <dillon@apollo.backplane.com>
Reviewed by: Kirk McKusick <mckusick@mckusick.com>
1999-07-04 00:25:38 +00:00
|
|
|
bd_request = 0;
|
2003-02-09 09:47:31 +00:00
|
|
|
mtx_unlock(&bdlock);
|
|
|
|
|
The buffer queue mechanism has been reformulated. Instead of having
QUEUE_AGE, QUEUE_LRU, and QUEUE_EMPTY we instead have QUEUE_CLEAN,
QUEUE_DIRTY, QUEUE_EMPTY, and QUEUE_EMPTYKVA. With this patch clean
and dirty buffers have been separated. Empty buffers with KVM
assignments have been separated from truely empty buffers. getnewbuf()
has been rewritten and now operates in a 100% optimal fashion. That is,
it is able to find precisely the right kind of buffer it needs to
allocate a new buffer, defragment KVM, or to free-up an existing buffer
when the buffer cache is full (which is a steady-state situation for
the buffer cache).
Buffer flushing has been reorganized. Previously buffers were flushed
in the context of whatever process hit the conditions forcing buffer
flushing to occur. This resulted in processes blocking on conditions
unrelated to what they were doing. This also resulted in inappropriate
VFS stacking chains due to multiple processes getting stuck trying to
flush dirty buffers or due to a single process getting into a situation
where it might attempt to flush buffers recursively - a situation that
was only partially fixed in prior commits. We have added a new daemon
called the buf_daemon which is responsible for flushing dirty buffers
when the number of dirty buffers exceeds the vfs.hidirtybuffers limit.
This daemon attempts to dynamically adjust the rate at which dirty buffers
are flushed such that getnewbuf() calls (almost) never block.
The number of nbufs and amount of buffer space is now scaled past the
8MB limit that was previously imposed for systems with over 64MB of
memory, and the vfs.{lo,hi}dirtybuffers limits have been relaxed
somewhat. The number of physical buffers has been increased with the
intention that we will manage physical I/O differently in the future.
reassignbuf previously attempted to keep the dirtyblkhd list sorted which
could result in non-deterministic operation under certain conditions,
such as when a large number of dirty buffers are being managed. This
algorithm has been changed. reassignbuf now keeps buffers locally sorted
if it can do so cheaply, and otherwise gives up and adds buffers to
the head of the dirtyblkhd list. The new algorithm is deterministic but
not perfect. The new algorithm greatly reduces problems that previously
occured when write_behind was turned off in the system.
The P_FLSINPROG proc->p_flag bit has been replaced by the more descriptive
P_BUFEXHAUST bit. This bit allows processes working with filesystem
buffers to use available emergency reserves. Normal processes do not set
this bit and are not allowed to dig into emergency reserves. The purpose
of this bit is to avoid low-memory deadlocks.
A small race condition was fixed in getpbuf() in vm/vm_pager.c.
Submitted by: Matthew Dillon <dillon@apollo.backplane.com>
Reviewed by: Kirk McKusick <mckusick@mckusick.com>
1999-07-04 00:25:38 +00:00
|
|
|
/*
|
2018-03-17 18:14:49 +00:00
|
|
|
* Save speedupreq for this pass and reset to capture new
|
|
|
|
* requests.
|
The buffer queue mechanism has been reformulated. Instead of having
QUEUE_AGE, QUEUE_LRU, and QUEUE_EMPTY we instead have QUEUE_CLEAN,
QUEUE_DIRTY, QUEUE_EMPTY, and QUEUE_EMPTYKVA. With this patch clean
and dirty buffers have been separated. Empty buffers with KVM
assignments have been separated from truely empty buffers. getnewbuf()
has been rewritten and now operates in a 100% optimal fashion. That is,
it is able to find precisely the right kind of buffer it needs to
allocate a new buffer, defragment KVM, or to free-up an existing buffer
when the buffer cache is full (which is a steady-state situation for
the buffer cache).
Buffer flushing has been reorganized. Previously buffers were flushed
in the context of whatever process hit the conditions forcing buffer
flushing to occur. This resulted in processes blocking on conditions
unrelated to what they were doing. This also resulted in inappropriate
VFS stacking chains due to multiple processes getting stuck trying to
flush dirty buffers or due to a single process getting into a situation
where it might attempt to flush buffers recursively - a situation that
was only partially fixed in prior commits. We have added a new daemon
called the buf_daemon which is responsible for flushing dirty buffers
when the number of dirty buffers exceeds the vfs.hidirtybuffers limit.
This daemon attempts to dynamically adjust the rate at which dirty buffers
are flushed such that getnewbuf() calls (almost) never block.
The number of nbufs and amount of buffer space is now scaled past the
8MB limit that was previously imposed for systems with over 64MB of
memory, and the vfs.{lo,hi}dirtybuffers limits have been relaxed
somewhat. The number of physical buffers has been increased with the
intention that we will manage physical I/O differently in the future.
reassignbuf previously attempted to keep the dirtyblkhd list sorted which
could result in non-deterministic operation under certain conditions,
such as when a large number of dirty buffers are being managed. This
algorithm has been changed. reassignbuf now keeps buffers locally sorted
if it can do so cheaply, and otherwise gives up and adds buffers to
the head of the dirtyblkhd list. The new algorithm is deterministic but
not perfect. The new algorithm greatly reduces problems that previously
occured when write_behind was turned off in the system.
The P_FLSINPROG proc->p_flag bit has been replaced by the more descriptive
P_BUFEXHAUST bit. This bit allows processes working with filesystem
buffers to use available emergency reserves. Normal processes do not set
this bit and are not allowed to dig into emergency reserves. The purpose
of this bit is to avoid low-memory deadlocks.
A small race condition was fixed in getpbuf() in vm/vm_pager.c.
Submitted by: Matthew Dillon <dillon@apollo.backplane.com>
Reviewed by: Kirk McKusick <mckusick@mckusick.com>
1999-07-04 00:25:38 +00:00
|
|
|
*/
|
2018-03-17 18:14:49 +00:00
|
|
|
speedupreq = bd_speedupreq;
|
|
|
|
bd_speedupreq = 0;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Flush each domain sequentially according to its level and
|
|
|
|
* the speedup request.
|
|
|
|
*/
|
|
|
|
for (i = 0; i < buf_domains; i++) {
|
|
|
|
bd = &bdomain[i];
|
|
|
|
if (speedupreq)
|
|
|
|
lodirty = bd->bd_numdirtybuffers / 2;
|
|
|
|
else
|
|
|
|
lodirty = bd->bd_lodirtybuffers;
|
|
|
|
while (bd->bd_numdirtybuffers > lodirty) {
|
|
|
|
if (buf_flush(NULL, bd,
|
|
|
|
bd->bd_numdirtybuffers - lodirty) == 0)
|
|
|
|
break;
|
|
|
|
kern_yield(PRI_USER);
|
|
|
|
}
|
The buffer queue mechanism has been reformulated. Instead of having
QUEUE_AGE, QUEUE_LRU, and QUEUE_EMPTY we instead have QUEUE_CLEAN,
QUEUE_DIRTY, QUEUE_EMPTY, and QUEUE_EMPTYKVA. With this patch clean
and dirty buffers have been separated. Empty buffers with KVM
assignments have been separated from truely empty buffers. getnewbuf()
has been rewritten and now operates in a 100% optimal fashion. That is,
it is able to find precisely the right kind of buffer it needs to
allocate a new buffer, defragment KVM, or to free-up an existing buffer
when the buffer cache is full (which is a steady-state situation for
the buffer cache).
Buffer flushing has been reorganized. Previously buffers were flushed
in the context of whatever process hit the conditions forcing buffer
flushing to occur. This resulted in processes blocking on conditions
unrelated to what they were doing. This also resulted in inappropriate
VFS stacking chains due to multiple processes getting stuck trying to
flush dirty buffers or due to a single process getting into a situation
where it might attempt to flush buffers recursively - a situation that
was only partially fixed in prior commits. We have added a new daemon
called the buf_daemon which is responsible for flushing dirty buffers
when the number of dirty buffers exceeds the vfs.hidirtybuffers limit.
This daemon attempts to dynamically adjust the rate at which dirty buffers
are flushed such that getnewbuf() calls (almost) never block.
The number of nbufs and amount of buffer space is now scaled past the
8MB limit that was previously imposed for systems with over 64MB of
memory, and the vfs.{lo,hi}dirtybuffers limits have been relaxed
somewhat. The number of physical buffers has been increased with the
intention that we will manage physical I/O differently in the future.
reassignbuf previously attempted to keep the dirtyblkhd list sorted which
could result in non-deterministic operation under certain conditions,
such as when a large number of dirty buffers are being managed. This
algorithm has been changed. reassignbuf now keeps buffers locally sorted
if it can do so cheaply, and otherwise gives up and adds buffers to
the head of the dirtyblkhd list. The new algorithm is deterministic but
not perfect. The new algorithm greatly reduces problems that previously
occured when write_behind was turned off in the system.
The P_FLSINPROG proc->p_flag bit has been replaced by the more descriptive
P_BUFEXHAUST bit. This bit allows processes working with filesystem
buffers to use available emergency reserves. Normal processes do not set
this bit and are not allowed to dig into emergency reserves. The purpose
of this bit is to avoid low-memory deadlocks.
A small race condition was fixed in getpbuf() in vm/vm_pager.c.
Submitted by: Matthew Dillon <dillon@apollo.backplane.com>
Reviewed by: Kirk McKusick <mckusick@mckusick.com>
1999-07-04 00:25:38 +00:00
|
|
|
}
|
1999-03-12 02:24:58 +00:00
|
|
|
|
2000-12-26 19:41:38 +00:00
|
|
|
/*
|
|
|
|
* Only clear bd_request if we have reached our low water
|
2002-03-05 15:38:49 +00:00
|
|
|
* mark. The buf_daemon normally waits 1 second and
|
2000-12-26 19:41:38 +00:00
|
|
|
* then incrementally flushes any dirty buffers that have
|
|
|
|
* built up, within reason.
|
|
|
|
*
|
|
|
|
* If we were unable to hit our low water mark and couldn't
|
2013-06-05 23:53:00 +00:00
|
|
|
* find any flushable buffers, we sleep for a short period
|
|
|
|
* to avoid endless loops on unlockable buffers.
|
2000-12-26 19:41:38 +00:00
|
|
|
*/
|
2003-02-09 09:47:31 +00:00
|
|
|
mtx_lock(&bdlock);
|
2022-01-19 00:26:16 +00:00
|
|
|
if (bd_shutdown)
|
|
|
|
break;
|
2022-01-16 00:32:36 +00:00
|
|
|
if (BIT_EMPTY(BUF_DOMAINS, &bdlodirty)) {
|
1999-12-20 20:28:40 +00:00
|
|
|
/*
|
2000-12-26 19:41:38 +00:00
|
|
|
* We reached our low water mark, reset the
|
|
|
|
* request and sleep until we are needed again.
|
|
|
|
* The sleep is just so the suspend code works.
|
1999-12-20 20:28:40 +00:00
|
|
|
*/
|
2000-12-26 19:41:38 +00:00
|
|
|
bd_request = 0;
|
2013-06-05 23:53:00 +00:00
|
|
|
/*
|
|
|
|
* Do an extra wakeup in case dirty threshold
|
|
|
|
* changed via sysctl and the explicit transition
|
|
|
|
* out of shortfall was missed.
|
|
|
|
*/
|
|
|
|
bdirtywakeup();
|
|
|
|
if (runningbufspace <= lorunningspace)
|
|
|
|
runningwakeup();
|
2003-02-09 09:47:31 +00:00
|
|
|
msleep(&bd_request, &bdlock, PVM, "psleep", hz);
|
1999-12-20 20:28:40 +00:00
|
|
|
} else {
|
|
|
|
/*
|
2000-12-26 19:41:38 +00:00
|
|
|
* We couldn't find any flushable dirty buffers but
|
|
|
|
* still have too many dirty buffers, we
|
|
|
|
* have to sleep and try again. (rare)
|
1999-12-20 20:28:40 +00:00
|
|
|
*/
|
2003-02-09 09:47:31 +00:00
|
|
|
msleep(&bd_request, &bdlock, PVM, "qsleep", hz / 10);
|
1999-12-20 20:28:40 +00:00
|
|
|
}
|
The buffer queue mechanism has been reformulated. Instead of having
QUEUE_AGE, QUEUE_LRU, and QUEUE_EMPTY we instead have QUEUE_CLEAN,
QUEUE_DIRTY, QUEUE_EMPTY, and QUEUE_EMPTYKVA. With this patch clean
and dirty buffers have been separated. Empty buffers with KVM
assignments have been separated from truely empty buffers. getnewbuf()
has been rewritten and now operates in a 100% optimal fashion. That is,
it is able to find precisely the right kind of buffer it needs to
allocate a new buffer, defragment KVM, or to free-up an existing buffer
when the buffer cache is full (which is a steady-state situation for
the buffer cache).
Buffer flushing has been reorganized. Previously buffers were flushed
in the context of whatever process hit the conditions forcing buffer
flushing to occur. This resulted in processes blocking on conditions
unrelated to what they were doing. This also resulted in inappropriate
VFS stacking chains due to multiple processes getting stuck trying to
flush dirty buffers or due to a single process getting into a situation
where it might attempt to flush buffers recursively - a situation that
was only partially fixed in prior commits. We have added a new daemon
called the buf_daemon which is responsible for flushing dirty buffers
when the number of dirty buffers exceeds the vfs.hidirtybuffers limit.
This daemon attempts to dynamically adjust the rate at which dirty buffers
are flushed such that getnewbuf() calls (almost) never block.
The number of nbufs and amount of buffer space is now scaled past the
8MB limit that was previously imposed for systems with over 64MB of
memory, and the vfs.{lo,hi}dirtybuffers limits have been relaxed
somewhat. The number of physical buffers has been increased with the
intention that we will manage physical I/O differently in the future.
reassignbuf previously attempted to keep the dirtyblkhd list sorted which
could result in non-deterministic operation under certain conditions,
such as when a large number of dirty buffers are being managed. This
algorithm has been changed. reassignbuf now keeps buffers locally sorted
if it can do so cheaply, and otherwise gives up and adds buffers to
the head of the dirtyblkhd list. The new algorithm is deterministic but
not perfect. The new algorithm greatly reduces problems that previously
occured when write_behind was turned off in the system.
The P_FLSINPROG proc->p_flag bit has been replaced by the more descriptive
P_BUFEXHAUST bit. This bit allows processes working with filesystem
buffers to use available emergency reserves. Normal processes do not set
this bit and are not allowed to dig into emergency reserves. The purpose
of this bit is to avoid low-memory deadlocks.
A small race condition was fixed in getpbuf() in vm/vm_pager.c.
Submitted by: Matthew Dillon <dillon@apollo.backplane.com>
Reviewed by: Kirk McKusick <mckusick@mckusick.com>
1999-07-04 00:25:38 +00:00
|
|
|
}
|
2022-01-19 00:26:16 +00:00
|
|
|
wakeup(&bd_shutdown);
|
|
|
|
mtx_unlock(&bdlock);
|
|
|
|
kthread_exit();
|
1999-03-12 02:24:58 +00:00
|
|
|
}
|
|
|
|
|
1999-07-08 06:06:00 +00:00
|
|
|
/*
|
|
|
|
* flushbufqueues:
|
|
|
|
*
|
|
|
|
* Try to flush a buffer in the dirty queue. We must be careful to
|
|
|
|
* free up B_INVAL buffers instead of write them, which NFS is
|
|
|
|
* particularly sensitive to.
|
|
|
|
*/
|
2005-02-10 12:28:58 +00:00
|
|
|
static int flushwithdeps = 0;
|
2019-10-04 21:43:43 +00:00
|
|
|
SYSCTL_INT(_vfs, OID_AUTO, flushwithdeps, CTLFLAG_RW | CTLFLAG_STATS,
|
|
|
|
&flushwithdeps, 0,
|
2021-11-30 06:28:40 +00:00
|
|
|
"Number of buffers flushed with dependencies that require rollbacks");
|
2004-09-15 20:54:23 +00:00
|
|
|
|
1999-03-12 02:24:58 +00:00
|
|
|
static int
|
2018-03-17 18:14:49 +00:00
|
|
|
flushbufqueues(struct vnode *lvp, struct bufdomain *bd, int target,
|
|
|
|
int flushdeps)
|
1999-03-12 02:24:58 +00:00
|
|
|
{
|
2018-02-20 00:06:07 +00:00
|
|
|
struct bufqueue *bq;
|
2009-04-16 09:37:48 +00:00
|
|
|
struct buf *sentinel;
|
2002-10-18 01:29:59 +00:00
|
|
|
struct vnode *vp;
|
2003-10-05 22:16:08 +00:00
|
|
|
struct mount *mp;
|
1999-03-12 02:24:58 +00:00
|
|
|
struct buf *bp;
|
2003-03-13 07:19:23 +00:00
|
|
|
int hasdeps;
|
2005-06-08 20:26:05 +00:00
|
|
|
int flushed;
|
2013-09-29 18:04:57 +00:00
|
|
|
int error;
|
2015-04-27 11:13:19 +00:00
|
|
|
bool unlock;
|
2005-06-08 20:26:05 +00:00
|
|
|
|
|
|
|
flushed = 0;
|
2018-03-17 18:14:49 +00:00
|
|
|
bq = &bd->bd_dirtyq;
|
2005-06-08 20:26:05 +00:00
|
|
|
bp = NULL;
|
2009-04-16 09:37:48 +00:00
|
|
|
sentinel = malloc(sizeof(struct buf), M_TEMP, M_WAITOK | M_ZERO);
|
|
|
|
sentinel->b_qindex = QUEUE_SENTINEL;
|
2018-02-20 00:06:07 +00:00
|
|
|
BQ_LOCK(bq);
|
|
|
|
TAILQ_INSERT_HEAD(&bq->bq_queue, sentinel, b_freelist);
|
|
|
|
BQ_UNLOCK(bq);
|
2005-06-08 20:26:05 +00:00
|
|
|
while (flushed != target) {
|
When helping the bufdaemon from the buffer allocation context, there
is no sense to walk the whole dirty buffer queue. We are only
interested in, and can operate on, the buffers owned by the current
vnode [1]. Instead of calling generic queue flush routine, do
VOP_FSYNC() if possible.
Holding the dirty buffer queue lock in the bufdaemon, without dropping
it, can cause starvation of buffer writes from other threads. This is
esp. easy to reproduce on the big memory machines, where large files
are written, causing almost all dirty buffers accumulating in several
big files, which vnodes are locked by writers. Bufdaemon cannot flush
any buffer, but is iterating over the whole dirty queue
continuously. Since dirty queue mutex is not dropped, bufdone() in
g_up thread is starved, usually deadlocking the machine [2]. Mitigate
this by dropping the queue lock after the vnode is locked, allowing
other queue lock contenders to make a progress.
Discussed with: Jeff [1]
Reported by: pho [2]
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Approved by: re (hrs)
2013-10-02 06:00:34 +00:00
|
|
|
maybe_yield();
|
2018-02-20 00:06:07 +00:00
|
|
|
BQ_LOCK(bq);
|
2009-04-16 09:37:48 +00:00
|
|
|
bp = TAILQ_NEXT(sentinel, b_freelist);
|
Fix two issues with bufdaemon, often causing the processes to hang in
the "nbufkv" sleep.
First, ffs background cg group block write requests a new buffer for
the shadow copy. When ffs_bufwrite() is called from the bufdaemon due
to buffers shortage, requesting the buffer deadlock bufdaemon.
Introduce a new flag for getnewbuf(), GB_NOWAIT_BD, to request getblk
to not block while allocating the buffer, and return failure
instead. Add a flag argument to the geteblk to allow to pass the flags
to getblk(). Do not repeat the getnewbuf() call from geteblk if buffer
allocation failed and either GB_NOWAIT_BD is specified, or geteblk()
is called from bufdaemon (or its helper, see below). In
ffs_bufwrite(), fall back to synchronous cg block write if shadow
block allocation failed.
Since r107847, buffer write assumes that vnode owning the buffer is
locked. The second problem is that buffer cache may accumulate many
buffers belonging to limited number of vnodes. With such workload,
quite often threads that own the mentioned vnodes locks are trying to
read another block from the vnodes, and, due to buffer cache
exhaustion, are asking bufdaemon for help. Bufdaemon is unable to make
any substantial progress because the vnodes are locked.
Allow the threads owning vnode locks to help the bufdaemon by doing
the flush pass over the buffer cache before getnewbuf() is going to
uninterruptible sleep. Move the flushing code from buf_daemon() to new
helper function buf_do_flush(), that is called from getnewbuf(). The
number of buffers flushed by single call to buf_do_flush() from
getnewbuf() is limited by new sysctl vfs.flushbufqtarget. Prevent
recursive calls to buf_do_flush() by marking the bufdaemon and threads
that temporarily help bufdaemon by TDP_BUFNEED flag.
In collaboration with: pho
Reviewed by: tegge (previous version)
Tested by: glebius, yandex ...
MFC after: 3 weeks
2009-03-16 15:39:46 +00:00
|
|
|
if (bp != NULL) {
|
2018-02-20 00:06:07 +00:00
|
|
|
TAILQ_REMOVE(&bq->bq_queue, sentinel, b_freelist);
|
|
|
|
TAILQ_INSERT_AFTER(&bq->bq_queue, bp, sentinel,
|
Fix two issues with bufdaemon, often causing the processes to hang in
the "nbufkv" sleep.
First, ffs background cg group block write requests a new buffer for
the shadow copy. When ffs_bufwrite() is called from the bufdaemon due
to buffers shortage, requesting the buffer deadlock bufdaemon.
Introduce a new flag for getnewbuf(), GB_NOWAIT_BD, to request getblk
to not block while allocating the buffer, and return failure
instead. Add a flag argument to the geteblk to allow to pass the flags
to getblk(). Do not repeat the getnewbuf() call from geteblk if buffer
allocation failed and either GB_NOWAIT_BD is specified, or geteblk()
is called from bufdaemon (or its helper, see below). In
ffs_bufwrite(), fall back to synchronous cg block write if shadow
block allocation failed.
Since r107847, buffer write assumes that vnode owning the buffer is
locked. The second problem is that buffer cache may accumulate many
buffers belonging to limited number of vnodes. With such workload,
quite often threads that own the mentioned vnodes locks are trying to
read another block from the vnodes, and, due to buffer cache
exhaustion, are asking bufdaemon for help. Bufdaemon is unable to make
any substantial progress because the vnodes are locked.
Allow the threads owning vnode locks to help the bufdaemon by doing
the flush pass over the buffer cache before getnewbuf() is going to
uninterruptible sleep. Move the flushing code from buf_daemon() to new
helper function buf_do_flush(), that is called from getnewbuf(). The
number of buffers flushed by single call to buf_do_flush() from
getnewbuf() is limited by new sysctl vfs.flushbufqtarget. Prevent
recursive calls to buf_do_flush() by marking the bufdaemon and threads
that temporarily help bufdaemon by TDP_BUFNEED flag.
In collaboration with: pho
Reviewed by: tegge (previous version)
Tested by: glebius, yandex ...
MFC after: 3 weeks
2009-03-16 15:39:46 +00:00
|
|
|
b_freelist);
|
When helping the bufdaemon from the buffer allocation context, there
is no sense to walk the whole dirty buffer queue. We are only
interested in, and can operate on, the buffers owned by the current
vnode [1]. Instead of calling generic queue flush routine, do
VOP_FSYNC() if possible.
Holding the dirty buffer queue lock in the bufdaemon, without dropping
it, can cause starvation of buffer writes from other threads. This is
esp. easy to reproduce on the big memory machines, where large files
are written, causing almost all dirty buffers accumulating in several
big files, which vnodes are locked by writers. Bufdaemon cannot flush
any buffer, but is iterating over the whole dirty queue
continuously. Since dirty queue mutex is not dropped, bufdone() in
g_up thread is starved, usually deadlocking the machine [2]. Mitigate
this by dropping the queue lock after the vnode is locked, allowing
other queue lock contenders to make a progress.
Discussed with: Jeff [1]
Reported by: pho [2]
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Approved by: re (hrs)
2013-10-02 06:00:34 +00:00
|
|
|
} else {
|
2018-02-20 00:06:07 +00:00
|
|
|
BQ_UNLOCK(bq);
|
2005-06-08 20:26:05 +00:00
|
|
|
break;
|
When helping the bufdaemon from the buffer allocation context, there
is no sense to walk the whole dirty buffer queue. We are only
interested in, and can operate on, the buffers owned by the current
vnode [1]. Instead of calling generic queue flush routine, do
VOP_FSYNC() if possible.
Holding the dirty buffer queue lock in the bufdaemon, without dropping
it, can cause starvation of buffer writes from other threads. This is
esp. easy to reproduce on the big memory machines, where large files
are written, causing almost all dirty buffers accumulating in several
big files, which vnodes are locked by writers. Bufdaemon cannot flush
any buffer, but is iterating over the whole dirty queue
continuously. Since dirty queue mutex is not dropped, bufdone() in
g_up thread is starved, usually deadlocking the machine [2]. Mitigate
this by dropping the queue lock after the vnode is locked, allowing
other queue lock contenders to make a progress.
Discussed with: Jeff [1]
Reported by: pho [2]
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Approved by: re (hrs)
2013-10-02 06:00:34 +00:00
|
|
|
}
|
2015-04-27 11:13:19 +00:00
|
|
|
/*
|
|
|
|
* Skip sentinels inserted by other invocations of the
|
|
|
|
* flushbufqueues(), taking care to not reorder them.
|
|
|
|
*
|
|
|
|
* Only flush the buffers that belong to the
|
|
|
|
* vnode locked by the curthread.
|
|
|
|
*/
|
|
|
|
if (bp->b_qindex == QUEUE_SENTINEL || (lvp != NULL &&
|
|
|
|
bp->b_vp != lvp)) {
|
2018-02-20 00:06:07 +00:00
|
|
|
BQ_UNLOCK(bq);
|
2016-10-05 23:42:02 +00:00
|
|
|
continue;
|
2015-04-27 11:13:19 +00:00
|
|
|
}
|
When helping the bufdaemon from the buffer allocation context, there
is no sense to walk the whole dirty buffer queue. We are only
interested in, and can operate on, the buffers owned by the current
vnode [1]. Instead of calling generic queue flush routine, do
VOP_FSYNC() if possible.
Holding the dirty buffer queue lock in the bufdaemon, without dropping
it, can cause starvation of buffer writes from other threads. This is
esp. easy to reproduce on the big memory machines, where large files
are written, causing almost all dirty buffers accumulating in several
big files, which vnodes are locked by writers. Bufdaemon cannot flush
any buffer, but is iterating over the whole dirty queue
continuously. Since dirty queue mutex is not dropped, bufdone() in
g_up thread is starved, usually deadlocking the machine [2]. Mitigate
this by dropping the queue lock after the vnode is locked, allowing
other queue lock contenders to make a progress.
Discussed with: Jeff [1]
Reported by: pho [2]
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Approved by: re (hrs)
2013-10-02 06:00:34 +00:00
|
|
|
error = BUF_LOCK(bp, LK_EXCLUSIVE | LK_NOWAIT, NULL);
|
2018-02-20 00:06:07 +00:00
|
|
|
BQ_UNLOCK(bq);
|
When helping the bufdaemon from the buffer allocation context, there
is no sense to walk the whole dirty buffer queue. We are only
interested in, and can operate on, the buffers owned by the current
vnode [1]. Instead of calling generic queue flush routine, do
VOP_FSYNC() if possible.
Holding the dirty buffer queue lock in the bufdaemon, without dropping
it, can cause starvation of buffer writes from other threads. This is
esp. easy to reproduce on the big memory machines, where large files
are written, causing almost all dirty buffers accumulating in several
big files, which vnodes are locked by writers. Bufdaemon cannot flush
any buffer, but is iterating over the whole dirty queue
continuously. Since dirty queue mutex is not dropped, bufdone() in
g_up thread is starved, usually deadlocking the machine [2]. Mitigate
this by dropping the queue lock after the vnode is locked, allowing
other queue lock contenders to make a progress.
Discussed with: Jeff [1]
Reported by: pho [2]
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Approved by: re (hrs)
2013-10-02 06:00:34 +00:00
|
|
|
if (error != 0)
|
2002-12-14 01:35:30 +00:00
|
|
|
continue;
|
2016-08-11 07:58:23 +00:00
|
|
|
|
2013-05-31 00:43:41 +00:00
|
|
|
/*
|
|
|
|
* BKGRDINPROG can only be set with the buf and bufobj
|
|
|
|
* locks both held. We tolerate a race to clear it here.
|
|
|
|
*/
|
2005-01-24 10:47:04 +00:00
|
|
|
if ((bp->b_vflags & BV_BKGRDINPROG) != 0 ||
|
|
|
|
(bp->b_flags & B_DELWRI) == 0) {
|
2003-03-13 07:19:23 +00:00
|
|
|
BUF_UNLOCK(bp);
|
2002-12-14 01:35:30 +00:00
|
|
|
continue;
|
2003-03-13 07:19:23 +00:00
|
|
|
}
|
2002-12-14 01:35:30 +00:00
|
|
|
if (bp->b_flags & B_INVAL) {
|
When helping the bufdaemon from the buffer allocation context, there
is no sense to walk the whole dirty buffer queue. We are only
interested in, and can operate on, the buffers owned by the current
vnode [1]. Instead of calling generic queue flush routine, do
VOP_FSYNC() if possible.
Holding the dirty buffer queue lock in the bufdaemon, without dropping
it, can cause starvation of buffer writes from other threads. This is
esp. easy to reproduce on the big memory machines, where large files
are written, causing almost all dirty buffers accumulating in several
big files, which vnodes are locked by writers. Bufdaemon cannot flush
any buffer, but is iterating over the whole dirty queue
continuously. Since dirty queue mutex is not dropped, bufdone() in
g_up thread is starved, usually deadlocking the machine [2]. Mitigate
this by dropping the queue lock after the vnode is locked, allowing
other queue lock contenders to make a progress.
Discussed with: Jeff [1]
Reported by: pho [2]
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Approved by: re (hrs)
2013-10-02 06:00:34 +00:00
|
|
|
bremfreef(bp);
|
2002-12-14 01:35:30 +00:00
|
|
|
brelse(bp);
|
2005-06-08 20:26:05 +00:00
|
|
|
flushed++;
|
|
|
|
continue;
|
2002-12-14 01:35:30 +00:00
|
|
|
}
|
2003-03-13 07:19:23 +00:00
|
|
|
|
2007-02-22 14:52:59 +00:00
|
|
|
if (!LIST_EMPTY(&bp->b_dep) && buf_countdeps(bp, 0)) {
|
2003-03-13 07:19:23 +00:00
|
|
|
if (flushdeps == 0) {
|
|
|
|
BUF_UNLOCK(bp);
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
hasdeps = 1;
|
|
|
|
} else
|
|
|
|
hasdeps = 0;
|
2002-12-14 01:35:30 +00:00
|
|
|
/*
|
|
|
|
* We must hold the lock on a vnode before writing
|
|
|
|
* one of its buffers. Otherwise we may confuse, or
|
|
|
|
* in the case of a snapshot vnode, deadlock the
|
|
|
|
* system.
|
2003-03-13 07:19:23 +00:00
|
|
|
*
|
|
|
|
* The lock order here is the reverse of the normal
|
|
|
|
* of vnode followed by buf lock. This is ok because
|
|
|
|
* the NOWAIT will prevent deadlock.
|
2002-12-14 01:35:30 +00:00
|
|
|
*/
|
2003-10-05 22:16:08 +00:00
|
|
|
vp = bp->b_vp;
|
|
|
|
if (vn_start_write(vp, &mp, V_NOWAIT) != 0) {
|
|
|
|
BUF_UNLOCK(bp);
|
|
|
|
continue;
|
|
|
|
}
|
2015-04-27 11:13:19 +00:00
|
|
|
if (lvp == NULL) {
|
|
|
|
unlock = true;
|
|
|
|
error = vn_lock(vp, LK_EXCLUSIVE | LK_NOWAIT);
|
|
|
|
} else {
|
|
|
|
ASSERT_VOP_LOCKED(vp, "getbuf");
|
|
|
|
unlock = false;
|
|
|
|
error = VOP_ISLOCKED(vp) == LK_EXCLUSIVE ? 0 :
|
|
|
|
vn_lock(vp, LK_TRYUPGRADE);
|
|
|
|
}
|
2013-09-29 18:04:57 +00:00
|
|
|
if (error == 0) {
|
2005-01-24 10:47:04 +00:00
|
|
|
CTR3(KTR_BUF, "flushbufqueue(%p) vp %p flags %X",
|
|
|
|
bp, bp->b_vp, bp->b_flags);
|
2015-04-27 11:13:19 +00:00
|
|
|
if (curproc == bufdaemonproc) {
|
|
|
|
vfs_bio_awrite(bp);
|
|
|
|
} else {
|
|
|
|
bremfree(bp);
|
|
|
|
bwrite(bp);
|
2018-02-20 00:06:07 +00:00
|
|
|
counter_u64_add(notbufdflushes, 1);
|
2015-04-27 11:13:19 +00:00
|
|
|
}
|
2003-10-05 22:16:08 +00:00
|
|
|
vn_finished_write(mp);
|
2015-04-27 11:13:19 +00:00
|
|
|
if (unlock)
|
2020-01-03 22:29:58 +00:00
|
|
|
VOP_UNLOCK(vp);
|
2003-03-13 07:19:23 +00:00
|
|
|
flushwithdeps += hasdeps;
|
2005-06-08 20:26:05 +00:00
|
|
|
flushed++;
|
2015-04-27 11:13:19 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Sleeping on runningbufspace while holding
|
|
|
|
* vnode lock leads to deadlock.
|
|
|
|
*/
|
|
|
|
if (curproc == bufdaemonproc &&
|
|
|
|
runningbufspace > hirunningspace)
|
Fix two issues with bufdaemon, often causing the processes to hang in
the "nbufkv" sleep.
First, ffs background cg group block write requests a new buffer for
the shadow copy. When ffs_bufwrite() is called from the bufdaemon due
to buffers shortage, requesting the buffer deadlock bufdaemon.
Introduce a new flag for getnewbuf(), GB_NOWAIT_BD, to request getblk
to not block while allocating the buffer, and return failure
instead. Add a flag argument to the geteblk to allow to pass the flags
to getblk(). Do not repeat the getnewbuf() call from geteblk if buffer
allocation failed and either GB_NOWAIT_BD is specified, or geteblk()
is called from bufdaemon (or its helper, see below). In
ffs_bufwrite(), fall back to synchronous cg block write if shadow
block allocation failed.
Since r107847, buffer write assumes that vnode owning the buffer is
locked. The second problem is that buffer cache may accumulate many
buffers belonging to limited number of vnodes. With such workload,
quite often threads that own the mentioned vnodes locks are trying to
read another block from the vnodes, and, due to buffer cache
exhaustion, are asking bufdaemon for help. Bufdaemon is unable to make
any substantial progress because the vnodes are locked.
Allow the threads owning vnode locks to help the bufdaemon by doing
the flush pass over the buffer cache before getnewbuf() is going to
uninterruptible sleep. Move the flushing code from buf_daemon() to new
helper function buf_do_flush(), that is called from getnewbuf(). The
number of buffers flushed by single call to buf_do_flush() from
getnewbuf() is limited by new sysctl vfs.flushbufqtarget. Prevent
recursive calls to buf_do_flush() by marking the bufdaemon and threads
that temporarily help bufdaemon by TDP_BUFNEED flag.
In collaboration with: pho
Reviewed by: tegge (previous version)
Tested by: glebius, yandex ...
MFC after: 3 weeks
2009-03-16 15:39:46 +00:00
|
|
|
waitrunningbufspace();
|
2005-06-08 20:26:05 +00:00
|
|
|
continue;
|
2002-12-14 01:35:30 +00:00
|
|
|
}
|
2003-10-14 00:38:34 +00:00
|
|
|
vn_finished_write(mp);
|
2003-03-13 07:19:23 +00:00
|
|
|
BUF_UNLOCK(bp);
|
2002-12-14 01:35:30 +00:00
|
|
|
}
|
2018-02-20 00:06:07 +00:00
|
|
|
BQ_LOCK(bq);
|
|
|
|
TAILQ_REMOVE(&bq->bq_queue, sentinel, b_freelist);
|
|
|
|
BQ_UNLOCK(bq);
|
2009-04-16 09:37:48 +00:00
|
|
|
free(sentinel, M_TEMP);
|
2005-06-08 20:26:05 +00:00
|
|
|
return (flushed);
|
1997-06-15 17:56:53 +00:00
|
|
|
}
|
|
|
|
|
1994-05-25 09:21:21 +00:00
|
|
|
/*
|
|
|
|
* Check to see if a block is currently memory resident.
|
|
|
|
*/
|
1994-05-24 10:09:53 +00:00
|
|
|
struct buf *
|
2004-10-22 08:47:20 +00:00
|
|
|
incore(struct bufobj *bo, daddr_t blkno)
|
1994-05-24 10:09:53 +00:00
|
|
|
{
|
2020-07-24 17:34:44 +00:00
|
|
|
return (gbincore_unlocked(bo, blkno));
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
1995-02-22 09:16:07 +00:00
|
|
|
* Returns true if no I/O is needed to access the
|
|
|
|
* associated VM object. This is like incore except
|
|
|
|
* it also hunts around in the VM system for the data.
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
*/
|
2020-10-09 23:49:42 +00:00
|
|
|
bool
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
inmem(struct vnode * vp, daddr_t blkno)
|
|
|
|
{
|
|
|
|
vm_object_t obj;
|
1998-12-22 14:43:58 +00:00
|
|
|
vm_offset_t toff, tinc, size;
|
2020-10-09 23:49:42 +00:00
|
|
|
vm_page_t m, n;
|
1995-12-11 04:58:34 +00:00
|
|
|
vm_ooffset_t off;
|
2020-10-09 23:49:42 +00:00
|
|
|
int valid;
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
|
2002-09-25 02:11:37 +00:00
|
|
|
ASSERT_VOP_LOCKED(vp, "inmem");
|
2001-07-04 16:20:28 +00:00
|
|
|
|
2004-10-22 08:47:20 +00:00
|
|
|
if (incore(&vp->v_bufobj, blkno))
|
2020-10-09 23:49:42 +00:00
|
|
|
return (true);
|
1995-08-24 13:59:14 +00:00
|
|
|
if (vp->v_mount == NULL)
|
2020-10-09 23:49:42 +00:00
|
|
|
return (false);
|
2005-01-25 00:40:01 +00:00
|
|
|
obj = vp->v_object;
|
|
|
|
if (obj == NULL)
|
2020-10-09 23:49:42 +00:00
|
|
|
return (false);
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
|
1998-12-22 14:43:58 +00:00
|
|
|
size = PAGE_SIZE;
|
|
|
|
if (size > vp->v_mount->mnt_stat.f_iosize)
|
|
|
|
size = vp->v_mount->mnt_stat.f_iosize;
|
1998-12-14 21:17:37 +00:00
|
|
|
off = (vm_ooffset_t)blkno * (vm_ooffset_t)vp->v_mount->mnt_stat.f_iosize;
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
|
|
|
|
for (toff = 0; toff < vp->v_mount->mnt_stat.f_iosize; toff += tinc) {
|
2020-10-09 23:49:42 +00:00
|
|
|
m = vm_page_lookup_unlocked(obj, OFF_TO_IDX(off + toff));
|
|
|
|
recheck:
|
|
|
|
if (m == NULL)
|
|
|
|
return (false);
|
|
|
|
|
1998-12-22 14:43:58 +00:00
|
|
|
tinc = size;
|
|
|
|
if (tinc > PAGE_SIZE - ((toff + off) & PAGE_MASK))
|
|
|
|
tinc = PAGE_SIZE - ((toff + off) & PAGE_MASK);
|
2020-10-09 23:49:42 +00:00
|
|
|
/*
|
|
|
|
* Consider page validity only if page mapping didn't change
|
|
|
|
* during the check.
|
|
|
|
*/
|
|
|
|
valid = vm_page_is_valid(m,
|
|
|
|
(vm_offset_t)((toff + off) & PAGE_MASK), tinc);
|
|
|
|
n = vm_page_lookup_unlocked(obj, OFF_TO_IDX(off + toff));
|
|
|
|
if (m != n) {
|
|
|
|
m = n;
|
|
|
|
goto recheck;
|
|
|
|
}
|
|
|
|
if (!valid)
|
|
|
|
return (false);
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
}
|
2020-10-09 23:49:42 +00:00
|
|
|
return (true);
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
|
|
|
|
1995-05-21 21:39:31 +00:00
|
|
|
/*
|
2010-06-08 17:54:28 +00:00
|
|
|
* Set the dirty range for a buffer based on the status of the dirty
|
|
|
|
* bits in the pages comprising the buffer. The range is limited
|
|
|
|
* to the size of the buffer.
|
The VFS/BIO subsystem contained a number of hacks in order to optimize
piecemeal, middle-of-file writes for NFS. These hacks have caused no
end of trouble, especially when combined with mmap(). I've removed
them. Instead, NFS will issue a read-before-write to fully
instantiate the struct buf containing the write. NFS does, however,
optimize piecemeal appends to files. For most common file operations,
you will not notice the difference. The sole remaining fragment in
the VFS/BIO system is b_dirtyoff/end, which NFS uses to avoid cache
coherency issues with read-merge-write style operations. NFS also
optimizes the write-covers-entire-buffer case by avoiding the
read-before-write. There is quite a bit of room for further
optimization in these areas.
The VM system marks pages fully-valid (AKA vm_page_t->valid =
VM_PAGE_BITS_ALL) in several places, most noteably in vm_fault. This
is not correct operation. The vm_pager_get_pages() code is now
responsible for marking VM pages all-valid. A number of VM helper
routines have been added to aid in zeroing-out the invalid portions of
a VM page prior to the page being marked all-valid. This operation is
necessary to properly support mmap(). The zeroing occurs most often
when dealing with file-EOF situations. Several bugs have been fixed
in the NFS subsystem, including bits handling file and directory EOF
situations and buf->b_flags consistancy issues relating to clearing
B_ERROR & B_INVAL, and handling B_DONE.
getblk() and allocbuf() have been rewritten. B_CACHE operation is now
formally defined in comments and more straightforward in
implementation. B_CACHE for VMIO buffers is based on the validity of
the backing store. B_CACHE for non-VMIO buffers is based simply on
whether the buffer is B_INVAL or not (B_CACHE set if B_INVAL clear,
and vise-versa). biodone() is now responsible for setting B_CACHE
when a successful read completes. B_CACHE is also set when a bdwrite()
is initiated and when a bwrite() is initiated. VFS VOP_BWRITE
routines (there are only two - nfs_bwrite() and bwrite()) are now
expected to set B_CACHE. This means that bowrite() and bawrite() also
set B_CACHE indirectly.
There are a number of places in the code which were previously using
buf->b_bufsize (which is DEV_BSIZE aligned) when they should have
been using buf->b_bcount. These have been fixed. getblk() now clears
B_DONE on return because the rest of the system is so bad about
dealing with B_DONE.
Major fixes to NFS/TCP have been made. A server-side bug could cause
requests to be lost by the server due to nfs_realign() overwriting
other rpc's in the same TCP mbuf chain. The server's kernel must be
recompiled to get the benefit of the fixes.
Submitted by: Matthew Dillon <dillon@apollo.backplane.com>
1999-05-02 23:57:16 +00:00
|
|
|
*
|
2010-06-08 17:54:28 +00:00
|
|
|
* Tell the VM system that the pages associated with this buffer
|
|
|
|
* are clean. This is used for delayed writes where the data is
|
|
|
|
* going to go to disk eventually without additional VM intevention.
|
The VFS/BIO subsystem contained a number of hacks in order to optimize
piecemeal, middle-of-file writes for NFS. These hacks have caused no
end of trouble, especially when combined with mmap(). I've removed
them. Instead, NFS will issue a read-before-write to fully
instantiate the struct buf containing the write. NFS does, however,
optimize piecemeal appends to files. For most common file operations,
you will not notice the difference. The sole remaining fragment in
the VFS/BIO system is b_dirtyoff/end, which NFS uses to avoid cache
coherency issues with read-merge-write style operations. NFS also
optimizes the write-covers-entire-buffer case by avoiding the
read-before-write. There is quite a bit of room for further
optimization in these areas.
The VM system marks pages fully-valid (AKA vm_page_t->valid =
VM_PAGE_BITS_ALL) in several places, most noteably in vm_fault. This
is not correct operation. The vm_pager_get_pages() code is now
responsible for marking VM pages all-valid. A number of VM helper
routines have been added to aid in zeroing-out the invalid portions of
a VM page prior to the page being marked all-valid. This operation is
necessary to properly support mmap(). The zeroing occurs most often
when dealing with file-EOF situations. Several bugs have been fixed
in the NFS subsystem, including bits handling file and directory EOF
situations and buf->b_flags consistancy issues relating to clearing
B_ERROR & B_INVAL, and handling B_DONE.
getblk() and allocbuf() have been rewritten. B_CACHE operation is now
formally defined in comments and more straightforward in
implementation. B_CACHE for VMIO buffers is based on the validity of
the backing store. B_CACHE for non-VMIO buffers is based simply on
whether the buffer is B_INVAL or not (B_CACHE set if B_INVAL clear,
and vise-versa). biodone() is now responsible for setting B_CACHE
when a successful read completes. B_CACHE is also set when a bdwrite()
is initiated and when a bwrite() is initiated. VFS VOP_BWRITE
routines (there are only two - nfs_bwrite() and bwrite()) are now
expected to set B_CACHE. This means that bowrite() and bawrite() also
set B_CACHE indirectly.
There are a number of places in the code which were previously using
buf->b_bufsize (which is DEV_BSIZE aligned) when they should have
been using buf->b_bcount. These have been fixed. getblk() now clears
B_DONE on return because the rest of the system is so bad about
dealing with B_DONE.
Major fixes to NFS/TCP have been made. A server-side bug could cause
requests to be lost by the server due to nfs_realign() overwriting
other rpc's in the same TCP mbuf chain. The server's kernel must be
recompiled to get the benefit of the fixes.
Submitted by: Matthew Dillon <dillon@apollo.backplane.com>
1999-05-02 23:57:16 +00:00
|
|
|
*
|
2010-06-08 17:54:28 +00:00
|
|
|
* Note that while we only really need to clean through to b_bcount, we
|
|
|
|
* just go ahead and clean through to b_bufsize.
|
1995-05-21 21:39:31 +00:00
|
|
|
*/
|
|
|
|
static void
|
2010-06-08 17:54:28 +00:00
|
|
|
vfs_clean_pages_dirty_buf(struct buf *bp)
|
1999-03-12 02:24:58 +00:00
|
|
|
{
|
2010-06-08 17:54:28 +00:00
|
|
|
vm_ooffset_t foff, noff, eoff;
|
|
|
|
vm_page_t m;
|
|
|
|
int i;
|
The VFS/BIO subsystem contained a number of hacks in order to optimize
piecemeal, middle-of-file writes for NFS. These hacks have caused no
end of trouble, especially when combined with mmap(). I've removed
them. Instead, NFS will issue a read-before-write to fully
instantiate the struct buf containing the write. NFS does, however,
optimize piecemeal appends to files. For most common file operations,
you will not notice the difference. The sole remaining fragment in
the VFS/BIO system is b_dirtyoff/end, which NFS uses to avoid cache
coherency issues with read-merge-write style operations. NFS also
optimizes the write-covers-entire-buffer case by avoiding the
read-before-write. There is quite a bit of room for further
optimization in these areas.
The VM system marks pages fully-valid (AKA vm_page_t->valid =
VM_PAGE_BITS_ALL) in several places, most noteably in vm_fault. This
is not correct operation. The vm_pager_get_pages() code is now
responsible for marking VM pages all-valid. A number of VM helper
routines have been added to aid in zeroing-out the invalid portions of
a VM page prior to the page being marked all-valid. This operation is
necessary to properly support mmap(). The zeroing occurs most often
when dealing with file-EOF situations. Several bugs have been fixed
in the NFS subsystem, including bits handling file and directory EOF
situations and buf->b_flags consistancy issues relating to clearing
B_ERROR & B_INVAL, and handling B_DONE.
getblk() and allocbuf() have been rewritten. B_CACHE operation is now
formally defined in comments and more straightforward in
implementation. B_CACHE for VMIO buffers is based on the validity of
the backing store. B_CACHE for non-VMIO buffers is based simply on
whether the buffer is B_INVAL or not (B_CACHE set if B_INVAL clear,
and vise-versa). biodone() is now responsible for setting B_CACHE
when a successful read completes. B_CACHE is also set when a bdwrite()
is initiated and when a bwrite() is initiated. VFS VOP_BWRITE
routines (there are only two - nfs_bwrite() and bwrite()) are now
expected to set B_CACHE. This means that bowrite() and bawrite() also
set B_CACHE indirectly.
There are a number of places in the code which were previously using
buf->b_bufsize (which is DEV_BSIZE aligned) when they should have
been using buf->b_bcount. These have been fixed. getblk() now clears
B_DONE on return because the rest of the system is so bad about
dealing with B_DONE.
Major fixes to NFS/TCP have been made. A server-side bug could cause
requests to be lost by the server due to nfs_realign() overwriting
other rpc's in the same TCP mbuf chain. The server's kernel must be
recompiled to get the benefit of the fixes.
Submitted by: Matthew Dillon <dillon@apollo.backplane.com>
1999-05-02 23:57:16 +00:00
|
|
|
|
2010-06-08 17:54:28 +00:00
|
|
|
if ((bp->b_flags & B_VMIO) == 0 || bp->b_bufsize == 0)
|
The VFS/BIO subsystem contained a number of hacks in order to optimize
piecemeal, middle-of-file writes for NFS. These hacks have caused no
end of trouble, especially when combined with mmap(). I've removed
them. Instead, NFS will issue a read-before-write to fully
instantiate the struct buf containing the write. NFS does, however,
optimize piecemeal appends to files. For most common file operations,
you will not notice the difference. The sole remaining fragment in
the VFS/BIO system is b_dirtyoff/end, which NFS uses to avoid cache
coherency issues with read-merge-write style operations. NFS also
optimizes the write-covers-entire-buffer case by avoiding the
read-before-write. There is quite a bit of room for further
optimization in these areas.
The VM system marks pages fully-valid (AKA vm_page_t->valid =
VM_PAGE_BITS_ALL) in several places, most noteably in vm_fault. This
is not correct operation. The vm_pager_get_pages() code is now
responsible for marking VM pages all-valid. A number of VM helper
routines have been added to aid in zeroing-out the invalid portions of
a VM page prior to the page being marked all-valid. This operation is
necessary to properly support mmap(). The zeroing occurs most often
when dealing with file-EOF situations. Several bugs have been fixed
in the NFS subsystem, including bits handling file and directory EOF
situations and buf->b_flags consistancy issues relating to clearing
B_ERROR & B_INVAL, and handling B_DONE.
getblk() and allocbuf() have been rewritten. B_CACHE operation is now
formally defined in comments and more straightforward in
implementation. B_CACHE for VMIO buffers is based on the validity of
the backing store. B_CACHE for non-VMIO buffers is based simply on
whether the buffer is B_INVAL or not (B_CACHE set if B_INVAL clear,
and vise-versa). biodone() is now responsible for setting B_CACHE
when a successful read completes. B_CACHE is also set when a bdwrite()
is initiated and when a bwrite() is initiated. VFS VOP_BWRITE
routines (there are only two - nfs_bwrite() and bwrite()) are now
expected to set B_CACHE. This means that bowrite() and bawrite() also
set B_CACHE indirectly.
There are a number of places in the code which were previously using
buf->b_bufsize (which is DEV_BSIZE aligned) when they should have
been using buf->b_bcount. These have been fixed. getblk() now clears
B_DONE on return because the rest of the system is so bad about
dealing with B_DONE.
Major fixes to NFS/TCP have been made. A server-side bug could cause
requests to be lost by the server due to nfs_realign() overwriting
other rpc's in the same TCP mbuf chain. The server's kernel must be
recompiled to get the benefit of the fixes.
Submitted by: Matthew Dillon <dillon@apollo.backplane.com>
1999-05-02 23:57:16 +00:00
|
|
|
return;
|
1999-01-12 11:59:34 +00:00
|
|
|
|
2010-06-08 17:54:28 +00:00
|
|
|
foff = bp->b_offset;
|
|
|
|
KASSERT(bp->b_offset != NOOFFSET,
|
|
|
|
("vfs_clean_pages_dirty_buf: no buffer offset"));
|
1999-03-12 02:24:58 +00:00
|
|
|
|
2019-10-15 03:35:11 +00:00
|
|
|
vfs_busy_pages_acquire(bp);
|
2019-10-29 20:37:59 +00:00
|
|
|
vfs_setdirty_range(bp);
|
2010-06-08 17:54:28 +00:00
|
|
|
for (i = 0; i < bp->b_npages; i++) {
|
|
|
|
noff = (foff + PAGE_SIZE) & ~(off_t)PAGE_MASK;
|
|
|
|
eoff = noff;
|
|
|
|
if (eoff > bp->b_offset + bp->b_bufsize)
|
|
|
|
eoff = bp->b_offset + bp->b_bufsize;
|
|
|
|
m = bp->b_pages[i];
|
|
|
|
vfs_page_set_validclean(bp, foff, m);
|
|
|
|
/* vm_page_clear_dirty(m, foff & PAGE_MASK, eoff - foff); */
|
|
|
|
foff = noff;
|
|
|
|
}
|
2019-10-15 03:35:11 +00:00
|
|
|
vfs_busy_pages_release(bp);
|
2006-10-29 00:04:39 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
2019-10-29 20:37:59 +00:00
|
|
|
vfs_setdirty_range(struct buf *bp)
|
2006-10-29 00:04:39 +00:00
|
|
|
{
|
2019-10-29 20:37:59 +00:00
|
|
|
vm_offset_t boffset;
|
|
|
|
vm_offset_t eoffset;
|
2006-10-29 00:04:39 +00:00
|
|
|
int i;
|
|
|
|
|
2009-06-01 06:12:08 +00:00
|
|
|
/*
|
2019-10-29 20:37:59 +00:00
|
|
|
* test the pages to see if they have been modified directly
|
|
|
|
* by users through the VM system.
|
2009-06-01 06:12:08 +00:00
|
|
|
*/
|
2019-10-29 20:37:59 +00:00
|
|
|
for (i = 0; i < bp->b_npages; i++)
|
|
|
|
vm_page_test_dirty(bp->b_pages[i]);
|
1995-05-21 21:39:31 +00:00
|
|
|
|
2019-10-29 20:37:59 +00:00
|
|
|
/*
|
|
|
|
* Calculate the encompassing dirty range, boffset and eoffset,
|
|
|
|
* (eoffset - boffset) bytes.
|
|
|
|
*/
|
The VFS/BIO subsystem contained a number of hacks in order to optimize
piecemeal, middle-of-file writes for NFS. These hacks have caused no
end of trouble, especially when combined with mmap(). I've removed
them. Instead, NFS will issue a read-before-write to fully
instantiate the struct buf containing the write. NFS does, however,
optimize piecemeal appends to files. For most common file operations,
you will not notice the difference. The sole remaining fragment in
the VFS/BIO system is b_dirtyoff/end, which NFS uses to avoid cache
coherency issues with read-merge-write style operations. NFS also
optimizes the write-covers-entire-buffer case by avoiding the
read-before-write. There is quite a bit of room for further
optimization in these areas.
The VM system marks pages fully-valid (AKA vm_page_t->valid =
VM_PAGE_BITS_ALL) in several places, most noteably in vm_fault. This
is not correct operation. The vm_pager_get_pages() code is now
responsible for marking VM pages all-valid. A number of VM helper
routines have been added to aid in zeroing-out the invalid portions of
a VM page prior to the page being marked all-valid. This operation is
necessary to properly support mmap(). The zeroing occurs most often
when dealing with file-EOF situations. Several bugs have been fixed
in the NFS subsystem, including bits handling file and directory EOF
situations and buf->b_flags consistancy issues relating to clearing
B_ERROR & B_INVAL, and handling B_DONE.
getblk() and allocbuf() have been rewritten. B_CACHE operation is now
formally defined in comments and more straightforward in
implementation. B_CACHE for VMIO buffers is based on the validity of
the backing store. B_CACHE for non-VMIO buffers is based simply on
whether the buffer is B_INVAL or not (B_CACHE set if B_INVAL clear,
and vise-versa). biodone() is now responsible for setting B_CACHE
when a successful read completes. B_CACHE is also set when a bdwrite()
is initiated and when a bwrite() is initiated. VFS VOP_BWRITE
routines (there are only two - nfs_bwrite() and bwrite()) are now
expected to set B_CACHE. This means that bowrite() and bawrite() also
set B_CACHE indirectly.
There are a number of places in the code which were previously using
buf->b_bufsize (which is DEV_BSIZE aligned) when they should have
been using buf->b_bcount. These have been fixed. getblk() now clears
B_DONE on return because the rest of the system is so bad about
dealing with B_DONE.
Major fixes to NFS/TCP have been made. A server-side bug could cause
requests to be lost by the server due to nfs_realign() overwriting
other rpc's in the same TCP mbuf chain. The server's kernel must be
recompiled to get the benefit of the fixes.
Submitted by: Matthew Dillon <dillon@apollo.backplane.com>
1999-05-02 23:57:16 +00:00
|
|
|
|
2019-10-29 20:37:59 +00:00
|
|
|
for (i = 0; i < bp->b_npages; i++) {
|
|
|
|
if (bp->b_pages[i]->dirty)
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
boffset = (i << PAGE_SHIFT) - (bp->b_offset & PAGE_MASK);
|
1995-05-21 21:39:31 +00:00
|
|
|
|
2019-10-29 20:37:59 +00:00
|
|
|
for (i = bp->b_npages - 1; i >= 0; --i) {
|
|
|
|
if (bp->b_pages[i]->dirty) {
|
|
|
|
break;
|
1995-05-21 21:39:31 +00:00
|
|
|
}
|
2019-10-29 20:37:59 +00:00
|
|
|
}
|
|
|
|
eoffset = ((i + 1) << PAGE_SHIFT) - (bp->b_offset & PAGE_MASK);
|
The VFS/BIO subsystem contained a number of hacks in order to optimize
piecemeal, middle-of-file writes for NFS. These hacks have caused no
end of trouble, especially when combined with mmap(). I've removed
them. Instead, NFS will issue a read-before-write to fully
instantiate the struct buf containing the write. NFS does, however,
optimize piecemeal appends to files. For most common file operations,
you will not notice the difference. The sole remaining fragment in
the VFS/BIO system is b_dirtyoff/end, which NFS uses to avoid cache
coherency issues with read-merge-write style operations. NFS also
optimizes the write-covers-entire-buffer case by avoiding the
read-before-write. There is quite a bit of room for further
optimization in these areas.
The VM system marks pages fully-valid (AKA vm_page_t->valid =
VM_PAGE_BITS_ALL) in several places, most noteably in vm_fault. This
is not correct operation. The vm_pager_get_pages() code is now
responsible for marking VM pages all-valid. A number of VM helper
routines have been added to aid in zeroing-out the invalid portions of
a VM page prior to the page being marked all-valid. This operation is
necessary to properly support mmap(). The zeroing occurs most often
when dealing with file-EOF situations. Several bugs have been fixed
in the NFS subsystem, including bits handling file and directory EOF
situations and buf->b_flags consistancy issues relating to clearing
B_ERROR & B_INVAL, and handling B_DONE.
getblk() and allocbuf() have been rewritten. B_CACHE operation is now
formally defined in comments and more straightforward in
implementation. B_CACHE for VMIO buffers is based on the validity of
the backing store. B_CACHE for non-VMIO buffers is based simply on
whether the buffer is B_INVAL or not (B_CACHE set if B_INVAL clear,
and vise-versa). biodone() is now responsible for setting B_CACHE
when a successful read completes. B_CACHE is also set when a bdwrite()
is initiated and when a bwrite() is initiated. VFS VOP_BWRITE
routines (there are only two - nfs_bwrite() and bwrite()) are now
expected to set B_CACHE. This means that bowrite() and bawrite() also
set B_CACHE indirectly.
There are a number of places in the code which were previously using
buf->b_bufsize (which is DEV_BSIZE aligned) when they should have
been using buf->b_bcount. These have been fixed. getblk() now clears
B_DONE on return because the rest of the system is so bad about
dealing with B_DONE.
Major fixes to NFS/TCP have been made. A server-side bug could cause
requests to be lost by the server due to nfs_realign() overwriting
other rpc's in the same TCP mbuf chain. The server's kernel must be
recompiled to get the benefit of the fixes.
Submitted by: Matthew Dillon <dillon@apollo.backplane.com>
1999-05-02 23:57:16 +00:00
|
|
|
|
2019-10-29 20:37:59 +00:00
|
|
|
/*
|
|
|
|
* Fit it to the buffer.
|
|
|
|
*/
|
The VFS/BIO subsystem contained a number of hacks in order to optimize
piecemeal, middle-of-file writes for NFS. These hacks have caused no
end of trouble, especially when combined with mmap(). I've removed
them. Instead, NFS will issue a read-before-write to fully
instantiate the struct buf containing the write. NFS does, however,
optimize piecemeal appends to files. For most common file operations,
you will not notice the difference. The sole remaining fragment in
the VFS/BIO system is b_dirtyoff/end, which NFS uses to avoid cache
coherency issues with read-merge-write style operations. NFS also
optimizes the write-covers-entire-buffer case by avoiding the
read-before-write. There is quite a bit of room for further
optimization in these areas.
The VM system marks pages fully-valid (AKA vm_page_t->valid =
VM_PAGE_BITS_ALL) in several places, most noteably in vm_fault. This
is not correct operation. The vm_pager_get_pages() code is now
responsible for marking VM pages all-valid. A number of VM helper
routines have been added to aid in zeroing-out the invalid portions of
a VM page prior to the page being marked all-valid. This operation is
necessary to properly support mmap(). The zeroing occurs most often
when dealing with file-EOF situations. Several bugs have been fixed
in the NFS subsystem, including bits handling file and directory EOF
situations and buf->b_flags consistancy issues relating to clearing
B_ERROR & B_INVAL, and handling B_DONE.
getblk() and allocbuf() have been rewritten. B_CACHE operation is now
formally defined in comments and more straightforward in
implementation. B_CACHE for VMIO buffers is based on the validity of
the backing store. B_CACHE for non-VMIO buffers is based simply on
whether the buffer is B_INVAL or not (B_CACHE set if B_INVAL clear,
and vise-versa). biodone() is now responsible for setting B_CACHE
when a successful read completes. B_CACHE is also set when a bdwrite()
is initiated and when a bwrite() is initiated. VFS VOP_BWRITE
routines (there are only two - nfs_bwrite() and bwrite()) are now
expected to set B_CACHE. This means that bowrite() and bawrite() also
set B_CACHE indirectly.
There are a number of places in the code which were previously using
buf->b_bufsize (which is DEV_BSIZE aligned) when they should have
been using buf->b_bcount. These have been fixed. getblk() now clears
B_DONE on return because the rest of the system is so bad about
dealing with B_DONE.
Major fixes to NFS/TCP have been made. A server-side bug could cause
requests to be lost by the server due to nfs_realign() overwriting
other rpc's in the same TCP mbuf chain. The server's kernel must be
recompiled to get the benefit of the fixes.
Submitted by: Matthew Dillon <dillon@apollo.backplane.com>
1999-05-02 23:57:16 +00:00
|
|
|
|
2019-10-29 20:37:59 +00:00
|
|
|
if (eoffset > bp->b_bcount)
|
|
|
|
eoffset = bp->b_bcount;
|
The VFS/BIO subsystem contained a number of hacks in order to optimize
piecemeal, middle-of-file writes for NFS. These hacks have caused no
end of trouble, especially when combined with mmap(). I've removed
them. Instead, NFS will issue a read-before-write to fully
instantiate the struct buf containing the write. NFS does, however,
optimize piecemeal appends to files. For most common file operations,
you will not notice the difference. The sole remaining fragment in
the VFS/BIO system is b_dirtyoff/end, which NFS uses to avoid cache
coherency issues with read-merge-write style operations. NFS also
optimizes the write-covers-entire-buffer case by avoiding the
read-before-write. There is quite a bit of room for further
optimization in these areas.
The VM system marks pages fully-valid (AKA vm_page_t->valid =
VM_PAGE_BITS_ALL) in several places, most noteably in vm_fault. This
is not correct operation. The vm_pager_get_pages() code is now
responsible for marking VM pages all-valid. A number of VM helper
routines have been added to aid in zeroing-out the invalid portions of
a VM page prior to the page being marked all-valid. This operation is
necessary to properly support mmap(). The zeroing occurs most often
when dealing with file-EOF situations. Several bugs have been fixed
in the NFS subsystem, including bits handling file and directory EOF
situations and buf->b_flags consistancy issues relating to clearing
B_ERROR & B_INVAL, and handling B_DONE.
getblk() and allocbuf() have been rewritten. B_CACHE operation is now
formally defined in comments and more straightforward in
implementation. B_CACHE for VMIO buffers is based on the validity of
the backing store. B_CACHE for non-VMIO buffers is based simply on
whether the buffer is B_INVAL or not (B_CACHE set if B_INVAL clear,
and vise-versa). biodone() is now responsible for setting B_CACHE
when a successful read completes. B_CACHE is also set when a bdwrite()
is initiated and when a bwrite() is initiated. VFS VOP_BWRITE
routines (there are only two - nfs_bwrite() and bwrite()) are now
expected to set B_CACHE. This means that bowrite() and bawrite() also
set B_CACHE indirectly.
There are a number of places in the code which were previously using
buf->b_bufsize (which is DEV_BSIZE aligned) when they should have
been using buf->b_bcount. These have been fixed. getblk() now clears
B_DONE on return because the rest of the system is so bad about
dealing with B_DONE.
Major fixes to NFS/TCP have been made. A server-side bug could cause
requests to be lost by the server due to nfs_realign() overwriting
other rpc's in the same TCP mbuf chain. The server's kernel must be
recompiled to get the benefit of the fixes.
Submitted by: Matthew Dillon <dillon@apollo.backplane.com>
1999-05-02 23:57:16 +00:00
|
|
|
|
2019-10-29 20:37:59 +00:00
|
|
|
/*
|
|
|
|
* If we have a good dirty range, merge with the existing
|
|
|
|
* dirty range.
|
|
|
|
*/
|
The VFS/BIO subsystem contained a number of hacks in order to optimize
piecemeal, middle-of-file writes for NFS. These hacks have caused no
end of trouble, especially when combined with mmap(). I've removed
them. Instead, NFS will issue a read-before-write to fully
instantiate the struct buf containing the write. NFS does, however,
optimize piecemeal appends to files. For most common file operations,
you will not notice the difference. The sole remaining fragment in
the VFS/BIO system is b_dirtyoff/end, which NFS uses to avoid cache
coherency issues with read-merge-write style operations. NFS also
optimizes the write-covers-entire-buffer case by avoiding the
read-before-write. There is quite a bit of room for further
optimization in these areas.
The VM system marks pages fully-valid (AKA vm_page_t->valid =
VM_PAGE_BITS_ALL) in several places, most noteably in vm_fault. This
is not correct operation. The vm_pager_get_pages() code is now
responsible for marking VM pages all-valid. A number of VM helper
routines have been added to aid in zeroing-out the invalid portions of
a VM page prior to the page being marked all-valid. This operation is
necessary to properly support mmap(). The zeroing occurs most often
when dealing with file-EOF situations. Several bugs have been fixed
in the NFS subsystem, including bits handling file and directory EOF
situations and buf->b_flags consistancy issues relating to clearing
B_ERROR & B_INVAL, and handling B_DONE.
getblk() and allocbuf() have been rewritten. B_CACHE operation is now
formally defined in comments and more straightforward in
implementation. B_CACHE for VMIO buffers is based on the validity of
the backing store. B_CACHE for non-VMIO buffers is based simply on
whether the buffer is B_INVAL or not (B_CACHE set if B_INVAL clear,
and vise-versa). biodone() is now responsible for setting B_CACHE
when a successful read completes. B_CACHE is also set when a bdwrite()
is initiated and when a bwrite() is initiated. VFS VOP_BWRITE
routines (there are only two - nfs_bwrite() and bwrite()) are now
expected to set B_CACHE. This means that bowrite() and bawrite() also
set B_CACHE indirectly.
There are a number of places in the code which were previously using
buf->b_bufsize (which is DEV_BSIZE aligned) when they should have
been using buf->b_bcount. These have been fixed. getblk() now clears
B_DONE on return because the rest of the system is so bad about
dealing with B_DONE.
Major fixes to NFS/TCP have been made. A server-side bug could cause
requests to be lost by the server due to nfs_realign() overwriting
other rpc's in the same TCP mbuf chain. The server's kernel must be
recompiled to get the benefit of the fixes.
Submitted by: Matthew Dillon <dillon@apollo.backplane.com>
1999-05-02 23:57:16 +00:00
|
|
|
|
2019-10-29 20:37:59 +00:00
|
|
|
if (boffset < eoffset) {
|
|
|
|
if (bp->b_dirtyoff > boffset)
|
|
|
|
bp->b_dirtyoff = boffset;
|
|
|
|
if (bp->b_dirtyend < eoffset)
|
|
|
|
bp->b_dirtyend = eoffset;
|
1995-05-21 21:39:31 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
Implement the concept of the unmapped VMIO buffers, i.e. buffers which
do not map the b_pages pages into buffer_map KVA. The use of the
unmapped buffers eliminate the need to perform TLB shootdown for
mapping on the buffer creation and reuse, greatly reducing the amount
of IPIs for shootdown on big-SMP machines and eliminating up to 25-30%
of the system time on i/o intensive workloads.
The unmapped buffer should be explicitely requested by the GB_UNMAPPED
flag by the consumer. For unmapped buffer, no KVA reservation is
performed at all. The consumer might request unmapped buffer which
does have a KVA reserve, to manually map it without recursing into
buffer cache and blocking, with the GB_KVAALLOC flag.
When the mapped buffer is requested and unmapped buffer already
exists, the cache performs an upgrade, possibly reusing the KVA
reservation.
Unmapped buffer is translated into unmapped bio in g_vfs_strategy().
Unmapped bio carry a pointer to the vm_page_t array, offset and length
instead of the data pointer. The provider which processes the bio
should explicitely specify a readiness to accept unmapped bio,
otherwise g_down geom thread performs the transient upgrade of the bio
request by mapping the pages into the new bio_transient_map KVA
submap.
The bio_transient_map submap claims up to 10% of the buffer map, and
the total buffer_map + bio_transient_map KVA usage stays the
same. Still, it could be manually tuned by kern.bio_transient_maxcnt
tunable, in the units of the transient mappings. Eventually, the
bio_transient_map could be removed after all geom classes and drivers
can accept unmapped i/o requests.
Unmapped support can be turned off by the vfs.unmapped_buf_allowed
tunable, disabling which makes the buffer (or cluster) creation
requests to ignore GB_UNMAPPED and GB_KVAALLOC flags. Unmapped
buffers are only enabled by default on the architectures where
pmap_copy_page() was implemented and tested.
In the rework, filesystem metadata is not the subject to maxbufspace
limit anymore. Since the metadata buffers are always mapped, the
buffers still have to fit into the buffer map, which provides a
reasonable (but practically unreachable) upper bound on it. The
non-metadata buffer allocations, both mapped and unmapped, is
accounted against maxbufspace, as before. Effectively, this means that
the maxbufspace is forced on mapped and unmapped buffers separately.
The pre-patch bufspace limiting code did not worked, because
buffer_map fragmentation does not allow the limit to be reached.
By Jeff Roberson request, the getnewbuf() function was split into
smaller single-purpose functions.
Sponsored by: The FreeBSD Foundation
Discussed with: jeff (previous version)
Tested by: pho, scottl (previous version), jhb, bf
MFC after: 2 weeks
2013-03-19 14:13:12 +00:00
|
|
|
/*
|
2015-07-23 19:13:41 +00:00
|
|
|
* Allocate the KVA mapping for an existing buffer.
|
|
|
|
* If an unmapped buffer is provided but a mapped buffer is requested, take
|
|
|
|
* also care to properly setup mappings between pages and KVA.
|
Implement the concept of the unmapped VMIO buffers, i.e. buffers which
do not map the b_pages pages into buffer_map KVA. The use of the
unmapped buffers eliminate the need to perform TLB shootdown for
mapping on the buffer creation and reuse, greatly reducing the amount
of IPIs for shootdown on big-SMP machines and eliminating up to 25-30%
of the system time on i/o intensive workloads.
The unmapped buffer should be explicitely requested by the GB_UNMAPPED
flag by the consumer. For unmapped buffer, no KVA reservation is
performed at all. The consumer might request unmapped buffer which
does have a KVA reserve, to manually map it without recursing into
buffer cache and blocking, with the GB_KVAALLOC flag.
When the mapped buffer is requested and unmapped buffer already
exists, the cache performs an upgrade, possibly reusing the KVA
reservation.
Unmapped buffer is translated into unmapped bio in g_vfs_strategy().
Unmapped bio carry a pointer to the vm_page_t array, offset and length
instead of the data pointer. The provider which processes the bio
should explicitely specify a readiness to accept unmapped bio,
otherwise g_down geom thread performs the transient upgrade of the bio
request by mapping the pages into the new bio_transient_map KVA
submap.
The bio_transient_map submap claims up to 10% of the buffer map, and
the total buffer_map + bio_transient_map KVA usage stays the
same. Still, it could be manually tuned by kern.bio_transient_maxcnt
tunable, in the units of the transient mappings. Eventually, the
bio_transient_map could be removed after all geom classes and drivers
can accept unmapped i/o requests.
Unmapped support can be turned off by the vfs.unmapped_buf_allowed
tunable, disabling which makes the buffer (or cluster) creation
requests to ignore GB_UNMAPPED and GB_KVAALLOC flags. Unmapped
buffers are only enabled by default on the architectures where
pmap_copy_page() was implemented and tested.
In the rework, filesystem metadata is not the subject to maxbufspace
limit anymore. Since the metadata buffers are always mapped, the
buffers still have to fit into the buffer map, which provides a
reasonable (but practically unreachable) upper bound on it. The
non-metadata buffer allocations, both mapped and unmapped, is
accounted against maxbufspace, as before. Effectively, this means that
the maxbufspace is forced on mapped and unmapped buffers separately.
The pre-patch bufspace limiting code did not worked, because
buffer_map fragmentation does not allow the limit to be reached.
By Jeff Roberson request, the getnewbuf() function was split into
smaller single-purpose functions.
Sponsored by: The FreeBSD Foundation
Discussed with: jeff (previous version)
Tested by: pho, scottl (previous version), jhb, bf
MFC after: 2 weeks
2013-03-19 14:13:12 +00:00
|
|
|
*/
|
|
|
|
static void
|
|
|
|
bp_unmapped_get_kva(struct buf *bp, daddr_t blkno, int size, int gbflags)
|
|
|
|
{
|
|
|
|
int bsize, maxsize, need_mapping, need_kva;
|
|
|
|
off_t offset;
|
|
|
|
|
2015-07-23 19:13:41 +00:00
|
|
|
need_mapping = bp->b_data == unmapped_buf &&
|
Implement the concept of the unmapped VMIO buffers, i.e. buffers which
do not map the b_pages pages into buffer_map KVA. The use of the
unmapped buffers eliminate the need to perform TLB shootdown for
mapping on the buffer creation and reuse, greatly reducing the amount
of IPIs for shootdown on big-SMP machines and eliminating up to 25-30%
of the system time on i/o intensive workloads.
The unmapped buffer should be explicitely requested by the GB_UNMAPPED
flag by the consumer. For unmapped buffer, no KVA reservation is
performed at all. The consumer might request unmapped buffer which
does have a KVA reserve, to manually map it without recursing into
buffer cache and blocking, with the GB_KVAALLOC flag.
When the mapped buffer is requested and unmapped buffer already
exists, the cache performs an upgrade, possibly reusing the KVA
reservation.
Unmapped buffer is translated into unmapped bio in g_vfs_strategy().
Unmapped bio carry a pointer to the vm_page_t array, offset and length
instead of the data pointer. The provider which processes the bio
should explicitely specify a readiness to accept unmapped bio,
otherwise g_down geom thread performs the transient upgrade of the bio
request by mapping the pages into the new bio_transient_map KVA
submap.
The bio_transient_map submap claims up to 10% of the buffer map, and
the total buffer_map + bio_transient_map KVA usage stays the
same. Still, it could be manually tuned by kern.bio_transient_maxcnt
tunable, in the units of the transient mappings. Eventually, the
bio_transient_map could be removed after all geom classes and drivers
can accept unmapped i/o requests.
Unmapped support can be turned off by the vfs.unmapped_buf_allowed
tunable, disabling which makes the buffer (or cluster) creation
requests to ignore GB_UNMAPPED and GB_KVAALLOC flags. Unmapped
buffers are only enabled by default on the architectures where
pmap_copy_page() was implemented and tested.
In the rework, filesystem metadata is not the subject to maxbufspace
limit anymore. Since the metadata buffers are always mapped, the
buffers still have to fit into the buffer map, which provides a
reasonable (but practically unreachable) upper bound on it. The
non-metadata buffer allocations, both mapped and unmapped, is
accounted against maxbufspace, as before. Effectively, this means that
the maxbufspace is forced on mapped and unmapped buffers separately.
The pre-patch bufspace limiting code did not worked, because
buffer_map fragmentation does not allow the limit to be reached.
By Jeff Roberson request, the getnewbuf() function was split into
smaller single-purpose functions.
Sponsored by: The FreeBSD Foundation
Discussed with: jeff (previous version)
Tested by: pho, scottl (previous version), jhb, bf
MFC after: 2 weeks
2013-03-19 14:13:12 +00:00
|
|
|
(gbflags & GB_UNMAPPED) == 0;
|
2015-07-23 19:13:41 +00:00
|
|
|
need_kva = bp->b_kvabase == unmapped_buf &&
|
|
|
|
bp->b_data == unmapped_buf &&
|
Implement the concept of the unmapped VMIO buffers, i.e. buffers which
do not map the b_pages pages into buffer_map KVA. The use of the
unmapped buffers eliminate the need to perform TLB shootdown for
mapping on the buffer creation and reuse, greatly reducing the amount
of IPIs for shootdown on big-SMP machines and eliminating up to 25-30%
of the system time on i/o intensive workloads.
The unmapped buffer should be explicitely requested by the GB_UNMAPPED
flag by the consumer. For unmapped buffer, no KVA reservation is
performed at all. The consumer might request unmapped buffer which
does have a KVA reserve, to manually map it without recursing into
buffer cache and blocking, with the GB_KVAALLOC flag.
When the mapped buffer is requested and unmapped buffer already
exists, the cache performs an upgrade, possibly reusing the KVA
reservation.
Unmapped buffer is translated into unmapped bio in g_vfs_strategy().
Unmapped bio carry a pointer to the vm_page_t array, offset and length
instead of the data pointer. The provider which processes the bio
should explicitely specify a readiness to accept unmapped bio,
otherwise g_down geom thread performs the transient upgrade of the bio
request by mapping the pages into the new bio_transient_map KVA
submap.
The bio_transient_map submap claims up to 10% of the buffer map, and
the total buffer_map + bio_transient_map KVA usage stays the
same. Still, it could be manually tuned by kern.bio_transient_maxcnt
tunable, in the units of the transient mappings. Eventually, the
bio_transient_map could be removed after all geom classes and drivers
can accept unmapped i/o requests.
Unmapped support can be turned off by the vfs.unmapped_buf_allowed
tunable, disabling which makes the buffer (or cluster) creation
requests to ignore GB_UNMAPPED and GB_KVAALLOC flags. Unmapped
buffers are only enabled by default on the architectures where
pmap_copy_page() was implemented and tested.
In the rework, filesystem metadata is not the subject to maxbufspace
limit anymore. Since the metadata buffers are always mapped, the
buffers still have to fit into the buffer map, which provides a
reasonable (but practically unreachable) upper bound on it. The
non-metadata buffer allocations, both mapped and unmapped, is
accounted against maxbufspace, as before. Effectively, this means that
the maxbufspace is forced on mapped and unmapped buffers separately.
The pre-patch bufspace limiting code did not worked, because
buffer_map fragmentation does not allow the limit to be reached.
By Jeff Roberson request, the getnewbuf() function was split into
smaller single-purpose functions.
Sponsored by: The FreeBSD Foundation
Discussed with: jeff (previous version)
Tested by: pho, scottl (previous version), jhb, bf
MFC after: 2 weeks
2013-03-19 14:13:12 +00:00
|
|
|
(gbflags & GB_KVAALLOC) != 0;
|
|
|
|
if (!need_mapping && !need_kva)
|
|
|
|
return;
|
|
|
|
|
|
|
|
BUF_CHECK_UNMAPPED(bp);
|
|
|
|
|
2015-07-23 19:13:41 +00:00
|
|
|
if (need_mapping && bp->b_kvabase != unmapped_buf) {
|
Implement the concept of the unmapped VMIO buffers, i.e. buffers which
do not map the b_pages pages into buffer_map KVA. The use of the
unmapped buffers eliminate the need to perform TLB shootdown for
mapping on the buffer creation and reuse, greatly reducing the amount
of IPIs for shootdown on big-SMP machines and eliminating up to 25-30%
of the system time on i/o intensive workloads.
The unmapped buffer should be explicitely requested by the GB_UNMAPPED
flag by the consumer. For unmapped buffer, no KVA reservation is
performed at all. The consumer might request unmapped buffer which
does have a KVA reserve, to manually map it without recursing into
buffer cache and blocking, with the GB_KVAALLOC flag.
When the mapped buffer is requested and unmapped buffer already
exists, the cache performs an upgrade, possibly reusing the KVA
reservation.
Unmapped buffer is translated into unmapped bio in g_vfs_strategy().
Unmapped bio carry a pointer to the vm_page_t array, offset and length
instead of the data pointer. The provider which processes the bio
should explicitely specify a readiness to accept unmapped bio,
otherwise g_down geom thread performs the transient upgrade of the bio
request by mapping the pages into the new bio_transient_map KVA
submap.
The bio_transient_map submap claims up to 10% of the buffer map, and
the total buffer_map + bio_transient_map KVA usage stays the
same. Still, it could be manually tuned by kern.bio_transient_maxcnt
tunable, in the units of the transient mappings. Eventually, the
bio_transient_map could be removed after all geom classes and drivers
can accept unmapped i/o requests.
Unmapped support can be turned off by the vfs.unmapped_buf_allowed
tunable, disabling which makes the buffer (or cluster) creation
requests to ignore GB_UNMAPPED and GB_KVAALLOC flags. Unmapped
buffers are only enabled by default on the architectures where
pmap_copy_page() was implemented and tested.
In the rework, filesystem metadata is not the subject to maxbufspace
limit anymore. Since the metadata buffers are always mapped, the
buffers still have to fit into the buffer map, which provides a
reasonable (but practically unreachable) upper bound on it. The
non-metadata buffer allocations, both mapped and unmapped, is
accounted against maxbufspace, as before. Effectively, this means that
the maxbufspace is forced on mapped and unmapped buffers separately.
The pre-patch bufspace limiting code did not worked, because
buffer_map fragmentation does not allow the limit to be reached.
By Jeff Roberson request, the getnewbuf() function was split into
smaller single-purpose functions.
Sponsored by: The FreeBSD Foundation
Discussed with: jeff (previous version)
Tested by: pho, scottl (previous version), jhb, bf
MFC after: 2 weeks
2013-03-19 14:13:12 +00:00
|
|
|
/*
|
|
|
|
* Buffer is not mapped, but the KVA was already
|
|
|
|
* reserved at the time of the instantiation. Use the
|
|
|
|
* allocated space.
|
|
|
|
*/
|
|
|
|
goto has_addr;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Calculate the amount of the address space we would reserve
|
|
|
|
* if the buffer was mapped.
|
|
|
|
*/
|
2020-08-19 02:51:17 +00:00
|
|
|
bsize = vn_isdisk(bp->b_vp) ? DEV_BSIZE : bp->b_bufobj->bo_bsize;
|
2014-09-04 00:10:06 +00:00
|
|
|
KASSERT(bsize != 0, ("bsize == 0, check bo->bo_bsize"));
|
Implement the concept of the unmapped VMIO buffers, i.e. buffers which
do not map the b_pages pages into buffer_map KVA. The use of the
unmapped buffers eliminate the need to perform TLB shootdown for
mapping on the buffer creation and reuse, greatly reducing the amount
of IPIs for shootdown on big-SMP machines and eliminating up to 25-30%
of the system time on i/o intensive workloads.
The unmapped buffer should be explicitely requested by the GB_UNMAPPED
flag by the consumer. For unmapped buffer, no KVA reservation is
performed at all. The consumer might request unmapped buffer which
does have a KVA reserve, to manually map it without recursing into
buffer cache and blocking, with the GB_KVAALLOC flag.
When the mapped buffer is requested and unmapped buffer already
exists, the cache performs an upgrade, possibly reusing the KVA
reservation.
Unmapped buffer is translated into unmapped bio in g_vfs_strategy().
Unmapped bio carry a pointer to the vm_page_t array, offset and length
instead of the data pointer. The provider which processes the bio
should explicitely specify a readiness to accept unmapped bio,
otherwise g_down geom thread performs the transient upgrade of the bio
request by mapping the pages into the new bio_transient_map KVA
submap.
The bio_transient_map submap claims up to 10% of the buffer map, and
the total buffer_map + bio_transient_map KVA usage stays the
same. Still, it could be manually tuned by kern.bio_transient_maxcnt
tunable, in the units of the transient mappings. Eventually, the
bio_transient_map could be removed after all geom classes and drivers
can accept unmapped i/o requests.
Unmapped support can be turned off by the vfs.unmapped_buf_allowed
tunable, disabling which makes the buffer (or cluster) creation
requests to ignore GB_UNMAPPED and GB_KVAALLOC flags. Unmapped
buffers are only enabled by default on the architectures where
pmap_copy_page() was implemented and tested.
In the rework, filesystem metadata is not the subject to maxbufspace
limit anymore. Since the metadata buffers are always mapped, the
buffers still have to fit into the buffer map, which provides a
reasonable (but practically unreachable) upper bound on it. The
non-metadata buffer allocations, both mapped and unmapped, is
accounted against maxbufspace, as before. Effectively, this means that
the maxbufspace is forced on mapped and unmapped buffers separately.
The pre-patch bufspace limiting code did not worked, because
buffer_map fragmentation does not allow the limit to be reached.
By Jeff Roberson request, the getnewbuf() function was split into
smaller single-purpose functions.
Sponsored by: The FreeBSD Foundation
Discussed with: jeff (previous version)
Tested by: pho, scottl (previous version), jhb, bf
MFC after: 2 weeks
2013-03-19 14:13:12 +00:00
|
|
|
offset = blkno * bsize;
|
|
|
|
maxsize = size + (offset & PAGE_MASK);
|
|
|
|
maxsize = imax(maxsize, bsize);
|
|
|
|
|
2015-10-14 02:10:07 +00:00
|
|
|
while (bufkva_alloc(bp, maxsize, gbflags) != 0) {
|
|
|
|
if ((gbflags & GB_NOWAIT_BD) != 0) {
|
|
|
|
/*
|
|
|
|
* XXXKIB: defragmentation cannot
|
|
|
|
* succeed, not sure what else to do.
|
|
|
|
*/
|
|
|
|
panic("GB_NOWAIT_BD and GB_UNMAPPED %p", bp);
|
Implement the concept of the unmapped VMIO buffers, i.e. buffers which
do not map the b_pages pages into buffer_map KVA. The use of the
unmapped buffers eliminate the need to perform TLB shootdown for
mapping on the buffer creation and reuse, greatly reducing the amount
of IPIs for shootdown on big-SMP machines and eliminating up to 25-30%
of the system time on i/o intensive workloads.
The unmapped buffer should be explicitely requested by the GB_UNMAPPED
flag by the consumer. For unmapped buffer, no KVA reservation is
performed at all. The consumer might request unmapped buffer which
does have a KVA reserve, to manually map it without recursing into
buffer cache and blocking, with the GB_KVAALLOC flag.
When the mapped buffer is requested and unmapped buffer already
exists, the cache performs an upgrade, possibly reusing the KVA
reservation.
Unmapped buffer is translated into unmapped bio in g_vfs_strategy().
Unmapped bio carry a pointer to the vm_page_t array, offset and length
instead of the data pointer. The provider which processes the bio
should explicitely specify a readiness to accept unmapped bio,
otherwise g_down geom thread performs the transient upgrade of the bio
request by mapping the pages into the new bio_transient_map KVA
submap.
The bio_transient_map submap claims up to 10% of the buffer map, and
the total buffer_map + bio_transient_map KVA usage stays the
same. Still, it could be manually tuned by kern.bio_transient_maxcnt
tunable, in the units of the transient mappings. Eventually, the
bio_transient_map could be removed after all geom classes and drivers
can accept unmapped i/o requests.
Unmapped support can be turned off by the vfs.unmapped_buf_allowed
tunable, disabling which makes the buffer (or cluster) creation
requests to ignore GB_UNMAPPED and GB_KVAALLOC flags. Unmapped
buffers are only enabled by default on the architectures where
pmap_copy_page() was implemented and tested.
In the rework, filesystem metadata is not the subject to maxbufspace
limit anymore. Since the metadata buffers are always mapped, the
buffers still have to fit into the buffer map, which provides a
reasonable (but practically unreachable) upper bound on it. The
non-metadata buffer allocations, both mapped and unmapped, is
accounted against maxbufspace, as before. Effectively, this means that
the maxbufspace is forced on mapped and unmapped buffers separately.
The pre-patch bufspace limiting code did not worked, because
buffer_map fragmentation does not allow the limit to be reached.
By Jeff Roberson request, the getnewbuf() function was split into
smaller single-purpose functions.
Sponsored by: The FreeBSD Foundation
Discussed with: jeff (previous version)
Tested by: pho, scottl (previous version), jhb, bf
MFC after: 2 weeks
2013-03-19 14:13:12 +00:00
|
|
|
}
|
2018-02-20 00:06:07 +00:00
|
|
|
counter_u64_add(mappingrestarts, 1);
|
2018-03-17 18:14:49 +00:00
|
|
|
bufspace_wait(bufdomain(bp), bp->b_vp, gbflags, 0, 0);
|
Implement the concept of the unmapped VMIO buffers, i.e. buffers which
do not map the b_pages pages into buffer_map KVA. The use of the
unmapped buffers eliminate the need to perform TLB shootdown for
mapping on the buffer creation and reuse, greatly reducing the amount
of IPIs for shootdown on big-SMP machines and eliminating up to 25-30%
of the system time on i/o intensive workloads.
The unmapped buffer should be explicitely requested by the GB_UNMAPPED
flag by the consumer. For unmapped buffer, no KVA reservation is
performed at all. The consumer might request unmapped buffer which
does have a KVA reserve, to manually map it without recursing into
buffer cache and blocking, with the GB_KVAALLOC flag.
When the mapped buffer is requested and unmapped buffer already
exists, the cache performs an upgrade, possibly reusing the KVA
reservation.
Unmapped buffer is translated into unmapped bio in g_vfs_strategy().
Unmapped bio carry a pointer to the vm_page_t array, offset and length
instead of the data pointer. The provider which processes the bio
should explicitely specify a readiness to accept unmapped bio,
otherwise g_down geom thread performs the transient upgrade of the bio
request by mapping the pages into the new bio_transient_map KVA
submap.
The bio_transient_map submap claims up to 10% of the buffer map, and
the total buffer_map + bio_transient_map KVA usage stays the
same. Still, it could be manually tuned by kern.bio_transient_maxcnt
tunable, in the units of the transient mappings. Eventually, the
bio_transient_map could be removed after all geom classes and drivers
can accept unmapped i/o requests.
Unmapped support can be turned off by the vfs.unmapped_buf_allowed
tunable, disabling which makes the buffer (or cluster) creation
requests to ignore GB_UNMAPPED and GB_KVAALLOC flags. Unmapped
buffers are only enabled by default on the architectures where
pmap_copy_page() was implemented and tested.
In the rework, filesystem metadata is not the subject to maxbufspace
limit anymore. Since the metadata buffers are always mapped, the
buffers still have to fit into the buffer map, which provides a
reasonable (but practically unreachable) upper bound on it. The
non-metadata buffer allocations, both mapped and unmapped, is
accounted against maxbufspace, as before. Effectively, this means that
the maxbufspace is forced on mapped and unmapped buffers separately.
The pre-patch bufspace limiting code did not worked, because
buffer_map fragmentation does not allow the limit to be reached.
By Jeff Roberson request, the getnewbuf() function was split into
smaller single-purpose functions.
Sponsored by: The FreeBSD Foundation
Discussed with: jeff (previous version)
Tested by: pho, scottl (previous version), jhb, bf
MFC after: 2 weeks
2013-03-19 14:13:12 +00:00
|
|
|
}
|
|
|
|
has_addr:
|
2015-07-23 19:13:41 +00:00
|
|
|
if (need_mapping) {
|
|
|
|
/* b_offset is handled by bpmap_qenter. */
|
|
|
|
bp->b_data = bp->b_kvabase;
|
|
|
|
BUF_CHECK_MAPPED(bp);
|
|
|
|
bpmap_qenter(bp);
|
|
|
|
}
|
Implement the concept of the unmapped VMIO buffers, i.e. buffers which
do not map the b_pages pages into buffer_map KVA. The use of the
unmapped buffers eliminate the need to perform TLB shootdown for
mapping on the buffer creation and reuse, greatly reducing the amount
of IPIs for shootdown on big-SMP machines and eliminating up to 25-30%
of the system time on i/o intensive workloads.
The unmapped buffer should be explicitely requested by the GB_UNMAPPED
flag by the consumer. For unmapped buffer, no KVA reservation is
performed at all. The consumer might request unmapped buffer which
does have a KVA reserve, to manually map it without recursing into
buffer cache and blocking, with the GB_KVAALLOC flag.
When the mapped buffer is requested and unmapped buffer already
exists, the cache performs an upgrade, possibly reusing the KVA
reservation.
Unmapped buffer is translated into unmapped bio in g_vfs_strategy().
Unmapped bio carry a pointer to the vm_page_t array, offset and length
instead of the data pointer. The provider which processes the bio
should explicitely specify a readiness to accept unmapped bio,
otherwise g_down geom thread performs the transient upgrade of the bio
request by mapping the pages into the new bio_transient_map KVA
submap.
The bio_transient_map submap claims up to 10% of the buffer map, and
the total buffer_map + bio_transient_map KVA usage stays the
same. Still, it could be manually tuned by kern.bio_transient_maxcnt
tunable, in the units of the transient mappings. Eventually, the
bio_transient_map could be removed after all geom classes and drivers
can accept unmapped i/o requests.
Unmapped support can be turned off by the vfs.unmapped_buf_allowed
tunable, disabling which makes the buffer (or cluster) creation
requests to ignore GB_UNMAPPED and GB_KVAALLOC flags. Unmapped
buffers are only enabled by default on the architectures where
pmap_copy_page() was implemented and tested.
In the rework, filesystem metadata is not the subject to maxbufspace
limit anymore. Since the metadata buffers are always mapped, the
buffers still have to fit into the buffer map, which provides a
reasonable (but practically unreachable) upper bound on it. The
non-metadata buffer allocations, both mapped and unmapped, is
accounted against maxbufspace, as before. Effectively, this means that
the maxbufspace is forced on mapped and unmapped buffers separately.
The pre-patch bufspace limiting code did not worked, because
buffer_map fragmentation does not allow the limit to be reached.
By Jeff Roberson request, the getnewbuf() function was split into
smaller single-purpose functions.
Sponsored by: The FreeBSD Foundation
Discussed with: jeff (previous version)
Tested by: pho, scottl (previous version), jhb, bf
MFC after: 2 weeks
2013-03-19 14:13:12 +00:00
|
|
|
}
|
|
|
|
|
2018-05-13 09:47:28 +00:00
|
|
|
struct buf *
|
|
|
|
getblk(struct vnode *vp, daddr_t blkno, int size, int slpflag, int slptimeo,
|
|
|
|
int flags)
|
|
|
|
{
|
|
|
|
struct buf *bp;
|
|
|
|
int error;
|
|
|
|
|
2019-12-03 23:07:09 +00:00
|
|
|
error = getblkx(vp, blkno, blkno, size, slpflag, slptimeo, flags, &bp);
|
2018-05-13 09:47:28 +00:00
|
|
|
if (error != 0)
|
|
|
|
return (NULL);
|
|
|
|
return (bp);
|
|
|
|
}
|
|
|
|
|
1994-05-25 09:21:21 +00:00
|
|
|
/*
|
2018-05-13 09:47:28 +00:00
|
|
|
* getblkx:
|
The VFS/BIO subsystem contained a number of hacks in order to optimize
piecemeal, middle-of-file writes for NFS. These hacks have caused no
end of trouble, especially when combined with mmap(). I've removed
them. Instead, NFS will issue a read-before-write to fully
instantiate the struct buf containing the write. NFS does, however,
optimize piecemeal appends to files. For most common file operations,
you will not notice the difference. The sole remaining fragment in
the VFS/BIO system is b_dirtyoff/end, which NFS uses to avoid cache
coherency issues with read-merge-write style operations. NFS also
optimizes the write-covers-entire-buffer case by avoiding the
read-before-write. There is quite a bit of room for further
optimization in these areas.
The VM system marks pages fully-valid (AKA vm_page_t->valid =
VM_PAGE_BITS_ALL) in several places, most noteably in vm_fault. This
is not correct operation. The vm_pager_get_pages() code is now
responsible for marking VM pages all-valid. A number of VM helper
routines have been added to aid in zeroing-out the invalid portions of
a VM page prior to the page being marked all-valid. This operation is
necessary to properly support mmap(). The zeroing occurs most often
when dealing with file-EOF situations. Several bugs have been fixed
in the NFS subsystem, including bits handling file and directory EOF
situations and buf->b_flags consistancy issues relating to clearing
B_ERROR & B_INVAL, and handling B_DONE.
getblk() and allocbuf() have been rewritten. B_CACHE operation is now
formally defined in comments and more straightforward in
implementation. B_CACHE for VMIO buffers is based on the validity of
the backing store. B_CACHE for non-VMIO buffers is based simply on
whether the buffer is B_INVAL or not (B_CACHE set if B_INVAL clear,
and vise-versa). biodone() is now responsible for setting B_CACHE
when a successful read completes. B_CACHE is also set when a bdwrite()
is initiated and when a bwrite() is initiated. VFS VOP_BWRITE
routines (there are only two - nfs_bwrite() and bwrite()) are now
expected to set B_CACHE. This means that bowrite() and bawrite() also
set B_CACHE indirectly.
There are a number of places in the code which were previously using
buf->b_bufsize (which is DEV_BSIZE aligned) when they should have
been using buf->b_bcount. These have been fixed. getblk() now clears
B_DONE on return because the rest of the system is so bad about
dealing with B_DONE.
Major fixes to NFS/TCP have been made. A server-side bug could cause
requests to be lost by the server due to nfs_realign() overwriting
other rpc's in the same TCP mbuf chain. The server's kernel must be
recompiled to get the benefit of the fixes.
Submitted by: Matthew Dillon <dillon@apollo.backplane.com>
1999-05-02 23:57:16 +00:00
|
|
|
*
|
|
|
|
* Get a block given a specified block and offset into a file/device.
|
|
|
|
* The buffers B_DONE bit will be cleared on return, making it almost
|
|
|
|
* ready for an I/O initiation. B_INVAL may or may not be set on
|
|
|
|
* return. The caller should clear B_INVAL prior to initiating a
|
|
|
|
* READ.
|
|
|
|
*
|
|
|
|
* For a non-VMIO buffer, B_CACHE is set to the opposite of B_INVAL for
|
|
|
|
* an existing buffer.
|
|
|
|
*
|
|
|
|
* For a VMIO buffer, B_CACHE is modified according to the backing VM.
|
|
|
|
* If getblk()ing a previously 0-sized invalid buffer, B_CACHE is set
|
|
|
|
* and then cleared based on the backing VM. If the previous buffer is
|
|
|
|
* non-0-sized but invalid, B_CACHE will be cleared.
|
|
|
|
*
|
|
|
|
* If getblk() must create a new buffer, the new buffer is returned with
|
|
|
|
* both B_INVAL and B_CACHE clear unless it is a VMIO buffer, in which
|
|
|
|
* case it is returned with B_INVAL clear and B_CACHE set based on the
|
|
|
|
* backing VM.
|
|
|
|
*
|
2019-12-03 23:07:09 +00:00
|
|
|
* getblk() also forces a bwrite() for any B_DELWRI buffer whose
|
The VFS/BIO subsystem contained a number of hacks in order to optimize
piecemeal, middle-of-file writes for NFS. These hacks have caused no
end of trouble, especially when combined with mmap(). I've removed
them. Instead, NFS will issue a read-before-write to fully
instantiate the struct buf containing the write. NFS does, however,
optimize piecemeal appends to files. For most common file operations,
you will not notice the difference. The sole remaining fragment in
the VFS/BIO system is b_dirtyoff/end, which NFS uses to avoid cache
coherency issues with read-merge-write style operations. NFS also
optimizes the write-covers-entire-buffer case by avoiding the
read-before-write. There is quite a bit of room for further
optimization in these areas.
The VM system marks pages fully-valid (AKA vm_page_t->valid =
VM_PAGE_BITS_ALL) in several places, most noteably in vm_fault. This
is not correct operation. The vm_pager_get_pages() code is now
responsible for marking VM pages all-valid. A number of VM helper
routines have been added to aid in zeroing-out the invalid portions of
a VM page prior to the page being marked all-valid. This operation is
necessary to properly support mmap(). The zeroing occurs most often
when dealing with file-EOF situations. Several bugs have been fixed
in the NFS subsystem, including bits handling file and directory EOF
situations and buf->b_flags consistancy issues relating to clearing
B_ERROR & B_INVAL, and handling B_DONE.
getblk() and allocbuf() have been rewritten. B_CACHE operation is now
formally defined in comments and more straightforward in
implementation. B_CACHE for VMIO buffers is based on the validity of
the backing store. B_CACHE for non-VMIO buffers is based simply on
whether the buffer is B_INVAL or not (B_CACHE set if B_INVAL clear,
and vise-versa). biodone() is now responsible for setting B_CACHE
when a successful read completes. B_CACHE is also set when a bdwrite()
is initiated and when a bwrite() is initiated. VFS VOP_BWRITE
routines (there are only two - nfs_bwrite() and bwrite()) are now
expected to set B_CACHE. This means that bowrite() and bawrite() also
set B_CACHE indirectly.
There are a number of places in the code which were previously using
buf->b_bufsize (which is DEV_BSIZE aligned) when they should have
been using buf->b_bcount. These have been fixed. getblk() now clears
B_DONE on return because the rest of the system is so bad about
dealing with B_DONE.
Major fixes to NFS/TCP have been made. A server-side bug could cause
requests to be lost by the server due to nfs_realign() overwriting
other rpc's in the same TCP mbuf chain. The server's kernel must be
recompiled to get the benefit of the fixes.
Submitted by: Matthew Dillon <dillon@apollo.backplane.com>
1999-05-02 23:57:16 +00:00
|
|
|
* B_CACHE bit is clear.
|
2020-07-10 09:01:36 +00:00
|
|
|
*
|
The VFS/BIO subsystem contained a number of hacks in order to optimize
piecemeal, middle-of-file writes for NFS. These hacks have caused no
end of trouble, especially when combined with mmap(). I've removed
them. Instead, NFS will issue a read-before-write to fully
instantiate the struct buf containing the write. NFS does, however,
optimize piecemeal appends to files. For most common file operations,
you will not notice the difference. The sole remaining fragment in
the VFS/BIO system is b_dirtyoff/end, which NFS uses to avoid cache
coherency issues with read-merge-write style operations. NFS also
optimizes the write-covers-entire-buffer case by avoiding the
read-before-write. There is quite a bit of room for further
optimization in these areas.
The VM system marks pages fully-valid (AKA vm_page_t->valid =
VM_PAGE_BITS_ALL) in several places, most noteably in vm_fault. This
is not correct operation. The vm_pager_get_pages() code is now
responsible for marking VM pages all-valid. A number of VM helper
routines have been added to aid in zeroing-out the invalid portions of
a VM page prior to the page being marked all-valid. This operation is
necessary to properly support mmap(). The zeroing occurs most often
when dealing with file-EOF situations. Several bugs have been fixed
in the NFS subsystem, including bits handling file and directory EOF
situations and buf->b_flags consistancy issues relating to clearing
B_ERROR & B_INVAL, and handling B_DONE.
getblk() and allocbuf() have been rewritten. B_CACHE operation is now
formally defined in comments and more straightforward in
implementation. B_CACHE for VMIO buffers is based on the validity of
the backing store. B_CACHE for non-VMIO buffers is based simply on
whether the buffer is B_INVAL or not (B_CACHE set if B_INVAL clear,
and vise-versa). biodone() is now responsible for setting B_CACHE
when a successful read completes. B_CACHE is also set when a bdwrite()
is initiated and when a bwrite() is initiated. VFS VOP_BWRITE
routines (there are only two - nfs_bwrite() and bwrite()) are now
expected to set B_CACHE. This means that bowrite() and bawrite() also
set B_CACHE indirectly.
There are a number of places in the code which were previously using
buf->b_bufsize (which is DEV_BSIZE aligned) when they should have
been using buf->b_bcount. These have been fixed. getblk() now clears
B_DONE on return because the rest of the system is so bad about
dealing with B_DONE.
Major fixes to NFS/TCP have been made. A server-side bug could cause
requests to be lost by the server due to nfs_realign() overwriting
other rpc's in the same TCP mbuf chain. The server's kernel must be
recompiled to get the benefit of the fixes.
Submitted by: Matthew Dillon <dillon@apollo.backplane.com>
1999-05-02 23:57:16 +00:00
|
|
|
* What this means, basically, is that the caller should use B_CACHE to
|
|
|
|
* determine whether the buffer is fully valid or not and should clear
|
|
|
|
* B_INVAL prior to issuing a read. If the caller intends to validate
|
|
|
|
* the buffer by loading its data area with something, the caller needs
|
|
|
|
* to clear B_INVAL. If the caller does this without issuing an I/O,
|
|
|
|
* the caller should set B_CACHE ( as an optimization ), else the caller
|
|
|
|
* should issue the I/O and biodone() will set B_CACHE if the I/O was
|
2016-04-29 21:54:28 +00:00
|
|
|
* a write attempt or if it was a successful read. If the caller
|
2000-04-02 15:24:56 +00:00
|
|
|
* intends to issue a READ, the caller must clear B_INVAL and BIO_ERROR
|
The VFS/BIO subsystem contained a number of hacks in order to optimize
piecemeal, middle-of-file writes for NFS. These hacks have caused no
end of trouble, especially when combined with mmap(). I've removed
them. Instead, NFS will issue a read-before-write to fully
instantiate the struct buf containing the write. NFS does, however,
optimize piecemeal appends to files. For most common file operations,
you will not notice the difference. The sole remaining fragment in
the VFS/BIO system is b_dirtyoff/end, which NFS uses to avoid cache
coherency issues with read-merge-write style operations. NFS also
optimizes the write-covers-entire-buffer case by avoiding the
read-before-write. There is quite a bit of room for further
optimization in these areas.
The VM system marks pages fully-valid (AKA vm_page_t->valid =
VM_PAGE_BITS_ALL) in several places, most noteably in vm_fault. This
is not correct operation. The vm_pager_get_pages() code is now
responsible for marking VM pages all-valid. A number of VM helper
routines have been added to aid in zeroing-out the invalid portions of
a VM page prior to the page being marked all-valid. This operation is
necessary to properly support mmap(). The zeroing occurs most often
when dealing with file-EOF situations. Several bugs have been fixed
in the NFS subsystem, including bits handling file and directory EOF
situations and buf->b_flags consistancy issues relating to clearing
B_ERROR & B_INVAL, and handling B_DONE.
getblk() and allocbuf() have been rewritten. B_CACHE operation is now
formally defined in comments and more straightforward in
implementation. B_CACHE for VMIO buffers is based on the validity of
the backing store. B_CACHE for non-VMIO buffers is based simply on
whether the buffer is B_INVAL or not (B_CACHE set if B_INVAL clear,
and vise-versa). biodone() is now responsible for setting B_CACHE
when a successful read completes. B_CACHE is also set when a bdwrite()
is initiated and when a bwrite() is initiated. VFS VOP_BWRITE
routines (there are only two - nfs_bwrite() and bwrite()) are now
expected to set B_CACHE. This means that bowrite() and bawrite() also
set B_CACHE indirectly.
There are a number of places in the code which were previously using
buf->b_bufsize (which is DEV_BSIZE aligned) when they should have
been using buf->b_bcount. These have been fixed. getblk() now clears
B_DONE on return because the rest of the system is so bad about
dealing with B_DONE.
Major fixes to NFS/TCP have been made. A server-side bug could cause
requests to be lost by the server due to nfs_realign() overwriting
other rpc's in the same TCP mbuf chain. The server's kernel must be
recompiled to get the benefit of the fixes.
Submitted by: Matthew Dillon <dillon@apollo.backplane.com>
1999-05-02 23:57:16 +00:00
|
|
|
* prior to issuing the READ. biodone() will *not* clear B_INVAL.
|
2019-12-03 23:07:09 +00:00
|
|
|
*
|
|
|
|
* The blkno parameter is the logical block being requested. Normally
|
|
|
|
* the mapping of logical block number to disk block address is done
|
|
|
|
* by calling VOP_BMAP(). However, if the mapping is already known, the
|
|
|
|
* disk block address can be passed using the dblkno parameter. If the
|
|
|
|
* disk block address is not known, then the same value should be passed
|
|
|
|
* for blkno and dblkno.
|
1994-05-25 09:21:21 +00:00
|
|
|
*/
|
2018-05-13 09:47:28 +00:00
|
|
|
int
|
2019-12-03 23:07:09 +00:00
|
|
|
getblkx(struct vnode *vp, daddr_t blkno, daddr_t dblkno, int size, int slpflag,
|
|
|
|
int slptimeo, int flags, struct buf **bpp)
|
1994-05-24 10:09:53 +00:00
|
|
|
{
|
1994-05-25 09:21:21 +00:00
|
|
|
struct buf *bp;
|
2004-10-22 08:47:20 +00:00
|
|
|
struct bufobj *bo;
|
2018-05-13 09:47:28 +00:00
|
|
|
daddr_t d_blkno;
|
2020-07-31 00:07:01 +00:00
|
|
|
int bsize, error, maxsize, vmio;
|
Implement the concept of the unmapped VMIO buffers, i.e. buffers which
do not map the b_pages pages into buffer_map KVA. The use of the
unmapped buffers eliminate the need to perform TLB shootdown for
mapping on the buffer creation and reuse, greatly reducing the amount
of IPIs for shootdown on big-SMP machines and eliminating up to 25-30%
of the system time on i/o intensive workloads.
The unmapped buffer should be explicitely requested by the GB_UNMAPPED
flag by the consumer. For unmapped buffer, no KVA reservation is
performed at all. The consumer might request unmapped buffer which
does have a KVA reserve, to manually map it without recursing into
buffer cache and blocking, with the GB_KVAALLOC flag.
When the mapped buffer is requested and unmapped buffer already
exists, the cache performs an upgrade, possibly reusing the KVA
reservation.
Unmapped buffer is translated into unmapped bio in g_vfs_strategy().
Unmapped bio carry a pointer to the vm_page_t array, offset and length
instead of the data pointer. The provider which processes the bio
should explicitely specify a readiness to accept unmapped bio,
otherwise g_down geom thread performs the transient upgrade of the bio
request by mapping the pages into the new bio_transient_map KVA
submap.
The bio_transient_map submap claims up to 10% of the buffer map, and
the total buffer_map + bio_transient_map KVA usage stays the
same. Still, it could be manually tuned by kern.bio_transient_maxcnt
tunable, in the units of the transient mappings. Eventually, the
bio_transient_map could be removed after all geom classes and drivers
can accept unmapped i/o requests.
Unmapped support can be turned off by the vfs.unmapped_buf_allowed
tunable, disabling which makes the buffer (or cluster) creation
requests to ignore GB_UNMAPPED and GB_KVAALLOC flags. Unmapped
buffers are only enabled by default on the architectures where
pmap_copy_page() was implemented and tested.
In the rework, filesystem metadata is not the subject to maxbufspace
limit anymore. Since the metadata buffers are always mapped, the
buffers still have to fit into the buffer map, which provides a
reasonable (but practically unreachable) upper bound on it. The
non-metadata buffer allocations, both mapped and unmapped, is
accounted against maxbufspace, as before. Effectively, this means that
the maxbufspace is forced on mapped and unmapped buffers separately.
The pre-patch bufspace limiting code did not worked, because
buffer_map fragmentation does not allow the limit to be reached.
By Jeff Roberson request, the getnewbuf() function was split into
smaller single-purpose functions.
Sponsored by: The FreeBSD Foundation
Discussed with: jeff (previous version)
Tested by: pho, scottl (previous version), jhb, bf
MFC after: 2 weeks
2013-03-19 14:13:12 +00:00
|
|
|
off_t offset;
|
1994-05-25 09:21:21 +00:00
|
|
|
|
2005-01-24 10:47:04 +00:00
|
|
|
CTR3(KTR_BUF, "getblk(%p, %ld, %d)", vp, (long)blkno, size);
|
Implement the concept of the unmapped VMIO buffers, i.e. buffers which
do not map the b_pages pages into buffer_map KVA. The use of the
unmapped buffers eliminate the need to perform TLB shootdown for
mapping on the buffer creation and reuse, greatly reducing the amount
of IPIs for shootdown on big-SMP machines and eliminating up to 25-30%
of the system time on i/o intensive workloads.
The unmapped buffer should be explicitely requested by the GB_UNMAPPED
flag by the consumer. For unmapped buffer, no KVA reservation is
performed at all. The consumer might request unmapped buffer which
does have a KVA reserve, to manually map it without recursing into
buffer cache and blocking, with the GB_KVAALLOC flag.
When the mapped buffer is requested and unmapped buffer already
exists, the cache performs an upgrade, possibly reusing the KVA
reservation.
Unmapped buffer is translated into unmapped bio in g_vfs_strategy().
Unmapped bio carry a pointer to the vm_page_t array, offset and length
instead of the data pointer. The provider which processes the bio
should explicitely specify a readiness to accept unmapped bio,
otherwise g_down geom thread performs the transient upgrade of the bio
request by mapping the pages into the new bio_transient_map KVA
submap.
The bio_transient_map submap claims up to 10% of the buffer map, and
the total buffer_map + bio_transient_map KVA usage stays the
same. Still, it could be manually tuned by kern.bio_transient_maxcnt
tunable, in the units of the transient mappings. Eventually, the
bio_transient_map could be removed after all geom classes and drivers
can accept unmapped i/o requests.
Unmapped support can be turned off by the vfs.unmapped_buf_allowed
tunable, disabling which makes the buffer (or cluster) creation
requests to ignore GB_UNMAPPED and GB_KVAALLOC flags. Unmapped
buffers are only enabled by default on the architectures where
pmap_copy_page() was implemented and tested.
In the rework, filesystem metadata is not the subject to maxbufspace
limit anymore. Since the metadata buffers are always mapped, the
buffers still have to fit into the buffer map, which provides a
reasonable (but practically unreachable) upper bound on it. The
non-metadata buffer allocations, both mapped and unmapped, is
accounted against maxbufspace, as before. Effectively, this means that
the maxbufspace is forced on mapped and unmapped buffers separately.
The pre-patch bufspace limiting code did not worked, because
buffer_map fragmentation does not allow the limit to be reached.
By Jeff Roberson request, the getnewbuf() function was split into
smaller single-purpose functions.
Sponsored by: The FreeBSD Foundation
Discussed with: jeff (previous version)
Tested by: pho, scottl (previous version), jhb, bf
MFC after: 2 weeks
2013-03-19 14:13:12 +00:00
|
|
|
KASSERT((flags & (GB_UNMAPPED | GB_KVAALLOC)) != GB_KVAALLOC,
|
|
|
|
("GB_KVAALLOC only makes sense with GB_UNMAPPED"));
|
2021-11-01 07:14:01 +00:00
|
|
|
if (vp->v_type != VCHR)
|
|
|
|
ASSERT_VOP_LOCKED(vp, "getblk");
|
2017-06-17 22:24:19 +00:00
|
|
|
if (size > maxbcachebuf)
|
|
|
|
panic("getblk: size(%d) > maxbcachebuf(%d)\n", size,
|
|
|
|
maxbcachebuf);
|
Implement the concept of the unmapped VMIO buffers, i.e. buffers which
do not map the b_pages pages into buffer_map KVA. The use of the
unmapped buffers eliminate the need to perform TLB shootdown for
mapping on the buffer creation and reuse, greatly reducing the amount
of IPIs for shootdown on big-SMP machines and eliminating up to 25-30%
of the system time on i/o intensive workloads.
The unmapped buffer should be explicitely requested by the GB_UNMAPPED
flag by the consumer. For unmapped buffer, no KVA reservation is
performed at all. The consumer might request unmapped buffer which
does have a KVA reserve, to manually map it without recursing into
buffer cache and blocking, with the GB_KVAALLOC flag.
When the mapped buffer is requested and unmapped buffer already
exists, the cache performs an upgrade, possibly reusing the KVA
reservation.
Unmapped buffer is translated into unmapped bio in g_vfs_strategy().
Unmapped bio carry a pointer to the vm_page_t array, offset and length
instead of the data pointer. The provider which processes the bio
should explicitely specify a readiness to accept unmapped bio,
otherwise g_down geom thread performs the transient upgrade of the bio
request by mapping the pages into the new bio_transient_map KVA
submap.
The bio_transient_map submap claims up to 10% of the buffer map, and
the total buffer_map + bio_transient_map KVA usage stays the
same. Still, it could be manually tuned by kern.bio_transient_maxcnt
tunable, in the units of the transient mappings. Eventually, the
bio_transient_map could be removed after all geom classes and drivers
can accept unmapped i/o requests.
Unmapped support can be turned off by the vfs.unmapped_buf_allowed
tunable, disabling which makes the buffer (or cluster) creation
requests to ignore GB_UNMAPPED and GB_KVAALLOC flags. Unmapped
buffers are only enabled by default on the architectures where
pmap_copy_page() was implemented and tested.
In the rework, filesystem metadata is not the subject to maxbufspace
limit anymore. Since the metadata buffers are always mapped, the
buffers still have to fit into the buffer map, which provides a
reasonable (but practically unreachable) upper bound on it. The
non-metadata buffer allocations, both mapped and unmapped, is
accounted against maxbufspace, as before. Effectively, this means that
the maxbufspace is forced on mapped and unmapped buffers separately.
The pre-patch bufspace limiting code did not worked, because
buffer_map fragmentation does not allow the limit to be reached.
By Jeff Roberson request, the getnewbuf() function was split into
smaller single-purpose functions.
Sponsored by: The FreeBSD Foundation
Discussed with: jeff (previous version)
Tested by: pho, scottl (previous version), jhb, bf
MFC after: 2 weeks
2013-03-19 14:13:12 +00:00
|
|
|
if (!unmapped_buf_allowed)
|
|
|
|
flags &= ~(GB_UNMAPPED | GB_KVAALLOC);
|
1996-11-28 04:26:04 +00:00
|
|
|
|
2004-10-22 08:47:20 +00:00
|
|
|
bo = &vp->v_bufobj;
|
2019-12-03 23:07:09 +00:00
|
|
|
d_blkno = dblkno;
|
2020-07-24 17:34:04 +00:00
|
|
|
|
|
|
|
/* Attempt lockless lookup first. */
|
|
|
|
bp = gbincore_unlocked(bo, blkno);
|
2021-01-27 17:59:50 +00:00
|
|
|
if (bp == NULL) {
|
|
|
|
/*
|
|
|
|
* With GB_NOCREAT we must be sure about not finding the buffer
|
|
|
|
* as it may have been reassigned during unlocked lookup.
|
|
|
|
*/
|
|
|
|
if ((flags & GB_NOCREAT) != 0)
|
|
|
|
goto loop;
|
2020-07-24 17:34:04 +00:00
|
|
|
goto newbuf_unlocked;
|
2021-01-27 17:59:50 +00:00
|
|
|
}
|
2020-07-24 17:34:04 +00:00
|
|
|
|
2020-07-31 00:07:01 +00:00
|
|
|
error = BUF_TIMELOCK(bp, LK_EXCLUSIVE | LK_NOWAIT, NULL, "getblku", 0,
|
|
|
|
0);
|
|
|
|
if (error != 0)
|
2020-07-24 17:34:04 +00:00
|
|
|
goto loop;
|
|
|
|
|
|
|
|
/* Verify buf identify has not changed since lookup. */
|
|
|
|
if (bp->b_bufobj == bo && bp->b_lblkno == blkno)
|
|
|
|
goto foundbuf_fastpath;
|
|
|
|
|
|
|
|
/* It changed, fallback to locked lookup. */
|
|
|
|
BUF_UNLOCK_RAW(bp);
|
|
|
|
|
1994-05-25 09:21:21 +00:00
|
|
|
loop:
|
2013-05-31 00:43:41 +00:00
|
|
|
BO_RLOCK(bo);
|
2004-10-22 08:47:20 +00:00
|
|
|
bp = gbincore(bo, blkno);
|
|
|
|
if (bp != NULL) {
|
2020-07-31 00:07:01 +00:00
|
|
|
int lockflags;
|
|
|
|
|
The VFS/BIO subsystem contained a number of hacks in order to optimize
piecemeal, middle-of-file writes for NFS. These hacks have caused no
end of trouble, especially when combined with mmap(). I've removed
them. Instead, NFS will issue a read-before-write to fully
instantiate the struct buf containing the write. NFS does, however,
optimize piecemeal appends to files. For most common file operations,
you will not notice the difference. The sole remaining fragment in
the VFS/BIO system is b_dirtyoff/end, which NFS uses to avoid cache
coherency issues with read-merge-write style operations. NFS also
optimizes the write-covers-entire-buffer case by avoiding the
read-before-write. There is quite a bit of room for further
optimization in these areas.
The VM system marks pages fully-valid (AKA vm_page_t->valid =
VM_PAGE_BITS_ALL) in several places, most noteably in vm_fault. This
is not correct operation. The vm_pager_get_pages() code is now
responsible for marking VM pages all-valid. A number of VM helper
routines have been added to aid in zeroing-out the invalid portions of
a VM page prior to the page being marked all-valid. This operation is
necessary to properly support mmap(). The zeroing occurs most often
when dealing with file-EOF situations. Several bugs have been fixed
in the NFS subsystem, including bits handling file and directory EOF
situations and buf->b_flags consistancy issues relating to clearing
B_ERROR & B_INVAL, and handling B_DONE.
getblk() and allocbuf() have been rewritten. B_CACHE operation is now
formally defined in comments and more straightforward in
implementation. B_CACHE for VMIO buffers is based on the validity of
the backing store. B_CACHE for non-VMIO buffers is based simply on
whether the buffer is B_INVAL or not (B_CACHE set if B_INVAL clear,
and vise-versa). biodone() is now responsible for setting B_CACHE
when a successful read completes. B_CACHE is also set when a bdwrite()
is initiated and when a bwrite() is initiated. VFS VOP_BWRITE
routines (there are only two - nfs_bwrite() and bwrite()) are now
expected to set B_CACHE. This means that bowrite() and bawrite() also
set B_CACHE indirectly.
There are a number of places in the code which were previously using
buf->b_bufsize (which is DEV_BSIZE aligned) when they should have
been using buf->b_bcount. These have been fixed. getblk() now clears
B_DONE on return because the rest of the system is so bad about
dealing with B_DONE.
Major fixes to NFS/TCP have been made. A server-side bug could cause
requests to be lost by the server due to nfs_realign() overwriting
other rpc's in the same TCP mbuf chain. The server's kernel must be
recompiled to get the benefit of the fixes.
Submitted by: Matthew Dillon <dillon@apollo.backplane.com>
1999-05-02 23:57:16 +00:00
|
|
|
/*
|
2012-05-15 09:55:15 +00:00
|
|
|
* Buffer is in-core. If the buffer is not busy nor managed,
|
|
|
|
* it must be on a queue.
|
The VFS/BIO subsystem contained a number of hacks in order to optimize
piecemeal, middle-of-file writes for NFS. These hacks have caused no
end of trouble, especially when combined with mmap(). I've removed
them. Instead, NFS will issue a read-before-write to fully
instantiate the struct buf containing the write. NFS does, however,
optimize piecemeal appends to files. For most common file operations,
you will not notice the difference. The sole remaining fragment in
the VFS/BIO system is b_dirtyoff/end, which NFS uses to avoid cache
coherency issues with read-merge-write style operations. NFS also
optimizes the write-covers-entire-buffer case by avoiding the
read-before-write. There is quite a bit of room for further
optimization in these areas.
The VM system marks pages fully-valid (AKA vm_page_t->valid =
VM_PAGE_BITS_ALL) in several places, most noteably in vm_fault. This
is not correct operation. The vm_pager_get_pages() code is now
responsible for marking VM pages all-valid. A number of VM helper
routines have been added to aid in zeroing-out the invalid portions of
a VM page prior to the page being marked all-valid. This operation is
necessary to properly support mmap(). The zeroing occurs most often
when dealing with file-EOF situations. Several bugs have been fixed
in the NFS subsystem, including bits handling file and directory EOF
situations and buf->b_flags consistancy issues relating to clearing
B_ERROR & B_INVAL, and handling B_DONE.
getblk() and allocbuf() have been rewritten. B_CACHE operation is now
formally defined in comments and more straightforward in
implementation. B_CACHE for VMIO buffers is based on the validity of
the backing store. B_CACHE for non-VMIO buffers is based simply on
whether the buffer is B_INVAL or not (B_CACHE set if B_INVAL clear,
and vise-versa). biodone() is now responsible for setting B_CACHE
when a successful read completes. B_CACHE is also set when a bdwrite()
is initiated and when a bwrite() is initiated. VFS VOP_BWRITE
routines (there are only two - nfs_bwrite() and bwrite()) are now
expected to set B_CACHE. This means that bowrite() and bawrite() also
set B_CACHE indirectly.
There are a number of places in the code which were previously using
buf->b_bufsize (which is DEV_BSIZE aligned) when they should have
been using buf->b_bcount. These have been fixed. getblk() now clears
B_DONE on return because the rest of the system is so bad about
dealing with B_DONE.
Major fixes to NFS/TCP have been made. A server-side bug could cause
requests to be lost by the server due to nfs_realign() overwriting
other rpc's in the same TCP mbuf chain. The server's kernel must be
recompiled to get the benefit of the fixes.
Submitted by: Matthew Dillon <dillon@apollo.backplane.com>
1999-05-02 23:57:16 +00:00
|
|
|
*/
|
2020-07-31 00:13:40 +00:00
|
|
|
lockflags = LK_EXCLUSIVE | LK_INTERLOCK |
|
|
|
|
((flags & GB_LOCK_NOWAIT) ? LK_NOWAIT : LK_SLEEPFAIL);
|
2003-03-04 00:04:44 +00:00
|
|
|
|
|
|
|
error = BUF_TIMELOCK(bp, lockflags,
|
2013-05-31 00:43:41 +00:00
|
|
|
BO_LOCKPTR(bo), "getblk", slpflag, slptimeo);
|
2003-02-25 03:37:48 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* If we slept and got the lock we have to restart in case
|
|
|
|
* the buffer changed identities.
|
|
|
|
*/
|
|
|
|
if (error == ENOLCK)
|
|
|
|
goto loop;
|
|
|
|
/* We timed out or were interrupted. */
|
2018-05-13 09:47:28 +00:00
|
|
|
else if (error != 0)
|
|
|
|
return (error);
|
2020-07-24 17:34:04 +00:00
|
|
|
|
|
|
|
foundbuf_fastpath:
|
2013-02-27 07:34:09 +00:00
|
|
|
/* If recursed, assume caller knows the rules. */
|
2020-07-24 17:34:04 +00:00
|
|
|
if (BUF_LOCKRECURSED(bp))
|
2013-02-27 07:34:09 +00:00
|
|
|
goto end;
|
The VFS/BIO subsystem contained a number of hacks in order to optimize
piecemeal, middle-of-file writes for NFS. These hacks have caused no
end of trouble, especially when combined with mmap(). I've removed
them. Instead, NFS will issue a read-before-write to fully
instantiate the struct buf containing the write. NFS does, however,
optimize piecemeal appends to files. For most common file operations,
you will not notice the difference. The sole remaining fragment in
the VFS/BIO system is b_dirtyoff/end, which NFS uses to avoid cache
coherency issues with read-merge-write style operations. NFS also
optimizes the write-covers-entire-buffer case by avoiding the
read-before-write. There is quite a bit of room for further
optimization in these areas.
The VM system marks pages fully-valid (AKA vm_page_t->valid =
VM_PAGE_BITS_ALL) in several places, most noteably in vm_fault. This
is not correct operation. The vm_pager_get_pages() code is now
responsible for marking VM pages all-valid. A number of VM helper
routines have been added to aid in zeroing-out the invalid portions of
a VM page prior to the page being marked all-valid. This operation is
necessary to properly support mmap(). The zeroing occurs most often
when dealing with file-EOF situations. Several bugs have been fixed
in the NFS subsystem, including bits handling file and directory EOF
situations and buf->b_flags consistancy issues relating to clearing
B_ERROR & B_INVAL, and handling B_DONE.
getblk() and allocbuf() have been rewritten. B_CACHE operation is now
formally defined in comments and more straightforward in
implementation. B_CACHE for VMIO buffers is based on the validity of
the backing store. B_CACHE for non-VMIO buffers is based simply on
whether the buffer is B_INVAL or not (B_CACHE set if B_INVAL clear,
and vise-versa). biodone() is now responsible for setting B_CACHE
when a successful read completes. B_CACHE is also set when a bdwrite()
is initiated and when a bwrite() is initiated. VFS VOP_BWRITE
routines (there are only two - nfs_bwrite() and bwrite()) are now
expected to set B_CACHE. This means that bowrite() and bawrite() also
set B_CACHE indirectly.
There are a number of places in the code which were previously using
buf->b_bufsize (which is DEV_BSIZE aligned) when they should have
been using buf->b_bcount. These have been fixed. getblk() now clears
B_DONE on return because the rest of the system is so bad about
dealing with B_DONE.
Major fixes to NFS/TCP have been made. A server-side bug could cause
requests to be lost by the server due to nfs_realign() overwriting
other rpc's in the same TCP mbuf chain. The server's kernel must be
recompiled to get the benefit of the fixes.
Submitted by: Matthew Dillon <dillon@apollo.backplane.com>
1999-05-02 23:57:16 +00:00
|
|
|
|
|
|
|
/*
|
1999-06-26 02:47:16 +00:00
|
|
|
* The buffer is locked. B_CACHE is cleared if the buffer is
|
2002-03-05 15:38:49 +00:00
|
|
|
* invalid. Otherwise, for a non-VMIO buffer, B_CACHE is set
|
The VFS/BIO subsystem contained a number of hacks in order to optimize
piecemeal, middle-of-file writes for NFS. These hacks have caused no
end of trouble, especially when combined with mmap(). I've removed
them. Instead, NFS will issue a read-before-write to fully
instantiate the struct buf containing the write. NFS does, however,
optimize piecemeal appends to files. For most common file operations,
you will not notice the difference. The sole remaining fragment in
the VFS/BIO system is b_dirtyoff/end, which NFS uses to avoid cache
coherency issues with read-merge-write style operations. NFS also
optimizes the write-covers-entire-buffer case by avoiding the
read-before-write. There is quite a bit of room for further
optimization in these areas.
The VM system marks pages fully-valid (AKA vm_page_t->valid =
VM_PAGE_BITS_ALL) in several places, most noteably in vm_fault. This
is not correct operation. The vm_pager_get_pages() code is now
responsible for marking VM pages all-valid. A number of VM helper
routines have been added to aid in zeroing-out the invalid portions of
a VM page prior to the page being marked all-valid. This operation is
necessary to properly support mmap(). The zeroing occurs most often
when dealing with file-EOF situations. Several bugs have been fixed
in the NFS subsystem, including bits handling file and directory EOF
situations and buf->b_flags consistancy issues relating to clearing
B_ERROR & B_INVAL, and handling B_DONE.
getblk() and allocbuf() have been rewritten. B_CACHE operation is now
formally defined in comments and more straightforward in
implementation. B_CACHE for VMIO buffers is based on the validity of
the backing store. B_CACHE for non-VMIO buffers is based simply on
whether the buffer is B_INVAL or not (B_CACHE set if B_INVAL clear,
and vise-versa). biodone() is now responsible for setting B_CACHE
when a successful read completes. B_CACHE is also set when a bdwrite()
is initiated and when a bwrite() is initiated. VFS VOP_BWRITE
routines (there are only two - nfs_bwrite() and bwrite()) are now
expected to set B_CACHE. This means that bowrite() and bawrite() also
set B_CACHE indirectly.
There are a number of places in the code which were previously using
buf->b_bufsize (which is DEV_BSIZE aligned) when they should have
been using buf->b_bcount. These have been fixed. getblk() now clears
B_DONE on return because the rest of the system is so bad about
dealing with B_DONE.
Major fixes to NFS/TCP have been made. A server-side bug could cause
requests to be lost by the server due to nfs_realign() overwriting
other rpc's in the same TCP mbuf chain. The server's kernel must be
recompiled to get the benefit of the fixes.
Submitted by: Matthew Dillon <dillon@apollo.backplane.com>
1999-05-02 23:57:16 +00:00
|
|
|
* and for a VMIO buffer B_CACHE is adjusted according to the
|
|
|
|
* backing VM cache.
|
|
|
|
*/
|
|
|
|
if (bp->b_flags & B_INVAL)
|
|
|
|
bp->b_flags &= ~B_CACHE;
|
1999-06-26 02:47:16 +00:00
|
|
|
else if ((bp->b_flags & (B_VMIO | B_INVAL)) == 0)
|
The VFS/BIO subsystem contained a number of hacks in order to optimize
piecemeal, middle-of-file writes for NFS. These hacks have caused no
end of trouble, especially when combined with mmap(). I've removed
them. Instead, NFS will issue a read-before-write to fully
instantiate the struct buf containing the write. NFS does, however,
optimize piecemeal appends to files. For most common file operations,
you will not notice the difference. The sole remaining fragment in
the VFS/BIO system is b_dirtyoff/end, which NFS uses to avoid cache
coherency issues with read-merge-write style operations. NFS also
optimizes the write-covers-entire-buffer case by avoiding the
read-before-write. There is quite a bit of room for further
optimization in these areas.
The VM system marks pages fully-valid (AKA vm_page_t->valid =
VM_PAGE_BITS_ALL) in several places, most noteably in vm_fault. This
is not correct operation. The vm_pager_get_pages() code is now
responsible for marking VM pages all-valid. A number of VM helper
routines have been added to aid in zeroing-out the invalid portions of
a VM page prior to the page being marked all-valid. This operation is
necessary to properly support mmap(). The zeroing occurs most often
when dealing with file-EOF situations. Several bugs have been fixed
in the NFS subsystem, including bits handling file and directory EOF
situations and buf->b_flags consistancy issues relating to clearing
B_ERROR & B_INVAL, and handling B_DONE.
getblk() and allocbuf() have been rewritten. B_CACHE operation is now
formally defined in comments and more straightforward in
implementation. B_CACHE for VMIO buffers is based on the validity of
the backing store. B_CACHE for non-VMIO buffers is based simply on
whether the buffer is B_INVAL or not (B_CACHE set if B_INVAL clear,
and vise-versa). biodone() is now responsible for setting B_CACHE
when a successful read completes. B_CACHE is also set when a bdwrite()
is initiated and when a bwrite() is initiated. VFS VOP_BWRITE
routines (there are only two - nfs_bwrite() and bwrite()) are now
expected to set B_CACHE. This means that bowrite() and bawrite() also
set B_CACHE indirectly.
There are a number of places in the code which were previously using
buf->b_bufsize (which is DEV_BSIZE aligned) when they should have
been using buf->b_bcount. These have been fixed. getblk() now clears
B_DONE on return because the rest of the system is so bad about
dealing with B_DONE.
Major fixes to NFS/TCP have been made. A server-side bug could cause
requests to be lost by the server due to nfs_realign() overwriting
other rpc's in the same TCP mbuf chain. The server's kernel must be
recompiled to get the benefit of the fixes.
Submitted by: Matthew Dillon <dillon@apollo.backplane.com>
1999-05-02 23:57:16 +00:00
|
|
|
bp->b_flags |= B_CACHE;
|
2012-05-15 09:55:15 +00:00
|
|
|
if (bp->b_flags & B_MANAGED)
|
|
|
|
MPASS(bp->b_qindex == QUEUE_NONE);
|
2013-05-31 00:43:41 +00:00
|
|
|
else
|
2012-05-15 09:55:15 +00:00
|
|
|
bremfree(bp);
|
1997-06-15 17:56:53 +00:00
|
|
|
|
1994-05-25 09:21:21 +00:00
|
|
|
/*
|
2013-03-14 20:31:39 +00:00
|
|
|
* check for size inconsistencies for non-VMIO case.
|
1994-05-25 09:21:21 +00:00
|
|
|
*/
|
|
|
|
if (bp->b_bcount != size) {
|
1999-01-21 08:29:12 +00:00
|
|
|
if ((bp->b_flags & B_VMIO) == 0 ||
|
The buffer queue mechanism has been reformulated. Instead of having
QUEUE_AGE, QUEUE_LRU, and QUEUE_EMPTY we instead have QUEUE_CLEAN,
QUEUE_DIRTY, QUEUE_EMPTY, and QUEUE_EMPTYKVA. With this patch clean
and dirty buffers have been separated. Empty buffers with KVM
assignments have been separated from truely empty buffers. getnewbuf()
has been rewritten and now operates in a 100% optimal fashion. That is,
it is able to find precisely the right kind of buffer it needs to
allocate a new buffer, defragment KVM, or to free-up an existing buffer
when the buffer cache is full (which is a steady-state situation for
the buffer cache).
Buffer flushing has been reorganized. Previously buffers were flushed
in the context of whatever process hit the conditions forcing buffer
flushing to occur. This resulted in processes blocking on conditions
unrelated to what they were doing. This also resulted in inappropriate
VFS stacking chains due to multiple processes getting stuck trying to
flush dirty buffers or due to a single process getting into a situation
where it might attempt to flush buffers recursively - a situation that
was only partially fixed in prior commits. We have added a new daemon
called the buf_daemon which is responsible for flushing dirty buffers
when the number of dirty buffers exceeds the vfs.hidirtybuffers limit.
This daemon attempts to dynamically adjust the rate at which dirty buffers
are flushed such that getnewbuf() calls (almost) never block.
The number of nbufs and amount of buffer space is now scaled past the
8MB limit that was previously imposed for systems with over 64MB of
memory, and the vfs.{lo,hi}dirtybuffers limits have been relaxed
somewhat. The number of physical buffers has been increased with the
intention that we will manage physical I/O differently in the future.
reassignbuf previously attempted to keep the dirtyblkhd list sorted which
could result in non-deterministic operation under certain conditions,
such as when a large number of dirty buffers are being managed. This
algorithm has been changed. reassignbuf now keeps buffers locally sorted
if it can do so cheaply, and otherwise gives up and adds buffers to
the head of the dirtyblkhd list. The new algorithm is deterministic but
not perfect. The new algorithm greatly reduces problems that previously
occured when write_behind was turned off in the system.
The P_FLSINPROG proc->p_flag bit has been replaced by the more descriptive
P_BUFEXHAUST bit. This bit allows processes working with filesystem
buffers to use available emergency reserves. Normal processes do not set
this bit and are not allowed to dig into emergency reserves. The purpose
of this bit is to avoid low-memory deadlocks.
A small race condition was fixed in getpbuf() in vm/vm_pager.c.
Submitted by: Matthew Dillon <dillon@apollo.backplane.com>
Reviewed by: Kirk McKusick <mckusick@mckusick.com>
1999-07-04 00:25:38 +00:00
|
|
|
(size > bp->b_kvasize)) {
|
Some VM improvements, including elimination of alot of Sig-11
problems. Tor Egge and others have helped with various VM bugs
lately, but don't blame him -- blame me!!!
pmap.c:
1) Create an object for kernel page table allocations. This
fixes a bogus allocation method previously used for such, by
grabbing pages from the kernel object, using bogus pindexes.
(This was a code cleanup, and perhaps a minor system stability
issue.)
pmap.c:
2) Pre-set the modify and accessed bits when prudent. This will
decrease bus traffic under certain circumstances.
vfs_bio.c, vfs_cluster.c:
3) Rather than calculating the beginning virtual byte offset
multiple times, stick the offset into the buffer header, so
that the calculated offset can be reused. (Long long multiplies
are often expensive, and this is a probably unmeasurable performance
improvement, and code cleanup.)
vfs_bio.c:
4) Handle write recursion more intelligently (but not perfectly) so
that it is less likely to cause a system panic, and is also
much more robust.
vfs_bio.c:
5) getblk incorrectly wrote out blocks that are incorrectly sized.
The problem is fixed, and writes blocks out ONLY when B_DELWRI
is true.
vfs_bio.c:
6) Check that already constituted buffers have fully valid pages. If
not, then make sure that the B_CACHE bit is not set. (This was
a major source of Sig-11 type problems.)
vfs_bio.c:
7) Fix a potential system deadlock due to an incorrectly specified
sleep priority while waiting for a buffer write operation. The
change that I made opens the system up to serious problems, and
we need to examine the issue of process sleep priorities.
vfs_cluster.c, vfs_bio.c:
8) Make clustered reads work more correctly (and more completely)
when buffers are already constituted, but not fully valid.
(This was another system reliability issue.)
vfs_subr.c, ffs_inode.c:
9) Create a vtruncbuf function, which is used by filesystems that
can truncate files. The vinvalbuf forced a file sync type operation,
while vtruncbuf only invalidates the buffers past the new end of file,
and also invalidates the appropriate pages. (This was a system reliabiliy
and performance issue.)
10) Modify FFS to use vtruncbuf.
vm_object.c:
11) Make the object rundown mechanism for OBJT_VNODE type objects work
more correctly. Included in that fix, create pager entries for
the OBJT_DEAD pager type, so that paging requests that might slip
in during race conditions are properly handled. (This was a system
reliability issue.)
vm_page.c:
12) Make some of the page validation routines be a little less picky
about arguments passed to them. Also, support page invalidation
change the object generation count so that we handle generation
counts a little more robustly.
vm_pageout.c:
13) Further reduce pageout daemon activity when the system doesn't
need help from it. There should be no additional performance
decrease even when the pageout daemon is running. (This was
a significant performance issue.)
vnode_pager.c:
14) Teach the vnode pager to handle race conditions during vnode
deallocations.
1998-03-16 01:56:03 +00:00
|
|
|
if (bp->b_flags & B_DELWRI) {
|
1998-03-17 17:36:05 +00:00
|
|
|
bp->b_flags |= B_NOCACHE;
|
2004-03-11 18:02:36 +00:00
|
|
|
bwrite(bp);
|
Some VM improvements, including elimination of alot of Sig-11
problems. Tor Egge and others have helped with various VM bugs
lately, but don't blame him -- blame me!!!
pmap.c:
1) Create an object for kernel page table allocations. This
fixes a bogus allocation method previously used for such, by
grabbing pages from the kernel object, using bogus pindexes.
(This was a code cleanup, and perhaps a minor system stability
issue.)
pmap.c:
2) Pre-set the modify and accessed bits when prudent. This will
decrease bus traffic under certain circumstances.
vfs_bio.c, vfs_cluster.c:
3) Rather than calculating the beginning virtual byte offset
multiple times, stick the offset into the buffer header, so
that the calculated offset can be reused. (Long long multiplies
are often expensive, and this is a probably unmeasurable performance
improvement, and code cleanup.)
vfs_bio.c:
4) Handle write recursion more intelligently (but not perfectly) so
that it is less likely to cause a system panic, and is also
much more robust.
vfs_bio.c:
5) getblk incorrectly wrote out blocks that are incorrectly sized.
The problem is fixed, and writes blocks out ONLY when B_DELWRI
is true.
vfs_bio.c:
6) Check that already constituted buffers have fully valid pages. If
not, then make sure that the B_CACHE bit is not set. (This was
a major source of Sig-11 type problems.)
vfs_bio.c:
7) Fix a potential system deadlock due to an incorrectly specified
sleep priority while waiting for a buffer write operation. The
change that I made opens the system up to serious problems, and
we need to examine the issue of process sleep priorities.
vfs_cluster.c, vfs_bio.c:
8) Make clustered reads work more correctly (and more completely)
when buffers are already constituted, but not fully valid.
(This was another system reliability issue.)
vfs_subr.c, ffs_inode.c:
9) Create a vtruncbuf function, which is used by filesystems that
can truncate files. The vinvalbuf forced a file sync type operation,
while vtruncbuf only invalidates the buffers past the new end of file,
and also invalidates the appropriate pages. (This was a system reliabiliy
and performance issue.)
10) Modify FFS to use vtruncbuf.
vm_object.c:
11) Make the object rundown mechanism for OBJT_VNODE type objects work
more correctly. Included in that fix, create pager entries for
the OBJT_DEAD pager type, so that paging requests that might slip
in during race conditions are properly handled. (This was a system
reliability issue.)
vm_page.c:
12) Make some of the page validation routines be a little less picky
about arguments passed to them. Also, support page invalidation
change the object generation count so that we handle generation
counts a little more robustly.
vm_pageout.c:
13) Further reduce pageout daemon activity when the system doesn't
need help from it. There should be no additional performance
decrease even when the pageout daemon is running. (This was
a significant performance issue.)
vnode_pager.c:
14) Teach the vnode pager to handle race conditions during vnode
deallocations.
1998-03-16 01:56:03 +00:00
|
|
|
} else {
|
2007-02-22 14:52:59 +00:00
|
|
|
if (LIST_EMPTY(&bp->b_dep)) {
|
1998-03-17 17:36:05 +00:00
|
|
|
bp->b_flags |= B_RELBUF;
|
|
|
|
brelse(bp);
|
|
|
|
} else {
|
|
|
|
bp->b_flags |= B_NOCACHE;
|
2004-03-11 18:02:36 +00:00
|
|
|
bwrite(bp);
|
1998-03-17 17:36:05 +00:00
|
|
|
}
|
Some VM improvements, including elimination of alot of Sig-11
problems. Tor Egge and others have helped with various VM bugs
lately, but don't blame him -- blame me!!!
pmap.c:
1) Create an object for kernel page table allocations. This
fixes a bogus allocation method previously used for such, by
grabbing pages from the kernel object, using bogus pindexes.
(This was a code cleanup, and perhaps a minor system stability
issue.)
pmap.c:
2) Pre-set the modify and accessed bits when prudent. This will
decrease bus traffic under certain circumstances.
vfs_bio.c, vfs_cluster.c:
3) Rather than calculating the beginning virtual byte offset
multiple times, stick the offset into the buffer header, so
that the calculated offset can be reused. (Long long multiplies
are often expensive, and this is a probably unmeasurable performance
improvement, and code cleanup.)
vfs_bio.c:
4) Handle write recursion more intelligently (but not perfectly) so
that it is less likely to cause a system panic, and is also
much more robust.
vfs_bio.c:
5) getblk incorrectly wrote out blocks that are incorrectly sized.
The problem is fixed, and writes blocks out ONLY when B_DELWRI
is true.
vfs_bio.c:
6) Check that already constituted buffers have fully valid pages. If
not, then make sure that the B_CACHE bit is not set. (This was
a major source of Sig-11 type problems.)
vfs_bio.c:
7) Fix a potential system deadlock due to an incorrectly specified
sleep priority while waiting for a buffer write operation. The
change that I made opens the system up to serious problems, and
we need to examine the issue of process sleep priorities.
vfs_cluster.c, vfs_bio.c:
8) Make clustered reads work more correctly (and more completely)
when buffers are already constituted, but not fully valid.
(This was another system reliability issue.)
vfs_subr.c, ffs_inode.c:
9) Create a vtruncbuf function, which is used by filesystems that
can truncate files. The vinvalbuf forced a file sync type operation,
while vtruncbuf only invalidates the buffers past the new end of file,
and also invalidates the appropriate pages. (This was a system reliabiliy
and performance issue.)
10) Modify FFS to use vtruncbuf.
vm_object.c:
11) Make the object rundown mechanism for OBJT_VNODE type objects work
more correctly. Included in that fix, create pager entries for
the OBJT_DEAD pager type, so that paging requests that might slip
in during race conditions are properly handled. (This was a system
reliability issue.)
vm_page.c:
12) Make some of the page validation routines be a little less picky
about arguments passed to them. Also, support page invalidation
change the object generation count so that we handle generation
counts a little more robustly.
vm_pageout.c:
13) Further reduce pageout daemon activity when the system doesn't
need help from it. There should be no additional performance
decrease even when the pageout daemon is running. (This was
a significant performance issue.)
vnode_pager.c:
14) Teach the vnode pager to handle race conditions during vnode
deallocations.
1998-03-16 01:56:03 +00:00
|
|
|
}
|
1995-09-23 21:12:45 +00:00
|
|
|
goto loop;
|
|
|
|
}
|
|
|
|
}
|
1999-01-21 08:29:12 +00:00
|
|
|
|
Implement the concept of the unmapped VMIO buffers, i.e. buffers which
do not map the b_pages pages into buffer_map KVA. The use of the
unmapped buffers eliminate the need to perform TLB shootdown for
mapping on the buffer creation and reuse, greatly reducing the amount
of IPIs for shootdown on big-SMP machines and eliminating up to 25-30%
of the system time on i/o intensive workloads.
The unmapped buffer should be explicitely requested by the GB_UNMAPPED
flag by the consumer. For unmapped buffer, no KVA reservation is
performed at all. The consumer might request unmapped buffer which
does have a KVA reserve, to manually map it without recursing into
buffer cache and blocking, with the GB_KVAALLOC flag.
When the mapped buffer is requested and unmapped buffer already
exists, the cache performs an upgrade, possibly reusing the KVA
reservation.
Unmapped buffer is translated into unmapped bio in g_vfs_strategy().
Unmapped bio carry a pointer to the vm_page_t array, offset and length
instead of the data pointer. The provider which processes the bio
should explicitely specify a readiness to accept unmapped bio,
otherwise g_down geom thread performs the transient upgrade of the bio
request by mapping the pages into the new bio_transient_map KVA
submap.
The bio_transient_map submap claims up to 10% of the buffer map, and
the total buffer_map + bio_transient_map KVA usage stays the
same. Still, it could be manually tuned by kern.bio_transient_maxcnt
tunable, in the units of the transient mappings. Eventually, the
bio_transient_map could be removed after all geom classes and drivers
can accept unmapped i/o requests.
Unmapped support can be turned off by the vfs.unmapped_buf_allowed
tunable, disabling which makes the buffer (or cluster) creation
requests to ignore GB_UNMAPPED and GB_KVAALLOC flags. Unmapped
buffers are only enabled by default on the architectures where
pmap_copy_page() was implemented and tested.
In the rework, filesystem metadata is not the subject to maxbufspace
limit anymore. Since the metadata buffers are always mapped, the
buffers still have to fit into the buffer map, which provides a
reasonable (but practically unreachable) upper bound on it. The
non-metadata buffer allocations, both mapped and unmapped, is
accounted against maxbufspace, as before. Effectively, this means that
the maxbufspace is forced on mapped and unmapped buffers separately.
The pre-patch bufspace limiting code did not worked, because
buffer_map fragmentation does not allow the limit to be reached.
By Jeff Roberson request, the getnewbuf() function was split into
smaller single-purpose functions.
Sponsored by: The FreeBSD Foundation
Discussed with: jeff (previous version)
Tested by: pho, scottl (previous version), jhb, bf
MFC after: 2 weeks
2013-03-19 14:13:12 +00:00
|
|
|
/*
|
|
|
|
* Handle the case of unmapped buffer which should
|
|
|
|
* become mapped, or the buffer for which KVA
|
|
|
|
* reservation is requested.
|
|
|
|
*/
|
|
|
|
bp_unmapped_get_kva(bp, blkno, size, flags);
|
|
|
|
|
1999-01-21 08:29:12 +00:00
|
|
|
/*
|
2016-04-29 21:54:28 +00:00
|
|
|
* If the size is inconsistent in the VMIO case, we can resize
|
The VFS/BIO subsystem contained a number of hacks in order to optimize
piecemeal, middle-of-file writes for NFS. These hacks have caused no
end of trouble, especially when combined with mmap(). I've removed
them. Instead, NFS will issue a read-before-write to fully
instantiate the struct buf containing the write. NFS does, however,
optimize piecemeal appends to files. For most common file operations,
you will not notice the difference. The sole remaining fragment in
the VFS/BIO system is b_dirtyoff/end, which NFS uses to avoid cache
coherency issues with read-merge-write style operations. NFS also
optimizes the write-covers-entire-buffer case by avoiding the
read-before-write. There is quite a bit of room for further
optimization in these areas.
The VM system marks pages fully-valid (AKA vm_page_t->valid =
VM_PAGE_BITS_ALL) in several places, most noteably in vm_fault. This
is not correct operation. The vm_pager_get_pages() code is now
responsible for marking VM pages all-valid. A number of VM helper
routines have been added to aid in zeroing-out the invalid portions of
a VM page prior to the page being marked all-valid. This operation is
necessary to properly support mmap(). The zeroing occurs most often
when dealing with file-EOF situations. Several bugs have been fixed
in the NFS subsystem, including bits handling file and directory EOF
situations and buf->b_flags consistancy issues relating to clearing
B_ERROR & B_INVAL, and handling B_DONE.
getblk() and allocbuf() have been rewritten. B_CACHE operation is now
formally defined in comments and more straightforward in
implementation. B_CACHE for VMIO buffers is based on the validity of
the backing store. B_CACHE for non-VMIO buffers is based simply on
whether the buffer is B_INVAL or not (B_CACHE set if B_INVAL clear,
and vise-versa). biodone() is now responsible for setting B_CACHE
when a successful read completes. B_CACHE is also set when a bdwrite()
is initiated and when a bwrite() is initiated. VFS VOP_BWRITE
routines (there are only two - nfs_bwrite() and bwrite()) are now
expected to set B_CACHE. This means that bowrite() and bawrite() also
set B_CACHE indirectly.
There are a number of places in the code which were previously using
buf->b_bufsize (which is DEV_BSIZE aligned) when they should have
been using buf->b_bcount. These have been fixed. getblk() now clears
B_DONE on return because the rest of the system is so bad about
dealing with B_DONE.
Major fixes to NFS/TCP have been made. A server-side bug could cause
requests to be lost by the server due to nfs_realign() overwriting
other rpc's in the same TCP mbuf chain. The server's kernel must be
recompiled to get the benefit of the fixes.
Submitted by: Matthew Dillon <dillon@apollo.backplane.com>
1999-05-02 23:57:16 +00:00
|
|
|
* the buffer. This might lead to B_CACHE getting set or
|
|
|
|
* cleared. If the size has not changed, B_CACHE remains
|
|
|
|
* unchanged from its previous state.
|
1999-01-21 08:29:12 +00:00
|
|
|
*/
|
2015-09-27 05:16:06 +00:00
|
|
|
allocbuf(bp, size);
|
1999-01-21 08:29:12 +00:00
|
|
|
|
1999-01-08 17:31:30 +00:00
|
|
|
KASSERT(bp->b_offset != NOOFFSET,
|
1999-01-10 01:58:29 +00:00
|
|
|
("getblk: no buffer offset"));
|
1999-01-21 08:29:12 +00:00
|
|
|
|
Some VM improvements, including elimination of alot of Sig-11
problems. Tor Egge and others have helped with various VM bugs
lately, but don't blame him -- blame me!!!
pmap.c:
1) Create an object for kernel page table allocations. This
fixes a bogus allocation method previously used for such, by
grabbing pages from the kernel object, using bogus pindexes.
(This was a code cleanup, and perhaps a minor system stability
issue.)
pmap.c:
2) Pre-set the modify and accessed bits when prudent. This will
decrease bus traffic under certain circumstances.
vfs_bio.c, vfs_cluster.c:
3) Rather than calculating the beginning virtual byte offset
multiple times, stick the offset into the buffer header, so
that the calculated offset can be reused. (Long long multiplies
are often expensive, and this is a probably unmeasurable performance
improvement, and code cleanup.)
vfs_bio.c:
4) Handle write recursion more intelligently (but not perfectly) so
that it is less likely to cause a system panic, and is also
much more robust.
vfs_bio.c:
5) getblk incorrectly wrote out blocks that are incorrectly sized.
The problem is fixed, and writes blocks out ONLY when B_DELWRI
is true.
vfs_bio.c:
6) Check that already constituted buffers have fully valid pages. If
not, then make sure that the B_CACHE bit is not set. (This was
a major source of Sig-11 type problems.)
vfs_bio.c:
7) Fix a potential system deadlock due to an incorrectly specified
sleep priority while waiting for a buffer write operation. The
change that I made opens the system up to serious problems, and
we need to examine the issue of process sleep priorities.
vfs_cluster.c, vfs_bio.c:
8) Make clustered reads work more correctly (and more completely)
when buffers are already constituted, but not fully valid.
(This was another system reliability issue.)
vfs_subr.c, ffs_inode.c:
9) Create a vtruncbuf function, which is used by filesystems that
can truncate files. The vinvalbuf forced a file sync type operation,
while vtruncbuf only invalidates the buffers past the new end of file,
and also invalidates the appropriate pages. (This was a system reliabiliy
and performance issue.)
10) Modify FFS to use vtruncbuf.
vm_object.c:
11) Make the object rundown mechanism for OBJT_VNODE type objects work
more correctly. Included in that fix, create pager entries for
the OBJT_DEAD pager type, so that paging requests that might slip
in during race conditions are properly handled. (This was a system
reliability issue.)
vm_page.c:
12) Make some of the page validation routines be a little less picky
about arguments passed to them. Also, support page invalidation
change the object generation count so that we handle generation
counts a little more robustly.
vm_pageout.c:
13) Further reduce pageout daemon activity when the system doesn't
need help from it. There should be no additional performance
decrease even when the pageout daemon is running. (This was
a significant performance issue.)
vnode_pager.c:
14) Teach the vnode pager to handle race conditions during vnode
deallocations.
1998-03-16 01:56:03 +00:00
|
|
|
/*
|
The VFS/BIO subsystem contained a number of hacks in order to optimize
piecemeal, middle-of-file writes for NFS. These hacks have caused no
end of trouble, especially when combined with mmap(). I've removed
them. Instead, NFS will issue a read-before-write to fully
instantiate the struct buf containing the write. NFS does, however,
optimize piecemeal appends to files. For most common file operations,
you will not notice the difference. The sole remaining fragment in
the VFS/BIO system is b_dirtyoff/end, which NFS uses to avoid cache
coherency issues with read-merge-write style operations. NFS also
optimizes the write-covers-entire-buffer case by avoiding the
read-before-write. There is quite a bit of room for further
optimization in these areas.
The VM system marks pages fully-valid (AKA vm_page_t->valid =
VM_PAGE_BITS_ALL) in several places, most noteably in vm_fault. This
is not correct operation. The vm_pager_get_pages() code is now
responsible for marking VM pages all-valid. A number of VM helper
routines have been added to aid in zeroing-out the invalid portions of
a VM page prior to the page being marked all-valid. This operation is
necessary to properly support mmap(). The zeroing occurs most often
when dealing with file-EOF situations. Several bugs have been fixed
in the NFS subsystem, including bits handling file and directory EOF
situations and buf->b_flags consistancy issues relating to clearing
B_ERROR & B_INVAL, and handling B_DONE.
getblk() and allocbuf() have been rewritten. B_CACHE operation is now
formally defined in comments and more straightforward in
implementation. B_CACHE for VMIO buffers is based on the validity of
the backing store. B_CACHE for non-VMIO buffers is based simply on
whether the buffer is B_INVAL or not (B_CACHE set if B_INVAL clear,
and vise-versa). biodone() is now responsible for setting B_CACHE
when a successful read completes. B_CACHE is also set when a bdwrite()
is initiated and when a bwrite() is initiated. VFS VOP_BWRITE
routines (there are only two - nfs_bwrite() and bwrite()) are now
expected to set B_CACHE. This means that bowrite() and bawrite() also
set B_CACHE indirectly.
There are a number of places in the code which were previously using
buf->b_bufsize (which is DEV_BSIZE aligned) when they should have
been using buf->b_bcount. These have been fixed. getblk() now clears
B_DONE on return because the rest of the system is so bad about
dealing with B_DONE.
Major fixes to NFS/TCP have been made. A server-side bug could cause
requests to be lost by the server due to nfs_realign() overwriting
other rpc's in the same TCP mbuf chain. The server's kernel must be
recompiled to get the benefit of the fixes.
Submitted by: Matthew Dillon <dillon@apollo.backplane.com>
1999-05-02 23:57:16 +00:00
|
|
|
* A buffer with B_DELWRI set and B_CACHE clear must
|
|
|
|
* be committed before we can return the buffer in
|
|
|
|
* order to prevent the caller from issuing a read
|
|
|
|
* ( due to B_CACHE not being set ) and overwriting
|
|
|
|
* it.
|
1999-01-23 06:36:15 +00:00
|
|
|
*
|
The VFS/BIO subsystem contained a number of hacks in order to optimize
piecemeal, middle-of-file writes for NFS. These hacks have caused no
end of trouble, especially when combined with mmap(). I've removed
them. Instead, NFS will issue a read-before-write to fully
instantiate the struct buf containing the write. NFS does, however,
optimize piecemeal appends to files. For most common file operations,
you will not notice the difference. The sole remaining fragment in
the VFS/BIO system is b_dirtyoff/end, which NFS uses to avoid cache
coherency issues with read-merge-write style operations. NFS also
optimizes the write-covers-entire-buffer case by avoiding the
read-before-write. There is quite a bit of room for further
optimization in these areas.
The VM system marks pages fully-valid (AKA vm_page_t->valid =
VM_PAGE_BITS_ALL) in several places, most noteably in vm_fault. This
is not correct operation. The vm_pager_get_pages() code is now
responsible for marking VM pages all-valid. A number of VM helper
routines have been added to aid in zeroing-out the invalid portions of
a VM page prior to the page being marked all-valid. This operation is
necessary to properly support mmap(). The zeroing occurs most often
when dealing with file-EOF situations. Several bugs have been fixed
in the NFS subsystem, including bits handling file and directory EOF
situations and buf->b_flags consistancy issues relating to clearing
B_ERROR & B_INVAL, and handling B_DONE.
getblk() and allocbuf() have been rewritten. B_CACHE operation is now
formally defined in comments and more straightforward in
implementation. B_CACHE for VMIO buffers is based on the validity of
the backing store. B_CACHE for non-VMIO buffers is based simply on
whether the buffer is B_INVAL or not (B_CACHE set if B_INVAL clear,
and vise-versa). biodone() is now responsible for setting B_CACHE
when a successful read completes. B_CACHE is also set when a bdwrite()
is initiated and when a bwrite() is initiated. VFS VOP_BWRITE
routines (there are only two - nfs_bwrite() and bwrite()) are now
expected to set B_CACHE. This means that bowrite() and bawrite() also
set B_CACHE indirectly.
There are a number of places in the code which were previously using
buf->b_bufsize (which is DEV_BSIZE aligned) when they should have
been using buf->b_bcount. These have been fixed. getblk() now clears
B_DONE on return because the rest of the system is so bad about
dealing with B_DONE.
Major fixes to NFS/TCP have been made. A server-side bug could cause
requests to be lost by the server due to nfs_realign() overwriting
other rpc's in the same TCP mbuf chain. The server's kernel must be
recompiled to get the benefit of the fixes.
Submitted by: Matthew Dillon <dillon@apollo.backplane.com>
1999-05-02 23:57:16 +00:00
|
|
|
* Most callers, including NFS and FFS, need this to
|
|
|
|
* operate properly either because they assume they
|
|
|
|
* can issue a read if B_CACHE is not set, or because
|
|
|
|
* ( for example ) an uncached B_DELWRI might loop due
|
|
|
|
* to softupdates re-dirtying the buffer. In the latter
|
|
|
|
* case, B_CACHE is set after the first write completes,
|
|
|
|
* preventing further loops.
|
2001-12-14 01:16:57 +00:00
|
|
|
* NOTE! b*write() sets B_CACHE. If we cleared B_CACHE
|
|
|
|
* above while extending the buffer, we cannot allow the
|
|
|
|
* buffer to remain with B_CACHE set after the write
|
|
|
|
* completes or it will represent a corrupt state. To
|
|
|
|
* deal with this we set B_NOCACHE to scrap the buffer
|
|
|
|
* after the write.
|
|
|
|
*
|
|
|
|
* We might be able to do something fancy, like setting
|
|
|
|
* B_CACHE in bwrite() except if B_DELWRI is already set,
|
|
|
|
* so the below call doesn't set B_CACHE, but that gets real
|
|
|
|
* confusing. This is much easier.
|
1999-01-21 08:29:12 +00:00
|
|
|
*/
|
|
|
|
|
|
|
|
if ((bp->b_flags & (B_CACHE|B_DELWRI)) == B_DELWRI) {
|
2001-12-14 01:16:57 +00:00
|
|
|
bp->b_flags |= B_NOCACHE;
|
2004-03-11 18:02:36 +00:00
|
|
|
bwrite(bp);
|
1999-01-21 08:29:12 +00:00
|
|
|
goto loop;
|
|
|
|
}
|
The VFS/BIO subsystem contained a number of hacks in order to optimize
piecemeal, middle-of-file writes for NFS. These hacks have caused no
end of trouble, especially when combined with mmap(). I've removed
them. Instead, NFS will issue a read-before-write to fully
instantiate the struct buf containing the write. NFS does, however,
optimize piecemeal appends to files. For most common file operations,
you will not notice the difference. The sole remaining fragment in
the VFS/BIO system is b_dirtyoff/end, which NFS uses to avoid cache
coherency issues with read-merge-write style operations. NFS also
optimizes the write-covers-entire-buffer case by avoiding the
read-before-write. There is quite a bit of room for further
optimization in these areas.
The VM system marks pages fully-valid (AKA vm_page_t->valid =
VM_PAGE_BITS_ALL) in several places, most noteably in vm_fault. This
is not correct operation. The vm_pager_get_pages() code is now
responsible for marking VM pages all-valid. A number of VM helper
routines have been added to aid in zeroing-out the invalid portions of
a VM page prior to the page being marked all-valid. This operation is
necessary to properly support mmap(). The zeroing occurs most often
when dealing with file-EOF situations. Several bugs have been fixed
in the NFS subsystem, including bits handling file and directory EOF
situations and buf->b_flags consistancy issues relating to clearing
B_ERROR & B_INVAL, and handling B_DONE.
getblk() and allocbuf() have been rewritten. B_CACHE operation is now
formally defined in comments and more straightforward in
implementation. B_CACHE for VMIO buffers is based on the validity of
the backing store. B_CACHE for non-VMIO buffers is based simply on
whether the buffer is B_INVAL or not (B_CACHE set if B_INVAL clear,
and vise-versa). biodone() is now responsible for setting B_CACHE
when a successful read completes. B_CACHE is also set when a bdwrite()
is initiated and when a bwrite() is initiated. VFS VOP_BWRITE
routines (there are only two - nfs_bwrite() and bwrite()) are now
expected to set B_CACHE. This means that bowrite() and bawrite() also
set B_CACHE indirectly.
There are a number of places in the code which were previously using
buf->b_bufsize (which is DEV_BSIZE aligned) when they should have
been using buf->b_bcount. These have been fixed. getblk() now clears
B_DONE on return because the rest of the system is so bad about
dealing with B_DONE.
Major fixes to NFS/TCP have been made. A server-side bug could cause
requests to be lost by the server due to nfs_realign() overwriting
other rpc's in the same TCP mbuf chain. The server's kernel must be
recompiled to get the benefit of the fixes.
Submitted by: Matthew Dillon <dillon@apollo.backplane.com>
1999-05-02 23:57:16 +00:00
|
|
|
bp->b_flags &= ~B_DONE;
|
1994-05-25 09:21:21 +00:00
|
|
|
} else {
|
The VFS/BIO subsystem contained a number of hacks in order to optimize
piecemeal, middle-of-file writes for NFS. These hacks have caused no
end of trouble, especially when combined with mmap(). I've removed
them. Instead, NFS will issue a read-before-write to fully
instantiate the struct buf containing the write. NFS does, however,
optimize piecemeal appends to files. For most common file operations,
you will not notice the difference. The sole remaining fragment in
the VFS/BIO system is b_dirtyoff/end, which NFS uses to avoid cache
coherency issues with read-merge-write style operations. NFS also
optimizes the write-covers-entire-buffer case by avoiding the
read-before-write. There is quite a bit of room for further
optimization in these areas.
The VM system marks pages fully-valid (AKA vm_page_t->valid =
VM_PAGE_BITS_ALL) in several places, most noteably in vm_fault. This
is not correct operation. The vm_pager_get_pages() code is now
responsible for marking VM pages all-valid. A number of VM helper
routines have been added to aid in zeroing-out the invalid portions of
a VM page prior to the page being marked all-valid. This operation is
necessary to properly support mmap(). The zeroing occurs most often
when dealing with file-EOF situations. Several bugs have been fixed
in the NFS subsystem, including bits handling file and directory EOF
situations and buf->b_flags consistancy issues relating to clearing
B_ERROR & B_INVAL, and handling B_DONE.
getblk() and allocbuf() have been rewritten. B_CACHE operation is now
formally defined in comments and more straightforward in
implementation. B_CACHE for VMIO buffers is based on the validity of
the backing store. B_CACHE for non-VMIO buffers is based simply on
whether the buffer is B_INVAL or not (B_CACHE set if B_INVAL clear,
and vise-versa). biodone() is now responsible for setting B_CACHE
when a successful read completes. B_CACHE is also set when a bdwrite()
is initiated and when a bwrite() is initiated. VFS VOP_BWRITE
routines (there are only two - nfs_bwrite() and bwrite()) are now
expected to set B_CACHE. This means that bowrite() and bawrite() also
set B_CACHE indirectly.
There are a number of places in the code which were previously using
buf->b_bufsize (which is DEV_BSIZE aligned) when they should have
been using buf->b_bcount. These have been fixed. getblk() now clears
B_DONE on return because the rest of the system is so bad about
dealing with B_DONE.
Major fixes to NFS/TCP have been made. A server-side bug could cause
requests to be lost by the server due to nfs_realign() overwriting
other rpc's in the same TCP mbuf chain. The server's kernel must be
recompiled to get the benefit of the fixes.
Submitted by: Matthew Dillon <dillon@apollo.backplane.com>
1999-05-02 23:57:16 +00:00
|
|
|
/*
|
|
|
|
* Buffer is not in-core, create new buffer. The buffer
|
1999-06-26 02:47:16 +00:00
|
|
|
* returned by getnewbuf() is locked. Note that the returned
|
|
|
|
* buffer is also considered valid (not marked B_INVAL).
|
The VFS/BIO subsystem contained a number of hacks in order to optimize
piecemeal, middle-of-file writes for NFS. These hacks have caused no
end of trouble, especially when combined with mmap(). I've removed
them. Instead, NFS will issue a read-before-write to fully
instantiate the struct buf containing the write. NFS does, however,
optimize piecemeal appends to files. For most common file operations,
you will not notice the difference. The sole remaining fragment in
the VFS/BIO system is b_dirtyoff/end, which NFS uses to avoid cache
coherency issues with read-merge-write style operations. NFS also
optimizes the write-covers-entire-buffer case by avoiding the
read-before-write. There is quite a bit of room for further
optimization in these areas.
The VM system marks pages fully-valid (AKA vm_page_t->valid =
VM_PAGE_BITS_ALL) in several places, most noteably in vm_fault. This
is not correct operation. The vm_pager_get_pages() code is now
responsible for marking VM pages all-valid. A number of VM helper
routines have been added to aid in zeroing-out the invalid portions of
a VM page prior to the page being marked all-valid. This operation is
necessary to properly support mmap(). The zeroing occurs most often
when dealing with file-EOF situations. Several bugs have been fixed
in the NFS subsystem, including bits handling file and directory EOF
situations and buf->b_flags consistancy issues relating to clearing
B_ERROR & B_INVAL, and handling B_DONE.
getblk() and allocbuf() have been rewritten. B_CACHE operation is now
formally defined in comments and more straightforward in
implementation. B_CACHE for VMIO buffers is based on the validity of
the backing store. B_CACHE for non-VMIO buffers is based simply on
whether the buffer is B_INVAL or not (B_CACHE set if B_INVAL clear,
and vise-versa). biodone() is now responsible for setting B_CACHE
when a successful read completes. B_CACHE is also set when a bdwrite()
is initiated and when a bwrite() is initiated. VFS VOP_BWRITE
routines (there are only two - nfs_bwrite() and bwrite()) are now
expected to set B_CACHE. This means that bowrite() and bawrite() also
set B_CACHE indirectly.
There are a number of places in the code which were previously using
buf->b_bufsize (which is DEV_BSIZE aligned) when they should have
been using buf->b_bcount. These have been fixed. getblk() now clears
B_DONE on return because the rest of the system is so bad about
dealing with B_DONE.
Major fixes to NFS/TCP have been made. A server-side bug could cause
requests to be lost by the server due to nfs_realign() overwriting
other rpc's in the same TCP mbuf chain. The server's kernel must be
recompiled to get the benefit of the fixes.
Submitted by: Matthew Dillon <dillon@apollo.backplane.com>
1999-05-02 23:57:16 +00:00
|
|
|
*/
|
2013-05-31 00:43:41 +00:00
|
|
|
BO_RUNLOCK(bo);
|
2020-07-24 17:34:04 +00:00
|
|
|
newbuf_unlocked:
|
2003-08-31 08:50:11 +00:00
|
|
|
/*
|
|
|
|
* If the user does not want us to create the buffer, bail out
|
|
|
|
* here.
|
|
|
|
*/
|
2005-04-30 12:18:50 +00:00
|
|
|
if (flags & GB_NOCREAT)
|
2018-05-13 09:47:28 +00:00
|
|
|
return (EEXIST);
|
2013-06-05 23:53:00 +00:00
|
|
|
|
2020-08-19 02:51:17 +00:00
|
|
|
bsize = vn_isdisk(vp) ? DEV_BSIZE : bo->bo_bsize;
|
2014-09-04 00:10:06 +00:00
|
|
|
KASSERT(bsize != 0, ("bsize == 0, check bo->bo_bsize"));
|
2002-06-21 06:18:05 +00:00
|
|
|
offset = blkno * bsize;
|
2005-01-25 00:40:01 +00:00
|
|
|
vmio = vp->v_object != NULL;
|
Implement the concept of the unmapped VMIO buffers, i.e. buffers which
do not map the b_pages pages into buffer_map KVA. The use of the
unmapped buffers eliminate the need to perform TLB shootdown for
mapping on the buffer creation and reuse, greatly reducing the amount
of IPIs for shootdown on big-SMP machines and eliminating up to 25-30%
of the system time on i/o intensive workloads.
The unmapped buffer should be explicitely requested by the GB_UNMAPPED
flag by the consumer. For unmapped buffer, no KVA reservation is
performed at all. The consumer might request unmapped buffer which
does have a KVA reserve, to manually map it without recursing into
buffer cache and blocking, with the GB_KVAALLOC flag.
When the mapped buffer is requested and unmapped buffer already
exists, the cache performs an upgrade, possibly reusing the KVA
reservation.
Unmapped buffer is translated into unmapped bio in g_vfs_strategy().
Unmapped bio carry a pointer to the vm_page_t array, offset and length
instead of the data pointer. The provider which processes the bio
should explicitely specify a readiness to accept unmapped bio,
otherwise g_down geom thread performs the transient upgrade of the bio
request by mapping the pages into the new bio_transient_map KVA
submap.
The bio_transient_map submap claims up to 10% of the buffer map, and
the total buffer_map + bio_transient_map KVA usage stays the
same. Still, it could be manually tuned by kern.bio_transient_maxcnt
tunable, in the units of the transient mappings. Eventually, the
bio_transient_map could be removed after all geom classes and drivers
can accept unmapped i/o requests.
Unmapped support can be turned off by the vfs.unmapped_buf_allowed
tunable, disabling which makes the buffer (or cluster) creation
requests to ignore GB_UNMAPPED and GB_KVAALLOC flags. Unmapped
buffers are only enabled by default on the architectures where
pmap_copy_page() was implemented and tested.
In the rework, filesystem metadata is not the subject to maxbufspace
limit anymore. Since the metadata buffers are always mapped, the
buffers still have to fit into the buffer map, which provides a
reasonable (but practically unreachable) upper bound on it. The
non-metadata buffer allocations, both mapped and unmapped, is
accounted against maxbufspace, as before. Effectively, this means that
the maxbufspace is forced on mapped and unmapped buffers separately.
The pre-patch bufspace limiting code did not worked, because
buffer_map fragmentation does not allow the limit to be reached.
By Jeff Roberson request, the getnewbuf() function was split into
smaller single-purpose functions.
Sponsored by: The FreeBSD Foundation
Discussed with: jeff (previous version)
Tested by: pho, scottl (previous version), jhb, bf
MFC after: 2 weeks
2013-03-19 14:13:12 +00:00
|
|
|
if (vmio) {
|
|
|
|
maxsize = size + (offset & PAGE_MASK);
|
|
|
|
} else {
|
|
|
|
maxsize = size;
|
|
|
|
/* Do not allow non-VMIO notmapped buffers. */
|
2015-07-23 19:13:41 +00:00
|
|
|
flags &= ~(GB_UNMAPPED | GB_KVAALLOC);
|
Implement the concept of the unmapped VMIO buffers, i.e. buffers which
do not map the b_pages pages into buffer_map KVA. The use of the
unmapped buffers eliminate the need to perform TLB shootdown for
mapping on the buffer creation and reuse, greatly reducing the amount
of IPIs for shootdown on big-SMP machines and eliminating up to 25-30%
of the system time on i/o intensive workloads.
The unmapped buffer should be explicitely requested by the GB_UNMAPPED
flag by the consumer. For unmapped buffer, no KVA reservation is
performed at all. The consumer might request unmapped buffer which
does have a KVA reserve, to manually map it without recursing into
buffer cache and blocking, with the GB_KVAALLOC flag.
When the mapped buffer is requested and unmapped buffer already
exists, the cache performs an upgrade, possibly reusing the KVA
reservation.
Unmapped buffer is translated into unmapped bio in g_vfs_strategy().
Unmapped bio carry a pointer to the vm_page_t array, offset and length
instead of the data pointer. The provider which processes the bio
should explicitely specify a readiness to accept unmapped bio,
otherwise g_down geom thread performs the transient upgrade of the bio
request by mapping the pages into the new bio_transient_map KVA
submap.
The bio_transient_map submap claims up to 10% of the buffer map, and
the total buffer_map + bio_transient_map KVA usage stays the
same. Still, it could be manually tuned by kern.bio_transient_maxcnt
tunable, in the units of the transient mappings. Eventually, the
bio_transient_map could be removed after all geom classes and drivers
can accept unmapped i/o requests.
Unmapped support can be turned off by the vfs.unmapped_buf_allowed
tunable, disabling which makes the buffer (or cluster) creation
requests to ignore GB_UNMAPPED and GB_KVAALLOC flags. Unmapped
buffers are only enabled by default on the architectures where
pmap_copy_page() was implemented and tested.
In the rework, filesystem metadata is not the subject to maxbufspace
limit anymore. Since the metadata buffers are always mapped, the
buffers still have to fit into the buffer map, which provides a
reasonable (but practically unreachable) upper bound on it. The
non-metadata buffer allocations, both mapped and unmapped, is
accounted against maxbufspace, as before. Effectively, this means that
the maxbufspace is forced on mapped and unmapped buffers separately.
The pre-patch bufspace limiting code did not worked, because
buffer_map fragmentation does not allow the limit to be reached.
By Jeff Roberson request, the getnewbuf() function was split into
smaller single-purpose functions.
Sponsored by: The FreeBSD Foundation
Discussed with: jeff (previous version)
Tested by: pho, scottl (previous version), jhb, bf
MFC after: 2 weeks
2013-03-19 14:13:12 +00:00
|
|
|
}
|
1998-12-22 14:43:58 +00:00
|
|
|
maxsize = imax(maxsize, bsize);
|
2018-05-13 09:47:28 +00:00
|
|
|
if ((flags & GB_NOSPARSE) != 0 && vmio &&
|
2020-08-19 02:51:17 +00:00
|
|
|
!vn_isdisk(vp)) {
|
2018-05-13 09:47:28 +00:00
|
|
|
error = VOP_BMAP(vp, blkno, NULL, &d_blkno, 0, 0);
|
|
|
|
KASSERT(error != EOPNOTSUPP,
|
|
|
|
("GB_NOSPARSE from fs not supporting bmap, vp %p",
|
|
|
|
vp));
|
|
|
|
if (error != 0)
|
|
|
|
return (error);
|
|
|
|
if (d_blkno == -1)
|
|
|
|
return (EJUSTRETURN);
|
|
|
|
}
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
|
2015-10-14 02:10:07 +00:00
|
|
|
bp = getnewbuf(vp, slpflag, slptimeo, maxsize, flags);
|
2004-10-22 08:47:20 +00:00
|
|
|
if (bp == NULL) {
|
2005-04-30 12:18:50 +00:00
|
|
|
if (slpflag || slptimeo)
|
2018-05-13 09:47:28 +00:00
|
|
|
return (ETIMEDOUT);
|
2015-11-07 04:04:00 +00:00
|
|
|
/*
|
|
|
|
* XXX This is here until the sleep path is diagnosed
|
|
|
|
* enough to work under very low memory conditions.
|
|
|
|
*
|
|
|
|
* There's an issue on low memory, 4BSD+non-preempt
|
|
|
|
* systems (eg MIPS routers with 32MB RAM) where buffer
|
|
|
|
* exhaustion occurs without sleeping for buffer
|
|
|
|
* reclaimation. This just sticks in a loop and
|
|
|
|
* constantly attempts to allocate a buffer, which
|
|
|
|
* hits exhaustion and tries to wakeup bufdaemon.
|
|
|
|
* This never happens because we never yield.
|
|
|
|
*
|
|
|
|
* The real solution is to identify and fix these cases
|
|
|
|
* so we aren't effectively busy-waiting in a loop
|
|
|
|
* until the reclaimation path has cycles to run.
|
|
|
|
*/
|
|
|
|
kern_yield(PRI_USER);
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
goto loop;
|
|
|
|
}
|
1995-04-09 06:02:46 +00:00
|
|
|
|
1995-02-22 09:30:13 +00:00
|
|
|
/*
|
1995-04-09 06:02:46 +00:00
|
|
|
* This code is used to make sure that a buffer is not
|
1995-05-30 08:16:23 +00:00
|
|
|
* created while the getnewbuf routine is blocked.
|
1999-03-02 21:23:38 +00:00
|
|
|
* This can be a problem whether the vnode is locked or not.
|
The VFS/BIO subsystem contained a number of hacks in order to optimize
piecemeal, middle-of-file writes for NFS. These hacks have caused no
end of trouble, especially when combined with mmap(). I've removed
them. Instead, NFS will issue a read-before-write to fully
instantiate the struct buf containing the write. NFS does, however,
optimize piecemeal appends to files. For most common file operations,
you will not notice the difference. The sole remaining fragment in
the VFS/BIO system is b_dirtyoff/end, which NFS uses to avoid cache
coherency issues with read-merge-write style operations. NFS also
optimizes the write-covers-entire-buffer case by avoiding the
read-before-write. There is quite a bit of room for further
optimization in these areas.
The VM system marks pages fully-valid (AKA vm_page_t->valid =
VM_PAGE_BITS_ALL) in several places, most noteably in vm_fault. This
is not correct operation. The vm_pager_get_pages() code is now
responsible for marking VM pages all-valid. A number of VM helper
routines have been added to aid in zeroing-out the invalid portions of
a VM page prior to the page being marked all-valid. This operation is
necessary to properly support mmap(). The zeroing occurs most often
when dealing with file-EOF situations. Several bugs have been fixed
in the NFS subsystem, including bits handling file and directory EOF
situations and buf->b_flags consistancy issues relating to clearing
B_ERROR & B_INVAL, and handling B_DONE.
getblk() and allocbuf() have been rewritten. B_CACHE operation is now
formally defined in comments and more straightforward in
implementation. B_CACHE for VMIO buffers is based on the validity of
the backing store. B_CACHE for non-VMIO buffers is based simply on
whether the buffer is B_INVAL or not (B_CACHE set if B_INVAL clear,
and vise-versa). biodone() is now responsible for setting B_CACHE
when a successful read completes. B_CACHE is also set when a bdwrite()
is initiated and when a bwrite() is initiated. VFS VOP_BWRITE
routines (there are only two - nfs_bwrite() and bwrite()) are now
expected to set B_CACHE. This means that bowrite() and bawrite() also
set B_CACHE indirectly.
There are a number of places in the code which were previously using
buf->b_bufsize (which is DEV_BSIZE aligned) when they should have
been using buf->b_bcount. These have been fixed. getblk() now clears
B_DONE on return because the rest of the system is so bad about
dealing with B_DONE.
Major fixes to NFS/TCP have been made. A server-side bug could cause
requests to be lost by the server due to nfs_realign() overwriting
other rpc's in the same TCP mbuf chain. The server's kernel must be
recompiled to get the benefit of the fixes.
Submitted by: Matthew Dillon <dillon@apollo.backplane.com>
1999-05-02 23:57:16 +00:00
|
|
|
* If the buffer is created out from under us, we have to
|
2005-04-30 12:18:50 +00:00
|
|
|
* throw away the one we just created.
|
2002-07-10 17:02:32 +00:00
|
|
|
*
|
|
|
|
* Note: this must occur before we associate the buffer
|
|
|
|
* with the vp especially considering limitations in
|
|
|
|
* the splay tree implementation when dealing with duplicate
|
|
|
|
* lblkno's.
|
1995-02-22 09:30:13 +00:00
|
|
|
*/
|
2004-10-22 08:47:20 +00:00
|
|
|
BO_LOCK(bo);
|
|
|
|
if (gbincore(bo, blkno)) {
|
|
|
|
BO_UNLOCK(bo);
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
bp->b_flags |= B_INVAL;
|
2018-03-17 18:14:49 +00:00
|
|
|
bufspace_release(bufdomain(bp), maxsize);
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
brelse(bp);
|
1994-05-25 09:21:21 +00:00
|
|
|
goto loop;
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
}
|
1995-04-09 06:02:46 +00:00
|
|
|
|
1995-02-22 09:30:13 +00:00
|
|
|
/*
|
|
|
|
* Insert the buffer into the hash, so that it can
|
|
|
|
* be found by incore.
|
|
|
|
*/
|
2018-05-13 09:47:28 +00:00
|
|
|
bp->b_lblkno = blkno;
|
|
|
|
bp->b_blkno = d_blkno;
|
1998-12-22 14:43:58 +00:00
|
|
|
bp->b_offset = offset;
|
1994-05-25 09:21:21 +00:00
|
|
|
bgetvp(vp, bp);
|
2004-10-22 08:47:20 +00:00
|
|
|
BO_UNLOCK(bo);
|
1995-02-22 09:30:13 +00:00
|
|
|
|
The VFS/BIO subsystem contained a number of hacks in order to optimize
piecemeal, middle-of-file writes for NFS. These hacks have caused no
end of trouble, especially when combined with mmap(). I've removed
them. Instead, NFS will issue a read-before-write to fully
instantiate the struct buf containing the write. NFS does, however,
optimize piecemeal appends to files. For most common file operations,
you will not notice the difference. The sole remaining fragment in
the VFS/BIO system is b_dirtyoff/end, which NFS uses to avoid cache
coherency issues with read-merge-write style operations. NFS also
optimizes the write-covers-entire-buffer case by avoiding the
read-before-write. There is quite a bit of room for further
optimization in these areas.
The VM system marks pages fully-valid (AKA vm_page_t->valid =
VM_PAGE_BITS_ALL) in several places, most noteably in vm_fault. This
is not correct operation. The vm_pager_get_pages() code is now
responsible for marking VM pages all-valid. A number of VM helper
routines have been added to aid in zeroing-out the invalid portions of
a VM page prior to the page being marked all-valid. This operation is
necessary to properly support mmap(). The zeroing occurs most often
when dealing with file-EOF situations. Several bugs have been fixed
in the NFS subsystem, including bits handling file and directory EOF
situations and buf->b_flags consistancy issues relating to clearing
B_ERROR & B_INVAL, and handling B_DONE.
getblk() and allocbuf() have been rewritten. B_CACHE operation is now
formally defined in comments and more straightforward in
implementation. B_CACHE for VMIO buffers is based on the validity of
the backing store. B_CACHE for non-VMIO buffers is based simply on
whether the buffer is B_INVAL or not (B_CACHE set if B_INVAL clear,
and vise-versa). biodone() is now responsible for setting B_CACHE
when a successful read completes. B_CACHE is also set when a bdwrite()
is initiated and when a bwrite() is initiated. VFS VOP_BWRITE
routines (there are only two - nfs_bwrite() and bwrite()) are now
expected to set B_CACHE. This means that bowrite() and bawrite() also
set B_CACHE indirectly.
There are a number of places in the code which were previously using
buf->b_bufsize (which is DEV_BSIZE aligned) when they should have
been using buf->b_bcount. These have been fixed. getblk() now clears
B_DONE on return because the rest of the system is so bad about
dealing with B_DONE.
Major fixes to NFS/TCP have been made. A server-side bug could cause
requests to be lost by the server due to nfs_realign() overwriting
other rpc's in the same TCP mbuf chain. The server's kernel must be
recompiled to get the benefit of the fixes.
Submitted by: Matthew Dillon <dillon@apollo.backplane.com>
1999-05-02 23:57:16 +00:00
|
|
|
/*
|
|
|
|
* set B_VMIO bit. allocbuf() the buffer bigger. Since the
|
|
|
|
* buffer size starts out as 0, B_CACHE will be set by
|
|
|
|
* allocbuf() for the VMIO case prior to it testing the
|
|
|
|
* backing store for validity.
|
|
|
|
*/
|
|
|
|
|
1998-12-22 14:43:58 +00:00
|
|
|
if (vmio) {
|
The VFS/BIO subsystem contained a number of hacks in order to optimize
piecemeal, middle-of-file writes for NFS. These hacks have caused no
end of trouble, especially when combined with mmap(). I've removed
them. Instead, NFS will issue a read-before-write to fully
instantiate the struct buf containing the write. NFS does, however,
optimize piecemeal appends to files. For most common file operations,
you will not notice the difference. The sole remaining fragment in
the VFS/BIO system is b_dirtyoff/end, which NFS uses to avoid cache
coherency issues with read-merge-write style operations. NFS also
optimizes the write-covers-entire-buffer case by avoiding the
read-before-write. There is quite a bit of room for further
optimization in these areas.
The VM system marks pages fully-valid (AKA vm_page_t->valid =
VM_PAGE_BITS_ALL) in several places, most noteably in vm_fault. This
is not correct operation. The vm_pager_get_pages() code is now
responsible for marking VM pages all-valid. A number of VM helper
routines have been added to aid in zeroing-out the invalid portions of
a VM page prior to the page being marked all-valid. This operation is
necessary to properly support mmap(). The zeroing occurs most often
when dealing with file-EOF situations. Several bugs have been fixed
in the NFS subsystem, including bits handling file and directory EOF
situations and buf->b_flags consistancy issues relating to clearing
B_ERROR & B_INVAL, and handling B_DONE.
getblk() and allocbuf() have been rewritten. B_CACHE operation is now
formally defined in comments and more straightforward in
implementation. B_CACHE for VMIO buffers is based on the validity of
the backing store. B_CACHE for non-VMIO buffers is based simply on
whether the buffer is B_INVAL or not (B_CACHE set if B_INVAL clear,
and vise-versa). biodone() is now responsible for setting B_CACHE
when a successful read completes. B_CACHE is also set when a bdwrite()
is initiated and when a bwrite() is initiated. VFS VOP_BWRITE
routines (there are only two - nfs_bwrite() and bwrite()) are now
expected to set B_CACHE. This means that bowrite() and bawrite() also
set B_CACHE indirectly.
There are a number of places in the code which were previously using
buf->b_bufsize (which is DEV_BSIZE aligned) when they should have
been using buf->b_bcount. These have been fixed. getblk() now clears
B_DONE on return because the rest of the system is so bad about
dealing with B_DONE.
Major fixes to NFS/TCP have been made. A server-side bug could cause
requests to be lost by the server due to nfs_realign() overwriting
other rpc's in the same TCP mbuf chain. The server's kernel must be
recompiled to get the benefit of the fixes.
Submitted by: Matthew Dillon <dillon@apollo.backplane.com>
1999-05-02 23:57:16 +00:00
|
|
|
bp->b_flags |= B_VMIO;
|
2005-01-25 00:40:01 +00:00
|
|
|
KASSERT(vp->v_object == bp->b_bufobj->bo_object,
|
2004-11-04 09:06:54 +00:00
|
|
|
("ARGH! different b_bufobj->bo_object %p %p %p\n",
|
2005-01-25 00:40:01 +00:00
|
|
|
bp, vp->v_object, bp->b_bufobj->bo_object));
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
} else {
|
|
|
|
bp->b_flags &= ~B_VMIO;
|
2004-11-04 09:06:54 +00:00
|
|
|
KASSERT(bp->b_bufobj->bo_object == NULL,
|
|
|
|
("ARGH! has b_bufobj->bo_object %p %p\n",
|
|
|
|
bp, bp->b_bufobj->bo_object));
|
Implement the concept of the unmapped VMIO buffers, i.e. buffers which
do not map the b_pages pages into buffer_map KVA. The use of the
unmapped buffers eliminate the need to perform TLB shootdown for
mapping on the buffer creation and reuse, greatly reducing the amount
of IPIs for shootdown on big-SMP machines and eliminating up to 25-30%
of the system time on i/o intensive workloads.
The unmapped buffer should be explicitely requested by the GB_UNMAPPED
flag by the consumer. For unmapped buffer, no KVA reservation is
performed at all. The consumer might request unmapped buffer which
does have a KVA reserve, to manually map it without recursing into
buffer cache and blocking, with the GB_KVAALLOC flag.
When the mapped buffer is requested and unmapped buffer already
exists, the cache performs an upgrade, possibly reusing the KVA
reservation.
Unmapped buffer is translated into unmapped bio in g_vfs_strategy().
Unmapped bio carry a pointer to the vm_page_t array, offset and length
instead of the data pointer. The provider which processes the bio
should explicitely specify a readiness to accept unmapped bio,
otherwise g_down geom thread performs the transient upgrade of the bio
request by mapping the pages into the new bio_transient_map KVA
submap.
The bio_transient_map submap claims up to 10% of the buffer map, and
the total buffer_map + bio_transient_map KVA usage stays the
same. Still, it could be manually tuned by kern.bio_transient_maxcnt
tunable, in the units of the transient mappings. Eventually, the
bio_transient_map could be removed after all geom classes and drivers
can accept unmapped i/o requests.
Unmapped support can be turned off by the vfs.unmapped_buf_allowed
tunable, disabling which makes the buffer (or cluster) creation
requests to ignore GB_UNMAPPED and GB_KVAALLOC flags. Unmapped
buffers are only enabled by default on the architectures where
pmap_copy_page() was implemented and tested.
In the rework, filesystem metadata is not the subject to maxbufspace
limit anymore. Since the metadata buffers are always mapped, the
buffers still have to fit into the buffer map, which provides a
reasonable (but practically unreachable) upper bound on it. The
non-metadata buffer allocations, both mapped and unmapped, is
accounted against maxbufspace, as before. Effectively, this means that
the maxbufspace is forced on mapped and unmapped buffers separately.
The pre-patch bufspace limiting code did not worked, because
buffer_map fragmentation does not allow the limit to be reached.
By Jeff Roberson request, the getnewbuf() function was split into
smaller single-purpose functions.
Sponsored by: The FreeBSD Foundation
Discussed with: jeff (previous version)
Tested by: pho, scottl (previous version), jhb, bf
MFC after: 2 weeks
2013-03-19 14:13:12 +00:00
|
|
|
BUF_CHECK_MAPPED(bp);
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
}
|
1995-02-22 09:30:13 +00:00
|
|
|
|
1995-04-09 06:02:46 +00:00
|
|
|
allocbuf(bp, size);
|
2018-03-17 18:14:49 +00:00
|
|
|
bufspace_release(bufdomain(bp), maxsize);
|
The VFS/BIO subsystem contained a number of hacks in order to optimize
piecemeal, middle-of-file writes for NFS. These hacks have caused no
end of trouble, especially when combined with mmap(). I've removed
them. Instead, NFS will issue a read-before-write to fully
instantiate the struct buf containing the write. NFS does, however,
optimize piecemeal appends to files. For most common file operations,
you will not notice the difference. The sole remaining fragment in
the VFS/BIO system is b_dirtyoff/end, which NFS uses to avoid cache
coherency issues with read-merge-write style operations. NFS also
optimizes the write-covers-entire-buffer case by avoiding the
read-before-write. There is quite a bit of room for further
optimization in these areas.
The VM system marks pages fully-valid (AKA vm_page_t->valid =
VM_PAGE_BITS_ALL) in several places, most noteably in vm_fault. This
is not correct operation. The vm_pager_get_pages() code is now
responsible for marking VM pages all-valid. A number of VM helper
routines have been added to aid in zeroing-out the invalid portions of
a VM page prior to the page being marked all-valid. This operation is
necessary to properly support mmap(). The zeroing occurs most often
when dealing with file-EOF situations. Several bugs have been fixed
in the NFS subsystem, including bits handling file and directory EOF
situations and buf->b_flags consistancy issues relating to clearing
B_ERROR & B_INVAL, and handling B_DONE.
getblk() and allocbuf() have been rewritten. B_CACHE operation is now
formally defined in comments and more straightforward in
implementation. B_CACHE for VMIO buffers is based on the validity of
the backing store. B_CACHE for non-VMIO buffers is based simply on
whether the buffer is B_INVAL or not (B_CACHE set if B_INVAL clear,
and vise-versa). biodone() is now responsible for setting B_CACHE
when a successful read completes. B_CACHE is also set when a bdwrite()
is initiated and when a bwrite() is initiated. VFS VOP_BWRITE
routines (there are only two - nfs_bwrite() and bwrite()) are now
expected to set B_CACHE. This means that bowrite() and bawrite() also
set B_CACHE indirectly.
There are a number of places in the code which were previously using
buf->b_bufsize (which is DEV_BSIZE aligned) when they should have
been using buf->b_bcount. These have been fixed. getblk() now clears
B_DONE on return because the rest of the system is so bad about
dealing with B_DONE.
Major fixes to NFS/TCP have been made. A server-side bug could cause
requests to be lost by the server due to nfs_realign() overwriting
other rpc's in the same TCP mbuf chain. The server's kernel must be
recompiled to get the benefit of the fixes.
Submitted by: Matthew Dillon <dillon@apollo.backplane.com>
1999-05-02 23:57:16 +00:00
|
|
|
bp->b_flags &= ~B_DONE;
|
1994-05-25 09:21:21 +00:00
|
|
|
}
|
2005-01-24 10:47:04 +00:00
|
|
|
CTR4(KTR_BUF, "getblk(%p, %ld, %d) = %p", vp, (long)blkno, size, bp);
|
2013-02-27 07:34:09 +00:00
|
|
|
end:
|
2016-10-31 23:09:52 +00:00
|
|
|
buf_track(bp, __func__);
|
2004-10-22 08:47:20 +00:00
|
|
|
KASSERT(bp->b_bufobj == bo,
|
2005-06-14 20:32:27 +00:00
|
|
|
("bp %p wrong b_bufobj %p should be %p", bp, bp->b_bufobj, bo));
|
2018-05-13 09:47:28 +00:00
|
|
|
*bpp = bp;
|
|
|
|
return (0);
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
|
|
|
|
1994-05-25 09:21:21 +00:00
|
|
|
/*
|
The VFS/BIO subsystem contained a number of hacks in order to optimize
piecemeal, middle-of-file writes for NFS. These hacks have caused no
end of trouble, especially when combined with mmap(). I've removed
them. Instead, NFS will issue a read-before-write to fully
instantiate the struct buf containing the write. NFS does, however,
optimize piecemeal appends to files. For most common file operations,
you will not notice the difference. The sole remaining fragment in
the VFS/BIO system is b_dirtyoff/end, which NFS uses to avoid cache
coherency issues with read-merge-write style operations. NFS also
optimizes the write-covers-entire-buffer case by avoiding the
read-before-write. There is quite a bit of room for further
optimization in these areas.
The VM system marks pages fully-valid (AKA vm_page_t->valid =
VM_PAGE_BITS_ALL) in several places, most noteably in vm_fault. This
is not correct operation. The vm_pager_get_pages() code is now
responsible for marking VM pages all-valid. A number of VM helper
routines have been added to aid in zeroing-out the invalid portions of
a VM page prior to the page being marked all-valid. This operation is
necessary to properly support mmap(). The zeroing occurs most often
when dealing with file-EOF situations. Several bugs have been fixed
in the NFS subsystem, including bits handling file and directory EOF
situations and buf->b_flags consistancy issues relating to clearing
B_ERROR & B_INVAL, and handling B_DONE.
getblk() and allocbuf() have been rewritten. B_CACHE operation is now
formally defined in comments and more straightforward in
implementation. B_CACHE for VMIO buffers is based on the validity of
the backing store. B_CACHE for non-VMIO buffers is based simply on
whether the buffer is B_INVAL or not (B_CACHE set if B_INVAL clear,
and vise-versa). biodone() is now responsible for setting B_CACHE
when a successful read completes. B_CACHE is also set when a bdwrite()
is initiated and when a bwrite() is initiated. VFS VOP_BWRITE
routines (there are only two - nfs_bwrite() and bwrite()) are now
expected to set B_CACHE. This means that bowrite() and bawrite() also
set B_CACHE indirectly.
There are a number of places in the code which were previously using
buf->b_bufsize (which is DEV_BSIZE aligned) when they should have
been using buf->b_bcount. These have been fixed. getblk() now clears
B_DONE on return because the rest of the system is so bad about
dealing with B_DONE.
Major fixes to NFS/TCP have been made. A server-side bug could cause
requests to be lost by the server due to nfs_realign() overwriting
other rpc's in the same TCP mbuf chain. The server's kernel must be
recompiled to get the benefit of the fixes.
Submitted by: Matthew Dillon <dillon@apollo.backplane.com>
1999-05-02 23:57:16 +00:00
|
|
|
* Get an empty, disassociated buffer of given size. The buffer is initially
|
|
|
|
* set to B_INVAL.
|
1994-05-25 09:21:21 +00:00
|
|
|
*/
|
|
|
|
struct buf *
|
Fix two issues with bufdaemon, often causing the processes to hang in
the "nbufkv" sleep.
First, ffs background cg group block write requests a new buffer for
the shadow copy. When ffs_bufwrite() is called from the bufdaemon due
to buffers shortage, requesting the buffer deadlock bufdaemon.
Introduce a new flag for getnewbuf(), GB_NOWAIT_BD, to request getblk
to not block while allocating the buffer, and return failure
instead. Add a flag argument to the geteblk to allow to pass the flags
to getblk(). Do not repeat the getnewbuf() call from geteblk if buffer
allocation failed and either GB_NOWAIT_BD is specified, or geteblk()
is called from bufdaemon (or its helper, see below). In
ffs_bufwrite(), fall back to synchronous cg block write if shadow
block allocation failed.
Since r107847, buffer write assumes that vnode owning the buffer is
locked. The second problem is that buffer cache may accumulate many
buffers belonging to limited number of vnodes. With such workload,
quite often threads that own the mentioned vnodes locks are trying to
read another block from the vnodes, and, due to buffer cache
exhaustion, are asking bufdaemon for help. Bufdaemon is unable to make
any substantial progress because the vnodes are locked.
Allow the threads owning vnode locks to help the bufdaemon by doing
the flush pass over the buffer cache before getnewbuf() is going to
uninterruptible sleep. Move the flushing code from buf_daemon() to new
helper function buf_do_flush(), that is called from getnewbuf(). The
number of buffers flushed by single call to buf_do_flush() from
getnewbuf() is limited by new sysctl vfs.flushbufqtarget. Prevent
recursive calls to buf_do_flush() by marking the bufdaemon and threads
that temporarily help bufdaemon by TDP_BUFNEED flag.
In collaboration with: pho
Reviewed by: tegge (previous version)
Tested by: glebius, yandex ...
MFC after: 3 weeks
2009-03-16 15:39:46 +00:00
|
|
|
geteblk(int size, int flags)
|
1994-05-24 10:09:53 +00:00
|
|
|
{
|
1994-05-25 09:21:21 +00:00
|
|
|
struct buf *bp;
|
2000-03-27 21:29:33 +00:00
|
|
|
int maxsize;
|
|
|
|
|
|
|
|
maxsize = (size + BKVAMASK) & ~BKVAMASK;
|
2015-10-14 02:10:07 +00:00
|
|
|
while ((bp = getnewbuf(NULL, 0, 0, maxsize, flags)) == NULL) {
|
Fix two issues with bufdaemon, often causing the processes to hang in
the "nbufkv" sleep.
First, ffs background cg group block write requests a new buffer for
the shadow copy. When ffs_bufwrite() is called from the bufdaemon due
to buffers shortage, requesting the buffer deadlock bufdaemon.
Introduce a new flag for getnewbuf(), GB_NOWAIT_BD, to request getblk
to not block while allocating the buffer, and return failure
instead. Add a flag argument to the geteblk to allow to pass the flags
to getblk(). Do not repeat the getnewbuf() call from geteblk if buffer
allocation failed and either GB_NOWAIT_BD is specified, or geteblk()
is called from bufdaemon (or its helper, see below). In
ffs_bufwrite(), fall back to synchronous cg block write if shadow
block allocation failed.
Since r107847, buffer write assumes that vnode owning the buffer is
locked. The second problem is that buffer cache may accumulate many
buffers belonging to limited number of vnodes. With such workload,
quite often threads that own the mentioned vnodes locks are trying to
read another block from the vnodes, and, due to buffer cache
exhaustion, are asking bufdaemon for help. Bufdaemon is unable to make
any substantial progress because the vnodes are locked.
Allow the threads owning vnode locks to help the bufdaemon by doing
the flush pass over the buffer cache before getnewbuf() is going to
uninterruptible sleep. Move the flushing code from buf_daemon() to new
helper function buf_do_flush(), that is called from getnewbuf(). The
number of buffers flushed by single call to buf_do_flush() from
getnewbuf() is limited by new sysctl vfs.flushbufqtarget. Prevent
recursive calls to buf_do_flush() by marking the bufdaemon and threads
that temporarily help bufdaemon by TDP_BUFNEED flag.
In collaboration with: pho
Reviewed by: tegge (previous version)
Tested by: glebius, yandex ...
MFC after: 3 weeks
2009-03-16 15:39:46 +00:00
|
|
|
if ((flags & GB_NOWAIT_BD) &&
|
|
|
|
(curthread->td_pflags & TDP_BUFNEED) != 0)
|
|
|
|
return (NULL);
|
|
|
|
}
|
1995-03-26 23:29:13 +00:00
|
|
|
allocbuf(bp, size);
|
2018-03-17 18:14:49 +00:00
|
|
|
bufspace_release(bufdomain(bp), maxsize);
|
The VFS/BIO subsystem contained a number of hacks in order to optimize
piecemeal, middle-of-file writes for NFS. These hacks have caused no
end of trouble, especially when combined with mmap(). I've removed
them. Instead, NFS will issue a read-before-write to fully
instantiate the struct buf containing the write. NFS does, however,
optimize piecemeal appends to files. For most common file operations,
you will not notice the difference. The sole remaining fragment in
the VFS/BIO system is b_dirtyoff/end, which NFS uses to avoid cache
coherency issues with read-merge-write style operations. NFS also
optimizes the write-covers-entire-buffer case by avoiding the
read-before-write. There is quite a bit of room for further
optimization in these areas.
The VM system marks pages fully-valid (AKA vm_page_t->valid =
VM_PAGE_BITS_ALL) in several places, most noteably in vm_fault. This
is not correct operation. The vm_pager_get_pages() code is now
responsible for marking VM pages all-valid. A number of VM helper
routines have been added to aid in zeroing-out the invalid portions of
a VM page prior to the page being marked all-valid. This operation is
necessary to properly support mmap(). The zeroing occurs most often
when dealing with file-EOF situations. Several bugs have been fixed
in the NFS subsystem, including bits handling file and directory EOF
situations and buf->b_flags consistancy issues relating to clearing
B_ERROR & B_INVAL, and handling B_DONE.
getblk() and allocbuf() have been rewritten. B_CACHE operation is now
formally defined in comments and more straightforward in
implementation. B_CACHE for VMIO buffers is based on the validity of
the backing store. B_CACHE for non-VMIO buffers is based simply on
whether the buffer is B_INVAL or not (B_CACHE set if B_INVAL clear,
and vise-versa). biodone() is now responsible for setting B_CACHE
when a successful read completes. B_CACHE is also set when a bdwrite()
is initiated and when a bwrite() is initiated. VFS VOP_BWRITE
routines (there are only two - nfs_bwrite() and bwrite()) are now
expected to set B_CACHE. This means that bowrite() and bawrite() also
set B_CACHE indirectly.
There are a number of places in the code which were previously using
buf->b_bufsize (which is DEV_BSIZE aligned) when they should have
been using buf->b_bcount. These have been fixed. getblk() now clears
B_DONE on return because the rest of the system is so bad about
dealing with B_DONE.
Major fixes to NFS/TCP have been made. A server-side bug could cause
requests to be lost by the server due to nfs_realign() overwriting
other rpc's in the same TCP mbuf chain. The server's kernel must be
recompiled to get the benefit of the fixes.
Submitted by: Matthew Dillon <dillon@apollo.backplane.com>
1999-05-02 23:57:16 +00:00
|
|
|
bp->b_flags |= B_INVAL; /* b_dep cleared by getnewbuf() */
|
1994-05-25 09:21:21 +00:00
|
|
|
return (bp);
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
|
|
|
|
2015-09-22 23:57:52 +00:00
|
|
|
/*
|
|
|
|
* Truncate the backing store for a non-vmio buffer.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
vfs_nonvmio_truncate(struct buf *bp, int newbsize)
|
|
|
|
{
|
|
|
|
|
|
|
|
if (bp->b_flags & B_MALLOC) {
|
|
|
|
/*
|
|
|
|
* malloced buffers are not shrunk
|
|
|
|
*/
|
|
|
|
if (newbsize == 0) {
|
|
|
|
bufmallocadjust(bp, 0);
|
|
|
|
free(bp->b_data, M_BIOBUF);
|
|
|
|
bp->b_data = bp->b_kvabase;
|
|
|
|
bp->b_flags &= ~B_MALLOC;
|
|
|
|
}
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
vm_hold_free_pages(bp, newbsize);
|
2015-10-14 02:10:07 +00:00
|
|
|
bufspace_adjust(bp, newbsize);
|
2015-09-22 23:57:52 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Extend the backing for a non-VMIO buffer.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
vfs_nonvmio_extend(struct buf *bp, int newbsize)
|
|
|
|
{
|
|
|
|
caddr_t origbuf;
|
|
|
|
int origbufsize;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We only use malloced memory on the first allocation.
|
|
|
|
* and revert to page-allocated memory when the buffer
|
|
|
|
* grows.
|
|
|
|
*
|
|
|
|
* There is a potential smp race here that could lead
|
|
|
|
* to bufmallocspace slightly passing the max. It
|
|
|
|
* is probably extremely rare and not worth worrying
|
|
|
|
* over.
|
|
|
|
*/
|
|
|
|
if (bp->b_bufsize == 0 && newbsize <= PAGE_SIZE/2 &&
|
|
|
|
bufmallocspace < maxbufmallocspace) {
|
|
|
|
bp->b_data = malloc(newbsize, M_BIOBUF, M_WAITOK);
|
|
|
|
bp->b_flags |= B_MALLOC;
|
|
|
|
bufmallocadjust(bp, newbsize);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If the buffer is growing on its other-than-first
|
|
|
|
* allocation then we revert to the page-allocation
|
|
|
|
* scheme.
|
|
|
|
*/
|
|
|
|
origbuf = NULL;
|
|
|
|
origbufsize = 0;
|
|
|
|
if (bp->b_flags & B_MALLOC) {
|
|
|
|
origbuf = bp->b_data;
|
|
|
|
origbufsize = bp->b_bufsize;
|
|
|
|
bp->b_data = bp->b_kvabase;
|
|
|
|
bufmallocadjust(bp, 0);
|
|
|
|
bp->b_flags &= ~B_MALLOC;
|
|
|
|
newbsize = round_page(newbsize);
|
|
|
|
}
|
|
|
|
vm_hold_load_pages(bp, (vm_offset_t) bp->b_data + bp->b_bufsize,
|
|
|
|
(vm_offset_t) bp->b_data + newbsize);
|
|
|
|
if (origbuf != NULL) {
|
|
|
|
bcopy(origbuf, bp->b_data, origbufsize);
|
|
|
|
free(origbuf, M_BIOBUF);
|
|
|
|
}
|
2015-10-14 02:10:07 +00:00
|
|
|
bufspace_adjust(bp, newbsize);
|
2015-09-22 23:57:52 +00:00
|
|
|
}
|
|
|
|
|
1994-05-25 09:21:21 +00:00
|
|
|
/*
|
1995-02-22 09:16:07 +00:00
|
|
|
* This code constitutes the buffer memory from either anonymous system
|
|
|
|
* memory (in the case of non-VMIO operations) or from an associated
|
1999-01-21 08:29:12 +00:00
|
|
|
* VM object (in the case of VMIO operations). This code is able to
|
|
|
|
* resize a buffer up or down.
|
1995-02-22 09:16:07 +00:00
|
|
|
*
|
|
|
|
* Note that this code is tricky, and has many complications to resolve
|
2016-04-29 21:54:28 +00:00
|
|
|
* deadlock or inconsistent data situations. Tread lightly!!!
|
1999-01-21 08:29:12 +00:00
|
|
|
* There are B_CACHE and B_DELWRI interactions that must be dealt with by
|
|
|
|
* the caller. Calling this code willy nilly can result in the loss of data.
|
The VFS/BIO subsystem contained a number of hacks in order to optimize
piecemeal, middle-of-file writes for NFS. These hacks have caused no
end of trouble, especially when combined with mmap(). I've removed
them. Instead, NFS will issue a read-before-write to fully
instantiate the struct buf containing the write. NFS does, however,
optimize piecemeal appends to files. For most common file operations,
you will not notice the difference. The sole remaining fragment in
the VFS/BIO system is b_dirtyoff/end, which NFS uses to avoid cache
coherency issues with read-merge-write style operations. NFS also
optimizes the write-covers-entire-buffer case by avoiding the
read-before-write. There is quite a bit of room for further
optimization in these areas.
The VM system marks pages fully-valid (AKA vm_page_t->valid =
VM_PAGE_BITS_ALL) in several places, most noteably in vm_fault. This
is not correct operation. The vm_pager_get_pages() code is now
responsible for marking VM pages all-valid. A number of VM helper
routines have been added to aid in zeroing-out the invalid portions of
a VM page prior to the page being marked all-valid. This operation is
necessary to properly support mmap(). The zeroing occurs most often
when dealing with file-EOF situations. Several bugs have been fixed
in the NFS subsystem, including bits handling file and directory EOF
situations and buf->b_flags consistancy issues relating to clearing
B_ERROR & B_INVAL, and handling B_DONE.
getblk() and allocbuf() have been rewritten. B_CACHE operation is now
formally defined in comments and more straightforward in
implementation. B_CACHE for VMIO buffers is based on the validity of
the backing store. B_CACHE for non-VMIO buffers is based simply on
whether the buffer is B_INVAL or not (B_CACHE set if B_INVAL clear,
and vise-versa). biodone() is now responsible for setting B_CACHE
when a successful read completes. B_CACHE is also set when a bdwrite()
is initiated and when a bwrite() is initiated. VFS VOP_BWRITE
routines (there are only two - nfs_bwrite() and bwrite()) are now
expected to set B_CACHE. This means that bowrite() and bawrite() also
set B_CACHE indirectly.
There are a number of places in the code which were previously using
buf->b_bufsize (which is DEV_BSIZE aligned) when they should have
been using buf->b_bcount. These have been fixed. getblk() now clears
B_DONE on return because the rest of the system is so bad about
dealing with B_DONE.
Major fixes to NFS/TCP have been made. A server-side bug could cause
requests to be lost by the server due to nfs_realign() overwriting
other rpc's in the same TCP mbuf chain. The server's kernel must be
recompiled to get the benefit of the fixes.
Submitted by: Matthew Dillon <dillon@apollo.backplane.com>
1999-05-02 23:57:16 +00:00
|
|
|
*
|
|
|
|
* allocbuf() only adjusts B_CACHE for VMIO buffers. getblk() deals with
|
|
|
|
* B_CACHE for the non-VMIO case.
|
1994-05-25 09:21:21 +00:00
|
|
|
*/
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
int
|
1999-01-21 08:29:12 +00:00
|
|
|
allocbuf(struct buf *bp, int size)
|
1994-05-24 10:09:53 +00:00
|
|
|
{
|
2015-09-22 23:57:52 +00:00
|
|
|
int newbsize;
|
1994-05-25 09:21:21 +00:00
|
|
|
|
2015-09-27 05:16:06 +00:00
|
|
|
if (bp->b_bcount == size)
|
|
|
|
return (1);
|
|
|
|
|
2015-07-23 19:13:41 +00:00
|
|
|
if (bp->b_kvasize != 0 && bp->b_kvasize < size)
|
1996-11-30 22:41:49 +00:00
|
|
|
panic("allocbuf: buffer too small");
|
|
|
|
|
2016-04-21 19:57:40 +00:00
|
|
|
newbsize = roundup2(size, DEV_BSIZE);
|
1995-04-09 06:02:46 +00:00
|
|
|
if ((bp->b_flags & B_VMIO) == 0) {
|
2015-09-22 23:57:52 +00:00
|
|
|
if ((bp->b_flags & B_MALLOC) == 0)
|
|
|
|
newbsize = round_page(newbsize);
|
1995-02-22 09:16:07 +00:00
|
|
|
/*
|
The VFS/BIO subsystem contained a number of hacks in order to optimize
piecemeal, middle-of-file writes for NFS. These hacks have caused no
end of trouble, especially when combined with mmap(). I've removed
them. Instead, NFS will issue a read-before-write to fully
instantiate the struct buf containing the write. NFS does, however,
optimize piecemeal appends to files. For most common file operations,
you will not notice the difference. The sole remaining fragment in
the VFS/BIO system is b_dirtyoff/end, which NFS uses to avoid cache
coherency issues with read-merge-write style operations. NFS also
optimizes the write-covers-entire-buffer case by avoiding the
read-before-write. There is quite a bit of room for further
optimization in these areas.
The VM system marks pages fully-valid (AKA vm_page_t->valid =
VM_PAGE_BITS_ALL) in several places, most noteably in vm_fault. This
is not correct operation. The vm_pager_get_pages() code is now
responsible for marking VM pages all-valid. A number of VM helper
routines have been added to aid in zeroing-out the invalid portions of
a VM page prior to the page being marked all-valid. This operation is
necessary to properly support mmap(). The zeroing occurs most often
when dealing with file-EOF situations. Several bugs have been fixed
in the NFS subsystem, including bits handling file and directory EOF
situations and buf->b_flags consistancy issues relating to clearing
B_ERROR & B_INVAL, and handling B_DONE.
getblk() and allocbuf() have been rewritten. B_CACHE operation is now
formally defined in comments and more straightforward in
implementation. B_CACHE for VMIO buffers is based on the validity of
the backing store. B_CACHE for non-VMIO buffers is based simply on
whether the buffer is B_INVAL or not (B_CACHE set if B_INVAL clear,
and vise-versa). biodone() is now responsible for setting B_CACHE
when a successful read completes. B_CACHE is also set when a bdwrite()
is initiated and when a bwrite() is initiated. VFS VOP_BWRITE
routines (there are only two - nfs_bwrite() and bwrite()) are now
expected to set B_CACHE. This means that bowrite() and bawrite() also
set B_CACHE indirectly.
There are a number of places in the code which were previously using
buf->b_bufsize (which is DEV_BSIZE aligned) when they should have
been using buf->b_bcount. These have been fixed. getblk() now clears
B_DONE on return because the rest of the system is so bad about
dealing with B_DONE.
Major fixes to NFS/TCP have been made. A server-side bug could cause
requests to be lost by the server due to nfs_realign() overwriting
other rpc's in the same TCP mbuf chain. The server's kernel must be
recompiled to get the benefit of the fixes.
Submitted by: Matthew Dillon <dillon@apollo.backplane.com>
1999-05-02 23:57:16 +00:00
|
|
|
* Just get anonymous memory from the kernel. Don't
|
|
|
|
* mess with B_CACHE.
|
1995-02-22 09:16:07 +00:00
|
|
|
*/
|
2015-09-22 23:57:52 +00:00
|
|
|
if (newbsize < bp->b_bufsize)
|
|
|
|
vfs_nonvmio_truncate(bp, newbsize);
|
|
|
|
else if (newbsize > bp->b_bufsize)
|
|
|
|
vfs_nonvmio_extend(bp, newbsize);
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
} else {
|
|
|
|
int desiredpages;
|
|
|
|
|
1998-12-22 14:43:58 +00:00
|
|
|
desiredpages = (size == 0) ? 0 :
|
2015-09-22 23:57:52 +00:00
|
|
|
num_pages((bp->b_offset & PAGE_MASK) + newbsize);
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
|
1996-03-02 04:40:56 +00:00
|
|
|
if (bp->b_flags & B_MALLOC)
|
|
|
|
panic("allocbuf: VMIO buffer can't be malloced");
|
The VFS/BIO subsystem contained a number of hacks in order to optimize
piecemeal, middle-of-file writes for NFS. These hacks have caused no
end of trouble, especially when combined with mmap(). I've removed
them. Instead, NFS will issue a read-before-write to fully
instantiate the struct buf containing the write. NFS does, however,
optimize piecemeal appends to files. For most common file operations,
you will not notice the difference. The sole remaining fragment in
the VFS/BIO system is b_dirtyoff/end, which NFS uses to avoid cache
coherency issues with read-merge-write style operations. NFS also
optimizes the write-covers-entire-buffer case by avoiding the
read-before-write. There is quite a bit of room for further
optimization in these areas.
The VM system marks pages fully-valid (AKA vm_page_t->valid =
VM_PAGE_BITS_ALL) in several places, most noteably in vm_fault. This
is not correct operation. The vm_pager_get_pages() code is now
responsible for marking VM pages all-valid. A number of VM helper
routines have been added to aid in zeroing-out the invalid portions of
a VM page prior to the page being marked all-valid. This operation is
necessary to properly support mmap(). The zeroing occurs most often
when dealing with file-EOF situations. Several bugs have been fixed
in the NFS subsystem, including bits handling file and directory EOF
situations and buf->b_flags consistancy issues relating to clearing
B_ERROR & B_INVAL, and handling B_DONE.
getblk() and allocbuf() have been rewritten. B_CACHE operation is now
formally defined in comments and more straightforward in
implementation. B_CACHE for VMIO buffers is based on the validity of
the backing store. B_CACHE for non-VMIO buffers is based simply on
whether the buffer is B_INVAL or not (B_CACHE set if B_INVAL clear,
and vise-versa). biodone() is now responsible for setting B_CACHE
when a successful read completes. B_CACHE is also set when a bdwrite()
is initiated and when a bwrite() is initiated. VFS VOP_BWRITE
routines (there are only two - nfs_bwrite() and bwrite()) are now
expected to set B_CACHE. This means that bowrite() and bawrite() also
set B_CACHE indirectly.
There are a number of places in the code which were previously using
buf->b_bufsize (which is DEV_BSIZE aligned) when they should have
been using buf->b_bcount. These have been fixed. getblk() now clears
B_DONE on return because the rest of the system is so bad about
dealing with B_DONE.
Major fixes to NFS/TCP have been made. A server-side bug could cause
requests to be lost by the server due to nfs_realign() overwriting
other rpc's in the same TCP mbuf chain. The server's kernel must be
recompiled to get the benefit of the fixes.
Submitted by: Matthew Dillon <dillon@apollo.backplane.com>
1999-05-02 23:57:16 +00:00
|
|
|
/*
|
|
|
|
* Set B_CACHE initially if buffer is 0 length or will become
|
|
|
|
* 0-length.
|
|
|
|
*/
|
|
|
|
if (size == 0 || bp->b_bufsize == 0)
|
|
|
|
bp->b_flags |= B_CACHE;
|
1996-03-02 04:40:56 +00:00
|
|
|
|
2015-09-22 23:57:52 +00:00
|
|
|
if (newbsize < bp->b_bufsize)
|
|
|
|
vfs_vmio_truncate(bp, desiredpages);
|
|
|
|
/* XXX This looks as if it should be newbsize > b_bufsize */
|
|
|
|
else if (size > bp->b_bcount)
|
|
|
|
vfs_vmio_extend(bp, desiredpages, size);
|
2015-10-14 02:10:07 +00:00
|
|
|
bufspace_adjust(bp, newbsize);
|
2015-09-22 23:57:52 +00:00
|
|
|
}
|
2015-07-23 19:13:41 +00:00
|
|
|
bp->b_bcount = size; /* requested buffer size. */
|
2015-09-27 05:16:06 +00:00
|
|
|
return (1);
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
|
|
|
|
Implement the concept of the unmapped VMIO buffers, i.e. buffers which
do not map the b_pages pages into buffer_map KVA. The use of the
unmapped buffers eliminate the need to perform TLB shootdown for
mapping on the buffer creation and reuse, greatly reducing the amount
of IPIs for shootdown on big-SMP machines and eliminating up to 25-30%
of the system time on i/o intensive workloads.
The unmapped buffer should be explicitely requested by the GB_UNMAPPED
flag by the consumer. For unmapped buffer, no KVA reservation is
performed at all. The consumer might request unmapped buffer which
does have a KVA reserve, to manually map it without recursing into
buffer cache and blocking, with the GB_KVAALLOC flag.
When the mapped buffer is requested and unmapped buffer already
exists, the cache performs an upgrade, possibly reusing the KVA
reservation.
Unmapped buffer is translated into unmapped bio in g_vfs_strategy().
Unmapped bio carry a pointer to the vm_page_t array, offset and length
instead of the data pointer. The provider which processes the bio
should explicitely specify a readiness to accept unmapped bio,
otherwise g_down geom thread performs the transient upgrade of the bio
request by mapping the pages into the new bio_transient_map KVA
submap.
The bio_transient_map submap claims up to 10% of the buffer map, and
the total buffer_map + bio_transient_map KVA usage stays the
same. Still, it could be manually tuned by kern.bio_transient_maxcnt
tunable, in the units of the transient mappings. Eventually, the
bio_transient_map could be removed after all geom classes and drivers
can accept unmapped i/o requests.
Unmapped support can be turned off by the vfs.unmapped_buf_allowed
tunable, disabling which makes the buffer (or cluster) creation
requests to ignore GB_UNMAPPED and GB_KVAALLOC flags. Unmapped
buffers are only enabled by default on the architectures where
pmap_copy_page() was implemented and tested.
In the rework, filesystem metadata is not the subject to maxbufspace
limit anymore. Since the metadata buffers are always mapped, the
buffers still have to fit into the buffer map, which provides a
reasonable (but practically unreachable) upper bound on it. The
non-metadata buffer allocations, both mapped and unmapped, is
accounted against maxbufspace, as before. Effectively, this means that
the maxbufspace is forced on mapped and unmapped buffers separately.
The pre-patch bufspace limiting code did not worked, because
buffer_map fragmentation does not allow the limit to be reached.
By Jeff Roberson request, the getnewbuf() function was split into
smaller single-purpose functions.
Sponsored by: The FreeBSD Foundation
Discussed with: jeff (previous version)
Tested by: pho, scottl (previous version), jhb, bf
MFC after: 2 weeks
2013-03-19 14:13:12 +00:00
|
|
|
extern int inflight_transient_maps;
|
|
|
|
|
2018-06-01 23:49:32 +00:00
|
|
|
static struct bio_queue nondump_bios;
|
|
|
|
|
2002-09-13 11:28:31 +00:00
|
|
|
void
|
|
|
|
biodone(struct bio *bp)
|
|
|
|
{
|
2008-03-21 10:00:05 +00:00
|
|
|
struct mtx *mtxp;
|
2005-09-29 10:37:20 +00:00
|
|
|
void (*done)(struct bio *);
|
Implement the concept of the unmapped VMIO buffers, i.e. buffers which
do not map the b_pages pages into buffer_map KVA. The use of the
unmapped buffers eliminate the need to perform TLB shootdown for
mapping on the buffer creation and reuse, greatly reducing the amount
of IPIs for shootdown on big-SMP machines and eliminating up to 25-30%
of the system time on i/o intensive workloads.
The unmapped buffer should be explicitely requested by the GB_UNMAPPED
flag by the consumer. For unmapped buffer, no KVA reservation is
performed at all. The consumer might request unmapped buffer which
does have a KVA reserve, to manually map it without recursing into
buffer cache and blocking, with the GB_KVAALLOC flag.
When the mapped buffer is requested and unmapped buffer already
exists, the cache performs an upgrade, possibly reusing the KVA
reservation.
Unmapped buffer is translated into unmapped bio in g_vfs_strategy().
Unmapped bio carry a pointer to the vm_page_t array, offset and length
instead of the data pointer. The provider which processes the bio
should explicitely specify a readiness to accept unmapped bio,
otherwise g_down geom thread performs the transient upgrade of the bio
request by mapping the pages into the new bio_transient_map KVA
submap.
The bio_transient_map submap claims up to 10% of the buffer map, and
the total buffer_map + bio_transient_map KVA usage stays the
same. Still, it could be manually tuned by kern.bio_transient_maxcnt
tunable, in the units of the transient mappings. Eventually, the
bio_transient_map could be removed after all geom classes and drivers
can accept unmapped i/o requests.
Unmapped support can be turned off by the vfs.unmapped_buf_allowed
tunable, disabling which makes the buffer (or cluster) creation
requests to ignore GB_UNMAPPED and GB_KVAALLOC flags. Unmapped
buffers are only enabled by default on the architectures where
pmap_copy_page() was implemented and tested.
In the rework, filesystem metadata is not the subject to maxbufspace
limit anymore. Since the metadata buffers are always mapped, the
buffers still have to fit into the buffer map, which provides a
reasonable (but practically unreachable) upper bound on it. The
non-metadata buffer allocations, both mapped and unmapped, is
accounted against maxbufspace, as before. Effectively, this means that
the maxbufspace is forced on mapped and unmapped buffers separately.
The pre-patch bufspace limiting code did not worked, because
buffer_map fragmentation does not allow the limit to be reached.
By Jeff Roberson request, the getnewbuf() function was split into
smaller single-purpose functions.
Sponsored by: The FreeBSD Foundation
Discussed with: jeff (previous version)
Tested by: pho, scottl (previous version), jhb, bf
MFC after: 2 weeks
2013-03-19 14:13:12 +00:00
|
|
|
vm_offset_t start, end;
|
2004-09-15 20:54:23 +00:00
|
|
|
|
2016-10-31 23:09:52 +00:00
|
|
|
biotrack(bp, __func__);
|
2018-06-01 23:49:32 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Avoid completing I/O when dumping after a panic since that may
|
|
|
|
* result in a deadlock in the filesystem or pager code. Note that
|
|
|
|
* this doesn't affect dumps that were started manually since we aim
|
|
|
|
* to keep the system usable after it has been resumed.
|
|
|
|
*/
|
|
|
|
if (__predict_false(dumping && SCHEDULER_STOPPED())) {
|
|
|
|
TAILQ_INSERT_HEAD(&nondump_bios, bp, bio_queue);
|
|
|
|
return;
|
|
|
|
}
|
Implement the concept of the unmapped VMIO buffers, i.e. buffers which
do not map the b_pages pages into buffer_map KVA. The use of the
unmapped buffers eliminate the need to perform TLB shootdown for
mapping on the buffer creation and reuse, greatly reducing the amount
of IPIs for shootdown on big-SMP machines and eliminating up to 25-30%
of the system time on i/o intensive workloads.
The unmapped buffer should be explicitely requested by the GB_UNMAPPED
flag by the consumer. For unmapped buffer, no KVA reservation is
performed at all. The consumer might request unmapped buffer which
does have a KVA reserve, to manually map it without recursing into
buffer cache and blocking, with the GB_KVAALLOC flag.
When the mapped buffer is requested and unmapped buffer already
exists, the cache performs an upgrade, possibly reusing the KVA
reservation.
Unmapped buffer is translated into unmapped bio in g_vfs_strategy().
Unmapped bio carry a pointer to the vm_page_t array, offset and length
instead of the data pointer. The provider which processes the bio
should explicitely specify a readiness to accept unmapped bio,
otherwise g_down geom thread performs the transient upgrade of the bio
request by mapping the pages into the new bio_transient_map KVA
submap.
The bio_transient_map submap claims up to 10% of the buffer map, and
the total buffer_map + bio_transient_map KVA usage stays the
same. Still, it could be manually tuned by kern.bio_transient_maxcnt
tunable, in the units of the transient mappings. Eventually, the
bio_transient_map could be removed after all geom classes and drivers
can accept unmapped i/o requests.
Unmapped support can be turned off by the vfs.unmapped_buf_allowed
tunable, disabling which makes the buffer (or cluster) creation
requests to ignore GB_UNMAPPED and GB_KVAALLOC flags. Unmapped
buffers are only enabled by default on the architectures where
pmap_copy_page() was implemented and tested.
In the rework, filesystem metadata is not the subject to maxbufspace
limit anymore. Since the metadata buffers are always mapped, the
buffers still have to fit into the buffer map, which provides a
reasonable (but practically unreachable) upper bound on it. The
non-metadata buffer allocations, both mapped and unmapped, is
accounted against maxbufspace, as before. Effectively, this means that
the maxbufspace is forced on mapped and unmapped buffers separately.
The pre-patch bufspace limiting code did not worked, because
buffer_map fragmentation does not allow the limit to be reached.
By Jeff Roberson request, the getnewbuf() function was split into
smaller single-purpose functions.
Sponsored by: The FreeBSD Foundation
Discussed with: jeff (previous version)
Tested by: pho, scottl (previous version), jhb, bf
MFC after: 2 weeks
2013-03-19 14:13:12 +00:00
|
|
|
if ((bp->bio_flags & BIO_TRANSIENT_MAPPING) != 0) {
|
2013-10-21 06:44:55 +00:00
|
|
|
bp->bio_flags &= ~BIO_TRANSIENT_MAPPING;
|
|
|
|
bp->bio_flags |= BIO_UNMAPPED;
|
Implement the concept of the unmapped VMIO buffers, i.e. buffers which
do not map the b_pages pages into buffer_map KVA. The use of the
unmapped buffers eliminate the need to perform TLB shootdown for
mapping on the buffer creation and reuse, greatly reducing the amount
of IPIs for shootdown on big-SMP machines and eliminating up to 25-30%
of the system time on i/o intensive workloads.
The unmapped buffer should be explicitely requested by the GB_UNMAPPED
flag by the consumer. For unmapped buffer, no KVA reservation is
performed at all. The consumer might request unmapped buffer which
does have a KVA reserve, to manually map it without recursing into
buffer cache and blocking, with the GB_KVAALLOC flag.
When the mapped buffer is requested and unmapped buffer already
exists, the cache performs an upgrade, possibly reusing the KVA
reservation.
Unmapped buffer is translated into unmapped bio in g_vfs_strategy().
Unmapped bio carry a pointer to the vm_page_t array, offset and length
instead of the data pointer. The provider which processes the bio
should explicitely specify a readiness to accept unmapped bio,
otherwise g_down geom thread performs the transient upgrade of the bio
request by mapping the pages into the new bio_transient_map KVA
submap.
The bio_transient_map submap claims up to 10% of the buffer map, and
the total buffer_map + bio_transient_map KVA usage stays the
same. Still, it could be manually tuned by kern.bio_transient_maxcnt
tunable, in the units of the transient mappings. Eventually, the
bio_transient_map could be removed after all geom classes and drivers
can accept unmapped i/o requests.
Unmapped support can be turned off by the vfs.unmapped_buf_allowed
tunable, disabling which makes the buffer (or cluster) creation
requests to ignore GB_UNMAPPED and GB_KVAALLOC flags. Unmapped
buffers are only enabled by default on the architectures where
pmap_copy_page() was implemented and tested.
In the rework, filesystem metadata is not the subject to maxbufspace
limit anymore. Since the metadata buffers are always mapped, the
buffers still have to fit into the buffer map, which provides a
reasonable (but practically unreachable) upper bound on it. The
non-metadata buffer allocations, both mapped and unmapped, is
accounted against maxbufspace, as before. Effectively, this means that
the maxbufspace is forced on mapped and unmapped buffers separately.
The pre-patch bufspace limiting code did not worked, because
buffer_map fragmentation does not allow the limit to be reached.
By Jeff Roberson request, the getnewbuf() function was split into
smaller single-purpose functions.
Sponsored by: The FreeBSD Foundation
Discussed with: jeff (previous version)
Tested by: pho, scottl (previous version), jhb, bf
MFC after: 2 weeks
2013-03-19 14:13:12 +00:00
|
|
|
start = trunc_page((vm_offset_t)bp->bio_data);
|
|
|
|
end = round_page((vm_offset_t)bp->bio_data + bp->bio_length);
|
2015-03-16 20:00:09 +00:00
|
|
|
bp->bio_data = unmapped_buf;
|
2017-03-14 19:39:17 +00:00
|
|
|
pmap_qremove(start, atop(end - start));
|
2013-10-21 06:44:55 +00:00
|
|
|
vmem_free(transient_arena, start, end - start);
|
|
|
|
atomic_add_int(&inflight_transient_maps, -1);
|
Implement the concept of the unmapped VMIO buffers, i.e. buffers which
do not map the b_pages pages into buffer_map KVA. The use of the
unmapped buffers eliminate the need to perform TLB shootdown for
mapping on the buffer creation and reuse, greatly reducing the amount
of IPIs for shootdown on big-SMP machines and eliminating up to 25-30%
of the system time on i/o intensive workloads.
The unmapped buffer should be explicitely requested by the GB_UNMAPPED
flag by the consumer. For unmapped buffer, no KVA reservation is
performed at all. The consumer might request unmapped buffer which
does have a KVA reserve, to manually map it without recursing into
buffer cache and blocking, with the GB_KVAALLOC flag.
When the mapped buffer is requested and unmapped buffer already
exists, the cache performs an upgrade, possibly reusing the KVA
reservation.
Unmapped buffer is translated into unmapped bio in g_vfs_strategy().
Unmapped bio carry a pointer to the vm_page_t array, offset and length
instead of the data pointer. The provider which processes the bio
should explicitely specify a readiness to accept unmapped bio,
otherwise g_down geom thread performs the transient upgrade of the bio
request by mapping the pages into the new bio_transient_map KVA
submap.
The bio_transient_map submap claims up to 10% of the buffer map, and
the total buffer_map + bio_transient_map KVA usage stays the
same. Still, it could be manually tuned by kern.bio_transient_maxcnt
tunable, in the units of the transient mappings. Eventually, the
bio_transient_map could be removed after all geom classes and drivers
can accept unmapped i/o requests.
Unmapped support can be turned off by the vfs.unmapped_buf_allowed
tunable, disabling which makes the buffer (or cluster) creation
requests to ignore GB_UNMAPPED and GB_KVAALLOC flags. Unmapped
buffers are only enabled by default on the architectures where
pmap_copy_page() was implemented and tested.
In the rework, filesystem metadata is not the subject to maxbufspace
limit anymore. Since the metadata buffers are always mapped, the
buffers still have to fit into the buffer map, which provides a
reasonable (but practically unreachable) upper bound on it. The
non-metadata buffer allocations, both mapped and unmapped, is
accounted against maxbufspace, as before. Effectively, this means that
the maxbufspace is forced on mapped and unmapped buffers separately.
The pre-patch bufspace limiting code did not worked, because
buffer_map fragmentation does not allow the limit to be reached.
By Jeff Roberson request, the getnewbuf() function was split into
smaller single-purpose functions.
Sponsored by: The FreeBSD Foundation
Discussed with: jeff (previous version)
Tested by: pho, scottl (previous version), jhb, bf
MFC after: 2 weeks
2013-03-19 14:13:12 +00:00
|
|
|
}
|
2005-09-29 10:37:20 +00:00
|
|
|
done = bp->bio_done;
|
2021-10-23 14:25:49 +00:00
|
|
|
/*
|
|
|
|
* The check for done == biodone is to allow biodone to be
|
|
|
|
* used as a bio_done routine.
|
|
|
|
*/
|
|
|
|
if (done == NULL || done == biodone) {
|
2013-10-16 09:56:40 +00:00
|
|
|
mtxp = mtx_pool_find(mtxpool_sleep, bp);
|
|
|
|
mtx_lock(mtxp);
|
|
|
|
bp->bio_flags |= BIO_DONE;
|
2003-03-13 07:31:45 +00:00
|
|
|
wakeup(bp);
|
2013-10-16 09:56:40 +00:00
|
|
|
mtx_unlock(mtxp);
|
2017-01-10 21:41:28 +00:00
|
|
|
} else
|
2005-09-29 10:37:20 +00:00
|
|
|
done(bp);
|
2002-09-13 11:28:31 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Wait for a BIO to finish.
|
|
|
|
*/
|
|
|
|
int
|
2021-10-17 03:25:27 +00:00
|
|
|
biowait(struct bio *bp, const char *wmesg)
|
2002-09-13 11:28:31 +00:00
|
|
|
{
|
2008-03-21 10:00:05 +00:00
|
|
|
struct mtx *mtxp;
|
2002-09-13 11:28:31 +00:00
|
|
|
|
2008-03-21 10:00:05 +00:00
|
|
|
mtxp = mtx_pool_find(mtxpool_sleep, bp);
|
|
|
|
mtx_lock(mtxp);
|
2002-09-13 11:28:31 +00:00
|
|
|
while ((bp->bio_flags & BIO_DONE) == 0)
|
2021-10-17 03:25:27 +00:00
|
|
|
msleep(bp, mtxp, PRIBIO, wmesg, 0);
|
2008-03-21 10:00:05 +00:00
|
|
|
mtx_unlock(mtxp);
|
2002-09-15 17:52:35 +00:00
|
|
|
if (bp->bio_error != 0)
|
2002-09-13 11:28:31 +00:00
|
|
|
return (bp->bio_error);
|
2002-09-26 16:32:14 +00:00
|
|
|
if (!(bp->bio_flags & BIO_ERROR))
|
|
|
|
return (0);
|
2002-09-13 11:28:31 +00:00
|
|
|
return (EIO);
|
|
|
|
}
|
|
|
|
|
2002-09-14 19:34:11 +00:00
|
|
|
void
|
|
|
|
biofinish(struct bio *bp, struct devstat *stat, int error)
|
|
|
|
{
|
2020-07-10 09:01:36 +00:00
|
|
|
|
2002-09-14 19:34:11 +00:00
|
|
|
if (error) {
|
|
|
|
bp->bio_error = error;
|
|
|
|
bp->bio_flags |= BIO_ERROR;
|
|
|
|
}
|
|
|
|
if (stat != NULL)
|
|
|
|
devstat_end_transaction_bio(stat, bp);
|
|
|
|
biodone(bp);
|
|
|
|
}
|
|
|
|
|
2016-10-31 23:09:52 +00:00
|
|
|
#if defined(BUF_TRACKING) || defined(FULL_BUF_TRACKING)
|
|
|
|
void
|
|
|
|
biotrack_buf(struct bio *bp, const char *location)
|
|
|
|
{
|
|
|
|
|
|
|
|
buf_track(bp->bio_track_bp, location);
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
1994-05-25 09:21:21 +00:00
|
|
|
/*
|
2000-04-30 06:16:03 +00:00
|
|
|
* bufwait:
|
The VFS/BIO subsystem contained a number of hacks in order to optimize
piecemeal, middle-of-file writes for NFS. These hacks have caused no
end of trouble, especially when combined with mmap(). I've removed
them. Instead, NFS will issue a read-before-write to fully
instantiate the struct buf containing the write. NFS does, however,
optimize piecemeal appends to files. For most common file operations,
you will not notice the difference. The sole remaining fragment in
the VFS/BIO system is b_dirtyoff/end, which NFS uses to avoid cache
coherency issues with read-merge-write style operations. NFS also
optimizes the write-covers-entire-buffer case by avoiding the
read-before-write. There is quite a bit of room for further
optimization in these areas.
The VM system marks pages fully-valid (AKA vm_page_t->valid =
VM_PAGE_BITS_ALL) in several places, most noteably in vm_fault. This
is not correct operation. The vm_pager_get_pages() code is now
responsible for marking VM pages all-valid. A number of VM helper
routines have been added to aid in zeroing-out the invalid portions of
a VM page prior to the page being marked all-valid. This operation is
necessary to properly support mmap(). The zeroing occurs most often
when dealing with file-EOF situations. Several bugs have been fixed
in the NFS subsystem, including bits handling file and directory EOF
situations and buf->b_flags consistancy issues relating to clearing
B_ERROR & B_INVAL, and handling B_DONE.
getblk() and allocbuf() have been rewritten. B_CACHE operation is now
formally defined in comments and more straightforward in
implementation. B_CACHE for VMIO buffers is based on the validity of
the backing store. B_CACHE for non-VMIO buffers is based simply on
whether the buffer is B_INVAL or not (B_CACHE set if B_INVAL clear,
and vise-versa). biodone() is now responsible for setting B_CACHE
when a successful read completes. B_CACHE is also set when a bdwrite()
is initiated and when a bwrite() is initiated. VFS VOP_BWRITE
routines (there are only two - nfs_bwrite() and bwrite()) are now
expected to set B_CACHE. This means that bowrite() and bawrite() also
set B_CACHE indirectly.
There are a number of places in the code which were previously using
buf->b_bufsize (which is DEV_BSIZE aligned) when they should have
been using buf->b_bcount. These have been fixed. getblk() now clears
B_DONE on return because the rest of the system is so bad about
dealing with B_DONE.
Major fixes to NFS/TCP have been made. A server-side bug could cause
requests to be lost by the server due to nfs_realign() overwriting
other rpc's in the same TCP mbuf chain. The server's kernel must be
recompiled to get the benefit of the fixes.
Submitted by: Matthew Dillon <dillon@apollo.backplane.com>
1999-05-02 23:57:16 +00:00
|
|
|
*
|
|
|
|
* Wait for buffer I/O completion, returning error status. The buffer
|
2003-01-01 18:49:04 +00:00
|
|
|
* is left locked and B_DONE on return. B_EINTR is converted into an EINTR
|
The VFS/BIO subsystem contained a number of hacks in order to optimize
piecemeal, middle-of-file writes for NFS. These hacks have caused no
end of trouble, especially when combined with mmap(). I've removed
them. Instead, NFS will issue a read-before-write to fully
instantiate the struct buf containing the write. NFS does, however,
optimize piecemeal appends to files. For most common file operations,
you will not notice the difference. The sole remaining fragment in
the VFS/BIO system is b_dirtyoff/end, which NFS uses to avoid cache
coherency issues with read-merge-write style operations. NFS also
optimizes the write-covers-entire-buffer case by avoiding the
read-before-write. There is quite a bit of room for further
optimization in these areas.
The VM system marks pages fully-valid (AKA vm_page_t->valid =
VM_PAGE_BITS_ALL) in several places, most noteably in vm_fault. This
is not correct operation. The vm_pager_get_pages() code is now
responsible for marking VM pages all-valid. A number of VM helper
routines have been added to aid in zeroing-out the invalid portions of
a VM page prior to the page being marked all-valid. This operation is
necessary to properly support mmap(). The zeroing occurs most often
when dealing with file-EOF situations. Several bugs have been fixed
in the NFS subsystem, including bits handling file and directory EOF
situations and buf->b_flags consistancy issues relating to clearing
B_ERROR & B_INVAL, and handling B_DONE.
getblk() and allocbuf() have been rewritten. B_CACHE operation is now
formally defined in comments and more straightforward in
implementation. B_CACHE for VMIO buffers is based on the validity of
the backing store. B_CACHE for non-VMIO buffers is based simply on
whether the buffer is B_INVAL or not (B_CACHE set if B_INVAL clear,
and vise-versa). biodone() is now responsible for setting B_CACHE
when a successful read completes. B_CACHE is also set when a bdwrite()
is initiated and when a bwrite() is initiated. VFS VOP_BWRITE
routines (there are only two - nfs_bwrite() and bwrite()) are now
expected to set B_CACHE. This means that bowrite() and bawrite() also
set B_CACHE indirectly.
There are a number of places in the code which were previously using
buf->b_bufsize (which is DEV_BSIZE aligned) when they should have
been using buf->b_bcount. These have been fixed. getblk() now clears
B_DONE on return because the rest of the system is so bad about
dealing with B_DONE.
Major fixes to NFS/TCP have been made. A server-side bug could cause
requests to be lost by the server due to nfs_realign() overwriting
other rpc's in the same TCP mbuf chain. The server's kernel must be
recompiled to get the benefit of the fixes.
Submitted by: Matthew Dillon <dillon@apollo.backplane.com>
1999-05-02 23:57:16 +00:00
|
|
|
* error and cleared.
|
1994-05-25 09:21:21 +00:00
|
|
|
*/
|
|
|
|
int
|
2004-09-15 20:54:23 +00:00
|
|
|
bufwait(struct buf *bp)
|
1994-05-24 10:09:53 +00:00
|
|
|
{
|
2003-03-13 07:31:45 +00:00
|
|
|
if (bp->b_iocmd == BIO_READ)
|
|
|
|
bwait(bp, PRIBIO, "biord");
|
|
|
|
else
|
|
|
|
bwait(bp, PRIBIO, "biowr");
|
1995-04-09 06:02:46 +00:00
|
|
|
if (bp->b_flags & B_EINTR) {
|
|
|
|
bp->b_flags &= ~B_EINTR;
|
|
|
|
return (EINTR);
|
|
|
|
}
|
2000-04-02 15:24:56 +00:00
|
|
|
if (bp->b_ioflags & BIO_ERROR) {
|
1995-04-09 06:02:46 +00:00
|
|
|
return (bp->b_error ? bp->b_error : EIO);
|
1994-05-25 09:21:21 +00:00
|
|
|
} else {
|
|
|
|
return (0);
|
|
|
|
}
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
|
|
|
|
1994-05-25 09:21:21 +00:00
|
|
|
/*
|
2000-05-01 13:36:25 +00:00
|
|
|
* bufdone:
|
The VFS/BIO subsystem contained a number of hacks in order to optimize
piecemeal, middle-of-file writes for NFS. These hacks have caused no
end of trouble, especially when combined with mmap(). I've removed
them. Instead, NFS will issue a read-before-write to fully
instantiate the struct buf containing the write. NFS does, however,
optimize piecemeal appends to files. For most common file operations,
you will not notice the difference. The sole remaining fragment in
the VFS/BIO system is b_dirtyoff/end, which NFS uses to avoid cache
coherency issues with read-merge-write style operations. NFS also
optimizes the write-covers-entire-buffer case by avoiding the
read-before-write. There is quite a bit of room for further
optimization in these areas.
The VM system marks pages fully-valid (AKA vm_page_t->valid =
VM_PAGE_BITS_ALL) in several places, most noteably in vm_fault. This
is not correct operation. The vm_pager_get_pages() code is now
responsible for marking VM pages all-valid. A number of VM helper
routines have been added to aid in zeroing-out the invalid portions of
a VM page prior to the page being marked all-valid. This operation is
necessary to properly support mmap(). The zeroing occurs most often
when dealing with file-EOF situations. Several bugs have been fixed
in the NFS subsystem, including bits handling file and directory EOF
situations and buf->b_flags consistancy issues relating to clearing
B_ERROR & B_INVAL, and handling B_DONE.
getblk() and allocbuf() have been rewritten. B_CACHE operation is now
formally defined in comments and more straightforward in
implementation. B_CACHE for VMIO buffers is based on the validity of
the backing store. B_CACHE for non-VMIO buffers is based simply on
whether the buffer is B_INVAL or not (B_CACHE set if B_INVAL clear,
and vise-versa). biodone() is now responsible for setting B_CACHE
when a successful read completes. B_CACHE is also set when a bdwrite()
is initiated and when a bwrite() is initiated. VFS VOP_BWRITE
routines (there are only two - nfs_bwrite() and bwrite()) are now
expected to set B_CACHE. This means that bowrite() and bawrite() also
set B_CACHE indirectly.
There are a number of places in the code which were previously using
buf->b_bufsize (which is DEV_BSIZE aligned) when they should have
been using buf->b_bcount. These have been fixed. getblk() now clears
B_DONE on return because the rest of the system is so bad about
dealing with B_DONE.
Major fixes to NFS/TCP have been made. A server-side bug could cause
requests to be lost by the server due to nfs_realign() overwriting
other rpc's in the same TCP mbuf chain. The server's kernel must be
recompiled to get the benefit of the fixes.
Submitted by: Matthew Dillon <dillon@apollo.backplane.com>
1999-05-02 23:57:16 +00:00
|
|
|
*
|
|
|
|
* Finish I/O on a buffer, optionally calling a completion function.
|
|
|
|
* This is usually called from an interrupt so process blocking is
|
|
|
|
* not allowed.
|
|
|
|
*
|
|
|
|
* biodone is also responsible for setting B_CACHE in a B_VMIO bp.
|
|
|
|
* In a non-VMIO bp, B_CACHE will be set on the next getblk()
|
|
|
|
* assuming B_INVAL is clear.
|
|
|
|
*
|
|
|
|
* For the VMIO case, we set B_CACHE if the op was a read and no
|
2016-04-29 21:54:28 +00:00
|
|
|
* read error occurred, or if the op was a write. B_CACHE is never
|
The VFS/BIO subsystem contained a number of hacks in order to optimize
piecemeal, middle-of-file writes for NFS. These hacks have caused no
end of trouble, especially when combined with mmap(). I've removed
them. Instead, NFS will issue a read-before-write to fully
instantiate the struct buf containing the write. NFS does, however,
optimize piecemeal appends to files. For most common file operations,
you will not notice the difference. The sole remaining fragment in
the VFS/BIO system is b_dirtyoff/end, which NFS uses to avoid cache
coherency issues with read-merge-write style operations. NFS also
optimizes the write-covers-entire-buffer case by avoiding the
read-before-write. There is quite a bit of room for further
optimization in these areas.
The VM system marks pages fully-valid (AKA vm_page_t->valid =
VM_PAGE_BITS_ALL) in several places, most noteably in vm_fault. This
is not correct operation. The vm_pager_get_pages() code is now
responsible for marking VM pages all-valid. A number of VM helper
routines have been added to aid in zeroing-out the invalid portions of
a VM page prior to the page being marked all-valid. This operation is
necessary to properly support mmap(). The zeroing occurs most often
when dealing with file-EOF situations. Several bugs have been fixed
in the NFS subsystem, including bits handling file and directory EOF
situations and buf->b_flags consistancy issues relating to clearing
B_ERROR & B_INVAL, and handling B_DONE.
getblk() and allocbuf() have been rewritten. B_CACHE operation is now
formally defined in comments and more straightforward in
implementation. B_CACHE for VMIO buffers is based on the validity of
the backing store. B_CACHE for non-VMIO buffers is based simply on
whether the buffer is B_INVAL or not (B_CACHE set if B_INVAL clear,
and vise-versa). biodone() is now responsible for setting B_CACHE
when a successful read completes. B_CACHE is also set when a bdwrite()
is initiated and when a bwrite() is initiated. VFS VOP_BWRITE
routines (there are only two - nfs_bwrite() and bwrite()) are now
expected to set B_CACHE. This means that bowrite() and bawrite() also
set B_CACHE indirectly.
There are a number of places in the code which were previously using
buf->b_bufsize (which is DEV_BSIZE aligned) when they should have
been using buf->b_bcount. These have been fixed. getblk() now clears
B_DONE on return because the rest of the system is so bad about
dealing with B_DONE.
Major fixes to NFS/TCP have been made. A server-side bug could cause
requests to be lost by the server due to nfs_realign() overwriting
other rpc's in the same TCP mbuf chain. The server's kernel must be
recompiled to get the benefit of the fixes.
Submitted by: Matthew Dillon <dillon@apollo.backplane.com>
1999-05-02 23:57:16 +00:00
|
|
|
* set if the buffer is invalid or otherwise uncacheable.
|
|
|
|
*
|
2019-04-26 15:00:59 +00:00
|
|
|
* bufdone does not mess with B_INVAL, allowing the I/O routine or the
|
2016-04-29 21:54:28 +00:00
|
|
|
* initiator to leave B_INVAL set to brelse the buffer out of existence
|
The VFS/BIO subsystem contained a number of hacks in order to optimize
piecemeal, middle-of-file writes for NFS. These hacks have caused no
end of trouble, especially when combined with mmap(). I've removed
them. Instead, NFS will issue a read-before-write to fully
instantiate the struct buf containing the write. NFS does, however,
optimize piecemeal appends to files. For most common file operations,
you will not notice the difference. The sole remaining fragment in
the VFS/BIO system is b_dirtyoff/end, which NFS uses to avoid cache
coherency issues with read-merge-write style operations. NFS also
optimizes the write-covers-entire-buffer case by avoiding the
read-before-write. There is quite a bit of room for further
optimization in these areas.
The VM system marks pages fully-valid (AKA vm_page_t->valid =
VM_PAGE_BITS_ALL) in several places, most noteably in vm_fault. This
is not correct operation. The vm_pager_get_pages() code is now
responsible for marking VM pages all-valid. A number of VM helper
routines have been added to aid in zeroing-out the invalid portions of
a VM page prior to the page being marked all-valid. This operation is
necessary to properly support mmap(). The zeroing occurs most often
when dealing with file-EOF situations. Several bugs have been fixed
in the NFS subsystem, including bits handling file and directory EOF
situations and buf->b_flags consistancy issues relating to clearing
B_ERROR & B_INVAL, and handling B_DONE.
getblk() and allocbuf() have been rewritten. B_CACHE operation is now
formally defined in comments and more straightforward in
implementation. B_CACHE for VMIO buffers is based on the validity of
the backing store. B_CACHE for non-VMIO buffers is based simply on
whether the buffer is B_INVAL or not (B_CACHE set if B_INVAL clear,
and vise-versa). biodone() is now responsible for setting B_CACHE
when a successful read completes. B_CACHE is also set when a bdwrite()
is initiated and when a bwrite() is initiated. VFS VOP_BWRITE
routines (there are only two - nfs_bwrite() and bwrite()) are now
expected to set B_CACHE. This means that bowrite() and bawrite() also
set B_CACHE indirectly.
There are a number of places in the code which were previously using
buf->b_bufsize (which is DEV_BSIZE aligned) when they should have
been using buf->b_bcount. These have been fixed. getblk() now clears
B_DONE on return because the rest of the system is so bad about
dealing with B_DONE.
Major fixes to NFS/TCP have been made. A server-side bug could cause
requests to be lost by the server due to nfs_realign() overwriting
other rpc's in the same TCP mbuf chain. The server's kernel must be
recompiled to get the benefit of the fixes.
Submitted by: Matthew Dillon <dillon@apollo.backplane.com>
1999-05-02 23:57:16 +00:00
|
|
|
* in the biodone routine.
|
1994-05-25 09:21:21 +00:00
|
|
|
*/
|
2000-04-15 05:54:02 +00:00
|
|
|
void
|
|
|
|
bufdone(struct buf *bp)
|
1994-05-24 10:09:53 +00:00
|
|
|
{
|
2005-01-28 17:48:58 +00:00
|
|
|
struct bufobj *dropobj;
|
2002-03-19 21:25:46 +00:00
|
|
|
void (*biodone)(struct buf *);
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
|
2016-10-31 23:09:52 +00:00
|
|
|
buf_track(bp, __func__);
|
2005-01-24 10:47:04 +00:00
|
|
|
CTR3(KTR_BUF, "bufdone(%p) vp %p flags %X", bp, bp->b_vp, bp->b_flags);
|
2005-01-28 17:48:58 +00:00
|
|
|
dropobj = NULL;
|
1997-06-15 17:56:53 +00:00
|
|
|
|
1999-03-12 02:24:58 +00:00
|
|
|
KASSERT(!(bp->b_flags & B_DONE), ("biodone: bp %p already done", bp));
|
1995-07-25 05:03:06 +00:00
|
|
|
|
2000-12-26 19:41:38 +00:00
|
|
|
runningbufwakeup(bp);
|
2005-01-28 17:48:58 +00:00
|
|
|
if (bp->b_iocmd == BIO_WRITE)
|
|
|
|
dropobj = bp->b_bufobj;
|
1994-05-25 09:21:21 +00:00
|
|
|
/* call optional completion function if requested */
|
2000-03-20 10:44:49 +00:00
|
|
|
if (bp->b_iodone != NULL) {
|
2000-04-02 09:26:51 +00:00
|
|
|
biodone = bp->b_iodone;
|
2000-03-20 10:44:49 +00:00
|
|
|
bp->b_iodone = NULL;
|
2000-04-02 09:26:51 +00:00
|
|
|
(*biodone) (bp);
|
2005-01-28 17:48:58 +00:00
|
|
|
if (dropobj)
|
|
|
|
bufobj_wdrop(dropobj);
|
1994-05-25 09:21:21 +00:00
|
|
|
return;
|
|
|
|
}
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
if (bp->b_flags & B_VMIO) {
|
The VFS/BIO subsystem contained a number of hacks in order to optimize
piecemeal, middle-of-file writes for NFS. These hacks have caused no
end of trouble, especially when combined with mmap(). I've removed
them. Instead, NFS will issue a read-before-write to fully
instantiate the struct buf containing the write. NFS does, however,
optimize piecemeal appends to files. For most common file operations,
you will not notice the difference. The sole remaining fragment in
the VFS/BIO system is b_dirtyoff/end, which NFS uses to avoid cache
coherency issues with read-merge-write style operations. NFS also
optimizes the write-covers-entire-buffer case by avoiding the
read-before-write. There is quite a bit of room for further
optimization in these areas.
The VM system marks pages fully-valid (AKA vm_page_t->valid =
VM_PAGE_BITS_ALL) in several places, most noteably in vm_fault. This
is not correct operation. The vm_pager_get_pages() code is now
responsible for marking VM pages all-valid. A number of VM helper
routines have been added to aid in zeroing-out the invalid portions of
a VM page prior to the page being marked all-valid. This operation is
necessary to properly support mmap(). The zeroing occurs most often
when dealing with file-EOF situations. Several bugs have been fixed
in the NFS subsystem, including bits handling file and directory EOF
situations and buf->b_flags consistancy issues relating to clearing
B_ERROR & B_INVAL, and handling B_DONE.
getblk() and allocbuf() have been rewritten. B_CACHE operation is now
formally defined in comments and more straightforward in
implementation. B_CACHE for VMIO buffers is based on the validity of
the backing store. B_CACHE for non-VMIO buffers is based simply on
whether the buffer is B_INVAL or not (B_CACHE set if B_INVAL clear,
and vise-versa). biodone() is now responsible for setting B_CACHE
when a successful read completes. B_CACHE is also set when a bdwrite()
is initiated and when a bwrite() is initiated. VFS VOP_BWRITE
routines (there are only two - nfs_bwrite() and bwrite()) are now
expected to set B_CACHE. This means that bowrite() and bawrite() also
set B_CACHE indirectly.
There are a number of places in the code which were previously using
buf->b_bufsize (which is DEV_BSIZE aligned) when they should have
been using buf->b_bcount. These have been fixed. getblk() now clears
B_DONE on return because the rest of the system is so bad about
dealing with B_DONE.
Major fixes to NFS/TCP have been made. A server-side bug could cause
requests to be lost by the server due to nfs_realign() overwriting
other rpc's in the same TCP mbuf chain. The server's kernel must be
recompiled to get the benefit of the fixes.
Submitted by: Matthew Dillon <dillon@apollo.backplane.com>
1999-05-02 23:57:16 +00:00
|
|
|
/*
|
|
|
|
* Set B_CACHE if the op was a normal read and no error
|
2016-04-29 21:54:28 +00:00
|
|
|
* occurred. B_CACHE is set for writes in the b*write()
|
The VFS/BIO subsystem contained a number of hacks in order to optimize
piecemeal, middle-of-file writes for NFS. These hacks have caused no
end of trouble, especially when combined with mmap(). I've removed
them. Instead, NFS will issue a read-before-write to fully
instantiate the struct buf containing the write. NFS does, however,
optimize piecemeal appends to files. For most common file operations,
you will not notice the difference. The sole remaining fragment in
the VFS/BIO system is b_dirtyoff/end, which NFS uses to avoid cache
coherency issues with read-merge-write style operations. NFS also
optimizes the write-covers-entire-buffer case by avoiding the
read-before-write. There is quite a bit of room for further
optimization in these areas.
The VM system marks pages fully-valid (AKA vm_page_t->valid =
VM_PAGE_BITS_ALL) in several places, most noteably in vm_fault. This
is not correct operation. The vm_pager_get_pages() code is now
responsible for marking VM pages all-valid. A number of VM helper
routines have been added to aid in zeroing-out the invalid portions of
a VM page prior to the page being marked all-valid. This operation is
necessary to properly support mmap(). The zeroing occurs most often
when dealing with file-EOF situations. Several bugs have been fixed
in the NFS subsystem, including bits handling file and directory EOF
situations and buf->b_flags consistancy issues relating to clearing
B_ERROR & B_INVAL, and handling B_DONE.
getblk() and allocbuf() have been rewritten. B_CACHE operation is now
formally defined in comments and more straightforward in
implementation. B_CACHE for VMIO buffers is based on the validity of
the backing store. B_CACHE for non-VMIO buffers is based simply on
whether the buffer is B_INVAL or not (B_CACHE set if B_INVAL clear,
and vise-versa). biodone() is now responsible for setting B_CACHE
when a successful read completes. B_CACHE is also set when a bdwrite()
is initiated and when a bwrite() is initiated. VFS VOP_BWRITE
routines (there are only two - nfs_bwrite() and bwrite()) are now
expected to set B_CACHE. This means that bowrite() and bawrite() also
set B_CACHE indirectly.
There are a number of places in the code which were previously using
buf->b_bufsize (which is DEV_BSIZE aligned) when they should have
been using buf->b_bcount. These have been fixed. getblk() now clears
B_DONE on return because the rest of the system is so bad about
dealing with B_DONE.
Major fixes to NFS/TCP have been made. A server-side bug could cause
requests to be lost by the server due to nfs_realign() overwriting
other rpc's in the same TCP mbuf chain. The server's kernel must be
recompiled to get the benefit of the fixes.
Submitted by: Matthew Dillon <dillon@apollo.backplane.com>
1999-05-02 23:57:16 +00:00
|
|
|
* routines.
|
|
|
|
*/
|
2000-03-20 10:44:49 +00:00
|
|
|
if (bp->b_iocmd == BIO_READ &&
|
2000-04-02 15:24:56 +00:00
|
|
|
!(bp->b_flags & (B_INVAL|B_NOCACHE)) &&
|
2015-09-22 23:57:52 +00:00
|
|
|
!(bp->b_ioflags & BIO_ERROR))
|
The VFS/BIO subsystem contained a number of hacks in order to optimize
piecemeal, middle-of-file writes for NFS. These hacks have caused no
end of trouble, especially when combined with mmap(). I've removed
them. Instead, NFS will issue a read-before-write to fully
instantiate the struct buf containing the write. NFS does, however,
optimize piecemeal appends to files. For most common file operations,
you will not notice the difference. The sole remaining fragment in
the VFS/BIO system is b_dirtyoff/end, which NFS uses to avoid cache
coherency issues with read-merge-write style operations. NFS also
optimizes the write-covers-entire-buffer case by avoiding the
read-before-write. There is quite a bit of room for further
optimization in these areas.
The VM system marks pages fully-valid (AKA vm_page_t->valid =
VM_PAGE_BITS_ALL) in several places, most noteably in vm_fault. This
is not correct operation. The vm_pager_get_pages() code is now
responsible for marking VM pages all-valid. A number of VM helper
routines have been added to aid in zeroing-out the invalid portions of
a VM page prior to the page being marked all-valid. This operation is
necessary to properly support mmap(). The zeroing occurs most often
when dealing with file-EOF situations. Several bugs have been fixed
in the NFS subsystem, including bits handling file and directory EOF
situations and buf->b_flags consistancy issues relating to clearing
B_ERROR & B_INVAL, and handling B_DONE.
getblk() and allocbuf() have been rewritten. B_CACHE operation is now
formally defined in comments and more straightforward in
implementation. B_CACHE for VMIO buffers is based on the validity of
the backing store. B_CACHE for non-VMIO buffers is based simply on
whether the buffer is B_INVAL or not (B_CACHE set if B_INVAL clear,
and vise-versa). biodone() is now responsible for setting B_CACHE
when a successful read completes. B_CACHE is also set when a bdwrite()
is initiated and when a bwrite() is initiated. VFS VOP_BWRITE
routines (there are only two - nfs_bwrite() and bwrite()) are now
expected to set B_CACHE. This means that bowrite() and bawrite() also
set B_CACHE indirectly.
There are a number of places in the code which were previously using
buf->b_bufsize (which is DEV_BSIZE aligned) when they should have
been using buf->b_bcount. These have been fixed. getblk() now clears
B_DONE on return because the rest of the system is so bad about
dealing with B_DONE.
Major fixes to NFS/TCP have been made. A server-side bug could cause
requests to be lost by the server due to nfs_realign() overwriting
other rpc's in the same TCP mbuf chain. The server's kernel must be
recompiled to get the benefit of the fixes.
Submitted by: Matthew Dillon <dillon@apollo.backplane.com>
1999-05-02 23:57:16 +00:00
|
|
|
bp->b_flags |= B_CACHE;
|
2015-09-22 23:57:52 +00:00
|
|
|
vfs_vmio_iodone(bp);
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
}
|
Merge biodone_finish() back into biodone(). The primary purpose is
to make the order of operations clearer to avoid the race condition
that was fixed in r328914. In particular, this commit corrects a
similar race that existed in the soft updates callback.
Doing some sleuthing through the SVN repository, it appears that
bufdone_finish() was added to support XFS:
------------------------------------------------------------------------
r153192 | rodrigc | 2005-12-06 19:39:08 -0800 (Tue, 06 Dec 2005) | 13 lines
Changes imported from XFS for FreeBSD project:
- add fields to struct buf (needed by XFS)
- 3 private fields: b_fsprivate1, b_fsprivate2, b_fsprivate3
- b_pin_count, count of pinned buffer
- add new B_MANAGED flag
- add breada() function to initiate asynchronous I/O on read-ahead blocks.
- add bufdone_finish(), bpin(), bunpin_wait() functions
Patches provided by: kan
Reviewed by: phk
Silence on: arch@
------------------------------------------------------------------------
It does not appear to ever have been used for anything else. XFS was
disconnected in r241607:
------------------------------------------------------------------------
r241607 | attilio | 2012-10-16 03:04:00 -0700 (Tue, 16 Oct 2012) | 5 lines
Disconnect non-MPSAFE XFS from the build in preparation for dropping
GIANT from VFS.
This is not targeted for MFC.
------------------------------------------------------------------------
and removed entirely in r247631:
------------------------------------------------------------------------
r247631 | attilio | 2013-03-02 07:33:54 -0800 (Sat, 02 Mar 2013) | 5 lines
Garbage collect XFS bits which are now already completely disconnected
from the tree since few months.
This is not targeted for MFC.
------------------------------------------------------------------------
Since XFS support is gone, there is no reason to retain biodone_finish().
Suggested by: Warner Losh (imp)
Discussed with: cem, kib
Tested by: Peter Holm (pho)
2018-02-09 19:50:47 +00:00
|
|
|
if (!LIST_EMPTY(&bp->b_dep))
|
|
|
|
buf_complete(bp);
|
Occasional cylinder-group check-hash errors were being reported on
systems running with a heavy filesystem load. Tracking down this
bug was elusive because there were actually two problems. Sometimes
the in-memory check hash was wrong and sometimes the check hash
computed when doing the read was wrong. The occurrence of either
error caused a check-hash mismatch to be reported.
The first error was that the check hash in the in-memory cylinder
group was incorrect. This error was caused by the following
sequence of events:
- We read a cylinder-group buffer and the check hash is valid.
- We update its cg_time and cg_old_time which makes the in-memory
check-hash value invalid but we do not mark the cylinder group dirty.
- We do not make any other changes to the cylinder group, so we
never mark it dirty, thus do not write it out, and hence never
update the incorrect check hash for the in-memory buffer.
- Later, the buffer gets freed, but the page with the old incorrect
check hash is still in the VM cache.
- Later, we read the cylinder group again, and the first page with
the old check hash is still in the VM cache, but some other pages
are not, so we have to do a read.
- The read does not actually get the first page from disk, but rather
from the VM cache, resulting in the old check hash in the buffer.
- The value computed after doing the read does not match causing the
error to be printed.
The fix for this problem is to only set cg_time and cg_old_time as
the cylinder group is being written to disk. This keeps the in-memory
check-hash valid unless the cylinder group has had other modifications
which will require it to be written with a new check hash calculated.
It also requires that the check hash be recalculated in the in-memory
cylinder group when it is marked clean after doing a background write.
The second problem was that the check hash computed at the end of the
read was incorrect because the calculation of the check hash on
completion of the read was being done too soon.
- When a read completes we had the following sequence:
- bufdone()
-- b_ckhashcalc (calculates check hash)
-- bufdone_finish()
--- vfs_vmio_iodone() (replaces bogus pages with the cached ones)
- When we are reading a buffer where one or more pages are already
in memory (but not all pages, or we wouldn't be doing the read),
the I/O is done with bogus_page mapped in for the pages that exist
in the VM cache. This mapping is done to avoid corrupting the
cached pages if there is any I/O overrun. The vfs_vmio_iodone()
function is responsible for replacing the bogus_page(s) with the
cached ones. But we were calculating the check hash before the
bogus_page(s) were replaced. Hence, when we were calculating the
check hash, we were partly reading from bogus_page, which means
we calculated a bad check hash (e.g., because multiple pages have
been mapped to bogus_page, so its contents are indeterminate).
The second fix is to move the check-hash calculation from bufdone()
to bufdone_finish() after the call to vfs_vmio_iodone() so that it
computes the check hash over the correct set of pages.
With these two changes, the occasional cylinder-group check-hash
errors are gone.
Submitted by: David Pfitzner <dpfitzner@netflix.com>
Reviewed by: kib
Tested by: David Pfitzner
2018-02-06 00:19:46 +00:00
|
|
|
if ((bp->b_flags & B_CKHASH) != 0) {
|
|
|
|
KASSERT(bp->b_iocmd == BIO_READ,
|
Merge biodone_finish() back into biodone(). The primary purpose is
to make the order of operations clearer to avoid the race condition
that was fixed in r328914. In particular, this commit corrects a
similar race that existed in the soft updates callback.
Doing some sleuthing through the SVN repository, it appears that
bufdone_finish() was added to support XFS:
------------------------------------------------------------------------
r153192 | rodrigc | 2005-12-06 19:39:08 -0800 (Tue, 06 Dec 2005) | 13 lines
Changes imported from XFS for FreeBSD project:
- add fields to struct buf (needed by XFS)
- 3 private fields: b_fsprivate1, b_fsprivate2, b_fsprivate3
- b_pin_count, count of pinned buffer
- add new B_MANAGED flag
- add breada() function to initiate asynchronous I/O on read-ahead blocks.
- add bufdone_finish(), bpin(), bunpin_wait() functions
Patches provided by: kan
Reviewed by: phk
Silence on: arch@
------------------------------------------------------------------------
It does not appear to ever have been used for anything else. XFS was
disconnected in r241607:
------------------------------------------------------------------------
r241607 | attilio | 2012-10-16 03:04:00 -0700 (Tue, 16 Oct 2012) | 5 lines
Disconnect non-MPSAFE XFS from the build in preparation for dropping
GIANT from VFS.
This is not targeted for MFC.
------------------------------------------------------------------------
and removed entirely in r247631:
------------------------------------------------------------------------
r247631 | attilio | 2013-03-02 07:33:54 -0800 (Sat, 02 Mar 2013) | 5 lines
Garbage collect XFS bits which are now already completely disconnected
from the tree since few months.
This is not targeted for MFC.
------------------------------------------------------------------------
Since XFS support is gone, there is no reason to retain biodone_finish().
Suggested by: Warner Losh (imp)
Discussed with: cem, kib
Tested by: Peter Holm (pho)
2018-02-09 19:50:47 +00:00
|
|
|
("bufdone: b_iocmd %d not BIO_READ", bp->b_iocmd));
|
|
|
|
KASSERT(buf_mapped(bp), ("bufdone: bp %p not mapped", bp));
|
Occasional cylinder-group check-hash errors were being reported on
systems running with a heavy filesystem load. Tracking down this
bug was elusive because there were actually two problems. Sometimes
the in-memory check hash was wrong and sometimes the check hash
computed when doing the read was wrong. The occurrence of either
error caused a check-hash mismatch to be reported.
The first error was that the check hash in the in-memory cylinder
group was incorrect. This error was caused by the following
sequence of events:
- We read a cylinder-group buffer and the check hash is valid.
- We update its cg_time and cg_old_time which makes the in-memory
check-hash value invalid but we do not mark the cylinder group dirty.
- We do not make any other changes to the cylinder group, so we
never mark it dirty, thus do not write it out, and hence never
update the incorrect check hash for the in-memory buffer.
- Later, the buffer gets freed, but the page with the old incorrect
check hash is still in the VM cache.
- Later, we read the cylinder group again, and the first page with
the old check hash is still in the VM cache, but some other pages
are not, so we have to do a read.
- The read does not actually get the first page from disk, but rather
from the VM cache, resulting in the old check hash in the buffer.
- The value computed after doing the read does not match causing the
error to be printed.
The fix for this problem is to only set cg_time and cg_old_time as
the cylinder group is being written to disk. This keeps the in-memory
check-hash valid unless the cylinder group has had other modifications
which will require it to be written with a new check hash calculated.
It also requires that the check hash be recalculated in the in-memory
cylinder group when it is marked clean after doing a background write.
The second problem was that the check hash computed at the end of the
read was incorrect because the calculation of the check hash on
completion of the read was being done too soon.
- When a read completes we had the following sequence:
- bufdone()
-- b_ckhashcalc (calculates check hash)
-- bufdone_finish()
--- vfs_vmio_iodone() (replaces bogus pages with the cached ones)
- When we are reading a buffer where one or more pages are already
in memory (but not all pages, or we wouldn't be doing the read),
the I/O is done with bogus_page mapped in for the pages that exist
in the VM cache. This mapping is done to avoid corrupting the
cached pages if there is any I/O overrun. The vfs_vmio_iodone()
function is responsible for replacing the bogus_page(s) with the
cached ones. But we were calculating the check hash before the
bogus_page(s) were replaced. Hence, when we were calculating the
check hash, we were partly reading from bogus_page, which means
we calculated a bad check hash (e.g., because multiple pages have
been mapped to bogus_page, so its contents are indeterminate).
The second fix is to move the check-hash calculation from bufdone()
to bufdone_finish() after the call to vfs_vmio_iodone() so that it
computes the check hash over the correct set of pages.
With these two changes, the occasional cylinder-group check-hash
errors are gone.
Submitted by: David Pfitzner <dpfitzner@netflix.com>
Reviewed by: kib
Tested by: David Pfitzner
2018-02-06 00:19:46 +00:00
|
|
|
(*bp->b_ckhashcalc)(bp);
|
|
|
|
}
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
/*
|
|
|
|
* For asynchronous completions, release the buffer now. The brelse
|
1999-06-26 02:47:16 +00:00
|
|
|
* will do a wakeup there if necessary - so no need to do a wakeup
|
|
|
|
* here in the async case. The sync case always needs to do a wakeup.
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
*/
|
1994-05-25 09:21:21 +00:00
|
|
|
if (bp->b_flags & B_ASYNC) {
|
2015-09-22 23:57:52 +00:00
|
|
|
if ((bp->b_flags & (B_NOCACHE | B_INVAL | B_RELBUF)) ||
|
|
|
|
(bp->b_ioflags & BIO_ERROR))
|
1997-09-21 22:00:25 +00:00
|
|
|
brelse(bp);
|
|
|
|
else
|
|
|
|
bqrelse(bp);
|
2005-01-28 17:48:58 +00:00
|
|
|
} else
|
2003-03-13 07:31:45 +00:00
|
|
|
bdone(bp);
|
Merge biodone_finish() back into biodone(). The primary purpose is
to make the order of operations clearer to avoid the race condition
that was fixed in r328914. In particular, this commit corrects a
similar race that existed in the soft updates callback.
Doing some sleuthing through the SVN repository, it appears that
bufdone_finish() was added to support XFS:
------------------------------------------------------------------------
r153192 | rodrigc | 2005-12-06 19:39:08 -0800 (Tue, 06 Dec 2005) | 13 lines
Changes imported from XFS for FreeBSD project:
- add fields to struct buf (needed by XFS)
- 3 private fields: b_fsprivate1, b_fsprivate2, b_fsprivate3
- b_pin_count, count of pinned buffer
- add new B_MANAGED flag
- add breada() function to initiate asynchronous I/O on read-ahead blocks.
- add bufdone_finish(), bpin(), bunpin_wait() functions
Patches provided by: kan
Reviewed by: phk
Silence on: arch@
------------------------------------------------------------------------
It does not appear to ever have been used for anything else. XFS was
disconnected in r241607:
------------------------------------------------------------------------
r241607 | attilio | 2012-10-16 03:04:00 -0700 (Tue, 16 Oct 2012) | 5 lines
Disconnect non-MPSAFE XFS from the build in preparation for dropping
GIANT from VFS.
This is not targeted for MFC.
------------------------------------------------------------------------
and removed entirely in r247631:
------------------------------------------------------------------------
r247631 | attilio | 2013-03-02 07:33:54 -0800 (Sat, 02 Mar 2013) | 5 lines
Garbage collect XFS bits which are now already completely disconnected
from the tree since few months.
This is not targeted for MFC.
------------------------------------------------------------------------
Since XFS support is gone, there is no reason to retain biodone_finish().
Suggested by: Warner Losh (imp)
Discussed with: cem, kib
Tested by: Peter Holm (pho)
2018-02-09 19:50:47 +00:00
|
|
|
if (dropobj)
|
|
|
|
bufobj_wdrop(dropobj);
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
|
|
|
|
1995-02-22 09:16:07 +00:00
|
|
|
/*
|
|
|
|
* This routine is called in lieu of iodone in the case of
|
|
|
|
* incomplete I/O. This keeps the busy status for pages
|
2016-04-29 21:54:28 +00:00
|
|
|
* consistent.
|
1995-02-22 09:16:07 +00:00
|
|
|
*/
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
void
|
2004-09-15 20:54:23 +00:00
|
|
|
vfs_unbusy_pages(struct buf *bp)
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
{
|
1998-10-25 17:44:59 +00:00
|
|
|
int i;
|
2004-09-15 21:08:58 +00:00
|
|
|
vm_object_t obj;
|
|
|
|
vm_page_t m;
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
|
2000-12-26 19:41:38 +00:00
|
|
|
runningbufwakeup(bp);
|
2004-09-15 21:08:58 +00:00
|
|
|
if (!(bp->b_flags & B_VMIO))
|
|
|
|
return;
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
|
2004-11-04 09:06:54 +00:00
|
|
|
obj = bp->b_bufobj->bo_object;
|
2004-09-15 21:08:58 +00:00
|
|
|
for (i = 0; i < bp->b_npages; i++) {
|
|
|
|
m = bp->b_pages[i];
|
|
|
|
if (m == bogus_page) {
|
2020-02-28 21:42:48 +00:00
|
|
|
m = vm_page_relookup(obj, OFF_TO_IDX(bp->b_offset) + i);
|
2004-10-22 08:47:20 +00:00
|
|
|
if (!m)
|
2004-09-15 21:08:58 +00:00
|
|
|
panic("vfs_unbusy_pages: page missing\n");
|
|
|
|
bp->b_pages[i] = m;
|
2015-07-23 19:13:41 +00:00
|
|
|
if (buf_mapped(bp)) {
|
Implement the concept of the unmapped VMIO buffers, i.e. buffers which
do not map the b_pages pages into buffer_map KVA. The use of the
unmapped buffers eliminate the need to perform TLB shootdown for
mapping on the buffer creation and reuse, greatly reducing the amount
of IPIs for shootdown on big-SMP machines and eliminating up to 25-30%
of the system time on i/o intensive workloads.
The unmapped buffer should be explicitely requested by the GB_UNMAPPED
flag by the consumer. For unmapped buffer, no KVA reservation is
performed at all. The consumer might request unmapped buffer which
does have a KVA reserve, to manually map it without recursing into
buffer cache and blocking, with the GB_KVAALLOC flag.
When the mapped buffer is requested and unmapped buffer already
exists, the cache performs an upgrade, possibly reusing the KVA
reservation.
Unmapped buffer is translated into unmapped bio in g_vfs_strategy().
Unmapped bio carry a pointer to the vm_page_t array, offset and length
instead of the data pointer. The provider which processes the bio
should explicitely specify a readiness to accept unmapped bio,
otherwise g_down geom thread performs the transient upgrade of the bio
request by mapping the pages into the new bio_transient_map KVA
submap.
The bio_transient_map submap claims up to 10% of the buffer map, and
the total buffer_map + bio_transient_map KVA usage stays the
same. Still, it could be manually tuned by kern.bio_transient_maxcnt
tunable, in the units of the transient mappings. Eventually, the
bio_transient_map could be removed after all geom classes and drivers
can accept unmapped i/o requests.
Unmapped support can be turned off by the vfs.unmapped_buf_allowed
tunable, disabling which makes the buffer (or cluster) creation
requests to ignore GB_UNMAPPED and GB_KVAALLOC flags. Unmapped
buffers are only enabled by default on the architectures where
pmap_copy_page() was implemented and tested.
In the rework, filesystem metadata is not the subject to maxbufspace
limit anymore. Since the metadata buffers are always mapped, the
buffers still have to fit into the buffer map, which provides a
reasonable (but practically unreachable) upper bound on it. The
non-metadata buffer allocations, both mapped and unmapped, is
accounted against maxbufspace, as before. Effectively, this means that
the maxbufspace is forced on mapped and unmapped buffers separately.
The pre-patch bufspace limiting code did not worked, because
buffer_map fragmentation does not allow the limit to be reached.
By Jeff Roberson request, the getnewbuf() function was split into
smaller single-purpose functions.
Sponsored by: The FreeBSD Foundation
Discussed with: jeff (previous version)
Tested by: pho, scottl (previous version), jhb, bf
MFC after: 2 weeks
2013-03-19 14:13:12 +00:00
|
|
|
BUF_CHECK_MAPPED(bp);
|
|
|
|
pmap_qenter(trunc_page((vm_offset_t)bp->b_data),
|
|
|
|
bp->b_pages, bp->b_npages);
|
|
|
|
} else
|
|
|
|
BUF_CHECK_UNMAPPED(bp);
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
}
|
2013-08-09 11:11:11 +00:00
|
|
|
vm_page_sunbusy(m);
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
}
|
2015-10-03 17:04:52 +00:00
|
|
|
vm_object_pip_wakeupn(obj, bp->b_npages);
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
}
|
|
|
|
|
1997-05-19 14:36:56 +00:00
|
|
|
/*
|
The VFS/BIO subsystem contained a number of hacks in order to optimize
piecemeal, middle-of-file writes for NFS. These hacks have caused no
end of trouble, especially when combined with mmap(). I've removed
them. Instead, NFS will issue a read-before-write to fully
instantiate the struct buf containing the write. NFS does, however,
optimize piecemeal appends to files. For most common file operations,
you will not notice the difference. The sole remaining fragment in
the VFS/BIO system is b_dirtyoff/end, which NFS uses to avoid cache
coherency issues with read-merge-write style operations. NFS also
optimizes the write-covers-entire-buffer case by avoiding the
read-before-write. There is quite a bit of room for further
optimization in these areas.
The VM system marks pages fully-valid (AKA vm_page_t->valid =
VM_PAGE_BITS_ALL) in several places, most noteably in vm_fault. This
is not correct operation. The vm_pager_get_pages() code is now
responsible for marking VM pages all-valid. A number of VM helper
routines have been added to aid in zeroing-out the invalid portions of
a VM page prior to the page being marked all-valid. This operation is
necessary to properly support mmap(). The zeroing occurs most often
when dealing with file-EOF situations. Several bugs have been fixed
in the NFS subsystem, including bits handling file and directory EOF
situations and buf->b_flags consistancy issues relating to clearing
B_ERROR & B_INVAL, and handling B_DONE.
getblk() and allocbuf() have been rewritten. B_CACHE operation is now
formally defined in comments and more straightforward in
implementation. B_CACHE for VMIO buffers is based on the validity of
the backing store. B_CACHE for non-VMIO buffers is based simply on
whether the buffer is B_INVAL or not (B_CACHE set if B_INVAL clear,
and vise-versa). biodone() is now responsible for setting B_CACHE
when a successful read completes. B_CACHE is also set when a bdwrite()
is initiated and when a bwrite() is initiated. VFS VOP_BWRITE
routines (there are only two - nfs_bwrite() and bwrite()) are now
expected to set B_CACHE. This means that bowrite() and bawrite() also
set B_CACHE indirectly.
There are a number of places in the code which were previously using
buf->b_bufsize (which is DEV_BSIZE aligned) when they should have
been using buf->b_bcount. These have been fixed. getblk() now clears
B_DONE on return because the rest of the system is so bad about
dealing with B_DONE.
Major fixes to NFS/TCP have been made. A server-side bug could cause
requests to be lost by the server due to nfs_realign() overwriting
other rpc's in the same TCP mbuf chain. The server's kernel must be
recompiled to get the benefit of the fixes.
Submitted by: Matthew Dillon <dillon@apollo.backplane.com>
1999-05-02 23:57:16 +00:00
|
|
|
* vfs_page_set_valid:
|
1999-01-21 08:29:12 +00:00
|
|
|
*
|
The VFS/BIO subsystem contained a number of hacks in order to optimize
piecemeal, middle-of-file writes for NFS. These hacks have caused no
end of trouble, especially when combined with mmap(). I've removed
them. Instead, NFS will issue a read-before-write to fully
instantiate the struct buf containing the write. NFS does, however,
optimize piecemeal appends to files. For most common file operations,
you will not notice the difference. The sole remaining fragment in
the VFS/BIO system is b_dirtyoff/end, which NFS uses to avoid cache
coherency issues with read-merge-write style operations. NFS also
optimizes the write-covers-entire-buffer case by avoiding the
read-before-write. There is quite a bit of room for further
optimization in these areas.
The VM system marks pages fully-valid (AKA vm_page_t->valid =
VM_PAGE_BITS_ALL) in several places, most noteably in vm_fault. This
is not correct operation. The vm_pager_get_pages() code is now
responsible for marking VM pages all-valid. A number of VM helper
routines have been added to aid in zeroing-out the invalid portions of
a VM page prior to the page being marked all-valid. This operation is
necessary to properly support mmap(). The zeroing occurs most often
when dealing with file-EOF situations. Several bugs have been fixed
in the NFS subsystem, including bits handling file and directory EOF
situations and buf->b_flags consistancy issues relating to clearing
B_ERROR & B_INVAL, and handling B_DONE.
getblk() and allocbuf() have been rewritten. B_CACHE operation is now
formally defined in comments and more straightforward in
implementation. B_CACHE for VMIO buffers is based on the validity of
the backing store. B_CACHE for non-VMIO buffers is based simply on
whether the buffer is B_INVAL or not (B_CACHE set if B_INVAL clear,
and vise-versa). biodone() is now responsible for setting B_CACHE
when a successful read completes. B_CACHE is also set when a bdwrite()
is initiated and when a bwrite() is initiated. VFS VOP_BWRITE
routines (there are only two - nfs_bwrite() and bwrite()) are now
expected to set B_CACHE. This means that bowrite() and bawrite() also
set B_CACHE indirectly.
There are a number of places in the code which were previously using
buf->b_bufsize (which is DEV_BSIZE aligned) when they should have
been using buf->b_bcount. These have been fixed. getblk() now clears
B_DONE on return because the rest of the system is so bad about
dealing with B_DONE.
Major fixes to NFS/TCP have been made. A server-side bug could cause
requests to be lost by the server due to nfs_realign() overwriting
other rpc's in the same TCP mbuf chain. The server's kernel must be
recompiled to get the benefit of the fixes.
Submitted by: Matthew Dillon <dillon@apollo.backplane.com>
1999-05-02 23:57:16 +00:00
|
|
|
* Set the valid bits in a page based on the supplied offset. The
|
|
|
|
* range is restricted to the buffer's size.
|
|
|
|
*
|
|
|
|
* This routine is typically called after a read completes.
|
1997-05-19 14:36:56 +00:00
|
|
|
*/
|
|
|
|
static void
|
2007-12-02 01:28:35 +00:00
|
|
|
vfs_page_set_valid(struct buf *bp, vm_ooffset_t off, vm_page_t m)
|
2009-05-13 05:39:39 +00:00
|
|
|
{
|
|
|
|
vm_ooffset_t eoff;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Compute the end offset, eoff, such that [off, eoff) does not span a
|
|
|
|
* page boundary and eoff is not greater than the end of the buffer.
|
|
|
|
* The end of the buffer, in this case, is our file EOF, not the
|
|
|
|
* allocation size of the buffer.
|
|
|
|
*/
|
|
|
|
eoff = (off + PAGE_SIZE) & ~(vm_ooffset_t)PAGE_MASK;
|
|
|
|
if (eoff > bp->b_offset + bp->b_bcount)
|
|
|
|
eoff = bp->b_offset + bp->b_bcount;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Set valid range. This is typically the entire buffer and thus the
|
|
|
|
* entire page.
|
|
|
|
*/
|
|
|
|
if (eoff > off)
|
2011-11-30 17:39:00 +00:00
|
|
|
vm_page_set_valid_range(m, off & PAGE_MASK, eoff - off);
|
2009-05-13 05:39:39 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* vfs_page_set_validclean:
|
|
|
|
*
|
|
|
|
* Set the valid bits and clear the dirty bits in a page based on the
|
|
|
|
* supplied offset. The range is restricted to the buffer's size.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
vfs_page_set_validclean(struct buf *bp, vm_ooffset_t off, vm_page_t m)
|
1997-05-19 14:36:56 +00:00
|
|
|
{
|
1997-05-30 22:25:35 +00:00
|
|
|
vm_ooffset_t soff, eoff;
|
1997-05-19 14:36:56 +00:00
|
|
|
|
1999-04-05 19:38:30 +00:00
|
|
|
/*
|
|
|
|
* Start and end offsets in buffer. eoff - soff may not cross a
|
2016-04-29 21:54:28 +00:00
|
|
|
* page boundary or cross the end of the buffer. The end of the
|
The VFS/BIO subsystem contained a number of hacks in order to optimize
piecemeal, middle-of-file writes for NFS. These hacks have caused no
end of trouble, especially when combined with mmap(). I've removed
them. Instead, NFS will issue a read-before-write to fully
instantiate the struct buf containing the write. NFS does, however,
optimize piecemeal appends to files. For most common file operations,
you will not notice the difference. The sole remaining fragment in
the VFS/BIO system is b_dirtyoff/end, which NFS uses to avoid cache
coherency issues with read-merge-write style operations. NFS also
optimizes the write-covers-entire-buffer case by avoiding the
read-before-write. There is quite a bit of room for further
optimization in these areas.
The VM system marks pages fully-valid (AKA vm_page_t->valid =
VM_PAGE_BITS_ALL) in several places, most noteably in vm_fault. This
is not correct operation. The vm_pager_get_pages() code is now
responsible for marking VM pages all-valid. A number of VM helper
routines have been added to aid in zeroing-out the invalid portions of
a VM page prior to the page being marked all-valid. This operation is
necessary to properly support mmap(). The zeroing occurs most often
when dealing with file-EOF situations. Several bugs have been fixed
in the NFS subsystem, including bits handling file and directory EOF
situations and buf->b_flags consistancy issues relating to clearing
B_ERROR & B_INVAL, and handling B_DONE.
getblk() and allocbuf() have been rewritten. B_CACHE operation is now
formally defined in comments and more straightforward in
implementation. B_CACHE for VMIO buffers is based on the validity of
the backing store. B_CACHE for non-VMIO buffers is based simply on
whether the buffer is B_INVAL or not (B_CACHE set if B_INVAL clear,
and vise-versa). biodone() is now responsible for setting B_CACHE
when a successful read completes. B_CACHE is also set when a bdwrite()
is initiated and when a bwrite() is initiated. VFS VOP_BWRITE
routines (there are only two - nfs_bwrite() and bwrite()) are now
expected to set B_CACHE. This means that bowrite() and bawrite() also
set B_CACHE indirectly.
There are a number of places in the code which were previously using
buf->b_bufsize (which is DEV_BSIZE aligned) when they should have
been using buf->b_bcount. These have been fixed. getblk() now clears
B_DONE on return because the rest of the system is so bad about
dealing with B_DONE.
Major fixes to NFS/TCP have been made. A server-side bug could cause
requests to be lost by the server due to nfs_realign() overwriting
other rpc's in the same TCP mbuf chain. The server's kernel must be
recompiled to get the benefit of the fixes.
Submitted by: Matthew Dillon <dillon@apollo.backplane.com>
1999-05-02 23:57:16 +00:00
|
|
|
* buffer, in this case, is our file EOF, not the allocation size
|
|
|
|
* of the buffer.
|
1999-04-05 19:38:30 +00:00
|
|
|
*/
|
1997-05-19 14:36:56 +00:00
|
|
|
soff = off;
|
Implement a low-memory deadlock solution.
Removed most of the hacks that were trying to deal with low-memory
situations prior to now.
The new code is based on the concept that I/O must be able to function in
a low memory situation. All major modules related to I/O (except
networking) have been adjusted to allow allocation out of the system
reserve memory pool. These modules now detect a low memory situation but
rather then block they instead continue to operate, then return resources
to the memory pool instead of cache them or leave them wired.
Code has been added to stall in a low-memory situation prior to a vnode
being locked.
Thus situations where a process blocks in a low-memory condition while
holding a locked vnode have been reduced to near nothing. Not only will
I/O continue to operate, but many prior deadlock conditions simply no
longer exist.
Implement a number of VFS/BIO fixes
(found by Ian): in biodone(), bogus-page replacement code, the loop
was not properly incrementing loop variables prior to a continue
statement. We do not believe this code can be hit anyway but we
aren't taking any chances. We'll turn the whole section into a
panic (as it already is in brelse()) after the release is rolled.
In biodone(), the foff calculation was incorrectly
clamped to the iosize, causing the wrong foff to be calculated
for pages in the case of an I/O error or biodone() called without
initiating I/O. The problem always caused a panic before. Now it
doesn't. The problem is mainly an issue with NFS.
Fixed casts for ~PAGE_MASK. This code worked properly before only
because the calculations use signed arithmatic. Better to properly
extend PAGE_MASK first before inverting it for the 64 bit masking
op.
In brelse(), the bogus_page fixup code was improperly throwing
away the original contents of 'm' when it did the j-loop to
fix the bogus pages. The result was that it would potentially
invalidate parts of the *WRONG* page(!), leading to corruption.
There may still be cases where a background bitmap write is
being duplicated, causing potential corruption. We have identified
a potentially serious bug related to this but the fix is still TBD.
So instead this patch contains a KASSERT to detect the problem
and panic the machine rather then continue to corrupt the filesystem.
The problem does not occur very often.. it is very hard to
reproduce, and it may or may not be the cause of the corruption
people have reported.
Review by: (VFS/BIO: mckusick, Ian Dowse <iedowse@maths.tcd.ie>)
Testing by: (VM/Deadlock) Paul Saab <ps@yahoo-inc.com>
2000-11-18 23:06:26 +00:00
|
|
|
eoff = (off + PAGE_SIZE) & ~(off_t)PAGE_MASK;
|
The VFS/BIO subsystem contained a number of hacks in order to optimize
piecemeal, middle-of-file writes for NFS. These hacks have caused no
end of trouble, especially when combined with mmap(). I've removed
them. Instead, NFS will issue a read-before-write to fully
instantiate the struct buf containing the write. NFS does, however,
optimize piecemeal appends to files. For most common file operations,
you will not notice the difference. The sole remaining fragment in
the VFS/BIO system is b_dirtyoff/end, which NFS uses to avoid cache
coherency issues with read-merge-write style operations. NFS also
optimizes the write-covers-entire-buffer case by avoiding the
read-before-write. There is quite a bit of room for further
optimization in these areas.
The VM system marks pages fully-valid (AKA vm_page_t->valid =
VM_PAGE_BITS_ALL) in several places, most noteably in vm_fault. This
is not correct operation. The vm_pager_get_pages() code is now
responsible for marking VM pages all-valid. A number of VM helper
routines have been added to aid in zeroing-out the invalid portions of
a VM page prior to the page being marked all-valid. This operation is
necessary to properly support mmap(). The zeroing occurs most often
when dealing with file-EOF situations. Several bugs have been fixed
in the NFS subsystem, including bits handling file and directory EOF
situations and buf->b_flags consistancy issues relating to clearing
B_ERROR & B_INVAL, and handling B_DONE.
getblk() and allocbuf() have been rewritten. B_CACHE operation is now
formally defined in comments and more straightforward in
implementation. B_CACHE for VMIO buffers is based on the validity of
the backing store. B_CACHE for non-VMIO buffers is based simply on
whether the buffer is B_INVAL or not (B_CACHE set if B_INVAL clear,
and vise-versa). biodone() is now responsible for setting B_CACHE
when a successful read completes. B_CACHE is also set when a bdwrite()
is initiated and when a bwrite() is initiated. VFS VOP_BWRITE
routines (there are only two - nfs_bwrite() and bwrite()) are now
expected to set B_CACHE. This means that bowrite() and bawrite() also
set B_CACHE indirectly.
There are a number of places in the code which were previously using
buf->b_bufsize (which is DEV_BSIZE aligned) when they should have
been using buf->b_bcount. These have been fixed. getblk() now clears
B_DONE on return because the rest of the system is so bad about
dealing with B_DONE.
Major fixes to NFS/TCP have been made. A server-side bug could cause
requests to be lost by the server due to nfs_realign() overwriting
other rpc's in the same TCP mbuf chain. The server's kernel must be
recompiled to get the benefit of the fixes.
Submitted by: Matthew Dillon <dillon@apollo.backplane.com>
1999-05-02 23:57:16 +00:00
|
|
|
if (eoff > bp->b_offset + bp->b_bcount)
|
|
|
|
eoff = bp->b_offset + bp->b_bcount;
|
1999-04-05 19:38:30 +00:00
|
|
|
|
The VFS/BIO subsystem contained a number of hacks in order to optimize
piecemeal, middle-of-file writes for NFS. These hacks have caused no
end of trouble, especially when combined with mmap(). I've removed
them. Instead, NFS will issue a read-before-write to fully
instantiate the struct buf containing the write. NFS does, however,
optimize piecemeal appends to files. For most common file operations,
you will not notice the difference. The sole remaining fragment in
the VFS/BIO system is b_dirtyoff/end, which NFS uses to avoid cache
coherency issues with read-merge-write style operations. NFS also
optimizes the write-covers-entire-buffer case by avoiding the
read-before-write. There is quite a bit of room for further
optimization in these areas.
The VM system marks pages fully-valid (AKA vm_page_t->valid =
VM_PAGE_BITS_ALL) in several places, most noteably in vm_fault. This
is not correct operation. The vm_pager_get_pages() code is now
responsible for marking VM pages all-valid. A number of VM helper
routines have been added to aid in zeroing-out the invalid portions of
a VM page prior to the page being marked all-valid. This operation is
necessary to properly support mmap(). The zeroing occurs most often
when dealing with file-EOF situations. Several bugs have been fixed
in the NFS subsystem, including bits handling file and directory EOF
situations and buf->b_flags consistancy issues relating to clearing
B_ERROR & B_INVAL, and handling B_DONE.
getblk() and allocbuf() have been rewritten. B_CACHE operation is now
formally defined in comments and more straightforward in
implementation. B_CACHE for VMIO buffers is based on the validity of
the backing store. B_CACHE for non-VMIO buffers is based simply on
whether the buffer is B_INVAL or not (B_CACHE set if B_INVAL clear,
and vise-versa). biodone() is now responsible for setting B_CACHE
when a successful read completes. B_CACHE is also set when a bdwrite()
is initiated and when a bwrite() is initiated. VFS VOP_BWRITE
routines (there are only two - nfs_bwrite() and bwrite()) are now
expected to set B_CACHE. This means that bowrite() and bawrite() also
set B_CACHE indirectly.
There are a number of places in the code which were previously using
buf->b_bufsize (which is DEV_BSIZE aligned) when they should have
been using buf->b_bcount. These have been fixed. getblk() now clears
B_DONE on return because the rest of the system is so bad about
dealing with B_DONE.
Major fixes to NFS/TCP have been made. A server-side bug could cause
requests to be lost by the server due to nfs_realign() overwriting
other rpc's in the same TCP mbuf chain. The server's kernel must be
recompiled to get the benefit of the fixes.
Submitted by: Matthew Dillon <dillon@apollo.backplane.com>
1999-05-02 23:57:16 +00:00
|
|
|
/*
|
|
|
|
* Set valid range. This is typically the entire buffer and thus the
|
|
|
|
* entire page.
|
|
|
|
*/
|
|
|
|
if (eoff > soff) {
|
|
|
|
vm_page_set_validclean(
|
|
|
|
m,
|
|
|
|
(vm_offset_t) (soff & PAGE_MASK),
|
|
|
|
(vm_offset_t) (eoff - soff)
|
|
|
|
);
|
|
|
|
}
|
1997-05-19 14:36:56 +00:00
|
|
|
}
|
|
|
|
|
2010-06-08 17:54:28 +00:00
|
|
|
/*
|
2019-10-15 03:35:11 +00:00
|
|
|
* Acquire a shared busy on all pages in the buf.
|
2010-06-08 17:54:28 +00:00
|
|
|
*/
|
2013-08-22 18:26:45 +00:00
|
|
|
void
|
2019-10-15 03:35:11 +00:00
|
|
|
vfs_busy_pages_acquire(struct buf *bp)
|
2010-06-08 17:54:28 +00:00
|
|
|
{
|
2019-10-15 03:35:11 +00:00
|
|
|
int i;
|
2010-06-08 17:54:28 +00:00
|
|
|
|
2019-10-15 03:35:11 +00:00
|
|
|
for (i = 0; i < bp->b_npages; i++)
|
|
|
|
vm_page_busy_acquire(bp->b_pages[i], VM_ALLOC_SBUSY);
|
|
|
|
}
|
|
|
|
|
|
|
|
void
|
|
|
|
vfs_busy_pages_release(struct buf *bp)
|
|
|
|
{
|
|
|
|
int i;
|
|
|
|
|
|
|
|
for (i = 0; i < bp->b_npages; i++)
|
The vm_pageout_flush() functions sbusies pages in the passed pages
run. After that, the pager put method is called, usually translated
to VOP_WRITE(). For the filesystems which use buffer cache,
bufwrite() sbusies the buffer pages again, waiting for the xbusy state
to drain. The later is done in vfs_drain_busy_pages(), which is
called with the buffer pages already sbusied (by vm_pageout_flush()).
Since vfs_drain_busy_pages() can only wait for one page at the time,
and during the wait, the object lock is dropped, previous pages in the
buffer must be protected from other threads busying them. Up to the
moment, it was done by xbusying the pages, that is incompatible with
the sbusy state in the new implementation of busy. Switch to sbusy.
Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
2013-09-05 12:56:08 +00:00
|
|
|
vm_page_sunbusy(bp->b_pages[i]);
|
2010-06-08 17:54:28 +00:00
|
|
|
}
|
|
|
|
|
1995-02-22 09:16:07 +00:00
|
|
|
/*
|
|
|
|
* This routine is called before a device strategy routine.
|
|
|
|
* It is used to tell the VM system that paging I/O is in
|
|
|
|
* progress, and treat the pages associated with the buffer
|
2013-08-09 11:11:11 +00:00
|
|
|
* almost as being exclusive busy. Also the object paging_in_progress
|
1995-02-22 09:16:07 +00:00
|
|
|
* flag is handled to make sure that the object doesn't become
|
2016-04-29 21:54:28 +00:00
|
|
|
* inconsistent.
|
The VFS/BIO subsystem contained a number of hacks in order to optimize
piecemeal, middle-of-file writes for NFS. These hacks have caused no
end of trouble, especially when combined with mmap(). I've removed
them. Instead, NFS will issue a read-before-write to fully
instantiate the struct buf containing the write. NFS does, however,
optimize piecemeal appends to files. For most common file operations,
you will not notice the difference. The sole remaining fragment in
the VFS/BIO system is b_dirtyoff/end, which NFS uses to avoid cache
coherency issues with read-merge-write style operations. NFS also
optimizes the write-covers-entire-buffer case by avoiding the
read-before-write. There is quite a bit of room for further
optimization in these areas.
The VM system marks pages fully-valid (AKA vm_page_t->valid =
VM_PAGE_BITS_ALL) in several places, most noteably in vm_fault. This
is not correct operation. The vm_pager_get_pages() code is now
responsible for marking VM pages all-valid. A number of VM helper
routines have been added to aid in zeroing-out the invalid portions of
a VM page prior to the page being marked all-valid. This operation is
necessary to properly support mmap(). The zeroing occurs most often
when dealing with file-EOF situations. Several bugs have been fixed
in the NFS subsystem, including bits handling file and directory EOF
situations and buf->b_flags consistancy issues relating to clearing
B_ERROR & B_INVAL, and handling B_DONE.
getblk() and allocbuf() have been rewritten. B_CACHE operation is now
formally defined in comments and more straightforward in
implementation. B_CACHE for VMIO buffers is based on the validity of
the backing store. B_CACHE for non-VMIO buffers is based simply on
whether the buffer is B_INVAL or not (B_CACHE set if B_INVAL clear,
and vise-versa). biodone() is now responsible for setting B_CACHE
when a successful read completes. B_CACHE is also set when a bdwrite()
is initiated and when a bwrite() is initiated. VFS VOP_BWRITE
routines (there are only two - nfs_bwrite() and bwrite()) are now
expected to set B_CACHE. This means that bowrite() and bawrite() also
set B_CACHE indirectly.
There are a number of places in the code which were previously using
buf->b_bufsize (which is DEV_BSIZE aligned) when they should have
been using buf->b_bcount. These have been fixed. getblk() now clears
B_DONE on return because the rest of the system is so bad about
dealing with B_DONE.
Major fixes to NFS/TCP have been made. A server-side bug could cause
requests to be lost by the server due to nfs_realign() overwriting
other rpc's in the same TCP mbuf chain. The server's kernel must be
recompiled to get the benefit of the fixes.
Submitted by: Matthew Dillon <dillon@apollo.backplane.com>
1999-05-02 23:57:16 +00:00
|
|
|
*
|
|
|
|
* Since I/O has not been initiated yet, certain buffer flags
|
2016-04-29 21:54:28 +00:00
|
|
|
* such as BIO_ERROR or B_INVAL may be in an inconsistent state
|
The VFS/BIO subsystem contained a number of hacks in order to optimize
piecemeal, middle-of-file writes for NFS. These hacks have caused no
end of trouble, especially when combined with mmap(). I've removed
them. Instead, NFS will issue a read-before-write to fully
instantiate the struct buf containing the write. NFS does, however,
optimize piecemeal appends to files. For most common file operations,
you will not notice the difference. The sole remaining fragment in
the VFS/BIO system is b_dirtyoff/end, which NFS uses to avoid cache
coherency issues with read-merge-write style operations. NFS also
optimizes the write-covers-entire-buffer case by avoiding the
read-before-write. There is quite a bit of room for further
optimization in these areas.
The VM system marks pages fully-valid (AKA vm_page_t->valid =
VM_PAGE_BITS_ALL) in several places, most noteably in vm_fault. This
is not correct operation. The vm_pager_get_pages() code is now
responsible for marking VM pages all-valid. A number of VM helper
routines have been added to aid in zeroing-out the invalid portions of
a VM page prior to the page being marked all-valid. This operation is
necessary to properly support mmap(). The zeroing occurs most often
when dealing with file-EOF situations. Several bugs have been fixed
in the NFS subsystem, including bits handling file and directory EOF
situations and buf->b_flags consistancy issues relating to clearing
B_ERROR & B_INVAL, and handling B_DONE.
getblk() and allocbuf() have been rewritten. B_CACHE operation is now
formally defined in comments and more straightforward in
implementation. B_CACHE for VMIO buffers is based on the validity of
the backing store. B_CACHE for non-VMIO buffers is based simply on
whether the buffer is B_INVAL or not (B_CACHE set if B_INVAL clear,
and vise-versa). biodone() is now responsible for setting B_CACHE
when a successful read completes. B_CACHE is also set when a bdwrite()
is initiated and when a bwrite() is initiated. VFS VOP_BWRITE
routines (there are only two - nfs_bwrite() and bwrite()) are now
expected to set B_CACHE. This means that bowrite() and bawrite() also
set B_CACHE indirectly.
There are a number of places in the code which were previously using
buf->b_bufsize (which is DEV_BSIZE aligned) when they should have
been using buf->b_bcount. These have been fixed. getblk() now clears
B_DONE on return because the rest of the system is so bad about
dealing with B_DONE.
Major fixes to NFS/TCP have been made. A server-side bug could cause
requests to be lost by the server due to nfs_realign() overwriting
other rpc's in the same TCP mbuf chain. The server's kernel must be
recompiled to get the benefit of the fixes.
Submitted by: Matthew Dillon <dillon@apollo.backplane.com>
1999-05-02 23:57:16 +00:00
|
|
|
* and should be ignored.
|
1995-02-22 09:16:07 +00:00
|
|
|
*/
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
void
|
2004-09-15 20:54:23 +00:00
|
|
|
vfs_busy_pages(struct buf *bp, int clear_modify)
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
{
|
2004-09-15 21:08:58 +00:00
|
|
|
vm_object_t obj;
|
|
|
|
vm_ooffset_t foff;
|
|
|
|
vm_page_t m;
|
2017-03-19 23:06:11 +00:00
|
|
|
int i;
|
|
|
|
bool bogus;
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
|
2004-09-15 21:08:58 +00:00
|
|
|
if (!(bp->b_flags & B_VMIO))
|
|
|
|
return;
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
|
2004-11-04 09:06:54 +00:00
|
|
|
obj = bp->b_bufobj->bo_object;
|
2004-09-15 21:08:58 +00:00
|
|
|
foff = bp->b_offset;
|
|
|
|
KASSERT(bp->b_offset != NOOFFSET,
|
|
|
|
("vfs_busy_pages: no buffer offset"));
|
2019-10-15 03:35:11 +00:00
|
|
|
if ((bp->b_flags & B_CLUSTER) == 0) {
|
|
|
|
vm_object_pip_add(obj, bp->b_npages);
|
|
|
|
vfs_busy_pages_acquire(bp);
|
|
|
|
}
|
2006-10-29 00:04:39 +00:00
|
|
|
if (bp->b_bufsize != 0)
|
2019-10-29 20:37:59 +00:00
|
|
|
vfs_setdirty_range(bp);
|
2017-03-19 23:06:11 +00:00
|
|
|
bogus = false;
|
2004-09-15 21:08:58 +00:00
|
|
|
for (i = 0; i < bp->b_npages; i++) {
|
|
|
|
m = bp->b_pages[i];
|
2019-10-15 03:35:11 +00:00
|
|
|
vm_page_assert_sbusied(m);
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
|
2004-09-15 21:08:58 +00:00
|
|
|
/*
|
|
|
|
* When readying a buffer for a read ( i.e
|
|
|
|
* clear_modify == 0 ), it is important to do
|
|
|
|
* bogus_page replacement for valid pages in
|
|
|
|
* partially instantiated buffers. Partially
|
|
|
|
* instantiated buffers can, in turn, occur when
|
|
|
|
* reconstituting a buffer from its VM backing store
|
|
|
|
* base. We only have to do this if B_CACHE is
|
|
|
|
* clear ( which causes the I/O to occur in the
|
|
|
|
* first place ). The replacement prevents the read
|
|
|
|
* I/O from overwriting potentially dirty VM-backed
|
|
|
|
* pages. XXX bogus page replacement is, uh, bogus.
|
|
|
|
* It may not work properly with small-block devices.
|
|
|
|
* We need to find a better way.
|
|
|
|
*/
|
2009-05-11 05:16:57 +00:00
|
|
|
if (clear_modify) {
|
|
|
|
pmap_remove_write(m);
|
2009-05-13 05:39:39 +00:00
|
|
|
vfs_page_set_validclean(bp, foff, m);
|
2019-10-15 03:45:41 +00:00
|
|
|
} else if (vm_page_all_valid(m) &&
|
2004-11-04 09:06:54 +00:00
|
|
|
(bp->b_flags & B_CACHE) == 0) {
|
2004-09-15 21:08:58 +00:00
|
|
|
bp->b_pages[i] = bogus_page;
|
2017-03-19 23:06:11 +00:00
|
|
|
bogus = true;
|
2004-09-15 21:08:58 +00:00
|
|
|
}
|
|
|
|
foff = (foff + PAGE_SIZE) & ~(off_t)PAGE_MASK;
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
}
|
2015-07-23 19:13:41 +00:00
|
|
|
if (bogus && buf_mapped(bp)) {
|
Implement the concept of the unmapped VMIO buffers, i.e. buffers which
do not map the b_pages pages into buffer_map KVA. The use of the
unmapped buffers eliminate the need to perform TLB shootdown for
mapping on the buffer creation and reuse, greatly reducing the amount
of IPIs for shootdown on big-SMP machines and eliminating up to 25-30%
of the system time on i/o intensive workloads.
The unmapped buffer should be explicitely requested by the GB_UNMAPPED
flag by the consumer. For unmapped buffer, no KVA reservation is
performed at all. The consumer might request unmapped buffer which
does have a KVA reserve, to manually map it without recursing into
buffer cache and blocking, with the GB_KVAALLOC flag.
When the mapped buffer is requested and unmapped buffer already
exists, the cache performs an upgrade, possibly reusing the KVA
reservation.
Unmapped buffer is translated into unmapped bio in g_vfs_strategy().
Unmapped bio carry a pointer to the vm_page_t array, offset and length
instead of the data pointer. The provider which processes the bio
should explicitely specify a readiness to accept unmapped bio,
otherwise g_down geom thread performs the transient upgrade of the bio
request by mapping the pages into the new bio_transient_map KVA
submap.
The bio_transient_map submap claims up to 10% of the buffer map, and
the total buffer_map + bio_transient_map KVA usage stays the
same. Still, it could be manually tuned by kern.bio_transient_maxcnt
tunable, in the units of the transient mappings. Eventually, the
bio_transient_map could be removed after all geom classes and drivers
can accept unmapped i/o requests.
Unmapped support can be turned off by the vfs.unmapped_buf_allowed
tunable, disabling which makes the buffer (or cluster) creation
requests to ignore GB_UNMAPPED and GB_KVAALLOC flags. Unmapped
buffers are only enabled by default on the architectures where
pmap_copy_page() was implemented and tested.
In the rework, filesystem metadata is not the subject to maxbufspace
limit anymore. Since the metadata buffers are always mapped, the
buffers still have to fit into the buffer map, which provides a
reasonable (but practically unreachable) upper bound on it. The
non-metadata buffer allocations, both mapped and unmapped, is
accounted against maxbufspace, as before. Effectively, this means that
the maxbufspace is forced on mapped and unmapped buffers separately.
The pre-patch bufspace limiting code did not worked, because
buffer_map fragmentation does not allow the limit to be reached.
By Jeff Roberson request, the getnewbuf() function was split into
smaller single-purpose functions.
Sponsored by: The FreeBSD Foundation
Discussed with: jeff (previous version)
Tested by: pho, scottl (previous version), jhb, bf
MFC after: 2 weeks
2013-03-19 14:13:12 +00:00
|
|
|
BUF_CHECK_MAPPED(bp);
|
2004-09-15 21:08:58 +00:00
|
|
|
pmap_qenter(trunc_page((vm_offset_t)bp->b_data),
|
|
|
|
bp->b_pages, bp->b_npages);
|
Implement the concept of the unmapped VMIO buffers, i.e. buffers which
do not map the b_pages pages into buffer_map KVA. The use of the
unmapped buffers eliminate the need to perform TLB shootdown for
mapping on the buffer creation and reuse, greatly reducing the amount
of IPIs for shootdown on big-SMP machines and eliminating up to 25-30%
of the system time on i/o intensive workloads.
The unmapped buffer should be explicitely requested by the GB_UNMAPPED
flag by the consumer. For unmapped buffer, no KVA reservation is
performed at all. The consumer might request unmapped buffer which
does have a KVA reserve, to manually map it without recursing into
buffer cache and blocking, with the GB_KVAALLOC flag.
When the mapped buffer is requested and unmapped buffer already
exists, the cache performs an upgrade, possibly reusing the KVA
reservation.
Unmapped buffer is translated into unmapped bio in g_vfs_strategy().
Unmapped bio carry a pointer to the vm_page_t array, offset and length
instead of the data pointer. The provider which processes the bio
should explicitely specify a readiness to accept unmapped bio,
otherwise g_down geom thread performs the transient upgrade of the bio
request by mapping the pages into the new bio_transient_map KVA
submap.
The bio_transient_map submap claims up to 10% of the buffer map, and
the total buffer_map + bio_transient_map KVA usage stays the
same. Still, it could be manually tuned by kern.bio_transient_maxcnt
tunable, in the units of the transient mappings. Eventually, the
bio_transient_map could be removed after all geom classes and drivers
can accept unmapped i/o requests.
Unmapped support can be turned off by the vfs.unmapped_buf_allowed
tunable, disabling which makes the buffer (or cluster) creation
requests to ignore GB_UNMAPPED and GB_KVAALLOC flags. Unmapped
buffers are only enabled by default on the architectures where
pmap_copy_page() was implemented and tested.
In the rework, filesystem metadata is not the subject to maxbufspace
limit anymore. Since the metadata buffers are always mapped, the
buffers still have to fit into the buffer map, which provides a
reasonable (but practically unreachable) upper bound on it. The
non-metadata buffer allocations, both mapped and unmapped, is
accounted against maxbufspace, as before. Effectively, this means that
the maxbufspace is forced on mapped and unmapped buffers separately.
The pre-patch bufspace limiting code did not worked, because
buffer_map fragmentation does not allow the limit to be reached.
By Jeff Roberson request, the getnewbuf() function was split into
smaller single-purpose functions.
Sponsored by: The FreeBSD Foundation
Discussed with: jeff (previous version)
Tested by: pho, scottl (previous version), jhb, bf
MFC after: 2 weeks
2013-03-19 14:13:12 +00:00
|
|
|
}
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
}
|
|
|
|
|
2009-05-17 20:26:00 +00:00
|
|
|
/*
|
|
|
|
* vfs_bio_set_valid:
|
|
|
|
*
|
|
|
|
* Set the range within the buffer to valid. The range is
|
|
|
|
* relative to the beginning of the buffer, b_offset. Note that
|
|
|
|
* b_offset itself may be offset from the beginning of the first
|
|
|
|
* page.
|
|
|
|
*/
|
2020-07-10 09:01:36 +00:00
|
|
|
void
|
2009-05-17 20:26:00 +00:00
|
|
|
vfs_bio_set_valid(struct buf *bp, int base, int size)
|
|
|
|
{
|
|
|
|
int i, n;
|
|
|
|
vm_page_t m;
|
|
|
|
|
|
|
|
if (!(bp->b_flags & B_VMIO))
|
|
|
|
return;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Fixup base to be relative to beginning of first page.
|
|
|
|
* Set initial n to be the maximum number of bytes in the
|
|
|
|
* first page that can be validated.
|
|
|
|
*/
|
|
|
|
base += (bp->b_offset & PAGE_MASK);
|
|
|
|
n = PAGE_SIZE - (base & PAGE_MASK);
|
|
|
|
|
2019-10-15 03:45:41 +00:00
|
|
|
/*
|
|
|
|
* Busy may not be strictly necessary here because the pages are
|
|
|
|
* unlikely to be fully valid and the vnode lock will synchronize
|
|
|
|
* their access via getpages. It is grabbed for consistency with
|
|
|
|
* other page validation.
|
|
|
|
*/
|
|
|
|
vfs_busy_pages_acquire(bp);
|
2009-05-17 20:26:00 +00:00
|
|
|
for (i = base / PAGE_SIZE; size > 0 && i < bp->b_npages; ++i) {
|
|
|
|
m = bp->b_pages[i];
|
|
|
|
if (n > size)
|
|
|
|
n = size;
|
2011-11-30 17:39:00 +00:00
|
|
|
vm_page_set_valid_range(m, base & PAGE_MASK, n);
|
2009-05-17 20:26:00 +00:00
|
|
|
base += n;
|
|
|
|
size -= n;
|
|
|
|
n = PAGE_SIZE;
|
|
|
|
}
|
2019-10-15 03:45:41 +00:00
|
|
|
vfs_busy_pages_release(bp);
|
2009-05-17 20:26:00 +00:00
|
|
|
}
|
The VFS/BIO subsystem contained a number of hacks in order to optimize
piecemeal, middle-of-file writes for NFS. These hacks have caused no
end of trouble, especially when combined with mmap(). I've removed
them. Instead, NFS will issue a read-before-write to fully
instantiate the struct buf containing the write. NFS does, however,
optimize piecemeal appends to files. For most common file operations,
you will not notice the difference. The sole remaining fragment in
the VFS/BIO system is b_dirtyoff/end, which NFS uses to avoid cache
coherency issues with read-merge-write style operations. NFS also
optimizes the write-covers-entire-buffer case by avoiding the
read-before-write. There is quite a bit of room for further
optimization in these areas.
The VM system marks pages fully-valid (AKA vm_page_t->valid =
VM_PAGE_BITS_ALL) in several places, most noteably in vm_fault. This
is not correct operation. The vm_pager_get_pages() code is now
responsible for marking VM pages all-valid. A number of VM helper
routines have been added to aid in zeroing-out the invalid portions of
a VM page prior to the page being marked all-valid. This operation is
necessary to properly support mmap(). The zeroing occurs most often
when dealing with file-EOF situations. Several bugs have been fixed
in the NFS subsystem, including bits handling file and directory EOF
situations and buf->b_flags consistancy issues relating to clearing
B_ERROR & B_INVAL, and handling B_DONE.
getblk() and allocbuf() have been rewritten. B_CACHE operation is now
formally defined in comments and more straightforward in
implementation. B_CACHE for VMIO buffers is based on the validity of
the backing store. B_CACHE for non-VMIO buffers is based simply on
whether the buffer is B_INVAL or not (B_CACHE set if B_INVAL clear,
and vise-versa). biodone() is now responsible for setting B_CACHE
when a successful read completes. B_CACHE is also set when a bdwrite()
is initiated and when a bwrite() is initiated. VFS VOP_BWRITE
routines (there are only two - nfs_bwrite() and bwrite()) are now
expected to set B_CACHE. This means that bowrite() and bawrite() also
set B_CACHE indirectly.
There are a number of places in the code which were previously using
buf->b_bufsize (which is DEV_BSIZE aligned) when they should have
been using buf->b_bcount. These have been fixed. getblk() now clears
B_DONE on return because the rest of the system is so bad about
dealing with B_DONE.
Major fixes to NFS/TCP have been made. A server-side bug could cause
requests to be lost by the server due to nfs_realign() overwriting
other rpc's in the same TCP mbuf chain. The server's kernel must be
recompiled to get the benefit of the fixes.
Submitted by: Matthew Dillon <dillon@apollo.backplane.com>
1999-05-02 23:57:16 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* vfs_bio_clrbuf:
|
|
|
|
*
|
2009-05-17 23:25:53 +00:00
|
|
|
* If the specified buffer is a non-VMIO buffer, clear the entire
|
|
|
|
* buffer. If the specified buffer is a VMIO buffer, clear and
|
|
|
|
* validate only the previously invalid portions of the buffer.
|
|
|
|
* This routine essentially fakes an I/O, so we need to clear
|
|
|
|
* BIO_ERROR and B_INVAL.
|
The VFS/BIO subsystem contained a number of hacks in order to optimize
piecemeal, middle-of-file writes for NFS. These hacks have caused no
end of trouble, especially when combined with mmap(). I've removed
them. Instead, NFS will issue a read-before-write to fully
instantiate the struct buf containing the write. NFS does, however,
optimize piecemeal appends to files. For most common file operations,
you will not notice the difference. The sole remaining fragment in
the VFS/BIO system is b_dirtyoff/end, which NFS uses to avoid cache
coherency issues with read-merge-write style operations. NFS also
optimizes the write-covers-entire-buffer case by avoiding the
read-before-write. There is quite a bit of room for further
optimization in these areas.
The VM system marks pages fully-valid (AKA vm_page_t->valid =
VM_PAGE_BITS_ALL) in several places, most noteably in vm_fault. This
is not correct operation. The vm_pager_get_pages() code is now
responsible for marking VM pages all-valid. A number of VM helper
routines have been added to aid in zeroing-out the invalid portions of
a VM page prior to the page being marked all-valid. This operation is
necessary to properly support mmap(). The zeroing occurs most often
when dealing with file-EOF situations. Several bugs have been fixed
in the NFS subsystem, including bits handling file and directory EOF
situations and buf->b_flags consistancy issues relating to clearing
B_ERROR & B_INVAL, and handling B_DONE.
getblk() and allocbuf() have been rewritten. B_CACHE operation is now
formally defined in comments and more straightforward in
implementation. B_CACHE for VMIO buffers is based on the validity of
the backing store. B_CACHE for non-VMIO buffers is based simply on
whether the buffer is B_INVAL or not (B_CACHE set if B_INVAL clear,
and vise-versa). biodone() is now responsible for setting B_CACHE
when a successful read completes. B_CACHE is also set when a bdwrite()
is initiated and when a bwrite() is initiated. VFS VOP_BWRITE
routines (there are only two - nfs_bwrite() and bwrite()) are now
expected to set B_CACHE. This means that bowrite() and bawrite() also
set B_CACHE indirectly.
There are a number of places in the code which were previously using
buf->b_bufsize (which is DEV_BSIZE aligned) when they should have
been using buf->b_bcount. These have been fixed. getblk() now clears
B_DONE on return because the rest of the system is so bad about
dealing with B_DONE.
Major fixes to NFS/TCP have been made. A server-side bug could cause
requests to be lost by the server due to nfs_realign() overwriting
other rpc's in the same TCP mbuf chain. The server's kernel must be
recompiled to get the benefit of the fixes.
Submitted by: Matthew Dillon <dillon@apollo.backplane.com>
1999-05-02 23:57:16 +00:00
|
|
|
*
|
|
|
|
* Note that while we only theoretically need to clear through b_bcount,
|
|
|
|
* we go ahead and clear through b_bufsize.
|
|
|
|
*/
|
1995-04-09 06:02:46 +00:00
|
|
|
void
|
2002-06-22 19:09:35 +00:00
|
|
|
vfs_bio_clrbuf(struct buf *bp)
|
|
|
|
{
|
2013-03-14 19:48:25 +00:00
|
|
|
int i, j, mask, sa, ea, slide;
|
2001-05-19 01:28:09 +00:00
|
|
|
|
2004-09-15 21:08:58 +00:00
|
|
|
if ((bp->b_flags & (B_VMIO | B_MALLOC)) != B_VMIO) {
|
|
|
|
clrbuf(bp);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
bp->b_flags &= ~B_INVAL;
|
|
|
|
bp->b_ioflags &= ~BIO_ERROR;
|
2019-10-15 03:45:41 +00:00
|
|
|
vfs_busy_pages_acquire(bp);
|
2013-03-14 19:48:25 +00:00
|
|
|
sa = bp->b_offset & PAGE_MASK;
|
|
|
|
slide = 0;
|
|
|
|
for (i = 0; i < bp->b_npages; i++, sa = 0) {
|
|
|
|
slide = imin(slide + PAGE_SIZE, bp->b_offset + bp->b_bufsize);
|
|
|
|
ea = slide & PAGE_MASK;
|
|
|
|
if (ea == 0)
|
|
|
|
ea = PAGE_SIZE;
|
2004-09-15 21:08:58 +00:00
|
|
|
if (bp->b_pages[i] == bogus_page)
|
|
|
|
continue;
|
2013-03-14 19:48:25 +00:00
|
|
|
j = sa / DEV_BSIZE;
|
2004-09-15 21:08:58 +00:00
|
|
|
mask = ((1 << ((ea - sa) / DEV_BSIZE)) - 1) << j;
|
|
|
|
if ((bp->b_pages[i]->valid & mask) == mask)
|
|
|
|
continue;
|
2009-05-17 23:25:53 +00:00
|
|
|
if ((bp->b_pages[i]->valid & mask) == 0)
|
2013-03-14 19:48:25 +00:00
|
|
|
pmap_zero_page_area(bp->b_pages[i], sa, ea - sa);
|
2009-05-17 23:25:53 +00:00
|
|
|
else {
|
2004-09-15 21:08:58 +00:00
|
|
|
for (; sa < ea; sa += DEV_BSIZE, j++) {
|
2013-03-14 19:48:25 +00:00
|
|
|
if ((bp->b_pages[i]->valid & (1 << j)) == 0) {
|
|
|
|
pmap_zero_page_area(bp->b_pages[i],
|
|
|
|
sa, DEV_BSIZE);
|
|
|
|
}
|
1995-04-09 06:02:46 +00:00
|
|
|
}
|
|
|
|
}
|
2019-10-29 20:37:59 +00:00
|
|
|
vm_page_set_valid_range(bp->b_pages[i], j * DEV_BSIZE,
|
|
|
|
roundup2(ea - sa, DEV_BSIZE));
|
1995-04-09 06:02:46 +00:00
|
|
|
}
|
2019-10-15 03:45:41 +00:00
|
|
|
vfs_busy_pages_release(bp);
|
2004-09-15 21:08:58 +00:00
|
|
|
bp->b_resid = 0;
|
1995-04-09 06:02:46 +00:00
|
|
|
}
|
|
|
|
|
2013-03-19 14:27:14 +00:00
|
|
|
void
|
|
|
|
vfs_bio_bzero_buf(struct buf *bp, int base, int size)
|
|
|
|
{
|
|
|
|
vm_page_t m;
|
|
|
|
int i, n;
|
|
|
|
|
2015-07-23 19:13:41 +00:00
|
|
|
if (buf_mapped(bp)) {
|
2013-03-19 14:27:14 +00:00
|
|
|
BUF_CHECK_MAPPED(bp);
|
|
|
|
bzero(bp->b_data + base, size);
|
|
|
|
} else {
|
|
|
|
BUF_CHECK_UNMAPPED(bp);
|
|
|
|
n = PAGE_SIZE - (base & PAGE_MASK);
|
|
|
|
for (i = base / PAGE_SIZE; size > 0 && i < bp->b_npages; ++i) {
|
|
|
|
m = bp->b_pages[i];
|
|
|
|
if (n > size)
|
|
|
|
n = size;
|
|
|
|
pmap_zero_page_area(m, base & PAGE_MASK, n);
|
|
|
|
base += n;
|
|
|
|
size -= n;
|
|
|
|
n = PAGE_SIZE;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2016-11-23 17:53:07 +00:00
|
|
|
/*
|
|
|
|
* Update buffer flags based on I/O request parameters, optionally releasing the
|
|
|
|
* buffer. If it's VMIO or direct I/O, the buffer pages are released to the VM,
|
|
|
|
* where they may be placed on a page queue (VMIO) or freed immediately (direct
|
|
|
|
* I/O). Otherwise the buffer is released to the cache.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
b_io_dismiss(struct buf *bp, int ioflag, bool release)
|
|
|
|
{
|
|
|
|
|
|
|
|
KASSERT((ioflag & IO_NOREUSE) == 0 || (ioflag & IO_VMIO) != 0,
|
|
|
|
("buf %p non-VMIO noreuse", bp));
|
|
|
|
|
|
|
|
if ((ioflag & IO_DIRECT) != 0)
|
|
|
|
bp->b_flags |= B_DIRECT;
|
This is an additional fix for bug report 230962. When using
extended attributes, the kernel can panic with either "ffs_truncate3"
or with "softdep_deallocate_dependencies: dangling deps".
The problem arises because the flushbuflist() function which is
called to clear out buffers is passed either the V_NORMAL flag to
indicate that it should flush buffer associated with the contents
of the file or the V_ALT flag to indicate that it should flush the
buffers associated with the extended attribute data. The buffers
containing the extended attribute data are identified by having
their BX_ALTDATA flag set in the buffer's b_xflags field. The
BX_ALTDATA flag is set on the buffer when the extended attribute
block is first allocated or when its contents are read in from the
disk.
On a busy system, a buffer may be reused for another purpose, but
the contents of the block that it contained continues to be held
in the main page cache. Each physical page is identified as holding
the contents of a logical block within a specified file (identified
by a vnode). When a request is made to read a file, the kernel first
looks for the block in the existing buffers. If it is not found
there, it checks the page cache to see if it is still there. If
it is found in the page cache, then it is remapped into a new
buffer thus avoiding the need to read it in from the disk.
The bug is that when a buffer request made for an extended attribute
is fulfilled by reconstituting a buffer from the page cache rather
than reading it in from disk, the BX_ALTDATA flag was not being
set. Thus the flushbuflist() function would never clear it out and
the "ffs_truncate3" panic would occur because the vnode being cleared
still had buffers on its clean-buffer list. If the extended attribute
was being updated, it is first read, then updated, and finally
written. If the read is fulfilled by reconstituting the buffer
from the page cache the BX_ALTDATA flag was not set and thus the
dirty buffer would never be flushed by flushbuflist(). Eventually
the buffer would be recycled. Since it was never written it would
have an unfinished dependency which would trigger the
"softdep_deallocate_dependencies: dangling deps" panic.
The fix is to ensure that the BX_ALTDATA flag is set when a buffer
has been reconstituted from the page cache.
PR: 230962
Reported by: 2t8mr7kx9f@protonmail.com
Reviewed by: kib
Tested by: Peter Holm
MFC after: 1 week
Sponsored by: Netflix
2019-03-12 19:08:41 +00:00
|
|
|
if ((ioflag & IO_EXT) != 0)
|
|
|
|
bp->b_xflags |= BX_ALTDATA;
|
2016-11-23 17:53:07 +00:00
|
|
|
if ((ioflag & (IO_VMIO | IO_DIRECT)) != 0 && LIST_EMPTY(&bp->b_dep)) {
|
|
|
|
bp->b_flags |= B_RELBUF;
|
|
|
|
if ((ioflag & IO_NOREUSE) != 0)
|
|
|
|
bp->b_flags |= B_NOREUSE;
|
|
|
|
if (release)
|
|
|
|
brelse(bp);
|
|
|
|
} else if (release)
|
|
|
|
bqrelse(bp);
|
|
|
|
}
|
|
|
|
|
|
|
|
void
|
|
|
|
vfs_bio_brelse(struct buf *bp, int ioflag)
|
|
|
|
{
|
|
|
|
|
|
|
|
b_io_dismiss(bp, ioflag, true);
|
|
|
|
}
|
|
|
|
|
|
|
|
void
|
|
|
|
vfs_bio_set_flags(struct buf *bp, int ioflag)
|
|
|
|
{
|
|
|
|
|
|
|
|
b_io_dismiss(bp, ioflag, false);
|
|
|
|
}
|
|
|
|
|
1994-05-24 10:09:53 +00:00
|
|
|
/*
|
2001-05-23 22:24:49 +00:00
|
|
|
* vm_hold_load_pages and vm_hold_free_pages get pages into
|
1995-02-22 09:16:07 +00:00
|
|
|
* a buffers address space. The pages are anonymous and are
|
|
|
|
* not associated with a file object.
|
1994-05-24 10:09:53 +00:00
|
|
|
*/
|
2001-05-19 01:28:09 +00:00
|
|
|
static void
|
2004-09-15 20:54:23 +00:00
|
|
|
vm_hold_load_pages(struct buf *bp, vm_offset_t from, vm_offset_t to)
|
These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
|
|
|
{
|
1994-05-25 09:21:21 +00:00
|
|
|
vm_offset_t pg;
|
|
|
|
vm_page_t p;
|
1996-01-19 04:00:31 +00:00
|
|
|
int index;
|
1994-05-25 09:21:21 +00:00
|
|
|
|
Implement the concept of the unmapped VMIO buffers, i.e. buffers which
do not map the b_pages pages into buffer_map KVA. The use of the
unmapped buffers eliminate the need to perform TLB shootdown for
mapping on the buffer creation and reuse, greatly reducing the amount
of IPIs for shootdown on big-SMP machines and eliminating up to 25-30%
of the system time on i/o intensive workloads.
The unmapped buffer should be explicitely requested by the GB_UNMAPPED
flag by the consumer. For unmapped buffer, no KVA reservation is
performed at all. The consumer might request unmapped buffer which
does have a KVA reserve, to manually map it without recursing into
buffer cache and blocking, with the GB_KVAALLOC flag.
When the mapped buffer is requested and unmapped buffer already
exists, the cache performs an upgrade, possibly reusing the KVA
reservation.
Unmapped buffer is translated into unmapped bio in g_vfs_strategy().
Unmapped bio carry a pointer to the vm_page_t array, offset and length
instead of the data pointer. The provider which processes the bio
should explicitely specify a readiness to accept unmapped bio,
otherwise g_down geom thread performs the transient upgrade of the bio
request by mapping the pages into the new bio_transient_map KVA
submap.
The bio_transient_map submap claims up to 10% of the buffer map, and
the total buffer_map + bio_transient_map KVA usage stays the
same. Still, it could be manually tuned by kern.bio_transient_maxcnt
tunable, in the units of the transient mappings. Eventually, the
bio_transient_map could be removed after all geom classes and drivers
can accept unmapped i/o requests.
Unmapped support can be turned off by the vfs.unmapped_buf_allowed
tunable, disabling which makes the buffer (or cluster) creation
requests to ignore GB_UNMAPPED and GB_KVAALLOC flags. Unmapped
buffers are only enabled by default on the architectures where
pmap_copy_page() was implemented and tested.
In the rework, filesystem metadata is not the subject to maxbufspace
limit anymore. Since the metadata buffers are always mapped, the
buffers still have to fit into the buffer map, which provides a
reasonable (but practically unreachable) upper bound on it. The
non-metadata buffer allocations, both mapped and unmapped, is
accounted against maxbufspace, as before. Effectively, this means that
the maxbufspace is forced on mapped and unmapped buffers separately.
The pre-patch bufspace limiting code did not worked, because
buffer_map fragmentation does not allow the limit to be reached.
By Jeff Roberson request, the getnewbuf() function was split into
smaller single-purpose functions.
Sponsored by: The FreeBSD Foundation
Discussed with: jeff (previous version)
Tested by: pho, scottl (previous version), jhb, bf
MFC after: 2 weeks
2013-03-19 14:13:12 +00:00
|
|
|
BUF_CHECK_MAPPED(bp);
|
|
|
|
|
1996-01-06 23:23:02 +00:00
|
|
|
to = round_page(to);
|
1996-01-19 04:00:31 +00:00
|
|
|
from = round_page(from);
|
1998-10-13 08:24:45 +00:00
|
|
|
index = (from - trunc_page((vm_offset_t)bp->b_data)) >> PAGE_SHIFT;
|
Make MAXPHYS tunable. Bump MAXPHYS to 1M.
Replace MAXPHYS by runtime variable maxphys. It is initialized from
MAXPHYS by default, but can be also adjusted with the tunable kern.maxphys.
Make b_pages[] array in struct buf flexible. Size b_pages[] for buffer
cache buffers exactly to atop(maxbcachebuf) (currently it is sized to
atop(MAXPHYS)), and b_pages[] for pbufs is sized to atop(maxphys) + 1.
The +1 for pbufs allow several pbuf consumers, among them vmapbuf(),
to use unaligned buffers still sized to maxphys, esp. when such
buffers come from userspace (*). Overall, we save significant amount
of otherwise wasted memory in b_pages[] for buffer cache buffers,
while bumping MAXPHYS to desired high value.
Eliminate all direct uses of the MAXPHYS constant in kernel and driver
sources, except a place which initialize maxphys. Some random (and
arguably weird) uses of MAXPHYS, e.g. in linuxolator, are converted
straight. Some drivers, which use MAXPHYS to size embeded structures,
get private MAXPHYS-like constant; their convertion is out of scope
for this work.
Changes to cam/, dev/ahci, dev/ata, dev/mpr, dev/mpt, dev/mvs,
dev/siis, where either submitted by, or based on changes by mav.
Suggested by: mav (*)
Reviewed by: imp, mav, imp, mckusick, scottl (intermediate versions)
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D27225
2020-11-28 12:12:51 +00:00
|
|
|
MPASS((bp->b_flags & B_MAXPHYS) == 0);
|
|
|
|
KASSERT(to - from <= maxbcachebuf,
|
|
|
|
("vm_hold_load_pages too large %p %#jx %#jx %u",
|
|
|
|
bp, (uintmax_t)from, (uintmax_t)to, maxbcachebuf));
|
1996-01-06 23:23:02 +00:00
|
|
|
|
1996-01-19 04:00:31 +00:00
|
|
|
for (pg = from; pg < to; pg += PAGE_SIZE, index++) {
|
Implement a low-memory deadlock solution.
Removed most of the hacks that were trying to deal with low-memory
situations prior to now.
The new code is based on the concept that I/O must be able to function in
a low memory situation. All major modules related to I/O (except
networking) have been adjusted to allow allocation out of the system
reserve memory pool. These modules now detect a low memory situation but
rather then block they instead continue to operate, then return resources
to the memory pool instead of cache them or leave them wired.
Code has been added to stall in a low-memory situation prior to a vnode
being locked.
Thus situations where a process blocks in a low-memory condition while
holding a locked vnode have been reduced to near nothing. Not only will
I/O continue to operate, but many prior deadlock conditions simply no
longer exist.
Implement a number of VFS/BIO fixes
(found by Ian): in biodone(), bogus-page replacement code, the loop
was not properly incrementing loop variables prior to a continue
statement. We do not believe this code can be hit anyway but we
aren't taking any chances. We'll turn the whole section into a
panic (as it already is in brelse()) after the release is rolled.
In biodone(), the foff calculation was incorrectly
clamped to the iosize, causing the wrong foff to be calculated
for pages in the case of an I/O error or biodone() called without
initiating I/O. The problem always caused a panic before. Now it
doesn't. The problem is mainly an issue with NFS.
Fixed casts for ~PAGE_MASK. This code worked properly before only
because the calculations use signed arithmatic. Better to properly
extend PAGE_MASK first before inverting it for the 64 bit masking
op.
In brelse(), the bogus_page fixup code was improperly throwing
away the original contents of 'm' when it did the j-loop to
fix the bogus pages. The result was that it would potentially
invalidate parts of the *WRONG* page(!), leading to corruption.
There may still be cases where a background bitmap write is
being duplicated, causing potential corruption. We have identified
a potentially serious bug related to this but the fix is still TBD.
So instead this patch contains a KASSERT to detect the problem
and panic the machine rather then continue to corrupt the filesystem.
The problem does not occur very often.. it is very hard to
reproduce, and it may or may not be the cause of the corruption
people have reported.
Review by: (VFS/BIO: mckusick, Ian Dowse <iedowse@maths.tcd.ie>)
Testing by: (VM/Deadlock) Paul Saab <ps@yahoo-inc.com>
2000-11-18 23:06:26 +00:00
|
|
|
/*
|
|
|
|
* note: must allocate system pages since blocking here
|
2009-05-29 18:35:51 +00:00
|
|
|
* could interfere with paging I/O, no matter which
|
Implement a low-memory deadlock solution.
Removed most of the hacks that were trying to deal with low-memory
situations prior to now.
The new code is based on the concept that I/O must be able to function in
a low memory situation. All major modules related to I/O (except
networking) have been adjusted to allow allocation out of the system
reserve memory pool. These modules now detect a low memory situation but
rather then block they instead continue to operate, then return resources
to the memory pool instead of cache them or leave them wired.
Code has been added to stall in a low-memory situation prior to a vnode
being locked.
Thus situations where a process blocks in a low-memory condition while
holding a locked vnode have been reduced to near nothing. Not only will
I/O continue to operate, but many prior deadlock conditions simply no
longer exist.
Implement a number of VFS/BIO fixes
(found by Ian): in biodone(), bogus-page replacement code, the loop
was not properly incrementing loop variables prior to a continue
statement. We do not believe this code can be hit anyway but we
aren't taking any chances. We'll turn the whole section into a
panic (as it already is in brelse()) after the release is rolled.
In biodone(), the foff calculation was incorrectly
clamped to the iosize, causing the wrong foff to be calculated
for pages in the case of an I/O error or biodone() called without
initiating I/O. The problem always caused a panic before. Now it
doesn't. The problem is mainly an issue with NFS.
Fixed casts for ~PAGE_MASK. This code worked properly before only
because the calculations use signed arithmatic. Better to properly
extend PAGE_MASK first before inverting it for the 64 bit masking
op.
In brelse(), the bogus_page fixup code was improperly throwing
away the original contents of 'm' when it did the j-loop to
fix the bogus pages. The result was that it would potentially
invalidate parts of the *WRONG* page(!), leading to corruption.
There may still be cases where a background bitmap write is
being duplicated, causing potential corruption. We have identified
a potentially serious bug related to this but the fix is still TBD.
So instead this patch contains a KASSERT to detect the problem
and panic the machine rather then continue to corrupt the filesystem.
The problem does not occur very often.. it is very hard to
reproduce, and it may or may not be the cause of the corruption
people have reported.
Review by: (VFS/BIO: mckusick, Ian Dowse <iedowse@maths.tcd.ie>)
Testing by: (VM/Deadlock) Paul Saab <ps@yahoo-inc.com>
2000-11-18 23:06:26 +00:00
|
|
|
* process we are.
|
|
|
|
*/
|
2021-10-20 00:23:39 +00:00
|
|
|
p = vm_page_alloc_noobj(VM_ALLOC_SYSTEM | VM_ALLOC_WIRED |
|
|
|
|
VM_ALLOC_COUNT((to - pg) >> PAGE_SHIFT) | VM_ALLOC_WAITOK);
|
2002-03-17 00:56:41 +00:00
|
|
|
pmap_qenter(pg, &p, 1);
|
1996-01-19 04:00:31 +00:00
|
|
|
bp->b_pages[index] = p;
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
1997-09-21 04:49:30 +00:00
|
|
|
bp->b_npages = index;
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
1994-05-25 09:21:21 +00:00
|
|
|
|
2002-03-05 15:38:49 +00:00
|
|
|
/* Return pages associated with this buf to the vm system */
|
2002-09-28 17:15:38 +00:00
|
|
|
static void
|
2010-07-11 20:11:44 +00:00
|
|
|
vm_hold_free_pages(struct buf *bp, int newbsize)
|
1994-09-25 19:34:02 +00:00
|
|
|
{
|
2010-07-11 20:11:44 +00:00
|
|
|
vm_offset_t from;
|
1994-05-25 09:21:21 +00:00
|
|
|
vm_page_t p;
|
1997-09-21 04:49:30 +00:00
|
|
|
int index, newnpages;
|
1996-01-06 23:23:02 +00:00
|
|
|
|
Implement the concept of the unmapped VMIO buffers, i.e. buffers which
do not map the b_pages pages into buffer_map KVA. The use of the
unmapped buffers eliminate the need to perform TLB shootdown for
mapping on the buffer creation and reuse, greatly reducing the amount
of IPIs for shootdown on big-SMP machines and eliminating up to 25-30%
of the system time on i/o intensive workloads.
The unmapped buffer should be explicitely requested by the GB_UNMAPPED
flag by the consumer. For unmapped buffer, no KVA reservation is
performed at all. The consumer might request unmapped buffer which
does have a KVA reserve, to manually map it without recursing into
buffer cache and blocking, with the GB_KVAALLOC flag.
When the mapped buffer is requested and unmapped buffer already
exists, the cache performs an upgrade, possibly reusing the KVA
reservation.
Unmapped buffer is translated into unmapped bio in g_vfs_strategy().
Unmapped bio carry a pointer to the vm_page_t array, offset and length
instead of the data pointer. The provider which processes the bio
should explicitely specify a readiness to accept unmapped bio,
otherwise g_down geom thread performs the transient upgrade of the bio
request by mapping the pages into the new bio_transient_map KVA
submap.
The bio_transient_map submap claims up to 10% of the buffer map, and
the total buffer_map + bio_transient_map KVA usage stays the
same. Still, it could be manually tuned by kern.bio_transient_maxcnt
tunable, in the units of the transient mappings. Eventually, the
bio_transient_map could be removed after all geom classes and drivers
can accept unmapped i/o requests.
Unmapped support can be turned off by the vfs.unmapped_buf_allowed
tunable, disabling which makes the buffer (or cluster) creation
requests to ignore GB_UNMAPPED and GB_KVAALLOC flags. Unmapped
buffers are only enabled by default on the architectures where
pmap_copy_page() was implemented and tested.
In the rework, filesystem metadata is not the subject to maxbufspace
limit anymore. Since the metadata buffers are always mapped, the
buffers still have to fit into the buffer map, which provides a
reasonable (but practically unreachable) upper bound on it. The
non-metadata buffer allocations, both mapped and unmapped, is
accounted against maxbufspace, as before. Effectively, this means that
the maxbufspace is forced on mapped and unmapped buffers separately.
The pre-patch bufspace limiting code did not worked, because
buffer_map fragmentation does not allow the limit to be reached.
By Jeff Roberson request, the getnewbuf() function was split into
smaller single-purpose functions.
Sponsored by: The FreeBSD Foundation
Discussed with: jeff (previous version)
Tested by: pho, scottl (previous version), jhb, bf
MFC after: 2 weeks
2013-03-19 14:13:12 +00:00
|
|
|
BUF_CHECK_MAPPED(bp);
|
|
|
|
|
2010-07-11 20:11:44 +00:00
|
|
|
from = round_page((vm_offset_t)bp->b_data + newbsize);
|
|
|
|
newnpages = (from - trunc_page((vm_offset_t)bp->b_data)) >> PAGE_SHIFT;
|
|
|
|
if (bp->b_npages > newnpages)
|
|
|
|
pmap_qremove(from, bp->b_npages - newnpages);
|
|
|
|
for (index = newnpages; index < bp->b_npages; index++) {
|
1995-12-11 04:58:34 +00:00
|
|
|
p = bp->b_pages[index];
|
2010-07-11 20:11:44 +00:00
|
|
|
bp->b_pages[index] = NULL;
|
2019-08-28 18:01:54 +00:00
|
|
|
vm_page_unwire_noq(p);
|
2010-07-11 20:11:44 +00:00
|
|
|
vm_page_free(p);
|
1994-05-25 09:21:21 +00:00
|
|
|
}
|
1997-09-21 04:49:30 +00:00
|
|
|
bp->b_npages = newnpages;
|
1994-05-25 09:21:21 +00:00
|
|
|
}
|
1997-05-10 09:09:42 +00:00
|
|
|
|
2003-01-15 23:54:35 +00:00
|
|
|
/*
|
|
|
|
* Map an IO request into kernel virtual address space.
|
|
|
|
*
|
|
|
|
* All requests are (re)mapped into kernel VA space.
|
|
|
|
* Notice that we use b_bufsize for the size of the buffer
|
|
|
|
* to be mapped. b_bcount might be modified by the driver.
|
2003-01-20 17:46:48 +00:00
|
|
|
*
|
|
|
|
* Note that even if the caller determines that the address space should
|
|
|
|
* be valid, a race or a smaller-file mapped into a larger space may
|
|
|
|
* actually cause vmapbuf() to fail, so all callers of vmapbuf() MUST
|
|
|
|
* check the return value.
|
2015-07-23 19:13:41 +00:00
|
|
|
*
|
|
|
|
* This function only works with pager buffers.
|
2003-01-15 23:54:35 +00:00
|
|
|
*/
|
2003-01-20 17:46:48 +00:00
|
|
|
int
|
2020-10-21 16:00:15 +00:00
|
|
|
vmapbuf(struct buf *bp, void *uaddr, size_t len, int mapbuf)
|
2003-01-15 23:54:35 +00:00
|
|
|
{
|
2003-09-13 04:29:55 +00:00
|
|
|
vm_prot_t prot;
|
2010-12-25 21:26:56 +00:00
|
|
|
int pidx;
|
2003-01-15 23:54:35 +00:00
|
|
|
|
Make MAXPHYS tunable. Bump MAXPHYS to 1M.
Replace MAXPHYS by runtime variable maxphys. It is initialized from
MAXPHYS by default, but can be also adjusted with the tunable kern.maxphys.
Make b_pages[] array in struct buf flexible. Size b_pages[] for buffer
cache buffers exactly to atop(maxbcachebuf) (currently it is sized to
atop(MAXPHYS)), and b_pages[] for pbufs is sized to atop(maxphys) + 1.
The +1 for pbufs allow several pbuf consumers, among them vmapbuf(),
to use unaligned buffers still sized to maxphys, esp. when such
buffers come from userspace (*). Overall, we save significant amount
of otherwise wasted memory in b_pages[] for buffer cache buffers,
while bumping MAXPHYS to desired high value.
Eliminate all direct uses of the MAXPHYS constant in kernel and driver
sources, except a place which initialize maxphys. Some random (and
arguably weird) uses of MAXPHYS, e.g. in linuxolator, are converted
straight. Some drivers, which use MAXPHYS to size embeded structures,
get private MAXPHYS-like constant; their convertion is out of scope
for this work.
Changes to cam/, dev/ahci, dev/ata, dev/mpr, dev/mpt, dev/mvs,
dev/siis, where either submitted by, or based on changes by mav.
Suggested by: mav (*)
Reviewed by: imp, mav, imp, mckusick, scottl (intermediate versions)
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D27225
2020-11-28 12:12:51 +00:00
|
|
|
MPASS((bp->b_flags & B_MAXPHYS) != 0);
|
2004-09-15 20:54:23 +00:00
|
|
|
prot = VM_PROT_READ;
|
|
|
|
if (bp->b_iocmd == BIO_READ)
|
|
|
|
prot |= VM_PROT_WRITE; /* Less backwards than it looks */
|
Make MAXPHYS tunable. Bump MAXPHYS to 1M.
Replace MAXPHYS by runtime variable maxphys. It is initialized from
MAXPHYS by default, but can be also adjusted with the tunable kern.maxphys.
Make b_pages[] array in struct buf flexible. Size b_pages[] for buffer
cache buffers exactly to atop(maxbcachebuf) (currently it is sized to
atop(MAXPHYS)), and b_pages[] for pbufs is sized to atop(maxphys) + 1.
The +1 for pbufs allow several pbuf consumers, among them vmapbuf(),
to use unaligned buffers still sized to maxphys, esp. when such
buffers come from userspace (*). Overall, we save significant amount
of otherwise wasted memory in b_pages[] for buffer cache buffers,
while bumping MAXPHYS to desired high value.
Eliminate all direct uses of the MAXPHYS constant in kernel and driver
sources, except a place which initialize maxphys. Some random (and
arguably weird) uses of MAXPHYS, e.g. in linuxolator, are converted
straight. Some drivers, which use MAXPHYS to size embeded structures,
get private MAXPHYS-like constant; their convertion is out of scope
for this work.
Changes to cam/, dev/ahci, dev/ata, dev/mpr, dev/mpt, dev/mvs,
dev/siis, where either submitted by, or based on changes by mav.
Suggested by: mav (*)
Reviewed by: imp, mav, imp, mckusick, scottl (intermediate versions)
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D27225
2020-11-28 12:12:51 +00:00
|
|
|
pidx = vm_fault_quick_hold_pages(&curproc->p_vmspace->vm_map,
|
|
|
|
(vm_offset_t)uaddr, len, prot, bp->b_pages, PBUF_PAGES);
|
|
|
|
if (pidx < 0)
|
2010-12-25 21:26:56 +00:00
|
|
|
return (-1);
|
2020-10-21 16:00:15 +00:00
|
|
|
bp->b_bufsize = len;
|
2003-01-15 23:54:35 +00:00
|
|
|
bp->b_npages = pidx;
|
2020-10-21 16:00:15 +00:00
|
|
|
bp->b_offset = ((vm_offset_t)uaddr) & PAGE_MASK;
|
2013-03-19 14:43:57 +00:00
|
|
|
if (mapbuf || !unmapped_buf_allowed) {
|
2015-07-23 19:13:41 +00:00
|
|
|
pmap_qenter((vm_offset_t)bp->b_kvabase, bp->b_pages, pidx);
|
|
|
|
bp->b_data = bp->b_kvabase + bp->b_offset;
|
|
|
|
} else
|
2013-03-19 14:43:57 +00:00
|
|
|
bp->b_data = unmapped_buf;
|
Make MAXPHYS tunable. Bump MAXPHYS to 1M.
Replace MAXPHYS by runtime variable maxphys. It is initialized from
MAXPHYS by default, but can be also adjusted with the tunable kern.maxphys.
Make b_pages[] array in struct buf flexible. Size b_pages[] for buffer
cache buffers exactly to atop(maxbcachebuf) (currently it is sized to
atop(MAXPHYS)), and b_pages[] for pbufs is sized to atop(maxphys) + 1.
The +1 for pbufs allow several pbuf consumers, among them vmapbuf(),
to use unaligned buffers still sized to maxphys, esp. when such
buffers come from userspace (*). Overall, we save significant amount
of otherwise wasted memory in b_pages[] for buffer cache buffers,
while bumping MAXPHYS to desired high value.
Eliminate all direct uses of the MAXPHYS constant in kernel and driver
sources, except a place which initialize maxphys. Some random (and
arguably weird) uses of MAXPHYS, e.g. in linuxolator, are converted
straight. Some drivers, which use MAXPHYS to size embeded structures,
get private MAXPHYS-like constant; their convertion is out of scope
for this work.
Changes to cam/, dev/ahci, dev/ata, dev/mpr, dev/mpt, dev/mvs,
dev/siis, where either submitted by, or based on changes by mav.
Suggested by: mav (*)
Reviewed by: imp, mav, imp, mckusick, scottl (intermediate versions)
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D27225
2020-11-28 12:12:51 +00:00
|
|
|
return (0);
|
2003-01-15 23:54:35 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Free the io map PTEs associated with this IO operation.
|
|
|
|
* We also invalidate the TLB entries and restore the original b_addr.
|
2015-07-23 19:13:41 +00:00
|
|
|
*
|
|
|
|
* This function only works with pager buffers.
|
2003-01-15 23:54:35 +00:00
|
|
|
*/
|
|
|
|
void
|
|
|
|
vunmapbuf(struct buf *bp)
|
|
|
|
{
|
|
|
|
int npages;
|
|
|
|
|
|
|
|
npages = bp->b_npages;
|
2015-07-23 19:13:41 +00:00
|
|
|
if (buf_mapped(bp))
|
2013-03-19 14:43:57 +00:00
|
|
|
pmap_qremove(trunc_page((vm_offset_t)bp->b_data), npages);
|
2010-12-17 22:41:22 +00:00
|
|
|
vm_page_unhold_pages(bp->b_pages, npages);
|
2015-07-23 19:13:41 +00:00
|
|
|
|
|
|
|
bp->b_data = unmapped_buf;
|
2003-01-15 23:54:35 +00:00
|
|
|
}
|
1997-05-10 09:09:42 +00:00
|
|
|
|
2003-03-13 07:31:45 +00:00
|
|
|
void
|
|
|
|
bdone(struct buf *bp)
|
|
|
|
{
|
2008-03-21 10:00:05 +00:00
|
|
|
struct mtx *mtxp;
|
2004-09-15 20:54:23 +00:00
|
|
|
|
2008-03-21 10:00:05 +00:00
|
|
|
mtxp = mtx_pool_find(mtxpool_sleep, bp);
|
|
|
|
mtx_lock(mtxp);
|
2003-03-13 07:31:45 +00:00
|
|
|
bp->b_flags |= B_DONE;
|
|
|
|
wakeup(bp);
|
2008-03-21 10:00:05 +00:00
|
|
|
mtx_unlock(mtxp);
|
2003-03-13 07:31:45 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
void
|
|
|
|
bwait(struct buf *bp, u_char pri, const char *wchan)
|
|
|
|
{
|
2008-03-21 10:00:05 +00:00
|
|
|
struct mtx *mtxp;
|
2004-09-15 20:54:23 +00:00
|
|
|
|
2008-03-21 10:00:05 +00:00
|
|
|
mtxp = mtx_pool_find(mtxpool_sleep, bp);
|
|
|
|
mtx_lock(mtxp);
|
2003-03-13 07:31:45 +00:00
|
|
|
while ((bp->b_flags & B_DONE) == 0)
|
2008-03-21 10:00:05 +00:00
|
|
|
msleep(bp, mtxp, pri, wchan, 0);
|
|
|
|
mtx_unlock(mtxp);
|
2003-03-13 07:31:45 +00:00
|
|
|
}
|
|
|
|
|
2005-01-11 10:43:08 +00:00
|
|
|
int
|
2008-10-10 21:23:50 +00:00
|
|
|
bufsync(struct bufobj *bo, int waitfor)
|
2005-01-11 10:43:08 +00:00
|
|
|
{
|
|
|
|
|
2016-09-30 17:11:03 +00:00
|
|
|
return (VOP_FSYNC(bo2vnode(bo), waitfor, curthread));
|
2005-01-11 10:43:08 +00:00
|
|
|
}
|
|
|
|
|
2004-10-21 15:53:54 +00:00
|
|
|
void
|
|
|
|
bufstrategy(struct bufobj *bo, struct buf *bp)
|
|
|
|
{
|
2018-05-19 04:59:39 +00:00
|
|
|
int i __unused;
|
2004-10-21 15:53:54 +00:00
|
|
|
struct vnode *vp;
|
|
|
|
|
|
|
|
vp = bp->b_vp;
|
2004-10-29 10:52:31 +00:00
|
|
|
KASSERT(vp == bo->bo_private, ("Inconsistent vnode bufstrategy"));
|
2004-10-21 15:53:54 +00:00
|
|
|
KASSERT(vp->v_type != VCHR && vp->v_type != VBLK,
|
|
|
|
("Wrong vnode in bufstrategy(bp=%p, vp=%p)", bp, vp));
|
2004-10-29 10:52:31 +00:00
|
|
|
i = VOP_STRATEGY(vp, bp);
|
2004-10-21 15:53:54 +00:00
|
|
|
KASSERT(i == 0, ("VOP_STRATEGY failed bp=%p vp=%p", bp, bp->b_vp));
|
|
|
|
}
|
|
|
|
|
2018-02-20 00:06:07 +00:00
|
|
|
/*
|
|
|
|
* Initialize a struct bufobj before use. Memory is assumed zero filled.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
bufobj_init(struct bufobj *bo, void *private)
|
|
|
|
{
|
|
|
|
static volatile int bufobj_cleanq;
|
|
|
|
|
|
|
|
bo->bo_domain =
|
2018-03-17 18:14:49 +00:00
|
|
|
atomic_fetchadd_int(&bufobj_cleanq, 1) % buf_domains;
|
2018-02-20 00:06:07 +00:00
|
|
|
rw_init(BO_LOCKPTR(bo), "bufobj interlock");
|
|
|
|
bo->bo_private = private;
|
|
|
|
TAILQ_INIT(&bo->bo_clean.bv_hd);
|
|
|
|
TAILQ_INIT(&bo->bo_dirty.bv_hd);
|
|
|
|
}
|
|
|
|
|
2005-05-30 07:01:18 +00:00
|
|
|
void
|
|
|
|
bufobj_wrefl(struct bufobj *bo)
|
|
|
|
{
|
|
|
|
|
|
|
|
KASSERT(bo != NULL, ("NULL bo in bufobj_wref"));
|
2013-05-31 00:43:41 +00:00
|
|
|
ASSERT_BO_WLOCKED(bo);
|
2005-05-30 07:01:18 +00:00
|
|
|
bo->bo_numoutput++;
|
|
|
|
}
|
|
|
|
|
2004-10-21 15:53:54 +00:00
|
|
|
void
|
|
|
|
bufobj_wref(struct bufobj *bo)
|
|
|
|
{
|
|
|
|
|
|
|
|
KASSERT(bo != NULL, ("NULL bo in bufobj_wref"));
|
|
|
|
BO_LOCK(bo);
|
|
|
|
bo->bo_numoutput++;
|
|
|
|
BO_UNLOCK(bo);
|
|
|
|
}
|
|
|
|
|
|
|
|
void
|
|
|
|
bufobj_wdrop(struct bufobj *bo)
|
|
|
|
{
|
|
|
|
|
|
|
|
KASSERT(bo != NULL, ("NULL bo in bufobj_wdrop"));
|
|
|
|
BO_LOCK(bo);
|
|
|
|
KASSERT(bo->bo_numoutput > 0, ("bufobj_wdrop non-positive count"));
|
|
|
|
if ((--bo->bo_numoutput == 0) && (bo->bo_flag & BO_WWAIT)) {
|
|
|
|
bo->bo_flag &= ~BO_WWAIT;
|
|
|
|
wakeup(&bo->bo_numoutput);
|
|
|
|
}
|
|
|
|
BO_UNLOCK(bo);
|
|
|
|
}
|
|
|
|
|
|
|
|
int
|
|
|
|
bufobj_wwait(struct bufobj *bo, int slpflag, int timeo)
|
|
|
|
{
|
|
|
|
int error;
|
|
|
|
|
|
|
|
KASSERT(bo != NULL, ("NULL bo in bufobj_wwait"));
|
2013-05-31 00:43:41 +00:00
|
|
|
ASSERT_BO_WLOCKED(bo);
|
2004-10-21 15:53:54 +00:00
|
|
|
error = 0;
|
|
|
|
while (bo->bo_numoutput) {
|
|
|
|
bo->bo_flag |= BO_WWAIT;
|
2013-05-31 00:43:41 +00:00
|
|
|
error = msleep(&bo->bo_numoutput, BO_LOCKPTR(bo),
|
2004-10-21 15:53:54 +00:00
|
|
|
slpflag | (PRIBIO + 1), "bo_wwait", timeo);
|
|
|
|
if (error)
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
return (error);
|
|
|
|
}
|
|
|
|
|
Implement the concept of the unmapped VMIO buffers, i.e. buffers which
do not map the b_pages pages into buffer_map KVA. The use of the
unmapped buffers eliminate the need to perform TLB shootdown for
mapping on the buffer creation and reuse, greatly reducing the amount
of IPIs for shootdown on big-SMP machines and eliminating up to 25-30%
of the system time on i/o intensive workloads.
The unmapped buffer should be explicitely requested by the GB_UNMAPPED
flag by the consumer. For unmapped buffer, no KVA reservation is
performed at all. The consumer might request unmapped buffer which
does have a KVA reserve, to manually map it without recursing into
buffer cache and blocking, with the GB_KVAALLOC flag.
When the mapped buffer is requested and unmapped buffer already
exists, the cache performs an upgrade, possibly reusing the KVA
reservation.
Unmapped buffer is translated into unmapped bio in g_vfs_strategy().
Unmapped bio carry a pointer to the vm_page_t array, offset and length
instead of the data pointer. The provider which processes the bio
should explicitely specify a readiness to accept unmapped bio,
otherwise g_down geom thread performs the transient upgrade of the bio
request by mapping the pages into the new bio_transient_map KVA
submap.
The bio_transient_map submap claims up to 10% of the buffer map, and
the total buffer_map + bio_transient_map KVA usage stays the
same. Still, it could be manually tuned by kern.bio_transient_maxcnt
tunable, in the units of the transient mappings. Eventually, the
bio_transient_map could be removed after all geom classes and drivers
can accept unmapped i/o requests.
Unmapped support can be turned off by the vfs.unmapped_buf_allowed
tunable, disabling which makes the buffer (or cluster) creation
requests to ignore GB_UNMAPPED and GB_KVAALLOC flags. Unmapped
buffers are only enabled by default on the architectures where
pmap_copy_page() was implemented and tested.
In the rework, filesystem metadata is not the subject to maxbufspace
limit anymore. Since the metadata buffers are always mapped, the
buffers still have to fit into the buffer map, which provides a
reasonable (but practically unreachable) upper bound on it. The
non-metadata buffer allocations, both mapped and unmapped, is
accounted against maxbufspace, as before. Effectively, this means that
the maxbufspace is forced on mapped and unmapped buffers separately.
The pre-patch bufspace limiting code did not worked, because
buffer_map fragmentation does not allow the limit to be reached.
By Jeff Roberson request, the getnewbuf() function was split into
smaller single-purpose functions.
Sponsored by: The FreeBSD Foundation
Discussed with: jeff (previous version)
Tested by: pho, scottl (previous version), jhb, bf
MFC after: 2 weeks
2013-03-19 14:13:12 +00:00
|
|
|
/*
|
|
|
|
* Set bio_data or bio_ma for struct bio from the struct buf.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
bdata2bio(struct buf *bp, struct bio *bip)
|
|
|
|
{
|
|
|
|
|
2015-07-23 19:13:41 +00:00
|
|
|
if (!buf_mapped(bp)) {
|
Implement the concept of the unmapped VMIO buffers, i.e. buffers which
do not map the b_pages pages into buffer_map KVA. The use of the
unmapped buffers eliminate the need to perform TLB shootdown for
mapping on the buffer creation and reuse, greatly reducing the amount
of IPIs for shootdown on big-SMP machines and eliminating up to 25-30%
of the system time on i/o intensive workloads.
The unmapped buffer should be explicitely requested by the GB_UNMAPPED
flag by the consumer. For unmapped buffer, no KVA reservation is
performed at all. The consumer might request unmapped buffer which
does have a KVA reserve, to manually map it without recursing into
buffer cache and blocking, with the GB_KVAALLOC flag.
When the mapped buffer is requested and unmapped buffer already
exists, the cache performs an upgrade, possibly reusing the KVA
reservation.
Unmapped buffer is translated into unmapped bio in g_vfs_strategy().
Unmapped bio carry a pointer to the vm_page_t array, offset and length
instead of the data pointer. The provider which processes the bio
should explicitely specify a readiness to accept unmapped bio,
otherwise g_down geom thread performs the transient upgrade of the bio
request by mapping the pages into the new bio_transient_map KVA
submap.
The bio_transient_map submap claims up to 10% of the buffer map, and
the total buffer_map + bio_transient_map KVA usage stays the
same. Still, it could be manually tuned by kern.bio_transient_maxcnt
tunable, in the units of the transient mappings. Eventually, the
bio_transient_map could be removed after all geom classes and drivers
can accept unmapped i/o requests.
Unmapped support can be turned off by the vfs.unmapped_buf_allowed
tunable, disabling which makes the buffer (or cluster) creation
requests to ignore GB_UNMAPPED and GB_KVAALLOC flags. Unmapped
buffers are only enabled by default on the architectures where
pmap_copy_page() was implemented and tested.
In the rework, filesystem metadata is not the subject to maxbufspace
limit anymore. Since the metadata buffers are always mapped, the
buffers still have to fit into the buffer map, which provides a
reasonable (but practically unreachable) upper bound on it. The
non-metadata buffer allocations, both mapped and unmapped, is
accounted against maxbufspace, as before. Effectively, this means that
the maxbufspace is forced on mapped and unmapped buffers separately.
The pre-patch bufspace limiting code did not worked, because
buffer_map fragmentation does not allow the limit to be reached.
By Jeff Roberson request, the getnewbuf() function was split into
smaller single-purpose functions.
Sponsored by: The FreeBSD Foundation
Discussed with: jeff (previous version)
Tested by: pho, scottl (previous version), jhb, bf
MFC after: 2 weeks
2013-03-19 14:13:12 +00:00
|
|
|
KASSERT(unmapped_buf_allowed, ("unmapped"));
|
|
|
|
bip->bio_ma = bp->b_pages;
|
|
|
|
bip->bio_ma_n = bp->b_npages;
|
|
|
|
bip->bio_data = unmapped_buf;
|
|
|
|
bip->bio_ma_offset = (vm_offset_t)bp->b_offset & PAGE_MASK;
|
|
|
|
bip->bio_flags |= BIO_UNMAPPED;
|
|
|
|
KASSERT(round_page(bip->bio_ma_offset + bip->bio_length) /
|
|
|
|
PAGE_SIZE == bp->b_npages,
|
2013-07-07 21:39:37 +00:00
|
|
|
("Buffer %p too short: %d %lld %d", bp, bip->bio_ma_offset,
|
|
|
|
(long long)bip->bio_length, bip->bio_ma_n));
|
Implement the concept of the unmapped VMIO buffers, i.e. buffers which
do not map the b_pages pages into buffer_map KVA. The use of the
unmapped buffers eliminate the need to perform TLB shootdown for
mapping on the buffer creation and reuse, greatly reducing the amount
of IPIs for shootdown on big-SMP machines and eliminating up to 25-30%
of the system time on i/o intensive workloads.
The unmapped buffer should be explicitely requested by the GB_UNMAPPED
flag by the consumer. For unmapped buffer, no KVA reservation is
performed at all. The consumer might request unmapped buffer which
does have a KVA reserve, to manually map it without recursing into
buffer cache and blocking, with the GB_KVAALLOC flag.
When the mapped buffer is requested and unmapped buffer already
exists, the cache performs an upgrade, possibly reusing the KVA
reservation.
Unmapped buffer is translated into unmapped bio in g_vfs_strategy().
Unmapped bio carry a pointer to the vm_page_t array, offset and length
instead of the data pointer. The provider which processes the bio
should explicitely specify a readiness to accept unmapped bio,
otherwise g_down geom thread performs the transient upgrade of the bio
request by mapping the pages into the new bio_transient_map KVA
submap.
The bio_transient_map submap claims up to 10% of the buffer map, and
the total buffer_map + bio_transient_map KVA usage stays the
same. Still, it could be manually tuned by kern.bio_transient_maxcnt
tunable, in the units of the transient mappings. Eventually, the
bio_transient_map could be removed after all geom classes and drivers
can accept unmapped i/o requests.
Unmapped support can be turned off by the vfs.unmapped_buf_allowed
tunable, disabling which makes the buffer (or cluster) creation
requests to ignore GB_UNMAPPED and GB_KVAALLOC flags. Unmapped
buffers are only enabled by default on the architectures where
pmap_copy_page() was implemented and tested.
In the rework, filesystem metadata is not the subject to maxbufspace
limit anymore. Since the metadata buffers are always mapped, the
buffers still have to fit into the buffer map, which provides a
reasonable (but practically unreachable) upper bound on it. The
non-metadata buffer allocations, both mapped and unmapped, is
accounted against maxbufspace, as before. Effectively, this means that
the maxbufspace is forced on mapped and unmapped buffers separately.
The pre-patch bufspace limiting code did not worked, because
buffer_map fragmentation does not allow the limit to be reached.
By Jeff Roberson request, the getnewbuf() function was split into
smaller single-purpose functions.
Sponsored by: The FreeBSD Foundation
Discussed with: jeff (previous version)
Tested by: pho, scottl (previous version), jhb, bf
MFC after: 2 weeks
2013-03-19 14:13:12 +00:00
|
|
|
} else {
|
|
|
|
bip->bio_data = bp->b_data;
|
|
|
|
bip->bio_ma = NULL;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2016-11-15 01:41:45 +00:00
|
|
|
/*
|
|
|
|
* The MIPS pmap code currently doesn't handle aliased pages.
|
|
|
|
* The VIPT caches may not handle page aliasing themselves, leading
|
|
|
|
* to data corruption.
|
|
|
|
*
|
|
|
|
* As such, this code makes a system extremely unhappy if said
|
|
|
|
* system doesn't support unaliasing the above situation in hardware.
|
|
|
|
* Some "recent" systems (eg some mips24k/mips74k cores) don't enable
|
|
|
|
* this feature at build time, so it has to be handled in software.
|
|
|
|
*
|
|
|
|
* Once the MIPS pmap/cache code grows to support this function on
|
|
|
|
* earlier chips, it should be flipped back off.
|
|
|
|
*/
|
|
|
|
#ifdef __mips__
|
|
|
|
static int buf_pager_relbuf = 1;
|
|
|
|
#else
|
|
|
|
static int buf_pager_relbuf = 0;
|
|
|
|
#endif
|
2016-10-28 11:43:59 +00:00
|
|
|
SYSCTL_INT(_vfs, OID_AUTO, buf_pager_relbuf, CTLFLAG_RWTUN,
|
|
|
|
&buf_pager_relbuf, 0,
|
|
|
|
"Make buffer pager release buffers after reading");
|
|
|
|
|
|
|
|
/*
|
|
|
|
* The buffer pager. It uses buffer reads to validate pages.
|
|
|
|
*
|
|
|
|
* In contrast to the generic local pager from vm/vnode_pager.c, this
|
|
|
|
* pager correctly and easily handles volumes where the underlying
|
|
|
|
* device block size is greater than the machine page size. The
|
|
|
|
* buffer cache transparently extends the requested page run to be
|
|
|
|
* aligned at the block boundary, and does the necessary bogus page
|
|
|
|
* replacements in the addends to avoid obliterating already valid
|
|
|
|
* pages.
|
|
|
|
*
|
|
|
|
* The only non-trivial issue is that the exclusive busy state for
|
|
|
|
* pages, which is assumed by the vm_pager_getpages() interface, is
|
|
|
|
* incompatible with the VMIO buffer cache's desire to share-busy the
|
|
|
|
* pages. This function performs a trivial downgrade of the pages'
|
|
|
|
* state before reading buffers, and a less trivial upgrade from the
|
|
|
|
* shared-busy to excl-busy state after the read.
|
|
|
|
*/
|
|
|
|
int
|
|
|
|
vfs_bio_getpages(struct vnode *vp, vm_page_t *ma, int count,
|
|
|
|
int *rbehind, int *rahead, vbg_get_lblkno_t get_lblkno,
|
|
|
|
vbg_get_blksize_t get_blksize)
|
|
|
|
{
|
|
|
|
vm_page_t m;
|
|
|
|
vm_object_t object;
|
|
|
|
struct buf *bp;
|
2016-11-08 10:10:55 +00:00
|
|
|
struct mount *mp;
|
2016-10-28 11:43:59 +00:00
|
|
|
daddr_t lbn, lbnp;
|
|
|
|
vm_ooffset_t la, lb, poff, poffe;
|
2021-09-16 23:53:58 +00:00
|
|
|
long bo_bs, bsize;
|
|
|
|
int br_flags, error, i, pgsin, pgsin_a, pgsin_b;
|
2016-10-28 11:43:59 +00:00
|
|
|
bool redo, lpart;
|
|
|
|
|
|
|
|
object = vp->v_object;
|
2016-11-08 10:10:55 +00:00
|
|
|
mp = vp->v_mount;
|
2018-03-14 22:11:45 +00:00
|
|
|
error = 0;
|
2016-10-28 11:43:59 +00:00
|
|
|
la = IDX_TO_OFF(ma[count - 1]->pindex);
|
|
|
|
if (la >= object->un_pager.vnp.vnp_size)
|
|
|
|
return (VM_PAGER_BAD);
|
2018-01-18 12:59:04 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Change the meaning of la from where the last requested page starts
|
|
|
|
* to where it ends, because that's the end of the requested region
|
|
|
|
* and the start of the potential read-ahead region.
|
|
|
|
*/
|
|
|
|
la += PAGE_SIZE;
|
|
|
|
lpart = la > object->un_pager.vnp.vnp_size;
|
2021-09-16 23:53:58 +00:00
|
|
|
error = get_blksize(vp, get_lblkno(vp, IDX_TO_OFF(ma[0]->pindex)),
|
|
|
|
&bo_bs);
|
|
|
|
if (error != 0)
|
|
|
|
return (VM_PAGER_ERROR);
|
2016-11-22 10:06:39 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Calculate read-ahead, behind and total pages.
|
|
|
|
*/
|
|
|
|
pgsin = count;
|
|
|
|
lb = IDX_TO_OFF(ma[0]->pindex);
|
|
|
|
pgsin_b = OFF_TO_IDX(lb - rounddown2(lb, bo_bs));
|
|
|
|
pgsin += pgsin_b;
|
|
|
|
if (rbehind != NULL)
|
|
|
|
*rbehind = pgsin_b;
|
|
|
|
pgsin_a = OFF_TO_IDX(roundup2(la, bo_bs) - la);
|
|
|
|
if (la + IDX_TO_OFF(pgsin_a) >= object->un_pager.vnp.vnp_size)
|
|
|
|
pgsin_a = OFF_TO_IDX(roundup2(object->un_pager.vnp.vnp_size,
|
|
|
|
PAGE_SIZE) - la);
|
|
|
|
pgsin += pgsin_a;
|
|
|
|
if (rahead != NULL)
|
|
|
|
*rahead = pgsin_a;
|
- Remove 'struct vmmeter' from 'struct pcpu', leaving only global vmmeter
in place. To do per-cpu stats, convert all fields that previously were
maintained in the vmmeters that sit in pcpus to counter(9).
- Since some vmmeter stats may be touched at very early stages of boot,
before we have set up UMA and we can do counter_u64_alloc(), provide an
early counter mechanism:
o Leave one spare uint64_t in struct pcpu, named pc_early_dummy_counter.
o Point counter(9) fields of vmmeter to pcpu[0].pc_early_dummy_counter,
so that at early stages of boot, before counters are allocated we already
point to a counter that can be safely written to.
o For sparc64 that required a whole dummy pcpu[MAXCPU] array.
Further related changes:
- Don't include vmmeter.h into pcpu.h.
- vm.stats.vm.v_swappgsout and vm.stats.vm.v_swappgsin changed to 64-bit,
to match kernel representation.
- struct vmmeter hidden under _KERNEL, and only vmstat(1) is an exclusion.
This is based on benno@'s 4-year old patch:
https://lists.freebsd.org/pipermail/freebsd-arch/2013-July/014471.html
Reviewed by: kib, gallatin, marius, lidl
Differential Revision: https://reviews.freebsd.org/D10156
2017-04-17 17:34:47 +00:00
|
|
|
VM_CNT_INC(v_vnodein);
|
|
|
|
VM_CNT_ADD(v_vnodepgsin, pgsin);
|
2016-11-22 10:06:39 +00:00
|
|
|
|
2016-11-08 10:10:55 +00:00
|
|
|
br_flags = (mp != NULL && (mp->mnt_kern_flag & MNTK_UNMAPPED_BUFS)
|
|
|
|
!= 0) ? GB_UNMAPPED : 0;
|
2016-10-28 11:43:59 +00:00
|
|
|
again:
|
2020-03-30 21:42:46 +00:00
|
|
|
for (i = 0; i < count; i++) {
|
|
|
|
if (ma[i] != bogus_page)
|
|
|
|
vm_page_busy_downgrade(ma[i]);
|
|
|
|
}
|
2016-10-28 11:43:59 +00:00
|
|
|
|
|
|
|
lbnp = -1;
|
|
|
|
for (i = 0; i < count; i++) {
|
|
|
|
m = ma[i];
|
2020-03-30 21:42:46 +00:00
|
|
|
if (m == bogus_page)
|
|
|
|
continue;
|
2016-10-28 11:43:59 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Pages are shared busy and the object lock is not
|
|
|
|
* owned, which together allow for the pages'
|
|
|
|
* invalidation. The racy test for validity avoids
|
|
|
|
* useless creation of the buffer for the most typical
|
|
|
|
* case when invalidation is not used in redo or for
|
|
|
|
* parallel read. The shared->excl upgrade loop at
|
|
|
|
* the end of the function catches the race in a
|
|
|
|
* reliable way (protected by the object lock).
|
|
|
|
*/
|
2019-10-15 03:45:41 +00:00
|
|
|
if (vm_page_all_valid(m))
|
2016-10-28 11:43:59 +00:00
|
|
|
continue;
|
|
|
|
|
|
|
|
poff = IDX_TO_OFF(m->pindex);
|
|
|
|
poffe = MIN(poff + PAGE_SIZE, object->un_pager.vnp.vnp_size);
|
|
|
|
for (; poff < poffe; poff += bsize) {
|
|
|
|
lbn = get_lblkno(vp, poff);
|
|
|
|
if (lbn == lbnp)
|
|
|
|
goto next_page;
|
|
|
|
lbnp = lbn;
|
|
|
|
|
2021-09-16 23:53:58 +00:00
|
|
|
error = get_blksize(vp, lbn, &bsize);
|
|
|
|
if (error == 0)
|
|
|
|
error = bread_gb(vp, lbn, bsize,
|
|
|
|
curthread->td_ucred, br_flags, &bp);
|
2016-10-28 11:43:59 +00:00
|
|
|
if (error != 0)
|
|
|
|
goto end_pages;
|
2020-03-05 15:52:34 +00:00
|
|
|
if (bp->b_rcred == curthread->td_ucred) {
|
|
|
|
crfree(bp->b_rcred);
|
|
|
|
bp->b_rcred = NOCRED;
|
|
|
|
}
|
2016-10-28 11:43:59 +00:00
|
|
|
if (LIST_EMPTY(&bp->b_dep)) {
|
|
|
|
/*
|
|
|
|
* Invalidation clears m->valid, but
|
|
|
|
* may leave B_CACHE flag if the
|
|
|
|
* buffer existed at the invalidation
|
|
|
|
* time. In this case, recycle the
|
|
|
|
* buffer to do real read on next
|
|
|
|
* bread() after redo.
|
|
|
|
*
|
|
|
|
* Otherwise B_RELBUF is not strictly
|
|
|
|
* necessary, enable to reduce buf
|
|
|
|
* cache pressure.
|
|
|
|
*/
|
|
|
|
if (buf_pager_relbuf ||
|
2019-10-15 03:45:41 +00:00
|
|
|
!vm_page_all_valid(m))
|
2016-10-28 11:43:59 +00:00
|
|
|
bp->b_flags |= B_RELBUF;
|
|
|
|
|
|
|
|
bp->b_flags &= ~B_NOCACHE;
|
|
|
|
brelse(bp);
|
|
|
|
} else {
|
|
|
|
bqrelse(bp);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
KASSERT(1 /* racy, enable for debugging */ ||
|
2019-10-15 03:45:41 +00:00
|
|
|
vm_page_all_valid(m) || i == count - 1,
|
2016-10-28 11:43:59 +00:00
|
|
|
("buf %d %p invalid", i, m));
|
|
|
|
if (i == count - 1 && lpart) {
|
2019-10-15 03:45:41 +00:00
|
|
|
if (!vm_page_none_valid(m) &&
|
|
|
|
!vm_page_all_valid(m))
|
2016-10-28 11:43:59 +00:00
|
|
|
vm_page_zero_invalid(m, TRUE);
|
|
|
|
}
|
|
|
|
next_page:;
|
|
|
|
}
|
|
|
|
end_pages:
|
|
|
|
|
|
|
|
redo = false;
|
|
|
|
for (i = 0; i < count; i++) {
|
2020-03-30 21:42:46 +00:00
|
|
|
if (ma[i] == bogus_page)
|
|
|
|
continue;
|
2020-02-28 20:34:30 +00:00
|
|
|
if (vm_page_busy_tryupgrade(ma[i]) == 0) {
|
|
|
|
vm_page_sunbusy(ma[i]);
|
|
|
|
ma[i] = vm_page_grab_unlocked(object, ma[i]->pindex,
|
|
|
|
VM_ALLOC_NORMAL);
|
|
|
|
}
|
2016-10-28 11:43:59 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Since the pages were only sbusy while neither the
|
|
|
|
* buffer nor the object lock was held by us, or
|
|
|
|
* reallocated while vm_page_grab() slept for busy
|
|
|
|
* relinguish, they could have been invalidated.
|
|
|
|
* Recheck the valid bits and re-read as needed.
|
|
|
|
*
|
|
|
|
* Note that the last page is made fully valid in the
|
|
|
|
* read loop, and partial validity for the page at
|
|
|
|
* index count - 1 could mean that the page was
|
|
|
|
* invalidated or removed, so we must restart for
|
|
|
|
* safety as well.
|
|
|
|
*/
|
2019-10-15 03:45:41 +00:00
|
|
|
if (!vm_page_all_valid(ma[i]))
|
2016-10-28 11:43:59 +00:00
|
|
|
redo = true;
|
|
|
|
}
|
|
|
|
if (redo && error == 0)
|
|
|
|
goto again;
|
|
|
|
return (error != 0 ? VM_PAGER_ERROR : VM_PAGER_OK);
|
|
|
|
}
|
|
|
|
|
1997-05-10 09:09:42 +00:00
|
|
|
#include "opt_ddb.h"
|
|
|
|
#ifdef DDB
|
|
|
|
#include <ddb/ddb.h>
|
|
|
|
|
2002-03-05 15:38:49 +00:00
|
|
|
/* DDB command to show buffer data */
|
1997-05-10 09:09:42 +00:00
|
|
|
DB_SHOW_COMMAND(buffer, db_show_buffer)
|
|
|
|
{
|
|
|
|
/* get args */
|
|
|
|
struct buf *bp = (struct buf *)addr;
|
2016-10-31 23:09:52 +00:00
|
|
|
#ifdef FULL_BUF_TRACKING
|
|
|
|
uint32_t i, j;
|
|
|
|
#endif
|
1997-05-10 09:09:42 +00:00
|
|
|
|
|
|
|
if (!have_addr) {
|
|
|
|
db_printf("usage: show buffer <addr>\n");
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
2005-03-25 00:20:37 +00:00
|
|
|
db_printf("buf at %p\n", bp);
|
2019-01-25 21:24:09 +00:00
|
|
|
db_printf("b_flags = 0x%b, b_xflags=0x%b\n",
|
|
|
|
(u_int)bp->b_flags, PRINT_BUF_FLAGS,
|
|
|
|
(u_int)bp->b_xflags, PRINT_BUF_XFLAGS);
|
|
|
|
db_printf("b_vflags=0x%b b_ioflags0x%b\n",
|
|
|
|
(u_int)bp->b_vflags, PRINT_BUF_VFLAGS,
|
|
|
|
(u_int)bp->b_ioflags, PRINT_BIO_FLAGS);
|
2002-03-19 04:09:21 +00:00
|
|
|
db_printf(
|
|
|
|
"b_error = %d, b_bufsize = %ld, b_bcount = %ld, b_resid = %ld\n"
|
2019-03-11 21:49:44 +00:00
|
|
|
"b_bufobj = (%p), b_data = %p\n, b_blkno = %jd, b_lblkno = %jd, "
|
|
|
|
"b_vp = %p, b_dep = %p\n",
|
2002-03-19 04:09:21 +00:00
|
|
|
bp->b_error, bp->b_bufsize, bp->b_bcount, bp->b_resid,
|
2008-09-16 11:19:38 +00:00
|
|
|
bp->b_bufobj, bp->b_data, (intmax_t)bp->b_blkno,
|
2019-03-11 21:49:44 +00:00
|
|
|
(intmax_t)bp->b_lblkno, bp->b_vp, bp->b_dep.lh_first);
|
2015-07-23 19:13:41 +00:00
|
|
|
db_printf("b_kvabase = %p, b_kvasize = %d\n",
|
|
|
|
bp->b_kvabase, bp->b_kvasize);
|
1997-09-21 04:49:30 +00:00
|
|
|
if (bp->b_npages) {
|
|
|
|
int i;
|
|
|
|
db_printf("b_npages = %d, pages(OBJ, IDX, PA): ", bp->b_npages);
|
|
|
|
for (i = 0; i < bp->b_npages; i++) {
|
|
|
|
vm_page_t m;
|
|
|
|
m = bp->b_pages[i];
|
2016-07-14 18:49:05 +00:00
|
|
|
if (m != NULL)
|
|
|
|
db_printf("(%p, 0x%lx, 0x%lx)", m->object,
|
|
|
|
(u_long)m->pindex,
|
|
|
|
(u_long)VM_PAGE_TO_PHYS(m));
|
|
|
|
else
|
|
|
|
db_printf("( ??? )");
|
1997-09-21 04:49:30 +00:00
|
|
|
if ((i + 1) < bp->b_npages)
|
|
|
|
db_printf(",");
|
|
|
|
}
|
|
|
|
db_printf("\n");
|
|
|
|
}
|
2018-03-17 18:14:49 +00:00
|
|
|
BUF_LOCKPRINTINFO(bp);
|
2016-10-31 23:09:52 +00:00
|
|
|
#if defined(FULL_BUF_TRACKING)
|
|
|
|
db_printf("b_io_tracking: b_io_tcnt = %u\n", bp->b_io_tcnt);
|
|
|
|
|
|
|
|
i = bp->b_io_tcnt % BUF_TRACKING_SIZE;
|
2017-04-23 17:39:31 +00:00
|
|
|
for (j = 1; j <= BUF_TRACKING_SIZE; j++) {
|
|
|
|
if (bp->b_io_tracking[BUF_TRACKING_ENTRY(i - j)] == NULL)
|
|
|
|
continue;
|
2016-10-31 23:09:52 +00:00
|
|
|
db_printf(" %2u: %s\n", j,
|
|
|
|
bp->b_io_tracking[BUF_TRACKING_ENTRY(i - j)]);
|
2017-04-23 17:39:31 +00:00
|
|
|
}
|
2016-10-31 23:09:52 +00:00
|
|
|
#elif defined(BUF_TRACKING)
|
|
|
|
db_printf("b_io_tracking: %s\n", bp->b_io_tracking);
|
|
|
|
#endif
|
2009-02-06 20:06:48 +00:00
|
|
|
db_printf(" ");
|
1997-05-10 09:09:42 +00:00
|
|
|
}
|
2005-03-25 00:20:37 +00:00
|
|
|
|
2018-02-20 00:06:07 +00:00
|
|
|
DB_SHOW_COMMAND(bufqueues, bufqueues)
|
|
|
|
{
|
|
|
|
struct bufdomain *bd;
|
2018-03-17 18:14:49 +00:00
|
|
|
struct buf *bp;
|
|
|
|
long total;
|
|
|
|
int i, j, cnt;
|
2018-02-20 00:06:07 +00:00
|
|
|
|
|
|
|
db_printf("bqempty: %d\n", bqempty.bq_len);
|
|
|
|
|
2018-03-17 18:14:49 +00:00
|
|
|
for (i = 0; i < buf_domains; i++) {
|
|
|
|
bd = &bdomain[i];
|
2018-02-20 00:06:07 +00:00
|
|
|
db_printf("Buf domain %d\n", i);
|
|
|
|
db_printf("\tfreebufs\t%d\n", bd->bd_freebuffers);
|
|
|
|
db_printf("\tlofreebufs\t%d\n", bd->bd_lofreebuffers);
|
|
|
|
db_printf("\thifreebufs\t%d\n", bd->bd_hifreebuffers);
|
|
|
|
db_printf("\n");
|
|
|
|
db_printf("\tbufspace\t%ld\n", bd->bd_bufspace);
|
|
|
|
db_printf("\tmaxbufspace\t%ld\n", bd->bd_maxbufspace);
|
|
|
|
db_printf("\thibufspace\t%ld\n", bd->bd_hibufspace);
|
|
|
|
db_printf("\tlobufspace\t%ld\n", bd->bd_lobufspace);
|
|
|
|
db_printf("\tbufspacethresh\t%ld\n", bd->bd_bufspacethresh);
|
|
|
|
db_printf("\n");
|
2018-03-17 18:14:49 +00:00
|
|
|
db_printf("\tnumdirtybuffers\t%d\n", bd->bd_numdirtybuffers);
|
|
|
|
db_printf("\tlodirtybuffers\t%d\n", bd->bd_lodirtybuffers);
|
|
|
|
db_printf("\thidirtybuffers\t%d\n", bd->bd_hidirtybuffers);
|
|
|
|
db_printf("\tdirtybufthresh\t%d\n", bd->bd_dirtybufthresh);
|
|
|
|
db_printf("\n");
|
|
|
|
total = 0;
|
|
|
|
TAILQ_FOREACH(bp, &bd->bd_cleanq->bq_queue, b_freelist)
|
|
|
|
total += bp->b_bufsize;
|
|
|
|
db_printf("\tcleanq count\t%d (%ld)\n",
|
|
|
|
bd->bd_cleanq->bq_len, total);
|
|
|
|
total = 0;
|
|
|
|
TAILQ_FOREACH(bp, &bd->bd_dirtyq.bq_queue, b_freelist)
|
|
|
|
total += bp->b_bufsize;
|
|
|
|
db_printf("\tdirtyq count\t%d (%ld)\n",
|
|
|
|
bd->bd_dirtyq.bq_len, total);
|
2018-02-20 00:06:07 +00:00
|
|
|
db_printf("\twakeup\t\t%d\n", bd->bd_wanted);
|
|
|
|
db_printf("\tlim\t\t%d\n", bd->bd_lim);
|
|
|
|
db_printf("\tCPU ");
|
2018-02-25 00:35:21 +00:00
|
|
|
for (j = 0; j <= mp_maxid; j++)
|
2018-02-20 00:06:07 +00:00
|
|
|
db_printf("%d, ", bd->bd_subq[j].bq_len);
|
|
|
|
db_printf("\n");
|
2018-03-17 18:14:49 +00:00
|
|
|
cnt = 0;
|
|
|
|
total = 0;
|
Make MAXPHYS tunable. Bump MAXPHYS to 1M.
Replace MAXPHYS by runtime variable maxphys. It is initialized from
MAXPHYS by default, but can be also adjusted with the tunable kern.maxphys.
Make b_pages[] array in struct buf flexible. Size b_pages[] for buffer
cache buffers exactly to atop(maxbcachebuf) (currently it is sized to
atop(MAXPHYS)), and b_pages[] for pbufs is sized to atop(maxphys) + 1.
The +1 for pbufs allow several pbuf consumers, among them vmapbuf(),
to use unaligned buffers still sized to maxphys, esp. when such
buffers come from userspace (*). Overall, we save significant amount
of otherwise wasted memory in b_pages[] for buffer cache buffers,
while bumping MAXPHYS to desired high value.
Eliminate all direct uses of the MAXPHYS constant in kernel and driver
sources, except a place which initialize maxphys. Some random (and
arguably weird) uses of MAXPHYS, e.g. in linuxolator, are converted
straight. Some drivers, which use MAXPHYS to size embeded structures,
get private MAXPHYS-like constant; their convertion is out of scope
for this work.
Changes to cam/, dev/ahci, dev/ata, dev/mpr, dev/mpt, dev/mvs,
dev/siis, where either submitted by, or based on changes by mav.
Suggested by: mav (*)
Reviewed by: imp, mav, imp, mckusick, scottl (intermediate versions)
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D27225
2020-11-28 12:12:51 +00:00
|
|
|
for (j = 0; j < nbuf; j++) {
|
|
|
|
bp = nbufp(j);
|
|
|
|
if (bp->b_domain == i && BUF_ISLOCKED(bp)) {
|
2018-03-17 18:14:49 +00:00
|
|
|
cnt++;
|
Make MAXPHYS tunable. Bump MAXPHYS to 1M.
Replace MAXPHYS by runtime variable maxphys. It is initialized from
MAXPHYS by default, but can be also adjusted with the tunable kern.maxphys.
Make b_pages[] array in struct buf flexible. Size b_pages[] for buffer
cache buffers exactly to atop(maxbcachebuf) (currently it is sized to
atop(MAXPHYS)), and b_pages[] for pbufs is sized to atop(maxphys) + 1.
The +1 for pbufs allow several pbuf consumers, among them vmapbuf(),
to use unaligned buffers still sized to maxphys, esp. when such
buffers come from userspace (*). Overall, we save significant amount
of otherwise wasted memory in b_pages[] for buffer cache buffers,
while bumping MAXPHYS to desired high value.
Eliminate all direct uses of the MAXPHYS constant in kernel and driver
sources, except a place which initialize maxphys. Some random (and
arguably weird) uses of MAXPHYS, e.g. in linuxolator, are converted
straight. Some drivers, which use MAXPHYS to size embeded structures,
get private MAXPHYS-like constant; their convertion is out of scope
for this work.
Changes to cam/, dev/ahci, dev/ata, dev/mpr, dev/mpt, dev/mvs,
dev/siis, where either submitted by, or based on changes by mav.
Suggested by: mav (*)
Reviewed by: imp, mav, imp, mckusick, scottl (intermediate versions)
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D27225
2020-11-28 12:12:51 +00:00
|
|
|
total += bp->b_bufsize;
|
2018-03-17 18:14:49 +00:00
|
|
|
}
|
Make MAXPHYS tunable. Bump MAXPHYS to 1M.
Replace MAXPHYS by runtime variable maxphys. It is initialized from
MAXPHYS by default, but can be also adjusted with the tunable kern.maxphys.
Make b_pages[] array in struct buf flexible. Size b_pages[] for buffer
cache buffers exactly to atop(maxbcachebuf) (currently it is sized to
atop(MAXPHYS)), and b_pages[] for pbufs is sized to atop(maxphys) + 1.
The +1 for pbufs allow several pbuf consumers, among them vmapbuf(),
to use unaligned buffers still sized to maxphys, esp. when such
buffers come from userspace (*). Overall, we save significant amount
of otherwise wasted memory in b_pages[] for buffer cache buffers,
while bumping MAXPHYS to desired high value.
Eliminate all direct uses of the MAXPHYS constant in kernel and driver
sources, except a place which initialize maxphys. Some random (and
arguably weird) uses of MAXPHYS, e.g. in linuxolator, are converted
straight. Some drivers, which use MAXPHYS to size embeded structures,
get private MAXPHYS-like constant; their convertion is out of scope
for this work.
Changes to cam/, dev/ahci, dev/ata, dev/mpr, dev/mpt, dev/mvs,
dev/siis, where either submitted by, or based on changes by mav.
Suggested by: mav (*)
Reviewed by: imp, mav, imp, mckusick, scottl (intermediate versions)
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D27225
2020-11-28 12:12:51 +00:00
|
|
|
}
|
2018-03-17 18:14:49 +00:00
|
|
|
db_printf("\tLocked buffers: %d space %ld\n", cnt, total);
|
|
|
|
cnt = 0;
|
|
|
|
total = 0;
|
Make MAXPHYS tunable. Bump MAXPHYS to 1M.
Replace MAXPHYS by runtime variable maxphys. It is initialized from
MAXPHYS by default, but can be also adjusted with the tunable kern.maxphys.
Make b_pages[] array in struct buf flexible. Size b_pages[] for buffer
cache buffers exactly to atop(maxbcachebuf) (currently it is sized to
atop(MAXPHYS)), and b_pages[] for pbufs is sized to atop(maxphys) + 1.
The +1 for pbufs allow several pbuf consumers, among them vmapbuf(),
to use unaligned buffers still sized to maxphys, esp. when such
buffers come from userspace (*). Overall, we save significant amount
of otherwise wasted memory in b_pages[] for buffer cache buffers,
while bumping MAXPHYS to desired high value.
Eliminate all direct uses of the MAXPHYS constant in kernel and driver
sources, except a place which initialize maxphys. Some random (and
arguably weird) uses of MAXPHYS, e.g. in linuxolator, are converted
straight. Some drivers, which use MAXPHYS to size embeded structures,
get private MAXPHYS-like constant; their convertion is out of scope
for this work.
Changes to cam/, dev/ahci, dev/ata, dev/mpr, dev/mpt, dev/mvs,
dev/siis, where either submitted by, or based on changes by mav.
Suggested by: mav (*)
Reviewed by: imp, mav, imp, mckusick, scottl (intermediate versions)
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D27225
2020-11-28 12:12:51 +00:00
|
|
|
for (j = 0; j < nbuf; j++) {
|
|
|
|
bp = nbufp(j);
|
|
|
|
if (bp->b_domain == i) {
|
2018-03-17 18:14:49 +00:00
|
|
|
cnt++;
|
Make MAXPHYS tunable. Bump MAXPHYS to 1M.
Replace MAXPHYS by runtime variable maxphys. It is initialized from
MAXPHYS by default, but can be also adjusted with the tunable kern.maxphys.
Make b_pages[] array in struct buf flexible. Size b_pages[] for buffer
cache buffers exactly to atop(maxbcachebuf) (currently it is sized to
atop(MAXPHYS)), and b_pages[] for pbufs is sized to atop(maxphys) + 1.
The +1 for pbufs allow several pbuf consumers, among them vmapbuf(),
to use unaligned buffers still sized to maxphys, esp. when such
buffers come from userspace (*). Overall, we save significant amount
of otherwise wasted memory in b_pages[] for buffer cache buffers,
while bumping MAXPHYS to desired high value.
Eliminate all direct uses of the MAXPHYS constant in kernel and driver
sources, except a place which initialize maxphys. Some random (and
arguably weird) uses of MAXPHYS, e.g. in linuxolator, are converted
straight. Some drivers, which use MAXPHYS to size embeded structures,
get private MAXPHYS-like constant; their convertion is out of scope
for this work.
Changes to cam/, dev/ahci, dev/ata, dev/mpr, dev/mpt, dev/mvs,
dev/siis, where either submitted by, or based on changes by mav.
Suggested by: mav (*)
Reviewed by: imp, mav, imp, mckusick, scottl (intermediate versions)
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D27225
2020-11-28 12:12:51 +00:00
|
|
|
total += bp->b_bufsize;
|
2018-03-17 18:14:49 +00:00
|
|
|
}
|
Make MAXPHYS tunable. Bump MAXPHYS to 1M.
Replace MAXPHYS by runtime variable maxphys. It is initialized from
MAXPHYS by default, but can be also adjusted with the tunable kern.maxphys.
Make b_pages[] array in struct buf flexible. Size b_pages[] for buffer
cache buffers exactly to atop(maxbcachebuf) (currently it is sized to
atop(MAXPHYS)), and b_pages[] for pbufs is sized to atop(maxphys) + 1.
The +1 for pbufs allow several pbuf consumers, among them vmapbuf(),
to use unaligned buffers still sized to maxphys, esp. when such
buffers come from userspace (*). Overall, we save significant amount
of otherwise wasted memory in b_pages[] for buffer cache buffers,
while bumping MAXPHYS to desired high value.
Eliminate all direct uses of the MAXPHYS constant in kernel and driver
sources, except a place which initialize maxphys. Some random (and
arguably weird) uses of MAXPHYS, e.g. in linuxolator, are converted
straight. Some drivers, which use MAXPHYS to size embeded structures,
get private MAXPHYS-like constant; their convertion is out of scope
for this work.
Changes to cam/, dev/ahci, dev/ata, dev/mpr, dev/mpt, dev/mvs,
dev/siis, where either submitted by, or based on changes by mav.
Suggested by: mav (*)
Reviewed by: imp, mav, imp, mckusick, scottl (intermediate versions)
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D27225
2020-11-28 12:12:51 +00:00
|
|
|
}
|
2018-03-17 18:14:49 +00:00
|
|
|
db_printf("\tTotal buffers: %d space %ld\n", cnt, total);
|
2018-02-20 00:06:07 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2005-03-25 00:20:37 +00:00
|
|
|
DB_SHOW_COMMAND(lockedbufs, lockedbufs)
|
|
|
|
{
|
|
|
|
struct buf *bp;
|
|
|
|
int i;
|
|
|
|
|
|
|
|
for (i = 0; i < nbuf; i++) {
|
Make MAXPHYS tunable. Bump MAXPHYS to 1M.
Replace MAXPHYS by runtime variable maxphys. It is initialized from
MAXPHYS by default, but can be also adjusted with the tunable kern.maxphys.
Make b_pages[] array in struct buf flexible. Size b_pages[] for buffer
cache buffers exactly to atop(maxbcachebuf) (currently it is sized to
atop(MAXPHYS)), and b_pages[] for pbufs is sized to atop(maxphys) + 1.
The +1 for pbufs allow several pbuf consumers, among them vmapbuf(),
to use unaligned buffers still sized to maxphys, esp. when such
buffers come from userspace (*). Overall, we save significant amount
of otherwise wasted memory in b_pages[] for buffer cache buffers,
while bumping MAXPHYS to desired high value.
Eliminate all direct uses of the MAXPHYS constant in kernel and driver
sources, except a place which initialize maxphys. Some random (and
arguably weird) uses of MAXPHYS, e.g. in linuxolator, are converted
straight. Some drivers, which use MAXPHYS to size embeded structures,
get private MAXPHYS-like constant; their convertion is out of scope
for this work.
Changes to cam/, dev/ahci, dev/ata, dev/mpr, dev/mpt, dev/mvs,
dev/siis, where either submitted by, or based on changes by mav.
Suggested by: mav (*)
Reviewed by: imp, mav, imp, mckusick, scottl (intermediate versions)
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D27225
2020-11-28 12:12:51 +00:00
|
|
|
bp = nbufp(i);
|
2008-01-19 17:36:23 +00:00
|
|
|
if (BUF_ISLOCKED(bp)) {
|
2005-03-25 00:20:37 +00:00
|
|
|
db_show_buffer((uintptr_t)bp, 1, 0, NULL);
|
|
|
|
db_printf("\n");
|
2017-04-23 22:20:25 +00:00
|
|
|
if (db_pager_quit)
|
|
|
|
break;
|
2005-03-25 00:20:37 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
2008-09-16 11:19:38 +00:00
|
|
|
|
|
|
|
DB_SHOW_COMMAND(vnodebufs, db_show_vnodebufs)
|
|
|
|
{
|
|
|
|
struct vnode *vp;
|
|
|
|
struct buf *bp;
|
|
|
|
|
|
|
|
if (!have_addr) {
|
|
|
|
db_printf("usage: show vnodebufs <addr>\n");
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
vp = (struct vnode *)addr;
|
|
|
|
db_printf("Clean buffers:\n");
|
|
|
|
TAILQ_FOREACH(bp, &vp->v_bufobj.bo_clean.bv_hd, b_bobufs) {
|
|
|
|
db_show_buffer((uintptr_t)bp, 1, 0, NULL);
|
|
|
|
db_printf("\n");
|
|
|
|
}
|
|
|
|
db_printf("Dirty buffers:\n");
|
|
|
|
TAILQ_FOREACH(bp, &vp->v_bufobj.bo_dirty.bv_hd, b_bobufs) {
|
|
|
|
db_show_buffer((uintptr_t)bp, 1, 0, NULL);
|
|
|
|
db_printf("\n");
|
|
|
|
}
|
|
|
|
}
|
2010-06-11 17:03:26 +00:00
|
|
|
|
|
|
|
DB_COMMAND(countfreebufs, db_coundfreebufs)
|
|
|
|
{
|
|
|
|
struct buf *bp;
|
|
|
|
int i, used = 0, nfree = 0;
|
|
|
|
|
|
|
|
if (have_addr) {
|
|
|
|
db_printf("usage: countfreebufs\n");
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
for (i = 0; i < nbuf; i++) {
|
Make MAXPHYS tunable. Bump MAXPHYS to 1M.
Replace MAXPHYS by runtime variable maxphys. It is initialized from
MAXPHYS by default, but can be also adjusted with the tunable kern.maxphys.
Make b_pages[] array in struct buf flexible. Size b_pages[] for buffer
cache buffers exactly to atop(maxbcachebuf) (currently it is sized to
atop(MAXPHYS)), and b_pages[] for pbufs is sized to atop(maxphys) + 1.
The +1 for pbufs allow several pbuf consumers, among them vmapbuf(),
to use unaligned buffers still sized to maxphys, esp. when such
buffers come from userspace (*). Overall, we save significant amount
of otherwise wasted memory in b_pages[] for buffer cache buffers,
while bumping MAXPHYS to desired high value.
Eliminate all direct uses of the MAXPHYS constant in kernel and driver
sources, except a place which initialize maxphys. Some random (and
arguably weird) uses of MAXPHYS, e.g. in linuxolator, are converted
straight. Some drivers, which use MAXPHYS to size embeded structures,
get private MAXPHYS-like constant; their convertion is out of scope
for this work.
Changes to cam/, dev/ahci, dev/ata, dev/mpr, dev/mpt, dev/mvs,
dev/siis, where either submitted by, or based on changes by mav.
Suggested by: mav (*)
Reviewed by: imp, mav, imp, mckusick, scottl (intermediate versions)
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D27225
2020-11-28 12:12:51 +00:00
|
|
|
bp = nbufp(i);
|
2015-10-14 02:10:07 +00:00
|
|
|
if (bp->b_qindex == QUEUE_EMPTY)
|
2010-06-11 17:03:26 +00:00
|
|
|
nfree++;
|
|
|
|
else
|
|
|
|
used++;
|
|
|
|
}
|
|
|
|
|
|
|
|
db_printf("Counted %d free, %d used (%d tot)\n", nfree, used,
|
|
|
|
nfree + used);
|
|
|
|
db_printf("numfreebuffers is %d\n", numfreebuffers);
|
|
|
|
}
|
1997-05-10 09:09:42 +00:00
|
|
|
#endif /* DDB */
|