Commit Graph

716 Commits

Author SHA1 Message Date
Mark Johnston
90e17792c8 Do not set BIO_DONE if the BIO specifies a completion handler.
biowait() will otherwise race with completions of such BIOs. In-tree code
only calls biowait() on BIOs that do not specify a handler, so this change
should not have any functional impact.

Reviewed by:	mav
MFC after:	1 month
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D9070
2017-01-10 21:41:28 +00:00
Gleb Smirnoff
bfc8c24c73 Move bogus_page declaration to vm_page.h and initialization to vm_page.c.
Reviewed by:	kib
2017-01-04 22:27:19 +00:00
Mark Johnston
99e6e1930c Release laundered vnode pages to the head of the inactive queue.
The swap pager enqueues laundered pages near the head of the inactive queue
to avoid another trip through LRU before reclamation. This change adds
support for this behaviour to the vnode pager and makes use of it in UFS and
ext2fs. Some ioflag handling is consolidated into a common subroutine so
that this support can be easily extended to other filesystems which make use
of the buffer cache. No changes are needed for ZFS since its putpages
routine always undirties the pages before returning, and the laundry
thread requeues the pages appropriately in this case.

Reviewed by:	alc, kib
Differential Revision:	https://reviews.freebsd.org/D8589
2016-11-23 17:53:07 +00:00
Konstantin Belousov
eb962424ba Restore vnode pager statistic for buffer pagers.
Reviewed by:	alc, markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D8585
2016-11-22 10:06:39 +00:00
Adrian Chadd
8ffa01a061 [mips] enable relbuf on mips for now to work around page aliasing in mips hardware.
Although the higher end MIPS hardware handles cache aliasing issues in
hardware, the older cores (r4k, etc) and some compile versions of the
newer cores (mips24k, mips34k, mips74k) don't have this feature.
This means we end up with some very unfortunate behaviour that was
made very obvious by some recent changes to the FFS pager by kib.

So, flip this off until we get our MIPS pmap/cache code upgraded to
handle aliased pages in software.

Discussed with: kib, bsdimp, juli
2016-11-15 01:41:45 +00:00
Konstantin Belousov
9a639daf77 Tweaks for the buffer pager.
Pass current thread credentials instead of NOCRED.
Only allow unmapped buffers for filesystem which proclaimed the support.

For all filesystems which currently use buffer pager (UFS, msdosfs and
cd9660), the changes are effectively nop.

Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2016-11-08 10:10:55 +00:00
Conrad Meyer
8532d381a9 Add BUF_TRACKING and FULL_BUF_TRACKING buffer debugging
Upstream the BUF_TRACKING and FULL_BUF_TRACKING buffer debugging code.
This can be handy in tracking down what code touched hung bios and bufs
last. The full history is especially useful, but adds enough bloat that
it shouldn't be enabled in release builds.

Function names (or arbitrary string constants) are tracked in a
fixed-size ring in bufs. Bios gain a pointer to the upper buf for
tracking. SCSI CCBs gain a pointer to the upper bio for tracking.

Reviewed by:	markj
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D8366
2016-10-31 23:09:52 +00:00
Konstantin Belousov
c39baa7480 Generalize UFS buffer pager to allow it serving other filesystems
which also use buffer cache.

Most important addition to the code is the handling of filesystems
where the block size is less than the machine page size, which might
require reading several buffers to validate single page.

Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2016-10-28 11:43:59 +00:00
Konstantin Belousov
5975e53d40 Fix a race in vm_page_busy_sleep(9).
Suppose that we have an exclusively busy page, and a thread which can
accept shared-busy page.  In this case, typical code waiting for the
page xbusy state to pass is
again:
	VM_OBJECT_WLOCK(object);
	...
	if (vm_page_xbusied(m)) {
		vm_page_lock(m);
 		VM_OBJECT_WUNLOCK(object);    <---1
		vm_page_busy_sleep(p, "vmopax");
 		goto again;
	}

Suppose that the xbusy state owner locked the object, unbusied the
page and unlocked the object after we are at the line [1], but before we
executed the load of the busy_lock word in vm_page_busy_sleep().  If it
happens that there is still no waiters recorded for the busy state,
the xbusy owner did not acquired the page lock, so it proceeded.

More, suppose that some other thread happen to share-busy the page
after xbusy state was relinquished but before the m->busy_lock is read
in vm_page_busy_sleep().  Again, that thread only needs vm_object lock
to proceed.  Then, vm_page_busy_sleep() reads busy_lock value equal to
the VPB_SHARERS_WORD(1).

In this case, all tests in vm_page_busy_sleep(9) pass and we are going
to sleep, despite the page being share-busied.

Update check for m->busy_lock == VPB_UNBUSIED in vm_page_busy_sleep(9)
to also accept shared-busy state if we only wait for the xbusy state to
pass.

Merge sequential if()s with the same 'then' clause in
vm_page_busy_sleep().

Note that the current code does not share-busy pages from parallel
threads, the only way to have more that one sbusy owner is right now
is to recurse.

Reported and tested by:	pho (previous version)
Reviewed by:	alc, markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D8196
2016-10-13 14:41:05 +00:00
Conrad Meyer
f43292ecf4 vfs_bio: Remove a leading space (style)
Introduced in r282085.

Sponsored by:	Dell EMC Isilon
2016-10-05 23:42:02 +00:00
Mateusz Guzik
8660b707ff vfs: remove the __bo_vnode field from struct vnode
The pointer can be obtained using __containerof instead.

Reviewed by:	kib
2016-09-30 17:11:03 +00:00
Mark Johnston
5004817335 Remove b_pin_count from struct buf.
It was added in r153192 for XFS and doesn't appear to have been used for
anything else. XFS was disconnected in r241607 and removed entirely in
r247631.

Reported by:	mlaier
Reviewed by:	imp, kib
Differential Revision:	https://reviews.freebsd.org/D7468
2016-08-11 07:58:23 +00:00
Mark Johnston
0649caae32 Let DDB's buf printer handle NULL pointers in the buf page array.
A buf's b_pages and b_npages fields may be inconsistent after a panic.
For instance, vfs_vmio_invalidate() sets b_npages to zero only after all
pages are unwired and their page array entries are cleared.

MFC after:	1 week
Sponsored by:	EMC / Isilon Storage Division
2016-07-14 18:49:05 +00:00
Pedro F. Giffuni
31b6732008 sys/kern: spelling fixes.
Mostly on comments but affects some debug messages.

MFC after: 2 weeks
2016-04-29 21:54:28 +00:00
Pedro F. Giffuni
f3b827f3ec bufs: make B_DIRTY and B_PERSISTENT flags available
It appears these flags were related to ext2fs but are completely
unused nowadays. Retire them.

Suggested by: mckusick
2016-04-29 16:32:28 +00:00
Pedro F. Giffuni
d9c9c81c08 sys: use our roundup2/rounddown2() macros when param.h is available.
rounddown2 tends to produce longer lines than the original code
and when the code has a high indentation level it was not really
advantageous to do the replacement.

This tries to strike a balance between readability using the macros
and flexibility of having the expressions, so not everything is
converted.
2016-04-21 19:57:40 +00:00
Edward Tomasz Napierala
ae34b6ff96 Add four new RCTL resources - readbps, readiops, writebps and writeiops,
for limiting disk (actually filesystem) IO.

Note that in some cases these limits are not quite precise. It's ok,
as long as it's within some reasonable bounds.

Testing - and review of the code, in particular the VFS and VM parts - is
very welcome.

MFC after:	1 month
Relnotes:	yes
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D5080
2016-04-07 04:23:25 +00:00
Pedro F. Giffuni
7559912039 Minor grammar fix in comment. 2016-02-07 16:18:12 +00:00
Kirk McKusick
d2a28cb080 The bread() function was inconsistent about whether it would return
a buffer pointer in the event of an error (for some errors it would
return a buffer pointer and for other errors it would not return a
buffer pointer). The cluster_read() function was similarly inconsistent.

Clients of these functions were inconsistent in handling errors.
Some would assume that no buffer was returned after an error and
would thus lose buffers under certain error conditions. Others would
assume that brelse() should always be called after an error and
would thus panic the system under certain error conditions.

To correct both of these problems with minimal code churn, bread()
and cluster_write() now always free the buffer when returning an
error thus ensuring that buffers will never be lost. The brelse()
routine checks for being passed a NULL buffer pointer and silently
returns to avoid panics. Thus both approaches to handling error
returns from bread() and cluster_read() will work correctly.

Future code should be written assuming that bread() and cluster_read()
will never return a buffer with an error, so should not attempt to
brelse() the buffer when an error is returned.

Reviewed by: kib
2016-01-27 21:23:01 +00:00
Adrian Chadd
45130aa12e Don't call wakeup if we're just returning reserved space; just
return the reservation and wait for more space to appear.

Submitted by:	jeff
Reviewed by:	kib
2015-12-16 00:13:16 +00:00
Steven Hartland
f1b13a89b0 Don't use 0 for pointer comparison
Use NULL instead of 0 for comparison with panicstr.

MFC after:	1 week
Sponsored by:	Multiplay
2015-12-08 18:38:33 +00:00
Adrian Chadd
2eb936acb9 Add a sched_yield() to work around low memory conditions in the current code.
Things seem to get stuck in low memory conditions where no bufs are available,
the reclamation path is called to wakeup the daemon, but no sleeping is done.
Because of this, we are stuck in a tight loop in the current process and
never run said reclamation path.

This was introduced in r289279 . This is only a temporary workaround
to restore system usefulness until the more permanent solutions can be
found.

Tested:

* Carambola2, 64MB (and 32MB by manual config.)
2015-11-07 04:04:00 +00:00
Warner Losh
a1f26210f3 The error classification from lower layers is a poor indicator of
whether an error is recoverable. Always re-dirty the buffer on errors
from write requests. The invalidation we used to do for errors not EIO
doesn't need to be done for a device that's really gone, since that's
done in a different path.

Reviewed by: mckusick@, kib@
2015-10-31 04:53:07 +00:00
Bryan Drewery
156c04f793 getnewbuf: Initialize bp to avoid uninitialized pointer dereference and brelse().
This came in recently in r289279.

Coverity CID:	1331561
2015-10-29 19:02:24 +00:00
Jeff Roberson
21fae96123 Parallelize the buffer cache and rewrite getnewbuf(). This results in a
8x performance improvement in a micro benchmark on a 4 socket machine.

 - Get buffer headers from a per-cpu uma cache that sits in from of the
   free queue.
 - Use a per-cpu quantum cache in vmem to eliminate contention for kva.
 - Use multiple clean queues according to buffer cache size to eliminate
   clean queue lock contention.
 - Introduce a bufspace daemon that attempts to prevent getnewbuf() callers
   from blocking or doing direct recycling.
 - Close some bufspace allocation races that could lead to endless
   recycling.
 - Further the transition to a more modern style of small functions grouped
   by prefix in order to improve growing complexity.

Sponsored by:	EMC / Isilon
Reviewed by:	kib
Tested by:	pho
2015-10-14 02:10:07 +00:00
Alan Cox
acada7aef0 Perform a single batched update to the object's paging-in-progress count
rather than updating it for each page.
2015-10-03 17:04:52 +00:00
Mark Johnston
3138cd3670 As a step towards the elimination of PG_CACHED pages, rework the handling
of POSIX_FADV_DONTNEED so that it causes the backing pages to be moved to
the head of the inactive queue instead of being cached.

This affects the implementation of POSIX_FADV_NOREUSE as well, since it
works by applying POSIX_FADV_DONTNEED to file ranges after they have been
read or written.  At that point the corresponding buffers may still be
dirty, so the previous implementation would coalesce successive ranges and
apply POSIX_FADV_DONTNEED to the result, ensuring that pages backing the
dirty buffers would eventually be cached.  To preserve this behaviour in an
efficient manner, this change adds a new buf flag, B_NOREUSE, which causes
the pages backing a VMIO buf to be placed at the head of the inactive queue
when the buf is released.  POSIX_FADV_NOREUSE then works by setting this
flag in bufs that underlie the specified range.

Reviewed by:	alc, kib
Sponsored by:	EMC / Isilon Storage Division
Differential Revision:	https://reviews.freebsd.org/D3726
2015-09-30 23:06:29 +00:00
Jeff Roberson
4615830db2 - Collapse vfs_vmio_truncate & vfs_vmio_release into a single function.
- Allow vfs_vmio_invalidate() to free the pages, leaving us with a
   single loop and bufobj lock when B_NOCACHE/B_INVAL is used.
 - Eliminate the special B_ASYNC handling on free that has not been
   relevant for some time.
 - Remove the extraneous page busy from vfs_vmio_truncate().

Reviewed by:	kib
Tested by:	pho
Sponsored by:	EMC / Isilon storage division
2015-09-27 05:16:06 +00:00
Jeff Roberson
589c956a5a - Fix a nonsense reordering that somehow slipped into my last diff.
Reported by:	pho
2015-09-23 07:44:07 +00:00
Jeff Roberson
8264830c95 Some refactoring of the buf/vm interface.
- Eliminate bogus page replacement that is inconsistently applied in the
   invalidation loop in brelse.  This has been a no-op in modern times as
   biodone() is responsible for cleaning up after bogus pages.  This
   would've spammed the console with printfs at a minimum.
 - Allow the compiler and human readers alike to reason about allocbuf()
   by splitting it into constituent parts.
 - Separate the VM manipulating and buf manipulating code in brelse() and
   bufdone() so that the intentions are clear.  This makes it evident that
   there are several duplicated buf pages loops that will be consolidated
   at a later time.

Reviewed by:	kib
Tested by:	pho
Sponsored by:	EMC / Isilon Storage Division
2015-09-22 23:57:52 +00:00
Alan Cox
15aaea7892 Change vm_page_unwire() such that it (1) accepts PQ_NONE as the specified
queue and (2) returns a Boolean indicating whether the page's wire count
transitioned to zero.

Exploit this change in vfs_vmio_release() to avoid pointlessly enqueueing
a page that is about to be freed.

(An earlier version of this change was developed by attilio@ and kmacy@.
Any errors in this version are my own.)

Reviewed by:	kib
Sponsored by:	EMC / Isilon Storage Division
2015-09-22 18:16:52 +00:00
Warner Losh
7297c5e535 bufdonebio is now unused. Retire it too. 2015-09-11 04:20:04 +00:00
Warner Losh
ad8d57a99d dev_strategy and dev_strategy_csw are unused since r281825. Remove
them.

Differential Revision: https://reviews.freebsd.org/D3620
2015-09-11 00:38:58 +00:00
Roger Pau Monné
c023d8234b vfs: fill fallout from r286076
This right operator is >= not =>.

Reported by: cem
2015-07-30 15:43:26 +00:00
Roger Pau Monné
8f89a299e2 vfs: fix off-by-one error in vfs_buf_check_mapped
The check added in r285872 can trigger for valid buffers if the buffer space
used happens to be just after unmapped_buf in KVA space.

Discussed with: kib
Sponsored by: Citrix Systems R&D
2015-07-30 15:28:06 +00:00
Konstantin Belousov
6cebf7e2be Move bufshutdown() out of the #ifdef INVARIANTS block. 2015-07-29 09:57:34 +00:00
Jeff Roberson
98082691bb - Make 'struct buf *buf' private to vfs_bio.c. Having a global variable
'buf' is inconvenient and has lead me to some irritating to discover
   bugs over the years.  It also makes it more challenging to refactor
   the buf allocation system.
 - Move swbuf and declare it as an extern in vfs_bio.c.  This is still
   not perfect but better than it was before.
 - Eliminate the unused ffs function that relied on knowledge of the buf
   array.
 - Move the shutdown code that iterates over the buf array into vfs_bio.c.

Reviewed by:	kib
Sponsored by:	EMC / Isilon Storage Division
2015-07-29 02:26:57 +00:00
Jeff Roberson
38750ada8f - Eliminate the EMPTYKVA queue. It served as a cache of KVA allocations
attached to bufs to avoid the overhead of the vm.  This purposes is now
   better served by vmem.  Freeing the kva immediately when a buf is
   destroyed leads to lower fragmentation and a much simpler scan algorithm.

Reviewed by:	kib
Sponsored by:	EMC / Isilon Storage Division
2015-07-28 20:24:09 +00:00
Konstantin Belousov
6fd04eff66 With the removal of b_saveaddr in the r285819, b_data must be reset to
b_kvabase when the buffer is reclaimed.  Otherwise, if b_data for the
mapped buffer was adjusted with the page-offset portion of b_offset,
nothing would re-adjust the b_data, which breaks buffer management
code which expects page-aligned b_data (see e.g. bpman_qenter(), which
skips partial pages).

Fix a minor issue with the GB_KVAALLOC requests, which could result in
returning the mapped buffer if the reused buffer is mapped and have
the right amount of KVA reserved.

Improve assertion in the vfs_buf_check_mapped() to catch unmapped
buffers which have their b_data incorrectly adjusted with offset.

Reported and tested by:	pho (previous version)
Reviewed by:	jeff (previous version)
Sponsored by:	The FreeBSD Foundation
2015-07-25 15:00:14 +00:00
Jeff Roberson
fade8dd714 Refactor unmapped buffer address handling.
- Use pointer assignment rather than a combination of pointers and
   flags to switch buffers between unmapped and mapped.  This eliminates
   multiple flags and generally simplifies the logic.
 - Eliminate b_saveaddr since it is only used with pager bufs which have
   their b_data re-initialized on each allocation.
 - Gather up some convenience routines in the buffer cache for
   manipulating buf space and buf malloc space.
 - Add an inline, buf_mapped(), to standardize checks around unmapped
   buffers.

In collaboration with: mlaier
Reviewed by:	kib
Tested by:	pho (many small revisions ago)
Sponsored by:	EMC / Isilon Storage Division
2015-07-23 19:13:41 +00:00
Jeff Roberson
1c1ddc0351 - Don't defeat the FIFO nature of the buffer cache by eliminating the
most recently used buffer when we are under paging pressure.  This is
   a perversion of the buffer and page replacement algorithms and recent
   improvements to the page daemon have rendered it unnecessary.  In the
   event that low-memory deadlocks become an issue it would be possible
   to make a daemon or event handler that performs a similar action on
   the oldest buffers rather than the newest.  Since the buf cache
   is analogous to the page cache and some minimum working set is desired
   another possibility is to simply shrink the minimum working set which
   has less downside now that file pages are not directly mapped.

Sponsored by:	EMC / Isilon
Reviewed by:	alc, kib (with some minor objection)
Tested by:	pho
2015-07-23 02:20:41 +00:00
Konstantin Belousov
cf88021ab1 Do not allow creation of the dirty buffers for the dead buffer
objects, i.e. for buffer objects which vnode was reclaimed.  Buffer
cache cannot write such buffers.  Return the error and discard the
buffer immediately on write attempt.

BO_DIRTY now always set during vnode reclamation, since it is used not
only for the INVARIANTS checks.  Do allow placement of the clean
buffers on dead bufobj list, otherwise filesystems cannot use bufcache
at all after the devvp reclaim.

Reported and tested by:	trasz
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2015-07-11 11:21:56 +00:00
Konstantin Belousov
b2c3df842b Handle errors from background write of the cylinder group blocks.
First, on the write error, bufdone() call from ffs_backgroundwrite()
panics because pbrelvp() cleared bp->b_bufobj, while brelse() would
try to re-dirty the copy of the cg buffer.  Handle this by setting
B_INVAL for the case of BIO_ERROR.

Second, we must re-dirty the real buffer containing the cylinder group
block data when background write failed.  Real cg buffer was already
marked clean in ffs_bufwrite(). After the BV_BKGRDINPROG flag is
cleared on the real cg buffer in ffs_backgroundwrite(), buffer scan
may reuse the buffer at any moment. The result is lost write, and if
the write error was only transient, we get corrupted bitmaps.

We cannot re-dirty the original cg buffer in the
ffs_backgroundwritedone(), since the context is not sleepable,
preventing us from sleeping for origbp' lock.  Add BV_BKGDERR flag
(protected by the buffer object lock), which is converted into delayed
write by brelse(), bqrelse() and buffer scan.

In collaboration with:	Conrad Meyer <cse.cem@gmail.com>
Reviewed by:	mckusick
Sponsored by:	The FreeBSD Foundation (kib),
	  EMC/Isilon storage division (Conrad)
MFC after:	2 weeks
2015-06-27 09:44:14 +00:00
Konstantin Belousov
b05c401ff6 Only take previous buffer queue lock (olock) when needed for REMFREE
in binsfree().

Submitted by:	Conrad Meyer
Sponsored by:	EMC / Isilon Storage Division
Review:	https://reviews.freebsd.org/D2882
MFC after:	1 week
2015-06-23 06:12:14 +00:00
Konstantin Belousov
9a2c85350a Partially revert r255986: do not call VOP_FSYNC() when helping
bufdaemon in getnewbuf(), do use buf_flush().  The difference is that
bufdaemon uses TRYLOCK to get buffer locks, which allows calls to
getnewbuf() while another buffer is locked.

Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-04-27 11:13:19 +00:00
Rick Macklem
7cfdc2a7bc MAXBSIZE defines both the largest UFS block size and the
largest size for a buffer in the buffer cache. This patch
defines a new constant MAXBCACHEBUF, which is the largest
size for a buffer in the buffer cache. Having a separate
constant allows MAXBCACHEBUF to be set larger than MAXBSIZE
on a per-architecture basis, so that NFS can do larger read/writes
for these architectures. It modifies sys/param.h so that BKVASIZE
can also be set on a per-architecture basis.
A couple of cases where NFS used MAXBSIZE instead of NFS_MAXBSIZE
is fixed as well.

Differential Revision:	https://reviews.freebsd.org/D2330
Reviewed by:	mav, kib
MFC after:	2 weeks
2015-04-25 00:52:01 +00:00
Benno Rice
43348dc2ad Reset bp->bio_done to unmapped_buf when removing a transient map in biodone.
Submitted by:	Scott Ferris <scott.ferris@isilon.com>
Sponsored by:	EMC / Isilon Storage Division
Reviewed by:	kib
2015-03-16 20:00:09 +00:00
Konstantin Belousov
904ed548bb When getnewbuf_reuse_bp() is called to reclaim some (clean) buffer,
the vnode owning the buffer is not locked.  More, it cannot be locked
safely, since getnewbuf_reuse_bp() is called from newbuf(), and some
other vnode is already locked, for which reused buffer will be
reassigned.

As the consequence, reclamation of the owning vnode could go in
parallel, in particular, the call to vnode_destroy_vobject(), which
deallocates the vm object and zeroes the v_bufobj->bo_object.  Note
that the pages wired by the buffer are left wired and can be safely
freed by the vfs_vmio_release() without the need for the vm object
lock.  Also, seeing stale pointer to the v_object is safe due to vm
object type stability.

Check for bo_bufobj != NULL and cache the value in local variable to
avoid trying to lock NULL vm object.

Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-12-08 16:42:34 +00:00
Alexander Motin
ccf8a5688a Revert somewhat hackish geom_disk optimization, committed as part of r256880,
and the following r273143 commit, supposed to workaround introduced issue by
quite innocent-looking change.

While there is no clear understanding why, but r273143 is accused in data
corruption in some environments with high I/O load.  I personally don't see
any problem in that commit, and possibly it is just a trigger to some other
bug somewhere, but better safe then sorry for now.

Requested by:	scottl@
MFC after:	3 days
2014-10-25 15:16:19 +00:00
Alexander Motin
99b9076c21 Remove setting BIO_DONE flag for BIOs that have done() method.
This fixes use-after-free, caused by geom_disk, completing same BIO twice
to save extra allocation, and getting BIO_DONE set after the first.

MFC after:	1 week
2014-10-15 18:36:34 +00:00