rounddown2 tends to produce longer lines than the original code
and when the code has a high indentation level it was not really
advantageous to do the replacement.
This tries to strike a balance between readability using the macros
and flexibility of having the expressions, so not everything is
converted.
for limiting disk (actually filesystem) IO.
Note that in some cases these limits are not quite precise. It's ok,
as long as it's within some reasonable bounds.
Testing - and review of the code, in particular the VFS and VM parts - is
very welcome.
MFC after: 1 month
Relnotes: yes
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D5080
a buffer pointer in the event of an error (for some errors it would
return a buffer pointer and for other errors it would not return a
buffer pointer). The cluster_read() function was similarly inconsistent.
Clients of these functions were inconsistent in handling errors.
Some would assume that no buffer was returned after an error and
would thus lose buffers under certain error conditions. Others would
assume that brelse() should always be called after an error and
would thus panic the system under certain error conditions.
To correct both of these problems with minimal code churn, bread()
and cluster_write() now always free the buffer when returning an
error thus ensuring that buffers will never be lost. The brelse()
routine checks for being passed a NULL buffer pointer and silently
returns to avoid panics. Thus both approaches to handling error
returns from bread() and cluster_read() will work correctly.
Future code should be written assuming that bread() and cluster_read()
will never return a buffer with an error, so should not attempt to
brelse() the buffer when an error is returned.
Reviewed by: kib
Things seem to get stuck in low memory conditions where no bufs are available,
the reclamation path is called to wakeup the daemon, but no sleeping is done.
Because of this, we are stuck in a tight loop in the current process and
never run said reclamation path.
This was introduced in r289279 . This is only a temporary workaround
to restore system usefulness until the more permanent solutions can be
found.
Tested:
* Carambola2, 64MB (and 32MB by manual config.)
whether an error is recoverable. Always re-dirty the buffer on errors
from write requests. The invalidation we used to do for errors not EIO
doesn't need to be done for a device that's really gone, since that's
done in a different path.
Reviewed by: mckusick@, kib@
8x performance improvement in a micro benchmark on a 4 socket machine.
- Get buffer headers from a per-cpu uma cache that sits in from of the
free queue.
- Use a per-cpu quantum cache in vmem to eliminate contention for kva.
- Use multiple clean queues according to buffer cache size to eliminate
clean queue lock contention.
- Introduce a bufspace daemon that attempts to prevent getnewbuf() callers
from blocking or doing direct recycling.
- Close some bufspace allocation races that could lead to endless
recycling.
- Further the transition to a more modern style of small functions grouped
by prefix in order to improve growing complexity.
Sponsored by: EMC / Isilon
Reviewed by: kib
Tested by: pho
of POSIX_FADV_DONTNEED so that it causes the backing pages to be moved to
the head of the inactive queue instead of being cached.
This affects the implementation of POSIX_FADV_NOREUSE as well, since it
works by applying POSIX_FADV_DONTNEED to file ranges after they have been
read or written. At that point the corresponding buffers may still be
dirty, so the previous implementation would coalesce successive ranges and
apply POSIX_FADV_DONTNEED to the result, ensuring that pages backing the
dirty buffers would eventually be cached. To preserve this behaviour in an
efficient manner, this change adds a new buf flag, B_NOREUSE, which causes
the pages backing a VMIO buf to be placed at the head of the inactive queue
when the buf is released. POSIX_FADV_NOREUSE then works by setting this
flag in bufs that underlie the specified range.
Reviewed by: alc, kib
Sponsored by: EMC / Isilon Storage Division
Differential Revision: https://reviews.freebsd.org/D3726
- Allow vfs_vmio_invalidate() to free the pages, leaving us with a
single loop and bufobj lock when B_NOCACHE/B_INVAL is used.
- Eliminate the special B_ASYNC handling on free that has not been
relevant for some time.
- Remove the extraneous page busy from vfs_vmio_truncate().
Reviewed by: kib
Tested by: pho
Sponsored by: EMC / Isilon storage division
- Eliminate bogus page replacement that is inconsistently applied in the
invalidation loop in brelse. This has been a no-op in modern times as
biodone() is responsible for cleaning up after bogus pages. This
would've spammed the console with printfs at a minimum.
- Allow the compiler and human readers alike to reason about allocbuf()
by splitting it into constituent parts.
- Separate the VM manipulating and buf manipulating code in brelse() and
bufdone() so that the intentions are clear. This makes it evident that
there are several duplicated buf pages loops that will be consolidated
at a later time.
Reviewed by: kib
Tested by: pho
Sponsored by: EMC / Isilon Storage Division
queue and (2) returns a Boolean indicating whether the page's wire count
transitioned to zero.
Exploit this change in vfs_vmio_release() to avoid pointlessly enqueueing
a page that is about to be freed.
(An earlier version of this change was developed by attilio@ and kmacy@.
Any errors in this version are my own.)
Reviewed by: kib
Sponsored by: EMC / Isilon Storage Division
The check added in r285872 can trigger for valid buffers if the buffer space
used happens to be just after unmapped_buf in KVA space.
Discussed with: kib
Sponsored by: Citrix Systems R&D
'buf' is inconvenient and has lead me to some irritating to discover
bugs over the years. It also makes it more challenging to refactor
the buf allocation system.
- Move swbuf and declare it as an extern in vfs_bio.c. This is still
not perfect but better than it was before.
- Eliminate the unused ffs function that relied on knowledge of the buf
array.
- Move the shutdown code that iterates over the buf array into vfs_bio.c.
Reviewed by: kib
Sponsored by: EMC / Isilon Storage Division
attached to bufs to avoid the overhead of the vm. This purposes is now
better served by vmem. Freeing the kva immediately when a buf is
destroyed leads to lower fragmentation and a much simpler scan algorithm.
Reviewed by: kib
Sponsored by: EMC / Isilon Storage Division
b_kvabase when the buffer is reclaimed. Otherwise, if b_data for the
mapped buffer was adjusted with the page-offset portion of b_offset,
nothing would re-adjust the b_data, which breaks buffer management
code which expects page-aligned b_data (see e.g. bpman_qenter(), which
skips partial pages).
Fix a minor issue with the GB_KVAALLOC requests, which could result in
returning the mapped buffer if the reused buffer is mapped and have
the right amount of KVA reserved.
Improve assertion in the vfs_buf_check_mapped() to catch unmapped
buffers which have their b_data incorrectly adjusted with offset.
Reported and tested by: pho (previous version)
Reviewed by: jeff (previous version)
Sponsored by: The FreeBSD Foundation
- Use pointer assignment rather than a combination of pointers and
flags to switch buffers between unmapped and mapped. This eliminates
multiple flags and generally simplifies the logic.
- Eliminate b_saveaddr since it is only used with pager bufs which have
their b_data re-initialized on each allocation.
- Gather up some convenience routines in the buffer cache for
manipulating buf space and buf malloc space.
- Add an inline, buf_mapped(), to standardize checks around unmapped
buffers.
In collaboration with: mlaier
Reviewed by: kib
Tested by: pho (many small revisions ago)
Sponsored by: EMC / Isilon Storage Division
most recently used buffer when we are under paging pressure. This is
a perversion of the buffer and page replacement algorithms and recent
improvements to the page daemon have rendered it unnecessary. In the
event that low-memory deadlocks become an issue it would be possible
to make a daemon or event handler that performs a similar action on
the oldest buffers rather than the newest. Since the buf cache
is analogous to the page cache and some minimum working set is desired
another possibility is to simply shrink the minimum working set which
has less downside now that file pages are not directly mapped.
Sponsored by: EMC / Isilon
Reviewed by: alc, kib (with some minor objection)
Tested by: pho
objects, i.e. for buffer objects which vnode was reclaimed. Buffer
cache cannot write such buffers. Return the error and discard the
buffer immediately on write attempt.
BO_DIRTY now always set during vnode reclamation, since it is used not
only for the INVARIANTS checks. Do allow placement of the clean
buffers on dead bufobj list, otherwise filesystems cannot use bufcache
at all after the devvp reclaim.
Reported and tested by: trasz
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
First, on the write error, bufdone() call from ffs_backgroundwrite()
panics because pbrelvp() cleared bp->b_bufobj, while brelse() would
try to re-dirty the copy of the cg buffer. Handle this by setting
B_INVAL for the case of BIO_ERROR.
Second, we must re-dirty the real buffer containing the cylinder group
block data when background write failed. Real cg buffer was already
marked clean in ffs_bufwrite(). After the BV_BKGRDINPROG flag is
cleared on the real cg buffer in ffs_backgroundwrite(), buffer scan
may reuse the buffer at any moment. The result is lost write, and if
the write error was only transient, we get corrupted bitmaps.
We cannot re-dirty the original cg buffer in the
ffs_backgroundwritedone(), since the context is not sleepable,
preventing us from sleeping for origbp' lock. Add BV_BKGDERR flag
(protected by the buffer object lock), which is converted into delayed
write by brelse(), bqrelse() and buffer scan.
In collaboration with: Conrad Meyer <cse.cem@gmail.com>
Reviewed by: mckusick
Sponsored by: The FreeBSD Foundation (kib),
EMC/Isilon storage division (Conrad)
MFC after: 2 weeks
bufdaemon in getnewbuf(), do use buf_flush(). The difference is that
bufdaemon uses TRYLOCK to get buffer locks, which allows calls to
getnewbuf() while another buffer is locked.
Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
largest size for a buffer in the buffer cache. This patch
defines a new constant MAXBCACHEBUF, which is the largest
size for a buffer in the buffer cache. Having a separate
constant allows MAXBCACHEBUF to be set larger than MAXBSIZE
on a per-architecture basis, so that NFS can do larger read/writes
for these architectures. It modifies sys/param.h so that BKVASIZE
can also be set on a per-architecture basis.
A couple of cases where NFS used MAXBSIZE instead of NFS_MAXBSIZE
is fixed as well.
Differential Revision: https://reviews.freebsd.org/D2330
Reviewed by: mav, kib
MFC after: 2 weeks
the vnode owning the buffer is not locked. More, it cannot be locked
safely, since getnewbuf_reuse_bp() is called from newbuf(), and some
other vnode is already locked, for which reused buffer will be
reassigned.
As the consequence, reclamation of the owning vnode could go in
parallel, in particular, the call to vnode_destroy_vobject(), which
deallocates the vm object and zeroes the v_bufobj->bo_object. Note
that the pages wired by the buffer are left wired and can be safely
freed by the vfs_vmio_release() without the need for the vm object
lock. Also, seeing stale pointer to the v_object is safe due to vm
object type stability.
Check for bo_bufobj != NULL and cache the value in local variable to
avoid trying to lock NULL vm object.
Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
and the following r273143 commit, supposed to workaround introduced issue by
quite innocent-looking change.
While there is no clear understanding why, but r273143 is accused in data
corruption in some environments with high I/O load. I personally don't see
any problem in that commit, and possibly it is just a trigger to some other
bug somewhere, but better safe then sorry for now.
Requested by: scottl@
MFC after: 3 days
This fixes use-after-free, caused by geom_disk, completing same BIO twice
to save extra allocation, and getting BIO_DONE set after the first.
MFC after: 1 week
set bo_bsize on a bufobj.
This is a slight modification of the patch provided.
PR: 193146
Submitted by: Conrad Meyer <conrad.meyer@isilon.com>
Sponsored by: EMC Isilon Storage Division
Replace a single soft updates thread with a thread per FFS-filesystem
mount point. The threads are associated with the bufdaemon process.
Reviewed by: kib
Tested by: Peter Holm and Scott Long
MFC after: 2 weeks
Sponsored by: Netflix
the queue where to enqueue pages that are going to be unwired.
- Add stronger checks to the enqueue/dequeue for the pagequeues when
adding and removing pages to them.
Of course, for unmanaged pages the queue parameter of vm_page_unwire() will
be ignored, just as the active parameter today.
This makes adding new pagequeues quicker.
This change effectively modifies the KPI. __FreeBSD_version will be,
however, bumped just when the full cache of free pages will be
evicted.
Sponsored by: EMC / Isilon storage division
Reviewed by: alc
Tested by: pho
flags, to rwlock. Lock it in read mode when used from subroutines
called from buffer release code paths.
The needsbuffer is now updated using atomics, while read lock of
nblock prevents loosing the wakeups from bufspacewakeup() and
bufcountadd() in getnewbuf_bufd_help().
In several interesting loads, needsbuffer flags are never set, while
buffers are reused quickly. This causes brelse() and bqrelse() from
different threads to content on the nblock. Now they take nblock in
read mode, together with needsbuffer not needing an update, allowing
higher parallelism.
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
using a direct hook called from kern_vfs_bio_buffer_alloc().
Mark ffs_rawread.c as requiring both ffs and directio options to be
compiled into the kernel. Add ffs_rawread.c to the list of ufs.ko
module' sources.
In addition to stopping breaking the layering violation, it also
allows to link kernel when FFS is configured as module and DIRECTIO is
enabled.
One consequence of the change is that ffs_rawread.o is always linked
into the module regardless of the DIRECTIO option. This is similar to
the option QUOTA and ufs_quota.c.
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
To reduce the diff struct pcu.cnt field was not renamed, so
PCPU_OP(cnt.field) is still used. pc_cnt and pcpu are also used in
kvm(3) and vmstat(8). The goal was to not affect externally used KPI.
Bump __FreeBSD_version_ in case some out-of-tree module/code relies on the
the global cnt variable.
Exp-run revealed no ports using it directly.
No objection from: arch@
Sponsored by: EMC / Isilon Storage Division
bio_completed, only manage bio_resid, e.g. sa(4).
Reported and tested by: Manfred Antar <null@pozo.com>
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
the excess code in g_io_check(), bio_resid is also truncated by
g_io_deliver(). As result, bufdonebio() assigns truncated value to
the buffer b_resid field.
Use the residual bio_completed to calculate buffer b_resid from
b_bcount in bufdonebio(), instead of bio_resid, calculated from
bio_length in g_io_deliver().
The issue is seemingly caused by the code rearrange into g_io_check(),
which is not present in stable/10. The change still looks as the
useful change to have in 10 nevertheless.
Reported by: Stefan Hegnauer <stefan.hegnauer@gmx.ch>
Tested by: pho, Stefan Hegnauer <stefan.hegnauer@gmx.ch>
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Restore BIO_UNMAPPED and BIO_TRANSIENT_MAPPING in biodonne() when unmapping
temporary mapped buffer. That fixes double unmap if biodone() called twice
for the same BIO (but with different done methods).
Move mapping removal before calling bio_done() method. I believe that it is
very wrong to do anything to BIO after reporting completion. kib@ thinks
it was done for some forgotten now case when bio_done() method needed mapped
buffer. But 1) if BIO was sent as unmapped, then IMO done() should be called
in the same way; 2) IMO there is no guatantee that buffer will be mapped at
this point at all, for example, if all underlying stack supports unmapped
I/O, so bio_done() handler can not expect that.