objects, i.e. for buffer objects which vnode was reclaimed. Buffer
cache cannot write such buffers. Return the error and discard the
buffer immediately on write attempt.
BO_DIRTY now always set during vnode reclamation, since it is used not
only for the INVARIANTS checks. Do allow placement of the clean
buffers on dead bufobj list, otherwise filesystems cannot use bufcache
at all after the devvp reclaim.
Reported and tested by: trasz
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
First, on the write error, bufdone() call from ffs_backgroundwrite()
panics because pbrelvp() cleared bp->b_bufobj, while brelse() would
try to re-dirty the copy of the cg buffer. Handle this by setting
B_INVAL for the case of BIO_ERROR.
Second, we must re-dirty the real buffer containing the cylinder group
block data when background write failed. Real cg buffer was already
marked clean in ffs_bufwrite(). After the BV_BKGRDINPROG flag is
cleared on the real cg buffer in ffs_backgroundwrite(), buffer scan
may reuse the buffer at any moment. The result is lost write, and if
the write error was only transient, we get corrupted bitmaps.
We cannot re-dirty the original cg buffer in the
ffs_backgroundwritedone(), since the context is not sleepable,
preventing us from sleeping for origbp' lock. Add BV_BKGDERR flag
(protected by the buffer object lock), which is converted into delayed
write by brelse(), bqrelse() and buffer scan.
In collaboration with: Conrad Meyer <cse.cem@gmail.com>
Reviewed by: mckusick
Sponsored by: The FreeBSD Foundation (kib),
EMC/Isilon storage division (Conrad)
MFC after: 2 weeks
bufdaemon in getnewbuf(), do use buf_flush(). The difference is that
bufdaemon uses TRYLOCK to get buffer locks, which allows calls to
getnewbuf() while another buffer is locked.
Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
largest size for a buffer in the buffer cache. This patch
defines a new constant MAXBCACHEBUF, which is the largest
size for a buffer in the buffer cache. Having a separate
constant allows MAXBCACHEBUF to be set larger than MAXBSIZE
on a per-architecture basis, so that NFS can do larger read/writes
for these architectures. It modifies sys/param.h so that BKVASIZE
can also be set on a per-architecture basis.
A couple of cases where NFS used MAXBSIZE instead of NFS_MAXBSIZE
is fixed as well.
Differential Revision: https://reviews.freebsd.org/D2330
Reviewed by: mav, kib
MFC after: 2 weeks
the vnode owning the buffer is not locked. More, it cannot be locked
safely, since getnewbuf_reuse_bp() is called from newbuf(), and some
other vnode is already locked, for which reused buffer will be
reassigned.
As the consequence, reclamation of the owning vnode could go in
parallel, in particular, the call to vnode_destroy_vobject(), which
deallocates the vm object and zeroes the v_bufobj->bo_object. Note
that the pages wired by the buffer are left wired and can be safely
freed by the vfs_vmio_release() without the need for the vm object
lock. Also, seeing stale pointer to the v_object is safe due to vm
object type stability.
Check for bo_bufobj != NULL and cache the value in local variable to
avoid trying to lock NULL vm object.
Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
and the following r273143 commit, supposed to workaround introduced issue by
quite innocent-looking change.
While there is no clear understanding why, but r273143 is accused in data
corruption in some environments with high I/O load. I personally don't see
any problem in that commit, and possibly it is just a trigger to some other
bug somewhere, but better safe then sorry for now.
Requested by: scottl@
MFC after: 3 days
This fixes use-after-free, caused by geom_disk, completing same BIO twice
to save extra allocation, and getting BIO_DONE set after the first.
MFC after: 1 week
set bo_bsize on a bufobj.
This is a slight modification of the patch provided.
PR: 193146
Submitted by: Conrad Meyer <conrad.meyer@isilon.com>
Sponsored by: EMC Isilon Storage Division
Replace a single soft updates thread with a thread per FFS-filesystem
mount point. The threads are associated with the bufdaemon process.
Reviewed by: kib
Tested by: Peter Holm and Scott Long
MFC after: 2 weeks
Sponsored by: Netflix
the queue where to enqueue pages that are going to be unwired.
- Add stronger checks to the enqueue/dequeue for the pagequeues when
adding and removing pages to them.
Of course, for unmanaged pages the queue parameter of vm_page_unwire() will
be ignored, just as the active parameter today.
This makes adding new pagequeues quicker.
This change effectively modifies the KPI. __FreeBSD_version will be,
however, bumped just when the full cache of free pages will be
evicted.
Sponsored by: EMC / Isilon storage division
Reviewed by: alc
Tested by: pho
flags, to rwlock. Lock it in read mode when used from subroutines
called from buffer release code paths.
The needsbuffer is now updated using atomics, while read lock of
nblock prevents loosing the wakeups from bufspacewakeup() and
bufcountadd() in getnewbuf_bufd_help().
In several interesting loads, needsbuffer flags are never set, while
buffers are reused quickly. This causes brelse() and bqrelse() from
different threads to content on the nblock. Now they take nblock in
read mode, together with needsbuffer not needing an update, allowing
higher parallelism.
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
using a direct hook called from kern_vfs_bio_buffer_alloc().
Mark ffs_rawread.c as requiring both ffs and directio options to be
compiled into the kernel. Add ffs_rawread.c to the list of ufs.ko
module' sources.
In addition to stopping breaking the layering violation, it also
allows to link kernel when FFS is configured as module and DIRECTIO is
enabled.
One consequence of the change is that ffs_rawread.o is always linked
into the module regardless of the DIRECTIO option. This is similar to
the option QUOTA and ufs_quota.c.
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
To reduce the diff struct pcu.cnt field was not renamed, so
PCPU_OP(cnt.field) is still used. pc_cnt and pcpu are also used in
kvm(3) and vmstat(8). The goal was to not affect externally used KPI.
Bump __FreeBSD_version_ in case some out-of-tree module/code relies on the
the global cnt variable.
Exp-run revealed no ports using it directly.
No objection from: arch@
Sponsored by: EMC / Isilon Storage Division
bio_completed, only manage bio_resid, e.g. sa(4).
Reported and tested by: Manfred Antar <null@pozo.com>
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
the excess code in g_io_check(), bio_resid is also truncated by
g_io_deliver(). As result, bufdonebio() assigns truncated value to
the buffer b_resid field.
Use the residual bio_completed to calculate buffer b_resid from
b_bcount in bufdonebio(), instead of bio_resid, calculated from
bio_length in g_io_deliver().
The issue is seemingly caused by the code rearrange into g_io_check(),
which is not present in stable/10. The change still looks as the
useful change to have in 10 nevertheless.
Reported by: Stefan Hegnauer <stefan.hegnauer@gmx.ch>
Tested by: pho, Stefan Hegnauer <stefan.hegnauer@gmx.ch>
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Restore BIO_UNMAPPED and BIO_TRANSIENT_MAPPING in biodonne() when unmapping
temporary mapped buffer. That fixes double unmap if biodone() called twice
for the same BIO (but with different done methods).
Move mapping removal before calling bio_done() method. I believe that it is
very wrong to do anything to BIO after reporting completion. kib@ thinks
it was done for some forgotten now case when bio_done() method needed mapped
buffer. But 1) if BIO was sent as unmapped, then IMO done() should be called
in the same way; 2) IMO there is no guatantee that buffer will be mapped at
this point at all, for example, if all underlying stack supports unmapped
I/O, so bio_done() handler can not expect that.
- Take BIO lock in biodone() only when there is no completion callback set
and so we should wake up thread waiting in biowait().
- Remove msleep() timeout from biowait(). It was added 11 years ago, when
there was no locks used, and it should not be needed any more.
called. This probably should be fixed eventually, but for now it is
not needed to try to flush such vnodes from the buffer allocation
context.
Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Approved by: re (gjb)
is no sense to walk the whole dirty buffer queue. We are only
interested in, and can operate on, the buffers owned by the current
vnode [1]. Instead of calling generic queue flush routine, do
VOP_FSYNC() if possible.
Holding the dirty buffer queue lock in the bufdaemon, without dropping
it, can cause starvation of buffer writes from other threads. This is
esp. easy to reproduce on the big memory machines, where large files
are written, causing almost all dirty buffers accumulating in several
big files, which vnodes are locked by writers. Bufdaemon cannot flush
any buffer, but is iterating over the whole dirty queue
continuously. Since dirty queue mutex is not dropped, bufdone() in
g_up thread is starved, usually deadlocking the machine [2]. Mitigate
this by dropping the queue lock after the vnode is locked, allowing
other queue lock contenders to make a progress.
Discussed with: Jeff [1]
Reported by: pho [2]
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Approved by: re (hrs)
The r255797 was:
Increase the chance of the buffer write from the bufdaemon helper
context to succeed. If the locked vnode which owns the buffer to be
written is shared locked, try the non-blocking upgrade of the lock to
exclusive.
PR: kern/178997
Reported and tested by: Klaus Weber <fbsd-bugs-2013-1@unix-admin.de>
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Approved by: re (glebius)
context to succeed. If the locked vnode which owns the buffer to be
written is shared locked, try the non-blocking upgrade of the lock to
exclusive.
PR: kern/178997
Reported and tested by: Klaus Weber <fbsd-bugs-2013-1@unix-admin.de>
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Approved by: re (marius)
pmap_remove_all(). Not doing the drain allows the pmap_enter() to
proceed in parallel, making the pmap_remove_all() effects void.
The race results in an invalidated page mapped wired by usermode.
Reported and tested by: pho
Reviewed by: alc
Sponsored by: The FreeBSD Foundation
Approved by: re (glebius)
run. After that, the pager put method is called, usually translated
to VOP_WRITE(). For the filesystems which use buffer cache,
bufwrite() sbusies the buffer pages again, waiting for the xbusy state
to drain. The later is done in vfs_drain_busy_pages(), which is
called with the buffer pages already sbusied (by vm_pageout_flush()).
Since vfs_drain_busy_pages() can only wait for one page at the time,
and during the wait, the object lock is dropped, previous pages in the
buffer must be protected from other threads busying them. Up to the
moment, it was done by xbusying the pages, that is incompatible with
the sbusy state in the new implementation of busy. Switch to sbusy.
Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
shared busy without first draining the hard busy state. Previously it
went unnoticed since VPO_BUSY and m->busy fields were distinct, and
vm_page_io_start() did not verified that the passed page has VPO_BUSY
flag cleared, but such page state is wrong. New implementation is
more strict and catched this case.
Drain the busy state as needed, before calling vm_page_sbusy().
Tested by: pho, jkim
Sponsored by: The FreeBSD Foundation
The flag was mandatory since r209792, where vm_page_grab(9) was
changed to only support the alloc retry semantic.
Suggested and reviewed by: alc
Sponsored by: The FreeBSD Foundation
Unify the 2 concept into a real, minimal, sxlock where the shared
acquisition represent the soft busy and the exclusive acquisition
represent the hard busy.
The old VPO_WANTED mechanism becames the hard-path for this new lock
and it becomes per-page rather than per-object.
The vm_object lock becames an interlock for this functionality:
it can be held in both read or write mode.
However, if the vm_object lock is held in read mode while acquiring
or releasing the busy state, the thread owner cannot make any
assumption on the busy state unless it is also busying it.
Also:
- Add a new flag to directly shared busy pages while vm_page_alloc
and vm_page_grab are being executed. This will be very helpful
once these functions happen under a read object lock.
- Move the swapping sleep into its own per-object flag
The KPI is heavilly changed this is why the version is bumped.
It is very likely that some VM ports users will need to change
their own code.
Sponsored by: EMC / Isilon storage division
Discussed with: alc
Reviewed by: jeff, kib
Tested by: gavin, bapt (older version)
Tested by: pho, scottl
transparent layering and better fragmentation.
- Normalize functions that allocate memory to use kmem_*
- Those that allocate address space are named kva_*
- Those that operate on maps are named kmap_*
- Implement recursive allocation handling for kmem_arena in vmem.
Reviewed by: alc
Tested by: pho
Sponsored by: EMC / Isilon Storage Division
error if any user wired mappings exist. Doing the invalidation
destroys the user wiring.
The change is the temporal measure to close the bug, the more proper
fix is to delegate the invalidation of the page to upper layers
always.
Reported and tested by: pho
Reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
originally inspired by the Solaris vmem detailed in the proceedings
of usenix 2001. The NetBSD version was heavily refactored for bugs
and simplicity.
- Use this resource allocator to allocate the buffer and transient maps.
Buffer cache defrags are reduced by 25% when used by filesystems with
mixed block sizes. Ultimately this may permit dynamic buffer cache
sizing on low KVA machines.
Discussed with: alc, kib, attilio
Tested by: pho
Sponsored by: EMC / Isilon Storage Division
- Split the bqlock into bqclean and bqdirty locks.
- Only acquire the wakeup synchronization locks when we cross a
threshold requiring them.
- Restructure the way flushbufqueues() targets work so they are more
smp friendly and sane.
Reviewed by: kib
Discussed with: mckusick, attilio
Sponsored by: EMC / Isilon Storage Division
M vfs_bio.c
used as the estimation of size, to 32GB. This provides around 100K of
buffer headers and corresponding KVA for buffer map at the peak.
Sizing the cache larger is not useful, also resulting in the wasting
and exhausting of KVA for large machines.
Reported and tested by: bdrewery
Sponsored by: The FreeBSD Foundation
- Use a shared bufobj lock in getblk() and inmem().
- Convert softdep's lk to rwlock to match the bufobj lock.
- Move INFREECNT to b_flags and protect it with the buf lock.
- Remove unnecessary locking around bremfree() and BKGRDINPROG.
Sponsored by: EMC / Isilon Storage Division
Discussed with: mckusick, kib, mdf
but assumes that a thread reference was already obtained on the passed
device. Use the function from physio(), to avoid two extra dev_mtx
lock and unlock. Note that physio() is always used as the cdevsw
method, or is called from a cdevsw method, and the caller already owns
the reference.
dev_strategy() is left to keep KPI intact, but now it is implemented
as a wrapper around dev_strategy_csw().
Do some style cleanup in physio().
Requested and reviewed by: kan (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
maxbcache size fixed, the auto-tuned transient map is too small for
real-world load on i386.
Tested by: David Wolfskill
Sponsored by: The FreeBSD Foundation
bufobj counter of the writes in progress is incremented. Other thread
inspecting the bufobj would consider it clean.
For the regular vnodes, the vnode lock is typically held both by the
thread performing the bufwrite() and an other thread doing syncing,
which prevents the situation. On the other hand, writes to the VCHR
vnodes are done without holding vnode lock.
Increment the write ref counter for the buffer object before calling
bundirty().
Sponsored by: The FreeBSD Foundation
Tested by: pho
MFC after: 2 weeks