Restore BIO_UNMAPPED and BIO_TRANSIENT_MAPPING in biodonne() when unmapping
temporary mapped buffer. That fixes double unmap if biodone() called twice
for the same BIO (but with different done methods).
Move mapping removal before calling bio_done() method. I believe that it is
very wrong to do anything to BIO after reporting completion. kib@ thinks
it was done for some forgotten now case when bio_done() method needed mapped
buffer. But 1) if BIO was sent as unmapped, then IMO done() should be called
in the same way; 2) IMO there is no guatantee that buffer will be mapped at
this point at all, for example, if all underlying stack supports unmapped
I/O, so bio_done() handler can not expect that.
- Take BIO lock in biodone() only when there is no completion callback set
and so we should wake up thread waiting in biowait().
- Remove msleep() timeout from biowait(). It was added 11 years ago, when
there was no locks used, and it should not be needed any more.
called. This probably should be fixed eventually, but for now it is
not needed to try to flush such vnodes from the buffer allocation
context.
Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Approved by: re (gjb)
is no sense to walk the whole dirty buffer queue. We are only
interested in, and can operate on, the buffers owned by the current
vnode [1]. Instead of calling generic queue flush routine, do
VOP_FSYNC() if possible.
Holding the dirty buffer queue lock in the bufdaemon, without dropping
it, can cause starvation of buffer writes from other threads. This is
esp. easy to reproduce on the big memory machines, where large files
are written, causing almost all dirty buffers accumulating in several
big files, which vnodes are locked by writers. Bufdaemon cannot flush
any buffer, but is iterating over the whole dirty queue
continuously. Since dirty queue mutex is not dropped, bufdone() in
g_up thread is starved, usually deadlocking the machine [2]. Mitigate
this by dropping the queue lock after the vnode is locked, allowing
other queue lock contenders to make a progress.
Discussed with: Jeff [1]
Reported by: pho [2]
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Approved by: re (hrs)
The r255797 was:
Increase the chance of the buffer write from the bufdaemon helper
context to succeed. If the locked vnode which owns the buffer to be
written is shared locked, try the non-blocking upgrade of the lock to
exclusive.
PR: kern/178997
Reported and tested by: Klaus Weber <fbsd-bugs-2013-1@unix-admin.de>
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Approved by: re (glebius)
context to succeed. If the locked vnode which owns the buffer to be
written is shared locked, try the non-blocking upgrade of the lock to
exclusive.
PR: kern/178997
Reported and tested by: Klaus Weber <fbsd-bugs-2013-1@unix-admin.de>
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Approved by: re (marius)
pmap_remove_all(). Not doing the drain allows the pmap_enter() to
proceed in parallel, making the pmap_remove_all() effects void.
The race results in an invalidated page mapped wired by usermode.
Reported and tested by: pho
Reviewed by: alc
Sponsored by: The FreeBSD Foundation
Approved by: re (glebius)
run. After that, the pager put method is called, usually translated
to VOP_WRITE(). For the filesystems which use buffer cache,
bufwrite() sbusies the buffer pages again, waiting for the xbusy state
to drain. The later is done in vfs_drain_busy_pages(), which is
called with the buffer pages already sbusied (by vm_pageout_flush()).
Since vfs_drain_busy_pages() can only wait for one page at the time,
and during the wait, the object lock is dropped, previous pages in the
buffer must be protected from other threads busying them. Up to the
moment, it was done by xbusying the pages, that is incompatible with
the sbusy state in the new implementation of busy. Switch to sbusy.
Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
shared busy without first draining the hard busy state. Previously it
went unnoticed since VPO_BUSY and m->busy fields were distinct, and
vm_page_io_start() did not verified that the passed page has VPO_BUSY
flag cleared, but such page state is wrong. New implementation is
more strict and catched this case.
Drain the busy state as needed, before calling vm_page_sbusy().
Tested by: pho, jkim
Sponsored by: The FreeBSD Foundation
The flag was mandatory since r209792, where vm_page_grab(9) was
changed to only support the alloc retry semantic.
Suggested and reviewed by: alc
Sponsored by: The FreeBSD Foundation
Unify the 2 concept into a real, minimal, sxlock where the shared
acquisition represent the soft busy and the exclusive acquisition
represent the hard busy.
The old VPO_WANTED mechanism becames the hard-path for this new lock
and it becomes per-page rather than per-object.
The vm_object lock becames an interlock for this functionality:
it can be held in both read or write mode.
However, if the vm_object lock is held in read mode while acquiring
or releasing the busy state, the thread owner cannot make any
assumption on the busy state unless it is also busying it.
Also:
- Add a new flag to directly shared busy pages while vm_page_alloc
and vm_page_grab are being executed. This will be very helpful
once these functions happen under a read object lock.
- Move the swapping sleep into its own per-object flag
The KPI is heavilly changed this is why the version is bumped.
It is very likely that some VM ports users will need to change
their own code.
Sponsored by: EMC / Isilon storage division
Discussed with: alc
Reviewed by: jeff, kib
Tested by: gavin, bapt (older version)
Tested by: pho, scottl
transparent layering and better fragmentation.
- Normalize functions that allocate memory to use kmem_*
- Those that allocate address space are named kva_*
- Those that operate on maps are named kmap_*
- Implement recursive allocation handling for kmem_arena in vmem.
Reviewed by: alc
Tested by: pho
Sponsored by: EMC / Isilon Storage Division
error if any user wired mappings exist. Doing the invalidation
destroys the user wiring.
The change is the temporal measure to close the bug, the more proper
fix is to delegate the invalidation of the page to upper layers
always.
Reported and tested by: pho
Reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
originally inspired by the Solaris vmem detailed in the proceedings
of usenix 2001. The NetBSD version was heavily refactored for bugs
and simplicity.
- Use this resource allocator to allocate the buffer and transient maps.
Buffer cache defrags are reduced by 25% when used by filesystems with
mixed block sizes. Ultimately this may permit dynamic buffer cache
sizing on low KVA machines.
Discussed with: alc, kib, attilio
Tested by: pho
Sponsored by: EMC / Isilon Storage Division
- Split the bqlock into bqclean and bqdirty locks.
- Only acquire the wakeup synchronization locks when we cross a
threshold requiring them.
- Restructure the way flushbufqueues() targets work so they are more
smp friendly and sane.
Reviewed by: kib
Discussed with: mckusick, attilio
Sponsored by: EMC / Isilon Storage Division
M vfs_bio.c
used as the estimation of size, to 32GB. This provides around 100K of
buffer headers and corresponding KVA for buffer map at the peak.
Sizing the cache larger is not useful, also resulting in the wasting
and exhausting of KVA for large machines.
Reported and tested by: bdrewery
Sponsored by: The FreeBSD Foundation
- Use a shared bufobj lock in getblk() and inmem().
- Convert softdep's lk to rwlock to match the bufobj lock.
- Move INFREECNT to b_flags and protect it with the buf lock.
- Remove unnecessary locking around bremfree() and BKGRDINPROG.
Sponsored by: EMC / Isilon Storage Division
Discussed with: mckusick, kib, mdf
but assumes that a thread reference was already obtained on the passed
device. Use the function from physio(), to avoid two extra dev_mtx
lock and unlock. Note that physio() is always used as the cdevsw
method, or is called from a cdevsw method, and the caller already owns
the reference.
dev_strategy() is left to keep KPI intact, but now it is implemented
as a wrapper around dev_strategy_csw().
Do some style cleanup in physio().
Requested and reviewed by: kan (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
maxbcache size fixed, the auto-tuned transient map is too small for
real-world load on i386.
Tested by: David Wolfskill
Sponsored by: The FreeBSD Foundation
bufobj counter of the writes in progress is incremented. Other thread
inspecting the bufobj would consider it clean.
For the regular vnodes, the vnode lock is typically held both by the
thread performing the bufwrite() and an other thread doing syncing,
which prevents the situation. On the other hand, writes to the VCHR
vnodes are done without holding vnode lock.
Increment the write ref counter for the buffer object before calling
bundirty().
Sponsored by: The FreeBSD Foundation
Tested by: pho
MFC after: 2 weeks
buffer, transparently handling mapped or unmapped buffers. Its intent
is to replace the use of bzero(bp->b_data) in cases where the buffer
might be unmapped, to avoid unneeded upgrades.
Sponsored by: The FreeBSD Foundation
Tested by: pho
do not map the b_pages pages into buffer_map KVA. The use of the
unmapped buffers eliminate the need to perform TLB shootdown for
mapping on the buffer creation and reuse, greatly reducing the amount
of IPIs for shootdown on big-SMP machines and eliminating up to 25-30%
of the system time on i/o intensive workloads.
The unmapped buffer should be explicitely requested by the GB_UNMAPPED
flag by the consumer. For unmapped buffer, no KVA reservation is
performed at all. The consumer might request unmapped buffer which
does have a KVA reserve, to manually map it without recursing into
buffer cache and blocking, with the GB_KVAALLOC flag.
When the mapped buffer is requested and unmapped buffer already
exists, the cache performs an upgrade, possibly reusing the KVA
reservation.
Unmapped buffer is translated into unmapped bio in g_vfs_strategy().
Unmapped bio carry a pointer to the vm_page_t array, offset and length
instead of the data pointer. The provider which processes the bio
should explicitely specify a readiness to accept unmapped bio,
otherwise g_down geom thread performs the transient upgrade of the bio
request by mapping the pages into the new bio_transient_map KVA
submap.
The bio_transient_map submap claims up to 10% of the buffer map, and
the total buffer_map + bio_transient_map KVA usage stays the
same. Still, it could be manually tuned by kern.bio_transient_maxcnt
tunable, in the units of the transient mappings. Eventually, the
bio_transient_map could be removed after all geom classes and drivers
can accept unmapped i/o requests.
Unmapped support can be turned off by the vfs.unmapped_buf_allowed
tunable, disabling which makes the buffer (or cluster) creation
requests to ignore GB_UNMAPPED and GB_KVAALLOC flags. Unmapped
buffers are only enabled by default on the architectures where
pmap_copy_page() was implemented and tested.
In the rework, filesystem metadata is not the subject to maxbufspace
limit anymore. Since the metadata buffers are always mapped, the
buffers still have to fit into the buffer map, which provides a
reasonable (but practically unreachable) upper bound on it. The
non-metadata buffer allocations, both mapped and unmapped, is
accounted against maxbufspace, as before. Effectively, this means that
the maxbufspace is forced on mapped and unmapped buffers separately.
The pre-patch bufspace limiting code did not worked, because
buffer_map fragmentation does not allow the limit to be reached.
By Jeff Roberson request, the getnewbuf() function was split into
smaller single-purpose functions.
Sponsored by: The FreeBSD Foundation
Discussed with: jeff (previous version)
Tested by: pho, scottl (previous version), jhb, bf
MFC after: 2 weeks
cluster_write() and cluster_wbuild() functions. The flags to be
allowed are a subset of the GB_* flags for getblk().
Sponsored by: The FreeBSD Foundation
Tested by: pho
buffers directly, use pmap_zero_page_area(9) for each zeroing page
region instead.
Sponsored by: The FreeBSD Foundation
Tested by: pho
MFC after: 2 weeks
future further optimizations where the vm_object lock will be held
in read mode most of the time the page cache resident pool of pages
are accessed for reading purposes.
The change is mostly mechanical but few notes are reported:
* The KPI changes as follow:
- VM_OBJECT_LOCK() -> VM_OBJECT_WLOCK()
- VM_OBJECT_TRYLOCK() -> VM_OBJECT_TRYWLOCK()
- VM_OBJECT_UNLOCK() -> VM_OBJECT_WUNLOCK()
- VM_OBJECT_LOCK_ASSERT(MA_OWNED) -> VM_OBJECT_ASSERT_WLOCKED()
(in order to avoid visibility of implementation details)
- The read-mode operations are added:
VM_OBJECT_RLOCK(), VM_OBJECT_TRYRLOCK(), VM_OBJECT_RUNLOCK(),
VM_OBJECT_ASSERT_RLOCKED(), VM_OBJECT_ASSERT_LOCKED()
* The vm/vm_pager.h namespace pollution avoidance (forcing requiring
sys/mutex.h in consumers directly to cater its inlining functions
using VM_OBJECT_LOCK()) imposes that all the vm/vm_pager.h
consumers now must include also sys/rwlock.h.
* zfs requires a quite convoluted fix to include FreeBSD rwlocks into
the compat layer because the name clash between FreeBSD and solaris
versions must be avoided.
At this purpose zfs redefines the vm_object locking functions
directly, isolating the FreeBSD components in specific compat stubs.
The KPI results heavilly broken by this commit. Thirdy part ports must
be updated accordingly (I can think off-hand of VirtualBox, for example).
Sponsored by: EMC / Isilon storage division
Reviewed by: jeff
Reviewed by: pjd (ZFS specific review)
Discussed with: alc
Tested by: pho
intact if getblk() is done on the already owned buffer. Exit from
brelse() early when the lock recursion is detected, otherwise brelse()
might prematurely destroy the buffer under some circumstances.
Sponsored by: The FreeBSD Foundation
Noted by: mckusick
Tested by: pho
MFC after: 2 weeks
write is a disk write request that tells the disk that the buffer
being written must be committed to the media along with any writes
that preceeded it before any future blocks may be written to the drive.
Barrier writes are provided by adding the functions bbarrierwrite
(bwrite with barrier) and babarrierwrite (bawrite with barrier).
Following a bbarrierwrite the client knows that the requested buffer
is on the media. It does not ensure that buffers written before that
buffer are on the media. It only ensure that buffers written before
that buffer will get to the media before any buffers written after
that buffer. A flush command must be sent to the disk to ensure that
all earlier written buffers are on the media.
Reviewed by: kib
Tested by: Peter Holm
When kern_yield() was introduced with the possibility to specify
a new priority, the behaviour changed by not lowering priority at all
in the consumers, making the yielding mechanism highly ineffective for
high priority kthreads like bufdaemon, syncer, vlrudaemon, etc.
There are no evidences that consumers could bear with such change in
semantic and this situation could finally lead to bugs similar to the
ones fixed in r244240.
Re-specify userland pri for kthreads involved.
Tested by: pho
Reviewed by: kib, mdf
MFC after: 1 week
call. The function indicates a failure by the TRUE return value. To
be extra safe, assert that the return value from the following
vm_map_insert() indicates success.
Fix style issues in the nearby lines, reformulate the comment.
Reviewed by: alc (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
In particular, do not lock Giant conditionally when calling into the
filesystem module, remove the VFS_LOCK_GIANT() and related
macros. Stop handling buffers belonging to non-mpsafe filesystems.
The VFS_VERSION is bumped to indicate the interface change which does
not result in the interface signatures changes.
Conducted and reviewed by: attilio
Tested by: pho
snapshots on UFS filesystems running with journaled soft updates.
This is the first of several bugs that need to be fixed before
removing the restriction added in -r230250 to prevent the use
of snapshots on filesystems running with journaled soft updates.
The deadlock occurs when holding the snapshot lock (snaplk)
and then trying to flush an inode via ffs_update(). We become
blocked by another process trying to flush a different inode
contained in the same inode block that we need. It holds the
inode block for which we are waiting locked. When it tries to
write the inode block, it gets blocked waiting for the our
snaplk when it calls ffs_copyonwrite() to see if the inode
block needs to be copied in our snapshot.
The most obvious place that this deadlock arises is in the
ffs_copyonwrite() routine when it updates critical metadata
in a snapshot and tries to write it out before proceeding.
The fix here is to write the data and indirect block pointer
for the snapshot, but to skip the call to ffs_update() to
write the snapshot inode. To ensure that we will never have
to update a pointer in the inode itself, the ffs_snapshot()
routine that creates the snapshot has to ensure that all the
direct blocks are allocated as part of the creation of the
snapshot.
A less obvious place that this deadlock occurs is when we hold
the snaplk because we are deleting a snapshot. In the course of
doing the deletion, we need to allocate various soft update
dependency structures and allocate some journal space. If we
hit a resource limit while doing this we decrease the resources
in use by flushing out an existing dirty file to get it to give
up the soft dependency resources that it holds. The flush can
cause an ffs_update() to be done on the inode for the file that
we have selected to flush resulting in the same deadlock as
described above when the inode that we have chosen to flush
resides in the same inode block as the snapshot inode that we hold.
The fix is to defer cleaning up any time that the inode on which
we are operating is a snapshot.
Help and review by: Jeff Roberson
Tested by: Peter Holm
MFC (to 9 only) after: 2 weeks
- Axe out the SHOW_BUSYBUFS option and uses a tunable for selectively
enable/disable it, which is defaulted for not printing anything (0
value) but can be changed for printing (1 value) and be verbose (2
value)
- Improves the informations outputed: right now, there is no track of
the actual struct buf object or vnode which are referenced by the
shutdown process, but it is printed the related struct bufobj object
which is not really helpful
- Add more verbosity about the state of the struct buf lock and the
vnode informations, with the latter to be activated separately by the
sysctl
Sponsored by: Sandvine Incorporated
Reviewed by: emaste, kib
Approved by: re (ksmith)
MFC after: 10 days