check hash to cylinder groups. If a check hash fails when a cylinder
group is read, no further allocations are attempted in that cylinder
group until it has been fixed by fsck. This avoids a class of
filesystem panics related to corrupted cylinder group maps. The
hash is done using crc32c.
Check hases are added only to UFS2 and not to UFS1 as UFS1 is primarily
used in embedded systems with small memories and low-powered processors
which need as light-weight a filesystem as possible.
Specifics of the changes:
sys/sys/buf.h:
Add BX_FSPRIV to reserve a set of eight b_xflags that may be used
by individual filesystems for their own purpose. Their specific
definitions are found in the header files for each filesystem
that uses them. Also add fields to struct buf as noted below.
sys/kern/vfs_bio.c:
It is only necessary to compute a check hash for a cylinder
group when it is actually read from disk. When calling bread,
you do not know whether the buffer was found in the cache or
read. So a new flag (GB_CKHASH) and a pointer to a function to
perform the hash has been added to breadn_flags to say that the
function should be called to calculate a hash if the data has
been read. The check hash is placed in b_ckhash and the B_CKHASH
flag is set to indicate that a read was done and a check hash
calculated. Though a rather elaborate mechanism, it should
also work for check hashing other metadata in the future. A
kernel internal API change was to change breada into a static
fucntion and add flags and a function pointer to a check-hash
function.
sys/ufs/ffs/fs.h:
Add flags for types of check hashes; stored in a new word in the
superblock. Define corresponding BX_ flags for the different types
of check hashes. Add a check hash word in the cylinder group.
sys/ufs/ffs/ffs_alloc.c:
In ffs_getcg do the dance with breadn_flags to get a check hash and
if one is provided, check it.
sys/ufs/ffs/ffs_vfsops.c:
Copy across the BX_FFSTYPES flags in background writes.
Update the check hash when writing out buffers that need them.
sys/ufs/ffs/ffs_snapshot.c:
Recompute check hash when updating snapshot cylinder groups.
sys/libkern/crc32.c:
lib/libufs/Makefile:
lib/libufs/libufs.h:
lib/libufs/cgroup.c:
Include libkern/crc32.c in libufs and use it to compute check
hashes when updating cylinder groups.
Four utilities are affected:
sbin/newfs/mkfs.c:
Add the check hashes when building the cylinder groups.
sbin/fsck_ffs/fsck.h:
sbin/fsck_ffs/fsutil.c:
Verify and update check hashes when checking and writing cylinder groups.
sbin/fsck_ffs/pass5.c:
Offer to add check hashes to existing filesystems.
Precompute check hashes when rebuilding cylinder group
(although this will be done when it is written in fsutil.c
it is necessary to do it early before comparing with the old
cylinder group)
sbin/dumpfs/dumpfs.c
Print out the new check hash flag(s)
sbin/fsdb/Makefile:
Needs to add libufs now used by pass5.c imported from fsck_ffs.
Reviewed by: kib
Tested by: Peter Holm (pho)
While these locks are guarnteed to not share their respective cache lines,
their current placement leaves unnecessary holes in lines which preceeded them.
For instance the annotation of vm_page_queue_free_mtx allows 2 neighbour
cachelines (previously separate by the lock) to be collapsed into 1.
The annotation is only effective on architectures which have it implemented in
their linker script (currently only amd64). Thus locks are not converted to
their not-padaligned variants as to not affect the rest.
MFC after: 1 week
vm_page_grab() on consecutive page indices. Besides simplifying the code
in the caller, vm_page_grab_pages() allows for batching optimizations.
For example, the current implementation replaces calls to vm_page_lookup()
on consecutive page indices by cheaper calls to vm_page_next().
Reviewed by: kib, markj
Tested by: pho (an earlier version)
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D11926
Atomic updates to v_wire_count are a significant source of contention, so
combine multiple updates into one in this easy case. Also remove an old
printf that gets executed if the page is shared-busied, which is a case
that will lead to a panic anyway.
Reviewed by: alc, kib
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D11791
By making MAXBCACHEBUF a tunable, it can be increased to allow for
larger read/write data sizes for the NFS client.
The tunable is limited to MAXPHYS, which is currently 128K.
Making MAXPHYS a tunable or increasing its value is being discussed,
since it would be nice to support a read/write data size of 1Mbyte
for the NFS client when mounting the AmazonEFS file service.
Reviewed by: kib
MFC after: 2 weeks
Relnotes: yes
Differential Revision: https://reviews.freebsd.org/D10991
in place. To do per-cpu stats, convert all fields that previously were
maintained in the vmmeters that sit in pcpus to counter(9).
- Since some vmmeter stats may be touched at very early stages of boot,
before we have set up UMA and we can do counter_u64_alloc(), provide an
early counter mechanism:
o Leave one spare uint64_t in struct pcpu, named pc_early_dummy_counter.
o Point counter(9) fields of vmmeter to pcpu[0].pc_early_dummy_counter,
so that at early stages of boot, before counters are allocated we already
point to a counter that can be safely written to.
o For sparc64 that required a whole dummy pcpu[MAXCPU] array.
Further related changes:
- Don't include vmmeter.h into pcpu.h.
- vm.stats.vm.v_swappgsout and vm.stats.vm.v_swappgsin changed to 64-bit,
to match kernel representation.
- struct vmmeter hidden under _KERNEL, and only vmstat(1) is an exclusion.
This is based on benno@'s 4-year old patch:
https://lists.freebsd.org/pipermail/freebsd-arch/2013-July/014471.html
Reviewed by: kib, gallatin, marius, lidl
Differential Revision: https://reviews.freebsd.org/D10156
This fixes some panics after disconnecting mounted disks.
Submitted by: imp (slightly different version, which I've then lost)
Reviewed by: kib, imp, mckusick
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D9674
biowait() will otherwise race with completions of such BIOs. In-tree code
only calls biowait() on BIOs that do not specify a handler, so this change
should not have any functional impact.
Reviewed by: mav
MFC after: 1 month
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D9070
The swap pager enqueues laundered pages near the head of the inactive queue
to avoid another trip through LRU before reclamation. This change adds
support for this behaviour to the vnode pager and makes use of it in UFS and
ext2fs. Some ioflag handling is consolidated into a common subroutine so
that this support can be easily extended to other filesystems which make use
of the buffer cache. No changes are needed for ZFS since its putpages
routine always undirties the pages before returning, and the laundry
thread requeues the pages appropriately in this case.
Reviewed by: alc, kib
Differential Revision: https://reviews.freebsd.org/D8589
Although the higher end MIPS hardware handles cache aliasing issues in
hardware, the older cores (r4k, etc) and some compile versions of the
newer cores (mips24k, mips34k, mips74k) don't have this feature.
This means we end up with some very unfortunate behaviour that was
made very obvious by some recent changes to the FFS pager by kib.
So, flip this off until we get our MIPS pmap/cache code upgraded to
handle aliased pages in software.
Discussed with: kib, bsdimp, juli
Pass current thread credentials instead of NOCRED.
Only allow unmapped buffers for filesystem which proclaimed the support.
For all filesystems which currently use buffer pager (UFS, msdosfs and
cd9660), the changes are effectively nop.
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Upstream the BUF_TRACKING and FULL_BUF_TRACKING buffer debugging code.
This can be handy in tracking down what code touched hung bios and bufs
last. The full history is especially useful, but adds enough bloat that
it shouldn't be enabled in release builds.
Function names (or arbitrary string constants) are tracked in a
fixed-size ring in bufs. Bios gain a pointer to the upper buf for
tracking. SCSI CCBs gain a pointer to the upper bio for tracking.
Reviewed by: markj
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D8366
which also use buffer cache.
Most important addition to the code is the handling of filesystems
where the block size is less than the machine page size, which might
require reading several buffers to validate single page.
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Suppose that we have an exclusively busy page, and a thread which can
accept shared-busy page. In this case, typical code waiting for the
page xbusy state to pass is
again:
VM_OBJECT_WLOCK(object);
...
if (vm_page_xbusied(m)) {
vm_page_lock(m);
VM_OBJECT_WUNLOCK(object); <---1
vm_page_busy_sleep(p, "vmopax");
goto again;
}
Suppose that the xbusy state owner locked the object, unbusied the
page and unlocked the object after we are at the line [1], but before we
executed the load of the busy_lock word in vm_page_busy_sleep(). If it
happens that there is still no waiters recorded for the busy state,
the xbusy owner did not acquired the page lock, so it proceeded.
More, suppose that some other thread happen to share-busy the page
after xbusy state was relinquished but before the m->busy_lock is read
in vm_page_busy_sleep(). Again, that thread only needs vm_object lock
to proceed. Then, vm_page_busy_sleep() reads busy_lock value equal to
the VPB_SHARERS_WORD(1).
In this case, all tests in vm_page_busy_sleep(9) pass and we are going
to sleep, despite the page being share-busied.
Update check for m->busy_lock == VPB_UNBUSIED in vm_page_busy_sleep(9)
to also accept shared-busy state if we only wait for the xbusy state to
pass.
Merge sequential if()s with the same 'then' clause in
vm_page_busy_sleep().
Note that the current code does not share-busy pages from parallel
threads, the only way to have more that one sbusy owner is right now
is to recurse.
Reported and tested by: pho (previous version)
Reviewed by: alc, markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D8196
It was added in r153192 for XFS and doesn't appear to have been used for
anything else. XFS was disconnected in r241607 and removed entirely in
r247631.
Reported by: mlaier
Reviewed by: imp, kib
Differential Revision: https://reviews.freebsd.org/D7468
A buf's b_pages and b_npages fields may be inconsistent after a panic.
For instance, vfs_vmio_invalidate() sets b_npages to zero only after all
pages are unwired and their page array entries are cleared.
MFC after: 1 week
Sponsored by: EMC / Isilon Storage Division
rounddown2 tends to produce longer lines than the original code
and when the code has a high indentation level it was not really
advantageous to do the replacement.
This tries to strike a balance between readability using the macros
and flexibility of having the expressions, so not everything is
converted.
for limiting disk (actually filesystem) IO.
Note that in some cases these limits are not quite precise. It's ok,
as long as it's within some reasonable bounds.
Testing - and review of the code, in particular the VFS and VM parts - is
very welcome.
MFC after: 1 month
Relnotes: yes
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D5080
a buffer pointer in the event of an error (for some errors it would
return a buffer pointer and for other errors it would not return a
buffer pointer). The cluster_read() function was similarly inconsistent.
Clients of these functions were inconsistent in handling errors.
Some would assume that no buffer was returned after an error and
would thus lose buffers under certain error conditions. Others would
assume that brelse() should always be called after an error and
would thus panic the system under certain error conditions.
To correct both of these problems with minimal code churn, bread()
and cluster_write() now always free the buffer when returning an
error thus ensuring that buffers will never be lost. The brelse()
routine checks for being passed a NULL buffer pointer and silently
returns to avoid panics. Thus both approaches to handling error
returns from bread() and cluster_read() will work correctly.
Future code should be written assuming that bread() and cluster_read()
will never return a buffer with an error, so should not attempt to
brelse() the buffer when an error is returned.
Reviewed by: kib
Things seem to get stuck in low memory conditions where no bufs are available,
the reclamation path is called to wakeup the daemon, but no sleeping is done.
Because of this, we are stuck in a tight loop in the current process and
never run said reclamation path.
This was introduced in r289279 . This is only a temporary workaround
to restore system usefulness until the more permanent solutions can be
found.
Tested:
* Carambola2, 64MB (and 32MB by manual config.)
whether an error is recoverable. Always re-dirty the buffer on errors
from write requests. The invalidation we used to do for errors not EIO
doesn't need to be done for a device that's really gone, since that's
done in a different path.
Reviewed by: mckusick@, kib@
8x performance improvement in a micro benchmark on a 4 socket machine.
- Get buffer headers from a per-cpu uma cache that sits in from of the
free queue.
- Use a per-cpu quantum cache in vmem to eliminate contention for kva.
- Use multiple clean queues according to buffer cache size to eliminate
clean queue lock contention.
- Introduce a bufspace daemon that attempts to prevent getnewbuf() callers
from blocking or doing direct recycling.
- Close some bufspace allocation races that could lead to endless
recycling.
- Further the transition to a more modern style of small functions grouped
by prefix in order to improve growing complexity.
Sponsored by: EMC / Isilon
Reviewed by: kib
Tested by: pho
of POSIX_FADV_DONTNEED so that it causes the backing pages to be moved to
the head of the inactive queue instead of being cached.
This affects the implementation of POSIX_FADV_NOREUSE as well, since it
works by applying POSIX_FADV_DONTNEED to file ranges after they have been
read or written. At that point the corresponding buffers may still be
dirty, so the previous implementation would coalesce successive ranges and
apply POSIX_FADV_DONTNEED to the result, ensuring that pages backing the
dirty buffers would eventually be cached. To preserve this behaviour in an
efficient manner, this change adds a new buf flag, B_NOREUSE, which causes
the pages backing a VMIO buf to be placed at the head of the inactive queue
when the buf is released. POSIX_FADV_NOREUSE then works by setting this
flag in bufs that underlie the specified range.
Reviewed by: alc, kib
Sponsored by: EMC / Isilon Storage Division
Differential Revision: https://reviews.freebsd.org/D3726
- Allow vfs_vmio_invalidate() to free the pages, leaving us with a
single loop and bufobj lock when B_NOCACHE/B_INVAL is used.
- Eliminate the special B_ASYNC handling on free that has not been
relevant for some time.
- Remove the extraneous page busy from vfs_vmio_truncate().
Reviewed by: kib
Tested by: pho
Sponsored by: EMC / Isilon storage division
- Eliminate bogus page replacement that is inconsistently applied in the
invalidation loop in brelse. This has been a no-op in modern times as
biodone() is responsible for cleaning up after bogus pages. This
would've spammed the console with printfs at a minimum.
- Allow the compiler and human readers alike to reason about allocbuf()
by splitting it into constituent parts.
- Separate the VM manipulating and buf manipulating code in brelse() and
bufdone() so that the intentions are clear. This makes it evident that
there are several duplicated buf pages loops that will be consolidated
at a later time.
Reviewed by: kib
Tested by: pho
Sponsored by: EMC / Isilon Storage Division
queue and (2) returns a Boolean indicating whether the page's wire count
transitioned to zero.
Exploit this change in vfs_vmio_release() to avoid pointlessly enqueueing
a page that is about to be freed.
(An earlier version of this change was developed by attilio@ and kmacy@.
Any errors in this version are my own.)
Reviewed by: kib
Sponsored by: EMC / Isilon Storage Division
The check added in r285872 can trigger for valid buffers if the buffer space
used happens to be just after unmapped_buf in KVA space.
Discussed with: kib
Sponsored by: Citrix Systems R&D
'buf' is inconvenient and has lead me to some irritating to discover
bugs over the years. It also makes it more challenging to refactor
the buf allocation system.
- Move swbuf and declare it as an extern in vfs_bio.c. This is still
not perfect but better than it was before.
- Eliminate the unused ffs function that relied on knowledge of the buf
array.
- Move the shutdown code that iterates over the buf array into vfs_bio.c.
Reviewed by: kib
Sponsored by: EMC / Isilon Storage Division
attached to bufs to avoid the overhead of the vm. This purposes is now
better served by vmem. Freeing the kva immediately when a buf is
destroyed leads to lower fragmentation and a much simpler scan algorithm.
Reviewed by: kib
Sponsored by: EMC / Isilon Storage Division