All I/O requests through the taste consumers are synchronous, done
with g_read_data() and without any locks held. It makes no sense
to delegate the I/O to g_down/g_up threads.
This removes many of context switches during disk retaste.
MFC after: 2 weeks
Replace MAXPHYS by runtime variable maxphys. It is initialized from
MAXPHYS by default, but can be also adjusted with the tunable kern.maxphys.
Make b_pages[] array in struct buf flexible. Size b_pages[] for buffer
cache buffers exactly to atop(maxbcachebuf) (currently it is sized to
atop(MAXPHYS)), and b_pages[] for pbufs is sized to atop(maxphys) + 1.
The +1 for pbufs allow several pbuf consumers, among them vmapbuf(),
to use unaligned buffers still sized to maxphys, esp. when such
buffers come from userspace (*). Overall, we save significant amount
of otherwise wasted memory in b_pages[] for buffer cache buffers,
while bumping MAXPHYS to desired high value.
Eliminate all direct uses of the MAXPHYS constant in kernel and driver
sources, except a place which initialize maxphys. Some random (and
arguably weird) uses of MAXPHYS, e.g. in linuxolator, are converted
straight. Some drivers, which use MAXPHYS to size embeded structures,
get private MAXPHYS-like constant; their convertion is out of scope
for this work.
Changes to cam/, dev/ahci, dev/ata, dev/mpr, dev/mpt, dev/mvs,
dev/siis, where either submitted by, or based on changes by mav.
Suggested by: mav (*)
Reviewed by: imp, mav, imp, mckusick, scottl (intermediate versions)
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D27225
fs_summary_info structure. This change was originally done
by the CheriBSD project as they need larger pointers that
do not fit in the existing superblock.
This cleanup of the superblock eases the task of the commit
that immediately follows this one.
Suggested by: brooks
Reviewed by: kib
PR: 246983
Sponsored by: Netflix
r357614 added CTLFLAG_NEEDGIANT to make it easier to find nodes that are
still not MPSAFE (or already are but aren’t properly marked).
Use it in preparation for a general review of all nodes.
This is non-functional change that adds annotations to SYSCTL_NODE and
SYSCTL_PROC nodes using one of the soon-to-be-required flags.
Mark all obvious cases as MPSAFE. All entries that haven't been marked
as MPSAFE before are by default marked as NEEDGIANT
Approved by: kib (mentor, blanket)
Commented by: kib, gallatin, melifaro
Differential Revision: https://reviews.freebsd.org/D23718
While some geom layers pass unknown commands down, not all do. For the ones that
don't, pass BIO_SPEEDUP down to the providers that constittue the geom, as
applicable. No changes to vinum or virstor because I was unsure how to add this
support, and I'm also unsure how to test these. gvinum doesn't implement
BIO_FLUSH either, so it may just be poorly maintained. gvirstor is for testing
and not supportig BIO_SPEEDUP is fine.
Reviewed by: chs
Differential Revision: https://reviews.freebsd.org/D23183
The previous behavior of leaving VI_OWEINACT vnodes on the active list without
a hold count is eliminated. Hold count is kept and inactive processing gets
explicitly deferred by setting the VI_DEFINACT flag. The syncer is then
responsible for vdrop.
Reviewed by: kib (previous version)
Tested by: pho (in a larger patch, previous version)
Differential Revision: https://reviews.freebsd.org/D23036
Similar to what was done for device_printfs in r347229.
Convert g_print_bio() to a thin shim around g_format_bio(), which acts on an
sbuf; documented in g_bio.9.
Reviewed by: markj
Discussed with: rlibby
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D21165
When it comes to megabytes of text, difference between sbuf_printf() and
sbuf_cat() becomes substantial.
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.
superblock has a check-hash error, an error message noting the
superblock check-hash failure is printed and the mount fails. The
administrator then runs fsck to repair the filesystem and when
successful, the filesystem can once again be mounted.
This approach fails if the filesystem in question is a root filesystem
from which you are trying to boot. Here, the loader fails when trying
to access the filesystem to get the kernel to boot. So it is necessary
to allow the loader to ignore the superblock check-hash error and make
a best effort to read the kernel. The filesystem may be suffiently
corrupted that the read attempt fails, but there is no harm in trying
since the loader makes no attempt to write to the filesystem.
Once the kernel is loaded and starts to run, it attempts to mount its
root filesystem. Once again, failure means that it breaks to its prompt
to ask where to get its root filesystem. Unless you have an alternate
root filesystem, you are stuck.
Since the root filesystem is initially mounted read-only, it is
safe to make an attempt to mount the root filesystem with the failed
superblock check-hash. Thus, when asked to mount a root filesystem
with a failed superblock check-hash, the kernel prints a warning
message that the root filesystem superblock check-hash needs repair,
but notes that it is ignoring the error and proceeding. It does
mark the filesystem as needing an fsck which prevents it from being
enabled for writing until fsck has been run on it. The net effect
is that the reboot fails to single user, but at least at that point
the administrator has the tools at hand to fix the problem.
Reported by: Rick Macklem (rmacklem@)
Discussed with: Warner Losh (imp@)
Sponsored by: Netflix
GEOM ELI may double ask the password during boot. Once at loader time, and
once at init time.
This happens due a module loading bug. By default GEOM ELI caches the
password in the kernel, but without the MODULE_VERSION annotation, the
kernel loads over the kernel module, even if the GEOM ELI was compiled into
the kernel. In this case, the newly loaded module
purges/invalidates/overwrites the GEOM ELI's password cache, which causes
the double asking.
MFC Note: There's a pc98 component to the original submission that is
omitted here due to pc98 removal in head. This part will need to be revived
upon MFC.
Reviewed by: imp
Submitted by: op
Obtained from: opBSD
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D14992
to fix the memory leak that I introduced in r328426. Instead of
trying to clear up the possible memory leak in all the clients, I
ensure that it gets cleaned up in the source (e.g., ffs_sbget ensures
that memory is always freed if it returns an error).
The original change in r328426 was a bit sparse in its description.
So I am expanding on its description here (thanks cem@ and rgrimes@
for your encouragement for my longer commit messages).
In preparation for adding check hashing to superblocks, r328426 is
a refactoring of the code to get the reading/writing of the superblock
into one place. Unlike the cylinder group reading/writing which
ends up in two places (ffs_getcg/ffs_geom_strategy in the kernel
and cgget/cgput in libufs), I have the core superblock functions
just in the kernel (ffs_sbfetch/ffs_sbput in ffs_subr.c which is
already imported into utilities like fsck_ffs as well as libufs to
implement sbget/sbput). The ffs_sbfetch and ffs_sbput functions
take a function pointer to do the actual I/O for which there are
four variants:
ffs_use_bread / ffs_use_bwrite for the in-kernel filesystem
g_use_g_read_data / g_use_g_write_data for kernel geom clients
ufs_use_sa_read for the standalone code (stand/libsa/ufs.c
but not stand/libsa/ufsread.c which is size constrained)
use_pread / use_pwrite for libufs
Uses of these interfaces are in the UFS filesystem, geoms journal &
label, libsa changes, and libufs. They also permeate out into the
filesystem utilities fsck_ffs, newfs, growfs, clri, dump, quotacheck,
fsirand, fstyp, and quot. Some of these utilities should probably be
converted to directly use libufs (like dumpfs was for example), but
there does not seem to be much win in doing so.
Tested by: Peter Holm (pho@)
ffs_sbget() may return a superblock buffer even if it fails, so the
caller must be prepared to free it in this case. Moreover, when tasting
alternate superblock locations in a loop, ffs_sbget()'s readfunc
callback must free the previously allocated buffer.
Reported and tested by: pho
Reviewed by: kib (previous version)
Differential Revision: https://reviews.freebsd.org/D14390
Specifically reading is done if ffs_sbget() and writing is done
in ffs_sbput(). These functions are exported to libufs via the
sbget() and sbput() functions which then used in the various
filesystem utilities. This work is in preparation for adding
subperblock check hashes.
No functional change intended.
Reviewed by: kib
Mainly focus on files that use BSD 2-Clause license, however the tool I
was using misidentified many licenses so this was mostly a manual - error
prone - task.
The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.
the g_journal level needs to check whether it is holding a newer
copy of the block than that which exists on the disk. If so, it
needs to return its copy. If not, it should pass the request down
to the disk to fulfill. It currently considers six queues:
0) delayed queue,
1) unsent (current queue),
2) in-flight to the journal (flush queue),
3) active journal (active queue),
4) inactive journal (inactive queue), and
5) inflight to the disk (copy queue).
Checking on two of these queues is unnecessary:
0) The delayed requests should not be used for reads because they
have not yet been entered into the journal, so their value should
reflect the disk contents, not the future contents that are not
yet committed.
2) Because all the bio's in the flush queue are also found on the
active queue, there is no need to inspect the flush queue for
reads since they will be found when searching the active queue.
Submitted by: Dr. Andreas Longwitz <longwitz@incore.de>
Discussed with: kib
MFC after: 1 week
in the flush_queue:
1 2 3 4 5 6 7 8 9 10
and another 10 bio's go into the flush queue after only the first five
bio's are removed from the flush queue, the queue should look like:
6 7 8 9 10 11 12 13 14 15 16 17 18 19 20,
but because of the bug we end up with
6 11 12 13 14 15 16 17 18 19 20 7 8 9 10.
So the sequence of the bio's is damaged in the flush queue (and
therefore in the journal on disk !). This error can be triggered by
ffs_snapshot() when a block is read with readblock() and gjournal finds
this block in the broken flush queue before it goes to the correct
active queue.
The fix is to place all new blocks at the end of the queue.
Submitted by: Dr. Andreas Longwitz <longwitz@incore.de>
Discussed with: kib
MFC after: 1 week
system having over 4GB RAM. That's due to:
1) the limit being u_int instead of u_long like vm.kmem_size (the limit is
half of vm.kmem_size by default for amd64);
2) sysctl handler g_journal_cache_limit_sysctl() using u_int instead of u_long.
The fix is to replace u_int with u_long for the kern.geom.journal.cache.limit
sysctl variable.
PR: 198500
Submitted by: Dr. Andreas Longwitz <longwitz@incore.de>
Reported by: Eugene Grosbein
Discussed with: kib
MFC after: 1 week
Don't start switcher kproc until the first GEOM is created.
Reviewed by: pjd
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D8576
for all struct bio you get back from g_{new,alloc}_bio. Temporary
bios that you create on the stack or elsewhere should use this before
first use of the bio, and between uses of the bio. At the moment, it
is nothing more than a wrapper around bzero, but that may change in
the future. The wrapper also removes one place where we encode the
size of struct bio in the KBI.
years for head. However, it is continuously misused as the mpsafe argument
for callout_init(9). Deprecate the flag and clean up callout_init() calls
to make them more consistent.
Differential Revision: https://reviews.freebsd.org/D2613
Reviewed by: jhb
MFC after: 2 weeks
These changes prevent sysctl(8) from returning proper output,
such as:
1) no output from sysctl(8)
2) erroneously returning ENOMEM with tools like truss(1)
or uname(1)
truss: can not get etype: Cannot allocate memory
there is an environment variable which shall initialize the SYSCTL
during early boot. This works for all SYSCTL types both statically and
dynamically created ones, except for the SYSCTL NODE type and SYSCTLs
which belong to VNETs. A new flag, CTLFLAG_NOFETCH, has been added to
be used in the case a tunable sysctl has a custom initialisation
function allowing the sysctl to still be marked as a tunable. The
kernel SYSCTL API is mostly the same, with a few exceptions for some
special operations like iterating childrens of a static/extern SYSCTL
node. This operation should probably be made into a factored out
common macro, hence some device drivers use this. The reason for
changing the SYSCTL API was the need for a SYSCTL parent OID pointer
and not only the SYSCTL parent OID list pointer in order to quickly
generate the sysctl path. The motivation behind this patch is to avoid
parameter loading cludges inside the OFED driver subsystem. Instead of
adding special code to the OFED driver subsystem to post-load tunables
into dynamically created sysctls, we generalize this in the kernel.
Other changes:
- Corrected a possibly incorrect sysctl name from "hw.cbb.intr_mask"
to "hw.pcic.intr_mask".
- Removed redundant TUNABLE statements throughout the kernel.
- Some minor code rewrites in connection to removing not needed
TUNABLE statements.
- Added a missing SYSCTL_DECL().
- Wrapped two very long lines.
- Avoid malloc()/free() inside sysctl string handling, in case it is
called to initialize a sysctl from a tunable, hence malloc()/free() is
not ready when sysctls from the sysctl dataset are registered.
- Bumped FreeBSD version to indicate SYSCTL API change.
MFC after: 2 weeks
Sponsored by: Mellanox Technologies
going on in here than can be fixed, and I introduced some of my own. Rather
than fix the whole host of them, back out my bugs.
Found by: bde
X-MFC with: r259080
vfs_busy(mp);
vfs_write_suspend(mp);
which are problematic if other thread starts unmount between two
calls. The unmount starts a write, while vfs_write_suspend() drain
writers. On the other hand, unmount drains busy references, causing
the deadlock.
Add a flag argument to vfs_write_suspend and require the callers of it
to specify VS_SKIP_UNMOUNT flag, when the call is performed not in the
mount path, i.e. the covered vnode is not locked. The suspension is
not attempted if VS_SKIP_UNMOUNT is specified and unmount is in
progress.
Reported and tested by: Andreas Longwitz <longwitz@incore.de>
Sponsored by: The FreeBSD Foundation
MFC after: 3 weeks
In particular, do not lock Giant conditionally when calling into the
filesystem module, remove the VFS_LOCK_GIANT() and related
macros. Stop handling buffers belonging to non-mpsafe filesystems.
The VFS_VERSION is bumped to indicate the interface change which does
not result in the interface signatures changes.
Conducted and reviewed by: attilio
Tested by: pho
mnt_noasync counter to temporary remove MNTK_ASYNC mount option, which
is needed to guarantee a synchronous completion of the initiated i/o
before syscall or VOP return. Global removal of MNTK_ASYNC option is
harmful because not only i/o started from corresponding thread becomes
synchronous, but all i/o is synchronous on the filesystem which is
initiated during sync(2) or syncer activity.
Instead of removing MNTK_ASYNC from mnt_kern_flag, provide a local
thread flag to disable async i/o for current thread only. Use the
opportunity to move DOINGASYNC() macro into sys/vnode.h and
consistently use it through places which tested for MNTK_ASYNC.
Some testing demonstrated 60-70% improvements in run time for the
metadata-intensive operations on async-mounted UFS volumes, but still
with great deviation due to other reasons.
Reviewed by: mckusick
Tested by: scottl
MFC after: 2 weeks
The SYSCTL_NODE macro defines a list that stores all child-elements of
that node. If there's no SYSCTL_DECL macro anywhere else, there's no
reason why it shouldn't be static.