Previous code closed and detached consumer even with I/O still in progress.
This patch adds locking and request counting to postpone the close till
the last of running requests completes.
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.
Change the "count_until_fail" option of gnop, now it enables the failing
rating instead of setting them to 100%.
The original patch introduced the new flag, which sets the fail/rate to 100%
after N requests. In some cases, we don't want to have 100% of failure
probabilities. We want to start failing at some point.
For example, on the early stage, we may like to allow some read/writes requests
before having some requests delayed - when we try to mount the partition,
or when we are trying to import the pool.
Another case may be to check how scrub in ZFS will behave on different stages.
This allows us to cover more cases.
The previous behavior still may be configured.
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D22632
Thanks to this option we can create more then one gnop provider from
single provider. This may be useful for temporary labeling some data
on the disk.
Reviewed by: markj, allanjude, bcr
Differential Revision: https://reviews.freebsd.org/D22304
Previous code closed and destroyed consumer even with I/O in progress.
This patch postpones the destruction till the last close, identical to
GEOM_STRIPE, since they seem to have common origin.
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.
Previous code closed and destroyed direct read consumer even with I/O still
in progress. This patch adds locking and request counting to postpone the
close till the last of running requests completes.
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.
Previous code destroyed softc even with provider still open, that resulted
in panic under load. This change postpones the free till the final close,
when we know for sure there will be no more I/O requests.
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.
gvinum was the only GEOM class, using consumer nstart/nend fields. Making
it do its own accounting for orphanization purposes allows in perspective
to remove burden of that expensive for SMP accounting from GEOM.
Also the previous implementation spinned in a tight event loop, waiting
for all active BIOs to complete, while the new one knows exactly when it
is possible to close the consumer.
MFC after: 1 month
Sponsored by: iXsystems, Inc.
Add BIO_SPEEDUP bio command and g_io_speedup wrapper. It tells the
lower layers that the upper layers are dealing with some shortage
(dirty pages and/or disk blocks). The lower layers should do what they
can to speed up anything that's been delayed.
The first use will be to tell the CAM I/O scheduler that any TRIM
shaping should be short-circuited because the system needs
blocks. We'll also call it when there's too many resources used by
UFS.
Reviewed by: kirk, kib
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D18351
to specify an optional separator to insert before partition name;
eg if it's set to "c/", you'll get "ada0c/s1" instead of "ada0s1".
(It cannot be set to just “/“, since ada0 is a device node, not
a directory.)
Reviewed by: imp
MFC after: 2 weeks
Sponsored by: Klara Inc.
Differential Revision: https://reviews.freebsd.org/D22193
pp->private just can not be NULL in those places.
In g_disk_start() and g_disk_ioctl() both dp != NULL and !dp->d_destroyed
should always be true if disk_gone() and disk_destroy() are used properly,
since GEOM does not send requests to errored providers. If the protocol is
not followed, then no amount of additional checks here give real safety.
In g_disk_access() though the checks are useful, since GEOM blocks only
new opens for errored providers, but allows closes. It should not happen
if disk_gone() and disk_destroy() are used properly, but may otherwise.
To improve cases when disk_gone() is not used, call it from disk_destroy().
It does not give full guaranties, but it errors the provider and makes
GEOM block unwanted requests at least after some race.
MFC after: 2 weeks
For normal I/Os consumer and provider statuses are checked by g_io_check().
But ioctl calls often do not go through it, being dispatched directly. This
change makes their semantics more alike, protecting lower levels.
MFC after: 2 weeks
Add ATF tests for most gmultipath operations. Add some dtrace probes too,
primarily for configuration changes that happen in response to provider
errors.
PR: 178473
MFC after: 2 weeks
Sponsored by: Axcient
Differential Revision: https://reviews.freebsd.org/D22235
In most cases with debug disabled this function does nothing, but argument
passing and the call still cost measurable time due to cache misses, etc.
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.
It closes the race condition and so allows to remove few NULL checks.
Also while there, use dev->si_drv1 in addition to cp->private to store
softc pointer. For calls coming from the dev side it gives reliable cache
hit instead of often miss before.
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.
to geom, and nothing we call requires it to be held. It's left over
from a time when the latter wasn't the case. Retire it.
Reviewed in concept: scottl@
via 'diskinfo -v'. This avoids the need to track it down via CAM,
and should also work for disks that don't use CAM. And since it's
inherited thru the GEOM hierarchy, in most cases one doesn't need
to walk the GEOM graph either, eg you can use it on a partition
instead of disk itself.
Reviewed by: allanjude, imp
Sponsored by: Klara Inc
Differential Revision: https://reviews.freebsd.org/D22249
filling in the same defaults that the current userland module uses.
This allows an old geom_nop.so userland module to work with a new kernel.
Approved by: imp (mentor)
Reviewed by: cem
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D21972
the request. It is the same as gctl_get_paraml() except that the request
is not marked with an error if the parameter is not present.
Approved by: imp (mentor)
Reviewed by: cem
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D21972
I/O requests after the given number have been allowed though.
Approved by: imp (mentor)
Reviewed by: rpokala kib 0mp mckusick
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D21593
GEOM is supposed to be topology-agnostic, but the GPT and BSD partition code
has arbitrary restrictions on nesting that are annoying in cases such as
running VMs on raw partitions (since the VM's partitioning scheme is not
visible to the host).
This patch adds sysctls to disable the restrictions except in the case of
BSD label (and similar) partitions with offset 0 (where we need to avoid
recursively recognizing the label).
Submitted by: Andrew Gierth
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D21350
The Zstd format bumps the CLOOP major number to 4 to avoid incompatibility
with older systems. Support in geom_uzip(4) is conditional on the ZSTDIO
kernel option, which is enabled in amd64 GENERIC, but not all in-tree
configurations.
mkuzip(8) was modified slightly to always initialize the nblocks + 1'th
offset in the CLOOP file format. Previously, it was only initialized in the
case where the final compressed block happened to be unaligned w.r.t.
DEV_BSIZE. The "Fake" last+1 block change in r298619 means that the final
compressed block's 'blen' was never correct unless the compressed uzip image
happened to be BSIZE-aligned. This happened in about 1 out of every 512
cases. The zlib and lzma decompressors are probably tolerant of extra trash
following the frame they were told to decode, but Zstd complains that the
input size is incorrect.
Correspondingly, geom_uzip(4) was modified slightly to avoid trashing the
nblocks + 1'th offset when it is known to be initialized to a good value.
This corrects the calculated final real cluster compressed length to match
that printed by mkuzip(8).
mkuzip(8) was refactored somewhat to reduce code duplication and increase
ease of adding other compression formats.
* Input block size validation was pulled out of individual compression
init routines into main().
* Init routines now validate a user-provided compression level or select
an algorithm-specific default, if none was provided.
* A new interface for calculating the maximal compressed size of an
incompressible input block was added for each driver. The generic code
uses it to validate against MAXPHYS as well as to allocate compression
result buffers in the generic code.
* Algorithm selection is now driven by a table lookup, to increase ease of
adding other formats in the future.
mkuzip(8) gained the ability to explicitly specify a compression level with
'-C'. The prior defaults -- 9 for zlib and 6 for lzma -- are maintained.
The new zstd default is 9, to match zlib.
Rather than select lzma or zlib with '-L' or its absense, respectively, a
new argument '-A <algorithm>' is provided to select 'zlib', 'lzma', or
'zstd'. '-L' is considered deprecated, but will probably never be removed.
All of the new features were documented in mkuzip.8; the page was also
cleaned up slightly.
Relnotes: yes
Follow-up on r322318 and r322319 and remove the deprecated modules.
Shift some now-unused kernel files into userspace utilities that incorporate
them. Remove references to removed GEOM classes in userspace utilities.
Reviewed by: imp (earlier version)
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D21249
- Use new zlib headers;
- Removed z_alloc and z_free to use the common sys/dev/zlib version.
- Replace z_compressBound with compressBound from zlib.
While there, limit LZMA CFLAGS to apply only for g_uzip_lzma.c.
PR: 229763
Submitted by: Yoshihiro Ota <ota j email ne jp> (with changes,
bugs are mine)
Differential Revision: https://reviews.freebsd.org/D20271
Similar to what was done for device_printfs in r347229.
Convert g_print_bio() to a thin shim around g_format_bio(), which acts on an
sbuf; documented in g_bio.9.
Reviewed by: markj
Discussed with: rlibby
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D21165
This allows to simulated disk that is responding slowly to the IO requests.
Reviewed by: markj, bcr, pjd (previous version)
Differential Revision: https://reviews.freebsd.org/D21052
If g_mirror_taste encountered an error at g_mirror_add_disk, it might
try to g_mirror_destroy the device with the G_MIRROR_DEVICE_FLAG_TASTING
flag still set. This would wait on a worker to complete the destruction
with g_mirror_try_destroy, but that function bails out if the tasting
flag is set, resulting in a deadlock. Clear the tasting flag before
trying to destroy the device.
Test Plan:
sysctl debug.fail_point.mnowait="1%return"
kyua test -k /usr/tests/sys/geom/class/mirror/Kyuafile
Reviewed by: markj
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D20744
NANDFS has been broken for years. Remove it. The NAND drivers that
remain are for ancient parts that are no longer relevant. They are
polled, have terrible performance and just for ancient arm
hardware. NAND parts have evolved significantly from this early work
and little to none of it would be relevant should someone need to
update to support raw nand. This code has been off by default for
years and has violated the vnode protocol leading to panics since it
was committed.
Numerous posts to arch@ and other locations have found no actual users
for this software.
Relnotes: Yes
No Objection From: arch@
Differential Revision: https://reviews.freebsd.org/D20745
When it comes to megabytes of text, difference between sbuf_printf() and
sbuf_cat() becomes substantial.
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.
On large systems those sysctls may generate megabytes of output. Before
this change sbuf(9) code was resizing buffer by 4KB each time many times,
generating tons of TLB shootdowns. Unfortunately in this case existing
sbuf_new_for_sysctl() mechanism, supposed to help with this issue, is not
applicable, since all the sbuf writes are done in different kernel thread.
This change improves situation in two ways:
- on first sysctl call, not providing any output buffer, it sets special
sbuf drain function, just counting the data and so not needing big buffer;
- on second sysctl call it uses as initial buffer size value saved on
previous call, so that in most cases there will be no reallocation, unless
GEOM topology changed significantly.
MFC after: 1 week
Sponsored by: iXsystems, Inc.
rename the source to gsb_crc32.c.
This is a prerequisite of unifying kernel zlib instances.
PR: 229763
Submitted by: Yoshihiro Ota <ota at j.email.ne.jp>
Differential Revision: https://reviews.freebsd.org/D20193
operations already in its queue were not being properly drained.
The GEOM framework does the queue draining, but the module needs
to wait for the draining to happen. The waiting is done by adding
a g_nop_providergone() function to wait for the I/O operations to
finish up. This change is similar to change -r345758 made to the
memory-disk driver.
Submitted by: Chuck Silvers
Tested by: Chuck Silvers
MFC after: 1 week
Sponsored by: Netflix
- Triple DES has been formally deprecated in Kerberos (RFC 8429)
and is soon to be deprecated in IPsec (RFC 8221).
- Blowfish is deprecated. FreeBSD doesn't support its successor
(Twofish).
- MD5 is generally considered a weak digest that has known attacks.
geli refuses to create new volumes using these algorithms via 'geli
init'. It also warns when attaching to existing volumes or creating
temporary volumes via 'geli onetime' . The plan is to fully remove
support for these algorithms in FreeBSD 13.
Note that none of these algorithms have ever been the default
algorithm used by geli(8). Users would have had to explicitly select
these algorithms when creating volumes in the past.
Reviewed by: cem, delphij
MFC after: 3 days
Relnotes: yes
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D20344
Allow users to specify multiple dump configurations in a prioritized list.
This enables fallback to secondary device(s) if primary dump fails. E.g.,
one might configure a preference for netdump, but fallback to disk dump as a
second choice if netdump is unavailable.
This change does not list-ify netdump configuration, which is tracked
separately from ordinary disk dumps internally; only one netdump
configuration can be made at a time, for now. It also does not implement
IPv6 netdump.
savecore(8) is already capable of scanning and iterating multiple devices
from /etc/fstab or passed on the command line.
This change doesn't update the rc or loader variables 'dumpdev' in any way;
it can still be set to configure a single dump device, and rc.d/savecore
still uses it as a single device. Only dumpon(8) is updated to be able to
configure the more complicated configurations for now.
As part of revving the ABI, unify netdump and disk dump configuration ioctl
/ structure, and leave room for ipv6 netdump as a future possibility.
Backwards-compatibility ioctls are added to smooth ABI transition,
especially for developers who may not keep kernel and userspace perfectly
synced.
Reviewed by: markj, scottl (earlier version)
Relnotes: maybe
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D19996
There's a race between the initialization of devsoftc.mtx (by devinit)
and the creation of the geom worker thread g_run_events, which calls
devctl_queue_data_f. Both of those are initialized at SI_SUB_DRIVERS
and SI_ORDER_FIRST, which means the geom worked thread can be created
before the mutex has been initialized, leading to the panic below:
wpanic: mtx_lock() of spin mutex (null) @ /usr/home/osstest/build.135317.build-amd64-freebsd/freebsd/sys/kern/subr_bus.c:620
cpuid = 3
time = 1
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe003b968710
vpanic() at vpanic+0x19d/frame 0xfffffe003b968760
panic() at panic+0x43/frame 0xfffffe003b9687c0
__mtx_lock_flags() at __mtx_lock_flags+0x145/frame 0xfffffe003b968810
devctl_queue_data_f() at devctl_queue_data_f+0x6a/frame 0xfffffe003b968840
g_dev_taste() at g_dev_taste+0x463/frame 0xfffffe003b968a00
g_load_class() at g_load_class+0x1bc/frame 0xfffffe003b968a30
g_run_events() at g_run_events+0x197/frame 0xfffffe003b968a70
fork_exit() at fork_exit+0x84/frame 0xfffffe003b968ab0
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe003b968ab0
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---
KDB: enter: panic
[ thread pid 13 tid 100029 ]
Stopped at kdb_enter+0x3b: movq $0,kdb_why
Fix this by initializing geom at SI_ORDER_SECOND instead of
SI_ORDER_FIRST.
Sponsored by: Citrix Systems R&D
Reviewed by: kevans, markj
Differential revision: https://reviews.freebsd.org/D20148
destroy_dev_sched_cb() is excessively asynchronous, and during media change
retaste new provider may appear sooner then device of the previous one get
destroyed.
MFC after: 1 week
Sponsored by: iXsystems, Inc.
provider grows, GELI will expand automatically and will move the metadata
to the new location of the last sector.
This functionality is turned on by default. It can be turned off with the
-R flag, but it is not recommended - if the underlying provider grows and
automatic expansion is turned off, it won't be possible to attach this
provider again, as the metadata is no longer located in the last sector.
If the automatic expansion is turned off and the underlying provider grows,
GELI will only log a message with the previous size of the provider, so
recovery can be easier.
Obtained from: Fudo Security
providers mediasize changes.
While here, use GEOM nomenclature to describe providers instead of calling
them device nodes.
Obtained from: Fudo Security
Tested in: AWS
While geom_flashmap has always supported label names for its slices, it does
so by appending "s.labelname" to the provider device name, meaning you still
have to know the name and unit of the hardware device to use the labels.
These changes add support for device-independent geom_flashmap labels, using
the standard geom_label infrastructure. geom_flashmap now creates a softc
struct attached to its geom, and as it creates slices it stores the label
into an array in the softc. The new geom_label_flashmap uses those labels
when tasting a geom_flashmap provider.
Differential Revision: https://reviews.freebsd.org/D19535
In revision 254095, gpt_entries is not set to match the on-disk
hdr_entries, but rather is computed based on available space.
There are 2 problems with this:
1. The GPT backend respects hdr_entries and only reads and writes
that number of partition entries. On top of that, CRC32 is
computed over the table that has hdr_entries elements. When
the common code works on what is possibly a larger number, the
behaviour becomes inconsistent and problematic. In particular,
it would be possible to add a new partition that on a reboot
isn't there anymore.
2. The calculation of gpt_entries is based on flawed assumptions.
The GPT specification does not dictate that sectors are layed
out in a particular way that the available space can be
determined by looking at LBAs. In practice, implementations
do the same thing, because there's no reason to do it any
other way. Still, GPT allows certain freedoms that can be
exploited in some form or shape if the need arises.
PR: 229977
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D19438