Add the ability to set two goals for trims in the I/O scheduler. The
first goal is the number of BIO_DELETEs to accumulate
(kern.cam.XX.U.trim_goal). When non-zero, this many trims will be
accumulated before we start to transfer them to lower layers. This is
useful for devices that like to get lots of trims all at once in one
transaction (not all devices are like this, and some vary by workload).
The second is a number of ticks to defer trims. If you've set a trim
goal, then kern.cam.XX.U.trim_ticks controls how long the system will
defer those trims before timing out and sending them anyway. It has no
effect when trim_goal is 0.
In any event, a BIO_FLUSH will cause all the TRIMs to be released to
the periph drivers. This may be a minor overloading of what BIO_FLUSH
is supposed to mean, but it's useful to preserve other ordering
semantics that users of BIO_FLUSH reply on.
Sponsored by: Netflix, Inc
It's often useful to have a callback when an I/O takes more than a
threshold amount of time. This adds the infrastructure for periph
devices to register one.
One use-case is as a debugging aide when you need a semi-realtime
indication of an I/O outlier so you can trigger bus capture gear for
vendor analysis.
Sponsored by: Netflix, Inc
read bias so we do reads in preference to TRIMs. This helps a lot when
many trims are delivered at once from the upper layers as they tend to
delay READs due to priority inversion in the code today.
The non iosched case will be fixed when the trim comibing changes
needed for nvme come in later this year.
Sponsored by: Netflix
When the ccb is NULL to cam_iosched_bio_complete, just update the
other statistics, but not the time. If many operations are collapsed
together, this is needed to keep stats properly for the grouped bp.
This should fix trim accounting.
Sponsored by: Netflix
Introduce flags word to describe the capacities of the peripheral.
First bit will describe if the periph driver allows multiple
outstanding TRIMS to be active in a device.
Modify the I/O scheduler so that the nda driver can queue trims
for a while after the first one arrives. We'll queue until we see
a I/O scheduler tick, then we'll schedule as many TRIMs as allowed
by other factors (currently this is slocts in the NVMe controller).
This mariginally helps the read latency issues we see with reads,
but sets the stage for the nda driver to do TRIM collapsing like the
da and ada drivers do today.
Sponsored by: Netflix
To help implement a policy of 'queue all trims until next I/O sched
tick' policy to help coalesce them, note when we tick so we can do
something special on the first call after the tick to get more work.
Sponsored by: Netflix
While the code for ada and da both assume that the trim list is
ordered when doing the coaleascing the TRIMs, it turns out that
creating the sorted list uses more resources than are saved by having
slightly fewer trims sent to the device.
Sponsored by: Netflix
When limiting I/O, a value of 0 makes no sense as a limit. No progress
can be made. Trade the possibility that someone might be doing
something clever to achieve ultra-low I/O limits vs the damage of not
ever making progress on an I/O in favor of making progress. Now the
machine won't be useless if this accidentally gets requested.
Sponsored by: Netflix
Prevent cam_iosched_iops_tick() from discarding 'unspent' ios unless
it's a new accounting interval.
Previously ios that weren't used between ticks were lost, as a result
the iops limiter could enforce a limit below the configured maximum.
Obtained from: ElectroBSD
Submitted by: Fabian Keil
PR: 221974
Previously the iops limiter would always allow at least
quanta ios per second as cam_iosched_iops_tick() never set
ios->l_value1 below 1.
Submitted by: Fabian Keil <fk@fabiankeil.de>
Obtained from: ElectroBSD
PR: 221974
Previously ios->current was set to 0 until the first
cam_iosched_cl_maybe_steer() call.
PR: 221954
Obtained from: ElectroBSD
Submitted by: Fabian Keil
Differential Revision: https://reviews.freebsd.org/D12349
Previously callout_reset() was called with a "ticks" value that was
off by one. As a result cam_iosched_ticker() was called a bit too
frequently: On systems with hz=1000 a quanta value of 200 resulted in
~250 calls and a value of 100 in ~111 calls.
For the "queue_depth" and "bandwidth" limiters the difference doesn't
matter but the "iops" limiter depends on the scheduling to enforce the
correct maximum.
PR: 221956
Obtained from: ElectroBSD
Submitted by: Fabian Keil
Differential Revision: https://reviews.freebsd.org/D12350
Invalid values can result in devision-by-zero panics or other
undefined behaviour so lets not allow them.
PR: 221957
Obtained from: ElectroBSD
Submitted by: Fabian Keil
Differential Revision: https://reviews.freebsd.org/D12351
Use the write queue for BIO_ZONE commands so they can't get executed
ahead of writes that were sent after them. More generally, since they
introduce strong ordering into the list, they need to go to the write
queue (which is the only queue that BIO_ORDERED is honored for at the
moment). In fact, fix mismatch between queueing and dequeueing code by
changing this to queue all non-reads (and non-trims) to the write
queue.
As a side effect this prevents the kernel message:
kernel: Found bio_cmd = 0x9
which cam_iosched_next_bio() emits when finding commands
other than BIO_READ in the read queue.
PR: 221973
Obtained from: ElectroBSD
Submitted by: Fabian Keil
Differential Revision: https://reviews.freebsd.org/D12353
It's intended only for those situations where the periph driver
ones to limit the number of trims active to one and only one.
Also update comments on associated functions.
Sponsored by: Netflix
The cam_iosched_ticker() can't be scheduled more than once per tick.
Some limiters depend on quanta matching the number of calls per second
to enforce the proper limits. Limit the quanta to no faster than 1 per
clock tick. This fixes some features when running in VMs where the
default HZ is 100.
PR: 221953
Obtained from: ElectroBSD
Differential Revision: https://reviews.freebsd.org/D12337
Submitted by: Fabian Keil
extreme outliers from dodgy drives. Adjust comments to reflect this,
and make sure that the number of latency buckets match in the two
places where it matters.
o Allow I/O scheduler to gather times on 32-bit systems. We do this by shifting
the sbintime_t over by 8 bits and truncating to 32-bits. This gives us 8.24
time. This is sufficient both in range (256 seconds is about 128x what current
users need) and precision (60ns easily meets the 1ms smallest bucket size
measurements). 64-bit systems are unchanged. Centralize all the time math so
it's easy to tweak tha range / precision tradeoffs in the future.
o While I'm here, the I/O scheduler should be using periph_data rather than
sim_data since it is operating on behalf of the periph.
Differential Review: https://reviews.freebsd.org/D12119
we have queued up normaliazed to the queue size. Also compute buckets
of latency to help compute, in userland, estimates of Median, P90, P95
and P99 values.
Sponsored by: Netflix, Inc