r357614 added CTLFLAG_NEEDGIANT to make it easier to find nodes that are
still not MPSAFE (or already are but aren’t properly marked).
Use it in preparation for a general review of all nodes.
This is non-functional change that adds annotations to SYSCTL_NODE and
SYSCTL_PROC nodes using one of the soon-to-be-required flags.
Mark all obvious cases as MPSAFE. All entries that haven't been marked
as MPSAFE before are by default marked as NEEDGIANT
Approved by: kib (mentor, blanket)
Commented by: kib, gallatin, melifaro
Differential Revision: https://reviews.freebsd.org/D23718
While it works on nda, it fails on ada and/or da for at least zfs with a modify
after free issue on a trim BIO. Revert while I rework it to fix those devices.
React to the BIO_SPEED command in the cam io scheduler by completing
as successful BIO_DELETE commands that are pending, up to the length
passed down in the BIO_SPEEDUP cmomand. The length passed down is a
hint for how much space on the drive needs to be recovered. By
completing the BIO_DELETE comomands, this allows the upper layers to
allocate and write to the blocks that were about to be trimmed. Since
FreeBSD implements TRIMSs as advisory, we can eliminliminate them and
go directly to writing.
The biggest benefit from TRIMS coomes ffrom the drive being able t
ooptimize its free block pool inthe log run. There's little nto no
bene3efit in the shoort term. , sepeciall whn the trim is followed by
a write. Speedup lets us make this tradeoff.
Reviewed by: kirk, kib
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D18351
Rather than a trim active flag, have a counter that can be used to
have a absolute limit on the number of trims in flight independent of
any I/O limiting factors.
Sponsored by: Netflix
For each of the different queue types, list the name of the
queue. While it can be worked out from context, this makes it more
useful and clearer.
Sponsored by: Netflix
Add rate limiters to trims. Trims are a bit different than reads or
writes in that they can be combined, so some care needs to be taken
where we rate limit them. Additional work will be needed to push the
working rate limit below the I/O quanta rate for things like IOPS.
Sponsored by: Netflix
Add the ability to set two goals for trims in the I/O scheduler. The
first goal is the number of BIO_DELETEs to accumulate
(kern.cam.XX.U.trim_goal). When non-zero, this many trims will be
accumulated before we start to transfer them to lower layers. This is
useful for devices that like to get lots of trims all at once in one
transaction (not all devices are like this, and some vary by workload).
The second is a number of ticks to defer trims. If you've set a trim
goal, then kern.cam.XX.U.trim_ticks controls how long the system will
defer those trims before timing out and sending them anyway. It has no
effect when trim_goal is 0.
In any event, a BIO_FLUSH will cause all the TRIMs to be released to
the periph drivers. This may be a minor overloading of what BIO_FLUSH
is supposed to mean, but it's useful to preserve other ordering
semantics that users of BIO_FLUSH reply on.
Sponsored by: Netflix, Inc
It's often useful to have a callback when an I/O takes more than a
threshold amount of time. This adds the infrastructure for periph
devices to register one.
One use-case is as a debugging aide when you need a semi-realtime
indication of an I/O outlier so you can trigger bus capture gear for
vendor analysis.
Sponsored by: Netflix, Inc
read bias so we do reads in preference to TRIMs. This helps a lot when
many trims are delivered at once from the upper layers as they tend to
delay READs due to priority inversion in the code today.
The non iosched case will be fixed when the trim comibing changes
needed for nvme come in later this year.
Sponsored by: Netflix
When the ccb is NULL to cam_iosched_bio_complete, just update the
other statistics, but not the time. If many operations are collapsed
together, this is needed to keep stats properly for the grouped bp.
This should fix trim accounting.
Sponsored by: Netflix
Introduce flags word to describe the capacities of the peripheral.
First bit will describe if the periph driver allows multiple
outstanding TRIMS to be active in a device.
Modify the I/O scheduler so that the nda driver can queue trims
for a while after the first one arrives. We'll queue until we see
a I/O scheduler tick, then we'll schedule as many TRIMs as allowed
by other factors (currently this is slocts in the NVMe controller).
This mariginally helps the read latency issues we see with reads,
but sets the stage for the nda driver to do TRIM collapsing like the
da and ada drivers do today.
Sponsored by: Netflix
To help implement a policy of 'queue all trims until next I/O sched
tick' policy to help coalesce them, note when we tick so we can do
something special on the first call after the tick to get more work.
Sponsored by: Netflix
While the code for ada and da both assume that the trim list is
ordered when doing the coaleascing the TRIMs, it turns out that
creating the sorted list uses more resources than are saved by having
slightly fewer trims sent to the device.
Sponsored by: Netflix
When limiting I/O, a value of 0 makes no sense as a limit. No progress
can be made. Trade the possibility that someone might be doing
something clever to achieve ultra-low I/O limits vs the damage of not
ever making progress on an I/O in favor of making progress. Now the
machine won't be useless if this accidentally gets requested.
Sponsored by: Netflix
Prevent cam_iosched_iops_tick() from discarding 'unspent' ios unless
it's a new accounting interval.
Previously ios that weren't used between ticks were lost, as a result
the iops limiter could enforce a limit below the configured maximum.
Obtained from: ElectroBSD
Submitted by: Fabian Keil
PR: 221974
Previously the iops limiter would always allow at least
quanta ios per second as cam_iosched_iops_tick() never set
ios->l_value1 below 1.
Submitted by: Fabian Keil <fk@fabiankeil.de>
Obtained from: ElectroBSD
PR: 221974
Previously ios->current was set to 0 until the first
cam_iosched_cl_maybe_steer() call.
PR: 221954
Obtained from: ElectroBSD
Submitted by: Fabian Keil
Differential Revision: https://reviews.freebsd.org/D12349
Previously callout_reset() was called with a "ticks" value that was
off by one. As a result cam_iosched_ticker() was called a bit too
frequently: On systems with hz=1000 a quanta value of 200 resulted in
~250 calls and a value of 100 in ~111 calls.
For the "queue_depth" and "bandwidth" limiters the difference doesn't
matter but the "iops" limiter depends on the scheduling to enforce the
correct maximum.
PR: 221956
Obtained from: ElectroBSD
Submitted by: Fabian Keil
Differential Revision: https://reviews.freebsd.org/D12350
Invalid values can result in devision-by-zero panics or other
undefined behaviour so lets not allow them.
PR: 221957
Obtained from: ElectroBSD
Submitted by: Fabian Keil
Differential Revision: https://reviews.freebsd.org/D12351
Use the write queue for BIO_ZONE commands so they can't get executed
ahead of writes that were sent after them. More generally, since they
introduce strong ordering into the list, they need to go to the write
queue (which is the only queue that BIO_ORDERED is honored for at the
moment). In fact, fix mismatch between queueing and dequeueing code by
changing this to queue all non-reads (and non-trims) to the write
queue.
As a side effect this prevents the kernel message:
kernel: Found bio_cmd = 0x9
which cam_iosched_next_bio() emits when finding commands
other than BIO_READ in the read queue.
PR: 221973
Obtained from: ElectroBSD
Submitted by: Fabian Keil
Differential Revision: https://reviews.freebsd.org/D12353
It's intended only for those situations where the periph driver
ones to limit the number of trims active to one and only one.
Also update comments on associated functions.
Sponsored by: Netflix
The cam_iosched_ticker() can't be scheduled more than once per tick.
Some limiters depend on quanta matching the number of calls per second
to enforce the proper limits. Limit the quanta to no faster than 1 per
clock tick. This fixes some features when running in VMs where the
default HZ is 100.
PR: 221953
Obtained from: ElectroBSD
Differential Revision: https://reviews.freebsd.org/D12337
Submitted by: Fabian Keil
extreme outliers from dodgy drives. Adjust comments to reflect this,
and make sure that the number of latency buckets match in the two
places where it matters.
o Allow I/O scheduler to gather times on 32-bit systems. We do this by shifting
the sbintime_t over by 8 bits and truncating to 32-bits. This gives us 8.24
time. This is sufficient both in range (256 seconds is about 128x what current
users need) and precision (60ns easily meets the 1ms smallest bucket size
measurements). 64-bit systems are unchanged. Centralize all the time math so
it's easy to tweak tha range / precision tradeoffs in the future.
o While I'm here, the I/O scheduler should be using periph_data rather than
sim_data since it is operating on behalf of the periph.
Differential Review: https://reviews.freebsd.org/D12119
we have queued up normaliazed to the queue size. Also compute buckets
of latency to help compute, in userland, estimates of Median, P90, P95
and P99 values.
Sponsored by: Netflix, Inc