freebsd-skq/sys/geom/notes

$FreeBSD$

For the lack of a better place to put them, this file will contain
notes on some of the more intricate details of geom.

-----------------------------------------------------------------------
Locking of bio_children and bio_inbed

bio_children is used by g_std_done() and g_clone_bio() to keep track
of children cloned off a request.  g_clone_bio will increment the
bio_children counter for each time it is called and g_std_done will
increment bio_inbed for every call, and if the two counters are
equal, call g_io_deliver() on the parent bio.

The general assumption is that g_clone_bio() is called only in
the g_down thread, and g_std_done() only in the g_up thread and
therefore the two fields do not generally need locking.  These
restrictions are not enforced by the code, but only with great
care should they be violated.

It is the responsibility of the class implementation to avoid the
following race condition:  A class intend to split a bio in two
children.  It clones the bio, and requests I/O on the child. 
This I/O operation completes before the second child is cloned
and g_std_done() sees the counters both equal 1 and finishes off
the bio.

There is no race present in the common case where the bio is split
in multiple parts in the class start method and the I/O is requested
on another GEOM class below:  There is only one g_down thread and
the class below will not get its start method run until we return
from our start method, and consequently the I/O cannot complete
prematurely.

In all other cases, this race needs to be mitigated, for instance
by cloning all children before I/O is request on any of them.

Notice that cloning an "extra" child and calling g_std_done() on
it directly opens another race since the assumption is that
g_std_done() only is called in the g_up thread.

-----------------------------------------------------------------------
Statistics collection

Statistics collection can run at three levels controlled by the
"kern.geom.collectstats" sysctl.

At level zero, only the number of transactions started and completed
are counted, and this is only because GEOM internally uses the difference
between these two as sanity checks.

At level one we collect the full statistics.  Higher levels are
reserved for future use.  Statistics are collected independently
on both the provider and the consumer, because multiple consumers
can be active against the same provider at the same time.

The statistics collection falls in two parts:

The first and simpler part consists of g_io_request() timestamping
the struct bio when the request is first started and g_io_deliver()
updating the consumer and providers statistics based on fields in
the bio when it is completed.  There are no concurrency or locking
concerns in this part.  The statistics collected consists of number
of requests, number of bytes, number of ENOMEM errors, number of
other errors and duration of the request for each of the three
major request types: BIO_READ, BIO_WRITE and BIO_DELETE.

The second part is trying to keep track of the "busy%".

If in g_io_request() we find that there are no outstanding requests,
(based on the counters for scheduled and completed requests being
equal), we set a timestamp in the "wentbusy" field.  Since there
are no outstanding requests, and as long as there is only one thread
pushing the g_down queue, we cannot possibly conflict with
g_io_deliver() until we ship the current request down.

In g_io_deliver() we calculate the delta-T from wentbusy and add this
to the "bt" field, and set wentbusy to the current timestamp.  We
take care to do this before we increment the "requests completed"
counter, since that prevents g_io_request() from touching the
"wentbusy" timestamp concurrently.

The statistics data is made available to userland through the use
of a special allocator (in geom_stats.c) which through a device
allows userland to mmap(2) the pages containing the statistics data.
In order to indicate to userland when the data in a statstics
structure might be inconsistent, g_io_deliver() atomically sets a
flag "updating" and resets it when the structure is again consistent.
-----------------------------------------------------------------------
maxsize, stripesize and stripeoffset

maxsize is the biggest request we are willing to handle.  If not
set there is no upper bound on the size of a request and the code
is responsible for chopping it up.  Only hardware methods should
set an upper bound in this field.  Geom_disk will inherit the upper
bound set by the device driver.

stripesize is the width of any natural request boundaries for the
device.  This would be the width of a stripe on a raid-5 unit or
one zone in GBDE.  The idea with this field is to hint to clustering
type code to not trivially overrun these boundaries.

stripeoffset is the amount of the first stripe which lies before the
devices beginning.

If we have a device with 64k stripes:
	[0...64k[
	[64k...128k[
	[128k..192k[
Then it will have stripesize = 64k and stripeoffset = 0.

If we put a MBR on this device, where slice#1 starts on sector#63,
then this slice will have: stripesize = 64k, stripeoffset = 63 * sectorsize.

If the clustering code wants to widen a request which writes to
sector#53 of the slice, it can calculate how many bytes till the end of
the stripe as:
	stripewith - (53 * sectorsize + stripeoffset) % stripewidth.
Commit the correct copy of the g_stat structure. Add debug.sizeof.g_stat sysctl. Set the id field of the g_stat when we create consumers and providers. Remove biocount from consumer, we will use the counters in the g_stat structure instead. Replace one field which will need to be atomically manipulated with two fields which will not (stat.nop and stat.nend). Change add companion field to bio_children: bio_inbed for the exact same reason. Don't output the biocount in the confdot output. Fix KASSERT in g_io_request(). Add sysctl kern.geom.collectstats defaulting to off. Collect the following raw statistics conditioned on this sysctl: for each consumer and provider { total number of operations started. total number of operations completed. time last operation completed. sum of idle-time. for each of BIO_READ, BIO_WRITE and BIO_DELETE { number of operations completed. number of bytes completed. number of ENOMEM errors. number of other errors. sum of transaction time. } } API for getting hold of these statistics data not included yet. 2003-02-07 23:08:24 +00:00			$FreeBSD$

			`For the lack of a better place to put them, this file will contain`
			`notes on some of the more intricate details of geom.`

			`-----------------------------------------------------------------------`
			`Locking of bio_children and bio_inbed`

			`bio_children is used by g_std_done() and g_clone_bio() to keep track`
			`of children cloned off a request. g_clone_bio will increment the`
			`bio_children counter for each time it is called and g_std_done will`
			`increment bio_inbed for every call, and if the two counters are`
			`equal, call g_io_deliver() on the parent bio.`

			`The general assumption is that g_clone_bio() is called only in`
			`the g_down thread, and g_std_done() only in the g_up thread and`
			`therefore the two fields do not generally need locking. These`
			`restrictions are not enforced by the code, but only with great`
			`care should they be violated.`

			`It is the responsibility of the class implementation to avoid the`
			`following race condition: A class intend to split a bio in two`
			`children. It clones the bio, and requests I/O on the child.`
			`This I/O operation completes before the second child is cloned`
			`and g_std_done() sees the counters both equal 1 and finishes off`
			`the bio.`

			`There is no race present in the common case where the bio is split`
			`in multiple parts in the class start method and the I/O is requested`
			`on another GEOM class below: There is only one g_down thread and`
			`the class below will not get its start method run until we return`
			`from our start method, and consequently the I/O cannot complete`
			`prematurely.`

			`In all other cases, this race needs to be mitigated, for instance`
			`by cloning all children before I/O is request on any of them.`

			`Notice that cloning an "extra" child and calling g_std_done() on`
			`it directly opens another race since the assumption is that`
			`g_std_done() only is called in the g_up thread.`
Update the statistics collection code to track busy time instead of idle time. Statistics now default to "on" and can be turned off with sysctl kern.geom.collectstats=0 Performance impact of statistics collection is on the order of 800 nsec per consumer/provider set on a 700MHz Athlon. 2003-02-09 17:04:57 +00:00
			`-----------------------------------------------------------------------`
			`Statistics collection`

			`Statistics collection can run at three levels controlled by the`
			`"kern.geom.collectstats" sysctl.`

			`At level zero, only the number of transactions started and completed`
			`are counted, and this is only because GEOM internally uses the difference`
			`between these two as sanity checks.`

			`At level one we collect the full statistics. Higher levels are`
			`reserved for future use. Statistics are collected independently`
			`on both the provider and the consumer, because multiple consumers`
			`can be active against the same provider at the same time.`

			`The statistics collection falls in two parts:`

			`The first and simpler part consists of g_io_request() timestamping`
			`the struct bio when the request is first started and g_io_deliver()`
			`updating the consumer and providers statistics based on fields in`
			`the bio when it is completed. There are no concurrency or locking`
			`concerns in this part. The statistics collected consists of number`
			`of requests, number of bytes, number of ENOMEM errors, number of`
			`other errors and duration of the request for each of the three`
			`major request types: BIO_READ, BIO_WRITE and BIO_DELETE.`

			`The second part is trying to keep track of the "busy%".`

			`If in g_io_request() we find that there are no outstanding requests,`
			`(based on the counters for scheduled and completed requests being`
			`equal), we set a timestamp in the "wentbusy" field. Since there`
			`are no outstanding requests, and as long as there is only one thread`
			`pushing the g_down queue, we cannot possibly conflict with`
			`g_io_deliver() until we ship the current request down.`

			`In g_io_deliver() we calculate the delta-T from wentbusy and add this`
			`to the "bt" field, and set wentbusy to the current timestamp. We`
			`take care to do this before we increment the "requests completed"`
			`counter, since that prevents g_io_request() from touching the`
			`"wentbusy" timestamp concurrently.`

			`The statistics data is made available to userland through the use`
			`of a special allocator (in geom_stats.c) which through a device`
			`allows userland to mmap(2) the pages containing the statistics data.`
			`In order to indicate to userland when the data in a statstics`
			`structure might be inconsistent, g_io_deliver() atomically sets a`
			`flag "updating" and resets it when the structure is again consistent.`
Better names for struct disk elements: d_maxsize, d_stripeoffset and d_stripesisze; Introduce si_stripesize and si_stripeoffset in struct cdev so we can make the visible to clustering code. Add stripesize and stripeoffset to providers. DTRT with stripesize and stripeoffset in various places in GEOM. 2003-02-11 14:57:34 +00:00			`-----------------------------------------------------------------------`
			`maxsize, stripesize and stripeoffset`

			`maxsize is the biggest request we are willing to handle. If not`
			`set there is no upper bound on the size of a request and the code`
			`is responsible for chopping it up. Only hardware methods should`
			`set an upper bound in this field. Geom_disk will inherit the upper`
			`bound set by the device driver.`

			`stripesize is the width of any natural request boundaries for the`
			`device. This would be the width of a stripe on a raid-5 unit or`
			`one zone in GBDE. The idea with this field is to hint to clustering`
			`type code to not trivially overrun these boundaries.`

			`stripeoffset is the amount of the first stripe which lies before the`
			`devices beginning.`

			`If we have a device with 64k stripes:`
			`[0...64k[`
			`[64k...128k[`
			`[128k..192k[`
			`Then it will have stripesize = 64k and stripeoffset = 0.`

			`If we put a MBR on this device, where slice#1 starts on sector#63,`
			`then this slice will have: stripesize = 64k, stripeoffset = 63 * sectorsize.`

			`If the clustering code wants to widen a request which writes to`
			`sector#53 of the slice, it can calculate how many bytes till the end of`
			`the stripe as:`
			`stripewith - (53 * sectorsize + stripeoffset) % stripewidth.`