freebsd-nq/sys/geom/sched
Justin T. Gibbs f03f7a0ca3 Correct bioq_disksort so that bioq_insert_tail() offers barrier semantic.
Add the BIO_ORDERED flag for struct bio and update bio clients to use it.

The barrier semantics of bioq_insert_tail() were broken in two ways:

 o In bioq_disksort(), an added bio could be inserted at the head of
   the queue, even when a barrier was present, if the sort key for
   the new entry was less than that of the last queued barrier bio.

 o The last_offset used to generate the sort key for newly queued bios
   did not stay at the position of the barrier until either the
   barrier was de-queued, or a new barrier (which updates last_offset)
   was queued.  When a barrier is in effect, we know that the disk
   will pass through the barrier position just before the
   "blocked bios" are released, so using the barrier's offset for
   last_offset is the optimal choice.

sys/geom/sched/subr_disk.c:
sys/kern/subr_disk.c:
	o Update last_offset in bioq_insert_tail().

	o Only update last_offset in bioq_remove() if the removed bio is
	  at the head of the queue (typically due to a call via
	  bioq_takefirst()) and no barrier is active.

	o In bioq_disksort(), if we have a barrier (insert_point is non-NULL),
	  set prev to the barrier and cur to it's next element.  Now that
	  last_offset is kept at the barrier position, this change isn't
	  strictly necessary, but since we have to take a decision branch
	  anyway, it does avoid one, no-op, loop iteration in the while
	  loop that immediately follows.

	o In bioq_disksort(), bypass the normal sort for bios with the
	  BIO_ORDERED attribute and instead insert them into the queue
	  with bioq_insert_tail().  bioq_insert_tail() not only gives
	  the desired command order during insertion, but also provides
	  barrier semantics so that commands disksorted in the future
	  cannot pass the just enqueued transaction.

sys/sys/bio.h:
	Add BIO_ORDERED as bit 4 of the bio_flags field in struct bio.

sys/cam/ata/ata_da.c:
sys/cam/scsi/scsi_da.c
	Use an ordered command for SCSI/ATA-NCQ commands issued in
	response to bios with the BIO_ORDERED flag set.

sys/cam/scsi/scsi_da.c
	Use an ordered tag when issuing a synchronize cache command.

	Wrap some lines to 80 columns.

sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c
sys/geom/geom_io.c
	Mark bios with the BIO_FLUSH command as BIO_ORDERED.

Sponsored by:	Spectra Logic Corporation
MFC after:	1 month
2010-09-02 19:40:28 +00:00
..
g_sched.c Check that gsp is not NULL before access. It can be NULL 2010-08-03 11:21:17 +00:00
g_sched.h fix copyright format, as requested by Joel Dahl 2010-04-13 09:56:17 +00:00
gs_rr.c fix copyright format, as requested by Joel Dahl 2010-04-13 09:56:17 +00:00
gs_scheduler.h fix copyright format, as requested by Joel Dahl 2010-04-13 09:56:17 +00:00
README
subr_disk.c Correct bioq_disksort so that bioq_insert_tail() offers barrier semantic. 2010-09-02 19:40:28 +00:00

	--- GEOM BASED DISK SCHEDULERS FOR FREEBSD ---

This code contains a framework for GEOM-based disk schedulers and a
couple of sample scheduling algorithms that use the framework and
implement two forms of "anticipatory scheduling" (see below for more
details).

As a quick example of what this code can give you, try to run "dd",
"tar", or some other program with highly SEQUENTIAL access patterns,
together with "cvs", "cvsup", "svn" or other highly RANDOM access patterns
(this is not a made-up example: it is pretty common for developers
to have one or more apps doing random accesses, and others that do
sequential accesses e.g., loading large binaries from disk, checking
the integrity of tarballs, watching media streams and so on).

These are the results we get on a local machine (AMD BE2400 dual
core CPU, SATA 250GB disk):

    /mnt is a partition mounted on /dev/ad0s1f

    cvs: 	cvs -d /mnt/home/ncvs-local update -Pd /mnt/ports
    dd-read:	dd bs=128k of=/dev/null if=/dev/ad0 (or ad0-sched-)
    dd-writew	dd bs=128k if=/dev/zero of=/mnt/largefile

			NO SCHEDULER		RR SCHEDULER
                	dd	cvs		dd	cvs

    dd-read only        72 MB/s	----		72 MB/s	---
    dd-write only	55 MB/s	---		55 MB/s	---
    dd-read+cvs		 6 MB/s	ok    		30 MB/s	ok
    dd-write+cvs	55 MB/s slooow		14 MB/s	ok

As you can see, when a cvs is running concurrently with dd, the
performance drops dramatically, and depending on read or write mode,
one of the two is severely penalized.  The use of the RR scheduler
in this example makes the dd-reader go much faster when competing
with cvs, and lets cvs progress when competing with a writer.

To try it out:

1. USERS OF FREEBSD 7, PLEASE READ CAREFULLY THE FOLLOWING:

    On loading, this module patches one kernel function (g_io_request())
    so that I/O requests ("bio's") carry a classification tag, useful
    for scheduling purposes.

    ON FREEBSD 7, the tag is stored in an existing (though rarely used)
    field of the "struct bio", a solution which makes this module
    incompatible with other modules using it, such as ZFS and gjournal.
    Additionally, g_io_request() is patched in-memory to add a call
    to the function that initializes this field (i386/amd64 only;
    for other architectures you need to manually patch sys/geom/geom_io.c).
    See details in the file g_sched.c.

    On FreeBSD 8.0 and above, the above trick is not necessary,
    as the struct bio contains dedicated fields for the classifier,
    and hooks for request classifiers.

    If you don't like the above, don't run this code.

2. PLEASE MAKE SURE THAT THE DISK THAT YOU WILL BE USING FOR TESTS
   DOES NOT CONTAIN PRECIOUS DATA.
    This is experimental code, so we make no guarantees, though
    I am routinely using it on my desktop and laptop.

3. EXTRACT AND BUILD THE PROGRAMS
    A 'make install' in the directory should work (with root privs),
    or you can even try the binary modules.
    If you want to build the modules yourself, look at the Makefile.

4. LOAD THE MODULE, CREATE A GEOM NODE, RUN TESTS

    The scheduler's module must be loaded first:

      # kldload gsched_rr

    substitute with gsched_as to test AS.  Then, supposing that you are
    using /dev/ad0 for testing, a scheduler can be attached to it with:

      # geom sched insert ad0

    The scheduler is inserted transparently in the geom chain, so
    mounted partitions and filesystems will keep working, but
    now requests will go through the scheduler.

    To change scheduler on-the-fly, you can reconfigure the geom:

      # geom sched configure -a as ad0.sched.

    assuming that gsched_as was loaded previously.

5. SCHEDULER REMOVAL

    In principle it is possible to remove the scheduler module
    even on an active chain by doing

	# geom sched destroy ad0.sched.

    However, there is some race in the geom subsystem which makes
    the removal unsafe if there are active requests on a chain.
    So, in order to reduce the risk of data losses, make sure
    you don't remove a scheduler from a chain with ongoing transactions.

--- NOTES ON THE SCHEDULERS ---

The important contribution of this code is the framework to experiment
with different scheduling algorithms.  'Anticipatory scheduling'
is a very powerful technique based on the following reasoning:

    The disk throughput is much better if it serves sequential requests.
    If we have a mix of sequential and random requests, and we see a
    non-sequential request, do not serve it immediately but instead wait
    a little bit (2..5ms) to see if there is another one coming that
    the disk can serve more efficiently.

There are many details that should be added to make sure that the
mechanism is effective with different workloads and systems, to
gain a few extra percent in performance, to improve fairness,
insulation among processes etc.  A discussion of the vast literature
on the subject is beyond the purpose of this short note.

--------------------------------------------------------------------------

TRANSPARENT INSERT/DELETE

geom_sched is an ordinary geom module, however it is convenient
to plug it transparently into the geom graph, so that one can
enable or disable scheduling on a mounted filesystem, and the
names in /etc/fstab do not depend on the presence of the scheduler.

To understand how this works in practice, remember that in GEOM
we have "providers" and "geom" objects.
Say that we want to hook a scheduler on provider "ad0",
accessible through pointer 'pp'. Originally, pp is attached to
geom "ad0" (same name, different object) accessible through pointer old_gp

  BEFORE	---> [ pp    --> old_gp ...]

A normal "geom sched create ad0" call would create a new geom node
on top of provider ad0/pp, and export a newly created provider
("ad0.sched." accessible through pointer newpp).

  AFTER create  ---> [ newpp --> gp --> cp ] ---> [ pp    --> old_gp ... ]

On top of newpp, a whole tree will be created automatically, and we
can e.g. mount partitions on /dev/ad0.sched.s1d, and those requests
will go through the scheduler, whereas any partition mounted on
the pre-existing device entries will not go through the scheduler.

With the transparent insert mechanism, the original provider "ad0"/pp
is hooked to the newly created geom, as follows:

  AFTER insert  ---> [ pp    --> gp --> cp ] ---> [ newpp --> old_gp ... ]

so anything that was previously using provider pp will now have
the requests routed through the scheduler node.

A removal ("geom sched destroy ad0.sched.") will restore the original
configuration.

# $FreeBSD$