This makes crash recovery work for stripe sizes that are not multiples of
DEFAULT_REVIVE_BLOCKSIZE (currently 64 kB).
While we're here, fix a few cosmetic nits.
Reviewed by: grog
Sponsored by: Enitel ASA (http://www.enitel.no/)
1. Don't include <sys/conf.h> in userland. It is not used, and including it
without including its prerequisite <sys/time.h> should have broken the
world.
2. Don't include <sys/mount.h>. It is not used, except in -current it
bogusly includes <sys/stat.h> which bogusly includes <sys/time.h> and
thus accidentally provides the prerequisite in (1).
3. Cleaned up nearby include messes.
Not approved by despite 5 weeks notice: MAINTAINER
Add support for AMD RAID controllers as "disks".
Requested-by: Marius Bendiksen <mbendiks@eunet.no>
Remove potential panic when attempting to open non-existent drivers.
init_drive: Return error codes correctly. Previously it would
occasionally return 0. The error was redetected
elsewhere, but this was causing a number of confusing
error messages.
Fix several instances of breakage in RAID-5 revive code.
Tidy up code.
parityops:
Don't attempt to do anything if the plex is degraded or worse.
parityrebuild:
Add comments.
Perform transfers in correct length.
not gone yet.
format_config: print correct text when a volume has a preferred plex.
This is still broken, but not quite as badly.
Reported-by: Phil Regnauld <regnauld@ftf.net>
Change a rather silly comment.
revive_block: Correct bug introduced in revision 1.25 which caused
Add fields to vinum_ioctl_msgexcessive concurrent requests followed by
system death.
<sys/bio.h>.
<sys/bio.h> is now a prerequisite for <sys/buf.h> but it shall
not be made a nested include according to bdes teachings on the
subject of nested includes.
Diskdrivers and similar stuff below specfs::strategy() should no
longer need to include <sys/buf.> unless they need caching of data.
Still a few bogus uses of struct buf to track down.
Repocopy by: peter
which seems to correspond better with what a busy plex needs. This
may also help us avoid race conditions when expanding the table which
may have been contributing to the random corruption, panics and hangs
we've been seeing in RAID-5 plexes, particularly with ata drives.
Eagerly-awaited-by: sos
Get counting volume I/Os right.
launch_requests: Be macho, throw away the safety net and walk the
tightrope with no splbio().
Add some comments explaining the smoke and mirrors.
Remove some redundant braces.
sdio: Set the state of an accessed but down subdisk correctly. This
appears to duplicate an earlier commit that I hadn't seen.
Get counting volume I/Os right.
Count buffer sizes correctly for architectures where ints are not 32 bits.
complete_rqe: Move decrementing active count until after call to
complete_raid5_write, thus possibly avoiding a race condition.
Suggested-by: dillon
Rename user bp to ubp to avoid confusion.
Tidy up comments.
request could be deallocated before the top half had finished
issuing it. The problem seems only to happen with IDE drives
and vn devices, but theoretically it could happen with any
drive. This is the most important part of a possible series
of fixes designed to remove race conditions without locking
out interrupts for longer than absolutely necessary.
Reported-by: sos
Fix-supplied-by: dillon
(Much of this done by script)
Move B_ORDERED flag to b_ioflags and call it BIO_ORDERED.
Move b_pblkno and b_iodone_chain to struct bio while we transition, they
will be obsoleted once bio structs chain/stack.
Add bio_queue field for struct bio aware disksort.
Address a lot of stylistic issues brought up by bde.
set properly in the struct buf with vinum:
Fix locations where B_READ was cleared in the old code but
b.b_iocmd wasn't set to BIO_WRITE
Fix propogation of b_iocmd
Correct comments to reflect reality
Don't compare b_flags with BIO_READ, it's in b_iocmd.
Submitted by: Bernd Walter <ticso@cicely.de>
substitute BUF_WRITE(foo) for VOP_BWRITE(foo->b_vp, foo)
substitute BUF_STRATEGY(foo) for VOP_STRATEGY(foo->b_vp, foo)
This patch is machine generated except for the ccd.c and buf.h parts.
field in struct buf: b_iocmd. The b_iocmd is enforced to have
exactly one bit set.
B_WRITE was bogusly defined as zero giving rise to obvious coding
mistakes.
Also eliminate the redundant struct buf flag B_CALL, it can just
as efficiently be done by comparing b_iodone to NULL.
Should you get a panic or drop into the debugger, complaining about
"b_iocmd", don't continue. It is likely to write on your disk
where it should have been reading.
This change is a step in the direction towards a stackable BIO capability.
A lot of this patch were machine generated (Thanks to style(9) compliance!)
Vinum users: Greg has not had time to test this yet, be careful.
transferred, do it in complete_rqe instead.
launch_requests: Replace the inadvertently removed splbio() around the
main loop. It may not be necessary, but the biggest
test of this stuff are IDE disks, which I'm not
using.
Remove throttling code, I'm pretty sure it's not
needed any more.
Don't set B_ORDERED, it's not necessary either.
Objected-to-by: alfred
build_rq_buffer: Don't lose the B_ORDERED bit, it still has some
residual meaning. To do this right, Vinum needs to
look at the B_ORDERED bit and order the transfer
across all disks involved. That's an exercise for
another day.
Objected-to-by: alfred
Implicitly-sanctioned-by: jkh
VINUM_BDEV_MAJOR and VINUM_CDEV_MAJOR respectively.
Set DRIVE_MAXACTIVE and VINUM_MAXACTIVE to 30000, effectively
disabling the request limitation code. This code was added as an
attempt to escape from a bug which seems to have gone away, and it's
very likely I'll remove the code Real Soon Now, but I don't want to do
it just yet.
struct drive: Remove references to vnode pointers, including debug
output. Vinum now talks directly to the device driver. Instead, add
a dev_t.
enum plexorg: Add an instance for RAID-4.
Change checks for striped or RAID-5 plexes to a macro 'isstriped',
which now also includes RAID-4.
Change checks for RAID-5 plexes to a macro 'isparity', which now also
includes RAID-4.
Approved-by: jkh
set_sd_state: update the state of a subdisk in a multi-plex volume
more correctly.
update_plex_state: Bring the plex up correctly when the last subdisk
comes up.
checksdstate: Update comments.
vpstate: Don't return an "up" state on a degraded, unattached plex.
start_object: Return a sensible error message when trying to revive a
subdisk whose drive is down. Previously it returned EBUSY.
Approved-by: jkh
data corruption. It's a wonder it worked at all.
Led-on-the-right-path-by: dillon
revive_block: Add treatment for RAID-4.
Add function parityrebuild, called by revive_block and parityops.
Approved-by: jkh
the tsleep call flags.
Submitted-by: Bernd Walter <ticso@cicely.de>
Remove references to vnode pointers, including debug output. Vinum
now talks directly to the device driver.
bre: Add case for RAID-4.
sdio: Don't try to write to a down drive. Set the sd state instead.
Approved-by: jkh
vn_open. This is necessary in order to be able to open drives before
the root file system is mounted. This also involves restructuring the
drive struct, which no longer contains a vnode pointer. Instead,
open_drive sets an open flag. It's a horrible kludge, and I'll gladly
borrow a Danish axe and hack it in little pieces when devfs comes.
read_drive, write_drive, drive_io_done: Replace with driveio. The
function names are now macros.
driveio: Fix horrible, embarrassing breakage which was the reason why
read_drive and write_drive existed in the first place.
Code-torn-to-shreds-by: dillon
format_config: Don't save config of objects in referenced state. They
get rebuilt automatically.
Change checks for striped or RAID-5 plexes to a macro 'isstriped',
which now also includes RAID-4.
Change checks for RAID-5 plexes to a macro 'isparity', which now also
includes RAID-4.
Replace the preprocessor variable names BDEV_MAJOR and CDEV_MAJOR with
VINUM_BDEV_MAJOR and VINUM_CDEV_MAJOR respectively.
vinum_scandisk: Don't free memory twice on error, once is enough.
Approved-by: jkh
to RAID-5. peter claims that it might be faster for sequential
reading, since the drive caches don't trip over the parity blocks. I
have seen no evidence to support this, but it's a trivial change.
Requested-by: peter
Change checks for striped or RAID-5 plexes to a macro 'isstriped',
which now also includes RAID-4.
Change checks for RAID-5 plexes to a macro 'isparity', which now also
includes RAID-4.
atoi(): Remove, nobody was talking to it.
give_sd_to_drive: If no space is available, make the subdisk down,
don't delete it.
Change the manner in which the subdisk count was maintained to avoid
cases where the count was not adjusted correctly.
config_drive: Check if we have subdisks referencing us, and add them
if so. This fixes problems which arose when a drive is replaced in a
running system.
config_sd: Add support for a keyword 'partition', whose meaning will
be revealed in the fullness of time.
Cosmetic: Shorten some console messages.
Approved-by: jkh
to SI_SUB_VINUM, thus making it possible for Vinum to access I/O
devices and start.
Replace the preprocessor variable names BDEV_MAJOR and CDEV_MAJOR with
VINUM_BDEV_MAJOR and VINUM_CDEV_MAJOR respectively.
Style fixes: replace NULL with 0 where appropriate.
Submitted-by: Charlie Root <root@sms-1.follo.net> (yup, that's all I
have to go on).
Approved-by: jkh
on alpha.
Submitted-by: Bernd Walter <ticso@cicely.de>
struct sd: Add a field for the pid of the reviver when the subdisk is
reviving.
Replace block device macros with generalized device macros.
alpha.
Explicitly type large scalar parameters to avoid compilation warnings
on alpha.
Submitted-by: Bernd Walter <ticso@cicely.de>
Make better checks that the revive block size is valid, silently set
it to the defaults if not.
Replace block device macros with generalized device macros.
alpha.
Modify the manner in which we lock RAID-5 plexes. This appears to
solve some of the elusive panics we have seen with corrupted buffer
headers (specifically the zeroed-out b_iodone field).
Submitted-by: Bernd Walter <ticso@cicely.de>
solve some of the elusive panics we have seen with corrupted buffer
headers (specifically the zeroed-out b_iodone field).
Submitted-by: Bernd Walter <ticso@cicely.de>
'iswhite'. The original change was required because of name
conflicts.
Add key pairs for the keywords 'mv' and 'move' (part of the move
command).
Add comments.
drives. This function just does the low-level configuration changes;
the resultant subdisk is stale if it previously had any contents,
otherwise it is empty (i.e. in need of initializing if it's RAID-5).
We still need to handle getting the contents moved over, but the
current version will suffice to migrate subdisks from a disk which has
failed.
Submitted-by: Marius Bendiksen <marius@marius.scancall.no>
on alpha.
Submitted-by: Bernd Walter <ticso@cicely.de>
Remove #include of vm/vm_zone.h.
Submitted-by: Someone, I'm sure, but I seem to have lost the
attribution. Sorry.
Get the check for disk devices correct, and return an appropriate
message if the check fails.
shutdown.
Submitted-by: Alfred Perlstein <bright@wintelcom.net>
Correct printf format for pointers to avoid compilation warnings on
alpha.
Submitted-by: Bernd Walter <ticso@cicely.de>
Identify daemon as 'vinum', not 'vinumd', in messages. This
corresponds to the name in ps.
on alpha.
Submitted-by: Bernd Walter <ticso@cicely.de>
Get parameters right for some error messages returned via
throw_rude_remark().
Fix typo in comment.
Remove the 'static' attribute from give_sd_to_drive. This is needed
for the implementation of moveobject() in vinumioctl.c.
is an application space macro and the applications are supposed to be free
to use it as they please (but cannot). This is consistant with the other
BSD's who made this change quite some time ago. More commits to come.
Put splbio protection around the main launch loop. We've seen cases where
the bottom half was cutting off the branch on which we're sitting.
Experienced-by: Michael Reifenberger <root@nihil.plaut.de>
maximum of 64 kB.
vinum_conf, struct drive: add fields for the current number of active
requests and the maximum ever active.
struct sd: Add fields for initialize progress.
struct mc: set the size of saved file names to MCFILENAMELEN instead
of the previous explicit constant.
limit the number of outstanding requests on a specific drive and
overall.
Change the way we set the active request count. This enables us to
start the requests without being in splbio for the duration, which
could be very long for IDE drives in PIO mode.
context. Be prepared to fail instead.
MMalloc, FFree: ensure that saved file names are properly terminated.
Use MCFILENAMELEN instead of the previous explicit constant.
and then doing it itself, resulting in a panic downed drives.
Sleuth-work-by: Christopher Masto <chris@netmonger.net>
Remove dummy function initsd, which is now (implemented) in
vinumrevive.c
vinum_scandisk: Check that a drive is up before reading from it. This
is probably excessive paranoia.
- Move intrhook stuff into kernel.h
- Remove all occurrences of #device <device.h>
- Add kernel.h were necessary (nowhere)
- delete device.h
This file contained the structures for cfdata (old style config) and is no
longer used. It was included by most drivers.
It confuses the remote debugger as the definition of 'struct device' in
device.h is found before the one in bus_private.h.
of parity check and rebuild operations. This enables us
to stop the operation and restart at a later time.
enum parityop: Trivial enum to decide what parityops() is going to do.
could stand.
Define the correct return lengths for a number of ioctls.
Add ioctls VINUM_CHECKPARITY and VINUM_RESETPARITY, still to be fully
implemented.
ourselves. This breaks a vicious circle which caused
vinum to dereference a null vp if device nodes were
missing.
Reported-by: Brad Chisholm <sasblc@unx.sas.com>
Alec Wolman <wolman@cs.washington.edu>
check_drive: Don't take a drive down if it's only referenced.
read_drive: Remove unused variable.
get_empty_volume: initialize plexes to -1 (not allocated)
remove_drive_entry:
Remove recurse parameter (there's nothing below a drive in the hierarchy).
Use remove_sd_entry to remove sds, don't do it ourselves.
Log errors, don't throw rude remarks.
remove_plex_entry:
Don't use plex->subdisks as a loop limit, it gets changed in the
loop. This caused some removals to only remove half the subdisks.
Change logging of some "impossible" situations.
remove_volume_entry:
Use remove_plex_entry to remove plexes, don't do it ourselves.
update_sd_config:
Use set_sd_state to do the work.
have been there in the first place. A GENERIC kernel shrinks almost 1k.
Add a slightly different safetybelt under nostop for tty drivers.
Add some missing FreeBSD tags
a quick think and discussion among various people some form of some of
these changes will probably be recommitted.
The reversion requested was requested by dg while discussions proceed.
PHK has indicated that he can live with this, and it has been agreed
that some form of some of these changes may return shortly after further
discussion.
the highly non-recommended option ALLOW_BDEV_ACCESS is used.
(bdev access is evil because you don't get write errors reported.)
Kill si_bsize_best before it kills Matt :-)
Use the specfs routines rather having cloned copies in devfs.
some swapper problems analogous to those experienced with ccd.
This fix is a kludge: since we currently don't track the "sector size"
in a volume label, we guess a worst case (4 kB, as used by vnode
devices). If the concept of sector size is here to stay, I'll make
some changes to track the "sector size" of a volume. This will
probably be the maximum of the sector sizes of all component drives,
but things could get ugly if we start allowing non-standard sector
sizes such as 524 bytes.
Unkludged-version-submitted-by: phk
lockrange: correctly expand rangelock struct, including expanding a
null struct. Previously lockrange would attempt to lock a
NULL pointer under these circumstances.
Reported-by: Ian Freislich <iang@uunet.co.za>
initialized subdisks.
Tidy up some comments.
Eliminate sddownstate(); it wasn't being used any more. Return
REQUEST_DOWN instead.
Add setstate_by_force() to implement the VINUM_SETSTATE_FORCE ioctl
for diddling individual object states. This is a repair tool which
can also be used for panicing the system. Use with utmost care if at
all.
avoids a race condition where multiple RAID-5 subdisks are being
revived at the same time. The locks should also prevent conflicts
with user requests on concatenated and striped plexes, but this needs
more work.
Tidy up some comments.
format_config: code preening.
vinum_scandisk: If we find a partition in the first pass over a drive,
note the fact so we don't grab the compatibility partition as well.
Submitted-by: peter
goes into initialized state, not 'up'. This makes it easier to ensure
consistency in multi-plex volumes.
update_plex_state: redo transitions from empty and initialized
subdisks to up or reviving, depending on the number of plexes.
Reported-by: Bernd Walter <ticso@cicely.de>
Remy Nonnenmacher <remy@synx.com>
Introduce BUF_STRATEGY(struct buf *, int flag) macro, and use it throughout.
please see comment in sys/conf.h about the flag argument.
Remove strategy argument from all the diskslice/label/bad144
implementations, it should be found from the dev_t.
Remove bogus and unused strategy1 routines.
Remove open/close arguments from dssize(). Pick them up from dev_t.
Remove unused and unfinished setgeom support from diskslice/label/bad144 code.
Don't return "can't do it" when the user requests a state change to
the current state. This previously caused silly messages like "Can't
start <foo>: invalid argument", when in fact <foo> was already
started.
set_plex_state: don't set state for non-existent plexes.
update_plex_status: as long as we have initializing subdisks, we're
initializing.
Move the declaration of freerq() to request.h.
logrq: add support for lock events.
vinumstart: solve a problem where removing a plex from an active
volume could cause attempts to access non-existent plexes.
launch_requests: don't set a request group active until we're sure we
can launch it. This caused some hangs under unusual
circumstances.
bre: don't set XFR_BAD_SUBDISK if we're not going to use it.
build_read_request: correct recovery, which caused some hangs under
(other) unusual circumstances.
build_rq_buffer: don't set bp->b_dev if we don't have a dev.
sdio: clean up, remove obsolete code.
deallocrqg: unlock any locks the rqg may have.
bre5:
Shorten some lines.
Desired-by: bde
If we're reading from a short plex, return EOF indication.
Always lock the stripe before starting a transfer. Hopefully the
current version will solve some data integrity problems that have
been reported with degraded RAID-5 plexes.
Reported-by: Bernd Walter <ticso@cicely.de>
Remy Nonnenmacher <remy@synx.com>
solve some data integrity problems that have been reported with
degraded RAID-5 plexes.
Reported-by: Bernd Walter <ticso@cicely.de>
Remy Nonnenmacher <remy@synx.com>
Tidy other comments.
open_drive: don't call set_drive_state if we decide to take it down.
This could help avoid some race conditions with the daemon.
init_drive: don't set the drive down, we'll let close_locked_drive do
that.
close_locked_drive: set drive state to down without calling
set_drive_state. This could help avoid some race conditions with the
daemon.
driveio: remove the function, it wasn't being used.
get_volume_label: remove volume dependencies so that we can return a
label for plexes and subdisks as well. What a kludge.
Remove declarations for freerq and free_rqg.
Remove DEBUG_RESID code.
freerq: check whether the request is holding a lock, free if so.
free_rqg: remove. It wasn't being used any more.
Change the Debugger calls to panics.
checkdiskconfig(): remove. It didn't make any sense to complain about
kernel keywords in user config files; it just made it more difficult
to convert. Now we ignore kernel keywords if we're not in kernel
mode.
get_empty_sd: initialize sectors.
free_drive: don't close if we don't have a vp. Maybe this will help
fix the problem that peter had, but I wouldn't count on it.
config_plex: If the plex is RAID-5, give it a rangelock structure.
start_config: Reset current drive, plex and volume so that a new
'create' command doesn't get long-dead defaults.
struct rqelement, enum rqinfo_type, struct rqinfo, union rqinfou: add
lock requests.
Add declarations for freerq and unlockrange. Since they include
request structures, they can't go in vinumext.h
- %q -> %ll.
Fixed nearby errors not reported by gcc -Wformat on i386's:
- don't assume that the promotion of [u_]int64_t is [u_]quad_t.
- don't use signed formats for unsigned args.
Add Cybernet copyright.
OK'd-by: Chuck Jacobus <chuck@cybernet.com>
update_plex_state:
If any subdisk in the plex is initializing, set the plex to
initializing state. This gets rid of the ugly corrupt/degraded/up
transitions which previously occurred.
Desired-by: Steve Taylor <staylor@cybernet.com>
sddownstate:
Add new function, used by checksdstate.
checksdstate:
Let sddownstate decide what status to return.
Add Cybernet copyright.
OK'd-by: Chuck Jacobus <chuck@cybernet.com>
logrq: save device major and minor numbers to compensate for lost
dev_t.
launch_requests: Don't issue requests which are marked
XFR_BAD_SUBDISK. This may make things easier in bre().
bre:
Rearrange.
- Change some comments
- Recognize holes in plex structure. Formerly this could lead to
incorrect write to the plex. Return REQUEST_DEGRADED on a read
request, but carry on to the bitter end on a write request, and
mark the requests for the inaccessible subdisks with
XFR_BAD_SUBDISK.
- return REQUEST_EOF if the requested transfer goes beyond the end
of the plex. This is not an error, since other plexes may go
further into the volume address space.
build_read_request:
Handle REQUEST_DEGRADED returned from bre().
sdio:
Lock buffer before issuing the requests.
will only accept partitions of type 'vinum'.
format_config: Use the new %q format option in kvprintf, thus getting
rid of some of the filthiest code I've written in a long time. Also
remove the lltoa() function.
With-great-thanks-to: peter
format_config: Accept the fact that a subdisk might not be attached to
a plex, and save the config correctly.
vinum_scandisk: Scan all slices on a drive with a Microsoft partition
table. Only look at the compatibility slice if nothing was found in
the Microsoft slices.
This change removes a frequently employed method of shooting
yourself in the foot: people would decide that the Vinum drives
belonged on their own slice, and they wouldn't be able to start the
subsystem after a reboot. Documentation updates to follow.
initialize subdisks. Probably the plex-related subdisk type will die
a death.
vinumconfig.c:
Accept (and ignore) kernel state information in userland config
files. This saves a lot of error recovery and also makes it possible
to use the output of printconfig to create new configuration.
Remove checkdiskconfig(). It wasn't needed any more.
Start adding support for hot spare drives. You can't put anything on
them (yet).
Change message formats from %lld to %qd.
get_empty_sd: Initialize size to -1. Previously this was done in
config_subdisk, which is the wrong place.
start_config: set current drive, plex and volume to -1, thus stopping
update configurations from taking their defaults from old configs.
the device numbers are now minor number only, so that we can still
compare them after dev_t has turned into a blob.
Broken-by: dev_t changes
Reported-by: Vallo Kallaste <vallo@matti.ee>
"Niels Chr. Bank-Pedersen" <ncbp@bank-pedersen.dk>
Correct race condition between caller and daemon.
Tripped-over-by: Zach Heilig <zach@uffdaonline.net>
Bernd Walter <ticso@cicely.de>
Niels Chr. Bank-Pedersen <ncbp@bank-pedersen.dk>
lockmgr locks. This commit should be functionally equivalent to the old
semantics. That is, all buffer locking is done with LK_EXCLUSIVE
requests. Changes to take advantage of LK_SHARED and LK_RECURSIVE will
be done in future commits.
If the drive goes down, queue a close to the daemon. In many cases
this function gets called in process context, so it could do it
directly, but it's more trouble finding out where we came from than
getting the daemon to do it.
Don't bzero the buffer structure, it's been done already by
allocrqg.
sdio:
Build up a correct buffer header, don't steal linkages from system
buffer headers.
Noticed-by: mckusick
fix; it doesn't address the problem of removing the module. If you do
the following:
vinum stop
fsck /dev/vinum/VOLUME
you *will* get a system crash. What we need is a cdevsw_remove
corresponding to cdevsw_add, but that hasn't been written yet.
Submitted-by: phk
Made a new (inline) function devsw(dev_t dev) and substituted it.
Changed to the BDEV variant to this format as well: bdevsw(dev_t dev)
DEVFS will eventually benefit from this change too.
Virtualize bdevsw[] from cdevsw. bdevsw() is now an (inline)
function.
Join CDEV_MODULE and BDEV_MODULE to DEV_MODULE (please pay attention
to the order of the cmaj/bmaj arguments!)
Join CDEV_DRIVER_MODULE and BDEV_DRIVER_MODULE to DEV_DRIVER_MODULE
(ditto!)
(Next step will be to convert all bdev dev_t's to cdev dev_t's
before they get to do any damage^H^H^H^H^H^Hwork in the kernel.)