seconds of idling, DRITY flags are removed.
- If mirror is in idle state or is not open for writing, sleep without
timeout when waiting for I/O requests.
- Don't use atomic operations, for now sysctls are protected by Giant.
- Update debugs.
- Fix for good (I hope) force-stopping mirrors and some filure cases
(e.g. the last good component dies when synchronization is in progress).
Don't use ->nstart/->nend consumer's fields, as this could be racy,
because those fields are used in g_down/g_up, use ->index consumer's
field instead for tracking number of not finished requests.
Reported by: marcel
- After 5 seconds of idle time (this should be configurable) mark all
dirty providers as clean, so when mirror is not used in 5 seconds
and there will be power failure, no synchronization on boot is needed.
Idea from: sorry, I can't find who suggested this
- When there are no ACTIVE components and no NEW components destroy whole
mirror, not only provider.
- Fix one debug to show information about I/O request, before we change
its command.
buf->b-dev.
Put a bio between the buf passed to dev_strategy() and the device driver
strategy routine in order to not clobber fields in the buf.
Assert copyright on vfs_bio.c and update copyright message to canonical
text. There is no legal difference between John Dysons two-clause
abbreviated BSD license and the canonical text.
This flag gets set whenever the thread posts an event on the GEOM
event queue, and if the flag is set when the thread is prepared
to return to userland from the kernel, g_waitidle() will be called
to make sure that the posted events have completed.
This can replace an insufficient number of g_waitidle() calls in
various other places, and has the advantage of being failsafe: Any
system call which does a VOP_OPEN()/VOP_CLOSE will now correctly
wait for any geom events it posted as part of spoils or tastes.
Assert that topology and Giant is not held in g_waitidle().
in the g_up and g_down threads. Each time a bio is propelled up and
down the stack, an event is generating showing the provider, offset,
and length, as well as thread wakeup and work status information.
providers for tasting. Before this hack, race below is possible:
SI_SUB_RAID (no not-fully-configured geoms, so don't block)
GEOM tasting (now geoms are created)
SI_SUB_MOUNT_ROOT (if root file system is placed on a mirror, it is
possible that this mirror is not fully configured yet)
There is a lot of work to do to avoid such hacks and I need a working
solution before 5.3, sorry.
Reported by: John Hay <jhay@icomtek.csir.co.za>
We have to use our own destroy_geom method, because default one, which
is a part of geom_slice is broken.
MT5 candidate.
PR: kern/72467
Submitted by: Vladimir Novoseltsev
by the time that kldload(8) returns. Satisfy that by making the GEOM
module load event -- only when the kernel is !cold -- wait until the
GEOM module init function has finished instead of returning immediately.
This is the other half of fixing md(8) (actually, "mfs" in fstab(5))
that is similar to r1.128 of src/sys/dev/md/md.c. This bug would be
why RAM disks would often fail on boot and the first call to mdconfig(8)
would probably fail.
pjd has ideas for not requiring kldload(8) to work synchronously for
control devices that could make this obsolete.
Silence on: -arch
This also adds support for bigger disks on the controller I have access to,
and maybe others if I understood the adhoc methods used on those.
Those with more PC98 bigdrive controllers it is hereby invited to add/fix
support for those in geom_pc98.c and not using #ifdef PC98 all over the place.
but it is possible:
1. Read data from good component for synchronization.
2. Write data to the same area.
3. Write synchronization data, which are now stale.
Found by: tegge (for gmirror)
but it is possible:
1. Read data from good component for synchronization.
2. Write data to the same area.
3. Write synchronization data, which are now stale.
Found by: tegge
Actually, it can even cause some problems, because GEOM requires sectorsize
to be more than 0 on first access, not on provider creation, so we can skip
valid providers by doing this check here.
Reported by: Divacky Roman <xdivac02@stud.fit.vutbr.cz>
Sven Willenberger <sven@dmv.com>
Analogous to the drive level, give each volume and plex a worker thread
that picks up and processes incoming and completed BIOs.
This should fix the data corruption issues that have come up a few
weeks ago and improve performance, especially of RAID5 plexes.
The volume level needs a little work, though.
sectorsize in order to avoid a lot of checks around various divisions etc.
Enforce the sectorsize being > 0 with a KASSERT on successful open.
Fix scsi_cd.c to return 2k sectors when no media inserted.
verification of regular data when device is in complete state.
On verification error, EIO error is returned for the bio and sysctl
kern.geom.raid3.stat.parity_mismatch is increased.
Suggested by: phk
as well, even if device is in complete state.
I observe 40% of speed-up with this option for random read operations,
but slowdown for sequential reads.
Basically, without this option reading from a RAID3 device built from 5
components (c0-c4) looks like this:
Request no. Used components
1 c0+c1+c2+c3
2 c0+c1+c2+c3
3 c0+c1+c2+c3
With the new feature:
Request no. Used components
1 c0+c1+c2+c3
2 (c1^c2^c3^c4)+c1+c2+c3
3 c0+(c0^c2^c3^c4)+c2+c3
4 c0+c1+(c0^c1^c3^c4)+c3
5 c0+c1+c2+(c0^c1^c2^c4)
6 c0+c1+c2+c3
[...]
drive is known to the configuration check also if it already has a geom.
Without this check several needless geoms are created and valid
configuration data was overwritten.
This change obsoletes the need for a separate geom to taste an
offered provider and the consumer doesn't need to be opened with the
exclusive bit set.
- Remove kern.geom.mirror.sync_block_size sysctl. It is quite obvious that we
want to use the biggest size possible.
- Do not use UMA zone for sync data allocations. There could be only one
synchronization request per synchronized disk at a time, so allocate memory
for one request on whole synchronization process related to one disk.
Tested by synchronizing one component (out of three) and by synchronizing
two components (out of three) in parallel.
and bio_inbed fields to 0. Without this change we can end up with
I/O leakage in some rare situations.
I tested this change by putting failure probability mechanism simlar
to this used in NOP class into g_clone_bio(9) function, so it was
able to return NULL with the given probability.
Discussed with: phk
bio_driver1 (as all the rest).
This introduced a small memory leak, but it wasn't really critical,
because maximum memory for g_stripe_zone is always set, so after few
requests gstripe was working in "economic" mode.
It allows to fix problems when last provider's sector is shared between few
providers.
- Bump version number for CONCAT and STRIPE and add code for backward
compatibility.
- Do not bump version number of MIRROR, as it wasn't officially introduced yet.
Even if someone started to play with it, there is no big deal, because
wrong MD5 sum of metadata will deny those providers.
- Update manual pages.
- Add version history to g_(stripe|concat).h files.
understood. This makes room for additional binary compatibility in the
future.
Put fields in the class for the geom's methods and initialize the methods
of a new geom from these fields. This saves some code in all classes.
something goes wrong while running in "fast" mode, we free all bios and
falling back to "economic" mode. Freeing bios, doesn't mean decrease
bio_children, so bio_inbed couldn't be equal to bio_children and request
was never finished.
Decrease bio_children manually when destroying bios.
Reported by: Sam Lawrance <boris@brooknet.com.au>, simon
consumer and 'bio_pflags' which can be used by provider.
- Remove BIO_FLAG1 and BIO_FLAG2 flags. From now on new fields should be
used for internal flags.
- Update g_bio(9) manual page.
- Update some comments.
- Update GEOM_MIRROR, which was the only one using BIO_FLAGs.
Idea from: phk
Reviewed by: phk
set gp->softc to NULL and return ENXIO when it is NULL, so GEOM
will not panic or hang, but unload one device on every 'unload'.
This make 'unload' command usable, but it have to be executed
<number of devices> + 1 times.
- Made use of 'pp' variable.
The different between the new function and g_mirror_orphan() (which was
used previously) is that syncid is bumped immediately, instead of on
first write, because when consumer was spoiled, it means, that its
provider was opened for writing, so we can't trust that its data
will be valid when it will be connected again.
features. The gmirror(8) utility should be used for control of this class.
There is no manual page yet, but I'm working on it with keramida@.
Many useful tests provided by: simon (thank you!)
Some ideas from: scottl, simon, phk
This is really ugly way to do this, but there is no other way for now.
It allows to mount root file system from providers which belong to
those classes.
Approved by: phk
provider.
- Bump version number.
This allows for a quite interesting trick. One can setup a stripe with
stripe size of 512 bytes and create transparent provider on top of it
with sector size equal to <ndisks> * 512. The result will be something
like RAID3 without parity disk (every access will touch all disks).
vinumdrive geom with an exclusive bit. This should fix the problem
when underlying partitions overlap (i.e. the 'a' partition is at
the same offset as the 'c' partition).
Ideas borrowed from pjd@, quite a bit of testing by
Matthias Schuendehuette <msch@snafu.de>.
for unknown events.
A number of modules return EINVAL in this instance, and I have left
those alone for now and instead taught MOD_QUIESCE to accept this
as "didn't do anything".
In this mode you can setup even very small stripe size and you can be
sure that only one I/O request will be send to every disks in stripe.
It consumes some more memory, but if allocation fails, it will fall
back to "ECONOMIC" mode.
It is about 10 times faster for small stripe size than "ECONOMIC" mode
and other RAID0 implementations. It is even recommended to use this
mode and small stripe size, so our requests are always splitted.
One can still use "ECONOMIC" mode by setting kern.geom.stripe.fast to 0.
It is also possible to setup maximum memory which "FAST" mode can consume,
by setting kern.geom.stripe.maxmem from /boot/loader.conf.
When we orphan/wither a provider, an attached geom+consumer could
end up being withered as a result and it may be in front of us in
the normal object scanning order so we need to do multi-pass. On
the other hand, there may be withering stuff we can't get rid off
(yet), so we need to keep track of both the existence of withering
stuff and if there is more we can do at this time.
This class is used for detecting volume labels on file systems:
UFS, MSDOSFS (FAT12, FAT16, FAT32) and ISO9660.
It also provide native labelization (there is no need for file system).
g_label_ufs.c is based on geom_vol_ffs from Gordon Tetlow.
g_label_msdos.c and g_label_iso9660.c are probably hacks, I just found
where volume labels are stored and I use those offsets here,
but with this class it should be easy to do it as it should be done by
someone who know how.
Implementing volume labels detection for other file systems also should
be trivial.
New providers are created in those directories:
/dev/ufs/ (UFS1, UFS2)
/dev/msdosfs/ (FAT12, FAT16, FAT32)
/dev/iso9660/ (ISO9660)
/dev/label/ (native labels, configured with glabel(8))
Manual page cleanups and some comments inside were submitted by
Simon L. Nielsen, who was, as always, very helpful. Thanks!
Now, when trying to mount file system in read-only mode it tries to
opened a device for writting to be able to update to read-write mode
latter. Ehh.
Discussed with: phk
to warn about attempts to sleep in the I/O path. This change pushes the
definition and use of 'mymutex' behind #ifdef WITNESS to avoid the cost
in non-debugging cases. This results in a clear .22% performance win for
512 byte and 1k I/O tests on my SMP test box. Not much, but every bit
counts.
not active GEOM providers, it will result in a kernel panic.
If the GEOM provider or disk goes away before the volume
configuration data gets written to the disk, it will result
in another kernel panic.
o Make sure that the drives specified for volume creation
are active GEOM providers.
o When writing out volume configuration data to associated drives,
make sure that the GEOM provider is active, otherwise continue
to the next drive in the volume.
Approved by: le, bmilekic (mentor)
The big lines are:
NODEV -> NULL
NOUDEV -> NODEV
udev_t -> dev_t
udev2dev() -> findcdev()
Various minor adjustments including handling of userland access to kernel
space struct cdev etc.
allocated ressouces should be ultimately freed in gv_destroy_geom()
(when unloading the module and not earlier), but I need to look at this
more closely.
is intend to be fast. Just like CONCAT class it provides manual and
auto configuration methods.
Supported by: Wheel - Open Technologies - http://www.wheel.pl
it is very useful for tests. One is able to destroy its provider
forcibly if wants to test how other class handle such events.
One is also able to specify failure probability to check how other
classes handle I/O errors.
Supported by: Wheel - Open Technologies - http://www.wheel.pl
Retire g_sanity() and corresponding debugflag (0x8)
Retire g_{stall,release}_events().
Under #ifdef DIAGNOSTIC:
Make g_valid_obj() an official function and have it return an an
non-zero integer which indicates the kind of object when found.
Implement G_VALID_{CLASS,GEOM,CONSUMER,PROVIDER}() macros based
on g_valid_obj().
Sprinkle calls to these macros liberally over the infrastructure.
Always check that we do not free a live object.
least common multiple of all disks sector sizes.
This will allow to safely concatenate disks with different sector sizes.
- Mark unused function arguments.
- Other minor cleanups.
Introduce d_version field in struct cdevsw, this must always be
initialized to D_VERSION.
Flip sense of D_NOGIANT flag to D_NEEDGIANT, this involves removing
four D_NOGIANT flags and adding 145 D_NEEDGIANT flags.
Previously the "struct disk" were owned by the device driver and this
gave us problems when the device disappared and the users of that device
were not immediately disappearing.
Now the struct disk is allocate with a new call, disk_alloc() and owned
by geom_disk and just abandonned by the device driver when disk_create()
is called.
Unfortunately, this results in a ton of "s/\./->/" changes to device
drivers.
Since I'm doing the sweep anyway, a couple of other API improvements
have been carried out at the same time:
The Giant awareness flag has been flipped from DISKFLAG_NOGIANT to
DISKFLAG_NEEDSGIANT
A version number have been added to disk_create() so that we can detect,
report and ignore binary drivers with old ABI in the future.
Manual page update to follow shortly.
shown that it is not useful.
Rename the relative count g_access_rel() function to g_access(), only
the name has changed.
Change all g_access_rel() calls in our CVS tree to call g_access() instead.
Add an #ifndef BURN_BRIDGES #define of g_access_rel() for source
code compatibility.
rather than right before and right after. This allows these routines
to manipulate the mesh.
KASSERT that nobody creates a geom on an alien class.
Assert topology in g_valid_obj().
Approved by: re@
provide no methods does not make any sense, and is not used by any
driver.
It is a pretty hard to come up with even a theoretical concept of
a device driver which would always fail open and close with ENODEV.
Change the defaults to be nullopen() and nullclose() which simply
does nothing.
Remove explicit initializations to these from the drivers which
already used them.
This replaces the current ioctl processing with a direct call path
from geom_dev() where the ioctl arrives (from SPECFS) to any directly
connected GEOM class.
The inverse of the above is no longer supported. This is the
situation were you have one or more intervening GEOM classes, for
instance a BSDlabel on top of a MBR or PC98. If you want to issue
MBR or PC98 specific ioctls, you will need to issue them on a MBR
or PC98 providers.
This paves the way for inviting CD's, FD's and other special cases
inside GEOM.
device should handle them.
This prevents for instance GEOM::ioctl requests from reaching a
lower BSDlabel node, which ps@ found would confuse newfs(8).
redundant paths to the same device.
This class reacts to a label in the first sector of the device,
which is created the following way:
# "0123456789abcdef012345..."
# "<----magic-----><-id-...>
echo "GEOM::FOX someid" | dd of=/dev/da0 conv=sync
NB: Since the fact that multiple disk devices are in fact the same
device is not known to GEOM, the geom taste/spoil process cannot
fully catch all corner cases and this module can therefore be
confused if you do the right wrong things.
NB: The disk level drivers need to do the right thing for this to
be useful, and that is not by definition currently the case.
Attach to the component devices using GEOM semantics.
Create a GEOM provider instead of using disk_create()
Use the GEOM OAM api for configuration.
I saw approx ~1% speedup in througput and ~7% in latency in a
simple minded test of a two-disk striped device.
This file was repo-copied from src/sys/dev/ccd/ccd.c.
This is not yet linked into the build.
Change the list interface to simplify things.
Remove old list ioctls which bogusly exported the softc to userland.
Move the softc and associated structures from the public header to
the source file.
Make CCD a GEOM class.
For now only use this for implementing a OAM config method which
can return a list of configured CCD devices in the format which
"ccdconfig -g[v]" would normally output.
hinge on the "verb" parameter which the class gets to interpret as
it sees fit.
Move the entire request into the kernel and move changed parameters
back when done.
Use ->init() and ->fini() to handle the mutex in geom_disk.c
Remove the g_add_class() function and replace it with a standardized
g_modevent() function.
This adds the basic infrastructure for loading/unloading GEOM classes
still outstanding, give them a chance to complete.
If after 10 seconds we still find outstanding I/O requests, complete
the close with a console warning that the system is likely to panic
later on.
This is a workaround for umount -f not quite doing the right thing.
Approved by: re/scottl
(If there is a legitimate need to correctly encode and pack a
disklabel with an invalid checksum custom tools can be built for
that.)
Make bsd_disklabel_le_dec() validate the magics, number of partitions
(against a new parameter) and the checksum.
Vastly simplify the logic of the GEOM::BSD class implementation:
Let g_bsd_modify() always take a byte-stream label.
This simplifies all users, except the ioctl's which now have to
convert to a byte-stream first. Their loss.
g_bsd_modify() is called with topology held now, and it returns
with it held.
Always update the md5sum in g_bsd_modify(), otherwise the check
is no use after the first modification of the label. Make the
MD5 over the bytestream version of the label.
Move the rawoffset hack to g_bsd_modify() and remove all the
inram/ondisk conversions.
Don't configure hotspots in g_bsd_modify(), do it in taste instead,
we do not support moving the label to a different location on the
fly anyway.
This passes all current regression tests.