resides on. Fix the special case of the filesystem fragment size not
evenly dividing the size of the provider. Fixing the general case
probably requires better superblock validation (left as an exercise to
the reader).
While we wait for holds to be released, print a list of who holds us
back once per second.
Use the new KPI from GEOM instead of vfs_mount.c calling g_waitidle().
Use the new KPI also from ata.
With ATAmkIII's newbusification, ata could narrowly miss the window
and ad0 would not exist when we tried to mount root.
in BSD and MBR classes, ie. if provider below us uses the same metadata,
don't create labels based on the metadata.
This allows to create labels on geoms with rank != 1 without hacks.
Tested by: Chris Elsworth <chris@shagged.org> on sparc64
OK'ed by: phk
MFC after: 2 weeks
completed I/O requests here.
- First allocate all needed bios, so if any of allocations fail, we can
free memory before sending any I/O requests down.
Reported by: Pawel Malachowski
MFC after: 3 days
seem to be necessary anymore, and it prevents tasting a valid drive
when booting with geom_vinum already loaded, since SCSI disks set their
sectorsize not until first opening them.
shared-last-sector problem.
After this change, even if there is more than one provider with the same
last sector, the proper one will be chosen based on its size.
It still doesn't fix the 'c' partition problem (when da0s1 can be confused
with da0s1c) and situation when 'a' partition starts at offset 0
(then da0s1a can be confused with da0s1 and da0s1c). One can use '-h'
option there, when creating device or avoid sharing last sector.
Actually, when providers share the same last sector and their size is equal,
they provide exactly the same data, so the name (da0s1, da0s1a, da0s1c)
isn't important at all.
- Provide backward compatibility.
- Update copyright's year.
MFC after: 1 week
the previous one failed and there are more than one plex in the volume.
This could have led to a flood of error messages on the console and
probably a deadlock in certain situations.
patch from kan@).
Pull bufobj_invalbuf() out of vinvalbuf() and make g_vfs call it on
close. This is not yet a generally safe function, but for this very
specific use it is safe. This solves the problem with buffers not
being flushed by unmount or after failed mount attempts.
This flag means "wait for all pending requests before returning to userland".
There are pending events for sure, because we just created new provider and
other classes want to taste it, but we cannot answer on I/O requests until
we're here.
4 mutex operations per I/O requests.
- Use only one mutex to protect both (incoming and outgoing) queue.
As MUTEX_PROFILING(9) shows, there is no big contention for this lock.
- Protect sc_queue_count with queue mutex, instead of doing atomic
operations on it.
- Remove DROP_GIANT()/PICKUP_GIANT() - ggate is marked as MPSAFE and no
Giant there.
in BSD class, ie. if provider below us uses the same metadata, don't
create slices based on the metadata.
This allows to create slices on geoms with rank != 1 without hacks.
Discussed with: phk
Approved by: phk
MFC after: 2 weeks
the given providers. Without even one of the configured components there
should be no way to get the secret.
Supported by: WHEEL Sp. z o.o.
http://www.wheel.pl
(we ignore it).
- Remove code used for handling spoil events, as spoiling is not possible
anymore, because we keep consumers open for writing all the time.
MFC after: 4 days
After this change, when component is disconnected because of an I/O error,
it will not be connected and synchronized automatically, it will be logged
as broken and skipped. Autosynchronization can occur, when component is
disconnected (on orphan event) and connected again - there were no I/O
error, so there is no need to not connected the component, but when there were
writes while it wasn't connected, it will be synchronized.
This fix cases, when component is disconnected because of I/O error and can be
connected again and again.
- Bump version number.
- Implement backward compatibility mechanism. After this change when metadata in
old version is detected, it is automatically upgraded to the new (current)
version.
After this change, when component is disconnected because of an I/O error,
it will not be connected and synchronized automatically, it will be logged
as broken and skipped. Autosynchronization can occur, when component is
disconnected (on orphan event) and connected again - there were no I/O
error, so there is no need to not connected the component, but when there were
writes while it wasn't connected, it will be synchronized.
This fix cases, when component is disconnected because of I/O error and can be
connected again and again.
- Bump version number.
- Add version change history.
- Implement backward compatibility mechanism. After this change when metadata in
old version is detected, it is automatically upgraded to the new (current)
version.
while doing g_(read|write)_data() (e.g. BSD). This can cause a deadlock
in MIRROR class. Not sure if this is safe to drop the topology lock in BSD
class, so change the code in MIRROR class to avoid this deadlock.
Keeping consumers open when device is closed is very hard. We need to
open consumers sometimes to update metadata, etc.
Many hacks was introduced in the past to made it possible. You cannot
be sure that you can open consumer for writing always, even if you think
it should be allowed. If one of the mirror components is for example da0
and you try to open it, you can get EPERM when da0s1 is opened for reading
(because BSD class opens consumers (da0) with an extra 'e' bit set).
Waiting for the events queue to be empty may do the trick, but it makes
code much uglier (as you cannot always call g_waitidle()), it doesn't
solve all edge cases and it can introduce deadlocks if there are events
in the queue that wait for gmirror.
I removed those hacks. Now all consumers are open r1w1e1 always, even if
device is closed. Maybe it is less clean from GEOM perspective, but simpify
code a lot and make it much more reliable.
The only issue was retaste event which is sent when we close consumers
opened for writing. I ignore retaste event by not detaching consumer
immediately (so retaste event is not send to my class) and sending event
right after it to detach and destroy consumer.
anywhere in the DAG. This includes configurations that are not
allowed by the EFI specification.
o Reject a GPT partition table if it's not preceeded by a PMBR.
There's no need to preserve the MBR partitioning anymore as GPT
is mature and with the first bullet extending the applicability
of GPT, it's better to be a bit more strict.
because we know it then and we need it when inserting a component which
wasn't destroyed while device was running.
Reported by: Michael Handler <handler@grendel.net>
MFC after: 1 week
observations lead me to believe that the convetion for pc98 boot
loaders is to have a jump unstruction, followed by a string, followed
by code. The jump usually doesn't have a nop after it and usually the
string is NUL terminated, but Grub/98 breaks both of these rules.
# I looked for, but failed to find the Minux boot blocks for PC-9801 port.
512. If I had an audio cdrom in my cd player when I booted my system,
I'd get a panic from geom because you can't read 8192 bytes from an
audio cdrom.
Remove XXX comment about IPL1 and replace it with some information
from my soon to be published web page on the pc98 disk layout. The
IPL1 test was the result of an observation of a disk with FreeBSD's
boot0 program. It was testing part of an area what appears to be
reserved for a boot loader name, which comes after a jump over this
area. I don't yet know if it is required to be any specific jump
instruction, or if the destination has to be location 11. [1]
[1] FreeBSD Press No. 13, page 115, poorly translated by myself. The
picture there shows offset 8 as the destination of the jump, but
FreeBSD's boot0 program has three padding NULs after the IPL1 name and
uses a 16-bit 'jmp' instruction.
correct open/close behaviour of filesystems:
When an ioctl to modify the MBR arrives, we cannot take for granted that
we have the consumer open.
The symptom is that one cannot run 'boot0cfg -s2 /dev/ad0' in single-user
mode because / is the only open partition in only open r1w0e1.
If it is not, we attempt to increase the write count by one and
decrease it again afterwards.
Presumably most if not all other slices suffer from the same problem.
that the events queue is empty. In other case we're able to hit the race
where for example da0s1 is tasted by some other class, which means that
da0 is open with exclusive bit set, which means that we can't open da0
for writing if it is our component.
Reported by: Attila Nagy <bra@fsn.hu> (and somebody else sometime ago,
but I cannot find who it was)
seconds of idling, DRITY flags are removed.
- If mirror is in idle state or is not open for writing, sleep without
timeout when waiting for I/O requests.
- Don't use atomic operations, for now sysctls are protected by Giant.
- Update debugs.
- Fix for good (I hope) force-stopping mirrors and some filure cases
(e.g. the last good component dies when synchronization is in progress).
Don't use ->nstart/->nend consumer's fields, as this could be racy,
because those fields are used in g_down/g_up, use ->index consumer's
field instead for tracking number of not finished requests.
Reported by: marcel
- After 5 seconds of idle time (this should be configurable) mark all
dirty providers as clean, so when mirror is not used in 5 seconds
and there will be power failure, no synchronization on boot is needed.
Idea from: sorry, I can't find who suggested this
- When there are no ACTIVE components and no NEW components destroy whole
mirror, not only provider.
- Fix one debug to show information about I/O request, before we change
its command.
buf->b-dev.
Put a bio between the buf passed to dev_strategy() and the device driver
strategy routine in order to not clobber fields in the buf.
Assert copyright on vfs_bio.c and update copyright message to canonical
text. There is no legal difference between John Dysons two-clause
abbreviated BSD license and the canonical text.
This flag gets set whenever the thread posts an event on the GEOM
event queue, and if the flag is set when the thread is prepared
to return to userland from the kernel, g_waitidle() will be called
to make sure that the posted events have completed.
This can replace an insufficient number of g_waitidle() calls in
various other places, and has the advantage of being failsafe: Any
system call which does a VOP_OPEN()/VOP_CLOSE will now correctly
wait for any geom events it posted as part of spoils or tastes.
Assert that topology and Giant is not held in g_waitidle().
in the g_up and g_down threads. Each time a bio is propelled up and
down the stack, an event is generating showing the provider, offset,
and length, as well as thread wakeup and work status information.
providers for tasting. Before this hack, race below is possible:
SI_SUB_RAID (no not-fully-configured geoms, so don't block)
GEOM tasting (now geoms are created)
SI_SUB_MOUNT_ROOT (if root file system is placed on a mirror, it is
possible that this mirror is not fully configured yet)
There is a lot of work to do to avoid such hacks and I need a working
solution before 5.3, sorry.
Reported by: John Hay <jhay@icomtek.csir.co.za>
We have to use our own destroy_geom method, because default one, which
is a part of geom_slice is broken.
MT5 candidate.
PR: kern/72467
Submitted by: Vladimir Novoseltsev
by the time that kldload(8) returns. Satisfy that by making the GEOM
module load event -- only when the kernel is !cold -- wait until the
GEOM module init function has finished instead of returning immediately.
This is the other half of fixing md(8) (actually, "mfs" in fstab(5))
that is similar to r1.128 of src/sys/dev/md/md.c. This bug would be
why RAM disks would often fail on boot and the first call to mdconfig(8)
would probably fail.
pjd has ideas for not requiring kldload(8) to work synchronously for
control devices that could make this obsolete.
Silence on: -arch
This also adds support for bigger disks on the controller I have access to,
and maybe others if I understood the adhoc methods used on those.
Those with more PC98 bigdrive controllers it is hereby invited to add/fix
support for those in geom_pc98.c and not using #ifdef PC98 all over the place.
but it is possible:
1. Read data from good component for synchronization.
2. Write data to the same area.
3. Write synchronization data, which are now stale.
Found by: tegge (for gmirror)
but it is possible:
1. Read data from good component for synchronization.
2. Write data to the same area.
3. Write synchronization data, which are now stale.
Found by: tegge
Actually, it can even cause some problems, because GEOM requires sectorsize
to be more than 0 on first access, not on provider creation, so we can skip
valid providers by doing this check here.
Reported by: Divacky Roman <xdivac02@stud.fit.vutbr.cz>
Sven Willenberger <sven@dmv.com>
Analogous to the drive level, give each volume and plex a worker thread
that picks up and processes incoming and completed BIOs.
This should fix the data corruption issues that have come up a few
weeks ago and improve performance, especially of RAID5 plexes.
The volume level needs a little work, though.
sectorsize in order to avoid a lot of checks around various divisions etc.
Enforce the sectorsize being > 0 with a KASSERT on successful open.
Fix scsi_cd.c to return 2k sectors when no media inserted.
verification of regular data when device is in complete state.
On verification error, EIO error is returned for the bio and sysctl
kern.geom.raid3.stat.parity_mismatch is increased.
Suggested by: phk
as well, even if device is in complete state.
I observe 40% of speed-up with this option for random read operations,
but slowdown for sequential reads.
Basically, without this option reading from a RAID3 device built from 5
components (c0-c4) looks like this:
Request no. Used components
1 c0+c1+c2+c3
2 c0+c1+c2+c3
3 c0+c1+c2+c3
With the new feature:
Request no. Used components
1 c0+c1+c2+c3
2 (c1^c2^c3^c4)+c1+c2+c3
3 c0+(c0^c2^c3^c4)+c2+c3
4 c0+c1+(c0^c1^c3^c4)+c3
5 c0+c1+c2+(c0^c1^c2^c4)
6 c0+c1+c2+c3
[...]
drive is known to the configuration check also if it already has a geom.
Without this check several needless geoms are created and valid
configuration data was overwritten.
This change obsoletes the need for a separate geom to taste an
offered provider and the consumer doesn't need to be opened with the
exclusive bit set.
- Remove kern.geom.mirror.sync_block_size sysctl. It is quite obvious that we
want to use the biggest size possible.
- Do not use UMA zone for sync data allocations. There could be only one
synchronization request per synchronized disk at a time, so allocate memory
for one request on whole synchronization process related to one disk.
Tested by synchronizing one component (out of three) and by synchronizing
two components (out of three) in parallel.
and bio_inbed fields to 0. Without this change we can end up with
I/O leakage in some rare situations.
I tested this change by putting failure probability mechanism simlar
to this used in NOP class into g_clone_bio(9) function, so it was
able to return NULL with the given probability.
Discussed with: phk
bio_driver1 (as all the rest).
This introduced a small memory leak, but it wasn't really critical,
because maximum memory for g_stripe_zone is always set, so after few
requests gstripe was working in "economic" mode.
It allows to fix problems when last provider's sector is shared between few
providers.
- Bump version number for CONCAT and STRIPE and add code for backward
compatibility.
- Do not bump version number of MIRROR, as it wasn't officially introduced yet.
Even if someone started to play with it, there is no big deal, because
wrong MD5 sum of metadata will deny those providers.
- Update manual pages.
- Add version history to g_(stripe|concat).h files.
understood. This makes room for additional binary compatibility in the
future.
Put fields in the class for the geom's methods and initialize the methods
of a new geom from these fields. This saves some code in all classes.
something goes wrong while running in "fast" mode, we free all bios and
falling back to "economic" mode. Freeing bios, doesn't mean decrease
bio_children, so bio_inbed couldn't be equal to bio_children and request
was never finished.
Decrease bio_children manually when destroying bios.
Reported by: Sam Lawrance <boris@brooknet.com.au>, simon
consumer and 'bio_pflags' which can be used by provider.
- Remove BIO_FLAG1 and BIO_FLAG2 flags. From now on new fields should be
used for internal flags.
- Update g_bio(9) manual page.
- Update some comments.
- Update GEOM_MIRROR, which was the only one using BIO_FLAGs.
Idea from: phk
Reviewed by: phk
set gp->softc to NULL and return ENXIO when it is NULL, so GEOM
will not panic or hang, but unload one device on every 'unload'.
This make 'unload' command usable, but it have to be executed
<number of devices> + 1 times.
- Made use of 'pp' variable.
The different between the new function and g_mirror_orphan() (which was
used previously) is that syncid is bumped immediately, instead of on
first write, because when consumer was spoiled, it means, that its
provider was opened for writing, so we can't trust that its data
will be valid when it will be connected again.
features. The gmirror(8) utility should be used for control of this class.
There is no manual page yet, but I'm working on it with keramida@.
Many useful tests provided by: simon (thank you!)
Some ideas from: scottl, simon, phk
This is really ugly way to do this, but there is no other way for now.
It allows to mount root file system from providers which belong to
those classes.
Approved by: phk
provider.
- Bump version number.
This allows for a quite interesting trick. One can setup a stripe with
stripe size of 512 bytes and create transparent provider on top of it
with sector size equal to <ndisks> * 512. The result will be something
like RAID3 without parity disk (every access will touch all disks).
vinumdrive geom with an exclusive bit. This should fix the problem
when underlying partitions overlap (i.e. the 'a' partition is at
the same offset as the 'c' partition).
Ideas borrowed from pjd@, quite a bit of testing by
Matthias Schuendehuette <msch@snafu.de>.