to problems when the geli device is used with file system or as a swap.
Hopefully will prevent problems like kern/98742 in the future.
MFC after: 1 week
arrangement that has no intrinsic internal knowledge of whether devices
it is given are truly multipath devices. As such, this is a simplistic
approach, but still a useful one.
The basic approach is to (at present- this will change soon) use camcontrol
to find likely identical devices and and label the trailing sector of the
first one. This label contains both a full UUID and a name. The name is
what is presented in /dev/multipath, but the UUID is used as a true
distinguishor at g_taste time, thus making sure we don't have chaos
on a shared SAN where everyone names their data multipath as "Fred".
The first of N identical devices (and N *may* be 1!) becomes the active
path until a BIO request is failed with EIO or ENXIO. When this occurs,
the active disk is ripped away and the next in a list is picked to
(retry and) continue with.
During g_taste events new disks that meet the match criteria for existing
multipath geoms get added to the tail end of the list.
Thus, this active/passive setup actually does work for devices which
go away and come back, as do (now) mpt(4) and isp(4) SAN based disks.
There is still a lot to do to improve this- like about 5 of the 12
recommendations I've received about it, but it's been functional enough
for a while that it deserves a broader test base.
Reviewed by: pjd
Sponsored by: IronPort Systems
MFC: 2 months
flash card reader.
Also remove an 'Opened da0 -> <random number>' which is not needed on a daily
basis (available through bootverbose).
Reviewed by: phk, ken
MFC after: 1 week
partitioning class that supports multiple schemes. Current
schemes supported are APM (Apple Partition Map) and GPT.
Change all GEOM_APPLE anf GEOM_GPT options into GEOM_PART_APM
and GEOM_PART_GPT (resp).
The ctlreq interface supports verbs to create and destroy
partitioning schemes on a disk; to add, delete and modify
partitions; and to commit or undo changes made.
We can't bind to a CPU which is not yet on-line, so add code that wait for
CPUs to go on-line before binding to them.
Reported by: Alin-Adrian Anton <aanton@spintech.ro>
MFC after: 2 weeks
file are after snaplock, while other ffs device buffers are before
snaplock in global lock order. By itself, this could cause deadlock
when bdwrite() tries to flush dirty buffers on snapshotted ffs. If,
during the flush, COW activity for snapshot needs to allocate block
and ffs_alloccg() selects the cylinder group that is being written
by bdwrite(), then kernel would panic due to recursive buffer lock
acquision.
Avoid dealing with buffers in bdwrite() that are from other side of
snaplock divisor in the lock order then the buffer being written. Add
new BOP, bop_bdwrite(), to do dirty buffer flushing for same vnode in
the bdwrite(). Default implementation, bufbdflush(), refactors the code
from bdwrite(). For ffs device buffers, specialized implementation is
used.
Reviewed by: tegge, jeff, Russell Cattelan (cattelan xfs org, xfs changes)
Tested by: Peter Holm
X-MFC after: 3 weeks (if ever: it changes ABI)
gmirror and graid3 in a way that it is not resynchronized after a
power failure or system crash.
It is safe when gjournal is running on top of gmirror/graid3.
we won't be able to exit from the thread.
Function g_eli_cpu_is_disabled() stoled from kern_pmc.c.
PR: 104669
Reported by: Nikolay Mirin <nik@optim.com.ru>
MFC after: 1 week
- Do not modify mnt_flag without mount interlock held.
- Do not touch MNT_ASYNC flag, as this can lead to a race with nmount(2).
Pointed out by: tegge
Reviewed by: tegge
journaling and can be tought about marking file system as clean before
doing journal switch, which easly allows to add journaling to file
systems that don't have this feature.
Sponsored by: home.pl
read requests to its consumer. It has been developed to address
the problem of a horrible read performance of a 64k blocksize FS
residing on a RAID3 array with 8 data components, where a single
disk component would only get 8k read requests, thus effectively
killing disk performance under high load. Documentation will be
provided later. I'd like to thank Vsevolod Lobko for his bright
ideas, and Pawel Jakub Dawidek for helping me fix the nasty bug.
request can still have bio_to set to sc_provider (this is READ part of a
synchronization request) and in this case g_{mirror,raid3}_sync() wasn't
called as it should be.
MFC after: 1 week
This way GEOM classes can safely detach from provider when an orphan
event is received. This fixes 'detach with active requests' panic for
gstripe/gconcat under load.
PR: kern/102766
Submitted by: mjacob
OK'ed by: phk
MFC after: 1 week
add count of active and total components to the launched line so you can
see at a glance if your mirror/raid3 is complete...
now:
GEOM_MIRROR: Device mirror/sam launched (2/2).
Reviewed by: pjd
- hold/release device in start/done routines, this will probably slow
down things a bit, but previous code was racy;
- only release device if g_gate_destroy() failed - if it succeeded device
is dead and there is nothing to release;
- various other changes which makes forcible destruction reliable.
MFC after: 3 days
created on Windows XP (and others maybe) were not detected.
We detected only those created with newfs_msdos(8).
Submitted by: Tobias Reifenberger <treif@mayn.de>
style(9)ified by: pjd
This way one will be able to use provider encrypted on eg. i386 on
eg. sparc64. This doesn't really buy us much today, because UFS isn't
endian agnostic.
We retain backward compatibility by setting G_ELI_FLAG_NATIVE_BYTE_ORDER
flag on devices with version number less than 2 and not converting the
offset.
o PMBR partitions count to the number of partitions on the disk, which
means that if a PMBR entry is invalid we will not treat the MBR as a
PMBR by virtue of it not describing any partitions.
Previously the checks were inconsistent in that an invalid PMBR entry
would be harmless when no other partitions exist (we would treat the
MBR as a PMBR by virtue of it being empty), but it would be fatal when
there is at least one other partition.
o The partition size of a PMBR partition is one less than the media size
because the GPT starts at the second sector (LBA 1) and extends to
the end of the media. For backward bug-compatibility we accept a size
that's exactly the media size (FreeBSD bug).
Also, when the partition size can not be represented in a 32-bit
integral, the partition size in the MBR is to be set to 0xFFFFFFFF.
Accept this as a valid size, even if the size can be represented.
we obtained access. It is possible that GPT gets to taste a disk
first, which means the disk has not been opened before and it will
not get opened until after we checked the mediasize and sectorsize.
However, since the mediasize and sectorsize are determined at open
and that happens when access is optained, checking the mediasize
and sectorsize before obtaining access may result in GPT rejecting
the disk.
uma(9) will be used for memory allocation.
In case of problems or tracking bugs, there are more useful tools for malloc(9)
debugging than for uma(9) debugging, like memguard(9) and redzone(9).
MFC after: 1 week
MBR should have only one entry of type 0xEE, consider protective MBR
to be one, that has at least one entry of type 0xEE covering the whole
unit. This makes GEOM_GPT compatible with disks partitioned by the
Apple's BootCamp.
Approved in principle by: marcel
MFC After: 1 month
offset or request size which is not a multiple of the sector size, make
sure that the bio is set to indicate that no data has actually been
transferred.
The result of this is that the file offset is no longer incremented for
these requests. The fact that the file offset was incremented broke
fdisk(8)'s probing of sector size for non-512 byte sector sizes.
Reviewed by: phk, cperciva
Submitted by: mdodd
MFC after: 2 weeks
By using a pointer to struct dos_partition, we implicitly tell the
compiler that the pointer is 4-bytes aligned, even though we know
that's not the case. The fact that we only dereference the pointer
to access a byte-wide field (field dp_ptyp) is not a guarantee that
the compiler will in fact use a byte-wide load. On some platforms
it's more efficient to use long word or quad word loads and use
bit-shifting and bit-masking to get the intended byte. On those
platforms an misaligned load will be the result.
The fix is to use byte-wide pointer arithmetic based on sizeof() and
offsetof() to avoid invalid casts which avoids that the compiler
makes invalid assumptions.
Backtrace provided by: wilko@
MFC after: 1 week
two places where g_io_request() is called. g_io_request() can free bio
structure so we can't reference it after and G_RAID3_FOREACH_BIO() macro
was doing this.
Found by: Coverity Prevent analysis tool (with my new models)
MFC after: 1 day
- Prevent possible live-lock in case of memory problems by freeing
already completed requests first.
Reported and tested by: markus, Bradley W. Dutton <brad-fbsd-stable@duttonbros.com>
MFC after: 1 day
- Comment possible event miss, which isn't critical, but probably can be
fixed by replacing the event lock usage with the queue lock.
MFC after: 2 weeks
stored in metadata instead of an offset in single disk.
After reboot/crash synchronization process started from a wrong offset
skipping (not synchronizing) part of the component which can lead to data
corrutpion (when synchronization process was interrupted on initial
synchronization) or other strange situations like 'graid3 status' showing
value more than 100%.
Reported, reviewed and tested by: ru
Reported by: Dmitry Morozovsky <marck@rinet.ru>
MFC after: 1 day
which means that devices will be destroyed on last close.
This fixes destruction order problems when, eg. RAID3 array is build on
top of RAID1 arrays.
Requested, reviewed and tested by: ru
MFC after: 2 weeks
o Implement the remove verb to remove a partition entry.
o Improve error reporting by first checking that the verb is valid.
o Add an entry parameter to the add verb. this parameter can be
both read-only as welll as read-write and specifies the entry
number of the newly added partition.
o Make sure that the provider is alive when passed to us. It may
be withering away.
o When adding a new partition entry, test for overlaps with existing
partitions.
particular provider. Use this function where g_orphan_provider()
is being called so that the flags are updated correctly and
g_orphan_provider() is called only when allowed.
error on the request. Add a wrapper, gctl_set_param_err(), that
sets the error on the request from the error returned by
gctl_set_param() and update current callers of gctl_set_param()
to call gctl_set_param_err() instead.
This makes gctl_set_param() much more usable in situations where
the caller knows better what to do with certain (apparent) error
conditions and setting an error on the request is not one of the
things that need to be done.
case panic on sparc64.
The problem is in MD5(9) implementation. The Encode() function takes
'unsigned char *output' as its first argument, which is then assigned to
'u_int32_t *op'. If the 'output' argument is not 4 byte aligned (and in
geli(8) case it is not), sparc64 machine will panic.
I don't know how to fix MD5(9) in a clean way, so I'm implementing a
work-around in geli(8).
Reported by: brueffer
MFC after: 3 days
Submitted by: green
- Speed up synchronization process by using configurable number of I/O
requests in parallel.
+ Add kern.geom.raid3.sync_requests tunable which defines how many parallel
I/O requests should be used.
+ Retire kern.geom.raid3.reqs_per_sync and kern.geom.raid3.syncs_per_sec
sysctls.
- Fix race between regular and synchronization requests.
- Reimplement raid3's data synchronization - do not use the topology lock
for this purpose, as it may case deadlocks.
- Stop synchronization from pre-sync hook.
- Fix some other minor issues.
Tested by: Mike Tancsa <mike@sentex.net>
MFC after: 3 days
requests in parallel.
+ Add kern.geom.mirror.sync_requests tunable which defines how many parallel
I/O requests should be used.
+ Retire kern.geom.mirror.reqs_per_sync and kern.geom.mirror.syncs_per_sec
sysctls.
- Fix race between regular and synchronization requests.
- Reimplement mirror's data synchronization - do not use the topology lock
for this purpose, as it may case deadlocks.
- Stop synchronization from pre-sync hook.
- Fix some other minor issues.
MFC after: 3 days
means that old problem was triggered (when two providers end at the same
offset, eg. ad0 and ad0s1 and the wrong was is picked up by gmirror/graid3).
Reported by: Michal Suszko <dry@dry.pl>
MFC after: 3 days
sysinstall(8) still bogusly puts first partition at offset 0 instead of 16,
so glabel/ufs will find file system on slice instead of partition.
Before sysinstall is fixed, we must keep this code, which means that we
wont't be able to detect UFS file systems created with 'newfs -s ...'.
PS. bsdlabel(8) creates partitions properly.
MFC after: 3 days
to preserve currect behaviour). When set to 0, components are not
disconnected - graid3 will try to still use them (only first error will
be logged). This is helpful when we have two broken components, but in
different places, so actually all data is available.
Such buggy component will be visible in 'graid3 list' output with flag
BROKEN.
- Never disconnect the last valid component. If we detect errors there we
will just pass them up. This wasn't reasonable to deny access to the
whole provider because of one broken sector.
Prodded by: ru
MFC after: 3 days
to preserve currect behaviour). When set to 0, components are not
disconnected - gmirror will try to still use them (only first error will
be logged). This is helpful when we have two broken components, but in
different places, so actually all data is available.
Such buggy component will be visible in 'gmirror list' output with flag
BROKEN.
- Never disconnect the last valid component. If we detect errors there we
will just pass them up. This wasn't reasonable to deny access to the
whole provider because of one broken sector.
Prodded by: ru
MFC after: 3 days
An example entries for loader.conf to make it possible:
geli_da0_keyfile0_load="YES"
geli_da0_keyfile0_type="da0:geli_keyfile0"
geli_da0_keyfile0_name="/boot/keys/da0.key0"
geli_da0_keyfile1_load="YES"
geli_da0_keyfile1_type="da0:geli_keyfile1"
geli_da0_keyfile1_name="/boot/keys/da0.key1"
geli_da0_keyfile2_load="YES"
geli_da0_keyfile2_type="da0:geli_keyfile2"
geli_da0_keyfile2_name="/boot/keys/da0.key2"
geli_da1s3a_keyfile0_load="YES"
geli_da1s3a_keyfile0_type="da1s3a:geli_keyfile0"
geli_da1s3a_keyfile0_name="/boot/keys/da1s3a.key"
Thanks for jhb and kan who showed me the right direction.
MFC after: 3 days
- number of read I/O requests,
- number of write I/O requests,
- number of read bytes,
- number of written bytes.
Add 'reset' subcommand for resetting statistics.
plain file bsdlabel(8) always writes label at a fixed offset from
its beginning (512 bytes), regardless of the sector size. At the same
time, bsdlabel geom class expects label to be available at the very
beginning of the second sector.
As a result, images prepared in userland for media with sector size
different from 512 bytes (i.e. 2k for cdroms) are not recognized by
the tasting mechanism.
Solve the problem by always looking for the label at 512-byte offset
if we can't find it at the beginning of the second sector and sector
size is not 512 bytes.
o The only indication of error condition is NULL value returned by
the function;
o value pointed to by error argument is undefined in the case when
operation completes successfully.
Discussed with: phk
the geom creation to a seperate init function and ignore the tasting.
The config is now parsed only in the vinumdrive geom, which hopefully
fixes the problem, that the drive class tasted before the vinum class
had a chance, for good.
Also restore the behaviour that the module can be loaded at boot time
and on a running system.
Don't allocate potentially large variables on the stack.
Check strsep() return values when the string comes from userland.
Shorten variable names for lucidity's sake.
most of the stuff:
Pointed out by: njl@
Add functions to rename objects and to move a subdisk from one drive
to another.
Obtained from: Chris Jones <chris.jones@ualberta.ca>
Sponsored by: Google Summer of Code 2005
MFC in: 1 week
the underlying drive had been hot-unplugged from the system. Here
is a specific example. Filesystem code had opened /dev/da1s1e.
Subsequently, the drive was hot-unplugged. This (correctly) caused
all of the associated /dev/da1* entries to be deleted. When the
filesystem later realized that the drive was gone it closed the
device, reducing the write-access counts to 0 on the geom providers
for da1s1e, da1s1, and da1. This caused geom to re-taste the
providers, resulting in the devices being created again. When the
drive was hot-plugged back in, it resulted in duplicate /dev entries
for da1s1e, da1s1, and da1.
This fix adds a new disk_gone() function which is called by CAM when a
drive goes away. It orphans all of the providers associated with the
drive, setting an error condition of ENXIO in each one. In addition,
we prevent a re-taste on last close for writing if an error condition
has been set in the provider.
Sponsored by: Isilon Systems
Reviewed by: phk
MFC after: 1 week
verbs. Only the create verb operates on a provider. All other verbs
operate on a GPT geom. Also, the GPT entry oriented verbs require
a non-downgraded GPT.
o Have all verbs take an optional flags parameter. The flags parameter
is a string of single-letter flags. The typical use of these flags
is to enable certain behaviour in support fo the gpt(8) tool.
o Add dummy implementations for the destroy and recover verbs.
This change causes test 2 of the GPT regression test suite to fail.
The presence of a geom parameter is now required even for unknown
verbs.
MD class. Previously only the DISK class was dumped. The only
consumer of this sysctl is libdisk (i.e. sysinstall) and it tests
explicitly for instances of the DISK class. Dumping other classes
is therefore harmless.
By also dumping the MD class regression tests can be written that
use the MD class for operations that would normally be done on the
DISK class. The sysctl can now be used to test if those operations
took an effect. An example is partitioning.
- Prefer '_' to ' ', as it results in more easily parsed results in
memory monitoring tools such as vmstat.
- Remove punctuation that is incompatible with using memory type names
as file names, such as '/' characters.
- Disambiguate some collisions by adding subsystem prefixes to some
memory types.
- Generally prefer lower case to upper case.
- If the same type is defined in multiple architecture directories,
attempt to use the same name in additional cases.
Not all instances were caught in this change, so more work is required to
finish this conversion. Similar changes are required for UMA zone names.
memory for request.
I was sure graid3 should handle such situations well, but green@ reported
it is not and we want to fix it before 6.0.
Submitted by: green
up. This make iostat report operations passed down to the device driver
instead of operations passed down to GEOM disk. The transfer size limit
imposed by the device driver is no longer hidden, improving the correlation
between iostat output and device driver workload.
requests. The following features have been added:
1. Extensive checking and validation of both the primary and
secondary headers to protect against corrupted data and to
take advantage of the redundancy to allow the GPT to be
used in the face of recoverable corruption.
2. Dynamic data-structures to avoid hardcoding gratuitous
table limits so as to support the creation of GPT tables
of (as of yet) unspecified size.
3. Only allow kernel dumps to swap partitions to provide the
necessary anti-footshooting measures. Linux swap partitions
are allowed.
4. Complete dump of the GPT configuration, including labels.
5. Supports Byte Order Mark (U+FEFF) handling for big-endian,
little-endian and mixed-endian partition names.
state where sleeping on a sleep queue is not allowed. The facility
doesn't support recursion but uses a simple private per-thread flag
(TDP_NOSLEEPING). The sleepq_add() function will panic if the flag is
set and INVARIANTS is enabled.
- Use this new facility to replace the g_xup and g_xdown mutexes that were
(ab)used to achieve similar behavior.
- Disallow sleeping in interrupt threads when invoking interrupt handlers.
MFC after: 1 week
Reviewed by: phk
it is destroyed in GEOM, in addition to being removed from /dev.
Before this patch, if you applied a new MBR which deleted a slice,
the deleted slice would not be in /dev, but it would still appear
in kern.geom.conftxt and kern.geom.confxml, which would confused
the diskPartitionEditor in sysinstall.
Submitted by: pjd
Tested by: pjd, rodrigc
MFC after: 1 week
waiting for geom events to happen:
Instead of maintaining a count of outstanding events, simply look if
the queue is empty. Make sure to not remove events from the queue
until they are executed in order to not open a new race.
Much work by: pjd
Tested by: kris
MT6: yes, should be.
This way, the VINUMDRIVE class is loaded before the VINUM class,
but since geom does the tasting for newly arrived classes
last-in-first-out, the VINUM class tastes first.
This removes the need to call gv_parse_config() in the drive
taste path.
sizeof(struct g_eli_metadata) will return the exact number of bytes needed
for storing it on the disk.
Without this change GELI was unusable on amd64 (and probably other 64-bit
archs), because sizeof(struct g_eli_metadata) was greater than 512 bytes
and geli(8) was failing on assertion.
Reported by: Michael Reifenberger <mike@Reifenberger.com>
MFC after: 3 days
When a drive is newly created, it's state is initially set to 'down',
so it won't allow saving the config to it (thus it will never know of
itself being created). Work around this by adding a new flag, that's
also checked when saving the config to a drive.
Actually, one cannot setup root file system on RAID3 device, but when
other file system exist in /etc/fstab which are placed on RAID3 device,
boot process will be interrupted when these devices are missing.
MFC after: 3 days
X-MFC-note: MFC only to RELENG_6, as RELENG_5 doesn't have root_mount KPI.
the assumption that performance was more important that beancounter
quality statistics.
As it transpires the microoptimization is not measurable in the
real world and the inconsistent statistics confuse users, so revert
the decision.
MT6 candidate: possibly
MT5 candidate: possibly
It creates very huge provider (41PB) /dev/gzero.
On BIO_READ request it zero-fills bio_data and on BIO_WRITE it does nothing.
You can also set kern.geom.zero.clear sysctl to 0 to do nothing even for
BIO_READ.
I'm using it for performance testing where it is very helpful.
MFC after: 3 days
post an event to the geom event queue that will take care of it,
letting outstanding bios finish, and closing the consumers.
Plus some cosmetic clean ups.
resides on. Fix the special case of the filesystem fragment size not
evenly dividing the size of the provider. Fixing the general case
probably requires better superblock validation (left as an exercise to
the reader).
While we wait for holds to be released, print a list of who holds us
back once per second.
Use the new KPI from GEOM instead of vfs_mount.c calling g_waitidle().
Use the new KPI also from ata.
With ATAmkIII's newbusification, ata could narrowly miss the window
and ad0 would not exist when we tried to mount root.
in BSD and MBR classes, ie. if provider below us uses the same metadata,
don't create labels based on the metadata.
This allows to create labels on geoms with rank != 1 without hacks.
Tested by: Chris Elsworth <chris@shagged.org> on sparc64
OK'ed by: phk
MFC after: 2 weeks
completed I/O requests here.
- First allocate all needed bios, so if any of allocations fail, we can
free memory before sending any I/O requests down.
Reported by: Pawel Malachowski
MFC after: 3 days
seem to be necessary anymore, and it prevents tasting a valid drive
when booting with geom_vinum already loaded, since SCSI disks set their
sectorsize not until first opening them.
shared-last-sector problem.
After this change, even if there is more than one provider with the same
last sector, the proper one will be chosen based on its size.
It still doesn't fix the 'c' partition problem (when da0s1 can be confused
with da0s1c) and situation when 'a' partition starts at offset 0
(then da0s1a can be confused with da0s1 and da0s1c). One can use '-h'
option there, when creating device or avoid sharing last sector.
Actually, when providers share the same last sector and their size is equal,
they provide exactly the same data, so the name (da0s1, da0s1a, da0s1c)
isn't important at all.
- Provide backward compatibility.
- Update copyright's year.
MFC after: 1 week
the previous one failed and there are more than one plex in the volume.
This could have led to a flood of error messages on the console and
probably a deadlock in certain situations.
patch from kan@).
Pull bufobj_invalbuf() out of vinvalbuf() and make g_vfs call it on
close. This is not yet a generally safe function, but for this very
specific use it is safe. This solves the problem with buffers not
being flushed by unmount or after failed mount attempts.
This flag means "wait for all pending requests before returning to userland".
There are pending events for sure, because we just created new provider and
other classes want to taste it, but we cannot answer on I/O requests until
we're here.
4 mutex operations per I/O requests.
- Use only one mutex to protect both (incoming and outgoing) queue.
As MUTEX_PROFILING(9) shows, there is no big contention for this lock.
- Protect sc_queue_count with queue mutex, instead of doing atomic
operations on it.
- Remove DROP_GIANT()/PICKUP_GIANT() - ggate is marked as MPSAFE and no
Giant there.
in BSD class, ie. if provider below us uses the same metadata, don't
create slices based on the metadata.
This allows to create slices on geoms with rank != 1 without hacks.
Discussed with: phk
Approved by: phk
MFC after: 2 weeks
the given providers. Without even one of the configured components there
should be no way to get the secret.
Supported by: WHEEL Sp. z o.o.
http://www.wheel.pl
(we ignore it).
- Remove code used for handling spoil events, as spoiling is not possible
anymore, because we keep consumers open for writing all the time.
MFC after: 4 days
After this change, when component is disconnected because of an I/O error,
it will not be connected and synchronized automatically, it will be logged
as broken and skipped. Autosynchronization can occur, when component is
disconnected (on orphan event) and connected again - there were no I/O
error, so there is no need to not connected the component, but when there were
writes while it wasn't connected, it will be synchronized.
This fix cases, when component is disconnected because of I/O error and can be
connected again and again.
- Bump version number.
- Implement backward compatibility mechanism. After this change when metadata in
old version is detected, it is automatically upgraded to the new (current)
version.
After this change, when component is disconnected because of an I/O error,
it will not be connected and synchronized automatically, it will be logged
as broken and skipped. Autosynchronization can occur, when component is
disconnected (on orphan event) and connected again - there were no I/O
error, so there is no need to not connected the component, but when there were
writes while it wasn't connected, it will be synchronized.
This fix cases, when component is disconnected because of I/O error and can be
connected again and again.
- Bump version number.
- Add version change history.
- Implement backward compatibility mechanism. After this change when metadata in
old version is detected, it is automatically upgraded to the new (current)
version.
while doing g_(read|write)_data() (e.g. BSD). This can cause a deadlock
in MIRROR class. Not sure if this is safe to drop the topology lock in BSD
class, so change the code in MIRROR class to avoid this deadlock.
Keeping consumers open when device is closed is very hard. We need to
open consumers sometimes to update metadata, etc.
Many hacks was introduced in the past to made it possible. You cannot
be sure that you can open consumer for writing always, even if you think
it should be allowed. If one of the mirror components is for example da0
and you try to open it, you can get EPERM when da0s1 is opened for reading
(because BSD class opens consumers (da0) with an extra 'e' bit set).
Waiting for the events queue to be empty may do the trick, but it makes
code much uglier (as you cannot always call g_waitidle()), it doesn't
solve all edge cases and it can introduce deadlocks if there are events
in the queue that wait for gmirror.
I removed those hacks. Now all consumers are open r1w1e1 always, even if
device is closed. Maybe it is less clean from GEOM perspective, but simpify
code a lot and make it much more reliable.
The only issue was retaste event which is sent when we close consumers
opened for writing. I ignore retaste event by not detaching consumer
immediately (so retaste event is not send to my class) and sending event
right after it to detach and destroy consumer.
anywhere in the DAG. This includes configurations that are not
allowed by the EFI specification.
o Reject a GPT partition table if it's not preceeded by a PMBR.
There's no need to preserve the MBR partitioning anymore as GPT
is mature and with the first bullet extending the applicability
of GPT, it's better to be a bit more strict.
because we know it then and we need it when inserting a component which
wasn't destroyed while device was running.
Reported by: Michael Handler <handler@grendel.net>
MFC after: 1 week
observations lead me to believe that the convetion for pc98 boot
loaders is to have a jump unstruction, followed by a string, followed
by code. The jump usually doesn't have a nop after it and usually the
string is NUL terminated, but Grub/98 breaks both of these rules.
# I looked for, but failed to find the Minux boot blocks for PC-9801 port.
512. If I had an audio cdrom in my cd player when I booted my system,
I'd get a panic from geom because you can't read 8192 bytes from an
audio cdrom.
Remove XXX comment about IPL1 and replace it with some information
from my soon to be published web page on the pc98 disk layout. The
IPL1 test was the result of an observation of a disk with FreeBSD's
boot0 program. It was testing part of an area what appears to be
reserved for a boot loader name, which comes after a jump over this
area. I don't yet know if it is required to be any specific jump
instruction, or if the destination has to be location 11. [1]
[1] FreeBSD Press No. 13, page 115, poorly translated by myself. The
picture there shows offset 8 as the destination of the jump, but
FreeBSD's boot0 program has three padding NULs after the IPL1 name and
uses a 16-bit 'jmp' instruction.
correct open/close behaviour of filesystems:
When an ioctl to modify the MBR arrives, we cannot take for granted that
we have the consumer open.
The symptom is that one cannot run 'boot0cfg -s2 /dev/ad0' in single-user
mode because / is the only open partition in only open r1w0e1.
If it is not, we attempt to increase the write count by one and
decrease it again afterwards.
Presumably most if not all other slices suffer from the same problem.
that the events queue is empty. In other case we're able to hit the race
where for example da0s1 is tasted by some other class, which means that
da0 is open with exclusive bit set, which means that we can't open da0
for writing if it is our component.
Reported by: Attila Nagy <bra@fsn.hu> (and somebody else sometime ago,
but I cannot find who it was)
seconds of idling, DRITY flags are removed.
- If mirror is in idle state or is not open for writing, sleep without
timeout when waiting for I/O requests.
- Don't use atomic operations, for now sysctls are protected by Giant.
- Update debugs.
- Fix for good (I hope) force-stopping mirrors and some filure cases
(e.g. the last good component dies when synchronization is in progress).
Don't use ->nstart/->nend consumer's fields, as this could be racy,
because those fields are used in g_down/g_up, use ->index consumer's
field instead for tracking number of not finished requests.
Reported by: marcel
- After 5 seconds of idle time (this should be configurable) mark all
dirty providers as clean, so when mirror is not used in 5 seconds
and there will be power failure, no synchronization on boot is needed.
Idea from: sorry, I can't find who suggested this
- When there are no ACTIVE components and no NEW components destroy whole
mirror, not only provider.
- Fix one debug to show information about I/O request, before we change
its command.
buf->b-dev.
Put a bio between the buf passed to dev_strategy() and the device driver
strategy routine in order to not clobber fields in the buf.
Assert copyright on vfs_bio.c and update copyright message to canonical
text. There is no legal difference between John Dysons two-clause
abbreviated BSD license and the canonical text.
This flag gets set whenever the thread posts an event on the GEOM
event queue, and if the flag is set when the thread is prepared
to return to userland from the kernel, g_waitidle() will be called
to make sure that the posted events have completed.
This can replace an insufficient number of g_waitidle() calls in
various other places, and has the advantage of being failsafe: Any
system call which does a VOP_OPEN()/VOP_CLOSE will now correctly
wait for any geom events it posted as part of spoils or tastes.
Assert that topology and Giant is not held in g_waitidle().
in the g_up and g_down threads. Each time a bio is propelled up and
down the stack, an event is generating showing the provider, offset,
and length, as well as thread wakeup and work status information.
providers for tasting. Before this hack, race below is possible:
SI_SUB_RAID (no not-fully-configured geoms, so don't block)
GEOM tasting (now geoms are created)
SI_SUB_MOUNT_ROOT (if root file system is placed on a mirror, it is
possible that this mirror is not fully configured yet)
There is a lot of work to do to avoid such hacks and I need a working
solution before 5.3, sorry.
Reported by: John Hay <jhay@icomtek.csir.co.za>
We have to use our own destroy_geom method, because default one, which
is a part of geom_slice is broken.
MT5 candidate.
PR: kern/72467
Submitted by: Vladimir Novoseltsev
by the time that kldload(8) returns. Satisfy that by making the GEOM
module load event -- only when the kernel is !cold -- wait until the
GEOM module init function has finished instead of returning immediately.
This is the other half of fixing md(8) (actually, "mfs" in fstab(5))
that is similar to r1.128 of src/sys/dev/md/md.c. This bug would be
why RAM disks would often fail on boot and the first call to mdconfig(8)
would probably fail.
pjd has ideas for not requiring kldload(8) to work synchronously for
control devices that could make this obsolete.
Silence on: -arch