a da(4) instance going away while GEOM is still probing it.
In this case, the GEOM disk class instance has been created by
disk_create(), and the taste of the disk is queued in the GEOM
event queue.
While that event is queued, the da(4) instance goes away. When the
open call comes into the da(4) driver, it dereferences the freed
(but non-NULL) peripheral pointer provided by GEOM, which results
in a panic.
The solution is to add a callback to the GEOM disk code that is
called when all of its resources are cleaned up. This is
implemented inside GEOM by adding an optional callback that is
called when all consumers have detached from a provider, and the
provider is about to be deleted.
scsi_cd.c,
scsi_da.c: In the register routine for the cd(4) and da(4)
routines, acquire a reference to the CAM peripheral
instance just before we call disk_create().
Use the new GEOM disk d_gone() callback to register
a callback (dadiskgonecb()/cddiskgonecb()) that
decrements the peripheral reference count once GEOM
has finished cleaning up its resources.
In the cd(4) driver, clean up open and close
behavior slightly. GEOM makes sure we only get one
open() and one close call, so there is no need to
set an open flag and decrement the reference count
if we are not the first open.
In the cd(4) driver, use cam_periph_release_locked()
in a couple of error scenarios to avoid extra mutex
calls.
geom.h: Add a new, optional, providergone callback that
is called when a provider is about to be deleted.
geom_disk.h: Add a new d_gone() callback to the GEOM disk
interface.
Bump the DISK_VERSION to version 2. This probably
should have been done after a couple of previous
changes, especially the addition of the d_getattr()
callback.
geom_disk.c: Add a providergone callback for the disk class,
g_disk_providergone(), that calls the user's
d_gone() callback if it exists.
Bump the DISK_VERSION to 2.
geom_subr.c: In g_destroy_provider(), call the providergone
callback if it has been provided.
In g_new_geomf(), propagate the class's
providergone callback to the new geom instance.
blkfront.c: Callers of disk_create() are supposed to pass in
DISK_VERSION, not an explicit disk API version
number. Update the blkfront driver to do that.
disk.9: Update the disk(9) man page to include information
on the new d_gone() callback, as well as the
previously added d_getattr() callback, d_descr
field, and HBA PCI ID fields.
MFC after: 5 days
Without it, it fails to create labels for filesystems resized by
growfs(8).
PR: kern/165962
Submitted by: Olivier Cochard-Labbe <olivier at cochard dot me>
into partitions.
Partitions are created based on data in dts file which are
extracted and interpreted by slicer.
Obtained from: Semihalf
Supported by: FreeBSD Foundation, Juniper Networks
failed while write to some other succeeded. Instead mark disk as failed.
- Make RAID1E less aggressive in failing disks to avoid volume breakage.
MFC after: 2 weeks
defined by the SNIA Common RAID Disk Data Format Specification v2.0.
Supports multiple volumes per array and multiple partitions per disk.
Supports standard big-endian and Adaptec's little-endian byte ordering.
Supports all single-layer RAID levels. Dual-layer RAID levels except
RAID10 are not supported now because of GEOM RAID design limitations.
Some work is still to be done, but the present code already manages basic
interoperation with RAID BIOS of the Adaptec 1430SA SATA RAID controller.
MFC after: 1 month
Sponsored by: iXsystems, Inc.
- Implement "configure" command to allow switching operation mode of
running device on-fly without destroying and recreation.
- Implement Active/Read mode as hybrid of Active/Active and Active/Passive.
In this mode all paths not marked FAIL may handle reads same time,
but unlike Active/Active only one path handles write requests at any
point in time. It allows to closer follow original write request order
if above layers need it for data consistency (not waiting for requisite
write completion before sending dependent write).
- Hide duplicate messages about device status change.
- Remove periodic thread wake up with 10Hz rate.
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.
accounting for I/O counts at completion of I/O operation. Also switch
from using global devmtx to vnode mutex to reduce contention.
Suggested and reviewed by: kib
to enable the collection of counts of synchronous and asynchronous
reads and writes for its associated filesystem. The counts are
displayed using `mount -v'.
Ensure that buffers used for paging indicate the vnode from
which they are operating so that counts of paging I/O operations
from the filesystem are collected.
This checkin only adds the setting of the mount point for the
UFS/FFS filesystem, but it would be trivial to add the setting
and clearing of the mount point at filesystem mount/unmount
time for other filesystems too.
Reviewed by: kib
KLD is preloaded with loader(8) and leads to infinity loops.
Also do not return EEXIST error code from MOD_LOAD handler, because
we have undocumented(?) ability replace kernel's module with preloaded one.
And if we have so, then preloaded module will be initialized first.
Thus error in MOD_LOAD handler will be triggered for the kernel.
PR: kern/165573
MFC after: 3 weeks
scheme. The LDM is a logical volume manager for MS Windows NT and it
is also known as dynamic volumes. It supports about 2000 partitions
and also provides the capability for software RAID implementations.
This version implements only partitioning scheme capability and based
on the linux-ntfs project documentation and several publications across
the Web. NOTE: JBOD, RAID0 and RAID5 volumes aren't supported.
An access to the LDM metadata is read-only. When LDM is on the disk
partitioned with MBR we can also destroy metadata. For the GPT
partitioned disks destroy action is not supported.
Reviewed by: ivoras (previous version)
MFC after: 1 month
It's not clear to a user what they should do after seeing the "geometry
does not match label" kernel message, and it does not appear to present
a problem in practice. Thus, just remove the messages.
Approved by: marcel
types 0x05 and 0x0f, but 0x05 is preferred and used when partition is
created with "gpart add -t ebr ...".
This should keep EBR partitions accessible after r231754 for those,
who have EBR on the partition with type 0x0f.
don't try probe and create EBR scheme when parent partition type
is not "ebr". This fixes error messages about corrupted EBR for
some partitions where is actually another partition scheme.
NOTE: if you have EBR on the partition with different than "ebr"
(0x05) type, then you will lost access to partitions until it will be
changed.
MFC after: 2 weeks
mnt_noasync counter to temporary remove MNTK_ASYNC mount option, which
is needed to guarantee a synchronous completion of the initiated i/o
before syscall or VOP return. Global removal of MNTK_ASYNC option is
harmful because not only i/o started from corresponding thread becomes
synchronous, but all i/o is synchronous on the filesystem which is
initiated during sync(2) or syncer activity.
Instead of removing MNTK_ASYNC from mnt_kern_flag, provide a local
thread flag to disable async i/o for current thread only. Use the
opportunity to move DOINGASYNC() macro into sys/vnode.h and
consistently use it through places which tested for MNTK_ASYNC.
Some testing demonstrated 60-70% improvements in run time for the
metadata-intensive operations on async-mounted UFS volumes, but still
with great deviation due to other reasons.
Reviewed by: mckusick
Tested by: scottl
MFC after: 2 weeks
primitives by breaking stop_scheduler into a per-thread variable.
Also, store the new td_stopsched very close to td_*locks members as
they will be accessed mostly in the same codepaths as td_stopsched and
this results in avoiding a further cache-line pollution, possibly.
STOP_SCHEDULER() was pondered to use a new 'thread' argument, in order to
take advantage of already cached curthread, but in the end there should
not really be a performance benefit, while introducing a KPI breakage.
In collabouration with: flo
Reviewed by: avg
MFC after: 3 months (or never)
X-MFC: r228424
as the system dump device. This was already allowed for GPT. The Linux
swap metadata at the beginning of the partition should not be disturbed
because the crash dump is written at the end.
Reviewed by: alfred, pjd, marcel
MFC after: 2 weeks
have reserved free space in the APM area.
Also instead of one write request per each APM entry, use MAXPHY
sized writes when we are updating APM.
MFC after: 1 month
Before r215687, if some withered geom or provider could not be destroyed,
g_event thread went to sleep for 0.1s before retrying. After that change
it is just restarting immediately. r227009 made orphaned (withered) provider
to not detach immediately, but only after context switch. That made loop
inside g_event thread infinite on UP systems without PREEMPTION.
To address original problem with possible dead lock addressed by r227009
we have to fix r215687 change first, that needs some time to think and test.
- Improved locking and destruction process to fix crashes.
- Improved "automatic" configuration method to make it consistent and safe
by reading metadata back from all specified paths after writing to one.
- Added provider size check to reduce chance of ordering conflict with
other GEOM classes.
- Added "manual" configuration method without using on-disk metadata.
- Added "add" and "remove" commands to allow manage paths manually.
- Failed paths are no longer dropped from geom, but only marked as FAIL
and excluded from I/O operations.
- Automatically restore failed paths when all others paths are marked
as failed, for example, because of device-caused (not transport) errors.
- Added "fail" and "restore" commands to manually control FAIL flag.
- geom is now destroyed on last path disconnection.
- Added optional Active/Active mode support. Unlike Active/Passive
mode, load evenly distributed between all working paths. If supported by
the device, it allows to significantly improve performance, utilizing
bandwidth of all paths. It is controlled by -A option during creation.
Disabled by default now.
- Improved `status` and `list` commands output.
Sponsored by: iXsystems, inc.
MFC after: 1 month
The SYSCTL_NODE macro defines a list that stores all child-elements of
that node. If there's no SYSCTL_DECL macro anywhere else, there's no
reason why it shouldn't be static.
- delay consumer closing and detaching on orphan() until all I/Os complete;
- prevent new I/Os submission after orphan() called.
Previous implementation could destroy consumers still having active
requests and worked only because of global workaround made on GEOM level.
instead of destroy_dev(). It moves device destruction waiting out of the
topology lock and so fixes dead lock between orphanization and closing.
Real provider and geom destruction called from swi context after device
destroyed as callback of the destroy_dev_sched_cb().
Do not close/destroy opened consumer directly in case of disconnect. Instead
keep it existing until it will be closed in regular way in response to
upstream provider destruction. Delay geom destruction in the same way.
Previous implementation could destroy consumers still having active
requests and worked only because of global workaround made on GEOM level.
Do not close/destroy opened consumer directly in case of disconnect. Instead
keep it existing until it will be closed in regular way in response to
upstream provider destruction. Delay geom destruction in the same way.
Previous implementation could destroy consumers still having active
requests and worked only because of global workaround made on GEOM level.
r162200 delays provider orphanization until all running requests complete,
to workaround broken orphan() method implementation in some classes.
r215687 removes persistent periodic (10Hz) event thread wake ups.
Together these changes can indefinitely delay orphanization until some
other event wake up the event thread. One consequence of this is inability
of CAM to destroy device disconnected when busy and, as consequence, create
new one after reconnection.
While the best solution would be to revert r162200, it is not easy, as
some classes still look broken in that way. Instead conditionally wake up
event thread if there are some providers waiting for orphanization.
MFC after: 1 week
providers and consumers will be destroyed. Before take some actions
with a geom, check that it is not destroyed at the moment.
Tested by: nwhitehorn
MFC after: 1 week
start only one worker thread. For software crypto it will start by default
N worker threads where N is the number of available CPUs.
This is not optimal if hardware crypto is AES-NI, which uses CPU for AES
calculations.
Change that to always start one worker thread for every available CPU.
Number of worker threads per GELI provider can be easly reduced with
kern.geom.eli.threads sysctl/tunable and even for software crypto it
should be reduced when using more providers.
While here, when number of threads exceeds number of CPUs avilable don't
reduce this number, assume the user knows what he is doing.
Reported by: Yuri Karaban <dev@dev97.com>
MFC after: 3 days
- add support for volumes above 2TiB with Promise metadata format;
- enforse and document other limitations:
- Intel and Promise metadata formats do not support disks above 2TiB;
- NVIDIA metadata format does not support volumes above 2TiB.
Sponsored by: iXsystems, Inc.
MFC after: 2 weeks
with older FreeBSD versions:
- Add -V option to 'geli init' to specify version number. If no -V is given
the most recent version is used.
- If -V is given don't allow to use features not supported by this version.
- Print version in 'geli list' output.
- Update manual page and add table describing which GELI version is
supported by which FreeBSD version, so one can use it when preparing GELI
device for older FreeBSD version.
Inspired by: Garrett Cooper <yanegomi@gmail.com>
MFC after: 3 days
o Detect when Boot Camp is enabled (i.e. the MBR mirrors the GPT).
o When Boot Camp is enabled, update the MBR whenever we write the GPT.
o Creation of a Boot Camp enabled GPT is not supported.
o Automatically disable Boot Camp when the GPT has been changed so that
there's either no EFI partition or no HFS+ partition.
o The first 4 partitions (by index) get mirrored in the MBR.
Requested by, discussed with and tested by: kris@pcbsd.org
MFC after: 1 week
filesystems to be opened for writing. This functionality used to
be special-cased for just the root filesystem, but with this change
is now available for all UFS filesystems. This change is needed for
journaled soft updates recovery.
Discussed with: Jeff Roberson
This fixes the problem, when the secondary GPT header is not erased when
partition table destroyed. Move equal operations from g_part_gpt_create
and g_part_gpt_recover to the separate function g_gpt_set_defaults.
Reported by: dwhite
MFC after: 1 week
Remove message about non empty bootcode, we can not break something
while GEOM_PART_EBR_COMPAT is defined.
But without GEOM_PART_EBR_COMPAT any changes in EBR are allowed and we
can accidentally wipe the boot code. To do not break anything save
the first EBR chunk and keep it untouched each time when we are
changing EBR. Note that we are still not support boot code for EBR.
PR: kern/141235
MFC after: 1 month
disk drive. The boot0cfg(8) utility preserves these 4 bytes when is
writing bootcode to keep a multiboot ability.
Change gpart's bootcode method to keep DSN if it is not zero. Also
do not allow writing bootcode with size not equal to MBRSIZE.
PR: kern/157819
Tested by: Eir Nym
MFC after: 1 month
Since the only parameter that we check is size of bootcode, then
allow only two sizes: size of boot1 and size of /boot/boot.
This partially protects users from losing ability to boot if incorrect
bootcode is specified.
Requested by: ru
DEVFS, and make it accessible via the diskinfo utility.
Extend GEOM's generic attribute query mechanism into generic disk consumers.
sys/geom/geom_disk.c:
sys/geom/geom_disk.h:
sys/cam/scsi/scsi_da.c:
sys/cam/ata/ata_da.c:
- Allow disk providers to implement a new method which can override
the default BIO_GETATTR response, d_getattr(struct bio *). This
function returns -1 if not handled, otherwise it returns 0 or an
errno to be passed to g_io_deliver().
sys/cam/scsi/scsi_da.c:
sys/cam/ata/ata_da.c:
- Don't copy the serial number to dp->d_ident anymore, as the CAM XPT
is now responsible for returning this information via
d_getattr()->(a)dagetattr()->xpt_getatr().
sys/geom/geom_dev.c:
- Implement a new ioctl, DIOCGPHYSPATH, which returns the GEOM
attribute "GEOM::physpath", if possible. If the attribute request
returns a zero-length string, ENOENT is returned.
usr.sbin/diskinfo/diskinfo.c:
- If the DIOCGPHYSPATH ioctl is successful, report physical path
data when diskinfo is executed with the '-v' option.
Submitted by: will
Reviewed by: gibbs
Sponsored by: Spectra Logic Corporation
Add generic attribute change notification support to GEOM.
sys/sys/geom/geom.h:
Add a new attrchanged method field to both g_class
and g_geom.
sys/sys/geom/geom.h:
sys/geom/geom_event.c:
- Provide the g_attr_changed() function that providers
can use to advertise attribute changes.
- Perform delivery of attribute change notifications
from a thread context via the standard GEOM event
mechanism.
sys/geom/geom_subr.c:
Inherit the attrchanged method from class to geom (class instance).
sys/geom/geom_disk.c:
Provide disk_attr_changed() to provide g_attr_changed() access
to consumers of the disk API.
sys/cam/scsi/scsi_pass.c:
sys/cam/scsi/scsi_da.c:
sys/geom/geom_dev.c:
sys/geom/geom_disk.c:
Use attribute changed events to track updates to physical path
information.
sys/cam/scsi/scsi_da.c:
Add AC_ADVINFO_CHANGED to the registered asynchronous CAM
events for this driver. When this event occurs, and
the updated buffer type references our physical path
attribute, emit a GEOM attribute changed event via the
disk_attr_changed() API.
sys/cam/scsi/scsi_pass.c:
Add AC_ADVINFO_CHANGED to the registered asynchronous CAM
events for this driver. When this event occurs, update
the physical patch devfs alias for this pass instance.
Submitted by: gibbs
Sponsored by: Spectra Logic Corporation
They are media-dependent and may change in run-time, same as sectorsize
and/or mediasize.
SCSI devices return physical sector size and offset via READ CAPACITY(16)
command and so can not report it until media inserted or at least until
probe sequence completed. UNMAP support is also reported there.
geometry and partitions may start from withing the first track.
If we found such partitions, then do not reserve space of the
first track, only first sector.