originally inspired by the Solaris vmem detailed in the proceedings
of usenix 2001. The NetBSD version was heavily refactored for bugs
and simplicity.
- Use this resource allocator to allocate the buffer and transient maps.
Buffer cache defrags are reduced by 25% when used by filesystems with
mixed block sizes. Ultimately this may permit dynamic buffer cache
sizing on low KVA machines.
Discussed with: alc, kib, attilio
Tested by: pho
Sponsored by: EMC / Isilon Storage Division
SPC-4 specification states that serial number may be property of device,
but not a specific logical unit. People reported about FC storages using
serial number in that way, making it unusable for purposes of LUN multipath
detection. SPC-4 states that designators associated with logical unit from
the VPD page 83h "Device Identification" should be used for that purpose.
Report first of them in the new attribute in such preference order: NAA,
EUI-64, T10 and SCSI name string.
While there, make GEOM DISK properly report GEOM::ident in XML output also
using d_getattr() method, if available. This fixes serial numbers reporting
for SCSI disks in `geom disk list` output and confxml.
Discussed with: gibbs, ken
Sponsored by: iXsystems, Inc.
MFC after: 2 weeks
This allows setting attributes on tables. One simply does not provide
an index in that case. Otherwise the entry corresponding the index has
the attribute set or unset.
Use this change to fix a relatively longstanding bug in our GPT scheme
that's the result of rev 198097 (relatively harmless) followed by rev
237057 (damaging). The damaging part being that our GPT scheme always
has the active flag set on the PMBR slice. This is in violation with
EFI. Existing EFI implementions for both x86 and ia64 reject the GPT.
As such, GPT disks created by us aren't usable under EFI because of
that.
After this change, GPT disks never have the active flag set on the PMBR
slice. In order to make the GPT disk bootable under some x86 BIOSes,
the reason of rev 198097, one must now set the active attribute on the
gpt table. The kernel will apply this to the PMBR slice For (S)ATA:
gpart set -a active ada0
To fix an existing GPT disk that has the active flag set in the PMBR,
and that does not need the flag, use (again for (S)ATA):
gpart unset -a active ada0
The EBR, MBR & PC98 schemes, which also impement at least 1 attribute,
now check to make sure the entry passed is valid. They do not have
attributes that apply to the table.
used previously caused probe failure on platforms where char is unsigned
(e.g. ARM), as mftrecsz can be negative.
Submitted by: Ilya Bakulin <ilya@bakulin.de>
MFC after: 2 weeks
requests.
sys/geom/geom_disk.h:
- Added d_delmaxsize which represents the maximum size of individual
device delete requests in bytes. This can be used by devices to
inform geom of their size limitations regarding delete operations
which are generally different from the read / write limits as data
is not usually transferred from the host to physical device.
sys/geom/geom_disk.c:
- Use new d_delmaxsize to calculate the size of chunks passed through to
the underlying strategy during deletes instead of using read / write
optimised values. This defaults to d_maxsize if unset (0).
- Moved d_maxsize default up so it can be used to default d_delmaxsize
sys/cam/ata/ata_da.c:
- Added d_delmaxsize calculations for TRIM and CFA
sys/cam/scsi/scsi_da.c:
- Added re-calculation of d_delmaxsize whenever delete_method is set.
- Added kern.cam.da.X.delete_max sysctl which allows the max size for
delete requests to be limited. This is useful in preventing timeouts
on devices who's delete methods are slow. It should be noted that
this limit is reset then the device delete method is changed and
that it can only be lowered not increased from the device max.
Reviewed by: mav
Approved by: pjd (mentor)
size of a delete request sent to the providing device performed by g_dev_ioctl.
This allows the kernel and apps via ioctl e.g. newfs -E to request large LBA
deletes which siginificantly improves performance.
Previously this was hard coded to 65536 sectors, the new default is 262144
which doubles the throughput of deletes on commonly available SSD's.
In tests on a Intel 520 120GB FW: 400i disk it improved the delete throughput
from 1.6GB/s to over 2.6GB/s on a full disk delete such as that done via
newfs -E
For some SSD's where delete time is pretty much constant, no matter what
the request, setting this to 0 will provide significantly better throughput
e.g. Samsung 840 240GB FW DXT07B0Q @ 262144 = 79G/s, @ 0 = 2259G/s
Reviewed by: mav
Approved by: pjd (mentor)
MFC after: 2 weeks
Pointy-hat to: me, for not realizing snprintf() is available in kernel.
Thanks to: jh, for bringing me the good news of snprintf(), Pawel Worach, for
noting that the panic can be provoked in i386 and not in amd64
Only look for FDT partitions if our potential parent is a DISK device.
Excluding direct recursion on the flashmap geoms was insufficient
because it did not prevent the underlying device from being retrieved if
flashmap geoms were further partitioned.
Reviewed by: imp
Sponsored by: DARPA, AFRL
implementation, error on the side of conservatism and only create labels
for GEOMs of classes DISK and MULTIPATH.
Discussed with: trasz
Approved by: silence from freebsd-geom@
In physio, check if device can handle unmapped IO and pass an
appropriately mapped buffer to the driver strategy routine. The
only driver in the tree that can handle unmapped buffers is one
exposed by GEOM, so mark it as such with the new flag in the
driver cdevsw structure.
This fixes insta-panics on hosts, running dconschat, as /dev/fwmem
is an example of the driver that makes use of physio routine, but
bypasses the g_down thread, where the buffer gets mapped normally.
Discussed with: kib (earlier version)
- Replace single done mutex with per-disk ones. On system with several
disks on several HBAs that removes small, but measurable lock congestion.
- Modify disk destruction process to not destroy the mutex prematurely.
- Remove some extra pointer derefences.
Use destroy_dev_sched_cb() to not wait for device destruction while holding
GEOM topology lock (that actually caused deadlock). Use request counting
protected by mutex to properly wait for outstanding requests completion in
cases of device closing and geom destruction. Unlike r227009, this code
does not block taskqueue thread for indefinite time, waiting for completion.
more topology change done that may require its attention. Add few missing
g_do_wither() calls in respective places to signal it.
This fixes potential infinite loop here when some provider is withered, but
still opened or connected for some reason and so can not be destroyed. For
example, see r227009 and r227510.
do not map the b_pages pages into buffer_map KVA. The use of the
unmapped buffers eliminate the need to perform TLB shootdown for
mapping on the buffer creation and reuse, greatly reducing the amount
of IPIs for shootdown on big-SMP machines and eliminating up to 25-30%
of the system time on i/o intensive workloads.
The unmapped buffer should be explicitely requested by the GB_UNMAPPED
flag by the consumer. For unmapped buffer, no KVA reservation is
performed at all. The consumer might request unmapped buffer which
does have a KVA reserve, to manually map it without recursing into
buffer cache and blocking, with the GB_KVAALLOC flag.
When the mapped buffer is requested and unmapped buffer already
exists, the cache performs an upgrade, possibly reusing the KVA
reservation.
Unmapped buffer is translated into unmapped bio in g_vfs_strategy().
Unmapped bio carry a pointer to the vm_page_t array, offset and length
instead of the data pointer. The provider which processes the bio
should explicitely specify a readiness to accept unmapped bio,
otherwise g_down geom thread performs the transient upgrade of the bio
request by mapping the pages into the new bio_transient_map KVA
submap.
The bio_transient_map submap claims up to 10% of the buffer map, and
the total buffer_map + bio_transient_map KVA usage stays the
same. Still, it could be manually tuned by kern.bio_transient_maxcnt
tunable, in the units of the transient mappings. Eventually, the
bio_transient_map could be removed after all geom classes and drivers
can accept unmapped i/o requests.
Unmapped support can be turned off by the vfs.unmapped_buf_allowed
tunable, disabling which makes the buffer (or cluster) creation
requests to ignore GB_UNMAPPED and GB_KVAALLOC flags. Unmapped
buffers are only enabled by default on the architectures where
pmap_copy_page() was implemented and tested.
In the rework, filesystem metadata is not the subject to maxbufspace
limit anymore. Since the metadata buffers are always mapped, the
buffers still have to fit into the buffer map, which provides a
reasonable (but practically unreachable) upper bound on it. The
non-metadata buffer allocations, both mapped and unmapped, is
accounted against maxbufspace, as before. Effectively, this means that
the maxbufspace is forced on mapped and unmapped buffers separately.
The pre-patch bufspace limiting code did not worked, because
buffer_map fragmentation does not allow the limit to be reached.
By Jeff Roberson request, the getnewbuf() function was split into
smaller single-purpose functions.
Sponsored by: The FreeBSD Foundation
Discussed with: jeff (previous version)
Tested by: pho, scottl (previous version), jhb, bf
MFC after: 2 weeks
of upgrading older machines using ataraid(4) to newer releases.
This optional parameter is controlled via kern.geom.raid.legacy_aliases
and will create a /dev/ar0 device that will point at /dev/raid/r0 for
example.
Tested on Dell SC 1425 DDF-1 format software raid controllers installing from
stable/7 and upgrading to stable/9 without having to adjust /etc/fstab
Reviewed by: mav
Obtained from: Yahoo!
MFC after: 2 Weeks
This will avoid a 0-byte read (in g_read_data()) leading to a panic, if
previously read data are erroneous.
Suggested by: John-Mark Gurney <jmg@funkthat.com>
Without this, read data is mis-interpreted. This could trigger a panic,
as was the case on one computer where computed "recsize" was zero,
leading to a call to g_read_page() asking for 0 bytes.
write is a disk write request that tells the disk that the buffer
being written must be committed to the media along with any writes
that preceeded it before any future blocks may be written to the drive.
Barrier writes are provided by adding the functions bbarrierwrite
(bwrite with barrier) and babarrierwrite (bawrite with barrier).
Following a bbarrierwrite the client knows that the requested buffer
is on the media. It does not ensure that buffers written before that
buffer are on the media. It only ensure that buffers written before
that buffer will get to the media before any buffers written after
that buffer. A flush command must be sent to the disk to ensure that
all earlier written buffers are on the media.
Reviewed by: kib
Tested by: Peter Holm
as clean on shutdown and move that action from shutdown_pre_sync stage to
shutdown_post_sync to avoid extra flapping.
ZFS tends to not close devices on shutdown, that doesn't allow GEOM RAID
to shutdown gracefully. To handle that, mark volume as clean just when
shutdown time comes and there are no active writes.
MFC after: 2 weeks
as clean on shutdown and move that action from shutdown_pre_sync stage to
shutdown_post_sync to avoid extra flapping.
ZFS tends to not close devices on shutdown, that doesn't allow GEOM RAID
to shutdown gracefully. To handle that, mark volume as clean just when
shutdown time comes and there are no active writes.
PR: kern/113957
MFC after: 2 weeks
unsupported metadata types like Intel Smart Response to not corrupt them.
- Improve setting of these things during metadata writing to protect from
incapable BIOS'es and other implementations.
disks should be rebuilt. Our rebuild code is same time disk-centric. To
handle this situation properly check all disks for RBLD flags, and if no
disk specified try rebuild/resync all of them except newly inserted.
Windows driver uses such migration when it creates new arrays. While GEOM
RAID has no mechanism to implement migration in general case, this specifc
case still can be handled easily via degraded RAID1 creation followed by
regular rebuild.
It is alike to RAID1, but with dedicating master and recovery disks and
providing manual control over synchronization. It allows to use recovery
disk as snapshot of the master disk from the time of the last sync.
This implementation is not functionaly complete comparing to Windows,
but it is better then silent conversion to RAID1 on first boot.
'"'. Mangling is only done for label names read from file system
metadata. Encoding resembles URL encoding. For example, the space
character becomes %20.
Help by: kib
Discussed with: imp, kib, pjd
extended using growfs(8). The problem here is that geom_label checks if
the filesystem size recorded in UFS superblock is equal to the provider
(i.e. device) size. This check cannot be removed due to backward
compatibility. On the other hand, in most cases growfs(8) cannot set
fs_size in the superblock to match the provider size, because, differently
from newfs(8), it cannot recompute cylinder group sizes.
To fix this problem, add another superblock field, fs_providersize, used
only for this purpose. The geom_label(4) will attach if either fs_size
(filesystem created with newfs(8)) or fs_providersize (filesystem expanded
using growfs(8)) matches the device size.
PR: kern/165962
Reviewed by: mckusick
Sponsored by: FreeBSD Foundation
Alike to BIO_WRITE, report success if at least one subdisk succeeded with
BIO_DELETE. But unlike BIO_WRITE don't fail disk on BIO_DELETE error.
Sponsored by: iXsystems, Inc.
MFC after: 1 month
If at least one subdisk in the volume supports it, BIO_DELETE requests
will be propagated down. Unfortunatelly, for RAID levels with redundancy
unmapped blocks will be mapped back during first rebuild/resync process.
Sponsored by: iXsystems, Inc.
MFC after: 1 month
and move that action from shutdown_pre_sync to shutdown_post_sync stage
to avoid extra flapping.
ZFS tends to not close devices on shutdown, that doesn't allow GEOM RAID
to shutdown gracefully. To handle that, mark volume as clean just when
shutdown time comes and there are no active writes.
MFC after: 2 weeks
In particular, do not lock Giant conditionally when calling into the
filesystem module, remove the VFS_LOCK_GIANT() and related
macros. Stop handling buffers belonging to non-mpsafe filesystems.
The VFS_VERSION is bumped to indicate the interface change which does
not result in the interface signatures changes.
Conducted and reviewed by: attilio
Tested by: pho
GIANT from VFS. This code is particulary broken and fragile and other
in-kernel implementations around, found in other operating systems,
don't really seem clean and solid enough to be imported at all.
If someone wants to reconsider in-kernel NTFS implementation for
inclusion again, a fair effort for completely fixing and cleaning it
up is expected.
In the while NTFS regular users can use FUSE interface and ntfs-3g
port to work with their NTFS partitions.
This is not targeted for MFC.
provider name to be specified instead of geom name (first argument in all
subcommands except label). In most cases there is only one array used
any way, so it is not really useful to make user type ugly geom names like
Intel-f0bdf223 or SiI-732c2b9448cf. Though they can be used in some cases.
Sponsored by: iXsystems, Inc.
MFC after: 1 month
mutexes held and the topology lock is an sx lock.
The topology lock was there to protect traversing through the list of providers
of disk's geom, but it seems that disk's geom has always exactly one provider.
Change the code to call g_wither_provider() for this one provider, which is
safe to do without holding the topology lock and assert that there is indeed
only one provider.
Discussed with: ken
MFC after: 1 week
It is possible that provider is destroyed while we are iterating over the
list.
Reported by: Brian Parkison <parkison@panzura.com>
Discussed with: phk
MFC after: 1 week
bytes syncronized.
The rationale behind this is the following: for large disks the
percent synchronisation counter ticks too seldom, and monitoring
software (as well as human operator) can't tell whether
synchronisation goes on or one of disks got stuck. On an idle
server one can look into gstat and see whether synchronisation goes
on or not, but on a busy server that won't work. Also, new value
monitored can be differentiated obtaining the synchronisation speed
quite precisely.
Submitted by: Konstantin Kukushkin <dark ramtel.ru>
Reviewed by: pjd
If GELI provider was created on FreeBSD HEAD r238116 or later (but before this
change), it is using very weak keys and the data is not protected.
The bug was introduced on 4th July 2012.
One can verify if its provider was created with weak keys by running:
# geli dump <provider> | grep version
If the version is 7 and the system didn't include this fix when provider was
initialized, then the data has to be backed up, underlying provider overwritten
with random data, system upgraded and provider recreated.
Reported by: Fabian Keil <fk@fabiankeil.de>
Tested by: Fabian Keil <fk@fabiankeil.de>
Discussed with: so
MFC after: 3 days
This fixes "Negative sc_ref" panic possible when sysctl_kern_geom_confxml()
is run simultaneously with destroying GATE device.
Reviewed by: pjd
MFC after: 3 days
This change triggered interesting foot shooting condition in GEOM when
RW access to root partition by fsck spoils VFS geom there, which has it
opened RO at the same time. Seems spoiling concept needs some rework.
It includes three parts:
1) Modifications to CAM to detect media media changes and report them to
disk(9) layer. For modern SATA (and potentially UAS) devices it utilizes
Asynchronous Notification mechanism to receive events from hardware.
Active polling with TEST UNIT READY commands with 3 seconds period is used
for incapable hardware. After that both CD and DA drivers work the same way,
detecting two conditions: "NOT READY: Medium not present" after medium was
detected previously, and "UNIT ATTENTION: Not ready to ready change, medium
may have changed". First one reported to disk(9) as media removal, second
as media insert/change. To reliably receive second event new
AC_UNIT_ATTENTION async added to make UAs broadcasted to all periphs by
generic error handling code in cam_periph_error().
2) Modifications to GEOM core to handle media remove and change events.
Media removal handled by spoiling all consumers attached to the provider.
Media change event also schedules provider retaste after spoiling to probe
new media. New flag G_CF_ORPHAN was added to consumers to reflect that
consumer is in process of destruction. It allows retaste to create new
geom instance of the same class, while previous one is still dying.
3) Modifications to some GEOM classes: DEV -- to report media change
events to devd; VFS -- to handle spoiling same as orphan to prevent
accessing replaced media. PART class already handles spoiling alike to
orphan.
Reviewed by: silence on geom@ and scsi@
Tested by: avg
Sponsored by: iXsystems, Inc. / PC-BSD
MFC after: 2 months
is an error set on the provider. With GEOM resizing, class can become
orphaned when it doesn't implement resize() method and the provider size
decreases.
Reviewed by: mav
Sponsored by: FreeBSD Foundation
This will allow HAST to read directly from the local component without
even communicating userland daemon.
Sponsored by: Panzura, http://www.panzura.com
MFC after: 1 month
Before this change the IV-Key was used to generate encryption keys,
which was incorrect, but safe - for the XTS mode this key was unused
anyway and for CBC mode it was used differently to generate IV
vectors, so there is no risk that IV vector collides with encryption
key somehow.
Bump version number and keep compatibility for older versions.
MFC after: 2 weeks
we need to pass BIO_DELETE requests down to providers that support
it. Also, we need to announce our support for BIO_DELETE to upper
consumer. This requires:
- In g_mirror_start() return true for "GEOM::candelete" request.
- In g_mirror_init_disk() probe below provider for "GEOM::candelete"
attribute, and mark disk with a flag if it does support BIO_DELETE.
- In g_mirror_register_request() distribute BIO_DELETE requests only
to those disks, that do support it.
Note that we announce "GEOM::candelete" as true unconditionally of
whether we have TRIM-capable media down below or not. This is made
intentionally, because upper consumer (usually UFS) requests the
attribite only once at mount time. And if user ever migrates his
mirror from HDDs to SSDs, then he/she would get TRIM working without
remounting filesystem.
Reviewed by: pjd
a da(4) instance going away while GEOM is still probing it.
In this case, the GEOM disk class instance has been created by
disk_create(), and the taste of the disk is queued in the GEOM
event queue.
While that event is queued, the da(4) instance goes away. When the
open call comes into the da(4) driver, it dereferences the freed
(but non-NULL) peripheral pointer provided by GEOM, which results
in a panic.
The solution is to add a callback to the GEOM disk code that is
called when all of its resources are cleaned up. This is
implemented inside GEOM by adding an optional callback that is
called when all consumers have detached from a provider, and the
provider is about to be deleted.
scsi_cd.c,
scsi_da.c: In the register routine for the cd(4) and da(4)
routines, acquire a reference to the CAM peripheral
instance just before we call disk_create().
Use the new GEOM disk d_gone() callback to register
a callback (dadiskgonecb()/cddiskgonecb()) that
decrements the peripheral reference count once GEOM
has finished cleaning up its resources.
In the cd(4) driver, clean up open and close
behavior slightly. GEOM makes sure we only get one
open() and one close call, so there is no need to
set an open flag and decrement the reference count
if we are not the first open.
In the cd(4) driver, use cam_periph_release_locked()
in a couple of error scenarios to avoid extra mutex
calls.
geom.h: Add a new, optional, providergone callback that
is called when a provider is about to be deleted.
geom_disk.h: Add a new d_gone() callback to the GEOM disk
interface.
Bump the DISK_VERSION to version 2. This probably
should have been done after a couple of previous
changes, especially the addition of the d_getattr()
callback.
geom_disk.c: Add a providergone callback for the disk class,
g_disk_providergone(), that calls the user's
d_gone() callback if it exists.
Bump the DISK_VERSION to 2.
geom_subr.c: In g_destroy_provider(), call the providergone
callback if it has been provided.
In g_new_geomf(), propagate the class's
providergone callback to the new geom instance.
blkfront.c: Callers of disk_create() are supposed to pass in
DISK_VERSION, not an explicit disk API version
number. Update the blkfront driver to do that.
disk.9: Update the disk(9) man page to include information
on the new d_gone() callback, as well as the
previously added d_getattr() callback, d_descr
field, and HBA PCI ID fields.
MFC after: 5 days
Without it, it fails to create labels for filesystems resized by
growfs(8).
PR: kern/165962
Submitted by: Olivier Cochard-Labbe <olivier at cochard dot me>
into partitions.
Partitions are created based on data in dts file which are
extracted and interpreted by slicer.
Obtained from: Semihalf
Supported by: FreeBSD Foundation, Juniper Networks
failed while write to some other succeeded. Instead mark disk as failed.
- Make RAID1E less aggressive in failing disks to avoid volume breakage.
MFC after: 2 weeks
defined by the SNIA Common RAID Disk Data Format Specification v2.0.
Supports multiple volumes per array and multiple partitions per disk.
Supports standard big-endian and Adaptec's little-endian byte ordering.
Supports all single-layer RAID levels. Dual-layer RAID levels except
RAID10 are not supported now because of GEOM RAID design limitations.
Some work is still to be done, but the present code already manages basic
interoperation with RAID BIOS of the Adaptec 1430SA SATA RAID controller.
MFC after: 1 month
Sponsored by: iXsystems, Inc.
- Implement "configure" command to allow switching operation mode of
running device on-fly without destroying and recreation.
- Implement Active/Read mode as hybrid of Active/Active and Active/Passive.
In this mode all paths not marked FAIL may handle reads same time,
but unlike Active/Active only one path handles write requests at any
point in time. It allows to closer follow original write request order
if above layers need it for data consistency (not waiting for requisite
write completion before sending dependent write).
- Hide duplicate messages about device status change.
- Remove periodic thread wake up with 10Hz rate.
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.
accounting for I/O counts at completion of I/O operation. Also switch
from using global devmtx to vnode mutex to reduce contention.
Suggested and reviewed by: kib