Commit Graph

1683 Commits

Author SHA1 Message Date
mav
fdb56713f2 Add to GEOM RAID class module for reading non-degraded RAID5 volumes and
some environment to differentiate 4 possible RAID5 on-disk layouts.

Tested with Intel and AMD RAID BIOSes.

MFC after:	2 weeks
2012-04-19 12:30:12 +00:00
marck
a91354cc7e VMware environments are not unusual now. Add VMware partitions recognition
(both MBR for ESXi <= 4.1 and GPT for ESXi 5) to g_part.

Reviewed by:	ae
Approved by:	ae
MFC after:	2 weeks
2012-04-18 11:59:03 +00:00
mav
91acf8bc72 Some improvements to GEOM MULTIPATH:
- Implement "configure" command to allow switching operation mode of
running device on-fly without destroying and recreation.
 - Implement Active/Read mode as hybrid of Active/Active and Active/Passive.
In this mode all paths not marked FAIL may handle reads same time,
but unlike Active/Active only one path handles write requests at any
point in time. It allows to closer follow original write request order
if above layers need it for data consistency (not waiting for requisite
write completion before sending dependent write).
 - Hide duplicate messages about device status change.
 - Remove periodic thread wake up with 10Hz rate.

MFC after:	2 weeks
Sponsored by:	iXsystems, Inc.
2012-04-18 09:42:14 +00:00
mckusick
614fd4fe5e Expand locking around identification of filesystem mount point when
accounting for I/O counts at completion of I/O operation. Also switch
from using global devmtx to vnode mutex to reduce contention.

Suggested and reviewed by: kib
2012-04-08 06:20:21 +00:00
ae
0c8f817a74 VMDB offset should be greater than logical volume size only for MBR. 2012-03-29 07:29:27 +00:00
ae
c982232271 Do proper cleanup for the GPT case when an error occurs. 2012-03-29 06:37:02 +00:00
mckusick
9a7982e5a0 Keep track of the mount point associated with a special device
to enable the collection of counts of synchronous and asynchronous
reads and writes for its associated filesystem. The counts are
displayed using `mount -v'.

Ensure that buffers used for paging indicate the vnode from
which they are operating so that counts of paging I/O operations
from the filesystem are collected.

This checkin only adds the setting of the mount point for the
UFS/FFS filesystem, but it would be trivial to add the setting
and clearing of the mount point at filesystem mount/unmount
time for other filesystems too.

Reviewed by: kib
2012-03-28 20:49:11 +00:00
ae
7232ff0e78 Check that scheme is not already registered. This may happens when a
KLD is preloaded with loader(8) and leads to infinity loops.

Also do not return EEXIST error code from MOD_LOAD handler, because
we have undocumented(?) ability replace kernel's module with preloaded one.
And if we have so, then preloaded module will be initialized first.
Thus error in MOD_LOAD handler will be triggered for the kernel.

PR:		kern/165573
MFC after:	3 weeks
2012-03-23 07:26:17 +00:00
ae
7049ead2d4 Add CTLFLAG_TUN to sysctls.
MFC after:	1 month
2012-03-19 13:21:10 +00:00
ae
63a3c6125e Add new GEOM_PART_LDM module that implements the Logical Disk Manager
scheme. The LDM is a logical volume manager for MS Windows NT and it
is also known as dynamic volumes. It supports about 2000 partitions
and also provides the capability for software RAID implementations.

This version implements only partitioning scheme capability and based
on the linux-ntfs project documentation and several publications across
the Web. NOTE: JBOD, RAID0 and RAID5 volumes aren't supported.

An access to the LDM metadata is read-only. When LDM is on the disk
partitioned with MBR we can also destroy metadata. For the GPT
partitioned disks destroy action is not supported.

Reviewed by:	ivoras (previous version)
MFC after:	1 month
2012-03-19 13:14:44 +00:00
ae
1fa040c42d Make kern.geom.part node not static. Also add CTLFLAG_TUN to the
check_integrity sysctl.

MFC after:	1 month
2012-03-19 12:57:52 +00:00
ae
924dbaa61e Add MODULE_DEPEND() to geom_part modules.
MFC after:	2 weeks
2012-03-15 08:39:10 +00:00
emaste
459dac9ae1 Remove unactionable message about label geometry
It's not clear to a user what they should do after seeing the "geometry
does not match label" kernel message, and it does not appear to present
a problem in practice.  Thus, just remove the messages.

Approved by:	marcel
2012-03-08 01:48:44 +00:00
ae
d92d49e46f If nested scheme allows dump kernel to its partition, we may allow
dump for the parent partition too.

MFC after:	2 weeks
2012-02-20 06:35:52 +00:00
ae
db69f3b1ae Add alias for the partition type 0x0f. Now "ebr" name is used for both
types 0x05 and 0x0f, but 0x05 is preferred and used when partition is
created with "gpart add -t ebr ...".
This should keep EBR partitions accessible after r231754 for those,
who have EBR on the partition with type 0x0f.
2012-02-20 05:48:57 +00:00
ae
e5861657ac Add additional check to EBR probe and create methods:
don't try probe and create  EBR scheme when parent partition type
is not "ebr". This fixes error messages about corrupted EBR for
some partitions where is actually another partition scheme.

NOTE: if you have EBR on the partition with different than "ebr"
(0x05) type, then you will lost access to partitions until it will be
changed.

MFC after:	2 weeks
2012-02-15 10:33:29 +00:00
ae
3ccaf222af Add PART::type attribute handler. It returns partition type as string.
MFC after:	2 weeks
2012-02-15 10:02:19 +00:00
ae
3c2d86e2b5 Add alias for the partition with type 0x42 to the MBR scheme.
MFC after:	1 week
2012-02-10 09:55:18 +00:00
ae
7a50922fdb Let's be more realistic and limit maximum number of partition to 4k.
MFC after:	1 week
2012-02-10 06:44:30 +00:00
kib
52c17430bc Current implementations of sync(2) and syncer vnode fsync() VOP uses
mnt_noasync counter to temporary remove MNTK_ASYNC mount option, which
is needed to guarantee a synchronous completion of the initiated i/o
before syscall or VOP return.  Global removal of MNTK_ASYNC option is
harmful because not only i/o started from corresponding thread becomes
synchronous, but all i/o is synchronous on the filesystem which is
initiated during sync(2) or syncer activity.

Instead of removing MNTK_ASYNC from mnt_kern_flag, provide a local
thread flag to disable async i/o for current thread only. Use the
opportunity to move DOINGASYNC() macro into sys/vnode.h and
consistently use it through places which tested for MNTK_ASYNC.

Some testing demonstrated 60-70% improvements in run time for the
metadata-intensive operations on async-mounted UFS volumes, but still
with great deviation due to other reasons.

Reviewed by:	mckusick
Tested by:	scottl
MFC after:	2 weeks
2012-02-06 11:04:36 +00:00
emaste
d6c0a587a3 Correct typo in comment (numbver) 2012-02-04 18:14:39 +00:00
ae
cc4809fb9a The scheme code may not know about some inconsistency in the metadata.
So, add an integrity check after recovery attempt.

MFC after:	1 week
2012-02-01 09:28:16 +00:00
attilio
1521eb4479 Avoid to check the same cache line/variable from all the locking
primitives by breaking stop_scheduler into a per-thread variable.
Also, store the new td_stopsched very close to td_*locks members as
they will be accessed mostly in the same codepaths as td_stopsched and
this results in avoiding a further cache-line pollution, possibly.

STOP_SCHEDULER() was pondered to use a new 'thread' argument, in order to
take advantage of already cached curthread, but in the end there should
not really be a performance benefit, while introducing a KPI breakage.

In collabouration with:	flo
Reviewed by:	avg
MFC after:	3 months (or never)
X-MFC:		r228424
2012-01-28 14:00:21 +00:00
nwhitehorn
b41629d499 Experimental support for booting CHRP-type PowerPC systems from hard disks. 2012-01-25 03:37:39 +00:00
truckman
906ade7185 Allow an MBR primary or extended Linux swap partition to be specified
as the system dump device.  This was already allowed for GPT.  The Linux
swap metadata at the beginning of the partition should not be disturbed
because the crash dump is written at the end.

Reviewed by:	alfred, pjd, marcel
MFC after:	2 weeks
2012-01-13 18:32:56 +00:00
jimharris
7b24e93323 Add support for >2TB disks in GEOM RAID for Intel metadata format.
Reviewed by: mav
Approved by: scottl
MFC after: 1 week
2012-01-09 23:01:42 +00:00
ray
f86cbc8446 GEOM_UNCOMPRESS module, can be used with uzip images and with new ulzma images.
Approved by:	adrian (mentor)
2012-01-04 23:39:11 +00:00
avg
abb1713421 replace uses of libkern gets with cngets
MFC after:	2 months
2011-12-17 15:26:34 +00:00
mav
96d9af62bb Close race between geom destruction on g_vfs_close() when softc destroyed
and g_vfs_orphan() call that tries to access softc, intruced at r227015.

PR:		kern/162997
2011-12-02 17:09:48 +00:00
ae
65be46df50 Add an ability to increase number of allocated APM entries when we
have reserved free space in the APM area.
Also instead of one write request per each APM entry, use MAXPHY
sized writes when we are updating APM.

MFC after:	1 month
2011-11-28 16:07:26 +00:00
ae
a267596c2a The size of APM could be bigger than number of already allocated entries.
And the first usable sector should not start from the inside of APM area.

MFC after:	1 month
2011-11-28 12:38:24 +00:00
mav
37d5d5108c Temporary revert r227009 to fix freeze on UP systems without PREEMPTION.
Before r215687, if some withered geom or provider could not be destroyed,
g_event thread went to sleep for 0.1s before retrying. After that change
it is just restarting immediately. r227009 made orphaned (withered) provider
to not detach immediately, but only after context switch. That made loop
inside g_event thread infinite on UP systems without PREEMPTION.

To address original problem with possible dead lock addressed by r227009
we have to fix r215687 change first, that needs some time to think and test.
2011-11-14 19:32:05 +00:00
mav
ec5f778a89 Major GEOM MULTIPATH class rewrite:
- Improved locking and destruction process to fix crashes.
 - Improved "automatic" configuration method to make it consistent and safe
by reading metadata back from all specified paths after writing to one.
 - Added provider size check to reduce chance of ordering conflict with
other GEOM classes.
 - Added "manual" configuration method without using on-disk metadata.
 - Added "add" and "remove" commands to allow manage paths manually.
 - Failed paths are no longer dropped from geom, but only marked as FAIL
and excluded from I/O operations.
 - Automatically restore failed paths when all others paths are marked
as failed, for example, because of device-caused (not transport) errors.
 - Added "fail" and "restore" commands to manually control FAIL flag.
 - geom is now destroyed on last path disconnection.
 - Added optional Active/Active mode support. Unlike Active/Passive
mode, load evenly distributed between all working paths. If supported by
the device, it allows to significantly improve performance, utilizing
bandwidth of all paths. It is controlled by -A option during creation.
Disabled by default now.
 - Improved `status` and `list` commands output.

Sponsored by:	iXsystems, inc.
MFC after:	1 month
2011-11-12 09:52:27 +00:00
ed
0c56cf839d Mark all SYSCTL_NODEs static that have no corresponding SYSCTL_DECLs.
The SYSCTL_NODE macro defines a list that stores all child-elements of
that node. If there's no SYSCTL_DECL macro anywhere else, there's no
reason why it shouldn't be static.
2011-11-07 15:43:11 +00:00
ed
e97eae1577 Mark MALLOC_DEFINEs static that have no corresponding MALLOC_DECLAREs.
This means that their use is restricted to a single C file.
2011-11-07 06:44:47 +00:00
mav
f082fe73d7 Add mutex and two flags to make orphan() call properly asynchronous:
- delay consumer closing and detaching on orphan() until all I/Os complete;
 - prevent new I/Os submission after orphan() called.
Previous implementation could destroy consumers still having active
requests and worked only because of global workaround made on GEOM level.
2011-11-02 09:24:59 +00:00
mav
98d097d322 Make orphan() method in geom_dev asynchronous using destroy_dev_sched_cb()
instead of destroy_dev(). It moves device destruction waiting out of the
topology lock and so fixes dead lock between orphanization and closing.
Real provider and geom destruction called from swi context after device
destroyed as callback of the destroy_dev_sched_cb().
2011-11-01 23:12:22 +00:00
mav
973c472be5 Refactor disk disconnection and geom destruction handling sequences.
Do not close/destroy opened consumer directly in case of disconnect. Instead
keep it existing until it will be closed in regular way in response to
upstream provider destruction. Delay geom destruction in the same way.
Previous implementation could destroy consumers still having active
requests and worked only because of global workaround made on GEOM level.
2011-11-01 20:56:19 +00:00
mav
1cbd492c70 Refactor disk disconnection and geom destruction handling sequences.
Do not close/destroy opened consumer directly in case of disconnect. Instead
keep it existing until it will be closed in regular way in response to
upstream provider destruction. Delay geom destruction in the same way.
Previous implementation could destroy consumers still having active
requests and worked only because of global workaround made on GEOM level.
2011-11-01 17:04:42 +00:00
mav
80b1c68a33 Workaround the problem introduced by combination of r162200 and r215687.
r162200 delays provider orphanization until all running requests complete,
to workaround broken orphan() method implementation in some classes.
r215687 removes persistent periodic (10Hz) event thread wake ups.
Together these changes can indefinitely delay orphanization until some
other event wake up the event thread. One consequence of this is inability
of CAM to destroy device disconnected when busy and, as consequence, create
new one after reconnection.

While the best solution would be to revert r162200, it is not easy, as
some classes still look broken in that way. Instead conditionally wake up
event thread if there are some providers waiting for orphanization.

MFC after:	1 week
2011-11-01 08:57:49 +00:00
ae
97fe037955 Our geom withering function could take some time before geom with its
providers and consumers will be destroyed.  Before take some actions
with a geom, check that it is not destroyed at the moment.

Tested by:	nwhitehorn
MFC after:	1 week
2011-10-28 11:45:24 +00:00
pjd
247ca8dfcf Before this change when GELI detected hardware crypto acceleration it will
start only one worker thread. For software crypto it will start by default
N worker threads where N is the number of available CPUs.

This is not optimal if hardware crypto is AES-NI, which uses CPU for AES
calculations.

Change that to always start one worker thread for every available CPU.
Number of worker threads per GELI provider can be easly reduced with
kern.geom.eli.threads sysctl/tunable and even for software crypto it
should be reduced when using more providers.

While here, when number of threads exceeds number of CPUs avilable don't
reduce this number, assume the user knows what he is doing.

Reported by:	Yuri Karaban <dev@dev97.com>
MFC after:	3 days
2011-10-27 16:12:25 +00:00
mav
a4f906fd74 Clarify disks/volumes above 2TiB support in geom_raid:
- add support for volumes above 2TiB with Promise metadata format;
 - enforse and document other limitations:
   - Intel and Promise metadata formats do not support disks above 2TiB;
   - NVIDIA metadata format does not support volumes above 2TiB.

Sponsored by:	iXsystems, Inc.
MFC after:	2 weeks
2011-10-26 21:50:10 +00:00
pjd
2f532a77b7 Allow upper layers to discover than BIO_DELETE and/or BIO_FLUSH is not
supported by returning EOPNOTSUPP instead of 0 or ENODEV.

MFC after:	3 days
2011-10-25 14:07:17 +00:00
pjd
2536f7a375 Improve style a bit.
MFC after:	3 days
2011-10-25 14:05:39 +00:00
pjd
44f9590751 Simplify disk_alloc().
MFC after:	3 days
2011-10-25 14:04:59 +00:00
pjd
df493c630e Add support for creating GELI devices with older metadata version for use
with older FreeBSD versions:
- Add -V option to 'geli init' to specify version number. If no -V is given
  the most recent version is used.
- If -V is given don't allow to use features not supported by this version.
- Print version in 'geli list' output.
- Update manual page and add table describing which GELI version is
  supported by which FreeBSD version, so one can use it when preparing GELI
  device for older FreeBSD version.

Inspired by:	Garrett Cooper <yanegomi@gmail.com>
MFC after:	3 days
2011-10-25 13:57:50 +00:00
pjd
c819441658 When decoding metadata, check magic string, so we know this is not GELI device
before we check its version. We don't want to report that some garbage is
unsupported version if this is not even GELI provider.

MFC after:	3 days
2011-10-25 13:44:23 +00:00
pjd
48cda601eb Prefer G_ELI_VERSION_* defines for version numbers over plain digits.
MFC after:	3 days
2011-10-25 13:09:22 +00:00
pjd
3304c99aeb Fit lines into 80 chars.
MFC after:	3 days
2011-10-25 13:08:03 +00:00