freebsd-dev

Author	SHA1	Message	Date
Brian Behlendorf	451041db53	Shorten zio_* thread names Linux kernel thread names are expected to be short. This change shortens the zio thread names to 10 characters leaving a few chracters to append the /<cpuid> to which the thread is bound. For example: z_wr_iss/0.	2010-11-08 14:03:35 -08:00
Ned Bass	b1c5821375	Fix panic mounting unformatted zvol On some older kernels, i.e. 2.6.18, zvol_ioctl_by_inode() may get passed a NULL file pointer if the user tries to mount a zvol without a filesystem on it. This change adds checks to prevent a null pointer dereference. Closes #73. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2010-10-29 14:46:33 -07:00
Brian Behlendorf	baa40d45cb	Fix missing 'zpool events' It turns out that 'zpool events' over 1024 bytes in size where being silently dropped. This was discovered while writing the zfault.sh tests to validate common failure modes. This could occur because the zfs interface for passing an arbitrary size nvlist_t over an ioctl() is to provide a buffer for the packed nvlist which is usually big enough. In this case 1024 byte is the default. If the kernel determines the buffer is to small it returns ENOMEM and the minimum required size of the nvlist_t. This was working properly but in the case of 'zpool events' the event stream was advanced dispite the error. Thus the retry with the bigger buffer would succeed but it would skip over the previous event. The fix is to pass this size to zfs_zevent_next() and determine before removing the event from the list if it will fit. This was preferable to checking after the event was returned because this avoids the need to rewind the stream.	2010-10-12 14:55:03 -07:00
Brian Behlendorf	a69052be7f	Initial zio delay timing While there is no right maximum timeout for a disk IO we can start laying the ground work to measure how long they do take in practice. This change simply measures the IO time and if it exceeds 30s an event is posted for 'zpool events'. This value was carefully selected because for sd devices it implies that at least one timeout (SD_TIMEOUT) has occured. Unfortunately, even with FAILFAST set we may retry and request and not get an error. This behavior is strongly dependant on the device driver and how it is hooked in to the scsi error handling stack. However by setting the limit at 30s we can log the event even if no error was returned. Slightly longer term we can start recording these delays perhaps as a simple power-of-two histrogram. This histogram can then be reported as part of the 'zpool status' command when given an command line option. None of this code changes the internal behavior of ZFS. Currently it is simply for reporting excessively long delays.	2010-10-12 14:55:02 -07:00
Brian Behlendorf	2959d94a0a	Add FAILFAST support ZFS works best when it is notified as soon as possible when a device failure occurs. This allows it to immediately start any recovery actions which may be needed. In theory Linux supports a flag which can be set on bio's called FAILFAST which provides this quick notification by disabling the retry logic in the lower scsi layers. That's the theory at least. In practice is turns out that while the flag exists you oddly have to set it with the BIO_RW_AHEAD flag. And even when it's set it you may get retries in the low level drivers decides that's the right behavior, or if you don't get the right error codes reported to the scsi midlayer. Unfortunately, without additional kernels patchs there's not much which can be done to improve this. Basically, this just means that it may take 2-3 minutes before a ZFS is notified properly that a device has failed. This can be improved and I suspect I'll be submitting patches upstream to handle this.	2010-10-12 14:55:02 -07:00
Brian Behlendorf	312c07edfd	Generate zevents for speculative and soft errors By default the Solaris code does not log speculative or soft io errors in either 'zpool status' or post an event. Under Linux we don't want to change the expected behavior of 'zpool status' so these io errors are still suppressed there. However, since we do need to know about these events for Linux FMA and the 'zpool events' interface is new we do post the events. With the addition of the zio_flags field the posted events now contain enough information that a user space consumer can identify and discard these events if it sees fit.	2010-10-12 14:55:00 -07:00
Brian Behlendorf	d148e95156	Fix negative zio->io_error which must be positive. All the upper layers of zfs expect zio->io_error to be positive. I was careful but I missed one instance in vdev_disk_physio_completion() which could return a negative error. To ensure all cases are always caught I had additionally added an ASSERT() to check this before zio_interpret(). Finally, as a debugging aid when zfs is build with --enable-debug all errors from the backing block devices will be reported to the console with an error message like this: ZFS: zio error=5 type=1 offset=4217856 size=8192 flags=60440	2010-10-12 14:55:00 -07:00
Brian Behlendorf	398f129ca3	Suppress large kmem_alloc() warning. Observed during failure mode testing, dsl_scan_setup_sync() allocates 73920 bytes. This is way over the limit of what is wise to do with a kmem_alloc() and it should probably be moved to a slab. For now I'm just flagging it with KM_NODEBUG to quiet the error until this can be revisited.	2010-10-12 14:54:59 -07:00
Ned Bass	3a7381e531	Use stored whole_disk property when opening a vdev This commit fixes a bug in vdev_disk_open() in which the whole_disk property was getting set to 0 for disk devices, even when it was stored as a 1 when the zpool was created. The whole_disk property lets us detect when the partition suffix should be stripped from the device name in CLI output. It is also used to determine how writeback cache should be set for a device. When an existing zpool is imported its configuration is read from the vdev label by user space in zpool_read_label(). The whole_disk property is saved in the nvlist which gets passed into the kernel, where it in turn gets saved in the vdev struct in vdev_alloc(). Therefore, this value is available in vdev_disk_open() and should not be overridden by checking the provided device path, since that path will likely point to a partition and the check will return the wrong result. We also add an ASSERT that the whole_disk property is set. We are not aware of any cases where vdev_disk_open() should be called with a config that doesn't have this property set. The ASSERT is there so that when debugging is enabled we can identify any legitimate cases that we are missing. If we never hit the ASSERT, we can at some point remove it along with the conditional whole_disk check. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2010-10-04 13:53:18 -07:00
Ricardo M. Correia	0151834d65	Register the space accounting callback even when we don't have the ZPL. This callback is needed for properly accounting the per-uid and per-gid space usage. Even if we don't have the ZPL, we still need this callback in order to have proper on-disk ZPL compatibility and to be able to use Lustre quotas. Fortunately, the callback doesn't have any ZPL/VFS dependencies so we can just move it out of #ifdef HAVE_ZPL. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2010-10-04 11:34:39 -07:00
Ricardo M. Correia	368f4c10ae	Export ZFS symbols needed by Lustre. Required for the DB_DNODE_ENTER()/DB_DNODE_EXIT() helpers. Signed-off-by: Ricardo M. Correia <ricardo.correia@oracle.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2010-09-17 16:24:15 -07:00
Ricardo M. Correia	1e411a4c12	Quiet down very frequent large allocation warning in ZFS. In my machine, dnode_hold_impl() allocates 9992 bytes in DEBUG mode and it causes a large stream of stack traces in the logs. Instead, use KM_NODEBUG to quiet down this known large alloc. Signed-off-by: Ricardo M. Correia <ricardo.correia@oracle.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2010-09-17 16:24:15 -07:00
Brian Behlendorf	6283f55ea1	Support custom build directories and move includes One of the neat tricks an autoconf style project is capable of is allow configurion/building in a directory other than the source directory. The major advantage to this is that you can build the project various different ways while making changes in a single source tree. For example, this project is designed to work on various different Linux distributions each of which work slightly differently. This means that changes need to verified on each of those supported distributions perferably before the change is committed to the public git repo. Using nfs and custom build directories makes this much easier. I now have a single source tree in nfs mounted on several different systems each running a supported distribution. When I make a change to the source base I suspect may break things I can concurrently build from the same source on all the systems each in their own subdirectory. wget -c http://github.com/downloads/behlendorf/zfs/zfs-x.y.z.tar.gz tar -xzf zfs-x.y.z.tar.gz cd zfs-x-y-z ------------------------- run concurrently ---------------------- <ubuntu system> <fedora system> <debian system> <rhel6 system> mkdir ubuntu mkdir fedora mkdir debian mkdir rhel6 cd ubuntu cd fedora cd debian cd rhel6 ../configure ../configure ../configure ../configure make make make make make check make check make check make check This change also moves many of the include headers from individual incude/sys directories under the modules directory in to a single top level include directory. This has the advantage of making the build rules cleaner and logically it makes a bit more sense.	2010-09-08 12:38:56 -07:00
Brian Behlendorf	f5e79474f0	Fix zfsdev_compat_ioctl() case For the !CONFIG_COMPAT case fix the zfsdev_compat_ioctl() compatibility function name. This was caught by the chaos4.3 builder.	2010-09-01 16:00:15 -07:00
Brian Behlendorf	302ef1517e	Add linux zpios support Linux kernel implementation of PIOS test app. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2010-08-31 13:42:01 -07:00
Brian Behlendorf	d603ed6c27	Add linux user disk support This topic branch contains all the changes needed to integrate the user side zfs tools with Linux style devices. Primarily this includes fixing up the Solaris libefi library to be Linux friendly, and integrating with the libblkid library which is provided by e2fsprogs. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2010-08-31 13:42:00 -07:00
Brian Behlendorf	054bc00b4c	Add linux compatibility Resolve minor Linux compatibility issues. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2010-08-31 13:41:59 -07:00
Brian Behlendorf	7b89a54996	Add linux spa thread support Disable the spa thread under Linux until it can be implemented. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2010-08-31 13:41:59 -07:00
Brian Behlendorf	9c905c550b	Add linux sha2 support The upstream ZFS code has correctly moved to a faster native sha2 implementation. Unfortunately, under Linux that's going to be a little problematic so we revert the code to the more portable version contained in earlier ZFS releases. Using the native sha2 implementation in Linux is possible but the API is slightly different in kernel version user space depending on which libraries are used. Ideally, we need a fast implementation of SHA256 which builds as part of ZFS this shouldn't be that hard to do but it will take some effort. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2010-08-31 13:41:59 -07:00
Brian Behlendorf	c28b227942	Add linux kernel module support Setup linux kernel module support, this includes: - zfs context for kernel/user - kernel module build system integration - kernel module macros - kernel module symbol export - kernel module options Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2010-08-31 13:41:58 -07:00
Brian Behlendorf	00b46022c6	Add linux kernel memory support Required kmem/vmem changes Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2010-08-31 13:41:57 -07:00
Brian Behlendorf	60101509ee	Add linux kernel disk support Native Linux vdev disk interfaces Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2010-08-31 13:41:57 -07:00
Brian Behlendorf	325f023544	Add linux kernel device support This branch contains the majority of the changes required to cleanly intergrate with Linux style special devices (/dev/zfs). Mainly this means dropping all the Solaris style callbacks and replacing them with the Linux equivilants. This patch also adds the onexit infrastructure needed to track some minimal state between ioctls. Under Linux it would be easy to do this simply using the file->private_data. But under Solaris they apparent need to pass the file descriptor as part of the ioctl data and then perform a lookup in the kernel. Once again to keep code change to a minimum I've implemented the Solaris solution. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2010-08-31 13:41:50 -07:00
Brian Behlendorf	47d0ed1e6f	Add linux spl debug support Use spl debug if HAVE_SPL defined Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2010-08-31 13:41:50 -07:00
Brian Behlendorf	d2c15e84e9	Add linux mlslabel support The ZFS update to onnv_141 brought with it support for a security label attribute called mlslabel. This feature depends on zones to work correctly and thus I am disabling it under Linux. Equivilant functionality could be added at some point in the future. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2010-08-31 13:41:49 -07:00
Brian Behlendorf	266852767f	Add linux events This topic branch leverages the Solaris style FMA call points in ZFS to create a user space visible event notification system under Linux. This new system is called zevent and it unifies all previous Solaris style ereports and sysevent notifications. Under this Linux specific scheme when a sysevent or ereport event occurs an nvlist describing the event is created which looks almost exactly like a Solaris ereport. These events are queued up in the kernel when they occur and conditionally logged to the console. It is then up to a user space application to consume the events and do whatever it likes with them. To make this possible the existing /dev/zfs ABI has been extended with two new ioctls which behave as follows. * ZFS_IOC_EVENTS_NEXT Get the next pending event. The kernel will keep track of the last event consumed by the file descriptor and provide the next one if available. If no new events are available the ioctl() will block waiting for the next event. This ioctl may also be called in a non-blocking mode by setting zc.zc_guid = ZEVENT_NONBLOCK. In the non-blocking case if no events are available ENOENT will be returned. It is possible that ESHUTDOWN will be returned if the ioctl() is called while module unloading is in progress. And finally ENOMEM may occur if the provided nvlist buffer is not large enough to contain the entire event. * ZFS_IOC_EVENTS_CLEAR Clear are events queued by the kernel. The kernel will keep a fairly large number of recent events queued, use this ioctl to clear the in kernel list. This will effect all user space processes consuming events. The zpool command has been extended to use this events ABI with the 'events' subcommand. You may run 'zpool events -v' to output a verbose log of all recent events. This is very similar to the Solaris 'fmdump -ev' command with the key difference being it also includes what would be considered sysevents under Solaris. You may also run in follow mode with the '-f' option. To clear the in kernel event queue use the '-c' option. $ sudo cmd/zpool/zpool events -fv TIME CLASS May 13 2010 16:31:15.777711000 ereport.fs.zfs.config.sync class = "ereport.fs.zfs.config.sync" ena = 0x40982b7897700001 detector = (embedded nvlist) version = 0x0 scheme = "zfs" pool = 0xed976600de75dfa6 (end detector) time = 0x4bec8bc3 0x2e5aed98 pool = "zpios" pool_guid = 0xed976600de75dfa6 pool_context = 0x0 While the 'zpool events' command is handy for interactive debugging it is not expected to be the primary consumer of zevents. This ABI was primarily added to facilitate the addition of a user space monitoring daemon. This daemon would consume all events posted by the kernel and based on the type of event perform an action. For most events simply forwarding them on to syslog is likely enough. But this interface also cleanly allows for more sophisticated actions to be taken such as generating an email for a failed drive. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2010-08-31 13:41:36 -07:00
Brian Behlendorf	c9c0d073da	Add build system Add autoconf style build infrastructure to the ZFS tree. This includes autogen.sh, configure.ac, m4 macros, some scripts/*, and makefiles for all the core ZFS components.	2010-08-31 13:41:27 -07:00
Brian Behlendorf	6656bf5621	Fix stack traverse_visitbp() Due to limited stack space recursive functions are frowned upon in the Linux kernel. However, they often are the most elegant solution to a problem. The following code preserves the recursive function traverse_visitbp() but moves the local variables AND function arguments to the heap to minimize the stack frame size. Enough space is initially allocated on the stack for 20 levels of recursion. This change does ugly-up-the-code but it reduces the worst case usage from roughly 4160 bytes to 960 bytes on x86_64 archs. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2010-08-31 08:38:50 -07:00
Ned Bass	da6b4005c9	Fix stack zio_execute() Implement zio_execute() as a wrapper around the static function __zio_execute() so that we can force __zio_execute() to be inlined. This reduces stack overhead which is important because __zio_execute() is called recursively in several zio code paths. zio_execute() itself cannot be inlined because it is externally visible. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2010-08-31 08:38:50 -07:00
Brian Behlendorf	c776b317e4	Fix stack zio_done() Eliminated local variables pointing to members of the zio struct. Just refer to the struct members directly. This saved about 32 bytes per call, but this function can be called recurisvely up to 19 levels deep, so we potentially save up to 608 bytes. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2010-08-31 08:38:50 -07:00
Brian Behlendorf	5fed499def	Fix stack vdev_cache_read() Moving the vdev_cache_entry_t struct ve_search from the stack to the heap saves ~100 bytes. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2010-08-31 08:38:49 -07:00
Brian Behlendorf	47050a88ac	Fix stack traverse_impl() Stack use reduced from 560 bytes to 128 bytes. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2010-08-31 08:38:49 -07:00
Brian Behlendorf	60948de1ef	Fix stack noinline Certain function must never be automatically inlined by gcc because they are stack heavy or called recursively. This patch flags all such functions I've found as 'noinline' to prevent gcc from making the optimization. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2010-08-31 08:38:49 -07:00
Brian Behlendorf	18a89ba43d	Fix stack lzjb Reduce kernel stack usage by lzjb_compress() by moving uint16 array off the stack and on to the heap. The exact performance implications of this I have not measured but we absolutely need to keep stack usage to a minimum. If/when this becomes and issue we optimize. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2010-08-31 08:38:49 -07:00
Brian Behlendorf	bf701a83c5	Fix stack inline Decrease stack usage for various call paths by forcing certain functions to be inlined. By inlining the functions the overhead of a new stack frame is removed at the cost of increased code size. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2010-08-31 08:38:48 -07:00
Brian Behlendorf	161ce7ce3c	Fix stack dsl_scan_visitbp() To reduce stack overhead this topic branch moves the 128 byte blkptr_t data strucutre in dsl_scan_visitbp() to the heap. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2010-08-31 08:38:48 -07:00
Brian Behlendorf	fcf37ec6c2	Fix stack dsl_dir_open_spa() Reduce stack usage by 256 bytes by moving buf char array from the stack to the heap. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2010-08-31 08:38:48 -07:00
Brian Behlendorf	48c67dc8f8	Fix stack dsl_deleg_get() Reduce stack usage in dsl_deleg_get, gcc flagged it as consuming a whopping 1040 bytes or potentially 1/4 of a 4K stack. This patch moves all the large structures and buffer off the stack and on to the heap. This includes 2 zap_cursor_t structs each 52 bytes in size, 2 zap_attribute_t structs each 280 bytes in size, and 1 256 byte char array. The total saves on the stack is 880 bytes after you account for the 5 new pointers added. Also the source buffer length has been increased from MAXNAMELEN to MAXNAMELEN+strlen(MOS_DIR_NAME)+1 as described by the comment in dsl_dir_name(). A buffer overrun may have been possible with the slightly smaller buffer. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2010-08-31 08:38:48 -07:00
Brian Behlendorf	81a4966389	Fix stack dsl_dataset_destroy() Move dsl_dataset_t local variable from the stack to the heap. This reduces the stack usage of this function from 2048 bytes to 176 bytes for x84_64 arches. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2010-08-31 08:38:48 -07:00
Brian Behlendorf	a8ac8e715e	Fix stack dmu_objset_snapshot() Reduce stack usage by 276 bytes by moving the snaparg struct from the stack to the heap. We have limited stack space we must not waste. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2010-08-31 08:38:47 -07:00
Brian Behlendorf	fc5bb51f08	Fix stack dbuf_hold_impl() This commit preserves the recursive function dbuf_hold_impl() but moves the local variables and function arguments to the heap to minimize the stack frame size. Enough space is initially allocated on the stack for 20 levels of recursion. This technique was based on commit 34229a2f2ac07363f64ddd63e014964fff2f0671 which reduced stack usage of traverse_visitbp(). Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2010-08-31 08:38:47 -07:00
Brian Behlendorf	5ac1241a95	Fix dnode_move() scope The dnode_move() functionality is only used in the kernel build. As such we should be careful to wrap all of the related code with '#ifdef _KERNEL' to avoid gcc warnings about unused code.	2010-08-31 08:38:47 -07:00
Brian Behlendorf	8a8f5c6b3c	Fix zfs_ioc_objset_stats Interestingly this looks like an upstream bug as well. If for some reason we are unable to get a zvols statistics, because perhaps the zpool is hopelessly corrupt, we would trigger the VERIFY. This commit adds the proper error handling just to propagate the error back to user space. Now the user space tools still must handle this properly but in the worst case the tool will crash or perhaps have some missing output. That's far far better than crashing the host. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2010-08-31 08:38:47 -07:00
Brian Behlendorf	5cc556b447	Fix zio_taskq_dispatch to use TQ_NOSLEEP The zio_taskq_dispatch() function may be called at interrupt time and it is critical that we never sleep. Additionally, wrap taskq_dispatch() in a while loop because it may fail. This is non optimal but is OK for now. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2010-08-31 08:38:46 -07:00
Brian Behlendorf	ef5319df8e	Fix rw_init() usage Properly initialize rwlock primitives. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2010-08-31 08:38:46 -07:00
Brian Behlendorf	eaa8687be3	Fix zmod.h usage in userspace Do not use zmod.h in userspace. This has also been filed with the ZFS team. It makes the userspace libzpool code use the zlib API, instead of the Solaris-only and non-standard zmod.h. The zlib API is almost identical and is a de facto standard, so this is a no-brainer. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2010-08-31 08:38:46 -07:00
Brian Behlendorf	3f50448292	Fix missing newlines Add missing \n's to dprintf()s Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2010-08-31 08:38:46 -07:00
Brian Behlendorf	22c81dd8a9	Fix metaslab If your only going to allow one allocator to be used and it is defined at compile time there is no point including the others in the build. This patch could/should be refined for Linux to make the metaslab configurable at run time. That might be a bit tricky however since you would need to quiese all IO. Short of that making it configurable as a module load option would be a reasonable compromise. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2010-08-31 08:38:45 -07:00
Brian Behlendorf	98f72a539c	Fix list handling to only use the API Remove all instances of list handling where the API is not used and instead list data members are directly accessed. Doing this sort of thing is bad for portability. Additionally, ensure that list_link_init() is called on newly created list nodes. This ensures the node is properly initialized and does not rely on the assumption that zero'ing the list_node_t via kmem_zalloc() is the same as proper initialization. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2010-08-31 08:38:45 -07:00
Brian Behlendorf	59e6e7ca85	Fix kstat xuio Move xiou stat structures from a header to the dmu.c source as is done with all the other kstat interfaces. This information is local to dmu.c registered the xuio kstat and should stay that way. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2010-08-31 08:38:45 -07:00

1 2

79 Commits